You are on page 1of 784

The NEURAL BASES of

MULTISENSORY PROCESSES

Edited by Micah M. Murray and Mark T. Wallace

FRONTIERS
IN NEUROSCIENCE
FRONTIERS IN NEUROSCIENCE
The NEURAL BASES of
MULTISENSORY PROCESSES
FRONTIERS IN NEUROSCIENCE
Series Editors
Sidney A. Simon, Ph.D.
Miguel A.L. Nicolelis, M.D., Ph.D.

Published Titles
Apoptosis in Neurobiology
Yusuf A. Hannun, M.D., Professor of Biomedical Research and Chairman, Department
of Biochemistry and Molecular Biology, Medical University of South Carolina, Charleston,
South Carolina
Rose-Mary Boustany, M.D., tenured Associate Professor of Pediatrics and Neurobiology, Duke
University Medical Center, Durham, North Carolina

Neural Prostheses for Restoration of Sensory and Motor Function


John K. Chapin, Ph.D., Professor of Physiology and Pharmacology, State University
of New York Health Science Center, Brooklyn, New York
Karen A. Moxon, Ph.D., Assistant Professor, School of Biomedical Engineering, Science,
and Health Systems, Drexel University, Philadelphia, Pennsylvania

Computational Neuroscience: Realistic Modeling for Experimentalists


Eric DeSchutter, M.D., Ph.D., Professor, Department of Medicine, University of Antwerp,
Antwerp, Belgium

Methods in Pain Research


Lawrence Kruger, Ph.D., Professor of Neurobiology (Emeritus), UCLA School of Medicine and
Brain Research Institute, Los Angeles, California

Motor Neurobiology of the Spinal Cord


Timothy C. Cope, Ph.D., Professor of Physiology, Wright State University, Dayton, Ohio

Nicotinic Receptors in the Nervous System


Edward D. Levin, Ph.D., Associate Professor, Department of Psychiatry and Pharmacology and
Molecular Cancer Biology and Department of Psychiatry and Behavioral Sciences,
Duke University School of Medicine, Durham, North Carolina

Methods in Genomic Neuroscience


Helmin R. Chin, Ph.D., Genetics Research Branch, NIMH, NIH, Bethesda, Maryland
Steven O. Moldin, Ph.D., University of Southern California, Washington, D.C.

Methods in Chemosensory Research


Sidney A. Simon, Ph.D., Professor of Neurobiology, Biomedical Engineering,
and Anesthesiology, Duke University, Durham, North Carolina
Miguel A.L. Nicolelis, M.D., Ph.D., Professor of Neurobiology and Biomedical Engineering,
Duke University, Durham, North Carolina

The Somatosensory System: Deciphering the Brain’s Own Body Image


Randall J. Nelson, Ph.D., Professor of Anatomy and Neurobiology,
University of Tennessee Health Sciences Center, Memphis, Tennessee

The Superior Colliculus: New Approaches for Studying Sensorimotor Integration


William C. Hall, Ph.D., Department of Neuroscience, Duke University, Durham, North Carolina
Adonis Moschovakis, Ph.D., Department of Basic Sciences, University of Crete, Heraklion, Greece
New Concepts in Cerebral Ischemia
Rick C. S. Lin, Ph.D., Professor of Anatomy, University of Mississippi Medical Center,
Jackson, Mississippi

DNA Arrays: Technologies and Experimental Strategies


Elena Grigorenko, Ph.D., Technology Development Group, Millennium Pharmaceuticals,
Cambridge, Massachusetts

Methods for Alcohol-Related Neuroscience Research


Yuan Liu, Ph.D., National Institute of Neurological Disorders and Stroke,
National Institutes of Health, Bethesda, Maryland
David M. Lovinger, Ph.D., Laboratory of Integrative Neuroscience, NIAAA,
Nashville, Tennessee

Primate Audition: Behavior and Neurobiology


Asif A. Ghazanfar, Ph.D., Princeton University, Princeton, New Jersey

Methods in Drug Abuse Research: Cellular and Circuit Level Analyses


Barry D. Waterhouse, Ph.D., MCP-Hahnemann University, Philadelphia, Pennsylvania

Functional and Neural Mechanisms of Interval Timing


Warren H. Meck, Ph.D., Professor of Psychology, Duke University, Durham, North Carolina

Biomedical Imaging in Experimental Neuroscience


Nick Van Bruggen, Ph.D., Department of Neuroscience Genentech, Inc.
Timothy P.L. Roberts, Ph.D., Associate Professor, University of Toronto, Canada

The Primate Visual System


John H. Kaas, Department of Psychology, Vanderbilt University, Nashville, Tennessee
Christine Collins, Department of Psychology, Vanderbilt University, Nashville, Tennessee

Neurosteroid Effects in the Central Nervous System


Sheryl S. Smith, Ph.D., Department of Physiology, SUNY Health Science Center,
Brooklyn, New York

Modern Neurosurgery: Clinical Translation of Neuroscience Advances


Dennis A. Turner, Department of Surgery, Division of Neurosurgery,
Duke University Medical Center, Durham, North Carolina

Sleep: Circuits and Functions


Pierre-Hervé Luppi, Université Claude Bernard, Lyon, France

Methods in Insect Sensory Neuroscience


Thomas A. Christensen, Arizona Research Laboratories, Division of Neurobiology,
University of Arizona, Tuscon, Arizona

Motor Cortex in Voluntary Movements


Alexa Riehle, INCM-CNRS, Marseille, France
Eilon Vaadia, The Hebrew University, Jerusalem, Israel

Neural Plasticity in Adult Somatic Sensory-Motor Systems


Ford F. Ebner, Vanderbilt University, Nashville, Tennessee

Advances in Vagal Afferent Neurobiology


Bradley J. Undem, Johns Hopkins Asthma Center, Baltimore, Maryland
Daniel Weinreich, University of Maryland, Baltimore, Maryland

The Dynamic Synapse: Molecular Methods in Ionotropic Receptor Biology


Josef T. Kittler, University College, London, England
Stephen J. Moss, University College, London, England
Animal Models of Cognitive Impairment
Edward D. Levin, Duke University Medical Center, Durham, North Carolina
Jerry J. Buccafusco, Medical College of Georgia, Augusta, Georgia

The Role of the Nucleus of the Solitary Tract in Gustatory Processing


Robert M. Bradley, University of Michigan, Ann Arbor, Michigan

Brain Aging: Models, Methods, and Mechanisms


David R. Riddle, Wake Forest University, Winston-Salem, North Carolina

Neural Plasticity and Memory: From Genes to Brain Imaging


Frederico Bermudez-Rattoni, National University of Mexico, Mexico City, Mexico

Serotonin Receptors in Neurobiology


Amitabha Chattopadhyay, Center for Cellular and Molecular Biology, Hyderabad, India

TRP Ion Channel Function in Sensory Transduction and Cellular Signaling Cascades
Wolfgang B. Liedtke, M.D., Ph.D., Duke University Medical Center, Durham, North Carolina
Stefan Heller, Ph.D., Stanford University School of Medicine, Stanford, California

Methods for Neural Ensemble Recordings, Second Edition


Miguel A.L. Nicolelis, M.D., Ph.D., Professor of Neurobiology and Biomedical Engineering,
Duke University Medical Center, Durham, North Carolina

Biology of the NMDA Receptor


Antonius M. VanDongen, Duke University Medical Center, Durham, North Carolina

Methods of Behavioral Analysis in Neuroscience


Jerry J. Buccafusco, Ph.D., Alzheimer’s Research Center, Professor of Pharmacology and Toxicology,
Professor of Psychiatry and Health Behavior, Medical College of Georgia,
Augusta, Georgia

In Vivo Optical Imaging of Brain Function, Second Edition


Ron Frostig, Ph.D., Professor, Department of Neurobiology, University of California,
Irvine, California

Fat Detection: Taste, Texture, and Post Ingestive Effects


Jean-Pierre Montmayeur, Ph.D., Centre National de la Recherche Scientifique, Dijon, France
Johannes le Coutre, Ph.D., Nestlé Research Center, Lausanne, Switzerland

The Neurobiology of Olfaction


Anna Menini, Ph.D., Neurobiology Sector International School for Advanced Studies, (S.I.S.S.A.),
Trieste, Italy

Neuroproteomics
Oscar Alzate, Ph.D., Department of Cell and Developmental Biology,
University of North Carolina, Chapel Hill, North Carolina

Translational Pain Research: From Mouse to Man


Lawrence Kruger, Ph.D., Department of Neurobiology, UCLA School of Medicine, Los Angeles,
California
Alan R. Light, Ph.D., Department of Anesthesiology, University of Utah, Salt Lake City, Utah

Advances in the Neuroscience of Addiction


Cynthia M. Kuhn, Duke University Medical Center, Durham, North Carolina
George F. Koob, The Scripps Research Institute, La Jolla, California
Neurobiology of Huntington’s Disease: Applications to Drug Discovery
Donald C. Lo, Duke University Medical Center, Durham, North Carolina
Robert E. Hughes, Buck Institute for Age Research, Novato, California

Neurobiology of Sensation and Reward


Jay A. Gottfried, Northwestern University, Chicago, Illinois

The Neural Bases of Multisensory Processes


Micah M. Murray, CIBM, Lausanne, Switzerland
Mark T. Wallace, Vanderbilt Brain Institute, Nashville, Tennessee
The NEURAL BASES of
MULTISENSORY PROCESSES
Edited by
Micah M. Murray
Center for Biomedical Imaging
Lausanne, Switzerland

Mark T. Wallace
Vanderbilt University
Nashville, Tennessee

Boca Raton London New York

CRC Press is an imprint of the


Taylor & Francis Group, an informa business
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
Boca Raton, FL 33487-2742
© 2012 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business

No claim to original U.S. Government works

International Standard Book Number-13: 978-1-4398-1219-8 (eBook - PDF)

This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the valid-
ity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.

Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or uti-
lized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopy-
ing, microfilming, and recording, or in any information storage or retrieval system, without written permission from the
publishers.

For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.

Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com
Contents
Series Preface.................................................................................................................................. xiii
Introduction....................................................................................................................................... xv
Editors..............................................................................................................................................xix
Contributors.....................................................................................................................................xxi

Section I  Anatomy

Chapter 1 Structural Basis of Multisensory Processing: Convergence.........................................3


H. Ruth Clemo, Leslie P. Keniston, and M. Alex Meredith

Chapter 2 Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay........ 15
Céline Cappe, Eric M. Rouiller, and Pascal Barone

Chapter 3 What Can Multisensory Processing Tell Us about the Functional Organization
of Auditory Cortex?..................................................................................................... 31
Jennifer K. Bizley and Andrew J. King

Section II Neurophysiological Bases

Chapter 4 Are Bimodal Neurons the Same throughout the Brain?............................................. 51


M. Alex Meredith, Brian L. Allman, Leslie P. Keniston, and H. Ruth Clemo

Chapter 5 Audiovisual Integration in Nonhuman Primates: A Window into the Anatomy


and Physiology of Cognition....................................................................................... 65
Yoshinao Kajikawa, Arnaud Falchier, Gabriella Musacchia, Peter Lakatos,
and Charles E. Schroeder

Chapter 6 Multisensory Influences on Auditory Processing: Perspectives from


fMRI and Electrophysiology.......................................................................................99
Christoph Kayser, Christopher I. Petkov, Ryan Remedios, and Nikos K. Logothetis

Chapter 7 Multisensory Integration through Neural Coherence............................................... 115


Andreas K. Engel, Daniel Senkowski, and Till R. Schneider

Chapter 8 The Use of fMRI to Assess Multisensory Integration.............................................. 131


Thomas W. James and Ryan A. Stevenson

ix
x Contents

Chapter 9 Perception of Synchrony between the Senses........................................................... 147


Mirjam Keetels and Jean Vroomen

Chapter 10 Representation of Object Form in Vision and Touch................................................ 179


Simon Lacey and Krish Sathian

Section III Combinatorial Principles and Modeling

Chapter 11 Spatial and Temporal Features of Multisensory Processes: Bridging Animal


and Human Studies................................................................................................... 191
Diana K. Sarko, Aaron R. Nidiffer, Albert R. Powers III, Dipanwita Ghose,
Andrea Hillock-Dunn, Matthew C. Fister, Juliane Krueger, and Mark T. Wallace

Chapter 12 Early Integration and Bayesian Causal Inference in Multisensory Perception......... 217
Ladan Shams

Chapter 13 Characterization of Multisensory Integration with fMRI: Experimental


Design, Statistical Analysis, and Interpretation........................................................ 233
Uta Noppeney

Chapter 14 Modeling Multisensory Processes in Saccadic Responses:


Time- Window-of- Integration Model......................................................................... 253
Adele Diederich and Hans Colonius

Section IV  Development and Plasticity

Chapter 15 The Organization and Plasticity of Multisensory Integration in the Midbrain........ 279
Thomas J. Perrault Jr., Benjamin A. Rowland, and Barry E. Stein

Chapter 16 Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus


Onset Asynchrony on Interaction Dynamics between Primary Auditory and
Primary Visual Cortex.............................................................................................. 301
Antje Fillbrandt and Frank W. Ohl

Chapter 17 Development of Multisensory Temporal Perception................................................. 325


David J. Lewkowicz

Chapter 18 Multisensory Integration Develops Late in Humans................................................ 345


David Burr and Monica Gori
Contents xi

Chapter 19 Phonetic Recalibration in Audiovisual Speech......................................................... 363


Jean Vroomen and Martijn Baart

Chapter 20 Multisensory Integration and Aging......................................................................... 381


Jennifer L. Mozolic, Christina E. Hugenschmidt, Ann M. Peiffer, and
Paul J. Laurienti

Section V Clinical Manifestations


Chapter 21 Neurophysiological Mechanisms Underlying Plastic Changes and
Rehabilitation following Sensory Loss in Blindness and Deafness.......................... 395
Ella Striem-Amit, Andreja Bubic, and Amir Amedi

Chapter 22 Visual Abilities in Individuals with Profound Deafness: A Critical Review............ 423
Francesco Pavani and Davide Bottari

Chapter 23 Peripersonal Space: A Multisensory Interface for Body–Object Interactions..........449


Claudio Brozzoli, Tamar R. Makin, Lucilla Cardinali,
Nicholas P. Holmes, and Alessandro Farnè

Chapter 24 Multisensory Perception and Bodily Self-Consciousness: From Out-of-Body to


Inside-Body Experience............................................................................................ 467
Jane E. Aspell, Bigna Lenggenhager, and Olaf Blanke

Section VI  Attention and Spatial Representations


Chapter 25 Spatial Constraints in Multisensory Attention.......................................................... 485
Emiliano Macaluso

Chapter 26 Cross-Modal Spatial Cueing of Attention Influences Visual Perception.................. 509


John J. McDonald, Jessica J. Green, Viola S. Störmer, and Steven A. Hillyard

Chapter 27 The Colavita Visual Dominance Effect.................................................................... 529


Charles Spence, Cesare Parise, and Yi-Chuan Chen

Chapter 28 The Body in a Multisensory World........................................................................... 557


Tobias Heed and Brigitte Röder
xii Contents

Section VII Naturalistic Multisensory Processes:


Motion Signals

Chapter 29 Multisensory Interactions during Motion Perception: From Basic Principles


to Media Applications............................................................................................... 583
Salvador Soto-Faraco and Aleksander Väljamäe

Chapter 30 Multimodal Integration during Self-Motion in Virtual Reality................................603


Jennifer L. Campos and Heinrich H. Bülthoff

Chapter 31 Visual–Vestibular Integration for Self-Motion Perception........................................ 629


Gregory C. DeAngelis and Dora E. Angelaki

Section VIII Naturalistic Multisensory Processes:


Communication Signals

Chapter 32 Unity of the Senses for Primate Vocal Communication........................................... 653


Asif A. Ghazanfar

Chapter 33 Convergence of Auditory, Visual, and Somatosensory Information in Ventral


Prefrontal Cortex....................................................................................................... 667
Lizabeth M. Romanski

Chapter 34 A Multisensory Perspective on Human Auditory Communication.......................... 683


Katharina von Kriegstein

Section IX Naturalistic Multisensory Processes: Flavor

Chapter 35 Multimodal Chemosensory Interactions and Perception of Flavor.......................... 703


John Prescott

Chapter 36 A Proposed Model of a Flavor Modality.................................................................. 717


Dana M. Small and Barry G. Green

Chapter 37 Assessing the Role of Visual and Auditory Cues in Multisensory Perception of
Flavor......................................................................................................................... 739
Massimiliano Zampini and Charles Spence
Series Preface
FRONTIERS IN NEUROSCIENCE
The Frontiers in Neuroscience Series presents the insights of experts on emerging experimental
technologies and theoretical concepts that are or will be at the vanguard of neuroscience.
The books cover new and exciting multidisciplinary areas of brain research and describe break-
throughs in fields such as insect sensory neuroscience, primate audition, and biomedical imaging.
The most recent books cover the rapidly evolving fields of multisensory processing and reward.
Each book is edited by experts and consists of chapters written by leaders in a particular field.
Books are richly illustrated and contain comprehensive bibliographies. Chapters provide substantial
background material relevant to the particular subject.
The goal is for these books to be the references neuroscientists use in order to acquaint them-
selves with new methodologies in brain research. We view our task as series editors to produce
outstanding products and to contribute to the field of neuroscience. We hope that, as the volumes
become available, the effort put in by us, the publisher, the book editors, and individual authors will
contribute to further development of brain research. To the extent that you learn from these books,
we will have succeeded.

Sidney A. Simon, PhD


Miguel A.L. Nicolelis, MD, PhD

xiii
Introduction
The field of multisensory research continues to grow at a dizzying rate. Although for those of us
working in the field this is extraordinarily gratifying, it is also a bit challenging to keep up with
all of the exciting new developments in such a multidisciplinary topic at such a burgeoning stage.
For those a bit peripheral to the field, but with an inherent interest in the magic of multisensory
interactions to shape our view of the world, the task is even more daunting. Our objectives for this
book are straightforward—to provide those working within the area a strong overview of the cur-
rent state-of-the field, while at the same time providing those a bit outside of the field with a solid
introduction to multisensory processes. We feel that the current volume meets these objectives,
largely through a choice of topics that span the single cell to the clinic and through the expertise of
our authors, each of whom has done an exceptional job explaining their research to an interdisci-
plinary audience.
The book is organized thematically, with the themes generally building from the more basic to
the more applied. Hence, a reader interested in the progression of ideas and approaches can start at
the beginning and see how the basic science informs the clinical and more applied sciences by read-
ing each chapter in sequence. Alternatively, one can choose to learn more about a specific theme and
delve directly into that section. Regardless of your approach, we hope that this book will serve as
an important reference related to your interests in multisensory processes. The following narrative
provides a bit of an overview to each of the sections and the chapters contained within them.
Section I (Anatomy) focuses on the essential building blocks for any understanding of the neural
substrates of multisensory processing. In Chapter 1, Clemo and colleagues describe how neural con-
vergence and synaptology in multisensory domains might account for the diversity of physiological
response properties, and provide elegant examples of structure/function relationships. Chapter 2,
from Cappe and colleagues, details the anatomical substrates supporting the growing functional
evidence for multisensory interactions in classical areas of unisensory cortex, and which highlights
the possible thalamic contributions to these processes. In Chapter 3, Bizley and King focus on the
unisensory cortical domain that has been best studied for these multisensory influences—auditory
cortex. They highlight how visual inputs into the auditory cortex are organized, and detail the pos-
sible functional role(s) of these inputs.
Section II, organized around Neurophysiological Bases, provides an overview of how multisen-
sory stimuli can dramatically change the encoding processes for sensory information. Chapter 4, by
Meredith and colleagues, addresses whether bimodal neurons throughout the brain share the same
integrative characteristics, and shows marked differences in these properties between subcortex
and cortex. Chapter 5, from Kajikawa and colleagues, focuses on the nonhuman primate model
and bridges what is known about the neural integration of auditory–visual information in monkey
cortex with the evidence for changes in multisensory-mediated behavior and perception. In Chapter
6, Kayser and colleagues also focus on the monkey model, with an emphasis now on auditory cor-
tex and the merging of classical neurophysiological analyses with neuroimaging methods used in
human subjects (i.e., functional magnetic resonance imaging (fMRI)). This chapter emphasizes not
only early multisensory interactions, but also the transformations that take place as one ascends
the processing hierarchy as well as the distributed nature of multisensory encoding. The final four
chapters in this section then examine evidence from humans. In Chapter 7, Engel and colleagues
present compelling evidence for a role of coherent oscillatory activity in linking unisensory and
multisensory brain regions and improving multisensory encoding processes. This is followed by a
contribution from James and Stevenson (Chapter 8), which focuses on fMRI measures of multisen-
sory integration and which proposes a new criterion based on inverse effectiveness in evaluating and

xv
xvi Introduction

interpreting the BOLD signal. Chapter 9, by Keetels and Vroomen, reviews the psychophysical and
neuroimaging evidence associated with the perception of the temporal relationships (i.e., synchrony
and asynchrony) between multisensory cues. Finally, this section closes with a chapter from Lacey
and Sathian (Chapter 10), which reviews our current neuroimaging knowledge concerning the men-
tal representations of objects across vision and touch.
Section III, Combinatorial Principles and Modeling, focuses on efforts to gain a better mecha-
nistic handle on multisensory operations and their network dynamics. In Chapter 11, Sarko and
colleagues focus on spatiotemporal analyses of multisensory neurons and networks as well as com-
monalities across both animal and human model studies. This is followed by a contribution from
Shams, who reviews the psychophysical evidence for multisensory interactions and who argues that
these processes can be well described by causal inference and Bayesian modeling approaches. In
Chapter 13, Noppeney returns to fMRI and illustrates the multiple methods of analyses of fMRI
datasets, the interpretational caveats associated with these approaches, and how the combined
use of methods can greatly strengthen the conclusions that can be drawn. The final contribution
(Chapter 14), from Diederich and Colonius, returns to modeling and describes the time-window-of-
integration (TWIN) model, which provides an excellent framework within which to interpret the
speeding of saccadic reaction times seen under multisensory conditions.
Section IV encompasses the area of Development and Plasticity. Chapter 15, from Perrault and
colleagues, describes the classic model for multisensory neural studies, the superior colliculus, and
highlights the developmental events leading up to the mature state. In Chapter 16, Fillbrandt and
Ohl explore temporal plasticity in multisensory networks and shows changes in the dynamics of
interactions between auditory and visual cortices following prolonged exposure to fixed auditory–
visual delays. The next two contributions focus on human multisensory development. In Chapter 17,
Lewkowicz details the development of multisensory temporal processes, highlighting the increasing
sophistication in these processes as infants grow and gain experience with the world. Chapter 18, by
Burr and Gori, reviews the neurophysiological, behavioral and imaging evidence that illustrates the
surprisingly late development of human multisensory capabilities, a finding that they posit is a result
of the continual need for cross-modal recalibration during development. In Chapter 19, Vroomen
and Baart also discuss recalibration, this time in the context of language acquisition. They argue
that in the process of phonetic recalibration, the visual system instructs the auditory system to build
phonetic boundaries in the presence of ambiguous sound sources. Finally, Chapter 20 focuses on
what can be considered the far end of the developmental process—normal aging. Here, Mozolic and
colleagues review the intriguing literature suggesting enhanced multisensory processing in aging
adults, and highlight a number of possible reasons for these apparent improvements in sensory
function.
Section V, Clinical Manifestations, addresses how perception and action are affected by altered
sensory experience. In Chapter 21, Striem-Amit and colleagues focus on sensory loss, placing
particular emphasis on plasticity following blindness and on efforts to introduce low-cost sensory
substitution devices as rehabilitation tools. The functional imaging evidence they review provides
a striking example of training-induced plasticity. In Chapter 22, Pavani and Bottari likewise con-
sider sensory loss, focusing on visual abilities in profoundly deaf individuals. One contention in
their chapter is that deafness results in enhanced speed of reactivity to visual stimuli, rather than
enhanced visual perceptual abilities. In Chapter 23, Brozzoli and colleagues use the case of visuo-
tactile interactions as an example of how multisensory brain mechanisms can be rendered plastic
both in terms of sensory as well as motor processes. This plasticity is supported by the continuous
and active monitoring of peripersonal space, including both one’s own body and the objects in its
vicinity. In Chapter 24, Aspell and colleagues address the topic of bodily self-consciousness both
in neurological patients and healthy participants, showing how the perception of one’s “self” can be
distorted by multisensory conflicts.
Section VI encompasses the topic of Attention and Spatial Representations. A contribution from
Macaluso opens this section by reviewing putative neural mechanisms for multisensory links in the
Introduction xvii

control of spatial attention as revealed by functional neuroimaging in humans. He puts particular


emphasis on there likely being multiple functional–anatomic routes for these links, which in turn
can provide a degree of flexibility in the manner by which sensory information at a given location
is selected and processed. In Chapter 26, McDonald and colleagues follow this with a review of
studies showing how nonvisual cues impact the subsequent processing (i.e., sensitivity, perceptual
awareness, and subjective experiences) of visual stimuli, demonstrating how such effects can mani-
fest within the first 200 ms of visual processing. Chapter 27, by Spence and colleagues, provides
a review of the Colavita visual dominance effect, including the proposition of an account for this
effect based on biased competition. Finally, in Chapter 28 Heed and Röder conclude this section
with a consideration of how the body schema is established and how an established body schema in
turn impacts the manner in which multisensory stimuli are treated.
Section VII focuses on Naturalistic Multisensory Processes in the context of motion signals. In
Chapter 29, Soto-Faraco and Väljamäe open this section with a consideration of how motion infor-
mation conveyed by audition and vision is integrated. First, they address the basic phenomenology
and behavioral principles. They then review studies examining the neurophysiologic bases for the
integration of multisensory motion signals. Finally, they discuss how laboratory findings can be
extended to media applications. In Chapter 30, Campos and Bülthoff address the topic of self-
motion perception. They describe and evaluate experimental settings and technologies for studying
self-motion, including the empirical findings that these methods and paradigms have produced.
The section concludes with a contribution from DeAngelis and Angelaki (Chapter 31), who review
their studies of visual–vestibular interactions in the dorsal medial superior temporal area (MSTd)
of macaque monkeys. Their review progresses from the characterization of heading-sensitive mul-
tisensory neurons, to a mathematical description of the visual–vestibular integration within MSTd
neurons, and finally to describing the links between neuronal and behavioral processes.
Section VIII continues the focus on Naturalistic Multisensory Processes, now with a particular
concentration on multisensory contributions to the perception and generation of communication
signals. In Chapter 32, Ghazanfar challenges Geschwind’s proposition that speech functions in
humans are intrinsically linked to the unique ability of humans to form multisensory associations.
He reviews the multisensory contributions to communication signals in nonhuman primates as well
as the role of auditory cortex in processing such signals. In Chapter 33, Romanski details the audi-
tory, visual, and somatosensory anatomical projections to the prefrontal cortex (VLPFC) as well
as neuronal responsiveness within this region with respect to communication signals and object
processing. The section closes with Chapter 34 by von Kriegstein that considers how unisensory
auditory communication is impacted by previous multisensory auditory–visual encoding as well
as by auditory-driven activity within nominally visual brain regions. One implication is that the
processing of auditory communication signals is achieved using not only auditory but also visual
brain areas.
The final section, Section IX, Naturalistic Multisensory Processes, concentrates on how the per-
ception of flavor is generated. In a pair of complementary chapters, psychophysical and neural mod-
els of flavor perception are reviewed. In Chapter 35, Prescott focuses on psychophysical findings and
covers processes ranging from basic sensation through learned olfactory–taste associations, as well
as the roles of synthetic versus fused perceptions, attention, and hedonics. Chapter 36, by Small and
Green, focuses largely on evidence from functional brain imaging. They propose that a distributed
network of regions is responsible for generating the perceived flavors of objects. Finally, in Chapter
37, Zampini and Spence conclude with a review of evidence for the impact of visual and acoustic
features on the perception of flavor. They distinguish between preingestive effects of vision, which
are more likely linked to expectancy, and effects of audition that coincide with ingestion. In paral-
lel, they discuss how auditory and visual influences can occur without awareness, highlighting the
necessity for increased neuroscientific investigation of these processes.
We hope that the reader enjoys this book as much as we have enjoyed assembling it. We have both
learned much during this endeavor, and have gained an even deeper fascination and appreciation for
xviii Introduction

our chosen field of inquiry. We are delighted by the diversity of experimental models, methodologi-
cal approaches, and conceptual frameworks that are used in the study of multisensory processes,
and that are reflected in the current volume. Indeed, in our opinion, the success of our field and
its rapid growth are attributable to this highly multidisciplinary philosophy, and bode well for the
future of multisensory science.

Micah M. Murray
Lausanne, Switzerland

Mark T. Wallace
Nashville, Tennessee
Editors
Micah M. Murray earned a double BA in psychology and English from The Johns Hopkins
University. In 2001, he received his PhD with honors from the Neuroscience Department, Albert
Einstein College of Medicine of Yeshiva University. He worked as a postdoctoral scientist in the
Neurology Clinic and Rehabilitation Department, University Hospital of Geneva, Switzerland. Since
2003 he has held a position within the Department of Clinical Neurosciences and Department of
Radiology at the University Hospital of Lausanne, Switzerland. Currently, he is an associate profes-
sor within these departments, adjunct associate professor at Vanderbilt University, as well as associ-
ate director of the EEG Brain Mapping Core of the Center for Biomedical Imaging in Lausanne,
Switzerland. Dr. Murray has a contiguous record of grant support from the Swiss National Science
Foundation. He has received awards for his research from the Leenaards Foundation (2005 Prize
for the Promotion of Scientific Research), the faculty of Biology and Medicine at the University
of Lausanne (2008 Young Investigator Prize), and from the Swiss National Science Foundation
(bonus of excellence in research). His research has been widely covered by the national and inter-
national media. He currently holds editorial board positions at Brain Topography (editor-in-chief),
Journal of Neuroscience (associate editor), Frontiers in Integrative Neuroscience (associate editor),
Frontiers in Auditory Cognitive Neuroscience (associate editor), and the Scientific World Journal.
Dr. Murray has authored more than 80 articles and book chapters. His group’s research primarily
focuses on multisensory interactions, object recognition, learning and plasticity, electroencepha-
logram-correlated functional MRI (EEG/fMRI) methodological developments, and systems/cog-
nitive neuroscience in general. Research in his group combines psychophysics, EEG, fMRI, and
transcranial magnetic simulation in healthy and clinical populations.

Mark T. Wallace received his BS in biology from Temple University in 1985, and his PhD in
neuroscience from Temple University in 1990, where he was the recipient of the Russell Conwell
Presidential Fellowship. He did a postdoctoral fellowship with Dr. Barry Stein at the Medical
College of Virginia, where he began his research looking at the neural mechanisms of multisensory
integration. Dr. Wallace moved to the Wake Forest University School of Medicine in 1995. In 2006,
Dr. Wallace came to Vanderbilt University, and was named the director of the Vanderbilt Brain
Institute in 2008. He is professor of hearing and speech sciences, psychology, and psychiatry, and
the associate director of the Vanderbilt Silvio O. Conte Center for Basic Neuroscience Research.
He is a member of the Center for Integrative and Cognitive Neuroscience, the Center for Molecular
Neuroscience, the Vanderbilt Kennedy Center, and the Vanderbilt Vision Research Center. Dr.
Wallace has received a number of awards for both research and teaching, including the Faculty
Excellence Award of Wake Forest University and being named the Outstanding Young Investigator
in the Basic Sciences. Dr. Wallace has an established record of research funding from the National
Institutes of Health, and is the author of more than 125 research presentations and publications. He
currently serves on the editorial board of several journals including Brain Topography, Cognitive
Processes, and Frontiers in Integrative Neuroscience. His work has employed a multidisciplinary
approach to examining multisensory processing, and focuses upon the neural architecture of multi-
sensory integration, its development, and its role in guiding human perception and performance.

xix
Contributors
Brian L. Allman Olaf Blanke
Department of Anatomy and Neurobiology Laboratory of Cognitive Neuroscience
Virginia Commonwealth University School of Ecole Polytechnique Fédérale de Lausanne
Medicine Lausanne, Switzerland
Richmond, Virginia
Davide Bottari
Amir Amedi Center for Mind/Brain Sciences
Department of Medical Neurobiology, Institute University of Trento
for Medical Research Israel–Canada Rovereto, Italy
Hebrew University–Hadassah Medical School
Jerusalem, Israel Claudio Brozzoli
Institut National de la Santé et de la Recherche
Dora E. Angelaki Médicale
Department of Anatomy and Neurobiology Bron, France
Washington University School of Medicine
St. Louis, Missouri Andreja Bubic
Department of Medical Neurobiology, Institute
Jane E. Aspell for Medical Research Israel–Canada
Laboratory of Cognitive Neuroscience Hebrew University–Hadassah Medical School
Ecole Polytechnique Fédérale de Lausanne Jerusalem, Israel
Lausanne, Switzerland
Heinrich H. Bülthoff
Martijn Baart Department of Human Perception, Cognition,
Department of Medical Psychology and and Action
Neuropsychology Max Planck Institute for Biological Cybernetics
Tilburg University Tübingen, Germany
Tilburg, the Netherlands
David Burr
Pascal Barone Dipartimento di Psicologia
Centre de Recherche Cerveau et Cognition Università Degli Studi di Firenze
(UMR 5549) Florence, Italy
CNRS, Faculté de Médecine de Rangueil
Université Paul Sabatier Toulouse 3 Jennifer L. Campos
Toulouse, France Department of Psychology
Toronto Rehabilitation Institute
Jennifer K. Bizley University of Toronto
Department of Physiology, Anatomy, and Toronto, Ontario, Canada
Genetics
University of Oxford Céline Cappe
Oxford, United Kingdom Laboratory of Psychophysics
Ecole Polytechnique Fédérale de Lausanne
Lausanne, Switzerland

xxi
xxii Contributors

Lucilla Cardinali Antje Fillbrandt


Institut National de la Santé et de la Recherche Leibniz Institute for Neurobiology
Médicale Magdeburg, Germany
Bron, France
Matthew C. Fister
Yi-Chuan Chen Vanderbilt Kennedy Center
Crossmodal Research Laboratory Vanderbilt University
Department of Experimental Psychology Nashville, Tennessee
University of Oxford
Oxford, United Kingdom Asif A. Ghazanfar
Departments of Psychology and Ecology and
H. Ruth Clemo Evolutionary Biology
Department of Anatomy and Neurobiology Neuroscience Institute
Virginia Commonwealth University School of Princeton University
Medicine Princeton, New Jersey
Richmond, Virginia
Dipanwita Ghose
Hans Colonius Department of Psychology
Department of Psychology Vanderbilt University
Oldenburg University Nashville, Tennessee
Oldenburg, Germany
Monica Gori
Gregory C. DeAngelis Department of Robotics
Department of Brain and Cognitive Sciences Brain and Cognitive Science
Center for Visual Science Italian Institute of Technology
University of Rochester Genoa, Italy
Rochester, New York
Barry G. Green
Adele Diederich The John B. Pierce Laboratory
School of Humanities and Social Sciences and
Jacobs University Yale University
Bremen, Germany New Haven, Connecticut

Andreas K. Engel Jessica J. Green


Department of Neurophysiology and Duke University
Pathophysiology Durham, North Carolina
University Medical Center
Hamburg–Eppendorf Tobias Heed
Hamburg, Germany Biological Psychology and Neuropsychology
University of Hamburg
Arnaud Falchier Hamburg, Germany
Nathan S. Kline Institute for Psychiatric
Research Andrea Hillock-Dunn
Orangeburg, New York Department of Hearing and Speech Sciences
Vanderbilt University
Alessandro Farnè Nashville, Tennessee
Institut National de la Santé et de la Recherche
Médicale Steven A. Hillyard
Bron, France University of California San Diego
San Diego, California
Contributors xxiii

Nicholas P. Holmes Simon Lacey


Institut National de la Santé et de la Recherche Department of Neurology
Médicale Emory University
Bron, France Atlanta, Georgia

Christina E. Hugenschmidt Peter Lakatos


Center for Diabetes Research Nathan S. Kline Institute for Psychiatric
Wake Forest University School of Medicine Research
Winston-Salem, North Carolina Orangeburg, New York

Thomas W. James Paul J. Laurienti


Department of Psychological and Brain Department of Radiology
Sciences Wake Forest University School of Medicine
Indiana University Winston-Salem, North Carolina
Bloomington, Indiana
Bigna Lenggenhager
Yoshinao Kajikawa Laboratory of Cognitive Neuroscience
Nathan S. Kline Institute for Psychiatric Research Ecole Polytechnique Fédérale de Lausanne
Orangeburg, New York Lausanne, Switzerland

Christoph Kayser David J. Lewkowicz


Max Planck Institute for Biological Cybernetics Department of Psychology
Tübingen, Germany Florida Atlantic University
Boca Raton, Florida
Mirjam Keetels
Department of Medical Psychology and Nikos K. Logothetis
Neuropsychology Max Planck Institute for Biological Cybernetics
Tilburg University Tübingen, Germany
Tilburg, The Netherlands
Emiliano Macaluso
Leslie P. Keniston Neuroimaging Laboratory
Department of Anatomy and Neurobiology Santa Lucia Foundation
Virginia Commonwealth University School of Rome, Italy
Medicine
Richmond, Virginia Tamar R. Makin
Institut National de la Santé et de la Recherche
Andrew J. King Médicale
Department of Physiology, Anatomy and Bron, France
Genetics
University of Oxford John J. McDonald
Oxford, United Kingdom Simon Fraser University
Burnaby, British Columbia, Canada
Katharina von Kriegstein
Max Planck Institute for Human Cognitive and M. Alex Meredith
Brain Sciences Department of Anatomy and Neurobiology
Leipzig, Germany Virginia Commonwealth University School of
Medicine
Juliane Krueger Richmond, Virginia
Neuroscience Graduate Program
Vanderbilt University
Nashville, Tennessee
xxiv Contributors

Jennifer L. Mozolic Albert R. Powers III


Department of Psychology Neuroscience Graduate Program
Warren Wilson College Vanderbilt University
Asheville, North Carolina Nashville, Tennessee

Gabriella Musacchia John Prescott


Nathan S. Kline Institute for Psychiatric Research School of Psychology
Orangeburg, New York University of Newcastle
Ourimbah, Australia
Aaron R. Nidiffer
Department of Hearing and Speech Sciences Ryan Remedios
Vanderbilt University Max Planck Institute for Biological Cybernetics
Nashville, Tennessee Tübingen, Germany

Uta Noppeney Brigitte Röder


Max Planck Institute for Biological Cybernetics Biological Psychology and Neuropsychology
Tübingen, Germany University of Hamburg
Hamburg, Germany
Frank W. Ohl
Leibniz Institute for Neurobiology Lizabeth M. Romanski
Magdeburg, Germany Department of Neurobiology and Anatomy
University of Rochester
Cesare Parise Rochester, New York
Department of Experimental Psychology
Crossmodal Research Laboratory Eric M. Rouiller
University of Oxford Unit of Physiology and Program in
Oxford, United Kingdom Neurosciences
Department of Medicine, Faculty of Sciences
Francesco Pavani University of Fribourg
Department of Cognitive Sciences and Fribourg, Switzerland
Education
Center for Mind/Brain Sciences Benjamin A. Rowland
University of Trento Department of Neurobiology and Anatomy
Rovereto, Italy Wake Forest School of Medicine
Winston-Salem, North Carolina
Ann M. Peiffer
Department of Radiology Diana K. Sarko
Wake Forest University School of Medicine Department of Hearing and Speech Sciences
Winston-Salem, North Carolina Vanderbilt University
Nashville, Tennessee
Thomas J. Perrault Jr.
Department of Neurobiology and Anatomy Krish Sathian
Wake Forest School of Medicine Department of Neurology
Winston-Salem, North Carolina Emory University
Atlanta, Georgia
Christopher I. Petkov
Institute of Neuroscience Till R. Schneider
University of Newcastle Department of Neurophysiology and
Newcastle upon Tyne, United Kingdom Pathophysiology
University Medical Center Hamburg–Eppendorf
Hamburg, Germany
Contributors xxv

Charles E. Schroeder Ryan A. Stevenson


Nathan S. Kline Institute for Psychiatric Research Department of Psychological and Brain
Orangeburg, New York Sciences
Indiana University
Daniel Senkowski Bloomington, Indiana
Department of Neurophysiology and
Pathophysiology Viola S. Störmer
University Medical Center Max Planck Institute of Human Development
Hamburg–Eppendorf Berlin, Germany
Hamburg, Germany
Ella Striem-Amit
Ladan Shams Department of Medical Neurobiology, Institute
Department of Psychology for Medical Research Israel–Canada
University of California, Los Angeles Hebrew University–Hadassah Medical School
Los Angeles, California Jerusalem, Israel

Dana M. Small Aleksander Väljamäe


The John B. Pierce Laboratory Institute of Audiovisual Studies
and Universitat Pompeu Fabra
Department of Psychiatry Barcelona, Spain
Yale University School of Medicine
New Haven, Connecticut Jean Vroomen
Department of Medical Psychology and
Salvador Soto-Faraco Neuropsychology
Departament de Tecnologies de la Informació i Tilburg University
les Comunicacions Tilburg, The Netherlands
Institució Catalana de Reserca i Estudis
Avançats Mark T. Wallace
Universitat Pompeu Fabra Vanderbilt Brain Institute
Barcelona, Spain Vanderbilt University
Nashville, Tennessee
Charles Spence
Department of Experimental Psychology Massimiliano Zampini
Crossmodal Research Laboratory Centre for Mind/Brain Sciences
University of Oxford University of Trento
Oxford, United Kingdom Rovereto, Italy

Barry E. Stein
Department of Neurobiology and Anatomy
Wake Forest School of Medicine
Winston-Salem, North Carolina
Section I
Anatomy
1 Structural Basis of
Multisensory Processing
Convergence
H. Ruth Clemo, Leslie P. Keniston, and M. Alex Meredith

CONTENTS
1.1 Introduction...............................................................................................................................3
1.2 Multiple Sensory Projections: Sources......................................................................................3
1.2.1 Multiple Sensory Projections: Termination Patterns.....................................................6
1.2.2 Supragranular Termination of Cross-Modal Projections.............................................. 7
1.3 Do All Cross-Modal Projections Generate Multisensory Integration?.....................................9
1.4 Synaptic Architecture of Multisensory Convergence.............................................................. 10
1.5 Summary and Conclusions...................................................................................................... 11
Acknowledgments............................................................................................................................. 12
References......................................................................................................................................... 12

1.1  INTRODUCTION
For multisensory processing, the requisite, defining step is the convergence of inputs from differ-
ent sensory modalities onto individual neurons. This arrangement allows postsynaptic currents
evoked by different modalities access to the same membrane, to collide and integrate there on
the common ground of an excitable bilayer. Naturally, one would expect a host of biophysical and
architectural features to play a role in shaping those postsynaptic events as they spread across
the membrane, but much more can be written about what is unknown of the structural basis
for multisensory integration than of what is known. Historically, however, what has primarily
been the focus of anatomical investigations of multisensory processing has been the identifica-
tion of sources of inputs that converge in multisensory regions. Although a few recent studies
have begun to assess the features of convergence (see below), most of what is known about the
structural basis of multisensory processing lies in the sources and pathways essentially before
convergence.

1.2  MULTIPLE SENSORY PROJECTIONS: SOURCES


Multisensory processing is defined as the influence of one sensory modality on activity generated
by another modality. However, for most of its history, the term “multisensory” had been syn-
onymous with the term “bimodal” (describing a neuron that can be activated by the independent
presentation of stimuli from more than one modality). Hence, studies of multisensory connections
first identified areas that were bimodal, either as individual neurons (Horn and Hill 1966) or areal
responses to different sensory stimuli (e.g., Toldi et al. 1984). Not surprisingly, the bimodal (and
trimodal) areas of the superior temporal sulcus (STS) in monkeys (e.g., Benevento et al. 1977;
Bruce et al. 1981; Hikosaka et al. 1988) were readily identified. Among the first comprehensive

3
4 The Neural Bases of Multisensory Processes

assessments of multisensory pathways were those that injected tracers into the STS and identified
the different cortical sources of inputs to that region. With tracer injections into the upper “poly-
sensory” STS bank, retrogradely labeled neurons were identified in adjoining auditory areas of the
STS, superior temporal gyrus, and supratemporal plane, and in visual areas of the inferior parietal
lobule and the lateral intraparietal sulcus, with a somewhat more restricted projection from the
parahippocampal gyrus and the inferotemporal visual area, as illustrated in Figure 1.1 (Seltzer and
Pandya 1994; Saleem et al. 2000). Although inconclusive about potential somatosensory inputs to
the STS, this study did mention the presence of retrogradely labeled neurons in the inferior parietal
lobule, an area that processes both visual and somatosensory information (e.g., Seltzer and Pandya
1980).
Like the STS, the feline anterior ectosylvian sulcus (AES) is located at the intersection of the
temporal, parietal, and frontal lobes, contains multisensory neurons (e.g., Rauschecker and Korte
1993; Wallace et al. 1992; Jiang et al. 1994), and exhibits a higher-order visual area within its lower
(ventral) bank (Mucke et al. 1982; Olson and Graybiel 1987). This has led to some speculation that
these regions might be homologous. However, a fourth somatosensory area (SIV) representation
(Clemo and Stein 1983) is found anterior along the AES, whereas somatosensory neurons are pre-
dominantly found in the posterior STS (Seltzer and Pandya 1994). The AES also contains distinct
modality-specific regions (somatosensory SIV, visual AEV, and auditory FAES) with multisensory
neurons found primarily at the intersection between these different representations (Meredith 2004;
Wallace et al. 2004; Carriere et al. 2007; Meredith and Allman 2009), whereas the subdivisions of
the upper STS bank are largely characterized by multisensory neurons (e.g., Benevento et al. 1977;
Bruce et al. 1981; Hikosaka et al. 1988). Further distinctions between the STS and the AES reside in
the cortical connectivity of the latter, as depicted in Figure 1.2. Robust somatosensory inputs reach
the AES from somatosensory areas SI–SIII (Burton and Kopf 1984; Reinoso-Suarez and Roda
1985) and SV (Mori et al. 1996; Clemo and Meredith 2004); inputs to AEV arrive from the extras-
triate visual area posterolateral lateral suprasylvian (PLLS), with smaller contributions from the
anterolateral lateral suprasylvian (ALLS) and the posteromedial lateral suprasylvian (PMLS) visual
areas (Olson and Graybiel 1987); auditory inputs to the FAES project from the rostral suprasylvian
sulcus (RSS), second auditory area (AII), and posterior auditory field (PAF) (Clemo et al. 2007; Lee
and Winer 2008). The laminar origin of these projections is provided in only a few of these reports.

CS

STS
Superior

LF
Posterior

FIGURE 1.1  Cortical afferents to monkey STS. On this lateral view of monkey brain, the entire extent of
STS is opened (dashed lines) to reveal upper and lower banks. On upper bank, multisensory regions TP0–4
are located (not depicted). Auditory inputs (black arrows) from adjoining superior temporal gyrus, planum
temporale, preferentially target anterior portions of upper bank. Visual inputs, primarily from parahippocam-
pal gyrus (medium gray arrow) but also from inferior parietal lobule (light gray arrow), also target upper
STS bank. Somatosensory inputs were comparatively sparse, limited to posterior aspects of STS, and may
arise from part of inferior parietal lobule (light gray arrow). Note that inputs intermingle within their areas
of termination.
Structural Basis of Multisensory Processing 5

AES

Superior
Posterior

FIGURE 1.2  Cortical afferents to cat AES. On this lateral view of cat cortex, the AES is opened (dashed
lines) to reveal dorsal and ventral banks. The somatosensory representation SIV on the anterior dorsal bank
receives inputs (light gray arrow) from somatosensory areas SI, SII, SII and SV. The auditory field of the
AES (FAES) in the posterior end of the sulcus receives inputs (black arrows) primarily from the rostral
suprasylvian auditory field, and sulcal portion of the anterior auditory field as well as portions of dorsal zone
of the auditory cortex, AII, and PAF. The ectosylvian visual (AEV) area in the ventral bank receives visual
inputs (dark gray arrow) primarily from PLLS and, to a lesser extent, from adjacent ALLS and PMLS visual
areas. Note that the SIV, FAES, and AEV domains, as well as their inputs, are largely segregated from one
another.

The AES is not alone as a cortical site of convergence of inputs from representations of different
sensory modalities, as the posterior ectosylvian gyrus (an auditory–visual area; Bowman and Olson
1988), PLLS visual area (an auditory–visual area; Yaka et al. 2002; Allman and Meredith 2007),
and the rostral suprasylvian sulcus (an auditory–somatosensory area; Clemo et al. 2007) have had
their multiple sensory sources examined.
Perhaps the most functionally and anatomically studied multisensory structure is not in the cor-
tex, but the midbrain. This six-layered region contains spatiotopic representations of visual, audi-
tory, and somatosensory modalities within its intermediate and deep layers (for review, see Stein
and Meredith 1993). Although unisensory, bimodal, and trimodal neurons are intermingled with
one another in this region, the multisensory neurons predominate (63%; Wallace and Stein 1997).
Despite their numbers, structure–function relationships have been determined for only a few mul-
tisensory neurons. The largest, often most readily identifiable on cross section (or via recording)
are the tectospinal and tectoreticulospinal neurons, with somata averaging 35 to 40 µm in diameter
whose dendritic arbors can extend up to 1.4 mm (Moschovakis and Karabelas 1985; Behan et al.
1988). These large multipolar neurons have a high incidence of multisensory properties, usually
as visual–auditory or visual–somatosensory bimodal neurons (Meredith and Stein 1986). Another
form of morphologically distinct superior colliculus (SC) neuron also shows multisensory proper-
ties: the nitric oxide synthase (NOS)-positive interneuron. These excitatory local circuit neurons have
been shown to receive bimodal inputs largely from the visual and auditory modalities (Fuentes-
Santamaria et al. 2008). Thus, unlike most other structures identified as multisensory, the SC con-
tains morphological classes of neurons that highly correlate with multisensory activity. Ultimately,
this could contribute to understanding how multisensory circuits are formed and their relation to
particular features of multisensory processing.
Because the SC is a multisensory structure, anatomical tracers injected into it have identified
numerous cortical and subcortical areas representing different sensory modalities that supply its
inputs. However, identification of the sources of multiple sensory inputs to this, or any, area pro-
vides little more than anatomical confirmation that projections from different sensory modalities
were  involved. More pertinent is the information relating to the other end of the projection, the
6 The Neural Bases of Multisensory Processes

axon  terminals, whose influence is responsible for the generation of multisensory effects on the
postsynaptic membrane. Despite the fact that axon terminals are at the physical point of multisen-
sory convergence, few studies of multisensory regions outside of the SC have addressed this specific
issue.

1.2.1  Multiple Sensory Projections: Termination Patterns


Unlike much of the multisensory cortex, the pattern of terminal projections to the SC is well
described, largely through the efforts of Harting’s group (Harting and Van Leishout 1991; Harting
et al. 1992, 1997). Historically, this work represented a conceptual leap from the identification
of multisensory sources to the convergent arrangement of those inputs that potentially generate
multisensory effects. These and other orthograde studies (e.g., Illing and Graybiel 1986) identi-
fied a characteristic, patchy arrangement of input terminals that occupied specific domains within
the SC. Somatosensory inputs, whether from the somatosensory cortex or the trigeminal nucleus,
terminated in an interrupted series of puffs across the mediolateral extent of the middle portion of
the intermediate layers (Harting and Van Leishout 1991; Harting et al. 1992, 1997). On the other
hand, visual inputs from, for example, the AEV, avoided the central aspects of the intermediate
layers while occupying patches above and below. These relationships among distributions of axon
terminals from different sensory modalities are depicted in Figure 1.3. This patchy, discontinuous
pattern of termination characterized most projections to the deeper SC layers and was so consistent
that some investigators came to regard them as a device by which the different inputs were compart-
mentalized within individually distinct functional domains (Illing and Graybiel 1986). Although
this interpretation has some validity, it is also true (as mentioned above) that some of the multi-
sensory neurons exhibit dendritic arbors of up to 1.4 mm. With this extensive branching pattern
(as illustrated in Figure 1.3), it would be difficult for a neuron to avoid contacting the domains of
different sensory inputs to the SC. In fact, it would appear that a multisensory tectoreticulospinal
neuron would likely sample repeated input domains from several modalities, and it is difficult to
imagine why there are not more SC trimodal neurons (9%; Wallace and Stein 1997). Ultimately,

SO

SGI

Visual - AEV
Somatosensory - SIV
1 mm

FIGURE 1.3  Sensory segregation and multisensory convergence in SC. This coronal section through cat
SC shows alternating cellular and fibrous layers (SO, stratum opticum; SGI, stratum griseum intermediale).
Terminal boutons form a discontinuous, patchy distribution across multisensory layers with somatosensory
(dark gray, from SIV) and visual (light gray, from AEV) inputs that largely occupy distinct, nonoverlapping
domains. (Redrawn from Harting, J.K. et al., J. Comp. Neurol., 324, 379–414, 1992.) A tectoreticulospinal neu-
ron (redrawn from Behan, M. et al., J. Comp. Neurol., 270, 171–184, 1988.) is shown, to scale, repeating across
the intermediate layer where dendritic arbor virtually cannot avoid contacting multiple input domains from
different modalities. Accordingly, tectoreticulospinal neurons are known for their multisensory properties.
Structural Basis of Multisensory Processing 7

24
1. 2. 3. 4.

AES

SIV
1 3 Inj.
FAES
injection
1 mm

FIGURE 1.4  Supragranular cross-modal projections from auditory FAES (black injection site) to somato­
sensory SIV. Coronal sections through SIV correlate with levels shown on lateral diagram of ferret cortex;
location of AES is indicated by arrow. On each coronal section, SIV region is denoted by dashed lines roughly
perpendicular to pial surface, and location of layer IV (granular layer) is indicated by dashed line essentially
parallel to the gray-white border. Each dot is equivalent to one bouton labeled from FAES; note that a prepon-
derance of labeled axon terminals are found in the supragranular layers. (Redrawn from Dehner, L.R. et al.,
Cereb. Cortex, 14, 387–403, 2004.)

these different input patterns suggest a complex spatial relationship with the recipient neurons and
may provide a useful testing ground on which to determine the synaptic architecture underlying
multisensory processing.
With regard to cortical multisensory areas, only relatively recent studies have examined the ter-
mination patterns of multiple sensory projections (e.g., projections from auditory and visual sources
to a target area) or cross-modal projections (e.g., projections from an auditory source to a visual tar-
get area). It had been observed that tracer injections into the anterior dorsal bank of the AES, where
the somatosensory area SIV is located, produced retrograde labeling in the posterior aspects of the
AES, where auditory field AES is found (Reinoso-Suarez and Roda 1985). This potential cross-
modal projection was further examined by Dehner et al. (2004), who injected tracers in auditory
FAES and identified orthograde projection terminals in SIV (see Figure 1.4). These experiments
were repeated with the tracer systematically placed in different portions of the FAES, showing
the constancy of the projection’s preference for terminating in the upper, supragranular layers of
SIV (Dehner et al. 2004). Functionally, such a cross-modal projection between auditory and soma-
tosensory areas would be expected to generate bimodal auditory–somatosensory neurons. However,
such bimodal neurons have rarely been observed in SIV (Clemo and Stein 1983; Rauschecker and
Korte 1993; Dehner et al. 2004) and stimulation of FAES (through indwelling electrodes) failed to
elicit a single example of orthodromic activation via this cross-modal pathway (Dehner et al. 2004).
Eventually, single- and combined-modality stimulation revealed that somatosensory SIV neurons
received subthreshold influences from auditory inputs, which was described as a “new” form of
multisensory convergence that was distinct from the well-known bimodal patterns identified in the
SC and elsewhere (Dehner et al. 2004). These functional distinctions are depicted in Figure 1.5,
where hypothetical circuits that produce different multisensory effects are illustrated. Ultimately,
these experiments (Dehner et al. 2004) indicate that bimodal neurons are not the only form of mul-
tisensory neuron.

1.2.2  Supragranular Termination of Cross-Modal Projections


The possibility that cross-modal projections underlying subthreshold multisensory processing
might be generalizable to brain regions other than the SIV was examined in several subsequent
investigations. Somatosensory area SIV was found to exhibit a reciprocal cross-modal projection
to auditory FAES, where subthreshold somatosensory effects were observed in approximately 25%
8 The Neural Bases of Multisensory Processes

Bimodal Subthreshold Unisensory

Responds “A” Responds “A” Responds “A”


Responds “B” No response “B” No response “B”
Integrates “A+B” “B” facilitates “A” “B” no effect on “A”
A B A B A

FIGURE 1.5  Different patterns of sensory convergence result in different forms of processing. In each panel,
neuron (gray) receives inputs (black) from sensory modalities “A” and/or “B.” In bimodal condition (left),
neuron receives multiple inputs from both modalities, such that it can be activated by stimulus “A” alone or
by stimulus “B” alone. Furthermore, when both “A + B” are stimulated together, inputs converge on the same
neuron and their responses integrate. In subthreshold condition (center), neuron still receives inputs from both
modalities, but inputs from modality “B” are so reduced and occur at low-priority locations that stimulation
of “B” alone fails to activate the neuron. However, when “B” is combined with “A,” activity is modulated
(facilitation or suppression). In contrast, unisensory neurons (right) receive inputs from only a single modality
“A” and stimulation of “B” has no effect alone or in combination with “A.”

of the samples (Meredith et al. 2006). These projections also showed a preference for supragranu-
lar termination, as illustrated in Figure 1.6. In another study (Clemo et al. 2008), several auditory
corticocortical projections were demonstrated to terminate in the visual PLLS area, but only those
projections from FAES were present within the entire extent of the PLLS corresponding with the
distribution of subthreshold multisensory neurons (Allman and Meredith 2007). These projections
from FAES to PLLS showed an overwhelming preference for termination in the supragranular

(a) Boutons in RSS from: (b) Boutons in PLLS from:


AI AAF AII PAF FAES A1 sAAF PAF FAES

SIV SV PLLS PMLS AEV (c) Boutons in FAES from:


SIV RSS

FIGURE 1.6  Corticocortical projections to multisensory areas preferentially terminate in supragranular lay-
ers. In “A,” all panels represent coronal sections through RSS with layer IV approximated by dashed line. For
each area injected (e.g., AI, SIV, AEV, etc.), each dot represents one labeled axon terminal (bouton). (Redrawn
from Clemo, H.R. et al., J. Comp. Neurol., 503, 110–127, 2007; Clemo, H.R. et al., Exp. Brain Res., 191, 37–47,
2008; Meredith, M.A. et al., Exp. Brain Res., 172:472–484, 2006.)
Structural Basis of Multisensory Processing 9

layers (see Figure 1.6). Thus, it might seem that cross-modal projections that have supragranular
terminations underlie a specific form of multisensory processing. However, in the auditory field of
the rostral suprasylvian sulcus (which is part of the rostral suprasylvian sulcal cortex; Clemo et al.
2007), projections from somatosensory area SIV have a similar supragranular distribution, but both
subthreshold and bimodal forms of multisensory neurons are present. Therefore, it is not conclusive
that the supragranular projections and subthreshold multisensory processing correlate. It is clear,
however, that cross-modal corticocortical projections are strongly characterized by supragranular
patterns of termination.

1.3 DO ALL CROSS-MODAL PROJECTIONS GENERATE


MULTISENSORY INTEGRATION?
Some of the cross-modal projections illustrated in the previous section would be described as
modest, at best, in their density of termination in the target region. In fact, it has been suggested
that this comparative reduction in projection strength may be one feature of convergence that
underlies subthreshold multisensory effects (Allman et al. 2009). Other reports of cortical cross-
modal projections, specifically those between the auditory and visual cortex in monkeys (Falchier
et al. 2002; Rockland and Ojima 2003), have also been characterized by the same sparseness of
projection. Nevertheless, in these cases, it seems to be broadly accepted that such sparse pro-
jections would not only underlie overt auditory activity in the visual cortex, but would lead to
multisensory integration there as well (Falchier et al. 2002). Data from unanesthetized, para-
lyzed animals have been cited in support of such interpretations (Murata et al. 1965; Bental et al.
1968; Spinelli et al. 1968; Morrell 1972; Fishman and Michael 1973), but it has been argued that
these results are inconsistent with auditory sensory activity (Allman et al. 2008). Thus, although
the functional effects of such sparse cross-modal projections are under dispute, the presence of
these projections among the repertoire of corticocortical connections now seems well established.
Therefore, a recent study (Allman et al. 2008) was initiated to examine the functional effects of a
modest cross-modal projection from auditory to visual cortices in ferrets. Tracer injections cen-
tered on A1 of ferret cortex were shown to label terminal projections in the supragranular layers
of visual area 21. However, single-unit recordings were unable to identify the result of that cross-
modal convergence in area 21: no bimodal neurons were observed. Furthermore, tests to reveal
subthreshold multisensory influences were also unsuccessful. Ultimately, only when local inhibi-
tion was pharmacologically blocked (via iontophoresis of bicuculine methiodide, the antagonist
of gamma-aminobutyric acid-alpha (GABA-a) was there a statistically significant indication of
cross-modal influence on visual processing. These results support the notion that multisensory
convergence does lead to multisensory processing effects, but those effects may be subtle and
manifest themselves in nontraditional forms (e.g., nonbimodal; Allman et al. 2008). In fact, this
interpretation is consistent with the results of a recent study of the effects of auditory stimulation
on visual processing in V1 of awake, behaving monkeys (Wang et al. 2008): no bimodal neurons
were observed, but responses to visual–auditory stimuli were significantly shorter in latency when
compared with those elicited by visual stimuli alone.
From another perspective, these data provide additional support to the notion that multisensory
convergence is not restricted to bimodal neurons. The well-known pattern of convergence under­
lying bimodal neurons has already been modified, as shown in Figure 1.4, to include subthreshold
multisensory neurons whose functional behavior might be defined by an imbalance of inputs from
the two different modalities. When considering the result of multisensory convergence in area 21, it
is not much of a design modification to reduce those subthreshold inputs even further, such that they
might be effective under specific contexts or conditions. Moreover, reducing the second set of inputs
further toward zero essentially converts a multisensory circuit (albeit a weak one) into a unisensory
circuit. Thus, it seems logical to propose that patterns of connectivity that produce multisensory
properties span a continuum from, at one end, the profuse levels of inputs from different modalities
10 The Neural Bases of Multisensory Processes

Bimodal Subthreshold Unisensory

Responds “A” Responds “A” Responds “A” Responds “A”


Responds “B” No response “B” No response “B” No response “B”
Integrates “A+B” ‘B’ facilitates “A” ‘B’ facilitates “A” ‘B’ no effect on “A”
A B A B A B A
Multisensory continuum Unisensory

FIGURE 1.7  Patterns of sensory convergence (black; from modality “A” or “B”) onto individual neurons
(gray) result in different forms of processing (similar to Figure 1.4). Synaptic arrangement depicted in middle
panel is adjusted such that inputs from modality “B” are light (left center) or very sparse (right center), sug-
gesting a slight difference of effect of modality “B” on responses elicited by “A.” In addition, because each
of these effects result from simple yet systematic changes in synaptic arrangement, these patterns suggest
that multisensory convergence occurs over a continuum of synaptic arrangements that, on one end, produces
bimodal multisensory properties, whereas on the other, it underlies only unisensory processing.

that produce bimodal neurons to, at the other end, the complete lack of inputs from a second modal-
ity that defines unisensory neurons (see Figure 1.7).

1.4  SYNAPTIC ARCHITECTURE OF MULTISENSORY CONVERGENCE


Implicit in the conclusions derived from the studies cited above is the notion that heavy cross-modal
projections underlie bimodal multisensory processing at the target site, whereas modest projections
subserve subthreshold multisensory processing. Although this general notion correlating projec-
tion strength with specific forms of multisensory effects awaits quantification, it is consistent with
the overarching neurophysiological principle that different patterns of connectivity underlie differ-
ent circuits and behaviors.
Another basic feature of neuronal connectivity is the priority of the location at which synapses
occur. It is well accepted that synapses located on a neuron’s soma are more likely to influence
its spiking activity than synapses occurring out on the dendrites, or those on proximal dendrites
will have a higher probability of affecting activity than those occurring at more distal sites.
Therefore, the synaptic architecture of multisensory processing should also be considered when
assessing the functional effects of cross-modal (and multisensory) projections. However, virtu-
ally nothing is known about the structure of multisensory convergence at the neuronal level. In
fact, the only electron micrographic documentation of multisensory convergence comes not from
the cortex, but from brainstem studies of somatosensory inputs to the dorsal cochlear nucleus
(Shore et al. 2000). Although the significance of this observation of multisensory convergence at
the first synapse in the auditory projection stream cannot be overstated, the technique of electron
microscopy is poorly adapted for making comparisons of multiple synaptic contacts along the
same neuron.
Confocal laser microscopy, coupled with multiple-fluorescent labeling techniques, can visual-
ize entire neurons as well as magnify areas of synaptic contact to submicron resolution (e.g., see
Vinkenoog et al. 2005). This technique was used in a recent study of auditory FAES cross-modal
Structural Basis of Multisensory Processing 11

SIV

FAES

1 µm

1 µm
10 µm

FIGURE 1.8  (See color insert.) Confocal images of a somatosensory SIV neuron (red) contacted by boutons
that originated in auditory FAES (green). A three-dimensional rendering of a trimmed confocal stack contain-
ing a calretinin-positive SIV neuron (red; scale bar, 10 μm) that was contacted by two axons (green) labeled
from auditory area FAES. Each of the axo-dendritic points of contact are enlarged on the right (white arrows;
scale bar, 1.0 μm) to reveal the putative bouton swelling. (From Keniston, L.P. et al., Exp. Brain Res., 202,
725–731, 2010. With permission.)

projections to somatosensory area SIV (Keniston et al. 2010). First, a tracer (fluoroemerald, linked
to biotinylated dextran amine) was injected into the auditory FAES and allowed to transport to
SIV. Next, because inhibitory interneurons represent only about 20% of cortical neurons, immu-
nofluorescent tags of specific subclasses of interneurons would make them stand out against the
neuropil. Therefore, immunocytochemical techniques were used to rhodamine-label SIV interneu-
rons containing a calcium-binding protein (e.g., parvalbumin, calbindin, calretinin). Double- labeled
tissue sections were examined by a laser-scanning confocal microscope (TCS SP2 AOBS, Leica
Microsystems) and high-magnification image stacks were collected, imported into Volocity
(Improvision, Lexington, Massachusetts), and deconvolved (AutoQuant, Media Cybernetics). A
synaptic contact was defined as an axon swelling that showed no gap between it and the immuno-
positive neuron. Of the 33 immunopositive neurons identified, a total of 59 contacts were observed
with axon terminals labeled from the FAES, two of which are illustrated in Figure 1.8. Sixty-four
percent (21 of 33) of interneurons showed one or more contacts; the average was 2.81 (±1.4), with a
maximum of 5 found on one neuron. Thus, the anatomical techniques used here visualized cross-
modal convergence at the neuronal level as well as obtained some of the first insights into the syn-
aptic architecture of multisensory connections.

1.5  SUMMARY AND CONCLUSIONS


Historically, anatomical studies of multisensory processing focused primarily on the source of inputs
to structures that showed responses to more than one sensory modality. However, because conver-
gence is the defining step in multisensory processing, it would seem most important to understand
12 The Neural Bases of Multisensory Processes

how the terminations of those inputs generate multisensory effects. Furthermore, because multisen-
sory processing is not restricted to only bimodal (or trimodal) neurons, the synaptic architecture
of multisensory convergence may be revealed to be as distinct and varied as the perceptions and
behaviors these multisensory circuits subserve.

ACKNOWLEDGMENTS
This study was supported by NIH grant NS039460.

REFERENCES
Allman, B.L., and M.A. Meredith. 2007. Multisensory processing in “unimodal” neurons: Cross-modal sub-
threshold auditory effects in cat extrastriate visual cortex. Journal of Neurophysiology 98:545–549.
Allman, B.L., R.E. Bittencourt-Navarrete, L.P. Keniston, A.E. Medina, M.Y. Wang, and M.A. Meredith. 2008.
Do cross-modal projections always result in multisensory integration? Cerebral Cortex 18:2066–2076.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2009. Not just for bimodal neurons anymore: The contribu-
tion of unimodal neurons to cortical multisensory processing. Brain Topography 21:157–167.
Behan, M., P.P. Appell, and M.J. Graper. 1988. Ultrastructural study of large efferent neurons in the supe-
rior colliculus of the cat after retrograde labeling with horseradish peroxidase. Journal of Comparative
Neurology 270:171–184.
Benevento, L.A., J.H. Fallon, B. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57:849–872.
Bental, E., N. Dafny, and S. Feldman. 1968. Convergence of auditory and visual stimuli on single cells in the
primary visual cortex of unanesthetized unrestrained cats. Experimental Neurology 20:341–351.
Bowman, E.M., and C.R. Olson. 1988. Visual and auditory association areas of the cat’s posterior ectosylvian
gyrus: Cortical afferents. Journal of Comparative Neurology 272:30–42.
Bruce, C., R. Desimone, and C.G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46:369–384.
Burton, H., and E.M. Kopf. 1984. Ipsilateral cortical connections from the second and fourth somatic sensory
areas in the cat. Journal of Comparative Neurology 225:527–553.
Carriere, B.N., D.W. Royal, T.J. Perrault, S.P. Morrison, J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2007.
Visual deprivation alters the development of cortical multisensory integration. Journal of Neurophysiology
98:2858–2867.
Clemo, H.R., and M.A. Meredith. 2004. Cortico-cortical relations of cat somatosensory areas SIV and SV.
Somatosensory & Motor Research 21:199–209.
Clemo, H.R., and B.E. Stein. 1983. Organization of a fourth somatosensory area of cortex in cat. Journal of
Neurophysiology 50:910–925.
Clemo, H.R., B.L. Allman, Donlan M.A., and M.A. Meredith. 2007. Sensory and multisensory representations
within the cat rostral suprasylvian cortex. Journal of Comparative Neurology 503:110–127.
Clemo, H.R., G.K. Sharma, B.L. Allman, and M.A. Meredith. 2008. Auditory projections to extrastriate visual
cortex: Connectional basis for multisensory processing in ‘unimodal’ visual neurons. Experimental Brain
Research 191:37–47.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith. 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Falchier, A., C. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience 22:5749–5759.
Fishman, M.C., and P. Michael. 1973. Integration of auditory information in the cat’s visual cortex. Vision
Research 13:1415–1419.
Fuentes-Santamaria, V., J.C. Alvarado, B.E. Stein, and J.G. McHaffie. 2008. Cortex contacts both output neu-
rons and nitrergic interneurons in the superior colliculus: Direct and indirect routes for multisensory
integration. Cerebral Cortex 18:1640–1652.
Harting, J.K., and D.P. Van Leishout. 1991. Spatial relationships of axons arising from the substantia nigra,
spinal trigeminal nucleus, and the pedunculopontine tegmental nucleus within the intermediate gray of
the cat superior colliculus. Journal of Comparative Neurology 305:543–558.
Structural Basis of Multisensory Processing 13

Harting, J.K., B.V. Updyke, and D.P. Van Lieshout. 1992. Corticotectal projections in the cat: Anterograde
transport studies of twenty-five cortical areas. Journal of Comparative Neurology 324:379–414.
Harting, J.K., S. Feig, and D.P. Van Lieshout. 1997. Cortical somatosensory and trigeminal inputs to the
cat superior colliculus: Light and electron microscopic analyses. Journal of Comparative Neurology
388:313–326.
Hikosaka, K., E. Iwai, H. Saito, and K. Tanaka. 1988. Polysensory properties of neurons in the anterior
bank of the caudal superior temporal sulcus of the macaque monkey. Journal of Neurophysiology
60:1615–1637.
Horn, G., and R.M. Hill. 1966. Responsiveness to sensory stimulation of units in the superior colliculus and
subjacent tectotegmental regions of the rabbit. Experimental Neurology 14:199–223.
Illing, R.-B., and A.M. Graybiel. 1986. Complementary and non-matching afferent compartments in the cat’s
superior colliculus: Innervation of the acetylcholinesterase-poor domain of the intermediate gray layer.
Neuroscience 18:373–394.
Jiang, H., F. Lepore, M. Ptito, and J.P. Guillemot. 1994, Sensory interactions in the anterior ectosylvian cortex
of cats. Experimental Brain Research 101:385–396.
Keniston, L.P., S.C. Henderson, and M.A. Meredith. 2010. Neuroanatomical identification of crossmodal audi-
tory inputs to interneurons in somatosensory cortex. Experimental Brain Research 202:725–731.
Lee, C.C., and J.A. Winer. 2008. Connections of cat auditory cortex: III. Corticocortical system. Journal of
Comparative Neurology 507:1920–1943.
Meredith, M.A. 2004. Cortico-cortical connectivity and the architecture of cross-modal circuits. In Handbook of
Multisensory Processes. C. Spence, G. Calvert, and B. Stein, eds. 343–355. Cambridge, MA: MIT Press.
Meredith, M.A., and B.L. Allman. 2009. Subthreshold multisensory processing in cat auditory cortex. Neuro­
report 20:126–131.
Meredith, M.A., and B.E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–662.
Meredith, M.A, L.P. Keniston, L.R. Dehner, and H.R. Clemo. 2006. Crossmodal projections from somatosen-
sory area SIV to the auditory field of the anterior ectosylvian sulcus (FAES) in cat: Further evidence for
subthreshold forms of multisensory processing. Experimental Brain Research 172:472–484.
Monteiro, G., H.R. Clemo, and M.A. Meredith. 2003. Auditory cortical projections to the rostral suprasylvian
sulcal cortex in the cat: Implications for its sensory and multisensory organization. NeuroReport 14:​
2139–2145.
Mori, A., T. Fuwa, A. Kawai et al. 1996. The ipsilateral and contralateral connections of the fifth somatosensory
area (SV) in the cat cerebral cortex. Neuroreport 7:2385–2387.
Morrell, F. 1972. Visual system’s view of acoustic space. Nature 238:44–46.
Moschovakis, A.K., and A.B. Karabelas. 1985. Observations on the somatodendritic morphology and axonal
trajectory of intracellularly HRP-labeled efferent neurons located in the deeper layers of the superior col-
liculus of the cat. Journal of Comparative Neurology 239:276–308.
Mucke, L., M. Norita, G. Benedek, and O. Creutzfeldt. 1982. Physiologic and anatomic investigation of a
visual cortical area situated in the ventral bank of the anterior ectosylvian sulcus of the cat. Experimental
Brain Research 46:1–11.
Murata, K., H. Cramer, and P. Bach-y-Rita. 1965. Neuronal convergence of noxious, acoustic, and visual stim-
uli in the visual cortex of the cat. Journal of Neurophysiology 28:1223–1239.
Olson, C.R., and A.M. Graybiel. 1987. Ectosylvian visual area of the cat: Location, retinotopic organization,
and connections. Journal of Comparative Neurology 261:277–294.
Rauschecker, J.P., and M. Korte. 1993. Auditory compensation for early blindness in cat cerebral cortex.
Journal of Neuroscience 13:4538–4548.
Reinoso-Suarez, F., and J.M. Roda. 1985. Topographical organization of the cortical afferent connections to the
cortex of the anterior ectosylvian sulcus in the cat. Experimental Brain Research 59:313–324.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50:19–26.
Saleem, K.S, W. Suzuki, K. Tanaka, and T. Hashikawa. 2000. Connections between anterior inferotem-
poral cortex and superior temporal sulcus regions in the macaque monkey. Journal of Neuroscience
20:5083–5101.
Seltzer, B., and D.N. Pandya. 1980. Converging visual and somatic sensory input to the intraparietal sulcus of
the rhesus monkey. Brain Research 192:339–351.
Seltzer, B., and D.N. Pandya. 1994. Parietal, temporal, and occipital projections to cortex of the superior
temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology
343:445–463.
14 The Neural Bases of Multisensory Processes

Shore, S.E., Z. Vass, N.L. Wys, and R.A. Altschuler. 2000. Trigeminal ganglion innervates the auditory brain-
stem. Journal of Comparative Neurology 419:271–285.
Spinelli, D.N., A. Starr, and T.W. Barrett. 1968. Auditory specificity in unit recordings from cat’s visual cortex.
Experimental Neurology 22:75–84.
Stein, B.E., and M.A. Meredith. 1993. Merging of the Senses. Cambridge, MA: MIT Press.
Toldi, J., O. Feher, and L. Feuer. 1984. Dynamic interactions of evoked potentials in a polysensory cortex of
the cat. Neuroscience 13:945–952.
Vinkenoog, M., M.C. van den Oever, H.B. Uylings, and F.G. Wouterlood. 2005. Random or selective neuroana-
tomical connectivity. Study of the distribution of fibers over two populations of identified interneurons in
cerebral cortex. Brain Research. Brain Research Protocols 14:67–76.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience 17:2429–2444.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1992. The integration of multiple sensory inputs in cat cortex.
Experimental Brain Research 91:484–488.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences 101:2167–2172.
Wang, Y., S. Celebrini, Y. Trotter, and P. Barone. 2008. Visuo–auditory interactions in the primary visual cortex
of the behaving monkey: Electrophysiological evidence. BMC Neuroscience 9:79.
Yaka, R., N. Notkin, U. Yinon, and Z. Wollberg. 2002. Visual, auditory and bimodal activity in the banks of the
lateral suprasylvian sulcus in the cat. Neuroscience and Behavioral Physiology 32:103–108.
2 Cortical and Thalamic
Pathways for Multisensory
and Sensorimotor Interplay
Céline Cappe, Eric M. Rouiller, and Pascal Barone

CONTENTS
2.1 Introduction............................................................................................................................. 15
2.2 Cortical Areas in Multisensory Processes............................................................................... 15
2.2.1 Multisensory Association Cortices.............................................................................. 15
2.2.1.1 Superior Temporal Sulcus............................................................................. 16
2.2.1.2 Intraparietal Sulcus....................................................................................... 16
2.2.1.3 Frontal and Prefrontal Cortex....................................................................... 16
2.2.2 Low-Level Sensory Cortical Areas............................................................................. 17
2.2.2.1 Auditory and Visual Connections and Interactions...................................... 17
2.2.2.2 Auditory and Somatosensory Connections and Interactions........................ 19
2.2.2.3 Visual and Somatosensory Connections and Interactions............................ 19
2.2.2.4 Heteromodal Projections and Sensory Representation................................. 19
2.3 Thalamus in Multisensory Processes......................................................................................20
2.3.1 Thalamocortical and Corticothalamic Connections...................................................20
2.3.2 Role of Thalamus in Multisensory Integration............................................................ 21
2.4 Higher-Order, Lower-Order Cortical Areas and/or Thalamus?.............................................. 23
2.5 Conclusions..............................................................................................................................24
Acknowledgments.............................................................................................................................24
References.........................................................................................................................................24

2.1  INTRODUCTION
Numerous studies in both monkey and human provided evidence for multisensory integration at
high-level and low-level cortical areas. This chapter focuses on the anatomical pathways contributing
to multisensory integration. We first describe the anatomical connections existing between different
sensory cortical areas, briefly concerning the well-known connections between associative cortical
areas and the more recently described connections targeting low-level sensory cortical areas. Then
we focus on the description of the connections of the thalamus with different sensory and motor
areas and their potential role in multisensory and sensorimotor integration. Finally, we discuss the
several possibilities for the brain to integrate the environmental world with the different senses.

2.2  CORTICAL AREAS IN MULTISENSORY PROCESSES


2.2.1  Multisensory Association Cortices
Parietal, temporal, and frontal cortical regions of primates have been reported to be polysensory cor-
tical areas, i.e., related to more than a single sensory modality. We describe here several important

15
16 The Neural Bases of Multisensory Processes

features about these regions, focusing on the superior temporal sulcus (STS), the intraparietal sul-
cus, and the frontal cortex.

2.2.1.1  Superior Temporal Sulcus


Desimone and Gross (1979) found neurons responsive to visual, auditory, and somatosensory stimuli
in a temporal region of the STS referred to as superior temporal plane (STP) (see also Bruce et al. 1981;
Baylis et al. 1987; Hikosaka et al. 1988). The rostral part of the STS (Bruce et al. 1981; Benevento et al.
1977) appears to contain more neurons with multisensory properties than the caudal part (Hikosaka
et al. 1988). The connections of the STP include higher-order visual cortical areas as posterior pari-
etal visual areas (Seltzer and Pandya 1994; Cusick et al. 1995) and temporal lobe visual areas (Kaas
and Morel 1993), auditory cortical areas (Pandya and Seltzer 1982), and posterior parietal cortex
(Seltzer and Pandya 1994; Lewis and Van Essen 2000). The STS region also has various connections
with the prefrontal cortex (Cusick et al. 1995). In humans, numerous neuroimaging studies have
shown multisensory convergence in the STS region (see Barraclough et al. 2005 for a review).
Recently, studies have focused on the role of the polysensory areas of the STS and their interac-
tions with the auditory cortex in processing primate communications (Ghazanfar 2009). The STS
is probably one of the origins of visual inputs to the auditory cortex (Kayser and Logothetis 2009;
Budinger and Scheich 2009; Cappe et al. 2009a; Smiley and Falchier 2009) and thus participates
in the multisensory integration of conspecific face and vocalizations (Ghazanfar et al. 2008) that
occurs in the auditory belt areas (Ghazanfar et al. 2005; Poremba et al. 2003). These findings sup-
port the hypothesis of general roles for the STS region in synthesizing perception of speech and
general biological motion (Calvert 2001).

2.2.1.2  Intraparietal Sulcus


The posterior parietal cortex contains a number of different areas including the lateral intraparietal
(LIP) and ventral intraparietal (VIP) areas, located in the intraparietal sulcus. These areas seem to
be functionally related and appear to encode the location of objects of interest (Colby and Goldberg
1999). These areas are thought to transform sensory information into signals related to the control
of hand and eye movements via projections to the prefrontal, premotor, and visuomotor areas of
the frontal lobe (Rizzolatti et al. 1997). Neurons of the LIP area present multisensory properties
(Cohen et al. 2005; Russ et al. 2006; Gottlieb 2007). Similarly, neurons recorded in the VIP area
exhibit typical multisensory responses (Duhamel et al. 1998; Bremmer et al. 2002; Schlack et al.
2005; Avillac et al. 2007). Anatomically, LIP and VIP are connected with cortical areas of different
sensory modalities (Lewis and Van Essen 2000). In particular, VIP receives inputs from posterior
parietal areas 5 and 7 and insular cortex in the region of S2, and few inputs from visual regions
such as PO and MST (Lewis and Van Essen 2000). Although it is uncertain whether neurons in VIP
are responsive to auditory stimuli, auditory inputs may originate from the dorsolateral auditory belt
and parabelt (Hackett et al. 1998). The connectivity pattern of LIP (Andersen et al. 1990; Blatt et
al. 1990; Lewis and Van Essen 2000) is consistent with neuronal responses related to eye position
and visual inputs. Auditory and somatosensory influences appear to be very indirect and visuomo-
tor functions dominate, as the connection pattern suggests. In particular, the ventral part of the
LIP is connected with areas dealing with spatial information (Andersen et al. 1997) as well as with
the frontal eye field (Schall et al. 1995), whereas the dorsal part of the LIP is connected with areas
responsible for the processing of visual information related to the form of objects in the inferotem-
poral cortex (ventral “what” visual pathway). Both LIP and VIP neurons exhibit task-dependent
responses (Linden et al. 1999; Gifford and Cohen 2004), although the strength of this dependence
and its rules remain to be determined.

2.2.1.3  Frontal and Prefrontal Cortex


The premotor cortex, located in the frontal lobe, contains neurons with responses to somatosensory,
auditory, and visual signals, especially its ventral part as shown in monkeys (Fogassi et al. 1996;
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 17

Graziano et al. 1994, 1999). Somatosensory responses may be mediated by connections with soma-
tosensory area S2 and parietal ventral (PV) somatosensory area (Disbrow et al. 2003) and with
the posterior parietal cortex, such as areas 5, 7a, 7b, anterior intraparietal area (AIP), and VIP (see
Kaas and Collins 2004). Visual inputs could also come from the posterior parietal region. The belt
and parabelt auditory areas project to regions rostral to the premotor cortex (Hackett et al. 1999;
Romanski et al. 1999) and may contribute to auditory activation, as well as connections from the
trimodal portion of area 7b to the premotor cortex (Graziano et al. 1999).
Anterior to the premotor cortex, the prefrontal cortex plays a key role in temporal integration and
is related to evaluative and cognitive functions (Milner et al. 1985; Fuster 2001). Much of this cortex
has long been considered to be multisensory (Bignall 1970) but some regions are characterized by
some predominance in one sensory modality, such as an auditory domain in the ventral prefrontal
region (Suzuki 1985; Romanski and Goldman-Rakic 2002; Romanski 2004). This region receives
projections from auditory, visual, and multisensory cortical regions (e.g., Gaffan and Harrison 1991;
Barbas 1986; Romanski et al. 1999; Fuster et al. 2000), which are mediated through different func-
tional streams ending separately in the dorsal and ventral prefrontal regions (Barbas and Pandya
1987; Kaas and Hackett 2000; Romanski et al. 1999). This cortical input arising from different
modalities confer to the prefrontal cortex a role in cross-modal association (see Petrides and Iversen
1976; Joseph and Barone 1987; Barone and Joseph 1989; Ettlinger and Wilson 1990) as well as in
merging sensory information especially in processing conspecific auditory and visual communica-
tion stimuli (Romanski 2007; Cohen et al. 2007).

2.2.2  Low-Level Sensory Cortical Areas


Several studies provide evidence that anatomical pathways between low-level sensory cortical areas
may represent the anatomical support for early multisensory integration. We will detail these pat-
terns of connections in this part according to sensory interactions.

2.2.2.1  Auditory and Visual Connections and Interactions


Recently, the use of anterograde and retrograde tracers in the monkey brain made it possible to
highlight direct projections from the primary auditory cortex (A1), the caudal auditory belt and
parabelt, and the polysensory area of the temporal lobe (STP) to the periphery of the primary visual
cortex (V1, area 17 of Brodmann) (Falchier et al. 2002), as well as from the associative auditory
cortex to the primary and secondary visual areas (Rockland and Ojima 2003). These direct pro-
jections of the auditory cortex toward the primary visual areas would bring into play connections
of the feedback type and may play a role in the “foveation” of a peripheral auditory sound source
(Heffner and Heffner 1992). The reciprocity of these connections from visual areas to auditory
areas was also tested in a recent study (Falchier et al. 2010) that revealed the existence of projec-
tions from visual areas V2 and prostriata to auditory areas, including the caudal medial and lateral
belt area, the caudal parabelt area, and the temporoparietal area. Furthermore, in the marmoset,
a projection from the high-level visual areas to the auditory cortex was also reported (Cappe and
Barone 2005). More precisely, an area anterior to the STS (corresponding to the STP) sends con-
nections toward the auditory core with a pattern of feedback connections. Thus, multiple sources
can provide visual input to the auditory cortex in monkeys (see also Smiley and Falchier 2009;
Cappe et al. 2009a).
Direct connections between the primary visual and auditory areas have been found in rodents,
such as in the gerbil (Budinger et al. 2006) or the prairie vole (Campi et al. 2010) as well as in car-
nivores. For example, the primary auditory cortex of the ferret receives a sparse projection from
the visual areas including the primary visual cortex (Bizley et al. 2007). Similarly, in the adult cat,
visual and auditory cortices are interconnected but the primary sensory fields are not the main areas
involved. Only a minor projection is observed from A1 toward the visual areas A17/18 (Innocenti
et al. 1988), the main component arising from the posterior auditory field (Hall and Lomber 2008). It
18 The Neural Bases of Multisensory Processes

is important to note that there is probably a tendency for a decrease in the density of these auditory–
visual interconnections when going from rodents to carnivore to primates. This probably means a
higher incidence of cross-modal responses in unisensory areas of the rodents (Wallace et al. 2004),
whereas such responses are not present in the primary visual or auditory cortex of the monkey
(Lakatos et al. 2007; Kayser et al. 2008; Wang et al. 2008).
On the behavioral side, in experiments conducted in animals, multisensory integration dealt
in most cases with spatial cues, for instance, the correspondence between the auditory space and
the visual space. These experiments were mainly conducted in cats (Stein et al. 1989; Stein and
Meredith 1993; Gingras et al. 2009). For example, Stein and collaborators (1989) trained cats to
move toward visual or auditory targets with weak salience, resulting in poor performance that did
not exceed 25% on average. When the same stimuli were presented in spatial and temporal con-
gruence, the percentage of correct detections increased up to nearly 100%. In monkeys, only few
experiments have been conducted on behavioral facilitation induced by multimodal stimulation
(Frens and Van Opstal 1998; Bell et al. 2005). In line with human studies, simultaneous presenta-
tion in monkeys of a sound during a visually guided saccade induced a reduction of about 10% to
15% of saccade latency depending on the visual stimulus contrast level (Wang et al. 2008). Recently,
we have shown behavioral evidence for multisensory facilitation between vision and hearing in
macaque monkeys (Cappe et al. 2010). Monkeys were trained to perform a simple detection task
to stimuli, which were auditory (noise), visual (flash), or auditory–visual (noise and flash) at dif-
ferent intensities. By varying the intensity of individual auditory and visual stimuli, we observed
that, when the stimuli are of weak saliency, the multisensory condition had a significant facilitatory
effect on reaction times, which disappeared at higher intensities (Cappe et al. 2010). We applied to
the behavioral data the “race model” (Raab 1962) that supposes that the faster unimodal modality
should be responsible for the shortening in reaction time (“the faster the winner”), which would
correspond to a separate activation model (Miller 1982). It turns out that the multisensory benefit
at low intensity derives from a coactivation mechanism (Miller 1982) that implies a convergence
of hearing and vision to produce multisensory interactions and a reduction in reaction time. The
anatomical studies previously described suggest that such a convergence may take place at the lower
levels of cortical sensory processing.
In humans, numerous behavioral studies, using a large panel of different paradigms and various
types of stimuli, showed the benefits of auditory–visual combination stimuli compared to unisen-
sory stimuli (see Calvert et al. 2004 for a review; Romei et al. 2007; Cappe et al. 2009b as recent
examples).
From a functional point of view, many studies have shown multisensory interactions early in
time and in different sensory areas with neuroimaging and electrophysiological methods. Auditory–
visual interactions have been revealed in the auditory cortex or visual cortex using electrophysi-
ological or neuroimaging methods in cats and monkeys (Ghazanfar et al. 2005; Bizley et al. 2007;
Bizley and King 2008; Cappe et al. 2007; Kayser et al. 2007, 2008; Lakatos et al. 2007; Wang et al.
2008). More specifically, electrophysiological studies in monkeys, revealing multisensory interac-
tions in primary sensory areas such as V1 or A1, showed that cross-modal stimuli (i.e., auditory
or visual stimuli, respectively) are rather modulatory on the non-“sensory-specific” response, and/
or acting on the oscillatory activity (Lakatos et al. 2007; Kayser et al. 2008) or on the latency of
the neuronal responses (Wang et al. 2008). These mechanisms can enhance the speed of sensory
processing and induce a reduction of the reaction times (RTs) during a multisensory stimulation.
Neurons recorded in the primary visual cortex showed a significant reduction in visual response
latencies, specifically in suboptimal conditions (Wang et al. 2008). It is important to mention that, in
the primary sensory areas of the primate, authors have reported the absence of nonspecific sensory
responses at the spiking level (Wang et al. 2008; Lakatos et al. 2007; Kayser et al. 2008). These
kinds of interactions between hearing and vision were also reported in humans using neuroimaging
techniques (Giard and Peronnet 1999; Molholm et al. 2002; Lovelace et al. 2003; Laurienti et al.
2004; Martuzzi et al. 2007).
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 19

2.2.2.2  Auditory and Somatosensory Connections and Interactions


The advantage of being able to use a number of distinct tracers allows us to identify connections
between several cortical areas. Indeed, we made injections of retrograde tracers into early visual
(V2 and MT), somatosensory (1/3b), and auditory (core) cortical areas in marmosets (Cappe and
Barone 2005) allowing us to exhibit connections between cortical areas considered as unisensory
areas. Projections from visual areas, such as the STP, to the core auditory cortex have been found
(Cappe and Barone 2005), as described in Section 2.2.2. Other corticocortical projections, and in
particular from somatosensory to auditory cortex, were found, supporting the view that inputs from
different modalities are sent to cortical areas that are classically considered to be unimodal (Cappe
and Barone 2005). More precisely, our study revealed projections from somatosensory areas S2/
PV to the primary auditory cortex. Another study conducted in gerbils also showed connections
between the primary somatosensory cortex and the primary auditory cortex (Budinger et al. 2006).
In marmosets and macaques, projections from the retroinsular area of the somatosensory cortex
to the caudiomedial belt auditory area were also reported (de la Mothe et al. 2006a; Smiley et al.
2007).
Intracranial recordings in the auditory cortex of monkeys have shown the modulation of auditory
responses by somatosensory stimuli, consistent with early multisensory convergence (Schroeder et
al. 2001; Schroeder and Foxe 2002; Fu et al. 2003). These findings have been extended by a func-
tional magnetic resonance imaging (fMRI) study in anesthetized monkeys, which showed auditory–
somatosensory interactions in the caudal lateral belt area (Kayser et al. 2005).
In humans, there have been previous demonstrations of a redundant signal effect between audi-
tory and tactile stimuli (Murray et al. 2005; Zampini et al. 2007; Hecht et al. 2008). Functional
evidence was mainly found with EEG and fMRI techniques (Foxe et al. 2000, 2002; Murray et al.
2005). In particular, Murray and collaborators (2005) reported in humans that neural responses
showed an initial auditory–somatosensory interaction in auditory association areas.

2.2.2.3  Visual and Somatosensory Connections and Interactions


Limited research has been focused on interactions between vision and touch. In our experiments,
using multiple tracing methods in marmoset monkeys (Cappe and Barone 2005), we found direct
projections from visual cortical areas to somatosensory cortical areas. More precisely, after an
injection of retrograde tracer in the primary somatosensory cortex (areas 1/3b), we observed projec-
tions originating from visual areas (the ventral and dorsal fundus of the superior temporal area, and
the middle temporal crescent).
On a functional point of view, electrophysiological recordings in the somatosensory cortex of
macaque monkeys showed modulations of responses by auditory and visual stimuli (Schroeder and
Foxe 2002). Behavioral results in humans demonstrated gain in performance when visual and tactile
stimuli were combined (Forster et al. 2002; Hecht et al. 2008). Evidence of functional interactions
between vision and touch was observed with neuroimaging techniques in humans (Amedi et al.
2002, 2007; James et al. 2002). In particular, it has been shown that the perception of motion could
activate the MT complex in humans (Hagen et al. 2002). It has also been demonstrated that the
extrastriate visual cortex area 19 is activated during tactile perception (see Sathian and Zangaladze
2002 for review).

2.2.2.4  Heteromodal Projections and Sensory Representation


In somatosensory (Krubitzer and Kaas 1990; Huffman and Krubitzer 2001) and visual systems
(Kaas and Morel 1993; Schall et al. 1995; Galletti et al. 2001; Palmer and Rosa 2006), there is
evidence for the existence of different connectivity patterns according to sensory representation,
especially in terms of the density of connections between areas. This observation also applies to
heteromodal connections. We found that the visual projections to areas 1/3b are restricted to the
representation of certain body parts (Cappe and Barone 2005). Some visual projections selectively
target the face (middle temporal crescent) or the arm (dorsal fundus of the superior temporal area)
20 The Neural Bases of Multisensory Processes

representations in areas 1/3b. Similarly, auditory and multimodal projections to area V1 are promi-
nent toward the representation of the peripheral visual field (Falchier et al. 2002, 2010; Hall and
Lomber 2008), and only scattered neurons in the auditory cortex send a projection to foveal V1. The
fact that heteromodal connections are coupling specific sensory representations across modalities
probably reflects an adaptive process for behavioral specialization. This is in agreement with human
and monkey data showing that the neuronal network involved in multisensory integration, as well
as its expression at the level of the neuronal activity, is highly dependent on the perceptual task in
which the subject is engaged. In humans, the detection or discrimination of bimodal objects, as
well as the perceptual expertise of subjects, differentially affect both the temporal aspects and the
cortical areas at which multisensory interactions occur (Giard and Peronnet 1999; Fort et al. 2002).
Similarly, we have shown that the visuo–auditory interactions observed at the level of V1 neurons
are observed only in behavioral situations during which the monkey has to interact with the stimuli
(Wang et al. 2008).
Such an influence of the perceptual context on the neuronal expression of multisensory interac-
tion is also present when analyzing the phenomena of cross-modal compensation after sensory
deprivation in human. In blind subjects (Sadato et al. 1996), the efficiency of somatosensory stimu-
lation on the activation of the visual cortex is at maximum during an active discrimination task
(Braille reading). This suggests that the mechanisms of multisensory interaction, at early stages of
sensory processing and the cross-modal compensatory mechanisms, are probably mediated through
common neuronal pathways involving the heteromodal connections described previously.

2.3  THALAMUS IN MULTISENSORY PROCESSES


2.3.1  Thalamocortical and Corticothalamic Connections
Although the cerebral cortex and the superior colliculus (Stein and Meredith 1993) have been shown
to be key structures for multisensory interactions, the idea that the thalamus could play a relay
role in multisensory processing has been frequently proposed (Ghazanfar and Schroeder 2006 for
review; Hackett et al. 2007; Cappe et al. 2009c; see also Cappe et al. 2009a for review).
By using anatomical multiple tracing methods in the macaque monkey, we were able to test this
hypothesis recently and looked at the relationship and the distribution of the thalamocortical and the
corticothalamic (CT) connections between different sensory and motor cortical areas and thalamic
nuclei (Cappe et al. 2009c). In this study, we provided evidence for the convergence of different
sensory modalities in the thalamus. Based on different injections in somatosensory [in the posterior
parietal somatosensory cortex (PE/PEa in area 5)], auditory [in the rostral (RAC) and caudal audi-
tory cortex (CAC)], and premotor cortical areas [dorsal and ventral premotor cortical areas (PMd
and PMv)] in the same animal, we were able to assess how connections between the cortex and the
different thalamic nuclei are organized.
We demonstrated for the first time the existence of overlapping territories of thalamic projections
to different sensory and motor areas. We focus our review on thalamic nuclei that are projecting
into more than two areas of different attributes rather than on sensory-specific thalamocortical
projections. Thalamocortical projections were found from the central lateral (CL) nucleus and the
mediodorsal (MD) nucleus to RAC, CAC, PEa, PE, PMd, and PMv. Common territories of projec-
tion were observed from the nucleus LP to PMd, PMv, PEa, and PE. The ventroanterior nucleus (VA),
known as a motor thalamic nucleus, sends projections to PE and to PEa. Interestingly, projections
distinct from the ones arising from specific unimodal sensory nuclei were observed from auditory
thalamic nuclei, such as projections from the medial geniculate nucleus to the parietal cortex (PE
in particular) and the premotor cortex (PMd/PMv). Last but not least, the medial pulvinar nucleus
(PuM) exhibits the most significant overlap across modalities, with projections from superimposed
territories to all six cortical areas injected with tracers. Projections from PuM to the auditory cor-
tex were also described by de la Mothe and colleagues (2006b). Hackett and collaborators (2007)
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 21

showed that somatosensory inputs may reach the auditory cortex (CM and CL) through connections
coming from the medial part of the medial geniculate nucleus (MGm) or the multisensory nuclei
[posterior, suprageniculate, limitans, and medial pulvinar (PuM)]. All these thalamocortical projec-
tions are consistent with the presence of thalamic territories possibly integrating different sensory
modalities with motor attributes.
We calculated the degree of overlap between thalamocortical and CT connections in the thalamus
to determine the projections to areas of a same modality, as previously described (Tanné-Gariépy
et al. 2002; Morel et al. 2005; Cappe et al. 2009c). The degree of overlap may range between 0%
when two thalamic territories projecting to two distinct cortical areas are spatially completely seg-
regated and 100% when the two thalamic territories fully overlap (considering a spatial resolution of
0.5 mm, further details in Cappe et al. 2009c). Thalamic nuclei with spatially intermixed thalamo-
cortical cells projecting to auditory or premotor cortices were located mainly in the PuM, VA, and
CL nuclei. The overlap between the projections to the auditory and parietal cortical areas concerned
different thalamic nuclei such as PuM, CL, and to a lesser extent, LP and PuL. The projections to
the premotor and posterior parietal cortex overlapped primarily in PuM, LP, MD, and also in VA,
VLpd, and CL. Quantitatively, we found that projections from the thalamus to the auditory and
motor cortical areas overlapped to an extent ranging from 4% to 12% through the rostral thalamus
and increased up to 30% in the caudal part of the thalamus. In PuM, the degree of overlap between
thalamocortical projections to auditory and premotor cortex ranged from 14% to 20%. PuM is the
thalamic nucleus where the maximum of overlap between thalamocortical projections was found.
Aside from the thalamocortical connections, CT connections were also investigated in the same
study, concerning, in particular, the parietal areas PE and PEa injected with a tracer with antero-
grade properties (biotinylated dextran amine; Cappe et al. 2007). Indeed, areas PE and PEa send
CT projections to the thalamic nuclei PuM, LP, and to a lesser extent, VPL, CM, CL, and MD (PEa
only for MD). These thalamic nuclei contained both the small and giant CT endings. The existence
of these two different types of CT endings reflect the possibility for CT connections to represent
either feedback or feedforward projections (for review, see Rouiller and Welker 2000; Sherman and
Guillery 2002, 2005; Sherman 2007). In contrast to the feedback CT projection originating from
cortical layer VI, the feedforward CT projection originates from layer V and terminates in the thala-
mus in the form of giant endings, which can ensure highly secure and rapid synaptic transmission
(Rouiller and Welker 2000). Considering the TC and CT projections, some thalamic nuclei (PuM,
LP, VPL, CM, CL, and MD) could play a role in the integration of different sensory information
with or without motor attributes (Cappe et al. 2007, 2009c). Moreover, parietal areas PE and PEa
may send, via the giant endings, feedforward CT projection and transthalamic projections to remote
cortical areas in the parietal, temporal, and frontal lobes contributing to polysensory and senso-
rimotor integration (Cappe et al. 2007, 2009c).

2.3.2  Role of Thalamus in Multisensory Integration


The interconnections between the thalamus and the cortex described in the preceding section sug-
gest that the thalamus could play the role of early sensory integrator. An additional role for the
thalamus in multisensory interplay may derive from the organization of its CT and thalamocortical
connections/loops as evoked in Section 2.3.1 (see also Crick and Koch 1998). Indeed, the thalamus
could also have a relay role between different sensory and/or premotor cortical areas. In particular,
the pulvinar, mainly its medial part, contains neurons which project to the auditory cortex, the
somatosensory cortex, the visual cortex, and the premotor cortex (Romanski et al. 1997; Hackett et
al. 1998; Gutierrez et al. 2000; Cappe et al. 2009c; see also Cappe et al. 2009a for a review). The
feedforward CT projection originating from different sensory or motor cortical areas, combined
with a subsequent TC projection, may allow a transfer of information between remote cortical
areas through a “cortico–thalamo–cortical” route (see, e.g., Guillery 1995; Rouiller and Welker
2000; Sherman and Guillery 2002, 2005; Sherman 2007; Cappe et al. 2009c). As described in
22 The Neural Bases of Multisensory Processes

Section 2.3.1, the medial part of the pulvinar nucleus is the main candidate (although other thalamic
nuclei such as LP, VPL, MD, or CL may also play a role) to represent an alternative to corticocorti-
cal loops by which information can be transferred between cortical areas belonging to different sen-
sory and sensorimotor modalities (see also Shipp 2003). On a functional point of view, neurons in
PuM respond to visual stimuli (Gattass et al. 1979) and auditory stimuli (Yirmiya and Hocherman
1987), which is consistent with our hypothesis.
Another point is that, as our injections in the different sensory and motor areas included corti-
cal layer I (Cappe et al. 2009c), it is likely that some of these projections providing multimodal
information to the cortex originate from the so-called “matrix” calbindin-immunoreactive neurons
distributed in all thalamic nuclei and projecting diffusely and relatively widely to the cortex (Jones
1998).
Four different mechanisms of multisensory and sensorimotor interplay can be proposed based
on the pattern of convergence and divergence of thalamocortical and CT connections (Cappe et al.
2009c). First, some restricted thalamic territories sending divergent projections to cortical areas
afford different sensory and/or motor inputs which can be mixed simultaneously. Although such
a multimodal integration in the temporal domain cannot be excluded (in case the inputs reach the
cerebral cortex at the exact same time), it is less likely to provide massive multimodal interplay
than an actual spatial convergence of projections. More convincingly, this pattern could support
a temporal coincidence mechanism as a synchronizer between remote cortical areas, allowing a
higher perceptual saliency of multimodal stimuli (Fries et al. 2001). Second, thalamic nuclei could
be an integrator of multisensory information, rapidly relaying this integrated information to the
cortex by their multiple thalamocortical connections. In PuM, considerable mixing of territories
projecting to cortical areas belonging to several modalities is in line with previously reported con-
nections with several cortical domains, including visual, auditory, somatosensory, and prefrontal
and motor areas. Electrophysiological recordings showed visual and auditory responses in this
thalamic nucleus (see Cappe et al. 2009c for an extensive description). According to our analysis,
PuM, LP, MD, MGm, and MGd could play the role of integrator (Cappe et al. 2009c). Third, the
spatial convergence of different sensory and motor inputs at the cortical level coming from thal-
amocortical connections of distinct thalamic territories suggests a fast multisensory interplay. In
our experiments (Cappe et al. 2009c), the widespread distribution of thalamocortical inputs to the
different cortical areas injected could imply that this mechanism of convergence plays an impor-
tant role in multisensory and motor integration. By their cortical connection patterns, thalamic
nuclei PuM and LP, for instance, could play this role for auditory–somatosensory interplay in area
5 (Cappe et al. 2009c). Fourth, the cortico–thalamo–cortical route can support rapid and secure
transfer from area 5 (PE/PEa; Cappe et al. 2007) to the premotor cortex via the giant terminals
of these CT connections (Guillery 1995; Rouiller and Welker 2000; Sherman and Guillery 2002,
2005; Sherman 2007). These giant CT endings, consistent with this principle of transthalamic loop,
have been shown to be present in different thalamic nuclei (e.g., Schwartz et al. 1991; Rockland
1996; Darian-Smith et al. 1999; Rouiller et al. 1998, 2003; Taktakishvili et al. 2002; Rouiller and
Durif 2004) and may well also apply to PuM, as demonstrated by the overlap between connections
to the auditory cortex and to the premotor cortex, allowing an auditory–motor integration (Cappe
et al. 2009c).
Thus, recent anatomical findings at the thalamic level (Komura et al. 2005; de la Mothe 2006b;
Hackett et al. 2007; Cappe et al. 2007, 2009c) may represent the anatomical support for multisen-
sory behavioral phenomenon as well as multisensory integration at the functional level. Indeed,
some nuclei in the thalamus, such as the medial pulvinar, receive either mixed sensory inputs or
projections from different sensory cortical areas and project to sensory and premotor areas (Cappe
et al. 2009c). Sensory modalities may thus already be fused at the thalamic level before being
directly conveyed to the premotor cortex and consequently participating in the redundant signal
effect expressed by faster reaction times in response to auditory–visual stimulation (Cappe et al.
2010).
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 23

2.4 HIGHER-ORDER, LOWER-ORDER CORTICAL AREAS AND/OR THALAMUS?


When applying the race model to behavioral performance for multisensory tasks, results showed
that this model cannot account for the shorter reaction times in auditory–visual conditions (see
Cappe et al. 2010 for data in monkeys), a result that imposes a “coactivation” model and implies
a convergence of the sensory channels (Miller 1982). The anatomical level at which the coactiva-
tion occurs is still under debate (Miller et al. 2001), as it has been suggested to occur early at the
sensory level (Miller et al. 2001; Gondan et al. 2005) or late at the motor stage (Giray and Ulrich
1993). However, in humans, analysis of the relationships between behavioral and neuronal indices
(Molholm et al. 2002; Sperdin et al. 2009; Jepma et al. 2009) seems to suggest that this convergence
of the sensory channels occurs early in sensory processing, before the decision at motor levels
(Mordkoff et al. 1996; Gondan et al. 2005), as shown in monkeys (Lamarre et al. 1983; Miller et al.
2001; Wang et al. 2008). Determining the links between anatomic, neurophysiologic, and behav-
ioral indices of multisensory processes is necessary to understand the conditions under which a
redundant signal effect is observable.
The reality of direct connections from a cortical area considered as unisensory to another one
of different modality is a paradox for hierarchical models of sensory processing (Maunsell and
Van Essen 1983; Felleman and Van Essen 1991). The most recent findings provided evidence that
multisensory interactions can occur shortly after response onset, at the lowest processing stages
(see previous paragraphs). These new elements have to be considered and included in view of the
sensory system organization. Obviously, it is possible that some connections mediating early-stage
multisensory connections have not yet been identified by anatomical methods.
Inside a sensory system, the hierarchy relationship between cortical areas have been defined by the
nature of the connections in terms of feedforward or feedback although the role of these connections
is only partially understood (Salin and Bullier 1995; Bullier 2006). Recent results suggest that multi-
sensory convergence in unisensory areas can intervene with stages of information processing of low
levels, through feedback and feedforward circuits (Schroeder et al. 2001; Schroeder and Foxe 2002;
Fu et al. 2003; Cappe and Barone 2005). Accordingly, anatomical methods alone are not sufficient to
definitely determine the functional distinction of any connections in terms of feedforward–feedback
nature, and cannot be used to establish a hierarchy between functional areas of different systems.
This review highlights that both higher-order association areas and lower-order cortical areas
are multisensory in nature and that the thalamus could also play a role in multisensory processing.
Figure 2.1 summarizes and represents schematically the possible scenarios for multisensory integra-
tion through anatomical pathways. First, as traditionally proposed, information is processed from
the primary “unisensory” cortical areas to “multisensory” association cortical areas, and finally,
the premotor and motor cortical areas in a hierarchical way (Figure 2.1a). In these multisensory
association areas, the strength and the latencies of neuronal responses are affected by the nature of
the stimuli (e.g., Avillac et al. 2007; Romanski 2007; Bizley et al. 2007). Second, recent evidence
demonstrated the existence of multisensory interaction at the first level of cortical processing of
the information (Figure 2.1b). Third, as we described in this review, the thalamus by its numerous
connections could play a role in this processing (Figure 2.1c). Altogether, this model represents the
different alternative pathways for multisensory integration. These multiple pathways, which coexist
(Figure 2.1d), may be useful to allow different paths according to the task and/or to mediate infor-
mation of different natures (see Wang et al. 2008 for recent evidence of the influence of a perceptual
task on neuronal responses).
Taken together, the data reviewed here provide evidence for anatomical pathways possibly involved
in multisensory integration at low levels of information processing in the primate and argue against
a strict hierarchical model. An alternative for multisensory integration appears to be the thalamus.
Indeed, as demonstrated in this chapter, the thalamus, thanks to its multiple connections, appears to
belong to a cortico–thalamo–cortical loop. This allows us to consider that it may have a key role in
multisensory integration. Finally, higher order association cortical areas, lower order cortical areas,
24 The Neural Bases of Multisensory Processes

(a) M (b) M

H H

A V S A V S

(c) (d) A: Auditory cortex


M M
V: Visual cortex
S: Somatosensory cortex
M: Premotor and motor cortex
H H H: Higher order multisensory regions
T: « non-specific » thalamic nuclei: PuM,
LP, VPL, CM, CL and MD as example for
A V S A V S connections with auditory and
somatosensory cortical areas; PuM as
example for connections with A, V and
T T S cortex

FIGURE 2.1  Hypothetical scenarios for multisensory and motor integration through anatomically identified
pathways. (a) High-level cortical areas as a pathway for multisensory and motor integration. (b) Low-level
cortical areas as a pathway for multisensory integration. (c) Thalamus as a pathway for multisensory and
motor integration. (d) Combined cortical and thalamic connections as a pathway for multisensory and motor
integration.

as well as the thalamus have now been shown to be part of multisensory integration. The question
is now to determine how this system of multisensory integration is organized and how the different
parts of the system communicate to allow a unified view of the perception of the world.

2.5  CONCLUSIONS
Obviously, we are just beginning to understand the complexity of interactions in the sensory sys-
tems and between the sensory and the motor systems. More work is needed in both the neural and
perceptual domains. At the neural level, additional studies are needed to understand the extent and
hierarchical organization of multisensory interactions. At the perceptual level, further experiments
should explore the conditions necessary for cross-modal binding and plasticity, and investigate the
nature of the information transfer between sensory systems. Such studies will form the basis for a
new comprehension of how the different sensory and/or motor systems function together.

ACKNOWLEDGMENTS
This study was supported by the following grants: the CNRS ATIP program (to P.B.), the Swiss
National Science Foundation, grants 31-61857.00 (to E.M.R.) and 310000-110005 (to E.M.R.), the
Swiss National Science Foundation Center of Competence in Research on “Neural Plasticity and
Repair” (to E.M.R.).

REFERENCES
Amedi, A., G. Jacobson, T. Hendler, R. Malach, and E. Zohary. 2002. Convergence of visual and tactile shape
processing in the human lateral occipital complex. Cerebral Cortex 12:1202–12.
Amedi, A., W.M., Stern, J.A. Camprodon et al. 2007. Shape conveyed by visual-to-auditory sensory substitu-
tion activates the lateral occipital complex. Nature Neuroscience 10:687–9.
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 25

Andersen, R.A., C., Asanuma, G. Essick, and R.M. Siegel. 1990. Corticocortical connections of anatomi-
cally and physiologically defined subdivisions within the inferior parietal lobule. Journal of Comparative
Neurology 296:65–113.
Andersen, R.A., L.H. Snyder, D.C. Bradley, and J. Xing. 1997. Multimodal representation of space in the
posterior parietal cortex and its use in planning movements. Annual Review of Neuroscience 20:​
303–30.
Avillac, M., S. Ben Hamed, and J.R. Duhamel. 2007. Multisensory integration in the ventral intraparietal area
of the macaque monkey. Journal of Neuroscience 27:1922–32.
Barbas, H. 1986. Pattern in the laminar origin of corticocortical connections. Journal of Comparative Neurology
252:415–22.
Barbas, H., and D.N. Pandya. 1987. Architecture and frontal cortical connections of the premotor cortex (area
6) in the rhesus monkey. Journal of Comparative Neurology 256:211–28.
Barone, P., and J.P. Joseph. 1989. Role of the dorsolateral prefrontal cortex in organizing visually guided behav-
ior. Brain, Behavior and Evolution 33:132–5.
Barraclough, N.E., D. Xiao, C.I., Baker, M.W. Oram, and D.I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17:377–91.
Baylis, G.C., E.T. Rolls, and C.M. Leonard. 1987. Functional subdivisions of the temporal lobe neocortex.
Journal of Neuroscience 7:330–42.
Bell, A.H., M.A. Meredith, A.J. Van Opstal, and D.P. Munoz. 2005. Crossmodal integration in the primate
superior colliculus underlying the preparation and initiation of saccadic eye movements. Journal of
Neurophysiology 93:3659–73.
Benevento, L.A., J., Fallon, B.J. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57:849–72.
Bignall, K.E. 1970. Auditory input to frontal polysensory cortex of the squirrel monkey: Possible pathways.
Brain Research 19:77–86.
Bizley, J.K., and A.J. King. 2008. Visual–auditory spatial processing in auditory cortical neurons. Brain
Research 1242:24–36.
Bizley, J.K., F.R. Nodal, V.M. Bajo, I. Nelken, and A.J. King. 2007. Physiological and anatomical evidence for
multisensory interactions in auditory cortex. Cerebral Cortex 17:2172–89.
Blatt, G.J., R.A. Andersen, and G.R. Stoner. 1990. Visual receptive field organization and cortico-cortical con-
nections of the lateral intraparietal area (area LIP) in the macaque. Journal of Comparative Neurology
299:421–45.
Bremmer, F., F. Klam, J.R. Duhamel, S. Ben Hamed, and W. Graf. 2002. Visual-vestibular interactive responses
in the macaque ventral intraparietal area (VIP). European Journal of Neuroscience 16:1569–86.
Bruce, C., R. Desimone, and C.G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46:369–84.
Budinger, E., and H. Scheich. 2009. Anatomical connections suitable for the direct processing of neuronal
information of different modalities via the rodent primary auditory cortex (review). Hearing Research
258:16–27.
Budinger, E., P. Heil, A. Hess, and H. Scheich. 2006. Multisensory processing via early cortical stages: Connec­
tions of the primary auditory cortical field with other sensory systems. Neuroscience 143:1065–83.
Bullier, J. 2006. What is feed back? In 23 Problems in Systems Neuroscience, ed. J.L. van Hemmen and T.J.
Sejnowski, 103–132. New York: Oxford University Press.
Calvert, G.A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies
(review). Cerebral Cortex 11:1110–23.
Calvert, G., C. Spence, and B.E. Stein, eds. 2004. The Handbook of Multisensory Processes. Cambridge, MA:
MIT Press.
Campi, K.L., K.L. Bales, R. Grunewald, and L. Krubitzer. 2010. Connections of auditory and visual cortex in
the prairie vole (Microtus ochrogaster): Evidence for multisensory processing in primary sensory areas.
Cerebral Cortex 20:89–108.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. European Journal of Neuroscience 22:2886–902.
Cappe, C., A. Morel, and E.M. Rouiller. 2007. Thalamocortical and the dual pattern of corticothalamic projec-
tions of the posterior parietal cortex in macaque monkeys. Neuroscience 146:1371–87.
Cappe, C., E.M. Rouiller, and P. Barone. 2009a. Multisensory anatomic pathway (review). Hearing Research
258:28–36.
26 The Neural Bases of Multisensory Processes

Cappe, C., G. Thut, V. Romei, and M.M. Murray. 2009b. Selective integration of auditory-visual looming cues
by humans. Neuropsychologia 47:1045–52.
Cappe, C., A. Morel, P. Barone, and E.M. Rouiller. 2009c. The thalamocortical projection systems in primate:
An anatomical support for multisensory and sensorimotor integrations. Cerebral Cortex 19:2025–37.
Cappe, C., M.M. Murray, P. Barone, and E.M. Rouiller. 2010. Multisensory facilitation of behavior in monkeys:
Effects of stimulus intensity. Journal of Cognitive Neuroscience 22:2850–63.
Cohen, Y.E., B.E. Russ, and G.W. Gifford 3rd. 2005. Auditory processing in the posterior parietal cortex
(review). Behavioral and Cognitive Neuroscience Reviews 4:218–31.
Cohen, Y.E., F. Theunissen, B.E. Russ, and P. Gill. 2007. Acoustic features of rhesus vocalizations and their
representation in the ventrolateral prefrontal cortex. Journal of Neurophysiology 97:1470–84.
Colby, C.L., and M.E. Goldberg. 1999. Space and attention in parietal cortex (review). Annual Review of
Neuroscience 22:319–49.
Crick, F., and C. Koch. 1998. Constraints on cortical and thalamic projections: The no-strong-loops hypothesis.
Nature 391:245–50.
Cusick, C.G., B. Seltzer, M. Cola, and E. Griggs. 1995. Chemoarchitectonics and corticocortical terminations
within the superior temporal sulcus of the rhesus monkey: Evidence for subdivisions of superior temporal
polysensory cortex. Journal of Comparative Neurology 360:513–35.
Darian-Smith, C., A. Tan, and S. Edwards. 1999. Comparing thalamocortical and corticothalamic microstruc-
ture and spatial reciprocity in the macaque ventral posterolateral nucleus (VPLc) and medial pulvinar.
Journal of Comparative Neurology 410:211–34.
de la Mothe, L.A., S. Blumell, Y. Kajikawa, and T.A. Hackett. 2006a. Cortical connections of the auditory cortex
in marmoset monkeys: Core and medial belt regions. Journal of Comparative Neurology 496:27–71.
de la Mothe, L.A., S. Blumell, Y. Kajikawa, and T.A. Hackett. 2006b. Thalamic connections of the auditory cor-
tex in marmoset monkeys: Core and medial belt regions. Journal of Comparative Neurology 496:72–96.
Desimone, R., and C.G. Gross. 1979. Visual areas in the temporal cortex of the macaque. Brain Research
178:363–80.
Disbrow, E., E. Litinas, G.H. Recanzo, J. Padberg, and L. Krubitzer. 2003. Cortical connections of the sec-
ond somatosensory area and the parietal ventral area in macaque monkeys. Journal of Comparative
Neurology 462:382–99.
Duhamel, J.R., C.L. Colby, and M.E. Goldberg. 1998. Ventral intraparietal area of the macaque: Congruent
visual and somatic response properties. Journal of Neurophysiology 79:126–36.
Ettlinger, G., and W.A. Wilson. 1990. Cross-modal performance: Behavioural processes, phylogenetic consid-
erations and neural mechanisms (review). Behavioural Brain Research 40:169–92.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience 22:5749–59.
Falchier, A., C.E. Schroeder, T.A. Hackett et al. 2010. Low level intersensory connectivity as a fundamental
feature of neocortex. Cerebral Cortex 20:1529–38.
Felleman, D.J., and D.C. Van Essen. 1991. Distributed hierarchical processing in the primate cerebral cortex.
Cerebral Cortex 1:1–47.
Fogassi, L., V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti. 1996. Coding of peripersonal
space in inferior premotor cortex (area F4). Journal of Neurophysiology 76:141–57.
Fort, A., C. Delpuech, J. Pernier, and M.H. Giard. 2002. Dynamics of corticosubcortical cross-modal opera-
tions involved in audio-visual object detection in humans. Cerebral Cortex 12:1031–39.
Forster, B., C. Cavina-Pratesi, S.M. Aglioti, and G. Berlucchi. 2002. Redundant target effect and intersensory facili-
tation from visual-tactile interactions in simple reaction time. Experimental Brain Research 143:480–487.
Foxe, J.J., I.A. Morocz, M.M. Murray, B.A. Higgins, D.C. Javitt, and C.E. Schroeder. 2000. Multisensory
auditory–somatosensory interactions in early cortical processing revealed by high-density electrical
mapping. Brain Research. Cognitive Brain Research 10:77–83.
Foxe, J.J., G.R. Wylie, A. Martinez et al. 2002. Auditory–somatosensory multisensory processing in auditory
association cortex: An fMRI study. Journal of Neurophysiology 88:540–3.
Frens, M.A., and A.J. Van Opstal. 1998. Visual–auditory interactions modulate saccade-related activity in mon-
key superior colliculus. Brain Research Bulletin 46:211–24.
Fries, P., S. Neuenschwander, A.K. Engel, R. Goebel, and W. Singer. 2001. Rapid feature selective neuronal
synchronization through correlated latency shifting. Nature Neuroscience 4:194–200.
Fu, K.M., T.A. Johnston, A.S. Shah et al. 2003. Auditory cortical neurons respond to somatosensory stimula-
tion. Journal of Neuroscience 23:7510–5.
Fuster, J.M. 2001. The prefrontal cortex—an update: Time is of the essence (review). Neuron 30:319–33.
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 27

Fuster, J.M., M. Bodner, and J.K. Kroger. 2000. Cross-modal and cross-temporal association in neurons of
frontal cortex. Nature 405:347–51.
Gaffan, D., and S. Harrison. 1991. Auditory–visual associations, hemispheric specialization and temporal–
frontal interaction in the rhesus monkey. Brain 114:2133–44.
Galletti, C., M. Gamberini, D.F. Kutz, P. Fattori, G. Luppino, M. Matelli. 2001. The cortical connections of
area V6: An occipito-parietal network processing visual information. European Journal of Neuroscience
13:1572–88.
Gattass, R., E. Oswaldo-Cruz, and A.P. Sousa. 1979. Visual receptive fields of units in the pulvinar of cebus
monkey. Brain Research 160:413–30.
Ghazanfar, A.A. 2009. The multisensory roles for auditory cortex in primate vocal communication (review).
Hearing Research 258:113–20.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? (review). Trends in Cognitive
Sciences 10:278–85.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25:5004–12.
Ghazanfar, A.A., C. Chandrasekaran, and N.K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience 28:4457–69.
Giard, M.H., and F. Peronnet. 1999. Auditory–visual integration during multimodal object recognition in
humans: A behavioral and electrophysiological study. Journal of Cognitive Neuroscience 11:473–90.
Gifford 3rd, G.W., and Y.E. Cohen. 2004. Effect of a central fixation light on auditory spatial responses in area
LIP. Journal of Neurophysiology 91:2929–33.
Gingras, G., B.A. Rowland, and B.E. Stein. 2009. The differing impact of multisensory and unisensory integra-
tion on behavior. Journal of Neuroscience 29:4897–902.
Giray, M., and R. Ulrich. 1993. Motor coactivation revealed by response force in divided and focused attention.
Journal of Experimental Psychology. Human Perception and Performance 19:1278–91.
Gondan, M., B. Niederhaus, F. Rösler, and B. Röder. 2005. Multisensory processing in the redundant-target
effect: A behavioral and event-related potential study. Perception & Psychophysics 67:713–26.
Gottlieb, J. 2007. From thought to action: The parietal cortex as a bridge between perception, action, and cogni-
tion (review). Neuron 53:9–16.
Graziano, M.S., G.S. Yap, and C.G. Gross. 1994. Coding of visual space by premotor neurons. Science
266:1054–7.
Graziano, M.S., L.A. Reiss, and C.G. Gross. 1999. A neuronal representation of the location of nearby sounds.
Nature 397:428–30.
Guillery, R.W. 1995. Anatomical evidence concerning the role of the thalamus in corticocortical communica-
tion: A brief review. Journal of Anatomy 187:583–92.
Gutierrez, C., M.G. Cola, B. Seltzer, and C. Cusick. 2000. Neurochemical and connectional organization of the
dorsal pulvinar complex in monkeys. Journal of Comparative Neurology 419:61–86.
Hackett, T.A., I. Stepniewska, and J.H. Kaas. 1998. Thalamocortical connections of the parabelt auditory cortex
in macaque monkeys. Journal of Comparative Neurology 400:271–86.
Hackett, T.A., I. Stepniewska, and J.H. Kaas. 1999. Prefrontal connections of the parabelt auditory cortex in
macaque monkeys. Brain Research 817:45–58.
Hackett, T.A., L.A. de La Mothe, I. Ulbert, G. Karmos, J. Smiley, and C.E. Schroeder. 2007. Multisensory
convergence in auditory cortex: II. Thalamocortical connections of the caudal superior temporal plane.
Journal of Comparative Neurology 502:924–52.
Hagen, M.C., O. Franzén, F. McGlone, G. Essick, C. Dancer, and J.V. Pardo. 2002. Tactile motion activates the
human middle temporal/V5 (MT/V5) complex. European Journal of Neuroscience 16:957–64.
Hall, A.J., and S.G. Lomber. 2008. Auditory cortex projections target the peripheral field representation of
primary visual cortex. Experimental Brain Research 190:413–30.
Hecht, D., M. Reiner, and A. Karni. 2008. Enhancement of response times to bi- and tri-modal sensory stimuli
during active movements. Experimental Brain Research 185:655–65.
Heffner, R.S., and H.E. Heffner. 1992. Visual factors in sound localization in mammals. Journal of Comparative
Neurology 317:219–32.
Hikosaka, K., E. Iwai, H. Saito, and K. Tanaka. 1988. Polysensory properties of neurons in the anterior bank of
the caudal superior temporal sulcus of the macaque monkey. Journal of Neurophysiology 60:1615–37.
Huffman, K.J., and L. Krubitzer. 2001. Area 3a: topographic organization and cortical connections in marmoset
monkeys. Cerebral Cortex 11:849–67.
28 The Neural Bases of Multisensory Processes

Innocenti, G.M., P. Berbel, and S. Clarke. 1988. Development of projections from auditory to visual areas in
the cat. Journal of Comparative Neurology 272:242–59.
James, T.W., G.K. Humphrey, J.S. Gati, P. Servos, R.S. Menon, and M.A. Goodale. 2002. Haptic study of three-
dimensional objects activates extrastriate visual areas. Neuropsychologia 40:1706–14.
Jepma, M., E.J. Wagenmakers, G.P. Band, and S. Nieuwenhuis. 2009. The effects of accessory stimuli on
information processing: Evidence from electrophysiology and a diffusion model analysis. Journal of
Cognitive Neuroscience 21:847–64.
Jones, E.G. 1998. Viewpoint: The core and matrix of thalamic organization. Neuroscience 85:331–45.
Joseph, J.P., and P. Barone. 1987. Prefrontal unit activity during a delayed oculomotor task in the monkey.
Experimental Brain Research 67:460–8.
Kaas, J.H., and C.E. Collins. 2001. Evolving ideas of brain evolution. Nature 411:141–2.
Kaas, J., and C.E. Collins. 2004. The resurrection of multisensory cortex in primates: connection patterns that
integrates modalities. In The Handbook of Multisensory Processes, ed. G. Calvert, C. Spence, and B.E.
Stein, 285–93. Cambridge, MA: MIT Press.
Kaas, J.H., and T.A. Hackett. 2000. Subdivisions of auditory cortex and processing streams in primates.
Proceedings of the National Academy of Sciences of the United States of America 97:11793–9.
Kaas, J.H., and A. Morel. 1993. Connections of visual areas of the upper temporal lobe of owl monkeys: The
MT crescent and dorsal and ventral subdivisions of FST. Journal of Neuroscience 13:534–46.
Kayser, C., and N.K. Logothetis. 2009. Directed interactions between auditory and superior temporal cortices
and their role in sensory integration (review). Frontiers in Integrative Neuroscience 3:7. doi: 10.3389/
neuro.07.007.2009.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48:373–84.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2007. Functional imaging reveals visual modulation
of specific fields in auditory cortex. Journal of Neuroscience 27:1824–35.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–74.
Komura, Y., R. Tamura, T. Uwano, H. Nishijo, and T. Ono. 2005. Auditory thalamus integrates visual inputs
into behavioral gains. Nature Neuroscience 8:1203–9.
Krubitzer, L.A., and J.H. Kaas. 1990. The organization and connections of somatosensory cortex in marmosets.
Journal of Neuroscience 10:952–74.
Lakatos, P., C.M. Chen, M.N. O’Connell, A. Mills, and C.E. Schroeder. 2007. Neuronal oscillations and multi-
sensory interaction in primary auditory cortex. Neuron 53:279–92.
Lamarre, Y., L. Busby, and G. Spidalieri. 1983. Fast ballistic arm movements triggered by visual, auditory, and
somesthetic stimuli in the monkey: I. Activity of precentral cortical neurons. Journal of Neurophysiology
50:1343–58.
Laurienti, P.J., R.A. Kraft, J.A. Maldjian, J.H. Burdette, and M.T. Wallace. 2004. Semantic congruence is a
critical factor in multisensory behavioral performance. Experimental Brain Research 158:405–14.
Lewis, J.W., and D.C. Van Essen. 2000. Corticocortical connections of visual, sensorimotor, and multimodal pro-
cessing areas in the parietal lobe of the macaque monkey. Journal of Comparative Neurology 428:112–37.
Linden, J.F., A. Grunewald, and R.A. Andersen. 1999. Responses to auditory stimuli in macaque lateral intra-
parietal area: II. Behavioral modulation. Journal of Neurophysiology 82:343–58.
Lovelace, C.T., B.E. Stein, and M.T. Wallace. 2003. An irrelevant light enhances auditory detection in humans:
A psychophysical analysis of multisensory integration in stimulus detection. Brain Research. Cognitive
Brain Research 17:447–53.
Martuzzi, R., M.M. Murray, C.M. Michel et al. 2007. Multisensory interactions within human primary cortices
revealed by BOLD dynamics. Cerebral Cortex 17:1672–9.
Maunsell, J.H., and D.C. Van Essen. 1983. The connections of the middle temporal visual area (MT) and their
relationship to a cortical hierarchy in the macaque monkey. Journal of Neuroscience 3:2563–86.
Miller, J. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology
14:247–79.
Miller, J., R. Ulrich, and Y. Lamarre. 2001. Locus of the redundant-signals effect in bimodal divided attention:
A neurophysiological analysis. Perception & Psychophysics 63:555–62.
Milner, B., M. Petrides, and M.L. Smith. 1985. Frontal lobes and the temporal organization of memory. Human
Neurobiology 4:137–42.
Molholm, S., W. Ritter, M.M. Murray, D.C. Javitt, C.E. Schroeder, and J.J. Foxe. 2002. Multisensory auditory–
visual interactions during early sensory processing in humans: A high-density electrical mapping study.
Brain Research. Cognitive Brain Research 14:115–28.
Cortical and Thalamic Pathways for Multisensory and Sensorimotor Interplay 29

Mordkoff, J.T., J. Miller, and A.C. Roch. 1996. Absence of coactivation in the motor component: Evidence
from psychophysiological measures of target detection. Journal of Experimental Psychology. Human
Perception and Performance 22:25–41.
Morel, A., J. Liu, T. Wannier, D. Jeanmonod, and E.M. Rouiller. 2005. Divergence and convergence of thalamo-
cortical projections to premotor and supplementary motor cortex: A multiple tracing study in macaque
monkey. European Journal of Neuroscience 21:1007–29.
Murray, M.M., S. Molholm, C.M. Michel et al. 2005. Grabbing your ear: Rapid auditory–somatosensory multi­
sensory interactions in low-level sensory cortices are not constrained by stimulus alignment. Cerebral
Cortex 15:963–74.
Palmer, S.M., and M.G. Rosa. 2006. A distinct anatomical network of cortical areas for analysis of motion in
far peripheral vision. European Journal of Neuroscience 24:2389–405.
Pandya, D.N., and B. Seltzer. 1982. Intrinsic connections and architectonics of posterior parietal cortex in the
rhesus monkey. Journal of Comparative Neurology 204:196–210.
Petrides, M., and S.D. Iversen. 1976. Cross-modal matching and the primate frontal cortex. Science 192:1023–4.
Poremba, A., R.C. Saunders, A.M. Crane, M. Cook, L. Sokoloff, and M. Mishkin. 2003. Functional mapping
of the primate auditory system. Science 299:568–72.
Raab, D.H. 1962. Statistical facilitation of simple reaction times. Transactions of the New York Academy of
Sciences 24:574–90.
Rizzolatti, G., L. Fogassi, and V. Gallese. 1997. Parietal cortex: From sight to action (review). Current Opinion
in Neurobiology 7:562–7.
Rockland, K.S. 1996. Two types of corticopulvinar terminations: Round (type 2) and elongate (type 1). Journal
of Comparative Neurology 368:57–87.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50:19–26.
Romanski, L.M. 2004. Domain specificity in the primate prefrontal cortex (review). Cognitive, Affective &
Behavioral Neuroscience 4:421–9.
Romanski, L.M. 2007. Representation and integration of auditory and visual stimuli in the primate ventral
lateral prefrontal cortex. Cerebral Cortex 17 Suppl. no. 1, i61–9.
Romanski, L.M., M. Giguere, J.F. Bates, and P.S. Goldman-Rakic. 1997. Topographic organization of medial
pulvinar connections with the prefrontal cortex in the rhesus monkey. Journal of Comparative Neurology
379:313–32.
Romanski, L.M., J.F. Bates, and P.S. Goldman-Rakic. 1999. Auditory belt and parabelt projections to the pre-
frontal cortex in the rhesus monkey. Journal of Comparative Neurology 403:141–57.
Romanski, L.M., and P.S. Goldman-Rakic. 2002. An auditory domain in primate prefrontal cortex. Nature
Neuroscience 5:15–6.
Romei, V., M.M. Murray, L.B. Merabet, and G. Thut. 2007. Occipital transcranial magnetic stimulation has
opposing effects on visual and auditory stimulus detection: Implications for multisensory interactions.
Journal of Neuroscience 27:11465–72.
Rouiller, E.M., and C. Durif. 2004. The dual pattern of corticothalamic projection of the primary auditory cor-
tex in macaque monkey. Neuroscience Letters 358:49–52.
Rouiller, E.M., J. Tanné, V. Moret, I. Kermadi, D. Boussaoud, and E. Welker. 1998. Dual morphology and
topography of the corticothalamic terminals originating from the primary, supplementary motor, and
dorsal premotor cortical areas in macaque monkeys. Journal of Comparative Neurology 396:169–85.
Rouiller, E.M., and E. Welker. 2000. A comparative analysis of the morphology of corticothalamic projections
in mammals. Brain Research Bulletin 53:727–41.
Rouiller, E.M., T. Wannier, and A. Morel. 2003. The dual pattern of corticothalamic projection of the premotor
cortex in macaque monkeys. Thalamus & Related Systems 2:189–97.
Russ, B.E., A.M. Kim, K.L. Abrahamsen, R. Kiringoda, and Y.E. Cohen. 2006. Responses of neurons in the
lateral intraparietal area to central visual cues. Experimental Brain Research 174:712–27.
Sadato, N., A. Pascual-Leone, J. Grafman et al. 1996. Activation of the primary visual cortex by Braille reading
in blind subjects. Nature 380:526–8.
Salin, P.A., and J. Bullier. 1995. Corticocortical connections in the visual system: Structure and function.
Physiological Reviews 75:107–54.
Sathian, K., and A. Zangaladze. 2002. Feeling with the mind’s eye: Contribution of visual cortex to tactile
perception (review). Behavioural Brain Research 135:127–32.
Schall, J.D., A. Morel, D.J. King, and J. Bullier. 1995. Topography of visual cortex connections with fron-
tal eye field in macaque: Convergence and segregation of processing streams. Journal of Neuroscience
15:4464–87.
30 The Neural Bases of Multisensory Processes

Schlack, A., S.J. Sterbing-D’Angelo, K. Hartung, K.P. Hoffmann, and F. Bremmer. 2005. Multisensory space
representations in the macaque ventral intraparietal area. Journal of Neuroscience 25:4616–25.
Schroeder, C.E., and J.J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory areas
of the macaque neocortex. Cognitive Brain Research 14:187–98.
Schroeder, C.E., R.W. Lindsley, C. Specht, A. Marcovici, J.F. Smiley, and D.C. Javitt. 2001. Somatosensory
input to auditory association cortex in the macaque monkey. Journal of Neurophysiology 85:1322–7.
Schwartz, M.L., J.J. Dekker, and P.S. Goldman-Rakic. 1991. Dual mode of corticothalamic synaptic termina-
tion in the mediodorsal nucleus of the rhesus monkey. Journal of Comparative Neurology 309:289–304.
Seltzer, B., and D.N. Pandya. 1994. Parietal, temporal, and occipital projections to cortex of the superior
temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology
343:445–63.
Sherman, S.M. 2007. The thalamus is more than just a relay. Current Opinion in Neurobiology 17:417–22.
Sherman, S.M., and R.W. Guillery. 2002. The role of the thalamus in the flow of information to the cortex.
Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences 357:1695–708.
Sherman, S.M., and R.W. Guillery. 2005. Exploring the Thalamus and Its Role in Cortical Function. Cambridge:
MIT Press.
Shipp, S. 2003. The functional logic of cortico-pulvinar connections. Philosophical Transactions of the Royal
Society of London. Series B, Biological Sciences 358:1605–24.
Smiley, J.F., T.A. Hackett, I. Ulbert, G. Karmas, P. Lakatos, D.C. Javitt, and C.E. Schroeder. 2007. Multisensory
convergence in auditory cortex, I. Cortical connections of the caudal superior temporal plane in macaque
monkeys. Journal of Comparative Neurology 502:894–923.
Smiley, J.F., and A. Falchier. 2009. Multisensory connections of monkey auditory cerebral cortex. Hearing
Research 258:37–46.
Sperdin, H., C. Cappe, J.J. Foxe, and M.M. Murray. 2009. Early, low-level auditory–somatosensory multisen-
sory interactions impact reaction time speed. Frontiers in Integrative Neuroscience 3:2. doi:10.3389/
neuro.07.002.2009.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., M.A. Meredith, W.S. Huneycutt, and L. Mcdade. 1989. Behavioral indices of multisensory inte-
gration: Orientation to visual cues is affected by auditory stimuli. Journal of Cognitive Neuroscience
1:12–24.
Suzuki, H. 1985. Distribution and organization of visual and auditory neurons in the monkey prefrontal cortex.
Vision Research 25:465–9.
Tanné-Gariépy, J., E.M. Rouiller, and D. Boussaoud. 2002. Parietal inputs to dorsal versus ventral premo-
tor areas in the macaque monkey: Evidence for largely segregated visuomotor pathways. Experimental
Brain Research 145:91–103.
Taktakishvili, O., E. Sivan-Loukianova, K. Kultas-Ilinsky, and I.A. Ilinsky. 2002. Posterior parietal cortex pro-
jections to the ventral lateral and some association thalamic nuclei in Macaca mulatta. Brain Research
Bulletin 59:135–50.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–72.
Wang, Y., S. Celebrini, Y. Trotter, and P. Barone. 2008. Visuo–auditory interactions in the primary visual cortex
of the behaving monkey: Electrophysiological evidence. BMC Neuroscience 9:79.
Yirmiya, R., and S. Hocherman. 1987. Auditory- and movement-related neural activity interact in the pulvinar
of the behaving rhesus monkey. Brain Research 402:93–102.
Zampini, M., D. Torresan, C. Spence, and M.M. Murray. 2007. Auditory–somatosensory multisensory interac-
tions in front and rear space. Neuropsychologia 45:1869–77.
3 What Can Multisensory
Processing Tell Us about
the Functional Organization
of Auditory Cortex?
Jennifer K. Bizley and Andrew J. King

CONTENTS
3.1 Introduction............................................................................................................................. 31
3.2 Functional Specialization within Auditory Cortex?................................................................ 32
3.3 Ferret Auditory Cortex: A Model for Multisensory Processing.............................................. 33
3.3.1 Organization of Ferret Auditory Cortex...................................................................... 33
3.3.2 Surrounding Cortical Fields........................................................................................ 35
3.3.3 Sensitivity to Complex Sounds.................................................................................... 36
3.3.4 Visual Sensitivity in Auditory Cortex......................................................................... 36
3.3.5 Visual Inputs Enhance Processing in Auditory Cortex............................................... 39
3.4 Where Do Visual Inputs to Auditory Cortex Come From?.....................................................40
3.5 What Are the Perceptual Consequences of Multisensory Integration in the Auditory
Cortex?..................................................................................................................................... 41
3.5.1 Combining Auditory and Visual Spatial Representations in the Brain....................... 42
3.5.2 A Role for Auditory Cortex in Spatial Recalibration?................................................. 43
3.6 Concluding Remarks...............................................................................................................44
References.........................................................................................................................................44

3.1  INTRODUCTION
The traditional view of sensory processing is that the pooling and integration of information across
different modalities takes place in specific areas of the brain only after extensive processing within
modality-specific subcortical and cortical regions. This seems like a logical arrangement because
our various senses are responsible for transducing different forms of energy into neural activity
and give rise to quite distinct perceptions. To a large extent, each of the sensory systems can oper-
ate independently. We can, after all, understand someone speaking by telephone or read a book
perfectly well without recourse to cues provided by other modalities. It is now clear, however, that
multisensory convergence is considerably more widespread in the brain, and particularly the cere-
bral cortex, than was once thought. Indeed, even the primary cortical areas in each of the main
senses have been claimed as part of the growing network of multisensory regions (Ghazanfar and
Schroeder 2006).
It is clearly beneficial to be able to combine information from the different senses. Although the
perception of speech is based on the processing of sound, what we actually hear can be influenced by
visual cues provided by lip movements. This can result in an improvement in speech intelligibility

31
32 The Neural Bases of Multisensory Processes

in the presence of other distracting sounds (Sumby and Pollack 1954) or even a subjective change
in the speech sounds that are perceived (McGurk and MacDonald 1976). Similarly, the accuracy
with which the source of a sound can be localized is affected by the availability of both spatially
congruent (Shelton and Searle 1980; Stein et al. 1989) and conflicting (Bertelson and Radeau 1981)
visual stimuli. With countless other examples of cross-modal interactions at the perceptual level
(Calvert and Thesen 2004), it is perhaps not surprising that multisensory convergence is so widely
found throughout the cerebral cortex.
The major challenge that we are now faced with is to identify the function of multisensory inte-
gration in different cortical circuits, and particularly at early levels of the cortical hierarchy—the
primary and secondary sensory areas—which are more likely to be involved in general-purpose
processing relating to multiple sound parameters than in task-specific computational operations
(Griffiths et al. 2004; King and Nelken 2009). In doing so, we have to try and understand how other
modalities influence the sensitivity or selectivity of cortical neurons in those areas while retaining
the modality specificity of the percepts to which the activity of the neurons contributes. By inves-
tigating the sources of origin of these inputs and the way in which they interact with the dominant
input modality for a given cortical area, we can begin to constrain our ideas about the potential
functions of multisensory integration in early sensory cortex.
In this article, we focus on the organization and putative functions of visual inputs to the audi-
tory cortex. Although anatomical and physiological studies have revealed multisensory interactions
in visual and somatosensory areas, it is arguably the auditory cortex where most attention has been
paid and where we may be closest to answering these questions.

3.2  FUNCTIONAL SPECIALIZATION WITHIN AUDITORY CORTEX?


A common feature of all sensory systems is that they comprise multiple cortical areas that can be
defined both physiologically and anatomically, and which are collectively involved in the processing
of the world around us. Although most studies on the cortical auditory system have focused on the
primary area, A1, there is considerable interest in the extent to which different sound features are
represented in parallel in distinct functional streams that extend beyond A1 (Griffiths et al. 2004).
Research on this question has been heavily influenced by studies of the visual cortex and, in par-
ticular, by the proposal that a division of function exists, with separate dorsal and ventral pathways
involved in visuomotor control and object identification, respectively. The dorsal processing stream,
specialized for detecting object motion and discriminating spatial relationships, includes the middle
temporal (MT) and medial superior temporal (MST) areas, whereas the ventral stream comprises
areas responsible for color, form, and pattern discrimination. Although the notion of strict parallel
processing of information, originating subcortically in the p and m pathways and terminating in tem-
poral and parietal cortical areas, is certainly an oversimplification (Merigan and Maunsell 1993), the
perception–action hypothesis is supported by neuroimaging, human neuropsychology, monkey neu-
rophysiology, and human psychophysical experiments (reviewed by Goodale and Westwood 2004).
A popular, if controversial, theory seeks to impose a similar organizational structure onto the
auditory cortex. Within this framework, Rauschecker and Tian (2000) proposed that the auditory
cortex can be divided into a rostral processing stream, responsible for sound identification, and
a caudal processing stream, involved in sound localization. Human functional imaging data pro-
vide support for this idea (Alain et al. 2001; Barrett and Hall 2006; Maeder et al. 2001; Warren
and Griffiths 2003), and there is evidence for regional differentiation based on the physiological
response properties of single neurons recorded in the auditory cortex of nonhuman primates (Tian
et al. 2001; Recanzone 2000; Woods et al. 2006; Bendor and Wang 2005). However, the most com-
pelling evidence for a division of labor has been provided by the specific auditory deficits induced
by transiently deactivating different cortical areas in cats. Thus, normal sound localization in this
species requires the activation of A1, the posterior auditory field (PAF), the anterior ectosylvian
sulcus and the dorsal zone of the auditory cortex, whereas other areas, notably the anterior auditory
Auditory Cortex according to Multisensory Processing 33

field (AAF), ventral PAF (VPAF), and secondary auditory cortex (A2) do not appear to contribute
to this task (Malhotra and Lomber 2007). Moreover, a double dissociation between PAF and AAF
in the same animals has been demonstrated, with impaired sound localization produced by cooling
of PAF but not AAF, and impaired temporal pattern discrimination resulting from inactivation of
AAF but not PAF (Lomber and Malhotra 2008). Lastly, anatomical projection patterns in nonhu-
man primates support differential roles for rostral and caudal auditory cortex, with each of those
areas having distinct prefrontal targets (Hackett et al. 1999; Romanski et al. 1999).
Despite this apparent wealth of data in support of functional specialization within the auditory
cortex, there are a number of studies that indicate that sensitivity to both spatial and nonspatial
sound attributes is widely distributed across different cortical fields (Harrington et al. 2008; Stecker
et al. 2003; Las et al. 2008; Hall and Plack 2009; Recanzone 2008; Nelken et al. 2008; Bizley et al.
2009). Moreover, in humans, circumscribed lesions within the putative “what” and “where” path-
ways do not always result in the predicted deficits in sound recognition and localization (Adriani et
al. 2003). Clearly defined output pathways from auditory cortex to prefrontal cortex certainly seem
to exist, but what the behavioral deficits observed following localized deactivation or damage imply
about the functional organization of the auditory cortex itself is less clear-cut. Loss of activity in
any one part of the network will, after all, affect both upstream cortical areas and potentially the
responses of subcortical neurons that receive descending projections from that region of the cor-
tex (Nakamoto et al. 2008). Thus, a behavioral deficit does not necessarily reflect the specialized
properties of the neurons within the silenced cortical area per se, but rather the contribution of the
processing pathways that the area is integral to.
Can the distribution and nature of multisensory processing in the auditory cortex help reconcile
the apparently contrasting findings outlined above? If multisensory interactions in the cortex are
to play a meaningful role in perception and behavior, it is essential that the neurons can integrate
the corresponding multisensory features of individual objects or events, such as vocalizations and
their associated lip movements or the visual and auditory cues originating from the same location
in space. Consequently, the extent to which spatial and nonspatial sound features are processed in
parallel in the auditory cortex should also be apparent in both the multisensory response properties
of the neurons found there and the sources of origin of its visual inputs. Indeed, evidence for task-
specific activation of higher cortical areas by different stimulus modalities has recently been pro-
vided in humans (Renier et al. 2009). In the next section, we focus on the extent to which anatomical
and physiological studies of multisensory convergence and processing in the auditory cortex of the
ferret have shed light on this issue. In recent years, this species has gained popularity for studies of
auditory cortical processing, in part because of its particular suitability for behavioral studies.

3.3  FERRET AUDITORY CORTEX: A MODEL FOR MULTISENSORY PROCESSING


3.3.1  Organization of Ferret Auditory Cortex
Ferret auditory cortex consists of at least six acoustically responsive areas: two core fields, A1 and
AAF, which occupy the middle ectosylvian gyrus; two belt areas on the posterior ectosylvian gyrus,
the posterior pseudosylvian field (PPF) and posterior suprasylvian field (PSF); plus two areas on
the anterior ectosylvian gyrus, the anterior dorsal field (ADF) and the anterior ventral field (AVF)
(Bizley et al. 2005; Figure 3.1a). A1, AAF, PPF, and PSF are all tonotopically organized: the neu-
rons found there respond to pure tones and are most sensitive to particular sound frequencies, which
vary systemically in value with neuron location within each cortical area. There is little doubt that
an equivalent area to the region designated as A1 is found in many different mammalian species,
including humans. AAF also appears to be homologous to AAF in other species including the gerbil
(Thomas et al. 1993) and the cat (Imaizumi et al. 2004), and is characterized by an underrepresen-
tation of neurons preferring middle frequencies and having shorter response latencies compared
to A1.
34 The Neural Bases of Multisensory Processes

(a)

PPr PPc
S1 body SIII
17
A 18
AL MLS
S1 face LS?
19
AAF A1 21
S
D RS S

SSY
PLLS?
M LRS ADF
sss AVF PPF

PS
PSF
C 20a

ssd
VP 20b

5 mm

(b) (c)

I-IV
V-VI
wm
V-VI
I-IV wm wm

CTβ
D
BDA CTβ
R BDA

1 mm
(d)

Visual cortical input


V1, V2 (sparse) A1
AAF
Area 20 (visual form) PPF
PSF
SSY (visual motion) ADF

FIGURE 3.1  Visual inputs to ferret auditory cortex. (a) Ferret sensory cortex. Visual (areas 17–20, PS, SSY,
AMLS), posterior parietal (PPr, PPc), somatosensory (S1, SIII, MRSS), and auditory areas (A1, AAF, PPF,
PSF, and ADF) have been identified. In addition, LRSS and AVF are multisensory regions, although many
of the areas classified as modality specific also contain some multisensory neurons. (b) Location of neurons
in visual cortex that project to auditory cortex. Tracer injections made into core auditory cortex (A1: BDA,
shown in black, and AAF: CTβ, shown in gray) result in retrograde labeling in early visual areas. Every fifth
section (50 µm thick) was examined, but for the purpose of illustration, labeling from four sections was col-
lapsed onto single sections. Dotted lines mark the limit between cortical layers IV and V; dashed lines delimit
the white matter (wm). (c) Tracer injections made into belt auditory cortex. Retrograde labeling after an injec-
tion of CTβ into the anterior fields (on the borders of ADF and AVF) is shown in gray, and retrograde labeling
resulting from a BDA injection into the posterior fields PPF and PSF is shown in black. Note the difference
in the extent and distribution of labeling after injections into the core and belt areas of auditory cortex. Scale
bars in (b) and (c), 1 mm. (d) Summary of sources of visual cortical input to auditory cortex. (Anatomical data
adapted with permission from Bizley, J.K. et al., Cereb. Cortex, 17, 2172–89, 2007.)
Auditory Cortex according to Multisensory Processing 35

Neurons in the posterior fields can be distinguished from those in the primary areas by the
temporal characteristics of their responses; discharges are often sustained and they vary in latency
and firing pattern in a stimulus-dependent manner. The frequency response areas of posterior field
neurons are often circumscribed, exhibiting tuning for sound level as well as frequency. As such,
the posterior fields in the ferret resemble PAF and VPAF in the cat (Stecker et al. 2003; Phillips and
Orman 1984; Loftus and Sutter 2001) and cortical areas R and RT in the marmoset monkey (Bizley
et al. 2005; Bendor and Wang 2008), although whether PPF and PSF actually correspond to these
fields is uncertain.
Neurons in ADF also respond to pure tones, but are not tonotopically organized (Bizley et al.
2005). The lack of tonotopicity and the broad, high-threshold frequency response areas that char-
acterize this field are also properties of cat A2 (Schreiner and Cynader 1984). However, given that
ferret ADF neurons seem to show relatively greater spatial sensitivity than those in surrounding
cortical fields (see following sections), which is not a feature of cat A2, it seems unlikely that
these areas are homologous. Ventral to ADF lies AVF. Although many of the neurons that have
been recorded there are driven by sound, the high incidence of visually responsive neurons (see
Section 3.3.4) makes it likely that AVF should be regarded as a parabelt or higher multisensory field.
Given its proximity to the somatosensory area on the medial bank of the rostral suprasylvian sulcus
(MRSS) (Keniston et al. 2009), it is possible that AVF neurons might also be influenced by tactile
stimuli, but this remains to be determined.
Other studies have also highlighted the multisensory nature of the anterior ectosylvian gyrus. For
example, Ramsay and Meredith (2004) described an area surrounding the pseudosylvian sulcus that
receives largely segregated inputs from the primary visual and somatosensory cortices, which they
termed the pseudosylvian sulcal cortex. Manger et al. (2005) reported that a visually responsive
area lies parallel to the pseudosylvian sulcus on the posterolateral half of the anterior ectosylvian
gyrus, which also contains bisensory neurons that respond either to both visual and tactile or to
visual and auditory stimulation. They termed this area AEV, following the terminology used for the
visual region within the cat’s anterior ectosylvian sulcus. Because this region overlaps in part with
the acoustically responsive areas that we refer to as ADF and AVF, further research using a range of
stimuli will be needed to fully characterize this part of the ferret’s cortex. However, the presence of
a robust projection from AVF to the superior colliculus (Bajo et al. 2010) makes it likely that this is
equivalent to the anterior ectosylvian sulcus in the cat.

3.3.2  Surrounding Cortical Fields


The different auditory cortical areas described in the previous section are all found on the ectosyl-
vian gyrus (EG), which is enclosed by the suprasylvian sulcus (Figure 3.1a). The somatosensory
cortex lies rostral to the EG (Rice et al. 1993; McLaughlin et al. 1998), extrastriate visual areas are
located caudally (Redies et al. 1990), and the parietal cortex is found dorsal to the EG (Manger et
al. 2002). The suprasylvian sulcus therefore separates the different auditory fields from functionally
distinct parts of the cerebral cortex.
Within the suprasylvian sulcus itself, several additional cortical fields have been characterized
(Philipp et al. 2006; Manger et al. 2004, 2008; Cantone et al. 2006; Keniston et al. 2008). Beginning
at the rostral border between the auditory and somatosensory cortices, field MRSS (Keniston et al.
2009) and the lateral bank of the rostral suprasylvian sulcus (LRSS) (Keniston et al. 2008) form
the medial and lateral sides of the suprasylvian sulcus, respectively. Field LRSS has been identified
as an auditory–somatosensory area, whereas MRSS is more modality specific and is thought to
be a higher somatosensory field. Field MRSS is bordered by the anteromedial lateral suprasylvian
visual area (AMLS), which lines the medial or dorsal bank of the suprasylvian sulcus (Manger et
al. 2008). Two more visually responsive regions, the suprasylvian visual area (SSY) (Cantone et al.
2006; Philipp et al. 2006) and the posterior suprasylvian area (PS) (Manger et al. 2004) are found
on the caudal side of the sulcus. SSY corresponds in location to an area described by Philipp et al.
36 The Neural Bases of Multisensory Processes

(2005) as the ferret homologue of primate motion-processing area MT. This region has also been
described by Manger et al. (2008) as the posteromedial suprasylvian visual area, but we will stay
with the terminology used in our previous articles and refer to it as SSY. PS has not been compre-
hensively investigated and, to our knowledge, neither of these sulcal fields have been tested with
auditory or somatosensory stimuli. On the lateral banks of the suprasylvian sulcus, at the dorsal and
caudal edges of the EG, remains an area of uninvestigated cortex. On the basis of its proximity to
AMLS and SSY, this region has tentatively been divided into the anterolateral lateral suprasylvian
visual area (ALLS) and the posterolateral lateral suprasylvian visual area (PLLS) by Manger et al.
(2008). However, because these regions of the sulcal cortex lie immediately adjacent to the primary
auditory fields, it is much more likely that they are multisensory in nature.

3.3.3  Sensitivity to Complex Sounds


In an attempt to determine whether spatial and nonspatial stimulus attributes are represented within
anatomically distinct regions of the ferret auditory cortex, we investigated the sensitivity of neurons in
both core and belt areas to stimulus periodicity, timbre, and spatial location (Bizley et al. 2009). Artificial
vowel sounds were used for this purpose, as they allowed each of these stimulus dimensions to be var-
ied parametrically. Recordings in our laboratory have shown that ferret vocalizations cover the same
frequency range as the sounds used in this study. Vowel identification involves picking out the formant
peaks in the spectral envelope of the sound, and is therefore a timbre discrimination task. The periodicity
of the sound corresponds to its perceived pitch and conveys information about speaker identity (males
tend to have lower pitch voices than females) and emotional state. Neuronal sensitivity to timbre and pitch
should therefore be found in cortical areas concerned with stimulus identification.
Neurons recorded throughout the five cortical areas (A1, AAF, PPF, PSF, and ADF) examined
were found to be sensitive to the pitch, timbre, and location of the sound source, implying a distrib-
uted representation of both spatial and nonspatial sound properties. Nevertheless, significant inter­
areal differences were observed. Sensitivity to sound pitch and timbre was most pronounced in the
primary and posterior auditory fields (Bizley et al. 2009). By contrast, relatively greater sensitivity
to sound-source location was found in A1 and in the areas around the pseudosylvian sulcus, which
is consistent with the finding that the responses of neurons in ADF carry more information about
sound azimuth than those in other auditory cortical areas (Bizley and King 2008).
The variance decomposition method used in the study by Bizley et al. (2009) to quantify the
effects of each stimulus parameter on the responses of the neurons was very different from the
measures used to define a pitch center in marmoset auditory cortex (Bendor and Wang 2005). We
did not, for example, test whether pitch sensitivity was maintained for periodic stimuli in which the
fundamental frequency had been omitted. Consequently, the distributed sensitivity we observed is
not incompatible with the idea that there might be a dedicated pitch-selective area. However, in a
subsequent study, we did find that the spiking responses of single neurons and neural ensembles
throughout the auditory cortex can account for the ability of trained ferrets to detect the direction
of a pitch change (Bizley et al. 2010). Although further research is needed, particularly in awake,
behaving animals, these electrophysiological data are consistent with support the results of an ear-
lier intrinsic optical imaging study (Nelken et al. 2008) in providing only limited support for a divi-
sion of labor across auditory cortical areas in the ferret.

3.3.4  Visual Sensitivity in Auditory Cortex


Visual inputs into auditory cortex have been described in several species, including humans (Calvert
et al. 1999; Giard and Peronnet 1999; Molholm et al. 2002), nonhuman primates (Brosch et al. 2005;
Ghazanfar et al. 2005; Schroeder and Foxe 2002; Kayser et al. 2007), ferrets (Bizley and King 2008,
2009; Bizley et al. 2007), gerbils (Cahill et al. 1996), and rats (Wallace et al. 2004). In our studies
on the ferret, the responses of single neurons and multineuron clusters were recorded to simplistic
Auditory Cortex according to Multisensory Processing 37

artificial stimuli presented under anesthesia. Sensitivity to visual stimulation was defined as a statisti-
cally significant change in spiking activity after the presentation of light flashes from a light-emitting
diode (LED) positioned in the contralateral hemifield or by a significant modulation of the response to
auditory stimulation even if the LED by itself was apparently ineffective in driving the neuron.
Although the majority of neurons recorded in the auditory cortex were classified as auditory alone,
the activity of more than one quarter was found to be influenced by visual stimulation. Figure 3.2a
shows the relative proportion of different response types observed in the auditory cortex as a whole.

(a) 7% (b)
140 284 143 127 225 105
7% 100 Visual
AV Bisensory
AV mod Auditory

Proportion of total units


75
14%
Visual
50

25
Auditory
72% 0
A1 AAF PPF PSF ADF AVF
Area

(c) (d)
1 Auditory
Visual
A1 Audio-visual
0.8
MI (bits) mean spike latency

AAF

PPF 0.6

PSF
0.4
ADF

0.2
AVF
Enhancement Suppression
0
0 0.25 0.5 0.75 0 0.2 0.4 0.6 0.8 1
Proportion of cells MI (bits) spike count

FIGURE 3.2  Visual–auditory interactions in ferret auditory cortex. (a) Proportion of neurons (n = 716) that
responded to contralaterally presented noise bursts (auditory), to light flashes from an LED positioned in the
contralateral visual field (visual), to both of these stimuli (AV), or whose responses to the auditory stimulus
were modulated by the presentation of the visual stimulus, which did not itself elicit a response (AVmod).
(b) Bar graph showing the relative proportions of unisensory auditory (white), unisensory visual (black), and
bisensory (gray) neurons recorded in each auditory field. The actual numbers of neurons recorded are given at
the top of each column. (c) Proportion of neurons whose spike rates in response to combined visual–auditory
stimulation were enhanced or suppressed. Total number of bisensory neurons in each field: A1, n = 9; AAF,
n = 16; PPF, n = 13; PSF, n = 32; ADF, n = 32; AVF, n = 24. (d) Distribution of mutual information (MI) values
obtained when two reduced spike statistics were used: spike count and mean spike latency. Points above the
unity line indicate that mean response latency was more informative about the stimulus than spike count. This
was increasingly the case for all three stimulus conditions when the spike counts were low. (Anatomical data
adapted from Bizley, J.K. et al., Cereb. Cortex, 17, 2172–89, 2007 and Bizley, J.K., and King, A.J., Hearing
Res., 258, 55–63, 2009.)
38 The Neural Bases of Multisensory Processes

Bisensory neurons comprised both those neurons whose spiking responses were altered by auditory
and visual stimuli and those whose auditory response was modulated by the simulta­neously presented
visual stimulus. The fact that visual stimuli can drive spiking activity in the auditory cortex has also
been described in highly trained monkeys (Brosch et al. 2005). Nevertheless, this finding is unusual, as
most reports emphasize the modulatory nature of nonauditory inputs on the cortical responses to sound
(Ghazanfar 2009; Musacchia and Schroeder 2009). At least part of the explanation for this is likely to
be that we analyzed our data by calculating the mutual information between the neural responses and
the stimuli that elicited them. Information (in bits) was estimated by taking into account the temporal
pattern of the response rather than simply the overall spike count. This method proved to be substan-
tially more sensitive than a simple spike count measure, and allowed us to detect subtle, but nonetheless
significant, changes in the neural response produced by the presence of the visual stimulus.
Although neurons exhibiting visual–auditory interactions are found in all six areas of the ferret
cortex, the proportion of such neurons varies in different cortical areas (Figure 3.2b). Perhaps not
surprisingly, visual influences are least common in the primary areas, A1 and AAF. Nevertheless,
approximately 20% of the neurons recorded in those regions were found to be sensitive to visual
stimulation, and even included some unisensory visual responses. In the fields on the posterior
ectosylvian gyrus and ADF, 40% to 50% of the neurons were found to be sensitive to visual stimuli.
This rose to 75% in AVF, which, as described in Section 3.3.1, should probably be regarded as a
multisensory rather than as a predominantly auditory area.
We found that visual stimulation could either enhance or suppress the neurons’ response to sound
and, in some cases, increased the precision in their spike timing without changing the overall firing
rate (Bizley et al. 2007). Analysis of all bisensory neurons, including both neurons in which there
was a spiking response to each sensory modality and those in which concurrent auditory–visual
stimulation modulated the response to sound alone, revealed that nearly two-thirds produced stron-
ger responses to bisensory than to unisensory auditory stimulation. Figure 3.2c shows the propor-
tion of response types in each cortical field. Although the sample size in some areas was quite small,
the relative proportions of spiking responses that were either enhanced or suppressed varied across
the auditory cortex. Apart from the interactions in A1, the majority of the observed interactions
were facilitatory rather than suppressive.
Although a similar trend for a greater proportion of sites to show enhancement as compared with
suppression has been reported for local field potential data in monkey auditory cortex, analysis of
spiking responses revealed that suppressive interactions are more common (Kayser et al. 2008).
This trend was found across four different categories of naturalistic and artificial stimuli, so the
difference in the proportion of facilitatory and suppressive interactions is unlikely to reflect the
use of different stimuli in the two studies. By systematically varying onset asynchronies between
the visual and auditory stimuli, we did observe in a subset of neurons that visual stimuli could
have suppressive effects when presented 100 to 200 ms before the auditory stimuli, which were not
apparent when the two modalities were presented simultaneously (Bizley et al. 2007). This finding,
along with the results of several other studies (Meredith et al. 2006; Dehner et al. 2004; Allman et
al. 2008), emphasizes the importance of using an appropriate combination of stimuli to reveal the
presence and nature of cross-modal interactions.
Examination of the magnitude of cross-modal facilitation in ferret auditory cortex showed that
visual–auditory interactions are predominantly sublinear. In other words, both the mutual informa-
tion values (in bits) and the spike rates in response to combined auditory–visual stimulation are
generally less than the linear sum of the responses to the auditory and visual stimuli presented in
isolation, although some notable exceptions to this have been found (e.g., Figure 2E, F of Bizley et
al. 2007). This is unsurprising as the stimulus levels used in that study were well above threshold
and, according to the “inverse effectiveness principle” (Stein et al. 1988), were unlikely to produce
supralinear responses to combined visual–auditory stimulation. Consistent with this is the observa-
tion of Kayser et al. (2008), showing that, across stimulus types, multisensory facilitation is more
common for those stimuli that are least effective in driving the neurons.
Auditory Cortex according to Multisensory Processing 39

As mentioned above, estimates of the mutual information between the neural responses and each
of the stimuli that produce them take into account the full spike discharge pattern. It is then possible
to isolate the relative contributions of spike number and spike timing to the neurons’ sensitivity to
multisensory stimulation. It has previously been demonstrated in both ferret and cat auditory cortex
that the stimulus information contained in the complete spike pattern is conveyed by a combination
of spike count and mean spike latency (Nelken et al. 2005). By carrying out a similar analysis of
the responses to the brief stimuli used to characterize visual–auditory interactions in ferret auditory
cortex, we found that more than half the neurons transmitted more information in the timing of their
responses than in their spike counts (Bizley et al. 2007). This is in agreement with the results of
Nelken et al. (2005) for different types of auditory stimuli. We found that this was equally the case for
unisensory auditory or visual stimuli and for combined visual–auditory stimulation (Figure 3.2d).

3.3.5  Visual Inputs Enhance Processing in Auditory Cortex


To probe the functional significance of the multisensory interactions observed in the auditory cor-
tex, we systematically varied the spatial location of the stimuli and calculated the mutual informa-
tion between the neural responses and the location of unisensory visual, unisensory auditory, and
spatially and temporally coincident auditory–visual stimuli (Bizley and King 2008). The majority of
the visual responses were found to be spatially restricted, and usually carried more location-related
information than was the case for the auditory responses. The amount of spatial information avail-
able in the neural responses varied across the auditory cortex (Figure 3.3). For all three stimulus

(a) (b) (c)


Visual Auditory Bisensory

2 1.8
0.6
1.8
1.6
1.6
Mutual information (bits)

0.5 1.4
1.4
1.2
1.2 0.4
1
1
0.3 0.8
0.8
0.6 0.6
0.2
0.4 0.4

0.2 0.2
0.1
0 0
A1 AAF PPF PSF ADF A1 AAF PPF PSF ADF A1 AAF PPF PSF ADF
Cortical field

FIGURE 3.3  Box plots displaying the amount of information transmitted by neurons in each of five ferret
cortical fields about LED location (a), sound-source location (b), or the location of temporally and spatially
congruent auditory–visual stimuli (c). Only neurons for which there was a significant unisensory visual or
auditory response are plotted in (a) and (b), respectively, whereas (c) shows the multisensory mutual informa-
tion values for all neurons recorded, irrespective of their response to unisensory stimulation. The box plots
show the median (horizontal bar), interquartile range (boxes), spread of data (tails), and outliers (cross sym-
bols). The notch indicates the distribution of data about the median. There were significant differences in the
mutual information values in different cortical fields (Kruskal–Wallis test; LED location, p = .0001; auditory
location, p = .0035; bisensory stimulus location, p < .0001). Significant post hoc pairwise differences (Tukey–
Kramer test, p < .05) between individual cortical fields are shown by the lines above each box plot. Note that
neurons in ADF transmitted the most spatial information irrespective of stimulus modality. (Adapted with
permission from Bizley, J.K., and King, A.J., Brain Res., 1242, 24–36, 2008.)
40 The Neural Bases of Multisensory Processes

conditions, spatial sensitivity was found to be highest in ADF, supporting the notion that there is
some functional segregation across the auditory cortex, with the anterior fields more involved in
spatial processing. Relative to the responses to sound alone, the provision of spatially coincident
visual cues frequently altered the amount of information conveyed by the neurons about stimulus
location. Bisensory stimulation reduced the spatial information in the response in one third of these
cases, but increased it in the remaining two thirds. Thus, overall, visual inputs to the auditory cortex
appear to enhance spatial processing.
Because of the simple stimuli that were used in these studies, it was not possible to determine
whether or how visual inputs might affect the processing of nonspatial information in ferret auditory
cortex. However, a number of studies in primates have emphasized the benefits of visual influences
on auditory cortex in terms of the improved perception of vocalizations. In humans, lip reading
has been shown to activate the auditory cortex (Molholm et al. 2002; Giard and Peronnet 1999;
Calvert et al. 1999), and a related study in macaques has shown that presenting a movie of a mon-
key vocalizing can modulate the auditory cortical responses to that vocalization (Ghazanfar et al.
2005). These effects were compared to a visual control condition in which the monkey viewed a
disk that was flashed on and off to approximate the movements of the animal’s mouth. In that study,
the integration of face and voice stimuli was found to be widespread in both core and belt areas of
the auditory cortex. However, to generate response enhancement, a greater proportion of recording
sites in the belt areas required the use of a real monkey face, whereas nonselective modulation of
auditory cortical responses was more common in the core areas. Because a number of cortical areas
have now been shown to exhibit comparable sensitivity to monkey calls (Recanzone 2008), it would
be of considerable interest to compare the degree to which face and non-face visual stimuli can
modulate the activity of the neurons found there. This should help us determine the relative extent
to which each area might be specialized for processing communication signals.

3.4  WHERE DO VISUAL INPUTS TO AUDITORY CORTEX COME FROM?


Characterizing the way in which neurons are influenced by visual stimuli and their distribution
within the auditory cortex is only a first step in identifying their possible functions. It is also neces-
sary to know where those visual inputs originate. Potentially, visual information might gain access
to the auditory cortex in a number of ways. These influences could arise from direct projections
from the visual cortex or they could be inherited from multisensory subcortical nuclei, such as
nonlemniscal regions of the auditory thalamus. A third possibility includes feedback connections
from higher multisensory association areas in temporal, parietal, or frontal cortex. Anatomical evi-
dence from a range of species including monkeys (Smiley et al. 2007; Hackett et al. 2007a; Cappe
et al. 2009), ferrets (Bizley et al. 2007), prairie voles (Campi et al. 2010), and gerbils (Budinger et
al. 2006) has shown that subcortical as well as feedforward and feedback corticortical inputs could
underpin multisensory integration in auditory cortex. To determine the most likely origins of the
nonauditory responses in the auditory cortex, we therefore need to consider studies of anatomical
connectivity in conjunction with information about the physiological properties of the neurons, such
as tuning characteristics or response latencies.
Previous studies have demonstrated direct projections from core and belt auditory cortex into
visual areas V1 and V2 in nonhuman primates (Rockland and Ojima 2003; Falchier et al. 2002) and,
more recently, in cats (Hall and Lomber 2008). The reciprocal projection, from V1 to A1, remains to
be described in primates, although Hackett et al. (2007b) have found evidence for a pathway termi-
nating in the caudomedial belt area of the auditory cortex from the area prostriata, adjacent to V1,
which is connected with the peripheral visual field representations in V1, V2, and MT. Connections
between early auditory and visual cortical fields have also been described in gerbils (Budinger et al.
2006, 2008) and prairie voles (Campi et al. 2010).
By placing injections of neural tracer into physiologically identified auditory fields in the ferret,
we were able to characterize the potential sources of visual input (Bizley et al. 2007; Figure 3.1b, c).
Auditory Cortex according to Multisensory Processing 41

These data revealed a clear projection pattern whereby specific visual cortical fields innervate spe-
cific auditory fields. A sparse direct projection exists from V1 to the core auditory cortex (A1 and
AAF), which originates from the region of V1 that represents the peripheral visual field. This find-
ing mirrors that of the reciprocal A1 to V1 projection in monkeys and cats, which terminates in
the peripheral field representation of V1 (Rockland and Ojima 2003; Falchier et al. 2002; Hall and
Lomber 2008). Ferret A1 and AAF are also weakly innervated by area V2. The posterior auditory
fields, PPF and PSF, are innervated principally by areas 20a and 20b, thought to be part of the visual
form-processing pathway (Manger et al. 2004). In contrast, the largest inputs to the anterior fields,
ADF and AVF, come from SSY, which is regarded as part of the visual “where” processing stream
(Philipp et al. 2006).
Interestingly, this difference in the sources of cortical visual input, which is summarized in
Figure 3.1d, appears to reflect the processing characteristics of the auditory cortical fields con-
cerned. As described above, the fields on the posterior ectosylvian gyrus are more sensitive to
pitch and timbre, parameters that contribute to the identification of a sound source, whereas spatial
sensitivity for auditory, visual, and multisensory stimuli is greatest in ADF (Figure 3.3). This func-
tional distinction therefore matches the putative roles of the extrastriate areas that provide the major
sources of cortical visual input to each of these regions.
These studies appear to support the notion of a division of labor across the nonprimary areas
of ferret auditory cortex, but it would be premature to conclude that distinct fields are responsible
for the processing of spatial and nonspatial features of the world. Thus, although PSF is innervated
by nonspatial visual processing areas 20a and 20b (Figure 3.1c), the responses of a particularly
large number of neurons found there show an increase in transmitted spatial information when a
spatially congruent visual stimulus is added to the auditory stimulus (Bizley and King 2008). This
could be related to a need to integrate spatial and nonspatial cues when representing objects and
events in the auditory cortex. The possibility that connections between the visual motion-sensi-
tive area SSY and the fields on the anterior ectosylvian gyrus are involved in processing spatial
information provided by different sensory modalities is supported by a magnetoencephalography
study in humans showing that audio–visual motion signals are integrated in the auditory cortex
(Zvyagintsev et al. 2009). However, we must not forget that visual motion also plays a key role in
the perception of communication calls. By making intracranial recordings in epileptic patients,
Besle et al. (2008) found that the visual cues produced by lip movements activate MT followed,
approximately 10 ms later, by secondary auditory areas, where they alter the responses to sound in
ways that presumably influence speech perception. Thus, although the influence of facial expres-
sions on auditory cortical neurons is normally attributed to feedback from the superior temporal
sulcus (Ghazanfar et al. 2008), the availability of lower-level visual signals that provide cues to
sound onset and offset may be important as well.

3.5 WHAT ARE THE PERCEPTUAL CONSEQUENCES OF


MULTISENSORY INTEGRATION IN THE AUDITORY CORTEX?
The concurrent availability of visual information presumably alters the representation in the audi-
tory cortex of sources that can be seen as well as heard in ways that are relevant for perception and
behavior. Obviously, the same argument applies to the somatosensory inputs that have also been
described there (Musacchia and Schroeder 2009). By influencing early levels of cortical process-
ing, these nonauditory inputs may play a fairly general processing role by priming the cortex to
receive acoustic signals. It has, for example, been proposed that visual and somatosensory inputs
can modulate the phase of oscillatory activity in the auditory cortex, potentially amplifying the
response to related auditory signals (Schroeder et al. 2008). But, as we have seen, visual inputs
can also have more specific effects, changing the sensitivity and even the selectivity of cortical
responses to stimulus location and, at least in primates, to vocalizations where communication
42 The Neural Bases of Multisensory Processes

relies on both vocal calls and facial gestures. The role of multisensory processing in receptive audi-
tory communication is considered in more detail in other chapters in this volume. Here, we will
focus on the consequences of merging spatial information across different sensory modalities in
the auditory cortex.

3.5.1  Combining Auditory and Visual Spatial Representations in the Brain


There are fundamental differences in the ways in which source location is extracted by the visual
and auditory systems. The location of visual stimuli is represented topographically, first by the
distribution of activity across the retina and then at most levels of the central visual pathway. By
contrast, auditory space is not encoded explicitly along the cochlea. Consequently, sound-source
location has to be computed within the brain on the basis of the relative intensity and timing of
sounds at each ear (“binaural cues”), coupled with the location-dependent filtering of sounds by
the external ear (King et al. 2001). By tuning neurons to appropriate combinations of these cues,
a “visual-like” map of auditory space is constructed in the superior colliculus, allowing spatial
information from different sensory modalities to be represented in a common format (King and
Hutchings 1987; Middlebrooks and Knudsen 1984). This arrangement is particularly advantageous
for facilitating the integration of multisensory cues from a common source for the purpose of direct-
ing orienting behavior (Stein and Stanford 2008). However, because spatial signals provided by each
sensory modality are initially encoded using different reference frames, with visual signals based
on eye-centered retinal coordinates and auditory signals being head centered, information about
current eye position has to be incorporated into the activity of these neurons in order to maintain
map alignment (Hartline et al. 1995; Jay and Sparks 1987).
In contrast to the topographic representation of auditory space in the superior colliculus, there
is no space map in the auditory cortex (King and Middlebrooks 2010), posing an even greater chal-
lenge for the integration of visual and auditory spatial signals at the cortical level. The integrity of
several auditory cortical areas is essential for normal sound localization (Malhotra and Lomber
2007), but we still have a very incomplete understanding of how neural activity in those regions
contributes to the percept of where a sound source is located. The spatial receptive fields of indi-
vidual cortical neurons are frequently very broad and, for the most part, occupy the contralateral
side of space. However, several studies have emphasized that sound-source location can also be
signaled by the timing of spikes (Jenison 2000; Nelken et al. 2005; Stecker et al. 2003). Our finding
that the presence of spatially congruent visual stimuli leads to auditory cortical neurons becom-
ing more informative about the source location, and that this greater spatial selectivity is based on
both the timing and number of spikes evoked, is clearly consistent with this. Whatever the relative
contributions of different neural coding strategies might be, it seems that sound-source location
is signaled by the population response of neurons in the auditory cortex (Woods et al. 2006). The
approach used by Allman and colleagues (2009) to estimate the response facilitation produced in a
population of cortical neurons by combining visual and auditory stimuli might therefore be useful
for characterizing the effects on spatial processing at this level.
We pointed out above that meaningful interactions between different sensory modalities can
take place only if the different reference frames used to encode modality-specific spatial signals
are brought together. Further evidence for the multisensory representation of spatial signals in the
auditory cortex is provided by the demonstration that gaze direction can change the activity of
neurons in the auditory cortex (Fu et al. 2004; Werner-Reiss et al. 2003). A modulatory influence of
eye position on auditory responses has been observed as early as the inferior colliculus (Groh et al.
2001), indicating that these effects could be inherited from the midbrain rather than created de novo
in the auditory cortex. On the other hand, the timing and laminar profile of eye-position effects in
the auditory cortex is more consistent with an origin from nonlemniscal regions of the thalamus or
via feedback projections from the parietal or frontal cortices (Fu et al. 2004). As in the superior col-
liculus, varying eye position does not change auditory cortical spatial tuning in a manner consistent
Auditory Cortex according to Multisensory Processing 43

with a straightforward transformation into eye-centered coordinates. Rather, spatial tuning seems to
take on an intermediate form between eye-centered and head-centered coordinates (Werner-Reiss
et al. 2003).

3.5.2  A Role for Auditory Cortex in Spatial Recalibration?


One possibility that has attracted recent attention is that visual–auditory interactions in early sen-
sory cortex could be involved in the visual recalibration of auditory space. The representation of
auditory space in the brain is inherently plastic, even in adulthood, and there are several well-
­documented examples in which the perceived location of sound sources can be altered so as to
conform to changes in visual inputs (King 2009; King et al. 2001). The most famous of these is the
ventriloquism illusion, whereby synchronous but spatially disparate visual cues can “capture” the
location of a sound source, so that it is incorrectly perceived to arise from near the seen location
(Bertelson and Radeau 1981). Repeated presentation of consistently misaligned visual and auditory
cues results in a shift in the perception of auditory space that can last for tens of minutes once the
visual stimulus is removed. This aftereffect has been reported in humans (Recanzone 1998; Radeau
and Bertelson 1974; Lewald 2002) and in nonhuman primates (Woods and Recanzone 2004).
Given the widespread distribution of visual–auditory interactions in the cortex, a number of sites
could potentially provide the neural substrate for this cross-modal spatial illusion. The finding that
the ventriloquism aftereffect does not transfer across sound frequency (Lewald 2002; Recanzone
1998; Woods and Recanzone 2004) implies the involvement of a tonotopically organized region,
i.e., early auditory cortex. On the other hand, generalization across frequencies has been observed
in another study (Frissen et al. 2005), so this conclusion may not stand. However, neuroimaging
results in humans have shown that activity levels in the auditory cortex vary on a trial-by-trial basis
according to whether a spatially discrepant visual stimulus is presented at the same time (Bonath et
al. 2007). Furthermore, the finding by Passamonti et al. (2009) that patients with unilateral lesions
of the visual cortex fail to show the ventriloquism aftereffect in the affected hemifield, whereas
patients with parietotemporal lesions still do, is consistent with the possibility that connections
between the visual and auditory cortices are involved. On the other hand, the hemianopic patients
did show improved sound localization accuracy when visual and auditory stimuli are presented at
the same location in space, implying that different neural circuits may underlie these cross-modal
spatial effects.
Visual capture of sound-source location is thought to occur because visual cues normally provide
more reliable and higher-resolution spatial information. If the visual stimuli are blurred, however, so
that this is no longer the case, spatially conflicting auditory cues can then induce systematic errors
in visual localization (Alais and Burr 2004). Nothing is known about the neural basis for reverse
ventriloquism, but it is tempting to speculate that auditory influences on visual cortex might be
involved. Indeed, the influence of sound on perceptual learning in a visual motion discrimination
task has been shown to be limited to locations in visual space that match those of the sound source,
implying an auditory influence on processing in a visual area that is retinotopically organized (Beer
and Watanabe 2009).
Behavioral studies have shown that adult humans and other mammals can adapt substantially to
altered auditory spatial cues produced, for example, by reversibly occluding or changing the shape
of the external ear (reviewed by Wright and Zhang 2006). Because visual cues provide a possible
source of sensory feedback about the accuracy of acoustically guided behavior, one potential role
of visual inputs to the auditory cortex is to guide the plasticity observed when localization cues are
altered. However, Kacelnik et al. (2006) found that the capacity of adult ferrets to relearn to local-
ize sound accurately after altering binaural cues by reversible occlusion of one ear is not depen-
dent on visual feedback. It has been suggested that instead of being guided by vision, this form of
adaptive plasticity could result from unsupervised sensorimotor learning, in which the dynamic
acoustic inputs resulting from an animal’s own movements help stabilize the brain’s representation
44 The Neural Bases of Multisensory Processes

of auditory space (Aytekin et al. 2008). Although vision is not essential for the recalibration of
auditory space in monaurally occluded ferrets, it is certainly possible that training with congruent
multisensory cues might result in faster learning than that seen with auditory cues alone, as shown
in humans for a motion detection task (Kim et al. 2008).

3.6  CONCLUDING REMARKS


There is now extensive anatomical and physiological evidence from a range of species that multi-
sensory convergence occurs at the earliest levels of auditory cortical processing. These nonauditory
influences therefore have to be taken into account in any model of what the auditory cortex actu-
ally does. Indeed, one of the consequences of visual, somatosensory, and eye-position effects on
the activity of neurons in core and belt areas of the auditory cortex is that those influences will be
passed on to each of the brain regions to which these areas project. Multiple sources of input have
been implicated in multisensory integration within auditory cortex, and a more detailed charac-
terization of those inputs will help determine the type of information that they provide and what
effect this might have on auditory processing. Some of those inputs are likely to provide low-level
temporal or spatial cues that enhance auditory processing in a fairly general way, whereas others
provide more complex information that is specifically related, for example, to the processing of
communication signals. Revealing where those inputs come from and where they terminate will
help unravel the relative contributions of different auditory cortical areas to perception. Indeed, the
studies that have been carried out to date have provided additional support for the standpoint that
there is some functional segregation across the different parts of the auditory cortex. In order to take
this further, however, it will also be necessary to examine the behavioral and physiological effects
of experimentally manipulating activity in those circuits if we are to understand how visual inputs
influence auditory processing and perception.

REFERENCES
Adriani, M., P. Maeder, R. Meuli et al. 2003. Sound recognition and localization in man: Specialized cortical
networks and effects of acute circumscribed lesions. Experimental Brain Research 153:591–604.
Alain, C., S.R. Arnott, S. Hevenor, S. Graham, and C.L. Grady. 2001. “What” and “where” in the human
auditory system. Proceedings of the National Academy of Sciences of the United States of America
98:12301–6.
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Current
Biology 14:257–62.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2008. Subthreshold auditory inputs to extrastriate visual
neurons are responsive to parametric changes in stimulus quality: Sensory-specific versus non-specific
coding. Brain Research 1242:95–101.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2009. Adult deafness induces somatosensory conversion of
ferret auditory cortex. Proceedings of the National Academy of Sciences of the United States of America
106:5925–30.
Aytekin, M., C.F. Moss, and J.Z. Simon. 2008. A sensorimotor approach to sound localization. Neural
Computation 20:603–35.
Bajo, V.M., F.R. Nodal, J.K. Bizley, and A.J. King. 2010. The non-lemniscal auditory cortex in ferrets:
Convergence of corticotectal inputs in the superior colliculus. Frontiers in Neuroanatomy 4:18.
Barrett, D.J., and D.A. Hall. 2006. Response preferences for “what” and “where” in human non-primary audi-
tory cortex. NeuroImage 32:968–77.
Beer, A.L., and T. Watanabe. 2009. Specificity of auditory-guided visual perceptual learning suggests cross-
modal plasticity in early visual cortex. Experimental Brain Research 198:353–61.
Bendor, D., and X. Wang. 2005. The neuronal representation of pitch in primate auditory cortex. Nature
436:1161–5.
Bendor, D., and X. Wang. 2008. Neural response properties of primary, rostral, and rostrotemporal core fields
in the auditory cortex of marmoset monkeys. Journal of Neurophysiology 100:888–906.
Auditory Cortex according to Multisensory Processing 45

Bertelson, P., and M. Radeau. 1981. Cross-modal bias and perceptual fusion with auditory-visual spatial dis-
cordance. Perception & Psychophysics 29:578–84.
Besle, J., C. Fischer, A. Bidet-Caulet, F. Lecaignard, O. Bertrand, and M.H. Giard. 2008. Visual activation
and audiovisual interactions in the auditory cortex during speech perception: Intracranial recordings in
humans. Journal of Neuroscience 28:14301–10.
Bizley, J.K., and A.J. King. 2008. Visual-auditory spatial processing in auditory cortical neurons. Brain
Research 1242:24–36.
Bizley, J.K., and A.J. King. 2009. Visual influences on ferret auditory cortex. Hearing Research 258:55–63.
Bizley, J.K., F.R. Nodal, I. Nelken, and A.J. King. 2005. Functional organization of ferret auditory cortex.
Cerebral Cortex 15:1637–53.
Bizley, J.K., F.R. Nodal, V.M. Bajo, I. Nelken, and A.J. King. 2007. Physiological and anatomical evidence for
multisensory interactions in auditory cortex. Cerebral Cortex 17:2172–89.
Bizley, J.K., K.M. Walker, B.W. Silverman, A.J. King, and J.W. Schnupp. 2009. Interdependent encoding of
pitch, timbre, and spatial location in auditory cortex. Journal of Neuroscience 29:2064–75.
Bizley, J.K., and K.M. Walker, A.J. King, and J.W. Schnupp. 2010. Neural ensemble codes for stimulus period-
icity in auditory cortex. Journal of Neuroscience 30:5078–91.
Bonath, B., T. Noesselt, A. Martinez et al. 2007. Neural basis of the ventriloquist illusion. Current Biology
17:1697–703.
Brosch, M., E. Selezneva, and H. Scheich. 2005. Nonauditory events of a behavioral procedure activate audi-
tory cortex of highly trained monkeys. Journal of Neuroscience 25:6797–806.
Budinger, E., P. Heil, A. Hess, and H. Scheich. 2006. Multisensory processing via early cortical stages:
Connections of the primary auditory cortical field with other sensory systems. Neuroscience 143:​
1065–83.
Budinger, E., A. Laszcz, H. Lison, H. Scheich, and F.W. Ohl. 2008. Non-sensory cortical and subcortical con-
nections of the primary auditory cortex in Mongolian gerbils: Bottom-up and top-down processing of
neuronal information via field AI. Brain Research 1220:2–32.
Cahill, L., F. Ohl, and H. Scheich. 1996. Alteration of auditory cortex activity with a visual stimulus through
conditioning: a 2-deoxyglucose analysis. Neurobiology of Learning and Memory 65:213–22.
Calvert, G.A., and T. Thesen. 2004. Multisensory integration: Methodological approaches and emerging prin-
ciples in the human brain. Journal of Physiology, Paris 98:191–205.
Calvert, G.A., M.J. Brammer, E.T. Bullmore, R. Campbell, S.D. Iversen, and A.S. David. 1999. Response
amplification in sensory-specific cortices during crossmodal binding. Neuroreport 10:2619–23.
Campi, K.L., K.L. Bales, R. Grunewald, and L. Krubitzer. 2010. Connections of auditory and visual cortex in
the prairie vole (Microtus ochrogaster): Evidence for multisensory processing in primary sensory areas.
Cerebral Cortex 20:89–108.
Cantone, G., J. Xiao, and J.B. Levitt. 2006. Retinotopic organization of ferret suprasylvian cortex. Visual
Neuroscience 23:61–77.
Cappe, C., A. Morel, P. Barone, and E.M. Rouiller. 2009. The thalamocortical projection systems in primate:
An anatomical support for multisensory and sensorimotor interplay. Cerebral Cortex 19:2025–37.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith. 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience 22:5749–59.
Frissen, I., J. Vroomen, B. De Gelder, and P. Bertelson. 2005. The aftereffects of ventriloquism: Generalization
across sound-frequencies. Acta Psychologica 118:93–100.
Fu, K.M., A.S. Shah, M.N. O’Connell et al. 2004. Timing and laminar profile of eye-position effects on audi-
tory responses in primate auditory cortex. Journal of Neurophysiology 92:3522–31.
Ghazanfar, A.A. 2009. The multisensory roles for auditory cortex in primate vocal communication. Hearing
Research 258:113–20.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–85.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25:5004–12.
Ghazanfar, A.A., C. Chandrasekaran, and N.K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience 28:4457–69.
46 The Neural Bases of Multisensory Processes

Giard, M.H., and F. Peronnet. 1999. Auditory-visual integration during multimodal object recognition in
humans: A behavioral and electrophysiological study. Journal of Cognitive Neuroscience 11:473–90.
Goodale, M.A., and D.A. Westwood. 2004. An evolving view of duplex vision: Separate but interacting cortical
pathways for perception and action. Current Opinion in Neurobiology 14:203–11.
Griffiths, T.D., J.D. Warren, S.K. Scott, I. Nelken, and A.J. King. 2004. Cortical processing of complex sound:
A way forward? Trends in Neuroscience 27:181–5.
Groh J.M., A.S. Trause, A.M. Underhill, K.R. Clark, and S. Inati. 2001. Eye position influences auditory
responses in primate inferior colliculus. Neuron 29:509–18.
Hackett, T.A., I. Stepniewska, and J.H. Kaas. 1999. Prefrontal connections of the parabelt auditory cortex in
macaque monkeys. Brain Research 817:45–58.
Hackett, T.A., L.A. De La Mothe, I. Ulbert, G. Karmos, J. Smiley, and C.E. Schroeder. 2007a. Multisensory
convergence in auditory cortex: II. Thalamocortical connections of the caudal superior temporal plane.
Journal of Comparative Neurology 502:924–52.
Hackett, T.A., J.F. Smiley, I. Ulbert et al. 2007b. Sources of somatosensory input to the caudal belt areas of
auditory cortex. Perception 36:1419–30.
Hall, A.J., and S.G. Lomber. 2008. Auditory cortex projections target the peripheral field representation of
primary visual cortex. Experimental Brain Research 190:413–30.
Hall, D.A., and C.J. Plack. 2009. Pitch processing sites in the human auditory brain. Cerebral Cortex
19:576–85.
Harrington, I.A., G.C. Stecker, E.A. Macpherson, and J.C. Middlebrooks. 2008. Spatial sensitivity of neurons
in the anterior, posterior, and primary fields of cat auditory cortex. Hearing Research 240:22–41.
Hartline, P.H., R.L. Vimal, A.J. King, D.D. Kurylo, and D.P. Northmore. 1995. Effects of eye position on audi-
tory localization and neural representation of space in superior colliculus of cats. Experimental Brain
Research 104:402–8.
Imaizumi, K., N.J. Priebe, P.A. Crum, P.H. Bedenbaugh, S.W. Cheung, and C.E. Schreiner. 2004. Modular
functional organization of cat anterior auditory field. Journal of Neurophysiology 92:444–57.
Jay, M.F., and D.L. Sparks. 1987. Sensorimotor integration in the primate superior colliculus: II. Coordinates
of auditory signals. Journal of Neurophysiology 57:35–55.
Jenison, R.L. 2000. Correlated cortical populations can enhance sound localization performance. Journal of the
Acoustical Society of America 107:414–21.
Kacelnik, O., F.R. Nodal, C.H. Parsons, and A.J. King. 2006. Training-induced plasticity of auditory localiza-
tion in adult mammals. PLoS Biology 4:627–38.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2007. Functional imaging reveals visual modulation
of specific fields in auditory cortex. Journal of Neuroscience 27:1824–35.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–74.
Keniston, L.P., B.L. Allman, and M.A. Meredith. 2008. The rostral suprasylvian sulcus (RSSS) of the ferret: A
‘new’ multisensory area. Society for Neuroscience Abstracts 38:457.10.
Keniston, L.P., B.L. Allman, M.A. Meredith, and H.R. Clemo. 2009. Somatosensory and multisensory properties
of the medial bank of the ferret rostral suprasylvian sulcus. Experimental Brain Research 196:239–51.
Kim, R.S., A.R. Seitz, and L. Shams. 2008. Benefits of stimulus congruency for multisensory facilitation of
visual learning. PLoS ONE 3:e1532.
King, A.J. 2009. Visual influences on auditory spatial learning. Philosophical Transactions of the Royal Society
of London. Series B, Biological Sciences 364:331–9.
King, A.J., and M.E. Hutchings. 1987. Spatial response properties of acoustically responsive neurons in the
superior colliculus of the ferret: A map of auditory space. Journal of Neurophysiology 57:596–624.
King, A.J., and I. Nelken. 2009. Unraveling the principles of auditory cortical processing: Can we learn from
the visual system? Nature Neuroscience 12:698–701.
King, A.J., and J.C. Middlebrooks. 2011. Cortical representation of auditory space. In The Auditory Cortex,
eds. J.A. Winer and C.E. Schreiner, 329–41. New York: Springer.
King, A.J., J.W. Schnupp, and T.P. Doubell. 2001. The shape of ears to come: Dynamic coding of auditory
space. Trends in Cognitive Sciences 5:261–70.
Las, L., A.H. Shapira, and I. Nelken. 2008. Functional gradients of auditory sensitivity along the anterior ecto-
sylvian sulcus of the cat. Journal of Neuroscience 28:3657–67.
Lewald, J. 2002. Rapid adaptation to auditory–visual spatial disparity. Learning and Memory 9:268–78.
Loftus, W.C., and M.L. Sutter. 2001. Spectrotemporal organization of excitatory and inhibitory receptive fields
of cat posterior auditory field neurons. Journal of Neurophysiology 86:475–91.
Auditory Cortex according to Multisensory Processing 47

Lomber, S.G., and S. Malhotra. 2008. Double dissociation of ‘what’ and ‘where’ processing in auditory cortex.
Nature Neuroscience 11:609–16.
Maeder, P.P., R.A. Meuli, M. Adriani et al. 2001. Distinct pathways involved in sound recognition and localiza-
tion: A human fMRI study. Neuroimage 14:802–16.
Malhotra, S., and S.G. Lomber. 2007. Sound localization during homotopic and heterotopic bilateral cooling deacti-
vation of primary and nonprimary auditory cortical areas in the cat. Journal of Neurophysiology 97:26–43.
Manger, P.R., I. Masiello, and G.M. Innocenti. 2002. Areal organization of the posterior parietal cortex of the
ferret (Mustela putorius). Cerebral Cortex 12:1280–97.
Manger, P.R., H. Nakamura, S. Valentiniene, and G.M. Innocenti. 2004. Visual areas in the lateral temporal
cortex of the ferret (Mustela putorius). Cerebral Cortex 14:676–89.
Manger, P.R., G. Engler, C.K. Moll, and A.K. Engel. 2005. The anterior ectosylvian visual area of the fer-
ret: A homologue for an enigmatic visual cortical area of the cat? European Journal of Neuroscience
22:706–14.
Manger, P.R., G. Engler, C.K. Moll, and A.K. Engel. 2008. Location, architecture, and retinotopy of the antero-
medial lateral suprasylvian visual area (AMLS) of the ferret (Mustela putorius). Visual Neuroscience
25:27–37.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264:746–8.
McLaughlin, D.F., R.V. Sonty, and S.L. Juliano. 1998. Organization of the forepaw representation in ferret
somatosensory cortex. Somatosensory & Motor Research 15:253–68.
Meredith, M.A., L.R. Keniston, L.R. Dehner, and H.R. Clemo. 2006. Crossmodal projections from somatosen-
sory area SIV to the auditory field of the anterior ectosylvian sulcus (FAES) in cat: Further evidence for
subthreshold forms of multisensory processing. Experimental Brain Research 172:472–84.
Merigan, W.H., and J.H. Maunsell. 1993. How parallel are the primate visual pathways? Annual Review of
Neuroscience 16:369–402.
Middlebrooks, J.C., and E.I. Knudsen. 1984. A neural code for auditory space in the cat’s superior colliculus.
Journal of Neuroscience 4:2621–34.
Molholm, S., W. Ritter, M.M. Murray, D.C. Javitt, C.E. Schroeder, and J.J. Foxe. 2002. Multisensory auditory-
visual interactions during early sensory processing in humans: A high-density electrical mapping study.
Brain Research Cognitive Brain Research 14:115–28.
Musacchia, G., and C.E. Schroeder. 2009. Neuronal mechanisms, response dynamics and perceptual functions
of multisensory interactions in auditory cortex. Hearing Research 258:72–9.
Nakamoto, K.T., S.J. Jones, and A.R. Palmer. 2008. Descending projections from auditory cortex modulate
sensitivity in the midbrain to cues for spatial position. Journal of Neurophysiology 99:2347–56.
Nelken, I., G. Chechik, T.D. Mrsic-Flogel, A.J. King, and J.W. Schnupp. 2005. Encoding stimulus informa-
tion by spike numbers and mean response time in primary auditory cortex. Journal of Computational
Neuroscience 19:199–221.
Nelken, I., J.K. Bizley, F.R. Nodal, B. Ahmed, A.J. King, and J.W. Schnupp. 2008. Responses of auditory
cortex to complex stimuli: Functional organization revealed using intrinsic optical signals. Journal of
Neurophysiology 99:1928–41.
Passamonti, C., C. Bertini, and E. Ladavas. 2009. Audio-visual stimulation improves oculomotor patterns in
patients with hemianopia. Neuropsychologia 47:546–55.
Philipp, R., C. Distler, and K.P. Hoffmann. 2006. A motion-sensitive area in ferret extrastriate visual cortex: An
analysis in pigmented and albino animals. Cerebral Cortex 16:779–90.
Phillips, D.P., and S.S. Orman. 1984. Responses of single neurons in posterior field of cat auditory cortex to
tonal stimulation. Journal of Neurophysiology 51:147–63.
Radeau, M., and P. Bertelson. 1974. The after-effects of ventriloquism. Quarterly Journal of Experimental
Psychology 26:63–71.
Ramsay, A.M., and M.A. Meredith. 2004. Multiple sensory afferents to ferret pseudosylvian sulcal cortex.
Neuroreport 15:461–5.
Rauschecker, J.P., and B. Tian. 2000. Mechanisms and streams for processing of “what” and “where” in auditory
cortex. Proceedings of the National Academy of Sciences of the United States of America 97:11800–6.
Recanzone, G.H. 1998. Rapidly induced auditory plasticity: The ventriloquism aftereffect. Proceedings of the
National Academy of Sciences of the United States of America 95:869–75.
Recanzone, G.H. 2000. Spatial processing in the auditory cortex of the macaque monkey. Proceedings of the
National Academy of Sciences of the United States of America 97:11829–35.
Recanzone, G.H. 2008. Representation of con-specific vocalizations in the core and belt areas of the auditory
cortex in the alert macaque monkey. Journal of Neuroscience 28:13184–93.
48 The Neural Bases of Multisensory Processes

Redies, C., M. Diksic, and H. Riml. 1990. Functional organization in the ferret visual cortex: A double-label
2-deoxyglucose study. Journal of Neuroscience 10:2791–803.
Renier, L.A., I. Anurova, A.G. De Volder, S. Carlson, J. Vanmeter, and J.P. Rauschecker. 2009. Multisensory
integration of sounds and vibrotactile stimuli in processing streams for “what” and “where.” Journal of
Neuroscience 29:10950–60.
Rice, F.L., C.M. Gomez, S.S. Leclerc, R.W. Dykes, J.S. Moon, and K. Pourmoghadam. 1993. Cytoarchitecture
of the ferret suprasylvian gyrus correlated with areas containing multiunit responses elicited by stimula-
tion of the face. Somatosensory & Motor Research 10:161–88.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50:19–26.
Romanski, L.M., B. Tian, J. Fritz, M. Mishkin, P.S. Goldman-Rakic, and J.P. Rauschecker. 1999. Dual streams
of auditory afferents target multiple domains in the primate prefrontal cortex. Nature Neuroscience
2:1131–6.
Schreiner, C.E., and M.S. Cynader. 1984. Basic functional organization of second auditory cortical field (AII)
of the cat. Journal of Neurophysiology 51:1284–305.
Schroeder, C.E., and J.J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory areas
of the macaque neocortex. Brain Research Cognitive Brain Research 14:187–98.
Schroeder, C.E., P. Lakatos, Y. Kajikawa, S. Partan, and A. Puce. 2008. Neuronal oscillations and visual ampli-
fication of speech. Trends in Cognitive Sciences 12:106–13.
Shelton, B.R., and C.L. Searle. 1980. The influence of vision on the absolute identification of sound-source
position. Perception & Psychophysics 28:589–96.
Smiley, J.F., T.A. Hackett, I. Ulbert et al. 2007. Multisensory convergence in auditory cortex, I. Cortical con-
nections of the caudal superior temporal plane in macaque monkeys. Journal of Comparative Neurology
502:894–923.
Stecker, G.C., B.J. Mickey, E.A. Macpherson, and J.C. Middlebrooks. 2003. Spatial sensitivity in field PAF of
cat auditory cortex. Journal of Neurophysiology 89:2889–903.
Stein, B.E., and T.R. Stanford. 2008. Multisensory intergration: Current issues from the perspective of the
single neuron. Nature Reviews. Neuroscience 9:1477–85.
Stein, B.E., W.S. Huneycutt, and M.A. Meredith. 1988. Neurons and behavior: The same rules of multisensory
integration apply. Brain Research 448:355–8.
Stein, B.E., M.A. Meredith, W.S. Huneycott, and L. Mcdade. 1989. Behavioral indices of multisensory inte-
gration: Orientation of visual cues is affected by auditory stimuli. Journal of Cognitive Neuroscience
1:12–24.
Sumby, W.H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the
Acoustical Society of America 26:212–15.
Tian, B., D. Reser, A. Durham, A. Kustov, and J.P. Rauschecker. 2001. Functional specialization in rhesus
monkey auditory cortex. Science 292:290–3.
Thomas, H., J. Tillein, P. Heil, and H. Scheich. 1993. Functional organization of auditory cortex in the mon-
golian gerbil (Meriones unguiculatus). I. Electrophysiological mapping of frequency representation and
distinction of fields. European Journal of Neuroscience 5:882–97.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–72.
Warren, J.D., and T.D. Griffiths. 2003. Distinct mechanisms for processing spatial sequences and pitch
sequences in the human auditory brain. Journal of Neuroscience 23:5799–804.
Werner-Reiss, U., K.A. Kelly, A.S. Trause, A.M. Underhill, and J.M. Groh. 2003. Eye position affects activity
in primary auditory cortex of primates. Current Biology 13:554–62.
Woods, T.M., and G.H. Recanzone. 2004. Visually induced plasticity of auditory spatial perception in macaques.
Current Biology 14:1559–64.
Woods, T.M., S.E. Lopez, J.H. Long, J.E. Rahman, and G.H. Recanzone. 2006. Effects of stimulus azimuth
and intensity on the single-neuron activity in the auditory cortex of the alert macaque monkey. Journal
of Neurophysiology 96:3323–37.
Wright, B.A., and Y. Zhang. 2006. A review of learning with normal and altered sound-localization cues in
human adults. International Journal of Audiology 45 Suppl 1, S92–8.
Zvyagintsev, M., A.R. Nikolaev, H. Thonnessen, O. Sachs, J. Dammers, and K. Mathiak. 2009. Spatially con-
gruent visual motion modulates activity of the primary auditory cortex. Experimental Brain Research
198:391–402.
Section II
Neurophysiological Bases
4 Are Bimodal Neurons the
Same throughout the Brain?
M. Alex Meredith, Brian L. Allman,
Leslie P. Keniston, and H. Ruth Clemo

CONTENTS
4.1 Introduction............................................................................................................................. 51
4.2 Methods................................................................................................................................... 52
4.2.1 Surgical Procedures..................................................................................................... 52
4.2.2 Recording..................................................................................................................... 52
4.2.3 Data Analysis............................................................................................................... 53
4.3 Results...................................................................................................................................... 54
4.3.1 Anterior Ectosylvian Sulcal Cortex............................................................................. 54
4.3.2 Posterolateral Lateral Suprasylvian Cortex................................................................. 54
4.3.3 Rostral Suprasylvian Sulcal Cortex............................................................................. 59
4.3.4 Superior Colliculus...................................................................................................... 59
4.4 Discussion................................................................................................................................60
4.4.1 Bimodal Neurons with Different Integrative Properties.............................................60
4.4.2 Bimodal Neurons in SC and Cortex Differ.................................................................60
4.4.3 Bimodal Neurons in Different Cortical Areas Differ..................................................60
4.4.4 Population Contribution to Areal Multisensory Function........................................... 61
4.4.5 Methodological Considerations................................................................................... 62
4.5 Conclusions.............................................................................................................................. 63
Acknowledgments............................................................................................................................. 63
References......................................................................................................................................... 63

4.1  INTRODUCTION
It is a basic tenet of neuroscience that different neural circuits underlie different functions or behav-
iors. For the field of multisensory processing, however, this concept appears to be superseded by
the system’s requirements: convergence of inputs from different sensory modalities onto individual
neurons is the requisite, defining step. This requirement is fulfilled by the bimodal neuron, which
has been studied for half a century now (Horn and Hill 1966) and has come to represent the basic
unit of multisensory processing (but see Allman et al. 2009). Bimodal neurons are ubiquitous: they
are found throughout the neuraxis and in nervous systems across the animal kingdom (for review,
see Stein and Meredith 1993). Bimodal (and trimodal) neurons exhibit suprathreshold responses to
stimuli from more than one sensory modality, and often integrate (a significant response change
when compared with unisensory responses) those responses when the stimuli are combined. As
revealed almost exclusively by studies of the superior colliculus (SC), bimodal neurons integrate
multisensory information according to the spatial, temporal, and physical parameters of the stim-
uli involved (for review, see Stein and Meredith 1993). The generality of these principles and the

51
52 The Neural Bases of Multisensory Processes

Rostral Posterolateral lateral


suprasylvian suprasylvian

Superior
colliculus

Anterior
ectosylvian

FIGURE 4.1  Lateral view of cat brain depicts multisensory recording sites in cortex and midbrain.

broadness of their applicability appeared to be confirmed by similar findings in cortical bimodal


neurons (Wallace et al. 1992) and overt multisensory behaviors (Stein et al. 1989).
Although it has been generally assumed that bimodal neurons are essentially the same, an
insightful study of multisensory integration in bimodal SC neurons demonstrated that bimodal
neurons exhibit different functional ranges (Perrault et al. 2005). Some bimodal neurons were
highly integrative and exhibited integrated, superadditive (combined response > sum of unisen-
sory responses) responses to a variety of stimulus combinations, whereas others never produced
superadditive levels despite the full range of stimuli presented. In this highly integrative structure,
approximately 28% of the bimodal neurons showed multisensory integration in the superadditive
range. Thus, within the SC, there was a distribution of bimodal neurons with different functional
ranges. Hypothetically, if this distribution were altered, for example, in favor of low-integrating
bimodal neurons, then it would be expected that the overall SC would exhibit lower levels of mul-
tisensory processing. Because many studies of cortical multisensory processing reveal few exam-
ples of superadditive levels of integration (e.g., Meredith et al. 2006; Clemo et al. 2007; Allman
and Meredith 2007; Meredith and Allman 2009), it seems possible that bimodal cortical neurons
also exhibit functional ranges like those observed in the SC, but do so in different proportions.
Therefore, the present investigation reviewed single-unit recording data derived from several dif-
ferent cortical areas and the SC (as depicted in Figure 4.1) to address the possibility that bimodal
neurons in different parts of the brain might exhibit different integrative properties that occur in
area-specific proportions.

4.2  METHODS
4.2.1  Surgical Procedures
A two-part implantation/recording procedure was used as described in detail in previous reports
(Meredith and Stein 1986; Meredith et al. 2006). First, the animals were anesthetized (pentobarbi-
tal, 40 mg/kg) and their heads were secured in a stereotaxic frame. Sterile techniques were used to
perform a craniotomy that exposed the targeted recording area and a recording well was implanted
over the opening. The scalp was then sutured closed around the implant and routine postoperative
care was provided. Approximately 7 to 10 days elapsed before the recording experiment.

4.2.2  Recording
Recording experiments were initiated by anesthetizing the animal (ketamine, 35 mg/kg, and
acepromazine, 3.5 mg/kg initial dose; 8 with 1 mg kg−1 h−1 supplements, respectively) and securing
the implant to a supporting bar. A leg vein was cannulated for continuous administration of fluids,
supplemental anesthetics, and to prevent spontaneous movements, a muscle relaxant (pancronium
bromide, 0.3 mg/kg initial dose; 0.2 mg kg−1 h−1 supplement). The animal was intubated through
Are Bimodal Neurons the Same throughout the Brain? 53

the mouth and maintained on a ventilator; expired CO2 was monitored and maintained at ~4.5%.
A  glass-insulated tungsten electrode (impedance <1.0 MΩ) was used for recording. A hydraulic
microdrive was used to advance the electrode and to record the depth of identified neurons. Neuronal
activity was amplified and routed through a counter (for SC recordings) or to a PC for storage and
analysis (for cortical recordings). Neurons were identified by their spontaneous activity and by their
responses to somatosensory (puffs of air through a pipette, brush strokes and taps, manual pres-
sure and joint movement, taps, and stroking by calibrated von Frey hairs), auditory (claps, clicks,
whistles, and hisses), and/or visual (flashed or moving spots or bars of light from a handheld oph-
thalmoscope projected onto the translucent hemisphere, or dark stimuli from a rectangular piece
of black cardboard) search stimuli. Sensory receptive fields were mapped using adequate stimuli in
each modality and were graphically recorded. During recording, the depth of each identified neuron
was noted and tabulated along with its sensory responsivity (e.g., auditory, visual, somatosensory,
bimodal, or trimodal) and level of evoked stimulation activity obtained during quantitative tests
(see below). Multiple recording penetrations were performed in a single experiment and success-
ful recording penetrations were marked with a small electrolytic lesion. At the conclusion of the
experiment, the animal was euthanized and the brain fixed and blocked stereotaxically. Standard
histological techniques were used to stain and mount the tissue. A projecting microscope was used
to trace sections and to reconstruct recording penetrations from the lesion sites.
For selected neurons in each recording area, quantitative tests were conducted to document their
responses to sensory/multisensory stimulation. Electronically gated, repeatable somatosensory,
auditory, and visual stimuli were presented. Somatosensory stimuli were produced by an electroni-
cally driven, modified shaker (Ling, 102A) whose amplitude, velocity, and temporal delay were
independently set to either indent the skin or deflect hairs. Auditory stimulation consisted of a white
noise burst, 100 ms duration, generated by a solenoid-gated air hose (for some SC recordings), or an
electronic waveform played through a hoop-mounted speaker (for all other recordings) positioned
in contralateral auditory space. Visual stimuli were generated by a projector that cast an image of
a light bar through a rotating prism (to determine angle of trajectory) onto a galvanometer-driven
mirror (to affect delay, amplitude, and velocity of movement). This image was projected onto a
translucent Plexiglas hemisphere (92 cm diameter) positioned in front of the animal. Visual stimuli
of effective size and luminosity were moved through the visual receptive field at an effective ori-
entation, direction, and speed. These controlled somatosensory, auditory, and visual stimuli were
presented alone and in paired combinations (i.e., visual–auditory, auditory–somatosensory, visual–
somatosensory). An interstimulus interval of 7 to 15 s was used to avoid habituation; each test was
repeated 10 to 25 times.

4.2.3  Data Analysis


For cortical recordings, neuronal activity was digitized (rate >25 kHz) using Spike2 (Cambridge
Electronic Design) software and sorted by waveform template for analysis. Then, for each test condi-
tion (somatosensory alone, somatosensory–auditory combined, etc.), a peristimulus time histogram
was generated from which the mean spike number per trial (and standard deviation) was calculated.
For the SC recordings, the online spike counter displayed trial-by-trial spike counts for each of the
stimulus conditions, from which these values were recorded and the mean spike number per trial
(and standard deviation) was calculated. A paired, two-tailed t-test used to statistically compare
the responses to the combined stimuli to that of the most effective single stimulus, and responses
that showed a significant difference (p < .05) were defined as response interactions (Meredith and
Stein 1986, 1996). The magnitude of a response interaction was estimated by the following formula:
(C – M)/M × 100 = %, where C is the response to the combined stimulation, and M is the maxi-
mal response to the unimodal stimulation (according to the criteria of Meredith and Stein 1986).
Summative responses were evaluated by comparing the responses evoked by the combined stimuli
to the sum of the responses elicited by the same stimuli presented separately.
54 The Neural Bases of Multisensory Processes

4.3  RESULTS
4.3.1  Anterior Ectosylvian Sulcal Cortex
The banks of the anterior ectosylvian sulcus (AES) contain auditory (field of the AES; Clarey and
Irvine 1990), visual (AEV; Olson and Graybiel 1987), and somatosensory (SIV; Clemo and Stein
1983) representations. Numerous studies of this region have identified bimodal neurons (Wallace
et al. 1992; Rauschecker and Korte 1993; Jiang et al. 1994a, 1994b) particularly at the intersection
of the different sensory representations (Meredith 2004; Carriere et al. 2007). The bimodal neu-
rons described in the present study were collected during the recordings reported by Meredith and
Allman (2009).
Neurons were identified in six penetrations in three cats, of which 24% (n = 46/193) were bimodal.
These neurons exhibited suprathreshold responses to independent presentations of auditory and
visual (n = 39), auditory and somatosensory (n = 6), or visual and somatosensory (n = 1) stimuli. A
typical example is illustrated in Figure 4.2, where the presentation of either auditory or visual stim-
uli vigorously activated this neuron. Furthermore, the combination of visual and auditory stimuli
induced an even stronger response representing a significant (p < .05, paired t-test) enhancement
of activity (36%) over that elicited by the most effective stimulus presented alone (see Meredith
and Stein 1986 for criteria). This response increment was representative of bimodal AES neurons
because the population average level of enhancement was 34% (see Figure 4.3). This modest level of
multisensory integration was collectively achieved by neurons of widely different activity levels. As
illustrated in Figure 4.4, responses to separate or combined-modality stimulation achieved between
an average of 1 and 50 spikes/trial [response averages to the weakest (5.1 ± 4.9 standard deviation
(SD)) and best (8.9 ± 7.9 SD) separate stimuli and to combined-modality stimulation (11.7 ± 9.9 SD)
are also shown in Figure 4.3]. However, only a minority (46%; n = 21/46) of bimodal neurons showed
response enhancement to the available stimuli and most showed levels of activity that plotted close
to the line of unity in Figure 4.4. Figure 4.5 shows that the highest levels of enhancement were gen-
erally achieved in those neurons with lower levels of unimodal response activity. Specifically, the
neurons showing >75% response change (average 130%) exhibited responses to unimodal stimuli
that averaged 6.6 spikes/trial. As illustrated in Figure 4.6, however, most (85%; n = 39/46) bimodal
neurons demonstrated response enhancements of <75%. In addition, a few (11%; 5/46) AES bimodal
neurons even showed smaller responses to combined-modality stimulation than to the most effective
unimodal stimulus.
Another measure of multisensory processing is the proportional relationship of the activity
evoked by the combined stimuli to that of the sum of responses to the different separate-modality
stimuli (e.g., King and Palmer 1985). This analysis for bimodal AES neurons is presented in
Figure 4.7, which indicates that fewer neurons (17%; n = 8/46) show superadditive activity com-
pared with those that show statistically significant levels of response enhancement (46%; n =
21/46). Given that bimodal neurons represent only about 25% of the AES neurons (Jiang et al.
1994b; Meredith and Allman 2009), and that multisensory integration occurs in a portion of that
population (17–46%, depending on the criterion for integration), these data suggest that integrated
multisensory signals in response to effective sensory stimuli contribute to a small portion of the
output from the AES.

4.3.2  Posterolateral Lateral Suprasylvian Cortex


The auditory cortices of the middle ectosylvian gyrus are bordered, medially, by the suprasylvian
sulcus whose banks contain the lateral suprasylvian visual areas. Containing a representation of
the contralateral upper visual hemifield, the posterolateral lateral suprasylvian (PLLS) visual area
(Palmer et al. 1978) is bordered, laterally, by the dorsal zone of auditory cortex (Stecker et al. 2005).
Largely along this lateral border, the PLLS contains bimodal visual–auditory neurons whose visual
receptive fields are restricted to eccentric portions of visual space >40° (Allman and Meredith
(a) Anterior ectosylvian sulcal area (b) Posterolateral lateral suprasylvian area
A A 15 8 *
* V V
39%
36% A A
V V

20
40

0 0 0 0
0 600 A V AV 0 600 A V AV

(c) Rostral suprasylvian sulcal area (d) Superior colliculus


6 A A 25
S S *
V V
A A 302%

0%
Are Bimodal Neurons the Same throughout the Brain?

60
20

Mean spikes/trial

Spikes
0 0 0 0 0
0 600 A S SA 0 600 A V AV
Time (ms)

FIGURE 4.2  For each recording area (a–d), individual bimodal neurons showed responses to both unimodal stimuli presented separately as well as to their combina-
tion stimuli, as illustrated by rasters (1 dot = 1 spike) and histograms (10 ms time bins). Waveforms above each raster/histogram indicate stimulation condition (square
wave labeled “A” = auditory; ramp labeled “V” = visual; ramp labeled “S” = somatosensory; presented separately or in combination). Bar graphs depict mean (and
standard deviation) of responses to different stimulus conditions; numerical percentage indicates proportional difference between the most effective unimodal stimulus
and the response elicited by stimulus combination (i.e., integration). Asterisk (*) indicates that response change between these two conditions was statistically significant
(p < .05 paired t-test).
55
56 The Neural Bases of Multisensory Processes

20
AES PLLS

Mean response (spikes/trial)


15
34 ± 7 %

24 ± 4 %
10

0
Lowest Best Lowest Best
Combined Combined
unimodal unimodal unimodal unimodal
20
RSS SC 88 ± 12 %
Mean response (spikes/trial)

15

10

5 37 ± 4 %

0
Lowest Best Lowest Best
Combined Combined
unimodal unimodal unimodal unimodal

FIGURE 4.3  For each recording area, average response levels (and standard error of the mean [SEM]) for
population of bimodal neurons. Responses to unimodal stimuli were grouped by response level (lowest, best),
not by modality. Percentage (and SEM) indicates proportional change between the best unimodal response
and that elicited by combined stimulation (i.e., integration). In each area, combined response was statistically
greater than that evoked by the most effective unimodal stimulus (p < .05; paired t-test).

50 50
AES PLLS
Combined response (spikes/trial)

40 40

30 30

20 20

10 10

0 0
0 10 20 30 40 50 0 10 20 30 40 50
Best unimodal response (spikes/trial) Best unimodal response (spikes/trial)
50 50
RSS SC
Combined response (spikes/trial)

40 40

30 30

20 20

10 10

0 0
0 10 20 30 40 50 0 10 20 30 40 50
Best unimodal response (spikes/trial) Best unimodal response (spikes/trial)

FIGURE 4.4  For neural areas sampled, response of a given bimodal neuron to the most effective unimodal
stimulus (x axis) was plotted against its response to stimulus combination (y axis). For the most part, bimodal
neurons in each area showed activity that almost always plotted above line of unity (dashed line).
Are Bimodal Neurons the Same throughout the Brain? 57

400 400
AES PLLS
300 300

200 200

Interaction (%)
100 100

0 0
0 10 20 30 0 10 20 30

–100 –100

–200 –200
Best unimodal response (spikes/trial) Best unimodal response (spikes/trial)

400 400
RSS SC
300 300

200 200
Interaction (%)

100 100

0 0
0 10 20 30 0 10 20 30

–100 –100

–200 –200
Best unimodal response (spikes/trial) Best unimodal response (spikes/trial)

FIGURE 4.5  For each of recording areas, response of a given bimodal neuron to the most effective unimodal
stimulus (x axis) was plotted against proportional change (interaction) elicited by combined stimuli (y axis).
Most bimodal neurons exhibited interactions > 0, but level of interaction generally decreased with increasing
levels of spiking activity.

2007). The bimodal neurons described in the present study were collected during PLLS recordings
reported by Allman and Meredith (2007).
A total of 520 neurons were identified in eight penetrations in three cats, of which 9% (n = 49/520)
were visual–auditory bimodal. A typical example is illustrated in Figure 4.2, where the presentation
of either auditory or visual stimuli vigorously activated the neuron. In addition, when the same visual
and auditory stimuli were combined, an even stronger response was evoked. The combined response
representing a significant (p < .05, paired t-test) enhancement of activity (39%) over that elicited by
the most effective stimulus presented alone (see Meredith and Stein 1986 for criteria). This response
increment was slightly larger than the average magnitude of integration (24%) seen in the population
of bimodal PLLS neurons [response averages to the weakest (4.7 ± 5.4 SD) and best (7.1 ± 6.8 SD)
separate stimuli and to combined-modality stimulation (8.8 ± 8.8 SD) are shown in Figure 4.3]. This
modest response increment was generated by neurons of widely different activity levels. As illustrated
in Figure 4.4, PLLS responses to separate or combined-modality stimulation produced between 1 and
50 mean spikes/trial. However, only a minority (39%; n = 19/49) of bimodal neurons showed signifi-
cant response enhancement to the available stimuli and most showed levels of activity that plotted
close to the line of unity in Figure 4.4. Figure 4.5 shows that levels of response interaction were gener-
ally the same across activity levels. Furthermore, all PLLS interaction magnitudes represented <75%
change, as also depicted in Figure 4.6. A few (16%; 8/49) PLLS bimodal neurons even showed smaller
responses to the combined stimuli than elicited by the most effective unimodal stimulus.
Analysis of the proportional change in bimodal PLLS neurons resulting from combined-modality
stimulation revealed that even fewer neurons (10; n = 5/49) achieve superadditive levels of activity
than statistically significant levels of response enhancement (39%; n = 19/49). Given that bimodal
neurons represent only about 25% of the PLLS neurons (Allman and Meredith 2007), and that mul-
tisensory integration occurs in a portion of that population (17–46%, depending on the criterion for
integration), these data suggest that integrated multisensory signals in response to effective sensory
stimuli contribute to a small portion of the output from the PLLS.
58 The Neural Bases of Multisensory Processes

60 60
AES x = 34% PLLS x = 24%

50 50

40 40
Neurons (%)

30 30

20 20

10 10

0 0
>–25 –25 to 25 to 75 to 125 to >175 >–25 –25 to 25 to 75 to 125 to >175
24 74 124 174 24 74 124 174
Interaction (%) Interaction (%)
60 60
RSS x = 37% SC x = 88%
50 50

40 40
Neurons (%)

30 30

20 20

10 10

0 0
>–25 –25 to 25 to 75 to 125 to >175 >–25 –25 to 25 to 75 to 125 to >175
24 74 124 174 24 74 124 174
Interaction (%) Interaction (%)

FIGURE 4.6  For each recording area, many bimodal neurons showed low levels of interaction (–25% to
25%). However, only AES and SC exhibited integrated levels in excess of 175%.

AES PLLS RSS SC


100

80

60
Neurons (%)

40

20

0
Statistical summative Statistical summative Statistical summative Statistical summative

FIGURE 4.7  Multisensory interactions in bimodal neurons can be evaluated by statistical (paired t-test between
best unimodal and combined responses) or by summative (combined response exceeds sum of both unimodal
responses) methods. For each area, fewer combined responses met these criteria using summative rather than statis-
tical methods. However, only in SC was integration (by either method) achieved by >50 of neurons.
Are Bimodal Neurons the Same throughout the Brain? 59

4.3.3  Rostral Suprasylvian Sulcal Cortex


As described by Clemo et al. (2007), extracellular recordings were made in three cats in which
recording penetrations (n = 27) covered the anterior–posterior extent and depth of the lateral
bank of rostral suprasylvian sulcus (RSS; see Figure 4.1 for location). A total of 946 neurons were
recorded, of which 24% were identified as bimodal: either auditory–somatosensory neurons (20%;
n = 193/946) or audio–visual neurons (4%; n = 35/946). Of these, 86 were tested quantitatively for
responses to separate and combined-modality stimulation, of which a representative example is pro-
vided in Figure 4.2. This neuron showed a reliable response to the auditory stimulus, and a vigorous
response to the somatosensory stimulus. When the two stimuli were combined, a vigorous response
was also elicited but did not significantly differ from that of the most effective (somatosensory)
stimulus presented alone. In addition, nearly 20% (18/97) of the neurons showed smaller responses
to the combined stimuli than to the most effective single-modality stimulus. This low level of mul-
tisensory integration was surprising, although not unusual in the RSS. In fact, the majority (66%;
64/97) of RSS bimodal neurons failed to show a significant response interaction to combined stimu-
lation. This effect is evident in the averaged responses of the RSS population, which achieved an
average 37% response increase (see Figure 4.3). Also evident from this figure are the comparatively
low levels of response evoked by stimuli presented separately (least effective, 1.67 ± 1.2 SD; most
effective, 2.8 ± 2.2 SD average spikes/trial) or together (3.6 ± 2.9 SD average spikes/trial). These low
response levels are also apparent in Figure 4.4, where responses to best and combined stimulation
are plotted for each neuron and, under no condition, was activity measured >20 spikes/trial. This
low level of activity may underlie the strong inverse relationship between effectiveness and inter-
active level, shown in Figure 4.5, because the neurons with the lowest unimodal response values
also showed the highest proportional gains. In fact, all of the neurons that showed >75% response
change had an average response to the most effective unimodal stimulus of only 0.89 ± 0.5 spikes/
trial. Therefore, the appearance of large proportional changes in these low-activity neurons may
be the result of comparisons among low values. With that in mind, the proportion of RSS neurons
showing response changes that were more than summative may be artificially large. As shown in
Figure 4.7, the proportion of RSS bimodal neurons with significant (34%) or more than summative
(20%) changes represented only a third of the sample or less. Given that only 24% of the RSS was
identified as bimodal, the small amount of multisensory integration produced by less than one third
of participating neurons would indicate that integrated multisensory signals are not a robust indica-
tor of this cortical region.

4.3.4  Superior Colliculus


The bimodal SC neurons described in the present study were collected from recordings reported
by Meredith and Stein (1983, 1985). A total of 81 bimodal neurons met acceptance criteria (see
Methods) were identified from recordings from 20 cats. Of these SC neurons, 62% (n = 50/81)
were visual–auditory, 16% (n = 13/81) were visual–somatosensory, 10% (n = 8/81) auditory–­
somatosensory, and 12% (n = 10/81) were trimodal; these proportions were similar to those reported
earlier (Meredith and Stein 1986). A typical example of a bimodal SC neuron is illustrated in Figure
4.2, where the presentation of either auditory or visual stimuli activated the neuron. When the same
visual and auditory stimuli were combined, however, a significantly (p < .05 paired t-test) stron-
ger response was evoked. This response to the combined stimulation represented a multisensory
enhancement of activity of >300%. Most (77%; n = 62/81) bimodal SC neurons showed significant
response enhancement, averaging a magnitude of 88% for the overall population [response aver-
ages to the weakest (5.9 ± 6.7 SD) and best (10.9 ± 10.4 SD) separate stimuli and to combined-
modality stimulation (17.4 ± 13.5 SD) are shown in Figure 4.3]. As depicted in Figure 4.4, response
enhancement was generated by neurons of widely different activity levels, ranging from 1 to 40
mean spikes/trial. However, Figure 4.5 shows that levels of response enhancement tended to be
60 The Neural Bases of Multisensory Processes

larger for responses with lower levels of activity. Given the levels of enhancement achieved by such
a large proportion of SC bimodal neurons, it did not seem surprising that >48% of neurons showed
enhancement levels in excess of a 75% change (see Figure 4.6). In contrast, few SC neurons (3%;
3/97) produced combined responses that were lower than that elicited by the most effective single-
modality stimulus.
Analysis of the proportional change in bimodal SC neurons resulting from combined-modality
stimulation revealed that a majority (56%; n = 45/81) achieved superadditive levels of activity; a
large majority also demonstrated statistically significant levels of response enhancement (76%; n =
62/81). Given that bimodal neurons represent a majority of neurons in the deep layers of the SC
(63%; Wallace and Stein 1997), and that significant levels of multisensory response enhancement
are achieved in more than three-fourths of those, these data suggest that integrated multisensory
signals are a robust component of sensory signals in the SC.

4.4  DISCUSSION
4.4.1  Bimodal Neurons with Different Integrative Properties
Bimodal neurons clearly differ from one another (Perrault et al. 2005). In the SC, some bimodal
neurons are highly integrative and exhibit integrated, superadditive responses to a variety of stimu-
lus combinations, whereas others never produce superadditive levels in spite of the full range of
stimuli presented. Thus, different bimodal neurons exhibit different functional ranges. The ques-
tion of whether bimodal neurons elsewhere in the brain might also exhibit integrative differences
was examined in the present study. Bimodal neurons in the AES, PLLS, and RSS were tested for
their responses to combined-modality stimuli that revealed that some cortical neurons generated
multisensory integrated responses whereas others did not. It should be pointed out that the present
study did not make an exhaustive characterization of the integrative capacity of each neuron (as
done by Perrault et al. 2005). However, the present sampling methods appear to have overestimated
(not underestimated) the proportion of integrative neurons because 45% of the SC sample showed
superadditive response levels, whereas fewer (28%) were identified using more intensive methods
(Perrault et al. 2005). Regardless of these testing differences, these combined studies indicate that
bimodal neurons from across the brain are a diverse group.

4.4.2  Bimodal Neurons in SC and Cortex Differ


The SC is well known for its highly integrative neurons, with examples of multisensory response
enhancement in excess of 1200% (Meredith and Stein 1986). The present sample of bimodal SC
neurons (derived from Meredith and Stein 1983, 1985) showed a range of –11% to 918% change
(average 88%) with most (55%; 45/81) neurons showing superadditive responses. In contrast, corti-
cal bimodal neurons (AES, PLLS, and RSS) generated a consistently lower range of integration
(–62 to 212; 33% overall average). In fact, only a minority (39%; 75/192) of cortical bimodal neurons
exhibited significant multisensory response changes and only 17% (33/192) produced superadditive
response levels. As a group, the average level of response interaction was only 17% change from
the best unimodal response. In addition, instances where the combined response was less than the
maximal unimodal response occurred in 16% of cortical bimodal neurons, but only in 3% of the SC
neurons (no such examples were observed in SC by Perrault et al. 2005). Clearly, bimodal neurons
in the cortex integrate multisensory information differently from those in the SC.

4.4.3  Bimodal Neurons in Different Cortical Areas Differ


Bimodal neurons in different cortical areas also exhibit different capacities for multisensory inte-
gration. Proportionally more bimodal AES neurons showed significant response interactions (46%;
Are Bimodal Neurons the Same throughout the Brain? 61

21/46) and higher levels of integration (34% average) than those in the RSS (34%; 33/97 showed
significant response change; 24% average). Furthermore, bimodal neurons in these regions showed
significantly different (p < .01 t-test) spike counts in response to adequate separate and combined-
modality stimuli. AES neurons averaged 8.9 ± 7.9 SD spikes/trial in response to the most effective
separate-modality stimulus, and 11.7 ± 9.9 SD spikes/trial to the combined stimuli. In contrast,
RSS neurons averaged 2.8 ± 2.2 SD spikes/trial in response to the most effective separate-modality
stimulus, and 3.6 ± 2.9 SD spikes/trial to the combined stimuli. In addition, nearly 20% of RSS
neurons showed combined responses that were less than the maximal unimodal responses, com-
pared with 11% of AES bimodal neurons. Thus, by a variety of activity measures, the multisensory
processing capacity is clearly different for bimodal neurons in different cortical areas. Measures of
multisensory processing in bimodal PLLS neurons appear to fall between those obtained for AES
and RSS.

4.4.4  Population Contribution to Areal Multisensory Function


The present results indicate that the range of multisensory integration is different for bimodal
neurons in different neural areas. Therefore, it should be expected that the performance of dif-
ferent areas will differ under the same multisensory conditions. As illustrated in the left panel
of Figure 4.8, some areas contain relatively few bimodal neurons, and those that are present are
generally poor multisensory integrators (e.g., those observed in the RSS). In contrast, other areas
(e.g., the SC) contain a high proportion of bimodal neurons of which many are strong integrators
(right panel Figure 4.8). Furthermore, the data suggest that areas of intermediate multisensory
properties also occur (e.g., AES), as schematized by the intermingled low- and high-integrators
in the center panel of Figure 4.8. Under these conditions, it is likely that a given multisensory
stimulus will simultaneously elicit widely different multisensory responses and levels of inte-
gration in these different areas. Furthermore, although the cat SC contains ~63% bimodal (and
trimodal) neurons (Wallace and Stein 1997), most cortical areas exhibit bimodal populations of
only between 25% and 30% (Rauschecker and Korte 1993; Jiang et al. 1994a, 1994b; Carriere

Low integration High integration

FIGURE 4.8  Bimodal neurons with different functional modes, when distributed in different proportions,
underlie regions exhibiting different multisensory properties. Each panel shows same array of neurons, except
that proportions of unisensory (white), low-integrator (gray), and high-integrator (black) multisensory neurons
are different. Areas in which low-integrator neurons predominate show low overall levels of multisensory
integration (left), whereas those with a large proportion of high-integrators (right) exhibit high levels of mul-
tisensory integration. Intermediate proportions of low- and high-integrators collectively generate intermedi-
ate levels of multisensory integration at areal level. Ultimately, these arrangements may underlie a range of
multisensory processes that occur along a continuum from one extreme (no integration, not depicted) to the
other (high integration).
62 The Neural Bases of Multisensory Processes

et al. 2007; Meredith et al. 2006; Clemo et al. 2007; Allman and Meredith 2007). Therefore,
from an areal level, the comparatively weak multisensory signal from a cortical area is likely
to be further diluted by the fact that only a small proportion of bimodal neurons contribute to
that signal. It should also be pointed out that many cortical areas have now been demonstrated
to contain subthreshold multisensory (also termed “modulatory”) neurons. These neurons are
activated by inputs from only one modality, but that response can be subtly modulated by influ-
ences from another to show modest (but statistically significant) levels of multisensory interac-
tion (Dehner et al. 2004; Meredith et al. 2006; Carriere et al. 2007; Allman and Meredith 2007;
Meredith and Allman 2009). Collectively, these observations suggest that cortical multisensory
activity is characterized by comparatively low levels of integration. In the context of the behav-
ioral/perceptual role of cortex, these modest integrative levels may be appropriate. For example,
when combining visual and auditory inputs to facilitate speech perception (e.g., the cocktail
party effect), it is difficult to imagine how accurate perception would be maintained if every
neuron showed a response change in excess of 1200%. On the other hand, for behaviors in which
survival is involved (e.g., detection), multisensory interactions >1200% would clearly provide an
adaptive advantage.

4.4.5  Methodological Considerations


Several methodological considerations should be appreciated for these results to have their proper
context. The results were obtained from cats under essentially the same experimental conditions
(single-unit recording under ketamine anesthesia). Data collection for all cortical values was car-
ried out using the same paradigms and equipment. Although the experimental design was the
same, the SC data were obtained before the incorporation of computers into experimental meth-
ods. Consequently, the different sensory trials were not interleaved but taken in sequential blocks,
usually with fewer repetitions (n = 10–16). This is important because the number of trials has
recently been demonstrated to be a key factor in determining statistical significance among mul-
tisensory interactions (Allman et al. 2009), where the larger number of trials was correlated with
more neurons meeting statistical criterion. However, the SC recordings revealed a higher proportion
of significantly affected neurons than in the cortex, despite these statistical measures being based
on fewer trials for SC neurons (10–16 trials) than for the cortical neurons (25 trials). All cortical
sensory tests were conducted in essentially the same manner: adequate (not minimal or maximal)
stimuli from each modality were used and they were not systematically manipulated to maximize
their integrative product. For this reason, only SC data taken before the spatial and temporal para-
metric investigations (e.g., Meredith and Stein 1986; Meredith et al. 1987) were included in the
present comparative study.
The present results are based completely on comparisons of spike counts in response to sin-
gle- and combined-modality stimulation. It is also possible (indeed likely) that other response
measures, such as temporal pattern or information content, may provide reliable indicators of
these different effects. In addition, each of these experiments used an anesthetized prepara-
tion and it would be expected that effects such as alertness and attention would have an influ-
ence on neuronal properties. However, the anesthetic regimen was the same for each of the
experiments and the comparisons were made with respect to relative changes within the data
sample. Furthermore, it would seem counterintuitive that response subtleties among bimodal
neurons would be observable under anesthesia but not in alert animals. However, these issues
await empirical evaluation.
In an effort to identify cortical areas capable of multisensory processing in humans, studies
using noninvasive technologies have adopted the principles of multisensory integration deter-
mined at the level of the bimodal neuron in the SC into the criteria by which computational,
perceptual, and cognitive multisensory effects could be measured and defined. For example,
the metric of superadditivity has been used in neuroimaging studies in a conservative effort
Are Bimodal Neurons the Same throughout the Brain? 63

to avoid “false positives” while identifying sites of multisensory integration within the cortex
(see Laurienti et al. 2005 for review). Based on the multisensory characteristics of SC neurons
(Perrault et al. 2005), however, Laurienti and colleagues cautioned that multisensory stimuli
would not likely generate superadditive responses in the blood oxygenation level–dependent sig-
nal as measured by functional magnetic resonance imaging (Laurienti et al. 2005). The results
of the present study further support this caution because proportionally fewer cortical neurons
reveal superadditive responses than SC neurons (Figure 4.7), and the magnitude of response
enhancement is considerably smaller in the cortex (Figure 4.6). On the other hand, given the
tenuous relationship between single neuron discharge activity (i.e., action potentials) and brain
hemodynamics underlying changes in the blood oxygenation level–dependent signal (Logothetis
et al. 2001; Laurienti et al. 2005; Sirotin and Das 2009; Leopold 2009), it remains debatable
whether effects identified in single-unit electrophysiological studies are appropriate to charac-
terize/define multisensory processing in neuroimaging studies in the first place. How this issue
is resolved, however, does not change the fact that electrophysiological measures of multisensory
processing at the neuronal level reveal differences among bimodal neurons from different brain
regions.

4.5  CONCLUSIONS
Bimodal neurons are known to differ functionally within the same structure, the SC. The present
study shows that this variation also occurs within the cortex. Ultimately, by varying the propor-
tional representation of the different types of bimodal neurons (defined by functional ranges), dif-
ferent neural areas can exhibit different levels of multisensory integration in response to the same
multisensory stimulus.

ACKNOWLEDGMENTS
Collection of superior colliculus data was supported by grants NS019065 (to B.E. Stein) and NS06838
(to M.A. Meredith), that of cortical data was supported by grant NS039460 (to M.A. Meredith).

REFERENCES
Allman, B.L., and M.A. Meredith. 2007. Multisensory processing in ‘unimodal’ neurons: Cross-modal sub-
threshold auditory effects in cat extrastriate visual cortex. Journal of Neurophysiology 98:545–549.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2009. Not just for bimodal neurons anymore: The contribu-
tion of unimodal neurons to cortical multisensory processing. Brain Topography 21:157–167.
Carriere, B.N., D.W. Royal, T.J. Perrault, S.P. Morrison, J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2007.
Visual deprivation alters the development of cortical multisensory integration. Journal of Neurophysiology
98:2858–2867.
Clarey, J.C., and D.R.F. Irvine. 1990. The anterior ectosylvian sulcal auditory field in the cat: I. An electro-
physiological study of its relationship to surrounding auditory cortical fields. Journal of Comparative
Neurology 301:289–303.
Clemo, H.R., B.L. Allman, M.A. Donlan, and M.A. Meredith. 2007. Sensory and multisensory representations
within the cat rostral suprasylvian cortices. Journal of Comparative Neurology 503:110–127.
Clemo, H.R., and B.E. Stein. 1983. Organization of a fourth somatosensory area of cortex in cat. Journal of
Neurophysiology 50:910–925.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith. 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Horn, G., and R.M. Hill. 1966. Responsiveness to sensory stimulation of units in the superior colliculus and
subjacent tectotegmental regions of the rabbit. Experimental Neurology 14:199–223.
Jiang, H., F. Lepore, M. Ptito, and J.P. Guillemot. 1994a. Sensory interactions in the anterior ectosylvian cortex
of cats. Experimental Brain Research 101:385–396.
64 The Neural Bases of Multisensory Processes

Jiang, H., F. Lepore, M. Ptito, and J.P. Guillermot. 1994b. Sensory modality distribution in the anterior ectosyl-
vian cortex (AEC) of cats. Experimental Brain Research 97:404–414.
King, A.J., and A.R. Palmer. 1985. Integration of visual and auditory information in bimodal neurones in the
guinea-pig superior colliculus. Experimental Brain Research 60:492–500.
Laurienti, P.J., T.J. Perrault, T.F. Stanford, M.T. Wallace, and B.E. Stein. 2005. On the use of superadditivity
as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental
Brain Research 166:289–297.
Leopold, D.A. 2009. Neuroscience: Pre-emptive blood flow. Nature 457:387–388.
Logothetis, N.K., J. Pauls, M. Augath, T. Trinath, and A. Oeltermann. 2001. Neurophysiological investigation
of the basis of the fMRI signal. Nature 412:150–157.
Meredith, M.A. 2004. Cortico-cortical connectivity and the architecture of cross-modal circuits. In Handbook of
Multisensory Processes, eds. C. Spence, G. Calvert, and B. Stein, 343–355. Cambridge, MA: MIT Press.
Meredith, M.A., and B.L. Allman. 2009. Subthreshold multisensory processing in cat auditory cortex.
Neuroreport 20:126–131.
Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–391.
Meredith, M.A., and B.E. Stein. 1985. Descending efferents of the superior colliculus relay integrated multi-
sensory information. Science 227:657–659.
Meredith, M.A., and B.E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in the superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–662.
Meredith, M.A., and B.E. Stein. 1996. Spatial determinants of multisensory integration in cat superior collicu-
lus neurons. Journal of Neurophysiology. 75:1843–1857.
Meredith, M.A., L.R. Keniston, L.R. Dehner, and H.R. Clemo. 2006. Cross-modal projections from somatosen-
sory area SIV to the auditory field of the anterior ecosylvian sulcus (FAES) in cat: Further evidence for
subthreshold forms of multisensory processing. Experimental Brain Research 172:472–484.
Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons: I. Temporal factors. Journal of Neuroscience 7:3215–3229.
Olson, C.R., and A.M. Graybiel. 1987. Ectosylvian visual area of the cat: Location, retinotopic organization,
and connections. Journal of Comparative Neurology 261:277–294.
Palmer, L.A., A.C. Rosenquist, and R.J. Tusa. 1978. The retinotopic organization of lateral suprasylvian visual
areas in the cat. Journal of Comparative Neurology 177:237–256.
Perrault, T.J., J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2005. Superior colliculus neurons use distinct opera-
tional modes in the integration of multisensory stimuli. Journal of Neurophysiology 93:2575–2586.
Rauschecker, J.P., and M. Korte. 1993. Auditory compensation for early blindness in cat cerebral cortex.
Journal of Neuroscience 13:4538–4548.
Sirotin, Y.B., and A. Das. 2009. Anticipatory haemodynamic signals in sensory cortex not predicted by local
neuronal activity. Nature 457:475–479.
Stecker, G.C., I.A. Harrington, E.A. MacPherson, and J.C. Middlebrooks. 2005. Spatial sensitivity in the dorsal
zone (area DZ) of cat auditory cortex. Journal of Neurophysiology 94:1267–1280.
Stein, B.E., and M.A. Meredith. 1993. Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., M.A. Meredith. W.S. Huneycutt, and L. McDade. 1989. Behavioral indices of multisensory inte-
gration: Orientation to visual cues is affected by auditory stimuli. Journal of Cognitive Neuroscience
1:12–24.
Sugihara, T., M.D. Diltz, B.B. Averbeck, and L.M. Romanski. 2006. Integration of auditory and visual
communication information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience
26:11138–11147.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience 17:2429–2444.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1992. Integration of multiple sensory inputs in cat cortex.
Experimental Brain Research 91:484–488.
5 Audiovisual Integration
in Nonhuman Primates
A Window into the Anatomy
and Physiology of Cognition
Yoshinao Kajikawa, Arnaud Falchier, Gabriella Musacchia,
Peter Lakatos, and Charles E. Schroeder

CONTENTS
5.1 Behavioral Capacities..............................................................................................................66
5.1.1 Recognition..................................................................................................................66
5.1.2 Fusion and Illusions.....................................................................................................66
5.1.3 Perception.................................................................................................................... 67
5.2 Neuroanatomical and Neurophysiological Substrates.............................................................68
5.2.1 Prefrontal Cortex......................................................................................................... 69
5.2.2 Posterior Parietal Cortex............................................................................................. 71
5.2.3 STP Area..................................................................................................................... 72
5.2.4 MTL Regions............................................................................................................... 73
5.2.5 Auditory Cortex........................................................................................................... 74
5.2.6 Visual Cortex............................................................................................................... 75
5.2.7 Subcortical Regions..................................................................................................... 76
5.3 Functional Significance of Multisensory Interactions............................................................. 77
5.3.1 Influences on Unimodal Perception............................................................................. 77
5.3.1.1 Influence on Temporal Dynamics of Visual Processing............................... 77
5.3.1.2 Sound Localization....................................................................................... 78
5.3.2 AV Recognition........................................................................................................... 79
5.4 Principles of Multisensory Interaction.................................................................................... 79
5.4.1 Inverse Effectiveness................................................................................................... 80
5.4.2 Temporal Contiguity....................................................................................................80
5.4.3 Spatial Contiguity........................................................................................................ 81
5.5 Mechanisms and Dynamics of Multisensory Interaction........................................................ 82
5.5.1 Phase Reset: Mechanisms............................................................................................ 82
5.5.2 Phase Reset: Dependence on Types of Stimuli........................................................... 83
5.6 Importance of Salience in Low-Level Multisensory Interactions........................................... 83
5.6.1 Role of (Top-Down) Attention.....................................................................................84
5.6.2 Attention or Saliency of Stimuli.................................................................................. 85
5.7 Conclusions, Unresolved Issues, and Questions for Future Studies........................................ 85
5.7.1 Complex AV Interactions............................................................................................. 85
5.7.2 Anatomical Substrates of AV Interaction.................................................................... 85
5.7.3 Implication of Motor Systems in Modulation of Reaction Time................................. 85
5.7.4 Facilitation or Information?......................................................................................... 86

65
66 The Neural Bases of Multisensory Processes

5.7.5 Inverse Effectiveness and Temporal Interaction.......................................................... 86


5.7.6 What Drives and What Is Driven by Oscillations?...................................................... 86
5.7.7 Role of Attention.......................................................................................................... 86
Acknowledgment.............................................................................................................................. 87
References......................................................................................................................................... 87

5.1  BEHAVIORAL CAPACITIES


Humans can associate a sound with its visual source, where it comes from, how it is produced, and
what it means. This association, or audiovisual (AV) integration, also occurs in many nonhuman
primate species, and may be used in kin recognition, localization, and social interaction, among
other things (Cheney and Seyfarth 1990; Ghazanfar and Santos 2004). These abilities suggest that
nonhuman primates integrate sight and sound as humans do: through recognition of AV vocaliza-
tions and enhanced perception of audiovisual stimuli.

5.1.1  Recognition
One of the most ubiquitous AV functions in everyday human life is recognizing and matching the
sight and sounds of other familiar humans. Nonhuman primates can also recognize the sight and
sound of a familiar object and can express this association behaviorally. Primates reliably associate
coincident auditory and visual signals of conspecific vocalizations (Evans et al. 2005; Ghazanfar
and Logothetis 2003; Jordan et al. 2005; Sliwa et al. 2009) and can match pictures to vocal sounds
of both conspecifics and familiar humans (Izumi and Kojima 2004; Kojima et al. 2003; Martinez
and Matsuzawa 2009). Monkeys can also identify a picture in which the number of individuals
matches the number of vocal sounds (Jordan et al. 2005). Although it appears that primates rec-
ognize the AV components of a talking face much better when the individual is socially familiar,
familiarity does not appear to be a critical component of audiovisual recognition; many of the
studies cited above showed that primates can correctly match AV vocalizations from other primate
species (Martinez and Matsuzawa 2009; Zangenehpour et al. 2009). Facial movement, on the other
hand, appears to be a key component for nonhuman primates in recognizing the vocal behavior of
others. When matching a visual stimulus to a vocalization, primates correctly categorized a still
face as a mismatch (Izumi and Kojima 2004; Evans et al. 2005; Ghazanfar and Logothetis 2003)
and performed poorly when only the back view was presented (Martinez and Matsuzawa 2009).
AV matching by monkeys is not limited to facial recognition. Ghazanfar et al. (2002) showed
that a rising-intensity sound attracted a monkey’s attention to a similar degree as a looming visual
object (Schiff et al. 1962). These auditory and visual signals are signatures of an approaching object.
Monkeys preferentially look at the corresponding looming rather than receding visual signal when
presented with a looming sound. This was not the case when the monkey was presented with either
a receding sound or white noise control stimulus with an amplitude envelope matching that of the
looming sound (Maier et al. 2004). Therefore, monkeys presumably form single events by associat-
ing sound and visual attributes at least for signals of approaching objects.
Taken together, these data indicate that the dynamic structure of the visual stimulus and com-
patibility between two modalities is vital for AV recognition in primates and suggest a common
mechanistic nature across primate species.

5.1.2  Fusion and Illusions


For humans, one of the most striking aspects of AV integration is that synchronous auditory and
visual speech stimuli seem fused together, and illusions relating to this phenomenon may arise. The
McGurk illusion is a case of this sort. When a mismatch between certain auditory and visual syl-
lables occurs (e.g., an auditory “ba” with a visual “ga”), humans often perceive a synthesis of those
Audiovisual Integration in Nonhuman Primates 67

syllables, mostly “da” (McGurk and MacDonald 1976). The illusion persists even when the listener
is aware of the mismatch, which indicates that visual articulations are automatically integrated into
speech perception (Green et al. 1991; Soto-Faraco and Alsius 2009).
Vatakis et al. (2008) examined whether auditory and visual components of monkey vocalizations
elicited a fused perception in humans. It is well known that people are less sensitive to temporal
asynchrony when auditory and visual components of speech are matched compared to a mismatched
condition (called the “unity effect”). Capitalizing on this phenomenon, Vatakis and colleagues used
a temporal order judgment task with matched and mismatched sounds and movies of monkey vocal-
izations across a range of stimulus onset asynchronies (SOA). The unity effect was observed for
human speech vocalization, but was not observed when people observed monkey vocalizations.
The authors also showed negative results for human vocalizations mimicking monkey vocaliza-
tions, suggesting that the fusion of face–voice components is limited to human speech for humans.
This may be because of the fact that monkey vocal repertoires are much more limited than those of
humans and have a large dissimilarity between facial expressive components and sound (Chakladar
et al. 2008; Partan 2002).
Another famous AV illusion, called the “ventriloquist effect,” also appears to have a corollary
in nonhuman primate perception. The effect is such that under the right conditions, a sound may be
perceived as originating from a visual location despite a spatial disparity. After training a monkey
to identify the location of a sound source, Recanzone’s group introduced a 20 to 60 min period of
spatially disparate auditory (tones) and visual (dots) stimuli (Woods and Recanzone 2004). The
consequence of this manipulation appeared in the sound lateralization task as a deviation of the
“auditory center spot” in the direction to the location of sound relative to visual fixation spot during
the prior task. The underlying neural mechanism of this effect may be similar to the realignment
of visual and auditory spatial maps after adapting to an optical prism displacing the visual space
(Cui et al. 2008; Knudsen and Knudsen 1989).
What about perception of multisensory moving objects? Preferential looking at looming sound
and visual signal suggests that monkeys associate sound and visual attributes of approaching objects
(Maier et al. 2004). However, longer looking does not necessarily imply fused perception, but may
instead suggest the attentional attraction to moving stimuli after assessing their congruency. Fused
perception of looming AV signals was supported by human studies, showing the redundant signal
effect (see Section 5.1.3 for more details) in reaction time (shorter reaction time to congruent loom-
ing AV signals) under the condition of bimodal attention (Cappe et al. 2010; see also Romei et al.
2009 for data suggesting preattentive effects of looming auditory signals). Interestingly, for such an
AV looming effect to happen, the spectrum of the sound has to be dynamically structured along
with sound intensity. It is not known which other attributes of a visual stimulus, other than motion,
could contribute to this effect. It is likely that auditory and visual stimuli must be related, not only
in spatial and temporal terms, but also in dynamic spectral dimensions in both modalities in order
for an attentional bias or performance enhancement to appear.

5.1.3  Perception
Visual influences on auditory perception, and vice versa, is well established in humans (Sumby
and Pollack 1954; Raab 1962; Miller 1982; Welch and Warren 1986; Sams et al. 1991; Giard and
Peronnet 1999; for review, see Calvert 2001; Stein and Meredith 1993) and has been examined in
several studies on nonhuman primates (described below). By using simple auditory and visual stim-
uli, such as tones and dots, the following studies show that auditory and visual information interact
with each other to modulate perception in monkeys.
Barone’s group trained monkeys to make a saccade to a visual target that starts to flash at the
moment when the fixation point disappears (Wang et al. 2008). In half of the trials, the visual target
was presented with a brief task-irrelevant noise. The result was faster saccadic reaction times when
the visual target was accompanied with a sound than without it. Frens and Van Opstal (1998) also
68 The Neural Bases of Multisensory Processes

studied the influence of auditory stimulation on saccadic responses in monkeys performing tasks
similar to that of Wang et al. (2008). They showed not only a shortening of reaction time, but also
that reaction time depended on the magnitude of the spatial and temporal shift between visual and
auditory stimuli; smaller distance and closer timing yielded shorter reaction times. These results
demonstrated a temporal effect of sound on visual localization. These results are compatible with
human psychophysical studies of AV integration (Frens et al. 1995; Diederich and Colonius 2004;
Perrott et al. 1990) and suggest that the underlying mechanism may be common to both human and
nonhuman primates.
Like humans, monkeys have also been shown to have shorter manual reaction times to bimodal
targets compared with unimodal targets. In a simple detection task in which a monkey had to report
the detection of a light flash (V alone), noise sound (A alone), or both (AV) stimuli by manual
response, reaction times to AV stimuli were faster than V alone regardless of its brightness (Cappe
et al. 2010; see also Miller et al. 2001, showing similar data for small data sets). When the sound was
loud, reaction times to AV stimuli and A alone were not different. When sound intensity was low,
the overall reaction time was longer and the response to AV stimuli was still faster than A alone.
A study from our laboratory showed that reaction times to perceptual “oddballs,” or novel stimuli
in a train of standard stimuli, were faster for AV tokens than for the visual or auditory tokens pre-
sented alone (Kajikawa and Schroeder 2008). Monkeys were presented with a series of standard AV
stimuli (monkey picture and vocal sound) with an occasional oddball imbedded in the series that
differed from the standard in image (V alone), sound (A alone), or both (AV) stimuli. The monkey
had to manually respond upon detection of such oddballs. In that case, whereas intensity levels
were fixed, reaction times to the AV oddballs were faster than either A alone or V alone oddballs.
In addition, the probability of a correct response was highest for the AV oddball and lowest for the
A alone condition. Therefore, not only the detection of signals, but also its categorization benefited
from AV integration.
This pattern of reaction times conforms to the results of human psychophysics studies showing
faster reaction time to bimodal than unimodal stimuli (Frens et al. 1995; Diederich and Colonius
2004; Perrott et al. 1990). Observations of faster reaction in response to bimodal compared with
unimodal stimuli in different motor systems suggest that AV integration occurs in sensory systems
before the motor system is engaged to generate a behavioral response (or that a similar integration
mechanism is present in several motor systems).
Difference in task demands complicates the ability to define the role of attention in the effect
of AV integration on reaction times. In the study conducted by Wang et al. (2008), monkeys were
required to monitor only the occurrence of the visual stimulus. Therefore, task-irrelevant sound
acted exogenously from outside of the attended sensory domain, that is, it likely drew the monkey’s
attention, but this possibility is impossible to assess. In contrast, Cappe et al. (2010) and Kajikawa
and Schroeder (2008) used monkeys that were actively paying attention to both visual and auditory
modalities during every trial. It is worth noting that the sound stimuli used by Wang et al. (2008)
did not act as distracters. Hence, it was possible that monkeys could do the task by paying attention
to both task-relevant visual stimuli and task-irrelevant sound (see Section 5.6).

5.2  NEUROANATOMICAL AND NEUROPHYSIOLOGICAL SUBSTRATES


In the following sections, we will describe AV interactions in numerous monkey brain regions
(Figure 5.1). Investigators have identified AV substrates in broadly two ways: by showing that (1) the
region responds to both auditory and visual stimuli or (2) AV stimuli produce neural activity that
differs from the unimodal responses presented alone. AV integration has been shown at the early
stages of processing, including primary sensory and subcortical areas (for review, see Ghazanfar
and Schroeder 2006; Musacchia and Schroeder 2009; Schroeder and Foxe 2005; Stein and Stanford
2008). Other areas that respond to both modalities have been identified in the prefrontal cortex
(PFC), the posterior parietal cortex (PPC), the superior temporal polysensory area (STP), and
Audiovisual Integration in Nonhuman Primates 69

2 7A
iS
-R
-lg pt DLPFC
PV al T VLPFC pt
ud B al T
Ca lt/P ud B
1 be Caelt/P
al A 1 b
str B al A
Ro lt/P str B
Roelt/P
be b
T
FS

23
V2/V1 31

Pro MP MGm SG
Li Po

FIGURE 5.1  (See color insert.) Connections mediating multisensory interactions in primate auditory
cortex. Primate auditory cortices receive a variety of inputs from other sensory and multisensory areas.
Somatosensory areas (PV, parietoventral area; Ri, retroinsular area; S2, secondary somatosensory cortex)
and their projections to auditory cortex are shown in red. Blue areas and lines denote known visual inputs
(FST, fundus of superior temporal area; Pro, prostriata; V1, primary visual cortex; V2, secondary visual
cortex). Feedback inputs from higher cognitive areas (7A, Brodmann’s area 7A; 23, Brodmann’s area 23;
31, Brodmann’s area 31; DLPFC, dorsolateral prefrontal cortex; VLPFC, ventrolateral prefrontal cortex) are
shown in green. Multisensory feedforward inputs from thalamic nuclei (Li, limitans; MP, medial pulvinar;
MGm, medial division of medial geniculate; Po, posterior nucleus; SG, suprageniculate nucleus) are shown
in purple.

medial temporal lobe (MTL). Even though most studies could not elucidate the relationship between
behavior and physiology because they did not test the monkey’s behavior in conjunction with physi-
ological measures, these studies provide promising indirect evidence that is useful in directing
future behavioral/physiological studies.

5.2.1  Prefrontal Cortex


In the PFC, broad regions have been reported to be multisensory. PFC is proposed to have “what”
and “where” pathways of visual object and space information processing segregated into dorso-
lateral (DLPFC) and ventrolateral (VLPFC) parts of PFC (Goldman-Rakic et al. 1996; Levy and
Goldman-Rakic 2000; Ungerleider et al. 1998). Although numerous studies support the idea of
segregated information processing in PFC (Wilson et al. 1993), others found single PFC neurons
integrated what and where information during a task that required monitoring of both object and
location (Rao et al. 1997).
It appears that auditory information processing in PFC also divides into analogous “what” (e.g.,
speaker specific) and “where” (e.g., location specific) domains. The proposed “what” and “where”
pathways of the auditory cortical system (Kaas and Hackett 2000; Rauschecker and Tian 2000) have
been shown to project to VLPFC and DLPFC, respectively (Hackett et al. 1999; Romanski et al.
1999a, 1999b). Broad areas of the DLPFC were shown to be sensitive to sound location (Artchakov
70 The Neural Bases of Multisensory Processes

et al. 2007; Azuma and Suzuki 1984; Kikuchi-Yorioka and Sawaguchi 2000; Vaadia et al. 1986).
Conversely, response selectivity to macaque vocal sounds were found in VLPFC (Cohen et al. 2009;
Gifford et al. 2005; Romanski and Goldman-Rakic 2002; Romanski et al. 2005) and orbitofrontal
cortex (Rolls et al. 2006). These two areas may correspond to face-selective regions of frontal lobe
in nonhuman primates (Parr et al. 2009; Tsao et al. 2008b). Taken together, these findings support
the notion that, as in the visual system, sensitivity to location and nonspatial features of sounds are
segregated in PFC.
Although the dorsolateral stream in PFC has largely been shown to be sensitive to location, audi-
tory responses to species-specific vocalizations were also found in regions of DLPFC in squirrel
monkey (Newman and Lindsley 1976; Wollberg and Sela 1980) and macaque monkey (Bon and
Lucchetti 2006). Interestingly, visual fixation diminished responses to vocal sounds in some neu-
rons (Bon and Lucchetti 2006). Taken together with the results of Rao et al. (1997) showing that
neurons of the “what” and “where” visual stream are distributed over a region spanning both the
DLPFC and VLPFC, these studies suggest that the “what” auditory stream might extend outside
the VLPFC.
Apart from showing signs of analogous processing streams in auditory and visual pathways, PFC
is cytoarchitecturally primed to process multisensory stimuli. In addition to auditory cortical affer-
ents, the DLPFC and VLPFC have reciprocal connections with rostral and caudal STP subdivisions
(Seltzer and Pandya 1989). The VLPFC also receives inputs from the PPC, a presumed “where”
visual region (Petrides and Pandya 2009). Within both the DLPFC and VLPFC, segregated projec-
tions of different sensory afferents exist. Area 8 receives projections from visual cortices (occipital
and IPS) in its caudal part, and auditory-responsive cortices [superior temporal gyrus (STG) and
STP] in its rostral part (Barbas and Mesulam 1981). Similar segregation of visual [inferior temporal
(IT)] and auditory (STG and STP) afferents exist within VLPFC (Petrides and Pandya 2002). Thus,
DLPFC and VLPFC contain regions receiving both or either one of auditory and visual projections,
and those regions are intermingled. Additionally, orbitofrontal cortex and medial PFC receive inputs
from IT, STP, and STG (Barbas et al. 1999; Carmichael and Price 1995; Cavada et al. 2000; Kondo
et al. 2003; Saleem et al. 2008), and may contribute to AV integration (see Poremba et al. 2003).
Not surprisingly, bimodal properties of PFC neurons have been described in numerous studies.
Some early studies described neurons responsive to both tones and visual stimuli (Kubota et al.
1980; Aou et al. 1983). However, because these studies used sound as a cue to initiate immediate
behavioral response, it is possible that the neuronal response to the sound might be related to motor
execution. Other studies of PFC employed tasks in which oculomotor or manual responses were
delayed from sensory cues (Artchakov et al. 2007; Ito 1982; Joseph and Barone 1987; Kikuchi-
Yorioka and Sawaguchi 2000; Vaadia et al. 1986; Watanabe 1992). Despite the delayed response,
populations of neurons still responded to both visual and auditory stimuli. Such responses had
spatial tuning and dependence on task conditions such as modality of task and task demands of
discrimination, active detection, passive reception (Vaadia et al. 1986), or reward/no reward con-
tingency (Watanabe 1992). One report shows that visuospatial and audiospatial working memory
processes seem to share a common neural mechanism (Kikuchi-Yorioka and Sawaguchi 2000).
The behavioral tasks used in studies described so far did not require any comparison of visual
and auditory events. Fuster et al. (2000) trained monkeys to learn pairing of tones and colors and
perform a cross-modal delayed matching task using tones as the sample cue and color signals as the
target. They found that PFC neurons in those monkeys had elevated firing during the delay period
that was not present on error trials. Therefore, PFC has many neurons responsive to both auditory
and visual signals, somehow depending on behavioral conditions, and possibly associates them.
Romanski’s group explored multisensory responses in VLPFC (Sugihara et al. 2006), and found
that this region may have unimodal visual, unimodal auditory, or bimodal AV responsive regions
(Romanski et al. 2002, 2005). Their group used movies, images, and sounds of monkeys produc-
ing vocalizations as stimuli, and presented them unimodally or bimodally while subjects fixated.
Although neurons responded exclusively to one or both modalities, about half of the neurons
Audiovisual Integration in Nonhuman Primates 71

examined exhibited AV integration as either enhancement or suppression of unimodal response.


Because subjects were not required to maintain working memory or make decision, those responses
are considered to be sensory.
In addition to the above described regions, premotor (PM) areas between the primary motor
cortex and the arcuate sulcus contain neurons sensitive to sound and vision. Although most of the
neurons in PM respond to somatosensory stimuli, there are neurons that also respond to sound and
visual stimuli and have receptive fields spatially registered between different modalities (Graziano
et al. 1994, 1999). Those neurons are located in caudal PM particularly coding the space proximal
to the face (Fogassi et al. 1996; Graziano et al. 1997; Graziano and Gandhi 2000) as well as defen-
sive actions (Cooke and Graziano 2004a, 2004b). Rostral PM contains audiovisual mirror neurons
activity that is elevated not only during the execution of actions but also during the observation of
such actions from others. Those neurons generate specific manual actions and respond to sound in
addition to the sight of such actions (Keysers et al. 2003; Kohler et al. 2002; Rizzolatti et al. 1996;
Rizzolatti and Craighero 2004) and the goal objects of those actions (Murata et al. 1997). Although
AV sensitivity in caudal PM seems directly connected to the subject’s actions, rostral PM presum-
ably reflects the cognitive processing of others’ actions.
In summary, the PFC is subdivided into various regions based on sensory, motor, and other
cognitive processes. Each subdivision contains AV sensitivity that could serve to code locations or
objects. There are neurons specialized in coding vocalization, associating sound and visual signals,
or engaged in representation/execution of particular motor actions.

5.2.2  Posterior Parietal Cortex


The PPC in the monkey responds to different modalities (Cohen 2009), is known to be a main station
of the “where” pathway before the information enters PFC (Goodale and Milner 1992; Ungerleider
and Mishkin 1982), and is highly interconnected with multisensory areas (see below).
PPC receives afferents from various cortices involved in visual spatial and motion processing
(Baizer et al. 1991; Cavada and Goldman-Rackic 1989a; Lewis and Van Essen 2000; Neal et al.
1990). The caudal area of PPC has reciprocal connections with multisensory parts of PFC and STS,
suggesting that the PPC plays a key role in multisensory integration (Cavada and Goldman-Rackic
1989b; Neal et al. 1990). The ventral intraparietal area receives input from the auditory association
cortex of the temporoparietal area (Lewis and Van Essen 2000). The anterior intraparietal area also
receives projections from the auditory cortex (Padberg et al. 2005). PPC receives subcortical inputs
from the medial pulvinar (Baizer et al. 1993) and superior colliculus (SC; Clower et al. 2001) that
may subserve multisensory responses in PPC.
Several subregions of PPC are known to be bimodal. An auditory responsive zone in PPC over-
laps with visually responsive areas (Poremba et al. 2003). Space-sensitive responses to sound (noise)
in several areas of PPC, typically thought to be primarily visual oriented, have been observed in
the lateral intraparietal cortex (LIP; Stricane et al. 1996), ventral intraparietal area (Schlack et al.
2005), the medial intraparietal cortex, and the parietal reach region (Cohen and Andersen 2000,
2002). The auditory space-sensitive neurons in PPC also respond to visual stimulation with similar
spatial tuning (Mazzoni et al. 1996; Schlack et al. 2005). Furthermore, the spatial tuning of the
auditory and visual response properties was sufficiently correlated to be predictive of one another,
indicating a shared spatial reference frame across modalities (Mullette-Gilman et al. 2005, 2009).
PPC also plays a major role in motor preparation during localization tasks (Andersen et al. 1997).
Auditory responses in LIP only appeared after training on memory-guided delayed reaction tasks
with auditory and visual stimuli (Grunewald et al. 1999) and disappeared when the sound cue
became irrelevant for the task (Linden et al. 1999). These results suggested that auditory responses
in PPC were not just sensory activity. Information for encoding spatial auditory cues evolve as the
phase of the task progresses but remains constantly higher for visual ones in LIP and parietal reach
region (Cohen et al. 2002, 2004). Thus, there is a difference in processing between modalities.
72 The Neural Bases of Multisensory Processes

Even though most PPC studies used simple stimuli such as LED flashes and noise bursts, one
study also examined LIP response to vocal sounds and showed that LIP neurons are capable of
carrying information of sound acoustic features in addition to spatial location (Gifford and Cohen
2005). In that study, sounds were delivered passively to monkeys during visual fixation. Thus, it
seems inconsistent with the previously mentioned findings that manifestation of auditory response
in PPC requires behavioral relevance of the sounds (Grunewald et al. 1999; Linden et al. 1999).
Nevertheless, that study suggested the possibility that auditory coding in PPC may not be limited to
spatial information. Similarly, the existence of face-selective patches was shown in PPC of chim-
panzee using PET (Parr et al. 2009).
Although these studies suggest AV integration in PPC, responses to stimuli in bimodal condi-
tions have not yet been directly examined in monkeys.

5.2.3  STP Area


The STP, located in the anterior region of the superior temporal sulcus, from the fundus to the upper
bank, responds to multisensory stimuli in monkeys (Bruce et al. 1981; Desimone and Gross 1979;
Schroeder and Foxe 2002; Poremba et al. 2003) and is a putatively key site for AV integration in
both monkeys and humans.
STP is highly connected to subcortical and cortical multisensory regions. STP receives inputs
from presumed multisensory thalamic structures (Yeterian and Pandya 1989) and medial pulvinar
(Burton and Jones 1976), and has reciprocal connections with the PFC and other higher-order corti-
cal regions such as PPC, IT cortex, cingulate cortex, MTL, and auditory parabelt regions (Barnes
and Pandya 1992; Cusick et al. 1995; Padberg et al. 2003; Saleem et al. 2000; Seltzer et al. 1996;
Seltzer and Pandya 1978, 1994). Based on connectivity patterns, area STP can be subdivided into
rostral and caudal regions. Its anterior part is connected to the ventral PFC, whereas the caudal part
seems to be connected to the dorsal PFC (Seltzer and Pandya 1989).
STP exhibits particular selectivity to complex objects, faces, and moving stimuli. STP was shown
to have responses to visual objects (Oram and Perrett 1996), and particularly to show some degree
of face selectivity (Bruce et al. 1981; Baylis et al. 1987). Face-selectivity was shown to exist in dis-
crete patches in monkeys (Pinsk et al. 2005; Tsao et al. 2006, 2008a) and chimpanzees (Parr et al.
2009), although others found responses to faces over a wide area (Hoffman et al. 2007). Responses
to faces are further selective to identity, gaze direction, and/or viewing angle of the presented face
(De Souza et al. 2005; Eifuku et al. 2004). Both regions of caudal STS, like MT (Born and Bradley
2005; Duffy and Wurtz 1991; Felleman and Kaas 1984) or MST (Gu et al. 2008; Tanaka et al. 1986)
as well as anterior STP (Anderson and Siegel 1999, 2005; Nelissen et al. 2006; Oram et al. 1993),
are sensitive to directional movement patterns. Although the caudal STS is regarded as a part of
the “where” pathway, the anterior STP is probably not because of its large spatial receptive field
size (Bruce et al. 1981, 1986; Oram et al. 1993). Given this and taken together with face selectivity,
it stands to reason that anterior STP may be important for the perception or recognition of facial
gestures, such as mouth movement.
In addition, STP responds to somatosensory, auditory, and visual stimulation. Multisensory
responsiveness of neurons in STS was tested in anesthetized (Benevento et al. 1977; Bruce et al.
1981; Hikosaka et al. 1988) and alert monkeys (Baylis et al. 1987; Perrett et al. 1982; Watanabe and
Iwai 1991). In both cases, stimuli were delivered unimodally (Baylis et al. 1987; Bruce et al. 1981;
Hikosaka et al. 1988) or simple bimodal stimuli (tone and LED flash) were used (Benevento et al.
1977; Watanabe and Iwai 1991). Although auditory and visual selective neurons were present in
STG and formed segregated clusters in STP (Dahl et al. 2009), a population of neurons responded
to both visual and auditory stimuli (Baylis et al. 1987; Bruce et al. 1981; Hikosaka et al. 1988). When
the response to bimodal stimuli was examined, the neural firing rate was either enhanced or reduced
compared to unimodal stimuli (Benevento et al. 1977; Watanabe and Iwai 1991). The laminar profile
Audiovisual Integration in Nonhuman Primates 73

of current source density (CSD), which reflects a pattern of afferent termination across cortical lay-
ers in response to sounds (click) and lights (flash), indicated that STP receives feedforward auditory
and visual inputs to layer IV (Schroeder and Foxe 2002).
Lesion studies in STP reveal that the region appears to process certain dimensions of sound and
vision used for discrimination. Monkeys with lesions of STG and STP areas showed an impair-
ment of auditory but not visual working memory and auditory pattern discrimination while sparing
hearing (Iversen and Mishkin 1973; Colombo et al. 2006). Although IT lesions impair many visual
tasks, IT and STP lesions (Aggleton and Mishkin 1990; Eaccott et al. 1993) selectively impair visual
discrimination of objects more severely while sparing the performance of other visual tasks. These
findings suggest that multisensory responses in STP are not simply sensory, but are involved in
cognitive processing of certain aspects of sensory signals.
A series of recent studies examined AV integration in STS using more naturalistic stimuli dur-
ing visual fixation, using sound and sight of conspecific vocalizations, naturally occurring scenes,
and artifactual movies (Barraclough et al. 2005; Dahl et al. 2009; Chandrasekaran and Ghazanfar
2009; Ghazanfar et al. 2008; Kayser and Logothetis 2009; Maier et al. 2008). As in previous studies
(Benevento et al. 1977; Watanabe and Iwai 1991), neuronal firing to bimodal stimuli was found to
be either stronger or weaker when compared to unimodal stimuli. Barraclough et al. (2005) showed
that the direction of change in the magnitude of response to AV stimuli from visual response
depended on the size of visual response. Incongruent pairs of sound and scenes seem to evoke
weaker responses (Barraclough et al. 2005; Maier et al. 2008).
To our knowledge, there are no animal studies that used task conditions requiring active behav-
ioral discrimination. Therefore, results may not be conclusive about whether the STS can associ-
ate/integrate information of different modalities to form a recognizable identity. However, their
bimodal responsiveness, specialization for objects such as faces in the visual modality, and sensitiv-
ity to congruence of signals in different modalities suggests that areas in STP are involved in such
cognitive processes and/or AV perception.

5.2.4  MTL Regions


The MTL is composed of the hippocampus, entorhinal, perirhinal and parahippocampal cortices.
These regions are involved in declarative memory formation (Squire et al. 2004) and place coding
(McNaughton et al. 2006). The amygdala plays a predominant role in emotional processes (Phelps
and LeDoux 2005), some of which may be affected by multisensory conjunction (e.g., in response
to “dominant” conspecifics or looming stimuli, as discussed above).
The MTL receives various multisensory cortical inputs. Entorhinal cortex (EC), the corti-
cal gate to the hippocampus, receives inputs from STG, STP, IT, and other nonprimary sensory
cortices either directly or through parahippocampal and perirhinal cortices (Blatt et al. 2003;
Mohedano-Moriano et al. 2007, 2008; Suzuki and Amaral 1994). Auditory, visual, and soma-
tosensory association cortices also project to the nuclei of the amygdala (Kosmal et al. 1997;
Turner et al. 1980).
Although IT, a part of the ventral “what” pathway (Ungerleider and Mishkin 1982) and the major
input stage to MTL, responds mainly to complex visual stimuli, IT can exhibit postauditory sample
delay activity during cross-modal delayed match-to-sample tasks, in which auditory sample stimuli
(tones or broadband sounds) were used to monitor the type of visual stimuli (Colombo and Gross
1994; Gibson and Maunsell 1997). During the same task, greater auditory responses and delay
activity were observed in the hippocampus. Those delay activities presumably reflected the working
memory of a visual object associated with sound after learning. In a visual discrimination task that
used tone as a warning to inform the start of trials to monkeys, ventral IT neurons responded to this
warning sound (Ringo and O’Neill 1993). Such auditory responses did not appear when identical
tones were used to signal the end of a trial, thereby indicating that effects were context-dependent.
74 The Neural Bases of Multisensory Processes

In the hippocampus, a small population of neurons responds to both auditory and visual cues for
moving tasks in which monkeys control their own spatial translation and position (Ono et al. 1993).
Even without task demands, hippocampal neurons exhibit spatial tuning properties to auditory and
visual stimuli (Tamura et al. 1992).
Neurons in the amygdala respond to face or vocalization of conspecifics passively presented
(Brothers et al. 1990; Kuraoka and Nakamura 2007; Leonard et al. 1985). Some neurons respond
selectively to emotional content (Hoffman et al. 2007; Kuraoka and Nakamura 2007). Multisensory
responses to different sensory cues were also shown in the amygdala of monkeys performing sev-
eral kinds of tasks to retrieve food or drink, avoid aversive stimuli, or discriminate sounds associ-
ated with reward (Nishijo et al. 1988a). These responses reflected affective values of those stimuli
rather than the sensory aspect (Nishijo et al. 1988b).
These data corroborate the notion that sensory activity in MTL is less likely to contribute
to detection, but more related to sensory association, evaluation, or other cognitive processes
(Murray and Richmond 2001). The integrity of these structures is presumably needed for the
formation and retention of cross-modal associational memory (Murray and Gaffan 1994; Squire
et al. 2004).

5.2.5  Auditory Cortex


Recent findings of multisensory sensitivity in sensory (early) cortical areas, including primary
areas, have revised our understanding of cortical “AV integration” (for review, see Ghazanfar and
Schroeder 2006). Before these findings came to light, it was thought that AV integration occurred
in higher-order cortices during complex component processing. To date, a large body of work has
focused on multisensory mechanisms in the AC. Like some of the seminal findings with human
subjects in this field (Sams et al. 1991; Calvert and Campbell 2003), the monkey AC appears to
respond to visual stimulus presented alone. Kayser et al. (2007) measured the BOLD signal to
natural unimodal and bimodal stimuli over the superior temporal plane. They observed that visual
stimuli alone could induce activity in the caudal area of the auditory cortex. In this same area, the
auditory-evoked signal was also modulated by cross-modal stimuli.
The primate auditory cortex stretches from the fundus of the lateral sulcus (LS) medially to the
STG laterally, and has more than 10 defined areas (Hackett 2002; Hackett et al. 2001; Kaas and
Hackett 2000). Among auditory cortical areas, the first area in which multisensory responsiveness
was examined was the caudal–medial area (CM; Schroeder et al. 2001). In addition to CM, other
auditory areas including the primary auditory cortex (A1) were also shown to receive somatosen-
sory inputs (Cappe and Barone 2005; Disbrow et al. 2003; de la Mothe et al. 2006a; Kayser et al.
2005; Lakatos et al. 2007; Smiley et al. 2007; for a review, see Musacchia and Schroeder 2009).
Most areas also received multisensory thalamic inputs (de la Mothe 2006b; Hackett et al. 2007;
Kosmal et al. 1997). Documented visual inputs to the auditory cortex have thus far originated from
STP (Cappe and Barone 2005) as well as from peripheral visual fields of V2 and prostriata (Falchier
et al. 2010).
Schroeder and Foxe (2002) reported CSD responses to unimodal and bimodal combinations of
auditory, visual, and somatosensory stimuli in area CM of the awake macaque. The laminar profiles
of CSD activity in response to visual stimuli differed from those of auditory and somatosensory
responses. Analysis of activity in different cortical layers revealed that visual inputs targeted the
extragranular layers, whereas auditory and somatosensory inputs terminated in the granular layers
in area CM. These two termination profiles are in accordance with the pattern of laminar projections
of visual corticocortical projections (Falchier et al. 2002; Rockland and Ojima 2003) and primary-
­like thalamocortical projections (Jones 1998), respectively. In contrast, A1 receives auditory and
somatosensory inputs in the granular and supragranular cortical layers, respectively (Lakatos et al.
2007). This suggests that somatosensory input to A1 originates from lateral, feedback, or nonspe-
cific thalamic nuclei connections. Our laboratory showed that attended visual stimuli presented in
Audiovisual Integration in Nonhuman Primates 75

isolation modulate activity in the extragranular layer of A1 (Lakatos et al. 2009) and the same pattern
is observed with attended auditory stimuli in V1 (Lakatos et al. 2008). These findings strengthen the
hypothesis that nonspecific thalamic projections (Sherman and Guillery 2002) or pulvinar-mediated
lateral connections (Cappe et al. 2009) contribute to AV integration in A1.
Ghazanfar et al. and Logothetis et al. groups have shown that concurrent visual stimuli influ-
enced auditory cortical response systematically in A1 as well as in the lateral associative auditory
cortices and STP (Ghazanfar et al. 2005; Hoffman et al. 2008; Kayser et al. 2007, 2008). These
studies used complex and natural AV stimuli, which are more efficient in evoking responses in
some nonprimary auditory areas (Petkov et al. 2008; Rauschecker et al. 1995; Russ et al. 2008).
Their initial study (Ghazanfar et al. 2005) revealed that movies of vocalizations presented with the
associated sounds could modulate local field potential (LFP) responses in A1 and the lateral belt.
Kayser et al. (2008) showed visual responses in LFP at frequency bands near 10 Hz. This frequency
component responded preferably to faces, and more preference existed in the lateral belt than A1
(Hoffman et al. 2008). However, multiunit activity (MUA) barely showed visual response that cor-
related in magnitude with the LFP response. AV interactions occurred as a small enhancement in
LFP and suppression in MUA (see also Kayser and Logothetis 2009).
Although AV integration in areas previously thought to be unisensory are intriguing and pro-
vocative, the use of a behavioral task is imperative in order to determine the significance of this
phenomenon. Brosch et al. (2005) employed a task in which an LED flash cued the beginning of
an auditory sequence. Monkeys were trained to touch a bar to initiate the trial and to signal the
detection of a change in the auditory sequence. They found that some neurons in AC responded to
LED, but only when the monkey touched the bar after detecting the auditory change. This response
disappeared when the monkey had to perform a visual task that did not require auditory attention.
Although this may be due in part to the fact that the monkeys were highly trained (or potentially
overtrained) on the experimental task, they also point to the importance of engaging auditory atten-
tion in evoking responses to visual stimuli. Findings like these, which elucidate the integrative
responses of individual and small populations of neurons, can provide key substrates to understand
the effects of bimodal versus unimodal attention on cross-modal responses demonstrated in humans
(Jääskeläinen et al. 2007; McDonald et al. 2003; Rahne and Böckmann-Barthel 2009; Talsma et al.
2009; von Kriegstein and Giraud 2006).
The timing of cross-modal effects in primary auditory and posterior auditory association corti-
ces in resting or anesthetized monkeys seemed consistent with the cross-modal influence of touch
and sight in monkeys engaged in an auditory task. In resting monkeys, the somatosensory CSD
response elicited by electrical stimulation of the median nerve had an onset latency as short as 9 ms
(Lakatos et al. 2007; Schroeder et al. 2001), and single neurons responded to air puff stimulation
at dorsum hand in anesthetized monkey with a latency of about 30 ms (Fu et al. 2003). Cutaneous
sensory response of single units in AC during active task peaked at 20 ms (Brosch et al. 2005) and
occurred slower than direct electrical activation of afferent fibers but faster than passive condition.
Similarly, visual responses of single units in AC were observed from 60 ms and peaked at around
100 ms after the onset of LED during an active task (Brosch et al. 2005). That was within the same
range of the onset latency, about 100 ms, of neuronal firing and the peak timing of LFP responses
to complex visual stimuli in AC when monkeys were simply visually fixating (Hoffman et al. 2007;
Kayser et al. 2008). The effect of gaze direction/saccades will also need to be taken into account
in future studies because it has been proposed that it can considerably affect auditory processing
(Fu et al. 2004; Groh et al. 2001; Werner-Reiss et al. 2006).

5.2.6  Visual Cortex


There has been much less multisensory research done in visual cortex than in auditory cortex,
although it has been shown that the peripheral visual field representations of primary visual cortex
(V1) receive inputs from auditory cortical areas, A1, parabelt areas on STG, and STP (Falchier et al.
76 The Neural Bases of Multisensory Processes

2002). The peripheral visual field representation of area V2 also receives feedback inputs from cau-
dal STG/auditory belt region (Rockland and Ojima 2003). A preference to vocal sounds, relative to
other sounds, was found in the nonprimary visual cortex using functional MRI (fMRI) in monkeys
(Petkov et al. 2008).
In contrast to studies of visual responses in the auditory cortex, not many visual studies recorded
auditory responses in visual cortex during the performance of a task. Wang et al. (2008) recorded
V1 single-unit firing while monkeys performed a visual detection task. Concurrent presentation of
auditory and visual stimuli not only shortened saccadic reaction time, but also increased the neu-
ronal response magnitude and reduced response latency. This effect was greatest when the intensity
of visual stimuli was of a low to moderate level, and disappeared when the luminance of the visual
stimuli was intense. When monkeys were not performing a task, no auditory effect was observed in
V1 (see Section 5.6.1).
In a series of studies from our laboratory, a selective attention task was employed to deter-
mine whether attention to auditory stimuli influenced neuronal activity in V1 (Lakatos et al. 2008,
2009; Mehta et al. 2000a, 2000b). In these studies, tones and flashes were presented alternatively
and monkeys had to monitor a series of either visual or auditory stimuli, while ignoring the other
modality. The visual response was stronger when monkeys tracked the visual series than when they
tracked the auditory series. In the attend-auditory condition, it appeared that a phase reset of ongo-
ing neuronal oscillations occurred earlier than the visual response (Lakatos et al. 2009). This effect
disappeared when the same stimuli were ignored. Thus, auditory influences on V1 were observed
only when auditory stimuli were attended. It contrasted with the findings of Wang et al. (2008) in
which sound affected V1 activity in monkeys performing a visual task. As we propose later, con-
trol of attention likely has a major role in the manifestation of auditory effects in V1 (see Section
5.6.2).

5.2.7  Subcortical Regions


The basal ganglia group is composed of several nuclei, each having a distinct function, such as motor
planning and execution, habitual learning, and motivation. Several studies show auditory, visual,
and bimodally responsive neurons in basal ganglia nuclei. Even though multisensory responses
could be observed under passive conditions (Santos-Benitez et al. 1995), many studies showed that
these responses were related to reinforcement (Wilson and Rolls 1990) or sensorimotor association
(Aosaki et al. 1995; Hikosaka et al. 1989; Kimura 1992).
Although it is well known that the SC is a control station orienting movement (Wurtz and Albano
1980), its multisensory property has been a hotbed of research for decades in monkey (Allon and
Wollberg 1978; Cynader and Berman 1972; Updyke 1974; Wallace et al. 1996) and other animal
models (Meredith and Stein 1983; Meredith et al. 1987; Rauschecker and Harris 1989; Stein et al.
2001, 2002). Neurons in the monkey SC adhered to well-established principles of multisensory
integration such as spatial contiguity and inverse effectiveness (for review, see Stein and Steinford
2008), whether the animals were engaged in tasks (Frens and Van Opstal 1998) or under anesthesia
(Wallace et al. 1996). In the SC of awake animals, AV integration depended on the task conditions,
whether they fixated on visible or memory-guided spots during AV stimuli (Bell et al. 2003). The
presence of a visual fixation spot decreased unimodal responses, and nearly suppressed response
enhancement by AV stimuli. Bell et al. (2003) attributed the reason of weaker AV integration dur-
ing visually guided fixation to fixation-mediated inhibition in SC. It is consistent with the fact that,
whereas activity in SC is coupled to eye movements, fixation requires the monkey to refrain from
gaze shifts.
Although the inferior colliculus (IC) has been generally assumed to be a passive station for pri-
marily auditory information and immune to nonauditory or cognitive influences, recent AV studies
challenge this view. Neuronal activity in the IC has been shown to be influenced by eye position
(Groh et al. 2001), saccades, and visual stimuli (Porter et al. 2007), suggesting that the IC may be
Audiovisual Integration in Nonhuman Primates 77

influenced by covert orienting of concurrent visual events. This covert orienting may contribute to
the visual influence observed on portions of human auditory brainstem responses that are roughly
localized to the IC (Musacchia et al. 2006).
Studies of thalamic projections to the primary auditory cortex show that multisensory connec-
tions are present in centers previously thought to be “unisensory” (de la Mothe et al. 2006b; Hackett
et al. 2007; Jones 1998). Multiple auditory cortices also receive divergent afferents originating from
common thalamic nuclei (Cappe et al. 2009; Jones 1998). In addition, the connections between
thalamic nuclei and cortices are largely reciprocal. Even though the functions of those thalamic
nuclei have to be clarified, they may contribute to multisensory responsiveness in cerebral cortices.
Bimodal responsiveness was shown in a few thalamic nuclei (Matsumoto et al. 2001; Tanibuchi and
Goldman-Rakic 2003).

5.3  FUNCTIONAL SIGNIFICANCE OF MULTISENSORY INTERACTIONS


It was shown in monkeys that, under certain circumstances, audition influences vision (Wang et al.
2008), vision influences audition (Woods and Recanzone 2004), or the two senses influence each
other (Cappe et al. 2010). For AV integration of any form, auditory and visual information has to
converge. As described in the previous section, most brain regions have the potential to support that
interaction (for review, see Ghazanfar and Schroeder 2006; Musacchia and Schroeder 2009), but
the importance of that potential can only be determined by assessing the functional role that each
region plays in helping to achieve perceptual integration of sight and sound. This can be achieved
by observing the behavioral effects of cortical lesions or electrical stimulation in different areas and
by simultaneously measuring behavioral performance and neural activity in normal functioning and
impaired populations.

5.3.1  Influences on Unimodal Perception


Neural activity in a unimodal area is thought to give rise to sensations only in the preferential
modality of the area. It is not surprising, therefore, that lesions in these areas only extinguish sensa-
tions of the “primary” modality. For example, STG lesions impair auditory memory retention but
leave visual memory retention intact (Colombo et al. 1996). One exception to this rule lies in cases
of acquired cross-modal activity such as auditory responses in the occipital cortex in blind people
(Théoret et al. 2004). Despite this reorganization, direct cortical stimulation in the visual cortex
of blind people elicits photic sensations of simple patterns (such as letters) (Dobelle et al. 1974).
Similar sensations of phosphenes can also be induced in sighted individuals using transcranial
magnetic stimulation (TMS) (Bolognini et al. 2010; Ramos-Estebanez et al. 2007; Romei et al.
2007, 2009). But do they also induce auditory sensations? Our opinion is that auditory activity in the
visual cortex does not induce visual sensations, and visual activity in the auditory cortex does not
induce auditory sensations, although it may depend on the condition of subjective experience with
stimuli (Meyer et al. 2010). In humans, influences of cross-modal attention on activity of sensory
cortices during cross-modal stimulus presentation, e.g., visual attention gates visual modulation in
auditory cortex, is known (Ciaramitaro et al. 2007; Lehman et al. 2006; Nager et al. 2006; Teder-
Sälejärvi et al. 1999). In particular, the functional role of visual information on speech perception
and underlying auditory cortical modulation is well documented (Besle et al. 2009; van Attenveldt
et al. 2009; Schroeder et al. 2008). The findings described below also suggest that the functional
role of cross-modal activation in early sensory cortices is likely the modulation of primitive (low-
level) sensory perception/detection.

5.3.1.1  Influence on Temporal Dynamics of Visual Processing


In the sensory system, more intense stimuli generally produce a higher neuronal firing rate,
faster response onset latencies, and stronger sensations. AV interactions often have a facilitative
78 The Neural Bases of Multisensory Processes

effect on the neural response, either through increased firing rate or faster response (for review,
see Stein and Stanford 2008), suggesting that AV stimuli should increase the acuity of the behav-
ioral sensation in some fashion. In humans, AV stimuli increase reaction time speed during
target detection (Diederich and Colonius 2004; Giard and Peronnet 1999; Molholm et al. 2002,
2007) and improve temporal order judgments (Hairston et al. 2006; Santangelo and Spence
2009).
In the monkey, Wang et al. (2008) showed electrophysiological results consistent with this notion.
During a visual localization task, the effect of AV enhancement in V1 occurred as shorter response
latency. Interestingly, no appreciable enhancement of visual response was elicited by auditory stim-
uli when monkeys were not engaged in tasks.
The auditory stimuli by themselves did not evoke firing response in V1. This suggests that audi-
tory influence on V1 activity is a subthreshold phenomenon. Suprathreshold response in V1 begins
at about 25 to 30 ms poststimulation (Chen et al. 2007; Musacchia and Schroeder 2009). To achieve
auditory influences on visual responses, auditory responses must arrive within a short temporal
window, a few milliseconds before visual input arrives (Lakatos et al. 2007; Schroeder et al. 2008).
Auditory responses in the auditory system generally begin much earlier than visual responses in V1.
For some natural events such as speech, visible signals lead the following sounds (Chandrasekaran
et al. 2009; for review, see Musacchia and Schroeder 2009). For these events, precedence of visual
input, relative to auditory input, is likely a requirement for very early AV interaction in early sensory
interactions.

5.3.1.2  Sound Localization


The ventriloquist aftereffect observed by Woods and Recanzone (2004) involves the alteration of
auditory spatial perception by vision. This phenomenon implies the recruitment of structures whose
auditory response depends on or encodes sound location. Several brain structures have sensitivity to
spatial location of sound in monkeys. Those include IC (Groh et al. 2001), SC (Wallace et al. 1996),
ventral division of the medial geniculate body (Starr and Don 1972), caudal areas of auditory cor-
tex (Recanzone et al. 2000; Tian et al. 2001), PPC (Cohen 2009), and PFC (Artchakov et al. 2007;
Kikuchi-Yorioka and Sawaguchi 2000).
Woods and Recanzone (2004) used two tasks to test for bimodal interaction during sound local-
ization: one for training to induce the ventriloquist aftereffect and another to test spatial sound later-
alization. Monkeys maintained fixation except when making a saccade to the target sound location
in the latter test task. The location of the LED light on which monkeys fixated during training task
differed between sessions and affected the sound localization in the subsequent sound localization
test tasks. Monkey’s “sound mislocalization” was predicted by the deviation of the LED position
during the training task from the true center position on which the monkey fixated during the test
task. Because monkeys always fixated on the LED, the retinotopic locus of the LED was identical
across the tasks. However, there was a small difference in gaze direction that played a key role in
causing “mislocalization” by presumably inducing plastic change in proprioceptive alignment of
gaze position to sensory LED position. An additional key to that study was even though LED posi-
tions were not identical between tasks, they were so close to each other that monkeys presumably
treated fixation points of slightly different positions as the same and did not notice differences in
gaze directions. Therefore, it could be guessed that the plasticity of the visual spatial localization
affected the auditory spatial localization.
Although the precise substrate for the ventriloquist aftereffect in the macaque has not been
established, several structures are candidates: IC (Groh et al. 2001), SC (Jay and Sparks 1984),
AC (Werner-Reiss et al. 2006), and LIP and MIP (Mullette-Gilman et al. 2005). However, in all
structures, except for the SC, the observed effects varied between simple gain modulation without
altering the spatial receptive field (head-centered coordinate), systematic change that followed gaze
direction (eye-centered coordinate), or other complex changes. Plastic change in either coordinate
or both can presumably contribute to inducing the ventriloquist aftereffect.
Audiovisual Integration in Nonhuman Primates 79

Fixation during head restraint does not allow any eye movement. During fixation, subjects
can pay visual attention to locations off from the fixated spot (covert attention) or listen carefully.
Neuronal activity correlates of such processes were seen in PFC (Artchakov et al. 2007; Kikuchi-
Yorioka and Sawaguchi 2000) or PPC (Andersen et al. 1997). Meanwhile, subjects have to keep
feeding oculomotor command signals to maintain steady eye position. Therefore, the signal that
transmits fixating location and differentiates between center and deviant should be present. A pos-
sible correlate to such a signal was described in AC, a change in spontaneous activity dependent
on gaze direction, whereas it was not observed in IC (Werner-Reiss et al. 2006). Even though what
provides the eye positional signal to AC is unknown, it suggests AC as one of the candidates induc-
ing the ventriloquist aftereffect.
It is worth mentioning that regardless of the name “ventriloquist aftereffect,” it is quite different
from the ventriloquist effect. The ventriloquist effect happens when audio and visual signals stem
from a shared vicinity, but does not require fixation on a visual spot and a steady eye positional sig-
nal. In contrast, the ventriloquist aftereffect is about spatial coding of solely auditory events. Hence,
the study of this phenomenon may be useful to clarify which type of neuronal coding is the main
strategy for cortical encoding of sound localization.

5.3.2  AV Recognition
Identifying a previously known AV object, such as a speaker’s face and voice, requires AV inte-
gration, discrimination, and retention. This process likely relies on accurate encoding of complex
stimulus features in sensory cortices and more complex multiplexing in higher-order multisensory
association cortices. Multisensory cortices in the “what” pathway probably function to unite these
sensory attributes. In humans, audiovisual integration plays an important role in person recogni-
tion (Campanella and Belin 2007). Several studies have shown that unimodal memory retrieval of
multisensory experiences activated unisensory cortices, presumably because of multisensory asso-
ciation (Wheeler et al. 2000; Nyberg et al. 2000; Murray et al. 2004, 2005; von Kriegstein and
Giraud 2006) and such memory depended on meaningfulness of combined signals (Lehmann and
Murray 2005).
Differential responses to vocal sounds were observed in PFC (Gifford et al. 2005; Romanski
et al. 2005), STG (Rauschecker et al. 1995; Russ et al. 2008), and AC (Ghazanfar et al. 2005).
Differential responses to faces were found in PFC (Rolls et al. 2006), temporal lobe cortices (Eifuku
et al. 2004), and amygdala (Kuraoka and Nakamura 2007). Some of these structures may possess
selectivity to both vocal sounds and faces. Recognition of a previously learned object suggests that
this process relies in part on working and long-term memory centers. The fact that the identification
of correspondence between vocal sound and face is better when the individuals are socially famil-
iar (Martinez and Matsuzawa 2009) supports this notion. PFC and MTL are also involved in the
association of simple auditory and visual stimuli as shown by delayed match to sample task studies
(Colombo and Gross 1994; Fuster et al. 2000; Gibson and Maunsell 1997). Lesions in MTL (Murray
and Gaffan 1994) or PFC (Gaffan and Harrison 1991) impaired performance in tasks requiring
memory and AV association. These findings implicate PFC, STG, and MTL in AV recognition.

5.4  PRINCIPLES OF MULTISENSORY INTERACTION


Relationships between multisensory responses and stimulus parameters, derived primarily from
single-unit studies in the cat SC, are summarized in three principles of multisensory interaction:
inverse effectiveness, temporal, and spatial principles (Stein and Meredith 1993). These organiz-
ing principles have been shown to be preserved with other sensory combinations (i.e., auditory–­
somatosensory; Lakatos et al. 2007) and in humans (Stevenson and James 2009); however,
systematic examination of these principles for AV integration in monkey cerebral cortex is limited
to the auditory cortex.
80 The Neural Bases of Multisensory Processes

5.4.1  Inverse Effectiveness


The inverse effectiveness principle of multisensory interaction states that the interaction of weaker uni-
modal inputs results in larger gain of multisensory response. In the case of audition, the response to a
softer sound should be enhanced more by visual input, relative to a louder sound. In the case of vision, the
response to a dimmer object should be enhanced more by sounds relative to a brighter object.
Cappe et al. (2010) showed a behavioral correlate to inverse effectiveness in monkeys. Manual
reaction times to soft sounds were slower relative to loud sounds, and only the reaction time to soft
sound was shortened by simultaneous visual stimuli. Responses to AV stimuli were also more accu-
rate than responses to sounds alone at the lowest sound intensities. The same group also showed that
the effect of sound on saccades as well as V1 neuronal response latencies is larger in the case of less
salient visual stimuli (Wang et al. 2008).
fMRI studies show that degraded auditory and visual stimuli both evoke weaker BOLD signal
responses in the macaque AC, relative to intact stimuli (Kayser et al. 2007). When those degraded
stimuli were presented simultaneously, enhancement of BOLD signal responses was larger than
simultaneous intact stimuli. Even though they did not test the combination of degraded and intact
stimuli, the results suggest synergistic inverse effectiveness between modalities.
Electrophysiologically, Ghazanfar et al. (2005) showed that weaker LFP responses to vocal
sounds were enhanced more by concurrently viewing a movie clip of a vocalizing monkey, rela-
tive to stronger responses. Another study showed that responses to vocal stimuli were modulated
by movie stimuli differentially depending on loudness: responses to the loud vocal stimuli were
suppressed when the movie was added, whereas the responses to the soft sounds were enhanced
(Kayser et al. 2008). These studies are compatible with the idea that weak responses are enhanced
by AV integration. Additionally, a recent study reported a small but significant increase in the infor-
mation capacity of auditory cortical activity (Kayser et al. 2010).
Thus, visual stimuli may not only enhance responses but also deploy more cortical neurons in
computational analysis of auditory signals, creating redundancy in processed information to secure
the perception.

5.4.2  Temporal Contiguity


The Temporal Principle of multisensory processing (Stein and Meredith 1993) predicts that integra-
tion effects will be greatest when neuronal responses evoked by stimuli of the two modalities are
within a small temporal window. Quite a few studies investigated spatial and temporal contiguity
principles of AV integration in nonhuman primates.
Overall, results in the monkey SC and A1 conform to the principle of temporal contiguity and
describe a range of enhancement and suppression effects. In the SC, Wallace et al. (1996) showed
that visual stimuli preceding auditory stimuli tend to produce more interaction. This condition
corresponds to the natural order of physical events in everyday stimuli where the visual stimulus
precedes the accompanying auditory one.
Ghazanfar et al. (2005) described neural responses in A1 and lateral belt areas to the presentation
of conspecific vocal sounds, with and without the accompanying movies at different SOAs. In this
region, bimodal stimulation can elicit suppression or enhancement, depending on the neural popula-
tion. Results showed that the proportion of sites exhibiting bimodal enhancement depended on the
SOA: SOAs longer than 100 ms enhanced less regions of AC. When the auditory response was sup-
pressed by a movie, the proportion of suppressed locations peaked at SOAs shorter than 80 ms and
longer than 100 ms, interestingly sparing the peak timing of visually evoked LFPs.
Kayser et al. (2008) tested responses in A1 and belt areas to systematic combinations of noise
bursts and flashes in 20 ms steps. Bimodal suppression was only observed when the flash preceded
noise by 20 to 80 ms. For the natural AV stimuli, bimodal enhancement was observed in some popu-
Audiovisual Integration in Nonhuman Primates 81

lations of auditory cortex at an SOA of 0 ms, and that was abolished by introducing a perceivable
delay between stimuli (160 ms).
These results suggest that AV interaction in AC could happen as either enhancement (if audio
and visual stimuli are nearly synchronized or separated by less than 100 ms delay) or suppression
(at  delays longer than 100 ms). Interpretations of these data should be approached with a little
caution. In the first study, the effect of AV interaction was attributed to the interaction between
movements of the mouth and the following vocal sound (Ghazanfar et al. 2005). However, because
the mouth movement started immediately after the abrupt appearance of the first movie frame,
the sudden change in the screen image could capture visual attention. In other studies, an abrupt
visual change was shown to elicit a brief freeze of gaze position in monkeys (Cui et al. 2009)
and in humans (e.g., Engbert and Kliegl 2003). Therefore, the onset of the movie itself could
evoke transient activity. This would suggest that the observed effects were related simply to visual
response or a transient change in covert visual attention. Because LFPs capture the response of a
large population of neurons, such activity generated in non-AC structures may be superimposed.
Further studies are necessary to dissociate the AV interaction into mouth movement-related and
other components.

5.4.3  Spatial Contiguity


The spatial principle of multisensory integration states that multisensory integration is greatest
when loci of events of different modalities overlap with the receptive fields of neurons and when
those receptive fields of different modalities overlap with each other. Although there is little data in
monkey cortex on this topic for AV integration, we can speculate how it operates based on anatomi-
cal and electrophysiological findings.
Anatomical studies predict that peripheral representations of visual stimuli should be more sus-
ceptible to auditory influences. The representation of the visual periphery is retinotopically orga-
nized in the visual cortex and is interconnected with caudal auditory cortices (Falchier et al. 2002,
2010; Rockland and Ojima 2003). In accordance with this prediction, Wang et al. (2008) observed
auditory influences on V1 responses to visual stimuli presented more peripherally than 10°, although
central vision was not tested. Similarly, in humans, auditory activation of visual cortex subserving
the peripheral visual fields was shown (Cate et al. 2009). However, many human studies used central
and parafoveal stimuli, for which anatomical substrates or other physiological mechanisms await
to be found.
Other studies used different types of visual stimuli to study auditory cortical responses. Flashes
(e.g., Kayser et al. 2008; Lakatos et al. 2009) excite a wide area of the retinotopic map. Images and
movies were overlaid around a central fixation point (Ghazanfar et al. 2005, 2008). In the latter
case, visual stimulation did not extend to peripheral visual space. In addition, when monkey faces
are used, the subjects tend to look at the mouth and eyes proximal to the center of face (Ghazanfar
et al. 2006).
These findings suggest that visual influence may have different sources depending on the type
of stimulus preference in each area. For example, cortices across STS possess face preference, large
receptive fields, and position invariance of object selectivity. Therefore, facial influences on AC may
originate from STS, as proposed by recent studies (Ghazanfar et al. 2008; Kayser and Logothetis
2009, see below). Such speculation may be further clarified by comparing the vocal movie effect on
AC between face positions relative to gaze position considering the difference in the size of recep-
tive field between visually responsive cortices.
In PPC, common spatial tuning to visual and auditory stimuli was observed (Mazzoni et al.
1996; Schlack et al. 2005). Even though PPC response to simultaneous AV stimuli has not been
investigated, it is likely that integration there depends on spatial congruency between modalities.
Further studies are needed to verify this.
82 The Neural Bases of Multisensory Processes

5.5  MECHANISMS AND DYNAMICS OF MULTISENSORY INTERACTION


Traditionally, multisensory integration is indexed at the neuronal level by a change in the aver-
aged magnitude of evoked activity relative to the sum of unimodal responses. This type of effect
was most often studied in the classical higher-order multisensory regions of the temporal, parietal,
and frontal cortices, and generally manifested as a simple enhancement of the excitatory response
beginning at the initial input stage in layer 4 as reviewed by Schroeder and Foxe (2002). Recent
studies have shown that cross-modal influence on traditional unisensory cortices could occur via
manipulation of ongoing oscillatory activity in supragranular layers, which in turn modulates the
probability that neurons will fire in response to the dominant (driving) auditory input (Lakatos et al.
2007; Schroeder and Lakatos 2009). Similarly, modulatory rather than driving multisensory influ-
ences were found in single-unit studies as well (Allman and Meredith 2007; Allman et al. 2008;
Dehner et al. 2004; Meredith et al. 2009). This more novel mechanism will be the focus of discus-
sion here.

5.5.1  Phase Reset: Mechanisms


Somatosensory stimuli evoked a modulatory response in the supragranular layer of A1, with an
onset time even faster than the auditory response (Lakatos et al. 2007). When it was paired with
synchronized auditory stimuli, faster somatosensory activation influenced the forthcoming a­uditory
response. However, somatosensory activity did not evoke a single rapid bolus of afferent a­ctivity like
a click, which elevates signal power across a broad frequency range at once. Instead, the somato­
sensory effect appeared as a modulation by phase reset of certain dominating neuronal oscilla-
tions observed in CSD. In other words, the somatosensory stimulus changed randomly f­luctuating
excitability of auditory neuronal ensembles to a certain excitable condition (represented by the
oscillatory phase), thereby determining the effect of the auditory input. The modulatory effect is
differential across somatosensory–auditory SOAs dependent on how a given SOA relates to periods
of delta, theta, and gamma oscillations; that is, facilitation is maximal at SOAs corresponding to
full gamma, theta, and delta cycles, and these peaks in the function are separated by “suppressive”
troughs, particularly at SOAs corresponding to 1/2 of a theta cycle, and 1/2 of a delta cycle.
In contrast with somatosensory activation of A1, visual responses are relatively slow even within
the visual systems (Chen et al. 2007; Musacchia and Schroeder 2009; Schmolesky et al. 1998). It takes
more time for visual activity to reach the auditory cortex than auditory activity in both A1 and V1
(Lakatos et al. 2009). Therefore, for the timing of visual signals to coincide with or to reach AC earlier
than that of the auditory signal, visual stimuli have to occur earlier than auditory stimuli, which is the
case for many natural forms of AV stimulation, particularly speech (Chandrasekaran et al. 2009).
Cross-modal auditory modulation of V1 activity and visual modulation of A1 activity were
observed in monkeys performing an intermodal selective attention task, in which auditory and
visual stimuli were presented alternatively at a rate in the range of delta frequency (Lakatos et al.
2009). Just like in the case of somatosensory modulation of A1 activity, cross-modal responses
occurred as a modulatory phase reset of ongoing oscillatory activity in the supragranular layers,
without a significant change in neuronal firing while those stimuli were attended.
Supragranular and granular layers are recipients of corticocortical, nonspecific thalamocortical
inputs or sensory-specific thalamocortical inputs, respectively. Modulatory phase reset in supra-
granular layer without any change in neuronal firing in granular and even supragranular layers sug-
gests that cross-modal activation happens as a transient change in supragranular cellular excitability
at the subthreshold level. It is consistent with the fact that cross-modal sensory firing response has
not been reported for primary sensory cortices in many studies that relied on action potentials as a
sole dependent measure. The manifestations of multiple poststimulus time windows of excitability
are consistent with nested hierarchical structure of frequency bands of ongoing neuronal activity
(Lakatos et al. 2005).
Audiovisual Integration in Nonhuman Primates 83

Cross-modal responses during an intermodal selective attention task were observed in response
to unimodal stimuli (Lakatos et al. 2008, 2009). What would be the effect of a phase reset when
auditory and visual stimuli are presented simultaneously? Wang et al. (2008) analyzed neuronal
firing responses to light with or without paired auditory noise stimuli using single-unit recordings
in V1. When stimuli were presented passively, firing rate in a population of V1 neurons increased
and remained high for 500 ms. V1 population responses to a visual target without sound during
visual detection tasks appeared as double peaks in a temporal pattern. The timing of each peak after
response onset was in the range of cycle length of gamma or theta frequency bands. In response to
AV stimuli, an additional peak near the time frame of a full delta cycle showed up in the temporal
firing pattern. Although translation of firing activity into underlying membrane potential is not
straightforward, those activity parameters are roughly monotonically proportional to each other
(e.g., Anderson et al. 2000). Thus, the oscillatory pattern of neuronal firing suggests oscillatory
modulation of neuronal excitability by the nonauditory stimuli.

5.5.2  Phase Reset: Dependence on Types of Stimuli


How would phase reset work in response to stimuli with complex temporal envelopes? Sounds
and movies of vocalizations are the popular stimuli examined in studies of AV integration in
auditory cortical areas and STP in nonhuman primates. As vocalization starts with visible
facial movement before a sound is generated, phase reset by visible movement pattern is in
a position to affect processing of a following sound. Kayser et al. (2008) showed changes in
frequency bands of LFP (around and below 10 Hz) consistent with the above predictions, that
is, they observed phase reset and excitability increases when response to the sound of complex
AV stimuli started in A1. When phase reset occurred, it was accompanied with enhanced firing
responses.
There were differences in the frequency bands in which phase reset is produced by visual inputs
between Kayser et al. (2008) and the findings of Lakatos et al. (2009), who showed cross-modal
phase reset in A1 and V1 occurred around theta (below 10 Hz) and gamma (above 25 Hz) bands
leaving a 10 to 25 Hz band out of the phenomena. Kayser et al. observed phase reset by visual input
alone across the range of 5 to 25 Hz. The differences between the results of these studies are likely
attributable to differences in visual stimuli.
Lakatos et al. (2009) did not examine whether phase reset of ongoing oscillatory activity at
theta and gamma bands contributed to AV integration because their task did not present auditory
and visual stimuli simultaneously. Kayser et al. (2008) showed that observation of enhanced neu-
ronal firing response to AV stimuli compared with auditory stimuli correlated with the occurrence
of phase reset about 10 Hz, underscoring the importance of reset in that band for AV response
enhancement. Also, differences in frequency band of phase reset by visual stimuli between the
Lakatos et al. and Kayser et al. studies suggests that the frequency of oscillation influenced by cross-
modal inputs depends on conditions of attention and stimulation.
Is phase reset a phenomenon beyond primary sensory cortices? This question is open. At least
STP clearly receives feedforward excitatory input from several modalities (Schroeder and Foxe
2002). The contribution of oscillatory phase reset in STP and other higher-order multisensory areas
have not been examined in detail, although the suspicion is that phase reset may have more to do
with attentional modulation than multisensory representation.

5.6 IMPORTANCE OF SALIENCE IN LOW-LEVEL


MULTISENSORY INTERACTIONS
Variations in AV integration effects according to saliency and attentional conditions are so perva-
sive that some have begun to wonder if attention is a prerequisite to integration (Navarra et al. 2010).
84 The Neural Bases of Multisensory Processes

However, AV integration has been observed in many higher cortical areas even when subjects were
only required to maintain visual fixation without further demands of a task (PFC, Sugihara et al.
2006; STP, Barraclough et al. 2005; AC, Ghazanfar et al. 2005; Kayser et al. 2008). Does this mean
audiovisual interactions happen automatically? The answer may depend on the level of the system
being studied, as well as the behavioral states, as discussed below.

5.6.1  Role of (Top-Down) Attention


There is strong evidence that top-down attention is required in order for AV integration to take
place in primary sensory cortices. Using an intermodal selective attention task, Lakatos et al. (2008,
2009) showed that the manifestation of visual influence in A1 and auditory influence in V1 was
dependent on attention. If a stimulus was ignored, its cross-modal influence could not be detected.
The selective role of sensory attention illustrated above contrasts with some findings that show
how attention to either modality elicits AV effects. Wang et al. (2008) showed that neurons in
V1 responded to auditory targets only when monkeys performed a purely visual localization task.
Similarly, in humans, task-irrelevant sound promoted phosphene detection during a task that
requires only visual attention to detect phosphene induced by TMS over visual cortex (Romei et al.
2007, 2009). Thus, tasks requiring either auditory (Lakatos et al. 2009) or visual (Romei et al. 2007,
2009; Wang et al. 2008) attention both rendered auditory influences observable in V1. This apparent
disagreement is most likely because of differences in the role of unattended sensory stimuli during
those tasks.
In the visual localization task (Wang et al. 2008), monkeys needed to react faster to localize
visual targets. Task-irrelevant auditory stimuli occurred in half of the trials, being delivered always
temporally congruent with visual targets and at a fixed center location. In this task, the status of
sound is key. Auditory stimuli, when delivered, were always informative, and thus, could act as an
instruction like that given verbally to subjects performing visual localization as in Posner’s classical
study (Posner et al. 1980). Therefore, it was possible that monkeys paid attention to such informa-
tive auditory stimuli in addition to visual stimuli to perform the visual localization task. In a similar
vein, responses to visual events in the auditory discrimination task of Brosch et al. (2005) may be
regarded as an informative cross-modal cue to perform the task, although again, the effects of over-
training must also be considered.
In the intermodal attention task (Lakatos et al. 2008, 2009), subjects did not have to spread
their spatial attention to different locations because visual and auditory stimuli were spatially con-
gruent. However, those stimuli were temporally incongruent, divided into two series as asynchro-
nous streams. Furthermore, whereas monkeys had to monitor a sequence of one modality, deviants
appeared in the other sequence and monkeys had to refrain from responding to it. The easiest way
to perform such a task is to plug one’s ears when watching and to close the eyes when you are listen-
ing. Prevented from these strategies, all monkeys could do actually was not only to pay attention to
a modality cued to attend to but also to ignore the other stream at the same time, in order to perform
the task.
Although it may be impossible to determine what monkeys are actually attending to during any
given task, it can be argued that monkeys do not ignore informative sounds based on the observa-
tion of auditory influence on visual response in V1 (Wang et al. 2008). Further studies are needed to
determine how attentive conditions influence AV integration. It would be interesting to see whether
an auditory influence could be observable in a visual localization task, as in the study of Wang et
al. (2008), but with auditory stimuli incongruent with visual stimuli matched both spatially and
temporally, thereby acting as distracters.
Auditory attention has also been suggested to play a role in evoking auditory response in LIP
(Linden et al. 1999) and PFC (Vaadia et al. 1986). Further clarification of the role of attention in
higher associative areas, such as the PFC, is very important because many models assume that those
cortices impose attentional control over lower cortices.
Audiovisual Integration in Nonhuman Primates 85

5.6.2  Attention or Saliency of Stimuli


Degrees of attentional focus and ranges of stimulus saliency surely have differential effects on AV
integration. It is difficult to argue that monkeys monitor AV stimuli during simple tasks such as
fixation because monkeys will receive reward anyway regardless of what happens during stimulus
presentation. However, monkeys are certainly alert in such a condition. Even though the mandated
level of such attention is different from active monitoring, such weak attention, or lack of competing
stimulation, may be enough to induce audiovisual integration.
Besides attentive requirements, there are differences in stimulus saliency between simple stimuli,
such as flashes and tones, and complex stimuli such as faces. It is well known that meaningful visual
stimuli attract attention in a behaviorally observable manner. The eyes and mouths of individuals
vocalizing draw a subject’s gaze (Ghazanfar et al. 2006). Thus, it is possible that highly salient
stimuli may passively induce AV effects in the absence of explicit requirements to attend.
Certain forms of AV effects in adult animals occur only after training (Grunewald et al. 1999;
Woods and Recanzone 2004). In that sense, perception of vocalization has already been acquired
by life-long training in monkeys. We may suppose that AV integration is essential for acquisi-
tion of communication skills in nonhuman primates. Once trained, AV integration may become
“pre- potent” requiring less attention and may be done “effortlessly.”

5.7 CONCLUSIONS, UNRESOLVED ISSUES, AND


QUESTIONS FOR FUTURE STUDIES
Compared to human studies, behavioral studies of AV integration in nonhuman primates are still
relatively rare. The ability to simultaneously record behavior and local neural activity has helped to
reconcile the multisensory findings in humans, and expand our understanding of how AV integra-
tion occurs in the nervous system. Below, we list several issues to be addressed in the future.

5.7.1  Complex AV Interactions


Tasks requiring linguistic ability may be out of reach for experiments involving nonhuman pri-
mates; however, visual tasks of high complexity have been done in previous studies. Considering
that many AV effects in humans were seen with purely visual tasks, it may be possible to train mon-
keys to perform complex visual tasks and then study the effect of auditory presentation on visual
performance.

5.7.2  Anatomical Substrates of AV Interaction


The anatomical substrates of cross-modal inputs to primary sensory cortices (de la Mothe et al.
2006b; Cappe and Barone 2005; Cappe et al. 2009; Falchier et al. 2002, 2010; Hackett et al. 2007;
Rockland and Ojima 2003; Smiley et al. 2007) provide the basis for the models of routes for AV
integration. These data show that two types of corticocortical inputs (feedback and lateral connec-
tions), and thalamocortical along with subcortical inputs from nonspecific as well as multisensory
thalamic nuclei are potential pathways mediating early multisensory convergence and integration.
The challenge here is to discriminate the influence of each of these pathways during a behavioral
task. It is probable that the weight of these different pathways is defined by the sensory context as
well as by the nature of the task objective.

5.7.3  Implication of Motor Systems in Modulation of Reaction Time


Brain structures showing AV responses included parts of not just sensory but motor systems.
Facilitated reaction time in both saccadic and manual responses raises an issue of whether
86 The Neural Bases of Multisensory Processes

enhancement occurs in just sensory systems or somewhere else additionally. As Miller et al. (2001)
showed, motor cortical activation triggered by sensory stimuli reflected that sensory signals were
already integrated at the stage of primary motor cortex, it is possible that activation of PPC, PFC,
particularly PM areas or SC is facilitated by redundant sensory inputs. These possibilities are not
fully discerned yet. The possibility of additional sources for facilitated reaction time was also sug-
gested by the findings of Wang et al. (2008). When intense visual stimuli were presented, additional
auditory stimuli did not affect visual response in V1, but it did influence saccadic reaction time. This
suggests either that visual response is facilitated somewhere in the visual system outside of V1 or
that auditory stimuli directly affect motor responses.

5.7.4  Facilitation or Information?


In general, larger neuronal responses can be beneficial for faster reactions to and discrimination of events
because they have faster onset latencies and better signal-to-noise ratios. The coding of which strategy, or
strategies, neurons take as they respond to stimuli has to be discerned. For example, visual localization
tasks require not only fast reaction times but also good discrimination of visual target location. Visual
influences on ongoing oscillations by phase reset mechanisms and the consequence of modulations on
response magnitude have been shown by several groups. Additionally, Kayser et al. (2010) has shown the
possibility that visual influences can tune the auditory response by increasing the signal-to-noise ratio
and thereby its information capacity. Because it is not known what aspect of neuronal response the brain
utilizes, it is desirable to compare mechanisms of modulation with behavioral responses.

5.7.5  Inverse Effectiveness and Temporal Interaction


Inverse effectiveness states that multisensory integration is most effective when weak stimuli are
presented. Even though most electrophysiological studies of AV integration in monkey auditory
cortex often utilize loud sounds, low stimulus intensity can degrade the temporal response pattern
of sensory neurons. Such an effect would be more prominent for complex stimuli, such as vocal
sounds, because smaller peaks in the temporal envelope (e.g., the first envelope peak of macaque
grunt call) may be missed in auditory encoding. The condition of weak sound is relevant to Sumby
and Pollack’s (1954) classic observation of inverse effectiveness of human speech. It is thus impor-
tant to investigate how AV integration works in degraded conditions. It could be possible that
degraded stimuli reveal a more central role of attention because weaker stimuli require more atten-
tion in order to discern them. Also, altered timing of peaks in response to weak vocal sound may
interact differently with the excitability phases of ongoing oscillation, leading to different patterns
of enhancement.

5.7.6  What Drives and What Is Driven by Oscillations?


Recent studies of AV integration in AC and STP stress the importance of oscillatory neuronal
activity. Oscillations in field potentials and CSD reflect rhythmic net excitability fluctuations of the
local neuronal ensemble in sensory cortical areas. Although numerous hypotheses are available,
the role of oscillatory modulation in other structures is unknown. Endogenous attention may also
be reflected in ongoing activity by top-down modulation. Its interaction with bottom-up sensory
activation can contribute to and be influenced by oscillatory dynamics. This is an extremely fruitful
area for future studies.

5.7.7  Role of Attention


Although some multisensory studies in monkeys did control for attention, most studies were done
where attention was not specifically controlled. The former studies provide ample evidence for a
Audiovisual Integration in Nonhuman Primates 87

definitive role of sensory attention in AV integration. To get a clear picture on the role attention
plays in multisensory interactions, more studies are needed in which attention, even unimodal, is
controlled through behavioral tasks and stimuli. It will be also important to investigate issues of
attentional load because differences in selective attention may only emerge under high load condi-
tions, as under high attentional loads in attended modality subjects may try to ignore stimuli of
irrelevant modalities either consciously or unconsciously.

ACKNOWLEDGMENT
This work was supported by grant nos. K01MH082415, R21DC10415, and R01MH61989.

REFERENCES
Aggleton, J.P., and M. Mishkin. 1990. Visual impairments in macaques following inferior temporal lesions are
exacerbated selectively by additional damage to superior temporal sulcus. Behavioural Brain Research
39:262–274.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2008. Subthreshold auditory inputs to extrastriate visual
neurons are responsive to parametric changes in stimulus quality: Sensory-specific versus non-specific
coding. Brain Research 1242:95–101.
Allman, B.L., and M.A. Meredith. 2007. Multisensory processing in “unimodal” neurons: Cross-modal sub-
threshold auditory effects in cat extrastriate visual cortex. Journal of Neurophysiology 98:545–549.
Allon, N., and Z. Wollberg. 1978. Responses of cells in the superior colliculus of the squirrel monkey to audi-
tory stimuli. Brain Research 159:321–330.
Andersen, R.A., L.H. Snyder, D.C. Bradley, and J. Xing. 1997. Multimodal representation of space in
the posterior parietal cortex and its use in planning movements. Annual Review of Neuroscience
20:303–330.
Anderson, J., I. Lampl, I. Reichova, M. Carandini, and D. Ferster. 2000. Stimulus dependence of two-state
fluctuations of membrane potential in cat visual cortex. Nature Neuroscience 3:617–621.
Anderson, K.C., and R.M. Siegel. 1999. Optic flow selectivity in the anterior superior temporal polysensory
area, STPa, of the behaving monkey. Journal of Neuroscience 19:2681–2691.
Anderson, K.C., and R.M. Siegel. 2005. Three-dimensional structure-from-motion selectivity in the anterior
superior temporal polysensory area STPs of the behaving monkey. Cerebral Cortex 15:1299–1307.
Aosaki, T., M. Kimura, and A.M. Graybiel. 1995. Temporal and spatial characteristics of tonically active neu-
rons of the primate’s striatum. Journal of Neurophysiology 73:1234–1252.
Aou, S., Y. Oomura, H. Nishino, et al. 1983. Functional heterogeneity of single neuronal activity in the monkey
dorsolateral prefrontal cortex. Brain Research 260:121–124.
Artchakov, D., D. Tikhonravov, V. Vuontela, I. Linnankoski, A. Korvenoja, and S. Carlson. 2007. Processing of
auditory and visual location information in the monkey prefrontal cortex. Experimental Brain Research
180:469–479.
Azuma, M., and H. Suzuki. 1984. Properties and distribution of auditory neurons in the dorsolateral prefrontal
cortex of the alert monkey. Brain Research 298:343–346.
Baizer, J.S., L.G. Ungerleider, and R. Desimone. 1991. Organization of visual inputs to the inferior temporal
and posterior parietal cortex in macaques. Journal of Neuroscience 11:168–190.
Baizer, J.S., R. Desimone, and L.G. Ungerleider. 1993. Comparison of subcortical connections of inferior tem-
poral and posterior parietal cortex in monkeys. Visual Neuroscience 10:59–72.
Barbas, H., H. Ghashghaei, S.M. Dombrowski, and N.L. Rempel-Clower. 1999. Medial prefrontal cortices
are unified by common connections with superior temporal cortices and distinguished by input from
memory-related areas in the rhesus monkey. Journal of Comparative Neurology 410:343–367.
Barbas, H., and M.M. Mesulam. 1981. Organization of afferent input to subdivisions of area 8 in the rhesus
monkey. Journal of Comparative Neurology 200:407–431.
Barnes, C.L., and D.N. Pandya. 1992. Efferent cortical connections of multimodal cortex of the superior tem-
poral sulcus in the rhesus monkey. Journal of Comparative Neurology 318:222–244.
Barraclough, N.E., D. Xiao, C.I. Baker, M.W. Oram, and D.I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17:377–391.
88 The Neural Bases of Multisensory Processes

Baylis, G.C., E.T. Rolls, and C.M. Leonard. 1987. Functional subdivisions of the temporal lobe neocortex.
Journal of Neuroscience 7:330–342.
Bell, A.H., B.D. Corneil, D.P. Munoz, and M.A. Meredith. 2003. Engagement of visual fixation suppresses sen-
sory responsiveness and multisensory integration in the primate superior colliculus. European Journal of
Neuroscience 18:2867–2873.
Benevento, L.A., J. Fallon, B.J. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57:849–872.
Besle, J., Bertrand, O., and Giard, M.H. 2009. Electrophysiological (EEG, sEEG, MEG) evidence for multiple
audiovisual interactions in the human auditory cortex. Hearing Research 258:143–151.
Blatt, G.J., D.N. Pandya, and D.L. Rosene. 2003. Parcellation of cortical afferents to three distinct sectors in
the parahippocampal gyrus of the rhesus monkey: An anatomical and neurophysiological study. Journal
of Comparative Neurology 466:161–179.
Bologninia, N., I. Senna, A. Maravita, A. Pascual-Leone, and L.B. Merabet. 2010. Auditory enhancement
of visual phosphene perception: The effect of temporal and spatial factors and of stimulus intensity.
Neuroscience Letters 477:109–114.
Bon, L., and C. Lucchetti. 2006. Auditory environmental cells and visual fixation effect in area 8B of macaque
monkey. Experimental Brain Research 168:441–449.
Born, R.T., and D.C. Bradley. 2005. Structure and function of visual area MT. Annual Review of Neuroscience
28:157–189.
Brosch, M., E. Selezneva, and H. Scheich. 2005. Nonauditory events of a behavioral procedure activate audi-
tory cortex of highly trained monkeys. Journal of Neuroscience. 25:6797–6806.
Brothers, L., B. Ring, and A. Kling. 1990. Response of neurons in the macaque amygdala to complex social
stimuli. Behavioural Brain Research 41:199–213.
Bruce, C.J., R. Desimone, and C.G. Gross. 1981. Visual properties of neurons in polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46:369–384.
Bruce, C.J., R. Desimone, and C.G. Gross. 1986. Both striate cortex and superior colliculus contributes to
visual properties of neurons in superior temporal polysensory area of macaque monkey. Journal of
Neurophysiology 55:1057–1075.
Burton, H., and E.G. Jones. 1976. The posterior thalamic region and its cortical projection in new world and old
world monkeys. Journal of Comparative Neurology 168:249–302.
Carmichael, S.T., and J.L. Price. 1995. Sensory and premotor connections of the orbital and medial prefrontal
cortex of macaque monkeys. Journal of Comparative Neurology 363:642–664.
Calvert, G.A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cerebral Cortex 11:1110–1123.
Calvert, G.A., and R. Campbell. 2003. Reading speech from still and moving faces: The neural substrates of
visible speech. Journal of Cognitive Neuroscience 15:57–70.
Campanella, S., and P. Belin. 2007. Integrating face and voice in person perception. Trends in Cognitive
Sciences 11:535–543.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. European Journal of Neuroscience 22:2886–2902.
Cappe, C., A. Morel, P. Barone, and E. Rouiller. 2009. The thalamocortical projection systems in primate: An
anatomical support for multisensory and sensorimotor interplay. Cerebral Cortex 19:2025–2037.
Cappe, C., M.M. Murray, P. Barone, and E.M. Rouiller. 2010. Multisensory facilitation of behavior in monkeys:
Effects of stimulus intensity. Journal of Cognitive Neuroscience 22:2850–2863.
Cate, A.D., T.J. Herron, E.W. Yund, et al. 2009. Auditory attention activates peripheral visual cortex. PLoS
ONE 4:e4645.
Cavada, C., and P.S. Goldman-Rakic. 1989a. Posterior parietal cortex in rhesus monkey: I. Parcellation of areas
based on distinctive limbic and sensory corticocortical connections. Journal of Comparative Neurology
287:393–421.
Cavada, C., and P.S. Goldman-Rakic. 1989b. Posterior parietal cortex in rhesus monkey: II. Evidence for
segregated corticocortical networks linking sensory and limbic areas with the frontal lobe. Journal of
Comparative Neurology 287:422–445.
Cavada, C., T. Company, J. Tejedor, R.J. Cruz-Rizzolo, and F. Reinoso-Suarez. 2000. The anatomical connec-
tions of the macaque monkey orbitofrontal cortex. A review. Cerebral Cortex 10:220–242.
Chakladar, S., N.K. Logothetis, and C.I. Petkov. 2008. Morphing rhesus monkey vocalizations. Journal of
Neuroscience Methods 170:45–55.
Audiovisual Integration in Nonhuman Primates 89

Chandrasekaran, C., and A.A. Ghazanfar. 2009. Different neural frequency bands integrate faces and voices
differently in the superior temporal sulcus. Journal of Neurophysiology 101:773–788.
Chandrasekaran, C., A. Trubanova, S. Stillittano, A. Caplier, and A.A. Ghazanfar. 2009. The natural statistics
of audiovisual speech. PLoS Computational Biology 5:e1000436.
Chen, C.M., P. Lakatos, A.S. Shah, et al. 2007. Functional anatomy and interaction of fast and slow visual
pathways in macaque monkeys. Cerebral Cortex 17:1561–1569.
Cheney, D.L., and Seyfarth, R.M. 1990. How Monkeys See the World. Chicago: Univ. of Chicago Press.
Ciaramitaro, V.M., G.T. Buracas, and G.M. Boynton. 2007. Spatial and crossmodal attention alter responses to
unattended sensory information in early visual and auditory human cortex. Journal of Neurophysiology
98:2399–2413.
Clower, D.M., R.A. West, J.C. Lynch, and P.L. Strick. 2001. The inferior parietal lobule is the target of output
from the superior colliculus, hippocampus, and cerebellum. Journal of Neuroscience. 21:6283–6291.
Cohen, Y.E. 2009. Multimodal activity in the parietal cortex. Hearing Research 258:100–105.
Cohen, Y.E., and R.A. Andersen. 2000. Reaches to sounds encoded in an eye-centered reference frame. Neuron
27:647–652.
Cohen, Y.E., and R.A. Andersen. 2002. A common reference frame for movement plans in the posterior parietal
cortex. Nature Reviews. Neuroscience 3:553–562.
Cohen, Y.E., A.P. Batista, and R.A. Andersen. 2002. Comparison of neural activity preceding reaches to audi-
tory and visual stimuli in the parietal reach region. Neuroreport 13:891–894.
Cohen, Y.E., I.S. Cohen, and G.W. Gifford III. 2004. Modulation of LIP activity by predictive auditory and
visual cues. Cerebral Cortex 14:1287–1301.
Cohen, Y.E., B.E. Russ, S.J. Davis, A.E. Baker, A.L. Ackelson, and R. Nitecki. 2009. A functional role for the
ventrolateral prefrontal cortex in non-spatial auditory cognition. Proceedings of the National Academy of
Sciences of the United States of America 106:20045–20050.
Colombo, M., and C.G. Gross. 1994. Responses of inferior temporal cortex and hippocampal neurons
during delayed matching to sample in monkeys (Macaca fascicularis). Behavioral Neuroscience
108:443–455.
Colombo, M., H.R. Rodman, and C.G. Gross. 1996. The effects of superior temporal cortex lesions on the
processing and retention of auditory information in monkeys (Cebus apella). Journal of Neuroscience.
16:4501–4517.
Cooke, D.F., and M.S.A. Graziano. 2004a. Super-flinchers and nerves of steel: Defensive movements altered
by chemical manipulation of a cortical motor area. Neuron 43:585–593.
Cooke, D.F., and M.S.A. Graziano. 2004b. Sensorimotor integration in the precentral gyrus: Polysensory neu-
rons and defensive movements. Journal of Neurophysiology 91:1648–1660.
Cui, Q.N., L. Bachus, E. Knoth, W.E. O’Neill, and G.D. Paige. 2008. Eye position and cross-sensory learning
both contribute to prism adaptation of auditory space. Progress in Brain Research 171:265–270.
Cui, J., M. Wilke, N.K. Logothetis, D.A. Leopold, and H. Liang. 2009. Visibility states modulate microsaccade
rate and direction. Vision Research 49:228–236.
Cusick, C.G., B. Seltzer, M. Cola, and E. Griggs. 1995. Chemoarchitectonics and corticocortical terminations
within the superior temporal sulcus of the rhesus monkey: Evidence for subdivisions of superior temporal
polysensory cortex. Journal of Comparative Neurology 360:513–535.
Cynader, M., and N. Berman. 1972. Receptive field organization of monkey superior colliculus. Journal of
Neurophysiology 35:187–201.
Dahl, C.D., N.K. Logothetis, and C. Kayser. 2009. Spatial organization of multisensory responses in temporal
association cortex. Journal of Neuroscience. 29:11924–11932.
de la Mothe, L.A., S. Blumell, Y. Kajikawa, and T.A. Hackett. 2006a. Cortical connections of the auditory cortex
in marmoset monkeys: Core and medial belt regions. Journal of Comparative Neurology 496:27–71.
de la Mothe, L.A., S. Blumell, Y. Kajikawa, and T.A. Hackett. 2006b. Thalamic connections of the auditory cor-
tex in marmoset monkeys: Core and medial belt regions. Journal of Comparative Neurology 496:72–96.
De Souza, W.C., S. Eifuku, R. Tamura, H. Nishijo, and T. Ono. 2005. Differential characteristics of face neu-
ron responses within the anterior superior temporal sulcus of macaques. Journal of Neurophysiology
94:1251–1566.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith. 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Desimone, R., and C.G. Gross. 1979. Visual areas in the temporal cortex of the macaque. Brain Research
178:363–380.
90 The Neural Bases of Multisensory Processes

Diederich, A., and H. Colonius. 2004. Modeling the time course of multisensory interaction in manual and
saccadic responses. In Handbook of Multisensory Processes, ed. G. Calvert, C. Spence, and B.E. Stein,
373–394. Cambridge, MA: MIT Press.
Disbrow, E., E. Litinas, G.H. Recanzone, J. Padberg, and L. Krubitzer. 2003. Cortical connections of the sec-
ond somatosensory area and the parietal ventral area in macaque monkeys. Journal of Comparative
Neurology 462:382–399.
Dobelle, W.H., M.G. Mladejovsky, and J.P. Girvin. 1974. Artificial vision for the blind: Electrical stimulation
of visual cortex offers hope for a functional prosthesis. Science 183:440–444.
Duffy, C.J., and R.H. Wurtz. 1991. Sensitivity of MST neurons to optic flow stimuli: I. A continuum of response
selectivity to large-field stimuli. Journal of Neurophysiology 65:1329–1345.
Eaccott, M.J., C.A. Heywood, C.G. Gross, and A. Cowey. 1993. Visual discrimination impairments fol-
lowing lesions of the superior temporal sulcus are not specific for facial stimuli. Neuropsychologia
31:609–619.
Eifuku, S., W.C. De Souza, R. Tamura, H. Nishijo, and T. Ono. 2004. Neuronal correlates of face identification
in the monkey anterior temporal cortical areas. Journal of Neurophysiology 91:358–371.
Engbert, R., and R. Kliegl. 2003. Microsaccades uncover the orientation of covert attention. Vision Research
43:1035–1045.
Evans, T.A., S. Howell, and G.C. Westergaard. 2005. Auditory–visual cross-modal perception of communica-
tive stimuli in tufted capuchin monkeys (Cebus apella). Journal of Experimental Psychology. Animal
Behavior Processes 31:399–406.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience. 22:5749–5759.
Falchier, A., C.E. Schroeder, T.A. Hackett, et al. 2010. Projection from visual areas V2 and prostriata to caudal
auditory cortex in the monkey. Cerebral Cortex 20:1529–1538.
Felleman, D.J., and J.H. Kaas. 1984. Receptive field properties of neurons in middle temporal visual area (MT)
of owl monkeys. Journal of Neurophysiology 52:488–513.
Fogassi, L., V. Gallese, L. Fadiga, F. Luppino, M. Matelli, and G. Rizzolatti. 1996. Coding of peripersonal
space in inferior premotor cortex (area F4). Journal of Neurophysiology 76:141–157.
Frens, M.A., and A.J. Van Opstal. 1998. Visual–auditory interactions modulate saccade-related activity in mon-
key superior colliculus. Brain Research Bulletin 46:211–224.
Frens, M.A., A.J. Van Opstal, and R.F. Van der Willigen. 1995. Spatial and temporal factors determine auditory–­
`visual interactions in human saccadic eye movements. Perception & Psychophysics 57:802–816.

Fu, K.G., T.A. Johnston, A.S. Shah, et al. 2003. Auditory cortical neurons respond to somatosensory stimula-
tion. Journal of Neuroscience. 23:7510–7515.
Fu, K.G., A.S. Shah, M.N. O’Connell, et al. 2004. Timing and laminar profile of eye-position effects on audi-
tory responses in primate auditory cortex. Journal of Neurophysiology 92:3522–3531.
Fuster, J.M., M. Bodner, and J.K. Kroger. 2000. Cross-modal and cross-temporal association in neurons of
frontal cortex. Nature 405:347–351.
Gaffan, D., and S. Harrison. 1991. Auditory–visual associations, hemispheric specialization and temporal–
frontal interaction in the rhesus monkey. Brain 114:2133–2144.
Ghazanfar, A.A., and N.K. Logothetis. 2003. Facial expressions linked to monkey calls. Nature 423:934–934.
Ghazanfar, A.A., and L.R. Santos. 2004. Primate brains in the wild: The sensory bases for social interactions.
Nature Reviews. Neuroscience 5:603–616.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–285.
Ghazanfar, A.A., J.G. Neuhoff, and N.K. Logothetis. 2002. Auditory looming perception in rhesus monkeys.
Proceedings of the National Academy of Sciences of the United States of America 99:15755–15757.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience. 25:5004–5012.
Ghazanfar, A.A., K. Nielsen, and N.K. Logothetis. 2006. Eye movements of monkey observers viewing vocal-
izing conspecifics. Cognition 101:515–529.
Ghazanfar, A.A., C. Chandrasekaran, and N.K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience. 28:4457–4469.
Giard, M.H., and F. Peronnet. 1999. Auditory–visual integration during multimodal object recognition in
humans: A behavioral and electrophysiological study. Journal of Cognitive Neuroscience 11:473–490.
Gibson, J.R., and J.H.R. Maunsell. 1997. Sensory modality specificity of neural activity related to memory in
visual cortex. Journal of Neurophysiology 78:1263–1275.
Audiovisual Integration in Nonhuman Primates 91

Gifford III, G.W., and Y.E. Cohen. 2005. Spatial and non-spatial auditory processing in the lateral intraparietal
area. Experimental Brain Research 162:509–512.
Gifford III, G.W., K.A. MacLean, M.D. Hauser, and Y.E. Cohen. 2005. The neurophysiology of functionally
meaningful categories: Macaque ventrolateral prefrontal cortex plays a critical role in spontaneous cat-
egorization of species-specific vocalizations. Journal of Cognitive Neuroscience 17:1471–1482.
Goldman-Rakic, P.S., A.R. Cools, and K. Srivastava. 1996. The prefrontal landscape: Implications of functional
architecture for understanding human mentation and the central executive. Philosophical Transactions of
the Royal Society of London. Series B, Biological Sciences 351:1445–1453.
Goodale, M.A., and A.D. Milner. 1992. Separate visual pathways for perception and action. Trends in
Neurosciences 15:20–25.
Graziano, M.S.A., and S. Gandhi. 2000. Location of the polysensory zone in the precentral gyrus of anesthe-
tized monkeys. Experimental Brain Research 135:259–266.
Graziano, M.S.A., X.T. Hu, and C.G. Gross. 1997. Visuospatial properties of ventral premotor cortex. Journal
of Neurophysiology 77:2268–2292.
Graziano, M.S.A., L.A.J. Reiss, and C.G. Gross. 1999. A neuronal representation of the location of nearby
sounds. Nature 397:428–430.
Graziano, M.S.A., G.S. Yap, and C.G. Gross. 1994. Coding of visual space by premotor neurons. Science
266:1054–1057.
Green, K.P., P.K. Kuhl, A.N. Meltzoff, and E.B. Stevens. 1991. Integrating speech information across talk-
ers, gender, and sensory modality: Female faces and male voices in the McGurk effect. Perception &
Psychophysics 50:524–536.
Groh, J.M., A.S. Trause, A.M. Underhill, K.R. Clark, and S. Inati. 2001. Eye position influences auditory
responses in primate inferior colliculus. Neuron 29:509–518.
Grunewald, A., J.F. Linden, and R.A. Andersen. 1999. Responses to auditory stimuli in macaque lateral intra-
parietal area I. Effects of training. Journal of Neurophysiology 82:330–342.
Gu, Y., D.E. Angelaki, and G.C. DeAngelis. 2008. Neural correlates of multisensory cue integration in macaque
MSTd. Nature Neuroscience 11:1201–1210.
Hackett, T.A. 2002. The comparative anatomy of the primate auditory cortex. In: Primate Audition: Ethology
and Neurobiology, ed. Asif A. Ghazanfar, 199–226. Boca Raton, FL: CRC.
Hackett, T.A., L.A. de la Mothe, I. Ulbert, G. Karmos, J.F. Smiley, and C.E. Schroeder. 2007. Multisensory
convergence in auditory cortex: II. Thalamocortical connections of the caudal superior temporal plane.
Journal of Comparative Neurology 502:894–923.
Hackett, T.A., T.M. Preuss, and J.H. Kaas. 2001. Architectonic identification of the core region in auditory
cortex of macaques, chimpanzees, and humans. Journal of Comparative Neurology 441:197–222.
Hackett, T.A, I. Stepniewska, and J.H. Kaas. 1999. Prefrontal connections of the parabelt auditory cortex in
macaque monkeys. Brain Research 817:45–58.
Hairston, W.D., D.A. Hodges, J.H. Burdette, and M.T. Wallace. 2006. Auditory enhancement of visual tempo-
ral order judgment. Neuroreport 17:791–795.
Hikosaka, K., E. Iwai, H. Saito, and K. Tanaka. 1988. Polysensory properties of neurons in the anterior
bank of the caudal superior temporal sulcus of the macaque monkey. Journal of Neurophysiology
60:1615–1637.
Hikosaka, O., M. Sakamoto, and S. Usui. 1989. Functional properties of monkey caudate neurons: II. Visual
and auditory responses. Journal of Neurophysiology 61:799–813.
Hoffman, K.L., A.A. Ghazanfar, I. Gauthier, and N.K. Logothetis. 2008. Category-specific responses to faces
and objects in primate auditory cortex. Frontiers in Systems Neuroscience 1:2.
Hoffman, K.L., K.M. Gothard, M.C. Schmid, and N.K. Logothetis. 2007. Facial-expression and gaze-selective
responses in the monkey amygdala. Current Biology 17:766–772.
Ito, S. 1982. Prefrontal activity of macaque monkeys during auditory and visual reaction time tasks. Brain
Research 247:39–47.
Iversen, S.D., and M. Mishkin. 1973. Comparison of superior temporal and inferior prefrontal lesions on audi-
tory and non-auditory task in rhesus monkeys. Brain Research 55:355–367.
Izumi, A., and S. Kojima. 2004. Matching vocalizations to vocalizing faces in chimpanzee (Pan troglodytes).
Animal Cognition 7:179–184.
Jääskeläinen, I.P., J. Ahveninen, J.W. Belliveau, T. Raij, and M. Sams. 2007. Short-term plasticity in auditory
cognition. Trends in Neurosciences 30:653–661.
Jay, M.F., and D.L. Sparks. 1984. Auditory receptive fields in primate superior colliculus shift with changes in
eye position. Nature 309:345–347.
Jones, E.G. 1998. Viewpoint: The core and matrix of thalamic organization. Neuoroscience 85:331–345.
92 The Neural Bases of Multisensory Processes

Jordan, K.E., E.M. Brannon, N.K. Logothetis, and A.A. Ghazanfar. 2005. Monkeys match the number of voices
they hear to the number of faces they see. Current Biology 15:1034–1038.
Joseph, J.P., and P. Barone. 1987. Prefrontal unit activity during a delayed oculomotor task in the monkey.
Experimental Brain Research 67:460–468.
Kaas, J.H., and T.A. Hackett. 2000. Subdivisions of auditory cortex and processing streams in primates.
Proceedings of the National Academy of Sciences of the United States of America 97:11793–11799.
Kajikawa, Y., C.E. Schroeder. 2008. Face–voice integration and vocalization processing in the monkey.
Abstracts Society for Neuroscience 852.22.
Kayser, C., and N.K. Logothetis. 2009. Directed interactions between auditory and superior temporal cortices
and their role in sensory integration. Frontiers in Integrative Neuroscience 3:7.
Kayser, C.I., C.I. Petkov, M. Augath, and N.K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48:373–384.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2007. Functional imaging reveals visual modulation
of specific fields in auditory cortex. Journal of Neuroscience 27:1824–1835.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–1574.
Kayser, C., and N.K. Logothetis. 2009. Directed interactions between auditory and superior temporal cortices
and their role in sensory integration. Frontiers in Integrative Neuroscience 3:7.
Kayser, C., N.K. Logothetis, and S. Panzeri. 2010. Visual enhancement of the information representation in
auditory cortex. Current Biology 20:19–24.
Keysers, C., E. Kohler, M.A. Umilta, L. Nanetti, L. Fogassi, and V. Gallese. 2003. Audiovisual mirror neurons
and action recognition. Experimental Brain Research 153:628–636.
Kikuchi-Yorioka, Y., and T. Sawaguchi. 2000. Parallel visuospatial and audiospatial working memory pro-
cesses in the monkey dorsolateral prefrontal cortex. Nature Neuroscience 3:1075–1076.
Kimura, M. 1992. Behavioral modulation of sensory responses of primate putamen neurons. Brain Research
578:204–214.
Knudsen, E.I., and P.F. Knudsen. 1989. Vision calibrates sound localization in developing barn owls. Journal
of Neuroscience 9:3306–3313.
Kohler, E., C. Keysers, M.A. Umilta, L. Fogassi, V. Gallese, and G. Rizzolatti. 2002. Hearing sounds, under-
standing actions: Action representation in mirror neurons. Science 297:846–848.
Kojima, S., A. Izumi, and M. Ceugniet. 2003. Identification of vocalizers by pant hoots, pant grants and screams
in a chimpanzee. Primates 44:225–230.
Kondo, H., K.S. Saleem, and J.L. Price. 2003. Differential connections of the temporal pole with the orbital and
medial prefrontal networks in macaque monkeys. Journal of Comparative Neurology 465:499–523.
Kosmal, A., M. Malinowska, and D.M. Kowalska. 1997. Thalamic and amygdaloid connections of the audi-
tory association cortex of the superior temporal gyrus in rhesus monkey (Macaca mulatta). Acta
Neurobiologiae Experimentalis 57:165–188.
Kubota, K., M. Tonoike, and A. Mikami. 1980. Neuronal activity in the monkey dorsolateral prefrontal cortex
during a discrimination task with delay. Brain Research 183:29–42.
Kuraoka, K., and K. Nakamura. 2007. Responses of single neurons in monkey amygdala to facial and vocal
emotions. Journal of Neurophysiology 97:1379–1387.
Lakatos, P., C.-M. Chen, M. O’Connell, A. Mills, and C.E. Schroeder. 2007. Neuronal oscillations and multi-
sensory interaction in primary auditory cortex. Neuron 53:279–292.
Lakatos, P., G., Karmos, A.D. Mehta, I. Ulbert, and C.E. Schroeder. 2008. Entrainment of neural oscillations as
a mechanism of attentional selection. Science 320:110–113.
Lakatos, P., M.N. O’Connell, A. Barczak, A. Mills, D.C. Javitt, and C.E. Schroeder. 2009. The leading sense:
Supramodal control of neurophysiological context by attention. Neuron 64:419–430.
Lakatos, P., A.S. Shaw, K.H. Knuth, I. Ulbert, G. Karmos, and C.E. Schroeder. 2005. An oscillatory hier-
archy controlling neuronal excitability and stimulu processing in the auditory cortex. Journal of
Neurophysiology 94:1904–1911.
Lehmann, C., M. Herdener, F. Esposito, et al. 2006. Differential patterns of multisensory interactions in core
and belt areas of human auditory cortex. Neuroimage 31:294–300.
Lehmann, S., and M.M. Murray. 2005. The role of multisensory memories in unisensory object discrimination.
Brain Research. Cognitive Brain Research 24:326–334.
Leonard, C.M., E.T. Rolls. F.A. Wilson and G.C. Baylis. 1985. Neurons in the amygdala of the monkey with
responses selective for faces. Behavioural Brain Research 15:159–176.
Levy, R., and P.S. Goldman-Rakic. 2000. Segregation of working memory functions within the dorsolateral
prefrontal cortex. Experimental Brain Research 133:23–32.
Audiovisual Integration in Nonhuman Primates 93

Lewis, J.W., and D.C. Van Essen. 2000. Corticocortical connections of visual, sensorimotor, and multi modal pro­
cessing areas in the parietal lobe of the macaque monkey. Journal of Comparative Neurology
428:112–137.
Linden, J.F., A. Grunewald, and R.A. Andersen. 1999. Responses to auditory stimuli in macaque lateral intra-
parietal area: II. Behavioral modulation. Journal of Neurophysiology 82:343–358.
Maier, J.X., J.G. Neuhoff, N.K. Logothetis, and A.A. Ghazanfar. 2004. Multisensory integration of looming
signals by rhesus monkeys. Neuron 43:177–181.
Maier, J.X., C. Chandrasekaran, and A.A. Ghazanfar. 2008. Integration of bimodal looming signals through
neuronal coherence in the temporal lobe. Current Biology 18:963–968.
Martinez, L., and T. Matsuzawa. 2009. Auditory–visual intermodal matching based on individual recognition
in a chimpanzee (Pan troglodytes). Animal Cognition 12:S71–S85.
Matsumoto, N., T. Minamimoto, A.M. Graybiel, and M. Kimura. 2001. Neurons in the thalamic CM-Pf com-
plex supply striatal neurons with information about behaviorally significant sensory events. Journal of
Neurophysiology 85:960–976.
Mazzoni, P., R.P. Bracewell, S. Barash, and R.A. Andersen. 1996. Spatially tuned auditory responses in area
LIP of macaques performing delayed memory saccades to acoustic targets. Journal of Neurophysiology
75:1233–1241.
McDonald, J.J., W.A. Teder-Sälejärvi, F. Di Russo, and S.A. Hillyard. 2003. Neural substrates of perceptual
enhancement by cross-modal spatial attention. Journal of Cognitive Neuroscience 15:10–19.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264:746–748.
McNaughton, B.L., F.P. Battagllia, O. Jensen, E.I. Moser, and M.B. Moser. 2006. Path integration and the neu-
ral basis of the ‘cognitive map.’ Nature Reviews. Neuroscience 7:663–678.
Mehta, A.D., U. Ulbert, and C.E. Schroeder. 2000a. Intermodal selective attention in monkeys: I. Distribution
and timing of effects across visual areas. Cerebral Cortex 10:343–358.
Mehta, A.D., U. Ulbert, and C.E. Schroeder. 2000b. Intermodal selective attention in monkeys: II. Physiological
mechanisms of modulation. Cerebral Cortex 10:359–370.
Meredith, M.A., B.L. Allman, L.P. Keniston, and H.R. Clemo. 2009. Auditory influences on non-auditory cor-
tices. Hearing Research 258:64–71.
Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons: I. Temporal factors. Journal of Neuroscience 7:3215–3229.
Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–391.
Meyer K., J.T. Kaplan, R. Essec, C. Webber, H. Damasio, and A. Damasio. 2010. Predicting visual stimuli on
the basis of activity in auditory cortices. Nature Neuroscience 13:667–668.
Miller, J.O. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology
14:247–279
Miller, J., R. Ulrich, and Y. Lanarre. 2001. Locus of the redundant-signals effect in bimodal divided attention:
A neurophysiological analysis. Perception & Psychophysics 63:555–562.
Mohedano-Moriano, A., P. Pro-Sistiaga, M.M. Arroyo-Jimenez, et al. 2007. Topographical and laminar distri-
bution of cortical input to the monkey entorhinal cortex. Journal of Anatomy 211:250–260.
Mohedano-Moriano, A., A. Martinez-Marcos, P. Pro-Sistiaga, et al. 2008. Convergence of unimodal and poly-
modal sensory input to the entorhinal cortex in the fascicularis monkey. Neuroscience 151:255–271.
Molholm, S., W. Ritter, M.M. Murray, D.C. Javitt, C.E. Schroeder, and J.J. Foxe. 2002. Multisensory auditory–
visual interactions during early sensory processing in humans: A high-density electrical mapping study.
Brain Research. Cognitive Brain Research 14, 115–128.
Molholm, S., A. Martinez, M. Shpaner, and J.J. Foxe. 2007. Object-based attention is multisensory: Co-activation
of an object’s representations in ignored sensory modalities. European Journal of Neuroscience 26:​
499–509.
Mullette-Gilman, O.A., Y.E. Cohen, and J.M. Groh. 2005. Eye-centered, head-centered, and complex coding of
visual and auditory targets in the intraparietal sulcus. Journal of Neurophysiology 94:2331–2352.
Mullette-Gilman, O.A., Y.E. Cohen, and J.M. Groh. 2009. Motor-related signals in the intraparietal cortex
encode locations in a hybrid, rather than eye-centered reference frame. Cerebral Cortex 19:1761–1775.
Murata, A., L. Fadiga, L. Fogassi, V. Gallese, V. Raos, and G. Rizzolatti. 1997. Object representation in the
ventral premotor cortex (area F5) of the monkey. Journal of Neurophysiology 78:2226–2230.
Murray, E.A., and D. Gaffan. 1994. Removal of the amygdala plus subjacent cortex disrupts the retention of both
intramodal and crossmodal associative memories in monkeys. Behavioral Neuroscience 108:494–500.
Murray, E.A., and B.J. Richmond. 2001. Role of perirhinal cortex in object perception, memory, and associa-
tions Current Opinion in Neurobiology 11:188–193.
94 The Neural Bases of Multisensory Processes

Murray, M.M., C.M. Michel, R.G. de Peralta, et al. 2004. Rapid discrimination of visual and multisensory
memories revealed by electrical neuroimaging. Neuroimage 21:125–135.
Murray, M.M., J.J. Foxe, and G.R. Wylie. 2005. The brain uses single-trial multisensory memories to discrimi-
nate without awareness. Neuroimage 27:473–478.
Musacchia, G., M. Sams, T. Nicol, and N. Kraus. 2006. Seeing speech affects acoustic information processing
in the human brainstem. Experimental Brain Research 168:1–10.
Musacchia, G., and C.E. Schroeder. 2009. Neuronal mechanisms, response dynamics and perceptual functions
of multisensory interactions in auditory cortex. Hearing Research 258:72–79.
Nager, W., K. Estorf, and T.F. Münte. 2006. Crossmodal attention effects on brain responses to different stimu-
lus classes. BMC Neuroscience 7:31.
Navarra, J., A. Alsius, S. Soto-Faraco, and C. Spence. 2010. Assessing the role of attention in the audiovisual
integration of speech. Information Fusion 11:4–11.
Neal, J.W., R.C. Pearson, and T.P. Powell. 1990. The connections of area PG, 7a, with cortex in the parietal,
occipital and temporal lobes of the monkey. Brain Research 532:249–264.
Nelissen, K., W. Vanduffel, and G.A. Orban. 2006. Charting the lower superior temporal region, a new motion-
sensitive region in monkey superior temporal sulcus. Journal of Neuroscience 26:5929–5947.
Newman, J.D., and D.F. Lindsley. 1976. Single unit analysis of auditory processing in squirrel monkey frontal
cortex. Experimental Brain Research 25:169–181.
Nishijo, H., T. Ono, and H. Nishino. 1988a. Topographic distribution of modality-specific amygdalar neurons
in alert monkey. Journal of Neuroscience 8:3556–3569.
Nishijo, H., T. Ono, and H. Nishino. 1988b. Single neuron responses in amygdala of alert monkey during com-
plex sensory stimulation with affective significance. Journal of Neuroscience 8:3570–3583.
Nyberg, L., R. Habib, A.R. McIntosh, and E. Tulving. 2000. Reactivation of encoding-related brain activ-
ity during memory retrieval. Proceedings of the National Academy of Sciences of the United States of
America 97:11120–11124.
Ono, T., K. Nakamura, H. Nishijo, and S. Eifuku. 1993. Monkey hippocampal neurons related to spatial and
nonspatial functions. Journal of Neurophysiology 70:1516–1529.
Oram, M.W., and D.I. Perrett. 1996. Integration of form and motion in the anterior superior temporal polysen-
sory area (STPa) of the macaque monkey. Journal of Neurophysiology 76:109–129.
Oram, M.W., D.I. Perrett, and J.K. Hietanen. 1993. Directional tuning of motion-sensitive cells in the anterior
superior temporal polysensory area of the macaque. Experimental Brain Research 97:274–294.
Padberg, J., B. Seltzer, and C.G. Cusick. 2003. Architectonics and cortical connections of the upper bank
of the superior temporal sulcus in the rhesus monkey: An analysis in the tangential plane. Journal of
Comparative Neurology 467:418–434.
Padberg, J., E. Disbrow, and L. Krubitzer. 2005. The organization and connections of anterior and posterior
parietal cortex in titi monkeys: Do new world monkeys have an area 2? Cerebral Cortex 15:1938–1963.
Parr, L.A., E. Hecht, S.K. Barks, T.M. Preuss, and J.R. Votaw. 2009. Face processing in the chimpanzee brain.
Current Biology 19:50–53.
Partan, S.R. 2002. Single and multichannel signal composition: Facial expressions and vocalizations of rhesus
macaques (Macaca mulatta). Behavior 139:993–1027.
Perrett, D.I., E.T. Rolls, and W. Caan. 1982. Visual neurones responsive to faces in the monkey temporal cortex.
Experimental Brain Research 47:329–342.
Perrott, D.R., K. Saberi, K. Brown, and T.Z. Strybel. 1990. Auditory psychomotor coordination and visual
search performance. Perception & Psychophysics 48:214–226.
Petkov, C.I., C. Kayser, T. Steudel, K. Whittingstall, M. Augath, and N.K. Logothetis. 2008. A voice region in
the monkey brain. Nature Neuroscience 11:367–374.
Petrides, M., and D.N. Pandya. 2002. Comparative cytoarchitectonic analysis of the human and the macaque
ventrolateral prefrontal cortex and corticocortical connection patterns in the monkey. European Journal
of Neuroscience 16:291–310.
Petrides, M., and D.N. Pandya. 2009. Distinct parietal and temporal pathways to the homologues of Broca’s
area in the monkey. PLoS Biology 7:e1000170.
Phelps, E.A., and J.E. LeDoux. 2005. Contributions of the amygdala to emotion processing: From animal mod-
els to human behavior. Neuron 48:175–187.
Pinsk, M.A., K. DeSimone, T. Moore, C.G. Gross, and S. Kastner. 2005. Representations of faces and body
parts in macaque temporal cortex: A functional MRI study. Proceedings of the National Academy of
Sciences of the United States of America 102:6996–7001.
Poremba, A., R.C. Saunders, A.M. Crane, M. Cook, L. Sokoloff, and M. Mishkin. 2003. Functional mapping
of the primate auditory system. Science 299:568–572.
Audiovisual Integration in Nonhuman Primates 95

Porter, K.K., R.R. Metzger, and J.M. Groh. 2007. Visual- and saccade-related signals in the primate infe-
rior colliculus. Proceedings of the National Academy of Sciences of the United States of America
104:17855–17860.
Posner, M.I., C.R.R. Snyder, and D.J. Davidson. 1980. Attention and the detection of signals. Journal of
Experimental Psychology. General 109:160–174.
Raab, D.H. 1962. Statistical facilitation of simple reaction times. Transactions of the New York Academy of
Sciences 24:574–590.
Rahne, T., and M. Böckmann-Barthel. 2009. Visual cues release the temporal coherence of auditory objects in
auditory scene analysis. Brain Research 1300:125–134.
Ramos-Estebanez, C., L.B. Merabet, K. Machii, et al. 2007. Visual phosphene perception modulated by sub-
threshold crossmodal sensory stimulation. Journal of Neuroscience 27:4178–4181.
Rao, S.C., G. Rainer, and E.K. Miller. 1997. Integration of what and where in the primate prefrontal cortex.
Science 276:821–824.
Rauschecker, J.P., and B. Tian. 2000. Mechanisms and streams for processing of “what” and “where” in
auditory cortex. Proceedings of the National Academy of Sciences of the United States of America
97:11800–11806.
Rauschecker, J.P., B. Tian, and M. Hauser. 1995. Processing of complex sounds in the macaque nonprimary
auditory cortex. Science 268:111–114.
Rauschecker, J.P., and L.R. Harris. 1989. Auditory and visual neurons in the cat’s superior colliculus selective
for the direction of apparent motion stimuli. Brain Research 490:56–63.
Recanzone, G.H., D.C. Guard, M.L. Phan, and T.K. Su. 2000. Correlation between the activity of single auditory
cortical neurons and sound-localization behavior in the macaque monkey. Journal of Neurophysiology
83:2723–2739.
Ringo, J.L., and S.G. O’Neill. 1993. Indirect inputs to ventral temporal cortex of monkey: The influence on unit
activity of alerting auditory input, interhemispheric subcortical visual input, reward, and the behavioral
response. Journal of Neurophysiology 70:2215–2225.
Rizzolatti, G., and L. Craighero. 2004. The mirror-neuron system. Annual Review of Neuroscience 27:169–192.
Rizzolatti, G., L. Fadiga, V. Gallese, and L. Fogassi. 1996. Premotor cortex and the recognition of motor
actions. Brain Research. Cognitive Brain Research 3:131–141.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50:19–26.
Rolls, E.T., H.D. Critchley, A.S. Browning, and K. Inoue. 2006. Face-selective and auditory neurons in the
primate orbitofrontal cortex. Experimental Brain Research 170:74–87.
Romanski, L.M., B.B. Averbeck, and M. Diltz. 2005. Neural representation of vocalizations in the primate
ventrolateral prefrontal cortex. Journal of Neurophysiology 93:734–747.
Romanski, L.M., J.F. Bates, and P.S. Goldman-Rakic. 1999a. Auditory belt and parabelt projections to the pre-
frontal cortex in the rhesus monkey. Journal of Comparative Neurology 403:141–157.
Romanski, L.M., and P.S. Goldman-Rakic. 2002. An auditory domain in primate prefrontal cortex. Nature
Neuroscience 5:15–16.
Romanski, L.M., B. Tian, J. Fritz, M. Mishkin, P.S. Goldman-Rakic, and J.P. Rauschecker. 1999b. Dual streams
of auditory afferents target multiple domains in the primate prefrontal cortex. Nature Neuroscience
2:1131–1136.
Romei, V., M.M. Murray, L.B. Merabet, and G. Thut. 2007. Occipital transcranial magnetic stimulation has
opposing effects on visual and auditory stimulus detection: Implications for multisensory interactions.
Journal of Neuroscience 27:11465–11472.
Romei, V., M.M. Murray, C. Cappe, and G. Thut. 2009. Preperceptual and stimulus-selective enhancement of
low-level human visual cortex excitability by sounds. Current Biology 19:1799–1805.
Russ, B.E., A.L. Ackelson, A.E. Baker, and Y.E. Cohen. 2008. Coding of auditory-stimulus identity in the audi-
tory non-spatial processing stream. Journal of Neurophysiology 99:87–95.
Saleem, K.S., W. Suzuki, K. Tanaka, and T. Hashikawa. 2000. Connections between anterior inferotemporal cortex
and superior temporal sulcus regions in the macaque monkey. Journal of Neuroscience 20:5083–5101.
Saleem, K.S., H. Kondo, and J.L. Price. 2008. Complementary circuits connecting the orbital and medial
prefrontal networks with the temporal, insular, and opercular cortex in the macaque monkey. Journal of
Comparative Neurology 506:659–693.
Sams, M., R. Aulanko, M. Hämäläinen, et al. 1991. Seeing speech: Visual information from lip movements
modifies activity in the human auditory cortex. Neuroscience Letters 127:141–145.
Santangelo V., and C. Spence. 2009. Crossmodal exogenous orienting improves the accuracy of temporal order
judgments. Experimental Brain Research 194:577–586.
96 The Neural Bases of Multisensory Processes

Santos-Benitez, H., C.M. Magarinos-Ascone, and E. Garcia-Austt. 1995. Nucleus basalis of Meynert cell
responses in awake monkeys. Brain Research Bulletin 37:507–511.
Schiff, W., J.A. Caviness, and J.J. Gibson. 1962. Persistent fear responses in rhesus monkeys to the optical
stimulus of “looming.” Science 136:982–983.
Schlack, A., S.J. Sterbing-D’Angelo, K. Hartung, K.-P. Hoffmann, and F. Bremmer. 2005. Multisensory space
representations in the macaque ventral intraparietal area. Journal of Neuroscience 25:4616–4625.
Schmolesky, M.T., Y. Wang, D.P. Hanes, et al. 1998. Signal timing across the macaque visual system. Journal
of Neurophysiology 79:3272–3278.
Schroeder, C.E., and J.J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory areas
of the macaque neocortex. Brain Research. Cognitive Brain Research 14:187–198.
Schroeder, C.E., and J.J. Foxe. 2005. Multisensory contributions to low-level, ‘unisensory’ processing. Current
Opinion in Neurobiology 15:454–458.
Schroeder, C.E., and P. Lakatos. 2009. Low-frequency neuronal oscillations as instruments of sensory selec-
tion. Trends in Neurosciences 32:9–18.
Schroeder, C.E., P. Lakatos, Y. Kajikawa, S. Partan, and A. Puce. 2008. Neuronal oscillations and visual ampli-
fication of speech. Trends in Cognitive Sciences 12:106–113.
Schroeder, C.E., R.W. Lindsley, C. Specht, A. Marcovici, J.F. Smilery, and D.C. Javitt. 2001. Somatosensory input
to auditory association cortex in the macaque monkey. Journal of Neurophysiology 85:1322–1327.
Seltzer, B., M.G. Cola, C. Gutierrez, M. Massee, C. Weldon, and C.G. Cusick. 1996. Overlapping and non-
overlapping cortical projections to cortex of the superior temporal sulcus in the rhesus monkey: Double
anterograde tracer studies. Journal of Comparative Neurology 370:173–190.
Seltzer, B., and D.N. Pandya. 1978. Afferent cortical connections and architectonics of the superior temporal
sulcus and surrounding cortex in the rhesus monkey. Brain Research 149:1–24.
Seltzer, B., and D.N. Pandya. 1989. Frontal lobe connections of the superior temporal sulcus in the rhesus
monkey. Journal of Comparative Neurology 281:97–113.
Seltzer, B., and D.N. Pandya. 1994. Parietal, temporal, and occipital projections to cortex of the superior
temporal sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology
343:445–463.
Sherman, S.M., and R.W. Guillery. 2002. The role of the thalamus in the flow of information to the cor-
tex. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences
357:1695–1708.
Sliwa, J., J.-R. Duhamel, O. Paxsalis, and S.C. Wirth. 2009. Cross-modal recognition of identity in rhesus mon-
keys for familiar conspecifics and humans. Abstracts Society for Neuroscience 684.14.
Smiley, J.F., T.A. Hackett, I. Ulbert, et al. 2007. Multisensory convergence in auditory cortex, I. Cortical con-
nections of the caudal superior temporal plane in macaque monkeys. Journal of Comparative Neurology
502:894–923.
Soto-Faraco, S., and A. Alsius. 2009. Deconstructing the McGurk–MacDonald illusion. Journal of Experimental
Psychology. Human Perception and Performance 35:580–587.
Squire, L.R., C.E.L. Stark, and R.E. Clark. 2004. The medial temporal lobe. Annual Review of Neuroscience
27:279–306.
Starr, A., and M. Don. 1972. Responses of squirrel monkey (Samiri sciureus) medial geniculate units to binau-
ral click stimuli. Journal of Neurophysiology 35:501–517.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., W. Jiang, M.T. Wallace, and T.R. Stanford. 2001. Nonvisual influences on visual-information pro-
cessing in the superior colliculus. Progress in Brain Research 134:143–156.
Stein, B.E., M.W. Wallace, T.R. Stanford, and W. Jiang. 2002. Cortex governs multisensory integration in the
midbrain. Neuroscientist 8:306–314.
Stein, B.E., and T.R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the single
neuron. Nature Reviews. Neuroscience 9:255–266.
Stevenson, R.A., and T.W. James. 2009. Audiovisual integration in human superior temporal sulcus: Inverse
effectiveness and the neural processing of speech and object recognition. Neuroimage 44:1210–1223.
Stricane, B., R.A. Andersen, and P. Mazzoni. 1996. Eye-centered, head-centered, and intermediate coding of
remembered sound locations in area LIP. Journal of Neurophysiology 76:2071–2076.
Sugihara, T., M.D. Diltz, B.B. Averbeck, and L.M. Romanski. 2006. Integration of auditory and visual
communication information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience
26:11138–11147.
Sumby, W.H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the
Acoustical Society of America 26:212–215.
Audiovisual Integration in Nonhuman Primates 97

Suzuki, W.A., and D.G. Amaral. 1994. Perirhinal and parahippocampal cortices of the macaque monkey:
Cortical afferents. Journal of Comparative Neurology 350:497–533.
Talsma, D., D. Senkowski, and M.G. Woldorff. 2009. Intermodal attention affects the processing of the tempo-
ral alignment of audiovisual stimuli. Experimental Brain Research 198:313–328.
Tamura, R., T. Ono, M. Fukuda, and K. Nakamura. 1992. Spatial responsiveness of monkey hippocampal neu-
rons to various visual and auditory stimuli. Hippocampus 2:307–322.
Tanaka, K., K. Hikosaka, H. Saito, M. Yukie, Y. Fukada, and E. Iwai. 1986. Analysis of local and wide-field
movements in the superior temporal visual areas of the macaque monkey. Journal of Neuroscience
6:134–144.
Tanibuchi I., and P.S. Goldman-Rakic. 2003. Dissociation of spatial-, object-, and sound-coding neurons in the
mediodorsal nucleus of the primate thalamus. Journal of Neurophysiology 89:1067–1077.
Teder-Sälejärvi, W.A., T.F. Münte, F. Sperlich, and S.A. Hillyard. 1999. Intra-modal and cross-modal spatial
attention to auditory and visual stimuli. An event-related brain potential study. Brain Research. Cognitive
Brain Research 8:327–343.
Théoret, H., L. Merabet, and A. Pascual-Leone. 2004. Behavioral and neuroplastic changes in the
blind: Evidence for functionally relevant cross-modal interactions. Journal of Physiology, Paris
98:221–233.
Tian, B., D. Reser, A. Durham, A. Kustov, and J.P. Rauschecker. 2001. Functional specialization in rhesus
monkey auditory cortex. Science 292:290–293.
Tsao, D.Y., W.A. Freiwald, R.B.H. Tootell, and M.S. Livingstone. 2006. A cortical region consisting entirely of
face-selective cells. Science 311:670–674.
Tsao, D.Y., S. Moeller, and W.A. Freiwald. 2008a. Comparing face patch systems in macaques and humans.
Proceedings of the National Academy of Sciences of the United States of America 105:19514–19519.
Tsao, D.Y., N. Schweers, S. Moeller, and W.A. Freiwald. 2008b. Patches of face-selective cortex in the macaque
frontal lobe. Nature Neuroscience 11:877–879.
Turner, B.H., M. Mishkin, and M. Knapp. 1980. Organization of the amygdalopetal projections from modality-
specific cortical association areas in the monkey. Journal of Comparative Neurology 191:515–543.
Ungerleider, L.G., and M. Mishkin. 1982. Two cortical visual systems. In Analysis of Visual Behavior, ed. D.J.
Ingle, M.A. Goodale, and R.J.W. Mansfield, 549–586. Cambridge: MIT Press.
Ungerleider, L.G., S.M. Courtney, and J.V. Haxby. 1998. A neural system for human vision working memory.
Proceedings of the National Academy of Sciences of the United States of America 95:883–890.
Updyke, B.V. 1974. Characteristics of unit responses in superior colliculus of the cebus monkey. Journal of
Neurophysiology 37:896–909.
Vaadia, E., D.A. Benson, R.D. Hienz, and M.H. Goldstein Jr. 1986. Unit study of monkey frontal cortex: Active
localization of auditory and of visual stimuli. Journal of Neurophysiology 56:934–952.
van Attenveldt, N., A. Roebroeck, and R. Goebel. 2009. Interaction of speech and script in human auditory
cortex: Insights from neuro-imaging and effective connectivity. Hearing Research 258:152–164.
Vatakis, A., A.A. Ghazanfar, and C. Spence. 2008. Facilitation of multisensory integration by the “unity effect”
reveals that speech is special. Journal of Vision 8(9):14.
von Kriegstein, K., and A.-L. Giraud. 2006. Implicit multisensory associations influence voice recognition.
PLoS Biology 4:e326.
Wallace, M.T., L.K. Wilkinson, and B.E. Stein. 1996. Representation and integration of multiple sensory inputs
in primate superior colliculus. Journal of Neurophysiology 76:1246–1266.
Wang, Y., S. Celebrini, Y. Trotter, and P. Barone. 2008. Visuo-auditory interactions in the primary visual cortex
of the behaving monkey: Electrophysiological evidence. BMC Neuroscience 9:79.
Watanabe, M. 1992. Frontal units of the monkey coding the associative significance of visual and auditory
stimuli. Experimental Brain Research 89:233–247.
Watanabe, J., and E. Iwai. 1991. Neuronal activity in visual, auditory and polysensory areas in the monkey
temporal cortex during visual fixation task. Brain Research Bulletin 26:583–592.
Welch, R., and D. Warren. 1986. Intersensory interactions. In Handbook of Perception and Human Performance,
ed. K.R. Boff, L. Kaufman, and J.P. Thomas, 21–36. New York: Wiley.
Werner-Reiss, U., K.A. Kelly, A.S. Trause, A.M. Underhill, and J.M. Groh. 2006. Eye position affects activity
in primary auditory cortex of primates. Current Biology 13:554–562.
Wheeler, M.E., S.E. Petersen, and R.L. Buckner. 2000. Memory’s echo: Vivid remembering reactivates
sensory-­specific cortex. Proceedings of the National Academy of Sciences of the United States of America
97:11125–11129.
Wilson, F.A.W., and E.T. Rolls. 1990. Neuronal responses related to reinforcement in the primate basal fore-
brain. Brain Research 509:213–231.
98 The Neural Bases of Multisensory Processes

Wilson, F.A.W., S.P.O. Scalaidhe, and P.S. Goldman-Rakic. 1993. Dissociation of object and spatial processing
in primate prefrontal cortex. Science 260:1955–1958.
Wollberg, Z., and J. Sela. 1980. Frontal cortex of the awake squirrel monkey: Responses of single cells to visual
and auditory stimuli. Brain Research 198:216–220.
Woods, T.M., and G.H. Recanzone. 2004. Visually induced plasticity of auditory spatial perception in macaques.
Current Biology 14:1559–1564.
Wurtz, R.H., and J.E. Albano. 1980. Visual–motor function of the primate superior colliculus. Annual Review
of Neuroscience 3:189–226.
Yeterian, E.H., and D.N. Pandya. 1989. Thalamic connections of the cortex of the superior temporal sulcus in
the rhesus monkey. Journal of Comparative Neurology 282:80–97.
Zangenehpour, S., A.A. Ghazanfar, D.J. Lewkowicz, and R.J. Zatorre. 2009. Heterochrony and cross-species
intersensory matching by infant vervet monkeys. PLoS ONE 4:e4302.
6 Multisensory Influences
on Auditory Processing
Perspectives from fMRI
and Electrophysiology
Christoph Kayser, Christopher I. Petkov,
Ryan Remedios, and Nikos K. Logothetis

CONTENTS
6.1 Introduction.............................................................................................................................99
6.2 The Where and How of Sensory Integration......................................................................... 100
6.3 Using Functional Imaging to Localize Multisensory Influences in Auditory Cortex........... 101
6.4 Multisensory Influences along the Auditory Processing Stream.......................................... 102
6.5 Multisensory Influences and Individual Neurons.................................................................. 104
6.6 Multisensory Influences and Processing of Communication Signals................................... 106
6.7 Conclusions............................................................................................................................ 109
References....................................................................................................................................... 109

6.1  INTRODUCTION
Traditionally, perception has been described as a modular function, with the different sensory
modalities operating as independent and separated processes. Following this view, sensory integra-
tion supposedly occurs only after sufficient unisensory processing and only in higher association
cortices (Jones and Powell 1970; Ghazanfar and Schroeder 2006). Studies in the past decade, how-
ever, promote a different view, and demonstrate that the different modalities interact at early stages
of processing (Kayser and Logothetis 2007; Schroeder and Foxe 2005; Foxe and Schroeder 2005).
A good model for this early integration hypothesis has been the auditory cortex, where multisensory
influences from vision and touch have been reported using a number of methods and experimental
paradigms (Kayser et al. 2009c; Schroeder et al. 2003; Foxe and Schroeder 2005). In fact, anatomi-
cal afferents are available to provide information about nonacoustic stimuli (Rockland and Ojima
2003; Cappe and Barone 2005; Falchier et al. 2002) and neuronal responses showing cross-modal
influences have been described in detail (Lakatos et al. 2007; Kayser et al. 2008, 2009a; Bizley et al.
2006). These novel insights, together with the traditional notion that multisensory processes are
more prominent in higher association regions, suggest that sensory integration is a rather distributed
process that emerges over several stages.
Of particular interest in the context of sensory integration are stimuli with particular behavioral
significance, such as sights and sounds related to communication (Campanella and Belin 2007;
Petrini et al. 2009; Ghazanfar and Logothetis 2003; von Kriegstein and Giraud 2006; von Kriegstein
et al. 2006). Indeed, a famous scenario used to exemplify sensory integration—the cocktail party—
concerns exactly this: when in a loud and noisy environment, we can better understand a person
talking to us when we observe the movements of his/her lips at the same time (Sumby and Polack

99
100 The Neural Bases of Multisensory Processes

1954; Ross et al. 2007). In this situation, the visual information about lip movements enhances the
(perceived) speech signal, hence providing an example of how visual information can enhance
auditory perception. However, as for many psychophysical phenomena, the exact neural substrate
mediating the sensory integration underlying this behavioral benefit remains elusive.
In this review, we discuss some of the results of early multisensory influences on auditory pro-
cessing, and provide evidence that sensory integration occurs distributed and across several pro-
cessing stages. In particular, we discuss some of the methodological aspects relevant for studies
seeking to localize and characterize multisensory influences, and emphasize some of the recent
results pertaining to speech and voice integration.

6.2  THE WHERE AND HOW OF SENSORY INTEGRATION


To understand how the processing of acoustic information benefits from the stimulation of other
modalities, we need to investigate “where” along auditory pathways influences from other modali-
ties occur, and “how” they affect the neural representation of the sensory environment. Noteworthy,
the questions of “where” and “how” address different scales and levels of organization. Probing
the “where” question requires the observation of sensory responses at many stages of processing,
and hence a large spatial field of view. This is, for example, provided by functional imaging, which
can assess signals related to neural activity in multiple brain regions at the same time. Probing
the “how” question, in contrast, requires an investigation of the detailed neural representation of
sensory information in localized regions of the brain. Given our current understanding of neural
information processing, this level is best addressed by electrophysiological recordings that assess
the responses of individual neurons, or small populations thereof, at the same time (Donoghue
2008; Kayser et al. 2009b; Quian Quiroga 2009).
These two approaches, functional imaging (especially functional magnetic resonance imaging
(fMRI)-blood oxygenation level-dependent (BOLD) signal) and electrophysiology, complement
each other not only with regard to the sampled spatiotemporal dimensions, but also with regard
to the kind of neural activity that is seen by the method. Although electrophysiological methods
sample neural responses at the timescale of individual action potentials (millisecond preci sion)
and the spatial scale of micrometers, functional imaging reports an aggregate signal derived from
(subthreshold) responses of millions of neurons sampled over several hundreds of micrometers and
hundreds of milliseconds (Logothetis 2002, 2008; Lauritzen 2005). In fact, because the fMRI-
BOLD signal is only indirectly related to neuronal activity, it is difficult, at least at the moment, to
make detailed inferences about neuronal responses from imaging data (Leopold 2009). As a result,
both methods provide complementary evidence on sensory integration.
In addition to defining methods needed to localize and describe sensory interactions, operational
criteria are required to define what kind of response properties are considered multisensory influ-
ences. At the level of neurons, many criteria have been derived from seminal work on the superior
colliculus by Stein and Meredith (1993). Considering an auditory neuron, as an example, visual
influences would be assumed if the response to a bimodal (audiovisual) stimulus differs signifi-
cantly from the unimodal (auditory) response. Although this criterion can be easily implemented
as a statistical test to search for multisensory influences, it is, by itself, not enough to merit the
conclusion that an observed process merits the label “sensory integration.” At the level of behavior,
sensory integration is usually assumed if the bimodal sensory stimulus leads to a behavioral gain
compared with the unimodal stimulus (Ernst and Bülthoff 2004). Typical behavioral gains are faster
responses, higher detection rates, or improved stimulus discriminability. Often, these behavioral
gains are highest when individual unimodal stimuli are least effective in eliciting responses, a phe-
nomenon known as the principle of inverse effectiveness. In addition, different unimodal stimuli
are only integrated when they are perceived to originate from the same source, i.e., when they occur
coincident in space and time. Together, these two principles provide additional criteria to decide
whether a particular neuronal process might be related to sensory integration (Stein 1998, 2008).
Multisensory Influences on Auditory Processing 101

This statistical criterion, in conjunction with the verification of these principles, has become the
standard approach to detect neural processes related to sensory integration. In addition, recent work
has introduced more elaborate concepts derived from information theory and stimulus decoding.
Such methods can be used to investigate whether neurons indeed become more informative about
the sensory stimuli, and whether they allow better stimulus discrimination in multisensory com-
pared to unisensory conditions (Bizley et al. 2006; Bizley and King 2008; Kayser et al. 2009a).

6.3 USING FUNCTIONAL IMAGING TO LOCALIZE


MULTISENSORY INFLUENCES IN AUDITORY CORTEX
Functional imaging is by far the most popular method to study the cortical basis of sensory integra-
tion, and many studies report multisensory interactions between auditory, visual, and somatosen-
sory stimulation in association cortices of the temporal and frontal lobes (Calvert 2001). In addition,
a number of studies reported that visual or somatosensory stimuli activate regions in close proxim-
ity to the auditory cortex or enhance responses to acoustic stimuli in these regions (Calvert and
Campbell 2003; Calvert et al. 1997, 1999; Pekkola et al. 2005; Lehmann et al. 2006; van Atteveldt
et al. 2004; Schurmann et al. 2006; Bernstein et al. 2002; Foxe et al. 2002; Martuzzi et al. 2006; van
Wassenhove et al. 2005). Together, these studies promoted the notion of early multisensory interac-
tions in the auditory cortex.
However, the localization of multisensory influences is only as good as the localization of those
structures relative to which the multisensory influences are defined. To localize multisensory effects
to the auditory core (primary) or belt (secondary) fields, one needs to be confident about the location
of these auditory structures in the respective subjects. Yet, this can be a problem given the small
scale and variable position of auditory fields in individual subjects (Kaas and Hackett 2000; Hackett
et al. 1998; Fullerton and Pandya 2007; Clarke and Rivier 1998; Chiry et al. 2003). One way to over-
come this would be to first localize individual areas in each subject and to analyze functional data
within these regions of interest. Visual studies often follow this strategy by mapping visual areas
using retinotopically organized stimuli, which exploit the well-known functional organization of the
visual cortex (Engel et al. 1994; Warnking et al. 2002). Auditory studies, in principle, could exploit
a similar organization of auditory cortex, known as tonotopy, to define individual auditory fields
(Rauschecker 1998; Rauschecker et al. 1995; Merzenich and Brugge 1973). In fact, electrophysi-
ological studies have demonstrated that several auditory fields contain an ordered representation
of sound frequency, with neurons preferring similar sound frequencies appearing in clusters and
forming continuous bands encompassing the entire range from low to high frequencies (Merzenich
and Brugge 1973; Morel et al. 1993; Kosaki et al. 1997; Recanzone et al. 2000). In addition, neurons
in the auditory core and belt show differences in their preferences to narrow and broadband sounds,
providing a second feature to distinguish several auditory fields (Rauschecker 1998; Rauschecker
et al. 1997) (Figure 6.1a). Yet, although these properties in principle provide characteristics to dif-
ferentiate individual auditory fields, this has proven surprisingly challenging in human fMRI stud-
ies (Wessinger et al. 2001; Formisano et al. 2003; Talavage et al. 2004).
To sidestep these difficulties, we exploited high-resolution imaging facilities in combination with
a model system for which there exists considerably more prior knowledge about the organization
of the auditory cortex: the macaque monkey. This model system allows imaging voxel sizes on the
order of 0.5 × 0.5 mm, whereas conventional human fMRI studies operate on a resolution of 3 ×
3 mm (Logothetis et al. 1999). Much of the evidence about the anatomical and functional structure
of the auditory cortex originates from this model system, providing important a priori information
about the expected organization (Kaas and Hackett 2000; Hackett et al. 1998; Rauschecker and Tian
2004; Recanzone et al. 2000). Combining this a priori knowledge with high-resolution imaging
systems as well as optimized data acquisition for auditory paradigms, we were able to obtain a tono-
topic functional parcellation in individual animals (Petkov et al. 2006, 2009). By comparing the
activation to stimulation with sounds of different frequency compositions, we obtained a smoothed
102 The Neural Bases of Multisensory Processes

(a) (b) Frequency preference Frequency map


Rostral
Low Low freq. Frequency (kHz)
16

Frequency preferences
RPB High freq.
Core (PAC) High 8
RTM RT RTL
Belt 4
Parabelt RM 2
R AL Low 5 mm
1

Lateral
MM 0.5
A1 ML
High
CM
CL
CPB
Low
Rostral
Broad Narrow Broad

Lateral
Caudal
Bandwidth preference
Caudal

FIGURE 6.1  (See color insert.) Mapping individual auditory fields using fMRI. (a) Schematic of organi-
zation of monkey auditory cortex. Three primary auditory fields (core region) are surrounded by secondary
fields (belt region) as well as higher association areas (parabelt). Electrophysiological studies have shown that
several of these fields contain an ordered representation of sound frequency (tonotopic map, indicated on left),
and that core and belt fields prefer narrow- and broadband sounds, respectively. These two functional proper-
ties can be exploited to map layout of these auditory fields in individual subjects using functional imaging.
(b) Single-slice fMRI data showing frequency-selective BOLD responses to low and high tones (left panel)
and a complete (smoothed) frequency map obtained from stimulation using six frequency bands (right panel).
Combining frequency map with an estimate of core region and anatomical landmarks to delineate the parabelt
results in a full parcellation of auditory cortex in individual subjects. This parcellation is indicated in the left
panel as white dashed lines and is shown in full in panel a.

frequency preference map which allowed determining the anterior–posterior borders of potential
fields. In addition, the preference to sounds of different bandwidths often allowed a segregation of
core and belt fields, hence providing borders in medial–lateral directions. When combined with the
known organization of auditory cortex, the evidence from these activation patterns allowed a more
complete parcellation into distinct core and belt fields, and provided constraints for the localization
of the parabelt regions (Figure 6.1b). This functional localization procedure for auditory fields now
serves as a routine tool to delineate auditory structures in experiments involving auditory cortex.

6.4 MULTISENSORY INFLUENCES ALONG THE


AUDITORY PROCESSING STREAM
In search for a better localization of multisensory influences in the auditory cortex reported by human
imaging studies, we combined the above localization technique with audiovisual and audio-tactile
stimulation paradigms (Kayser et al. 2005, 2007). To localize multisensory influences, we searched
for regions (voxels) in which responses to acoustic stimuli were significantly enhanced when a visual
stimulus was presented at the same time. Because functional imaging poses particular constraints
on statistical contrasts (Laurienti et al. 2005), we used a conservative formulation of this criterion in
which multisensory influences are defined as significant superadditive effects, i.e., the response in the
bimodal condition is required to be significantly stronger than the sum of the two unisensory responses:
AV > (A + V). In our experiments, we employed naturalistic stimuli in order to activate those regions
especially involved in the processing of everyday scenarios. These stimuli included scenes of conspe-
cific animals vocalizing as well as scenes showing other animals in their natural settings.
In concordance with previous reports, we found that visual stimuli indeed influence fMRI
responses to acoustic stimuli within the classical auditory cortex. These visual influences were stron-
gest in the caudal portions of the auditory cortex, especially in the caudo–medial and caudo–lateral
belt, portions of the medial belt, and the caudal parabelt (Figure 6.2a and b). These multisensory
Multisensory Influences on Auditory Processing 103

(a) (b)
p<0.01 p<10–7

RT

R
MM A1

CM
CL
CPB
Percent change

Dorsal
(c) (d)

40
% of total activation
Caudal

30
Core
20
Medial
belt
Lateral 10
belt
STG 0
uSTS A1 Caudal Parabelt/ uSTS
belt STG
Lower areas Higher areas

Rostral

FIGURE 6.2  (See color insert.) Imaging multisensory influences in monkey auditory cortex. (a) Data from
an experiment with audiovisual stimulation. Sensory activation to auditory (left) and visual (right) stimuli are
shown on single image slices (red to yellow voxels). An outline of auditory fields is indicated (white lines).
Time course illustrates multisensory enhancement during combined audiovisual stimulation (data from one
session, averaged over 36 repeats of the stimulus). For details, see Kayser, C. et al. (2007). (b) Schematic of
auditory fields exhibiting significant visual influences. Visual influences (shown in blue) were most prominent
in caudal fields, and effects in A1 were only observed in alert animals. (From Kayser, C. et al., J. Neurosci.,
27, 1824–1835, 2007. With permission.) (c) Three-dimensional rendering of a segment of a monkey brain.
Different structures investigated in fMRI experiments are color coded and comprise classical auditory cortex
(core and belt) as well as auditory association cortex (parabelt) and general association cortex (STS). Please
note that this figure serves as an illustration only, and individual structures have been sketched based on
approximate anatomical location, not on functional criteria. (d) Strength of visual influence along auditory
hierarchy. The graph displays contribution of responses to (unimodal) visual stimuli to total fMRI-BOLD
activation obtained during auditory, visual, and audiovisual stimulation. This was computed as a fraction (in
percentage) of BOLD response to visual stimulation relative to sum of BOLD responses to all three conditions.
Visual contribution increases from lower to higher areas.

interactions in secondary and higher auditory regions occurred reliably in both anesthetized and
alert animals. In addition, we found multisensory interactions in the core region A1, but only in the
alert animal, indicating that these early interactions could be dependent on the vigilance of the ani-
mal, perhaps involving cognitive or top-down influences. To rule out nonspecific modulatory pro-
jections as the source of these effects, we tested two functional criteria of sensory integration: the
104 The Neural Bases of Multisensory Processes

principles of temporal coincidence and inverse effectiveness. We found both criteria to be obeyed,
and multisensory influences were stronger when sensory stimuli were in temporal coincidence and
when unisensory stimuli were less effective in eliciting BOLD responses. Overall, these findings
not only confirm previous results from human imaging, but also localize multisensory influences
mostly to secondary fields and demonstrate a clear spatial organization, with caudal regions being
most susceptible to multisensory inputs (Kayser et al. 2009c).
In addition to providing a good localization of cross-modal influences (the “where” question),
functional imaging can also shed light on the relative influence of visual stimuli on auditory pro-
cessing at several processing stages. Because fMRI allows measuring responses at many locations
at the same time, we were able to quantify visual influences along multiple stages in the caudal
auditory network (Figure 6.2c). Using the above-mentioned localization technique in conjunction
with anatomical landmarks, we defined several regions of interest outside the classical auditory
cortex: these comprised the caudal parabelt, the superior temporal gyrus, as well as the upper bank
of the STS (uSTS). The uSTS is a well-known multisensory area where neuronal responses as well
as fMRI activations to stimulation of several modalities have been described (Benevento et  al.
1977; Bruce et al. 1981; Beauchamp et al. 2004, 2008; Dahl et al. 2009). As a result, one should
expect a corresponding increase in visual influence when proceeding from the auditory core to the
uSTS. This was indeed the case, as shown in Figure 6.2d: visual influences were relatively small
in auditory core and belt fields, as described above. In the parabelt/STG region, an auditory asso-
ciation cortex, visual influences already contributed a considerable proportion to the total activa-
tion, and were yet much stronger in the uSTS. As a rule of thumb, it seemed that the contribution
of visual stimuli to the total measured activation roughly doubled from stage to stage along this
hierarchy.
Although human functional imaging has described multisensory influences at different stages
of auditory processing, and in a number of behavioral contexts, imaging studies with the animal
model localized these influences to the identified areas. These results promote a model in which
multisensory influences already exist at early processing stages and progressively increase in higher
areas. This suggests that sensory integration is a distributed process involving several processing
stages to varying degrees, in opposition to the traditional idea of a modular organization of sensory
processing into independent unisensory processes modules.

6.5  MULTISENSORY INFLUENCES AND INDIVIDUAL NEURONS


Having localized multisensory influences to particular auditory fields, the obvious question arises
of whether and how nonauditory inputs improve the processing of acoustic information. As noted
above, this “how” question is ideally investigated using electrophysiological methods, for two rea-
sons. First, the imaging signal reflects neuronal activity only indirectly and does not permit defi-
nite conclusions about the underlying neuronal processes (Logothetis 2008; Kayser et al. 2009c;
Laurienti et al. 2005). And second, electrophysiology can directly address those parameters that
are believed to be relevant for neural information processing, such as the spike count of individual
neurons, temporal patterns of action potentials, or the synchronous firing of several neurons (Kayser
et al. 2009b).
Several electrophysiological studies have characterized multisensory influences in the auditory
cortex. Especially at the level of subthreshold activity, as defined by field potentials and current
source densities, strong visual or somatosensory influences were reported (Ghazanfar et al. 2005,
2008; Lakatos et al. 2007; Schroeder and Foxe 2002; Schroeder et al. 2001, 2003). These multi-
sensory influences were widespread, in that they occurred at the vast majority of recording sites in
each of these studies. In addition, these multisensory influences were not restricted to secondary
areas but also occurred in regions functionally and anatomically characterized as primary auditory
cortex (Kayser et al. 2008; Lakatos et al. 2007). Given that field potentials are especially sensitive
to synaptic activity in the vicinity of the electrode (Mitzdorf 1985; Juergens et al. 1999; Logothetis
Multisensory Influences on Auditory Processing 105

2002), these observations demonstrate that multisensory input to the auditory cortex occurs at the
synaptic level. These results provide a direct neural basis for the multisensory influences seen in
imaging studies, but do not yet reveal whether the neural information representation benefits from
the multisensory input.
Other studies provide evidence for multisensory influences on the firing of individual neurons in
the auditory cortex. For example, measurements in ferret auditory cortex revealed that 15% of the
neurons in core fields are sensitive no nonauditory inputs such as flashes of light (Bizley et al. 2006;
and see Cappe et al. 2007 for similar results in monkeys). We investigated such visual influences
in the macaque and found that a similar proportion (12%) of neurons in the auditory core revealed
multisensory interactions in their firing rates. Of these, nearly 4% responded to both acoustic and
visual stimuli when presented individually, and hence, constitute bimodal neurons. The remain-
ing 8% responded to unimodal sounds but did not respond to unimodal visual stimuli; however,
their responses were enhanced (or reduced) by the simultaneous presentation of both stimuli. This
response pattern does not conform to the traditional notion of bimodal neurons but represents a
kind of multisensory influence typically called subthreshold response modulation (Dehner et al.
2004). Similar subthreshold response modulations have been observed in a number of cortical
areas (Allman et al. 2008a, 2008b; Allman and Meredith 2007; Meredith and Allman 2009), and
suggest that multisensory influences can fall along a continuum, ranging from true unimodal neu-
rons to the classical bimodal neuron that exhibits suprathreshold responses to stimuli in several
modalities.
Noteworthy, the fraction of neurons with significant multisensory influences in the auditory cor-
tex was considerably smaller than the fraction of sites showing similar response properties in the
local field potential (LFP), or the spatial area covered by the voxels showing multisensory responses
in the imaging data. Hence, although visual input seems to be widely present at the subthreshold
level, only a minority of neurons actually exhibit significant changes of their firing rates. This sug-
gests that the effect of visual stimulation on auditory information coding in early auditory cortex
is weaker than one would estimate from the strong multisensory influences reported in imaging
studies.
When testing the principles of temporal coincidence and inverse effectiveness for these audi-
tory cortex neurons, we found both to be obeyed: the relative timing of auditory and visual stimuli
was as important in shaping the multisensory influence as was the efficacy of the acoustic stimu-
lus (Kayser et al. 2008). Similar constraints of spatiotemporal stimulus alignment on audiovisual
response modulations in the auditory cortex have been observed in other studies as well (Bizley
et al. 2006). Additional experiments using either semantically congruent or incongruent audiovi-
suals revealed that visual influences in the auditory cortex also show specificity to more complex
stimulus attributes. For example, neurons integrating information about audiovisual communication
signals revealed reduced visual modulation when the acoustic communication call was paired with
a moving disk instead of the movie displaying the conspecific animal (Ghazanfar et al. 2008). A
recent study also revealed that pairing a natural sound with a mismatching movie abolishes multi-
sensory benefits for acoustic information representations (Kayser et al. 2009a). Altogether, this sug-
gests that visual influences in the primary and secondary auditory fields indeed provide functionally
specific visual information.
Given that imaging studies reveal an increase of multisensory influence in higher auditory
regions, one should expect a concomitant increase in the proportion of multisensory neurons.
Indeed, when probing neurons in a classical association cortex, such as the STS, much stronger
multisensory influences are visible in the neurons firing. Using the same stimuli and statistical
criteria, a recent study revealed a rather homogeneous population of unimodal and bimodal neu-
rons in the upper bank STS (Dahl et al. 2009): about half the neurons responded significantly to
both sensory modalities, whereas 28% of the neurons preferred the visual and 19% preferred the
auditory modality. Importantly, this study not only revealed a more complex interplay of auditory
and visual information representations in this region, but detailed electrophysiological mappings
106 The Neural Bases of Multisensory Processes

demonstrated that a spatial organization of neurons according to their modality preferences exists
in the STS: neurons preferring the same modality (auditory or visual) co-occurred in close spatial
proximity or occurred intermingled with bimodal neurons, whereas neurons preferring different
modalities occurred only spatially separated. This organization at the scale of individual neurons
led to extended patches of same modality preference when analyzed at the scale of millimeters,
revealing large-scale regions that preferentially respond to the same modality. These results lend
support to the notion that topographical organizations might serve as a general principle of inte-
grating information within and across the sensory modalities (Beauchamp et al. 2004; Wallace et
al. 2004).
These insights from studies of multisensory integration at the neuronal level are in concordance
with the notion that sensory integration is a distributed hierarchical process that extends over sev-
eral processing stages. Given the difficulty in characterizing and interpreting the detailed effect of
multisensory influences at a single processing stage, a comparative approach might prove useful:
comparing multisensory influences at different stages using the same stimuli might help not only
in understanding the contribution of individual stages to the process of sensory integration, but
also facilitate the understanding of the exact benefit for a particular region to receive multisensory
input.

6.6 MULTISENSORY INFLUENCES AND PROCESSING


OF COMMUNICATION SIGNALS
The above findings clearly reveal that the processing of auditory information is modulated by visual
(or somatosensory) information already at processing stages in or close to the primary auditory
cortex. Noteworthy, these cross-modal influences were seen not only in the context of naturalistic
stimuli, but also for very simple and artificial stimuli. For example, visual influences on neuronal
firing rates occurred when using flashes of light, short noise bursts, or very rapid somatosensory
stimulation (Bizley et al. 2006; Lakatos et al. 2007; Kayser et al. 2008; Cappe et al. 2007; Bizley
and King 2008). This suggests that multisensory influences in early auditory fields are not special-
ized for natural stimuli such as communication sounds, but rather reflect a more general process that
is sensitive to basic stimulus attributes such as relative timing, relative position, or other semantic
attributes.
To those especially interested in the neural basis of communication or speech, this poses the
immediate question of where in the brain multisensory influences are specialized for such stimuli
and mediate the well-known behavioral benefits of integration. As seen above, the cocktail party
effect—the integration of face and voice information—serves as one of the key examples to
illustrate the importance of audiovisual integration; the underlying neural substrate, however,
remains elusive. One approach to elucidate this could be to focus on those cortical regions in
which neural processes directly related to the processing of communication sounds have been
reported. Besides the classical speech areas, in case of the human brain, a number of other areas
have been implicated in the nonhuman primate: response preferences to conspecific vocaliza-
tions have been reported in the lateral belt (Tian et al. 2001), the insula cortex (Remedios et al.
2009), in a voice area on the anterior temporal plane (Petkov et al. 2008), and in the ventrolateral
prefrontal cortex (Cohen et al. 2007; Romanski et al. 2005). Noteworthy, several of these stages
have not only been investigated in the context of purely auditory processing but have also been
assessed for audiovisual integration.
The lateral belt is one of the regions classically implicated in an auditory “what” pathway con-
cerned with the processing of acoustic object information (Romanski et al. 1999; Rauschecker and
Tian 2000). The process of object segmentation or identification could well benefit from input from
other modalities. Indeed, studies have reported that audiovisual interactions in the lateral belt are
widespread at the level of LFPs and include about 40% of the recorded units (Ghazanfar et al. 2005).
Multisensory Influences on Auditory Processing 107

In fact, the multisensory influences in this region were found to depend on stimulus parameters such
as the face–voice onset asynchrony or the match of visual and acoustic vocalizations, suggesting a
good degree of specificity of the visual input. At the other end of this pathway, in the ventrolateral
prefrontal cortex, 46% of the neurons were found to reflect audiovisual components of vocalization
signals (Sugihara et al. 2006). Although the existence of a dedicated “what” pathway is still debated
(Bizley and Walker 2009; Hall 2003; Wang 2000), these results highlight the prominence of multi-
sensory influences in the implicated areas.
In addition to these stages of the presumed “what” pathway, two other regions have recently
been highlighted in the context of vocal communication sounds. Recording in the primate
insula, we recently found a large cluster of neurons that respond preferentially to conspecific
vocalizations, when contrasted with a large set of other natural sounds (Remedios et al. 2009)
(Figure 6.3a). Many of these neurons not only responded more strongly to conspecific vocaliza-
tions, but also responded selectively to only a few examples, and their responses allowed the
decoding of the identity of individual vocalizations. This suggests that the insular cortex might
play an important role in the representation of vocal communication sounds. Noteworthy, this
response preference to conspecific vocalizations is also supported by functional imaging studies
in animals (Figure 6.3b) and humans (Griffiths et al. 1997; Rumsey et al. 1997; Kotz et al. 2003;
Meyer et al. 2002; Zatorre et al. 1994). In addition, lesions of the insula often manifest as deficits
in sound or speech recognition (auditory agnosia) and speech production, confirming a central
function of this structure in communication-­related processes (Habib et al. 1995; Cancelliere and
Kertesz 1990; Engelien et al. 1995). Noteworthy, some of the neurons in this auditory responsive
region in the insula also show sensitivity to visual stimuli or response interactions during audio-
visual stimulation (R. Remedios and C. Kayser, unpublished data). However, the vast majority of
units in this structure are not affected by visual stimuli, suggesting that in this region, it is likely
not concerned with the sensory integration of information related to communication calls, but
mostly processes acoustic input.
Another region that has recently been implicated in the processing of communication sounds
is the so-called voice region in the anterior temporal lobe. A preference for the human voice,
in particular, the identity of a human speaker, has been found in the human anterior temporal
lobe (Belin and Zatorre 2003; Belin et al. 2000; von Kriegstein et al. 2003) and a similar pref-
erence for conspecific vocalizations and the identity of a monkey caller has been observed in
the anterior temporal lobe of the nonhuman primate (Petkov et al. 2008). For example, high-
resolution functional imaging revealed several regions in the superior temporal lobe responding
preferentially to the presentation of conspecific macaque vocalizations over other vocalizations
and natural sounds (see the red clusters in the middle panel of Figure 6.3c), as has been seen in
humans (Belin et al. 2000; von Kriegstein et al. 2003). These results can be interpreted as evi-
dence for sensitivity to the acoustic features that distinguish the vocalizations of members of the
species from other sounds. Further experiments have shown that one of these regions located in
the anterior temporal lobe respond more vigorously to sounds that come from different speakers,
whose meaning is constant, rather than to those that come from the same speaker, whose mean-
ing and acoustics vary (Belin and Zatorre 2003; von Kriegstein et al. 2003; Petkov et al. 2008).
These observations support the conclusion of a high-level correspondence in the processing of
species-specific vocal features and a common cross-species substrate in the brains of human and
nonhuman primates.
Noteworthy, this human voice region can also be influenced by multisensory input. For
instance, von Kriegstein and colleagues (2006) used face and voice stimuli to first localize the
human “face” and “voice” selective regions. They then showed that the activity of each of these
regions was modulated by multisensory input. Comparable evidence from the animal model
is still unavailable. Ongoing work in our laboratory is pursuing this question (Perrodin et al.
2009a, 2009b).
108 The Neural Bases of Multisensory Processes

(a) 80 Conspecific vocalizations (MVoc) (b)


Environmental sounds (Esnd)
[lmp/s – baseline] 60 Animal vocalizations (Asnd)

40

20

Insula
0 500 1000 1500
Time (s)

% Bold response
AC
4
Population (155 units)
Rel. response

0 Mvoc Asnd Esnd

–2
Mvoc Asnd Esnd
(c)

Pro
Ts1
Voice area
Ts2

1.0
Normalized response

0.5
A1

0
Mvoc Asnd Esnd
Preference for
Tpt cons. vocalization
Preference for
Core Belt Parabelt other sounds

FIGURE 6.3  (See color insert.) Response preferences to (vocal) communication sounds. Preferences to con-
specific communication sounds have been found in insula (panels a and b) and in anterior temporal lobe (panel
c). In both cases, responses to conspecific communication sounds (Mvoc) have been contrasted with sounds of
other animals (Asnd) and environmental sounds (Esnd). (a) Data from an electrophysiological investigation of
insula neurons. (From Remedios, R. et al., J. Neurosci., 29, 1034–1045, 2009. With permission.) Upper panel
displays one example neuron, showing a strong response to Mvocs. Lower panel displays normalized popu-
lation response to three sound categories (mean ± SEM). (b) Example data from a single fMRI experiment
showing voxels significantly preferring conspecific vocalizations over other sounds (color code) in a single
slice. Such voxels were found in anterior auditory cortex (field TS2), core and lateral belt, and in insula. Bar
plot displays BOLD signal change for different conditions (mean ± SEM for insula voxels). (c) Identification
of a voice region in monkey brain using functional imaging. (From Petkov, C.I. et al., Nat. Neurosci., 11,
367–374, 2008. With permission.) Preferences to conspecific vocal sounds (red voxels) were found in caudal
auditory cortex (as also seen in b), and on anterior temporal lobe (voice area). This location of voice area is
consistent with studies on voice processing in human brain, and suggests a common basis of voice processing
in human and nonhuman primates. Bar plot displays BOLD signal change in voice region for different sound
conditions (mean ± SEM across experiments).
Multisensory Influences on Auditory Processing 109

6.7  CONCLUSIONS
During everyday actions, we benefit tremendously from the combined input provided by our different
sensory modalities. Although seldom experienced explicitly, only this combined sensory input makes
an authentic and coherent percept of our environment possible (Adrian 1928; Stein and Meredith
1993). In fact, multisensory integration helps us to react faster or with higher precision (Calvert et al.
2004; Hershenson 1962), improves our learning capacities (Montesori 1967; Oakland et al. 1998),
and sometimes even completely alters our percept (McGurk and MacDonald 1976). As a result, the
understanding of sensory integration and its neural basis not only shed insights into brain function
and perception, but could also provide improved strategies for learning and rehabilitation programs
(Shams and Seitz 2008).
Evidence from functional imaging and electrophysiology demonstrates that this process of sen-
sory integration is likely distributed across multiple processing stages. Multisensory influences
are already present at early stages, such as in the primary auditory cortex, but increase along the
processing hierarchy and are ubiquitous in higher association cortices. Existing data suggest that
multisensory influences at early stages are specific to basic stimulus characteristics such as spatial
and temporal localization, but are not specialized toward particular kinds of stimuli, such as com-
munication signals. Whether, where, and how multisensory influences become more specialized,
remains to be investigated by future work. In this search, a comparative approach comparing the
multisensory influences at multiple processing stages during the same stimulation paradigm might
prove especially useful. And as highlighted here, this ideally precedes using a combination of meth-
ods that probe neural responses at different spatiotemporal scales, such as electrophysiology and
functional imaging. Definitely, much remains to be learned until we fully understand the neural
basis underlying the behavioral gains provided by multisensory stimuli.

REFERENCES
Adrian, E.D. 1928. The Basis of Sensations. New York, Norton.
Allman, B.L., and M.A. Meredith. 2007. Multisensory processing in “unimodal” neurons: Cross-modal sub-
threshold auditory effects in cat extrastriate visual cortex. Journal of Neurophysiology 98:545–9.
Allman, B.L., R.E. Bittencourt-Navarrete, L.P. Keniston et al. 2008a. Do cross-modal projections always result
in multisensory integration? Cerebral Cortex 18:2066–76.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2008b. Subthreshold auditory inputs to extrastriate visual
neurons are responsive to parametric changes in stimulus quality: Sensory-specific versus non-specific
coding. Brain Research 1242:95–101.
Beauchamp, M.S., B.D. Argall, J. Bodurka, J.H. Duyn, and A. Martin. 2004. Unraveling multisensory integra-
tion: Patchy organization within human STS multisensory cortex. Nature Neuroscience 7:1190–2.
Beauchamp, M.S., N.E. Yasar, R.E. Frye, and T. Ro. 2008. Touch, sound and vision in human superior temporal
sulcus. NeuroImage 41:1011–20.
Belin, P., and R.J. Zatorre. 2003. Adaptation to speaker’s voice in right anterior temporal lobe. Neuroreport
14:2105–9.
Belin, P., R.J. Zatorre, P. Lafaille, P. Ahad, and B. Pike. 2000. Voice-selective areas in human auditory cortex.
Nature 403:309–12.
Benevento, L.A., J. Fallon, B.J. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57:849–72.
Bernstein, L.E., E.T. Auer Jr., J.K. Moore et al. 2002. Visual speech perception without primary auditory cortex
activation. Neuroreport 13:311–5.
Bizley, J.K., and A.J. King. 2008. Visual–auditory spatial processing in auditory cortical neurons. Brain
Research 1242:24–36.
Bizley, J.K., and K.M. Walker. 2009. Distributed sensitivity to conspecific vocalizations and implications for
the auditory dual stream hypothesis. Journal of Neuroscience 29:3011–3.
Bizley, J.K., F.R. Nodal, V.M. Bajo, I. Nelken, and A.J. King. 2006. Physiological and anatomical evidence for
multisensory interactions in auditory cortex. Cerebral Cortex 17:2172–89.
110 The Neural Bases of Multisensory Processes

Bruce, C., R. Desimone, and C.G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46:369–84.
Calvert, G.A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cerebral Cortex 11:1110–23.
Calvert, G.A., and R. Campbell. 2003. Reading speech from still and moving faces: The neural substrates of
visible speech. Journal of Cognitive Neuroscience 15:57–70.
Calvert, G.A., M.J. Brammer, E.T. Bullmore et al. 1999. Response amplification in sensory-specific cortices
during crossmodal binding. Neuroreport 10:2619–23.
Calvert, G., C. Spence, and B.E. Stein. 2004. The Handbook of Multisensory Processes. Cambridge: MIT
Press.
Calvert, G.A., E.T. Bullmore, M.J. Brammer et al. 1997. Activation of auditory cortex during silent lipreading.
Science 276:593–6.
Campanella, S., and P. Belin. 2007. Integrating face and voice in person perception. Trends in Cognitive
Sciences 11:535–43.
Cancelliere, A.E., and A. Kertesz. 1990. Lesion localization in acquired deficits of emotional expression and
comprehension. Brain and Cognition 13:133–47.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. European Journal of Neuroscience 22:2886–902.
Cappe, C., G. Loquet, P. Barone, and E.M. Rouiller. 2007. Neuronal responses to visual stimuli in auditory
cortical areas of monkeys performing an audio-visual detection task. European Brain and Behaviour
Society. Trieste.
Chiry, O., E. Tardif, P.J. Magistretti, and S. Clarke. 2003. Patterns of calcium-binding proteins support
parallel and hierarchical organization of human auditory areas. European Journal of Neuroscience
17:397–410.
Clarke, S., and F. Rivier. 1998. Compartments within human primary auditory cortex: Evidence from cyto-
chrome oxidase and acetylcholinesterase staining. European Journal of Neuroscience 10:741–5.
Cohen, Y.E., F. Theunissen, B.E. Russ, and P. Gill. 2007. Acoustic features of rhesus vocalizations and their
representation in the ventrolateral prefrontal cortex. Journal of Neurophysiology 97:1470–84.
Dahl, C., N. Logothetis, and C. Kayser. 2009. Spatial organization of multisensory responses in temporal asso-
ciation cortex. Journal of Neuroscience 29:11924–32.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith. 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Engel, S.A., D.E. Rumelhart, B.A. Wandell et al. 1994. fMRI of human visual cortex. Nature 369:525.
Engelien, A., D. Silbersweig, E. Stern et al. 1995. The functional anatomy of recovery from auditory
agnosia. A PET  study of sound categorization in a neurological patient and normal controls. Brain
118(Pt 6):1395–409.
Ernst, M.O., and H.H. Bülthoff. 2004. Merging the senses into a robust percept. Trends in Cognitive Science
8:162–9.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience 22:5749–59.
Formisano, E., D.S. Kim, F. Di Salle et al. 2003. Mirror-symmetric tonotopic maps in human primary auditory
cortex. Neuron 40:859–69.
Foxe, J.J., and C.E. Schroeder. 2005. The case for feedforward multisensory convergence during early cortical
processing. Neuroreport 16:419–23.
Foxe, J.J., G.R. Wylie, A. Martinez et al. 2002. Auditory–somatosensory multisensory processing in auditory
association cortex: An fMRI study. Journal of Neurophysiology 88:540–3.
Fullerton, B.C., and D.N. Pandya. 2007. Architectonic analysis of the auditory-related areas of the superior
temporal region in human brain. Journal of Comparative Neurology 504:470–98.
Ghazanfar, A.A., C. Chandrasekaran, and N.K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience 28:4457–69.
Ghazanfar, A.A., and N.K. Logothetis. 2003. Neuroperception: Facial expressions linked to monkey calls.
Nature 423:937–8.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–85.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25:5004–12.
Multisensory Influences on Auditory Processing 111

Griffiths, T.D., A. Rees, C. Witton et al. 1997. Spatial and temporal auditory processing deficits following right
hemisphere infarction. A psychophysical study. Brain 120(Pt 5):785–94.
Habib, M., G. Daquin, L. Milandre et al. 1995. Mutism and auditory agnosia due to bilateral insular damage—
role of the insula in human communication. Neuropsychologia 33:327–39.
Hackett, T.A., I. Stepniewska, and J.H. Kaas. 1998. Subdivisions of auditory cortex and ipsilateral cortical connec-
tions of the parabelt auditory cortex in macaque monkeys. Journal of Comparative Neurology 394:475–95.
Hall, D.A. 2003. Auditory pathways: Are ‘what’ and ‘where’ appropriate? Current Biology 13:R406–8.
Hershenson, M. 1962. Reaction time as a measure of intersensory facilitation. Journal of Experimental
Psychology 63:289–93.
Jones, E.G., and T.P. Powell. 1970. An anatomical study of converging sensory pathways within the cerebral
cortex of the monkey. Brain 93:793–820.
Juergens, E., A. Guettler, and R. Eckhorn. 1999. Visual stimulation elicits locked and induced gamma oscilla-
tions in monkey intracortical- and EEG-potentials, but not in human EEG. Experimental Brain Research
129:247–59.
Kaas, J.H., and T.A. Hackett. 2000. Subdivisions of auditory cortex and processing streams in primates.
Proceedings of the National Academy of Sciences of the United States of America 97:11793–9.
Kayser, C., and N.K. Logothetis. 2007. Do early sensory cortices integrate cross-modal information? Brain
Structure and Function 212:121–32.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48:373–84.
Kayser, C., C.I. Petkov, M. Augath, and N.K. Logothetis. 2007. Functional imaging reveals visual modulation
of specific fields in auditory cortex. Journal of Neuroscience 27:1824–35.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–74.
Kayser, C., N. Logothetis, and S. Panzeri. 2009a. Visual enhancement of the information representation in audi-
tory cortex. Current Biology (in press).
Kayser, C., M.A. Montemurro, N. Logothetis, and S. Panzeri. 2009b. Spike-phase coding boosts and stabilizes
the information carried by spatial and temporal spike patterns. Neuron 61:597–608.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2009c. Multisensory interactions in primate auditory cortex:
fMRI and electrophysiology. Hearing Research (in press). doi:10.1016/j.heares.2009.02.011.
Kosaki, H., T. Hashikawa, J. He, and E.G. Jones. 1997. Tonotopic organization of auditory cortical fields
delineated by parvalbumin immunoreactivity in macaque monkeys. Journal of Comparative Neurology
386:304–16.
Kotz, S.A., M. Meyer, K. Alter et al. 2003. On the lateralization of emotional prosody: An event-related func-
tional MR investigation. Brain and Language 86:366–76.
Lakatos, P., C.M. Chen, M.N. O’Connell, A. Mills, and C.E. Schroeder. 2007. Neuronal oscillations and multi-
sensory interaction in primary auditory cortex. Neuron 53:279–92.
Laurienti, P.J., T.J. Perrault, T.R. Stanford, M.T. Wallace, and B.E. Stein. 2005. On the use of superadditivity
as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental
Brain Research 166:289–97.
Lauritzen, M. 2005. Reading vascular changes in brain imaging: Is dendritic calcium the key? Nature
Neuroscience Reviews 6(1):77–85.
Lehmann, C., M. Herdener, F. Esposito et al. 2006. Differential patterns of multisensory interactions in core and
belt areas of human auditory cortex. NeuroImage 31:294–300.
Leopold, D.A. 2009. Neuroscience: Pre-emptive blood flow. Nature 457:387–8.
Logothetis, N.K. 2002. The neural basis of the blood-oxygen-level-dependent functional magnetic resonance
imaging signal. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences
357:1003–37.
Logothetis, N.K. 2008. What we can do and what we cannot do with fMRI. Nature 453:869–78.
Logothetis, N.K., H. Guggenberger, S. Peled, and J. Pauls. 1999. Functional imaging of the monkey brain.
Nature Neuroscience 2:555–62.
Martuzzi, R., M.M. Murray, C.M. Michel et al. 2006. Multisensory interactions within human primary cortices
revealed by BOLD dynamics. Cerebral Cortex 17:1672–9.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264:746–8.
Meredith, M.A., and B.L. Allman. 2009. Subthreshold multisensory processing in cat auditory cortex.
Neuroreport 20:126–31.
Merzenich, M.M., and J.F. Brugge. 1973. Representation of the cochlear partition of the superior temporal
plane of the macaque monkey. Brain Research 50:275–96.
112 The Neural Bases of Multisensory Processes

Meyer, M., K. Alter, A.D. Friederici, G. Lohmann, and D.Y. Von Cramon. 2002. FMRI reveals brain regions
mediating slow prosodic modulations in spoken sentences. Human Brain Mapping 17:73–88.
Mitzdorf, U. 1985. Current source-density method and application in cat cerebral cortex: Investigation of
evoked potentials and EEG phenomena. Physiological Reviews 65:37–100.
Montessori, M. 1967. The Absorbent Mind. New York: Henry Holt & Co.
Morel, A., P.E. Garraghty, and J.H. Kaas. 1993. Tonotopic organization, architectonic fields, and connections of
auditory cortex in macaque monkeys. Journal of Comparative Neurology 335:437–59.
Oakland, T., J.L. Black, G. Stanford, N.L. Nussbaum, and R.R. Balise. 1998. An evaluation of the dyslexia
training program: A multisensory method for promoting reading in students with reading disabilities.
Journal of Learning Disabilities 31:140–7.
Pekkola, J., V. Ojanen, T. Autti et al. 2005. Attention to visual speech gestures enhances hemodynamic activity
in the left planum temporale. Human Brain Mapping 27:471–7.
Perrodin, C., C. Kayser, N. Logothetis, and C. Petkov. 2009a. Visual influences on voice-selective neurons
in the anterior superior-temporal plane. International Conference on Auditory Cortex. Madgeburg,
Germany, 2009.
Perrodin, C., L. Veit, C. Kayser, N.K. Logothetis, and C.I. Petkov. 2009b. Encoding properties of neurons sensi-
tive to species-specific vocalizations in the anterior temporal lobe of primates. International Conference
on Auditory Cortex. Madgeburg, Germany, 2009.
Petkov, C.I., C. Kayser, M. Augath, and N.K. Logothetis. 2006. Functional imaging reveals numerous fields in
the monkey auditory cortex. PLoS Biology 4:e215.
Petkov, C.I., C. Kayser, T. Steudel et al. 2008. A voice region in the monkey brain. Nature Neuroscience
11:367–74.
Petkov, C.I., C. Kayser, M. Augath, and N.K. Logothetis. 2009. Optimizing the imaging of the monkey auditory
cortex: Sparse vs. continuous fMRI. Magnetic Resonance Imaging 27:1065–73.
Petrini, K., M. Russell, and F. Pollick. 2009. When knowing can replace seeing in audiovisual integration of
actions. Cognition 110:432–9.
Rauschecker, J.P. 1998. Cortical processing of complex sounds. Current Opinion in Neurobiology 8:516–21.
Rauschecker, J.P., and B. Tian. 2000. Mechanisms and streams for processing of what and where in auditory
cortex. Proceedings of the National Academy of Sciences of the United States of America 97:11800–6.
Rauschecker, J.P., and B. Tian. 2004. Processing of band-passed noise in the lateral auditory belt cortex of the
rhesus monkey. Journal of Neurophysiology 91:2578–89.
Rauschecker, J.P., B. Tian, and M. Hauser. 1995. Processing of complex sounds in the macaque nonprimary
auditory cortex. Science 268:111–4.
Rauschecker, J.P., B. Tian, T. Pons, and M. Mishkin. 1997. Serial and parallel processing in rhesus monkey
auditory cortex. Journal of Comparative Neurology 382:89–103.
Recanzone, G.H., D.C. Guard, and M.L. Phan. 2000. Frequency and intensity response properties of single neu-
rons in the auditory cortex of the behaving macaque monkey. Journal of Neurophysiology 83:2315–31.
Remedios, R., N.K. Logothetis, and C. Kayser. 2009. An auditory region in the primate insular cortex respond-
ing preferentially to vocal communication sounds. Journal of Neuroscience 29:1034–45.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50:19–26.
Romanski, L.M., B. Tian, J. Fritz et al. 1999. Dual streams of auditory afferents target multiple domains in the
primate prefrontal cortex. Nature Neuroscience 2:1131–6.
Romanski, L.M., B.B. Averbeck, and M. Diltz. 2005. Neural representation of vocalizations in the primate
ventrolateral prefrontal cortex. Journal of Neurophysiology 93:734–47.
Ross, L.A., D. Saint-Amour, V.M. Leavitt, D.C. Javitt, and J.J. Foxe. 2007. Do you see what I am saying?
Exploring visual enhancement of speech comprehension in noisy environments. Cerebral Cortex 17:
1147–53.
Rumsey, J.M., B. Horwitz, B.C. Donohue et al. 1997. Phonological and orthographic components of word
recognition. A PET-rCBF study. Brain 120(Pt 5):739–59.
Schroeder, C.E., and J.J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory areas
of the macaque neocortex. Brain Research. Cognitive Brain Research 14:187–98.
Schroeder, C.E., and J. Foxe. 2005. Multisensory contributions to low-level, ‘unisensory’ processing. Current
Opinion in Neurobiology 15:454–8.
Schroeder, C.E., R.W. Lindsley, C. Specht et al. 2001. Somatosensory input to auditory association cortex in
the macaque monkey. Journal of Neurophysiology 85:1322–7.
Schroeder, C.E., J. Smiley, K.G. Fu et al. 2003. Anatomical mechanisms and functional implications of multi-
sensory convergence in early cortical processing. International Journal of Psychophysiology 50:5–17.
Multisensory Influences on Auditory Processing 113

Schurmann, M., G. Caetano, Y. Hlushchuk, V. Jousmaki, and R. Hari. 2006. Touch activates human auditory
cortex. NeuroImage 30:1325–31.
Shams, L., and A.R. Seitz. 2008. Benefits of multisensory learning. Trends in Cognitive Sciences 12:411–7.
Stein, B.E. 1998. Neural mechanisms for synthesizing sensory information and producing adaptive behaviors.
Experimental Brain Research 123:124–35.
Stein, B.E., and T.R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the single
neuron. Nature Reviews Neuroscience 9:255–66.
Stein, B.E., and M.A. Meredith. 1993. Merging of the Senses. Cambridge: MIT Press.
Sugihara, T., M.D. Diltz, B.B. Averbeck, and L.M. Romanski. 2006. Integration of auditory and visual communi-
cation information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience 26:11138–47.
Sumby, W.H., and I. Polack. 1954. Visual contribution to speech intelligibility in noise. Journal of the Acoustical
Society of America 26:212–5.
Talavage, T.M., M.I. Sereno, J.R. Melcher et al. 2004. Tonotopic organization in human auditory cortex revealed
by progressions of frequency sensitivity. Journal of Neurophysiology 91:1282–96.
Tian, B., D. Reser, A. Durham, A. Kustov, and J.P. Rauschecker. 2001. Functional specialization in rhesus
monkey auditory cortex. Science 292:290–3.
van Atteveldt, N., E. Formisano, R. Goebel, and L. Blomert. 2004. Integration of letters and speech sounds in
the human brain. Neuron 43:271–82.
van Wassenhove, V., K.W. Grant, and D. Poeppel. 2005. Visual speech speeds up the neural processing of
auditory speech. Proceedings of the National Academy of Sciences of the United States of America
102:1181–6.
von Kriegstein, K., and A.L. Giraud. 2006. Implicit multisensory associations influence voice recognition.
PLoS Biol 4:e326.
von Kriegstein, K., E. Eger, A. Kleinschmidt, and A.L. Giraud. 2003. Modulation of neural responses to speech
by directing attention to voices or verbal content. Brain Research. Cognitive Brain Research 17:48–55.
von Kriegstein, K., A. Kleinschmidt, and A.L. Giraud. 2006. Voice recognition and cross-modal responses to
familiar speakers’ voices in prosopagnosia. Cerebral Cortex 16:1314–22.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–72.
Wang, X. 2000. On cortical coding of vocal communication sounds in primates. Proceedings of the National
Academy of Sciences of the United States of America 97:11843–9.
Warnking, J., M. Dojat, A. Guerin-Dugue et al. 2002. fMRI retinotopic mapping—step by step. NeuroImage
17:1665–83.
Wessinger, C.M., J. Vanmeter, B. Tian et al. 2001. Hierarchical organization of the human auditory cortex
revealed by functional magnetic resonance imaging. Journal of Cognitive Neuroscience 13:1–7.
Zatorre, R.J., A.C. Evans, and E. Meyer. 1994. Neural mechanisms underlying melodic perception and memory
for pitch. Journal of Neuroscience 14:1908–19.
7 Multisensory Integration
through Neural Coherence
Andreas K. Engel, Daniel Senkowski, and Till R. Schneider

CONTENTS
7.1 Introduction........................................................................................................................... 115
7.2 Views on Cross-Modal Integration........................................................................................ 116
7.2.1 Integration by Convergence....................................................................................... 116
7.2.2 Integration through Neural Coherence...................................................................... 117
7.3 Oscillatory Activity in Cross-Modal Processing................................................................... 117
7.3.1 Oscillations Triggered by Multisensory Stimuli....................................................... 117
7.3.2 Effects of Cross-Modal Semantic Matching on Oscillatory Activity....................... 119
7.3.3 Modulation of Cross-Modal Oscillatory Responses by Attention............................ 119
7.3.4 Percept-Related Multisensory Oscillations................................................................ 121
7.4 Functional Role of Neural Synchrony for Cross-Modal Interactions.................................... 123
7.5 Outlook.................................................................................................................................. 125
References....................................................................................................................................... 126

7.1  INTRODUCTION
The inputs delivered by different sensory organs provide us with complementary information about
the environment. Constantly, multisensory interactions occur in the brain to evaluate cross-modal
matching or conflict of such signals. The outcome of these interactions is of critical importance
for perception, cognitive processing, and the control of action (Meredith and Stein 1983, 1985;
Stein and Meredith 1993; Macaluso and Driver 2005; Kayser and Logothetis 2007). Recent stud-
ies have revealed that a vast amount of cortical operations, including those carried out by primary
regions, are shaped by inputs from multiple sensory modalities (Amedi et al. 2005; Ghazanfar and
Schroeder 2006; Kayser and Logothetis 2007, 2009). Multisensory integration is highly automatized
and can even occur when there is no meaningful relationship between the different sensory inputs
and even under conditions with no perceptual awareness, as demonstrated in pioneering research on
multisensory interactions in the superior colliculus of anesthetized cats (Meredith and Stein 1983,
1985; Stein and Meredith 1993; Stein et al. 2002). Clearly, these findings suggest the fundamental
importance of multisensory processing for development (Sur et al. 1990; Shimojo and Shams 2001;
Bavelier and Neville 2002) and normal functioning of the nervous system.
In recent years, an increasing number of studies has aimed at characterizing multisensory corti-
cal regions, revealing multisensory processing in the superior temporal sulcus, the intraparietal
sulcus, frontal regions as well as the insula and claustrum (Calvert 2001; Ghazanfar and Schroeder
2006; Kayser and Logothetis 2007). Interestingly, there is increasing evidence that neurons in areas
formerly considered unimodal, such as auditory belt areas (Foxe et al. 2002; Kayser et al. 2005;
Macaluso and Driver 2005; Ghazanfar and Schroeder 2006; Kayser and Logothetis 2007), can also
exhibit multisensory characteristics. Furthermore, numerous subcortical structures are involved in
multisensory processing. In addition to the superior colliculus (Meredith and Stein 1983, 1985),

115
116 The Neural Bases of Multisensory Processes

this includes the striatum (Nagy et al. 2006), the cerebellum (Baumann and Greenlee 2007), the
amygdala (Nishijo et al. 1988), and there is evidence for cross-modal interactions at the level of the
thalamus (Komura et al. 2005).
Whereas the ubiquity and fundamental relevance of multisensory processing have become
increasingly clear, the neural mechanisms underlying multisensory interaction are much less
well understood. In this chapter, we review recent studies that may cast new light on this issue.
Although classical studies have postulated a feedforward convergence of unimodal signals as the
primary mechanism for multisensory integration (Stein and Meredith 1993; Meredith 2002), there
is now evidence that both feedback and lateral interaction may also be relevant (Driver and Spence
2000; Foxe and Schroeder 2005; Ghazanfar and Schroeder 2006; Kayser and Logothetis 2007).
Beyond this changing view on the anatomical substrate, there is increasing awareness that com-
plex dynamic interactions of cell populations, leading to coherent oscillatory firing patterns, may
be crucial for mediating cross-systems integration in the brain (von der Malsburg and Schneider
1986; Singer and Gray 1995; Singer 1999; Engel et al. 1992, 2001; Varela et al. 2001; Herrmann
et al. 2004a; Fries 2005). Here, we will consider the hypothesis that synchronized oscillations may
also provide a potential mechanism for cross-modal integration and for the selection of informa-
tion that is coherent across different sensory channels. We will (1) contrast the two different views
on cross-modal integration that imply different mechanisms (feedforward convergence vs. neural
coherence), (2) review recent studies on oscillatory responses and cross-modal processing, and
(3) discuss functional aspects and scenarios for the involvement of neural coherence in cross-modal
interaction.

7.2  VIEWS ON CROSS-MODAL INTEGRATION


7.2.1  Integration by Convergence
The classical view posits that multisensory integration occurs in a hierarchical manner by progres-
sive convergence of pathways and, thus, sensory signals are integrated only in higher association
areas and in specialized subcortical regions (Stein and Meredith 1993; Meredith 2002). A core
assumption of this approach is that the neural representation of an object is primarily reflected in a
firing rate code. Multisensory integration, accordingly, is expressed by firing rate changes in neu-
rons or neural populations receiving convergent inputs from different modalities. A frequently used
approach to investigate multisensory processing at the level of single neurons is the comparison of
spike rate in response to multisensory stimuli with the firing rate observed when presenting the most
effective of these stimuli alone (Meredith and Stein 1983, 1985; Stein and Meredith 1993; Stein et al.
2002). In more recent studies, an approach in which the neuronal responses to multisensory inputs
are directly compared with the algebraic sum of the neuronal responses to the unisensory constitu-
ents has been applied (Rowland et al. 2007; Stanford et al. 2005). In this approach, multisensory
responses that are larger than the sum of the unisensory responses are referred to as superaddi-
tive, whereas multisensory responses that are smaller are classified as subadditive. A large body of
evidence demonstrates such multisensory response patterns in a wide set of brain regions (Calvert
2001; Macaluso and Driver 2005; Ghazanfar and Schroeder 2006).
However, as recognized by numerous authors in recent years, a pure convergence model would
probably not suffice to account for all aspects of multisensory processing (Driver and Spence 2000;
Foxe and Schroeder 2005; Ghazanfar and Schroeder 2006; Kayser and Logothetis 2007). First,
strong cross-modal interactions and modulation occur in primary cortices, which is difficult to
reconcile with the notion of hierarchical convergence. Second, a convergence scenario does not
appear flexible enough because it does not allow for rapid recombination of cross-modal signals into
completely novel percepts. Furthermore, a feedforward convergence model does not explain how
low-level information about objects can remain accessible because the high-level representation is
noncompositional, i.e., it does not explicitly make reference to elementary features.
Multisensory Integration through Neural Coherence 117

7.2.2  Integration through Neural Coherence


A different account of multisensory interaction can be derived from data on the functional role of
correlated neural activity, which is likely to play a key role for feature integration and response
selection in various sensory modalities (von der Malsburg and Schneider 1986; Singer and Gray
1995; Singer 1999; Tallon-Baudry and Bertrand 1999; Engel et al. 1992, 2001; Herrmann et al.
2004a; Fries 2005). As shown by numerous studies in both animals and humans, synchronized
oscillatory activity, in particular at frequencies in the gamma band (>30 Hz), is related to a large
variety of cognitive and sensorimotor functions. The majority of these studies were conducted in the
visual modality, relating gamma band coherence of neural assemblies to processes such as feature
integration over short and long distances (Engel et al. 1991a, 1991b; Tallon-Baudry et al. 1996), sur-
face segregation (Gray et al. 1989; Castelo-Branco et al. 2000), perceptual stimulus selection (Fries
et al. 1997; Siegel et al. 2007), and attention (Müller et al. 2000; Fries et al. 2001; Siegel et al. 2008).
Beyond the visual modality, gamma band synchrony has also been observed in the auditory (Brosch
et al. 2002; Debener et al. 2003), somatosensory (Bauer et al. 2006), and olfactory systems (Bressler
and Freeman 1980; Wehr and Laurent 1996). Moreover, gamma band synchrony has been impli-
cated in processes such as sensorimotor integration (Roelfsema et al. 1997; Womelsdorf et al. 2006),
movement preparation (Sanes and Donoghue 1993; Farmer 1998) or memory formation (Csicsvari
et al. 2003; Gruber and Müller 2005; Herrmann et al. 2004b).
Collectively, these data provide strong support for the hypothesis that synchronization of neu-
ral signals is a key mechanism for integrating and selecting information in distributed networks.
This so-called “temporal correlation hypothesis” (Singer and Gray 1995; Singer 1999; Engel et al.
2001) predicts that coherence of neural signals allows to set up highly specific patterns of effec-
tive neuronal coupling, thus enabling the flexible and context-dependent binding, the selection of
relevant information, and the efficient routing of signals through processing pathways (Salinas and
Sejnowski 2001; Fries 2005; Womelsdorf et al. 2007). Based on experimental evidence discussed in
the subsequent sections, we suggest that the same mechanism may also serve to establish specific
relationships across different modalities, allowing cross-modal interactions of sensory inputs and
the preferential routing of matching cross-modal information to downstream assemblies (Senkowski
et al. 2008). We would like to note that this view does not contradict the notion that cross-modal
interactions have strong effects on neuronal firing rates, but it shifts emphasis to considering a richer
dynamic repertoire of neural interactions and a more flexible scenario of cross-modal communica-
tion in the brain.

7.3  OSCILLATORY ACTIVITY IN CROSS-MODAL PROCESSING


A variety of different paradigms have been used to study the role of oscillatory responses and neu-
ral coherence during multisensory processing. Most studies have been performed in humans using
electroencephalography (EEG) or magnetoencephalography (MEG), whereas only few animal
studies are available. The approaches used address different aspects of multisensory processing,
including (1) bottom-up processing of multisensory information, (2) cross-modal semantic match-
ing, (3) modulation by top-down attention, as well as (4) cross-modally induced perceptual changes.
In all these approaches, specific changes in oscillatory responses or coherence of neural activity
have been observed, suggesting that temporally patterned neural signals may be relevant for more
than just one type of multisensory interaction.

7.3.1  Oscillations Triggered by Multisensory Stimuli


The first attempt to investigate neural synchronization of oscillatory responses in the human EEG
was the comparison of phase coherence patterns for multiple pairs of electrodes during the presenta-
tion of auditory and visual object names, as well as pictures of objects (von Stein et al. 1999). Under
118 The Neural Bases of Multisensory Processes

conditions of passive stimulation (i.e., subjects were not required to perform any task), the authors
reported an increase of phase coherence in the lower beta band between temporal and parietal
electrode sites. The authors therefore suggested that meaningful semantic inputs are processed in a
modality-independent network of temporal and parietal areas.
Additional evidence for the involvement of oscillatory beta responses in multisensory process-
ing comes from a study in which subjects were instructed to respond to the appearance of any
stimulus in a stream of semantically meaningless auditory, visual, and multisensory audiovisual
stimuli (Senkowski et al. 2006). In the cross-modal condition, an enhancement was observed for
evoked oscillations, i.e., early oscillatory activity that is phase-locked to stimulus onset. This inte-
gration effect, which specifically occurred in the beta band, predicted the shortening of reaction
times observed for multisensory audiovisual stimuli, suggesting an involvement of beta activity
in the multisensory processing of behaviorally relevant stimuli. Cross-modal effects on evoked
beta responses have been also reported in a sensory gating paradigm (Kisley and Cornwell 2006),
in which auditory and somatosensory stimuli were presented at short or long interstimulus inter-
vals under conditions of passive stimulation. Higher auditory and somatosensory evoked beta
responses were found when the preceding stimulus was from the other compared to when it was
from the same modality, suggesting a cross-modal gating effect on the oscillatory activity in
this frequency range. Further EEG investigations have focused on the examination of oscilla-
tory activity in response to basic auditory, visual, and audiovisual stimuli during passive stimula-
tion (Sakowitz et al. 2000, 2001, 2005). In these studies, multisensory interactions were found in
evoked oscillatory responses across a wide range of frequencies and across various scalp sites,
indicating an involvement of neural synchronization of cell assemblies in different frequency
bands and brain regions.
Compelling evidence for an association between oscillatory responses and multisensory process-
ing comes from a recent study on somatosensory modulation of processing in primary auditory
cortex of alert monkeys (Lakatos et al. 2007). The authors investigated the effect of median nerve
stimulation on auditory responses and observed a pronounced augmentation of oscillations in the
delta, theta, and gamma frequency ranges. Further analysis revealed that this effect was mainly due
to a phase resetting of auditory oscillations by the somatosensory inputs. Another intriguing obser-
vation in the same study was that systematic variation of the relative delay between somatosensory
and auditory inputs lead to multisensory response enhancements at intervals corresponding to the
cycle length of gamma, theta, and delta band oscillations. In contrast, for intermediate delays, the
paired stimulus response was smaller than the responses to auditory stimuli alone. Further support
for phase resetting as a potential mechanism of cross-modal interaction comes from a recent study
focusing on visual modulation of auditory processing in the monkey (Kayser et al. 2008). Using
auditory and visual stimuli while recording in the auditory core and belt regions of awake behav-
ing monkeys, the authors observed both enhancement and suppression of unit and field potential
responses. Importantly, visual stimuli could be shown to modulate the phase angle of auditory alpha
and theta band activity.
Two recent studies have addressed interactions between auditory and multisensory regions in the
superior temporal sulcus in behaving monkeys. One of the studies examined the effect of audiovi-
sual looming signals on neural oscillations in the two regions (Maier et al. 2008). The main finding
of this study was enhanced gamma band coherence between the two structures for cross-modally
coherent looming signals compared to unimodal or receding motion inputs. This suggests that
coupling of neuronal populations between primary sensory areas and higher-order multisensory
structures may be functionally relevant for the integration of audiovisual signals. In a recent study,
Kayser and Logothetis (2009) have investigated directed interactions between auditory cortex and
multisensory sites in the superior temporal sulcus. Their analysis, which was confined to frequen-
cies below the gamma band, suggests that superior temporal regions provide one major source of
visual influences to the auditory cortex and that the beta band is involved in directed information
flow through coupled oscillations.
Multisensory Integration through Neural Coherence 119

In line with other studies (Foxe et al. 2002; Kayser et al. 2005; Ghazanfar and Schroeder 2006;
Kayser and Logothetis 2007), these data support the notion that inputs from other modalities and
from multisensory association regions can shape, in a context-dependent manner, the processing of
stimuli in presumed unimodal cortices. Taken together, the findings discussed above suggest that
modulation of both the power and the phase of oscillatory activity could be important mechanisms
of cross-modal interaction.

7.3.2  Effects of Cross-Modal Semantic Matching on Oscillatory Activity


In addition to spatial and temporal congruency (Stein and Meredith 1993), an important factor influ-
encing cross-modal integration is the semantic matching of information across sensory channels. A
recent study has addressed this issue during audiovisual processing in an object recognition task, in
which sounds of animals were presented in combination with a picture of either the same or a differ-
ent animal. Larger gamma band activity (GBA) was observed for semantically congruent compared to
semantically incongruent audiovisual stimuli (Yuval-Greenberg and Deouell 2007). We have recently
been able to obtain similar results using a visual-to-auditory semantic priming paradigm (Schneider
et al. 2008), in which we also observed stronger GBA for trials with cross-modal semantic congru-
ence as compared to incongruent trials (Figure 7.1). Source localization using the method of “linear
beamforming” revealed that the matching operation presumably reflected in the GBA involves multi-
sensory regions in the left lateral temporal cortex (Schneider et al. 2008). In line with these results, we
have recently observed an enhanced GBA for the matching of visual and auditory inputs in working
memory in a visual-to-auditory object-matching paradigm (Senkowski et al. 2009).
The effect of multisensory matching of meaningful stimuli on oscillatory activity has also been
the subject of studies that have used socially important stimuli such as faces and voices. Exploiting
the interesting case of synchronous versus asynchronous audiovisual speech (Doesburg et al. 2007),
changes in phase synchrony were shown to occur in a transiently activated gamma oscillatory net-
work. Gamma band phase-locking values were increased for asynchronous as compared to synchro-
nous speech between frontal and left posterior sensors, whereas gamma band amplitude showed an
enhancement for synchronous compared to asynchronous speech at long latencies after stimulus onset.
A more complex pattern of multisensory interactions between faces and voices of conspecifics has
been recently observed in the superior temporal sulcus of macaques (Chandrasekaran and Ghazanfar
2009). Importantly, this study demonstrates that faces and voices elicit distinct bands of activity in the
theta, alpha, and gamma frequency ranges in the superior temporal sulcus, and moreover, that these
frequency band activities show differential patterns of cross-modal integration effects.
The relationship between the early evoked auditory GBA and multisensory processing has also
been investigated in an audiovisual symbol-to-sound-matching paradigm (Widmann et al. 2007).
An enhanced left-frontally distributed evoked GBA and later parietally distributed induced (i.e.,
non–phase locked) GBA were found for auditory stimuli that matched the elements of a visual pat-
tern compared to auditory inputs that did not match the visual pattern. In another study, the role
of neural synchronization between visual and sensorimotor cortex has been examined in a multi-
sensory matching task in which tactile Braille stimuli and visual dot patterns had to be compared
(Hummel and Gerloff 2005). In trials in which subjects performed well compared to trials in which
they performed poorly, this study revealed an enhancement of phase coherence in the alpha band
between occipital and lateral central regions, whereas no significant effects could be found in other
frequency bands. In summary, the available studies suggest that cross-modal matching may be
reflected in both local and long-range changes of neural coherence.

7.3.3  Modulation of Cross-Modal Oscillatory Responses by Attention


One of the key functions of attention is to enhance perceptual salience and reduce stimulus ambigu-
ity. Behavioral, electrophysiological, and functional imaging studies have shown that attention plays
120 The Neural Bases of Multisensory Processes

(a) Congruent Visual Auditory


sheep baaa

Incongruent
sheep ring

Fixation S1 ISI S2 Response


500 ms 400 ms 1000 ms 400 ms

(b) Congruent Incongruent Difference


100 10 4
90
Frequency (Hz)

80
70
(%) (%)
60
50
40
30 0 0
–200 0 200 400 600 800 –200 0 200 400 600 800 –200 0 200 400 600 800
Time (ms) Time (ms) Time (ms)

(c)

x = –52 y = –32 z = –8

FIGURE 7.1  Enhanced gamma band activity during semantic cross-modal matching. (a) Semantically
congruent and incongruent objects were presented in a cross-modal visual-to-auditory priming paradigm.
(b) GBA in response to auditory target stimuli (S2) was enhanced following congruent compared to incongru-
ent stimuli. Square in right panel indicates a time-frequency window in which GBA difference was signifi-
cant. (c) Source localization of GBA (40–50 Hz) between 120 and 180 ms after auditory stimulus onset (S2)
using “linear beamforming” method (threshold at z = 2.56). Differences between congruent and incongruent
conditions are prominent in left medial temporal gyrus (BA 21; arrow). This suggests that enhanced GBA
reflects cross-modal semantic matching processes in lateral temporal cortex. (Adapted with permission from
Schneider, T.R. et al., NeuroImage, 42, 1244–1254, 2008.)

an important role in multisensory processing (Driver and Spence 2000; Macaluso et al. 2000; Foxe
et al. 2005; Talsma and Woldorff 2005). The effect of spatial selective attention on GBA in a mul-
tisensory setting has recently been investigated (Senkowski et al. 2005). Subjects were presented
with a stream of auditory, visual, and combined audiovisual stimuli to the left and right hemispaces
and had to attend to a designated side to detect occasional target stimuli in either modality. An
Multisensory Integration through Neural Coherence 121

enhancement of the evoked GBA was found for attended compared to unattended multisensory
stimuli. In contrast, no effect of spatial attention was observed for unimodal stimuli. An additional
analysis of the gamma band phase distribution suggested that attention primarily acts to enhance
GBA phase-locking, compatible with the idea already discussed above that cross-modal interactions
can affect the phase of neural signals.
The effects of nonspatial intermodal attention and the temporal relation between auditory and
visual inputs on the early evoked GBA have been investigated in another EEG study (Senkowski
et al. 2007). Subjects were presented with a continuous stream of centrally presented unimodal
and bimodal stimuli while they were instructed to detect an occasional auditory or visual target.
Using combined auditory and visual stimuli with different onset delays revealed clear effects on the
evoked GBA. Although there were no significant differences between the two attention conditions,
an enhancement of the GBA was observed when auditory and visual inputs of multisensory stimuli
were presented simultaneously (i.e., 0 ± 25 ms; Figure 7.2). This suggests that the integration of
auditory and visual inputs, as reflected in high-frequency oscillatory activity, is sensitive to the rela-
tive onset timing of the sensory inputs.

7.3.4  Percept-Related Multisensory Oscillations


A powerful approach to study cross-modal integration is the use of physically identical multisensory
events that can lead to different percepts across trials. A well-known example for this approach is
the sound-induced visual flash illusion that exploits the effect that a single flash of light accompa-
nied by rapidly presented auditory beeps is often perceived as multiple flashes (Shams et al. 2000).
This illusion allows the direct comparison of neural responses to illusory trials (i.e., when more than
one flash is perceived) with nonillusory trials (i.e., when a single flash is perceived), whereas keep-
ing the physical parameters of the presented stimuli constant. In an early attempt to study GBA dur-
ing the sound-induced visual flash illusion, an increase in induced GBA was observed over occipital
sites in an early (around 100 ms) and a late time window (around 450 ms) for illusory but not for
nonillusory trials (Bhattacharya et al. 2002). Confirming these data, a more recent study has also
observed enhanced induced GBA over occipital areas around 130 and 220 ms for illusory compared
to nonillusory trials (Mishra et al. 2007).
Using a modified version of the McGurk effect, the link between induced GBA and illusory per-
ception during audiovisual speech processing has been addressed in MEG investigations (Kaiser
et al. 2005, 2006). In the McGurk illusion, an auditory phoneme is dubbed onto a video showing
an incongruent lip movement, which often leads to an illusory auditory percept (McGurk and
McDonald 1976). Exploiting this cross-modal effect, an enhanced GBA was observed in epochs
in which an illusory auditory percept was induced by a visual deviant within a continuous stream
of multisensory audiovisual speech stimuli (Kaiser et al. 2005). Remarkably, the topography of
this effect was comparable with the frontal topography of a GBA enhancement obtained in an
auditory mismatch study (Kaiser et al. 2002), suggesting that the GBA effect in the McGurk illu-
sion study may represent a perceived auditory pattern change caused by the visual lip movement.
Moreover, across subjects, the amplitude of induced GBA over the occipital cortex and the degree
of the illusory acoustic change were closely correlated, suggesting that the induced GBA in early
visual areas may be directly related to the generation of the illusory auditory percept (Kaiser et al.
2006).
Further evidence for a link of gamma band oscillations to illusory cross-modal perception comes
from a study on the rubber hand illusion (Kanayama et al. 2007). In this study, a rubber hand was
placed atop of a box in which the subject’s own hand was located. In such a setting, subjects can
have the illusory impression that a tactile input presented to one of their fingers actually stimulated
the rubber hand. Interestingly, there was a strong effect of cross-modal congruence of the stimula-
tion site. Stronger induced GBA and phase synchrony between distant electrodes occurred when a
visual stimulus was presented nearby the finger of the rubber hand that corresponded to the subject’s
122 The Neural Bases of Multisensory Processes

(a) Experimental setup (b) Stimulus asynchrony


A
A|V(–100±25)
V
A|V(–50±25)
V
A|V(0±25)
V
A|V(50±25)
V
A|V(100±25)
V
Timing
–100 –50 0 50 100 subranges
Time (ms)

(c)
A|V(0±25) Auditory only Difference
80 0.2 0.08
70
Frequency (Hz)

60
50 (µV) (µV)
40
30
20 0 0
–100 0 100 200 300 400 –100 0 100 200 300 400 –100 0 100 200 300 400
Time (ms) Time (ms) Time (ms)

0.15 0.06
50–100ms

(µV) (µV)

0 0

FIGURE 7.2  Effect of relative timing of multisensory stimuli on gamma band oscillations. (a) Horizontal
gratings and sinusoidal tones were presented with different stimulus onset asynchronies. (b) GBA to auditory
and visual components of multisensory audiovisual stimuli were extracted for five asynchrony ranges centered
about –100, –50, 0, +50, and +100 ms delay between visual and auditory stimulus, respectively. GBA evoked
with multisensory inputs was compared to GBA to unisensory control stimuli. (c) An enhancement of evoked
GBA compared to unimodal input was observed when auditory and visual inputs were presented with smallest
relative onset asynchrony window (0 ± 25 ms). This shows that precision of temporal synchrony has an effect
on early cross-modal processing as reflected by evoked GBA. (Adapted with permission from Senkowski, D.
et al., Neuropsychologia, 45, 561–571, 2007.)

finger receiving a tactile stimulus, as compared to a spatial cross-modal misalignment. This finding
suggests a close relationship between multisensory tactile–visual stimulation and phase coherence
in gamma band oscillations. In sum, the findings discussed in this section suggest that oscillatory
activity, in particular at gamma band frequencies, can reflect perceptual changes resulting from
cross-modal interactions.
Multisensory Integration through Neural Coherence 123

7.4 FUNCTIONAL ROLE OF NEURAL SYNCHRONY


FOR CROSS-MODAL INTERACTIONS
The data available support the hypothesis that synchronization of oscillatory responses plays a
role for multisensory processing (Senkowski et al. 2008). They consistently show that multisensory
interactions are accompanied by condition-specific changes in oscillatory responses which often,
albeit not always, occur in the gamma band (Sakowitz et al. 2000, 2001, 2005; Bhattacharya et al.
2002; Kaiser et al. 2005, 2006; Senkowski et al. 2005, 2006, 2007; Lakatos et al. 2007; Mishra et
al. 2007; Kanayama et al. 2007; Doesburg et al. 2007; Widmann et al. 2007; Schneider et al. 2008).
Interpreting these effects observed in EEG or MEG signals, it is likely that they result not only from
changes in oscillatory power, but also from altered phase coherence in the underlying neuronal
populations. Several of the studies reviewed above have addressed this directly, providing evidence
that coherence of neural signals across cortical areas may be a crucial mechanism involved in mul-
timodal processing (von Stein et al. 1999; Hummel and Gerloff 2005; Doesburg et al. 2007; Maier
et al. 2008; Kayser and Logothetis 2009).
Theoretical arguments suggest that coherent oscillatory signals may be well-suited to serve cross-
modal integration. It has been argued that synchronization of neural activity may help to cope with
binding problems that occur in distributed architectures (von der Malsburg and Schneider 1986;
Singer and Gray 1995; Engel et al. 1992, 2001; Singer 1999). Clearly, multisensory processing poses
binding problems in at least two respects (Foxe and Schroeder 2005): information must be inte-
grated across different neural systems; moreover, real-world scenes comprise multiple objects, cre-
ating the need for segregating unrelated neural signals within processing modules while, at the same
time, selectively coordinating signals across channels in the correct combination. It seems unlikely
that such complex coordination could be achieved by anatomical connections alone because this
would not provide sufficient flexibility to cope with a fast-changing multisensory world. In contrast,
establishment of relations between signals by neural coherence may provide both the required flex-
ibility and selectivity because transient phase-locking of oscillatory signals allows for the dynamic
modulation of effective connectivity between spatially distributed neuronal populations (König
et al. 1995; Salinas and Sejnowski 2001; Fries 2005; Womelsdorf et al. 2007).
If neural coherence indeed supports multisensory integration, a number of scenarios seem pos-
sible regarding the interaction of “lower-order” and “higher-order” regions. The studies reviewed
above demonstrate the effects of multisensory interactions on oscillatory responses at multiple lev-
els, including primary sensory areas (Kaiser et al. 2006; Kayser et al. 2008; Lakatos et al. 2007)
as well as higher-order multimodal and frontal areas (Kaiser et al. 2005; Senkowski et al. 2006),
suggesting that coherent neural activity might play a role for both “early” and “late” integration of
multisensory signals. However, the available data do not yet allow to conclusively decide which
interaction patterns are most plausibly involved and, likely, these will also depend on the nature of
the task and the stimuli. Using the case of audiovisual interactions, a number of hypothetical scenar-
ios are schematically depicted in Figure 7.3. The simplest scenario predicts that during multisensory
interactions, neural synchronization changes between early sensory areas. An alternative possibility
is that changes in neural coherence or power occur mainly within cell assemblies of multisensory
association cortices like, e.g., superior temporal regions. More complex scenarios would result from
a combination of these patterns. For instance, changes in neural synchrony among unimodal regions
could also be associated with enhanced oscillatory activity in multisensory areas. This could result
from reentrant bottom-up and top-down interactions between unimodal and multimodal cortices.
In addition, changes in multisensory perception will often also involve frontal regions, which might
exert a modulatory influence on temporal patterns in multisensory parietotemporal regions through
oscillatory coupling. Most likely, at least for multisensory processing in naturalistic environments,
these interactions will combine into a highly complex pattern involving the frontal cortex, tem-
poroparietal regions as well as unimodal cortices and presumably also subcortical structures.
124 The Neural Bases of Multisensory Processes

Multisensory
(a) (b) parietal
cortex

Audi-
tory cortex Visual
cortex Multisensory
temporal cortex

Multisensory Multisensory
(c) parietal (d) Premotor parietal
cortex cortex cortex

Prefrontal
cortex
Audi-
tory cortex Visual
Multisensory cortex Multisensory
temporal cortex temporal cortex

FIGURE 7.3  Scenarios for large-scale neural communication during cross-modal perception. The model
proposed here is compatible with a number of different patterns of neural interactions. The figure refers to
the case of audiovisual interactions. (a) Multisensory interactions by coherence change between early sen-
sory areas. (b) Alternatively, changes in neural coherence or power might occur mainly within or between
multisensory association cortices, e.g., superior temporal and parietal regions. (c) Combining both scenarios,
neural synchrony among unimodal regions could also be associated with enhanced oscillatory activity in mul-
tisensory areas. (d) Multisensory perception might also involve oscillatory activity in frontal regions, which is
likely to exert a modulatory influence on temporal patterns in parietal and temporal regions.

Exploiting coherent oscillations as a potential mechanism would be compatible with various


modes, or outcomes, of cross-modal interaction. An important case is the integration of spatially or
semantically matching cross-modal signals. Congruent multisensory information would lead, very
likely, to coherent activation of neurons processing sensory inputs from different modalities. This,
in turn, will lead to stronger activation of cells in multisensory temporal, parietal, or frontal regions
that receive input from such a synchronized assembly (Figure 7.3). Thus, cross-modal coherence
might provide a plausible mechanism to implement the binding of features across different sensory
pathways. In addition, cross-modal integration may be considerably facilitated by top-down influ-
ences from higher-order regions (Engel et al. 2001; Herrmann et al. 2004a). During the processing
of natural multimodal scenes or semantically complex cross-modal information, such top-down
influences might express a dynamic “prediction” (Engel et al. 2001) about expected multisensory
inputs. In case of a match with newly arriving sensory inputs, “resonance” is likely to occur, which
would augment and accelerate the processing and selection of matching multisensory information
(Widmann et al. 2007; Schneider et al. 2008; Senkowski et al. 2009).
The mechanism postulated here may also account for the processing of conflicting cross-modal
information. In this case, the mismatching of spatiotemporal phase patterns would presumably
lead to competition between different assemblies and a winner-take-all scenario (Fries et al. 2007).
Evidence from work in the visual system suggests that such a competition would lead to an augmen-
tation of temporal coherence in the dominant assembly, but a weakening of the temporal binding
in other assemblies (Roelfsema et al. 1994; Fries et al. 1997, 2001). Because synchronized signals
are particularly efficient in driving downstream cell populations (König et al. 1996; Womelsdorf
Multisensory Integration through Neural Coherence 125

et al. 2007) and in modulating synaptic weights (Markram et al. 1997; Bi and Poo 2001), such a
mechanism would then lead to a selection of strongly synchronized populations and suppression of
decorrelated activity.
A third case may be cross-modal modulation, i.e., the bias of a percept by concurrent input from
a different sensory modality. The model suggested here predicts that the inputs from the second
modality can change the temporal structure of activity patterns in the first modality. One possible
mechanism for such a modulation by oscillatory inputs is suggested by studies discussed above
(Lakatos et al. 2007; Kayser et al. 2008). Both “lateral” interactions between assemblies in early
areas as well as top-down influences could lead to a shift in phase of the respective local oscillations,
thus entraining the local population into a temporal pattern that may be optimally suited to enhance
the effect on downstream assemblies. The prediction is that this phase resetting or phase shift-
ing should be maximally effective in case of spatial, temporal, or semantic matching cross-modal
information. Such a mechanism might help to explain why cross-modal context can often lead to
biases in the processing of information in one particular sensory system and might contribute to
understanding the nature of “early” multisensory integration (Foxe and Schroeder 2005). Because
such modulatory effects might occur on a range of time scales (defined by different frequency bands
in oscillatory activity), this mechanism may also account for broader temporal integration windows
that have been reported for multisensory interactions (Vroomen and Keetels 2010).
Finally, our hypothesis might also help to account for key features of multisensory process-
ing such as the superadditivity or subadditivity of responses (Stein and Meredith 1993; Meredith
2002) and the principle of “inverse effectiveness” (Kayser and Logothetis 2007). Because of non-
linear dendritic processing, appropriately timed inputs will generate a much stronger postsynaptic
response in target neuronal populations than temporally uncoordinated afferent signals (König et al.
1996; Singer 1999; Fries 2005) and, therefore, matching cross-modal inputs can have an impact
that differs strongly from the sum of the unimodal responses. Conversely, incongruent signals from
two modalities might result in temporally desynchronized inputs and, therefore, in “multisensory
depression” in downstream neural populations (Stein et al. 2002).

7.5  OUTLOOK
Although partially supported by data, the hypothesis that neural synchrony may play a role in
multisensory processing clearly requires further experimental testing. Thus far, only a relatively
small number of multisensory studies have used coherence measures to explicitly address interac-
tions across different neural systems. Very likely, substantial progress can be achieved by studies
in humans if the approaches are suitable to capture dynamic cross-systems interactions among
specific brain regions. Such investigations may be carried out using MEG (Gross et al. 2001; Siegel
et al. 2007, 2008), combination of EEG with functional magnetic resonance imaging (Debener et
al. 2006) or intracerebral multisite recordings (Lachaux et al. 2003), if the recordings are com-
bined with advanced source modeling techniques (Van Veen et al. 1997) and analysis methods
that quantify, e.g., directed information transfer between the activated regions (Supp et al. 2007).
In addition, some of the earlier EEG studies on multisensory oscillations involving visual stimuli
(e.g., Yuval-Greenberg and Deouell 2007) seem to be confounded by artifacts relating to microsac-
cades (Yuval-Greenberg et al. 2008), a methodological issue that needs to be clarified and possibly
can be avoided by using MEG (Fries et al. 2008). To characterize the role of correlated activity for
multisensory processing at the cellular level, further microelectrode studies in higher mammals will
be indispensable.
The model put forward here has several implications. We believe that the study of synchroniza-
tion phenomena may lead to a new view on multisensory processing that considers the dynamic
interplay of neural populations as a key to cross-modal integration and stipulates the development of
new research approaches and experimental strategies. Conversely, the investigation of multisensory
interactions may also provide a crucial test bed for further validation of the temporal correlation
126 The Neural Bases of Multisensory Processes

hypothesis (Engel et al. 1992; Singer and Gray 1995; Singer 1995), because task- or percept-related
changes in coherence between independent neural sources have hardly been shown in humans thus
far. In this context, the role of oscillations in different frequency bands is yet another unexplored
issue that future studies will have to address. As discussed above, multisensory effects are often, but
not exclusively, observed in higher frequency ranges, and it is unclear why gamma band oscillations
figure so prominently.
Finally, abnormal synchronization across sensory channels may play a role in conditions of
abnormal cross-modal perception such as synesthesia (Hubbard and Ramachandran 2005) or in
disorders such as schizophrenia or autism. In synesthesia, excessively strong multisensory coher-
ence might occur, which then would not just modulate processing in unimodal regions but actually
drive sensory neurons even in the absence of a proper stimulus. In contrast, abnormal weakness of
cross-modal coupling might account for the impairment of multisensory integration that is observed
in patients with schizophrenia (Ross et al. 2007) or autism (Iarocci and McDonald 2006). Thus,
research on cross-modal binding may help to advance our understanding of brain disorders that
partly result from dysfunctional integrative mechanisms (Schnitzler and Gross 2005; Uhlhaas and
Singer 2006).

REFERENCES
Amedi, A., K. von Kriegstein, N.M. van Atteveldt, M.S. Beauchamp, M.J. Naumer. 2005. Functional imaging
of human crossmodal identification and object recognition. Experimental Brain Research 166:559–571.
Bauer, M., R. Oostenveld, M. Peeters, P. Fries. 2006. Tactile spatial attention enhances gamma-band activ-
ity in somatosensory cortex and reduces low-frequency activity in parieto-occipital areas. Journal of
Neuroscience 26:490–501.
Baumann, O., and M.W. Greenlee. 2007. Neural correlates of coherent audiovisual motion perception. Cerebral
Cortex 17:1433–1443.
Bavelier, D., and H.J. Neville. 2002. Cross-modal plasticity: Where and how? Nature Reviews. Neuroscience
3:443–452.
Bhattacharya, J., L. Shams, S. Shimojo. 2002. Sound-induced illusory flash perception: Role of gamma band
responses. Neuroreport 13:1727–1730.
Bi, G.-Q., and M.-M. Poo. 2001. Synaptic modification by correlated activity: Hebb’s postulate revisited.
Annual Review of Neuroscience 24:139–166.
Bressler, S.L., and W.J. Freeman. 1980. Frequency analysis of olfactory system EEG in cat, rabbit, and rat.
Electroencephalography and Clinical Neurophysiology 50:19–24.
Brosch, M., E. Budinger, H. Scheich. 2002. Stimulus-related gamma oscillations in primate auditory cortex.
Journal of Neurophysiology 87:2715–2725.
Calvert, G.A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cerebral Cortex 11:1110–1123.
Castelo-Branco, M., R. Goebel, S. Neuenschwander, W. Singer. 2000. Neural synchrony correlates with surface
segregation rules. Nature 405:685–689.
Csicsvari, J., B. Jamieson, K.D. Wise, G. Buzsaki. 2003. Mechanisms of gamma oscillations in the hippocam-
pus of the behaving rat. Neuron 37:311–322.
Chandrasekaran, C., and A.A. Ghazanfar. 2009. Different neural frequency bands integrate faces and voices
differently in the superior temporal sulcus. Journal of Neurophysiology 101:773–788.
Debener, S., C.S. Herrmann, C. Kranczioch, D. Gembris, A.K. Engel. 2003. Top-down attentional processing
enhances auditory evoked gamma band activity. Neuroreport 14:683–686.
Debener, S., M. Ullsperger, M. Siegel, A.K. Engel. 2006. Single-trial EEG-fMRI reveals the dynamics of cog-
nitive function. Trends in Cognitive Sciences 10:558–563.
Doesburg, S.M., L.L. Emberson, A. Rahi, D. Cameron, L.M. Ward. 2007. Asynchrony from synchrony:
Long-range gamma-band neural synchrony accompanies perception of audiovisual speech asynchrony.
Experimental Brain Research 185:11–20.
Driver, J., and C. Spence. 2000. Multisensory perception: Beyond modularity and convergence. Current Biology
10:R731–R735.
Engel, A.K., P. König, W. Singer. 1991a. Direct physiological evidence for scene segmentation by temporal cod-
ing. Proceedings of the National Academy of Sciences of the United States of America 88:9136–9140.
Multisensory Integration through Neural Coherence 127

Engel, A.K., P. König, A.K. Kreiter, Singer, W. 1991b. Interhemispheric synchronization of oscillatory neu-
ronal responses in cat visual cortex. Science 252:1177–1179.
Engel, A.K., P. König, A.K. Kreiter, T.B. Schillen, W. Singer. 1992. Temporal coding in the visual cortex: New
vistas on integration in the nervous system. Trends in Neurosciences 15:218–226.
Engel, A.K., P. Fries, W. Singer. 2001. Dynamic predictions: Oscillations and synchrony in top-down process-
ing. Nature Reviews. Neuroscience 2:704–716.
Farmer, S.F. 1998. Rhythmicity, synchronization and binding in human and primate motor systems. Journal of
Physiology 509:3–14.
Foxe, J.J., and C.E. Schroeder. 2005. The case for feedforward multisensory convergence during early cortical
processing. Neuroreport 16:419–423.
Foxe, J.J., G.R. Wylie, A. Martinez et al. 2002. Auditory-somatosensory multisensory processing in auditory
association cortex: An fMRI study. Journal of Neurophysiology 88:540–543.
Foxe, J.J., G.V. Simpson, S.P. Ahlfors, C.D. Saron. 2005. Biasing the brain’s attentional set: I. cue driven
deployments of intersensory selective attention. Experimental Brain Research 166:370–392.
Fries, P. 2005. A mechanism for cognitive dynamics: Neuronal communication through neuronal coherence.
Trends in Cognitive Sciences 9:474–480.
Fries, P., P.R. Roelfsema, A.K. Engel, P. König, W. Singer. 1997. Synchronization of oscillatory responses in
visual cortex correlates with perception in interocular rivalry. Proceedings of the National Academy of
Sciences of the United States of America 94:12699–12704.
Fries, P., S. Neuenschwander, A.K. Engel, R. Goebel, W. Singer. 2001. Modulation of oscillatory neuronal
synchronization by selective visual attention. Science 291:1560–1563.
Fries, P., D. Nikolic, W. Singer. 2007. The gamma cycle. Trends in Neurosciences 30:309–316.
Fries, P., R. Scheeringa, R. Oostenveld. 2008. Finding gamma. Neuron 58:303–305.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–285.
Gray, C.M., P. König, A.K. Engel, W. Singer. 1989. Oscillatory responses in cat visual cortex exhibit inter-
columnar synchronization which reflects global stimulus properties. Nature 338:334–337.
Gross, J., J. Kujala, M. Hamalainen et al. 2001. Dynamic imaging of coherent sources: Studying neural inter-
actions in the human brain. Proceedings of the National Academy of Sciences of the United States of
America 98:694–699.
Gruber, T., and M.M. Müller. 2005. Oscillatory brain activity dissociates between associative stimulus content
in a repetition priming task in the human EEG. Cerebral Cortex 15:109–116.
Herrmann, C.S., M.H. Munk, A.K. Engel. 2004a. Cognitive functions of gamma-band activity: Memory match
and utilization. Trends in Cognitive Sciences 8:347–355.
Herrmann, C.S., D. Lenz, S. Junge, N.A. Busch, B. Maess. 2004b. Memory-matches evoke human gamma-
responses. BMC Neuroscience 5:13.
Hubbard, E.M., and V.S. Ramachandran. 2005. Neurocognitive mechanisms of synesthesia. Neuron 48:
509–520.
Hummel, F., and C. Gerloff. 2005. Larger interregional synchrony is associated with greater behavioral success
in a complex sensory integration task in humans. Cerebral Cortex 15:670–678.
Iarocci, G., and J. McDonald. 2006. Sensory integration and the perceptual experience of persons with autism.
Journal of Autism and Developmental Disorders 36:77–90.
Kaiser, J., W. Lutzenberger, H. Ackermann, N. Birbaumer. 2002. Dynamics of gamma-band activity induced by
auditory pattern changes in humans. Cerebral Cortex 12:212–221.
Kaiser, J., I. Hertrich, H. Ackermann, K. Mathiak, W. Lutzenberger. 2005. Hearing lips: Gamma-band activity
during audiovisual speech perception. Cerebral Cortex 15:646–653.
Kaiser, J., I. Hertrich, W. Ackermann, W. Lutzenberger. 2006. Gamma-band activity over early sensory areas
predicts detection of changes in audiovisual speech stimuli. NeuroImage 30:1376–1382.
Kanayama, N., A. Sato, H. Ohira. 2007. Crossmodal effect with rubber hand illusion and gamma-band activity.
Psychophysiology 44:392–402.
Kayser, C., and N.K. Logothetis. 2007. Do early sensory cortices integrate crossmodal information? Brain
Structure and Function 212:121–132.
Kayser, C., and N.K. Logothetis. 2009. Directed interactions between auditory and superior temporal cortices
and their role in sensory integration. Frontiers in Integrative Neuroscience 3:7.
Kayser, C., C.I. Petkov, M. Augath, N.K. Logothetis. 2005. Integration of touch and sound in auditory cortex.
Neuron 48:373–384.
Kayser, C., C.I. Petkov, N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–1574.
128 The Neural Bases of Multisensory Processes

Kisley, M.A., and Z.M. Cornwell. 2006. Gamma and beta neural activity evoked during a sensory gating
paradigm: Effects of auditory, somatosensory and cross-modal stimulation. Clinical Neurophysiology
117:2549–2563.
Komura, Y., R. Tamura, T. Uwano, H. Nishijo, T. Ono. 2005. Auditory thalamus integrates visual inputs into
behavioral gains. Nature Neuroscience 8:1203–1209.
König, P., A.K. Engel, W. Singer. 1995. Relation between oscillatory activity and long-range synchronization
in cat visual cortex. Proceedings of the National Academy of Sciences of the United States of America
92:290–294.
König, P., A.K. Engel, W. Singer. 1996. Integrator or coincidence detector? The role of the cortical neuron
revisited. Trends in Neurosciences 19:130–137.
Lachaux, J.P., D. Rudrauf, P. Kahane. 2003. Intracranial EEG and human brain mapping. Journal of Physiology
97:613–628.
Lakatos, P., C.M. Chen, M.N. O’Connell, A. Mills, C.E. Schroeder. 2007. Neuronal oscillations and multisen-
sory interaction in primary auditory cortex. Neuron 53:279–292.
Macaluso, E., and J. Driver. 2005. Multisensory spatial interactions: A window onto functional integration in
the human brain. Trends in Neurosciences 28:264–271.
Macaluso, E., C.D. Frith, J. Driver. 2000. Modulation of human visual cortex by crossmodal spatial attention.
Science 289:1206–1208.
Maier, J.X., C. Chandrasekaran, A.A. Ghazanfar. 2008. Integration of bimodal looming signals through neu-
ronal coherence in the temporal lobe. Current Biology 18:963–968.
Markram, H., J. Lübke, M. Frotscher, B. Sakmann. 1997. Regulation of synaptic efficacy by coincidence of
postsynaptic APs and EPSPs. Nature 275:213–215.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264:746–748.
Meredith, M.A. 2002. On the neuronal basis for multisensory convergence: A brief overview. Brain Research.
Cognitive Brain Research 14:31–40.
Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–391.
Meredith, M.A., and B.E. Stein. 1985. Descending efferents from the superior colliculus relay integrated mul-
tisensory information. Science 227:657–659.
Mishra, J., A. Martinez, T.J. Sejnowski, S.A. Hillyard. 2007. Early cross-modal interactions in auditory and
visual cortex underlie a sound-induced visual illusion. Journal of Neuroscience 27:4120–4131.
Müller, M.M., T. Gruber, A. Keil. 2000. Modulation of induced gamma band activity in the human EEG by
attention and visual information processing. International Journal of Psychophysiology 38:283–299.
Nagy, A., G. Eördegh, Z. Paroczy, Z. Markus, G. Benedek. 2006. Multisensory integration in the basal ganglia.
European Journal of Neuroscience 24:917–924.
Nishijo, H., T. Ono, H. Nishino. 1988. Topographic distribution of modality-specific amygdalar neurons in alert
monkey. Journal of Neuroscience 8:3556–3569.
Roelfsema, P.R., P. König, A.K. Engel, R. Sireteanu, W. Singer. 1994. Reduced synchronization in the visual
cortex of cats with strabismic amblyopia. European Journal of Neuroscience 6:1645–1655.
Roelfsema, P.R., A.K. Engel, P. König, W. Singer. 1997. Visuomotor integration is associated with zero time-
lag synchronization among cortical areas. Nature 385:157–161.
Ross, L.A., D. Saint-Amour, V.M. Leavitt, S. Molholm, D.C. Javitt, J.J. Foxe. 2007. Impaired multisensory
processing in schizophrenia: Deficits in the visual enhancement of speech comprehension under noisy
environmental conditions. Schizophrenia Research 97:173–183.
Rowland B.A., S. Quessy, T.R. Stanford, B.E. Stein. 2007. Multisensory integration shortens physiological
response latencies. Journal of Neuroscience 27:5879–5884.
Sakowitz, O.W., M. Schürmann, E. Basar. 2000. Oscillatory frontal theta responses are increased upon bisen-
sory stimulation. Clinical Neurophysiology 111:884–893.
Sakowitz, O.W., R.Q. Quiroga, M. Schürmann, E. Basar. 2001. Bisensory stimulation increases gamma-
­responses over multiple cortical regions. Brain Research. Cognitive Brain Research. 11:267–279.
Sakowitz, O.W., R.Q. Quiroga, M. Schürmann, E. Basar. 2005. Spatio-temporal frequency characteristics of
intersensory components in audiovisual evoked potentials. Brain Research. Cognitive Brain Research
23:316–326.
Salinas, E., and T.J. Sejnowski. 2001. Correlated neuronal activity and the flow of neural information. Nature
Reviews Neuroscience 2:539–550.
Sanes, J.N., and J.P. Donoghue. 1993. Oscillations in local field potentials of the primate motor cortex during
voluntary movement. Proceedings of the National Academy of Sciences of the United States of America
90:4470–4474.
Multisensory Integration through Neural Coherence 129

Schnitzler, A., and J. Gross. 2005. Normal and pathological oscillatory communication in the brain. Nature
Reviews Neuroscience 6:285–296.
Schneider, T.R., S. Debener, R. Oostenveld, A.K. Engel. 2008. Enhanced EEG gamma-band activity reflects
multisensory semantic matching in visual-to-auditory object priming. NeuroImage 42:1244–1254.
Senkowski, D., D. Talsma, C.S. Herrmann, M.G. Woldorff. 2005. Multisensory processing and oscil-
latory gamma responses: Effects of spatial selective attention. Experimental Brain Research
3–4:411–426.
Senkowski, D., S. Molholm, M. Gomez-Ramirez, J.J. Foxe. 2006. Oscillatory beta activity predicts response
speed during a multisensory audiovisual reaction time task: A high-density electrical mapping study.
Cerebral Cortex 16:1556–1565.
Senkowski, D., D. Talsma, M. Grigutsch, C.S. Herrmann, M.G. Woldorff. 2007. Good times for multisensory
integration: Effects of the precision of temporal synchrony as revealed by gamma-band oscillations.
Neuropsychologia 45:561–571.
Senkowski, D., T.R. Schneider, J.J. Foxe, A.K. Engel. 2008. Crossmodal binding through neural coherence:
Implications for multisensory processing. Trends in Neurosciences 31:401–409.
Senkowski, D., T.R. Schneider, R. Tandler, A.K. Engel. 2009. Gamma-band activity reflects multisensory
matching in working memory. Experimental Brain Research 198:363–372.
Shams, L., Y. Kamitani, S. Shimojo. 2000. Illusions. What you see is what you hear. Nature 408:788.
Shimojo, S., and L. Shams. 2001. Sensory modalities are not separate modalities: Plasticity and interactions.
Current Opinion in Neurobiology 11:505–509.
Siegel, M., T.H. Donner, R. Oostenveld, P. Fries, A.K. Engel. 2007. High-frequency activity in human visual
cortex is modulated by visual motion strength. Cerebral Cortex 17:732–741.
Siegel, M., T.H. Donner, R. Oostenveld, P. Fries, A.K. Engel. 2008. Neuronal synchronization along the dorsal
visual pathway reflects the focus of spatial attention. Neuron 60:709–719.
Singer, W. 1999. Neuronal synchrony: A versatile code for the definition of relations? Neuron 24:49–65.
Singer, W., and C.M. Gray. 1995. Visual feature integration and the temporal correlation hypothesis. Annual
Review of Neuroscience 18:555–586.
Stanford, T.R., S. Quessy, B.E. Stein. 2005. Evaluating the operations underlying multisensory integration in
the cat superior colliculus. Journal of Neuroscience 25:6499–6508.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., M.W. Wallace, T.R. Stanford, W. Jiang. 2002. Cortex governs multisensory integration in the mid-
brain. Neuroscientist 8:306–314.
Supp, G.G., A. Schlögl, N. Trujillo-Barreto, M.M. Müller, T. Gruber. 2007. Directed cortical information flow
during human object recognition: Analyzing induced EEG gamma-band responses in brain’s source
space. PLoS ONE 2:e684.
Sur, M., S.L. Pallas, A.W. Roe. 1990. Cross-modal plasticity in cortical development: Differentiation and speci-
fication of sensory neocortex. Trends in Neurosciences 13:227–233.
Tallon-Baudry, C., and O. Bertrand. 1999. Oscillatory gamma activity in humans and its role in object repre-
sentation. Trends in Cognitive Sciences 3:151–162.
Tallon-Baudry, C., O. Bertrand, C. Delpuech, J. Pernier. 1996. Stimulus specificity of phase-locked and non-
phase-locked 40 Hz visual responses in human. Journal of Neuroscience 16:4240–4249.
Talsma, D., and M.G. Woldorff. 2005. Selective attention and multisensory integration: Multiple phases of
effects on the evoked brain activity. Journal of Cognitive Neuroscience 17:1098–1114.
Uhlhaas, P.J., and W. Singer. 2006. Neural synchrony in brain disorders: Relevance for cognitive dysfunctions
and pathophysiology. Neuron 52:155–168.
Van Veen, B.D., W. van Drongelen, M. Yuchtman, A. Suzuki. 1997. Localization of brain electrical activity via
linearly constrained minimum variance spatial filtering. IEEE Transactions on Bio-Medical Engineering
44:867–880.
Varela, F., J.P. Lachaux, E. Rodriguez, J. Martinerie. 2001. The brainweb: Phase synchronization and large-
scale integration. Nature Reviews. Neuroscience 2:229–239.
von der Malsburg, C., and W. Schneider. 1986. A neural cocktail-party processor. Biological Cybernetics
54:29–40.
von Stein, A., P. Rappelsberger, J. Sarnthein, H. Petsche. 1999. Synchronization between temporal and parietal
cortex during multimodal object processing in man. Cerebral Cortex 9:137–150.
Vroomen, J., and M. Keetels. 2010. Perception of intersensory synchrony: A tutorial review. Attention,
Perception, & Psychophysics 72:871–884.
Wehr, M., and G. Laurent. 1996. Odour encoding by temporal sequences of firing in oscillating neural assem-
blies. Nature 384:162–166.
130 The Neural Bases of Multisensory Processes

Widmann, A., T. Gruber, T. Kujala, M. Tervaniemi, E. Schröger. 2007. Binding symbols and sounds: Evidence
from event-related oscillatory gamma-band activity. Cerebral Cortex 17:2696–2702.
Womelsdorf, T., P. Fries, P.P. Mitra, R. Desimone. 2006. Gamma-band synchronization in visual cortex predicts
speed of change detection. Nature 439:733–736.
Womelsdorf, T., J.M. Schoffelen, R. Oostenveld et al. 2007. Modulation of neuronal interactions through neu-
ronal synchronization. Science 316:1609–1612.
Yuval-Greenberg, S., and L.Y. Deouell. 2007. What you see is not (always) what you hear: Induced gamma
band responses reflect cross-modal interactions in familiar object recognition. Journal of Neuroscience
27:1090–1096.
Yuval-Greenberg, S., O. Tomer, A.S. Keren, I. Nelken, L.Y. Deouell. 2008. Transient induced gamma-band
response in EEG as a manifestation of miniature saccades. Neuron 58:429–441.
8 The Use of fMRI to Assess
Multisensory Integration
Thomas W. James and Ryan A. Stevenson

CONTENTS
8.1 Principles of Multisensory Enhancement.............................................................................. 131
8.2 Superadditivity and BOLD fMRI.......................................................................................... 133
8.3 Problems with Additive Criterion.......................................................................................... 134
8.4 Inverse Effectiveness............................................................................................................. 136
8.5 BOLD Baseline: When Zero Is Not Zero.............................................................................. 138
8.6 A Difference-of-BOLD Measure........................................................................................... 139
8.7 Limitations and Future Directions........................................................................................ 143
8.8 Conclusions............................................................................................................................ 144
Acknowledgments........................................................................................................................... 144
References....................................................................................................................................... 145

Although scientists have only recently had the tools available to noninvasively study the neural
mechanisms of multisensory perceptual processes in humans (Calvert et al. 1999), the study of
multisensory perception has had a long history in science (James 1890; Molyneux 1688). Before the
advent of neuroimaging techniques, such as functional magnetic resonance imaging (fMRI) and
high-density electrical recording, the study of neural mechanisms, using single-unit recording, was
restricted to nonhuman animals such as monkeys and cats. These groundbreaking neurophysiologi-
cal studies established many principles for understanding multisensory processing at the level of
single neurons (Meredith and Stein 1983), and continue to improve our understanding of multisen-
sory mechanisms at that level (Stein and Stanford 2008).
It is tempting to consider that neuroimaging measurements, like blood oxygenation level–­
dependent (BOLD) activation measured with fMRI, are directly comparable with findings from
single-unit recordings. Although several studies have established clear links between BOLD activa-
tion and neural activity (Attwell and Iadecola 2002; Logothetis and Wandell 2004; Thompson et al.
2003), there remains a fundamental difference between BOLD activation and single-unit activity:
BOLD activation is measured from the vasculature supplying a heterogeneous population of neu-
rons, whereas single-unit measures are taken from individual neurons (Scannell and Young 1999).
The ramifications of this difference are not inconsequential because the principles of multisensory
phenomena established using single-unit recording may not apply to population-based neuroimaging
data (Calvert et al. 2000). The established principles must be tested theoretically and empirically, and
where they fail, they must be replaced with new principles that are specific to the new technique.

8.1  PRINCIPLES OF MULTISENSORY ENHANCEMENT


Although the definitions of unisensory and multisensory neurons may seem intuitive, for clarity, we
will define three different types of neurons that are found in multisensory brain regions. The first
class of neurons is unisensory. They produce significant neural activity (measured as an increase in

131
132 The Neural Bases of Multisensory Processes

spike count above spontaneous baseline) with only one modality of sensory input, and this response
is not modulated by concurrent input from any other sensory modality. The second class of neurons is
bimodal (or trimodal). They produce significant neural activity with two or more modalities of unisen-
sory input (Meredith and Stein 1983; Stein and Stanford 2008). With single-unit recording, bimodal
neurons can be identified by testing their response with unisensory stimuli from two different sensory
modalities. The premise is simple: if the neuron produces significant activity with both modalities,
then it is bimodal. However, bimodal activation only implies a convergence of sensory inputs, not the
integration of those inputs (Stein et al. 2009). Bimodal neurons can be further tested for multisensory
integration by using multisensory stimuli. When tested with a multisensory stimulus, most bimodal
neurons produce activity that is greater than the maximum activity produced with either unisensory
stimulus or multisensory enhancement. The criterion usually used to identify multisensory enhance-
ment is called the maximum criterion or rule (AV > Max(A,V)). A minority of neurons produce activity
that is lower than the maximum criterion, which is considered multisensory suppression. Whether the
effect is enhancement or suppression, a change in activity of a neuron when the subject is stimulated
through a second sensory channel only occurs if those sensory channels interact. Thus, multisensory
enhancement and suppression are indicators that information is being integrated. The third class of
neurons is subthreshold. They have patterns of activity that look unisensory when they are tested with
only unisensory stimuli, but when tested with multisensory stimuli, show multisensory enhancement
(Allman and Meredith 2007; Allman et al. 2008; Meredith and Allman 2009). For example, a sub-
threshold neuron may produce significant activity with visual stimuli, but not with auditory stimuli.
Because it does not respond significantly with both, it cannot be classified as bimodal. However, when
tested with combined audiovisual stimuli, the neuron shows multisensory enhancement and thus inte-
gration. For graphical representations of each of these three classes of neurons, see Figure 8.1.

Unisensory neurons
Unisensory auditory Unisensory visual
Impulse counts

Impulse counts
Unisensory

A V AV A V AV
Input modality Input modality

Bimodal neurons
Bimodal enhanced Bimodal supressed Bimodal superadditive
Impulse counts

Impulse counts
Impulse counts
Multisensory

A V AV A V AV A V AV
Input modality Input modality Input modality

Subthreshold neurons
Subthreshold auditory Subthreshold Visual
Impulse counts

Impulse counts

A V AV A V AV
Input modality Input modality

FIGURE 8.1  Activity profiles of neurons found in multisensory brain regions.


The Use of fMRI to Assess Multisensory Integration 133

A majority of bimodal and subthreshold neurons show multisensory enhancement (i.e., exceed
the maximum criterion when stimulated with a multisensory stimulus); however, neurons that show
multisensory enhancement can be further subdivided into those that are superadditive and those
that are subadditive. Superadditive neurons show multisensory activity that exceeds a criterion that
is greater than the sum of the unisensory activities (AV > Sum(A,V); Stein and Meredith 1993). In
the case of subthreshold neurons, neural activity is only elicited by a single unisensory modality;
therefore, the criterion for superadditivity is the same as (or very similar to) the maximum crite-
rion. However, in the case of bimodal neurons, the criterion for superadditivity is usually much
greater than the maximum criterion. Thus, superadditive bimodal neurons can show extreme levels
of multisensory enhancement. Although bimodal neurons that are superadditive are, by definition,
multisensory (because they must also exceed the maximum criterion), the majority of multisensory
enhancing neurons are not superadditive (Alvarado et al. 2007; Perrault et al. 2003; Stanford et al.
2007). To be clear, in single-unit studies, superadditivity is not a criterion for identifying multisen-
sory enhancement, but instead is used to classify the degree of enhancement.

8.2  SUPERADDITIVITY AND BOLD fMRI


BOLD activation is measured from the vasculature that supplies blood to a heterogeneous popula-
tion of neurons. When modeling (either formally or informally) the underlying activity that pro-
duces BOLD activation, it is tempting to consider that all of the neurons in that population have
similar response properties. However, there is little evidence to support such an idea, especially
within multisensory brain regions. Neuronal populations within multisensory brain regions contain
a mixture of unisensory neurons from different sensory modalities in addition to bimodal and sub-
threshold multisensory neurons (Allman and Meredith 2007; Allman et al. 2008; Barraclough et al.
2005; Benevento et al. 1977; Bruce et al. 1981; Hikosaka et al. 1988; Meredith 2002; Meredith and
Stein 1983, 1986; Stein and Meredith 1993; Stein and Stanford 2008). It is this mixture of neurons
of different classes in multisensory brain regions that necessitates the development of new criteria
for assessing multisensory interactions using BOLD fMRI.
The first guideline established for studying multisensory phenomena specific to population-based
BOLD fMRI measures was superadditivity (Calvert et al. 2000), which we will refer to here as the
additive criterion to differentiate it from superadditivity in single units. In her original fMRI study,
Calvert used audio and visual presentations of speech (talking heads) and isolated an area of the
superior temporal sulcus that produced BOLD activation with a multisensory speech stimulus that
was greater than the sum of the BOLD activations with the two unisensory stimuli (AV > Sum(A,V)).
The use of this additive criterion was a departure from the established maximum criterion that was
used in single-unit studies, but was based on two supportable premises. First, BOLD activation can
be modeled as a time-invariant linear system, that is, activation produced by two stimuli presented
together can be modeled by summing the activity produced by those same two stimuli presented
alone (Boynton et al. 1996; Dale and Buckner 1997; Glover 1999; Heeger and Ress 2002). Second,
the null hypothesis to be rejected is that the neuronal population does not contain multisensory
neurons (Calvert et al. 2000, 2001; Meredith and Stein 1983). Using the additive criterion, the pres-
ence of multisensory neurons can be inferred (and the null hypothesis rejected) if activation with the
multisensory stimulus exceeds the additive criterion (i.e., superadditivity).
The justification for an additive criterion as the null hypothesis is illustrated in Figure 8.2. Data
in Figure 8.2 are simulated based on single-unit recording statistics taken from Laurienti et al.
(2005). Importantly, the data are modeled based on a brain region that does not contain multi-
sensory neurons. A brain region that only contains unisensory neurons is not a site of integration,
and therefore represents an appropriate null hypothesis. The heights of the two left bars indicate
stimulated BOLD activation with unisensory auditory (A) and visual (V) stimulation. The next
bar is the simulated BOLD activation with simultaneously presented auditory and visual stimuli
(AV). The rightmost bar, Sum(A,V), represents the additive criterion. Assuming that the pools of
134 The Neural Bases of Multisensory Processes

Two-population null hypothesis

A cells
V cells

BOLD response

A V AV Max(A,V) Sum(A,V)
Input modality Criterion

FIGURE 8.2  Criteria for assessing multisensory interactions in neuronal populations.

unisensory neurons respond similarly under unisensory and multisensory stimulation (otherwise
they would be classified as subthreshold neurons), the modeled AV activation is the same as the
additive criterion.
For comparison, we include the maximum criterion (the Max(A,V) bar), which is the crite-
rion used in single-unit recording, and sometimes used with BOLD fMRI (Beauchamp 2005; van
Atteveldt et al. 2007). The maximum criterion is clearly much more liberal than the additive cri-
terion, and the model in Figure 8.2 shows that the use of the maximum criterion with BOLD data
could produce false-positives in brain regions containing only two pools of unisensory neurons
and no multisensory neurons. That is, if a single voxel contained only unisensory neurons and no
neurons with multisensory properties, the BOLD response will still exceed the maximum criterion.
Thus, the simple model shown in Figure 8.2 demonstrates both the utility of the additive criterion
for assessing multisensory interactions in populations containing a mixture of unisensory and mul-
tisensory neurons, and that the maximum criterion, which is sometimes used in place of the additive
criterion, may inappropriately identify unisensory areas as multisensory.
It should be noted that the utility of the additive criterion applied to BOLD fMRI data is different
conceptually from the superadditivity label used with single units. The additive criterion is used to
identify multisensory interactions with BOLD activation. This is analogous to maximum criterion
being used to identify multisensory interactions in single-unit activity. Thus, superadditivity with
single units is not analogous to the additive criterion with BOLD fMRI. The term superadditivity is
used with single-unit recordings as a label to describe a subclass of neurons that not only exceeded
the maximum criterion, but also the superadditivity criterion.

8.3  PROBLEMS WITH ADDITIVE CRITERION


Although the additive criterion tests a more appropriate null hypothesis than the maximum crite-
rion, in practice, the additive criterion has had only limited success. Some early studies successfully
identified brain regions that met the additive criterion (Calvert et al. 2000, 2001), but subsequent
studies did not find evidence for additivity even in known multisensory brain regions (Beauchamp
2005; Beauchamp et al. 2004a, 2004b; Laurienti et al. 2005; Stevenson et al. 2007). These findings
prompted researchers to suggest that the additive criterion may be too strict and thus susceptible to
false negatives. As such, some suggested using the more liberal maximum criterion (Beauchamp
2005), which, as shown in Figure 8.2, is susceptible to false-positives.
One possible reason for the discrepancy between theory and practice was described by Laurienti
et al. (2005) and is demonstrated in Figure 8.3. The values in the bottom row of the table in Figure
8.3 are simulated BOLD activation. Each column in the table is a different stimulus condition,
The Use of fMRI to Assess Multisensory Integration 135

Modeled BOLD responses


2.5
AV cells
2.0 A cells
V cells 1.55 0.49
BOLD response

1.03
1.5 0.79 0.80
0.54
0.80

1.0
0.49 0.80 0.80 0.80 0.80 0.80 0.49
0.54 0.54
0.5
0.80 0.80
0.60 0.60 0.60 0.60 0.60 0.60 0.60
0.0
A V

ax
ax

ive

ive

ti

,V )

)
A,V
en
rm
:m

dit

dit

x(A
ri

m(
pe
AV

au
: ad

rad

Ma

Su
: su

:L
pe
AV

AV
AV

: su
AV
Unisensory Modeled responses Criterion
input to AV input

Neural contributions by class


A V Max Supermax Additive Superadditive Laurienti Max(A,V) Sum(A,V)
A cells 0.60 0.00 0.60 0.60 0.60 0.60 0.60 0.00 0.60
V cells 0.00 0.80 0.80 0.80 0.80 0.80 0.80 0.80 0.80
AV cells 0.54 0.48 0.54 0.79 1.03 1.88 0.80 0.49 1.03
BOLD 1.14 1.29 1.94 2.19 2.43 2.95 2.20 1.29 2.43

FIGURE 8.3  Models of BOLD activation with multisensory stimulation.

including unisensory auditory, unisensory visual, and multisensory audiovisual. The Sum(A,V) col-
umn is simply the sum of the audio and visual BOLD signals and represents the additive criterion
(null hypothesis). The audiovisual stimulus conditions were simulated using five different models,
the maximum model, the supermaximum model, the additive model, the superadditive model, and
the Laurienti model. The first three rows of the table represent the contributions of different classes
of neurons to BOLD activation, including auditory unisensory neurons (A cells), visual unisensory
neurons (V cells), and audiovisual multisensory neurons (AV cells). To be clear, the BOLD value in
the bottom-most row is the sum of the A, V, and AV cell’s contributions. Summing these contribu-
tions is based on the assumption that voxels (or clusters of voxels) contain mixtures of unisensory
and multisensory neurons, not a single class of neurons. Although the “contributions” have no units,
they are simulated based on the statistics of recorded impulse counts (spike counts) from neurons
in the superior colliculus, as reported by Laurienti et al. (2005). Unisensory neurons were explicitly
modeled to respond similarly under multisensory stimulation as they did under unisensory stimula-
tion, otherwise they would be classified as subthreshold neurons, which were not considered in the
models.
The five models of BOLD activation under audiovisual stimulation differed in the calculation
of only one value: the contribution of the AV multisensory neurons. For the maximum model, the
contribution of AV cells was calculated as the maximum of the AV cell contributions with visual
and auditory unisensory stimuli. For the super-max model, the contribution of AV neurons was cal-
culated as 150% of the AV cell contribution used for the maximum model. For the additive model,
the contribution of AV cells was calculated as the sum of AV cell contributions with visual and audi-
tory unisensory stimuli. For the superadditive model, the contribution of AV cells was calculated as
150% of the AV cell contribution used for the additive model. Finally, for the Laurienti model, the
136 The Neural Bases of Multisensory Processes

contribution of the AV cells was based on the statistics of recorded impulse counts. What the table
makes clear is that, based on Laurienti’s statistics, the additive criterion is too conservative, which is
consistent with what has been found in practice (Beauchamp 2005; Beauchamp et al. 2004a, 2004b;
Laurienti et al. 2005; Stevenson et al. 2007).
Laurienti and colleagues (2005) suggest three reasons why the simulated BOLD activation may
not exceed the additive criterion based on the known neurophysiology: first, the proportion of AV
neurons is small compared to unisensory neurons; second, of those multisensory neurons, only a
small proportion are superadditive; and third, superadditive neurons have low impulse counts rela-
tive to other neurons. To exceed the additive criterion, the average impulse count of the pool of
bimodal neurons must be significantly superadditive for population-based measurements to exceed
the additive criterion. The presence of superadditive neurons in the pool is not enough by itself
because those superadditive responses are averaged with other subadditive, and even suppressive,
responses. According to Laurienti’s statistics, the result of this averaging is a value somewhere
between maximum and additive. Thus, even though the additive criterion is appropriate because
it represents the correct null hypothesis, the statistical distribution of cell and impulse counts in
multisensory brain regions may make it practically intractable as a criterion.

8.4  INVERSE EFFECTIVENESS


The Laurienti model is consistent with recent findings suggesting that the additive criterion is too
conservative (Beauchamp 2005; Beauchamp et al. 2004a, 2004b; Laurienti et al. 2005; Stevenson et
al. 2007); however, those recent studies used stimuli that were highly salient. Another established
principle of multisensory single-unit recording is the law of inverse effectiveness. Effectiveness
in this case refers to how well a stimulus drives the neurons in question. Multisensory neurons
usually increase their proportional level of multisensory enhancement as the stimulus quality is
degraded (Meredith and Stein 1986; Stein et al. 2008). That is, the multisensory gain increases as the
“effec tiveness” of the stimulus decreases. If the average level of multisensory enhancement of a
pool of neurons increases when stimuli are degraded, then BOLD activation could exceed the addi-
tive criterion when degraded stimuli are used.
Figure 8.4 shows this effect using the simulated data from the Laurienti model (Figure 8.3).
In the high stimulus quality condition, the simulated AV activation clearly does not exceed the
additive criterion, indicated as Sum(A,V), and it can be seen that this is because of the subadditive

Inverse effectiveness with the Laurienti model

Subadditive
2.5
AV cells
0.49
2.0 A cells
V cells
BOLD response

0.80
Superadditive
1.5 0.80

0.40 0.12
1.0 0.49 0.80
0.54 0.54 0.56
0.56
0.12
0.5 0.13 0.13
0.80
0.60 0.60 0.60 0.56
0.42 0.42 0.42
0.0
A V AV Sum(A,V) A V AV Sum(A,V)

High stimulus quality Low stimulus quality

FIGURE 8.4  Influence of inverse effectiveness on simulated multisensory BOLD activation.


The Use of fMRI to Assess Multisensory Integration 137

contribution of the multisensory neurons. On the right in Figure 8.4, a similar situation is shown,
but with less effective, degraded stimuli. In general, neurons in multisensory regions decrease their
impulse counts when stimuli are less salient. However, the size of the decrease is different across
different classes of neurons and different stimulus conditions (Alvarado et al. 2007). In our simu-
lation, impulse counts of unisensory neurons were reduced by 30% from the values simulated by
the Laurienti model. Impulse counts of bimodal neurons were reduced by 75% under unisensory
stimulus conditions, and by 50% under multisensory stimulus conditions. This difference in reduc-
tion for bimodal neurons between unisensory and multisensory stimulus conditions reflects inverse
effectiveness, that is, the multisensory gain increases with decreasing stimulus effectiveness.
Using these reductions in activity with stimulus degradation, BOLD activation with the AV
stimulus now exceeds the additive criterion. Admittedly, the reductions that were assigned to
the different classes of neurons were chosen somewhat arbitrarily. There are definitely different
combinations of reductions that would lead to AV activation that would not exceed the criterion.
However, the reductions shown are based on statistics of impulse counts taken from single-unit
recording data, and are consistent with the principle of inverse effectiveness reported routinely
in the single-unit recording literature (Meredith and Stein 1986). Furthermore, there is empirical
evidence from neuroimaging showing an increased likelihood of exceeding the additive criterion
as stimulus quality is degraded (Stevenson and James 2009; Stevenson et al. 2007, 2009). Figure
8.5 compares AV activation with the additive criterion at multiple levels of stimulus quality. These
are a subset of data from a study reported elsewhere (Stevenson and James 2009). Stimulus quality
was degraded by parametrically varying the signal-to-noise ratio (SNR) of the stimuli until partici-
pants were able to correctly identify the stimuli at a given accuracy. This was done by embedding
the audio and visual signals in constant external noise and lowering the root mean square contrast
of the signals. AV activation exceeded the additive criterion at low SNR, but failed to exceed the
criterion at high SNR.
Although there is significant empirical and theoretical evidence suggesting that the additive
criterion is too conservative at high stimulus SNR, the data presented in Figure 8.5 suggest that
the additive criterion may be a better criterion at low SNR. However, there are two possible prob-
lems with using low-SNR stimuli to assess multisensory integration with BOLD fMRI. First,
based on the data in Figure 8.5, the change from failing to meet the additive criterion to exceeding
the additive criterion is gradual, not a sudden jump at a particular level of SNR. Thus, the choice
of SNR level(s) is extremely important for the interpretation of the results. Second, there may be
problems with using the additive criterion with measurements that lack a natural zero, such as
BOLD.

Inverse effectiveness in BOLD

0.3 AV response
Sum(AV) response
0.25
BOLD response

0.2
0.15
0.1
0.05
0
95% 85% 75% 65%
Stimulus quality by % accuracy

FIGURE 8.5  Assessing inverse effectiveness empirically with BOLD activation. These are a subset of data
reported elsewhere. (From Stevenson, R.A. and James, T.W., NeuroImage, 44, 1210–23, 2009. With permission.)
138 The Neural Bases of Multisensory Processes

8.5  BOLD BASELINE: WHEN ZERO IS NOT ZERO


It is established procedure with fMRI data to transform raw BOLD values to percentage signal
change values by subtracting the mean activation for the baseline condition and dividing by the
baseline. Thus, for BOLD measurements, “zero” is not absolute, but is defined as the activation
produced by the baseline condition chosen by the experimenter (Binder et al. 1999; Stark et al.
2001). Statistically, this means that BOLD measurements would be considered an interval scale at
best (Stevens 1946). The use of an interval scale affects the interpretation of the additive criterion
because of the fact that calculating the additive criterion is reliant on summing two unisensory
activations and comparing with a single multisensory activation. Because the activation values are
measured relative to an arbitrary baseline, the value of the baseline condition has a different effect
on the summed unisensory activations than on the single multisensory activation. In short, the value
of the baseline is subtracted from the additive criterion twice, but is subtracted from the multisen-
sory activation only once (see Equation 8.3).
The additive criterion for audiovisual stimuli is described according to the following equation:

AV > A + V (8.1)

But, Equation 8.1 is more accurately described by

AV-baseline A-baseline V-baselinne


> + (8.2)
baseline baseline baseline

The baseline problem


620
600
Raw BOLD signal

580
560
540
520
500
480
A V AV Baseline 1 A V AV Baseline 2
Experiment 1 Experiment 2

0.20 Subadditive

0.16
% BOLD change

Superadditive
0.12

0.08

0.04

0.00
A V AV Sum(A,V) A V AV Sum(A,V)
Experiment 1 Experiment 2

FIGURE 8.6  Influence of baseline activation on additive criterion.


The Use of fMRI to Assess Multisensory Integration 139

Equation 8.2 can be rewritten as

AV – baseline > A + V – 2 × baseline, (8.3)

and then

AV > A + V – baseline. (8.4)

Equation 8.4 clearly shows that the level of activation produced by the baseline condition influences
the additive criterion. An increase in activation of the baseline condition causes the additive crite-
rion to become more liberal (Figure 8.6). The fact that the additive criterion can be influenced by the
activation of the experimenter-chosen baseline condition may explain why similar experiments from
different laboratories produce different findings when that criterion is used (Beauchamp 2005).

8.6  A DIFFERENCE-OF-BOLD MEASURE


We have provided a theoretical rationale for the inconsistency of the additive criterion for assessing
multisensory integration using BOLD fMRI as well as a theoretical rationale for the inappropriate-
ness of the maximum criterion as a null hypothesis for this same assessment. The maximum cri-
terion is appropriate when used with single-unit recording data, but when used with BOLD fMRI
data, which represent populations of neurons, cannot account for the contribution of unisensory
neurons that are found in multisensory brain regions. Without being able to account for the hetero-
geneity of neuronal populations, the maximum criterion is likely to produce false-positives when
used with a population-based measure such as fMRI.
Although the null hypothesis tested by the additive criterion is more appropriate than the maxi-
mum criterion, the additive criterion is not without issues. First, an implicit assumption with the
additive criterion is that the average multisensory neuronal response shows a pattern that is super-
additive, an assumption that is clearly not substantiated empirically. Second, absolute BOLD per-
centage signal change measurements are measured on an interval scale. An interval scale is one
with no natural zero, and on which the absolute values are not meaningful (in a statistical sense).
The relative differences between absolute values, however, are meaningful, even when the absolute
values are measured on an interval scale. To specifically relate relative differences to the use of an
additive criterion, imagine an experiment where A, V, and AV were not levels of a sensory modality
factor, but instead A, V, and AV were three separate factors, each with at least two different levels
(e.g., levels of stimulus quality). Rather than analyzing the absolute BOLD values associated with
each condition, a relative difference measurement could be calculated between the levels of each
factor, resulting in ΔA, ΔV, and ΔAV measurements. The use of relative differences alleviates the
baseline problem because the baseline activations embedded in the measurements cancel out when
a difference operation is performed across levels of a factor. If we replace the absolute BOLD values
in Equation 8.1 with BOLD differences, the equation becomes

ΔAV ≠ ΔA + ΔV. (8.5)

Note that the inequality sign is different in Equation 8.5 than in Equation 8.1. Equation 8.1 is used
to test the directional hypothesis that AV activation exceeds the additive criterion. Subadditivity, the
hypothesis that AV activation is less than the additive criterion, is rarely, if ever, used as a criterion
by itself. It has used been used in combination with superadditivity, for instance, showing that a
brain region exceeds the additive criterion with semantically congruent stimuli but does not exceed
the additive criterion with semantically incongruent stimuli (Calvert et al. 2000). This example
(using both superadditivity and subadditivity), however, is testing two directional hypotheses, rather
than testing one nondirectional hypothesis. Equation 8.5 is used to test a nondirectional hypothesis,
140 The Neural Bases of Multisensory Processes

and we suggest that it should be nondirectional for two reasons. First, the order in which the two
terms are subtracted to produce each delta is arbitrary. For each delta term, if the least effective
stimulus condition is subtracted from the most effective condition, then Equation 8.5 can be rewrit-
ten as ΔAV < ΔA + ΔV to test for inverse effectiveness, that is, the multisensory difference should
be less than the sum of the unisensory differences. If, however, the differences were taken in the
opposite direction (i.e., most effective subtracted from least effective), Equation 8.5 would need to
be rewritten with the inequality in the opposite direction (i.e., ΔAV > ΔA + ΔV). Second, inverse
effectiveness may not be the only meaningful effect that can be seen with difference measures,
perhaps especially if the measures are used to assess function across the whole brain. This point is
discussed further at the end of the chapter (Figure 8.9).
Each component of Equation 8.5 can be rewritten with the baseline activation made explicit. The
equation for the audio component would be

∆A=
(A1 − baseline ) − (A 2 − baseline ) , (8.6)
baseline baseline

where A1 and A2 represent auditory stimulus conditions with different levels of stimulus quality.
When Equation 8.5 is rewritten by substituting Equation 8.6 for each of the three stimulus condi-
tions, all baseline variables in both the denominator and the numerator cancel out, producing the
following equation:

(AV1 – AV2) ≠ (A1 – A2) – (V1 – V2). (8.7)

The key importance of Equation 8.7 is that the baseline variable cancels out when relative differ-
ences are used instead of absolute values. Thus, the level of baseline activation has no influence on
a criterion calculated from BOLD differences.
The null hypothesis represented by Equation 8.5 is similar to the additive criterion in that the
sum of two unisensory values is compared to a multisensory value. Those values, however, are
relative differences instead of absolute BOLD percentage signal changes. If the multisensory differ-
ence is less (or greater) than the additive difference criterion, one can infer an interaction between
sensory channels, most likely in the form of a third pool of multisensory neurons in addition to
unisensory neurons. The rationale for using additive differences is illustrated in Figure 8.7. The
simulated data for the null hypothesis reflect the contributions of neurons in a brain region that
contains only unisensory auditory and visual neurons (Figure 8.7a). In the top panel, the horizontal
axis represents the stimulus condition, either unisensory auditory (A) or visual (V), or multisensory
audiovisual (AV). The subscripts 1 and 2 represent different levels of stimulus quality. For example,
A1 is high-quality audio and A2 is low-quality audio. To relate these simulated data to the data in
Figure 8.2 and the absolute additive criterion, the height of the stacked bar for AV1 is the absolute
additive criterion (or null hypothesis) for the high-quality stimuli, and the height of the AV2 stacked
bar is the absolute additive criterion for the low-quality stimuli. Those absolute additive criteria,
however, suffer from the issues discussed above. Evaluating the absolute criterion at multiple levels
of stimulus quality provides the experimenter with more information than evaluating it at only one
level, but a potentially better way of assessing multisensory integration is to use a criterion based
on differences between the high- and low-quality stimulus conditions. The null hypothesis for this
additive differences criterion is illustrated in the bottom panel of Figure 8.7a. The horizontal axis
shows the difference in auditory (ΔA), visual (ΔV), and audiovisual (ΔAV) stimuli, all calculated
as differences in the heights of the stacked bars in the top panel. The additive differences criterion,
labeled Sum(ΔA,ΔV), is also shown, and is the same as the difference in multisensory activation
(ΔAV). Thus, for a brain region containing only two pools of unisensory neurons, the appropriate
null hypothesis to be tested is provided by Equation 8.5.
The Use of fMRI to Assess Multisensory Integration 141

(a) Additive differences: (b) Additive differences:


two-population null hypothesis three-population hypothesis

1.40 2.5
A cells AV cells
1.20 V cells ΔAV A cells
2.0 V cells
% BOLD change

% BOLD change
1.00 0.80 0.80 ΔAV
0.80 1.5
ΔV 0.56
0.40
0.60 1.0 0.49
ΔA ΔV 0.80
0.54 ΔA
0.40 0.80
0.12 0.56
0.60 0.56 0.60 0.5 0.13
0.20 0.42 0.42 0.80
0.60 0.56 0.60
0.42 0.42
0.00 0.0
A1 A2 V1 V2 AV1 AV2 A1 A2 V1 V2 AV1 AV2
Input condition Input condition

ΔAV = Sum(ΔA,ΔV) 1.40 ΔAV < Sum(ΔA,ΔV)


0.45
A cells AV cells
0.40 1.20
V cells A cells
BOLD differences

BOLD differences
0.35 V cells
1.00 0.37
0.30 0.24 0.24
0.25 0.80
0.24
0.20 0.60 0.40

0.15 0.37
0.24 0.40 0.41 0.41
0.10 0.18 0.18 0.18 0.24
0.05 0.20
0.18 0.24 0.18 0.18
0.00 0.00
ΔA ΔV ΔAV Sum(ΔA,ΔV) ΔA ΔV ΔAV Sum(ΔA,ΔV)
Differences Criterion Differences Criterion

FIGURE 8.7  Additive differences criterion.

The data in Figure 8.7b apply the additive differences criterion to the simulated BOLD activation
data shown in Figure 8.4. Recall from Figure 8.4 that the average contribution of the multisensory
neurons is subadditive for high-quality stimuli (A1, V1, AV1), but is superadditive with low-quality
stimuli (A2, V2, AV2). In other words, the multisensory pool shows inverse effectiveness. The data
in the bottom panel of Figure 8.7b are similar to the bottom panel of Figure 8.7a, but with the addi-
tion of this third pool of multisensory neurons to the population. Adding the third pool makes ΔAV
(the difference in multisensory activation) significantly less than the additive differences criterion
(Sum(ΔA,ΔV)), and rejects the null hypothesis of only two pools of unisensory neurons.
Figure 8.8 shows the same additive differences analysis performed on the empirical data from
Figure 8.5 (Stevenson and James 2009; Stevenson et al. 2009). The empirical data show the same
pattern as the simulated data. With both the simulated and empirical data, ΔAV was less than
Sum(ΔA,ΔV), a pattern of activation similar to inverse effectiveness seen in single units. In single-
unit recording, there is a positive relation between stimulus quality and impulse count (or effective-
ness). This same relation was seen between stimulus quality and BOLD activation. Although most
neurons show this relation, the multisensory neurons tend to show smaller decreases (proportion-
ately) than the unisensory neurons. Thus, as the effectiveness of the stimuli decreases, the multisen-
sory gain increases. Decreases in stimulus quality also had a smaller effect on multisensory BOLD
activation than on unisensory BOLD activation, suggesting that the results in Figure 8.8 could (but
do not necessarily) reflect the influence of inversely-effective neurons.
In summary, we have demonstrated some important theoretical limitations of the criteria com-
monly used in BOLD fMRI studies to assess multisensory integration. First, the additive criterion
142 The Neural Bases of Multisensory Processes

Additive differences in BOLD


0.12
ΔAV
0.1 Sum(ΔA,ΔV)

BOLD differences
0.08
0.06
0.04
0.02
0
95-85% 85-75% 75-65%
Stimulus quality by % accuracy

FIGURE 8.8  Assessing multisensory interactions empirically with additive differences.

is susceptible to variations in baseline. Second, the additive criterion is sensitive only if the aver-
age activity profile of the multisensory neurons in the neuronal population is superadditive, which,
empirically, only occurs with very low-quality stimuli. A combination of these two issues may
explain the inconsistency in empirical findings using the additive criterion (Beauchamp 2005;
Calvert et al. 2000; Stevenson et al. 2007). Third, the maximum criterion tests a null hypothesis that
is based on a homogeneous population of only multisensory neurons. Existing single-unit recording
data suggest that multisensory brain regions have heterogeneous populations containing unisensory,
bimodal, and sometimes, subthreshold neurons. Thus, the null hypothesis tested with the maximum
criterion is likely to produce false-positive results in unisensory brain regions.

Possible BOLD additive-difference interactions


Direct gain suppression Direct gain enhancement
ΔAV < Sum(ΔA,ΔV) ΔAV > Sum(ΔA,ΔV)
BOLD activity

BOLD activity

A V AV A V AV A V AV A V AV
High quality Low quality High quality Low quality

Indirect gain suppression Indirect gain enhancement


ΔAV > Sum(ΔA,ΔV) ΔAV < Sum(ΔA,ΔV)
BOLD activity

BOLD activity

A V AV A V AV A V AV A V AV
High quality Low quality High quality Low quality

FIGURE 8.9  A whole-brain statistical parametric map of regions demonstrating audiovisual neuronal con-
vergence as assessed by additive differences criterion.
The Use of fMRI to Assess Multisensory Integration 143

As a potential solution to these concerns, we have developed a new criterion for assessing mul-
tisensory integration using relative BOLD differences instead of absolute BOLD measurements.
Relative differences are not influenced by changes in baseline, protecting the criterion from incon-
sistencies across studies. The null hypothesis to be tested is the sum of unisensory differences
(additive differences), which is based on the assumption of a heterogeneous population of neurons.
In addition to the appropriateness of the null hypothesis tested, the additive differences criterion
produced positive results in known multisensory brain regions when tested empirically (Stevenson
et al. 2009). Evidence for inverse effectiveness with audiovisual stimuli was found in known mul-
tisensory brain regions such as the superior temporal gyrus and inferior parietal lobule, but also
in regions that have garnered less attention from the multisensory community, such as the medial
frontal gyrus and parahippocampal gyrus (Figure 8.9). These results were found across different
pairings of sensory modalities and with different experimental designs, suggesting the use of addi-
tive differences may be of general use for assessing integration across sensory channels. A num-
ber of different brain regions, such as the insula and caudate nucleus, also showed an effect that
appeared to be the opposite of inverse effectiveness (Figure 8.9). BOLD activation in these brain
regions showed the opposite relation with stimulus quality as sensory brain regions, that is, high-
quality stimuli produced less activation than low-quality stimuli. Because of this opposite relation,
we termed the effect observed in these regions indirect inverse effectiveness. More research will be
needed to assess the contribution of indirect inverse effectiveness to multisensory neural processing
and behavior.

8.7  LIMITATIONS AND FUTURE DIRECTIONS


All of the simulations above made the assumption that BOLD activation could be described by
a time-invariant linear system. Although there is clearly evidence supporting this assumption
(Boynton et al. 1996; Dale and Buckner 1997; Glover 1999; Heeger and Ress 2002), studies using
serial presentation of visual stimuli suggest that nonlinearities in BOLD activation may exist when
stimuli are presented closely together in time, that is, closer than a few seconds (Boynton and
Finney 2003; Friston et al. 1999). Simultaneous presentation could be considered just a serial pre-
sentation with the shortest asynchrony possible. In that case, the deviations from linearity with
simultaneous presentation may be substantial. A careful examination of unisensory integration and
a comparison of unisensory with multisensory integration could provide valuable insights about the
linearity assumption of BOLD responses.
The simulations above were also based on only one class of multisensory neuron, the bimodal
neurons, which respond with two or more sensory modalities. Another class of multisensory neurons
has recently been discovered, which was not used in the simulations presented here. Subthreshold
neurons respond to only one sensory modality when stimulated with unisensory stimuli. However,
when stimulated with multisensory stimuli, these neurons show multisensory enhancement (Allman
and Meredith 2007; Allman et al. 2008; Meredith and Allman 2009). Adding this class of neurons to
the simulations may increase the precision of the predictions for population models with more than
two populations of neurons. The goal of the simulations presented here, however, was to develop
null hypotheses based on neuronal populations composed of only two unisensory pools of neurons.
Rejecting the null hypothesis then implies the presence of at least one other pool of neurons besides
the unisensory pools. In our simulations, we modeled that pool as bimodal; however, we could have
also modeled subthreshold neurons or a combination of bimodal and subthreshold neurons. Our
impression is that the addition of subthreshold neurons to the simulations would not qualitatively
change the results, because subthreshold neurons are found in relatively small numbers (less than
the number of subadditive bimodal neurons), and their impulse counts are low compared to other
classes of neurons (Allman and Meredith 2007).
The simulations above made predictions about levels of BOLD activation, but were based on prin-
ciples of multisensory processing that were largely derived from spike (action potential) count data
144 The Neural Bases of Multisensory Processes

collected using single-unit recording. BOLD activation reflects a hemodynamic response, which
itself is the result of local neural activity. The exact relationship, however, between neural activ-
ity and BOLD activation is unclear. There is evidence that increased spiking produces small brief
local reductions in tissue oxygenation, followed by large sustained increases in tissue oxygenation
(Thompson et al. 2003). Neural spike count, however, is not the only predictor of BOLD activation
levels nor is it the best predictor. The correlation of BOLD activation with local field potentials is
stronger than the correlation of BOLD with spike count (Heeger et al. 2000; Heeger and Ress 2002;
Logothetis and Wandell 2004). Whereas spikes reflect the output of neurons, local field potentials
are thought to reflect the postsynaptic potentials or input to neurons. This distinction between input
and output and the relationship with BOLD activation raises some concerns about the relating stud-
ies using BOLD fMRI to studies using single-unit recording. Of course, spike count is also highly
correlated with local field potentials, suggesting that spike count, local field potentials, and BOLD
activation are all interrelated and, in fact, that the correlations among them may be related to another
variable that is responsible for producing all of the phenomena (Attwell and Iadecola 2002).
Multisensory single-unit recordings are mostly performed in monkey and cat superior colliculus
and monkey superior temporal sulcus or cat posterolateral lateral suprasylvian area (Allman and
Meredith 2007; Allman et al. 2008; Barraclough et al. 2005; Benevento et al. 1977; Bruce et al. 1981;
Hikosaka et al. 1988; Meredith 2002; Meredith and Stein 1983, 1986; Stein and Meredith 1993; Stein
and Stanford 2008). With BOLD fMRI, whole-brain imaging is routine, which allows for exploration
of the entire cortex. The principles that are derived from investigation of specific brain areas may not
always apply to other areas of the brain. Thus, whole-brain investigation has the distinct promise of
producing unexpected results. The unexpected results could be because of the different proportions
of known classes of neurons, or the presence of other classes of multisensory neurons that have not
yet been found with single-unit recording. It is possible that the indirect inverse effectiveness effect
described above (Figure 8.9) may reflect the combined activity of types of multisensory neurons with
response profiles that have not yet been discovered with single-unit recording.

8.8  CONCLUSIONS
We must stress that each method used to investigate multisensory interactions has a unique set
of limitations and assumptions, whether the method is fMRI, high-density recording, single-unit
recording, behavioral reaction time, or others. Differences between methods can have a great impact
on how multisensory interactions are assessed. Thus, it should not be assumed that a criterion that is
empirically tested and theoretically sound when used with one method will be similarly sound when
applied to another method. We have developed a method for assessing multisensory integration
using BOLD fMRI that makes fewer assumptions than established methods. Because BOLD mea-
surements have an arbitrary baseline, a criterion that is based on relative BOLD differences instead
of absolute BOLD values is more interpretable and reliable. Also, the use of BOLD differences is
not limited to comparing across multisensory channels, but should be equally effective when com-
paring across unisensory channels. Finally, it is also possible that the use of relative differences may
be useful with other types of measures, such as EEG, which also use an arbitrary baseline. However,
before using the additive differences criterion with other measurement methods, it should be tested
both theoretically and empirically, as we have done here with BOLD fMRI.

ACKNOWLEDGMENTS
This research was supported in part by the Indiana METACyt Initiative of Indiana University, funded
in part through a major grant from the Lilly Endowment, Inc., the IUB Faculty Research Support
Program, and the Indiana University GPSO Research Grant. We appreciate the insights provided
by Karin Harman James, Sunah Kim, and James Townsend, by other members the Perception and
Neuroimaging Laboratory, and by other members of the Indiana University Neuroimaging Group.
The Use of fMRI to Assess Multisensory Integration 145

REFERENCES
Allman, B.L., and M.A. Meredith. 2007. Multisensory processing in “unimodal” neurons: Cross-modal sub-
threshold auditory effects in cat extrastriate visual cortex. Journal of Neurophysiology 98:545–9.
Allman, B.L., L.P. Keniston, and M.A. Meredith. 2008. Subthreshold auditory inputs to extrastriate visual
neurons are responsive to parametric changes in stimulus quality: Sensory-specific versus non-specific
coding. Brain Research 1242:95–101.
Alvarado, J.C., J.W. Vaughan, T.R. Stanford, and B.E. Stein. 2007. Multisensory versus unisensory integration:
Contrasting modes in the superior colliculus. Journal of Neurophysiology 97:3193–205.
Attwell, D., and C. Iadecola. 2002. The neural basis of functional brain imaging signals. Trends in Neurosciences
25:621–5.
Barraclough, N.E., D. Xiao, C.I. Baker, M.W. Oram, and D.I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17:377–91.
Beauchamp, M.S. 2005. Statistical criteria in FMRI studies of multisensory integration. Neuroinformatics
3:93–113.
Beauchamp, M.S., B.D. Argall, J. Bodurka, J.H. Duyn, and A. Martin. 2004a. Unraveling multisensory integra-
tion: Patchy organization within human STS multisensory cortex. Nature Neuroscience 7:1190–2.
Beauchamp, M.S., K.E. Lee, B.D. Argall, and A. Martin. 2004b. Integration of auditory and visual information
about objects in superior temporal sulcus. Neuron 41:809–23.
Benevento, L.A., J. Fallon, B.J. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57:849–72.
Binder, J.R., J.A. Frost, T.A. Hammeke et al. 1999. Conceptual processing during the conscious resting state.
A functional MRI study. Journal of Cognitive Neuroscience 11:80–95.
Boynton, G.M., S.A. Engel, G.H. Glover, and D.J. Heeger. 1996. Linear systems analysis of functional mag-
netic resonance imaging in human V1. Journal of Neuroscience 16:4207–21.
Boynton, G.M., and E.M. Finney. 2003. Orientation-specific adaptation in human visual cortex. The Journal of
Neuroscience 23:8781–7.
Bruce, C., R. Desimone, and C.G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46:369–84.
Calvert, G.A., M.J. Brammer, E.T. Bullmore et al. 1999. Response amplification in sensory-specific cortices
during crossmodal binding. NeuroReport 10:2619–23.
Calvert, G.A., R. Campbell, and M.J. Brammer. 2000. Evidence from functional magnetic resonance imaging
of crossmodal binding in the human heteromodal cortex. Current Biology 10:649–57.
Calvert, G.A., P.C. Hansen, S.D. Iversen, and M.J. Brammer. 2001. Detection of audio-visual integration sites
in humans by application of electrophysiological criteria to the BOLD effect. NeuroImage 14:427–38.
Dale, A.M., and R.L. Buckner. 1997. Selective averaging of rapidly presented individual trials using fMRI.
Human Brain Mapping 5:329–40.
Friston, K.J., E. Zarahn, O. Josephs, R.N. Henson, and A.M. Dale. 1999. Stochastic designs in event-related
fMRI. NeuroImage 10:607–19.
Glover, G.H. 1999. Deconvolution of impulse response in event-related BOLD fMRI. NeuroImage 9:416–29.
Heeger, D.J., A.C. Huk, W.S. Geisler, and D.G. Albrecht. 2000. Spikes versus BOLD: What does neuroimaging
tell us about neuronal activity? Nature Neuroscience 3:631–3.
Heeger, D.J., and D. Ress. 2002. What does fMRI tell us about neuronal activity? Nature Reviews Neuroscience
3:142–51.
Hikosaka, K., E. Iwai, H. Saito, and K. Tanaka. 1988. Polysensory properties of neurons in the anterior bank of
the caudal superior temporal sulcus of the macaque monkey. Journal of Neurophysiology 60:1615–37.
James, W. 1890. The Principles of Psychology. New York: Henry Holt & Co.
Laurienti, P.J., T.J. Perrault, T.R. Stanford, M.T. Wallace, and B.E. Stein. 2005. On the use of superadditivity
as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental
Brain Research 166:289–97.
Logothetis, N.K., and B.A. Wandell. 2004. Interpreting the BOLD signal. Annual Review of Physiology
66:735–69.
Meredith, M.A. 2002. On the neuronal basis for multisensory convergence: A brief overview. Brain Research.
Cognitive Brain Research 14:31–40.
Meredith, M.A., and B.L. Allman. 2009. Subthreshold multisensory processing in cat auditory cortex.
NeuroReport 20:126–31.
146 The Neural Bases of Multisensory Processes

Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–91.
Meredith, M.A., and B.E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–62.
Molyneux, W. 1688. Letter to John Locke. In E.S. de Beer (ed.), The correspondence of John Locke. Oxford:
Clarendon Press.
Perrault Jr., T.J., J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2003. Neuron-specific response characteristics
predict the magnitude of multisensory integration. Journal of Neurophysiology 90:4022–6.
Scannell, J.W., and M.P. Young. 1999. Neuronal population activity and functional imaging. Proceedings of the
Royal Society of London. Series B. Biological Sciences 266:875–81.
Stanford, T.R., and B.E. Stein. 2007. Superadditivity in multisensory integration: Putting the computation in
context. NeuroReport 18:787–92.
Stark, C.E., and L.R. Squire. 2001. When zero is not zero: The problem of ambiguous baseline conditions in
fMRI. Proceedings of the National Academy of Sciences of the United States of America 98:12760–6.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: The MIT Press.
Stein, B.E., and T.R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the single
neuron. Nature Reviews Neuroscience 9:255–66.
Stein, B.E., T.R. Stanford, R. Ramachandran, T.J. Perrault Jr., and B.A. Rowland. 2009. Challenges in quantify-
ing multisensory integration: Alternative criteria, models, and inverse effectiveness. Experimental Brain
Research 198:113–26.
Stevens, S.S. 1946. On the theory of scales of measurement. Science 103:677–80.
Stevenson, R.A., and T.W. James. 2009. Audiovisual integration in human superior temporal sulcus: Inverse
effectiveness and the neural processing of speech and object recognition. NeuroImage 44:1210–23.
Stevenson, R.A., M.L. Geoghegan, and T.W. James. 2007. Superadditive BOLD activation in superior temporal
sulcus with threshold non-speech objects. Experimental Brain Research 179:85–95.
Stevenson, R.A., S. Kim, and T.W. James. 2009. An additive-factors design to disambiguate neuronal and areal
convergence: Measuring multisensory interactions between audio, visual, and haptic sensory streams
using fMRI. Experimental Brain Research 198:183–94.
Thompson, J.K., M.R. Peterson, and R.D. Freeman. 2003. Single-neuron activity and tissue oxygenation in the
cerebral cortex. Science 299:1070–2.
van Atteveldt, N.M., E. Formisano, L. Blomert, and R. Goebel. 2007. The effect of temporal asynchrony on the
multisensory integration of letters and speech sounds. Cerebral Cortex 17:962–74.
9 Perception of Synchrony
between the Senses
Mirjam Keetels and Jean Vroomen

CONTENTS
9.1 Introduction........................................................................................................................... 147
9.2 Measuring Intersensory Synchrony: Temporal Order Judgment Task and Simultaneity
Judgment Task....................................................................................................................... 148
9.3 Point of Subjective Simultaneity............................................................................................ 150
9.3.1 Attention Affecting PSS: Prior Entry........................................................................ 151
9.4 Sensitivity for Intersensory Asynchrony............................................................................... 152
9.4.1 Spatial Disparity Affects JND................................................................................... 153
9.4.2 Stimulus Complexity Affects JND............................................................................ 154
9.4.3 Stimulus Rate Affects JND....................................................................................... 155
9.4.4 Predictability Affects JND........................................................................................ 155
9.4.5 Does Intersensory Pairing Affect JND?.................................................................... 156
9.5 How the Brain Deals with Lags between the Senses............................................................ 156
9.5.1 Window of Temporal Integration.............................................................................. 156
9.5.2 Compensation for External Factors........................................................................... 158
9.5.3 Temporal Recalibration............................................................................................. 161
9.5.4 Temporal Ventriloquism............................................................................................ 164
9.6 Temporal Synchrony: Automatic or Not?.............................................................................. 167
9.7 Neural Substrates of Temporal Synchrony............................................................................ 169
9.8 Conclusions............................................................................................................................ 170
References....................................................................................................................................... 171

9.1  INTRODUCTION
Most of our real-world perceptual experiences are specified by synchronous redundant and/or com-
plementary multisensory perceptual attributes. As an example, a talker can be heard and seen at the
same time, and as a result, we typically have access to multiple features across the different senses
(i.e., lip movements, facial expression, pitch, speed, and temporal structure of the speech sound).
This is highly advantageous because it increases perceptual reliability and saliency and, as a result,
it might enhance learning, discrimination, or the speed of a reaction to the stimulus (Sumby and
Pollack 1954; Summerfield 1987). However, the multisensory nature of perception also raises the
question about how the different sense organs cooperate so as to form a coherent representation of
the world. In recent years, this has been the focus of much behavioral and neuroscientific research
(Calvert et al. 2004). The most commonly held view among researchers in multisensory perception
is what has been referred to as the “assumption of unity.” It states that, as information from different
modalities share more (amodal) properties, the more likely the brain will treat them as originating
from a common object or source (see, e.g., Bedford 1989; Bertelson 1999; Radeau 1994; Stein and
Meredith 1993; Welch 1999; Welch and Warren 1980). Without a doubt, the most important amodal

147
148 The Neural Bases of Multisensory Processes

property is temporal coincidence (e.g., Radeau 1994). From this perspective, one expects intersen-
sory interactions to occur if, and only if, information from the different sense organs arrives at
around the same time in the brain; otherwise, two separate events are perceived rather than a single
multimodal one.
The perception of time and, in particular, synchrony between the senses is not straightforward
because there is no dedicated sense organ that registers time in an absolute scale. Moreover, to
perceive synchrony, the brain has to deal with differences in physical (outside the body) and neural
(inside the body) transmission times. Sounds, for example, travel through air much slower than
visual information does (i.e., 300,000,000 m/s for vision vs. 330 m/s for audition), whereas no
physical transmission time through air is involved for tactile stimulation as it is presented directly
at the body surface. The neural processing time also differs between the senses, and it is typically
slower for visual than it is for auditory stimuli (approximately 50 vs. 10 ms, respectively), whereas
for touch, the brain may have to take into account where the stimulation originated from as the trav-
eling time from the toes to the brain is longer than from the nose (the typical conduction velocity
is 55 m/s, which results in a ~30 ms difference between toe and nose when this distance is 1.60 m;
Macefield et al. 1989). Because of these differences, one might expect that for audiovisual events,
only those occurring at the so-called “horizon of simultaneity” (Pöppel 1985; Poppel et al. 1990)—a
distance of approximately 10 to 15 m from the observer—will result in the approximate synchro-
nous arrival of auditory and visual information at the primary sensory cortices. Sounds will arrive
before visual stimuli if the audiovisual event is within 15 m from the observer, whereas vision will
arrive before sounds for events farther away. Although surprisingly, despite these naturally occur-
ring lags, observers perceive intersensory synchrony for most multisensory events in the external
world, and not only for those at 15 m.
In recent years, a substantial amount of research has been devoted to understanding how the
brain handles these timing differences (Calvert et al. 2004; King 2005; Levitin et al. 2000; Spence
and Driver 2004; Spence and Squire 2003). Here, we review several key issues about intersensory
timing. We start with a short overview of how intersensory timing is generally measured, and then
discuss several factors that affect the point of subjective simultaneity and sensitivity. In the sections
that follow, we address several ways in which the brain might deal with naturally occurring lags
between the senses.

9.2 MEASURING INTERSENSORY SYNCHRONY: TEMPORAL ORDER


JUDGMENT TASK AND SIMULTANEITY JUDGMENT TASK
Before examining some of the basic findings, we first devote a few words on how intersensory
synchrony is usually measured. There are two classic tasks that have been used most of the time in
the literature. In both tasks, observers are asked to judge—in a direct way—the relative timing of
two stimuli from different modalities: the temporal order judgment (TOJ) task and the simultaneity
judgment (SJ) task. In the TOJ task, stimuli are presented in different modalities at various stimulus
onset asynchronies (SOA; Dixon and Spitz 1980; Hirsh and Sherrick 1961; Sternberg and Knoll
1973), and observers may judge which stimulus came first or which came second. In an audiovisual
TOJ task, participants may thus respond with “sound-first” or “light-first.” If the percentage of
“sound-first” responses is plotted as a function of the SOA, one usually obtains an S-shaped logistic
psychometric curve. From this curve, one can derive two measures: the 50% crossover point, and
the steepness of the curve at the 50% point. The 50% crossover point is the SOA at which observers
were—presumably—maximally unsure about temporal order. In general, this is called the “point of
subjective simultaneity” (PSS) and it is assumed that at this SOA, the information from the differ-
ent modalities is perceived as being maximally simultaneous. The second measure—the steepness
at the crossover point—reflects the observers’ sensitivity to temporal asynchronies. The steepness
can also be expressed in terms of the just noticeable difference (JND; half the difference in SOA
between the 25% and 75% point), and it represents the smallest interval observers can reliably
Perception of Synchrony between the Senses 149

notice. A steep psychometric curve thus implies a small JND, and sensitivity is thus good as observ-
ers are able to detect small asynchronies (see Figure 9.1).
The second task that has been used often is the SJ task. Here, stimuli are also presented at
various SOAs, but rather than judging which stimulus came first, observers now judge whether
the stimuli were presented simultaneously or not. In the SJ task, one usually obtains a bell-shaped
Gaussian curve if the percentage of “simultaneous” responses is plotted as a function of the SOA.
For the audiovisual case, the raw data are usually not mirror-symmetric, but skewed toward more
“simultaneous” responses on the “light-first” side of the axis. Once a curve is fitted on the raw data,
one can, as in the TOJ task, derive the PSS and the JND: the peak of the bell shape corresponds to
the PSS, and the width of the bell shape corresponds to the JND.
The TOJ and SJ tasks have, in general, been used more or less interchangeably, despite the fact
that comparative studies have found differences in performance measures derived from both tasks.
Possibly, it reflects that judgments about simultaneity and temporal order are based on different
sources of information (Hirsh and Fraisse 1964; Mitrani et al. 1986; Schneider and Bavelier 2003;
Zampini et  al. 2003a). As an example, van Eijk et al. (2008) examined task effects on the PSS.
They presented observers a sound and light, or a bouncing ball and an impact sound at various
SOAs, and had them perform three tasks: an audiovisual TOJ task (“sound-first” or “light-first”
responses required), an SJ task with two response categories (SJ2; “synchronous” or “asynchro-
nous” responses required), and an SJ task with three response categories (SJ3; “sound-first,” “syn-
chronous,” or “light-first” responses required). Results from both stimulus types showed that the
individual PSS values for the two SJ tasks correlated well, but there was no correlation between the

Simultaneity judgment task: Temporal order judgment task:


Synchronous or asynchronous? Sound or light first?

100
Percentage of “Synchronous” or

75
“V-first” responses

JND

50

25

PSS

0
A-first –80 –60 –40 –20 20 40 60 80 V-first
Stimulus onset asynchrony (in ms)

FIGURE 9.1  S-shaped curve that is typically obtained for a TOJ task and a bell-shaped curve typically
obtained in a simultaneity task (SJ). Stimuli from different modalities are presented at varying SOAs, ranging
from clear auditory-first (A-first) to clear vision-first (V-first). In a TOJ task, the participant’s task is to judge
which stimulus comes first, sound or light, whereas in a SJ task, subjects judge whether stimuli are synchro-
nous or not. The PSS represents the interval at which information from different modalities is perceived as
being maximally simultaneous (~0 ms). In a SJ task, this is the point at which the most synchronous responses
are given; in TOJ task, it is the point at which 50% of responses is vision-first and 50% is auditory-first. The
JND represents the smallest interval observers can reliably notice (in this example ~27 ms). In a SJ task, this
is the average interval (of A-first and V-first) at which a participant responds with 75% synchronous responses.
In a TOJ task, it is the difference in SOA at 25% and 75% point divided by two.
150 The Neural Bases of Multisensory Processes

TOJ and SJ tasks. This made the authors conclude, arguably, that the SJ task should be preferred
over the TOJ task if one wants to measure perception of audiovisual synchrony.
In our view, there is no straightforward solution about how to measure the PSS or JND for
intersensory timing because the tasks are subject to different kinds of response biases (see Schneider
and Bavelier 2003; Van Eijk et al. 2008; Vatakis et al. 2007, 2008b for discussion). In the TOJ task,
in which only temporal order responses can be given (“sound-first” or “light-first”), observers may
be inclined to adopt the assumption that stimuli are never simultaneous, which thus may result in
rather low JNDs. On the other hand, in the SJ task, observers may be inclined to assume that stimuli
actually belong together because the “synchronous” response category is available. Depending on
criterion settings, this may result in many “synchronous” responses, and thus, a wide bell-shaped
curve which will lead to the invalid conclusion that sensitivity is poor.
In practice, both the SJ and TOJ task will have their limits. The SJ2 task suffers heavily from the
fact that observers have to adopt a criterion about what counts as “simultaneous/nonsimultaneous.”
And in the SJ3 task, the participant has to dissociate sound-first stimuli from synchronous ones, and
light-first stimuli from synchronous ones. Hence, in the SJ3 task there are two criteria: a “sound-first/
simultaneous” criterion, and a “light-first/simultaneous” criterion. If observers change, for whatever
reason, their criterion (or criteria) along the experiment or between experimental manipulations, it
changes the width of the curve and the corresponding JND. If sensitivity is the critical measure, one
should thus be careful using the SJ task because JNDs depend heavily on these criterion settings.
A different critique can be applied to the TOJ task. Here, the assumption is made that observ-
ers respond at about 50% for each of the two response alternatives when maximally unsure about
temporal order. Although in practice, participants may adopt a different strategy and respond, for
example, “sound-first” (and others may, for arbitrary reasons, respond “light-first”) whenever unsure
about temporal order. Such a response bias will shift the derived 50% point toward one side of the
continuum or the other, and the 50% point will then not be a good measure of the PSS, the point at
which simultaneity is supposed to be maximal. If performance of an individual observer on an SJ
task is compared with a TOJ task, it should thus not come as too big of a surprise that the PSS and
JND derived from both tasks do not converge.

9.3  POINT OF SUBJECTIVE SIMULTANEITY


The naïve reader might think that stimuli from different modalities are perceived as being maxi-
mally simultaneous if they are presented the way nature does, that is, synchronous, so at 0 ms
SOA. Although surprisingly, most of the time, this is not the case. For audiovisual stimuli, the PSS
is usually shifted toward a visual–lead stimulus, so perceived simultaneity is maximal if vision
comes slightly before sounds (e.g., Kayser et al. 2008; Lewald and Guski 2003; Lewkowicz 1996;
Slutsky and Recanzone 2001; Zampini et al. 2003a, 2005b, 2005c). This bias was found in a clas-
sic study by Dixon and Spitz (1980). Here, participants monitored continuous videos consisting of
an audiovisual speech stream or an object event consisting of a hammer hitting a peg. The videos
started off in synchrony and were then gradually desynchronized at a constant rate of 51 ms/s up to
a maximum asynchrony of 500 ms. Observers were instructed to respond as soon as they noticed
the asynchrony. They were better at detecting the audiovisual asynchrony if the sound preceded the
video rather than if the video preceded the sound (131 vs. 258 ms thresholds for speech, and 75 vs.
188 ms thresholds for the hammer, respectively). PSS values also pointed in the same direction as
simultaneity was maximal when the video preceded the audio by 120 ms for speech, and by 103 ms
for the hammer. Many other studies have reported this vision-first PSS (Dinnerstein and Zlotogura
1968; Hirsh and Fraisse 1964; Jaskowski et al. 1990; Keetels and Vroomen 2005; Spence et al. 2003;
Vatakis and Spence 2006a; Zampini et al. 2003a), although some also reported opposite results
(Bald et al. 1942; Rutschmann and Link 1964; Teatini et al. 1976; Vroomen et al. 2004). There have
been many speculations about the underlying reason for this overall visual–lead asymmetry, the
main one being that observers are tuned toward the natural situation in which lights arrive before
Perception of Synchrony between the Senses 151

sounds on the sense organs (King and Palmer 1985). There will then be a preference for vision to
have a head start over sound so as to be perceived as simultaneous.
Besides this possibility, though, there are many other reasons why the PSS can differ quite substan-
tially from 0 ms SOA. To point out just a few: the PSS depends, among others, on stimulus intensity
(more intense stimuli are processed faster or come to consciousness more quickly (Jaskowski 1999;
Neumann and Niepel 2004; Roefs 1963; Sanford 1971; Smith 1933), stimulus duration (Boenke et
al. 2009), the nature of the response that participants have to make (e.g., “Which stimulus came
first?” vs. “Which stimulus came second?”; see Frey 1990; Shore et al. 2001), individual differ-
ences (Boenke et al. 2009; Mollon and Perkins 1996; Stone et al. 2001), and the modality to which
attention is directed (Mattes and Ulrich 1998; Schneider and Bavelier 2003; Shore et al. 2001, 2005;
Stelmach and Herdman 1991; Zampini et al. 2005c). We do not intend to list all the factors known
thus far, but we only pick out the one that has been particularly important in theorizing about per-
ception in general, that is, the role of attention.

9.3.1  Attention Affecting PSS: Prior Entry


A vexing issue in experimental psychology is the idea that attention speeds up sensory process-
ing. Titchener (1908) termed it the “law of prior entry,” implying that attended objects come to
consciousness more quickly than unattended ones. Many of the old studies on prior entry suffered
from the fact that they might simply reflect response biases (see Schneider and Bavelier 2003; Shore
et al. 2001; Spence et al. 2001; Zampini et al. 2005c for discussions on the role of response bias in
prior entry). As an example, observers may, whenever unsure, just respond that the attended stimu-
lus was presented first without really having that impression. This strategy would reflect a change
in decision criterion rather than a low-level sensory interaction between attention and the attended
target stimulus. To disentangle response biases from truly perceptual effects, Spence et al. (2001)
performed a series of important TOJ experiments in which visual–tactile, visual–visual, or tactile–
tactile stimulus pairs were presented from the left or right of fixation. The focus of attention was
directed toward either the visual or tactile modality by varying the probability of each stimulus
modality (e.g., in the attend–touch condition, there were 50% tactile–tactile pairs, 0% visual–visual,
and 50% critical tactile–visual pairs). Participants had to indicate whether the left or right stimulus
was presented first. The idea tested was that attention to one sensory modality would speed up
perception of stimuli in that modality, thus resulting in a change of the PSS (see also Mattes and
Ulrich 1998; Schneider and Bavelier 2003; Shore et al. 2001, 2005; Stelmach and Herdman 1991;
Zampini et al. 2005c). Their results indeed supported this notion: when attention was directed
to touch, visual stimuli had to lead by much greater intervals (155 ms) than when attention was
directed to vision (22 ms) for them to be perceived as simultaneous. Additional experiments demon-
strated that attending to one side (left or right) also speeded perception of stimuli presented at that
side. Therefore, both spatial attention and attention to modality were effective in shifting the PSS,
presumably because they speeded up perceptual processes. To minimize the contribution of any
simple response bias on the PSS, Spence et al. (2001) performed these experiments in which atten-
tion was manipulated in a dimension (modality or side) that was orthogonal to that of responding
(side or modality, respectively). Thus, while attending to vision or touch, participants had to judge
which side came first; and while attending to the left or right, participants judged which modality
came first. The authors reported similar shifts of the PSS in these different tasks, thus favoring a
perceptual basis for prior entry.
Besides such behavioral data, there is also extensive electrophysiological support for the idea
that attention affects perceptual processing. Very briefly, in the electroencephalogram (EEG) one
can measure the event-related response (ERP) of stimuli that were either attended or unattended.
Naïvely speaking, if attention speeds up stimulus processing, one would expect ERPs of attended
stimuli to be faster than unattended ones. In a seminal study by Hillyard and Munte (1984), par-
ticipants were presented a stream of brief flashes and tones on the left or right of fixation. The
152 The Neural Bases of Multisensory Processes

participant’s task was to attend either the auditory or visual modality, and to respond to infrequent
targets in that modality at an attended location (e.g., respond to a slightly longer tone on the left).
The attended modality was constant during the experiment (but varied between subjects), and the
relevant location was specified at the beginning of each block of trials. The authors found enhanced
negativity in the ERP for stimuli at attended locations if compared to nonattended locations. The
negativity started at about 150 ms poststimulus for visual stimuli and at about 100 ms for auditory
stimuli. Evidence for a cross-modal link in spatial attention was also found, as the enhancement
(although smaller) was also found for stimuli at the attended location in the unattended modality
(see also Spence and Driver 1996; Spence et al. 2000 for behavioral results). Since then, analogous
results have been found by many others. For example, Eimer and Schröger (1998) found similar
results using a different design in which the side of the attended location varied from trial to trial.
Again, their results demonstrated enhanced negativities (between 160 and 280 ms after stimulus
onset) for attended locations as compared to unattended locations, and the effect was again bigger
for the relevant rather than irrelevant modality.
The critical issue for the idea prior entry is whether these ERP effects also reflect that attended
stimuli are processed faster. In most EEG studies, attention affects the amplitude of the ERP rather
than speed (for a review, see Eimer and Driver 2001). The problem is that there are many other inter-
pretations for an amplitude modulation rather than increased processing speed (e.g., less smearing
of the EEG signal over trials if attended). A shift in the latencies of the ERP would have been easier
to interpret in terms of increased processing speed, but the problem is that even if a latency shift
in the ERP is obtained, it is usually small if compared to the behavioral data. As an example, in an
ERP study by Vibell et al. (2007), attention was directed toward the visual or tactile modality in a
visual–tactile TOJ task. Results showed that the peak latency of the visual evoked potentials (P1
and N1) was earlier when attention was directed to vision (P1 = 147 ms, and N1 = 198 ms) rather
than when directed to touch (P1 = 151 ms, and N1 = 201 ms). This shift in the P1 may be taken as
evidence that attention indeed speeds up perception in the attended modality, but it should also be
noted that the 4-ms shift in the ERP is in a quite different order of magnitude than the 38 ms shift
of the PSS in the behavioral data, or the 133 ms shift reported by Spence et al. (2001) in a similar
study. In conclusion, there is both behavioral and electrophysiological support for the idea that atten-
tion speeds up perceptual processing, but the underlying neural mechanisms remain, for the time
being, elusive.

9.4  SENSITIVITY FOR INTERSENSORY ASYNCHRONY


Besides the point at which simultaneity is perceived to be maximal (the PSS), the second measure
that one can derive from the TOJ and SJ task—but which is unfortunately not always reported—is
the observers’ sensitivity to timing differences, the JND. The sensitivity to intersensory timing
differences is not only of interest for theoretical reasons, but it is also of practical importance, for
example, in video broadcasting or multimedia Internet where standards are required for allowable
audio or video delays (Finger and Davis 2001; Mortlock et al. 1997; Rihs 1995). One of the clas-
sic studies on sensitivity for intersensory synchrony was done by Hirsh and Sherrick (1961). They
presented audio–visual, visual–tactile, and audio–tactile stimuli in a TOJ task and reported JNDs
to be approximately 20 ms regardless of the modalities used. Although more recent studies have
found substantially bigger JNDs and larger differences between the sensory modalities. For simple
cross-modal stimuli such as auditory beeps and visual flashes, JNDs have been reported in the order
of approximately 25 to 50 ms (Keetels and Vroomen 2005; Zampini et al. 2003a, 2005b), but for
audio–tactile pairs, Zampini et al. (2005a) obtained JNDs of about 80 ms, and for visual–tactile
pairs, JNDs have been found in the order of 35 to 65 ms (Keetels and Vroomen 2008b; Spence et
al. 2001). More importantly, JNDs are not constant, but have been shown to depend on various
other factors like the spatial separation between the components of the stimuli, stimulus complex-
Perception of Synchrony between the Senses 153

ity, whether it is speech or not, and—more controversial—the semantic congruency. Some of these
factors will be described below.

9.4.1  Spatial Disparity Affects JND


A factor that has been shown to affect sensitivity for intersensory timing is the spatial separation
between the components of a stimulus pair. Typically, sensitivity for temporal order improves if
the components of the cross-modal stimuli are spatially separated (i.e., lower JNDs; Bertelson
and Aschersleben 2003; Spence et al. 2003; Zampini et al. 2003a, 2003b, 2005b). Bertelson and
Aschersleben, for example, reported audiovisual JNDs to be lower when a beep and a flash were
presented from different locations rather than from a common and central location. Zampini et al.
(2003b) qualified these findings and observed that sensitivity in an audiovisual TOJ task improved if
the sounds and lights were presented from different locations, but only so if presented at the left and
right from the median (at 24°). No effect of separation was found for vertically separated stimuli. This
made Zampini et al. conclude that the critical factor for the TOJ improvement was that the individual
components of an audiovisual stimulus were presented in different hemifields. Keetels and Vroomen
(2005), though, examined this notion and varied the (horizontal) size of the spatial disparity. Their
results showed that JNDs also improved when spatial disparity was large rather than small, even if
stimuli did not cross hemifields. Audiovisual JNDs thus depend on both the relative position from
which stimuli are presented and on whether hemifields are crossed or not. Spence et al. (2001) further
demonstrated that sensitivity improves for spatially separated visual–tactile stimulus pairs, although
no such effect was found for audio–tactile pairs (Zampini et al. 2005a). In blind people, on the other
hand, audio–tactile temporal sensitivity was found to be affected by spatial separation (Occelli et al.
2008) and similar spatial modulation effects were demonstrated in rear space (Kitagawa 2005).
What is the underlying reason that sensitivity to temporal differences improves if the sources are
spatially separated? Or, why does the brain fail to notice temporal intervals when stimuli comes
from a single location? Two accounts have been proposed (Spence et al. 2003). First, it has been
suggested that intersensory pairing impairs sensitivity for temporal order. The idea underlying
“intersensory pairing” is that the brain has a list of criteria on which it decides whether information
from different modalities belong together or not. Commonality in time is, without a doubt, a very
important criterion, but there may be others like commonality in space, association based on co-
occurrence, or semantic congruency. Stimuli from the same location may, for this reason, be more
likely paired into a single multimodal event if compared to stimuli presented far apart (see Radeau
1994). Any such tendency to pair stimuli could then make the relative temporal order of the compo-
nents lost, thereby worsening the temporal sensitivity in TOJ or SJ tasks.
In contrast with this notion, many cross-modal effects occur despite spatial discordance, and
there are reasons to argue that spatial congruency may not be an important criterion for intersensory
pairing (Bertelson 1994; Colin et al. 2001; Jones and Munhall 1997; Keetels et al. 2007; Keetels
and Vroomen 2007, 2008a; Stein et al. 1996; Teder-Salejarvi et al. 2005; Vroomen and Keetels
2006). But why, then, does sensitivity for temporal order improve with spatially separated stimuli
if not because intersensory pairing is impeded? A second reason why JNDs may improve is that
of spatial redundancy. Whenever multisensory information is presented from different locations,
observers actually have extra spatial information on which to base their response. That is, observ-
ers may initially not know which modality had been presented first, but still know on which side
the first stimulus appeared, and they may then infer which modality had been presented first. As an
example, in an audiovisual TOJ task, an observer may have noticed that the first stimulus came from
the left (possibly because attention was captured by the first stimulus toward that side). They may
also remember that the light was presented on the right. By inference, then, the sound must have
been presented first. Sensitivity for temporal order for spatially separated stimuli then improves
because there are extra spatial cues that are not present for colocated stimuli.
154 The Neural Bases of Multisensory Processes

9.4.2  Stimulus Complexity Affects JND


Many studies exploring temporal sensitivity have used relatively simple stimuli such as flashes
and beeps that have a single and rather sharp transient onset. However, in real-world situations,
the brain has to deal with much more complex stimuli that often have complicated variations in
temporal structure over time (e.g., seeing and hearing someone speaking; or seeing, hearing, and
touching the keys on a computer keyboard). How does the brain notice timing differences between
these more complicated and dynamic stimuli? Theoretically, one might expect that more complex
stimuli also provide a richer base on which to judge temporal order. Audiovisual speech would be
the example “par excellence” because it is rich in content and fluctuating over time. Although in
fact, several studies have found the opposite, and in particular for audiovisual speech, the “temporal
window” for which the auditory and visual streams are perceived as synchronous is rather wide
(Conrey and Pisoni 2006; Dixon and Spitz 1980; Jones and Jarick 2006; Stekelenburg and Vroomen
2007; a series of studies by Vatakis and Spence 2006a; Vatakis, Ghanzanfar and Spence 2008a; van
Wassenhove et al. 2007). For example, in a study by van Wassenhove et al. (2007), observers judged
in an SJ task whether congruent audiovisual speech stimuli and incongruent McGurk-like speech
stimuli* (McGurk and MacDonald 1976) were synchronous or not. The authors found a temporal
window of 203 ms for the congruent pairs (ranging from −76 ms sound-first to +127 ms vision-first,
with PSS at 26 ms vision-first) and a 159 ms window for the incongruent pairs (ranging from –40 to
+119 ms, with PSS at 40 ms vision-first). These windows are rather wide if compared to the much
smaller windows found for simple flashes and beeps (mostly below 50 ms; Hirsh and Sherrick 1961;
Keetels and Vroomen 2005; Zampini et al. 2003a, 2005b). The relatively wide temporal window
for complex stimuli has also been demonstrated by indirect tests. For example, the McGurk effect
was found to diminish if the auditory and visual information streams are out of sync, but this only
occurred at rather long intervals (comparable with the ones found in SJ tasks; Grant et al. 2004;
Massaro et al. 1996; McGrath and Summerfield 1985; Munhall et al. 1996; Pandey et al. 1986;
Tanaka et al. 2009b; van Wassenhove et al. 2007).
There have been several recent attempts to compare sensitivity for intersensory timing in audio-
visual speech with other audiovisual events such as music (guitar and piano) and object actions
(e.g., smashing a television set with a hammer, or hitting a soda can with a block of wood; Vatakis
and Spence 2006a, 2006b). Observers made TOJs about which stream (auditory or visual) appeared
first. Overall, results showed better temporal sensitivity for audiovisual stimuli of “lower com-
plex ity” in comparison with stimuli having continuously varying properties (i.e., syllables vs. words
and/or sentences). Similar findings were reported by Stekelenburg and Vroomen (2007), who com-
pared JNDs of audiovisual speech (pronunciation of the syllable /bi/) with that of natural nonspeech
events (a video of a handclap) in a TOJ task. Again, JNDs were much better for the nonspeech events
(64 ms) than for speech (105 ms).
On the basis of these findings, some have concluded that “speech is special” (van Wassenhove
et al. 2007; Vatakis et al. 2008a) or that when “stimulus complexity” increases, sensitivity for tem-
poral order deteriorates (Vatakis and Spence 2006a). Although in our view, these proposals do not
really clarify the issue because the notion of “speech is special” and “stimulus complexity” are both
ill-defined, and most likely, these concepts are confounded with other stimulus factors that can
be described more clearly. As an example, it is known that the rate at which stimuli are presented
affects audiovisual JNDs for intersensory timing (Benjamins et al. 2008; Fujisaki and Nishida
2005). Sensitivity may also be affected by whether there is anticipatory information that predicts
the onset of an audiovisual event (Stekelenburg and Vroomen 2007; Van Eijk 2008; Vroomen and

* In the McGurk illusion (McGurk and MacDonald 1976), it is shown that the perception of nonambiguous speech tokens
can be modified by the simultaneous presentation of visually incongruent articulatory gestures. Typically, when pre-
sented with an auditory syllable /ba/ dubbed onto a face articulating /ga/, participants report hearing /da/. The occurrence
of this so-called McGurk effect has been taken as a particularly powerful demonstration of the use of visual information
in speech perception.
Perception of Synchrony between the Senses 155

Stekelenburg 2009), and by whether there is a sharp transition that can serve as a temporal anchor
(Fujisaki and Nishida 2005). Each of these stimulus characteristics—and likely many others—need
to be controlled if one wants to compare across stimuli in a nonarbitrary way. Below, we address
some of these factors.

9.4.3  Stimulus Rate Affects JND


It has been demonstrated that perception of intersensory synchrony breaks down if stimuli are
presented with a temporal frequency of above ~4Hz. This is very slow if compared to unimodal
visual or auditory sensitivity for temporal coherence. Fujisaki and Nishida (2005) examined this
using audiovisual stimuli consisting of a luminance-modulated Gaussian blob and an amplitude-
modulated white noise presented at various rates. They demonstrated that synchrony–asynchrony
discrimination for temporally dense random pulse trains became nearly impossible at temporal fre-
quencies larger than 4 Hz, even when the audiovisual interval was large enough for discrimination
of single pulses (the discrimination thresholds were 75, 81, and 119 ms for single pulses, 2 and 4 Hz
repetitive stimuli, respectively). This 4-Hz boundary was also reported by Benjamins et al. (2008).
They explored the temporal limit of audiovisual integration using a visual stimulus that alternated
in color (red or green) and a sound that alternated in frequency (high or low). Observers had to
indicate which sound (high or low) accompanied the red disk. Their results demonstrated that at
rates of 4.2 Hz and higher, observers were no longer able to match the visual and auditory stimuli
across modalities (proportion correct matches dropped from 0.9 at 1.9 Hz to 0.5 at a 4.2 Hz). Further
experiments also demonstrated that manipulating other temporal stimulus characteristics such as
the stimulus offsets and/or audiovisual SOAs did not change the 4-Hz threshold. Here, it should be
mentioned that the 4-Hz rate is also the approximate rate with which syllables are spoken in con-
tinuous speech, and temporal order in audiovisual speech might thus be difficult simply because
stimulus presentation is too fast, and not because speech is special.*

9.4.4  Predictability Affects JND


Another factor that may play a role in intersensory synchrony judgments, but one that has not yet
been studied extensively, is the extent to which (one of the components of) a multisensory event
can be predicted. As an example, for many natural events—such as the clapping of hands—vision
provides predictive information about when a sound is to occur, as there is visual anticipatory
information about sound onset. Stimuli with predictive information allow observers to make a clear
prediction about when a sound is to occur, and this might improve sensitivity for temporal order.
A study by van Eijk et al. (2008, Chapter 4) is of relevance here. They explored the effect of visual
predictive information (or, the way the authors called it, “apparent causality”) on perceived audio-
visual synchrony. Visual predictive information was either present or absent by showing all or part
of a Newton’s cradle toy (i.e., a ball that appears to fall from a suspended position on the left of the
display, strikes the leftmost of four contiguous balls, and then launches the rightmost ball into an
arc motion away from the other balls). The collision of the balls was accompanied by a sound that
varied around the time of the impact. The predictability of the sound was varied by showing either
the left side of the display (motion followed by a collision and sound so that visual motion predicted
sound occurrence) or the right side of the display (a sound followed by visual motion; so no predict-
able information about sound onset). In line with the argument made here, the authors reported

* It has also been reported that the presentation rate may shift the PSS. In a study by Arrighi et al. (2006), participants were
presented a video of hands drumming on a conga at various rates (1, 2, and 4 Hz). Observers were asked to judge whether
the auditory and visual streams appeared to be synchronous or not (an SJ task). Results showed that the auditory delay for
maximum simultaneity (the PSS) varied inversely with drumming tempo from about 80 ms at 1 Hz, and 60 ms at 2 Hz,
to 40 ms at 4 Hz. Video sequences of random drumming motion and of a disk moving along the motion profile matching
the hands of to the drummer produced similar results, with higher tempos requiring less auditory delay.
156 The Neural Bases of Multisensory Processes

better temporal sensitivity if visual predictive information about sound onset was available (the left
display) rather than if it was absent (the right display).

9.4.5  Does Intersensory Pairing Affect JND?


A more controversial issue in the literature on intersensory timing is the extent to which information
from different modalities is treated by the brain as belonging to the same event. Some have headed
it under the already mentioned notion of “intersensory pairing,” others under the “unity assump-
tion” (Welch and Warren 1980). The idea is that observers find it difficult to judge temporal order
if the information streams naturally belong together, for reasons other than temporal coincidence,
because there is then more intersensory integration; in which case, temporal order is lost. Several
studies have examined this issue but with varying outcomes. In a study by Vatakis and Spence
(2007), participants judged the temporal order of audiovisual speech stimuli that varied in gender
and phonemic congruency. Face and voice congruency could vary in gender (a female face articulat-
ing /pi/ with a sound of either a female or male /pi/), or phonemic content (a face saying /ba/ with a
voice saying /ba/ or /da/). In support of the unity assumption, results showed that for both the gender
and phonemic congruency manipulation, sensitivity for temporal order improved if the auditory
and visual streams were incongruent rather than congruent. In a recent study, Vatakis et al. (2008a)
qualified these findings and reported that this effect may be specific for human speech. In this study,
the effect of congruency was examined using matching or mismatching call types of monkeys
(“cooing” vs. “grunt” or threat calls). For audiovisual speech, the sensitivity of temporal order was
again better for the incongruent rather than congruent trials, but there was no congruency effect for
the monkey calls. In another study, Vatakis and Spence (2008) also found no congruency effect for
audiovisual music and object events that either matched (e.g., the sight of a note being played on a
piano together with the corresponding sound, or the video of a bouncing ball with a correspond-
ing sound) or mismatched. At this stage, it therefore appears that the “unity assumption” may only
apply to audiovisual speech. It leaves one to wonder, though, whether this effect is best explained
in terms of the “special” nature of audiovisual speech, or whether other factors are at play (e.g., the
high level of exposure to speech stimuli in daily life, the possibly more attention-grabbing nature of
speech stimuli, or the specific low-level acoustic stimulus features of speech; Vatakis et al. 2008a).

9.5  HOW THE BRAIN DEALS WITH LAGS BETWEEN THE SENSES
In any multisensory environment, the brain has to deal with lags in arrival and processing time
between the different senses. Surprisingly though, despite these lags, temporal coherence is usually
maintained, and only in exceptional circumstances such as the thunder, which is heard after the
lightning, a single multisensory event is perceived as being separated. This raises the question of
how temporal coherence is maintained. In our view, at least four options are available: (1) the brain
might be insensitive for small lags, or it could just ignore them (a window of temporal integration);
(2) the brain might be “intelligent” and bring deeply rooted knowledge about the external world into
play that allows it to compensate for various external factors; (3) the brain might be flexible and shift
its criterion about synchrony in an adaptive fashion (recalibration); or (4) the brain might actively
shift the time at which one information stream is perceived to occur toward the other (temporal
ventriloquism). Below, we discuss each of these notions. It should be noted beforehand that none of
these options mutually excludes the other.

9.5.1  Window of Temporal Integration


The first notion, that the brain is rather insensitive to lags, comes close to the idea that there is a
“window of temporal integration.” Any information that falls within this hypothetical window is
potentially assigned to the same external event and streams within the window are then treated as to
Perception of Synchrony between the Senses 157

have occurred simultaneously (see Figure 9.2, panel 1). Many have alluded to this concept, but what
is less satisfying about it is that it is basically a description rather than an explanation. To make this
point clear, some have reported that the temporal window for audiovisual speech can be quite large
because it can range from approximately 40 ms audio-first to 240 ms vision-first. However, sensitiv-
ity for intersensory asynchronies (JND) is usually much smaller than the size of this window. For
example, Munhall et al. (1996) demonstrated that exact temporal coincidence between the auditory
and visual parts of audiovisual speech stimuli is not a very strict constraint on the McGurk effect
(McGurk and MacDonald 1976). Their results demonstrated that the McGurk effect was biggest
when vowels were synchronized (see also McGrath and Summerfield 1985), but the effect survived
even if audition lagged vision by 180 ms (see also Soto-Faraco and Alsius 2007, 2009; these studies

1) A wide window of temporal integration = Air travel time

= Neural processing time = Actual stimulus onset time

= Window of integration = Perceived temporal occurrence


Time

2) The brain compensates for auditory delays caused by sound distance


Close sound: Far sound:

3) Adaptation to intersensory asynchrony via:

a. Adjustment of criterion

b. Widening of the window

c. Adjustment of the sensory threshold

4) Temporal ventriloquism: The perceived visual onset time is shifted towards audition

FIGURE 9.2  Synchrony can be perceived despite lags. How is this accomplished? Four possible mechanisms
are depicted for audiovisual stimuli like a flash and beep. Similar mechanisms might apply for other stimuli
and other modality pairings. Time is represented on the x-axis, and accumulation of sensory evidence on the
y-axis. A stimulus is time-stamped once it surpasses a sensory threshold. Stimuli in audition and vision are
perceived as being synchronous if they occur within a certain time window. (1) The brain might be insensitive
for naturally occurring lags because the window of temporal integration is rather wide. (2) The brain might
compensate for predictable variability—here, sound distance—by adjusting perceived occurrence of a sound
in accordance with sound travel time. (3) Temporal recalibration. Three different mechanisms might underlie
adaptation to asynchrony: (a) a shift in criterion about synchrony for adapted stimuli or modalities, (b) a wid-
ening of temporal window for adapted stimuli or modalities, and (c) a change in threshold of sensory detection
(when did the stimulus occur?) within one of adapted modalities. (4) Temporal ventriloquism: a visual event
is actively shifted toward an auditory event.
158 The Neural Bases of Multisensory Processes

show that participants can still perceive a McGurk effect when they can quite reliably perform
TOJs). Outside the speech domain, similar findings have been reported. In a study by Shimojo et al.
(2001), the role of temporal synchrony was examined using the streaming–bouncing illusion (i.e.,
two identical visual targets that move across each other and are normally perceived as a streaming
motion are typically perceived to bounce when a brief sound is presented at the moment that the
visual targets coincide; Sekuler et al. 1997). The phenomenon is dependent on the timing of the
sound relative to the coincidence of the moving objects. Although it has been demonstrated that a
brief sound induced the visual bouncing percept most effectively when it was presented about 50 ms
before the moving objects coincide, their data furthermore showed a rather large temporal window
of integration because intervals ranging from 250 ms before visual coincidence to 150 ms after
coincidence still induced the bouncing percept (see also Bertelson and Aschersleben 1998, for the
effect of temporal asynchrony on spatial ventriloquism; or Shams et al. 2002, for the illusory-flash
effect). All these intersensory effects thus occur at asynchronies that are much larger than JNDs
normally reported when directly exploring the effect of asynchrony using TOJ or SJ tasks (van
Wassenhove et al. 2007). One might argue that despite the fact that observers do notice small delays
between the senses, the brain can still ignore it if it is of help for other purposes, such as understand-
ing speech (Soto-Faraco and Alsius 2007, 2009). But the question then becomes, why is there more
than one window; that is, one for understanding, the other for noticing timing differences.
Besides the width of the temporal window varying with the purpose of the task, it has also been
found to vary for different kinds of stimuli. As already mentioned, the temporal window is much
smaller for clicks and flashes than it is for audiovisual speech. However, why would the size be
different for different stimuli? Does the brain have a separate window for each stimulus and each
purpose? If so, we are left with explaining how and why it varies. Some have taken the concept of
a window quite literally, and have argued that “speech is special” because the window for audiovi-
sual speech is wide (van Wassenhove et al. 2007; Vatakis et al. 2008a). Although we would rather
refrain from such speculations, and consider it more useful to examine what the critical features are
that determine when perception of simultaneity becomes easy (a small window) or difficult (a large
window). The size of the window is thus, in our view, the factor that needs to be explained rather
than that it is the explanation itself.

9.5.2  Compensation for External Factors


The second possibility—the intelligent brain that compensates for various delays—is a controver-
sial issue that has received support mainly from studies that examined whether observers take
distance into account when judging audiovisual synchrony (see Figure 9.2, panel 2). The relatively
slow transduction time of sounds through air causes natural differences in arrival time between
sounds and lights. It implies that the farther away an audiovisual event, the more the sound will
lag the visual stimulus; although such a lag might be compensated for by the brain if distance were
known. The brain might then treat a lagging sound as being synchronous to a light, provided that
the audiovisual event occurred at the right distance. Some have indeed reported that the brain does
just that as judgments about audiovisual synchrony were found to depend on perceived distance
(Alais and Carlile 2005; Engel and Dougherty 1971; Heron et al. 2007; Kopinska and Harris 2004).
Although others have failed to demonstrate compensation for distance (Arnold et al. 2005; Lewald
and Guski 2004).
Sugita and Suzuki (2003) explored compensation for distance with an audiovisual TOJ task. The
visual stimuli were delivered by light-emitting diodes (LEDs) at distances ranging from 1 to 50 m in
free-field circumstances (and were compensated for by intensity, although not size). Of importance,
the sounds were delivered through headphones, and no attempt was made to equate the distance of
the sound with that of the light. Note that this, in essence, undermines the whole idea that the brain
compensates for lags of audiovisual events out in space. Nevertheless, PSS values were found to
shift with visual stimulus distance. When the visual stimulus was 1 m away, the PSS was at about a
Perception of Synchrony between the Senses 159

~5 ms sound delay, and the delay increased when the LEDs were farther away. The increment was
consistent with the velocity of sounds up to a viewing distance of about 10 m, after which it leveled
off. This led the authors to conclude that lags between auditory and visual inputs are perceived as
synchronous not because the brain has a wide temporal window for audiovisual integration, but
because the brain actively changes the temporal location of the window depending on the distance
of the source.
Alais and Carlile (2005) came to similar conclusions, but with different stimuli. In their study,
auditory stimuli were presented over a loudspeaker and auditory distance was simulated by varying
the direct-to-reverberant energy ratio as a depth cue for sounds (Bronkhorst 1995; Bronkhorst and
Houtgast 1999). The near sounds simulated a depth of 5 m and had substantial amounts of direct
energy with a sharp transient onset; the far sounds simulated a depth of 40 m and did not have a
transient. The visual stimulus was a Gaussian blob on a computer screen in front of the observer
without variations in the distance. Note that, again, no attempt was made to equate auditory and
visual distance, thus again undermining the underlying notion. The effect of apparent auditory dis-
tance on temporal alignment with the blob on the screen was measured in a TOJ task. The authors
found compensation for depth, thus the PSS in the audiovisual TOJ task shifted with the apparent
distance of the sound in accordance with the speed of sounds through air up to 40 m. Although on
closer inspection of their data, it is clear that the shift in the PSS was mainly caused by the fact that
sensitivity for intersensory synchrony became increasingly worse for more distant sounds. Judging
from their figures, sensitivity for nearby sounds at 5 m was in the normal range, but for the most
distant sound, sensitivity was extremely poor as it never reached plateau, and even at a sound delay
of 200 ms, 25% of the responses was still “auditory-first” (see also Arnold et al. 2005; Lewald and
Guski 2004). This suggests that observers, while performing the audiovisual TOJ task, could not
use the onset of the far sound as a cue for temporal order, possibly because it lacks a sharp transient
and that they had to rely on other cues instead. Besides controversial stimuli and data, there are oth-
ers who simply failed to observe compensation for distance (Arnold et al. 2005; Heron et al. 2007;
Lewald and Guski 2004; Stone et al. 2001). For example, Stone et al. (2001) used an audiovisual SJ
task and varied stimulus–observer distances from 0.5 m in the near condition to 3.5 m in the far con-
dition. This resulted in a 3-m difference that would theoretically correspond to an 11 ms difference
in the PSS if sound–travel time would not be compensated (sound velocity of 330 m/s corresponds
to ~3.5 m/11 ms). For three out of five subjects, the PSS values were indeed shifted in that direction,
which led the authors to conclude that distance was not compensated. Against this conclusion, it
should be said that the SJ tasks depend heavily on criterion settings, that “three-out-of-five” is not
persuasively above chance, and that the range of distances was rather restricted.
Less open to these kinds of criticisms is a study by Lewald and Guski (2004). They used a
rather wide range of distances (1, 5, 10, 20, and 50 m), and their audiovisual stimuli (a sequence of
five beeps/flashes) were delivered by colocated speakers/LEDs placed in the open field. Note that
in this case, there were no violations in the “naturalness” of the audiovisual stimuli and that they
were physically colocated. Using this setup, the authors did not observe compensation for distance.
Rather, their results showed that when the physical observer–stimulus distance increased, the PSS
shifted precisely with the variation in sound transmission time through air. For audiovisual stimuli
that are far away, sounds thus had to be presented earlier than for nearby stimuli to be perceived
as simultaneous, and there was no sign that the brain would compensate for sound–traveling time.
The authors also suggested that the discrepancy between their findings and those who did find com-
pensation for distance lies in the fact that the latter simulated distance rather than using the natural
situation.
Similar conclusions were also reached by Arnold et al. (2005), who examined whether the stream/
bounce illusion (Sekuler et al. 1997) varies with distance. The authors examined whether the opti-
mal time to produce a “bounce” percept varied with the distance of the display, which ranged from
~1 to ~15 m. The visual stimuli were presented on a computer monitor—keeping retinal proper-
ties constant—and the sounds were presented either over loudspeakers at these distances or over
160 The Neural Bases of Multisensory Processes

headphones. The optimal time to induce a bounce percept shifted with the distance of the sound if
they were presented over loudspeakers, but there was no shift if the sound was presented over head-
phones. Similar effects of timing shifts with viewing distance after loudspeaker, but not headphone,
presentation were obtained in an audiovisual TOJ task in which observers judged whether a sound
came before or after two disks collided. This led the authors to conclude that there is no compensa-
tion for distance if distance is real and presented over speakers rather than simulated and presented
over headphones.
This conclusion might well be correct, but it raises the question of how to account for the findings
by Kopinska and Harris (2004). These authors reported complete compensation for distance despite
using colocated sounds and lights produced at natural distances. In their study, the audiovisual
stimulus was a bright disk that flashed once on a computer monitor and it was accompanied by a
tone burst presented from the computer’s inbuilt speaker. Participants were seated at various dis-
tances from the screen (1, 4, 8, 16, 24, and 32 m) and made TOJs about the flash and the sound. The
authors also selectively slowed down visual processing by presenting the visual stimulus at 20° of
eccentricity rather than in the fovea, or by having observers wear darkened glasses. As an additional
control, they used simple reaction time tasks and found that all these variations—distance, eccen-
tricity, and dark glasses—had predictable effects on auditory or visual speeded reaction. However,
audiovisual simultaneity was not affected by distance, eccentricity, or darkened glasses. Thus, there
was no shift in the PSS despite the fact that the change in distance, illumination, and retinal location
affected simple reaction times. This made the authors conclude that observers recover the external
world by taking into account all kinds of predictable variations, most importantly distance, alluding
to similar phenomena such as size or color constancy.
There are some studies that varied audiovisual distance in a natural way, but came to diametri-
cally opposing conclusions: Lewald and Guski (2004) and Arnold et al. (2005) found no compensa-
tion for distance, whereas Kopinska and Harris (2004) reported complete compensation. What’s the
critical difference between them? Our conjecture is that they differ in two critical aspects, that is,
(1) whether distance was randomized on a trial-by-trial basis or blocked, and (2) whether sensitivity
for temporal order was good or poor. In the study by Lewald and Guski, the distance of the stimuli
was varied on a trial-by-trial basis as they used a setup of five different speakers/LEDs. In Kopinska
and Harris’s study, though, the distance between the observer and the screen was blocked over trials
because otherwise subjects would have to be shifted back and forth after each trial. If the distance is
blocked, then either adaptation to the additional sound lag may occur (i.e., recalibration), or subjects
may equate response probabilities to the particular distance that they are seated. Either way, the
effect of distance on the PSS will diminish if trials are blocked, and no shift in the PSS will then
be observed, leading to the “wrong” conclusion that distance is compensated. This line of reason-
ing corresponds with a recent study by Heron et al. (2007). In their study, participants performed a
TOJ task in which audiovisual stimuli (a white disk and a click) were presented at varying distances
(0, 5, 10, 20, 30, and 40 m). Evidence for compensation was only found after a period of adaptation
(1 min + 5 top-up adaptation stimuli between trials) to the naturally occurring audiovisual asyn-
chrony associated with a particular viewing distance. No perceptual compensation for distance-
induced auditory delays could be demonstrated whenever there was no adaptation period (although
we should notice that in the present study, observer distance was always blocked).
The second potentially relevant difference between studies that do or do not demonstrate com-
pensation is the difficulty of the stimuli. Lewald and Guski (2004) used a sequence of five pulses/
sounds, whereas Kopinska and Harris (2004) presented a single sound/flash. In our experience, a
sequence of pulses/flashes drastically improves accuracy for temporal order if compared to a single
pulse/flash because there are many more cues in the signal. In the study by Arnold et al. (2005),
judgments about temporal order could also be relatively accurate because the two colliding disks
provided anticipatory information about when to expect the sound. Most likely, observers in the
study of Kopinska and Harris were inaccurate because their single sound/flash stimuli without
anticipatory information were difficult (unfortunately, none of the studies reported JNDs). In effect,
Perception of Synchrony between the Senses 161

this amounts to adding noise to the psychometric function, which then effectively masks the effect
of distance on temporal order. It might easily lead one to conclude “falsely” that there is compensa-
tion for distance.

9.5.3  Temporal Recalibration


The third possibility of how the brain might deal with lags between the senses entails that the brain
is flexible in adopting what it counts as synchronous (see Figure 9.2, panel 3). This phenomenon is
also known as “temporal recalibration.” Recalibration is a well-known phenomenon in the spatial
domain, but it has only recently been demonstrated in the temporal domain (Fujisaki et al. 2004;
Vroomen et al. 2004). As for the spatial case, more than a century ago, von Helmholtz (1867) had
already shown that the visual–motor system was remarkably flexible as it adapts to shifts of the
visual field induced by wedge prisms. If prism-wearing subjects had to pick up a visually displaced
object, they would quickly adapt to the new sensor–motor arrangement and even after only a few
trials, small visual displacements might get unnoticed. Recalibration was the term used to explain
this phenomenon. In essence, recalibration is thought to be driven by a tendency of the brain to
minimize discrepancies between the senses about objects or events that normally belong together.
For the prism case, it is the position of where the hand is seen and felt. Nowadays, it is also known
that the least reliable source is adjusted toward the more reliable one (Ernst and Banks 2002; Ernst
et al. 2000; Ernst and Bulthoff 2004).
The first evidence of recalibration in the temporal domain came from two studies with very
similar designs: an exposure–test paradigm. Both Fujisaki et al. (2004) and Vroomen et al. (2004)
first exposed observers to a train of sounds and light flashes with a constant but small intersensory
interval, and then tested them by using an audiovisual TOJ or SJ task. The idea was that observers
would adapt to small audiovisual lags in such a way that the adapted lag is eventually perceived as
synchronous. Therefore, after a light-first exposure, light-first trials would be perceived as synchro-
nous, and after a sound-first exposure, a sound-first stimulus would be perceived as synchronous (see
Figure 9.3). Both studies indeed observed that the PSS was shifted in the direction of the exposure
lag. For example, Vroomen and Keetels exposed subjects for ~3 min to a sequence of sound bursts/
light flashes with audiovisual lags of either ±100 or ±200 ms (sound-first or light-first). During the
test, the PSS was shifted, on average, by 27 and 18 ms (PSS difference between sound-first and
light-first) for the SJ and TOJ tasks, respectively. Fujisaki et al. used slightly bigger lags (±235 ms
sound-first or light-first) and found somewhat bigger shifts in the PSS (59 ms shifts of the PSS in SJ
and 51 ms in TOJ), but data were, in essence, comparable. Many others have reported similar effects
(Asakawa et al. 2009; Di Luca et al. 2007; Hanson et al. 2008; Keetels and Vroomen 2007, 2008b;
Navarra et al. 2005, 2007, 2009; Stetson et al. 2006; Sugano et al. 2010; Sugita and Suzuki 2003;
Takahashi et al. 2008; Tanaka et al. 2009a; Yamamoto et al. 2008).
The mechanism underlying temporal recalibration, though, remains elusive at this point. One
option is that there is a shift in the criterion for simultaneity in the adapted modalities (Figure 9.2,
panel 3a). After exposure to light-first pairings, participants may thus change their criterion for
audiovisual simultaneity in such a way that light-first stimuli are taken to be simultaneous. On this
view, other modality-pairings (e.g., vision–touch) would be unaffected and the change in crite-
rion should then not affect unimodal processing of visual and auditory stimuli presented in isola-
tion. Another strong prediction is that stimuli that were once synchronous, before adaptation, can
become asynchronous after adaptation. The most dramatic case of this phenomenon can be found
in motor–visual adaptation. In a study by Eagleman and Holcombe (2002), participants were asked
to repeatedly tap their finger on a key, and after each key tap, a delayed flash was presented. If the
visual flash occurred at an unexpectedly short delay after the tap (or synchronous), it was actually
perceived as occurring before the tap, an experience that runs against the law of causality.
It may also be the case that one modality (vision, audition, or touch) is “shifted” toward the other,
possibly because the sensory threshold for stimulus detection in one of the adapted modalities is
162 The Neural Bases of Multisensory Processes

(a) Exposure lag –100 ms 100 ms AV-lag

Time
Exposure lag 0 ms

Exposure lag 100 ms

(b) Exposure lag –100 ms 100 ms TV-lag

Time

= Visual stimulus
= Sound
= Vibro-tactile stimulus

FIGURE 9.3  Schematic illustration of exposure conditions typically used in a temporal recalibration para-
digm. During exposure, participants are exposed to a train of auditory–visual (AV) or tactile–visual (TV)
stimulus pairs (panels a and b, respectively) with a lag of –100, 0, or +100 ms. To explore possible shifts in
perceived simultaneity or sensitivity to asynchrony, typically a TOJ or SJ task is performed in a subsequent
test phase. (From Fujisaki, W. et al., Nat. Neurosci., 7, 773–8, 2004; Vroomen, J. et al., Cogn. Brain Res.,
22, 32–5, 2004; Keetels, M., Vroomen, J., Percept. Psychophys., 70, 765–71, 2008; Keetels, J., Vroomen, M.,
Neurosci. Lett., 430, 130–4, 2008. With permission.)

changed (see Figure 9.2, panel 3b). For example, as an attempt to perceive simultaneity during light-
first exposure, participants might delay processing time in the visual modality by adopting a more
stringent criterion for sensory detection of visual stimuli. After exposure to light-first audiovisual
pairings, one might then expect slower processing times of visual stimuli in general, and other
modality pairings that involve the visual modality, say vision–touch, would then also be affected.
Two strategies have been undertaken to explore the mechanism underlying temporal recalibra-
tion. The first is to examine whether temporal recalibration generalizes to other stimuli within
the adapted modalities, the second is to examine whether temporal recalibration affects different
modality pairings than the ones adapted. Fujisaki et al. (2004) have already demonstrated that the
effect of adaptation in temporal misalignment was effective even when the visual test stimulus was
very different from the exposure situation. The authors exposed observers to asynchronous tone-
flash stimulus pairs and later tested them on the “stream/bounce” illusion (Sekuler et al. 1997).
Fujisaki et al. reported that the optimal delay for obtaining a bounce percept in the stream/bounce
illusion was shifted in the same direction as the adapted lag. Furthermore, after exposure to a “wall-
display,” in which tones were timed with a ball bouncing off the inner walls of a square, similar
shifts in the PSS on the bounce percept were found (a ~45 ms difference when comparing the PSS
of the –235 ms sound-first exposure with the +235 ms vision-first exposure). Audiovisual temporal
recalibration thus generalized well to other visual stimuli.
Navarra et al. (2005) and Vatakis et al. (2008b) also tested generalization for audiovisual tempo-
ral recalibration using stimuli from different domains (speech/nonspeech). Their observers had to
monitor a continuous speech stream for target words that were presented either in synchrony with
the video of a speaker, or with the audio stream lagging 300 ms behind. During the monitoring
Perception of Synchrony between the Senses 163

task, participants performed a TOJ (Navarra et al. 2005; Vatakis et al. 2007) or SJ task (Vatakis
et al. 2008b) on simple flashes and white noise bursts that were overlaid on the video. Their results
showed that sensitivity, rather than a shift in the PSS, became worse if subjects were exposed to
desynchronized rather than synchronized audiovisual speech. Similar effects (larger JNDs) were
found with music stimuli. This led the authors to conclude that the “window of temporal integra-
tion” was widened (see Figure 9.2, panel 3c) because of asynchronous exposure (see also Navarra
et al. 2007 for effects on JND after adaptation to asynchronous audio–tactile stimuli). The authors
argued that this effect on the JND may reflect an initial stage of recalibration in which a more
lenient criterion is adopted for simultaneity. With prolonged exposure, subjects may then shift the
PSS. An alternative explanation—also considered by the authors, but rejected—might be that sub-
jects became confused by the nonmatching exposure stimuli, which as a result may also affect the
JND rather than the PSS because it adds noise to the distribution.
The second way to study the underlying mechanisms of temporal recalibration is to examine
whether temporal recalibration generalizes to different modality pairings. Hanson et al. (2008)
explored whether a “supramodal” mechanism might be responsible for the recalibration of multi-
sensory timing. They examined whether adaptation to audiovisual, audio–tactile, and tactile–visual
asynchronies (10 ms flashes, noise bursts, and taps on the left index finger) generalized across
modalities. The data showed that a brief period of repeated exposure to ±90 ms asynchrony in any
of these pairings resulted in shifts of about 70 ms of the PSS on subsequent TOJ tasks, and that
the size and nature of the shifts were very similar across all three pairings. This made them con-
clude that there is a “general mechanism.” Opposite conclusions though, were reached by Harrar
and Harris (2005). They exposed participants for 5 min to audiovisual pairs with a fixed time lag
(250 ms light-first), but did not obtain shifts in the PSSs for touch–light pairs. In an extension of this
topic (Harrar and Harris 2008), observers were exposed for 5 min to ~100 ms lags of light-first stim-
uli for the audiovisual case, and touch-first stimuli for the auditory–tactile and visual–tactile case.
Participants were tested on each of these pairs before and after exposure. Shifts of the PSS in the
predicted direction were only found in the audiovisual exposure–test stimuli, but not for the other
cases. Di Luca et al. (2007) also exposed participants to asynchronous audiovisual pairs (~200 ms
lags of sound-first and light-first) and measured the PSS for audiovisual, audio–tactile, and visual–
tactile test stimuli. Besides obtaining a shift in the PSS for audiovisual pairs, the effect was found
to generalize to audio–tactile, but not to visual–tactile test pairs. This pattern made the authors
conclude that adaptation resulted in a phenomenal shift of the auditory event (Di Luca et al. 2007).
Navarra et al. (2009) also recently reported that the auditory rather than visual modality is more
flexible. Participants were exposed to synchronous or asynchronous audiovisual stimuli (224 ms
vision-first, or 84 ms auditory-first for 5 min of exposure) after which they performed a speeded
reaction time task on unimodal visual or auditory stimuli. In contrast with the idea that visual
stimuli get adjusted in time to the relatively more accurate auditory stimuli (Hirsh and Sherrick
1961; Shipley 1964; Welch 1999; Welch and Warren 1980), their results seemed to show the oppo-
site, namely, that auditory rather than visual stimuli were shifted in time. The authors reported that
simple reaction times to sounds became approximately 20 ms faster after vision-first exposure and
about 20 ms slower after auditory-first exposure, whereas simple reaction times for visual stimuli
remained unchanged. They explained this finding by alluding to the idea that visual information
can serve as the temporal anchor because it is a more exact estimate of the time of occurrence of a
distal event rather than auditory information because light travel time does not depend on distance.
Further research is needed, however, to examine whether a change in simple reaction times is truly
reflective of a change in the timing of that event, as there is quite some evidence showing that the
two do not always go hand-in-hand (e.g., reaction times are more affected by variations in intensity
than TOJs; Jaskowski and Verleger 2000; Neumann and Niepel 2004).
To summarize, until now, there is no clear explanation for the mechanism underlying temporal
recalibration as there is some discrepancy in the data regarding generalization across modalities. It
seems safe to conclude that the audiovisual exposure–test situation is the most reliable one to obtain
164 The Neural Bases of Multisensory Processes

a shift in the PSS. Arguably, audiovisual pairs are more flexible because the brain has to correct
for timing differences between auditory and visual stimuli because of naturally occurring delays
caused by distance. Tactile stimuli might be more rigid in time because visual–tactile and audio–
tactile events always occur at the body surface, so less compensation for latency differences might
be required here. As already mentioned above, a widening of the JND, rather than a shift in the
PSS, has also been observed and it might possibly reflect an initial stage of recalibration in which a
more lenient criterion about simultaneity is adopted. The reliability of each modality on its own is
also likely to play a role. For visual stimuli, it is known that they are less reliable in time than audi-
tory or tactile stimuli (Fain 2003), and as a consequence they may be more malleable (Ernst and
Banks 2002; Ernst et al. 2000; Ernst and Bulthoff 2004), but there is also evidence that the auditory
modality is, in fact, shifted.

9.5.4  Temporal Ventriloquism


The fourth possibility of how the brain might deal with lags between the senses, and how they may
get unnoticed, is that the perceived timing of a stimulus in one modality is actively shifted toward
the other (see Figure 9.2, panel 4). This phenomenon is also known as “temporal ventriloquism,” and
it is named in analogy with the spatial ventriloquist effect. For spatial ventriloquism, it was already
known for a long time that listeners who heard a sound while seeing a spatially displayed flash had
the (false) impression that the sound originated from the flash. This phenomenon was named the
“ventriloquist illusion” because it was considered a stripped-down version of what the ventriloquist
was doing when performing on stage. The temporal ventriloquist effect is analogous to the spatial
variant, except that here, sound attracts vision in the time dimension rather than vision attracting
sound in the spatial dimension. There are, by now, many demonstrations of this phenomenon, and
we describe several in subsequent paragraphs. They all show that small lags between sound and
vision go unnoticed because the perceived timing of visual events is flexible and is attracted toward
events in other modalities.
Scheier et al. (1999) were one of the first to demonstrate temporal ventriloquism using a visual
TOJ task (see Figure 9.4). Observers were presented with two lights at various SOAs, one above and
one below a fixation point, and their task was to judge which light came first (the upper or the lower).
To induce temporal ventriloquism, Scheier et al. added two sounds that could either be presented
before the first and after the second light (condition AVVA), or the sounds could be presented in
between the two lights (condition VAAV). Note that they used a visual TOJ task, and that sounds
were task-irrelevant. The results showed that observers were more sensitive (i.e., smaller intervals
were still perceived correctly) in the AVVA condition compared to the VAAV condition (visual
JNDs were approximately 24 and 39 ms, respectively). Presumably, the two sounds attracted the
temporal occurrence of the two lights, and thus, effectively pulled the lights farther apart in the
AVVA condition, and closer together in the VAAV condition. In single-sound conditions, AVV
and VVA, sensitivity was not different from a visual-only baseline, indicating that the effects were
not because of the initial sound acting as a warning signal, or some cognitive factor related to the
observer’s awareness of the sounds.
Morein-Zamir et al. (2003) replicated these effects and further explored the sound–light inter-
vals at which the effect occurred. Sound–light intervals of ~100 to ~600 ms were tested, and it was
shown that the second sound was mainly responsible for the temporal ventriloquist effect up to a
sound–light interval of 200 ms, whereas the interval of the first sound had little effect.
The results were also consistent with earlier findings of Fendrich and Corballis (2001) who used
a paradigm in which participants judged when a flash occurred by reporting the clock position of
a rotating marker. The repeating flash was seen earlier when it was preceded by a click and later
when the click lagged the visual stimulus. Another demonstration of temporal ventriloquism using
a different paradigm came from a study by Vroomen and de Gelder (2004b). Here, temporal ven-
triloquism was demonstrated using the flash-lag effect (FLE). In the typical FLE (Mackay 1958;
Perception of Synchrony between the Senses 165

(a) 0 ms AV Interval 100 ms AV Interval


Actual SOA ‘Perceived’ SOA

Time

(b) 0 ms TV Interval 100 ms TV Interval

= Visual stimulus
= Sound
= Vibro-tactile stimulus

FIGURE 9.4  A schematic illustration of conditions typically used to demonstrate auditory–visual temporal
ventriloquism (panel a) and tactile–visual temporal ventriloquism (panel b). The first capturing stimulus (i.e.,
either a sound or a vibro–tactile stimulus) precedes the first light by 100 ms, whereas the second capturing
stimulus trails the second light by 100 ms. Baseline condition consists of presentation of two capturing stimuli
simultaneous with light onsets. Temporal ventriloquism is typically shown by improved visual TOJ sensitivity
when capture stimuli are presented with a 100-ms interval. (From Scheier, C.R. et al., Invest. Ophthalmol. Vis.
Sci., 40, 4169, 1999; Morein-Zamir, S. et al., Cogn. Brain Res., 17, 154–63, 2003; Vroomen, J., Keetels, M.,
J. Exp. Psychol. Hum. Percept. Perform., 32, 1063–71, 2006; Keetels, M. et al., Exp. Brain Res., 180, 449–56,
2007; Keetels, M., Vroomen, J., Percept. Psychophys., 70, 765–71, 2008, Keetels, M., Vroomen, J., Neurosci.
Lett., 430, 130–4, 2008. With permission.)

Nijhawan 1994, 1997, 2002), a flash appears to lag behind a moving visual stimulus even though
the stimuli are presented at the same physical location. To induce temporal ventriloquism, Vroomen
and de Gelder added a single click presented slightly before, at, or after the flash (intervals of 0, 33,
66, and 100 ms). The results showed that the sound attracted the temporal onset of the flash and
shifted it in the order of ~5%. A sound ~100 ms before the flash thus made the flash appear ~5 ms
earlier, and a sound 100 ms after the flash made the flash appear ~5 ms later. A sound, including the
synchronous one, also improved sensitivity on the visual task because JNDs on the visual task were
better if a sound was present rather than absent.
Yet another recent manifestation of temporal ventriloquism used an apparent visual motion par-
adigm. Visual apparent motion occurs when a stimulus is flashed in one location and is followed by
another identical stimulus flashed in another location (Korte 1915). Typically, an illusory movement
is observed that starts at the lead stimulus and is directed toward the second lagging stimulus (the
strength of the illusion depends on the exposure time of the stimuli, and the temporal and spatial
separation between them). Getzmann (2007) explored the effects of irrelevant sounds on this motion
illusion. In their study, two temporally separated visual stimuli (SOAs ranged from 0 to 350 ms)
were presented and participants classified their impression of motion using a categorization system.
The results demonstrated that sounds intervening between the visual stimuli facilitated the impres-
sion of apparent motion relative to no sounds, whereas sounds presented before the first and after
the second visual stimulus reduced motion perception (see Bruns and Getzmann 2008 for similar
results). The idea was that because exposure time and spatial separation were both held constant in
this study, the impression of apparent motion was systematically affected by the perceived length of
the interstimulus interval. The effect was explained in terms of temporal ventriloquism, as sounds
attracted the illusory onset of visual stimuli.
Freeman and Driver (2008) investigated whether the timing of a static sound could influence spa-
tiotemporal processing of visual apparent motion. Apparent motion was induced by visual stimuli
166 The Neural Bases of Multisensory Processes

alternating between opposite hemifields. The perceived direction typically depends on the relative
timing interval between the left–right and right–left flashes (e.g., rightward motion dominating
when left–right interflash intervals are shortest; von Grunau 1986). In their study, the interflash
intervals were always 500 ms (ambiguous motion), but sounds could slightly lead the left flash and
lag the right flash by 83 ms or vice versa. Because of temporal ventriloquism, this variation made
visual apparent motion depend on the timing of the sound stimuli (e.g., more rightward responses if
a sound preceded the left flash, and lagged the right flash, and more leftward responses if a sound
preceded the right flash, and lagged the left flash).
The temporal ventriloquist effect has also been used as a diagnostic tool to examine whether
commonality in space is a constraint on intersensory pairing. Vroomen and Keetels (2006) adopted
the visual TOJ task of Scheier et al. (1999) and replicated that sounds improved sensitivity in the
AVVA version of the visual TOJ task. Importantly, the temporal ventriloquist effect was unaffected
by whether sounds and lights were colocated or not. For example, the authors varied whether the
sounds came from a central location or a lateral one, whether the sounds were static or moving,
and whether the sounds and lights came from the same or different sides of fixation at either small
or large spatial disparities. All these variations had no effect on the temporal ventriloquist effect,
despite that discordant sounds were shown to attract reflexive spatial attention and to interfere with
speeded visual discrimination. These results made the author conclude that intersensory interac-
tions in general do not require spatial correspondence between the components of the cross-modal
stimuli (see also Keetels et al. 2007).
In another study (Keetels and Vroomen 2008a), it was explored whether touch affects vision on
the time dimension as audition does (visual–tactile ventriloquism), and whether spatial disparity
between the vibrator and lights modifies this effect. Given that tactile stimuli are spatially better
defined than tones because of their somatotopic rather than tonotopic initial coding, this study pro-
vided a strong test case for the notion that spatial co-occurrence between the senses is required for
intersensory temporal integration. The results demonstrated that tactile–visual stimuli behaved like
audiovisual stimuli, in that temporally misaligned tactile stimuli captured the onsets of the lights
and spatial discordance between the stimuli did not harm this phenomenon.
Besides exploring whether spatial disparity affects temporal ventriloquism, the effect of synes-
thetic congruency between modalities was also recently explored (Keetels and Vroomen 2010;
Parise and Spence 2008). Parise and Spence (2008) suggested that pitch size synesthetic congru-
ency (i.e., a natural association between the relative pitch of a sound and the relative size of a visual
stimulus) might affect temporal ventriloquism. In their study, participants made visual TOJs about
small-sized and large-sized visual stimuli whereas high-pitched or low-pitched tones were pre-
sented before the first and after the second light. The results showed that, at large sound–light inter-
vals, sensitivity for visual temporal order was better for synesthetically congruent than incongruent
pairs. In a more recent study, Keetels and Vroomen (2010) reexamined this effect and showed that
this congruency effect could not be attributed to temporal ventriloquism, as it disappeared at short
sound–light intervals if compared to a synchronous AV baseline condition that excludes response
biases. In addition, synesthetic congruency did not affect temporal ventriloquism even if partici-
pants were made explicitly aware of congruency before testing, challenging the view that synes-
thetic congruency affects temporal ventriloquism.
Stekelenburg and Vroomen (2005) also investigated the time course and the electrophysiologi-
cal correlates of the audiovisual temporal ventriloquist effect using ERPs in the FLE. Their results
demonstrated that the amplitude of the visual N1 was systematically affected by the temporal inter-
val between the visual target flash and the task-irrelevant sound in the FLE paradigm (Mackay
1958; Nijhawan 1994, 1997, 2002). If a sound was presented in synchrony with the flash, the N1
amplitude was larger than when the sound lagged the visual stimulus, and it was smaller when the
sound lead the flash. No latency shifts, however, were found. Yet, based on the latency of the cross-
modal effect (N1 at 190 ms) and its localization in the occipitoparietal cortex, this study confirmed
the sensory nature of temporal ventriloquism. An explanation for the absence of a temporal shift of
Perception of Synchrony between the Senses 167

the ERP components may lie in the small size of the temporal ventriloquist effect found (3 ms). Such
a small temporal difference may not be reliably reflected in the ERPs because it reaches the lower
limit of the temporal resolution of the sampled EEG.
In most of the studies examining temporal ventriloquism (visual TOJ, FLE, reporting clock posi-
tion or motion direction), the timing of the visual stimulus is the task-relevant dimension. Although
recently, Vroomen and Keetels (2009) explored whether a temporally offset sound could improve
the identification of a visual stimulus whereas temporal order is not involved. In this study, it was
examined whether four-dot masking was affected by temporal ventriloquism. In the four-dot mask-
ing paradigm, visual target identification is impaired when a briefly presented target is followed by
a mask that consists of four dots that surround but do not touch the visual target (Enns 2004; Enns
and DiLollo 1997, 2000). The idea tested was that a sound presented slightly before the target and
slightly after the mask might lengthen the perceived interval between target and mask. By lengthen-
ing the perceived target–mask interval, there is more time for the target to consolidate, and in turn
target identification should be easier. Results were in line with this hypothesis as a small release
from four-dot masking was reported (1% improvement, which corresponds to an increase of the
target–mask ISI of 4.4 ms) if two sounds were presented at approximately 100-ms intervals before
the target and after the mask, rather than if only a single sound was presented before the target or
a silent condition.
To summarize, there are by now many demonstrations that vision is flexible on the time dimen-
sion. In general, the perceived timing of a visual event is attracted toward other events in audition
and touch, provided that the lag between them is less than ~200 ms. The deeper reason why there is
this mutual attraction is still untested. Although in our view, it serves to reduce natural lags between
the senses so that they become unnoticed, thus maintaining coherence between the senses.
If so, one can ask what the relationship is between temporal ventriloquism and temporal recali-
bration. Despite the fact that occurs immediately when a temporal asynchrony is presented, whereas
temporal recalibration manifests itself as an aftereffect, both effects are explained as perceptual
solutions to maintain intersensory synchrony. The question can then be asked whether the same
mechanism underlies the two phenomena. At first sight, one might argue that the magnitude of
the temporal ventriloquist effect seems smaller than the temporal recalibration effects (temporal
ventriloquism: Morein-Zamir et al. 2003, ~15 ms JND improvement; Scheier et al. 1999, 15 ms
JND improvement; Vroomen and Keetels 2006, ~6 ms JND improvement; temporal recalibration:
Fujisaki et al. 2004, ~30 ms PSS shifts for 225 ms adaptation lags; Hanson et al. 2008, ~35 ms
PSS shifts for 90 ms adaptation lags; Navarra et al. 2009, ~20 ms shifts in reaction times; although
relatively small effects were found by Vroomen et al. 2004, ~8 ms PSS shifts for 100 ms adaptation
lags). However, these magnitudes cannot be compared directly because the temporal ventriloquist
effect refers to an improvement in JNDs, whereas the temporal recalibration effect is typically a
shift of the PSS. Moreover, in studies measuring temporal recalibration, there is usually much more
exposure to temporal asynchronies than in studies measuring temporal ventriloquism. Therefore,
it remains up to future studies to examine whether the mechanisms that are involved in temporal
ventriloquism and temporal recalibration are the same.

9.6  TEMPORAL SYNCHRONY: AUTOMATIC OR NOT?


An important property about the perception of intersensory synchrony is to know whether it is
perceived in an automatic fashion or not. As is often the case, there are two opposing views on this
issue. Some have reported that the detection of temporal alignment is a slow, serial, and attention-
demanding process, whereas others have argued that it is fast and only requires a minimal amount
of attention that is needed to perceive the visual stimulus, but once this criterion is met, audiovisual
or visual–tactile integration comes for free.
An important signature of automatic processing is that the stimulus in question is salient and
“pops out.” If so, the stimulus is easy to find among distracters. What about intersensory synchrony:
168 The Neural Bases of Multisensory Processes

does it “pop out”? In a study by van de Par and Kohlrausch (2004), this question was addressed by
presenting observers a visual display of a number of independently moving circles moving up and
down along a Gaussian profile. Along with the motion display, a concurrent sound was presented
in which amplitude was modulated coherently with one of the circles. The participant’s task was to
identify the coherently moving visual circle as quickly as possible. The authors found that response
times increased approximately linearly with the numbers of distracters (~500 ms/distracter), indi-
cating a slow serial search process rather than pop-out.
Fujisaki et al. (2006) came to similar conclusions. They examined search functions for a visual
target that changed in synchrony with an auditory stimulus. The visual display consisted of two, four,
or eight luminance-modulated Gaussian blobs presented at 5, 10, 20, and 40 Hz that were accompa-
nied by a white noise sound whose amplitude was modulated in synch with one of the visual stimuli.
Other displays contained clockwise/counterclockwise rotations of windmills synchronized with a
sound whose frequency was modulated up or down at a rate of 10 Hz. The observers’ task was to
indicate which visual stimulus was luminance-modulated in synch with the sound. Search func-
tions for both displays were slow (~1 s/distractor in target-present displays), and increased linearly
with the number of visual distracters. In a control experiment, it was also shown that synchrony
discrimination was unaffected by the presence of distractors if attention was directed at the visual
target. Fujisaki et al. therefore concluded that perception of audiovisual synchrony is a slow and
serial process based on a comparison of salient temporal features that need to be individuated from
within-modal signal streams.
Others, though, came to quite opposing conclusions and found that intersensory synchrony can
be detected in an automatic fashion. Most notably, van der Burg et al. (2008b) reported an interest-
ing study in which they showed that a simple auditory pip can drastically reduce search times for
a color-changing object that is synchronized with the pip. The authors presented a horizontal or
vertical target line among a large array of oblique lines. Each of the lines (target and distracters)
changed color from green-to-red or red-to-green in a random fashion. If a pip sound was synchro-
nized with a color change, visual attention was automatically drawn to the location of the line that
changed color. When the sound was synchronized with the color change of the target, search times
improved drastically and the number of irrelevant distracters had virtually no effect on search times
(a nearly flat slope indicating pop-out). The authors concluded that the temporal information of the
auditory signal was integrated with the visual signal generating a relatively salient emergent feature
that automatically draw spatial attention (see also van der Burg et al. 2008a). Similar effects were
also demonstrated for tactile stimuli instead of auditory pips (Olivers and van der Burg 2008; van
der Burg et al. 2009).
Kanai et al. (2007) also explored temporal correspondences in visually ambiguous displays. They
presented multiple disks flashing sequentially at one of eight locations in a circle, thus inducing the
percept of a disk revolving around fixation. A sound was presented at one particular location in every
cycle, and participants had to indicate the disk that was temporally aligned with the sound. The disk
seen as being synchronized with the sound was perceived as brighter with a sharper onset and offset
(Vroomen and de Gelder 2000). Moreover, it fluctuated over time and its position changed every 5
to 10 s. Kanai et al. explored whether this flexibility was dependent on attention by having observers
perform a concurrent task in which they had to count the number of X’s in a letter stream. The results
demonstrated that the transitions disappeared whenever attention was distracted from the stimulus.
On the other hand, if attention was directed to one particular visual event—either by making it
“pop-out” by a using a different color, by presenting a cue next to the target dot, or by overtly cueing
it—the perceived timing of the sound was attracted toward that event. These results thus suggest that
perception of intersensory synchrony is flexible, and is not completely immune to attention.
These opposing views on the role of attention can be reconciled on the assumption that percep-
tion of synchrony depends on a matching process of salient temporal features (Fujisaki et al. 2006;
Fujisaki and Nishida 2007). Saliency may be lost when stimuli are presented at fast rates (typi-
cally above 4 Hz), when perceptually grouped into other streams, or if they lack a sharp transition
Perception of Synchrony between the Senses 169

(Keetels et al. 2007; Sanabria et al. 2004; Vroomen and de Gelder 2004a; Watanabe and Shimojo
2001). In line with this notion, studies reporting that audiovisual synchrony detection is slow, either
presented stimuli at fast rates (>4 Hz up to 80/s) or the stimuli lacked a sharp onset/offset (e.g., van
de Par, using a Gaussian amplitude modulation). Others reporting automatic detection of auditory–
visual synchrony used much slower rates (1.11 Hz; van der Burg et al. 2008b) and sharp transitions
(a pip).

9.7  NEURAL SUBSTRATES OF TEMPORAL SYNCHRONY


Although temporal correspondence is frequently considered one of the most important constraints
on cross-modal integration (e.g., Bedford 1989; Bertelson 1999; Radeau 1994; Stein and Meredith
1993; Welch 1999; Welch and Warren 1980), the neural correlates for the ability to detect and use
temporal synchrony remain largely unknown. Most likely, however, a whole network is involved.
Seminal studies examining the neural substrates of intersensory temporal correspondence were
done in animals. The finding that the firing rate of a subsample of cells in the superior colliculi (SC)
increases dramatically and more than what can be expected by summing the unimodal impulses
when auditory (tones) and visual stimuli (flashes) occur in close temporal and spatial proximity
(Meredith et al. 1987; Stein et al. 1993) is well-known. More recently, Calvert et al. (2001) used
functional magnetic resonance imaging (fMRI) on human subjects for studying brain areas that
demonstrate facilitation and suppression effects in the blood oxygenation level–dependent (BOLD)
signal for temporally aligned and temporally misaligned audiovisual stimuli. Their stimulus con-
sisted of a reversing checkerboard pattern of alternating black and white squares with sounds pre-
sented either simultaneously with the onset of a reversal (synchronous condition) or a randomly
phase-shifted asynchronous condition. The results showed an involvement of the SC as its response
was superadditive for temporally matched stimuli and depressed for the temporally mismatched
ones. Other cross-modal interactions were also identified in a network of cortical brain areas that
included several frontal cortical sites; the right inferior frontal gyrus, multiple sites within the right
lateral sulcus, and the ventromedial frontal gyrus. Furthermore, response enhancement and depres-
sion was observed in the insula bilaterally, right superior parietal lobule, right inferior parietal
sulcus, left superior occipital gyrus, and left superior temporal sulcus (STS).
Bushara et al. (2001) examined the effect of temporal asynchrony in a positron emission tomog-
raphy study. Here, observers had to decide whether a colored circle was presented simultaneously
with a tone or not. The stimulus pairs could either be auditory-first (AV) or vision-first (VA) at three
levels of SOAs that varied in difficulty. A control condition (C) was included in which the auditory
and visual stimuli were presented simultaneously, and in which participants performed a visual
color discrimination task whenever a sound was present. The brain areas involved in auditory–visual
synchrony detection were identified by subtracting the activity of the control condition from that in
the asynchronous conditions (AV-C and VA-C). Results revealed a network of heteromodal brain
areas that included the right anterior insula, the right ventrolateral prefrontal cortex, right inferior
parietal lobe, and left cerebellar hemisphere. The activity in the areas that correlated positively with
decreasing asynchrony revealed a cluster within the right insula, suggesting that this region is most
important for the detection of auditory–visual synchrony. Given that interactions were also found
between the insula, the posterior thalamus, and the SC, it was suggested that intersensory temporal
processing is mediated via subcortical tecto-thalamo-insula pathways.
In a positron emission tomography study by Macaluso et al. (2004), subjects were looking at
a video monitor that showed a face mouthing words. In different blocks of trials, the audiovisual
signals were either presented synchronously or asynchronously (the auditory stimulus was leading
by a clearly noticeable 240 ms). In addition, the visual and auditory sources were either presented
at the same location or in opposite hemifields. Results showed that activity in ventral occipital areas
and left STS increased during synchronous audiovisual speech, regardless of the relative location of
the auditory and visual input.
170 The Neural Bases of Multisensory Processes

More recently, in an fMRI study, Dhamala et al. (2007) examined the networks that are involved
in the perception of physically synchronous versus asynchronous audiovisual events. Two timing
parameters were varied: the SOA between sound and light (–200 to +200 ms) and the stimulation
rate (0.5–3.5 Hz). In the behavioral task, observers had to report whether stimuli were perceived
as simultaneous, sound-first, light-first, or “Can’t tell,” resulting in the classification of three dis-
tinct perceptual states, that is, the perception of synchrony, asynchrony, and “no clear perception.”
The fMRI data showed that each of these stages involved activation in different brain networks.
Perception of asynchrony activated the primary sensory, prefrontal, and inferior parietal corti-
ces, whereas perception of synchrony disengaged the inferior parietal cortex and further recruited
the SC.
An fMRI study by Noesselt et al. (2007) also explored the effect of temporal correspondence
between auditory and visual streams. The stimuli were arranged such that auditory and visual
streams were temporally corresponding or not, using irregular and arrhythmic temporal patterns
that either matched between audition and vision or mismatched substantially whereas maintaining
the same overall temporal statistics. For the coincident audiovisual streams, there was an increase
in the BOLD response in multisensory STS contralateral to the visual stream. The contralateral
primary visual and auditory cortex were also found to be affected by the synchrony–asynchrony
manipulations, and a connectivity analysis indicated enhanced influence from mSTS on primary
sensory areas during temporal correspondence.
In an EEG paradigm, Senkowski et al. (2007) examined the neural mechanisms underlying
intersensory synchrony by measuring oscillatory gamma-band responses (GBRs; 30–80 Hz).
Oscillatory GBRs have been linked to feature integration mechanisms and to multisensory pro-
cessing. The authors reasoned that GBRs might also be sensitive to the temporal alignment of
intersensory stimulus components. The temporal synchrony of auditory and visual components of a
multisensory signal was varied (tones and horizontal gratings with SOAs ranging from –125 to +125
ms). The GBRs to the auditory and visual components of multisensory stimuli were extracted for
five subranges of asynchrony and compared with GBRs to unisensory control stimuli. The results
revealed that multisensory interactions were strongest in the early GBRs when the sound and light
stimuli were presented with the closest synchrony. These effects were most evident over medial–
frontal brain areas after 30 to 80 ms and over occipital areas after 60 to 120 ms, indicating that
temporal synchrony may have an effect on early intersensory interactions in the human cortex.
Overall, it should be noted that there is a lot of variation in the outcomes of studies that have
examined the neural basis of intersensory temporal synchrony. At present, the issue is far from
resolved and more research has to be performed to unravel the exact neural substrates underlying
it. The overall finding is that the SC and mSTS are repeatedly reported in intersensory synchrony
detection studies, which at least suggests a prominent role for these structures in the processing
of intersensory stimuli based on their temporal correspondence. For the time being, however, it is
unknown how these areas would affect the perception of intersensory synchrony if they were dam-
aged or temporarily blocked by, for example, transcranial magnetic stimulation.

9.8  CONCLUSIONS
In recent years, a substantial amount of research has been devoted to understanding how the brain
handles lags between the senses. The most important conclusion we draw is that intersensory timing
is flexible and adaptive. The flexibility is clearly demonstrated by studies showing one or another
variant of temporal ventriloquism. In that case, small lags go unnoticed because the brain actively
shifts one information stream (usually vision) toward the other, possibly to maintain temporal
coherence. The adaptive part rests on studies of temporal recalibration demonstrating that observ-
ers are flexible in adopting what counts as synchronous. The extent to which temporal recalibration
generalizes to other stimuli and domains, however, remains to be further explored. The idea that the
brain compensates for predictable variability between the senses—most notably distance—is, in
Perception of Synchrony between the Senses 171

our view, not well-founded. We are more enthusiastic about the notion that intersensory synchrony
is perceived mostly in an automatic fashion, provided that the individual components of the stimuli
are sufficiently salient. The neural mechanisms that underlie this ability are of clear importance for
future research.

REFERENCES
Alais, D., and S. Carlile. 2005. Synchronizing to real events: Subjective audiovisual alignment scales with
perceived auditory depth and speed of sound. Proceedings of the National Academy of Sciences of the
United States of America 102(6);2244–7.
Arnold, D.H., A. Johnston, and S. Nishida. 2005. Timing sight and sound. Vision Research 45(10);1275–84.
Arrighi, R., D. Alais, and D. Burr. 2006. Perceptual synchrony of audiovisual streams for natural and artificial
motion sequences. Journal of Vision 6(3);260–8.
Asakawa, K., A. Tanaka, and H. Imai. 2009. Temporal Recalibration in Audio-Visual Speech Integration Using
a Simultaneity Judgment Task and the McGurk Identification Task. Paper presented at the 31st Annual
Meeting of the Cognitive Science Society (July 29–August 1, 2009). Amsterdam, The Netherlands.
Bald, L., F.K. Berrien, J.B. Price, and R.O. Sprague. 1942. Errors in perceiving the temporal order of auditory
and visual stimuli. Journal of Applied Psychology 26;283–388.
Bedford, F.L. 1989. Constraints on learning new mappings between perceptual dimensions. Journal of
Experimental Psychology. Human Perception and Performance 15(2);232–48.
Benjamins, J.S., M.J. van der Smagt, and F.A. Verstraten. 2008. Matching auditory and visual signals: Is sen-
sory modality just another feature? Perception 37(6);848–58.
Bertelson, P. 1994. The cognitive architecture behind auditory-visual interaction in scene analysis and speech
identification. Cahiers de Psychologie Cognitive 13(1);69–75.
Bertelson, P. 1999. Ventriloquism: A case of crossmodal perceptual grouping. In G. Aschersleben, T. Bachmann,
and J. Musseler (eds.), Cognitive Contributions to the Perception of Spatial and Temporal Events, 347–63.
North-Holland: Elsevier.
Bertelson, P., and G. Aschersleben. 1998. Automatic visual bias of perceived auditory location. Psychonomic
Bulletin & Review 5(3);482–89.
Bertelson, P., and G. Aschersleben. 2003. Temporal ventriloquism: Crossmodal interaction on the time
dimension: 1. Evidence from auditory–visual temporal order judgment. International Journal of
Psychophysiology 50(1–2);147–55.
Boenke, L.T., M. Deliano, and F.W. Ohl. 2009. Stimulus duration influences perceived simultaneity in audiovi-
sual temporal-order judgment. Experimental Brain Research 198(2–3);233–44.
Bronkhorst, A.W. 1995. Localization of real and virtual sound sources. Journal of the Acoustical Society of
America 98(5);2542–53.
Bronkhorst, A.W., and T. Houtgast. 1999. Auditory distance perception in rooms. Nature 397;517–20.
Bruns, P., and S. Getzmann. 2008. Audiovisual influences on the perception of visual apparent motion:
Exploring the effect of a single sound. Acta Psychologica 129(2);273–83.
Bushara, K.O., J. Grafman, and M. Hallett. 2001. Neural correlates of auditory-visual stimulus onset asyn-
chrony detection. Journal of Neuroscience 21(1);300–4.
Calvert, G., P.C. Hansen, S.D. Iversen, and M.J. Brammer. 2001. Detection of audio-visual integration sites in
humans by application of electrophysiological criteria to the BOLD effect. NeuroImage 14(2);427–38.
Calvert, G., C. Spence, and B. Stein. 2004. The Handbook of Multisensory Processes. Cambridge, MA: The
MIT Press.
Colin, C., M. Radeau, P. Deltenre, and J. Morais. 2001. Rules of intersensory integration in spatial scene analy-
sis and speechreading. Psychologica Belgica 41(3);131–44.
Conrey, B., and D.B. Pisoni. 2006. Auditory–visual speech perception and synchrony detection for speech and
nonspeech signals. Journal of the Acoustical Society of America 119(6);4065–73.
Dhamala, M., C.G. Assisi, V.K. Jirsa, F.L. Steinberg, and J.A. Kelso. 2007. Multisensory integration for timing
engages different brain networks. NeuroImage 34(2);764–73.
Di Luca, M., T. Machulla, and M.O. Ernst. 2007. Perceived Timing Across Modalities. Paper presented at the
International Intersensory Research Symposium 2007: Perception and Action (July 3, 2007). Sydney,
Australia.
Dinnerstein, A.J., and P. Zlotogura. 1968. Intermodal perception of temporal order and motor skills: Effects of
age. Perceptual and Motor Skills 26(3);987–1000.
Dixon, N.F., and L. Spitz. 1980. The detection of auditory visual desynchrony. Perception 9(6);719–21.
172 The Neural Bases of Multisensory Processes

Eagleman, D.M., and A.O. Holcombe. 2002. Causality and the perception of time. Trends in Cognitive Sciences
6(8);323–5.
Eimer, M., and J. Driver. 2001. Crossmodal links in endogenous and exogenous spatial attention: Evidence
from event-related brain potential studies. Neuroscience and Biobehavioral Reviews 25(6);497–511.
Eimer, M., and E. Schroger. 1998. ERP effects of intermodal attention and cross-modal links in spatial atten-
tion. Psychophysiology 35(3);313–27.
Engel, G.R., and W.G. Dougherty. 1971. Visual–auditory distance constancy. Nature 234(5327);308.
Enns, J.T. 2004. Object substitution and its relation to other forms of visual masking. Vision Research
44(12);1321–31.
Enns, J.T., and V. DiLollo. 1997. Object substitution: A new form of masking in unattended visual locations.
Psychological Science 8;135–9.
Enns, J.T., and V. DiLollo. 2000. What’s new in visual masking? Trends in Cognitive Sciences 4(9);345–52.
Ernst, M.O., and M.S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415(6870);429–33.
Ernst, M.O., and H.H. Bulthoff. 2004. Merging the senses into a robust percept. Trends in Cognitive Sciences
8(4);162–9.
Ernst, M.O., M.S. Banks, and H.H. Bulthoff. 2000. Touch can change visual slant perception. Nature Neuro­
science 3(1);69–73.
Fain, G.L. 2003. Sensory Transduction. Sunderland, MA: Sinauer Associates.
Fendrich, R., and P.M. Corballis. 2001. The temporal cross-capture of audition and vision. Perception &
Psychophysics 63(4);719–25.
Finger, R., and A.W. Davis. 2001. Measuring Video Quality in Videoconferencing Systems. Technical Report
SN187-D. Los Gatos, CA: Pixel Instrument Corporation.
Freeman, E., and J. Driver. 2008. Direction of visual apparent motion driven solely by timing of a static sound.
Current Biology 18(16);1262–6.
Frey, R.D. 1990. Selective attention, event perception and the criterion of acceptability principle: Evidence sup-
porting and rejecting the doctrine of prior entry. Human Movement Science 9;481–530.
Fujisaki, W., and S. Nishida. 2005. Temporal frequency characteristics of synchrony–asynchrony discrimina-
tion of audio-visual signals. Experimental Brain Research 166(3–4);455–64.
Fujisaki, W., and S. Nishida. 2007. Feature-based processing of audio-visual synchrony perception revealed by
random pulse trains. Vision Research 47(8);1075–93.
Fujisaki, W., S. Shimojo, M. Kashino, and S. Nishida. 2004. Recalibration of audiovisual simultaneity. Nature
Neuroscience 7(7);773–8.
Fujisaki, W., A. Koene, D. Arnold, A. Johnston, and S. Nishida. 2006. Visual search for a target changing in
synchrony with an auditory signal. Proceedings of Biological Science 273(1588);865–74.
Getzmann, S. 2007. The effect of brief auditory stimuli on visual apparent motion. Perception 36(7);1089–103.
Grant, K.W., V. van Wassenhove, and D. Poeppel. 2004. Detection of auditory (cross-spectral) and auditory–
visual (cross-modal) synchrony. Speech Communication 44;43–53.
Hanson, J.V., J. Heron, and D. Whitaker. 2008. Recalibration of perceived time across sensory modalities.
Experimental Brain Research 185(2);347–52.
Harrar, V., and L.R. Harris. 2005. Simultaneity constancy: Detecting events with touch and vision. Experimental
Brain Research 166(3–4);465–73.
Harrar, V., and L.R. Harris. 2008. The effect of exposure to asynchronous audio, visual, and tactile stimulus
combinations on the perception of simultaneity. Experimental Brain Research 186(4);517–24.
Heron, J., D. Whitaker, P.V. McGraw, and K.V. Horoshenkov. 2007. Adaptation minimizes distance-related
audiovisual delays. Journal of Vision 7(13);51–8.
Hillyard, S.A., and T.F. Munte. 1984. Selective attention to color and location: An analysis with event-related
brain potentials. Perception & Psychophysics 36(2);185–98.
Hirsh, I.J., and P. Fraisse. 1964. Simultaneous character and succession of heterogenous stimuli. L’Année
Psychologique 64;1–19.
Hirsh, I.J., and C.E. Sherrick. 1961. Perceived order in different sense modalities. Journal of Experimental
Psychology 62(5);423–32.
Jaskowski, P. 1999. Reaction time and temporal-order judgment as measures of perceptual latency: The prob-
lem of dissociations. In G. Aschersleben, T. Bachmann, and J. Müsseler (eds.), Cognitive Contributions
to the Perception of Spatial and Temporal Events (pp. 265–82). North-Holland: Elsevier Science B.V.
Jaskowski, P., and R. Verleger. 2000. Attentional bias toward low-intensity stimuli: An explanation for the
intensity dissociation between reaction time and temporal order judgment? Consciousness and Cognition
9(3);435–56.
Perception of Synchrony between the Senses 173

Jaskowski, P., F. Jaroszyk, and D. Hojan-Jezierska. 1990. Temporal-order judgments and reaction time for
stimuli of different modalities. Psychological Research, 52(1);35–8.
Jones, J.A., and M. Jarick. 2006. Multisensory integration of speech signals: The relationship between space
and time. Experimental Brain Research 174(3);588–94.
Jones, J.A., and K.G. Munhall. 1997. The effects of separating auditory and visual sources on the audiovisual
integration of speech. Canadian Acoustics 25(4);13–9.
Kanai, R., B.R. Sheth, F.A. Verstraten, and S. Shimojo. 2007. Dynamic perceptual changes in audiovisual
simultaneity. PLoS ONE 2(12);e1253.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18(7);1560–74.
Keetels, M., and J. Vroomen. 2005. The role of spatial disparity and hemifields in audio-visual temporal order
judgements. Experimental Brain Research 167;635–40.
Keetels, M., and J. Vroomen. 2007. No effect of auditory-visual spatial disparity on temporal recalibration.
Experimental Brain Research 182(4);559–65.
Keetels, M., and J. Vroomen. 2008a. Tactile–visual temporal ventriloquism: No effect of spatial disparity.
Perception & Psychophysics 70(5);765–71.
Keetels, M., and  J. Vroomen. 2008b. Temporal recalibration to tactile–visual asynchronous stimuli. Neuroscience
Letters 430(2);130–4.
Keetels, M., and J. Vroomen. 2010. No effect of synesthetic congruency on temporal ventriloquism. Attention,
Perception, & Psychophysics 72(4);871–4.
Keetels, M., J. Stekelenburg, and J. Vroomen. 2007. Auditory grouping occurs prior to intersensory pairing:
Evidence from temporal ventriloquism. Experimental Brain Research 180(3);449–56.
King, A.J. 2005. Multisensory integration: Strategies for synchronization. Current Biology 15(9);R339–41.
King, A.J., and A.R. Palmer. 1985. Integration of visual and auditory information in bimodal neurones in the
guinea-pig superior colliculus. Experimental Brain Research 60(3);492–500.
Kitagawa, N., M. Zampini, and C. Spence. 2005. Audiotactile interactions in near and far space. Experimental
Brain Research 166(3–4);528–37.
Kopinska, A., and L.R. Harris. 2004. Simultaneity constancy. Perception 33(9);1049–60.
Korte, A. 1915. Kinematoskopische untersuchungen. Zeitschrift für Psychologie mit Zeitschrift für Angewandte
Psychologie 72;194–296.
Levitin, D., K. MacLean, M. Mathews, and L. Chu. 2000. The perception of cross-modal simultaneity.
International Journal of Computing and Anticipatory Systems, 323–9.
Lewald, J., and R. Guski. 2003. Cross-modal perceptual integration of spatially and temporally disparate audi-
tory and visual stimuli. Cognitive Brain Research 16(3);468–78.
Lewald, J., and R. Guski. 2004. Auditory–visual temporal integration as a function of distance: No compensa-
tion for sound-transmission time in human perception. Neuroscience Letters 357(2);119–22.
Lewkowicz, D.J. 1996. Perception of auditory-visual temporal synchrony in human infants. Journal of
Experimental Psychology. Human Perception and Performance 22(5);1094–106.
Macaluso, E., N. George, R. Dolan, C. Spence, and J. Driver. 2004. Spatial and temporal factors during process-
ing of audiovisual speech: A PET study. NeuroImage 21(2);725–32.
Macefield, G., S.C. Gandevia, and D. Burke. 1989. Conduction velocities of muscle and cutaneous afferents in
the upper and lower limbs of human subjects. Brain 112(6);1519–32.
Mackay, D.M. 1958. Perceptual stability of a stroboscopically lit visual field containing self-luminous objects.
Nature 181(4607);507–8.
Massaro, D.W., M.M. Cohen, and P.M. Smeele. 1996. Perception of asynchronous and conflicting visual and
auditory speech. Journal of the Acoustical Society of America 100(3);1777–86.
Mattes, S., and R. Ulrich. 1998. Directed attention prolongs the perceived duration of a brief stimulus. Perception
& Psychophysics 60(8);1305–17.
McGrath, M., and Q. Summerfield. 1985. Intermodal timing relations and audio-visual speech recognition by
normal-hearing adults. Journal of the Acoustical Society of America 77(2);678–85.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264(5588);746–8.
Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons. I. Temporal factors. Journal of Neuroscience 7(10);3215–29.
Mitrani, L., S. Shekerdjiiski, and N. Yakimoff. 1986. Mechanisms and asymmetries in visual perception of
simultaneity and temporal order. Biological Cybernetics 54(3);159–65.
Mollon, J.D., and A.J. Perkins. 1996. Errors of judgement at Greenwich in 1796. Nature 380(6570);101–2.
Morein-Zamir, S., S. Soto-Faraco, and A. Kingstone. 2003. Auditory capture of vision: Examining temporal
ventriloquism. Cognitive Brain Research 17(1);154–63.
174 The Neural Bases of Multisensory Processes

Mortlock, A.N., D. Machin, S. McConnell, and P. Sheppard. 1997. Virtual conferencing. BT Technology Journal
15;120–9.
Munhall, K.G., P. Gribble, L. Sacco, and M. Ward. 1996. Temporal constraints on the McGurk effect. Perception
& Psychophysics 58(3);351–62.
Navarra, J., A. Vatakis, M. Zampini et al. 2005. Exposure to asynchronous audiovisual speech extends the tem-
poral window for audiovisual integration. Cognitive Brain Research 25(2);499–507.
Navarra, J., S. Soto-Faraco, and C. Spence. 2007. Adaptation to audiotactile asynchrony. Neuroscience Letters
413(1);72–6.
Navarra, J., J. Hartcher-O’Brien, E. Piazza, and C. Spence. 2009. Adaptation to audiovisual asynchrony modu-
lates the speeded detection of sound. Proceedings of the National Academy of Sciences of the United
States of America 106(23);9169–73.
Neumann, O., and M. Niepel. 2004. Timing of “perception” and perception of “time.” In C. Kaernbach,
E. Schröger, and H. Müller (eds.), Psychophysics Beyond Sensation: Laws and Invariants of Human
Cognition (pp. 245–70): Lawrence Erlbaum Associates, Inc.
Nijhawan, R. 1994. Motion extrapolation in catching. Nature 370(6487);256–7.
Nijhawan, R. 1997. Visual decomposition of colour through motion extrapolation. Nature 386(6620);66–9.
Nijhawan, R. 2002. Neural delays, visual motion and the flash-lag effect. Trends in Cognitive Science
6(9);387.
Noesselt, T., J.W. Rieger, M.A. Schoenfeld et al. 2007. Audiovisual temporal correspondence modulates
humultisensory man superior temporal sulcus plus primary sensory cortices. Journal of Neuroscience
27(42);11431–41.
Occelli, V., C. Spence, and M. Zampini. 2008. Audiotactile temporal order judgments in sighted and blind
individuals. Neuropsychologia 46(11);2845–50.
Olivers, C.N., and E. van der Burg. 2008. Bleeping you out of the blink: Sound saves vision from oblivion.
Brain Research 1242;191–9.
Pandey, P.C., H. Kunov, and S.M. Abel. 1986. Disruptive effects of auditory signal delay on speech perception
with lipreading. Journal of Auditory Research 26(1);27–41.
Parise, C., and C. Spence. 2008. Synesthetic congruency modulates the temporal ventriloquism effect.
Neuroscience Letters 442(3);257–61.
Pöppel, E. 1985. Grenzes des bewusstseins, Stuttgart: Deutsche Verlags-Anstal, translated as Mindworks: Time
and Conscious Experience. New York: Harcourt Brace Jovanovich. 1988.
Poppel, E., K. Schill, and N. von Steinbuchel. 1990. Sensory integration within temporally neutral systems
states: A hypothesis. Naturwissenschaften 77(2);89–91.
Radeau, M. 1994. Auditory-visual spatial interaction and modularity. Cahiers de Psychologie Cognitive
13(1);3–51.
Rihs, S. 1995. The Influence of Audio on Perceived Picture Quality and Subjective Audio-Visual Delay
Tolerance. Paper presented at the MOSAIC Workshop: Advanced methods for the evaluation of television
picture quality, Eindhoven, 18–19 September.
Roefs, J.A.J. 1963. Perception lag as a function of stimulus luminance. Vision Research 3;81–91.
Rutschmann, J., and R. Link. 1964. Perception of temporal order of stimuli differing in sense mode and simple
reaction time. Perceptual and Motor Skills 18;345–52.
Sanabria, D., S. Soto-Faraco, and C. Spence. 2004. Exploring the role of visual perceptual grouping on the
audiovisual integration of motion. Neuroreport 15(18);2745–9.
Sanford, A.J. 1971. Effects of changes in the intensity of white noise on simultaneity judgements and simple
reaction time. Quarterly Journal of Experimental Psychology 23;296–303.
Scheier, C.R., R. Nijhawan, and S. Shimojo. 1999. Sound alters visual temporal resolution. Investigative
Ophthalmology & Visual Science 40;4169.
Schneider, K.A., and D. Bavelier. 2003. Components of visual prior entry. Cognitive Psychology 47(4);
333–66.
Sekuler, R., A.B. Sekuler, and R. Lau. 1997. Sound alters visual motion perception. Nature 385;308–08.
Senkowski, D., D. Talsma, M. Grigutsch, C.S. Herrmann, and M.G. Woldorff. 2007. Good times for multisen-
sory integration: Effects of the precision of temporal synchrony as revealed by gamma-band oscillations.
Neuropsychologia 45(3);561–71.
Shams, L., Y. Kamitani, and S. Shimojo. 2002. Visual illusion induced by sound. Cognitive Brain Research
14(1);147–52.
Shimojo, S., C. Scheier, R. Nijhawan et al. 2001. Beyond perceptual modality: Auditory effects on visual per-
ception. Acoustical Science & Technology 22(2);61–67.
Shipley, T. 1964. Auditory flutter-driving of visual flicker. Science 145;1328–30.
Perception of Synchrony between the Senses 175

Shore, D.I., C. Spence, and R.M. Klein. 2001. Visual prior entry. Psychological Science 12(3);205–12.
Shore, D.I., C. Spence, and R.M. Klein. 2005. Prior entry. In L. Itti, G. Rees, and J. Tsotsos (eds.), Neurobiology
of Attention (pp. 89–95). North Holland: Elsevier.
Slutsky, D.A., and G.H. Recanzone. 2001. Temporal and spatial dependency of the ventriloquism effect.
Neuroreport 12(1);7–10.
Smith, W.F. 1933. The relative quickness of visual and auditory perception. Journal of Experimental Psychology
16;239–257.
Soto-Faraco, S., and A. Alsius. 2007. Conscious access to the unisensory components of a cross-modal illusion.
Neuroreport 18(4);347–50.
Soto-Faraco, S., and A. Alsius. 2009. Deconstructing the McGurk–MacDonald illusion. Journal of Experimen­
tal Psychology. Human Perception and Performance 35(2);580–7.
Spence, C., and J. Driver. 1996. Audiovisual links in endogenous covert spatial attention. Journal of Experimen­
tal Psychology. Human Perception and Performance 22(4);1005–30.
Spence, C., and J. Driver. 2004. Crossmodal Space and Crossmodal Attention. Oxford: Oxford University
Press.
Spence, C., F. Pavani, and J. Driver. 2000. Crossmodal links between vision and touch in covert endoge-
nous spatial attention. Journal of Experimental Psychology. Human Perception and Performance
26(4);1298–319.
Spence, C., and S. Squire. 2003. Multisensory integration: Maintaining the perception of synchrony. Current
Biology 13(13);R519–21.
Spence, C., D.I. Shore, and R.M. Klein. 2001. Multisensory prior entry. Journal of Experimental Psychology.
General 130(4);799–832.
Spence, C., R. Baddeley, M. Zampini, R. James, and D.I. Shore. 2003. Multisensory temporal order judgments:
When two locations are better than one. Perception & Psychophysics 65(2);318–28.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: The MIT Press.
Stein, B.E., M.A. Meredith, and M.T. Wallace. 1993. The visually responsive neuron and beyond: Multisensory
integration in cat and monkey. Progress in Brain Research 95;79–90.
Stein, B.E., N. London, L.K. Wilkinson, and D.D. Price. 1996. Enhancement of perceived visual intensity by
auditory stimuli: A psychophysical analysis. Journal of Cognitive Neuroscience 8(6);497–506.
Stekelenburg, J.J., and J. Vroomen. 2005. An event-related potential investigation of the time-course of tempo-
ral ventriloquism. Neuroreport 16;641–44.
Stekelenburg, J.J., and J. Vroomen. 2007. Neural correlates of multisensory integration of ecologically valid
audiovisual events. Journal of Cognitive Neuroscience 19(12);1964–73.
Stelmach, L.B., and C.M. Herdman. 1991. Directed attention and perception of temporal order. Journal of
Experimental Psychology. Human Perception and Performance 17(2);539–50.
Sternberg, S., and R.L. Knoll. 1973. The perception of temporal order: Fundamental issues and a general model.
In S. Kornblum (ed.), Attention and Performance (vol. IV, pp. 629–85). New York: Academic Press.
Stetson, C., X. Cui, P.R. Montague, and D.M. Eagleman. 2006. Motor–sensory recalibration leads to an illusory
reversal of action and sensation. Neuron 51(5);651–9.
Stone, J.V., N.M. Hunkin, J. Porrill et al. 2001. When is now? Perception of simultaneity. Proceedings of the
Royal Society of London. Series B. Biological Sciences 268(1462);31–8.
Sugano, Y., M. Keetels, and J. Vroomen. 2010. Adaptation to motor–visual and motor–auditory temporal lags
transfer across modalities. Experimental Brain Research 201(3);393–9.
Sugita, Y., and Y. Suzuki. 2003. Audiovisual perception: Implicit estimation of sound-arrival time. Nature
421(6926);911.
Sumby, W.H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the
Acoustical Society of America 26;212–15.
Summerfield, Q. 1987. A comprehensive account of audio-visual speech perception. In B. Dodd and R.
Campbell (eds.), Hearing by Eye: The Psychology of Lip-Reading (pp. 3–51). London: Lawrence
Erlbaum Associates.
Takahashi, K., J. Saiki, and K. Watanabe. 2008. Realignment of temporal simultaneity between vision and
touch. Neuroreport 19(3);319–22.
Tanaka, A., S. Sakamoto, K. Tsumura, and S. Suzuki. 2009a. Visual speech improves the intelligibility of time-
expanded auditory speech. Neuroreport 20;473–7.
Tanaka, A., S. Sakamoto, K. Tsumura, and Y. Suzuki. 2009b. Visual speech improves the intelligibility of time-
expanded auditory speech. Neuroreport 20(5);473–7.
Teatini, G., M. Ferne, F. Verzella, and J.P. Berruecos. 1976. Perception of temporal order: Visual and auditory
stimuli. Giornale Italiano di Psicologia 3;157–64.
176 The Neural Bases of Multisensory Processes

Teder-Salejarvi, W.A., F. Di Russo, J.J. McDonald, and S.A. Hillyard. 2005. Effects of spatial congruity on
audio-visual multimodal integration. Journal of Cognitive Neuroscience 17(9);1396–409.
Titchener, E.B. 1908. Lectures on the Elementary Psychology of Feeling and Attention. New York:
Macmillan.
van de Par, S., and A. Kohlrausch. 2004. Visual and auditory object selection based on temporal correlations
between auditory and visual cues. Paper presented at the 18th International Congress on Acoustics,
Kyoto, Japan.
van der Burg, E., C.N. Olivers, A.W. Bronkhorst, and J. Theeuwes. 2008a. Audiovisual events capture attention:
Evidence from temporal order judgments. Journal of Vision 8(5);2, 1–10.
van der Burg, E., C.N. Olivers, A.W. Bronkhorst, and J. Theeuwes. 2008b. Pip and pop: Nonspatial audi-
tory signals improve spatial visual search. Journal of Experimental Psychology. Human Perception and
Performance 34(5);1053–65.
van der Burg, E., C.N. Olivers, A.W. Bronkhorst, and J. Theeuwes. 2009. Poke and pop: Tactile–visual syn-
chrony increases visual saliency. Neuroscience Letters 450(1);60–4.
Van Eijk, R.L. 2008. Audio-Visual Synchrony Perception. Thesis, Technische Universiteit Eindhoven, The
Netherlands.
Van Eijk, R.L., A. Kohlrausch, J.F. Juola, and S. van de Par. 2008. Audiovisual synchrony and tempo-
ral order judgments: Effects of experimental method and stimulus type. Perception & Psychophysics
70(6);955–68.
van Wassenhove, V., K.W. Grant, and D. Poeppel. 2007. Temporal window of integration in auditory–visual
speech perception. Neuropsychologia 45;598–601.
Vatakis, A., and C. Spence. 2006a. Audiovisual synchrony perception for music, speech, and object actions.
Brain Research 1111(1);134–42.
Vatakis, A., and C. Spence. 2006b. Audiovisual synchrony perception for speech and music assessed using a
temporal order judgment task. Neuroscience Letters 393(1);40–4.
Vatakis, A., and C. Spence. 2007. Crossmodal binding: Evaluating the “unity assumption” using audiovisual
speech stimuli. Perception & Psychophysics 69(5);744–56.
Vatakis, A., and C. Spence. 2008. Evaluating the influence of the ‘unity assumption’ on the temporal perception
of realistic audiovisual stimuli. Acta Psychologica 127(1);12–23.
Vatakis, A., J. Navarra, S. Soto-Faraco, and C. Spence. 2007. Temporal recalibration during asynchronous
audiovisual speech perception. Experimental Brain Research 181(1);173–81.
Vatakis, A., A.A. Ghazanfar, and C. Spence. 2008a. Facilitation of multisensory integration by the “unity effect”
reveals that speech is special. Journal of Vision 8(9);14 1–11.
Vatakis, A., J. Navarra, S. Soto-Faraco, and C. Spence. 2008b. Audiovisual temporal adaptation of speech:
Temporal order versus simultaneity judgments. Experimental Brain Research 185(3);521–9.
Vibell, J., C. Klinge, M. Zampini, C. Spence, and A.C. Nobre. 2007. Temporal order is coded temporally in the
brain: Early event-related potential latency shifts underlying prior entry in a cross-modal temporal order
judgment task. Journal of Cognitive Neuroscience 19(1);109–20.
von Grunau, M.W. 1986. A motion aftereffect for long-range stroboscopic apparent motion. Perception &
Psychophysics 40(1);31–8.
Von Helmholtz, H. 1867. Handbuch der Physiologischen Optik. Leipzig: Leopold Voss.
Vroomen, J., and B. de Gelder. 2000. Sound enhances visual perception: Cross-modal effects of auditory
organization on vision. Journal of Experimental Psychology. Human Perception and Performance
26(5);1583–90.
Vroomen, J., and B. de Gelder. 2004a. Perceptual effects of cross-modal stimulation: Ventriloquism and the
freezing phenomenon. In G.A. Calvert, C. Spence, and B.E. Stein (eds.). The Handbook of Multisensory
Processes. Cambridge, MA: MIT Press.
Vroomen, J., and B. de Gelder. 2004b. Temporal ventriloquism: Sound modulates the flash-lag effect. Journal
of Experimental Psychology. Human Perception and Performance 30(3);513–8.
Vroomen, J., and M. Keetels. 2006. The spatial constraint in intersensory pairing: No role in temporal ventrilo-
quism. Journal of Experimental Psychology. Human Perception and Performance 32(4);1063–71.
Vroomen, J., and M. Keetels. 2009. Sounds change four-dot masking. Acta Psychologica 130(1);58–63.
Vroomen, J., and J.J. Stekelenburg. 2009. Visual anticipatory information modulates multisensory interactions
of artificial audiovisual stimuli. Journal of Cognitive Neuroscience 22(7);1583–96.
Vroomen, J., M. Keetels, B. de Gelder, and P. Bertelson. 2004. Recalibration of temporal order perception by
exposure to audio-visual asynchrony. Cognitive Brain Research 22(1);32–5.
Watanabe, K., and S. Shimojo. 2001. When sound affects vision: Effects of auditory grouping on visual motion
perception. Psychological Science 12(2);109–16.
Perception of Synchrony between the Senses 177

Welch, R.B. 1999. Meaning, attention, and the “unity assumption” in the intersensory bias of spatial and tem-
poral perceptions. In G. Aschersleben, T. Bachmann, and J. Müsseler (eds.), Cognitive Contributions to
the Perception of Spatial and Temporal Events (pp. 371–87). Amsterdam: Elsevier.
Welch, R.B., and D.H. Warren. 1980. Immediate perceptual response to intersensory discrepancy. Psychological
Bulletin 88(3);638–67.
Yamamoto, S., M. Miyazaki, T. Iwano, and S. Kitazawa. 2008. Bayesian calibration of simultaneity in audio-
visual temporal order judgment. Paper presented at the 9th International Multisensory Research Forum
(July 16–19, 2008). Hamburg, Germany.
Zampini, M., D.I. Shore, and C. Spence. 2003a. Audiovisual temporal order judgments. Experimental Brain
Research 152(2);198–210.
Zampini, M., D.I. Shore, and C. Spence. 2003b. Multisensory temporal order judgments: The role of hemi-
spheric redundancy. International Journal of Psychophysiology 50(1–2);165–80.
Zampini, M., T. Brown, D.I. Shore et al. 2005a. Audiotactile temporal order judgments. Acta Psychologica
118(3);277–91.
Zampini, M., S. Guest, D.I. Shore, and C. Spence. 2005b. Audio-visual simultaneity judgments. Perception &
Psychophysics 67(3);531–44.
Zampini, M., D.I. Shore, and C. Spence. 2005c. Audiovisual prior entry. Neuroscience Letters 381(3);217–22.
10 Representation of Object
Form in Vision and Touch
Simon Lacey and Krish Sathian

CONTENTS
10.1 Introduction........................................................................................................................... 179
10.2 Cortical Regions Involved in Visuo-Haptic Shape Processing............................................. 179
10.2.1 Lateral Occipital Complex......................................................................................... 179
10.2.2 Parietal Cortical Regions........................................................................................... 180
10.3 Do Vision and Touch Share a Common Shape Representation?........................................... 180
10.3.1 Potential Role of Visual Imagery.............................................................................. 180
10.3.2 A Modality-Independent Shape Representation?...................................................... 181
10.4 Properties of Shared Representation..................................................................................... 182
10.4.1 View-Dependence in Vision and Touch.................................................................... 182
10.4.2 Cross-Modal View-Independence............................................................................. 183
10.5 An Integrative Framework for Visuo- Haptic Shape Representation..................................... 183
Acknowledgments........................................................................................................................... 184
References....................................................................................................................................... 184

10.1  INTRODUCTION
The idea that the brain processes sensory inputs in parallel modality-specific streams has given way
to the concept of a “metamodal” brain with a multisensory task-based organization (Pascual-Leone
and Hamilton 2001). For example, recent research shows that many cerebral cortical regions previ-
ously considered to be specialized for processing various aspects of visual input are also activated
during analogous tactile or haptic tasks (reviewed by Sathian and Lacey 2007). In this article,
which concentrates on shape processing in humans, we review the current state of knowledge about
the mental representation of object form in vision and touch. We begin by describing the cortical
regions showing multisensory responses to object form. Next, we consider the extent to which the
underlying representation of object form is explained by cross-modal visual imagery or multisen-
sory convergence. We then review recent work on the view-dependence of visuo-haptic shape rep-
resentations and the resulting model of a multisensory, view-independent representation. Finally,
we discuss a recently presented conceptual framework of visuo-haptic shape processing as a basis
for future investigations.

10.2  CORTICAL REGIONS INVOLVED IN VISUO-HAPTIC SHAPE PROCESSING


10.2.1  Lateral Occipital Complex
Most notable among the several cortical regions implicated in visuo-haptic shape processing
is the lateral occipital complex (LOC), an object-selective region in the ventral visual pathway
(Malach et al. 1995). Part of the LOC responds selectively to objects in both vision and touch and
has been termed LOtv (Amedi et al. 2001, 2002). The LOC is shape-selective during both haptic

179
180 The Neural Bases of Multisensory Processes

three- dimensional shape perception (Amedi et al. 2001; Stilla and Sathian 2008; Zhang et al. 2004)
and tactile two-dimensional shape perception (Stoesz et al. 2003; Prather et al. 2004). Neurological
case studies indicate that the LOC is necessary for both haptic and visual shape perception: a patient
with a left occipitotemporal cortical lesion, likely including the LOC, was found to exhibit tactile
in addition to visual agnosia (inability to recognize objects), although somatosensory cortex and
basic somatosensory function were intact (Feinberg et al. 1986). Another patient with bilateral LOC
lesions could not learn new objects either visually or haptically (James et al. 2006). LOtv is thought
to be a processor of geometric shape because it is not activated during object recognition triggered
by object-specific sounds (Amedi et al. 2002). Interestingly, though, LOtv does respond when audi-
tory object recognition is mediated by a visual–auditory sensory substitution device that converts
visual shape information into an auditory stream, but only when individuals (whether sighted or
blind) are specifically trained in a manner permitting generalization to untrained objects and not
when merely arbitrary associations are taught (Amedi et al. 2007). This dissociation further bolsters
the idea that LOtv is concerned with geometric shape information, regardless of the input sensory
modality.

10.2.2  Parietal Cortical Regions


Multisensory shape selectivity also occurs in parietal cortical regions, including the postcentral sul-
cus (Stilla and Sathian 2008), which is the location of Brodmann’s area 2 in human primary soma-
tosensory cortex (S1; Grefkes et al. 2001). Although this region is generally assumed to be purely
somatosensory, earlier neurophysiological observations in monkeys suggested visual responsiveness
in parts of S1 (Iwamura 1998; Zhou and Fuster 1997). Visuo-haptic shape selectivity has also repeat-
edly been reported in various parts of the human intraparietal sulcus (IPS), which is squarely in clas-
sical multisensory cortex. The particular bisensory foci are either anteriorly in the IPS (Grefkes et al.
2002; Stilla and Sathian 2008), in either the region referred to as the anterior intraparietal area (AIP;
Grefkes and Fink 2005; Shikata et al. 2008) or that termed the medial intraparietal area (Grefkes et
al. 2004), or posteroventrally (Saito et al. 2003; Stilla and Sathian 2008) in a region comprising the
caudal intraparietal area (CIP; Shikata et al. 2008) and the adjacent, retinotopically mapped areas
IPS1 and V7 (Swisher et al. 2007). It should be noted that areas AIP, medial intraparietal, CIP, and V7
were first described in macaque monkeys, and their homologies in humans remain somewhat uncer-
tain. A recent study reported that repetitive transcranial magnetic stimulation over the left anterior
IPS impaired visual–haptic, but not haptic–visual, shape matching using the right hand (Buelte et al.
2008). However, repetitive transcranial magnetic stimulation over the right AIP during shape match-
ing with the left hand had no effect on either cross-modal condition. The reason for this discrepancy
is unclear, and emphasizes that the exact roles of the postcentral sulcus, the IPS regions, and LOtv in
multisensory shape processing remain to be fully worked out.

10.3 DO VISION AND TOUCH SHARE A COMMON SHAPE REPRESENTATION?


10.3.1  Potential Role of Visual Imagery
An intuitively appealing explanation for haptically evoked activation of visual cortex is that this is
mediated by visual imagery rather than multisensory convergence of inputs (Sathian et al. 1997).
The visual imagery hypothesis is supported by evidence that the LOC is active during visual imag-
ery. For example, the left LOC is active during mental imagery of familiar objects previously
explored haptically by blind individuals or visually by sighted individuals (De Volder et al. 2001),
and also during recall of both geometric and material object properties from memory (Newman et
al. 2005). Furthermore, individual differences in ratings of the vividness of visual imagery were
found to strongly predict individual differences in haptic shape-selective activation magnitudes in
the right LOC (Zhang et al. 2004). On the other hand, the magnitude of LOC activation during
Representation of Object Form in Vision and Touch 181

visual imagery can be considerably less than during haptic shape perception, suggesting that visual
imagery may be relatively unimportant in haptic shape perception (Amedi et al. 2001; see also Reed
et al. 2004). However, performance on the visual imagery task has not generally been monitored, so
that lower levels of LOC activity during visual imagery could simply reflect participants not main-
taining their visual images throughout the imagery scan. Because both the early and late blind show
shape-related activity in the LOC evoked by tactile input (Amedi et al. 2003; Burton et al. 2002;
Pietrini et al. 2004; Stilla et al. 2008; reviewed by Pascual-Leone et al. 2005; Sathian 2005; Sathian
and Lacey 2007), or by auditory input when sensory substitution devices were used (Amedi et al.
2007; Arno et al. 2001; Renier et al. 2004, 2005), some have concluded that visual imagery does not
account for cross-modal activation of visual cortex. Although this is true for the early-blind, it cer-
tainly does not exclude the use of visual imagery in the sighted, especially in view of the abundant
evidence for cross-modal plasticity resulting from visual deprivation (Pascual-Leone et al. 2005;
Sathian 2005; Sathian and Lacey 2007).
It is also important to be clear about what is meant by “visual imagery,” which is often treated
as a unitary ability. Recent research has shown that there are two different kinds of visual imagery:
“object imagery” (images that are pictorial and deal with the actual appearance of objects in terms
of shape, color, brightness, and other surface properties) and “spatial imagery” (more schematic
images dealing with the spatial relations of objects and their component parts and with spatial
transformations; Kozhevnikov et al. 2002, 2005; Blajenkova et al. 2006). This distinction is relevant
because both vision and touch encode spatial information about objects—for example, size, shape,
and the relative positions of different object features—such information may well be encoded in a
modality-independent spatial representation (Lacey and Campbell 2006). Support for this possibil-
ity is provided by recent work showing that spatial, but not object, imagery scores were correlated
with accuracy on cross-modal, but not within-modal, object identification for a set of closely similar
and previously unfamiliar objects (Lacey et al. 2007a). Thus, it is probably beneficial to explore the
roles of object and spatial imagery rather than taking an undifferentiated visual imagery approach.
We return to this idea later but, as an aside, we note that the object–spatial dimension of imagery
can be viewed as orthogonal to the modality involved, as there is evidence that early-blind individu-
als perform both object-based and spatially based tasks equally well (Aleman et al. 2001; see also
Noordzij et al. 2007). However, the object–spatial dimension of haptically derived representations
remains unexplored.

10.3.2  A Modality-Independent Shape Representation?


An alternative to the visual imagery hypothesis is that incoming inputs in both vision and touch
converge on a modality-independent representation, which is suggested by the overlap of visual and
haptic shape-selective activity in the LOC. Some researchers refer to such modality-independent
representations as “amodal,” but we believe that this term is best reserved for linguistic or other
abstract representations. Instead, we suggest the use of the term “multisensory” to refer to a rep-
resentation that can be encoded and retrieved by multiple sensory systems and which retains the
modality “tags” of the associated inputs (Sathian 2004). The multisensory hypothesis is suggested
by studies of effective connectivity derived from functional magnetic resonance imaging (fMRI)
data indicating bottom-up projections from S1 to the LOC (Peltier et al. 2007; Deshpande et al.
2008) and also by electrophysiological data showing early propagation of activity from S1 into the
LOC during tactile shape discrimination (Lucan et al. 2011).
If vision and touch engage a common spatial representational system, then we would expect to
see similarities in processing of visually and haptically derived representations and this, in fact,
turns out to be the case. Thus, LOC activity is greater when viewing objects previously primed hap-
tically, compared to viewing nonprimed objects (James et al. 2002b). In addition, behavioral studies
have shown that cross-modal priming is as effective as within-modal priming (Easton et al. 1997a,
1997b; Reales and Ballesteros 1999). Candidate regions for housing a common visuo-haptic shape
182 The Neural Bases of Multisensory Processes

representation include the right LOC and the left CIP because activation magnitudes during visual
and haptic processing of (unfamiliar) shape are significantly correlated across subjects in these
regions (Stilla and Sathian 2008). Furthermore, the time taken to scan both visual images (Kosslyn
1973; Kosslyn et al. 1978) and haptically derived images (Röder and Rösler 1998) increases with
the spatial distance to be inspected. Also, the time taken to judge whether two objects are the same
or mirror images increases nearly linearly with increasing angular disparity between the objects
for mental rotation of both visual (Shepard and Metzler 1971) and haptic stimuli (Marmor and
Zaback 1976; Carpenter and Eisenberg 1978; Hollins 1986; Dellantonio and Spagnolo 1990). The
same relationship was found when the angle between a tactile stimulus and a canonical angle was
varied, with associated activity in the left anterior IPS (Prather et al. 2004), an area also active dur-
ing mental rotation of visual stimuli (Alivisatos and Petrides 1997), and probably corresponding to
AIP (Grefkes and Fink 2005; Shikata et al. 2008). Similar processing has been found with sighted,
early- and late-blind individuals (Carpenter and Eisenberg 1978; Röder and Rösler 1998). These
findings suggest that spatial metric information is preserved in both vision and touch, and that both
modalities rely on similar, if not identical, imagery processes (Röder and Rösler 1998).

10.4  PROPERTIES OF SHARED REPRESENTATION


In this section, we discuss the properties of the multisensory representation of object form with par-
ticular reference to recent work on view-independence in visuo-haptic object recognition. The repre-
sentation of an object is said to be view-dependent if rotating the object away from the learned view
impairs object recognition, that is, optimal recognition depends on perceiving the same view of the
object. By contrast, a representation is view-independent if objects are correctly identified despite
being rotated to provide a different view. The shared multisensory representation that enables cross-
modal object recognition is likely distinct from the separate unisensory representations that support
visual and haptic within-modal object recognition: we examine the relationship between these.

10.4.1  View-Dependence in Vision and Touch


It has long been known that visual object representations are view-dependent (reviewed by Peissig
and Tarr 2007) but it might be expected that haptic object representations are view-independent
because the hands can simultaneously contact an object from different sides (Newell et al. 2001).
This expectation is reinforced because following the contours of a three-dimensional object is nec-
essary for haptic object recognition (Lederman and Klatzky 1987). Nonetheless, several studies have
shown that haptic object representations are in fact view-dependent for both unfamiliar (Newell et
al. 2001; Lacey et al. 2007a) and familiar objects (Lawson 2009). This may be because the bio-
mechanics of the hands can be restrictive in some circumstances: some hand positions naturally
facilitate exploration more than others (Woods et al. 2008). Furthermore, for objects with a vertical
main axis, haptic exploration is biased to the far (back) “view” of an object, explored by the fingers
whereas the thumbs stabilize the object rather than explore it (Newell et al. 2001). However, haptic
recognition remains view-dependent even when similar objects are presented so that their main axis
is horizontal, an orientation that allows more freely comprehensive haptic exploration of multiple
object surfaces (Lacey et al. 2007a). The extent to which visual object recognition is impaired by
changes in orientation depends on the particular axis of rotation: picture-plane rotations are less
disruptive than depth-plane rotations in both object recognition and mental rotation tasks, even
though these tasks depend on different visual pathways—ventral and dorsal, respectively (Gauthier
et al. 2002). By contrast, haptic object recognition is equally disrupted by rotation about each of the
three main axes (Lacey et al. 2007a). Thus, although visual and haptic unisensory representations
may be functionally equivalent in that they are both view-dependent, the underlying basis for this
may be very different in each case.
Representation of Object Form in Vision and Touch 183

A further functional equivalence between visual and haptic object representation is that each
has preferred or canonical views of objects. In vision, the preferred view for both familiar and
unfamiliar objects is one in which the main axis is angled at 45° to the observer (Palmer et al. 1981;
Perrett et al. 1992). Recently, Woods et al. (2008) have shown that haptic object recognition also has
canonical views—again independently of familiarity—but that these are defined by reference to the
midline of the observer’s body, the object’s main axis being aligned either parallel or perpendicu-
lar to the midline. This may be due to grasping and object function: Craddock and Lawson (2008)
found that haptic recognition was better for objects in typical rather than atypical orientations; for
example, a cup oriented with the handle to the right for a right-handed person.

10.4.2  Cross-Modal View-Independence


Remarkably, although visual and haptic within-modal object recognition are both view-dependent,
visuo-haptic cross-modal recognition is view-independent (Lacey et al. 2007a; Ueda and Saiki
2007). Rotating an object away from the learned view did not degrade recognition, whether visual
study was followed by haptic test or vice versa (Lacey et al. 2007a; Ueda and Saiki 2007), although
Lawson (2009) found view-independence only in the haptic study–visual test condition. Cross-modal
object recognition was also independent of the particular axis of rotation (Lacey et al. 2007a). Thus,
visuo-haptic cross-modal object recognition clearly relies on a different representation from that
involved in the corresponding within-modal task (see also Newell et al. 2005).
In a recent series of experiments, we used a perceptual learning paradigm to investigate the
relationship between the unisensory view-dependent and multisensory view-independent represen-
tations (Lacey et al. 2009a). We showed that a relatively brief period of within-modal learning to
establish within-modal view-independence resulted in complete, symmetric cross-modal transfer of
view-independence: visual view-independence acquired following exclusively visual learning also
resulted in haptic view-independence, and vice versa. In addition, both visual–haptic and haptic–­
visual cross-modal learning also transformed visual and haptic within-modal recognition from view-
dependent to view-independent. We concluded from this study that visual and haptic within-modal
and visuo-haptic cross-modal view-independence all rely on the same shared representation. Thus,
this study and its predecessor (Lacey et al. 2007a) suggest a model of view-independence in which
separate, view-dependent, unisensory representations feed directly into a view-­independent, bisen-
sory representation rather than being routed through intermediate, unisensory, view-­independent
representations. A possible mechanism for this is the integration of multiple low-level, view-dependent,
unisensory representations into a higher-order, view-independent, multisensory representation (see
Riesenhuber and Poggio 1999 for a similar proposal regarding visual object recognition). Cortical
localization of this modality-independent, view-independent representation is an important goal for
future work. Although the IPS is a potential candidate, being a well-known convergence site for
visual and haptic shape processing (Amedi et al. 2001; James et al. 2002b; Zhang et al. 2004; Stilla
and Sathian 2008), IPS responses appear to be view-dependent (James et al. 2002a). The LOC also
shows convergent multisensory shape processing; however, responses in this area have shown view-
dependence in some studies (Grill-Spector et al. 1999; Gauthier et al. 2002) but view-independence
in other studies (James et al. 2002a).

10.5 AN INTEGRATIVE FRAMEWORK FOR


VISUO- HAPTIC SHAPE REPRESENTATION
An important goal of multisensory research is to model the processes underlying visuo-haptic object
representation. As a preliminary step to this goal, we have recently investigated connectivity and
intertask correlations of activation magnitudes during visual object imagery and haptic perception
of both familiar and unfamiliar objects (Deshpande et al. 2010; Lacey et al. 2010). In the visual
184 The Neural Bases of Multisensory Processes

object imagery task, participants listened to word pairs and decided whether the objects designated
by those words had the same or different shapes. Thus, in contrast with earlier studies, participants
had to process their images throughout the scan and this could be verified by monitoring their
performance. In a separate session, participants performed a haptic shape discrimination task. For
one group of subjects, the haptic objects were familiar; for the other group, they were unfamiliar.
We found that both intertask correlations and connectivity were modulated by object familiarity
(Deshpande et al. 2010; Lacey et al. 2010). Although the LOC was active bilaterally during both
visual object imagery and haptic shape perception, there was an intertask correlation only for famil-
iar shape. Analysis of connectivity showed that visual object imagery and haptic familiar shape
perception engaged quite similar networks characterized by top-down paths from prefrontal and
parietal regions into the LOC, whereas a very different network emerged during haptic perception
of unfamiliar shape, featuring bottom-up inputs from S1 to the LOC (Deshpande et al. 2010).
Based on these findings and on the literature reviewed earlier in this chapter, we proposed a
conceptual framework for visuo-haptic object representation that integrates the visual imagery and
multisensory approaches (Lacey et al. 2009b). In this proposed framework, the LOC houses a rep-
resentation that is independent of the input sensory modality and is flexibly accessible via either
bottom-up or top-down pathways, depending on object familiarity (or other task attributes). Haptic
perception of familiar shape uses visual object imagery via top-down paths from prefrontal and
parietal areas into the LOC whereas haptic perception of unfamiliar shape may use spatial imagery
processes and involves bottom-up pathways from the somatosensory cortex to the LOC. Because
there is no stored representation of an unfamiliar object, its global shape has to be computed by
exploring it in its entirety and the framework would therefore predict the somatosensory drive of
LOC. The IPS has been implicated in visuo-haptic perception of both shape and location (Stilla
and Sathian 2008; Gibson et al. 2008). We might therefore expect that, to compute global shape in
unfamiliar objects, the IPS would be involved in processing the relative spatial locations of object
parts. For familiar objects, global shape can be inferred easily, perhaps from distinctive features
that are sufficient to retrieve a visual image, and so the framework predicts increased contribution
from parietal and prefrontal regions. Clearly, objects are not exclusively familiar or unfamiliar and
individuals are not purely object or spatial imagers: these are continua along which objects and indi-
viduals may vary. In this respect, an individual differences approach is likely to be productive (see
Lacey et al. 2007b; Motes et al. 2008) because these factors may interact, with different weights in
different circumstances, for example task demands or individual history (visual experience, train-
ing, etc.). More work is required to define and test this framework.

ACKNOWLEDGMENTS
This work was supported by the National Eye Institute, the National Science Foundation, and the
Veterans Administration.

REFERENCES
Aleman, A., L. van Lee, M.H.M. Mantione, I.G. Verkoijen, and E.H.F. de Haan. 2001. Visual imagery without
visual experience: Evidence from congenitally totally blind people. Neuroreport 12:2601–2604.
Alivisatos, B., and M. Petrides. 1997. Functional activation of the human brain during mental rotation.
Neuropsychologia 36:11–118.
Amedi, A., R. Malach, T. Hendler, S. Peled, and E. Zohary. 2001. Visuo-haptic object-related activation in the
ventral visual pathway. Nature Neuroscience 4:324–330.
Amedi, A., G. Jacobson, T. Hendler, R. Malach, and E. Zohary. 2002. Convergence of visual and tactile shape
processing in the human lateral occipital complex. Cerebral Cortex 12:1202–1212.
Amedi, A., N. Raz, P. Pianka, R. Malach, and E. Zohary. 2003 Early ‘visual’ cortex activation correlates with
superior verbal memory performance in the blind. Nature Neuroscience 6:758–766.
Representation of Object Form in Vision and Touch 185

Amedi, A., W.M. Stern, J.A. Camprodon et al. 2007. Shape conveyed by visual-to-auditory sensory substitution
activates the lateral occipital complex. Nature Neuroscience 10:687–689.
Arno, P., A.G. De Volder, A. Vanlierde et al. 2001. Occipital activation by pattern recognition in the early blind
using auditory substitution for vision. NeuroImage 13:632–645.
Blajenkova, O., M. Kozhevnikov, and M.A. Motes. 2006. Object-spatial imagery: A new self-report imagery
questionnaire. Applied Cognitive Psychology 20:239–263.
Buelte, D., I.G. Meister, M. Staedtgen et al. 2008. The role of the anterior intraparietal sulcus in crossmodal
processing of object features in humans: An rTMS study. Brain Research 1217:110–118.
Burton, H., A.Z. Snyder, T.E. Conturo, E. Akbudak, J.M. Ollinger, and M.E. Raichle. 2002. Adaptive changes
in early and late blind: A fMRI study of Braille reading. Journal of Neurophysiology 87:589–607.
Carpenter, P.A., and P. Eisenberg. 1978. Mental rotation and the frame of reference in blind and sighted indi-
viduals. Perception & Psychophysics 23:117–124.
Craddock, M., and R. Lawson. 2008. Repetition priming and the haptic recognition of familiar and unfamiliar
objects. Perception & Psychophysics 70:1350–1365.
Dellantonio, A., and F. Spagnolo. 1990. Mental rotation of tactual stimuli. Acta Psychologica 73:245–257.
Deshpande, G., X. Hu, R. Stilla, and K. Sathian. 2008. Effective connectivity during haptic perception: A
study using Granger causality analysis of functional magnetic resonance imaging data. NeuroImage
40:1807–1814.
Deshpande, G., X. Hu, S. Lacey, R. Stilla, and K. Sathian. 2010. Object familiarity modulates effective con-
nectivity during haptic shape perception. NeuroImage 49:1991–2000.
De Volder, A.G., H. Toyama, Y. Kimura et al. 2001. Auditory triggered mental imagery of shape involves visual
association areas in early blind humans. NeuroImage 14:129–139.
Easton, R.D., A.J. Greene, and K. Srinivas. 1997a. Transfer between vision and haptics: Memory for 2-D pat-
terns and 3-D objects. Psychonomic Bulletin & Review 4:403–410.
Easton, R.D., K. Srinivas, and A.J. Greene. 1997b. Do vision and haptics share common representations?
Implicit and explicit memory within and between modalities. Journal of Experimental Psychology.
Learning, Memory, and Cognition 23:153–163.
Feinberg, T.E., L.J. Rothi, and K.M. Heilman. 1986. Multimodal agnosia after unilateral left hemisphere lesion.
Neurology 36:864–867.
Gauthier, I., W.G. Hayward, M.J. Tarr et al. 2002. BOLD activity during mental rotation and view-dependent
object recognition. Neuron 34:161–171.
Gibson, G., R. Stilla, and K. Sathian. 2008. Segregated visuo-haptic processing of texture and location. Abstract,
Human Brain Mapping.
Grefkes, C., S. Geyer, T. Schormann, P. Roland, and K. Zilles. 2001. Human somatosensory area 2: Observer-
independent cytoarchitectonic mapping, interindividual variability, and population map. NeuroImage
14:617–631.
Grefkes, C., P.H. Weiss, K. Zilles, and G.R. Fink. 2002. Crossmodal processing of object features in human
anterior intraparietal cortex: An fMRI study implies equivalencies between humans and monkeys. Neuron
35:173–184.
Grefkes, C., A. Ritzl, K. Zilles, and G.R. Fink. 2004. Human medial intraparietal cortex subserves visuomotor
coordinate transformation. NeuroImage 23:1494–1506.
Grefkes, C., and G. Fink. 2005. The functional organization of the intraparietal sulcus in humans and monkeys.
Journal of Anatomy 207:3–17.
Grill-Spector, K., T. Kushnir, S. Edelman, G. Avidan, Y. Itzchak, and R. Malach. 1999. Differential processing of
objects under various viewing conditions in the human lateral occipital complex. Neuron 24:187–203.
Hollins, M. 1986. Haptic mental rotation: More consistent in blind subjects? Journal of Visual Impairment &
Blindness 80:950–952.
Iwamura, Y. 1998. Hierarchical somatosensory processing. Current Opinion in Neurobiology 8:522–528.
James, T.W., G.K. Humphrey, J.S. Gati, R.S. Menon, and M.A. Goodale. 2002a. Differential effects of view on
object-driven activation in dorsal and ventral streams. Neuron 35:793–801.
James, T.W., G.K. Humphrey, J.S. Gati, P. Servos, R.S. Menon, and M.A. Goodale. 2002b. Haptic study of
three-dimensional objects activates extrastriate visual areas. Neuropsychologia 40:1706–1714.
James, T.W., K.H. James, G.K. Humphrey, and M.A. Goodale. 2006. Do visual and tactile object representa-
tions share the same neural substrate? In Touch and Blindness: Psychology and Neuroscience, ed. M.A.
Heller and S. Ballesteros, 139–155. Mahwah, NJ: Lawrence Erlbaum Associates.
Kosslyn, S.M. 1973. Scanning visual images: Some structural implications. Perception & Psychophysics
14:90–94.
186 The Neural Bases of Multisensory Processes

Kosslyn, S.M., T.M. Ball, and B.J. Reiser. 1978. Visual images preserve metric spatial information: Evidence
from studies of image scanning. Journal of Experimental Psychology. Human Perception and Performance
4:47–60.
Kozhevnikov, M., M. Hegarty, and R.E. Mayer. 2002. Revising the visualiser–verbaliser dimension: Evidence
for two types of visualisers. Cognition and Instruction 20:47–77.
Kozhevnikov, M., S.M. Kosslyn, and J. Shephard. 2005. Spatial versus object visualisers: A new characterisa-
tion of cognitive style. Memory & Cognition 33:710–726.
Lacey, S., and C. Campbell. 2006. Mental representation in visual/haptic crossmodal memory: Evidence from
interference effects. Quarterly Journal of Experimental Psychology 59:361–376.
Lacey, S., A. Peters, and K. Sathian. 2007a. Cross-modal object representation is viewpoint-independent. PLoS
ONE 2:e890. doi: 10.1371/journal.pone0000890.
Lacey, S., C. Campbell, and K. Sathian. 2007b. Vision and touch: Multiple or multisensory representations of
objects? Perception 36:1513–1521.
Lacey, S., M. Pappas, A. Kreps, K. Lee, and K. Sathian. 2009a. Perceptual learning of view-independence in
visuo-haptic object representations. Experimental Brain Research 198:329–337.
Lacey, S., N. Tal, A. Amedi, and K. Sathian. 2009b. A putative model of multisensory object representation.
Brain Topography 21:269–274.
Lacey, S., P. Flueckiger, R. Stilla, M. Lava, and K. Sathian. 2010. Object familiarity modulates the relationship
between visual object imagery and haptic shape perception. NeuroImage 49:1977–1990.
Lawson, R. 2009. A comparison of the effects of depth rotation on visual and haptic three-dimensional object
recognition. Journal of Experimental Psychology. Human Perception and Performance 35:911–930.
Lederman, S.J., and R.L. Klatzky. 1987. Hand movements: A window into haptic object recognition. Cognitive
Psychology 19:342–368.
Lucan, J.N., J.J. Foxe, M. Gomez-Ramirez, K. Sathian, and S. Molholm. 2011. Tactile shape discrimination
recruits human lateral occipital complex during early perceptual processing. Human Brain Mapping
31:1813–1821.
Malach, R., J.B. Reppas, R.R. Benson et al. 1995. Object-related activity revealed by functional magnetic reso-
nance imaging in human occipital cortex. Proceedings of the National Academy of Sciences of the United
States of America 92:8135–8139.
Marmor, G.S., and L.A. Zaback. 1976. Mental rotation by the blind: Does mental rotation depend on visual
imagery? Journal of Experimental Psychology. Human Perception and Performance 2:515–521.
Motes, M.A., R. Malach, and M. Kozhevnikov. 2008. Object-processing neural efficiency differentiates object
from spatial visualizers. Neuroreport 19:1727–1731.
Newell, F.N., M.O. Ernst, B.S. Tjan, and H.H. Bülthoff. 2001. View dependence in visual and haptic object
recognition. Psychological Science 12:37–42.
Newell, F.N., A.T. Woods, M. Mernagh, and H.H. Bülthoff. 2005. Visual, haptic and crossmodal recognition of
scenes. Experimental Brain Research 161:233–242.
Newman, S.D., R.L. Klatzky, S.J. Lederman, and M.A. Just. 2005. Imagining material versus geometric prop-
erties of objects: An fMRI study. Cognitive Brain Research 23:235–246.
Noordzij, M.L., S. Zuidhoek, and A. Postma. 2007. The influence of visual experience on visual and spatial
imagery. Perception 36:101–112.
Palmer, S., E. Rosch, and P. Chase. 1981. Canonical perspective and the perception of objects. In Attention
and Performance IX, ed. J.B. Long and A.D. Baddeley, 135–151. Hillsdale, NJ: Lawrence Earlbaum
Associates.
Pascual-Leone, A., and R.H. Hamilton. 2001. The metamodal organization of the brain. Progress in Brain
Research 134:427–445.
Pascual-Leone, A., A. Amedi, F. Fregni, and L.B. Merabet. 2005. The plastic human brain. Annual Review of
Neuroscience 28:377–401.
Peissig, J.J., and M.J. Tarr. 2007. Visual object recognition: Do we know more now than we did 20 years ago?
Annual Review of Psychology 58:75–96.
Peltier, S., R. Stilla, E. Mariola, S. LaConte, X. Hu, and K. Sathian. 2007. Activity and effective connectivity of
parietal and occipital cortical regions during haptic shape perception. Neuropsychologia 45:476–483.
Perrett, D.I., M.H. Harries, and S. Looker. 1992. Use of preferential inspection to define the viewing sphere and
characteristic views of an arbitrary machined tool part. Perception 21:497–515.
Pietrini, P., M.L. Furey, E. Ricciardi et al. 2004. Beyond sensory images: Object-based representation in the
human ventral pathway. Proceedings of the National Academy of Sciences of the United States of America
101:5658–5663.
Representation of Object Form in Vision and Touch 187

Prather, S.C., J.R. Votaw, and K. Sathian. 2004. Task-specific recruitment of dorsal and ventral visual areas
during tactile perception. Neuropsychologia 42:1079–1087.
Reales, J.M., and S. Ballesteros. 1999. Implicit and explicit memory for visual and haptic objects: Cross-modal
priming depends on structural descriptions. Journal of Experimental Psychology. Learning, Memory, and
Cognition 25:644–663.
Reed, C.L., S. Shoham, and E. Halgren. 2004. Neural substrates of tactile object recognition: An fMRI study.
Human Brain Mapping 21:236–246.
Renier, L., O. Collignon, D. Tranduy et al. 2004. Visual cortex activation in early blind and sighted subjects
using an auditory visual substitution device to perceive depth. NeuroImage 22:S1.
Renier, L., O. Collignon, C. Poirier et al. 2005. Cross modal activation of visual cortex during depth perception
using auditory substitution of vision. NeuroImage 26:573–580.
Riesenhuber, M., and T. Poggio. 1999. Hierarchical models of object recognition in cortex. Nature Neuroscience
2:1019–1025.
Röder, B., and F. Rösler. 1998. Visual input does not facilitate the scanning of spatial images. Journal of Mental
Imagery 22:165–181.
Saito, D.N., T. Okada, Y. Morita, Y. Yonekura, and N. Sadato. 2003. Tactile–visual cross-modal shape match-
ing: A functional MRI study. Cognitive Brain Research 17:14–25.
Sathian, K. 2004. Modality, quo vadis? Comment. Behavioral and Brain Sciences 27:413–414.
Sathian, K. 2005. Visual cortical activity during tactile perception in the sighted and the visually deprived.
Developmental Psychobiology 46:279–286.
Sathian, K., and S. Lacey. 2007. Journeying beyond classical somatosensory cortex. Canadian Journal of
Experimental Psychology 61:254–264.
Sathian, K., A. Zangaladze, J.M. Hoffman, and S.T. Grafton. 1997. Feeling with the mind’s eye. Neuroreport
8:3877–3881.
Shepard, R.N., and J. Metzler. 1971. Mental rotation of three-dimensional objects. Science 171:701–703.
Shikata, E., A. McNamara, A. Sprenger et al. 2008. Localization of human intraparietal areas AIP, CIP, and
LIP using surface orientation and saccadic eye movement tasks. Human Brain Mapping 29:411–421.
Stilla, R., R. Hanna, X. Hu, E. Mariola, G. Deshpande, and K. Sathian. 2008. Neural processing underlying
tactile microspatial discrimination in the blind: A functional magnetic resonance imaging study. Journal
of Vision 8:1–19 doi:10.1167/8.10.13.
Stilla, R., and K. Sathian. 2008. Selective visuo-haptic processing of shape and texture. Human Brain Mapping
29:1123–1138.
Stoesz, M., M. Zhang, V.D. Weisser, S.C. Prather, H. Mao, and K. Sathian. 2003. Neural networks active during
tactile form perception: Common and differential activity during macrospatial and microspatial tasks.
International Journal of Psychophysiology 50:41–49.
Swisher, J.D., M.A. Halko, L.B. Merabet, S.A. McMains, and D.C. Somers. 2007. Visual topography of human
intraparietal sulcus. Journal of Neuroscience 27:5326–5337.
Ueda, Y., and J. Saiki. 2007. View independence in visual and haptic object recognition. Japanese Journal of
Psychonomic Science 26:11–19.
Woods, A.T., A. Moore, and F.N. Newell. 2008. Canonical views in haptic object representation. Perception
37:1867–1878.
Zhang, M., V.D. Weisser, R. Stilla, S.C. Prather, and K. Sathian. 2004. Multisensory cortical processing of object
shape and its relation to mental imagery. Cognitive, Affective & Behavioral Neuroscience 4:251–259.
Zhou, Y.-D., and J.M. Fuster. 1997. Neuronal activity of somatosensory cortex in a cross-modal (visuo-haptic)
memory task. Experimental Brain Research 116:551–555.
Section III
Combinatorial Principles and Modeling
11 Spatial and Temporal Features
of Multisensory Processes
Bridging Animal and Human Studies
Diana K. Sarko, Aaron R. Nidiffer, Albert R. Powers III,
Dipanwita Ghose, Andrea Hillock-Dunn,
Matthew C. Fister, Juliane Krueger, and Mark T. Wallace

CONTENTS
11.1 Introduction........................................................................................................................... 192
11.2 Neurophysiological Studies in Animal Models: Integrative Principles as a Foundation
for Understanding Multisensory Interactions........................................................................ 192
11.3 Neurophysiological Studies in Animal Models: New Insights into Interdependence of
Integrative Principles............................................................................................................. 193
11.3.1 Spatial Receptive Field Heterogeneity and Its Implications for Multisensory
Interactions................................................................................................................ 193
11.3.2 Spatiotemporal Dynamics of Multisensory Processing............................................ 197
11.4 Studying Multisensory Integration in an Awake and Behaving Setting: New Insights
into Utility of Multisensory Processes.................................................................................. 199
11.5 Human Behavioral and Perceptual Studies of Multisensory Processing: Building
Bridges between Neurophysiological and Behavioral and Perceptual Levels of Analysis.........201
11.5.1 Defining the “Temporal Window” of Multisensory Integration............................... 201
11.5.2 Stimulus-Dependent Effects on the Size of the Multisensory Temporal Window....202
11.5.3 Can “Higher-Order” Processes Affect Multisensory Temporal Window?............... 203
11.6 Adult Plasticity in Multisensory Temporal Processes: Psychophysical and
Neuroimaging Evidence........................................................................................................203
11.7 Developmental Plasticity in Multisensory Representations: Insights from Animal and
Human Studies....................................................................................................................... 205
11.7.1 Neurophysiological Studies into Development of Multisensory Circuits..................205
11.7.2 Development of Integrative Principles......................................................................206
11.7.3 Experientially Based Plasticity in Multisensory Circuits..........................................207
11.7.4 Development of Human Multisensory Temporal Perception....................................207
11.8 Conclusions and Future Directions........................................................................................209
References....................................................................................................................................... 210

191
192 The Neural Bases of Multisensory Processes

11.1  INTRODUCTION
Multisensory processing is a pervasive and critical aspect of our behavioral and perceptual reper-
toires, facilitating and enriching a wealth of processes including target identification, signal detec-
tion, speech comprehension, spatial navigation, and flavor perception to name but a few. The adaptive
advantages that multisensory integration confers are critical to survival, with effective acquisition
and use of multisensory information enabling the generation of appropriate behavioral responses
under circumstances in which one sense is inadequate. In the behavioral domain, a number of
studies have illustrated the strong benefits conferred under multisensory circumstances, with the
most salient examples including enhanced orientation and discrimination (Stein et al. 1988, 1989),
improved target detection (Frassinetti et al. 2002; Lovelace et al. 2003), and speeded responses
(Hershenson 1962; Hughes et al. 1994; Frens et al. 1995; Harrington and Peck 1998; Corneil et al.
2002; Forster et al. 2002; Molholm et al. 2002; Amlot et al. 2003; Diederich et al. 2003; Calvert
and Thesen 2004).
Along with these behavioral examples, there are myriad perceptual illustrations of the power
of multisensory interactions. For example, the intensity of a light is perceived as greater when
presented with a sound (Stein et al. 1996) and judgments of stimulus features such as speed and
orientation are often more accurate when combined with information available from another sense
(Soto-Faraco et al. 2003; Manabe and Riquimaroux 2000; Clark and Graybiel 1966; Wade and Day
1968). One of the most compelling examples of multisensory-mediated perceptual gains can be
seen in the speech realm, where the intelligibility of a spoken signal can be greatly enhanced when
the listener can see the speaker’s face (Sumby and Pollack 1954). In fact, this bimodal gain may
be a principal factor in the improvements in speech comprehension seen in those with significant
hearing loss after visual training (Schorr et al. 2005; Rouger et al. 2007). Regardless of whether
the benefits are seen in the behavioral or perceptual domains, they typically exceed those that are
predicted on the basis of responses to each of the component unisensory stimuli (Hughes et al. 1994,
1998; Corneil and Munoz 1996; Harrington and Peck 1998). Such deviations from simple additive
models provide important insights into the neural bases for these multisensory interactions in that
they strongly argue for a convergence and active integration of the different sensory inputs within
the brain.

11.2 NEUROPHYSIOLOGICAL STUDIES IN ANIMAL MODELS:


INTEGRATIVE PRINCIPLES AS A FOUNDATION FOR
UNDERSTANDING MULTISENSORY INTERACTIONS
Information from multiple sensory modalities converges at many sites within the central nervous
system, providing the necessary anatomical framework for multisensory interactions (Calvert and
Thesen 2004; Stein and Meredith 1993). Multisensory convergence at the level of the single neuron
commonly results in an integrated output such that the multisensory response is typically distinct
from the component responses, and often from their predicted addition as well. Seminal studies
of multisensory processing initially focused on a midbrain structure, the superior colliculus (SC),
because of its high incidence of multisensory neurons, its known spatiotopic organization, and its
well-defined role in controlling orientation movements of the eyes, pinnae, and head (Sparks 1986;
Stein and Meredith 1993; Sparks and Groh 1995; Hall and Moschovakis 2004; King 2004).
These foundational studies of the SC of cats (later reaffirmed by work in nonhuman primate
models, see Wallace and Stein 1996, 2001; Wallace et al. 1996) provided an essential understanding
of the organization of multisensory neurons and the manner in which they integrate their different
sensory inputs. In addition to characterizing the striking nonlinearities that frequently define the
responses of these neurons under conditions of multisensory stimulation, these studies established
a series of fundamental principles that identified key stimulus features that govern multisensory
interactions (Meredith and Stein 1983, 1985, 1986; Meredith et al. 1987). The spatial principle deals
Spatial and Temporal Features of Multisensory Processes 193

with the physical location of the paired stimuli, and illustrates the importance of spatial proximity
in driving the largest proportionate gains in response. Similarly, the temporal principle captures
the fact that the largest gains are typically seen when stimuli are presented close together in time,
and that the magnitude of the interaction declines as the stimuli become increasingly separated in
time. Finally, the principle of inverse effectiveness reflects the fact that the largest gains are gener-
ally seen to the pairing of two weakly effective stimuli. As individual stimuli become increasingly
effective in driving neuronal responses, the size of the interactions seen to the pairing declines.
Together, these principles have provided an essential predictive outline for understanding multisen-
sory integration at the neuronal level, as well as for understanding the behavioral and perceptual
consequences of multisensory pairings. However, it is important to point out that these principles,
although widely instructive, fail to capture the complete integrative profile of any individual neuron.
The reason for this is that space, time, and effectiveness are intimately intertwined in naturalistic
stimuli, and manipulating one has a consequent effect on the others. Recent studies, described in
the next section, have sought to better understand the strong interdependence between these fac-
tors, with the hope of better elucidating the complex spatiotemporal architecture of multisensory
interactions.

11.3 NEUROPHYSIOLOGICAL STUDIES IN ANIMAL MODELS: NEW


INSIGHTS INTO INTERDEPENDENCE OF INTEGRATIVE PRINCIPLES
11.3.1  S patial Receptive Field Heterogeneity and Its
Implications for Multisensory Interactions
Early observations during the establishment of the neural principles of multisensory integration
hinted at a complexity not captured by integrative “rules” or constructs. For example, in structuring
experiments to test the spatial principle, it was clear that stimulus location not only played a key role
in the magnitude of the multisensory interaction, but also that the individual sensory responses were
strongly modulated by stimulus location. Such an observation suggested an interaction between the
spatial and inverse effectiveness principles, and one that might possibly be mediated by differences
in unisensory responses as a function of location within the neuron’s receptive field. Recently, this
concept has been tested by experiments specifically designed to characterize the microarchitecture
of multisensory receptive fields.
In these experiments, stimuli from each of the effective modalities were presented at a series
of locations within and outside the classically defined excitatory receptive field of individual mul-
tisensory neurons (Figure 11.1). Studies were conducted in both subcortical (i.e., SC) and cortical
[i.e., the anterior ectosylvian sulcus (AES)] multisensory domains in the cat, in which prior work
had illustrated that the receptive fields of multisensory neurons are quite large (Stein and Meredith
1993; Benedek et al. 2004; Furukawa and Middlebrooks 2002; Middlebrooks and Knudsen 1984;
Middlebrooks et al. 1998; Xu et al. 1999; Wallace and Stein 1996, 1997; Nagy et al. 2003). In this
manner, spatial receptive field (SRFs) can be created for each of the effective modalities, as well as
for the multisensory combination. It is important to point out that in these studies, the stimuli are
identical (e.g., same luminance, loudness, and spectral composition) except for their location. The
results of these analyses have revealed a marked degree of heterogeneity to the SRFs of both SC and
AES multisensory neurons (Carriere et al. 2008; Royal et al. 2009). This response heterogeneity is
typically characterized by regions of high response (i.e., hot spots) surrounded by regions of sub-
stantially weaker response. Studies are ongoing to determine whether features such as the number
or size of these hot spots differ between subcortical and cortical areas.
Although these SRF analyses have revealed a previously uncharacterized feature of multisensory
neurons, perhaps the more important consequence of this SRF heterogeneity is the implication that
this has for multisensory interactions. At least three competing hypotheses can be envisioned for
the role of receptive field heterogeneity in multisensory integration—each with strikingly different
194 The Neural Bases of Multisensory Processes

Receptive field
locations Stimulus
locations
SUA

SRF

SDF

Elevation (deg)
Visual
Auditory

100
Spikes/s

50
0
–200 0 200 400 600
Time (ms)
Azimuth (deg)

FIGURE 11.1  Construction of an SRF for an individual multisensory neuron. Each stimulus location tested
within receptive field generates a response that is then compiled into a single unit activity (SUA) plot. SUA
plot at one location is shown in detail to illustrate how spike density function (SDF) is derived. Finally, SDF/
SUA data are transformed into a pseudocolor SRF plot in which normalized evoked response is shown relative
to azimuth and elevation. Evoked responses are scaled to maximal response, with warmer colors representing
higher firing rates. (Adapted from Carriere, B.N. et al., J. Neurophysiol., 99, 2357–2368, 2008.)

predictions. The first is that spatial location takes precedence and that the resultant interactions
would be completely a function of the spatial disparity between the paired stimuli. In this scenario,
the largest interactions would be seen when the stimuli were presented at the same location, and
the magnitude of the interaction would decline as spatial disparity increased. Although this would
seem to be a strict interpretation of the spatial principle, in fact, even the early characterization of
this principle focused not on location or disparity, but rather on the presence or absence of stimuli
within the receptive field (Meredith and Stein 1986), hinting at the relative lack of importance of
absolute location. The second hypothesis is that stimulus effectiveness would be the dominant fac-
tor, and that the interaction would be dictated not by spatial location but rather by the magnitude of
the individual sensory responses (which would be modulated by changes in spatial location). The
final hypothesis is that there is an interaction between stimulus location and effectiveness, such that
both would play a role in shaping the resultant interaction. If this were the case, studies would seek
to identify the relative weighting of these two stimulus dimensions to gain a better mechanistic view
into these interactions.
The first foray into this question focused on cortical area AES (Carriere et al. 2008). Here, it
was found that SRF architecture played an essential deterministic role in the observed multisensory
interactions, and most intriguingly, in a manner consistent with the second hypothesis outlined
above. Thus, and as illustrated in Figure 11.2, SRF architecture resulted in changes in stimulus
effectiveness that formed the basis for the multisensory interaction. In the neuron shown, if the stim-
uli were presented in a region of strong response within the SRF, a response depression would result
(Figure 11.2b, left column). In contrast, if the stimuli were moved to a location of weak response,
their pairing resulted in a large enhancement (Figure 11.2b, center column). Intermediate regions
(a) (b) S
Visual A
1 V
14
15 0.8 7

Trial Stim
0 0.6 200

–15 0.4
100
–30 0.2 +2SD +2SD +2SD

Elevation (deg)
0 Mean Mean Mean

Spikes/s
0 143 131
–10 0 10 20 69 263 153
64 175 65 253
Azimuth (deg)
–100 0 100 200 300 400 –100 0 100 200 300 400 –100 0 100 200 300 400
S
A
Auditory (50%) V
1 14
0.8 7

Trial Stim
15
200
0 0.6
–15 0.4 100
–30 0.2 +2SD +2SD +2SD

Elevation (deg)
0 Mean Mean Mean

Spikes/s
0
–10 0 10 20
Azimuth (deg) No stimulus evoked response detected No stimulus evoked response detected No stimulus evoked response detected
–100 0 100 200 300 400 –100 0 100 200 300 400 –100 0 100 200 300 400
S
A
Multisensory V
14
1
7

Trial Stim
15 0.8
200
0 0.6
Spatial and Temporal Features of Multisensory Processes

–15 0.4 100


+2SD +2SD +2SD
–30 0.2

Elevation (deg)
0 Mean Mean Mean

Spikes/s
198 199 106
0 99 267 56 286 65 241
–10 0 10 20
Azimuth (deg) –100 0 100 200 300 400 –100 0 100 200 300 400 –100 0 100 200 300 400
Time from stimulus onset (ms) Time from stimulus onset (ms) Time from stimulus onset (ms)

(c) (d)
10 200 10 200 10 200

100 100 100


5 5 5
0 0 0

spikes/trial
spikes/trial
spikes/trial

Multisensory
Multisensory
Multisensory

interaction (%)
interaction (%)
interaction (%)

0 –100 0 –100 0 –100


V A M V A M V A M

Mean stimulus evoked


Mean stimulus evoked
Mean stimulus evoked

Stimulus condition Stimulus condition Stimulus condition

FIGURE 11.2  (See color insert.) Multisensory interactions in AES neurons differ based on location of paired stimuli. (a) Visual, auditory, and multisensory SRFs
are shown with highlighted locations (b, d) illustrating response suppression (left column), response enhancement (middle column), and no significant interaction (right
column). (c) Shaded areas depict classically defined receptive fields for visual (blue) and auditory (green) stimuli.
195
196 The Neural Bases of Multisensory Processes

of response resulted in either weak or no interactions (Figure 11.2b, right column). In addition to
this traditional measure of multisensory gain (relative to the best unisensory response), these same
interactions can also be examined and quantified relative to the predicted summation of the unisen-
sory responses (Wallace et al. 1992; Wallace and Stein 1996; Stein and Wallace 1996; Stanford et
al. 2005; Royal et al. 2009; Carriere et al. 2008). In these comparisons, strongly effective pairings
typically result in subadditive interactions, weakly effective pairings result in superadditive inter-
actions, and intermediate pairings result in additive interactions. Visualization of these different
categories of interactions relative to additive models can be captured in pseudocolor representations
such as that shown in Figure 11.3, in which the actual multisensory SRF is contrasted against that
predicted on the basis of additive modeling. Together, these results clearly illustrate the primacy of
stimulus efficacy in dictating multisensory interactions, and that the role of space per se appears to
be a relatively minor factor in governing these integrative processes.
Parallel studies are now beginning to focus on the SC, and provide an excellent comparative
framework from which to view multisensory interactive mechanisms across brain structures. In
this work, Krueger et al. (2009) reported that the SRF architecture of multisensory neurons in the
SC is not only similar to that of cortical neurons, but also that stimulus effectiveness appears to
once again be the key factor in dictating the multisensory response. Thus, stimulus pairings within
regions of weak unisensory response often resulted in superadditive interactions (Figure 11.4b–c,
◼), whereas pairings at locations of strong unisensory responses typically exhibited subadditive
interactions (Figure 11.4b–c, ○). Overall, such an organization presumably boosts signals within
weakly effective regions of the unisensory SRFs during multisensory stimulus presentations and
yields more reliable activation for each stimulus presentation.
Although SRF architecture appears similar in both cortical and subcortical multisensory brain
regions, there are also subtle differences that may provide important insights into both the underly-
ing mechanistic operations and the different behavioral and perceptual roles of AES and SC. For
example, when the SRFs of a multisensory neuron in the SC are compared under different sensory

V
Visual V Multisensory A
Azimuth (deg)
Azimuth (deg)

10 10

30 30

50 50
0 200 400 600 0 200 400 600
V
Auditory A (V + A) A 1
Azimuth (deg)

Azimuth (deg)

10 10

30 30

50 50
0
0 200 400 600 0 200 400 600
Time from stim onset (ms) Time from stim onset (ms)

FIGURE 11.3  Multisensory interactions relative to additive prediction models. Visual, auditory, and multi-
sensory (VA) SRFs are shown for an individual multisensory neuron of AES. True multisensory responses can
be contrasted with those predicted by an additive model (V + A) and reveal a richer integrative microarchitec-
ture than predicted by simple linear summation of unisensory response profiles. (Adapted from Carriere, B.N.
et al., J. Neurophysiol., 99, 2357–2368, 2008.)
Spatial and Temporal Features of Multisensory Processes 197

(a) (b)
Multisensory (M) Visual (V) Auditory (A)

Elevation (deg)
Elevation (deg)

60 60
30 30
0 0
30 30
−60 −60
−60 −30 0 30 −60 −30 0 30 −60 −30 0 30 −60 −30 0 30
Azimuth (deg) Azimuth (deg) Azimuth (deg) Azimuth (deg)
(c)
A
V
Spikes/s stim

V V
150 A A
M M
100
50
0

–100 0 100 200 300 400 –100 0 100 200 300 400
Time from stimulus onset (ms) Time from stimulus onset (ms)

FIGURE 11.4  Multisensory interactions in SC neurons differ based on location of paired stimuli. (a) Visual,
auditory, and multisensory SRFs are shown as a function of azimuth (x axis) and elevation (y axis). Specific
locations within receptive field (b) are illustrated in detail (c) to show evoked responses for visual, auditory,
and multisensory conditions. Weakly effective locations (square) result in response enhancement, whereas
conditions evoking a strong unisensory response (circle) result in response suppression.

conditions, there appears to be a global similarity in the structure of each SRF with respect to both
the number and location of hot spots. This might indicate that the overall structure of the SRF is
dependent on fixed anatomical and/or biophysical constraints such as the extent of dendritic arbors.
However, these characteristics are far less pronounced in cortical SRFs (Carriere et al. 2008), possi-
bly due to the respective differences in the inputs to these two structures (the cortex receiving more
heterogeneous inputs) and/or due to less spatiotopic order in the cortex. Future work will seek to
better clarify these intriguing differences across structures.

11.3.2  Spatiotemporal Dynamics of Multisensory Processing


In addition to the clear interactions between space and effectiveness captured by the aforementioned
SRF analyses, an additional stimulus dimension that needs to be included is time. For example, and
returning to the initial outlining of the interactive principles, changing stimulus location impacts
not only stimulus effectiveness, but also the temporal dynamics of each of the unisensory (and mul-
tisensory) responses. Thus, dependent on the location of the individual stimuli, responses will have
very different temporal patterns of activation.
More recently, the importance of changes in temporal response profiles has been highlighted by
findings that the multisensory responses of SC neurons show shortened latencies when compared
with the component unisensory responses (Rowland et al. 2007), a result likely underlying the
behavioral finding of the speeding of saccadic eye movements under multisensory conditions (Frens
and Van Opstal 1998; Frens et al. 1995; Hughes et al. 1998; Amlot et al. 2003; Bell et al. 2005).
Additional work focused on the temporal dimension of multisensory responses has extended the
original characterization of the temporal principle to nonhuman primate cortex, where Kayser and
colleagues (2008) have found that audiovisual interactions in the superior temporal plane of rhesus
monkey neocortex are maximal when a visual stimulus precedes an auditory stimulus by 20 to 80
198 The Neural Bases of Multisensory Processes

ms. Along with these unitary changes, recent work had also shown that the timing of sensory inputs
with respect to ongoing neural oscillations in the neocortex has a significant impact on whether
neuronal responses are enhanced or suppressed. For instance, in macaque primary auditory cortex,
properly timed somatosensory input has been found to reset ongoing oscillations to an optimal
excitability phase that enhances the response to temporally correlated auditory input. In contrast,
somatosensory input delivered during suboptimal, low-excitability oscillatory periods depresses the
auditory response (Lakatos et al. 2007).
Although clearly illustrating the importance of stimulus timing in shaping multisensory interac-
tions, these prior studies have yet to characterize the interactions between time, space, and effec-
tiveness in the generation of a multisensory response. To do this, recent studies from our laboratory
have extended the SRF analyses described above to include time, resulting in the creation of spa-
tiotemporal receptive field (STRF) plots. It is important to point out that such analyses are not
a unique construct to multisensory systems, but rather stem from both spatiotemporal and spec-
trotemporal receptive field studies within individual sensory systems (David et al. 2004; Machens
et al. 2004; Haider et al. 2010; Ye et al. 2010). Rather, the power of the STRF here is its application
to multisensory systems as a modeling framework from which important mechanistic insights can
be gained about the integrative process.
The creation of STRFs for cortical multisensory neurons has revealed interesting features about
the temporal dynamics of multisensory interactions and the evolution of the multisensory response
(Royal et al. 2009). Most importantly, these analyses, when contrasted with simple additive mod-
els based on the temporal architecture of the unisensory responses, identified two critical epochs
in the multisensory response not readily captured by additive processes (Figure 11.5). The first of
these, presaged by the Rowland et al. study described above, revealed an early phase of superaddi-
tive multisensory responses that manifest as a speeding of response (i.e., reduced latency) under

200
Change in firing rate (%)

150
100
50
0
−50
−100
−15 −10 −5 0 5 10 15 20 25 –50 −25 0 25 50 75
Latency shift (ms) Duration shift (ms)
Integration (%)

800
400
0

60
Firing rate (Hz)

45

30

15

0
–500 0 500 1000 –1000 –500 0 500
Time from onset of predicted Time from offset of predicted
multisensory response (ms) multisensory response (ms)

FIGURE 11.5  Spatiotemporal response dynamics in multisensory AES neurons. A reduced response latency
and increased response duration characterized spatiotemporal dynamics of paired multisensory stimuli.
Spatial and Temporal Features of Multisensory Processes 199

multisensory conditions. The second of these happens late in the response epoch, where the multi-
sensory response continues beyond the truncation of the unisensory responses, effectively increasing
response duration under multisensory circumstances. It has been postulated that these two distinct
epochs of multisensory integration may ultimately be linked to very different behavioral and/or
perceptual roles (Royal et al. 2009). Whereas reduced latencies may speed target detection and
identification, extended response duration may facilitate perceptual analysis of the object or area of
interest. One interesting hypothesis is that the early speeding of responses will be more prominent
in SC multisensory neurons given their important role in saccadic (and head) movements, and that
the extended duration will be seen more in cortical networks engaged in perceptual analyses. Future
work, now in progress in our laboratory (see below), will seek to clarify the behavioral/perceptual
roles of these integrative processes by directly examining the links at the neurophysiological and
behavioral levels.

11.4 STUDYING MULTISENSORY INTEGRATION IN AN


AWAKE AND BEHAVING SETTING: NEW INSIGHTS
INTO UTILITY OF MULTISENSORY PROCESSES
As research on the neural substrates of multisensory integration progresses, and as the behavioral
and perceptual consequences of multisensory combinations become increasingly apparent, contem-
porary neuroscience is faced with the challenge of bridging between the level of the single neuron
and whole animal behavior and perception. To date, much of the characterization of multisensory
integration at the cellular level has been conducted in anesthetized animals, which offer a variety
of practical advantages. However, given that anesthesia could have substantial effects on neural
encoding, limiting the interpretation of results within the broader construct of perceptual abilities
(Populin 2005; Wang et al. 2005; Ter-Mikaelian et al. 2007), the field must now turn toward awake
preparations in which direct correlations can be drawn between neurons and behavior/perception.
Currently, in our laboratory, we are using operant conditioning methods to train animals to fix-
ate on a single location while audiovisual stimuli are presented in order to study SRF architecture
in this setting (and compare these SRFs with those generated in anesthetized animals). In addition
to providing a more naturalistic view into receptive field organization, these studies can then be
extended in order to begin to address the relationships between the neural and behavioral levels.
One example of this is the use of a delayed saccade task, which has been used in prior work to parse
sensory from motor responses in the SC (where many neurons have both sensory and motor activ-
ity; Munoz et al. 1991a, 1991b; Munoz and Guitton 1991; Guitton and Munoz 1991). In this task,
an animal is operantly conditioned to fixate on a simple visual stimulus (a light-emitting diode or
LED), and to hold fixation for the duration of the LED. While maintaining fixation, a peripheral
LED illuminates, resulting in a sensory (i.e., visual) response in the SC. A short time later (usually
on the order of 100–200 ms), the fixation LED is shut off, cueing the animal to generate a motor
response to the location at which the target was previously presented. The “delay” allows the sen-
sory response to be dissociated from the motor response, thus providing insight into the nature of
the sensory–motor transform. Although such delayed saccade tasks have been heavily employed in
both the cat and monkey, they are typically used to eliminate “confounding” sensory influences on
the motor responses.
Another advantage afforded by the awake preparation is the ability to study how space, time, and
effectiveness interact in a state more reflective of normal brain function, and which is likely to reveal
important links between multisensory neuronal interactions and behavioral/perceptual enhance-
ments such as speeded responses, increased detection, and accuracy gains. Ideally, these analyses
could be structured to allow direct neurometric–psychometric comparisons, providing fundamental
insights into how individual neurons and neuronal assemblies impact whole organismic processes.
200

(a) (b)
STRFs STRFs Rasters and
Rasters and
0 perievent time histograms perievent time histograms
10 0
V 20 20
30 150 200
50 40
100 100
0 50
10 0 0 0
A 20 20
30
200

Azimuth (deg)
50 150 40
Spikes/s

Spikes/s
100 100
0
10 50 0
0 0
VA 20 20
30
50 40
150 200
100
100
50
0 0 0 0
10 –100 0 100 200 300 400 –100 0 100 200 300 400
(V+A) 20 20 Time (ms)
Time (ms)
30 40
50
0 1.0 1.0
10 0
VA–(V+A) 20 0.0 20 0.0
30
40
50 –1.0 –1.0
0 100 200 300 400 0 100 200 300 400
Time from stim onset (ms) Time from stim onset (ms)

FIGURE 11.6  (See color insert.) Representative STRF from awake (a) versus anesthetized (b) recordings from cat SC using simple audiovisual stimulus presentations
(an LED paired with broadband noise). In awake animals, superadditive interactions occurred over multiple time points in multisensory condition (VA) when compared
to what would be predicted based on a linear summation of unisensory responses (V + A; see contrast, VA – [V + A]). This differs from anesthetized recordings from
SC in which multisensory interactions are limited to earliest temporal phase of multisensory response.
The Neural Bases of Multisensory Processes
Spatial and Temporal Features of Multisensory Processes 201

Preliminary studies have already identified that multisensory neurons in the SC of the awake cat
demonstrate extended response durations, as well as superadditive interactions over multiple time
scales, when compared to anesthetized animals in which multisensory interactions are typically
limited to the early phases of the response (Figure 11.6; Krueger et al. 2008). These findings remain
to be tested in multisensory regions of the cortex, or extended beyond simple stimuli (LEDs paired
with white noise) to more complex, ethologically relevant cues that might better address multisen-
sory perceptual capabilities. Responses to naturalistic stimuli in cats have primarily been examined
in unisensory cortices, demonstrating that simplification of natural sounds (bird chirps) results in
significant alteration of neuronal responses (Bar-Yosef et al. 2002) and that firing rates differ for
natural versus time-reversed conspecific vocalizations (Qin et al. 2008) in the primary auditory cor-
tex. Furthermore, multisensory studies in primates have shown that multisensory enhancement in
the primary auditory cortex of awake monkeys was reduced when a mismatched pair of naturalistic
audiovisual stimuli was presented (Kayser et al. 2010).

11.5 HUMAN BEHAVIORAL AND PERCEPTUAL STUDIES OF MULTISENSORY


PROCESSING: BUILDING BRIDGES BETWEEN NEUROPHYSIOLOGICAL
AND BEHAVIORAL AND PERCEPTUAL LEVELS OF ANALYSIS
As should be clear from the above description, the ultimate goal of neurophysiological studies is to
provide a more informed view into the encoding processes that give rise to our behaviors and per-
ceptions. Indeed, these seminal findings in the animal model can be used as important instruction
sets for the design of experiments in human subjects to bridge between these domains. Recently, our
laboratory has embarked on such experiments with a focus on better characterizing how stimulus
timing influences multisensory perceptual processes, with a design shaped by our knowledge of the
temporal principle.

11.5.1  Defining the “Temporal Window” of Multisensory Integration


In addition to emphasizing the importance of stimulus onset asynchrony (SOA) in determining the
outcome of a given multisensory pairing, experiments in both SC and AES cortex of the cat showed
that the span of time over which response enhancements are generally seen in these neurons is on
the order of several hundred milliseconds (Meredith et al. 1987; Wallace and Stein 1996; Wallace
et al. 1992, 1996). Behavioral studies have followed up on these analyses to illustrate the temporal
constraints of multisensory combinations on human performance, and have found that the presenta-
tion of cross-modal stimulus pairs in close temporal proximity results in shortened saccadic reac-
tion times (Colonius and Diederich 2004; Colonius and Arndt 2001; Frens et al. 1995), heightened
accuracy in understanding speech in noise (McGrath and Summerfield 1985; Pandey et al. 1986;
van Wassenhove et al. 2007), as well as playing an important role in multisensory illusions such
as the McGurk effect (Munhall et al. 1996), the sound-induced flash illusion (Shams et al. 2000,
2002), the parchment skin illusion (Guest et al. 2002), and the stream-bounce illusion (Sekuler et
al. 1997). Moreover, multisensory interactions as demonstrated using population-based functional
imaging methods (Dhamala et al. 2007; Kavounoudias et al. 2008; Macaluso et al. 2004; Noesselt
et al. 2007) have been shown to be greatest during synchronous presentation of stimulus pairs.
Perhaps even more important than synchrony in these studies was the general finding that multisen-
sory interactions were typically preserved over an extended window of time (i.e., several hundred
milliseconds) surrounding simultaneity, giving rise to the term “temporal window” for describing
the critical period for these interactions (Colonius and Diederich 2004; van Wassenhove et al. 2007;
Dixon and Spitz 1980). The concept of such a window makes good ethological sense, in that it pro-
vides a buffer for the latency differences that characterize the propagation times of energies in the
different senses. Most illustrative here are the differences between the propagation times of light
and sound in our environment, which differ by many orders of magnitude. As a simple example of
202 The Neural Bases of Multisensory Processes

this difference, take an audiovisual event happening at a distance of 1 m, where the incident ener-
gies will arrive at the retina almost instantaneously and at the cochlea about 3 ms later (the speed
of sound is approximately 330 m/s). Now, if we move that same audiovisual source to a distance of
20 m, the difference in arrival times expands to 60 ms. Hence, having a window of tolerance for
these audiovisual delays represents an effective means to continue to bind stimuli across modalities
even without absolute correspondence in their incident arrival times.
Because of the importance of temporal factors for multisensory integration, a number of experi-
mental paradigms have been developed for use in human subjects as a way to systematically study
the temporal binding window and its associated dynamics. One of the most commonly used of these
is a simultaneity judgment task, in which paired visual and auditory stimuli are presented at various
SOAs and participants are asked to judge whether the stimuli occurred simultaneously or succes-
sively (Zampini et al. 2005a; Engel and Dougherty 1971; Stone et al. 2001; Stevenson et al. 2010).
A distribution of responses can then be created that plots the probability of simultaneity reports as
a function of SOA. This distribution yields not only the point of subjective simultaneity, defined as
the peak of function (Stone et al. 2001; Zampini et al. 2005a) but, more importantly, can be used
to define a “window” of time within which simultaneity judgments are highly likely. A similar
approach is taken in paradigms designed to assess multisensory temporal order judgments, wherein
participants judge whether stimuli within one or another modality was presented first. Similar to
the simultaneity judgment task, the point of subjective simultaneity is the time point at which par-
ticipants judge either stimulus to have occurred first at a rate of 50% (Zampini et al. 2003; Spence
et al. 2001). Once again, this method can also be adapted to create response distributions that serve
as proxies for the temporal binding window. Although the point measures (i.e., point of subjective
simultaneity) derived from these studies tend to differ based on the paradigm chosen (Fujisaki et
al. 2004; Vroomen et al. 2004; Zampini et al. 2003, 2005a), the span of time over which there is a
high likelihood of reporting simultaneity is remarkably constant, ranging from about –100 ms to
250 ms, where negative values denote auditory-leading-visual conditions (Dixon and Spitz 1980;
Fujisaki et al. 2004; Vroomen et al. 2004; Zampini et al. 2003, 2005a). The larger window size on
the right side of these distributions—in which vision leads audition—appears in nearly all stud-
ies of audiovisual simultaneity perception, and has been proposed to arise from the inherent flex-
ibility needed to process real-world audiovisual events, given that the propagation speeds of light
and sound will result in SOAs only on the right side of these distributions (Dixon and Spitz 1980).
Indeed, very recent efforts to model the temporal binding window within a probabilistic framework
(Colonius and Diederich 2010a, 2010b) have described this asymmetry as arising from an asym-
metry in Bayesian priors across SOAs corresponding to the higher probability that visual-first pairs
were generated by the same external event.

11.5.2  Stimulus-Dependent Effects on the Size of the Multisensory Temporal Window


Although some have argued for an invariant size to the temporal window (see Munhall et al.
1996), there is a growing body of evidence to suggest that the size of the temporal window is very
much dependent on the type of stimulus that is used (Dixon and Spitz 1980; van Wassenhove et
al. 2008; Soto-Faraco and Alsius 2009). The largest distinctions in this domain have been seen
when contrasting speech versus nonspeech stimuli, in which the window for speech appears to
be far larger (approximately 450 ms) when compared with the pairing of simpler stimuli such as
flash-tone pairs or videos of inanimate objects, such as a hammer pounding a nail—about 250 ms
(Dixon and Spitz 1980; van Atteveldt et al. 2007; van Wassenhove et al. 2007; Massaro et al.
1996; Conrey and Pisoni 2006; McGrath and Summerfield 1985). Interpretation of this seeming
expansion in the case of speech has ranged from the idea that learned tolerance of asynchrony
is greatest with stimuli to which we are most exposed (Dixon and Spitz 1980), to the theory that
the richness of auditory spectral and visual dynamic content in speech allows for binding over a
larger range of asynchrony (Massaro et al. 1996), to the view that speech window size is dictated
Spatial and Temporal Features of Multisensory Processes 203

by the duration of the elemental building blocks of the spoken language—phonemes (Crystal and
House 1981).
Other studies have focused on altering the statistics of multisensory temporal relations in an
effort to better characterize the malleability of these processes. For example, repeated exposure to
a 250-ms auditory-leading-visual asynchronous pair is capable of biasing participants’ simultaneity
judgments in the direction of that lag by about 25 ms, with effects lasting on the order of minutes
(Fujisaki et al. 2004; Vroomen et al. 2004). Similar recalibration effects have been noted after expo-
sure to asynchronous audiovisual speech, as well as to visual–tactile, audio–tactile, and sensory–
motor pairs (Hanson et al. 2008; Fajen 2007; Stetson et al. 2006; Navarra et al. 2005). Although the
exact mechanisms underlying these changes are unknown, they have been proposed to represent a
recalibration of sensory input consistent with Bayesian models of perception (Hanson et al. 2008;
Miyazaki et al. 2005, 2006).

11.5.3  Can “Higher-Order” Processes Affect Multisensory Temporal Window?


In addition to these studies examining stimulus-dependent effects, other works have sought to
determine the malleability of multisensory temporal processing resulting from the manipulation
of cognitive processes derived from top-down networks. Much of this work has focused on atten-
tional control, and has been strongly influenced by historical studies showing that attention within
a modality could greatly facilitate information processing of a cued stimulus within that modality.
This work has now been extended to the cross-modal realm, and has shown that attention to one
modality can bias temporally based judgments concerning a stimulus in another modality (Zampini
et al. 2005b; Spence et al. 2001; Shore et al. 2001), illustrating the presence of strong attentional
links between different sensory systems.

11.6 ADULT PLASTICITY IN MULTISENSORY TEMPORAL PROCESSES:


PSYCHOPHYSICAL AND NEUROIMAGING EVIDENCE
Further work in support of top-down influences on multisensory perception have focused on char-
acterizing the plasticity that can be engendered with the use of classic perceptual learning para-
digms. The first of these studies were directed outside the temporal domain, and focused on the
simple question of whether perceptual learning within a single sensory modality can be improved
with the use of cross-modal stimuli. In these studies, participants were trained on a motion dis-
crimination task using either a visual cue alone or combined visual–auditory cues. Results reveal
enhanced visual motion discrimination abilities and an abbreviated time course of learning in
the group trained on the audiovisual version of the task when compared with those trained only
on the visual version (Kim et al. 2008; Seitz et al. 2006). Similar results have been seen in the
visual facilitation of voice discrimination learning (von Kriegstein and Giraud 2006), cross-modal
enhancement of both auditory and visual natural object recognition (Schneider et al. 2008), and
in the facilitation of unisensory processing based on prior multisensory memories (Murray et al.
2004, 2005).
More recently, our laboratory has extended these perceptual plasticity studies into the temporal
realm, by attempting to assess the plasticity of the multisensory temporal binding window itself.
Initial efforts used a two-alternative forced choice audiovisual simultaneity judgment task in which
subjects were asked to choose on a trial-by-trial basis whether a stimulus pair was synchronously or
asynchronously presented (Powers et al. 2009). In the initial characterization (i.e., before training),
a distribution of responses was obtained that allowed us to define a proxy measure for the multisen-
sory temporal binding window for each individual subject (Figure 11.7). After this baseline mea-
surement, subjects were then engaged in the same task, except that now they were given feedback as
to the correctness of their judgments. Training was carried out for an hour a day over 5 days. This
training regimen resulted in a marked narrowing in the width of the multisensory temporal binding
204 The Neural Bases of Multisensory Processes

(a) (b)
1 Baseline

Probability of simultaneity judgment


1 Baseline
Probability of simultaneity judgment

Post-training day 5 0.9 Post-training day 5


0.9 *
N = 22
0.8 0.8 *
0.7 *
0.7
0.6 0.6
0.5 0.5
0.4 321 ms 0.4
0.3 115 ms 0.3
0.2 0.2
0.1 0.1
0 0
–300 –200 –100 0 100 200 300 –300 –200 –100 0 100 200 300
SOA (ms) SOA (ms)

(c) Total Window


400 Right Window
Left Window
350
Mean window size (ms)

300
*
250
200
150
100
50
0
Baseline Pre Post Pre Post Pre Post Pre Post Pre Post
Day 1 Day 1 Day 2 Day 2 Day 3 Day 3 Day 4 Day 4 Day 5 Day 5

FIGURE 11.7  Training on a two-alternative forced choice simultaneity judgment forced choice task. (a) An
estimate of temporal binding window is derived using a criterion set at 75% of maximum. In this representa-
tive individual case, window narrows from 321 to 115 ms after 5 days (1 h/day) of feedback training. (b) After
training, a significant decrease in probability of judging nonsimultaneous audiovisual pairs to be simultane-
ous was found (*P < .05). (c) Average window size dropped significantly after first day (1 h) of training, then
remained stable (*P < .05).

window, with a group average reduction of 40%. Further characterization revealed that the changes
in window size were very rapid (being seen after the first day of training), were durable (lasting
at least a week after the cessation of training), and were a direct result of the feedback provided
(control subjects passively exposed to the same stimulus set did not exhibit window narrowing).
Additionally, to rule out the possibility that this narrowing was the result of changes in cognitive
biases, a second experiment using a two-interval forced choice paradigm was undertaken in which
participants were instructed to identify the simultaneously presented audiovisual pair presented
within one of two intervals. The two-interval forced choice paradigm resulted in a narrowing that
was similar in both degree and dynamics to that using the two-alternative forced choice approach.
Overall, this result is the first to illustrate a marked experience-dependent malleability to the mul-
tisensory temporal binding window, a result that has potentially important implication for clinical
conditions such as autism and dyslexia in which there is emerging evidence for changes in multisen-
sory temporal function (Ciesielski et al. 1995; Laasonen et al. 2001, 2002; Kern 2002; Hairston et
al. 2005; Facoetti et al. 2010; Foss-Feig et al. 2010).
In an effort to better define the brain networks responsible for multisensory temporal perception
(and the demonstrable plasticity), our laboratory has conducted a follow-up neuroimaging study using
functional magnetic resonance imaging (fMRI) (Powers et al. 2010). The findings revealed marked
Spatial and Temporal Features of Multisensory Processes 205

changes in one of the best-­established multisensory cortical domains in humans, the posterior supe-
rior temporal sulcus (pSTS). The pSTS exhibited striking decreases in blood oxygen level dependent
(BOLD) activation after training, suggestive of an increased efficiency of processing. In addition to
these changes in pSTS were changes in regions of the auditory and visual cortex, along with marked
changes in functional coupling between these unisensory domains and the pSTS. Together, these
studies are beginning to reveal the cortical networks involved in multisensory temporal processing
and perception, as well as the dynamics of these networks that must be continually adjusted to cap-
ture the ever-changing sensory statistics of our natural world as well as their cognitive valence.

11.7 DEVELOPMENTAL PLASTICITY IN MULTISENSORY REPRESENTATIONS:


INSIGHTS FROM ANIMAL AND HUMAN STUDIES
In addition to this compelling emerging evidence as to the plastic potential of the adult brain for
having its multisensory processing architecture shaped in an experience-dependent manner, there
is a rich literature on the development of multisensory representations and the role that postnatal
experience plays in shaping these events. Although the questions were first posed in the literature
associated with the development of human perceptual abilities, more recent work in animal models
has laid the foundation for better understanding the seminal events in the maturation of multisen-
sory behaviors and perceptions.

11.7.1  Neurophysiological Studies into Development of Multisensory Circuits


The studies described above in adult animal models provide an ideal foundation on which to eval-
uate the developmental events in the nervous system that lead up to the construction of mature
multisensory representations. Hence, subsequent studies focused on establishing the developmental
chronology for multisensory neurons and their integrative features in these same model structures—
the subcortical SC and the cortical AES. In the SC, recordings immediately after birth reveal an
absence of multisensory neurons (Wallace and Stein 1997). Indeed, the first neurons present in the
SC at birth and soon after are those that are exclusively responsive to somatosensory cues. By 10 to
12 days postnatal, auditory-responsive neurons appear, setting the stage for the first multisensory
neurons that are responsive to both somatosensory and auditory cues. More than a week later, the
first visually responsive neurons appear, providing the basis for the first visually responsive multi-
sensory neurons. These early multisensory neurons were found to be far different than their adult
counterparts, responded weakly to sensory stimuli, and had poorly developed response selectivity,
long latencies, and large receptive fields (Wallace and Stein 1997; Stein et al. 1973a, 1973b). Perhaps
most importantly, these early multisensory neurons failed to integrate their different sensory inputs,
responding to stimulus combinations in a manner that was indistinguishable from their component
unisensory responses (Wallace and Stein 1997). Toward the end of the first postnatal month, this
situation begins to change, with individual neurons starting to show the capacity to integrate their
different sensory inputs. Over the ensuing several months, both the number of multisensory neurons
and those with integrative capacity grow steadily, such that by 4 to 5 months after birth, the adult-
like incidences are achieved (Figure 11.8).
The developmental progression in the cortex is very similar to that in the SC, except that it
appears to be delayed by several weeks (Wallace et al. 2006). Thus, the first multisensory neurons
do not appear in AES until about 6 weeks after birth (Figure 11.8). Like with the SC, these early
multisensory neurons are reflective of the adjoining unisensory representations, being auditory–
somatosensory. Four weeks or so later, we see the appearance of visual neurons and the coincident
appearance of visually responsive multisensory neurons. Once again, early cortical multisensory
neurons are strikingly immature in many respects, including a lack of integrative capacity. As devel-
opment progresses, we see a substantial growth in the multisensory population and we see most
multisensory AES neurons develop their integrative abilities.
206 The Neural Bases of Multisensory Processes

70 SC SC
AES
60

% Multisensory cells
50

40

30
AES
20

10

0
0 5 10 15 20 Adult
Postnatal age (weeks)

FIGURE 11.8  Development of multisensory neurons in SC (open circles) versus AES (closed circles) of cat.
Development of multisensory neurons is similar between SC and AES with exceptions of onset and overall
percentage of multisensory neurons. At 4 months postnatal life, percentages of multisensory neurons in both
AES and SC are at their mature levels, with SC having a higher percentage than AES.

The parallels between SC and AES in their multisensory developmental chronology likely
reflect the order of overall sensory development (Gottlieb 1971), rather than dependent connectiv-
ity between the two regions because the establishment of sensory profiles in the SC precedes the
functional maturation of connections between AES and the SC (Wallace and Stein 2000). Thus, a
gradual recruitment of sensory functions during development appears to produce neurons capable
of multisensory integration (Lewkowicz and Kraebel 2004; Lickliter and Bahrick 2004), and points
strongly to a powerful role for early experience in sculpting the final multisensory state of these
systems (see Section 11.7.3).

11.7.2  Development of Integrative Principles


In addition to characterizing the appearance of multisensory neurons and the maturation of their
integrative abilities, these studies also examined how the integrative principles changed during
the course of development. Intriguingly, the principle of inverse effectiveness appeared to hold in
the earliest integrating neurons, in that as soon as a neuron demonstrated integrative abilities, the
largest enhancements were seen in pairings of weakly effective stimuli. Indeed, one of the most
surprising findings in these developmental studies is the all-or-none nature of multisensory inte-
gration. Thus, neurons appear to transition very rapidly from a state in which they lack integrative
capacity to one in which that capacity is adult-like in both magnitude and adherence to the principle
of inverse effectiveness. In the spatial domain, the situation appears to be much the same. Whereas
early multisensory neurons have large receptive fields and lack integration, as soon as receptive
fields become adult-like in size, neurons show integrative ability. Indeed, these processes appear
to be so tightly linked that it has been suggested that they reflect the same underlying mechanistic
process (Wallace and Stein 1997; Wallace et al. 2006).
The one principle that appears to differ in a developmental context is the temporal principle.
Observations from the earliest integrating neurons show that they typically only show response
enhancements to pairings at a single SOA (see Wallace and Stein 1997). This is in stark contrast to
adults, in which enhancements are typically seen over a span of SOAs lasting several hundred mil-
liseconds, and which has led to the concept of a temporal “window” for multisensory integration. In
these animal studies, as development progresses, the range of SOAs over which enhancements can
be generated grow, ultimately resulting in adult-sized distributions reflective of the large temporal
window. Why such a progression is seen in the temporal domain and not in the other domains is not
Spatial and Temporal Features of Multisensory Processes 207

yet clear, but may have something to do with the fact that young animals are generally only con-
cerned with events in the immediate proximity to the body (and which would make an SOA close
to 0 of greatest utility). As the animal becomes increasingly interested in exploring space at greater
distances, an expansion in the temporal window would allow for the better encoding of these more
distant events. We will return to the issue of plasticity in the multisensory temporal window when
we return to the human studies (see Section 11.7.4).

11.7.3  Experientially Based Plasticity in Multisensory Circuits


Although the protracted timeline for the development of mature multisensory circuits is strongly
suggestive of a major deterministic role for early experience in shaping these circuits, only with
controlled manipulation of this experience can we begin to establish causative links. To address this
issue, our laboratory has performed a variety of experiments in which sensory experience is elimi-
nated or altered in early life, after which the consequent impact on multisensory representations is
examined. In the first of these studies, the necessity of cross-modal experiences during early life
was examined by eliminating all visual experiences from birth until adulthood, and then assessing
animals as adults (Wallace et al. 2004; Carriere et al. 2007). Although there were subtle differences
between SC and AES in these studies, the impact on multisensory integration in both structures was
profound. Whereas dark-rearing allowed for the appearance of a robust (albeit smaller than normal)
visual population, its impact on multisensory integration was profound—abolishing virtually all
response enhancements to visual–nonvisual stimulus pairings.
A second series of experiments then sought to address the importance of the statistical relation-
ship of the different sensory cues to one another on the construction of these multisensory represen-
tations. Here, animals were reared in environments in which the spatial relationship between visual
and auditory stimuli was systematically altered, such that visual and auditory events that were
temporally coincident were always separated by 30°. When examined as adults, these animals were
found to have multisensory neurons with visual and auditory receptive fields that were displaced
by approximately 30°, but more importantly, to now show maximal multisensory enhancements
when stimuli were separated by this disparity (Figure 11.9a). More recent work has extended these
studies into the temporal domain, and has shown that raising animals in environments in which
the temporal relationship of visual and auditory stimuli is altered by 100 ms results in a shift in the
peak tuning profiles of multisensory neurons by approximately 100 ms (Figure 11.9b). Of particular
interest was that when the temporal offset was extended to 250 ms, the neurons lost the capacity
to integrate their different sensory inputs, suggesting that there is a critical temporal window for
this developmental process. Collectively, these results provide strong support for the power of the
statistical relations of multisensory stimuli in driving the formation of multisensory circuits; circuits
that appear to be optimally designed to code the relations most frequently encountered in the world
during the developmental period.

11.7.4  Development of Human Multisensory Temporal Perception


The ultimate goal of these animal model–based studies is to provide a better framework from which
to view human development, with a specific eye toward the maturation of the brain mechanisms
that underlie multisensory-mediated behaviors and perceptions. Human developmental studies on
multisensory processing have provided us with important insights into the state of the newborn
and infants brains, and have illustrated that multisensory abilities are changing rapidly in the first
year of postnatal life (see Lewkowicz and Ghazanfar 2009). Intriguingly, there is then a dearth of
knowledge about multisensory maturation until adulthood. In an effort to begin to fill this void,
our laboratory has embarked on a series of developmental studies focused on childhood and ado-
lescence, with a specific emphasis on multisensory temporal processes, one of the principal themes
of this chapter.
208 The Neural Bases of Multisensory Processes

(a) Spatial disparity rearing


150
normal rearing
30˚ spatially
disparate
rearing
100
Multisensory interaction (%)

50

–50
–30 –20 –10 0 10 20 30 40 50 60

Visual stimulus location relative to auditory (degrees)


–100
30˚
Visual-auditory spatial experience shift

(b) Temporal disparity rearing


160 normal rearing
100 ms temporal
140 disparate rearing
250 ms temporal
120 disparate rearing
Multisensory interaction (%)

100

80

60

40
20

–20

–40 A100V V=A V100A V200A V300A V400A V500A


–60 stimulus onset asynchrony (ms)
100 ms 250 ms
visual-auditory temporal experience shift

FIGURE 11.9  Developmental manipulations of spatial and temporal relationships of audiovisual stimuli.
(a) Multisensory interaction is shown as a function of spatially disparate stimuli between normally reared
animals and animals reared with a 30° disparity between auditory and visual stimuli. Peak multisensory
interaction for disparately reared group falls by 30° from that of normally reared animals. (b) Multisensory
interaction as a function of SOA in animals reared normally versus animals reared in environments with 100
and 250 ms temporal disparities. As might be expected, peak multisensory interactions are offset by 100 ms
for normally reared versus the 100 ms disparate group. Interestingly, the 250 ms group loses the ability to
integrate audiovisual stimuli.

These studies strongly suggest that the maturation of multisensory temporal functioning extends
beyond the first decade of life. In the initial study, it was established that multisensory temporal
functioning was still not mature by 10 to 11 years of age (Hillock et al. 2010). Here, children were
assessed on a simultaneity judgment task in which flashes and tone pips were presented at SOAs
ranging from –450 to +450 ms (with positive values representing visual-leading stimulus trials and
Spatial and Temporal Features of Multisensory Processes 209

700

600

500

Window size (ms)


400

300

200

100

0
0 5 10 15 20 25
Subject age (y)

FIGURE 11.10  Temporal window size decreases from childhood to adulthood. Each data point represents a
participant’s window size as determined by width at 75% of maximum probability of perceived simultaneity
using nonspeech stimuli. See Section 11.5.1. (Adapted from Hillock, A.R. et al., Binding of sights and sounds:
Age-related changes in audiovisual temporal processing, 2010, submitted for publication.)

negative values representing auditory-leading stimulus trials), allowing for the creation of a response
distribution identical to what has been done in adults and which serves as a proxy for the multi-
sensory temporal binding window (see Section 11.6). When compared with adults, the group mean
window size for these children was found to be approximately 38% larger (i.e., 413 vs. 299 ms). A
larger follow-up study then sought to detail the chronology of this maturational process from 6 years
of age until adulthood, and identified the closure of the binding window in mid to late adolescence
for these simple visual–auditory pairings (Figure 11.10; Hillock and Wallace 2011b). A final study
then sought to extend these analyses into the stimulus domain with which children likely have the
greatest experience—speech. Using the McGurk effect, which uses the pairing of discordant visual
and auditory speech stimuli (e.g., a visual /ga/ with an auditory /ba/), it is possible to index the
integrative process by looking at how often participants report fusions that represent a synthesis of
the visual and auditory cues (e.g., /da/ or /tha/). Furthermore, because this effect has been shown
to be temporally dependent, it can be used as a tool to study the multisensory temporal binding
window for speech-related stimuli. Surprisingly, when used with children (6–11 years), adolescents
(12–17 years), and adults (18–23 years), windows were found to be indistinguishable (Hillock and
Wallace 2011a). Together, these studies show a surprising dichotomy between the development of
multisensory temporal perception for nonspeech versus speech stimuli, a result that may reflect the
powerful imperative placed on speech in young children, and reinforcing the importance of sensory
experience in the development of multisensory abilities.

11.8  CONCLUSIONS AND FUTURE DIRECTIONS


As should be clear from the above, substantial efforts are ongoing to bridge between the rapidly
growing knowledge sets concerning multisensory processing derived from both animal and human
studies. This work should not only complement each domain, but should inform the design of better
experiments in each. As an example, the final series of human experiments described above begs
for a nonhuman correlate to better explore the mechanistic underpinnings that result in very differ-
ent timelines for the maturation of nonspeech versus speech integrative networks. Experiments in
nonhuman primates, in which the critical nodes for communicative signal processing are beginning
to emerge (Ghazanfar et al. 2008, 2010), can begin to tease out the relative maturation of the relevant
neurophysiological processes likely to result in these distinctions.
210 The Neural Bases of Multisensory Processes

Although we have made great strides in recent years in building a better understanding of multi-
sensory behavioral and perceptual processes and their neural correlates, we still have much to dis-
cover. Fundamental questions remain unanswered, providing both a sense of frustration but also a
time of great opportunity. One domain of great interest to our laboratory is creating a bridge between
the neural and the behavioral/perceptual in an effort to extend beyond the correlative analyses done
thus far. Paradigms developed in awake and behaving animals allow for a direct assessment of neu-
ral and behavioral responses during performance on the same task, and should more directly link
multisensory encoding processes to their striking behavioral benefits (e.g., see Chandrasekaran and
Ghazanfar 2009). However, even these experiments provide only correlative evidence, and future
work will seek to use powerful new methods such as optogenetic manipulation in animal models
(e.g., see Cardin et al. 2009) and transcranial magnetic stimulation in humans (e.g., see Romei et al.
2007; Beauchamp et al. 2010; Pasalar et al. 2010) to selectively deactivate specific circuit compo-
nents and then assess the causative impact on multisensory function.

REFERENCES
Amlot, R., R. Walker, J. Driver, and C. Spence. 2003. Multimodal visual–somatosensory integration in saccade
generation. Neuropsychologia, 41, 1–15.
Bar-Yosef, O., Y. Rotman, and I. Nelken. 2002. Responses of neurons in cat primary auditory cortex to bird
chirps: Effects of temporal and spectral context. Journal of Neuroscience, 22, 8619–8632.
Beauchamp, M.S., A.R. Nath, and S. Pasalar. 2010. fMRI-guided transcranial magnetic stimulation reveals
that the superior temporal sulcus is a cortical locus of the McGurk effect. Journal of Neuroscience, 30,
2414–2417.
Bell, A.H., M.A. Meredith, A.J. Van Opstal, and D.P. Munoz. 2005. Crossmodal integration in the primate
superior colliculus underlying the preparation and initiation of saccadic eye movements. Journal of
Neurophysiology, 93, 3659–3673.
Benedek, G., G. Eordegh, Z. Chadaide, and A. Nagy. 2004. Distributed population coding of multisensory spa-
tial information in the associative cortex. European Journal of Neuroscience, 20, 525–529.
Calvert, G.A., and T. Thesen. 2004. Multisensory integration: methodological approaches and emerging prin-
ciples in the human brain. Journal of Physiology, Paris, 98, 191–205.
Cardin, J.A., M. Carlen, K. Meletis, U. Knoblich, F. Zhang, K. Deisseroth, L.H. Tsai, and C.I. Moore. 2009.
Driving fast-spiking cells induces gamma rhythm and controls sensory responses. Nature, 459, 663–667.
Carriere, B.N., D.W. Royal, T.J. Perrault, S.P. Morrison, J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2007.
Visual deprivation alters the development of cortical multisensory integration. Journal of Neurophysiology,
98, 2858–2867.
Carriere, B.N., D.W. Royal, and M.T. Wallace. 2008. Spatial heterogeneity of cortical receptive fields and its
impact on multisensory interactions. Journal of Neurophysiology, 99, 2357–2368.
Chandrasekaran, C., and A.A. Ghazanfar. 2009. Different neural frequency bands integrate faces and voices
differently in the superior temporal sulcus. Journal of Neurophysiology, 101, 773–788.
Ciesielski, K.T., J.E. Knight, R.J. Prince, R.J. Harris, and S.D. Handmaker. 1995. Event-related potentials in
cross-modal divided attention in autism. Neuropsychologia, 33, 225–246.
Clark, B., and A. Graybiel. 1966. Factors contributing to the delay in the perception of the oculogravic illusion.
American Journal of Psychology, 79, 377–388.
Colonius, H., and P. Arndt. 2001. A two-stage model for visual–auditory interaction in saccadic latencies.
Perception & Psychophysics, 63, 126–147.
Colonius, H., and A. Diederich. 2004. Multisensory interaction in saccadic reaction time: A time-window-of-
integration model. Journal of Cognitive Neuroscience, 16, 1000–1009.
Colonius, H., and A. Diederich. 2010a. The optimal time window of visual–auditory integration: A reaction
time analysis. Frontiers in Integrative Neuroscience, 4, 11.
Colonius, H., and A. Diederich. 2010b. Optimal time windows of integration. Abstract Presented at 2010
International Multisensory Research Forum.
Conrey, B., and D.B. Pisoni. 2006. Auditory–visual speech perception and synchrony detection for speech and
nonspeech signals. Journal of the Acoustical Society of America, 119, 4065–4073.
Corneil, B.D., and D.P. Munoz 1996. The influence of auditory and visual distractors on human orienting gaze
shifts. Journal of Neuroscience, 16, 8193–8207.
Spatial and Temporal Features of Multisensory Processes 211

Corneil, B.D., M. Van Wanrooij., D.P. Munoz, and A.J. Van Opstal. 2002. Auditory–visual interactions subserv-
ing goal-directed saccades in a complex scene. Journal of Neurophysiology, 88, 438–454.
Crystal, T.H., and A.S. House. 1981. Segmental durations in connected speech signals. Journal of the Acoustical
Society of America, 69, S82–S83.
David, S.V., W.E. Vinje, and J.L. Gallant. 2004. Natural stimulus statistics alter the receptive field structure of
v1 neurons. Journal of Neuroscience, 24, 6991–7006.
Dhamala, M., C.G. Assisi, V.K. Jirsa, F.L. Steinberg, and J.A. Kelso. 2007. Multisensory integration for timing
engages different brain networks. NeuroImage, 34, 764–773.
Diederich, A., H. Colonius, D. Bockhorst, and S. Tabeling. 2003. Visual–tactile spatial interaction in saccade
generation. Experimental Brain Research, 148, 328–337.
Dixon, N.F., and L. Spitz. 1980. The detection of auditory visual desynchrony. Perception, 9, 719–721.
Engel, G.R., and W.G. Dougherty. 1971. Visual–auditory distance constancy. Nature, 234, 308.
Facoetti, A., A.N. Trussardi, M. Ruffino, M.L. Lorusso, C. Cattaneo, R. Galli, M. Molteni, and M. Zorzi. 2010.
Multisensory spatial attention deficits are predictive of phonological decoding skills in developmental
dyslexia. Journal of Cognitive Neuroscience, 22, 1011–1025.
Fajen, B.R. 2007. Rapid recalibration based on optic flow in visually guided action. Experimental Brain
Research, 183, 61–74.
Forster, B., C. Cavina-Pratesi, S.M. Aglioti, and G. Berlucchi. 2002. Redundant target effect and intersensory
facilitation from visual–tactile interactions in simple reaction time. Experimental Brain Research, 143,
480–487.
Foss-Feig, J.H., L.D. Kwakye, C.J. Cascio, C.P. Burnette, H. Kadivar, W.L. Stone, and M.T. Wallace. 2010.
An extended multisensory temporal binding window in autism spectrum disorders. Experimental Brain
Research, 203, 381–389.
Frassinetti, F., N. Bolognini, and E. Ladavas. 2002. Enhancement of visual perception by crossmodal visuo-
auditory interaction. Experimental Brain Research, 147, 332–343.
Frens, M.A., and A.J. Van Opstal. 1998. Visual–auditory interactions modulate saccade-related activity in mon-
key superior colliculus. Brain Research Bulletin, 46, 211–224.
Frens, M.A., A.J. Van Opstal, and R.F. van der Willigen. 1995. Spatial and temporal factors determine audito-
ry–visual interactions in human saccadic eye movements. Perception & Psychophysics, 57, 802–816.
Fujisaki, W., S. Shimojo, M. Kashino, and S. Nishida. 2004. Recalibration of audiovisual simultaneity. Nature
Neuroscience, 7, 773–778.
Furukawa, S., and J.C. Middlebrooks. 2002. Cortical representation of auditory space: Information-bearing
features of spike patterns. Journal of Neurophysiology, 87, 1749–1762.
Ghazanfar, A.A., C. Chandrasekaran, and N.K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience, 28, 4457–4469.
Ghazanfar, A.A., C. Chandrasekaran, and R.J. Morrill. 2010. Dynamic, rhythmic facial expressions and the
superior temporal sulcus of macaque monkeys: implications for the evolution of audiovisual speech.
European Journal of Neuroscience, 31, 1807–1817.
Gottlieb, G. 1971. Ontogenesis of sensory function in birds and mammals. In The biopsychology of develop-
ment, ed. E. Tobach, L.R. Aronson, and E. Shaw. New York: Academic Press.
Guest, S., C. Catmur, D. Lloyd, and C. Spence. 2002. Audiotactile interactions in roughness perception.
Experimental Brain Research, 146, 161–171.
Guitton, D., and D.P. Munoz 1991. Control of orienting gaze shifts by the tectoreticulospinal system in the
head-free cat. I. Identification, localization, and effects of behavior on sensory responses. Journal of
Neurophysiology, 66, 1605–1623.
Haider, B., M.R. Krause, A. Duque, Y. Yu, J. Touryan, J.A. Mazer, and D.A. McCormick. 2010. Synaptic and
network mechanisms of sparse and reliable visual cortical activity during nonclassical receptive field
stimulation. Neuron, 65, 107–121.
Hairston, W.D., J.H. Burdette, D.L. Flowers, F.B. Wood, and M.T. Wallace. 2005. Altered temporal profile of
visual–auditory multisensory interactions in dyslexia. Experimental Brain Research, 166, 474–480.
Hall, W.C., and A.K. Moschovakis. 2004. The superior colliculus: New approaches for studying sensorimotor
integration. Boca Raton, FL: CRC Press.
Hanson, J.V., J. Heron, and D. Whitaker. 2008. Recalibration of perceived time across sensory modalities.
Experimental Brain Research, 185, 347–352.
Harrington, L.K., and C.K. Peck. 1998. Spatial disparity affects visual–auditory interactions in human senso-
rimotor processing. Experimental Brain Research, 122, 247–252.
212 The Neural Bases of Multisensory Processes

Hershenson, M. 1962. Reaction time as a measure of intersensory facilitation. Journal of Experimental


Psychology, 63, 289–293.
Hillock, A.R., and M.T. Wallace. 2011a. Changes in the multisensory temporal binding window persist into
adolescence. In preparation.
Hillock, A.R., and M.T. Wallace. 2011b. A developmental study of the temporal constraints for audiovisual
speech binding. In preparation.
Hillock, A.R., A.R. Powers 3rd, and M.T. Wallace. 2010. Binding of sights and sounds: Age-related changes in
audiovisual temporal processing. (Submitted).
Hughes, H.C., P.A. Reuter-Lorenz, G. Nozawa, and R. Fendrich. 1994. Visual–auditory interactions in sen-
sorimotor processing: saccades versus manual responses. Journal of Experimental Psychology. Human
Perception and Performance, 20, 131–53.
Hughes, H.C., M.D. Nelson, and D.M. Aronchick. 1998. Spatial characteristics of visual–auditory summation
in human saccades. Vision Research, 38, 3955–63.
Kavounoudias, A., J.P. Roll, J.L. Anton, B. Nazarian, M. Roth, and R. Roll. 2008. Proprio-tactile integration for
kinesthetic perception: An fMRI study. Neuropsychologia, 46, 567–575.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex, 18, 1560–74.
Kayser, C., N.K. Logothetis, and S. Panzeri. 2010. Visual enhancement of the information representation in
auditory cortex. Current Biology, 20, 19–24.
Kern, J.K. 2002. The possible role of the cerebellum in autism/PDD: Disruption of a multisensory feedback
loop. Medical Hypotheses, 59, 255–260.
Kim, R.S., A.R. Seitz, and L. Shams. 2008. Benefits of stimulus congruency for multisensory facilitation of
visual learning. PLoS One, 3, e1532.
King, A. J. 2004. The superior colliculus. Current Biology, 14, R335–R338.
Krueger, J., M.C. Fister, D.W. Royal, B.N. Carriere, and M.T. Wallace. 2008. A comparison of spatiotemporal
receptive fields of multisensory superior colliculus neurons in awake and anesthetized cat. Society for
Neuroscience Abstract, 457.17.
Krueger, J., D.W. Royal, M.C. Fister, and M.T. Wallace. 2009. Spatial receptive field organization of multisen-
sory neurons and its impact on multisensory interactions. Hearing Research, 258, 47–54.
Laasonen, M., E. Service, and V. Virsu. 2001. Temporal order and processing acuity of visual, auditory,
and tactile perception in developmentally dyslexic young adults. Cognitive, Affective & Behavioral
Neuroscience, 1, 394–410.
Laasonen, M., E. Service, and V. Virsu. 2002. Crossmodal temporal order and processing acuity in develop-
mentally dyslexic young adults. Brain and Language, 80, 340–354.
Lakatos, P., C.M. Chen, M.N. O’Connell, A. Mills, and C.E. Schroeder. 2007. Neuronal oscillations and multi-
sensory interaction in primary auditory cortex. Neuron, 53, 279–292.
Lewkowicz, D.J., and K.S. Kraebel. 2004. The value of multisensory redundancy in the development of
intersensory perception. In The Handbook of Multisensory Processes, ed. G.A. Calvert, C. Spence, and
B.E. Stein. Cambridge, MA: MIT Press.
Lewkowicz, D.J., and A.A. Ghazanfar. 2009. The emergence of multisensory systems through perceptual nar-
rowing. Trends in Cognitive Sciences, 13, 470–478.
Lickliter, R., and L.E. Bahrick. 2004. Perceptual development and the origins of multisensory responsiveness.
In The Handbook of Multisensory Processes, ed. G.A. Calvert, C. Spence, and B.E. Stein. Cambridge,
MA: MIT Press.
Lovelace, C.T., B.E. Stein, and M.T. Wallace. 2003. An irrelevant light enhances auditory detection in humans:
a psychophysical analysis of multisensory integration in stimulus detection. Brain Research Cognitive
Brain Research, 17, 447–453.
Macaluso, E., N. George, R. Dolan, C. Spence, and J. Driver. 2004. Spatial and temporal factors during process-
ing of audiovisual speech: A PET study. NeuroImage, 21, 725–732.
Machens, C.K., M.S. Wehr, and A.M. Zador. 2004. Linearity of cortical receptive fields measured with natural
sounds. Journal of Neuroscience, 24, 1089–1100.
Manabe, K., and H. Riquimaroux. 2000. Sound controls velocity perception of visual apparent motion. Journal
of the Acoustical Society of Japan, 21, 171–174.
Massaro, D.W., M.M. Cohen, and P.M. Smeele. 1996. Perception of asynchronous and conflicting visual and
auditory speech. Journal of the Acoustical Society of America, 100, 1777–1786.
McGrath, M., and Q. Summerfield. 1985. Intermodal timing relations and audio-visual speech recognition by
normal-hearing adults. Journal of the Acoustical Society of America, 77, 678–685.
Spatial and Temporal Features of Multisensory Processes 213

Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons. I. Temporal factors. Journal of Neuroscience, 7, 3215–3229.
Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science, 221, 389–391.
Meredith, M.A., and B.E. Stein. 1985. Descending efferents from the superior colliculus relay integrated mul-
tisensory information. Science, 227, 657–659.
Meredith, M.A., and B.E. Stein. 1986. Spatial factors determine the activity of multisensory neurons in cat
superior colliculus. Brain Research, 365, 350–354.
Middlebrooks, J.C., and E.I. Knudsen. 1984. A neural code for auditory space in the cat’s superior colliculus.
Journal of Neuroscience, 4, 2621–2634.
Middlebrooks, J.C., L. Xu, A.C. Eddins, and D.M. Green. 1998. Codes for sound-source location in nontono-
topic auditory cortex. Journal of Neurophysiology, 80, 863–881.
Miyazaki, M., D. Nozaki, and Y. Nakajima. 2005. Testing Bayesian models of human coincidence timing.
Journal of Neurophysiology, 94, 395–399.
Miyazaki, M., S. Yamamoto, S., Uchida, and S. Kitazawa. 2006. Bayesian calibration of simultaneity in tactile
temporal order judgment. Nature Neuroscience, 9, 875–877.
Molholm, S., W. Ritter, M.M. Murray, D.C. Javitt, C.E. Schroeder, and J.J. Foxe. 2002. Multisensory auditory–
visual interactions during early sensory processing in humans: A high-density electrical mapping study.
Brain Research. Cognitive Brain Research, 14, 115–128.
Munhall, K.G., P. Gribble, L. Sacco, and M. Ward. 1996. Temporal constraints on the McGurk effect. Perception
& Psychophysics, 58, 351–362.
Munoz, D.P., and D. Guitton. 1991. Control of orienting gaze shifts by the tectoreticulospinal system in the
head-free cat: II. Sustained discharges during motor preparation and fixation. Journal of Neurophysiology,
66, 1624–1641.
Munoz, D.P., D. Guitton, and D. Pelisson. 1991a. Control of orienting gaze shifts by the tectoreticulospinal
system in the head-free cat: III. Spatiotemporal characteristics of phasic motor discharges. Journal of
Neurophysiology, 66, 1642–1666.
Munoz, D.P., D. Pelisson, and D. Guitton. 1991b. Movement of neural activity on the superior colliculus motor
map during gaze shifts. Science, 251, 1358–1360.
Murray, M.M., C.M. Michel, R. Grave De Peralta, S. Ortigue, D. Brunet, S. Gonzalez Andino, and A. Schnider.
2004. Rapid discrimination of visual and multisensory memories revealed by electrical neuroimaging.
NeuroImage, 21, 125–135.
Murray, M.M., J.J. Foxe, and G.R. Wylie. 2005. The brain uses single-trial multisensory memories to discrimi-
nate without awareness. NeuroImage, 27, 473–478.
Nagy, A., G. Eordegh, and G. Benedek. 2003. Spatial and temporal visual properties of single neurons in the
feline anterior ectosylvian visual area. Experimental Brain Research, 151, 108–114.
Navarra, J., A. Vatakis, M. Zampini, S. Soto-Faraco, W. Humphreys, and C. Spence. 2005. Exposure to asyn-
chronous audiovisual speech extends the temporal window for audiovisual integration. Brain Research.
Cognitive Brain Research, 25, 499–507.
Noesselt, T., J.W. Rieger, M.A. Schoenfeld, M. Kanowski, H. Hinrichs, H.J. Heinze, and J. Driver. 2007.
Audiovisual temporal correspondence modulates human multisensory superior temporal sulcus plus pri-
mary sensory cortices. Journal of Neuroscience, 27, 11431–11441.
Pandey, P.C., H. Kunov, and S.M. Abel. 1986. Disruptive effects of auditory signal delay on speech perception
with lipreading. Journal of Auditory Research, 26, 27–41.
Pasalar, S., T. Ro, and M.S. Beauchamp. 2010. TMS of posterior parietal cortex disrupts visual tactile multisen-
sory integration. European Journal of Neuroscience, 31, 1783–1790.
Populin, L.C. 2005. Anesthetics change the excitation/inhibition balance that governs sensory processing in the
cat superior colliculus. Journal of Neuroscience, 25, 5903–5914.
Powers 3rd, A.R., A.R. Hillock, and M.T. Wallace. 2009. Perceptual training narrows the temporal window of
multisensory binding. Journal of Neuroscience, 29, 12265–12274.
Powers 3rd, A.R., M.A. Hevey, and M.T. Wallace. 2010. Neural correlates of multisensory perceptual learning.
In preparation.
Qin, L., J.Y. Wang, and Y. Sato. 2008. Representations of cat meows and human vowels in the primary auditory
cortex of awake cats. Journal of Neurophysiology, 99, 2305–2319.
Romei, V., M.M. Murray, L.B. Merabet, and G. Thut. 2007. Occipital transcranial magnetic stimulation has
opposing effects on visual and auditory stimulus detection: implications for multisensory interactions.
Journal of Neuroscience, 27, 11465–11472.
214 The Neural Bases of Multisensory Processes

Rouger, J., S. Lagleyre, B. Fraysse, S. Deneve, O. Deguine, and P. Barone. 2007. Evidence that cochlear-
implanted deaf patients are better multisensory integrators. Proceedings of the National Academy of
Sciences of the United States of America, 104, 7295–7300.
Rowland, B.A., S. Quessy, T.R. Stanford, and B.E. Stein. 2007. Multisensory integration shortens physiological
response latencies. Journal of Neuroscience, 27, 5879–5884.
Royal, D.W., B.N. Carriere, and M.T. Wallace. 2009. Spatiotemporal architecture of cortical receptive fields
and its impact on multisensory interactions. Experimental Brain Research, 198, 127–136.
Schneider, T.R., A.K. Engel, and S. Debener. 2008. Multisensory identification of natural objects in a two-way
crossmodal priming paradigm. Experimental Psychology, 55, 121–132.
Schorr, E.A., N.A. Fox, V. van Wassenhove, and E.I. Knudsen. 2005. Auditory–visual fusion in speech percep-
tion in children with cochlear implants. Proceedings of the National Academy of Sciences of the United
States of America, 102, 18748–18750.
Seitz, A.R., R. Kim, and L. Shams. 2006. Sound facilitates visual learning. Current Biology, 16, 1422–1427.
Sekuler, R., A.B. Sekuler, and R. Lau. 1997. Sound alters visual motion perception. Nature, 385, 308.
Shams, L., Y. Kamitani, and S. Shimojo. 2000. Illusions. What you see is what you hear. Nature, 408, 788.
Shams, L., Y. Kamitani, and S. Shimojo. 2002. Visual illusion induced by sound. Brain Research Cognitive
Brain Research, 14, 147–152.
Shore, D.I., C. Spence, and R.M. Klein. 2001. Visual prior entry. Psychological Science, 12, 205–212.
Soto-Faraco, S., and A. Alsius. 2009. Deconstructing the McGurk–MacDonald illusion. Journal of Experimental
Psychology. Human Perception and Performance, 35, 580–587.
Soto-Faraco, S., A. Kingstone, and C. Spence. 2003. Multisensory contributions to the perception of motion.
Neuropsychologia, 41, 1847–1862.
Sparks, D.L. 1986. Translation of sensory signals into commands for control of saccadic eye movements: Role
of primate superior colliculus. Physiological Reviews, 66, 118–171.
Sparks, D.L., and Groh, J.M. 1995. The superior colliculus: A window for viewing issues in integrative neuro-
science. In The Cognitive Sciences, ed. Gazzaniga, M.S. Cambridge, MA: MIT Press.
Spence, C., D.I. Shore, and R.M. Klein. 2001. Multisensory prior entry. Journal of Experimental Psychology.
General, 130, 799–832.
Stanford, T.R., S. Quessy, and B.E. Stein. 2005. Evaluating the operations underlying multisensory integration
in the cat superior colliculus. Journal of Neuroscience, 25, 6499–6508.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., and M.T. Wallace. 1996. Comparisons of cross-modality integration in midbrain and cortex.
Progress in Brain Research, 112, 289–299.
Stein, B.E., E. Labos, and L. Kruger. 1973a. Determinants of response latency in neurons of superior colliculus
in kittens. Journal of Neurophysiology, 36, 680–689.
Stein, B.E., E. Labos, and L. Kruger. 1973b. Sequence of changes in properties of neurons of superior colliculus
of the kitten during maturation. Journal of Neurophysiology, 36, 667–679.
Stein, B.E., W.S. Huneycutt, and M.A. Meredith. 1988. Neurons and behavior: The same rules of multisensory
integration apply. Brain Research, 448, 355–358.
Stein, B.E., M.A. Meredith, W.S. Huneycutt, and L. McDade. 1989. Behavioral indices of multisensory inte-
gration: Orientation to visual cues is affected by auditory stimuli. Journal of Cognitive Neuroscience, 1,
12–24.
Stein, B.E., N. London, L.K. Wilkinson, and D.D. Price. 1996. Enhancement of perceived visual intensity by
auditory stimuli: A psychophysical analysis. Journal of Cognitive Neuroscience, 8, 497–506.
Stetson, C., X. Cui, P.R. Montague, and D.M. Eagleman. 2006. Motor-sensory recalibration leads to an illusory
reversal of action and sensation. Neuron, 51, 651–659.
Stevenson, R.A., N.A. Altieri, S. Kim, D.B. Pisoni, and T.W. James. 2010. Neural processing of asynchronous
audiovisual speech perception. NeuroImage, 49, 3308–3318.
Stone, J.V., N.M. Hunkin, J. Porrill, R. Wood, V. Keeler, M. Beanland, M. Port, and N.R. Porter. 2001. When
is now? Perception of simultaneity. Proceedings of the Royal Society of London. Series B. Biological
Sciences, 268, 31–38.
Sumby, W.H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the
Acoustical Society of America, 26, 212–215.
Ter-Mikaelian, M., D.H. Sanes, and M.N. Semple. 2007. Transformation of temporal properties between audi-
tory midbrain and cortex in the awake Mongolian gerbil. Journal of Neuroscience, 27, 6091–6102.
van Atteveldt, N.M., E. Formisano, L. Blomert, and R. Goebel. 2007. The effect of temporal asynchrony on the
multisensory integration of letters and speech sounds. Cerebral Cortex, 17, 962–794.
Spatial and Temporal Features of Multisensory Processes 215

van Wassenhove, V., K.W. Grant, and D. Poeppel. 2007. Temporal window of integration in auditory–visual
speech perception. Neuropsychologia, 45, 598–607.
van Wassenhove, V., D.V. Buonomano, S. Shimojo, and L. Shams. 2008. Distortions of subjective time percep-
tion within and across senses. PLoS One, 3, e1437.
Von Kriegstein, K., and A.L. Giraud. 2006. Implicit multisensory associations influence voice recognition.
PLoS Biology, 4, e326.
Vroomen, J., M. Keetels, B. De Gelder, and P. Bertelson. 2004. Recalibration of temporal order perception by
exposure to audio-visual asynchrony. Brain Research. Cognitive Brain Research, 22, 32–35.
Wade, N.J., and R.H. Day. 1968. Development and dissipation of a visual spatial aftereffect from prolonged
head tilt. Journal of Experimental Psychology, 76, 439–443.
Wallace, M.T., and B.E. Stein. 1996. Sensory organization of the superior colliculus in cat and monkey. Progress
in Brain Research, 112, 301–311.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience, 17, 2429–2444.
Wallace, M.T., and B.E. Stein. 2000. Onset of cross-modal synthesis in the neonatal superior colliculus is gated
by the development of cortical influences. Journal of Neurophysiology, 83, 3578–3582.
Wallace, M.T., and B.E. Stein. 2001. Sensory and multisensory responses in the newborn monkey superior col-
liculus. Journal of Neuroscience, 21, 8886–8894.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1992. Integration of multiple sensory modalities in cat cortex.
Experimental Brain Research, 91, 484–488.
Wallace, M.T., L.K. Wilkinson, and B.E. Stein. 1996. Representation and integration of multiple sensory inputs
in primate superior colliculus. Journal of Neurophysiology, 76, 1246–1266.
Wallace, M.T., T.J. Perrault Jr., W.D. Hairston, and B.E. Stein. 2004. Visual experience is necessary for the
development of multisensory integration. Journal of Neuroscience, 24, 9580–9584.
Wallace, M.T., B.N. Carriere, T.J. Perrault Jr., J.W. Vaughan, and B.E. Stein. 2006. The development of cortical
multisensory integration. Journal of Neuroscience, 26, 11844–11849.
Wang, X., T. Lu, R.K. Snider, and L. Liang. 2005. Sustained firing in auditory cortex evoked by preferred
stimuli. Nature, 435, 341–346.
Xu, L., S. Furukawa, and J.C. Middlebrooks. 1999. Auditory cortical responses in the cat to sounds that pro-
duce spatial illusions. Nature, 399, 688–691.
Ye, C.Q., M.M. Poo, Y. Dan, and X.H. Zhang. 2010. Synaptic mechanisms of direction selectivity in primary
auditory cortex. Journal of Neuroscience, 30, 1861–1868.
Zampini, M., D.I. Shore, and C. Spence. 2003. Audiovisual temporal order judgments. Experimental Brain
Research, 152, 198–210.
Zampini, M., S. Guest, D.I. Shore, and C. Spence. 2005a. Audio-visual simultaneity judgments. Perception &
Psychophysics, 67, 531–544.
Zampini, M., D.I. Shore, and C. Spence. 2005b. Audiovisual prior entry. Neurosci Letters, 381, 217–22.
12 Early Integration and
Bayesian Causal Inference
in Multisensory Perception
Ladan Shams

CONTENTS
12.1 Introduction........................................................................................................................... 217
12.2 Early Auditory–Visual Interactions in Human Brain............................................................ 218
12.3 Why Have Cross-Modal Interactions?................................................................................... 219
12.4 The Problem of Causal Inference.......................................................................................... 220
12.5 Spectrum of Multisensory Combinations..............................................................................220
12.6 Principles Governing Cross-Modal Interactions................................................................... 222
12.7 Causal Inference in Multisensory Perception........................................................................ 223
12.8 Hierarchical Bayesian Causal Inference Model.................................................................... 225
12.9 Relationship with Nonhierarchical Causal Inference Model................................................ 226
12.10 Hierarchical Causal Inference Model versus Human Data................................................. 226
12.11 Independence of Priors and Likelihoods............................................................................. 227
12.12 Conclusions.......................................................................................................................... 229
References....................................................................................................................................... 229

12.1  INTRODUCTION
Brain function in general, and perception in particular, has been viewed as highly modular for more
than a century. Although phrenology is considered obsolete, its general notion of the brain being
composed of compartments each devoted to a single function and independent of other functions has
been the dominant paradigm, especially in the context of perception (Pascual-Leone and Hamilton
2001). In the cerebral cortex, it is believed that the different sensory modalities are organized into
separate pathways that are independent of each other, and process information almost completely in
a self-contained manner until the “well digested” processed signals converge at some higher-order
level of processing in the polysensory association cortical areas, wherein the unified perception of
the environment is achieved. The notion of modularity of sensory modalities has been particularly
strong as related to visual perception. Vision has been considered to be highly self-contained and
independent of extramodal influences. This view owes to many sources. Humans are considered to
be “visual animals,” and this notion has been underscored in contemporary society with the ever-
increasingly important role of text and images in our lives along with the advent of electricity (and
light at night). The notion of visual dominance has been supported by the classic and well-known
studies of cross-modal interactions in which a conflict was artificially imposed between vision and
another modality and found that vision overrides the conflicting sensory modality. For example, in
the ventriloquist illusion, vision captures the location of discrepant auditory stimulus (Howard and
Templeton 1966). Similarly, in the “visual capture” effect, vision captures the spatial location of a
tactile or proprioceptive stimulus (Rock and Victor 1964). In the McGurk effect, vision strongly and

217
218 The Neural Bases of Multisensory Processes

qualitatively alters the perceived syllable (McGurk and McDonald 1976). As a result, the influence
of vision on other modalities has been acknowledged for some time. However, the influence of other
modalities on vision has not been appreciated until very recently. There have been several reports
of vision being influenced by another modality; however, most of these have involved quantitative
effects (Gebhard and Mowbray 1959; Scheier et al. 1999; Walker and Scott 1981; McDonald et al.
2000; Spence and Driver 1997; Spence et al. 1998; Stein et al. 1996). Over the past few years, two
studies have reported radical alterations of visual perception by auditory modality. In one case, the
motion trajectory of two visual targets is sometimes changed from a streaming motion to a bounc-
ing motion by a brief sound occurring at the time of visual coincidence (Sekuler et al. 1997). In this
case, the motion of the visual stimuli is, in principle, ambiguous in the absence of sound, and one
could argue that sound disambiguates this ambiguity. In another study, we found that the perceived
number of pulsations of a visual flash (for which there is no obvious ambiguity) is often increased
when paired with multiple beeps (Shams et al. 2000, 2002). This phenomenon demonstrates, in an
unequivocal fashion, that visual perception can be altered by a nonvisual signal. The effect is also
very robust and resistant to changes in the shape, pattern, intensity, and timing of the visual and
auditory stimuli (Shams et al. 2001, 2002; Watkins et al. 2006). For this reason, this illusion known
as “sound-induced flash illusion” appears to reflect a mainstream mechanism of auditory–visual
interaction in the brain as opposed to some aberration in neural processing. Thus, we used the
sound-induced flash illusion as an experimental paradigm for investigating auditory–visual interac-
tions in the human brain.

12.2  EARLY AUDITORY–VISUAL INTERACTIONS IN HUMAN BRAIN


The first question we asked was, at what level of processing do auditory–visual perceptual interac-
tions occur? Do they occur at some higher-order polysensory area in the association cortex or do
they involve the modulation of activation along the visual cortex? We examined whether visually
evoked potentials, as recorded from three electrodes in the occipital regions of the scalp, are affected
by sound. We recorded evoked potentials under visual-alone (1flash, or 2flashes), auditory-alone
(2beeps), and auditory–visual (1flash2beeps) stimulus conditions. When comparing the pattern of
activity associated with a second physical flash (2flash – 1flash) with that of an illusory second
flash (i.e., 1flash2beeps – 1flash – 2beeps), we obtained a very similar temporal pattern of activity
(Shams et al. 2001). Furthermore, for the 1flash2beep condition, comparing illusion and no-illusion
trials revealed that the perception of illusion was associated with increased gamma-band activity
in the occipital region (Bhattacharya et al. 2002). A magnetoencephalography (MEG) study of the
flash illusion revealed the modulation of activity in occipital channels by sound as early as 35 to
65 ms poststimulus onset (Shams et al. 2005a). These results altogether indicated a mechanism of
auditory–visual interaction with very short latency, and in the occipital cortex. However, to map
the exact location of the interactions, we needed higher spatial resolution. Therefore, we performed
functional MRI (fMRI) studies of the sound-induced flash illusion. In these studies (Watkins et al.
2006, 2007), the visual cortical areas were functionally mapped for each individual subject using
retinotopic mapping. We contrasted auditory–visual conditions (1flash1beep, 2flash2beep) versus
visual-alone conditions (1flash, 2flash). This contrast indicated auditory cortical areas, which is not
surprising because in one condition, there is sound, and in another condition, there is no sound.
But interestingly, the contrast also indicated areas V1, V2, and V3, which is surprising because
the visual stimulus is identical in the contrasted conditions. Therefore, these results (Watkins et al.
2006) clearly demonstrated for the first time (but see Calvert et al. 2001) that activity in the human
visual cortex as early as V1 can be modulated by nonvisual stimulation. The observed increase in
activation was very robust and significant. We suspected that this increase in activity may reflect
a possible general arousal effect caused by sound as opposed to auditory–visual integration per se.
Indeed, attention has been previously shown to increase activity in early visual cortical areas. To
address this question, we focused on the 1flash2beep condition which, in some trials, gave rise to
Early Integration and Bayesian Causal Inference in Multisensory Perception 219

an illusory percept of two flashes (also referred to as a fission effect). We compared the illusion and
no-illusion trials, reasoning that given that the physical stimuli are identical in both of these post
hoc–defined conditions, the arousal level should also be equal. Contrasting illusion and nonillusion
trials revealed increased activity in V1 in the illusion condition (Watkins et al. 2006), indicating
that the perception of illusion is correlated with increased activity in V1. Although this contradicts
the attention hypothesis laid out earlier, one could still argue that sound may only increase arousal
in some trials and those trials happen to be the illusion trials. Although this argument confounds
attention with integration, we could nevertheless address it using another experiment in which
we included a 2flash1beep condition. On some trials of this condition, the two flashes are fused,
leading to an illusory percept of a single flash (also referred to as a fusion effect), whereas in other
trials, the observers correctly perceived two flashes. Contrasting the illusion and nonillusion tri-
als, we again found a significant difference in the activation level of V1; however, this time, the
perception of sound-induced visual illusion was correlated with decreased activity in V1 (Watkins
et al. 2007), therefore ruling out the role of attention or arousal. As mentioned above, the event-
related potential (ERP) study showed a similar temporal pattern of activity for the illusory and
physical second flash. Here, we found a similar degree of V1 activation for physical and illusory
double flash, and a similar degree of activation for the physical and illusory single flash (Watkins
et al. 2007). These results altogether establish clearly that activity in early visual cortical areas,
as early as in the primary visual cortex, is modulated by sound through cross-modal integration
processes.
What neural pathway could underlie these early auditory–visual interactions? Again, the last
decade has witnessed the overturning of another dogma; the dogma of no connectivity among
the sensory cortical areas. There has been mounting evidence for direct and indirect anatomical
connectivity among the sensory cortical areas (e.g., Clavagnier et al. 2004; Falchier et al. 2002;
Ghazanfar and Schroeder 2006; Rockland and Ojima 2003; Hackett et al. 2007). Of particular
interest here are the findings of extensive projections from the auditory core and parabelt and mul-
tisensory area superior temporal polysensory cortical areas to V1 and V2 in monkey (Falchier et
al. 2002; Rockland and Ojima 2003; Clavagnier et al. 2004). Intriguingly, these projections appear
to be only extensive for the peripheral representations in V1, and not for the foveal representa-
tions (Falchier et al. 2002). This pattern is highly consistent with the much stronger behavioral and
physiological auditory modulation of vision in the periphery compared with the fovea that we have
observed (Shams et al. 2001). Interestingly, tactile modulation of visual processing also seems to
be stronger in the periphery (Diederich and Colonius 2007). Therefore, it seems likely that a direct
projection from A1 or a feedback projection from superior temporal sulcus (STS) could mediate the
modulations we have observed. We believe that the former may be more likely because although the
activation in V1 was found to correlate with the perception of flash, the activation of area STS was
always increased with the perception of illusion regardless of the type of illusion (single or double-
flash; Watkins et al. 2006, 2007). Therefore, these results are more readily consistent with a direct
modulation of V1 projections from auditory areas.

12.3  WHY HAVE CROSS-MODAL INTERACTIONS?


The findings discussed above as well as those discussed in other chapters, make it clear that cross-
modal interactions are prevalent, and can be very strong and robust. But why? At first glance, it
may not be obvious why having cross-modal interactions would be advantageous or necessary for
human’s survival in the environment. Especially in the context of visual perception, one could
argue that visual perception is highly precise and accurate in so many tasks, that it may even be
disadvantageous to “contaminate” it with other sensory signals that are not as reliable (which could
then cause illusions or errors). Theory tells us, and experimental studies have confirmed, that even
when a second source of information is not very reliable, combining two sources of information
could result in superior estimation compared with using only the most reliable source. Maximum
220 The Neural Bases of Multisensory Processes

likelihood estimation of an object property using two independent cues, for example, an auditory esti-
mate and a visual estimate, results in an estimate that is more reliable (more precise) than either one of
the individual estimates. Many studies of multisensory perception have confirmed that the human ner-
vous system integrates two cross-modal estimates in a similar fashion (e.g., Alais and Burr 2004; Ernst
and Banks 2002; van Beers et al. 1999; Ronsse et al. 2009). Therefore, integrating information across
modalities is always beneficial. Interestingly, recent studies using single-cell recordings and behavioral
measurements from macaque monkeys have provided a bridge between the behavioral manifestations of
multisensory integration and neural activity, showing that the activity of multisensory (visual–vestibular)
neurons is consistent with Bayesian cue integration (for a review, see Angelaki et al. 2009).

12.4  THE PROBLEM OF CAUSAL INFERENCE


Although it is beneficial to integrate information from different modalities if the signals correspond
to the same object, one could see that integrating information from two different objects would not
be advantageous. For example, while trying to cross the street on a foggy day, it would be beneficial
to combine auditory and visual information to estimate the speed and direction of an approaching
car. It could be a fatal mistake, on the other hand, to combine the information from the sound of
a car moving behind us in the opposite direction with the image of another moving car in front of
us. It should be noted that humans (as with most other organisms) are constantly surrounded by
multiple objects and thus multiple sources of sensory stimulation. Therefore, at any given moment,
the nervous system is engaged in processing multiple sensory signals across the senses, and not all
of these signals are caused by the same object, and therefore not all of them should be bound and
integrated. The problem of whether to combine two signals involves an (implicit or explicit) infer-
ence about whether the two signals are caused by the same object or by different objects, i.e., causal
inference. This is not a trivial problem, and cannot be simply solved, for example, based on whether
the two signals originate from the same coordinates in space. The different senses have different
precisions in all dimensions, including the temporal and spatial dimensions, and even if the two
signals are derived from the same object/event, the noise in the environment and in the nervous sys-
tem makes the sensory signals somewhat inconsistent with each other most of the time. Therefore,
the nervous system needs to use as much information as possible to solve this difficult problem. It
appears that whether two sensory signals are perceptually bound together typically depends on a
combination of spatial, temporal, and structural consistency between the signals as well as the prior
knowledge derived from experience about the coupling of the signals in nature. For example, mov-
ing cars often make a frequency sweep sound, therefore, the prior probability for combining these
two stimuli should be very high. On the other hand, moving cars do not typically create a bird song,
therefore the prior bias for combining the image of a car and the sound of a bird is low.
Unlike the problem of causal inference in cognition, which only arises intermittently, the prob-
lem of causal inference in perception has to be solved by the nervous system at any given moment,
and is therefore at the heart of perceptual processing. In addition to solving the problem of causal
inference, the perceptual system also needs to determine how to integrate signals that appear to
have originated from the same source, i.e., to what extent, and in which direction (which modality
should dominate which modality).

12.5  SPECTRUM OF MULTISENSORY COMBINATIONS


To investigate these theoretical issues, we used two complementary experimental paradigms: a tem-
poral numerosity judgment task (Shams et al. 2005b), and a spatial localization task (Körding et al.
2007). These two tasks are complementary in that the former is primarily a temporal task, whereas
the latter is clearly a spatial task. Moreover, in the former, the auditory modality dominates, whereas
in the latter, vision dominates. In both of these paradigms, there are strong illusions that occur under
some stimulus conditions: the sound-induced flash illusion and the ventriloquist illusion.
Early Integration and Bayesian Causal Inference in Multisensory Perception 221

In the temporal numerosity experiment, a variable number of flashes were presented in the
periphery simultaneously with a variable number of beeps. The task of the observers was to judge
the number of flashes and beeps in each trial. In the spatial localization experiment, a Gabor patch
and/or a noise burst were briefly presented at one of several locations along a horizontal line and
the task of the subject was to judge the location of both the visual and auditory stimuli in each trial.
In both experiments, we observed a spectrum of interactions (Figure 12.1). When there was no dis-
crepancy between the auditory and visual stimuli, the two stimuli were fused (Figure 12.1a, left).
When the discrepancy was small between the two stimuli, they were again fused in a large frac-
tion of trials (Figure 12.1a, middle and right). These trials are those in which an illusion occurred.
For example, when one flash paired with two beeps was presented, in a large fraction of trials, the
observers reported seeing two flashes (sound-induced flash illusion) and hearing two beeps. The
reverse illusion occurred when two flashes paired with one beep were seen as a single flash in a
large fraction of trials. Similarly, in the localization experiment, when the spatial gap between the
flash and noiseburst was small (5°), the flash captured the location of the sound in a large fraction of
trials (ventriloquist illusion). In the other extreme, when the discrepancy between the auditory and
visual stimuli was large, there was little interaction, if any, between the two. For example, in the
1flash4beep or 4flash1beep conditions in the numerosity judgment experiments, or in the conditions
in which the flash was all the way to the left and noise all the way to the right or vice versa in the
localization experiment, there was hardly any shift in the visual or auditory percepts relative to the
unisensory conditions. We refer to this lack of interaction as segregation (Figure 12.1c) because it
appears that the signals are kept separate from each other. Perhaps most interestingly, in conditions
in which there was a moderate discrepancy between the two stimuli, sometimes there was a partial
shift of the two modalities toward each other. We refer to this phenomenon as “partial integration”
(Figure 12.1b). For example, in the 1flash3beep condition, the observers sometimes reported seeing
two flashes and hearing three beeps. Or in the condition in which the flash is at –5° (left of fixa-
tion) and noise is at +5° (right of fixation), the observers sometimes reported hearing the noise at
0 degrees and seeing the flash at –5°. Therefore, in summary, in both experiments, we observed a

(a)
Fusion
Partial integration Segregation

(b)
Conflict

(c)

FIGURE 12.1  Range of cross-modal interactions. Horizontal axis in these panels represents a perceptual
dimension such as space, time, number, etc. Light bulb and loudspeaker icons represent visual stimulus
and auditory stimulus, respectively. Eye and ear icons represent visual and auditory percepts, respectively.
(a) Fusion. Three examples of conditions in which fusion often occurs. Left: when stimuli are congruent and
veridically perceived. Middle: when discrepancy between auditory and visual stimuli is small, and percept
corresponds to a point in between two stimuli. Right: when discrepancy between two stimuli is small, and one
modality (in this example, vision) captures the other modality. (b) Partial integration. Left: when discrepancy
between two stimuli is moderate, and the less reliable modality (in this example, vision) gets shifted toward
the other modality but does not converge. Right: when discrepancy is moderate and both modalities get shifted
toward each other but not enough to converge. (c) Segregation. When conflict between two stimuli is large, and
the two stimuli do not affect each other.
222 The Neural Bases of Multisensory Processes

(a) (b)
50 50

% Auditory bias
% Visual bias 40 40

30 30

20 20

1 2 3 5 10 15 20
Number disparity (#) Spatial disparity (deg.)

FIGURE 12.2  Interaction between auditory and visual modalities as a function of conflict. (a) Visual bias
(i.e., influence of sound on visual perception) as a function of discrepancy between number of flashes and
beeps in temporal numerosity judgment task. (b) Auditory bias (i.e., influence of vision on auditory perception)
as a function of spatial gap between the two in spatial localization task.

spectrum of interactions between the two modalities. When the discrepancy is zero or small, the
two modalities tend to get fused. When the conflict is moderate, partial integration may occur, and
when the conflict is large, the two signals tend to be segregated (Figure 12.1, right). In both experi-
ments, the interaction between the two modalities gradually decreased as the discrepancy between
the two increased (Figure 12.2).
What would happen if we had more than two sensory signals? For example, if we have a visual,
auditory, and tactile signal, as is most often the case in nature. We investigated this scenario using
the numerosity judgment task (Wozny et al. 2008). We presented a variable number of flashes paired
with a variable number of beeps and a variable number of taps, providing unisensory, bisensory,
and trisensory conditions pseudorandomly interleaved. The task of the participants was to judge the
number of flashes, beeps, and taps on each trial. This experiment provided a rich set of data that
replicated the sound-induced flash illusion (Shams et al. 2000) and the touch-induced flash illusion
(Violentyev et al. 2005), as well as many previously unreported illusions. In fact, in every condition
in which there was a small discrepancy between two or three modalities, we observed an illusion.
This finding demonstrates that the interaction among these modalities is the rule rather than the
exception, and the sound-induced flash illusions that have been previously reported are not “special”
in the sense that they are not unusual or out of the ordinary, but rather, they are consistent with a
general pattern of cross-modal interactions that cuts across modalities and stimulus conditions.
We wondered whether these changes in perceptual reports reflect a change in response criterion
as opposed to a change in perception per se. We calculated the sensitivity (d′) change between
bisensory and unisensory conditions (and between trisensory and bisensory conditions) and found
statistically significant changes in sensitivity as a result of the introduction of a second (or third)
sensory signal in most of the cases despite the very conservative statistical criterion used. In other
words, the observed illusions (both fission and fusion) reflect cross-modal integration processes, as
opposed to response bias.

12.6  PRINCIPLES GOVERNING CROSS-MODAL INTERACTIONS


Is there anything surprising about the fact that there are a range of interactions between the senses?
Let us examine that. Intuitively, it is reasonable for the brain to combine different sources of infor-
mation to come up with the most informative guess about an object, if all the bits of information
are about the same object. For example, if we are holding a mug in our hand, it makes sense that
Early Integration and Bayesian Causal Inference in Multisensory Perception 223

we use both haptic and visual information to estimate the shape of the mug. It is also expected for
the bits of information to be fairly consistent with each other if they arise from the same object.
Therefore, it would make sense for the nervous system to fuse the sensory signals when there is
little or no discrepancy between the signals. Similarly, as discussed earlier, it is reasonable for the
nervous system not to combine the bits of information if they correspond to different objects. It is
also expected for the bits of information to be highly disparate if they stem from different objects.
Therefore, if we are holding a mug while watching TV, it would be best not to combine the visual
and haptic information. Therefore, segregation also makes sense from a functional point of view.
How about partial integration? Is there a situation in which partial integration would be beneficial?
There is no intuitively obvious explanation for partial integration, as we do not encounter situations
wherein two signals are only partially caused by the same object. Therefore, the phenomenon of
partial integration is rather curious. Is there a single rule that can account for the entire range of
cross-modal interactions including partial integration?

12.7  CAUSAL INFERENCE IN MULTISENSORY PERCEPTION


The traditional model of cue combination (Ghahramani 1995; Yuille and Bülthoff 1996; Landy et
al. 1995), which has been the dominant model for many years, assumes that the sensory cues all
originate from the same object (Figure 12.3a) and therefore they should all be fused to obtain an
optimal estimate of the object property in question. In this model, it is assumed that the sensory
signals are corrupted by independent noise and, therefore, are conditionally independent of each
other. The optimal estimate of the source is then a linear combination of the two sensory cues. If
a Gaussian distribution is assumed for the distribution of the sensory cues, and no a priori bias,
this linear combination would become a weighted average of the two sensory estimates, with each
estimate weighted by its precision (or inverse of variance). This model has been very successful in
accounting for the integration of sensory cues in various tasks and various combinations of sensory
modalities (e.g., Alais and Burr 2004; Ernst and Banks 2002; Ghahramani 1995; van Beers et al.
1999). Although this model can account well for behavior when the conflict between the two signals
is small (i.e., for situations of fusion, for obvious reasons), it fails to account for the rest of the spec-
trum (i.e., partial integration and segregation).

(a) (b)
s sA sV

x1 x2 xA xV

(c) (d) C
sA sV sT
C=1 C=2
s sA sV
xA xV xT

x1 x2 xA xV

FIGURE 12.3  Generative model of different models of cue combination. (a) Traditional model of cue combi-
nation, in which two signals are assumed to be caused by one source. (b) Causal inference model of cue combi-
nation, in which each signal has a respective cause, and causes may or may not be related. (c) Generalization of
model in (b) to three signals. (d) Hierarchical causal inference model of cue combination. There are two explicit
causal structures, one corresponding to common cause and one corresponding to independent causes, and
variable C chooses between the two. (b, Adapted from Shams, L. et al., Neuroreport, 16, 1923–1927, 2005b; c,
adapted from Wozny, D.R. et al., J. Vis., 8, 1–11, 2008; d, Körding, K. et al., PLoS ONE, 2, e943, 2007.)
224 The Neural Bases of Multisensory Processes

To come up with a general model that can account for the entire range of interactions, we aban-
doned the assumption of a single source, and allowed each of the sensory cues to have a respective
source. By allowing the two sources to be either dependent or independent, we allowed for both
conditions of a common cause and conditions of independent causes for the sensory signals (Figure
12.3b). We assume that the two sensory signals (xA and x V) are conditionally independent of each
other. This follows from the assumption that up to the point where the signals get integrated, the
sensory signals in different modalities are processed in separate pathways and thus are corrupted
by independent noise processes. As mentioned above, this is a common assumption. The additional
assumption made here is that the auditory signal is independent of the visual source (sV) given the
auditory source (sA), and likewise for visual signal. This is based on the observation that either the
two signals are caused by the same object, in which case, the dependence of auditory signal on
the visual source is entirely captured by its dependence on the auditory source, or they are caused
by different objects, in which case, the auditory signal is entirely independent of the visual source
(likewise for visual signal). In other words, this assumption follows from the observation that there
is either a common source or independent sources. This general model of bisensory perception
(Shams et al. 2005b) results in a very simple inference rule:

P ( x A | s A ) P ( x V | sV ) P ( s A , sV )
P ( s A , sV | x A , x V ) = (12.1)
P( x A , x V )

where the probability of the auditory and visual sources, sA and sV, given the sensory signals xA
and x V is a normalized product of the auditory likelihood (i.e., the probability of getting a signal xA
given that there is a source sA out there) and visual likelihood (i.e., the probability of getting a signal
x V given that there is a source sV) and the prior probability of sources sA and s V occurring jointly.
The joint prior probability P(sA,s V) represents the implicit knowledge that the perceptual system has
accumulated over the course of a lifetime about the statistics of auditory–visual events in the envi-
ronment. In effect, it captures the coupling between the two modalities, and therefore, how much
the two modalities will interact in the process of inference. If the two signals (e.g., the number of
flashes and beeps) have always been consistent in one’s experience, then the expectation is that they
will be highly consistent in the future, and therefore, the joint prior matrix would be diagonal (only
the identical values of number of flashes and beeps are allowed, and the rest will be zero). On the
other hand, if in one’s experience, the number of flashes and beeps are completely independent of
each other, then P(sA,sV) would be factorizable (e.g., a uniform distribution or an isotropic Gaussian
distribution) indicating that the two events have nothing to do with each other, and can take on any
values independently of each other. Therefore, by having nonzero values for both sA = sV and sA ≠
sV in this joint probability distribution, both common cause and independent cause scenarios are
allowed, and the relative strength of these probabilities would determine the prior expectation of
a common cause versus independent causes. Other recent models of multisensory integration have
also used joint prior probabilities to capture the interaction between two modalities, for example, in
haptic–visual numerosity judgment tasks (Bresciani et al. 2006) and auditory–visual rate perception
(Roach et al. 2006).
The model of Equation 12.1 is simple, general, and readily extendable to more complex situa-
tions. For example, the inference rule for trisensory perception (Figure 12.3c) would be as follows:

P( x A | sA ) P( x V | sV ) P( x T | sT ) P(sA , sV , sT )
P ( s A , sV , sT | x A , x V , x T ) = (12.2)
P( x A , xV , xT )

To test the trisensory perception model of Equation 12.2, we modeled the three-dimensional joint
prior P(sA,sV,sT) with a multivariate Gaussian function, and each of the likelihood functions with
a univariate Gaussian function. The mean of the likelihoods were assumed to be unbiased (i.e., on
Early Integration and Bayesian Causal Inference in Multisensory Perception 225

average at the veridical number), and the standard deviation of the likelihoods was estimated using
data from unisensory conditions. It was also assumed that the mean and variance for the prior of
the three modalities were equal, and the three covariances (for three pairs of modalities) were also
equal.* This resulted in a total of three free parameters (mean, variance, and covariance of the
prior). These parameters were fitted to the data from the trisensory numerosity judgment experi-
ment discussed earlier. The model accounted for 95% of variance in the data (676 data points) using
only three free parameters. To test whether the three parameters rendered the model too powerful
and able to account for any data set, we scrambled the data and found that the model badly failed
to account for the arbitrary data (R2 < .01). In summary, the Bayesian model of Figure 12.3c could
provide a remarkable account for the myriad of two-way and three-way interactions observed in
the data.

12.8  HIERARCHICAL BAYESIAN CAUSAL INFERENCE MODEL


The model described above can account for the entire range of interactions. However, it does not
directly make predictions about the perceived causal structure. In order to be able to make predic-
tions about the perceived causal structure, one needs a hierarchical model in which there is a vari-
able (variable C in Figure 12.3d) that chooses between the different causal structures. We describe
this model in the context of the spatial localization task as an example. In this model, the probability
of a common cause (i.e., C = 1) is simply computed using Bayes rule as follows:

( ) (
p xV , x A | C = 1 p C = 1 )
(
p C = 1 | xV , x A =) p(x , x )
(12.3)
V A

According to this rule, the probability of a common cause is simply a product of two factors. The
left term in the numerator—the likelihood that the two sensory signals occur if there is a common
cause—is a function of how similar the two sensory signals are. The more dissimilar the two sig-
nals, the lower this probability will be. The right term in the numerator is the a priori expectation
of a common cause, and is a function of prior experience (how often two signals are caused by the
same source in general). The denominator again is a normalization factor.
Given this probability of a common cause, the location of the auditory and visual stimulus can
now be computed as follows:


( ) ( )
sˆ = p C = 1 | xV , x A sˆC=1 + p C = 2 | x V , x A sˆC=2 (12.4)

where ŝ denotes the overall estimate of the location of sound (or visual stimulus), and ŝ C = 1 and ŝ C = 2
denote the optimal estimates of location for the scenario of common-cause or scenario of indepen-
dent causes, respectively. The inference rule is interesting because it is a weighted average of two
optimal estimates, and it is nonlinear in xA and x V.
What does this inference rule mean? Let us focus on auditory estimation of location for example,
and assume Gaussian functions for prior and likelihood functions over space. If the task of the
observer is to judge the location of sound, then if the observer knows for certain that the auditory
and visual stimuli were caused by two independent sources (e.g., a puppeteer talking and a puppet
moving), then the optimal estimate of the location of sound would be entirely based on the auditory

* These assumptions were made to minimize the number of free parameters and maximize the parsimony of the model.
However, the assumptions were verified by fitting a model with nine parameters (allowing different values for the mean,
variance, and covariance across modalities) to the data, and finding almost equal values for all three means, all three
variances, and all three covariances.
226 The Neural Bases of Multisensory Processes

x A σ A2 + x P σ P2
information and the prior: ŝA,C= 2 = where σA and σ P are the standard deviations
1 σ A2 + 1 σ P2
of the auditory likelihood and the prior, respectively. On the other hand, if the observer knows for
certain that the auditory and visual stimuli were caused by the same object (e.g., a puppet talking
and moving), then the optimal estimate of the location of sound would take visual information into
x σ 2 + x σ 2 + x P σ P2
account: sˆA ,C=1 = V V2 A 2 A . In nature, the observer is hardly ever certain about the
1 σ V + 1 σ A + 1 σ P2
causal structure of the events in the environment, and in fact, it is the job of the nervous system
to solve that problem. Therefore, in general, the nervous system would have to take both of these
possibilities into account, thus, the overall optimal estimate of the location of sound happens to be
a weighted average of the two optimal estimates each weighted by their respective probabilities as
in Equation 12.3. It can now be understood how partial integration could result from this optimal
scheme of multisensory perception.
It should be noted that Equation 12.4 is derived assuming a mean squared error cost function.
This is a common assumption, and roughly speaking, it means that the nervous system tries to
minimize the average magnitude of error. The mean squared error function is minimized if the
mean of the posterior distribution is selected as the estimate. The estimate shown in Equation 12.4
corresponds to the mean of the posterior distribution, and as it is a weighted average of the estimates
of the two causal structures (i.e., ŝA,C = 2 and ŝA,C = 1), it is referred to as “model averaging.” If, on the
other hand, the goal of the perceptual system is to minimize the number of times that an error is
made, then the maximum of the posterior distribution would be the optimal estimate. In this sce-
nario, the overall estimate of location would be the estimate corresponding to the causal structure
with the higher probability, and thus, this strategy is referred to as “model selection.” Although the
model averaging strategy of Equation 12.4 provides estimates that are never entirely consistent with
either one of the two possible scenarios (i.e., with what occurs in the environment), this strategy
does minimize the magnitude of error on average (the mean squared error) more than any other
strategy, and therefore, it is optimal given the cost function.

12.9  RELATIONSHIP WITH NONHIERARCHICAL CAUSAL INFERENCE MODEL


The hierarchical causal inference model of Equation 12.3 can be thought of as a special form of
the nonhierarchical causal inference model of Equation 12.1. By integrating out the hidden variable
p( x A | sA ) p( x V | sV ) p(sA , sV )
C, the hierarchical model can be recast as p(sA , sV | x A , x V ) = where
p( x A , x V )
p(sA,sV) = p(C = 1)p(s) + p(C = 2)p(sA)p(sV). In other words, the hierarchical model is a special form
of the nonhierarchical model in which the joint prior is a mixture of two priors, a prior correspond-
ing to the independent sources, and a prior corresponding to common cause. The main advantage of
the hierarchical model over the nonhierarchical model is that it performs causal inference explicitly
and allows making direct predictions about perceived causal structure (C).

12.10  HIERARCHICAL CAUSAL INFERENCE MODEL VERSUS HUMAN DATA


We tested whether the hierarchical causal inference model can account for human auditory–visual
spatial localization (Körding et al. 2007). We modeled the likelihood and prior over space using
Gaussian functions. We assumed that the likelihood functions are, on average, centered around the
veridical location. We also assumed that there is a bias for the center (straight ahead) location. There
were four free parameters that were fitted to the data: the prior probability of a common cause, the
standard deviation of the visual likelihood (i.e., the visual sensory noise), the standard deviation of
auditory likelihoods (i.e., the auditory sensory noise), and the standard deviation of the prior over
Early Integration and Bayesian Causal Inference in Multisensory Perception 227

space (i.e., the strength of the bias for center). Because the width of the Gaussian prior over space is
a free parameter, if there is no such bias for center position, the parameter will take on a large value,
practically rendering this distribution uniform, and thus, the bias largely nonexistent.
The model accounted for 97% of variance in human observer data (1225 data points) using only
four free parameters (Körding et al. 2007). This is a remarkable fit, and as before, is not due to the
degrees of freedom of the model, as the model cannot account for arbitrary data using the same
number of free parameters. Also, if we set the value of the four parameters using some common
sense values or the published data from other studies, and compare the data with the predictions of
the model with no free parameters, we can still account for the data similarly well.
We tested whether model averaging (Equation 12.4) or model selection (see above) explains the
observers’ data better, and found that observers’ responses were highly more consistent with model
averaging than model selection.
In our spatial localization experiment, we did not ask participants to report their perceived causal
structure on each trial. However, Wallace and colleagues did ask their subjects to report whether
they perceive a unified source for the auditory and visual stimuli on each trial (Wallace et al. 2004).
The hierarchical causal inference model can account for their published data; both for the data
on judgments of unity, and the spatial localizations and interactions between the two modalities
(Körding et al. 2007).
We compared this model with other models of cue combination on the spatial localization data
set. The causal inference model accounts for the data substantially better than the traditional forced
fusion model of integration, and better than two recent models of integration that do not assume
forced fusion (Körding et al. 2007). One of these models was a model developed by Bresciani et al.
(2006) that assumes a Gaussian ridge distribution as the joint prior, and the other one was a model
developed by Roach et al. (2006) that assumes the sum of a uniform distribution and a Gaussian
ridge as the joint prior.
We tested the hierarchical causal inference model on the numerosity judgment data described ear-
lier. The model accounts for 86% of variance in the data (576 data points) using only four free param-
eters (Beierholm 2007). We also compared auditory–visual interactions and visual–visual interactions
in the numerosity judgment task, and found that both cross-modal and within-modality interactions
could be explained using the causal inference model, with the main difference between the two being
in the a priori expectation of a common cause (i.e., Pcommon). The prior probability of a common
cause for visual–visual condition was higher than that of the auditory–visual condition (Beierholm
2007). Hospedales and Vijayakumar (2009) have also recently shown that an adaptation of the causal
inference model for an oddity detection task accounts well for both within-modality and cross-modal
oddity detection of observers. Consistent with our results, they found the prior probability of a com-
mon cause to be higher for the within-modality task compared with the cross-modality task.
In summary, we found that the causal inference model accounts well for two complementary
sets of data (spatial localization and numerosity judgment), it accounts well for data collected by
another group, it outperforms the traditional and other contemporary models of cue combination
(on the tested data set), and it provides a unifying account of within-modality and cross-modality
integration.

12.11  INDEPENDENCE OF PRIORS AND LIKELIHOODS


These results altogether strongly suggest that human observers are Bayes-optimal in multisensory
perceptual tasks. What does it exactly mean to be Bayes-optimal? The general understanding of
Bayesian inference is that inference is based on two factors, likelihood and prior. Likelihood rep-
resents the sensory noise (in the environment or in the brain), whereas prior captures the statistics
of the events in the environment, and therefore, the two quantities are independent of each other.
Although this is the general interpretation of Bayesian inference, it is important to note that dem-
onstrating that observers are Bayes-optimal under one condition does not necessarily imply that the
228 The Neural Bases of Multisensory Processes

likelihoods and priors are independent of each other. It is quite possible that changing the likeli-
hoods would result in a change in priors or vice versa. Given that we are able to estimate likelihoods
and priors using the causal inference model, we can empirically investigate the question of inde-
pendence of likelihoods and priors. Furthermore, it is possible that the Bayes-optimal performance
is achieved without using Bayesian inference (Maloney and Mamassian 2009). For example, it has
been described that an observer using a table-lookup mechanism can achieve near-optimal perfor-
mance using reinforcement learning (Maloney and Mamassian 2009). Because the Bayes-optimal
performance can be achieved by using different processes, it has been argued that comparing human
observer performance with a Bayesian observer in one setting alone is not sufficient as evidence
for Bayesian inference as a process model of human perception. For these reasons, Maloney and
Mamassian (2009) have proposed transfer criteria as more powerful experimental tests of Bayesian
decision theory as a process model of perception. The transfer criterion is to test whether the change
in one component of decision process (i.e., likelihood, prior, or decision rule) leaves the other com-
ponents unchanged. The idea is that if the perceptual system indeed engages in Bayesian inference, a
change in likelihoods, for example, would not affect the priors. However, if the system uses another
process such as a table-lookup then it would fail these kinds of transfer tests.
We asked whether priors are independent of likelihoods (Beierholm et al. 2009). To address this
question, we decided to induce a strong change in the likelihoods and examine whether this would
lead to a change in priors. To induce a change in likelihoods, we manipulated the visual stimulus.
We used the spatial localization task and tested participants under two visual conditions, one with a
high-contrast visual stimulus (Gabor patch), and one with a low-contrast visual stimulus. The task,
procedure, auditory stimulus, and all other variables were identical across the two conditions that
were tested in two separate sessions. The two sessions were held 1 week apart, so that if the observ-
ers learn the statistics of the stimuli during the first session, the effect of this learning would disap-
pear by the time of the second session. The change in visual contrast was drastic enough to cause the
performance on visual-alone trials to be lower than that of the high-contrast condition by as much
as 41%. The performance on auditory-alone trials did not change significantly because the auditory
stimuli were unchanged. The model accounts for both sets of data very well (R2 = .97 for high con-
trast, and R2 = .84 for low-contrast session). Therefore, the performance of the participants appears
to be Bayes-optimal in both the high-contrast and low-contrast conditions. Considering that the
performances in the two sessions were drastically different (substantially worse in the low-contrast
condition), and considering that the priors were estimated from the behavioral responses, there is no
reason to believe that the priors in these two sessions would be equal (as they are derived from very
different sets of data). Therefore, if the estimated priors do transpire to be equal between the two
sessions, that would provide a strong evidence for independence of priors from likelihoods.
If the priors are equal, then swapping them between the two sessions should not hurt the good-
ness of fit to the data. We tested this using priors estimated from the low-contrast data to predict
high-contrast data, and the priors estimated from the high-contrast data to predict the low-contrast
data. The results were surprising: the goodness of fit remained almost as good (R2 = .97 and R2 =
.81) as using priors from the same data set (Beierholm et al. 2009). Next, we directly compared the
estimated parameters of the likelihood and prior functions for the two sessions. The model was
fitted to each individual subject’s data, and the likelihood and prior parameters were estimated
for each subject for each of the two sessions separately. Comparing the parameters across subjects
(Figure 12.4) revealed a statistically significant (P < .0005) difference only for the visual likelihood
(showing a higher degree of noise for the low-contrast condition). No other parameters (neither the
auditory likelihood nor the two prior parameters) were statistically different between the two ses-
sions. Despite a large difference between the two visual likelihoods (by >10 standard deviations) no
change was detected in either probability of a common cause nor the prior over space. Therefore,
these results suggest that priors are encoded independently of the likelihoods (Beierholm et al.
2009). These findings are consistent with the findings of a previous study showing that the change
in the kind of perceptual bias transfers qualitatively to other types of stimuli (Adams et al. 2004).
Early Integration and Bayesian Causal Inference in Multisensory Perception 229

*** n.s. n.s. n.s.


20 100

Degrees along azimuth


16 80

Percentage common
12 60

8 40

4 20

0 0
σV σA σP Pcommon

Likelihoods Priors

FIGURE 12.4  Mean prior and likelihood parameter values across participants in two experimental sessions
differing only in contrast of visual stimulus. Black and gray denote values corresponding to session with
high-contrast and low-contrast visual stimulus, respectively. Error bars correspond to standard error of mean.
(From Beierholm, U. et al., J. Vis., 9, 1–9, 2009. With permission.)

12.12  CONCLUSIONS
Together with a wealth of other accumulating findings, our behavioral findings suggest that cross-
modal interactions are ubiquitous, strong, and robust in human perceptual processing. Even visual
perception that has been traditionally believed to be the dominant modality and highly self-­contained
can be strongly and radically influenced by cross-modal stimulation. Our ERP, MEG, and fMRI
findings consistently show that visual processing is affected by sound at the earliest levels of corti-
cal processing, namely at V1. This modulation reflects a cross-modal integration phenomenon as
opposed to attentional modulation. Therefore, multisensory integration can occur even at these early
stages of sensory processing, in areas that have been traditionally held to be unisensory.
Cross-modal interactions depend on a number of factors, namely the temporal, spatial, and struc-
tural consistency between the stimuli. Depending on the degree of consistency between the two
stimuli, a spectrum of interactions may result, ranging from complete integration, to partial integra-
tion, to complete segregation. The entire range of cross-modal interactions can be explained by a
Bayesian model of causal inference wherein the inferred causal structure of the events in the envi-
ronment depends on the degree of consistency between the signals as well as the prior knowledge/
bias about the causal structure. Indeed given that humans are surrounded by multiple objects and
hence multiple sources of sensory stimulation, the problem of causal inference is a fundamental
problem at the core of perception. The nervous system appears to have implemented the optimal
solution to this problem as the perception of human observers appears to be Bayes-optimal in mul-
tiple tasks, and the Bayesian causal inference model of multisensory perception presented here can
account in a unified and coherent fashion for an entire range of interactions in a multitude of tasks.
Not only the performance of observers appears to be Bayes-optimal in multiple tasks, but the priors
also appear to be independent of likelihoods, consistent with the notion of priors encoding the sta-
tistics of objects and events in the environment independent of sensory representations.

REFERENCES
Adams, W.J., E.W. Graf, and M.O. Ernst. 2004. Experience can change the ‘light-from-above’ prior. Nature
Neuroscience, 7, 1057–1058.
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Current
Biology, 14, 257–62.
230 The Neural Bases of Multisensory Processes

Angelaki, D.E., Y. Gu, and G.C. Deangelis. 2009. Multisensory integration: Psychophysics, neurophysiology,
and computation. Current Opinion in Neurobiology, 19, 452–458.
Beierholm, U. 2007. Bayesian modeling of sensory cue combinations. PhD Thesis, California Institute of
Technology.
Beierholm, U., S. Quartz, and L. Shams. 2009. Bayesian priors are encoded independently of likelihoods in
human multisensory perception. Journal of Vision, 9, 1–9.
Bhattacharya, J., L. Shams, and S. Shimojo. 2002. Sound-induced illusory flash perception: Role of gamma
band responses. Neuroreport, 13, 1727–1730.
Bresciani, J.P., F. Dammeier, and M.O. Ernst. 2006. Vision and touch are automatically integrated for the per-
ception of sequences of events. Journal of Vision, 6, 554–564.
Calvert, G., P.C. Hansen, S.D. Iversen, and M.J. Brammer. 2001. Detection of audio-visual integration sites in
humans by application of electro-physiological criteria to the BOLD effect. NeuroImage, 14, 427–438.
Clavagnier, S., A. Falchier, and H. Kennedy. 2004. Long-distance feedback projections to area V1: Implications
for multisensory integration, spatial awareness, and visual consciousness. Cognitive Affective Behavioral
Neuroscience, 4, 117–126.
Diederich, A., and H. Colonius. 2007. Modeling spatial effects in visual-tactile saccadic reaction time.
Perception & Psychophysics, 69, 56–67.
Ernst, M.O., and M.S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature, 415, 429–433.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience, 22, 5749–5759.
Gebhard, J.W., and G.H. Mowbray. 1959. On discriminating the rate of visual flicker and auditory flutter.
American Journal of Psychology, 72, 521–528.
Ghahramani, Z. 1995. Computation and psychophysics of sensorimotor integration. Ph.D. Thesis, Massachusetts
Institute of Technology.
Ghazanfar, A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences,
10, 278–285.
Hackett, T.A., J.F. Smiley, I. Ulbert, G. Karmos, P. Lakatos, L.A. De La Mothe, and C.E. Schroeder. 2007.
Sources of somatosensory input to the caudal belt areas of auditory cortex. Perception, 36, 1419–1430.
Hospedales, T., and S. Vijayakumar. 2009. Multisensory oddity detection as Bayesian inference. PLoS ONE,
4, e4205.
Howard, I.P., and W.B. Templeton. 1966. Human Spatial Orientation, London, Wiley.
Körding, K., U. Beierholm, W.J. Ma, J.M. Tenenbaum, S. Quartz, and L. Shams. 2007. Causal inference in
multisensory perception. PLoS ONE, 2, e943.
Landy, M.S., L.T. Maloney, E.B. Johnston, and M. Young. 1995. Measurement and modeling of depth cue
combination: In defense of weak fusion. Vision Research, 35, 389–412.
Maloney, L.T., and P. Mamassian. 2009. Bayesian decision theory as a model of human visual perception:
Testing Bayesian transfer. Visual Neuroscience, 26, 147–155.
McDonald, J.J., W.A. Teder-Sälejärvi, and S.A. Hillyard. 2000. Involuntary orienting to sound improves visual
perception. Nature, 407, 906–908.
McGurk, H., and J.W. McDonald. 1976. Hearing lips and seeing voices. Nature, 264, 746–748.
Pascual-Leone, A., and R. Hamilton. 2001. The metamodal organization of the brain. Progress in Brain
Research, 134, 427–445.
Roach, N., J. Heron, and P. McGraw. 2006. Resolving multisensory conflict: A strategy for balancing the costs
and benefits of audio-visual integration. Proceedings of the Royal Society B: Biological Sciences, 273.
2159–2168.
Rock, I., and J. Victor. 1964. Vision and touch: An experimentally created conflict between the two senses.
Science, 143, 594–596.
Rockland, K.S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology, 50, 19–26.
Ronsse, R., C. Miall, and S.P. Swinnen. 2009. Multisensory integration in dynamical behaviors: Maximum
likelihood estimation across bimanual skill learning. Journal of Neuroscience, 29, 8419–8428.
Scheier, C.R., R. Nijwahan, and S. Shimojo. 1999. Sound alters visual temporal resolution. In Investigative
Ophthalmology and Visual Science, 40, S4169.
Sekuler, R., A.B. Sekuler, and R. Lau. 1997. Sound alters visual motion perception. Nature, 385, 308.
Shams, L., Y. Kamitani, and S. Shimojo. 2000. What you see is what you hear. Nature, 408, 788.
Shams, L., Y. Kamitani, S. Thompson, and S. Shimojo. 2001. Sound alters visual evoked potentials in humans.
Neuroreport, 12, 3849–3852.
Early Integration and Bayesian Causal Inference in Multisensory Perception 231

Shams, L., Y. Kamitani, and S. Shimojo. 2002. Visual illusion induced by sound. Cognitive Brain Research,
14, 147–152.
Shams, L., S. Iwaki, A. Chawla, and J. Bhattacharya. 2005a. Early modulation of visual cortex by sound: An
MEG study. Neuroscience Letters, 378, 76–81.
Shams, L., W.J. Ma, and U. Beierholm. 2005b. Sound-induced flash illusion as an optimal percept. Neuroreport,
16, 1923–1927.
Spence, C., and J. Driver. 1997. Audiovisual links in exogenous covert spatial orienting. Perception &
Psychophysics, 59, 1–22.
Spence, C., M.E. Nicholls, N. Gillespie, and J. Driver. 1998. Cross-modal links in exogenous covert spatial
orienting between touch, audition, and vision. Perception and Psychophysics, 60, 544–557.
Stein, B.E., N. London, L.K. Wilkinson, and D.D. Price. 1996. Enhancement of perceived visual intensity by
auditory stimuli: A psychophysical analysis. Journal of Cognitive Neuroscience, 8, 497–506.
Van Beers, R.J., A.C. Sittig, and J.J. Denier van der Gon. 1999. Integration of proprioceptive and visual position
information: An experimentally supported model. Journal of Neurophysiology, 81, 1355–1364.
Violentyev, A., S. Shimojo, and L. Shams. 2005. Touch-induced visual illusion. Neuroreport, 16, 1107–1110.
Walker, J.T., and K.J. Scott. 1981. Auditory–visual conflicts in the perceived duration of lights, tones, and gaps.
Journal of Experimental Psychology: Human Perception and Performance, 7, 1327–1339.
Wallace, M.T., G.H. Roberson, W.D. Hairston, B.E. Stein, J.W. Vaughan, and J.A. Schirillo. 2004. Unifying
mulitsensory signals across time and space. Experimental Brain Research, 158, 252–258.
Watkins, S., L. Shams, S. Tanaka, J.-D. Haynes, and G. Rees. 2006. Sound alters activity in human V1 in asso-
ciation with illusory visual perception. NeuroImage, 31, 1247–1256.
Watkins, S., L. Shams, O. Josephs, and G. Rees. 2007. Activity in human V1 follows multisensory perception.
NeuroImage, 37, 572–578.
Wozny, D.R., U.R. Beierholm, and L. Shams. 2008. Human trimodal perception follows optimal statistical
inference. Journal of Vision, 8, 1–11.
Yuille, A.L., and H.H. Bülthoff. 1996. Bayesian decision theory and psychophysics. In Perception as Bayesian
Inference, ed. D.C. Knill and W. Richards. Cambridge, UK: Cambridge Univ. Press.
13 Characterization of
Multisensory Integration
with fMRI
Experimental Design,
Statistical Analysis, and
Interpretation
Uta Noppeney

CONTENTS
13.1 Functional Specialization: Mass- Univariate Statistical Approaches....................................234
13.1.1 Conjunction Analyses................................................................................................234
13.1.2 Max and Mean Criteria............................................................................................. 236
13.1.3 Interaction Approaches.............................................................................................. 236
13.1.3.1 Classical Interaction Design: 2 × 2 Factorial Design Manipulating
Presence versus Absence of Sensory Inputs............................................... 236
13.1.3.2 Interaction Design: 2 × 2 Factorial Design Manipulating
Informativeness or Reliability of Sensory Inputs....................................... 238
13.1.3.3 Elaborate Interaction Design: m × n Factorial Design (i.e., More than
Two Levels).................................................................................................240
13.1.3.4 Interaction Analyses Constrained by Maximum Likelihood
Estimation Model........................................................................................ 242
13.1.3.5 Combining Interaction Analyses with Max Criterion................................ 242
13.1.4 Congruency Manipulations....................................................................................... 243
13.1.5 fMRI Adaptation (or Repetition Suppression)........................................................... 243
13.2 Multisensory Representations: Multivariate Decoding and Pattern Classifier Analyses......246
13.3 Functional Integration: Effective Connectivity Analyses..................................................... 247
13.3.1 Data-Driven Effective Connectivity Analysis: Psychophysiological Interactions
and Granger Causality............................................................................................... 247
13.3.2 Hypothesis-Driven Effective Connectivity Analysis: Dynamic Causal
Modeling....................................................................................................................248
13.4 Conclusions and Future Directions........................................................................................ 249
Acknowledgments........................................................................................................................... 249
References.......................................................................................................................................249

233
234 The Neural Bases of Multisensory Processes

This chapter reviews the potential and limitations of functional magnetic resonance imaging (fMRI)
in characterizing the neural processes underlying multisensory integration. The neural basis of mul-
tisensory integration can be characterized from two distinct perspectives. From the perspective of
functional specialization, we aim to identify regions where information from different senses con-
verges and/or is integrated. From the perspective of functional integration, we investigate how infor-
mation from multiple sensory regions is integrated via interactions among brain regions. Combining
these two perspectives, this chapter discusses experimental design, analysis approaches, and inter-
pretational limitations of fMRI results. The first section describes univariate statistical analyses of
fMRI data and emphasizes the interpretational ambiguities of various statistical criteria that are
commonly used for the identification of multisensory integration sites. The second section explores
the potential and limitations of multivariate and pattern classifier approaches in multisensory inte-
gration. The third section introduces effective connectivity analyses that investigate how multi-
sensory integration emerges from distinct interactions among brain regions. The complementary
strengths of data-driven and hypothesis-driven effective connectivity analyses will be discussed.
We conclude by emphasizing that the combined potentials of these various analysis approaches
may help us to overcome or at least ameliorate the interpretational ambiguities associated with each
analysis when applied in isolation.

13.1 FUNCTIONAL SPECIALIZATION: MASS- UNIVARIATE


STATISTICAL APPROACHES
Mass-univariate statistical analyses are used to identify regions where information from multiple
senses converges or is integrated. Over the past decade, mass-univariate analyses formed the main-
stay of fMRI research in multisensory integration. In the following section, we will discuss the
pros and cons of the various analyses and statistical criteria that have been applied in the fMRI
literature.

13.1.1  Conjunction Analyses


Conjunction analyses explicitly test whether a voxel or an area responds to several unisensory inputs.
For instance, a brain area is implicated in audiovisual convergence if it responds to both auditory and
visual inputs presented in isolation. Conjunction analyses are well motivated by the neurophysiologi-
cal findings that unisensory cortical domains are separated from one another by transitional multi-
sensory zones (Wallace et al. 2004) and by the proposed patchy sensory organization of higher-order
association cortices such as the superior temporal sulcus (STS; Seltzer et al. 1996; Beauchamp et al.
2004). Given the location of multisensory integration in transition zones between unisensory regions,
it seems rational to infer multisensory properties from responsiveness to multiple unisensory inputs.
However, whereas conjunction analyses can identify candidate multisensory regions that respond
to inputs from multiple senses, even when presented alone (see Figure 13.1b), they cannot capture
integration processes in which one unisensory (e.g., visual) input in itself does not elicit a significant
response, but rather modulates the response elicited by another unisensory (e.g., auditory) input (see
Figure 13.1c). In fact, at the single neuron level, recent neurophysiological studies have demonstrated
that these sorts of modulatory multisensory interactions seem to be a rather common phenomenon in
both higher level regions such as STS (Barraclough et al. 2005; Avillac et al. 2007) and particularly
in low level, putatively unisensory regions (Allman et al. 2009; Meredith and Allman 2009; Dehner
et al. 2004; Kayser et al. 2008). Conjunction approaches are blind to these modulatory interactions
that can instead be revealed by interaction analyses (see below).
Even though, based on neurophysiological results, regions that respond to multiple unisensory
inputs are likely to be involved in multisensory integration, conjunction analyses cannot formally
dissociate (1) genuine multisensory integration from (2) regional convergence with independent
sensory neuronal populations. (1) In the case of true multisensory integration, multisensory neurons
Characterization of Multisensory Integration with fMRI 235

(a) Auditory Visual

A V
(b)
1
0.8
Threshold
0.6
0.4
0.2
0
–0.2 A V Conjunction

(c)
1
0.8
0.6 Threshold
0.4
0.2
0
–0.2 A V

FIGURE 13.1  Conjunction design and analysis. (a) Experimental design. (1) Auditory: environmental sounds;
(2) visual: pictures or video clips. Example stimuli are presented as visual images and corresponding sound
spectrograms. (b and c) Data analysis and interpretation. (b) A region responding to auditory “and” visual
inputs when presented in isolation is identified as multisensory in a conjunction analysis. (c) A region respond-
ing only to auditory but not visual inputs is identified as unisensory in a conjunction analysis. Therefore,
conjunction analyses cannot capture modulatory interactions in which one sensory (e.g., visual) input in itself
does not elicit a response, but significantly modulates response of another sensory input (e.g., auditory). Bar
graphs represent effect for auditory (black) and visual (darker gray) stimuli, and “multisensory” (lighter gray)
effect as defined by a conjunction.

would respond to unisensory inputs from multiple sensory modalities (e.g., AV neurons to A inputs
and V inputs). (2) In the case of pure regional convergence, the blood oxygen level dependent
(BOLD) response is generated by independent populations of either auditory neurons or visual
neurons (e.g., A neurons to A and V neurons to V inputs). Given the low spatial resolution of fMRI,
both cases produce a “conjunction” BOLD response profile, i.e., regional activation that is elicited
by unisensory inputs from multiple senses. Hence, conjunction analyses cannot unambiguously
identify multisensory integration.
From a statistical perspective, it is important to note that the term “conjunction analysis” has been
used previously to refer to two distinct classes of statistical tests that have later on been coined (1)
“global null conjunction analysis” (Friston et al. 1999, 2005) and (2) “conjunction null conjunction
analysis” (Nichols et al. 2005). (1) A global null conjunction analysis generalizes the one-sided t-test
to multiple dimensions (i.e., comparable to an F-test, but unidirectional) and enables inferences about
k or more effects being present. Previous analyses based on minimum statistics have typically used
the null hypothesis that k = 0. Hence, they tested whether one or more effects were present. In the
context of multisensory integration, this sort of global null conjunction analysis tests whether “at least
one” unisensory input significantly activates a particular region or voxel (with all unisensory inputs
eliciting an effect greater than a particular minimum t value). (2) The more stringent conjunction
null conjunction analysis (implemented in most software packages) explicitly tests whether a region
is significantly activated by both classes of unisensory inputs. Hence, a conjunction null conjunc-
tion analysis forms a logical “and” operation of the two statistical comparisons. This second type of
inference, i.e., a logical “and” operation, is needed when identifying multisensory convergence with
236 The Neural Bases of Multisensory Processes

the help of conjunction analyses. Nevertheless, because conjunction analyses were used primarily
in the early stages of fMRI multisensory research, when this distinction was not yet clearly drawn,
most of the previous research is actually based on the more liberal and, in this context, inappropriate
global null conjunction analysis. For instance, initial studies identified integration sites of motion
information by performing a global null conjunction analysis on motion effects in the visual, tactile,
and auditory domains (Bremmer et al. 2001). Future studies are advised to use the more stringent
conjunction null conjunction approach to identify regional multisensory convergence.

13.1.2  Max and Mean Criteria


Although conjunction analyses look for commonalities in activations to unisensory inputs from
multiple sensory modalities, fMRI studies based on the max criterion include unisensory and
multisensory stimulation conditions. For statistical inference, the BOLD response evoked by a
bisensory input is compared to the maximal BOLD response elicited by any of the two unisensory
inputs. This max criterion is related, yet not identical, to the multisensory enhancement used in
neurophysiological studies. fMRI studies quantify the absolute multisensory enhancement (e.g.,
AV – max(A,V); van Atteveldt et al. 2004). Neurophysiological studies usually evaluate the rela-
tive multisensory enhancement, i.e., the multisensory enhancement standardized by the maximal
unisensory response, e.g., (AV – max(A,V))/max(A,V) (Stein and Meredith 1993; Stein et al. 2009).
Despite the similarities in criterion, the interpretation and conclusions that can be drawn from
neurophysiological and fMRI results differ. Although in neurophysiology, multisensory enhance-
ment or depression in activity in single neurons unambiguously indicate multisensory integration,
multisensory BOLD enhancement does not compellingly prove multisensory integration within a
region. For instance, if a region contains independent visual and auditory neuronal populations, the
response to an audiovisual stimulus should be equal to the sum of the auditory and visual responses,
and hence, exceed the response to the maximal unisensory response (Calvert et al. 2001). Hence,
like the conjunction analysis, the max criterion cannot dissociate genuine multisensory integration
from regional convergence with independent unisensory populations. Nevertheless, it may be useful
to further characterize the response profile of multisensory regions identified in interaction analyses
using the max criterion (see Section 13.1.3.5).
In addition to the max criterion, some researchers have proposed or used a mean criterion, i.e.,
the response to the bisensory input should be greater than the mean response to the two unisen-
sory inputs when presented in isolation (Beauchamp 2005). However, even in true unisensory (e.g.,
visual) regions, responses to audiovisual stimuli (equal to visual response) are greater than the mean
of the auditory and visual responses (equal to ½ visual response). Hence, the mean criterion does not
seem to be theoretically warranted and will therefore not be discussed further (Figure 13.2).

13.1.3  Interaction Approaches


As demonstrated in the discussion of the conjunction and max criterion approaches, the limited
spatial resolution of the BOLD response precludes dissociation of true multisensory integration
from regional convergence—when the bisensory response is equal to the sum of the two unisensory
responses. Given this fundamental problem of independent unisensory neuronal populations within
a particular region, more stringent methodological approaches have therefore posed response additiv-
ity as the null hypothesis and identified multisensory integration through response nonlinearities, i.e.,
the interaction between, for example, visual and auditory inputs (Calvert et al. 2001; Calvert 2001).

13.1.3.1 Classical Interaction Design: 2 × 2 Factorial Design Manipulating


Presence versus Absence of Sensory Inputs
In a 2 × 2 factorial design, multisensory integration is classically identified through the interac-
tion between presence and absence of input from two sensory modalities, e.g., (A – fixation) ≠
Characterization of Multisensory Integration with fMRI 237

(a)
Auditory Visual Audiovisual

A V AV

(b) Max criterion: Max (A – Fix, V – Fix) < AV – Fix


“Potentially multisensory”
1
0.8
Max
0.6
0.4
0.2
0
–0.2 A V AV MSE

(c) Max criterion: Max (A – Fix, V – Fix) < AV – Fix


“Only unisensory”
1
0.8
Max
0.6
0.4
0.2
0
–0.2 A V AV MSE

(d) Mean criterion: (A – Fix) + (V – Fix) < AV – Fix


2
“Only unisensory”
1
0.8
0.6 Mean
0.4
0.2
0
–0.2 A V AV AV – mean

FIGURE 13.2  Max and mean criteria. (a) Experimental design. (1) Auditory: environmental sounds;
(2) visual: pictures or video clips; (3) audiovisual: sounds + concurrent pictures. Example stimuli are presented
as visual images and corresponding sound spectrograms. (b–d) Data analysis and interpretation. (b) A region
where audiovisual response is equal to sum of auditory and visual responses is identified as potentially multi-
sensory. However, this activation profile could equally well emerge in a region with independent auditory and
visual neuronal populations. (c and d) A “unisensory” region responding equally to auditory and audiovisual
inputs but not to visual inputs is identified as unisensory by max criterion (C), but as multisensory by mean cri-
terion (d). Bar graphs represent effect for auditory (black), visual (darker gray), and audiovisual (lighter gray)
stimuli, and “multisensory” (gray) effect as defined by max (multisensory enhancement) or mean criteria.

(AV – V). For example, the interaction approach investigates whether the response to an auditory
stimulus depends on the presence versus the absence of a visual stimulus. To relate the interaction
approach to the classical neurophysiological criterion of superadditivity, we can rewrite this formula
as (AV – fixation) ≠ (A – fixation) + (V – fixation) ↔ (AV + fixation) ≠ (A + V). In other words,
the response to the bisensory stimulus is different from the sum of two unisensory stimuli when
presented alone (with each stimulus evoked response being normalized relative to, e.g., prestimulus
baseline activity; Stanford et al. 2005; Perrault et al. 2005). A positive interaction identifies regions
where the bisensory response exceeds the sum of the unisensory responses—hence referred to as
238 The Neural Bases of Multisensory Processes

a superadditive response. Similarly, subadditive (and even suppressive) effects can be identified by
negative interactions. Although previous fMRI research has largely ignored and discarded subad-
ditive interactions for methodological reasons (Beauchamp 2005), recent neurophysiological stud-
ies have clearly revealed the relevance of different, i.e., superadditive and subadditive interaction
profiles for multisensory integration (Stanford et al. 2005; Laurienti et al. 2005; Stanford and Stein
2007; Sugihara et al. 2006; Avillac et al. 2007). This emphasizes the need to develop methodologi-
cal approaches in fMRI that enable the interpretation of subadditive interactions.
A BOLD response profile consistent with a significant superadditive and subadditive interac-
tion cannot be attributed to the summation of independent auditory and visual responses within a
region and hence implicates a region in multisensory integration. Furthermore, in contradistinc-
tion to the conjunction analysis, the interaction approach does not necessitate that a multisensory
region responds to unisensory input from multiple sensory modalities. Therefore, it can also capture
the modulatory interactions in which auditory input modulates the processing of visual input even
though the auditory input does not elicit a response when presented alone. However, this classical
interaction design gives rise to four major drawbacks. First, by definition, the interaction term can
only identify nonlinear combinations of modality-specific inputs, leaving out additive multisensory
integration effects that have been observed at the single neuron level. Second, for the interaction
term to be valid and unbiased, the use of “fixation” (the absence of auditory and visual informa-
tion) precludes that subjects perform a task on the stimuli (Beauchamp 2005). This is because
task-related activations are absent during the “fixation” condition, leading to an overestimation of
the summed unisensory relative to the bisensory fMRI-responses in the interaction term. Yet, even
in the absence of a task, the interaction term may be unbalanced with respect to processes that
are induced by stimuli but not during the fixation condition. For instance, stimulus-induced exog-
enous attention is likely to be enhanced for (A + V) relative to (AV + fixation). Third, subadditive
interactions may be because of nonlinearities or ceiling effects not only in the neuronal but also
in the BOLD response—rendering the interpretation ambiguous. Fourth, during the recognition of
complex environmental stimuli such as speech, objects, or actions, multisensory interactions could
emerge at multiple processing levels, ranging from the integration of low-level spatiotemporal to
higher-level object-related perceptual information. These different types of integration processes
are all included in the statistical comparison (i.e., interaction) when using a “fixation” condition
(Werner and Noppeney 2010c). Hence, a selective dissociation of integration at multiple processing
stages such as spatiotemporal and object-related information is not possible (Figure 13.3).

13.1.3.2 Interaction Design: 2 × 2 Factorial Design Manipulating


Informativeness or Reliability of Sensory Inputs
Some of the drawbacks of the classical interaction design can, in part, be addressed in a 2 × 2
factorial design that manipulates (1) visual informativeness (intact = Vi, noise = Vn) and (2) audi-
tory informativeness (intact = Ai, noise = An). Even though the audiovisual noise stimulus does
not provide visual or auditory object information, pure noise stimuli can be treated as a “degraded
object stimulus” by subjects (Gosselin and Schyns 2003). Hence, in contrast to the classical inter-
action that manipulates the presence versus the absence of inputs, subjects can perform a task on
the “noise” stimulus rendering the interaction AiVi + VnAn ≠ AiVn + ViAn matched with respect to
stimulus evoked attention and response selection processes at least to a certain degree. Obviously,
conditions cannot be matched entirely with respect to task demands. However, performance differ-
ences in a multisensory integration study should generally not be considered a confound, but rather
an interesting property of multisensory integration. Indeed, it is an important question how neural
processes mediate multi­sensory benefits. Furthermore, as auditory and visual inputs are provided
in all conditions, the audiovisual interaction focuses selectively on the integration of higher-order
object features rather than low-level spatiotemporal information (Figure 13.4). Hence, this design is
a first step toward dissociating multisensory integration at multiple processing stages (Werner and
Noppeney 2010a).
Characterization of Multisensory Integration with fMRI 239

(a) Auditory
Present Absent

Present
AV V
Visual
Absent
+

A Fix

(b) Interaction:
Superadditive: (AV + Fix) – (A + V) > 0

1
0.8
0.6
0.4
0.2
0
–0.2 A V AV Fix AV + Fix A + V MSI

(c) Subadditive: (AV + Fix) – (A + V) < 0


Enhancement: AV > max (A, V)
1
0.8
0.6
0.4
0.2
0
–0.2 A V AV Fix AV + Fix A + V
MSI

(d) Subadditive: (AV + Fix) – (A + V) < 0


Suppression: AV < max (A, V)
1
0.8
0.6
0.4
0.2
0
–0.2 A V AV Fix AV + Fix A + V

MSI

FIGURE 13.3  Classical interaction design: 2 × 2 factorial design manipulating presence versus absence of
sensory inputs. (a) Experimental design: 2 × 2 factorial design with the factors (1) auditory: present versus
absent; (2) visual: present versus absent. Example stimuli are presented as visual images and correspond-
ing sound spectrograms. (b–d) Data analysis and interpretation. Three activation profiles are illustrated.
(b) Superadditive interaction as indexed by a positive MSI effect. (c) Subadditive interaction as indexed by
a negative interaction term in context of audiovisual enhancement. (d) Subadditive interaction as indexed by
a negative interaction term in context of audiovisual suppression. Please note that subadditive (yet not sup-
pressive) interactions can also result from nonlinearities in BOLD response. Bar graphs represent effect for
auditory (black), visual (darker gray), and audiovisual (lighter gray) stimuli, and “multisensory” (gray) effect
as defined by audiovisual interaction (AV + Fix) – (A + V). To facilitate understanding, two additional bars are
inserted indicating sums that enter into interaction, i.e., AV + Fix and A + V.
240 The Neural Bases of Multisensory Processes

(a) Auditory
Intact Noise

Intact
AiVi ViAn
Visual
Noise

AiVn AnVn

(b) Interaction: (AiVi + AdVd) – (AiVd + ViAd) = MSI


e.g., superadditive
1
0.8
0.6
0.4
0.2
0
–0.2 AiVn ViAn AiVi AnVn AiVi AiVn MSI
+ AnVn + ViAn

FIGURE 13.4  Interaction design: 2 × 2 factorial design manipulating reliability of sensory inputs.
(a)  Experimental design. 2 × 2 factorial design with the factors (1) auditory: reliable versus unreliable;
(2) visual: reliable versus unreliable. Example stimuli are presented as visual images and corresponding sound
spectrograms. Please note that manipulating stimulus reliability rather than presence evades the problem of
fixation condition. (b) Data analysis and interpretation. One activation profile is illustrated as an example:
superadditive interaction as indexed by a positive MSI effect.

13.1.3.3  Elaborate Interaction Design: m × n Factorial Design (i.e., More than Two Levels)
The drawbacks of the classical interaction design can be ameliorated further if the factorial design
includes more than two levels. For instance, in a 3 × 3 factorial design, auditory and visual modali-
ties may include three levels of sensory input: (1) sensory intact = Vi or Ai, (2) sensory degraded =
Vd or Ad, or (3) sensory absent (Figure 13.5). This more elaborate interaction design enables the
dissociation of audiovisual integration at multiple stages of information processing (Werner and
Noppeney 2010b). The interaction approach can thus open up the potential for a fine-grained char-
acterization of the neural processes underlying the integration of different types of audiovisual
information. In addition to enabling the estimation of interactions, it also allows us to compare
interactions across different levels. For instance, in a 3 × 3 factorial design, we can investigate
whether an additive response combination for degraded stimuli turns into subadditive response
combinations for intact stimuli by comparing the superadditivitydegraded to superadditivityintact (for-
mally: AdVd + fixation – Vd – Ad > AiVi + fixation – Vi – Ai → AdVd – Vd – Ad > AiVi – Vi – Ai).
Thus, an additive integration profile at one particular sensory input level becomes an interesting
finding when it is statistically different from the integration profile (e.g., subadditive) at a different
input level. In this way, the interaction approach that is initially predicated on response nonlineari-
ties is rendered sensitive to additive combinations of unisensory responses. Testing for changes in
superadditivity (or subadditivity) across different stimulus levels can also be used as a test for the
principle of inverse effectiveness. According to the principle of inverse effectiveness, superadditiv-
ity is expected to decrease with stimulus efficacy as defined by, for instance, stimulus intensity or
informativeness. A more superadditive or less subadditive integration profile would be expected
for weak signal intensities (Stein and Stanford 2008). Finally, it should be emphasized that this
Characterization of Multisensory Integration with fMRI 241

(a) Auditory
Intact Degraded Absent

Intact
AiVi AdVi AaVi

Degraded
Visual

AiVd AdVd AaVd


Absent

AiVa AdVa Fix

(b) Interaction: (AiVi + Fix) – (Ai + Vi) = MSIi


1
0.8
0.6
0.4
0.2
0
–0.2 Ai Vi AiVi Fix AiVi Ai + Vi
+ Fix
MSIi

(c) Interaction: (AdVd + Fix) – (Ad + Vd) = MSId

1
0.8
0.6
0.4
0.2
0
–0.2 Ad Vd AdVd Fix AdVd Ad + Vd MSId
+ Fix

(d) Inverse effectiveness: MSId – MSIi


1
0.8
0.6
0.4
0.2
0
–0.2 IE
MSId MSIi

FIGURE 13.5  “Elaborate” interaction design with more than two levels. (a) Experimental design: 3 × 3 fac-
torial design with factors (1) auditory: (i) auditory intact = Ai, (ii) auditory degraded = Ad, and (iii) auditory
absent Aa; (2) visual: (i) visual intact = Vi, (ii) visual degraded = Vd, and (iii) visual absent Va. Example stimuli
are presented as visual images and corresponding sound spectrograms. (b–d) Data analysis and interpretation.
This more elaborate design enables computation of (b) interaction for intact stimuli (MSIi), (c) interaction for
degraded stimuli (MSId), and (d) inverse effectiveness contrast, i.e., MSId – MSI i = (AdVd – Vd – Ad) – (A iVi –
Vi – Ai) that does not depend on fixation condition.
242 The Neural Bases of Multisensory Processes

more complex inverse effectiveness contrast does not depend on the “fixation” condition, as that is
included on both sides of the inequality (and eliminated from the contrast). Thus, the inverse effec-
tiveness contrast is an elegant way to circumvent the problems associated with the fixation condition
mentioned above (Stevenson et al. 2009; Stevenson and James 2009; Werner and Noppeney 2010b;
also, for a related approach in which audiovisual interactions are compared between intelligible and
nonintelligible stimuli, see Lee and Noppeney 2010).

13.1.3.4  Interaction Analyses Constrained by Maximum Likelihood Estimation Model


A more elaborate interaction design also accommodates more sophisticated analyses developed from
the maximum likelihood framework. Numerous psychophysics studies have shown that humans
integrate information from multiple senses in a Bayes optimal fashion by forming a weighted aver-
age of the independent sensory estimates (maximum likelihood estimation, MLE; Ernst and Banks
2002; Knill and Saunders 2003). This multisensory percept is Bayes optimal in that it yields the
most reliable percept (n.b., reliability is the inverse of variance). Combining fMRI and an elaborate
interaction design, we can investigate the neural basis of Bayes optimal multisensory integration at
the macroscopic scale as provided by the BOLD response. First, we can investigate whether regional
activations are modulated by the relative reliabilities of the unisensory estimates as predicted by the
MLE model. For instance, in visuo–tactile integration, we would expect the activation in the soma-
tosensory cortex during visuo–tactile stimulation to increase when the reliability of visual input
is reduced and higher weight is attributed to the tactile input (Helbig et al. 2010). Second, we can
investigate whether differential activations (i.e., bisensory–unisensory) in higher-order association
cortices, for instance, reflect the increase in reliability during bisensory stimulation as predicted
by the MLE model. This reliability increase for bisensory stimulation should be maximal when
the reliabilities of the two unisensory inputs are equal. By cleverly manipulating the reliabilities of
the two sensory inputs, we can thus independently test the two main MLE predictions within the
same interaction paradigm: (1) the contributions of the sensory modalities to multisensory process-
ing depend on the reliability of the unisensory estimates and (2) the reliability of the multisensory
estimate is greater than the reliability of each unisensory estimate.

13.1.3.5  Combining Interaction Analyses with Max Criterion


Interaction analyses can be used to refute the possibility of independent unisensory neuronal popu-
lations in a region. Nevertheless, a significant interaction is still open to many different functional
interpretations. Further insights need to be gained from the activation profile of the unisensory and
bisensory conditions that formed the interaction contrast. More formally, the activation profiles of
superadditive and subadditive interactions can be further characterized according to the max cri-
terion (for a related approach, see Avillac et al. 2007; Perrault et al. 2005; Werner and Noppeney
2010c). For instance, a subadditive interaction in which the audiovisual response is greater than the
maximal unisensory response may simply be because of nonlinearities in the BOLD response (e.g.,
saturation effects) and needs to be interpreted with caution. In contrast, a subadditive interaction in
which the audiovisual response is smaller than the maximal unisensory response cannot easily be
attributed to such nonlinearities in the BOLD response. Instead, suppressive interactions indicate
that one sensory input modulates responses to the other sensory input (Sugihara et al. 2006). Finally,
a subadditive interaction with equivalent responses for auditory, visual, and audiovisual conditions
is most parsimoniously explained by amodal functional properties of a particular brain region.
Rather than genuinely integrating inputs from multiple sensory modalities, an amodal region may
be located further “upstream” and be involved in higher-order processing of already integrated
inputs. For instance, in audiovisual speech integration, a region involved in amodal semantic pro-
cessing may be equally activated via visual, auditory, or audiovisual inputs. These examples dem-
onstrate that a significant interaction is not the end, but rather the starting point of analysis and
interpretation. To reach conclusive interpretations, a careful characterization of the activation pro-
file is required.
Characterization of Multisensory Integration with fMRI 243

13.1.4  Congruency Manipulations


Congruency manipulations are based on the rationale that if a region distinguishes between congru-
ent and incongruent component pairs, it needs to have access to both sensory inputs. Congruency
manipulations can be used to focus selectively on different aspects of information integration.
For instance, audiovisual stimuli can be rendered incongruent in terms of space (Fairhall and
Macaluso 2009; Busse et al. 2005; Bonath et al. 2007), time (Noesselt et al. 2007; Lewis and
Noppeney 2010), phonology (van Atteveldt et al. 2007a; Noppeney et al. 2008), or semantics
(Doehrmann and Naumer 2008; Hein et al. 2007; Noppeney et al. 2008, 2010; Sadaghiani et al.
2009; Adam and Noppeney 2010). Thus, congruency manipulations seem to be ideal to dissociate
multisensory integration at multiple processing stages. However, the interpretation of congru-
ency results is impeded by the fact that incongruencies are usually artifactual and contradict
natural environmental statistics. At the behavioral level, it is well-known that multisensory inte-
gration breaks down and no unified multisensory percept is formed when the senses disagree.
However, it is currently unknown how the human brain responds when it encounters discrepancies
between the senses. Most of the previous fMRI research has adopted the view that integration
processes are reduced for incongruent sensory inputs (Calvert et al. 2000; van Atteveldt et al.
2004; Doehrmann and Naumer 2008). Hence, comparing congruent to incongruent conditions
was thought to reveal multisensory integration regions. However, the brain may also unsuccess-
fully attempt to integrate the discrepant sensory inputs. In this case, activations associated with
multisensory integration may actually be enhanced for unfamiliar incongruent (rather than famil-
iar congruent) sensory inputs. A similar argument has been put forward in the language process-
ing domain where activations associated with lexical retrieval were found to be enhanced for
pseudowords relative to familiar words, even though pseudowords are supposedly not endowed
with a semantic representation (Price et al. 1996). Finally, within the framework of predictive
coding, the brain may act as a prediction device and generate a prediction error signal when pre-
sented with unpredictable incongruent sensory inputs. Again, in this case, increased activations
would be expected for incongruent rather than congruent sensory inputs in brain areas that are
involved in processing the specific stimulus attributes that define the incongruency (e.g., tempo-
ral, spatial, semantic, etc.). As fMRI activations are known to be very susceptible to top-down
modulation and cognitive set, these inherent interpretational ambiguities limit the role of incon-
gruency manipulations in the investigation of multisensory integration, particularly for fMRI
(rather than neurophysiological) studies. In fact, a brief review of the literature seems to suggest
that congruency manipulations strongly depend on the particular cognitive set and experimental
paradigm. Under passive listening/viewing conditions, increased activations have been reported
primarily for congruent relative to incongruent conditions (Calvert et al. 2000; van Atteveldt et al.
2004). In contrast, in selective attention paradigms, where subjects attend to one sensory modal-
ity and ignore sensory inputs from other modalities, the opposite pattern has been reported, i.e.,
increased activations are observed for incongruent relative to congruent inputs (Noppeney et al.
2008, 2010; Sadaghiani et al. 2009). Finally, when subjects perform a congruency judgment that
requires access and comparison of the two independent unisensory percepts and hence precludes
natural audiovisual integration, differences between congruent and incongruent stimulus pairs
are attenuated (van Atteveldt et al. 2007b). This complex pattern of fMRI activations suggest
that incongruency does not simply prevent the brain from integrating sensory inputs, but elicits
a range of other cognitive effects and top-down modulations that need to be taken into account
when interpreting fMRI results.

13.1.5  fMRI Adaptation (or Repetition Suppression)


fMRI adaptation (used here synonymously with repetition suppression) refers to the phenomenon
that prior processing of stimuli (or stimulus attributes) decreases activation elicited by processing
244 The Neural Bases of Multisensory Processes

subsequent stimuli with identical attributes. Repetition suppression has frequently been interpreted
as the fMRI analogue of neuronal response suppression, i.e., a decrease in neuronal firing rate as
recorded in nonhuman primates (Desimone 1996). Despite current uncertainties about its underly-
ing neural mechanisms, fMRI repetition suppression has been widely used as a tool for dissociating
and mapping the various stages of sensory and cognitive processing. These fMRI experiments are
based on the rationale that the sensitivity of a brain region to variations in stimulus attributes deter-
mines the degree of repetition suppression: the more a brain region is engaged in processing and
hence sensitive to a particular stimulus feature, the more it will adapt to stimuli that are identical
with respect to this feature—even though they might vary with respect to other dimensions (Grill-
Spector and Malach 2001; Grill-Spector et al. 2006). Repetition suppression can thus be used to
define the response selectivity and invariance of neuronal populations within a region. Initial fMRI
adaptation paradigms have used simple block designs, i.e., they presented alternating blocks of
“same (adaptation)” versus “different (no adaptation)” stimuli. However, arrangement of the stimuli
in blocks introduces a strong attentional confound that renders the interpretations of the adaptation
effect difficult (even when attempts are made to maintain attention in a control task). More recent
studies have therefore used randomized fMRI adaptation paradigms that reduce attentional top-
down modulation at least to a certain degree. In addition to attentional confounds, task effects (e.g.,
response priming) need to be very tightly controlled in adaptation paradigms (for further discussion,
see Henson and Rugg 2003; Henson 2003).
In the field of multisensory integration, fMRI adaptation may be used to identify “amodal” neu-
ral representations. Thus, despite the changes in sensory modality, a multisensory or amodal region
should show fMRI adaptation when presented with identical stimuli in different sensory modalities.
For instance, presenting identical words subsequently in a written and spoken format, this cross-
modal adaptation effect was used to identify amodal or multisensory phonological representations
(Noppeney et al. 2008; Hasson et al. 2007). fMRI adaptation paradigms may also be combined with
the outlined interaction approach. Here, a 2 × 2 factorial design would manipulate the repetition
of (1) visual and (2) auditory features. A region that integrates visual and auditory features is then
expected to show an interaction between the auditory and visual repetition effects, e.g., an increased
visual adaptation, if the auditory feature is also repeated (Tal and Amedi 2009). This experimental
approach has recently been used to study form and motion integration within the visual domain
(Sarkheil et al. 2008). Most commonly, fMRI adaptation is used to provide insights into subvoxel
neuronal representation. This motivation is based on the so-called fatigue model that proposes that
the fMRI adaptation effect is attributable to a “fatigue” (as indexed by decreased activity) of the
neurons initially responding to a specific stimulus (Grill-Spector and Malach 2001). For instance,
let us then assume that a voxel contains populations of A and B neurons and responds equally to
stimuli A and B, so that a standard paradigm would not be able to reveal selectivity for stimulus
A. Yet, repetitive presentation of stimulus A will only fatigue the A-responsive neurons. Therefore,
subsequent presentation of stimulus B will lead a rebound response of the “fresh” B neurons. Thus,
it was argued the fMRI adaptation can increase the spatial resolution to a subvoxel level. Along
similar lines, fMRI adaptation could potentially be used to dissociate unisensory and multisensory
neuronal populations. In the case of independent populations of visual and auditory neurons (no
multisensory neurons), after adaptation to a specific visual stimulus, a rebound in activation should
be observed when the same stimulus is presented in the auditory modality. This activation increase
should be comparable to the rebound observed when presented with a new unrelated stimulus. In
contrast, if a region contains multisensory neurons, it will adapt when presented with the same
stimulus irrespective of sensory modality. Thus, within the fatigue framework, fMRI adaptation
may help us to dissociate unisensory and multisensory neuronal populations that evade standard
analyses. However, it is likely that voxels containing visual and auditory neurons will also include
audiovisual neurons. This mixture of multiple neuronal populations within a voxel may produce a
more complex adaptation profile than illustrated in our toy example. Furthermore, given the diver-
sity of multisensory enhancement and depression profiles for concurrently presented sensory inputs,
Characterization of Multisensory Integration with fMRI 245

the adaptation profile for asynchronously presented inputs from multiple modalities is not yet well
characterized—it may depend on several factors such as the temporal relationship, stimulus inten-
sity, and a voxel’s responsiveness.
Even in the “simple” unisensory case, the interpretation of fMRI adaptation results is impeded
by our lack of understanding of the underlying neuronal mechanisms as well as the relationship
between the decreased BOLD activation and neuronal response suppression (for review and discus-
sion, see Henson and Rugg 2003; Henson 2003). In fact, multiple models and theories have been
advanced to explain repetition suppression. (1) According to the fMRI adaptation approach (the
“fatigue” model mentioned above), the number of neurons that are important for stimulus represen-
tation and processing remain constant, but show reductions in their firing rates for repeated stimuli
(Grill-Spector and Malach 2001). (2) Repetition suppression has been attributed to a sharpening

Presentation 1 Presentation 2 Cases


1
(a) 0.8 -Unisensory
0.6 -Multisensory
0.4 -Amodal
0.2
Stim 1 0
–0.2
ity

1
us
al
ul
od

0.8
tim

0.6
es

ry
so
m

0.4 -Unisensory
Sa

en

0.2
es
m

(b) 0
Sa

–0.2
1 ulus ity
1
e stim odal 0.8
0.8 Sam s o ry m
se n Stim 1 0.6
0.6 rent
0.4 Diffe 0.4 -Amodal
0.2 0.2 -Multisensory?
Stim 1 0 Diff 0
e
–0.2 Sam rent st –0.2
e se imu
nso lus
ry m
oda 1
lity (c) 0.8
D
D

0.6 -Unisensory
iff nt s
iff

er en
er

0.4 -Multisensory
en so
e

t s ry

0.2 -Amodal
tim m

Stim 2 0
ul oda

–0.2
us

1
lit
y

(d) 0.8
0.6 -Unisensory
0.4 -Multisensory
0.2 -Amodal
0
Stim 2
–0.2

FIGURE 13.6  Cross-modal fMRI adaptation paradigm and BOLD predictions. Figure illustrates BOLD
predictions for different stimulus pairs with (1) stimulus and/or (2) sensory modality being same or differ-
ent for the two presentations. Please note that this simplistic toy example serves only to explain fundamental
principles rather than characterizing the complexity of multisensory adaptation profiles (see text for further
discussion). (a) Same stimulus, same sensory modality: decreased BOLD response is expected in unisensory,
multisensory, and amodal areas. (b) Same stimulus, different sensory modality: decreased BOLD response
is expected for higher-order “amodal” regions and not for unisensory regions. Given the complex interaction
profiles for concurrently presented sensory inputs, prediction for multisensory regions is unclear. Different
stimulus, same sensory modality (c) and different stimulus, different sensory modality (d). No fMRI adapta-
tion is expected of unisensory, multisensory, or amodal regions.
246 The Neural Bases of Multisensory Processes

of the cortical stimulus representations, whereby neurons that are not essential for stimulus pro-
cessing respond less for successive stimulus presentations (Wiggs and Martin 1998). (3) In neural
network models, repetition suppression is thought to be mediated by synaptic changes that decrease
the settling time of an attractor neural network (Becker et al. 1997; Stark and McClelland 2000).
(4) Finally, hierarchical models of predictive coding have proposed that response suppression reflects
reduced prediction error, i.e., the brain learns to predict the stimulus attributes on successive expo-
sures to identical stimuli, the firing rate of stimulus-evoked error units are suppressed by top-down
predictions mediated by backward connections from higher-level cortical areas (Friston 2005). The
predictive coding model raises questions about the relationship between cross-modal congruency
and adaptation effects. Both fMRI adaptation and congruency designs manipulate the “congru-
ency” between two stimuli. The two approaches primarily differ in the (a)synchrony between the
two sensory inputs. For instance, spoken words and the corresponding facial movements would be
presented synchronously in a classical congruency paradigm and sequentially in an adaptation para-
digm. The different latencies of the sensory inputs may induce distinct neural mechanisms for con-
gruency and/or adaptation effects. Yet, events in the natural environment often produce temporal
asynchronies between sensory signals. For instance, facial movements usually precede the auditory
speech signal. Furthermore, the asynchrony between visual and auditory signals depends on the
distance between signal source and observer because of differences in velocity of light and sound.
Finally, the neural processing latencies for signals from different sensory modalities depend on the
particular brain regions and stimuli, which will lead, in turn, to variations in the width and asym-
metry of temporal integration windows as a function of stimulus and region. Collectively, the vari-
ability in latency and temporal integration window suggests a continuum between “syn chronous”
congruency effects and “asynchronous” adaptation effects that may rely on distinct and shared
neural mechanisms (Figure 13.6).

13.2 MULTISENSORY REPRESENTATIONS: MULTIVARIATE


DECODING AND PATTERN CLASSIFIER ANALYSES
All methodological approaches discussed thus far were predicated on encoding models using
mass-univariate statistics. In other words, these approaches investigated how external variables or
stimulus functions cause and are thus encoded by brain activations in a regionally specific fash-
ion. This is a mass-univariate approach because a general linear model with the experimental
variables as predictors is estimated independent for each voxel time course followed by statistical
inference (n.b., statistical dependencies are usually taken into account at the stage of statistical
inference, using, e.g., Gaussian random field theory; Friston et al. 1995). Over the past 5 years,
multivariate decoding models and pattern classifiers have progressively been used in functional
imaging studies. In contrast to encoding models that infer a mapping from experimental variables
to brain activations, these decoding models infer a mapping from brain activations to cognitive
states. There are two main approaches: (1) canonical correlation analyses (and related models such
as linear discriminant analyses, etc.) infer a mapping from data features (voxel activations) to cog-
nitive states using classical multivariate statistics (based on Wilk’s lambda). Recently, an alterna-
tive Bayesian method, multivariate Bayesian decoding, has been proposed that uses a parametric
empirical or hierarchical Bayesian model to infer the mapping from voxel activations to a target
variable (Friston et al. 2008). (2) Pattern classifiers (e.g., using support vector machines) implicitly
infer a mapping between voxel patterns and cognitive states via cross-validation schemes and clas-
sification performance on novel unlabeled feature vectors (voxel activation pattern). To this end,
the data are split into two (or multiple) sets. In a cross-validation scheme, the classifier is trained
on set 1 and its generalization performance is tested on set 2 (for a review, see Haynes and Rees
2006; Pereira et al. 2009). Linear classifiers are often used in functional imaging, as the voxel
weights after training provide direct insights into the contribution of different voxels to the clas-
sification performance. Thus, even if the classifier is applied to the entire brain, the voxel weights
Characterization of Multisensory Integration with fMRI 247

may indicate regional functional specialization. Furthermore, multivariate decoding approaches


can also be applied locally (at each location in the brain) using searchlight procedures (Nandy
and Cordes 2003; Kriegeskorte et al. 2006). Because multivariate decoding and pattern classifiers
extract the discriminative signal from multiple voxels, they can be more sensitive than univariate
encoding approaches and provide additional insights into the underlying distributed neural repre-
sentations. By carefully designing training and test sets, pattern classifiers can also characterize
the invariance of the neural representations within a region. Within the field of multisensory inte-
gration, future studies may, for instance, identify amodal representations by investigating whether
a pattern classifier that is trained on visual stimuli generalizes to auditory stimuli. In addition,
pattern classifiers trained on different categories of multisensory stimuli could be used to provide
a more fine-grained account of multisensory representations in low level putatively unisensory and
higher order multisensory areas.

13.3  FUNCTIONAL INTEGRATION: EFFECTIVE CONNECTIVITY ANALYSES


From the perspective of functional integration, effective connectivity analyses can be used to inves-
tigate how information from multiple senses is integrated via distinct interactions among brain
regions. In contrast with functional connectivity analyses that simply characterize statistical depen-
dencies between time series in different voxels or regions, effective connectivity analyses investi-
gate the influence that one region exerts on another region. The aim of these analyses is to estimate
and make inference about the coupling among brain areas and how this coupling is influenced by
experimental context (e.g., cognitive set, task). We will limit our discussion to approaches that have
already been applied in the field of multisensory integration. From the experimenter’s perspective,
the models are organized according to data-driven and hypothesis-driven approaches for effective
connectivity, even though this is only one of many differences and possible classifications.

13.3.1  Data-Driven Effective Connectivity Analysis: Psychophysiological


Interactions and Granger Causality
Early studies have used simple regression models to infer a context-dependent change in effec-
tive connectivity between brain regions. In psychophysiological interaction analyses, the activation
time courses in each voxel within the brain are regressed on the time course in a particular seed
voxel under two contexts (Friston et al. 1997). A change in coupling is inferred from a change in
regression slopes under the two contexts. Based on a psychophysiological interaction analysis, for
instance, visuo–tactile interactions in the lingual gyrus were suggested to be induced by increased
connectivity from the parietal cortex (Macaluso et al. 2000). Similarly, a psychophysiological inter-
action analysis was used to demonstrate increased coupling between the left prefrontal cortex and
the inferior temporal gyrus in blind, relative to sighted, subjects as a results of cross-modal plas-
ticity (Noppeney et al. 2003). More recent approaches aim to infer directed connectivity based on
Granger causality that is temporal precedence. A time series X is said to Granger cause Y, if the
history of X (i.e., the lagged values of X) provides statistically significant information about future
values of Y, after taking into account the known history of Y. Inferences of Granger causality are
based on multivariate autoregressive models or directed information transfer (a measure derived
from mutual information; Roebroeck et al. 2005; Goebel et al. 2003; Harrison et al. 2003; Hinrichs
et al. 2006). It is important to note that Granger causality does not necessarily imply true causal-
ity because a single underlying process may cause both signals X and Y, yet with different lags.
Furthermore, temporal differences between regions in hemodynamic time series that result from
variations in vascular architecture and hemodynamic response functions may be misinterpreted as
causal influences. The second problem can be partly controlled by comparing Granger causality
across two conditions and prior deconvolution to obtain an estimate of the underlying neuronal
248 The Neural Bases of Multisensory Processes

signals (Roebroeck et al. 2009; David et al. 2008). As a primarily data-driven approach, the analysis
estimates the Granger causal influences of a seed region on all other voxels in the brain. Because
this analysis approach does not require an a priori selection of regions of interest, it may be very
useful to generate hypotheses that may then be further evaluated on new data in a more constrained
framework. Recently, Granger causality has been used to investigate and reveal top-down influ-
ences from the STS on auditory cortex/planum temporale in the context of letter–speech sound
congruency (multivariate autoregressive models; van Atteveldt et al. 2009) and temporal synchrony
manipulations (directed information transfer; Noesselt et al. 2007). For instance, van Atteveldt et al.
(2009) have suggested that activation increases for congruent relative to incongruent letter–sound
pairs may be mediated via increased connectivity from the STS. Similarly, Granger causality has
been used to investigate the influence of somatosensory areas on the lateral occipital complex dur-
ing shape discrimination (Deshpande et al. 2010; Peltier et al. 2007).

13.3.2  Hypothesis-Driven Effective Connectivity Analysis: Dynamic Causal Modeling


The basic idea of dynamic causal modeling (DCM) is to construct a reasonable realistic model
of interacting brain regions that form the key players of the functional system under investigation
(Friston et al. 2003). DCM treats the brain as a dynamic input–state–output system. The inputs
correspond to conventional stimulus functions encoding experimental manipulations. The state
variables are neuronal activities and the outputs are the regional hemodynamic responses mea-
sured with fMRI. The idea is to model changes in the states, which cannot be observed directly,
using the known inputs and outputs. Critically, changes in the states of one region depend on the
states (i.e., activity) of others. This dependency is parameterized by effective connectivity. There
are three types of parameters in a DCM: (1) input parameters that describe how much brain regions
respond to experimental stimuli, (2) intrinsic parameters that characterize effective connectivity
among regions, and (3) modulatory parameters that characterize changes in effective connectivity
caused by experimental manipulation. This third set of parameters, the modulatory effects, allows
us to explain context-sensitive activations by changes in coupling among brain areas. Importantly,
this coupling (effective connectivity) is expressed at the level of neuronal states. DCM uses a for-
ward model, relating neuronal activity to fMRI data, which can be inverted during the model fitting
process. Put simply, the forward model is used to predict outputs using the inputs. During model
fitting, the parameters are adjusted so that the predicted and observed outputs match. Thus, DCM
differs from (auto)regressive-like models that were discussed in the previous section in three impor-
tant aspects: (1) it is a hypothesis-driven approach that requires a priori selection of regions and
specification of model space in terms of potential connectivity structures, (2) the neuronal responses
are driven by experimentally designed inputs rather than endogenous noise, and (3) the regional
interactions emerge at the neuronal level and are transformed into observable BOLD response using
a biophysically plausible hemodynamic forward model.
DCM can be used to make two sorts of inferences: first, we can compare multiple models that
embody hypotheses about functional neural architectures. Using Bayesian model selection, we will
infer the optimal model given the data (Penny et al. 2004; Stephan et al. 2009). Second, given the
optimal model, we can make inference on connectivity parameters (Friston et al. 2003). For instance,
we can compare the strength of forward and backward connections or test whether attention modu-
lates the connectivity between sensory areas. In the field of multisensory integration, DCM has
been used to investigate whether incongruency effects emerge via forward or backward connec-
tivity. Comparing DCMs in which audiovisual incongruency modulates either the forward or the
backward connectivity, we suggested that increased activation for incongruent relative to congruent
stimulus pairs is mediated via enhanced forward connectivity from low-level auditory areas to STS
and IPS (Noppeney et al. 2008). More recently, we used DCM to address the question of whether
audiovisual interactions in low-level auditory areas (superior temporal gyrus; Driver and Noesselt
2008; Schroeder and Foxe 2005) are mediated via direct connectivity from visual occipital areas or
Characterization of Multisensory Integration with fMRI 249

(a) ‘Direct influence’ DCM (b) ‘Indirect influence’ DCM

STS STS
AV AV

STG CaS STG CaS

AV
A V A V

FIGURE 13.7  Candidate dynamic causal models. (a) “Direct” influence DCM: audiovisual costimulation
modulates direct connectivity between auditory and visual regions. (b) “Indirect” influence DCM: audiovisual
costimulation modulates indirect connectivity between auditory and visual regions. STG, superior temporal
gyrus; CaS, calcarine sulcus; A, auditory input; V, visual input; AV, audiovisual input.

indirect pathways via the STS. Partitioning the model space into “direct,” “indi rect,” or “indirect +
direct” models suggested that visual input may influence auditory processing in the superior tem-
poral gyrus via direct and indirect connectivity from visual cortices (Lewis and Noppeney 2010;
Noppeney et al. 2010; Werner and Noppeney 2010a; Figure 13.7).

13.4  CONCLUSIONS AND FUTURE DIRECTIONS


Multisensory integration has been characterized with fMRI using a variety of experimental design
and statistical analysis approaches. When applied in isolation, each approach provides only limited
insights and can lead to misinterpretations. A more comprehensive picture may emerge by com-
bining the potentials of multiple methodological approaches. For instance, pattern classifiers and
fMRI adaptation may be jointly used to provide insights into subvoxel neuronal representations and
dissociate unisensory and multisensory neuronal populations. Amodal neural representations may
then be identified, if the classification performance and fMRI adaptation generalizes across stimuli
from different sensory modalities. Increased spatial resolution at higher field strength will enable
us to more thoroughly characterize the response properties of individual regions. To go beyond
structure–function mapping, we also need to establish the effective connectivity between regions
using neurophysiologically plausible observation models. Understanding the neural mechanisms of
multisensory integration will require an integrative approach combining computational modeling
and the complementary strengths of fMRI, EEG/MEG, and lesion studies.

ACKNOWLEDGMENTS
We thank Sebastian Werner, Richard Lewis, and Johannes Tünnerhoff for helpful comments on a
previous version of this manuscript and JT for his enormous help with preparing the figures.

REFERENCES
Adam, R., U. Noppeney. 2010. Prior auditory information shapes visual category-selectivity in ventral occipito-
­temporal cortex. NeuroImage 52:1592–1602.
Allman, B.L., L.P., Keniston, and M.A. Meredith. 2009. Not just for bimodal neurons anymore: The contribu-
tion of unimodal neurons to cortical multisensory processing. Brain Topography 21:157–167.
Avillac, M., H.S. Ben, and J.R. Duhamel. 2007. Multisensory integration in the ventral intraparietal area of the
macaque monkey. Journal of Neuroscience 27:1922–1932.
Barraclough, N.E., D. Xiao, C.I. Baker, M.W. Oram, and D.I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17:377–391.
250 The Neural Bases of Multisensory Processes

Beauchamp, M.S. 2005. Statistical criteria in FMRI studies of multisensory integration. Neuroinformatics
3:93–113.
Beauchamp, M.S., B.D. Argall, J. Bodurka, J.H. Duyn, and A. Martin. 2004. Unraveling multisensory integra-
tion: Patchy organization within human STS multisensory cortex. Nature Neuroscience 7:1190–1192.
Becker, S., M. Moscovitch, M. Behrmann, and S. Joordens. 1997. Long-term semantic priming: a computa-
tional account and empirical evidence. Journal of Experimental Psychology. Learning, Memory, and
Cognition 23:1059–1082.
Bonath, B., T. Noesselt, A. Martinez, J. Mishra, K. Schwiecker, H.J. Heinze, and S.A. Hillyard. 2007. Neural
basis of the ventriloquist illusion. Current Biology 17:1697–1703.
Bremmer, F., A. Schlack, N.J. Shah et al. 2001. Polymodal motion processing in posterior parietal and premo-
tor cortex: A human fMRI study strongly implies equivalencies between humans and monkeys. Neuron
29:287–296.
Busse, L., K.C. Roberts, R.E. Crist, D.H. Weissman, and M.G. Woldorff. 2005. The spread of attention across
modalities and space in a multisensory object. Proceedings of the National Academy of Sciences of the
United States of America 102:18751–18756.
Calvert, G.A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cerebral Cortex 11:1110–1123.
Calvert, G.A., R. Campbell, and M.J. Brammer. 2000. Evidence from functional magnetic resonance imaging
of crossmodal binding in the human heteromodal cortex. Current Biology 10:649–657.
Calvert, G.A., P.C. Hansen, S.D. Iversen, and M.J. Brammer. 2001. Detection of audio-visual integration
sites  in  humans by application of electrophysiological criteria to the BOLD effect. NeuroImage 14:​
427–438.
David, O., I. Guillemain, S. Saillet et al. 2008. Identifying neural drivers with functional MRI: An electrophysi-
ological validation. PLoS Biology 6:2683–2697.
Dehner, L.R., L.P. Keniston, H.R. Clemo, and M.A. Meredith 2004. Cross-modal circuitry between auditory
and somatosensory areas of the cat anterior ectosylvian sulcal cortex: A ‘new’ inhibitory form of multi-
sensory convergence. Cerebral Cortex 14:387–403.
Deshpande, G., X. Hu, S. Lacey, R. Stilla, and K. Sathian. 2010. Object familiarity modulates effective con-
nectivity during haptic shape perception. NeuroImage 49:1991–2000.
Desimone, R. 1996. Neural mechanisms for visual memory and their role in attention. Proceedings of the
National Academy of Sciences of the United States of America 93:13494–13499.
Doehrmann, O., and M.J. Naumer. 2008. Semantics and the multisensory brain: how meaning modulates pro-
cesses of audio-visual integration. Brain Research 1242:136–150.
Driver, J., and T. Noesselt 2008. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’
brain regions, neural responses, and judgments. Neuron 57:11–23.
Ernst, M.O., and M.S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415:429–433.
Fairhall, S.L., and E. Macaluso. 2009. Spatial attention can modulate audiovisual integration at multiple corti-
cal and subcortical sites. European Journal of Neuroscience 29:1247–1257.
Friston, K. 2005. A theory of cortical responses. Philosophical Transactions of the Royal Society of London.
Series B, Biological Sciences 360:815–836.
Friston, K., C. Chu, J. Mourao-Miranda, O. Hulme, G. Rees, W. Penny, and J. Ashburner. 2008. Bayesian
decoding of brain images. NeuroImage 39:181–205.
Friston, K.J., C. Buechel, G.R. Fink, J. Morris, E. Rolls, and R.J. Dolan. 1997. Psychophysiological and modu-
latory interactions in neuroimaging. NeuroImage 6:218–229.
Friston, K.J., L. Harrison, and W. Penny. 2003. Dynamic causal modelling. NeuroImage 19:1273–1302.
Friston, K.J., A. Holmes, K.J. Worsley, J.B. Poline, C.D. Frith, and R. Frackowiak. 1995. Statistical parametric
mapping: A general linear approach. Human Brain Mapping 2:189–210.
Friston, K.J., A.P. Holmes, C.J. Price, C. Buchel, and K.J. Worsley. 1999. Multisubject fMRI studies and con-
junction analyses. NeuroImage 10:385–396.
Friston, K.J., W.D. Penny, and D.E. Glaser. 2005. Conjunction revisited. NeuroImage 25:661–667.
Goebel, R., A. Roebroeck, D.S. Kim, and E. Formisano. 2003. Investigating directed cortical interactions in
time-resolved fMRI data using vector autoregressive modeling and Granger causality mapping. Magnetic
Resonance Imaging 21:1251–1261.
Gosselin, F., and P.G. Schyns. 2003. Superstitious perceptions reveal properties of internal representations.
Psychological Science 14:505–509.
Grill-Spector, K., and R. Malach. 2001. fMR-adaptation: A tool for studying the functional properties of human
cortical neurons. Acta Psychologica 107:293–321.
Characterization of Multisensory Integration with fMRI 251

Grill-Spector, K., R. Henson, and A. Martin. 2006. Repetition and the brain: neural models of stimulus-specific
effects. Trends in Cognitive Sciences 10:14–23.
Harrison, L., W.D. Penny, and K. Friston. 2003. Multivariate autoregressive modeling of fMRI time series.
NeuroImage 19:1477–1491.
Hasson, U., J.I. Skipper, H.C. Nusbaum, and S.L. Small. 2007. Abstract coding of audiovisual speech: Beyond
sensory representation. Neuron 56:1116–1126.
Haynes, J.D., and G. Rees. 2006. Decoding mental states from brain activity in humans. Nature Reviews.
Neuroscience 7:523–534.
Hein, G., O. Doehrmann, N.G. Muller, J. Kaiser, L. Muckli, and M.J. Naumer. 2007. Object familiar-
ity and semantic congruency modulate responses in cortical audiovisual integration areas. Journal of
Neuroscience 27:7881–7887.
Helbig, H.B., M.O. Ernst, E. Ricciardi, P. Pietrini, A. Thielscher, K.M. Mayer, J. Schultz, and U. Noppeney.
2010. Reliability of visual information modulates tactile shape processing in primary somatosensory
cortices (Submitted for publication).
Henson, R.N. 2003. Neuroimaging studies of priming. Progress in Neurobiology 70:53–81.
Henson, R.N., and M.D. Rugg. 2003. Neural response suppression, haemodynamic repetition effects, and
behavioural priming. Neuropsychologia 41:263–270.
Hinrichs, H., H.J. Heinze, and M.A. Schoenfeld. 2006. Causal visual interactions as revealed by an information
theoretic measure and fMRI. NeuroImage 31:1051–1060.
Kayser, C., C.I. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–1574.
Knill, D.C., and J.A. Saunders. 2003. Do humans optimally integrate stereo and texture information for judg-
ments of surface slant? Vision Research 43:2539–2558.
Kriegeskorte, N., R. Goebel, and P. Bandettini. 2006. Information-based functional brain mapping. Proceedings
of the National Academy of Sciences of the United States of America 103:3863–3868.
Laurienti, P.J., T.J. Perrault, T.R. Stanford, M.T. Wallace, and B.E. Stein. 2005. On the use of superadditivity
as a metric for characterizing multisensory integration in functional neuroimaging studies. Experimental
Brain Research 166:289–297.
Lee, H., and U. Noppeney. Physical and perceptual factors shape the neural mechanisms that integrate audiovi-
sual signals in speech comprehension (submitted for publication).
Lewis, R., and U. Noppeney. 2010. Audiovisual synchrony improves motion discrimination via enhanced con-
nectivity between early visual and auditory areas. Journal of Neuroscience 30:12329–12339.
Macaluso, E., C.D. Frith, and J. Driver. 2000. Modulation of human visual cortex by crossmodal spatial atten-
tion. Science 289:1206–1208.
Meredith, M.A., and B.L. Allman. 2009. Subthreshold multisensory processing in cat auditory cortex.
Neuroreport 20:126–131.
Nandy, R.R., and D. Cordes. 2003. Novel nonparametric approach to canonical correlation analysis with appli-
cations to low CNR functional MRI data. Magnetic Resonance in Medicine 50:354–365.
Nichols, T., M. Brett, J. Andersson, T. Wager, and J.B. Poline. 2005 Valid conjunction inference with the mini-
mum statistic. NeuroImage 25:653–660.
Noesselt, T., J.W. Rieger, M.A. Schoenfeld et al. 2007. Audiovisual temporal correspondence modulates
human multisensory superior temporal sulcus plus primary sensory cortices. Journal of Neuroscience
27:11431–11441.
Noppeney, U., K. Friston, and C. Price. 2003. Effects of visual deprivation on the organisation of the semantic
system. Brain 126:1620–1627.
Noppeney, U., O. Josephs, J. Hocking, C.J. Price, and K.J. Friston. 2008. The effect of prior visual information
on recognition of speech and sounds. Cerebral Cortex 18:598–609.
Noppeney, U., D. Ostwald, S. Werner. 2010. Perceptual decisions formed by accumulation of audiovisual evi-
dence in prefrontal cortex. Journal of Neuroscience 30:7434–7446.
Peltier, S., R. Stilla, E. Mariola, S. LaConte, X. Hu, and K. Sathian. 2007. Activity and effective connectivity of
parietal and occipital cortical regions during haptic shape perception. Neuropsychologia 45:476–483.
Penny, W.D., K.E. Stephan, A. Mechelli, and K.J. Friston. 2004. Comparing dynamic causal models. NeuroImage
22:1157–1172.
Pereira, F., T. Mitchell, and M. Botvinick. 2009. Machine learning classifiers and fMRI: A tutorial overview.
NeuroImage 45:S199–S209.
Perrault Jr., T.J., J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2005. Superior colliculus neurons use dis-
tinct operational modes in the integration of multisensory stimuli. Journal of Neurophysiology 93:​
2575–2586.
252 The Neural Bases of Multisensory Processes

Price, C.J., R.J. Wise, and R.S. Frackowiak. 1996. Demonstrating the implicit processing of visually presented
words and pseudowords. Cerebral Cortex 6:62–70.
Roebroeck, A., E. Formisano, and R. Goebel. 2005. Mapping directed influence over the brain using Granger
causality and fMRI. NeuroImage 25:230–242.
Roebroeck, A., E. Formisano, and R. Goebel. 2009. The identification of interacting networks in the brain using
fMRI: Model selection, causality and deconvolution. NeuroImage.
Sadaghiani, S., J.X. Maier, and U. Noppeney. 2009. Natural, metaphoric, and linguistic auditory direction sig-
nals have distinct influences on visual motion processing. Journal of Neuroscience 29:6490–6499.
Sarkheil, P., Q.C. Vuong, H.H. Bulthoff, and U. Noppeney. 2008. The integration of higher order form and
motion by the human brain. NeuroImage 42:1529–1536.
Schroeder, C.E., and J. Foxe. 2005. Multisensory contributions to low-level, ‘unisensory’ processing. Current
Opinion in Neurobiology 15:454–458.
Seltzer, B., M.G. Cola, C. Gutierrez, M. Massee, C. Weldon, and C.G. Cusick. 1996. Overlapping and non-
overlapping cortical projections to cortex of the superior temporal sulcus in the rhesus monkey: Double
anterograde tracer studies. Journal of Comparative Neurology 370:173–190.
Stanford, T.R., and B.E. Stein. 2007. Superadditivity in multisensory integration: Putting the computation in
context. Neuroreport 18:787–792.
Stanford, T.R., S. Quessy, B.E. Stein. 2005. Evaluating the operations underlying multisensory integration in
the cat superior colliculus. Journal of Neuroscience 25:6499–6508.
Stark, C.E., and J.L. McClelland. 2000. Repetition priming of words, pseudowords, and nonwords. Journal of
Experimental Psychology. Learning, Memory, and Cognition 26:945–972.
Stein, B.E., and M.A. Meredith. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., and T.R. Stanford. 2008. Multisensory integration: current issues from the perspective of the single
neuron. Nature Reviews. Neuroscience 9:255–266.
Stein, B.E., T.R. Stanford, R. Ramachandran, T.J. Perrault Jr., and B.A. Rowland. 2009. Challenges in quantify-
ing multisensory integration: alternative criteria, models, and inverse effectiveness. Experimental Brain
Research 198(2–3):131–126.
Stephan, K.E., W.D. Penny, J. Daunizeau, R.J. Moran, and K.J. Friston. 2009. Bayesian model selection for
group studies. NeuroImage 46(4):1004–1017. Erratum in NeuroImage 48(1):311.
Stevenson, R.A., and T.W. James. 2009. Audiovisual integration in human superior temporal sulcus: Inverse
effectiveness and the neural processing of speech and object recognition. NeuroImage 44:1210–1223.
Stevenson, R.A., S. Kim, and T.W. James. 2009. An additive-factors design to disambiguate neuronal and areal
convergence: Measuring multisensory interactions between audio, visual, and haptic sensory streams
using fMRI. Experimental Brain Research 198(2–3):183–194
Sugihara, T., M.D. Diltz, B.B. Averbeck, and L.M. Romanski. 2006. Integration of auditory and visual
communication information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience
26:11138–11147.
Tal, N., and A. Amedi. 2009. Multisensory visual-tactile object related network in humans: insights gained
using a novel crossmodal adaptation approach. Experimental Brain Research 198:165–182.
van Atteveldt, N., E. Formisano, R. Goebel, and L. Blomert. 2004. Integration of letters and speech sounds in
the human brain. Neuron 43:271–282.
van Atteveldt, N.M., E. Formisano, L. Blomert, and R. Goebel. 2007a. The effect of temporal asynchrony on
the multisensory integration of letters and speech sounds. Cerebral Cortex 17:962–974.
van Atteveldt, N.M., E. Formisano, R. Goebel, and L. Blomert. 2007b. Top-down task effects overrule automatic
multisensory responses to letter-sound pairs in auditory association cortex. NeuroImage 36:1345–1360.
van Atteveldt, N., A. Roebroeck, and R. Goebel. 2009. Interaction of speech and script in human auditory cor-
tex: Insights from neuro-imaging and effective connectivity. Hearing Research 258(1–2):152–164
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–2172.
Werner, S., and U. Noppeney. 2010a. Distinct functional contributions of primary sensory and association areas
to audiovisual integration in object categorization. Journal of Neuroscience 30:2662–2675.
Werner, S., and U Noppeney. 2010b. Superadditive responses in superior temporal sulcus predict audiovisual
benefits in object categorization. Cerebral Cortex 20:1829–1842.
Werner, S., and U. Noppeney. 2010c. The contributions of transient and sustained response codes to audiovisual
integration. Cerebral Cortex 21(4):920–931.
Wiggs, C.L., and A. Martin. 1998. Properties and mechanisms of perceptual priming. Current Opinion in
Neurobiology 8:227–233.
14 Modeling Multisensory
Processes in Saccadic
Responses
Time-Window-of-
Integration Model
Adele Diederich and Hans Colonius

CONTENTS
14.1 Summary............................................................................................................................... 253
14.2 Multisensory Processes Measured through Response Time.................................................254
14.3 TWIN Modeling.................................................................................................................... 255
14.3.1 Basic Assumptions..................................................................................................... 255
14.3.2 Quantifying Multisensory Integration in the TWIN Model...................................... 257
14.3.3 Some General Predictions of TWIN......................................................................... 257
14.4 TWIN Models for Specific Paradigms: Assumptions and Predictions................................. 258
14.4.1 Measuring Cross-Modal Effects in Focused Attention and Redundant Target
Paradigms.................................................................................................................. 258
14.4.2 TWIN Model for the FAP......................................................................................... 259
14.4.2.1 TWIN Predictions for the FAP...................................................................260
14.4.3 TWIN Model for RTP............................................................................................... 263
14.4.3.1 TWIN Predictions for RTP.........................................................................264
14.4.4 Focused Attention versus RTP.................................................................................. 265
14.5 TWIN Model for Focused Attention: Including a Warning Mechanism..............................266
14.5.1 TWIN Predictions for FAP with Warning................................................................268
14.6 Conclusions: Open Questions and Future Directions............................................................ 270
Appendix A..................................................................................................................................... 271
A.1 Deriving the Probability of Interaction in TWIN................................................................. 271
A.1.1 Focused Attention Paradigm..................................................................................... 271
A.1.2 Redundant Target Paradigm...................................................................................... 272
A.1.3 Focused Attention and Warning................................................................................ 273
References....................................................................................................................................... 274

14.1  SUMMARY
Multisensory research within experimental psychology has led to the emergence of a number of
lawful relations between response speed and various empirical conditions of the experimental setup
(spatiotemporal stimulus configuration, intensity, number of modalities involved, type of instruc-
tion, and so forth). This chapter presents a conceptual framework to account for the effects of

253
254 The Neural Bases of Multisensory Processes

cross- modal stimulation on response speed. Although our framework applies to measures of cross-
modal response speed in general, here we focus on modeling saccadic reaction time as a measure
of orientation performance toward cross-modal stimuli.
The central postulate is the existence of a critical “time-window-of-integration” (TWIN) con-
trolling the combination of information from different modalities. It is demonstrated that a few
basic assumptions about this timing mechanism imply a remarkable number of empirically testable
predictions. After introducing a general version of the TWIN model framework, we present various
specifications and extensions of the original model that are geared toward more specific experi-
mental paradigms. Our emphasis will be on predictions and empirical testability of these model
versions, but for experimental data, we refer the reader to the original literature.

14.2  MULTISENSORY PROCESSES MEASURED THROUGH RESPONSE TIME


For more than 150 years, response time (RT) has been used in experimental psychology as a ubiq-
uitous measure to investigate hypotheses about the mental and motor processes involved in simple
cognitive tasks (Van Zandt 2002). Interpreting RT data, in the context of some specific experimen-
tal paradigm, is subtle and requires a high level of technical skill. Fortunately, over the years, many
sophisticated mathematical and statistical methods for response time analysis and corresponding
processing models have been developed (Luce 1986; Schweickert et al., in press). One reason for
the sustained popularity of RT as a measure of mental processes may be the simple fact that these
processes always have to unfold over time. A similar rationale, of course, is valid for other methods
developed to investigate mental processes, such as electrophysiological and related brain-imaging
techniques, and it may be one reason why we are currently witnessing some transfer of concepts and
techniques from RT analysis into these domains (e.g., Sternberg 2001). Here, we focus on the early,
dynamic aspects of simultaneously processing cross-modal stimuli—combinations of vision, audi-
tion, and touch—as they are revealed by a quantitative stochastic analysis of response times.
One of the first psychological studies on cross-modal interaction using RT to measure the effect
of combining stimuli from different modalities and of varying their intensities is the classic article
by Todd (1912). A central finding, supported by subsequent research, is that the occurrence of cross-
modal effects critically depends on the temporal arrangement of the stimulus configuration. For
example, the speedup of response time to a visual stimulus resulting from presenting an accessory
auditory stimulus typically becomes most pronounced when the visual stimulus precedes the audi-
tory by an interval that equals the difference in RT between response to the visual alone and the
auditory alone (Hershenson 1962). The rising interest in multisensory research in experimental
psychology over the past 20 years has led to the emergence of a number of lawful relations between
response speed, on the one hand, and properties of the experimental setting, such as (1) spatiotempo-
ral stimulus configuration, (2) stimulus intensity levels, (3) number of modalities involved, (4) type
of instruction, and (5) semantic congruity, on the other. In the following, rather than reviewing
the abundance of empirical results, we present a modeling framework within which a number of
specific quantitative models have been developed and tested. Although such models can certainly
not reflect the full complexity of the underlying multisensory processes, their predictions are suf-
ficiently specific to be rigorously tested through experiments.
For a long time, the ubiquitous mode of assessing response speed has been to measure the time it
takes to press a button, or to release it, by moving a finger or foot. With the advance of modern eye
movement registration techniques, the measurement of gaze shifts has become an important addi-
tional technique to assess multisensory effects. In particular saccadic reaction time, i.e., the time
from the presentation of a target stimulus to the beginning of the eye movement, is ideally suited for
studying both the temporal and spatial rules of multisensory integration. Although participants can
be asked to move their eyes to either visual, auditory, or somatosensory targets because the ocular
system is geared to the visual system, the saccadic RT characteristics will be specific to each modal-
ity. For example, it is well-known that saccades to visual targets have a higher level of accuracy than
Modeling Multisensory Processes in Saccadic Responses 255

those to auditory or somatosensory stimuli. Note also, as the superior colliculus is an important site
of oculomotor control (e.g., Munoz and Wurtz 1995), measuring saccadic responses is an obvious
choice for studying the behavioral consequences of multisensory integration.

14.3  TWIN MODELING


We introduce a conceptual framework to account for the effects of cross-modal stimulation as mea-
sured by changes in response speed.* The central postulate is the existence of a critical TWIN
controlling the integration of information from different modalities. The starting idea simply is that
a visual and an auditory stimulus must not be presented too far away from each other in time for
bimodal integration to occur. As we will show, this seemingly innocuous assumption has a number
of nontrivial consequences that any multisensory integration model of response speed has to satisfy.
Most prominently, it imposes a process consisting of—at least—two serial stages: one early stage,
before the outcome of the time window check has occurred, and a later one, in which the outcome
of the check may affect further processing.
Although the TWIN framework applies to measures of cross-modal response speed in general,
the focus is on modeling saccadic reaction time. First, a general version of the TWIN model and
its predictions, introduced by Colonius and Diederich (2004), will be described. Subsequently, we
present various extensions of the original model that are geared toward more specific experimental
paradigms. Our emphasis will again be on the predictions and empirical testability of these model
versions but because of space limitations, no experimental data will be presented here.

14.3.1  Basic Assumptions


A classic explanation for a speedup of responses to cross-modal stimuli is that subjects are merely
responding to the first stimulus detected. Taking these detection times to be random variables and
glossing over some technical details, observed reaction time would then become the minimum of the
reaction times to the visual, auditory, or tactile signal leading to a purely statistical facilitation effect
(also known as probability summation) in response speed (Raab 1962). Over time, numerous studies
have shown that this race model was not sufficient to explain the observed speedup in saccadic reac-
tion time (Harrington and Peck 1998; Hughes et al. 1994, 1998; Corneil and Munoz 1996; Arndt and
Colonius 2003). Using Miller’s inequality as a benchmark test (cf. Colonius and Diederich 2006;
Miller 1982), saccadic responses to bimodal stimuli have been found to be faster than predicted by
statistical facilitation, in particular, when the stimuli were spatially aligned. Moreover, in the race
model, there is no natural explanation for the decrease in facilitation observed with variations in
many cross-modal stimulus properties, e.g., increasing spatial disparity between the stimuli.
Nevertheless, the initial anatomic separation of the afferent pathways for different sensory
modalities suggests that an early stage of peripheral processing exists, during which no intermodal
interaction may occur. For example, a study by Whitchurch and Takahashi (2006) collecting (head)
saccadic reaction times in the barn owl lends support to the notion of a race between early visual and
auditory processes depending on the relative intensity levels of the stimuli. In particular, their data
suggest that the faster modality initiates the saccade, whereas the slower modality remains available
to refine saccade trajectory. Thus, there are good reasons for retaining the construct of an—albeit
very peripheral—race mechanism.
Even under invariant experimental conditions, observed responses typically vary from one trial
to the next, presumably because of an inherent variability of the underlying neural processes in
both ascending and descending pathways. In analogy to the classic race model, this is taken into
account in the TWIN framework by assuming any processing duration to be a random variable.
In particular, the peripheral processing times for visual, auditory, and somatosensory stimuli are

* See Section 14.6 for possible extensions to other measures of performance.


256 The Neural Bases of Multisensory Processes

assumed to be stochastically independent random variables. This leads to the first postulate of the
TWIN model:
(B1) First Stage Assumption: The first stage consists in a (stochastically independent) race among the
peripheral processes in the visual, auditory, and/or somatosensory pathways triggered by a cross-modal
stimulus complex.

The existence of a critical “spatiotemporal window” for multisensory integration to occur has been
suggested by several authors, based on both neurophysiological and behavioral findings in humans,
monkey, and cat (e.g., Bell et al. 2005; Meredith 2002; Corneil et al. 2002; Meredith et al. 1987; see
Navarra et al. 2005 for a recent behavioral study). This integration may manifest itself in the form of
an increased firing rate of a multisensory neuron (relative to unimodal stimulation), an acceleration
of saccadic reaction time (Frens et al. 1995; Diederich et al. 2003), an effective audiovisual speech
integration (Van Wassenhove et al. 2007), or in an improved or degraded judgment of temporal
order of bimodal stimulus pairs (cf. Spence and Squire 2003).
One of the basic tenets of the TWIN framework, however, is the priority of temporal proximity
over any other type of proximity: rather than assuming a joint spatiotemporal window of integra-
tion permitting interaction to occur only for both spatially and temporally neighboring stimuli, the
TWIN model allows for cross-modal interaction to occur, for example, even for spatially rather
distant stimuli of different modalities as long as they fall within the time window.
(B2) TWIN Assumption: Multisensory integration occurs only if the peripheral processes of the first
stage all terminate within a given temporal interval, the TWIN.

In other words, a visual and an auditory stimulus may occur at the same spatial location, or the lip
movements of a speaker may be perfectly consistent with the utterance, no intersensory interaction
effect will be possible if the data from the two sensory channels are registered too distant from
each other in time. Thus, the window acts like a filter determining whether afferent information
delivered from different sensory organs is registered close enough in time to allow for multisensory
integration. Note that passing the filter is a necessary, but not sufficient, condition for multisensory
integration to occur. The reason is that the amount of multisensory integration also depends on other
aspects of the stimulus set, such as the spatial configuration of the stimuli. For example, response
depression may occur with nearly simultaneous but distant stimuli, making it easier for the organ-
ism to focus attention on the more important event. In other cases, multisensory integration may fail
to occur—despite near-simultaneity of the unisensory events—because the a priori probability for
a cross-modal event is very small (e.g., Körding et al. 2007).
Although the priority of temporal proximity seems to afford more flexibility for an organism
in a complex environment, the next assumption delimits the role of temporal proximity to the first
processing stage:
(B3) Assumption of Temporal Separability: The amount of interaction manifesting itself in an
increase or decrease of second stage processing time is a function of cross-modal stimulus features, but
it does not depend on the presentation asynchrony (stimulus onset asynchrony, SOA) of the stimuli.

This assumption is based on a distinction between intra- and cross-modal stimulus properties,
where the properties may refer to both subjective and physical properties. Cross-modal properties
are defined when stimuli of more than one modality are present, such as spatial distance of target
to nontarget, or subjective similarity between stimuli of different modalities. Intramodal proper-
ties, on the other hand, refer to properties definable for a single stimulus, regardless of whether this
property is definable in all modalities (such as intensity) or in only one modality (such as wavelength
for color or frequency for pitch). Intramodal properties can affect the outcome of the race in the
first stage and, thereby, the probability of an interaction. Cross-modal properties may affect the
amount of cross-modal interaction occurring in the second stage. Note that cross-modal features
cannot influence first stage processing time because the stimuli are still being processed in separate
pathways.
Modeling Multisensory Processes in Saccadic Responses 257

(B4) Second Stage Assumption: The second stage comprises all processes after the first stage includ-
ing preparation and execution of a response.

The assumption of only two stages is certainly an oversimplification. Note, however, that the second
stage is defined here by default: it includes all subsequent, possibly overlapping, processes that are
not part of the peripheral processes in the first stage (for a similar approach, see Van Opstal and
Munoz 2004). Thus, the TWIN model retains the classic notion of a race mechanism as an explana-
tion for cross-modal interaction but restricts it to the very first stage of stimulus processing.

14.3.2  Quantifying Multisensory Integration in the TWIN Model


To derive empirically testable predictions from the TWIN framework, its assumptions must be put
into more precise form. According to the two-stage assumption, total saccadic reaction time in the
cross-modal condition can be written as a sum of two nonnegative random variables defined on a
common probability space:

RTcross-modal = S1 + S2, (14.1)

where S1 and S2 refer to first and second stage processing time, respectively (a base time would also
be subsumed under S2). Let I denote the event that multisensory integration occurs, having prob-
ability P(I). For the expected reaction time in the cross-modal condition then follows:

E[RTcrossmodal ] = E[ S1 ] + E[ S2 ]

= E[ S1 ] + P( I ) ⋅ E[ S2 | I ] + (1 − P (I )) ⋅ E[S2 | I c ]

= E[S1 ] + E[S2 | I c ] − P (I ) ⋅ (E[S2 | I c ] − E[ S2 | I ]),

where E[S2|I] and E[S2|Ic] denote the expected second stage processing time conditioned on interac-
tion occurring (I) or not occurring (Ic), respectively. Putting Δ ≡ E[S2|Ic] – E[S2|I], this becomes

E[RTcross-modal] = E[S1] + E[S2|Ic] – P(I) · Δ. (14.2)

That is, mean RT to cross-modal stimuli is the sum of mean RT of the first stage processing time,
mean RT of the second stage processing when no interaction occurs, and the term P(I) · Δ, which
is a measure of the expected amount of intersensory interaction in the second stage with positive Δ
values corresponding to facilitation, and negative values corresponding to inhibition.
This factorization of expected intersensory interaction into the probability of interaction P(I)
and the amount and sign of interaction (Δ) is an important feature of the TWIN model. According
to Assumptions B1 to B4, the first factor, P(I), depends on the temporal configuration of the stimuli
(SOA), whereas the second factor, Δ, depends on nontemporal aspects, in particular their spatial
configuration. Note that this separation of temporal and nontemporal factors is in accordance with
the definition of the window of integration: the incidence of multisensory integration hinges on the
stimuli to occur in temporal proximity, whereas the amount and sign of interaction (Δ) is modulated
by nontemporal aspects, such as semantic congruity or spatial proximity reaching, in the latter case,
from enhancement for neighboring stimuli to possible inhibition for distant stimuli (cf. Diederich
and Colonius 2007b).

14.3.3  Some General Predictions of TWIN


In the next section, more specific assumptions on first stage processing time, S1, and probabil-
ity of interaction P(I) will be introduced to derive detailed quantitative predictions for specific
258 The Neural Bases of Multisensory Processes

experimental cross-modal paradigms. Nonetheless, even at the general level of the framework intro-
duced thus far, a number of qualitative empirical predictions of TWIN are possible.
SOA effects. The amount of cross-modal interaction should depend on the SOA between the
stimuli because the probability of integration, P(I), changes with SOA. Let us assume that two stim-
uli from different modalities differ considerably in their peripheral processing times. If the faster
stimulus is delayed (in terms of SOA) so that the arrival times of both stimuli have a high probability
of falling into the window of integration, then the amount of cross-modal interaction should be larg-
est for that value of SOA (see, e.g., Frens et al. 1995; Colonius and Arndt 2001).
Intensity effects. Stimuli of high intensity have relatively fast peripheral processing times.
Therefore, for example, if a stimulus from one modality has a high intensity compared to a stimulus
from the other modality, the chance that both peripheral processes terminate within the time win-
dow will be small, assuming simultaneous stimulus presentations. The resulting low value of P(I) is
in line with the empirical observation that a very strong signal will effectively rule out any further
reduction of saccadic RT by adding a stimulus from another modality (e.g., Corneil et al. 2002).
Cross-modal effects. The amount of multisensory integration (Δ) and its sign (facilitation or inhi-
bition) occurring in the second stage depend on cross-modal features of the stimulus set, for exam-
ple, spatial disparity and laterality (laterality here refers to whether all stimuli appear in the same
hemisphere). Cross-modal features cannot have an influence on first stage processing time because
the modalities are being processed in separate pathways. Conversely, parameter Δ not depending on
SOA cannot change its sign as a function of SOA and, therefore, the model cannot simultaneously
predict facilitation to occur for some SOA values and inhibition for others. Some empirical evidence
against this prediction has been observed (Diederich and Colonius 2008).
In the classic race model, the addition of a stimulus from a modality not yet present will increase
(or, at least, not decrease) the amount of response facilitation. This follows from the fact that—
even without assuming stochastic independence—the probability of the fastest of several processes
terminating processing before time t will increase with the number of “racers” (e.g., Colonius and
Vorberg 1994). In the case of TWIN, both facilitation and inhibition are possible under certain
conditions as follows:
Number of modalities effect. The addition of a stimulus from a modality not yet present will
increase (or, at least, not decrease) the expected amount of interaction if the added stimulus is not
“too fast” and the time window is not “too small.” The latter restrictions are meant to guarantee that
the added stimulus will fall into the time window, thereby increasing the probability of interaction
to occur.

14.4 TWIN MODELS FOR SPECIFIC PARADIGMS:


ASSUMPTIONS AND PREDICTIONS
In a cross-modal experimental paradigm, the individual modalities may either be treated as being
on an equal footing, or one modality may be singled out as a target modality, whereas stimuli from
the remaining modalities may be ignored by the participant as nontargets. Cross-modal effects are
assessed in different ways, depending on task instruction. As shown below, the TWIN model can
take these different paradigms into account simply by modifying the conditions that lead to an
opening of the time window.

14.4.1  Measuring Cross-Modal Effects in Focused Attention


and Redundant Target Paradigms

In the redundant target paradigm (RTP; also known as the divided attention paradigm), stimuli
from different modalities are presented simultaneously or with certain SOA, and the participant
is instructed to respond to the stimulus detected first. Typically, the time to respond in the cross-
Modeling Multisensory Processes in Saccadic Responses 259

modal condition is faster than in either of the unimodal conditions. In the focused attention para-
digm (FAP), cross-modal stimulus sets are presented in the same manner, but now participants are
instructed to respond only to the onset of a stimulus from a specifically defined target modality,
such as the visual, and to ignore the remaining nontarget stimulus (the tactile or the auditory). In
the latter setting, when a stimulus of a nontarget modality, for example, a tone, appears before the
visual target at some spatial disparity, there is no overt response to the tone if the participant is
following the task instructions. Nevertheless, the nontarget stimulus has been shown to modulate
the saccadic response to the target: depending on the exact spatiotemporal configuration of target
and nontarget, the effect can be a speedup or an inhibition of saccadic RT (see, e.g., Amlôt et al.
2003; Diederich and Colonius 2007b), and the saccadic trajectory can be affected as well (Doyle
and Walker 2002).
Some striking similarities to human data have been found in a detection task utilizing both
paradigms. Stein et al. (1988) trained cats to orient to visual or auditory stimuli, or both. In one
paradigm, the target was a visual stimulus (a dimly illuminating LED) and the animal learned that
although an auditory stimulus (a brief, low-intensity broadband noise) would be presented periodi-
cally, responses to it would never be rewarded, and the cats learned to “ignore” it (FAP). Visual–
auditory stimuli were always presented spatially coincident, but their location varied from trial to
trial. The weak visual stimulus was difficult to detect and the cats’ performance was <50% correct
detection. However, combining the visual stimulus with the neutral auditory stimulus markedly
enhanced performance, regardless of their position. A similar result was obtained when animals
learned that both stimuli were potential targets (RTP). In a separate experiment in which the visual
and the (neutral) auditory stimuli were spatially disparate, however, performance was significantly
worse than when the visual stimulus was presented alone (cf. Stein et al. 2004).
A common method to assess the amount of cross-modal interaction is to use a measure that
relates mean RT in cross-modal conditions to that in the unimodal condition. The following defini-
tions quantify the percentage of RT enhancement in analogy to a measure proposed for measuring
multisensory enhancement in neural responses (cf. Meredith and Stein 1986; Anastasio et al. 2000;
Colonius and Diederich 2002; Diederich and Colonius 2004a, 2004b). For visual, auditory, and
visual–auditory stimuli with observed mean (saccadic or manual) reaction time, RTV, RTA, and
RTVA, respectively, and SOA = τ, the multisensory response enhancement (MRE) for the redundant
target task is defined as

min(RTV ,RTA + τ ) − RTVA, τ


MRE RTP = ⋅ 100, (14.3)
min(RTV ,RTA + τ )

where RTAV,τ refers to observed mean RT to the bimodal stimulus with SOA = τ. For the focused
attention task, MRE is defined as

RTV − RTVA , τ
MRE FAP = ⋅ 100, (14.4)
RTV

assuming vision as target modality.

14.4.2  TWIN Model for the FAP


TWIN is adapted to the focused attention task by replacing the original TWIN Assumption B2 with
(B2-FAP) TWIN Assumption: In the FAP, cross-modal interaction occurs only if (1) a nontarget
stimulus wins the race in the first stage, opening the TWIN such that (2) the termination of the target
peripheral process falls in the window. The duration of the time window is a constant.
260 The Neural Bases of Multisensory Processes

The idea here is that the winning nontarget will keep the saccadic system in a state of heightened
reactivity such that the upcoming target stimulus, if it falls into the time window, will trigger cross-
modal interaction. At the neural level, this may correspond to a gradual inhibition of fixation neu-
rons (in the superior colliculus) and/or omnipause neurons (in the midline pontine brain stem). In
the case of the target being the winner, no discernible effect on saccadic RT is predicted, such as in
the unimodal situation.
The race in the first stage of the model is made explicit by assigning statistically independent,
nonnegative random variables V and A to the peripheral processing times, for example, for a visual
target and an auditory nontarget stimulus, respectively. With τ as SOA value and ω as integration
window width parameter, Assumption B2-FAP amounts to the event that multisensory integration
occurs, IFAP , being

IFAP = {A + τ < V < A + τ + ω}.

Thus, the probability of integration to occur, P(IFAP), is a function of both τ and ω, and it can be
determined numerically once the distribution functions of A and V have been specified.
Expected reaction time in the bimodal condition then is (cf. Equation 14.2)

c
E[RTVA,τ ] = E[V ] + E[ S2 | I FAP ] − P( I FAP ) ⋅ ∆. (14.5)

No interaction is possible in the unimodal condition. Thus, the expected reaction time for the visual
(target) stimulus condition is

c
E[RTV ] = E[V ] + E[ S2 | I FAP ]. (14.6)

Note that in the focused attention task, the first stage duration is defined as the time it takes to pro-
cess the (visual) target stimulus, E[V]. Cross-modal interaction (CI) is defined as difference between
mean RT to the unimodal and cross-modal stimuli, i.e.,

CI ≡ E[RTV] – E[RTVA,τ] = P(IFAP) · Δ. (14.7)

Thus, the separation of temporal and nontemporal factors expressed in the above equation for the
observable CI is directly inherited from Assumptions B4 and B2-FAP.

14.4.2.1  TWIN Predictions for the FAP


The integration Assumption B2-FAP permits further specification of TWIN’s general predictions
of Section 14.3.3. From a model testing point of view, it is a clear strength of the TWIN framework
that it allows for numerous qualitative predictions without having to specify the probability distribu-
tions for the random processing times. Thus, a violation of any one of these predictions cannot be
attributed to an inappropriate choice of the distributions but may point to a more fundamental inad-
equacy of one or, possibly, several model assumptions. For a quantitative fit to an observed set of
data, however, some distributional assumptions are required. In the parametric version of TWIN, all
peripheral processing times are assumed to be exponentially distributed (cf. Colonius and Diederich
2004b). This choice is made mainly for computational simplicity: calculating the probability of
integration, P(IFAP), is straightforward, and the exponential distribution is characterized by a single
quantity, the intensity parameter λ (see Appendix A). As long as predictions are limited to the level
of means, no specific assumptions about the distribution of processing times in the second stage are
necessary (but see Section 14.6).
Next, we demonstrate how the focused attention context leads to more specific empirically test-
able predictions of TWIN. Predictions relying on the parametric TWIN version are postponed to the
Modeling Multisensory Processes in Saccadic Responses 261

final part of this section. If not specifically mentioned otherwise, we always assume nonnegative Δ
values in the following elaborations.
SOA effects. When the nontarget is presented very late relative to the target (large positive SOA),
its chance of winning the race against the target and thus opening the window of integration become
very small. When it is presented rather early (large negative SOA), it is likely to win the race and
to open the window, but the window may be closed by the time the target arrives. Again, the prob-
ability of integration, P(IFAP), is small. Therefore, the largest probability of integration is expected
for some midrange SOA values. Although P(IFAP) is unobservable, it should leave its mark on a well-
known observable measure, i.e., MRE. In fact, MREFAP, defined in (Equation 14.4) as a function of
SOA, should have the same form as P(IFAP), scaled only by some constant:

MRE FAP =
( RTV − RTVA , τ ) ⋅ 100
RTV
P( I FAP ) ⋅ ∆
= ⋅ 100 (14.8)
RTV

= P( I FAP ) ⋅ ∆ ⋅ const.

Intensity effects. Increasing the intensity of the visual stimulus will speed up visual peripheral
processing (up to some minimum level) thereby increasing the chance for the visual target to win the
race. Thus, the probability that the window of integration opens decreases, predicting less multisen-
sory integration. Increasing the intensity of the nontarget auditory stimulus, on the other hand, leads
to the opposite prediction: the auditory stimulus will have a better chance to win the race and to
open the window of integration, hence, predicting more multisensory integration to occur on aver-
age. Two further distinctions can be made. For large negative SOA, i.e., when the auditory nontarget
arrives very early, further increasing the auditory intensity makes is more likely for the TWIN to
close before the target arrives and therefore results in a lower P(IFAP) value. For smaller negative
SOA, however, i.e., when the nontarget is presented shortly before the target, increasing the auditory
intensity improves its chances to win against the target and to open the window. Given the com-
plexity of these intensity effects, however, more specific quantitative predictions will require some
distributional assumptions for the first stage processing times (see below). Alternatively, it may be
feasible to adapt the “double factorial paradigm” developed by Townsend and Nozawa (1995) to
analyze predictions when the effects of both targets and nontargets presented at two different inten-
sities levels are observed.
Cross-modal effects. If target and nontarget are presented in two distinct cross-modal condi-
tions, one would expect parameter Δ to take on two different values. For example, for two spatial
conditions, ipsilateral and contralateral, the values could be Δi and Δc, respectively. Subtracting the
corresponding cross-modal interaction terms then gives (cf. Equation 14.7)

CIi – CIc = P(IFAP) · (Δi – Δc), (14.9)

an expression that should again yield the same qualitative behavior, as a function of SOA, as P(IFAP).
In a similar vein, one can capitalize on the factorization of expected cross-modal interaction if some
additional experimental factor affecting Δ, but not P(IFAP), is available. In Colonius et al. (2009), an
auditory background masker stimulus, presented at increasing intensity levels, was hypothesized to
simultaneously increase Δc and decrease Δi. The ratio of CIs in both configurations,

CI i P( I FAP ) ⋅ ∆i ∆i
= = , (14.10)
CI c P( I FAP ) ⋅ ∆c ∆c
262 The Neural Bases of Multisensory Processes

should then remain invariant across SOA values, with a separate value for each level of the
masker.
Number of nontargets effects. For cross-modal interaction to occur in the focused attention task,
it is necessary that the nontarget process wins the race in the first stage. With two or more nontar-
gets entering the race, the probability of one of them winning against the target process increases
and, therefore, the probability of opening the window of integration increases with the number of
nontargets present. In this case, there are even two different ways of utilizing the factorization of CI,
both requiring the existence of two cross-modal conditions with two different Δ parameters (spatial
or other). The first test is analogous to the previous one. Because the number of nontargets affects
P(IFAP) only, the ratio in Equation 14.10 should be the same whether it is computed from conditions
with one or two nontargets. The second test results from taking the ratio of CI based on one non-
target, C1, over CI based on two nontargets, C2. Because Δ should not be affected by the number of
nontargets, the ratio

CI1 P1 ( I FAP ) ⋅ ∆ P1 ( I FAP )


= = , (14.11)
CI 2 P2 ( I FAP ) ⋅ ∆ P2 ( I FAP )

where P1 and P2 refer to the probability of opening the window under one or two nontargets, respec-
tively, should be the same, no matter from which one of the two cross-modal conditions it was
computed. In the study of Diederich and Colonius (2007a), neither of these tests revealed evidence
against these TWIN predictions.
SOA and intensity effects predicted by a parametric TWIN version. Assuming exponential dis-
tributions for the peripheral processing times, the intensity parameter for the visual modality is set
to 1/λV = 50 (ms) and to 1/λA = 10, 30, 70, or 90 (ms) for the auditory nontarget. Quantitative predic-
tions of TWIN for focused attention are shown in the left of Figure 14.1. Panels 1 and 2 show mean
RT and P(IFAP) as a function of SOA for the various intensities of the auditory nontarget. Note that
two intensities result in faster mean RT, whereas two intensities result in lower mean RT, compared
to mean unimodal RT to the visual target. Here, the parameter for second stage processing time
when no integration occurs, μ, was set to 100 ms. The TWIN was set to 200 ms. The parameter for
multisensory integration was set to Δi = 20 ms for bimodal stimuli presented ipsilaterally, implying
a facilitation effect. Note that neither λV nor μ are directly observable, but the sum of the peripheral
and central processing time for the visual target stimulus constitutes a prediction for unimodal
mean saccadic RT:

1
E[RTV ] = + ,
λV

which, for the present example, is 50 ms + 100 ms = 150 ms. The dashed line and the dotted line
show the bimodal RT predictions for the auditory nontargets with the highest and lowest intensity,
respectively.
No fits to empirical data sets are presented here, but good support of TWIN has been found thus
far (see, e.g., Diederich and Colonius 2007a, 2007b, 2008; Diederich et al. 2008). Close correspon-
dence between data and model prediction, however, is not the only aspect to consider. Importantly,
the pattern of parameter values estimated for a given experimental setting should suggest a mean-
ingful interpretation. For example, increasing stimulus intensities are reflected in a decrease of the
corresponding λ parameters, assuming higher intensities to lead to faster peripheral processing
times (at least, within certain limits). Furthermore, in the study with an auditory background masker
(Colonius et al. 2009), the cross-modal interaction parameter (Δ) was a decreasing or increasing
function of masker level for the contralateral or ipsilateral condition, respectively, as predicted.
Modeling Multisensory Processes in Saccadic Responses 263

Focused attention paradigm Redundant target paradigm

150 150
Mean RT (ms)

130

140

110

130 90

1 1
Pr(I)

0.5 0.5

0 0
30
12

20
8
MRE

4 10

0 0
−400 −300 −200 −100 0 100 200 0 20 40 100 200 300
SOA (ms) SOA (ms)

FIGURE 14.1  TWIN predictions for FAP (left panels) and RTP (right panels). Parameters in both paradigms
were chosen to be identical. Mean RT for visual stimulus is 150 ms (1/λV = 50, μ = 100). Peripheral process-
ing times for auditory stimuli are 1/λA = 10 ms (dashed line), 1/λA = 30 ms (solid), 1/λA = 70 ms (dash-dotted),
and 1/λA = 90 ms (dotted). Interaction parameter is Δ = 20 ms. 

14.4.3  TWIN Model for RTP


TWIN is adapted to the redundant target task by replacing the original TWIN Assumption B2 by
(B2-RTP) TWIN Assumption: In the RTP, (1) the window of integration is opened by whichever
stimulus wins the race in the first stage and (2) cross-modal interaction occurs if the termination of the
peripheral process of a stimulus of another modality falls within the window. The duration of the time
window is a constant.
Obviously, if stimuli from more than two modalities are presented, the question of a possible addi-
tional effect on cross-modal interaction arises. There is both behavioral and neurophysiological
evidence for trimodal interaction (e.g., Diederich and Colonius 2004b; Stein and Meredith 1993),
but data from saccadic eye movement recordings do not yet seem to be conclusive enough to justify
further elaboration of Assumption B2-RTP.
264 The Neural Bases of Multisensory Processes

To compute the probability of interaction in the RTP, P(IRTP), we assume that a visual and an
auditory stimulus are presented with an SOA equal to τ. Then, either the visual stimulus wins, V <
A + τ, or the auditory stimulus wins, A + τ < V; so, in either case, min(V, A + τ) < max(V, A + τ) and,
by Assumption B2-RTP,

IRTP = {max(V, A + τ) < min(V, A + τ) + ω}.

Thus, the probability of integration to occur is a function of both τ and ω, as before. Expected reac-
tion time in the cross-modal condition is computed as (see Equation 14.2)

c
E[RTVA,τ ] = E[min(V , A + τ )] + E[ S2 | I RTP ] − P( I RTP ) ⋅ ∆. (14.12)

In the RTP, first stage duration is determined by the termination time of the winner. This is an
important difference to the focused attention situation in which first stage duration is defined by
the time it takes to process the (visual) target stimulus. Even for a zero probability of interaction,
expected reaction time in the bimodal condition is smaller than, or equal to, either of the unimodal
stimulus conditions. These are

c
E[RTV ] = E[V ] + E[ S2 | I RTP ] (14.13)

and

c
E[RTA ] = E[ A] + E[ S2 | I RTP ], (14.14)

because in the redundant target version of TWIN, the race in the first stage produces a statistical
facilitation effect equivalent to the one in the classic race model. Thus, a possible cross-modal
enhancement observed in a redundant target task may be because of multisensory integration or sta-
tistical facilitation, or both. Moreover, a possible cross-modal inhibition effect may be weakened by
the simultaneous presence of statistical facilitation in the first stage. Predictions for the redundant
target case are less straightforward than for focused attention because the factorization of cross-
modal interaction (CI) in the latter is no longer valid. Nevertheless, some general predictions can be
made assuming, as before, a multisensory facilitation effect, i.e., Δ > 0.

14.4.3.1  TWIN Predictions for RTP


In this paradigm, both stimuli are on an equal footing and, therefore, negative SOA values need not
be introduced. Each SOA value now indicates the time between the stimulus presented first and the
one presented second, regardless of modality.
SOA effects. The probability of cross-modal interaction decreases with increasing SOA: the later
the second stimulus is presented, the less likely it is to win the race and to open the window of inte-
gration; alternatively, if the window has already been opened by the first stimulus, the less likely it
is to fall into that window with increasing SOA. For large enough SOA values, mean saccadic RT in
the cross-modal condition approaches the mean for the stimulus presented first.
To fix ideas, we now assume, without loss of generality, that a visual stimulus of constant intensity
is presented first and that an auditory stimulus is presented second, or simultaneous with the visual,
and at different intensities. Predictions then depend on the relative intensity difference between
both stimuli. Note that the unimodal means constitute upper bounds for bimodal mean RT.
Intensity effects. For a visual stimulus presented first, increasing the intensity of the auditory
stimulus (presented second) increases the amount of facilitation.
Modeling Multisensory Processes in Saccadic Responses 265

SOA and intensity effects predicted by a parametric TWIN version. Figure 14.1 (right panels)
shows the quantitative predictions of TWIN for SOA and intensity variations under exponential
distributions for the peripheral processing times. Parameters are the same as for the FAP predic-
tions (left panels). Panels 1 and 2 show mean RT and P(I) as a function of SOA for various intensity
levels (λ parameters) of the auditory stimulus. Both panels exhibit the predicted monotonicity in
SOA and intensity. The third panel, depicting MRE, reveals some nonmonotonic behavior in both
SOA and intensity.
Without going into numerical details, this nonmonotonicity of MRE can be seen to be because
of a subtle interaction between two mechanisms, both being involved in the generation of MRE:
(1) statistical facilitation occurring in the first stage and (2) opening of the time window. The former
is maximal if presentation of the stimulus processed faster is delayed by an SOA equal to the differ-
ence in mean RT in the unimodal stimulus conditions, that is when peripheral processing times are
in physiological synchrony; for example, if mean RT to an auditory stimulus is 110 ms and mean
RT to a visual stimulus is 150 ms, the maximal amount of statistical facilitation is expected when
the auditory stimulus is presented 150 ms – 110 ms = 40 ms after the visual stimulus. The SOA
value being “optimal” for statistical facilitation, however, need not be the one producing the highest
probability of opening the time window that was shown to be decreasing with SOA. Moreover, the
nonmonotonicity in intensity becomes plausible if one realizes that variation in intensity results in a
change in mean processing time analogous to an SOA effect: for example, lowering auditory stimu-
lus intensity has an effect on statistical facilitation and the probability of opening the time window
that is comparable to increasing SOA.

14.4.4  Focused Attention versus RTP


Top-down versus bottom-up. The distinction between RTP and FAP is not only an interesting exper-
imental variation as such but it may also provide an important theoretical aspect. In fact, because
physically identical stimuli can be presented under the same spatiotemporal configuration in both
paradigms, any differences observed in the corresponding reaction times would have to be because
of the instructions being different, thereby pointing to a possible separation of top-down from bot-
tom-up processes in the underlying multisensory integration mechanism.
Probability of integration. Moreover, comparing both paradigms yields some additional insight
into the mechanics of TWIN. Note that under equivalent stimulus conditions, IFAP ⊂ IRTP; this rela-
tion follows from the observation that

IFAP = IRTP ∩ {A + τ is the winner of the race}.

It means that any realization of the peripheral processing times that leads to an opening of the time
window under the focused attention instruction also leads to the same event under the redundant
target instruction. Thus, the probability of integration under redundant target instructions cannot
be smaller than that under focused attention instruction: P(IFAP) ≤ P(IRTP), given identical stimulus
conditions (see also Figure 14.1).
Inverse effectiveness. It is instructive to consider the effect of varying stimulus intensity in both
paradigms when both stimuli are presented simultaneously (SOA = 0) and at intensity levels produc-
ing the same mean peripheral speed, i.e., with the same intensity parameters, λV = λA. Assuming
exponential distributions, Figure 14.2 depicts the probability of integration (upper panels) and MRE
(lower panels) as a function of time window width (ω) for both paradigms and with each curve pre-
senting a specific intensity level. The probability of integration increases monotonically from zero
(for ω = 0) toward 0.5 for the focused attention, and toward 1 for the RTP. For the former, the prob-
ability of integration cannot surpass 0.5 because, for any given window width, the target process has
the same chance of winning as the nontarget process under the given λ parameters. For both para-
digms, P(I), as a function of ω, is ordered with respect to intensity level: it increases monotonically
266 The Neural Bases of Multisensory Processes

FAP RTP
1 1

0.75 0.75
Pr(I)

Pr(I)
0.5 0.5

0.25 0.25

0 0
0 50 100 200 300 0 50 100 200 300
Time window width (ms) Time window width (ms)

FAP RTP

30 30

20 20
MRE

MRE

10 10

0 0
0 50 100 200 300 0 50 100 200 300
Time window width (ms) Time window width (ms)

FIGURE 14.2  TWIN predictions for FAP (left panels) and RTP (right panels) as a function of time win-
dow width (ω) at SOA = 0. Upper panels depict probability of integration P(I), whereas lower panels show
MRE. Each curve corresponds to a specific intensity parameter of stimuli. Peripheral processing times for
auditory and visual stimuli are 1/λA = 1/λV equal to 30 ms (dashed line), 50 ms (solid), 70 ms (dash-dotted),
and 90 ms (black dotted). Mean second stage processing time is μ = 100 ms). Interaction parameter is
Δ = 20 ms.

with the mean processing time of both stimuli* (upper panels of Figure 14.2). The same ordering is
found for MRE in the FAP; somewhat surprisingly, however, the ordering is reversed for MRE in
the RTP: increasing intensity implies less enhancement, i.e., it exhibits the “inverse effectiveness”
property often reported in empirical studies (Stein and Meredith 1993; Rowland and Stein 2008).
Similar to the above discussion of intensity effects for RTP, this is because of an interaction gener-
ated by increasing intensity: it weakens statistical facilitation in first stage processing but simultane-
ously increases the probability of integration.

14.5 TWIN MODEL FOR FOCUSED ATTENTION:


INCLUDING A WARNING MECHANISM
Although estimates for the TWIN vary somewhat across subjects and task specifics, a 200-ms width
showed up in several studies (e.g., Eimer 2001; Sinclair and Hammond 2009). In a focused attention
task, when the nontarget occurs at an early point in time (i.e., 200 ms or more before the target), a
substantial decrease of RT compared to the unimodal condition has been observed by Diederich

* This is because of a property of the exponential distribution: mean and SD are identical.
Modeling Multisensory Processes in Saccadic Responses 267

and Colonius (2007a). This decrease, however, no longer depended on whether target and nontarget
appeared at ipsilateral or contralateral positions, thus supporting the hypothesis that the nontarget
plays the role of a spatially unspecific alerting cue, or warning signal, for the upcoming target when-
ever the SOA is large enough.
The hypothesis of increased cross-modal processing triggered by an alerting cue had already
been advanced by Nickerson (1973), who called it “preparation enhancement.” In the eye movement
literature, the effects of a warning signal have been studied primarily in the context of explaining
the “gap effect,” i.e., the latency to initiate a saccade to an eccentric target is reduced by extinguish-
ing the fixation stimulus approximately 200 ms before target onset (Reuter-Lorenz et al. 1991;
Klein and Kingston 1993). An early study on the effect of auditory or visual warning signals on
saccade latency, but without considering multisensory integration effects, was conducted by Ross
and Ross (1981).
Here, the dual role of the nontarget—inducing multisensory integration that is governed by the
above-mentioned spatiotemporal rules, on the one hand, and acting as a spatially unspecific cross-
modal warning cue, on the other—will be taken into account by an extension of TWIN that yields
an estimate of the relative contribution of either mechanism for any specific SOA value.
(W) Assumption on warning mechanism: If the nontarget wins the processing race in the first stage
by a margin wide enough for the TWIN to be closed again before the arrival of the target, then subse-
quent processing will be facilitated or inhibited (“warning effect”) without dependence on the spatial
configuration of the stimuli.*

The time margin by which the nontarget may win against the target will be called head start denoted
as γ. The assumption stipulates that the head start is at least as large as the width of the time window
for a warning effect to occur. That is, the warning mechanism of the nontarget is triggered when-
ever the nontarget wins the race by a head start γ ≥ ω ≥ 0. Taking, for concreteness, the auditory as
nontarget modality, occurrence of a warning effect corresponds to the event:

W = {A + τ + γ < V}.

The probability of warning to occur, P(W), is a function of both τ and γ. Because γ ≥ ω ≥ 0 this
precludes the simultaneous occurrence of both warning and multisensory interaction within one
and the same trial and, therefore, P(I ∩ W) = 0 (because no confusion can arise, we write I for IFAP
throughout this section). The actual value of the head start criterion is a parameter to be estimated
in fitting the model under Assumption W.
The expected saccadic reaction time in the cross-modal condition in the TWIN model with
warning assumption can then be shown to be

E[RTcross-modal ] = E[ S1 ] + E[ S2 ]

= E[ S1 ] + E[ S2 | I c ∩ W c ] − P( I ) ⋅ {E[ S2 | I c ∩ W c ] − E[S2 | I ]}

− P(W ) ⋅ {E[S2 | I c ∩ W c ] − E[ S2 | W ]},

* In the study of Diederich and Colonius 2008, an alternative version of this assumption was considered as well (ver-
sion B). If the nontarget wins the processing race in the first stage by a wide enough margin, then subsequent processing
will in part be facilitated or inhibited without dependence on the spatial configuration of the stimuli. This version is less
restrictive: All that is needed for the nontarget to act as a warning signal is a “large enough” headstart against the target
in the race and P(I ∩ W) can be larger than 0. Assuming that the effects on RT of the two events I and W, integration and
warning, combine additively, it can then be shown that the cross-modal interaction prediction of this model version is
captured by the same equation as under the original version, i.e., Equation 14.17 below. The only difference is in the order
restriction for the parameters, γ ≥ ω. Up to now, no empirical evidence has been found in favor of one of the two versions
over the other.
268 The Neural Bases of Multisensory Processes

where E[S2|I], E[S2|W], and E[S2|Ic ∩ Wc] denote the expected second stage processing time condi-
tioned on interaction occurring (I), warning occurring (W), or neither of them occurring (Ic ∩ Wc),
respectively (Ic, Wc stand for the complement of events I, W). Setting

∆ ≡ E[ S 2 | I c ∩ W c ] − E [ S 2 | I ]

κ ≡ E[ S 2 | I c ∩ W c ] − E[ S 2 | W ]

where κ denotes the amount of the warning effect (in milliseconds), this becomes

E[RTcross-modal] = E[S1] + E[S2|Ic ∩ Wc] – P(I) · Δ – P(W) · κ. (14.15)

In the unimodal condition, neither integration nor warning are possible. Thus,

E[RTunimodal] = E[S1] + E[S2|Ic ∩ Wc], (14.16)

and we arrive at a simple expression for the combined effect of multisensory integration and warn-
ing, cross-modal interaction (CI),

CI ≡ E[RTunimodal] – E[RTcross-modal] = P(I) · Δ + P(W) · κ. (14.17)

Recall that the basic assumptions of TWIN imply that for a given spatial configuration and nontar-
get modality, there are no sign reversals or changes in magnitude of Δ across all SOA values. The
same holds for κ. Note, however, that Δ and κ can separately take on positive or negative values
(or zero) depending on whether multisensory integration and warning have a facilitative or inhibi-
tory effect. Furthermore, for the probability of integration P(I), the probability of warning P(W)
does change with SOA.

14.5.1  TWIN Predictions for FAP with Warning


The occurrence of a warning effect depends on intramodal characteristics of the target and the
nontarget, such as modality or intensity. Assuming that increasing stimulus intensity goes along
with decreased reaction time (for auditory stimuli, see, e.g., Frens et al. 1995; Arndt and Colonius
2003; for stimuli, see Diederich and Colonius 2004b), TWIN makes specific predictions regarding
the effect of nontarget intensity variation.
Intensity effects. An intense (auditory) nontarget may have a higher chance to win the race with
a head start compared to a weak nontarget. In general, increasing the intensity of the nontarget
(1) increases the probability of it functioning as a warning signal, and (2) makes it more likely for
the nontarget to win the peripheral race against the target process.
SOA effects. The probability of warning P(W) decreases monotonically with SOA: the later the
nontarget is presented, the smaller its chances to win the race against the target with some head
start γ. This differs from the nonmonotonic relationship predicted between P(IFAP) and SOA (see
above). It is interesting to note that the difference in how P(I) and P(W) should depend on SOA is,
in principle, empirically testable without any distributional assumptions by manipulating the con-
ditions of the experiment. Specifically, if target and nontarget are presented in two distinct spatial
conditions, for example, ipsilateral and contralateral, one would expect Δ to take on two different
values, Δi and Δc, whereas P(W) · κ, the expected nonspatial warning effect, should remain the same
under both conditions. Subtracting the corresponding cross-modal interaction terms then gives,
after canceling the warning effect terms (Equation 14.17),

CIi – CIc = P(I) · (Δi – Δc). (14.18)


Modeling Multisensory Processes in Saccadic Responses 269

This expression is an observable function of SOA and, because the factor Δi – Δc does not depend
on SOA by Assumption B3, it should exhibit the same functional form as P(I): increasing and then
decreasing (see Figure 14.1, middle left panel).
Context effects. The magnitude of the warning effect may be influenced by the experimental design.
Specifically, presenting nontargets from different modalities in two distinct presentation modes, e.g.,
blocking or mixing the modality of the auditory and tactile nontargets within an experimental block of
trials, such that supposedly no changes in the expected amount of multisensory integration should occur,
then subtraction of the corresponding CI values yields, after canceling the integration effect terms,

CIblocked – CImixed = P(W) · (κmixed – κ blocked), (14.19)

a quantity that should decrease monotonically with SOA because P(W) does.
The extension of the model to include warning effects has been probed for both auditory and tac-
tile nontargets. Concerning the warning assumptions, no clear superiority of version A over version

Warning Integration and warning

150 150
Mean RT (ms)

Mean RT (ms)

140 140

130 130

1 1
Pr(W), Pr(I)
Pr(W)

0.5 0.5

0 0

12 12

8 8
MRE

MRE

4 4

0 0
−400 −300 −200 −100 0 100 200 −400 −300 −200 −100 0 100 200
SOA (ms) SOA (ms)

FIGURE 14.3  TWIN predictions for FAP when only warning occurs (left panels) and when both integra-
tion and warning occur (right panels). Parameters are chosen as before: 1/λV = 50 and μ = 100, resulting in
a mean RT for visual stimulus of 150 ms. Peripheral processing times for auditory stimuli are 1/λA = 10 ms
(dashed line), 1/λA = 30 ms (solid), 1/λA = 70 ms (dash-dotted), and 1/λA = 90 ms (black dotted).
270 The Neural Bases of Multisensory Processes

B was found in the data. For detailed results on all of the tests described above, we refer the reader
to Diederich and Colonius (2008).
SOA and intensity: quantitative predictions. To illustrate the predictions of TWIN with warning
for mean SRT, we choose the following set of parameters. As before, the intensity parameter for the
visual modality is set to 1/λV = 50 (ms) and to 1/λA = 10, 30, 70, or 90 (ms) for the (auditory) nontar-
get, the parameter for second stage processing time when no integration and no warning occurs, μ ≡
E[S2|Ic ∩ Wc], is set to 100 ms, and the TWIN to 200 ms. The parameter for multisensory integration
is set to Δi = 20 ms for bimodal stimuli presented ipsilaterally, and κ is set to 5 ms (Figure 14.3).

14.6  CONCLUSIONS: OPEN QUESTIONS AND FUTURE DIRECTIONS


The main contribution of the TWIN framework thus far is to provide an estimate of the multi-
sensory integration effect—and, for the extended model, also of a possible warning effect—that
is “contaminated” neither by a specific SOA nor by intramodal stimulus properties such as inten-
sity. This is achieved through factorizing* expected cross-modal interaction into the probability
of interaction in a given trial, P(I), times the amount of interaction Δ (cf. Equation 14.2), the latter
being measured in milliseconds. Some potential extensions of the TWIN framework are discussed
next.
Although the functional dependence of P(I) on SOA and stimulus parameters is made explicit
in the rules governing the opening and closing of the time window, the TWIN model framework
as such does not stipulate a mechanism for determining the actual amount of interaction. By
Assumption B4, Δ depends on cross-modal features like, for example, spatial distance between
the stimuli of different modalities, and by systematically varying the spatial configuration, some
insight into the functional dependence can be gained (e.g., Diederich and Colonius 2007b). Given
the diversity of intersensory interaction effects, however, it would be presumptuous to aim at a
single universal mechanism for predicting the amount of Δ. This does not preclude incorporating
multisensory integration mechanisms into the TWIN framework within a specific context such as
a spatial orienting task. Such an approach, which includes stipulating distributional properties of
second stage processing time in a given situation, would bring along the possibility of a stronger
quantitative model test, namely at the level of the entire observable reaction time distribution rather
than at the level of means only.
In line with the framework of modeling multisensory integration as (nearly) optimal decision
making (Körding et al. 2007), we have recently suggested a decision rule that determines an optimal
window width as a function of (1) the prior odds in favor of a common multisensory source, (2) the
likelihood of arrival time differences, and (3) the payoff for making correct or wrong decisions
(Colonius and Diederich 2010).
Another direction is to extend the TWIN framework to account for additional experimental
paradigms. For example, in many studies, a subject’s task is not simply to detect the target but to
perform a speeded discrimination task between two stimuli (Driver and Spence 2004). Modeling
this task implies not only a prediction of reaction time but also of the frequency of a correct or
incorrect discrimination response. Traditionally, such data have been accommodated by assuming
an evidence accumulation mechanism sequentially sampling information from the stimulus display
favoring either response option A or B, for example, and stopping as soon as a criterion threshold for
one or the other alternative has been reached. A popular subclass of these models are the diffusion
models, which have been considered models of multisensory integration early on (Diederich 1995,
2008). At this point, however, it is an open question how this approach can be reconciled with the
TWIN framework.

* Strictly speaking, this only holds for the focused attention version of TWIN; for the redundant target version, an estimate
of the amount of statistical facilitation is required and can be attained empirically (cf. Colonius and Diederich 2006).
Modeling Multisensory Processes in Saccadic Responses 271

One of the most intriguing neurophysiological findings has been the suppression of multisensory
integration ability of superior colliculus neurons by a temporary suspension of corticotectal inputs
from the anterior ectosylvian sulcus and the lateral suprasylvian sulcus (Clemo and Stein 1986; Jiang
et al. 2001). A concomitant effect on multisensory orientation behavior observed in the cat (Jiang
et al. 2002) suggests the existence of more general cortical influences on multisensory integration.
Currently, there is no explicit provision of a top-down mechanism in the TWIN framework. Note,
however, that the influence of task instruction (FAP vs. RTP) is implicitly incorporated in TWIN
because the probability of integration is supposed to be computed differently under otherwise iden-
tical stimulus conditions (cf. Section 14.4.4). It is a challenge for future development to demonstrate
that the explicit incorporation of top-down processes can be reconciled with the two-stage structure
of the TWIN framework.

APPENDIX A
A.1  DERIVING THE PROBABILITY OF INTERACTION IN TWIN
The peripheral processing times V for the visual and A for the auditory stimulus have an exponential
distribution with parameters λV and λA, respectively. That is,

fV (t ) = λV e − λV t ,

fA (t ) = λA e − λA t

for t ≥ 0, and f V(t) = fA(t) ≡ 0 for t < 0. The corresponding distribution functions are referred to as
FV(t) and FA(t).

A.1.1  Focused Attention Paradigm


The visual stimulus is the target and the auditory stimulus is the nontarget. By definition,

P(I FAP ) = Pr ( A + τ < V < A + τ + ω )


=
∫ f (x){F (x + τ + ω ) − F (x + τ )} dx,
A V V

where τ denotes the SOA value and ω is the width of the integration window. Computing the integral
expression requires that we distinguish between three cases for the sign of τ + ω:

(1) τ < τ + ω < 0

−τ

P(I FAP ) =
∫ {
λA e − λA x 1 − e − λV ( x +τ +ω ) d x }
− τ −ω

+
−τ
∫ λ e {eA
− λA x − λV ( x +τ )
}
− e − λV ( x +τ +ω ) d x

λV
= e λAτ ( −1 + e λA ω ) ;
λV + λ A

272 The Neural Bases of Multisensory Processes

(2) τ < 0 < τ + ω


−τ

P(I FAP ) =
∫ λ e {1 − e
0
A
− λA x − λV ( x +τ +ω )
} dx

+
∫ λ e {e
−τ
A
− λAx − λV ( x +τ )
− e − λV ( x +τ +ω ) d x }
1

=
λV + λ A
{λ A
(1 − e− λV (ω +τ )
) + λV (1 − e )}; λA τ

(3) 0 < τ < τ + ω


P(I FAP ) =
∫ λ e {e
0
A
− λA x − λV ( x +τ )
}
− e − λV ( x +τ +ω ) d x

λA

=
λV + λ A
{e − λV τ
− e − λ V (ω + τ ) }

The mean RT for cross-modal stimuli is


c
E[RTVA,τ ] = E[V ] + E[ S2 | I FAP ] − P( I FAP ) ⋅ ∆

1
= + − P( I FAP ) ⋅ ∆
λV
and the mean RT for the visual target is
1
E[RTV ] = + ,
λV

where 1/λV, the mean of the exponential distribution, is the mean RT of the first stage and μ is the
mean RT of the second stage when no interaction occurs.

A.1.2  Redundant Target Paradigm


The visual stimulus is presented first and the auditory stimulus second. By definition,

P(IRTP) = Pr{max(V, A + τ) < min(V, A + τ) + ω}

If the visual stimulus wins:

(1) 0 ≤ τ ≤ ω
τ

P(I RTPV ) =
∫λ e V
− λV x (1 − e− λ A ( x +ω −τ )
) dx
0

+
∫λ e V
− λV x
(1 − e − λ A ( x +ω −τ )
− (1 − e − λA ( x −τ ) ) ) d x
τ

1

=
λV + λ A
{
λV (1 − e λA ( − ω +τ ) ) + λ A (1 − e( − λV τ ) ;

}
Modeling Multisensory Processes in Saccadic Responses 273

(2) 0 < ω ≤ τ
τ

P(I RTPV ) =
∫ λ x (1 − e
τ −ω
V
− λ A ( x +ω −τ ) ) dx

+
∫ λ e (1 − e
τ
V
− λV x − λ A ( x +ω − τ )
)
− (1 − e − λA ( x −τ ) ) d x


=
λV + λ A
λA
{e − λV τ
⋅ ( −1 + e λV ω ) }

If the auditory stimulus wins: 0 < τ ≤ τ + ω and


P(I RTPA ) =
∫ λ e {e
0
A
− λA x − λ V ( x +τ )
}
− e − λ V ( x +τ +ω ) d x

λA

=
λV + λ A
{e − λV τ
− e − λ V ( ω +τ ) }
The probability that the visual or the auditory stimulus wins is therefore

P( I RTP ) = P( I RTPV ) + P( I RTPA ).

The mean RT for cross-modal stimuli is


c
E[RTVA,τ ] = E[min(V , A + τ )] + E[ S2 | I RTP ] − P( I RTP ) ⋅ ∆

1 1 1
= − e − λV τ ⋅ − + − P( I RTP ) ⋅ ∆
λV λV λV + λ A

and the mean RT for the visual and auditory stimulus is


1
E[RTV ] = + ,
λV

and
1
E[RTA ] = + ,
λA

A.1.3  Focused Attention and Warning


By definition,

P(W ) = Pr ( A + τ + γ A < V )

=
∫ f (x){1 − F (x + τ + γ
0
A V A )} d x

= 1−
∫ f ( x ) F (a + τ + γ
A V A ) d x.
0
274 The Neural Bases of Multisensory Processes

Again, we need to consider different cases:

(1) τ + γA < 0

P(W ) = 1 −
∫ {
λ A e − λ A a 1 − e − λ V ( a +τ + γ A ) d a }
− τ −γ A

λV
= 1− e λ A (τ + γ A ) ;
λV + λ A

(2) τ + γA ≥ 0

P(W ) = 1 −
∫ λ e {1 − e
0
A
− λA a − λV ( a +τ + γ A )
} da
λA
= e − λV ( τ + γ A ) .
λV + λ A

The mean RT for cross-modal stimuli is


c
E[RTVA,τ ] = E[V ] + E[ S2 | I FAP ] − P( I FAP ) ⋅ ∆ − P(W ) ⋅ κ

1
= + − P( I FAP ) ⋅ ∆ − P(W ) ⋅ κ
λV

where 1/λV is the mean RT of the first stage, μ is the mean RT of the second stage when no inter-
action occurs, P(IFAP) · Δ is the expected amount of intersensory interaction, and P(W) · κ is the
expected amount of warning.

REFERENCES
Amlôt, R., R. Walker, J. Driver, and C. Spence. 2003. Multimodal visual-somatosensory integration in saccade
generation. Neuropsychologia 41:1–15.
Anastasio, T.J., P.E. Patton, and K. Belkacem-Boussaid. 2000. Using Bayes’ rule to model multisensory
enhancement in the superior colliculus. Neural Computation 12:1165–1187.
Arndt, A., and H. Colonius. 2003. Two separate stages in crossmodal saccadic integration: Evidence from vary-
ing intensity of an auditory accessory stimulus. Experimental Brain Research 150:417–426.
Bell, A.H., A. Meredith, A.J. Van Opstal, and D.P. Munoz. 2005. Crossmodal integration in the primate
superior colliculus underlying the preparation and initiation of saccadic eye movements. Journal of
Neurophysiology 93:3659–3673.
Clemo, H.R., and B.E. Stein. 1986. Effects of cooling somatosensory corticotectal influences in cat. Journal of
Neurophysiology 55:1352–1368.
Colonius, H., and P. Arndt. 2001. A two-stage model for visual-auditory interaction in saccadic latencies.
Perception & Psychophysics, 63:126–147.
Colonius, H., and A. Diederich. 2002. A maximum-likelihood approach to modeling multisensory enhance-
ment. In Advances in Neural Information Processing Systems 14, T.G. Ditterich, S. Becker, and Z.
Ghahramani (eds.). Cambridge, MA: MIT Press.
Colonius, H., and A. Diederich. 2004. Multisensory interaction in saccadic reaction time: A time-window-of-
integration model. Journal of Cognitive Neuroscience 16:1000–1009.
Colonius, H., and A. Diederich. 2006. Race model inequality: Interpreting a geometric measure of the amount
of violation. Psychological Review 113(1):148–154.
Colonius, H., and A. Diederich. 2010. The optimal time window of visual–auditory integration: A reaction time
analysis. Frontiers in Integrative Neuroscience, 4:11. doi:10.3389/fnint.2010.00011.
Modeling Multisensory Processes in Saccadic Responses 275

Colonius, H., and D. Vorberg. 1994. Distribution inequalities for parallel models with unlimited capacity.
Journal of Mathematical Psychology 38:35–58.
Colonius, H., A. Diederich, and R. Steenken. 2009. Time-window-of-integration (TWIN) model for saccadic
reaction time: Effect of auditory masker level on visual-auditory spatial interaction in elevation. Brain
Topography 21:177–184.
Corneil, B.D., and D.P. Munoz. 1996. The influence of auditory and visual distractors on human orienting gaze
shifts. Journal of Neuroscience 16:8193–8207.
Corneil, B.D., M. Van Wanrooij, D.P. Munoz, A.J. Van Opstal. 2002. Auditory-visual interactions subserving
goal-directed saccades in a complex scene. Journal of Neurophysiology 88:438–454.
Diederich, A. 1995. Intersensory facilitation of reaction time: Evaluation of counter and diffusion coactivation
models. Journal of Mathematical Psychology 39:197–215.
Diederich, A. 2008. A further test on sequential sampling models accounting for payoff effects on response bias
in perceptual decision tasks. Perception & Psychophysics 70(2):229–256.
Diederich, A., and H. Colonius. 2004a. Modeling the time course of multisensory interaction in manual and
saccadic responses. In Handbook of multisensory processes, ed. G. Calvert, C. Spence, and B.E. Stein,
395–408. Cambridge, MA: MIT Press.
Diederich, A., and H. Colonius. 2004b. Bimodal and trimodal multisensory enhancement: Effects of stimulus
onset and intensity on reaction time. Perception & Psychophysics 66(8):1388–1404.
Diederich, A., and H. Colonius. 2007a. Why two “distractors” are better than one: Modeling the effect of
nontarget auditory and tactile stimuli on visual saccadic reaction time. Experimental Brain Research
179:43–54.
Diederich, A., and H. Colonius. 2007b. Modeling spatial effects in visual–tactile saccadic reaction time.
Perception & Psychophysics 69(1):56–67.
Diederich, A., and H. Colonius. 2008. Crossmodal interaction in saccadic reaction time: Separating multisensory
from warning effects in the time window of integration model. Experimental Brain Research 186:1–22.
Diederich, A., H. Colonius, D. Bockhorst, and S. Tabeling. 2003. Visual–tactile spatial interaction in saccade
generation. Experimental Brain Research 148:328–337.
Diederich, A., H. Colonius, and A. Schomburg. 2008. Assessing age-related multisensory enhancement with
the time-window-of-integration model. Neuropsychologia 46:2556–2562.
Doyle, M.C., and R. Walker. 2002. Multisensory interactions in saccade target selection: Curved saccade tra-
jectories Experimental Brain Research 142:116–130.
Driver, J., and C. Spence. 2004. Crossmodal spatial attention: Evidence from human performance. In Crossmodal
space and crossmodal attention, ed. C. Spence and J. Driver, 179–220. Oxford: Oxford Univ. Press.
Eimer, M. 2001. Crossmodal links in spatial attention between vision, audition, and touch: Evidence from
event-related brain potentials. Neuropsychologia 39:1292–1303.
Frens, M.A., A.J. Van Opstal, and R.F. Van der Willigen. 1995. Spatial and temporal factors determine auditory–­
visual interactions in human saccadic eye movements. Perception & Psychophysics 57:802–816.
Harrington, L.K., and C.K. Peck. 1998. Spatial disparity affects visual–auditory interactions in human senso-
rimotor processing. Experimental Brain Research 122:247–252.
Hershenson, M. 1962. Reaction time as a measure of intersensory facilitation. Journal of Experimental
Psychology 63:289–293.
Hughes, H.C., P.-A. Reuter-Lorenz, G. Nozawa, and R. Fendrich. 1994. Visual–auditory interactions in sen-
sorimotor processing: Saccades versus manual responses. Journal of Experimental Psychology: Human
Perception and Performance 20:131–153.
Hughes, H.C., M.D. Nelson, and D.M. Aronchick. 1998. Spatial characteristics of visual–auditory summation
in human saccades. Vision Research 38:3955–3963.
Jiang, W., M.T. Wallace, H. Jiang, J.W. Vaughan, and B.E. Stein. 2001. Two cortical areas mediate multisensory
integration in superior colliculus neurons. Journal of Neurophysiology 85:506–522.
Jiang, W., H. Jiang, and B.E. Stein. 2002. Two cortical areas facilitate multisensory orientation behaviour.
Journal of Cognitive Neuroscience 14:1240–1255.
Körding, K.P., U. Beierholm, W.J. Ma, S. Quartz, J.B. Tenenbaum et al. 2007. Causal inference in multisensory
perception. PLoS ONE 2(9):e943, doi:10.1371/journal.pone.0000943.
Klein, R., and A. Kingstone. 1993. Why do visual offsets reduce saccadic latencies? Behavioral and Brain
Sciences 16(3):583–584.
Luce, R.D. 1986. Response times: Their role in inferring elementary mental organization. New York: Oxford
Univ. Press.
Meredith, M.A. 2002. On the neural basis for multisensory convergence: A brief overview. Cognitive Brain
Research 14:31–40.
276 The Neural Bases of Multisensory Processes

Meredith, M.A., and B.E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–662.
Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons. I. Temporal factors. Journal of Neuroscience 10:3215–3229.
Miller, J.O. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology
14:247–279.
Munoz, D.P., and R. H. Wurtz. 1995. Saccade-related activity in monkey superior colliculus. I. Characteristics
of burst and buildup cells. Journal of Neurophysiology 73:2313–2333.
Navarra, J., A. Vatakis, M. Zampini, S. Soto-Faraco, W. Humphreys, and C. Spence. 2005. Exposure to asyn-
chronous audiovisual speech extends the temporal window for audiovisual integration. Cognitive Brain
Research 25:499–507.
Nickerson, R.S. 1973. Intersensory facilitation of reaction time: Energy summation or preparation enhance-
ment. Psychological Review 80:489–509.
Raab, D.H. 1962. Statistical facilitation of simple reaction times. Transactions of the New York Academy of
Science 24:574–590.
Reuter-Lorenz, P.A., H.C. Hughes, and R. Fendrich. 1991. The reduction of saccadic latency by prior offset of
the fixation point: An analysis of the gap effect. Perception & Psychophysics 49(2):167–175.
Ross, S.M., and L.E. Ross. 1981. Saccade latency and warning signals: Effects of auditory and visual stimulus
onset and offset. Perception & Psychophysics 29(5):429–437.
Rowland, B.A., and B.E. Stein. 2008. Temporal profiles of response enhancement in multisensory integration.
Frontiers in Neuroscience 2:218–224.
Schweickert, R., D.L. Fisher, and K. Sung. Discovering Cognitive Architecture by Selectively Influencing
Mental Processes. London: World Scientific Publishing (in press).
Sinclair, C., and G.R. Hammond. 2009. Excitatory and inhibitory processes in primary motor cortex during
the foreperiod of a warned reaction time task are unrelated to response expectancy. Experimental Brain
Research 194:103–113.
Spence, C., and S. Squire. 2003. Multisensory integration: Maintaining the perception of synchrony. Current
Biology 13:R519–R521.
Stein, B.E., and Meredith M.A. 1993. The Merging of the Senses. Cambridge, MA: MIT Press.
Stein, B.E., W.S. Huneycutt, and M.A. Meredith. 1988. Neurons and behavior: The same rules of multisensory
integration apply. Brain Research 448:355–358.
Stein, B.E., W. Jiang, and T.R. Stanford. 2004. Multisensory integration in single neurons in the midbrain. In
Handbook of multisensory processes, ed. G. Calvert, C. Spence, and B.E. Stein, 243–264. Cambridge,
MA: MIT Press.
Sternberg, S. 2001. Separate modifiability, mental modules, and the use of pure and composite measures to
reveal them. Acta Psychologica 106:147–246.
Todd, J.W. 1912. Reaction to multiple stimuli, in Archives of Psychology, No. 25. Columbia contributions to
philosophy and psychology, ed. R.S. Woodworth, Vol. XXI, No. 8, New York: The Science Press.
Townsend, J.T., and G. Nozawa. 1995. Spatio-temporal properties of elementary perception: An investigation
of parallel, serial, and coactive theories. Journal of Mathematical Psychology 39:321–359.
Van Opstal, A.J., and D.P. Munoz. 2004. Auditory–visual interactions subserving primate gaze orienting. In
Handbook of multisensory processes, ed. G. Calvert, C. Spence, and B.E. Stein, 373–393. Cambridge,
MA: MIT Press.
Van Wassenhove, V., K.W. Grant, and D. Poeppel. 2007. Temporal window of integration in auditory–visual
speech perception. Neuropsychologia 45:598–607.
Van Zandt, T. 2002. Analysis of response time distributions. In Stevens’ handbook of experimental psychology,
vol. 4, 3rd edn, ed. H. Pashler. New York: Wiley & Sons, Inc.
Whitchurch, E.A., and T.T. Takahashi. 2006. Combined auditory and visual stimuli facilitate head saccades in
the barn owl (Tyto alba). Journal of Neurophysiology 96:730–745.
Section IV
Development and Plasticity
15 The Organization and
Plasticity of Multisensory
Integration in the Midbrain
Thomas J. Perrault Jr., Benjamin A. Rowland, and Barry E. Stein

CONTENTS
15.1 Impact of Multisensory Integration....................................................................................... 279
15.2 Organization of Multisensory Organization in Adult SC......................................................280
15.3 SC Multisensory Integration Depends on Influences from Cortex....................................... 287
15.4 Ontogeny of SC Multisensory Integration............................................................................. 288
15.4.1 Impact of Developing in Absence of Visual–Nonvisual Experience........................ 289
15.4.2 Altering Early Experience with Cross-Modal Cues by Changing Their Spatial
Relationships.............................................................................................................. 291
15.4.3 Role of Cortical Inputs during Maturation................................................................ 291
15.4.4 Ontogeny of Multisensory Integration in Cortex...................................................... 292
15.4.5 Ontogeny of SC Multisensory Integration in a Primate............................................ 292
Acknowledgments........................................................................................................................... 294
References....................................................................................................................................... 294

A great deal of attention has been paid to the physiological processes through which the brain
integrates information from different senses. This reflects the substantial impact of this process on
perception, cognitive decisions, and overt behavior. Yet, less attention has been given to the postna-
tal development, organization, and plasticity associated with this process. In the present chapter we
examine what is known about the normal development of multisensory integration and how early
alterations in postnatal experience disrupt, change, and dramatically alter the fundamental proper-
ties of multisensory integration. The focus here is on the multisensory layers of the cat superior
colliculus (SC), a system that has served as an excellent model for understanding multisensory inte-
gration at the level of the single neuron and at the level of overt orientation behavior. Before discuss-
ing this structure’s normal development and its capacity to change, it is important to examine what
has been learned about multisensory integration and the functional role of the SC in this process.

15.1  IMPACT OF MULTISENSORY INTEGRATION


The ability of the brain to integrate information from different sources speeds and enhances its
ability to detect, locate, and identify external events as well as the higher-order and behavioral pro-
cesses necessary to deal with these events (Corneil and Munoz 1996; Frens et al. 1995a; Hughes et
al. 1994; Marks 2004; Newell 2004; Sathian et al. 2004; Shams et al. 2004; Stein et al. 1989; Stein
and Meredith 1993; Woods et al. 2004). All brains engage in this process of multisensory integra-
tion, and do so at multiple sites within the nervous system (Calvert et al. 2004a). The proper identifi-
cation of an event includes the ability to disambiguate potentially confusing signals, including those
associated with speech and animal communication (Bernstein et al. 2004; Busse et al. 2005; Corneil

279
280 The Neural Bases of Multisensory Processes

and Munoz 1996; Frens et al. 1995b; Ghazanfar et al. 2005; Ghazanfar and Schroeder 2006; Grant
et al. 2000; Hughes et al. 1994; King and Palmer 1985; Lakatos et al. 2007; Liotti et al. 1998; Marks
2004; Massaro 2004; Newell 2004; Partan 2004; Recanzone 1998; Sathian 2000, 2005; Sathian
et al. 2004; Schroeder and Foxe 2004; Senkowski et al. 2007; Shams et al. 2004; Stein et al. 1989;
Sugihara et al. 2006; Sumby and Pollack 1954; Talsma et al. 2006, 2007; Wallace et al. 1996;
Weisser et al. 2005; Woldorff et al. 2004; Woods and Recanzone 2004a, 2004b; Zangaladze et al.
1999). The facilitation of these capabilities has enormous survival value, so its retention and elabo-
ration in all extant species is no surprise. What is surprising is that despite the frequent discussion
of this phenomenon in adults (see Calvert et al. 2004b; Ghazanfar and Schroeder 2006; Spence and
Driver 2004; Stein and Meredith 1993), there is much less effort directed to understanding how this
process develops, and how it adapts to the environment in which it will be used.
The multisensory neuron in the cat SC is an excellent model system to explore the organization
and plasticity of multisensory integration. This is because it is not only the primary site of converg-
ing inputs from different senses (Fuentes-Santamaria et al. 2008; Stein et al. 1993; Wallace et al.
1993), but because it is involved in well-defined behaviors (orientation and localization), thereby
providing an opportunity to relate physiology to behavior. Furthermore, we already know a good
deal about the normal development of the unisensory properties of SC neurons (Kao et al. 1994;
Stein 1984) and SC neurons have been one of the richest sources of information about the ontogeny
and organization of multisensory integration (Barth and Brett-Green 2004; Calvert et al. 2004b;
Groh and Sparks 1996a, 1996b; Gutfreund and Knudsen 2004; Jay et al. 1987a, 1987b; King et al.
2004; Lakatos et al. 2007; Peck 1987b; Sathian et al. 2004; Senkowski et al. 2007; Stein 1984; Stein
and Arigbede 1972; Stein and Clamann 1981; Stein and Meredith 1993; Stein et al. 1973, 1976,
1993; Wallace 2004; Woods et al. 2004a).
Of the most interest in the present context are two experimental observations. The first is that
influences from the cortex are critical for the maturation of SC multisensory integration, the second
is that experience during early postnatal life guides the nature of that integrative process. These
are likely to be interrelated observations given the well-known plasticity of neonatal cortex. One
reasonable possibility is that experience is coded in the cortex and in the morphology and functional
properties of its connections with the SC.

15.2  ORGANIZATION OF MULTISENSORY ORGANIZATION IN ADULT SC


Traditionally, the seven-layered structure of the SC has been subdivided into two functional sets
of laminae: the superficial laminae (I–III) are exclusively visual, and the deeper laminae (IV–VII)
contain unisensory (visual, auditory, and somatosensory) and multisensory neurons of all possible
combinations (Stein and Meredith 1993). Visual, auditory, and somatosensory representations in the
SC are all arranged in a similar map-like fashion so that they are all in register with each other (see
Figure 15.1; Meredith and Stein 1990; Meredith et al. 1991; Middlebrooks and Knudsen 1984; Stein
and Clamann 1981; Stein et al. 1976, 1993). The frontal regions of sensory space (forward visual and
auditory space, and the face), are represented in the anterior aspect of the structure, whereas more
temporal space (and the rear of the body) are represented in the posterior SC. Superior sensory space
is represented in the medial aspect of the structure, and inferior space in the more lateral aspect of
the structure. As a consequence, the neurons in a given region of the SC represent the same region
of sensory space. These sensory maps are in register with the premotor map in the SC. This is a
convenient way of matching incoming sensory information with the outgoing signals that program
an orientation to the initiating event (Grantyn and Grantyn 1982; Groh et al. 1996a, 1996b; Guitton
and Munoz 1991; Harris 1980; Jay and Sparks 1984, 1987a, 1987b; Munoz and Wurtz 1993a, 1993b;
Peck 1987b; Sparks 1986; Sparks and Nelson 1987; Stein and Clamann 1981; Wurtz and Goldberg
1971; Wurtz and Albano 1980).
Each multisensory SC neuron has multiple receptive fields, one for each of the modalities to
which it responds. As would be expected from the structure’s map-like representations of the senses,
The Organization and Plasticity of Multisensory Integration in the Midbrain 281

Nas
al
Nasal
Tem
p ora Temporal
l Multisensory
r
erio
rior r Face
erio
Sup
Infe

Sup rior
Inf e
Body
Medial

sal
tral
Caudal Visual

Dor
Ven
Auditory
Somatosensory

FIGURE 15.1  Correspondence of visual, auditory, and somatosensory representations in SC. Horizontal
and vertical meridians of different sensory representations in SC suggest a common coordinate system rep-
resenting multisensory space. (From Stein, B.E., and Meredith, M.A., The merging of the senses, MIT Press,
Cambridge, 1993. With permission.)

these receptive fields are in spatial coincidence with each other (King et al. 1996; Meredith and Stein
1990; Meredith et al. 1991, 1992). Cross-modal stimuli that are in spatial and temporal coincidence
with one another and fall within the excitatory receptive fields of a given neuron function synergisti-
cally. They elicit more vigorous responses (more impulses) than are evoked by the strongest of them
individually. This is called “multisensory enhancement” and is illustrated in Figure 15.2. However,
when these same stimuli are disparate in space, such that one falls within its excitatory receptive

Response Auditory RF Response


enhancement S depression
94%
100 100
10 * 10
V
% Interaction

% Interaction

8 8
Mean impulses

Mean impulses

Sum I
6 50 N Sum 50
6
Ao Ai
4 4
* –47%
2 2
0 0 0 0
V Ai VAi V AoVAo
Visual RF

50

FIGURE 15.2  Multisensory enhancement and depression. Middle: visual (dark gray) and auditory (light
gray) receptive fields (RF) of this SC neuron are plotted on hemispheres representing visual and auditory
space. Each concentric circle represents 10° of space with right caudal aspect of auditory space represented
by the half hemisphere. White bar labeled V represents a moving visual stimulus, whereas speakers labeled
A0 and Ai represent auditory stimuli. Left: response enhancement occurred when visual and auditory stimuli
were placed in spatial congruence (VAi). Note, in plot to the left, multisensory response exceeded sum of
visual and auditory responses (horizontal dotted line) and was 94% greater than response to the most effec-
tive component stimulus (visual). Right: response depression occurred when visual and auditory stimuli were
spatially disparate (VA0) so that multisensory response was 47% less than response to visual stimulus.
282 The Neural Bases of Multisensory Processes

field and the other falls within the inhibitory portion of its receptive field, the result is “multisen-
sory depression.” Now the response consists of fewer impulses than that evoked by the most effec-
tive individual component stimulus. This ubiquitous phenomenon of enhancement and depression
has been described in the SC and cortex for a number of organisms ranging from the rat to the
human (Barth and Brett-Green 2004; Calvert et al. 2004b; DeGelder et al. 2004; Fort and Giard
2004; Ghazanfar and Schroeder 2006; King and Palmer 1985; Lakatos et al. 2007; Laurienti et al.
2002; Lovelace et al. 2003; Macaluso and Driver 2004; Meredith and Stein 1983, 1986a, 1986b,
1996; Morgan et al. 2008; Romanski 2007; Sathian et al. 2004; Schroeder et al. 2001; Schroeder and
Foxe 2002, 2004; Wallace and Stein 1994; Wallace et al. 1992, 1993, 1998, 2004b).
The clearest indicator that a neuron can engage in multisensory integration is its ability to show
multisensory enhancement because multisensory depression occurs only in a subset of neurons that
show multisensory enhancement (Kadunce et al. 2001). The magnitude of response enhancement
will vary dramatically, both among neurons across the population as well as within a particular
neuron throughout its dynamic range. This variation is in part due to differences in responses to dif-
ferent cross-modal stimulus combinations. When spatiotemporally aligned cross-modal stimuli are
poorly effective, multisensory response enhancement magnitudes are often proportionately greater
than those elicited when stimuli are robustly effective. Single neurons have demonstrated that mul-
tisensory responses are capable of exceeding predictions based on the simple addition of the two
unisensory responses. These superadditive interactions generally occur at the lower end of a given
neuron’s dynamic range and as stimulus effectiveness increases, multisensory responses tend to
exhibit more additive or subadditive interactions (Alvarado et al. 2007b; Perrault et al. 2003, 2005;
Stanford and Stein 2007; Stanford et al. 2005), a series of transitions that are consistent with the
concept of “inverse effectiveness” (Meredith and Stein 1986b), in which the product of an enhanced
multisensory interaction is proportionately largest when the effectiveness of the cross-modal stimuli
are weakest. Consequently, the proportionate benefits that accrue to performance based on this
neural process will also be greatest.
This makes intuitive sense because highly effective cues are generally easiest to detect, locate,
and identify. Using the same logic, the enhanced magnitude of a multisensory response is likely
to be proportionately largest at its onset, because it is at this point when the individual component
responses would be just beginning, and thus, weakest. Recent data suggests this is indeed the case
(Rowland et al. 2007a, 2007b; see Figure 15.3). This is of substantial interest because it means that
individual responses often, if not always, involve multiple underlying computations: superadditivity
at their onset and additivity (and perhaps subadditivity) as the response evolves. In short, the super-
additive multisensory computation may be far more common than previously thought, rendering the
initial portion of the response of far greater impact than would otherwise be the case and markedly
increasing its likely role in the detection and localization of an event.
Regarding computational modes, one should be cautious when interpreting multisensory response
enhancements from pooled samples of neurons. As noted earlier, the underlying computation varies
among neurons as a result of their inherent properties and the specific features of the cross-modal
stimuli with which they are evaluated. Many of the studies cited above yielded significant population
enhancements that appear “additive,” yet one cannot conclude from these data that this was their
default computation (e.g., Alvarado et al. 2007b; Perrault et al. 2005; Stanford et al. 2005). This is
because they were examined with a battery of stimuli whose individual efficacies were dispropor-
tionately high. Because of inverse effectiveness, combinations of such stimuli would, of course, be
expected to produce less robust enhancement and a high incidence of additivity (Stanford and Stein
2007). If those same neurons were tested with minimally effective stimuli exclusively, the incidence
of superadditivity would have been much higher. Furthermore, most neurons, regardless of the com-
putation that best describes their averaged response, exhibit superadditive computations at their onset,
when activity is weakest (Rowland and Stein 2007). It is important to consider that this initial portion
of a multisensory response may have the greatest impact on behavior (Rowland et al. 2007a).
The Organization and Plasticity of Multisensory Integration in the Midbrain 283

Impulse rasters Qsum comparison

Qsum (# impulses)
3
VA
Trials

2
V
0 100 200 300 1 A
Time from V stim onset (ms) V
0
0 100 200 300
Trials

Time from V stim onset (ms)


A
Event estimate comparison
0 100 200 300 0.5
Time from V stim onset (ms)

Event estimate
VA A

.25 V
Trials

VA
0 100 200 300 0
Time from V stim onset (ms) 0 100 200 300
Time from V stim onset (ms)

FIGURE 15.3  Temporal profile of multisensory enhancement. Left: impulse rasters illustrating responses
of a multisensory SC neuron to visual (V), auditory (A), and combined visual–auditory (VA) stimulation.
Right: two different measures of response show the same basic principle of “initial response enhancement.”
Multisensory responses are enhanced from their very onset and have shorter latencies than either of indi-
vidual unisensory responses. Upper right: measure is mean stimulus-driven cumulative impulse count (qsum),
reflecting temporal evolution of enhanced response. Bottom right: an instantaneous measure of response effi-
cacy using event estimates. Event estimates use an appropriate kernel function that convolves impulse spike
trains into spike density functions that differentiate spontaneous activity from stimulus-driven activity using
a mutual information measure. Spontaneous activity was then subtracted from stimulus-driven activity and a
temporal profile of multisensory integration was observed. (From Rowland, B.A., and Stein, B.E., Frontiers
in Neuroscience, 2, 218–224, 2008. With permission.)

This process of integrating information from different senses is computationally distinct from
the integration of information within a sense. This is likely to be the case, in large part, because the
multiple cues in the former provide independent estimates of the same initiating event whereas the
multiple cues in the latter contain substantial noise covariance (Ernst and Banks 2002). Using this
logic, one would predict that a pair of within-modal stimuli would not yield the same response
enhancement obtained with a pair of cross-modal stimuli even if both stimulus pairs were posi-
tioned at the same receptive field locations. On the other hand, one might argue that equivalent
results would be likely because, in both cases, the effect reflects the amount of environmental
energy. This latter argument posits that multiple, redundant stimuli explain the effect, rather than
some unique underlying computation (Gondan et al. 2005; Leo et al. 2008; Lippert et al. 2007;
Miller 1982; Sinnett et al. 2008).
The experimental results obtained by Alvarado and colleagues (Figure 15.4) argue for the former
explanation. The integration of cross-modal cues produced significantly greater response products
than did the integration of within-modal cues. The two integration products also reflected very dif-
ferent underlying neural computations, with the latter most frequently reflecting subadditivity—a
computation that was rarely observed with cross-modal cues (Alvarado et al. 2007b). Gingras et al.
(2009) tested the same assumption and came to the same conclusions using an overt behavioral mea-
sure in which cats performed a detection and localization task in response to cross-modal (visual–
auditory) and within-modal (visual–visual or auditory–auditory) stimulus combinations (Gingras et
al. 2009; Figure 15.5).
284 The Neural Bases of Multisensory Processes

(a) Multisensory integration


cross-modal stimulus condition

Multisensory response (impulses)


30 R = 0.93

25

20
y = 1.29x + 1.11
15

10

5 Significant difference
No significant difference
0
0 5 10 15 20 25 30
Best unisensory response (impulses)

(b) Unisensory integration


within-modal stimulus condition
Combined unisensory response (impulses)

30 R = 0.94
30 R = 0.94 25
20
25 15
10
20 5 y = 0.87x + 1.16
Multisensory neurons
0
0 5 10 15 20 25 30
15
30 R = 0.95
25
10 y = 0.91x + 0.77 20
15
5 Significant difference
10
y = 0.96x + 0.26
5
No significant difference Unisensory neurons
0 0
0 5 10 15 20 25 30
0 5 10 15 20 25 30
Best unisensory response (impulses)

FIGURE 15.4  Physiological comparisons of multisensory and unisensory integration. (a) Magnitude of
response evoked by a cross-modal stimulus (y-axis) is plotted against magnitude of largest response evoked
by component unisensory stimuli (x-axis). Most of observations show multisensory enhancement (positive
deviation from solid line of unity). (b) The same cannot be said for response magnitudes evoked by two within-
modal stimuli. Here, typical evoked response is not statistically better than that evoked by largest response to a
component stimulus. Within-modal responses are similar in both multisensory and unisensory neurons (insets
on right). (From Alvarado, J.C. et al., Journal of Neurophysiology 97, 3193–205, 2007b. With permission.)

Because the SC is a site at which modality-specific inputs from the different senses converge
(Meredith and Stein 1986b; Stein and Meredith 1993; Wallace et al. 1993), it is a primary site of
their integration, and is not a reflection of multisensory integration elsewhere in the brain. The
many unisensory structures from which these inputs are derived have been well-described (e.g., see
Edwards et al. 1979; Huerta and Harting 1984; Stein and Meredith 1993; Wallace et al. 1993). Most
multisensory SC neurons send their axons out of the structure to target motor areas of the brain-
stem and spinal cord. It is primarily via this descending route that the multisensory responses of
SC neurons effect orientation behaviors (Moschovakis and Karabelas 1985; Peck 1987a; Stein and
Meredith 1993; Stein et al. 1993). Thus, it is perhaps no surprise that the principles found govern-
ing the multisensory integration at the level of the individual SC neuron also govern SC-mediated
overt behavior (Burnett et al. 2004, 2007; Jiang et al. 2002, 2007; Stein et al. 1989; Wilkinson et
al. 1996).
(a)
100 +168% +123% +94% +137% +125% +147% +156%
* *
+79% +45% * +52% * +31% +32% * +63% +58% *
*
50 * * * * *
* *

% Accuracy
0
A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1 A1 V1 V2 V1V2V1A1
–45° –30° –15° 0° +15° +30° +45°

(b) V1A1 (c)


% Change

–50
0
50
100
150
30% 18% 17%
42%

C
NG W NG W +137%* 250
A1
C

W
–29% 200
C –45
27%
30

NG
–56%* 150 0
45
51% 65% 15
NG –30
W 100
V1A1
V1V2 –45 –15

% Enhancement
V1 C
24% 30
& –15
25% 36% 26% 50 45

C
V2 +49%* –30 0
No-Go Wrong 15 V1V2
location
0

W
+9% 20 25 30 35
Correct Best unisensory accuracy (%)

NG
–29%*

38%
The Organization and Plasticity of Multisensory Integration in the Midbrain

FIGURE 15.5  Multisensory integration was distinct from unisensory visual–visual integration. (a) At every spatial location, multisensory integration produced sub-
stantial performance enhancements (94–168%; mean, 137%), whereas unisensory visual integration produced comparatively modest enhancements (31–79%; mean,
49%). Asterisks indicate comparisons that were significantly different (χ2 test; P < 0.05). (b) Pie charts to left show performance in response to modality-specific auditory
(A1) and visual (V1 and V2 are identical) stimuli. Figures within the bordered region show performance to cross-modal (V1A1) and within-modal (V1V2) stimulus combi-
nations. No-Go errors (NG; gray) and Wrong Localization errors (W; white) were significantly decreased as a result of multisensory integration, but only No-Go errors
were significantly reduced as a result of unisensory integration. (c) Differential effect of multisensory and unisensory integration was reasonably constant, regardless of
effectiveness of best component stimulus, and both showed an inverse relationship, wherein benefits were greatest when effectiveness of component stimuli was lowest.
285

V, visual; A, auditory; C, correct. (From Gingras, G. et al., Journal of Neuroscience, 29, 4897–902, 2009. With permission.)
286 The Neural Bases of Multisensory Processes

Control AES & rLS deactivated


Modality-specific
A V1 V2 V3 V4 V5 A V1 V2 V3 V4 V5

20 20
15 15
Impulses

Impulses
10 10
5 5
0 0
Multisensory
AV1 AV2 AV3 AV4 AV5 AV1 AV2 AV3 AV4 AV5

20 20
15 15
Impulses

Impulses
10 10
5 5
0 0

VA
15 15 rLS
V
* AES
*
* 200 200
+180%

Response enhancement (%)


10 10
Mean impulses

+129% 150 150

+89%
100 100

5 +58% 5
+37% 50 50

0% +9%
0% –14% –3%
0 0
0 0
V1 V2 V3 V4 V5 V1 V2 V3 V4 V5
Visual effectiveness Visual effectiveness

FIGURE 15.6  SC multisensory integration depends on influences from association cortex. SC responses to
auditory (A), visual (V), and multisensory (AV) stimuli were recorded before (left) and after (right) deactiva-
tion of association cortex. Visual stimulus was presented at multiple (five) levels of effectiveness. At the top of
the figure are individual stimulus traces, impulse rasters, and peristimulus time histograms for each response.
Graphs at bottom summarize these data showing mean response levels (lines) and percentage of multisensory
enhancement (bars) observed for each of stimulus pairings. Before cortical deactivation, enhanced responses
showed characteristic “inverse effectiveness” profile with larger unisensory responses associated with smaller
multisensory enhancements. However, after cortical deactivation (shaded region of inset), multisensory
enhancements were eliminated at each of stimulus effectiveness levels tested so that multisensory and unisen-
sory responses were no longer significantly different. (From Jiang, W. et al., Journal of Neurophysiology, 85,
506–22, 2001. With permission.)
The Organization and Plasticity of Multisensory Integration in the Midbrain 287

15.3 SC MULTISENSORY INTEGRATION DEPENDS


ON INFLUENCES FROM CORTEX
Although, as noted above, SC neurons become multisensory as a result of receiving converging
inputs from multiple visual, auditory, and somatosensory sources, this does not automatically ren-
der them capable of integrating these multiple sensory inputs. Rather, a specific component of the
circuit must be operational: the projection from the association cortex. As shown in Figure 15.6,
deactivating this input renders SC neurons incapable of multisensory integration. Their multisen-
sory responses now approximate those elicited by the most effective modality-specific component
stimulus, a result that is paralleled at the level of overt behavior (Alvarado et al. 2007a; Jiang and
Stein 2003; Jiang et al. 2001, 2002, 2006; Stein and Meredith 1993a; Stein et al. 2002; Wallace and
Stein 1994, 1997).
This association cortical area in the cat is the anterior ectosylvian sulcus (AES), and an adjacent
area, the rostral aspect of the lateral suprasylvian sulcus (rLS). The homologue in other species has
not yet been determined. These two areas appear to be unique in this context (Burnett et al. 2004;
Jiang et al. 2003, 2006, 2007; Wilkinson et al. 1996). Thus, when one of them is damaged during
early life, the other can take on its role, but when both are damaged, no other cortical areas seem
capable of substituting for them. In the normal animal, they generally function together in medi-
ating SC multisensory integration, but the AES is the more important of the two, as many more
neurons in the SC are dependent on AES influences than on rLS influences for this capability (Jiang
et al. 2001).
The intense experimental scrutiny on the influences of AES over SC multisensory integration
has helped us understand the nature of these descending influences. First, their projections to the
SC are derived from unisensory neurons; second, they converge from different subregions of the
AES (visual, AEV; auditory, FAES; and somatosensory, SIV) onto a given SC neuron in a pattern
that matches the convergence pattern from non-AES input sources (Fuentes-Santamaria et al. 2008;
Wallace et al. 1992). For example, an individual multisensory SC neuron that receives converging
visual input from the retina and auditory input from the inferior colliculus, will also likely receive
convergent input from AEV and FAES.

FAES

SIV

FIGURE 15.7  (See color insert.) SC neurons receive converging input from different sensory subregions of
anterior ectosylvian (association) cortex. Flourescent tracers were deposited in auditory (FAES; green) and soma-
tosensory (SIV; red) subregions. Axons of these cortical neurons often had boutons in contact with SC neurons,
and sometimes could be seen converging onto the same target neurons. Presumptive contact points are indicated
by arrows. (From Fuentes-Santamaria, V. et al., Cerebral Cortex, 18, 1640–52, 2008. With permission.)
288 The Neural Bases of Multisensory Processes

Rowland et al. (2007b) used these convergence patterns as the basis for an explanatory model
in which AES inputs and other inputs have different convergence patterns on the dendrites of
their SC target neurons (Rowland et al. 2007b; Figure 15.7). The model assumption of N-methyl-
d-aspartate (NMDA) (and 2-amino-3-(5-methyl-3-oxo-1,2-oxazol-4-yl)propanoic acid (AMPA))
receptors at every dendritic region provides the possibility of producing nonlinear interaction
between inputs that cluster in the same region. These clustering inputs are selectively those from
AES, and are preferentially on proximal dendrites. The currents they introduce affect one another,
and produce a nonlinear amplification through the NMDA receptors, something that the inputs
from non-AES areas cannot do because they are more computationally segregated from one
another. All inputs also contact a population of inhibitory interneurons, and these also contact
SC multisensory neurons, so that the output of the SC neuron depends on the relative balance of
excitatory inputs from the direct projecting inputs and the shunting inhibition via the inhibitory
interneurons.

15.4  ONTOGENY OF SC MULTISENSORY INTEGRATION


The multisensory properties of SC neurons described above are not characteristic of the neonate.
This is evident from studies of the cat SC. The cat is an excellent model for exploring the ontogeny
of sensory information processing because it is an altricial species, so that a good deal of its devel-
opment is observable after birth. At this time, its eyelids are still fused and its ear canals have not
yet opened. Most SC neurons are unresponsive to sensory stimuli at this time, and the few that do
respond to external stimulation are activated by tactile stimuli, often on the perioral region. This is
a condition that is already evident in late fetal stages (Stein et al. 1973) and has been thought to help
prepare the infant for finding the nipple and suckling (Larson and Stein 1984). The first neurons
that respond to auditory stimulation are encountered at approximately 5 days postnatal, but neurons
responsive to visual neurons in the multisensory (i.e., deep) layers are not evident until approxi-
mately 3 weeks postnatal, long after their overlying superficial layer counterparts have been active
(Kao et al. 1994; Stein et al. 1973, 1984; Wallace and Stein 1997).
Just as the appearance of multisensory neurons is delayed relative to their unisensory counter-
parts, so is the maturation of their most characteristic property, multisensory integration. This may
be because they, compared with their unisensory neighbors, have to accommodate a more complex
task: determining which signals from different senses should be coupled, and which should be
segregated.
The first multisensory neurons that appear are those responsive to somatosensory and auditory
stimuli. They become active at about postnatal day 10, several days after auditory responsiveness
appears. Visual–auditory, visual–somatosensory, and trisensory neurons become active at about
3 weeks, as soon as deep-layer visual responsiveness is evident. Yet, the capacity to integrate a
neuron’s multiple sensory inputs does not appear until approximately 5 weeks of age, and at this
time, very few neurons are capable of this feat (Figure 15.8a). During this time, the characteristic
response properties of these neurons change dramatically, exhibiting substantially reduced recep-
tive fields and decreased response latencies (Figure 15.8b and c). Achieving the normal complement
of multisensory neurons capable of multisensory integration requires months of development, a
period of maturation during which inputs from the association cortex also become functional (Stein
and Gallagher 1981b; Stein et al. 2002; Wallace and Stein 1997, 2000).
The observation that this ontogenetic process is so gradual was taken to suggest that this period
is one in which experience plays a substantial role in guiding the maturation of multisensory inte-
gration. One possibility considered was that the brain is learning to expect that certain physical
properties of cues from different senses are linked to common events, specifically their timing and
location. This would provide the brain with a way of crafting the principles that govern multisensory
integration to adapt to the environment in which it will be used. To examine this possibility, ani-
mals were reared without the opportunity to obtain experience with visual and nonvisual cues (i.e.,
The Organization and Plasticity of Multisensory Integration in the Midbrain 289

(a) 70

60
% Multisensory neurons

50
16-20
eeks) 13-15 adult
a g e (w 9-1011-12
l
40 tn ata 7
8
s
Po 6
5
30 4
3
20 2 Unimodal
Multisensory
10 1

0
0 5 10 15 20 adult
Postnatal age (weeks)
(b) (c)
500 250
Receptive field size (% of adult value)

Somatosensory
Auditory
400 Somatosensory 200 Visual
Auditory
Visual
Mean latency (ms)

300 150

200 100

100 50

0 0
0 5 10 15 20 adult 0 5 10 15 adult
Postnatal age (weeks) Postnatal age (weeks)

FIGURE 15.8  Developmental chronology of SC multisensory neurons. (a) Percentage of multisensory neu-
rons as a proportion of sensory-responsive neurons in deep SC is shown as a function of postnatal age. Each
closed circle represents a single age, and increasing proportion of such neurons is also shown on pie charts.
(b) Rapid decrease in size of different receptive fields (as a percentage of mean adult value) of multisensory
neurons is shown as a function of postnatal age. (c) Decrease in response latencies of multisensory neurons to
each modality-specific stimulus is shown as a function of postnatal age. (From Wallace, M.T., and Stein, B.E.,
Journal of Neuroscience, 17, 2429–44, 1997. With permission.)

in darkness), and also in situations in which the spatial cues associated with common events were
perturbed. The first experimental condition tests the notion that in the absence of such experience,
multisensory integration would not develop, and the second tests the possibility that the specific
features of experience guide the formation of the principles governing multisensory integration.

15.4.1  Impact of Developing in Absence of Visual–Nonvisual Experience


In this experimental series, animals were reared in darkness until they were 6 months of age, a time
at which most of the physiological properties of SC neurons appear mature, or near-mature. These
animals developed a near-normal set of visual, auditory, and somatosensory neurons that were
highly responsive to natural physiological stimuli (Wallace et al. 2001, 2004a). That these neurons
were atypical, however, was indicated by their abnormally large receptive fields, receptive fields that
290 The Neural Bases of Multisensory Processes

(a) Dark-reared V A VA

75 150 225
15
Mean impulses/trial

% Change (MSI)
0 5 10 15 20 25

10
Impulses
V No
change

5
A +8%

0
0
V A VA
(b) Disparity-reared

75 150 225
15
Mean impulses/trial
Stimuli coincident in VRF

% Change (MSI)
0 5 10 15 20 25 No

10
Impulses

change
V A

5
+16%

0
0
V A VA

75 150 225
15
Stimuli coincident in ARF

Mean impulses/trial

% Change (MSI)
0 5 10 15 20 25

10
Impulses

No
V A change

5
–3%

0
V A VA
*

75 150 225
15
Stimuli disparate
Mean impulses/trial

% Change (MSI)
0 5 10 15 20 25

+144%
10
Impulses

V A
0 5

0
V A VA

FIGURE 15.9  Early experience influences receptive field and response properties of SC multisensory neu-
rons. Impact of dark rearing (a) and disparity rearing (b) on properties of adult multisensory neurons are
shown using two exemplar neurons. Rearing in absence of visual experience was characterized by large visual
and auditory receptive fields (a) that were more characteristic of neonates than adults. This neuron was typi-
cal of population of neurons from dark-reared animals. It was responsive to visual and auditory stimuli, but
its inexperience with visual–auditory stimuli was evident in its lack of ability to integrate those cross-modal
stimuli to producing an enhanced response. Responses from neuron depicted in panel (b) were characteristic
of those affected by a rearing environment in which visual and auditory stimuli were always spatially dispa-
rate. Its visual and auditory receptive fields did not develop normal spatial register, but were completely out
of alignment. It was also incapable of “normal” multisensory integration as indicated by absence of enhanced
responses to spatiotemporally aligned cross-modal stimuli (B1 and B2). Nevertheless, it did show multisen-
sory enhancement to spatially disparate stimuli (B3), revealing that its multisensory integrative properties had
been crafted to adapt them to presumptive environment in which they would be used. (Adapted from Wallace,
M.T. et al., Journal of Neuroscience, 24, 9580–4, 2004a; Wallace, M.T. et al., Proceedings of the National
Academy of Sciences of the United States of America, 101, 2167–72, 2004b; Wallace, M.T., and Stein, B.E.,
Journal of Neurophysiology, 97, 921–6, 2007.)
The Organization and Plasticity of Multisensory Integration in the Midbrain 291

were more characteristic of a neonate than of an adult animal. These neurons were also unable to
integrate their multiple sensory inputs as evidenced by the absence of visual–auditory integration
(Figure 15.9a). This too made them appear more like neonatal, or adults who have had association
cortex removed, than like adult animals (Jiang et al. 2006). These observations are consistent with
the idea that experience with cross-modal cues is necessary for integrating those cues.

15.4.2  Altering Early Experience with Cross-Modal Cues


by Changing Their Spatial Relationships

If early experience does indeed craft the principles governing multisensory integration, changes in
those experiences should produce corresponding changes in those principles. Under normal circum-
stances, cross-modal events provide cues that have a high degree of spatial and temporal fidelity.
In short, the different sensory cues come from the same event, so they come from about the same
place at about the same time. Presumably, with extensive experience, the brain links stimuli from
the two senses by their temporal and spatial relationships. In that way, similar concordances among
cross-modal stimuli that are later encountered facilitate the detection, localization, and identifica-
tion of those initiating events.
Given those assumptions, any experimental changes in the physical relationships of the cross-
modal stimuli that are experienced during early life should be reflected in adaptations in the prin-
ciples governing multisensory integration. In short, they should be appropriate for that “atypical”
environment and inappropriate for the normal environment. To examine this expectation, a group
of cats was reared in a darkroom from birth to 6 months of age, and were periodically presented
with visual and auditory cues that were simultaneous, but derived from different locations in space
(Wallace and Stein 2007). This was accomplished by fixing speakers and light-emitting diodes to
different locations on the wall of the cages.
When SC neurons were then examined, many had developed visual–auditory responsiveness.
Most of them looked similar to those found in animals reared in the dark. They had very large recep-
tive fields, and were unable to integrate their visual–auditory inputs. The retention of these neonatal
properties was not surprising in light of the fact that these stimuli presented in an otherwise dark
room required no response, and were not associated with any consequence. However, there were a
substantial number of SC neurons in these animals that did appear to reflect their visual–auditory
experience. Their visual–auditory receptive fields had contracted as would be expected with sen-
sory experience, but they had also developed poor alignment. A number of them had no overlap
between them (see Figure 15.9b), a relationship almost never seen in animals reared in illuminated
conditions or in animals reared in the dark. However, it did reflect their unique rearing condition.
Most significant in the present context is that they could engage in multisensory integration.
However, only when the cross-modal stimuli were disparate in space were they able to fall simul-
taneously in their respective visual and auditory receptive fields. In this case, the magnitude of the
response to the cross-modal stimulus was significantly enhanced, just as in normally reared animals
when presented with spatially aligned visual–auditory stimuli. Similarly, the cross-modal stimulus
configurations that are spatially coincident fail to fall within the corresponding receptive fields of
the neuron, and the result is to produce response depression or no integration (see Kadunce et al.
2001; Meredith and Stein 1996). These observations are consistent with the prediction above, and
reveal that early experience with the simple temporal coincidence of the two cross-modal stimuli
was sufficient for the brain to link them, and initiate multisensory integration.

15.4.3  Role of Cortical Inputs during Maturation


The data from the above experiments did not reveal where in the multisensory SC circuitry these
early sensory experiences were exerting their greatest effects. Nevertheless, the fact that the cortex
is known to be highly dependent on early experience for its development made it a prime candidate
292 The Neural Bases of Multisensory Processes

for this role. To test this idea, Rowland and colleagues (Stein and Rowland 2007) reversibly deacti-
vated both AES and rLS during the period (25–81 days postnatal) in which multisensory integration
normally develops (see Wallace and Stein 1997), so that their neurons were unable to participate in
these sensory experiences. This was accomplished by implanting a drug-infused polymer over these
cortical areas. The polymer would gradually release its store of muscimol, a gamma-aminobutyric
acid A (GABAa) receptor agonist that blocked neuronal activity. Once the stores of muscimol were
depleted over many weeks, or the polymer was physically removed, these cortical areas would once
again become active and responsive to external stimulation. As predicted, SC neurons in these ani-
mals were unable to integrate their visual and auditory inputs to enhance their responses. Rather,
their responses were no greater to the cross-modal combination of stimuli than they were to the
most effective of its component stimuli. Furthermore, comparable deficits were apparent in overt
behavior. Animals were no better at localizing a cross-modal stimulus than they were at local-
izing the most effective of its individual component stimuli. Although these data do not prove the
point, they do suggest that the cortical component of the SC multisensory circuit is a critical site
for incorporating the early sensory experiences required for the development of SC multisensory
integration.

15.4.4  Ontogeny of Multisensory Integration in Cortex


The development of the cortex is believed to lag the development of the midbrain, and this principle
would be expected to extend to the maturation of sensory response properties. Consequently, the
inability of SC neurons in the neonatal cat brain to exhibit multisensory integration before 4 post-
natal weeks suggests that the property would develop even later in the cortex. To evaluate this issue,
multisensory neurons were studied in the developing AES. Although, as discussed above, neurons
from the AES that project to the SC are unisensory, there are multisensory neurons scattered along
the AES and concentrated at the borders between its three largely modality-specific zones. The
visual–auditory neurons in this “SC independent” multisensory group were the target of this study.
They, like their counterparts in the SC, share many fundamental characteristics of an integrated
response, such as response enhancement and depression (Wallace et al. 1992), and significant altera-
tions in their temporal response profile (Royal et al. 2009). Neurons in the AES can serve as a good
maturational referent for the SC.
As predicted, multisensory neurons in the neonatal AES were unable to integrate their visual
and auditory inputs. They too developed their capacity for multisensory integration only gradually,
and did so within a time window that began and ended later in ontogeny than does the time window
for SC neurons (Wallace et al. 2006). The data not only support the contention that cortical sensory
processes lag those of the midbrain during development, but also raise the possibility that, just as
in the SC, experience with visual and auditory stimuli in cross-modal configurations is required
for the maturation of multisensory integration. The likelihood of this possibility was strengthened
using the same rearing strategy as discussed earlier. Animals were raised in the dark to preclude
visual–nonvisual experience. As a result, AES neurons failed to develop the capacity to integrate
their visual and auditory inputs. Once again, this rearing condition did not impair the development
of visually-responsive, auditory-responsive, and even visual–auditory neurons. They were common.
The rearing condition simply impaired AES multisensory neurons from developing an ability to use
these inputs synergistically (Carriere et al. 2007).

15.4.5  Ontogeny of SC Multisensory Integration in a Primate


The multisensory properties of SC neurons discussed above are not unique to the cat. Although
their incidence is somewhat lower, multisensory neurons in the rhesus monkey SC have proper-
ties very similar to those described above (Wallace et al. 1996). They have multiple, overlapping
receptive fields and show multisensory enhancement and multisensory depression, respectively, to
The Organization and Plasticity of Multisensory Integration in the Midbrain 293

spatially aligned and spatially disparate cross-modal stimuli. Although there may seem to be no a
priori reason to assume that their maturation would depend on different factors than those of the
cat, the monkey, unlike the cat, is a precocial species. Its SC neurons have comparatively more time
to develop in utero than do those of the cat. Of course, they also have to do so in the dark, making
one wonder if the late in utero visual-free experiences of the monkey have some similarity to the
visual-free environment of the dark-reared cat.
Wallace and Stein (2001) examined the multisensory properties of the newborn monkey SC and
found that, unlike the SC of the newborn cat, there were already multisensory neurons present (Wallace
et al. 2001; Figure 15.10). However, as in the cat SC, these multisensory neurons were unable to inte-
grate visual–nonvisual inputs. Their responses to combinations of coincident visual and auditory or
somatosensory cues were no better than were their responses to the most effective of these component
stimuli individually. Although there is no data regarding when they develop this capacity, and whether
dark-rearing would preclude its appearance, it seems highly likely that the monkey shares the same
developmental antecedents for the maturation of multisensory integration as the cat.
Recent reports in humans suggest that this may be a general mammalian plan. People who have
experienced early visual deprivation due to dense congenital cataracts were examined many years
after surgery to remove those cataracts. The observations are consistent with predictions that would
be made from the animal studies. Specifically, their vision appeared to be normal, but their ability
to integrate visual–nonvisual information was significantly less well developed than in normal sub-
jects. This ability was compromised in a variety of tasks including those that involved speech and
those that did not (Putzar et al. 2007).
Whether neurons in the human SC, like those in the SC of cat and monkey, are incapable of
multisensory integration is not yet known. However, human infants do poorly on tasks requiring
the integration of visual and auditory information to localize events before 8 months of age (Neil

Multisensory Adult
neurons (28%)
AS (0.9%) VAS
Multisensory VS 6.5%
neurons (14.7%) 9.3%
VAS (1%) VA 11.1% Visual
AS (1%) 37.0%
VS
17.6%
VA 17.6%
7.4 %
5.3 % Somatosensory

Auditory
Modality-specific
neurons (72%)

49.5 % Visual
23.2 %
Somatosensory

12.6 %

Newborn
Auditory Modality-specific
neurons (85.3%)

FIGURE 15.10  Modality convergence patterns in SC of newborn and adult (inset) monkey. Pie charts show
distributions of all recorded sensory-responsive neurons in multisensory laminas (IV–VII) of SC. (From
Wallace, M.T., and Stein, B.E., Journal of Neuroscience, 21, 8886–94, 2001. With permission.)
294 The Neural Bases of Multisensory Processes

et al. 2006), and do poorly on tasks requiring the integration of visual and haptic information before
8 years of age (Gori et al. 2008). These data indicate that multisensory capabilities develop over
far longer periods in the human brain than in the cat brain, an observation consistent with the long
period of postnatal life devoted to human brain maturation. These observations, coupled with those
indicating that early sensory deprivation has a negative effect on multisensory integration even far
later in life suggests that early experience with cross-modal cues is essential for normal multisen-
sory development in all higher-order species. If so, we can only wonder how well the human brain
can adapt its multisensory capabilities to the introduction of visual or auditory input later in life
via prosthetic devices. Many people who had congenital hearing impairments, and later received
cochlear implants, have shown remarkable accommodation to them. They learn to use their newly
found auditory capabilities with far greater precision than one might have imagined when they were
first introduced. Nevertheless, it is not yet known whether they can use them in concert with other
sensory systems. Although the population of people with retinal implants is much smaller, there are
very encouraging reports among them as well. However, the same questions apply: Are they able
to acquire the ability to engage in some forms of multisensory integration after experience with
visual–auditory cues later in life and, if so, how much experience and what kinds of experiences are
necessary for them to develop this capability? These issues remain to be determined.

ACKNOWLEDGMENTS
The research described here was supported in part by NIH grants NS36916 and EY016716.

REFERENCES
Alvarado, J.C., T.R. Stanford, J.W. Vaughan, and B.E. Stein. 2007a. Cortex mediates multisensory but not
unisensory integration in superior colliculus. Journal of Neuroscience 27:12775–86.
Alvarado, J.C., J.W. Vaughan, T.R. Stanford, and B.E. Stein. 2007b. Multisensory versus unisensory integra-
tion: Contrasting modes in the superior colliculus. Journal of Neurophysiology 97:3193–205.
Barth, D.S., and B. Brett-Green. 2004. Multisensory-Evoked Potentials in Rat Cortex. In The handbook of mul-
tisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 357–70. Cambridge, MA: MIT Press.
Bernstein, L.E., J. Edward, T. Auer, and J.K. Moore. 2004. Audiovisual Speech Binding: Convergence or
Association. In Handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein,
203–23. Cambridge, MA: MIT Press.
Burnett, L.R., B.E. Stein, D. Chaponis, and M.T. Wallace. 2004. Superior colliculus lesions preferentially dis-
rupt multisensory orientation. Neuroscience 124:535–47.
Burnett, L.R., B.E. Stein, T.J. Perrault Jr., and M.T. Wallace. 2007. Excitotoxic lesions of the superior colliculus
preferentially impact multisensory neurons and multisensory integration. Experimental Brain Research
179:325–38.
Busse, L., K.C. Roberts, R.E. Crist, D.H. Weissman, and M.G. Woldorff. 2005. The spread of attention across
modalities and space in a multisensory object. Proceedings of the National Academy of Sciences of the
United States of America 102:18751–6.
Calvert, G., C. Spence, and B.E. Stein. 2004a. The handbook of multisensory processes. Cambridge, MA: MIT
Press.
Calvert, G. A., and J. Lewis, W. 2004b. Hemodynamic Studies of Audiovisual Interactions. In The Handbook of
Multisensory Processes, ed. G. A. Calvert, C. Spence, and B.E. Stein, 483–502. Cambridge, MA: MIT Press.
Carriere, B.N., D.W. Royal, T.J. Perrault et al. 2007. Visual deprivation alters the development of cortical mul-
tisensory integration. Journal of Neurophysiology 98:2858–67.
Corneil, B.D., and D.P. Munoz. 1996. The influence of auditory and visual distractors on human orienting gaze
shifts. Journal of Neuroscience 16:8193–207.
DeGelder, B., J. Vroomen, and G. Pourtois. 2004. Multisensory Perception of Emotion, Its Time Course, and
Its Neural Basis. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein,
581–96. Cambridge, MA: MIT Press.
Edwards, S.B., C.L. Ginsburgh, C.K. Henkel, and B.E. Stein. 1979. Sources of subcortical projections to the
superior colliculus in the cat. Journal of Comparative Neurology 184:309–29.
The Organization and Plasticity of Multisensory Integration in the Midbrain 295

Ernst, M.O., and M.S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415:429–33.
Fort, A., and M.-H. Giard. 2004. Multiple Electrophysiological Mechanisms of Audiovisual Integration in
Human Perception. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E.
Stein, 503–13. Cambridge, MA: MIT Press.
Frens, M.A., and A.J. Van Opstal. 1995a. A quantitative study of auditory-evoked saccadic eye movements in
two dimensions. Experimental Brain Research 107:103–17.
Frens, M.A., A.J. Van Opstal, and R.F. Van der Willigen. 1995b. Spatial and temporal factors determine auditory-
­visual interactions in human saccadic eye movements. Perception & Psychophysics 57:802–16.
Fuentes-Santamaria, V., J.C., Alvarado, B.E., Stein, and J.G. McHaffie. 2008. Cortex contacts both output
neurons and nitrergic interneurons in the superior colliculus: Direct and indirect routes for multisensory
integration. Cerebral Cortex 18:1640–52.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–285.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25:5004–12.
Gingras, G., B.A. Rowland, and B.E. Stein. 2009. The differing impact of multisensory and unisensory integra-
tion on behavior. Journal of Neuroscience 29:4897–902.
Gondan, M., B., Niederhaus, F. Rosler, and B. Roder. 2005. Multisensory processing in the redundant-target
effect: A behavioral and event-related potential study. Perception & Psychophysics 67:713–26.
Gori, M., M. Del Viva, G. Sandini, and D.C. Burr. 2008. Young children do not integrate visual and haptic form
information. Current Biology 18:694–8.
Grant, A.C., M.C. Thiagarajah, and K. Sathian. 2000. Tactile perception in blind Braille readers: A psy-
chophysical study of acuity and hyperacuity using gratings and dot patterns. Perception & Psychophysics
62:301–12.
Grantyn, A., and R. Grantyn. 1982. Axonal patterns and sites of termination of cat superior colliculus neurons
projecting in the tecto-bulbo-spinal tract. Experimental Brain Research 46:243–56.
Groh, J.M., and D.L. Sparks. 1996a. Saccades to somatosensory targets: II. Motor convergence in primate
superior colliculus. Journal of Neurophysiology 75:428–38.
Groh, J.M., and D.L. Sparks. 1996b. Saccades to somatosensory targets: III. Eye-position-dependent soma-
tosensory activity in primate superior colliculus. Journal of Neurophysiology 75:439–53.
Guitton, D., and D.P. Munoz. 1991. Control of orienting gaze shifts by the tectoreticulospinal system in the
head-free cat: I. Identification, localization, and effects of behavior on sensory responses. Journal of
Neurophysiology 66:1605–23.
Gutfreund, Y., and E.I. Knudsen. 2004. Visual Instruction of the Auditory Space Map in the Midbrain. In The handbook
of multisensory processes, ed. G.A. Calvert, C. Spence and B.E. Stein, 613–24. Cambridge, MA: MIT Press.
Harris, L.R. 1980. The superior colliculus and movements of the head and eyes in cats. Journal of Physiology
300:367–91.
Huerta, M.F., and J.K. Harting. 1984. The mammalian superior colliculus: Studies of its morphology and con-
nections. In Comparative neurology of the optic tectum, ed. H. Vanegas, 687–773. New York: Plenum
Publishing Corporation.
Hughes, H.C., P.A. Reuter-Lorenz, G. Nozawa, and R. Fendrich. 1994. Visual–auditory interactions in senso-
rimotor processing: Saccades versus manual responses. Journal of Experimental Psychology. Human
Perception and Performance 20:131–53.
Jay, M.F., and D.L. Sparks. 1984. Auditory receptive fields in primate superior colliculus shift with changes in
eye position. Nature 309:345–7.
Jay, M.F., and D.L. Sparks. 1987a. Sensorimotor integration in the primate superior colliculus: I. Motor conver-
gence. Journal of Neurophysiology 57:22–34.
Jay, M.F., and D.L. Sparks. 1987b. Sensorimotor integration in the primate superior colliculus: II. Coordinates
of auditory signals. Journal of Neurophysiology 57:35–55.
Jiang, W., and B.E. Stein. 2003. Cortex controls multisensory depression in superior colliculus. Journal of
Neurophysiology 90:2123–35.
Jiang, W., M.T. Wallace, H. Jiang, J.W. Vaughan, and B.E. Stein. 2001. Two cortical areas mediate multisensory
integration in superior colliculus neurons. Journal of Neurophysiology 85:506–22.
Jiang, W., H. Jiang, and B.E. Stein. 2002. Two corticotectal areas facilitate multisensory orientation behavior.
Journal of Cognitive Neuroscience 14:1240–55.
Jiang, H., B.E. Stein, and J.G. McHaffie. 2003. Opposing basal ganglia processes shape midbrain visuomotor
activity bilaterally. Nature 423:982–6.
296 The Neural Bases of Multisensory Processes

Jiang, W., H. Jiang, B.A. Rowland, and B.E. Stein. 2007. Multisensory orientation behavior is disrupted by
neonatal cortical ablation. Journal of Neurophysiology 97:557–62.
Jiang, W., H. Jiang, and B.E. Stein. 2006. Neonatal cortical ablation disrupts multisensory development in
superior colliculus. Journal of Neurophysiology 95:1380–96.
Kadunce, D.C., J.W. Vaughan, M.T. Wallace, and B.E. Stein. 2001. The influence of visual and auditory
receptive field organization on multisensory integration in the superior colliculus. Experimental Brain
Research 139:303–10.
Kao, C.Q., B.E. Stein, and D.A. Coulter. 1994. Postnatal development of excitatory synaptic function in deep
layers of SC. Society of Neuroscience Abstracts.
King, A.J., T.P. Doubell, and I. Skaliora. 2004. Epigenetic factors that align visual and auditory maps in the
ferret midbrain. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein,
599–612. Cambridge, MA: MIT Press.
King, A.J., and A.R. Palmer. 1985. Integration of visual and auditory information in bimodal neurones in the
guinea-pig superior colliculus. Experimental Brain Research. 60:492–500.
King, A.J., J.W. Schnupp, S. Carlile, A.L. Smith, and I.D. Thompson. 1996. The development of topographi-
cally-aligned maps of visual and auditory space in the superior colliculus. Progress in Brain Research
112:335–50.
Lakatos, P., C.M. Chen, M.N. O’Connell, A. Mills, and C.E. Schroeder. 2007. Neuronal oscillations and multi-
sensory interaction in primary auditory cortex. Neuron 53:279–92.
Larson, M.A., and B.E. Stein. 1984. The use of tactile and olfactory cues in neonatal orientation and localiza-
tion of the nipple. Developmental Psychobiology 17:423–36.
Laurienti, P.J., J.H. Burdette, M.T. Wallace et al. 2002. Deactivation of sensory-specific cortex by cross-modal
stimuli. Journal of Cognitive Neuroscience 14:420–9.
Leo, F., N. Bolognini, C. Passamonti, B.E. Stein, and E. Ladavas. 2008. Cross-modal localization in hemiano-
pia: New insights on multisensory integration. Brain 131: 855–65.
Liotti, M., K. Ryder, and M.G. Woldorff. 1998. Auditory attention in the congenitally blind: Where, when and
what gets reorganized? Neuroreport 9:1007–12.
Lippert, M., N.K. Logothetis, and C. Kayser. 2007. Improvement of visual contrast detection by a simultaneous
sound. Brain Research 1173:102–9.
Lovelace, C.T., B.E. Stein, and M.T. Wallace. 2003. An irrelevant light enhances auditory detection in humans:
A psychophysical analysis of multisensory integration in stimulus detection. Cognitive Brain Research
17:447–453.
Macaluso, E., and J. Driver. 2004. Functional imaging evidence for multisensory spatial representations and
cross-modal attentional interactions in the human brain. In The handbook of multisensory processes, ed.
G.A. Calvert, C. Spence, and B.E. Stein, 529–48. Cambridge, MA: MIT Press.
Marks, L.E. 2004. Cross-modal interactions in speeded classification. In The handbook of multisensory pro-
cesses, ed. G.A. Calvert, C. Spence, and B.E. Stein, 85–106. Cambridge, MA: MIT Press.
Massaro, D.W. 2004. From multisensory integration to talking heads and language learning. In The handbook of
multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 153–76. Cambridge, MA: MIT Press.
Meredith, M.A., and B.E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–91.
Meredith, M.A., and B.E. Stein. 1986a. Spatial factors determine the activity of multisensory neurons in cat
superior colliculus. Brain Research 365:350–4.
Meredith, M.A., and B.E. Stein. 1986b. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–62.
Meredith, M.A., and B.E. Stein. 1990. The visuotopic component of the multisensory map in the deep laminae
of the cat superior colliculus. Journal of Neuroscience 10:3727–42.
Meredith, M.A., and B.E. Stein. 1996. Spatial determinants of multisensory integration in cat superior collicu-
lus neurons. Journal of Neurophysiology 75:1843–57.
Meredith, M.A., H.R. Clemo, and B.E. Stein. 1991. Somatotopic component of the multisensory map in the
deep laminae of the cat superior colliculus. Journal of Comparative Neurology 312:353–70.
Meredith, M.A., M.T. Wallace, and B.E. Stein. 1992. Visual, auditory and somatosensory convergence in output
neurons of the cat superior colliculus: Multisensory properties of the tecto-reticulo-spinal projection.
Experimental Brain Research 88:181–6.
Middlebrooks, J.C., and E.I. Knudsen. 1984. A neural code for auditory space in the cat’s superior colliculus.
Journal of Neuroscience 4:2621–34.
Miller, J. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology
14:247–79.
The Organization and Plasticity of Multisensory Integration in the Midbrain 297

Morgan, M.L., G.C. Deangelis, and D.E. Angelaki. 2008. Multisensory integration in macaque visual cortex
depends on cue reliability. Neuron 59:662–73.
Moschovakis, A.K., and A.B. Karabelas. 1985. Observations on the somatodendritic morphology and axonal
trajectory of intracellularly HRP-labeled efferent neurons located in the deeper layers of the superior col-
liculus of the cat. Journal of Comparative Neurology 239:276–308.
Munoz, D.P., and R.H. Wurtz. 1993a. Fixation cells in monkey superior colliculus. I. Characteristics of cell
discharge. Journal of Neurophysiology 70:559–75.
Munoz, D.P., and R.H. Wurtz. 1993b. Fixation cells in monkey superior colliculus: II. Reversible activation and
deactivation. Journal of Neurophysiology 70:576–89.
Neil, P.A., C. Chee-Ruiter, C. Scheier, D.J. Lewkowicz, and S. Shimojo. 2006. Development of multisensory
spatial integration and perception in humans. Developmental Science 9:454–64.
Newell, F.N. 2004. Cross-modal object recognition. In The handbook of multisensory processes, ed. G.A.
Calvert, C. Spence, and B.E. Stein, 123–39: Cambridge, MA: MIT Press.
Partan, S.R. 2004. Multisensory animal communication. In The handbook of multisensory processes, ed. G.A.
Calvert, C. Spence, and B.E. Stein, 225–40. Cambridge, MA: MIT Press.
Peck, C.K. 1987a. Saccade-related burst neurons in cat superior colliculus. Brain Research 408:329–33.
Peck, C.K. 1987b. Visual–auditory interactions in cat superior colliculus: Their role in the control of gaze.
Brain Research 420:162–6.
Perrault Jr., T.J., J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2003. Neuron-specific response characteristics
predict the magnitude of multisensory integration. Journal of Neurophysiology 90:4022–6.
Perrault Jr., T.J., J.W. Vaughan, B.E. Stein, and M.T. Wallace. 2005. Superior colliculus neurons use distinct
operational modes in the integration of multisensory stimuli. Journal of Neurophysiology 93:2575–86.
Putzar, L., I. Goerendt, K. Lange, F. Rosler, and B. Roder. 2007. Early visual deprivation impairs multisensory
interactions in humans. Nature Neuroscience 10:1243–5.
Recanzone, G.H. 1998. Rapidly induced auditory plasticity: The ventriloquism aftereffect. Proceedings of the
National Academy of Sciences of the United States of America 95:869–75.
Romanski, L.M. 2007. Representation and integration of auditory and visual stimuli in the primate ventral
lateral prefrontal cortex. Cerebral Cortex 17(Suppl 1):i61–9.
Rowland, B.A., and B.E. Stein. 2007. Multisensory integration produces an initial response enhancement.
Frontiers in Integrative Neuroscience 1:4.
Rowland, B.A., and B.E. Stein. 2008. Temporal profiles of response enhancement in multisensory integration.
Frontiers in Neuroscience 2:218–24.
Rowland, B.A., S. Quessy, T.R. Stanford, and B.E. Stein. 2007a. Multisensory integration shortens physiologi-
cal response latencies. Journal of Neuroscience 27:5879–84.
Rowland, B.A., T.R. Stanford, and B.E. Stein. 2007b. A model of the neural mechanisms underlying multisen-
sory integration in the superior colliculus. Perception 36:1431–43.
Royal, D.W., B.N. Carriere, and M.T. Wallace. 2009. Spatiotemporal architecture of cortical receptive fields
and its impact on multisensory interactions. Experimental Brain Research 198:127–36.
Sathian, K. 2000. Practice makes perfect: Sharper tactile perception in the blind. Neurology 54:2203–4.
Sathian, K. 2005. Visual cortical activity during tactile perception in the sighted and the visually deprived.
Developmental Psychobiology 46:279–86.
Sathian, K., S.C. Prather, and M. Zhang. 2004. Visual cortical involvement in normal tactile perception. In The
handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 703–9. Cambridge,
MA: MIT Press.
Schroeder, C.E., and J.J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory areas
of the macaque neocortex. Brain Research. Cognitive Brain Research 14:187–98.
Schroeder, C. E., and J.J. Foxe. 2004. Multisensory convergence in early cortical processing. In The handbook
of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 295–309. Cambridge, MA: MIT
Press.
Schroeder, C.E., R.W. Lindsley, C. Specht et al. 2001. Somatosensory input to auditory association cortex in
the macaque monkey. Journal of Neurophysiology 85:1322–7.
Senkowski, D., D. Talsma, M. Grigutsch, C.S. Herrmann, and M.G. Woldorff. 2007. Good times for multisen-
sory integration: Effects of the precision of temporal synchrony as revealed by gamma-band oscillations.
Neuropsychologia 45:561–71.
Shams, L., Y. Kamitani, and S. Shimojo. 2004. Modulations of visual perception by sound. In The handbook of
multisensoty processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 27–33. Cambridge, MA: MIT Press.
Sinnett, S., S. Soto-Faraco, and C. Spence. 2008. The co-occurrence of multisensory competition and facilita-
tion. Acta Psychologica 128:153–61.
298 The Neural Bases of Multisensory Processes

Sparks, D.L. 1986. Translation of sensory signals into commands for control of saccadic eye movements: Role
of primate superior colliculus. Physiological Reviews 66:118–71.
Sparks, D.L., and J.S. Nelson. 1987. Sensory and motor maps in the mammalian superior colliculus. Trends in
Neuroscience 10:312–7.
Spence, C., and J. Driver. 2004. Crossmodal space and crossmodal attention. Oxford: Oxford Univ. Press.
Stanford, T.R., and B.E. Stein. 2007. Superadditivity in multisensory integration: Putting the computation in
context. Neuroreport 18:787–92.
Stanford, T.R., S. Quessy, and B.E. Stein. 2005. Evaluating the operations underlying multisensory integration
in the cat superior colliculus. Journal of Neuroscience 25:6499–508.
Stein, B.E. 1984. Development of the superior colliculus. Annual Review of Neuroscience 7:95–125.
Stein, B.E., and M.O. Arigbede. 1972. Unimodal and multimodal response properties of neurons in the cat’s
superior colliculus. Experimental Neurology 36:179–96.
Stein, B.E., and H.P. Clamann. 1981. Control of pinna movements and sensorimotor register in cat superior
colliculus. Brain, Behavior and Evolution 19:180–92.
Stein, B.E., and H.L. Gallagher. 1981. Maturation of cortical control over superior colliculus cells in cat. Brain
Research 223:429–35.
Stein, B.E., and M.A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Stein, B.E., and B.A. Rowland. 2007. The critical role of cortico-collicular interactions in the development of
multisensory integration. Paper presented at the Society for Neuroscience.
Stein, B.E., E. Labos, and L. Kruger. 1973. Sequence of changes in properties of neurons of superior colliculus
of the kitten during maturation. Journal of Neurophysiology 36:667–79.
Stein, B.E., B. Magalhaes-Castro, and L. Kruger. 1976. Relationship between visual and tactile representations
in cat superior colliculus. Journal of Neurophysiology 39:401–19.
Stein, B.E., R.F. Spencer, and S.B. Edwards. 1984. Efferent projections of the neonatal cat superior colliculus:
Facial and cerebellum-related brainstem structures. Journal of Comparative Neurology 230:47–54.
Stein, B.E., M.A. Meredith, W.S. Huneycutt, and L. McDade. 1989. Behavioral indices of multisensory inte-
gration: Orientation to visual cues is affected by auditory stimuli. Journal of Cognitive Neuroscience
1:12–24.
Stein, B.E., M.A. Meredith, and M.T. Wallace. 1993. The visually responsive neuron and beyond: Multisensory
integration in cat and monkey. Progress in Brain Research 95:79–90.
Stein, B.E., M.W. Wallace, T.R. Stanford, and W. Jiang. 2002. Cortex governs multisensory integration in the
midbrain. Neuroscientist 8:306–14.
Sugihara, T., M.D. Diltz, B.B. Averbeck, and L.M. Romanski. 2006. Integration of auditory and visual communi-
cation information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience 26:11138–47.
Sumby, W.H., and I. Pollack. 1954. Visual contribution to speech intelligibillity in noise. Journal of the
Acoustical Society of America 26:212–5.
Talsma, D., T.J. Doty, R. Strowd, and M.G. Woldorff. 2006. Attentional capacity for processing concurrent
stimuli is larger across sensory modalities than within a modality. Psychophysiology 43:541–9.
Talsma, D., T.J. Doty, and M.G. Woldorff. 2007. Selective attention and audiovisual integration: Is attending to
both modalities a prerequisite for early integration? Cerebral Cortex 17:679–90.
Wallace, M.T. 2004. The development of multisensory integration. In The handbook of multisensory processes,
ed. G.A. Calvert, C. Spence, and B.E. Stein, 625–42. Cambridge, MA: MIT Press.
Wallace, M.T., and B.E. Stein. 1994. Cross-modal synthesis in the midbrain depends on input from cortex.
Journal of Neurophysiology 71:429–32.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience 17:2429–44.
Wallace, M.T., and B.E. Stein. 2000. Onset of cross-modal synthesis in the neonatal superior colliculus is gated
by the development of cortical influences. Journal of Neurophysiology 83:3578–82.
Wallace, M.T., and B.E. Stein. 2001. Sensory and multisensory responses in the newborn monkey superior col-
liculus. Journal of Neuroscience 21:8886–94.
Wallace, M.T., and B.E. Stein. 2007. Early experience determines how the senses will interact. Journal of
Neurophysiology 97:921–6.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1992. Integration of multiple sensory modalities in cat cortex.
Experimental Brain Research 91:484–8.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1993. Converging influences from visual, auditory, and
somatosensory cortices onto output neurons of the superior colliculus. Journal of Neurophysiology
69:1797–809.
The Organization and Plasticity of Multisensory Integration in the Midbrain 299

Wallace, M.T., L.K. Wilkinson, and B.E. Stein. 1996. Representation and integration of multiple sensory inputs
in primate superior colliculus. Journal of Neurophysiology 76:1246–66.
Wallace, M.T., M.A. Meredith, and B.E. Stein. 1998. Multisensory integration in the superior colliculus of the
alert cat. Journal of Neurophysiology 80:1006–10.
Wallace, M.T., W.D. Hairston, and B.E. Stein. 2001. Long-term effects of dark-rearing on multisensory pro-
cessing. Paper presented at the Society for Neuroscience.
Wallace, M.T., T.J. Perrault Jr., W.D., Hairston, and B.E. Stein. 2004a. Visual experience is necessary for the
development of multisensory integration. Journal of Neuroscience 24:9580–4.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004b. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–72.
Wallace, M.T., B.N. Carriere, T.J. Perrault Jr., J.W. Vaughan, and B.E. Stein. 2006. The development of cortical
multisensory integration. Journal of Neuroscience 26:11844–9.
Weisser, V., R. Stilla, S. Peltier, X. Hu, and K. Sathian. 2005. Short-term visual deprivation alters neural pro-
cessing of tactile form. Experimental Brain Research 166:572–82.
Wilkinson, L.K., M.A. Meredith, and B.E. Stein. 1996. The role of anterior ectosylvian cortex in cross-­modality
orientation and approach behavior. Experimental Brain Research 112:1–10.
Woldorff, M.G., C.J. Hazlett, H.M. Fichtenholtz et al. 2004. Functional parcellation of attentional control
regions of the brain. Journal of Cognitive Neuroscience 16:149–65.
Woods, T.M., and G.H. Recanzone. 2004a. Cross-modal interactions evidenced by the ventriloquism effect in
humans and monkeys. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E.
Stein, 35–48. Cambridge, MA: MIT Press.
Woods, T.M., and G.H. Recanzone. 2004b. Visually induced plasticity of auditory spatial perception in
macaques. Current Biology 14:1559–64.
Wurtz, R.H., and J.E. Albano. 1980. Visual–motor function of the primate superior colliculus. Annual Review
of Neuroscience 3:189–226.
Wurtz, R.H., and M.E. Goldberg. 1971. Superior colliculus cell responses related to eye movements in awake
monkeys. Science 171:82–4.
Zangaladze, A., C.M. Epstein, S.T. Grafton, and K. Sathian. 1999. Involvement of visual cortex in tactile dis-
crimination of orientation. Nature 401:587–90.
16 Effects of Prolonged
Exposure to Audiovisual
Stimuli with Fixed Stimulus
Onset Asynchrony on
Interaction Dynamics
between Primary Auditory
and Primary Visual Cortex
Antje Fillbrandt and Frank W. Ohl

CONTENTS
16.1 Introduction...........................................................................................................................302
16.1.1 Speed of Signal Transmission Is Modality-Specific.................................................. 303
16.1.2 Simultaneity Constancy............................................................................................. 303
16.1.3 Temporal Recalibration............................................................................................. 303
16.1.4 Mechanisms of Temporal Recalibration....................................................................304
16.1.4.1 Are There Any Indications for Recalibration at Early Levels of
Stimulus Processing?..................................................................................304
16.1.4.2 To What Extent Does Temporal Recalibration Need Attentional
Resources?..................................................................................................304
16.1.4.3 Is Recalibration Stimulus-Specific?............................................................ 305
16.1.4.4 Is Recalibration Modality-Specific?........................................................... 305
16.1.4.5 Does Recalibration Occur at Decision Level?............................................ 305
16.1.5 Outlook on Experiments............................................................................................ 305
16.2 Methods.................................................................................................................................306
16.2.1 Animals.....................................................................................................................306
16.2.2 Electrodes..................................................................................................................306
16.2.3 Animal Preparation and Recording...........................................................................306
16.2.4 Stimuli.......................................................................................................................306
16.2.5 Experimental Protocol...............................................................................................306
16.2.6 Data Preprocessing....................................................................................................307
16.2.7 DTF: Mathematical Definition.................................................................................. 307
16.2.8 Estimation of Autoregressive Models........................................................................ 308
16.2.9 Normalization of DTF...............................................................................................309
16.2.10 Statistical Testing.....................................................................................................309

301
302 The Neural Bases of Multisensory Processes

16.3 Results.................................................................................................................................... 310


16.3.1 Stimulus-Induced Changes in Single-Trial nDTF, Averaged across All Trials
from All Sessions....................................................................................................... 310
16.3.1.1 Animals Receiving Light Followed by Tone Stimulus (VA-Animals)....... 311
16.3.1.2 Animals Receiving Tone Followed by Light Stimulus (AV-Animals)........ 312
16.3.2 Development of Amplitude of nDTFA→V and nDTFV→A within Sessions.................. 313
16.3.2.1 VA-Animals................................................................................................ 313
16.3.2.2 AV-Animals................................................................................................ 313
16.3.3 Development of the Amplitude of nDTFA→V and nDTFV→A across Sessions............ 314
16.4 Discussion.............................................................................................................................. 316
16.4.1 Interpretation of DTF-Amplitudes............................................................................ 316
16.4.2 Development of nDTF-Amplitude within Sessions................................................... 317
16.4.3 Audiovisual Stimulus Association as a Potential Cause of Observed Changes in
nDTF-Amplitudes...................................................................................................... 318
16.4.4 Changes in Lag Detection as a Potential Cause of Observed Changes in DTF-
Amplitudes................................................................................................................. 318
16.4.5 Mechanisms of Recalibration: Some Preliminary Restrictions................................ 318
16.4.5.1 Expectation and Lag Detection................................................................... 318
16.4.5.2 Processes after the Second Stimulus.......................................................... 319
16.4.5.3 Speed of Processing.................................................................................... 319
16.5 Conclusions............................................................................................................................ 319
References....................................................................................................................................... 320

Temporal congruity between auditory and visual stimuli has frequently been shown to be an impor-
tant factor in audiovisual integration, but information about temporal congruity is blurred by the
different speeds of transmission in the two sensory modalities. Compensating for the differences in
transmission times is challenging for the brain because at each step of transmission, from the pro-
duction of the signal to its arrival at higher cortical areas, the speed of transmission can be affected
in various ways. One way to deal with this complexity could be that the compensation mechanisms
remain plastic throughout its lifetime so that they can flexibly adapt to the typical transmission
delays of new types of stimuli. Temporal recalibration to new values of stimulus asynchronies has
been demonstrated in several behavioral studies. This study seeks to explore the potential mecha-
nisms underlying such recalibration at the cortical level. Toward this aim, tone and light stimuli
were presented repeatedly to awake, passively listening, Mongolian gerbils at the same constant
lag. During stimulation, the local field potential was recorded from electrodes implanted into the
auditory and visual cortices. The interaction dynamics between the auditory and visual cortices
were examined using the directed transfer function (DTF; Kaminski and Blinowska 1991). With an
increasing number of stimulus repetitions, the amplitude of the DTF showed characteristic changes
at specific time points between and after the stimuli. Our findings support the view that repeated
presentation of audiovisual stimuli at a constant delay alters the interactions between the auditory
and visual cortices.

16.1  INTRODUCTION
Listening to a concert is also enjoyable while watching the musicians play. Under normal circum-
stances, we are not confused by seeing the drumstick movement or the lip movement of the singer
after hearing the beat and the vocals. When in our conscious experience of the world, the senses
appear as having been united, this also seems to imply that stimulus processing in different modali-
ties must have reached consciousness at about the same time.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 303

Apparently, the task to judge which stimuli have appeared simultaneously is quite challenging
for the brain: during the past decade, an increasing number of studies have been published indicat-
ing that temporal perception remains plastic throughout the lifetime. These studies demonstrated
that when stimuli from different sensory modalities are presented repeatedly at a small constant
temporal onset asynchrony, after a while, their temporal disparity is perceived as being diminished
in the conscious experience. This chapter describes the electrophysiological results of interaction
processes between the auditory and visual cortices during constant asynchronous presentation of
audiovisual stimuli in a rodent preparation designed to mimic relevant aspects of classic experi-
ments in humans on the recalibration of temporal order judgment.

16.1.1  Speed of Signal Transmission Is Modality-Specific


From the point in time a single event causes an auditory and a visual signal, to the point in time a
certain brain area is activated by these signals, information about temporal congruity is blurred in
various ways by the different speeds of transmission of the two signals. The first temporal dispari-
ties in signal propagation arise outside the brain from the different velocities of sound and light. At
the receptor level, sound transduction in the ear is faster than phototransduction in the retina (see
Fain 2003, for a detailed review). The minimum response latency for a bright flash, approximately
7 ms, is nearly the same in rods and cones (Cobbs and Pugh 1987; Hestrin and Korenbrot 1990;
Robson et al. 2003). But with low light intensities, the rod-driven response might take as long as
300 ms (Baylor et al. 1984, 1987). In contrast, transduction by the hair cells of the inner ear is effec-
tively instantaneous via direct mechanic linkage (~10 µs; Corey and Hudspeth 1979, 1983; Crawford
and Fettiplace 1985; Crawford et al. 1991).
Next, the duration of the transmission of auditory and visual signals depends on the length of
the nerves used for their transmission (Von Békésy 1963; Harrar and Harris 2005). The relation-
ship of transmission delays between sensory modalities is further complicated by the fact that, in
each modality, processing speed seems to be modulated by detailed physical stimulus character-
istics, such as stimulus intensity (Wilson and Anstis 1969) and visual eccentricity (Nickalls 1996;
Kopinska and Harris 2004), as well as by subjective factors, such as attention (e.g., Posner et al.
1980).

16.1.2  Simultaneity Constancy


The ability to perceive stimuli as simultaneous despite their different transmission delays has been
termed simultaneity constancy (Kopinska and Harris 2004). Several studies demonstrated that
human beings are able to compensate for temporal lags caused by variances in spatial distance
(Engel and Dougherty 1971; Sugita and Suzuki 2003; Kopinska and Harris 2004; Alais and Carlile
2005). Interestingly, the compensation also worked when distance cues were presented only to a sin-
gle modality. In the study by Sugita and Suzuki (2003), only visual distance cues were used. Alais
and Carlile (2005) varied only cues for auditory distance perception. The question of which cues are
essential to induce a lag compensation is still a matter of ongoing debate as there are also several
studies that failed to find evidence for a similar perceptual compensation (Stone 2001; Lewald and
Guski 2004; Arnold et al. 2005; Heron et al. 2007).

16.1.3  Temporal Recalibration


Because the transmission delays of auditory and visual signals depend on many factors, they cannot
be described by simple rules. One way to deal with this complexity could be that the compensation
mechanisms remain plastic throughout its lifetime so that they can flexibly adapt to new sets of
stimuli and their typical transmission delays.
304 The Neural Bases of Multisensory Processes

The existence of temporal recalibration to new stimuli has been demonstrated in several stud-
ies (Fujisaki et al. 2004; Vroomen et al. 2004; Navarra et al. 2005; Heron et al. 2007; Keetels and
Vroomen 2007). In these studies, experimental paradigms typically start with an adaptation phase
with auditory and visual stimuli being presented repeatedly over several minutes, and consistently
at a slight onset asynchrony of about 0 to 250 ms. In a subsequent behavioral testing phase, auditory
and visual stimuli are presented at various temporal delays and their perceived temporal distance
is usually assessed by a simultaneity judgment task (subjects have to indicate whether the stimuli
are simultaneous or not) or a temporal order judgment task (subjects have to indicate which of the
stimuli they perceived first).
Using these procedures, temporal recalibration could be demonstrated repeatedly: the average
time one stimulus had to lead the other for the two to be judged as occurring simultaneously, the
point of subjective simultaneity (PSS), was shifted in the direction of the lag used in the adapta-
tion phase (Fujisaki et al. 2004; Vroomen et al. 2004). For example, if sound was presented before
light in the adaptation phase, in the testing phase, the sound stimulus had to be presented earlier
in time than before the adaptation to be regarded as having occurred simultaneously with the light
stimulus.
In addition, in several studies, an increase in the just notable difference (JND) was observed (the
smallest temporal interval between two stimuli needed for the participants in a temporal order task
to be able to judge correctly which of the stimuli was presented first in 75% of the trials; Fujisaki et
al. 2004; Navarra et al. 2005).

16.1.4  Mechanisms of Temporal Recalibration


The neural mechanisms underlying temporal recalibration have not yet been investigated in detail.
In the following we will review current psychophysical data with respect to cognitive processes
hypothetically involved in recalibration to develop first ideas about the neural levels at which reca-
libration might operate.
The idea that temporal recalibration works on an early level of processing is quite attractive:
more accurate temporal information is available at the early stages because the different processing
delays of later stages have not yet been added. However, there are also reasons to believe that recal­
ibration works at later levels: recalibration effects are usually observed in the conscious perception.
It is plausible to assume that the conscious percept is also shaped by the results of later processing
stages. For recalibration to operate correctly, it should also compensate for delays of later process-
ing stages.

16.1.4.1 Are There Any Indications for Recalibration at


Early Levels of Stimulus Processing?
There are indications that recalibration does not occur at the very periphery. Fujisaki et al. (2004)
presented sound stimuli during the testing phase to a different ear than during the adaptation phase
and found clear evidence for recalibration. They concluded that recalibration occurs at least at stages
of processing where information from both ears had already been combined.
To investigate the possible neuronal mechanism of temporal recalibration, the neuronal sites at
which temporal onset asynchronies are represented might be of interest. There are indications that
neurons are tuned to different onset asynchronies of multimodal stimuli at the level of the superior
colliculus (Meredith et al. 1987). But in addition, there are also first findings of neural correlates of
onset asynchrony detection at the cortical level (Bushara et al. 2001; Senkowski et al. 2007).

16.1.4.2  To What Extent Does Temporal Recalibration Need Attentional Resources?


An increasing number of results indicate that processes of synchrony detection require attentional
resources (Fujisaki and Nishida 2005, 2008; Fujisaki et al. 2006). Recalibration is often measured
by a change in the perception of synchrony, but preliminary results suggest that the mechanisms
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 305

of recalibration and attention might be independent: Fujisaki and colleagues found no interaction
between the shift in the PSS caused by attention and the shift in PSS caused by adaptation in a
recalibration experiment (Fujisaki et al. 2004).

16.1.4.3  Is Recalibration Stimulus-Specific?


Several studies demonstrated that the lag adaptation can easily generalize to stimuli not presented
during the adaptation phase, suggesting that temporal recalibration occurs at a level processing
abstracts from the details of the specific stimuli (Fujisaki et al. 2004; Navarra et al. 2005; Vatakis
et al. 2007, 2008).

16.1.4.4  Is Recalibration Modality-Specific?


Also fundamental for understanding the basics of recalibration is the question of whether it is
a supramodal process. As in the conscious experience, the information from all senses usually
appears to be temporally aligned, a hypothetical compensatory process should take into account
the various temporal delays of all modalities. If there were separate compensatory mechanisms for
all combinations of modality pairs, this might cause conflicts between the different compensatory
mechanisms.
Results from recalibration experiments invoking modality pairs other than the audiovisual have
yielded variable results (Miyazaki et al. 2006; Navarra et al. 2006; Hanson et al. 2008; Harrar and
Harris 2008).
If there was a single compensatory mechanism, we should be able to observe a transfer of recal­
ibration across modality pairings. In the study of Harrar and Harris (2008), an exposure to visuo­
tactile asynchronous stimuli in the adaptation phase shifted the PSS when participants had to do
an audiovisual temporal order judgment task, an adaptation to audiotactile asynchronous stimuli
caused an increase in the JND in an audiovisual temporal order judgment task. However, the effects
do not seem to be simple because, in this study, no recalibration effects were found when audiotac-
tile and visuotactile pairings were used in the testing phase.

16.1.4.5  Does Recalibration Occur at Decision Level?


Fujisaki et al. (2004) advanced the hypothesis that recalibration might occur as late as at the deci-
sion level. According to this hypothesis, the effect of recalibration could be explained by a change in
the response bias in the temporal order task. Fujisaki tested this hypothesis by testing the perception
of simultaneity of his participants indirectly by presenting an auditory-induced visual illusion. As
the perception of this illusion changed after the lag adaptation phase, he concluded that recalibration
does not occur at the response level.

16.1.5  Outlook on Experiments


This short review on studies addressing questions about the mechanisms of recalibration makes it
clear that it is still too early to deduce any precise hypothesis at which neural level recalibration
might operate. In the current explorative study, we began by searching for neural mechanisms of
recalibration at the level of the primary sensory cortex. In the past decade, the primary sensory
cortices have repeatedly been demonstrated to be involved in multisensory interactions (e.g., Cahill
et al. 1996; Brosch et al. 2005; Bizley et al. 2007; Kayser et al. 2008; Musacchia and Schroeder
2009).
The experimental paradigm for rodents resembled the previously described human studies on
temporal recalibration: auditory and visual stimuli were presented repeatedly at a constant inter-
modal temporal onset asynchrony of 200 ms.
We implanted one electrode into the primary auditory cortex and one electrode into the visual
cortex of Mongolian gerbils, and during stimulation, local field potentials were recorded in the
awake animal. Our main question of interest was whether the interaction patterns between auditory
306 The Neural Bases of Multisensory Processes

and visual cortices change during the course of continuous asynchronous presentation of auditory
and visual stimuli. There is accumulating evidence that the synchronization dynamics between
brain areas might reflect their mode of interaction (Bressler 1995, 1996). We examined directional
influences between auditory and visual cortices by analyzing the local field potential data using the
DTF (Kaminski and Blinowska 1991).

16.2  METHODS
16.2.1  Animals
Data were obtained from eight adult male Mongolian gerbils (Meriones unguiculatus). All ani-
mal experiments were surveyed and approved by the animal care committee of the Land Sachsen-
Anhalt.

16.2.2  Electrodes
Electrodes were made of stainless steel wire (diameter, 185 µm) and were deinsulated only at the tip.
The tip of the reference electrodes was bent into a small loop (diameter, 0.6 mm). The impedance of
the recording electrodes was 1.5 MΩ (at 1 kHz).

16.2.3  Animal Preparation and Recording


Electrodes were chronically implanted under deep ketamine anesthesia (xylazine, 2 mg/100 g body
weight, i.p.; ketamine, 20 mg/100 g body weight, i.p.). One recording electrode was inserted into
the right primary auditory cortex and one into the right visual cortex, at depths of 300 µm, using a
microstepper. Two reference electrodes were positioned onto the dura mater over the region of the
parietal and the frontal cortex, electrically connected, and served as a common frontoparietal refer-
ence. After the operation, animals were allowed to recover for 1 week before the recording sessions
began. During the measurements, the animal was allowed to move freely in the recording box (20 ×
30 cm). The measured local field potentials from auditory and visual cortices were digitized at a
rate of 1000 Hz.

16.2.4  Stimuli
Auditory and visual stimuli were presented at a constant intermodal stimulus onset asynchrony
of 200 ms. The duration of both the auditory and the visual stimuli was 50 ms and the intertrial
interval varied randomly between 1 and 2 s with a rectangular distribution of intervals in that range.
Acoustic stimuli were tones presented from a loudspeaker located 30 cm above the animal. The tone
frequency was chosen for each individual animal to match the frequency that evoked, in preparatory
experiments, the strongest amplitude of local field potential at the recording site within the tono-
topic map of primary auditory cortex (Ohl et al. 2000, 2001). The frequencies used ranged from 250
Hz to 4 kHz with the peak level of the tone stimuli varying between 60 dB (low frequencies) and 48
dB (high frequencies), measured by a Bruel und Kjaer sound level meter type). Visual stimuli were
flashes presented from an LED lamp (9.6 cd/m2) located at the height of the eyes of the animal.

16.2.5  Experimental Protocol


To be able to examine both short-term and long-term adaptation effects, animals were presented
with asynchronous stimuli for 10 sessions with 750 stimulus presentations at each session. For four
animals, the auditory stimuli were presented first, for the remaining four animals, the visual stimuli
were presented first.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 307

16.2.6  Data Preprocessing


The local field potential of each trial was analyzed from 1 s before to 1 s after the first stimulus.
The data of this time period were detrended separately for each trial and each channel. In addition,
the temporal mean and the temporal standard deviation of the time period were determined for
each trial and for each channel, and used for z-standardization. Amplifier clippings as they resulted
from movement of the animals were identified by visual inspection. Only artifact-free trials were
included into the analysis (~70–90% of the trials).

16.2.7  DTF: Mathematical Definition


Directional influences between the auditory and the visual cortex were analyzed in single trials by
estimating the DTF (Kaminski and Blinowska 1991; Kaminski et al. 2001; for comparison of the
performance of the DTF with other spectral estimators, see Kus et al. 2004; Astolfi et al. 2007). The
DTF is based on the concept of Granger causality. According to this concept, one time series can be
called causal to a second one if its values can be used for the prediction of values of the second time
series measured at later time points. This basic principle is typically mathematically represented in
the formalism of autoregressive models (AR models).
Let X1(t) be the time series data from a selectable channel 1, and X2(t) the data from a selectable
channel 2:

p p

X1 (t ) = ∑Aj =1
1→1 ( j) X1 (t − j) + ∑A
j =1
2→1 ( j) X 2 (t − j) + E (16.1)

p p

X 2 (t ) = ∑A
j =1
1→2 ( j)X1 (t − j) + ∑A
j =1
2→2 ( j) X 2 (t − j) + E (16.2)

Here, the A(j) are the autoregressive coefficients at time lag j, p is the order of the autoregressive
model, and E the prediction error. According to the concept of Granger causality, in Equation 16.1,
the channel X2 is said to have a causal influence on channel X1 if the prediction error E can be
reduced by including past measurements of channel X2 (for the influence of the channel X1 on the
channel X2, see Equation 16.2).
To investigate the spectral characteristics of interchannel interaction, the autoregressive coef-
ficients in Equation 16.1 were Fourier-transformed; the transfer matrix was then obtained by matrix
inversion:

−1
A1→1 ( f ) A2→1 ( f ) H1→1 ( f ) H 2→1 ( f )
= (16.3)
A1→2 ( f ) A2→2 ( f ) H1→2 ( f ) H 2→2 ( f )

where the components of the A(f) matrix are

Al→m ( f ) = 1 − ∑A
j =1
l→m ( j )e − i 2π fj when l = m (16.4)

with l being the number of the transmitting channel and m the number of the receiving channel
308 The Neural Bases of Multisensory Processes

Al→m ( f ) = 0 − ∑A
j =1
l→m ( j )e − i 2π fj otherwise. (16.5)

The DTF for the influence from a selectable channel 1 to a selectable channel 2, DTF1→2, is
defined as

nDTF1→2 ( f ) = H1→2 ( f )2 (16.6)


In the case of only two channels, the DTF measures the predictability of the frequency response of
a first channel from a second channel measured earlier in time. When, for example, X1 describes the
local field potential from the auditory cortex, X2 the local field potential from the visual cortex, and
the amplitude of the nDTF1→2 has high values in the beta band, this means that we are able to predict
the beta response of the visual cortex from the beta response of the auditory cortex measured earlier
in time. There are several possible situations of cross-cortical interaction that might underlie the
modulation of DTF amplitudes (see, e.g., Kaminski et al. 2001; Cassidy and Brown 2003; Eichler
2006). See Section 16.4 for more details.

16.2.8  Estimation of Autoregressive Models


We fitted bivariate autoregressive models to local field potential time series from auditory and visual
cortices using the Burg method as this algorithm has been shown to provide accurate results (Marple
1987; Kay 1988; Schlögl 2003). We partitioned the time series data of single trials into 100-ms time
windows that were stepped at intervals of 5 ms through each trial from 1 s before the first stimulus
to 1 s after the first stimulus. Models were estimated separately for each time window of the single
trials. Occasionally, the covariance matrix used for estimation of the AR coefficients turned out to
be singular or close to singular, in these rare cases, the whole trial was not analyzed any further.
In the present study, we used a modal order of 8, the sampling rate of 1000 Hz was used for
model estimation. The model order was determined by the Akaike Information Criterion (Akaike
1974). After model estimation, the adequacy of the model was tested by analyzing the residuals
(Lütkepohl 1993). Using this model order, the auto- and crosscovariance of the residuals was found
to have values between 0.001% and 0.005% of the auto- and crosscovariance of the original data
(data averaged from two animals here). In other words, the model was able to capture most of the
covariance structure contained in the data. When DTFs were computed from the residuals, the
single-trial spectra were almost flat, indicating that the noise contained in the residuals was close
to white noise.
The estimation of AR models requires the normality of the process. To analyze the extent to
which normality assumption was fulfilled in our data, the residuals were inspected by plotting
them as histograms and, in addition, a Lillie test was computed separately for the residuals of the
single data windows. In about 80% of the data windows, the Lillie test confirmed the normality
assumption.
A second requirement for the estimation of the autoregressive models is the stationarity of the
time series data. Generally, this assumption is better fulfilled with small data windows (Ding et al.
2000), although it is impossible to tell in advance at which data window a complex system like the
brain will move to another state (Freeman 2000).
A further reason why the use of small data windows is recommendable is that changes in the
local field potential are captured at a higher temporal resolution. The spectral resolution of low
frequencies does not seem to be a problem for small data windows when the spectral estimates are
based on AR models (for a mathematical treatment of this issue, see, e.g., Marple 1987, p. 199f).
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 309

Using a high sampling rate ensures that the number of data points contained in the small time
windows is sufficient for model estimation. For example, when we used a sampling rate of 500 Hz
instead of 1000 Hz to estimate models from our time windows of 100 ms, the covariance of the
residuals increased, signaling that the estimation had become worse (the autocovariance of the
residuals of the auditory and visual channels at 1000 Hz were about 10% of the auto- and crossco-
variance of the auditory and visual channels at 500 Hz). Importantly, when inspecting the spectra
visually, they seemed to be quite alike, indicating that AR models were robust, to an extent, to a
change in sampling rate. When using a data window of 200 ms with the same sampling rate of 500
Hz, the model estimation improved (the covariance of the residuals was 20–40% of the covariance
of a model with a window of 100 ms), but at the expense of the temporal resolution.

16.2.9  Normalization of DTF


Kaminski and Blinowska (1991) suggested normalization of the DTF relative to the structure that sends
the signal, i.e., for the case of the directed transfer from the auditory channel to the visual channel:

H A→ V ( f ) 2
nDTFA→V ( f ) = k (16.7)
∑H
2
M→V (f)
M =1

In the two-channel case, the DTFA→V is divided by the sum of itself and the spectral autocovariance
of the visual channel. Thus, when using this normalization, the amplitude of the nDTFA→V depends
on the influence of the auditory channel on itself and, reciprocally, the amplitude of the nDTFV→A
is dependent on the influence of the visual channel on itself. This is problematic in two ways: first,
we cannot tell whether differences between the amplitude of the nDTFA→V and the amplitude of the
nDTFV→A are because of differences in normalization or to differences in the strengths of cross-
cortical influences. Second, analysis of our data has shown that the auditory and the visual stimuli
influenced both the amplitude of the local field potential and the spectral autocovariance of both
auditory and visual channels. Thus, it is not clear whether changes in the amplitude of the nDTF
after stimulation signal changes in the crosscortical interaction or changes in spectral autocovari-
ance of the single channels.
As the nonnormalized DTF is difficult to handle because of large differences in the amplitudes
at different frequencies, we normalized the DTF in the following way:

DTFA→V ( f )
nDTFA→V ( f ) = (16.8)
n _ session n _ trials n _ windows

∑ ∑ ∑
1 1 1
(
DTFA→V ( f ) / n _ windows * n _ trials * n _ session )

with n_windows being the number of time windows of the prestimulus interval per trial, n_trials the
number of trials per session, and n_session the number of sessions.
Hence, the amplitude of the DTF estimated for each single time window of the single trials was
divided by the average of the DTF of all time windows taken from the 1 s prestimulus interval of
the single trials of all sessions.

16.2.10  Statistical Testing


We assessed the statistical significance of differences in the amplitude of the nDTF using the boot-
strap technique (e.g., Efron and Tibshirani 1993) to avoid being bound to assumptions about the
310 The Neural Bases of Multisensory Processes

empirical statistical error distribution of the nDTF (but see Eichler 2006, for an investigation of the
statistical properties of the DTF). The general procedure was as follows: first, bootstrap samples
were drawn from real data under the assumption that the null hypothesis was true. Then for each
bootstrap sample, a chosen test statistic was computed. The values of the test statistic from all boot-
strap samples formed a distribution of values of the test statistic under the assumption of the null
hypothesis. Next, we determined from the bootstrap distribution of the test statistic the probability
of finding values equal to or larger than the empirically observed one by chance. If this value was
less than the preselected significance level, the null hypothesis was rejected.
More specifically, in our first bootstrap test, we wanted to test the hypothesis of whether the
nDTF has higher amplitude values in the poststimulus interval than in the prestimulus interval.
Under the assumption of the null hypothesis, the nDTF amplitude values of the prestimulus and
the poststimulus interval should not be different from each other. Thus, pairs of bootstrap samples
were generated by taking single-trial nDTF amplitude values at random but with replacement from
the prestimulus and from the poststimulus interval. For each of the sample pairs, the amplitudes
were averaged across trials and the difference between the averages was computed separately for
each pair. This procedure of drawing samples was repeated 1000 times, getting a distribution of
differences between the average amplitudes. The resulting bootstrap distribution was then used to
determine the probability of the real amplitude difference of the averages between the prestimulus
and the poststimulus interval under the assumption of the null hypothesis.
In a second bootstrap test, we assessed the significance of the slope of a line fitted to the data by
linear regression analysis. We used the null hypothesis that the predictor variable (here, the number
of stimulus presentations) and the response variable (here, the nDTF amplitude) are independent
from each other. We generated bootstrap samples by randomly pairing the values of the predictor
and observer variables. For each of these samples, a line was fitted by linear regression analysis and
the slope was computed obtaining a distribution of slope values under the null hypothesis.

16.3  RESULTS
16.3.1  S timulus-Induced Changes in Single-Trial nDTF,
Averaged across All Trials from All Sessions
For a first inspection of the effect the audiovisual stimulation had on the nDTF, from the auditory
to the visual cortex (nDTFA→V) and from the visual to the auditory cortex (nDTFV→A), we averaged
nDTF amplitudes across all single trials of all sessions, separately for each time window from 1 s
before to 1 s after the first stimulus. Figure 16.1 shows time-frequency plots of the nDTFA→V (left),
which describes the predictability of the frequency response of the visual cortex based on the fre-
quency response of the auditory cortex, and the nDTF V→A (right), which describes the predictability
of the frequency response of the auditory cortex based on the frequency response of the visual cor-
tex. Results from animals receiving the light stimulus first are presented in the upper two graphs and
results from animals receiving the tone stimulus first are shown in the lower two graphs. Data from
200 ms before the first stimulus to 1 s after the first stimulus is shown here. Note that the abscissa
indicates the start of a time window (window duration: 100 ms), so the data from time windows at
100 ms before the first stimulus are already influenced by effects occurring after the presentation
of the first stimulus.
The significance of the observed changes in the nDTF amplitude was assessed separately for
each animal using Student’s t-test based on the bootstrap technique (see Methods). More precisely,
we tested whether the amplitudes of the nDTF averaged across trials at different time points after
the presentation of the first stimulus were significantly different from the nDTF amplitude of the
prestimulus interval, averaged across trials and time from –1000 to 100 ms before the first stimulus.
To compare the relative amplitudes of the nDTFA→V and the nDTFV→A, we tested whether the dif-
ference of the amplitudes of nDTFA→V and nDTFV→A averaged across trials at different time points
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 311

A V–DTF V A–DTF
100 100
1.6 2
80 80
Frequency [Hz] 1.4
60 1.2 60
1.5
40 1 40
0.8
20 20 1
0.6

(a) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8

A V–DTF V A–DTF
100 1.6 100
1.8
80 1.4 80
Frequency [Hz]

1.6
60 1.2 60 1.4

40 1 40 1.2
0.8 1
20 20
0.8
0.6
(b) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8

A V–DTF – V A–DTF A V–DTF – V A–DTF


100 0.5 100
0.2
80 80
Frequency [Hz]

0 0
60 60
–0.5 –0.2
40 40

20 20 –0.4
–1

(c) 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8


Time [s] Time [s]

FIGURE 16.1  (a and b) nDTFA→V (left) and nDTF V→A (right), averaged across all trials from all sessions,
separately for time windows from –0.2 to 0.9 s after start of first stimulus. (a) Animal receiving light first.
(b) Animal receiving tone first. (c) Difference between averages (nDTFA→V – nDTFV→A). Animal receiving
light first (left). Animal receiving tone first (right).

after the presentation of the first stimulus were significantly different from the difference of the
amplitudes of nDTFA→V and nDTFV→A of the prestimulus interval. In the following we will describe
only peaks of the amplitudes of nDTF, which deviated significantly (P < 0.01) from the average
amplitude of prestimulus interval.

16.3.1.1  Animals Receiving Light Followed by Tone Stimulus (VA-Animals)


At first sight, the response of the nDTFA→V closely resembled the response of the nDTFV→A. In ani-
mals receiving first the light stimulus and then the tone stimulus we observed two prominent posi-
tive peaks in both the nDTFA→V (Figure 16.1a, left) and the nDTFV→A (Figure 16.1a, right), the first
one after the light stimulus started at about –20 ms and the second one after the tone stimulus began
at about 151 ms. After the second peak, the amplitude of the nDTFA→V and the nDTFV→A dropped to
slightly less than the prestimulus baseline values and returned very slowly to the prestimulus values
within the next second.
312 The Neural Bases of Multisensory Processes

Even though the temporal development and the frequency spectra were roughly similar in the
nDTFA→V and the nDTFV→A, there were small but important differences.
First, there were stimulus-evoked differences in the amplitudes of the nDTFA→V and the nDTF V→A
(Figure 16.1c, left, and the line plots in Figure 16.2, top). After the visual stimulus, the nDTF ampli-
tude was significantly higher in the nDTFV→A than in the nDTFA→V, whereas after the auditory
stimulus, the nDTFA→V reached higher values, but only at frequencies exceeding 30 Hz.
Second, even though the peaks could be found at all frequency bands in the nDTF V→A, the first
peak was strongest at a frequency of 1 Hz and at about 32 Hz, and the second peak at frequencies of
1 Hz and at about 40 Hz. In the nDTFA→V, the highest amplitude values after the first peak could be
observed at 1 Hz and at about 35 Hz and after the second peak at 1 Hz and at about 45 Hz.

16.3.1.2  Animals Receiving Tone Followed by Light Stimulus (AV-Animals)


In animals receiving first the light stimulus and then the tone stimulus, three positive peaks devel-
oped after stimulation. As in the VA animals, the nDTFA→V and nDTFV→A were similar to each
other (Figure 16.1b and the line plots in Figure 16.2, bottom). The first peak could be found between

95 Hz
85 Hz
75 Hz
65 Hz
nDTF

55 Hz
45 Hz
35 Hz
25 Hz
15 Hz
5 Hz

–200 0 200 400 600 800


Time [ms]

95 Hz
85 Hz
75 Hz
65 Hz
nDTF

55 Hz
45 Hz
35 Hz
25 Hz
15 Hz
5 Hz

–200 0 200 400 600 800


Time [ms]

FIGURE 16.2  Top: representative nDTFV→A (dashed) and nDTFA→V (solid), averaged across all trials from
all sessions, separately for all time windows from –200 to 900 ms after start of first stimulus, from an animal
receiving light first, followed by tone stimulus. Bottom: data from an animal receiving tone first, followed by
light stimulus.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 313

the tone and the light stimulus, at about –40 ms. The second and the third peaks occurred after
the light stimulus at about 170 ms and 330 ms, respectively. And as in the VA animals, after the
auditory stimulus (here the first stimulus), the amplitude of the nDTFA→V significantly exceeded the
amplitude of the nDTF V→A for frequencies above 20 Hz in the AV animals, whereas after the visual
stimulus, amplitudes were significantly higher in the nDTFV→A (Figure 16.1c, right). Thus, the sign
of the difference between the nDTFA→V and the nDTFV→A depended on the type of the stimulus
(auditory or visual) and not on the order of stimulus presentation.
The peaks ran through all frequencies from 0 to 100 Hz. The first peak of the nDTFA→V was most
pronounced at 1 Hz and at about 42 Hz, the second peak at 1 Hz, at about 32 Hz, and at 100 Hz. The
first peak of the nDTFV→A reached their highest values at 1 Hz and at 35 Hz, the second peak had
its highest amplitude at 1 Hz and at 28 Hz. For the third peak, the amplitude was most prominent
at 1 Hz.

16.3.2  Development of Amplitude of nDTFA→V and nDTFV→A within Sessions


To investigate the development of the effects within the sessions, we divided the 750 trials of each
session into windows of 125 trials from the start to the end of each session. Averaging was done
across the trials of each trial window, but separately for the time windows within the course of each
trial. Trials from all sessions were included in the average. As for the majority of the animals, the
nDTF amplitude increased or decreased fairly smoothly within the sessions, and we decided to
characterize the effects by linear regression analysis. The slope of the regression line fitted to the
observed data points was subjected to statistical testing using the bootstrap technique (for details,
see Methods).

16.3.2.1  VA-Animals
In Figure 16.3a and b, the development of the nDTF amplitude of the first and the second peaks
within the sessions is depicted and averaged across all four animals that received the light stimulus
first. Most of the effects could roughly be observed over the whole range of frequencies tested (in
Figure 16.3, we selected nDTF peaks at a frequency of 40 Hz for illustration). Nevertheless, effects
did not always reach significance at all frequencies tested (see Tables 16.1 and 16.2 for more detailed
information on the development of peaks at other frequencies).
After the first (visual) stimulus, the amplitude of the first peak increased in the nDTFA→V and
decreased in the nDTFV→A (Figure 16.3a, left). At the beginning of the session, the amplitude was
higher in the nDTFV→A than in the nDTFA→V, thus the amplitude difference between the nDTFA→V
and the nDTFV→A decreased significantly over the session (Figure 16.3a, right).
After the second (auditory) stimulus, the amplitude of the second peak increased both in the
nDTFA→V and the nDTFV→A (Figure 16.3b, left). Importantly, the increase in the nDTFA→V exceeded
the increase in the nDTFV→A, gradually increasing the difference between the nDTFA→V and the
nDTFV→A (Figure 16.3b, right).

16.3.2.2  AV-Animals
Similar to the nDTF development in VA-animals after the second (auditory) stimulus, in the AV-
animals after the first (auditory) stimulus, the amplitude increased both in the nDTFA→V and the
nDTFV→A (Figure 16.3c, left). The increase was more pronounced in nDTFA→V, further increasing
the difference between the nDTFA→V and the nDTFV→A (Figure 16.3c, right).
Interestingly, after the second (visual) stimulus, the behavior of the nDTF in the AV-animals
did not resemble the behavior of the nDTF after the first (visual) stimulus in the VA-animals. In
the AV-animals, the amplitude of the nDTFV→A increased after the visual stimulus, the ampli-
tude of the nDTFA→V decreased slightly in some animals, whereas in other animals, an increase
could be observed (Figure 16.3d, left; Table 16.1). After the visual stimulus, the amplitude of the
nDTFV→A was already higher than the amplitude of the nDTFA→V at the beginning of the sessions,
314 The Neural Bases of Multisensory Processes

(a) VA animals: peak 1 Difference (A V–DTF – V A–DTF)


1.4 0.2
A V–DTF
1.3 V A–DTF 0.1
nDTF

nDTF
1.2 0

1.1 –0.1

1 –0.2
0 2 4 6 8 0 2 4 6 8
Number of trial interval Number of trial interval

(b) VA animals: peak 2 Difference (A V–DTF – V A–DTF)


1.6 0.3
A V–DTF
1.5 V A–DTF 0.2
nDTF

nDTF
1.2 0.1

1.3 0

–0.1
0 2 4 6 8 0 2 4 6 8
Number of trial interval Number of trial interval

(c) AV animals: peak 1 Difference (A V–DTF – V A–DTF)


1 0.3
A V–DTF
0.9 V A–DTF 0.2
nDTF

nDTF

0.8 0.1

0.7 0

–0.1
0 2 4 6 8 0 2 4 6 8
Number of trial interval Number of trial interval

(d) AV animals: peak 2 Difference (A V–DTF – V A–DTF)


0.2 0
A V–DTF
0.15 V A–DTF –0.05
nDTF

nDTF

0.1 –0.1

0.05 –0.15

0 –0.2
0 2 4 6 8 0 2 4 6 8
Number of trial interval Number of trial interval

FIGURE 16.3  Development of nDTF peaks at 40 Hz within sessions averaged across nonoverlapping win-
dows of 125 trials stepped through all sessions. (a and b) Animals receiving light first. (c and d) Animals
receiving tone first. Left: development of average amplitude peak after first stimulus in nDTFA→V and nDTF V→A
(a and c). Development of average amplitude peak after second stimulus in nDTFA→V and nDTFV→A (b and
d). Right: amplitude of nDTFV→A peak subtracted from amplitude of nDTFA→V peak shown in left. Error bars
denote standard error of mean, averaged across animals.

the difference between the nDTFA→V and the nDTFV→A further increased during the course of the
sessions (Figure 16.3d, right).

16.3.3  Development of the Amplitude of nDTFA→V and nDTFV→A across Sessions


To examine the effects of long-term adaptation, the nDTF amplitude of the first 100 trials was
averaged separately for each session. The development of the amplitude averages across sessions
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 315

TABLE 16.1
P Values of Slope of a Regression Line Fitted to Peak Amplitudes of nDTF Averaged across
Nonoverlapping Windows of 125 Trials Stepped through All Sessions
A→V nDTF peak 1 V→A nDTF peak 1
Animals 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz
AV090 <0.001 a 0.003 a 0.00a 0.006 a 0.003 a <0.001 b <0.001 b <0.001 b >0.05c >0.001a
AV091 0.003a 0.001a 0.001a 0.004a 0.002a 0.001a 0.001a 0.01a >0.05c >0.05c
AV106 0.02a 0.001a <0.001a <0.001a <0.001a 0.01a <0.001a 0.002a 0.05a >0.05c
AV125 0.0a <0.001a 0.04a 0.01a 0.001a 0.02a 0.02a 0.03a >0.05c >0.05c
VA099 <0.001a 0.001b 0.001b >0.05c 0.03a <0.001b <0.001b <0.001b <0.001b 0.001b
VA100 0.02a 0.01a 0.04a 0.001a 0.001a 0.02b <0.001b 0.001b 0.002b 0.01b
VA107 0.004a 0.001b 0.01b 0.01a 0.001a >0.05c 0.004b >0.05c >0.05c 0.01a
VA124 0.03a <0.001a <0.001a 0.01a >0.05c 0.01a <0.001b <0.001b 0.01b 0.01b
A→V nDTF peak 2 V→A nDTF peak 2
Animals 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz
AV090 <0.001 a 0.05 >0.05c >0.05 c >0.05c 0.01a 0.03a >0.05c <0.001 a <0.001a
AV091 0.001a 0.001a 0.002a 0.001a <0.001a <0.001a 0.001a 0.002a 0.006a 0.01b
AV106 <0.001a >0.05c 0.002a <0.05c 0.001a <0.001a <0.001a 0.002a <0.001a 0.004a
AV125 <0.001a <0.001a 0.001a 0.003a <0.05c 0.03a <0.001a <0.001a >0.05c >0.05c
VA099 0.001a <0.001a <0.001a <0.001a <0.001a 0.02a 0.03a 0.001a >0.05c >0.05c
VA100 0.001a 0.001a 0.001a 0.001a 0.001a >0.05c 0.002a >0.05c >0.05c 0.001a
VA107 0.001a 0.001a 0.001a 0.001a 0.001a >0.05c 0.001a >0.05c >0.05c >0.05c
VA124 >0.05c 0.01b 0.001b 0.01a >0.05c 0.01a 0.02a 0.001a 0.001a >0.05c

Note: Upper table: results from the nDTF peak after the first stimulus. Bottom table, results from the nDTF peak after the
second stimulus. Animal notation: AV, animals receiving tone first; VA, animals receiving the light first.
a Slope is positive.

b Slope is negative.

c Nonsignificant results.

was examined by linear regression analysis and the significance of the slope was tested using
the bootstrap technique. In the following, effects are reported for a chosen significance level
of 0.05.
Even though some significant trends could be observed, results were not consistent among
animals. In the VA-group, one animal showed a decrease in the amplitude of the nDTFA→V at the
beginning of the first stimulus, but an increase could be found only 20 ms after the beginning
of the first stimulus. In a second animal, there was an increase in the amplitude of the nDTFA→V
after the second stimulus. In the amplitude of the nDTF V→A of two VA-animals, decreases could
be observed after the first and second stimulus, whereas in a third animal, an increase was found
after the second stimulus. All these results could be observed for the majority of examined
frequencies.
In the nDTFA→V of the AV-animals, at many frequencies, no clear developmental trend could be
observed, but at frequencies less than 10 Hz, there was an increase in amplitude both after the first
and second stimulus in two animals, whereas in one animal, a decrease could be found after both
stimuli. In the amplitude of the nDTF V→A, increases could be observed at various frequencies and
time points after stimulation.
316 The Neural Bases of Multisensory Processes

TABLE 16.2
P Values of Slope of a Regression Line Fitted to Difference of Peak Amplitudes of nDTFV→A
and nDTFA→V Averaged in Nonoverlapping Windows of 125 Trials Stepped through All
Sessions
Difference (A→V minus V→A): peak 1 Difference (A→V minus V→A): peak 2
Animals 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz 1 Hz 20 Hz 40 Hz 60 Hz 80 Hz
AV090 0.03 a 0.002 a 0.004 a 0.006a 0.009a 0.01 b 0.02b 0.02 b >0.05 c >0.05c
AV091 0.01a 0.006a 0.007a 0.004a 0.009a 0.01b 0.04b 0.02b >0.05c 0.01a
AV106 0.008a 0.03a 0.04a 0.03a 0.02a 0.02b >0.05c >0.05c >0.05c >0.05c
AV125 >0.05c >0.05c 0.04a 0.005a 0.06a 0.02b 0.01b 0.02b 0.03b 0.01b
VA099 0.002a 0.005a 0.002a <0.001a 0.002a 0.002a 0.001a 0.002a 0.002a 0.002a
VA100 0.04a 0.009a 0.01a 0.008a 0.04a 0.03a 0.004a 0.001a 0.001a 0.001a
VA107 0.01a >0.05c 0.04a 0.02a 0.04a 0.01a 0.06c >0.05c >0.05c >0.05c

Note: Left, first nDTF peak. Right, second nDTF peak. Animal notation: AV, animals receiving tone first; VA, animals
receiving the light first.
a Slope is positive.

b Slope is negative.

c Nonsignificant results.

16.4  DISCUSSION
The repeated presentation of pairs of auditory and visual stimuli, with random intervals between
stimulus pairs but constant audiovisual stimulus onset asynchrony within each pair, led to robust
changes in the interaction dynamics between the primary auditory and the primary visual cortex.
Independent of the stimulus order, when an auditory stimulus was presented, the amplitude of the
nDTFA→V exceeded the amplitude of the nDTF V→A, whereas after the visual stimulus, the amplitude
of the nDTFV→A reached higher values. Moreover, within adaptation sessions, some of the observed
changes in nDTF amplitudes showed clear dynamic trends, whereas across adaptation sessions, no
coherent development could be observed. In the following we will discuss which processes might
be evoked by the repeated asynchronous presentation of audiovisual stimuli and whether they might
offer suitable explanations for the amplitude changes in the nDTF we observed.
As paired-stimulus adaptation protocols, similar to the one used in the present study, have been
shown to induce recalibration of temporal order judgment in humans (e.g., Fujisaki et al. 2004;
Vroomen et al. 2004), we want to discuss whether some of the described effects on the directed
information transfer could possibly underlie such recalibration functions. To prepare the discussion,
some general considerations of the interpretation of nDTF amplitudes seem appropriate.

16.4.1  Interpretation of DTF-Amplitudes


Long-range interaction processes have been frequently associated with coherent oscillatory activity
between the cortices (Bressler 1995; Bressler et al. 1993; Roelfsema et al. 1997; Rodriguez et al.
1999; Varela et al. 2001). Moreover, it has been shown that the oscillatory activity in one cortical
area can be predicted by earlier measurement of another cortical area using the DTF (Kaminski
et al. 1997, 2001; Korzeniewska et al. 1997, 2003; Franaszczuk and Bergey 1998; Medvedev and
Willoughby 1999; Liang et al. 2000), indicating that the oscillatory activity might signal directional
influences between the cortices.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 317

However, as Cassidy and Brown (2003) have demonstrated in a series of simulation studies, there
is no straightforward way to conclude from the information provided by the DTF to cross-cortical
interactions. Specifically, from DTF amplitudes alone, we cannot tell whether the information flow
is unidirectional, bidirectional, or even multidirectional, including additional brain areas.
Let us consider the situation after the presentation of the auditory stimulus when the amplitude of
the nDTFA→V attains higher values than the amplitude of the nDTFV→A. First, this result might indi-
cate that there is unidirectional influence from the auditory to the visual cortex, with the size of the
amplitude difference positively correlating with the delay in the information transfer. Second, this
finding could also reflect a reciprocal influence between the auditory and visual cortices, but with
the influence from the auditory cortex either larger in amplitude or lagged relative to the influence
from the visual cortex. Third, additional unobserved structures might be involved, sending input
slightly earlier to the auditory cortex than to the visual cortex.

16.4.2  Development of nDTF-Amplitude within Sessions


The development of the nDTF after the auditory stimulus did not seem to depend strongly on the
order of stimulus presentation. Independent of whether an auditory or a visual stimulus was pre-
sented first, after the auditory stimulus, the peak amplitude of both the nDTFA→V and nDTFV→A
increased. Noteworthy, the increase was more pronounced in the nDTFA→V than in the nDTF V→A,
further increasing the difference between the amplitudes of the nDTFA→V and the nDTFV→A. Using
the interpretation scheme introduced above, under the assumption of unidirectional interaction, the
influence from the auditory to the visual cortex not only increased in strength but also the lag with
which the input is sent became larger with increasing number of stimulus repetitions. In case of
bidirectional interaction, influences from both sides increased, but the influence from the auditory
cortex became stronger relative to the influence from the visual cortex. Finally, in case of multi-
directional interaction, the influence of a third structure in both the auditory and the visual cortex
might become more pronounced, but at the same time, the temporal delay of input sent to the visual
cortex relatively to the delay input sent to the auditory cortex is increased even further. All three
interpretations have in common that not only the interaction gathered in but also the mode of the
interaction changed.
In contrast to the development of the nDTF after the auditory stimulus, the development of
the nDTF after the visual stimulus clearly depended on the order of stimulus presentation. When
the visual stimulus was presented first, contrary to expectations, the amplitude of the nDTF V→A
decreased with increasing number of stimulus repetitions, whereas the amplitude of the nDTFA→V
increased in the majority of the animals. Thus, assuming that unidirectional influence underlies
our data, this finding might reflect that the visual cortex sends influences to the auditory cortex
at increasingly shorter delays. In case of bidirectional interaction, the input from the visual cortex
decreases whereas the input from the auditory cortex increases. Finally, under the assumption of
multidirectional interaction, a hypothetical third structure might still send its input earlier to the
visual cortex, but the delay becomes diminished with increasing number of stimulus repetitions.
When the visual stimulus was presented as the second stimulus, the nDTF behaved similarly
as after the auditory stimulus. More precisely, both the peak amplitude of the nDTFA→V and the
nDTFV→A increased within the sessions. But importantly, now the increase was stronger in the
nDTFV→A. To summarize, the characteristic developmental trend after the second stimulus was an
increase in both nDTFA→V and nDTFV→A, with the increase stronger in the nDTF sending informa-
tion from the structure the stimulus had been presented to, namely in the nDTF V→A after the visual
stimulus and in the nDTFA→V after the auditory stimulus.
After the first stimulus, no typical development of the nDTF can be outlined: the behavior of
the nDTF clearly depended on the stimulus modality as the difference in nDTFA→V and nDTFV→A
amplitudes increased for an auditory stimulus, but decreased for a visual stimulus.
318 The Neural Bases of Multisensory Processes

16.4.3  Audiovisual Stimulus Association as a Potential Cause


of Observed Changes in nDTF-Amplitudes

The cross-cortical interaction between auditory and visual cortices reflected in the peaks of the
nDTF could simply be an indication that information is spread among the sensory cortices dur-
ing the course of stimulus processing. However, we also have to take into account that the nDTF
amplitudes increased within the sessions, signaling that the interaction between the auditory and
the visual cortex intensified. In addition, after the visual stimulus, the behavior of the DTF differed
strongly with the order of stimulus presentation. Each of these observations might be a sign that the
auditory and the visual information became associated. This hypothesis is in accordance with the
unity assumption (e.g., Bedford 2001; Welch 1999; Welch and Warren 1980), which states that two
stimuli from different sensory modalities will be more likely regarded as deriving from the same
event when they are presented, for example, in close temporal congruence.
The increase in the nDTF after the second stimulus might indicate that stimuli are integrated
after the second stimulus has been presented. The increase in the nDTF before the second stimulus
might indicate the expectation of the second stimulus. Several other studies have demonstrated
increases in coherent activity associated with anticipatory processing (e.g., Roelfsema et al. 1998;
Von Stein et al. 2000; Fries et al. 2001; Liang et al. 2002). But on the other hand, our results on
the development of the nDTF after the first stimulus varied strongly with the stimulus order, and it
seems strange that the effect the expectation of an auditory stimulus has on the nDTF is quite dif-
ferent from the effect the expectation of a visual stimulus might have on the nDTF.
To clarify whether the observed changes might have something to do with stimulus associa-
tion or expectation processes, the repetition of this experiment with anesthetized animals might be
helpful. To explore whether the nDTF amplitude is influenced by anticipatory processing, it might
also be interesting to vary the likelihood with which a stimulus of a first modality is followed by
a stimulus of a second modality (see Sutton et al. 1965, for an experiment examining the effect of
stimulus uncertainty on local field potentials).

16.4.4  Changes in Lag Detection as a Potential Cause of


Observed Changes in DTF-Amplitudes
As we presented our stimuli constantly at the same lag, it does not seem far-fetched to assume
that our stimulation alerted hypothetical lag detectors. There are already some studies on the
neural correlates of synchronous and asynchronous stimulus presentation (Meredith et al. 1987;
Bushara et al. 2001; Senkowski et al. 2007). Senkowski et al. (2007) examined the oscillatory
gamma-band responses in the human EEG for different stimulus onset asynchronies of auditory
and visual stimuli. They found clear evidence for multisensory interactions in the gamma-band
response when stimuli were presented in very close temporal synchrony. In addition, they also
found a very specific interaction effect over occipital areas when auditory inputs were leading
visual input by 100 ± 25 ms, indicating that cortical responses could be specific for certain
asynchronies.

16.4.5  Mechanisms of Recalibration: Some Preliminary Restrictions


16.4.5.1  Expectation and Lag Detection
Experiments on the recalibration of temporal order judgment typically demonstrate a shift of the
entire psychometric function (at many stimulus onset synchronies), despite the fact that only a single
intermodal lag value has been used in the adaptation phase. In other words, a specific stimulus
order does not seem to be necessary to be able to observe the change in temporal order percep-
tion, indicating that expectation processes are unlikely to play a major role in evoking recalibration
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 319

effects. In a similar way, a specific stimulus onset asynchrony between the stimuli does not seem
to be required, speaking against a dominant role for lag-specific detection processes underlying the
recalibration effect.

16.4.5.2  Processes after the Second Stimulus


Even though the presentation of stimuli at a specific lag or in a specific order does not seem to be neces-
sary to make the recalibration of temporal perception observable in behavior, it is still possible that the
presentation of both an auditory and a visual stimulus is required. Under this hypothesis, the mechanisms
of recalibration should come into play only after stimuli of both modalities have been presented.
After the second stimulus, we could observe an increase in the difference of the amplitudes of
the nDTFs in both AV and VA animals. We hypothesized that this increase might reflect an ongoing
stimulus association.
Vatakis and Spence (2007) demonstrated that subjects showed decreased temporal sensitivity, as
measured by the JND, when an auditory and a visual speech stimulus belonged to the same speech
event. Also, in some experiments on recalibration, an increase in the JND was observed (Navarra et
al. 2005, 2006; Fujisaki et al. 2004). However, it is premature to conclude that stimulus association
plays a role in recalibration experiments. First, an increase in JND after stimulus association could
not be observed with different experimental conditions (Vatakis and Spence 2008). Second, as
already discussed in the Introduction, recalibration does not seem to be stimulus-specific (Fujisaki
et al. 2004; Navarra et al. 2005).

16.4.5.3  Speed of Processing


The observation that neither a specific lag nor a specific stimulus order seemed to be required to
observe a recalibration effect supports a further possibility: to observe a change in temporal per-
ception, the presentation of a second stimulus might not be necessary at all. Temporal perception
in different modalities is probably not recalibrated relative to each other but perception is simply
speeded up or slowed down in one modality.
In our data, we did not find any indication for an increase in the speed of stimulus processing.
The latencies of the nDTF peaks did not change with increasing number of stimulus presentations,
but one has to keep in mind here that there might not be a direct relationship between the speed of
processing and the speed of perception measured in recalibration experiments.
Fujisaki et al. (2004) investigated the role of the speed of sensory processing in recalibration.
Specifically, they advanced the hypothesis that processing in one modality might be speeded up by
drawing attention to that modality, but based on the results of their experiments, they concluded that
attention and recalibration were independent.
If there was a general increase in the speed of perception, a transfer of recalibration effects to
modality pairs not presented in the adaptation phase should be easy to detect. Preliminary results
indicate that the effects and mechanisms of recalibration are not that simple. In the study by Harrar
and Harris (2008), after the adaptation with visuotactile stimulus pairs, the visual stimulus was
perceived to occur later relative to an auditory stimulus, but surprisingly, there were no changes in
the perception of temporal disparities when the visual stimulus was presented with a tactile stimulus
during the testing phase.

16.5  CONCLUSIONS
The repeated presentation of paired auditory and visual stimuli with constant intermodal onset
asynchrony is known to recalibrate audiovisual temporal order judgment in humans. The aim of
this study was to identify potential neural mechanisms that could underlie this recalibration in an
animal model amenable to detailed electrophysiological analysis of neural mass activity. Using
Mongolian gerbils, we found that prolonged presentation of paired auditory and visual stimuli
320 The Neural Bases of Multisensory Processes

caused characteristic changes in the neuronal interaction dynamics between the primary auditory
cortex and the primary visual cortex, as evidenced by changes in the amplitude of the nDTF esti-
mated from local field potentials recorded in both cortices. Specifically, changes in both the DTF
from auditory to visual cortex (nDTFA→V) and from visual to auditory cortex (nDTF V→A) dynami-
cally developed over the course of the adaptation trials. We discussed three types of processes
that might have been induced by the repeated stimulation: stimulus association processes, lag
detection processes, and changes in the speed of stimulus processing. Although all three processes
could potentially have contributed to the observed changes in nDTF amplitudes, their relative roles
for mediating psychophysical recalibration of temporal order judgment must remain speculative.
Further clarification of this issue would require a behavioral test of the recalibration of temporal
order judgment in combination with the electrophysiological analysis.

REFERENCES
Akaike, H. 1974. A new look at statistical model identification. Transactions on Automatic Control
19:716–723.
Alais, D. and S. Carlile. 2005. Synchronizing to real events: Subjective audiovisual alignment scales with per-
ceived auditory depth and speed of sound. Proceedings of the National Academy of Science of the United
States of America 102(6):2244–2247.
Arnold, D.H., A. Johnston, and S. Nishida. 2005. Timing sight and sound. Vision Research 45:1275–1284.
Astolfi, L., F. Cincotti, D. Mattia, M.G. Marciani, L.A. Baccala, F. de Vico Fallani, S. Salinari, M. Ursino, M.
Zavaglia, L. Ding, J.C. Edgar, G.A. Miller, B. He, and F. Babiloni. 2007. Comparison of different cortical
connectivity estimators for high-tech resolution EEG Recordings. Human Brain Mapping 28:143–157.
Baylor, D.A., B.J. Nunn, and J.L. Schnapf. 1984. The photocurrent, noise and spectral sensitivity of rods of the
monkey Macaca fascicularis. Journal of Physiology 357:575–607.
Baylor, D.A., B.J. Nunn, and J.L. Schnapf. 1987. Spectral sensitivity of cones of the monkey Macaca fascicu-
laris. Journal of Physiology 390:124–160.
Bedford, F.L. 2001. Toward a general law of numerical/object identity. Current Psychology of Cognition
20(3–4):113–175.
Bizley, J.K., F.R. Nodal, V.M. Bajo, I. Nelken, and A.J. King. 2007. Physiological and anatomical evidence for
multisensory interactions in auditory cortex. Cerebral Cortex 17:2172–2198.
Bressler, S.L. 1995. Large scale cortical networks and cognition. Brain Research Reviews 20:288–304.
Bressler, S.L. 1996. Interareal synchronization in the visual cortex. Behavioral Brain Research 76:37–49.
Bressler, S.L., R. Coppola, and R. Nakamura. 1993. Episodic multiregional cortical coherence at multiple fre-
quencies during visual task performance. Nature 366:153–156.
Brosch, M., E. Selezneva, and H. Scheich. 2005. Nonauditory events of a behavioral procedure activate audi-
tory cortex of highly trained monkeys. Journal of Neuroscience 25(29):6796–6806.
Bushara, K.O., J. Grafman, and M. Hallet. 2001. Neural correlates of audio-visual stimulus onset asynchrony
detection. The Journal of Neuroscience 21(1):300–304.
Cahill, L., F.W. Ohl, and H. Scheich. 1996. Alternation of auditory cortex activity with a visual stimulus through
conditioning: A 2-deoxyglucose analysis. Neurobiology of Learning and Memory 65(3):213–222.
Cassidy, M., and P. Brown. 2003. Spectral phase estimates in the setting of multidirectional coupling. Journal
of Neuroscience Methods 127:95–103.
Cobbs, E.H., and E.N. Pugh Jr. 1987. Kinetics and components of the flash photocurrent of isolated retinal rods
of the larval salamander, Ambystoma tigrinum. Journal of Physiology 394:529–572.
Corey, D.P., and A.J. Hudspeth. 1979. Response latency of vertebrate hair cells. Biophysical Journal 26:499–506.
Corey, D.P., and A.J. Hudspeth. 1983. Analysis of the microphonic potential of the bullfrog’s sacculus. Journal
of Neuroscience 3:942–961.
Crawford, A.C., and R. Fettiplace. 1985. The mechanical properties of ciliary bundles of turtle cochlear hair
cells. Journal of Physiology 364:359–379.
Crawford, A.C., M.G. Evans, and R. Fettiplace. 1991. The actions of calcium on the mechanoelectrical trans-
ducer current of turtle hair cells. Journal of Physiology 491:405–434.
Ding, M., S.L. Bressler, W. Yang, and H. Liang. 2000. Short-window spectral analysis of cortical event-related
potentials by adaptive autoregressive modelling: Data preprocessing, model validation, variability assess-
ment. Biological Cybernetics 83:35–45.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 321

Eichler, M. 2006. On the evaluation of information flow in multivariate systems by the directed transfer func-
tion. Biological Cybernetics 94:469–482.
Engel, G.R., and W.G. Dougherty 1971. Visual-auditory distance constancy. Nature 234:308.
Efron, B., and R.J. Tibshirani 1993. An Introduction to the Bootstrap. Boca Raton, FL: Chapman and Hall/
CRC.
Fain, G.L. 2003. Sensory Transduction. Sunderland: Sinauer Associates.
Franaszczuk, P.J., and G.K. Bergey. 1998. Application of the directed transfer function method to mesial and
lateral onset temporal lobe seizures. Brain Topography 11:13–21.
Freeman,W.J. 2000. Neurodynamics: An Exploration in Mesoscopic Brain Dynamics. London: Springer
Verlag.
Fries, P., J.H. Reynolds, A.E. Rorie, and R. Desimone. 2001. Modulation of oscillatory neuronal synchroniza-
tion by selective visual attention. Science 291:1560–1563.
Fujisaki, W., and S. Nishida. 2005. Temporal frequency characteristics of synchrony-asynchrony discrimina-
tion of audio-visual signals. Experimental Brain Research 166:455–464.
Fujisaki, W., and S. Nishida. 2008. Top-down feature based selection of matching feature for audio-visual syn-
chrony discrimination. Neuroscience Letters 433:225–230.
Fujisaki, W., S. Shinsuke, K. Makio, and S. Nishida. 2004. Recalibration of audiovisual simultaneity. Nature
Neuroscience 7(7):773.
Fujisaki, W., A. Koene, D. Arnold, A. Johnston and S. Nishida. 2006. Visual search for a target changing in
synchrony with an auditory signal. Proceedings of the Royal Society of London. Series B. Biological
Sciences 273:865–874.
Harrar, V., and L.R. Harris. 2005. Simultaneity constancy: Detecting events with touch and vision. Experimental
Brain Research 166:465–473.
Harrar, V., and L.R. Harris. 2008. The effects of exposure to asynchronous audio, visual, and tactile stimulus
combination on the perception of simultaneity. Experimental Brain Research 186:517–524.
Hanson, J.V.M., J. Heron, and D. Whitaker. 2008. Recalibration of perceived time across sensory modalities.
Experimental Brain Research 185:347–352.
Heron, J., D. Whitaker, P. McGraw, and K.V. Horoshenkov. 2007. Adaptation minimizes distance-related audio-
visual delays. Journal of Vision 7(13):1–8.
Hestrin, S., and J.I. Korenbrot. 1990. Activation kinetics of retinal cones and rods: Response to intense flashes
of light. Journal of Neuroscience 10:1967–1973.
Kaminski, M., and K.J. Blinowska. 1991. A new method for the description of the information flow in the brain
structures. Biological Cybernetics 65:203–210.
Kaminski, M., K.J. Blinowska, and W. Szelenberger. 1997. Topographic analysis of coherence and propaga-
tion of EEG activity during sleep and wakefulness. Electroencephalography Clinical Neurophysiology
102:216–277.
Kaminski, M., M. Ding, W.A. Trucculo, and S.L. Bressler. 2001. Evaluating causal relations in neural sys-
tems: Granger causality, directed transfer function and statistical assessment of significance. Biological
Cybernetics 85:145–157.
Kay, S.M. 1987. Modern Spectral Estimation. Englewood Cliffs, NJ: Prentice Hall.
Kayser, C., C. Petkov, and N.K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18:1560–1574.
Keetels, M., and J. Vroomen. 2007. No effect of auditory–visual spatial disparity on temporal recalibration.
Experimental Brain Research 182:559–565.
Kopinska, A., and L.R. Harris. 2004. Simultaneity constancy. Perception 33:1049–1060.
Korzeniewska, A., S. Kasicki, M. Kaminski, and K.J. Blinowska. 1997. Information flow between hippocam-
pus and related structures during various types of rat’s behavior. Journal of Neuroscience Methods
73:49–60.
Korzeniewska, A., M. Manczak, M. Kaminski, K.J. Blinowska, and S. Kasicki. 2003. Determination of infor-
mation flow direction among brain structures by a modified directed transfer function (dDTF) method.
Journal of Neuroscience Methods 125:195–207.
Kus, R., M. Kaminski, and K.J. Blinowska. 2004. Determination of EEG activity propagation: Pairwise versus
multichannel estimate. IEEE Transactions on Bio-Medical Engineering 51:1501–1510.
Lewald, J., and R. Guski. 2004. Auditory-visual temporal integration as a function of distance: No compensa-
tion for sound-transmission time in human perception. Neuroscience Letters 357:119–122.
Liang, H., M. Ding, R. Nakamura, and S.L. Bressler. 2000. Causal influences in primate cerebral cortex.
Neuroreport 11(13):2875–2880.
322 The Neural Bases of Multisensory Processes

Liang, H., S.L. Bressler, M. Ding, W.A. Truccolo, and R. Nakamura. 2002. Synchronized activity in prefrontal
cortex during anticipation of visuomotor processing. Neuroreport 13(16):2011–2015.
Lütkepohl, H. 1993. Introduction to Multiple Time Series Analysis, 2nd ed. Berlin: Springer.
Marple, S.L. 1987. Digital Spectral Analysis with Applications. Englewood Cliffs, NJ: Prentice Hall.
Medvedev, A., and J.O. Willoughby. 1999. Autoregressive modeling of the EEG in systemic kainic acid-­induced
epileptogenesis. International Journal of Neuroscience 97:149–167.
Meredith, M.A., J.W. Nemitz, and B.E. Stein. 1987. Determinants of multisensory integration in superior col-
liculus neurons. I. Temporal factors. The Journal of Neuroscience 7(10):3212–3229.
Miyazaki, M., S. Yamamoto, S. Uchida, and S. Kitazawa. 2006. Bayesian calibration of simultaneity in tactile
temporal order judgment. Nature Neuroscience 9:875–877.
Musacchia, G., and C.E. Schroeder. 2009. Neural mechanisms, response dynamics and perceptual functions of
multisensory interactions in auditory cortex. Hearing Research 285:72–79.
Navarra, J., A. Vatakis, M. Zampini, S. Soto-Faraco, W. Humphreys, and C. Spence. 2005. Exposure to asyn-
chronous audiovisual speech extends the temporal window for audiovisual integration. Cognitive Brain
Research 25:499–507.
Navarra, J., S. Soto-Faraco, and C. Spence. 2006. Adaptation to audiovisual asynchrony. Neuroscience Letters
431:72–76.
Nickalls, R.W.D. 1996. The influences of target angular velocity on visual latency difference determined using
the rotating Pulfirch effect. Vision Research 36:2865–2872.
Ohl, F.W., H. Scheich, and W.J. Freeman. 2000. Topographic analysis of epidural pure-tone-evoked potentials
in gerbil auditory cortex. Journal of Neurophysiology 83:3123–3132.
Ohl, F.W., H. Scheich, and W.J. Freeman. 2001. Change in pattern of ongoing cortical activity with auditory
learning. Nature 412:733–736.
Posner, M.I., C.R.R. Snyder, and B.J. Davidson. 1980. Attention and the detection of signals. Journal of
Experimental Psychology: General 109(2):160–174.
Robson, J.G., S.M. Saszik, J. Ahmed, and L.J. Frishman. 2003. Rod and cone contributions to the a-wave of the
electroretinogram of the macaque. Journal of Physiology 547:509–530.
Rodriguez, E., N. Georg, J.P. Lachaux, J. Martinerie, B. Renault, and F.J. Varela. 1999. Perception’s shadow:
Long-distance synchronization of neural activity. Nature 397:430–433.
Roelfsema, P.R., A.K. Engel, P. König, and W. Singer. 1997. Visuomotor integration is associated with zero
time-lag synchronization among cortical areas. Nature 385:157–161.
Roelfsema, P.R., V.A.F. Lamme, and H. Spekreijse. 1998. Object based attention in the primary auditory cortex
of the macaque monkey, Nature 395:377–381.
Schlögl, A. 2006. A comparison of multivariate autoregressive estimators. Signal Processing 86:2426–2429.
Senkowski, D., D. Talsma, M. Grigutsch, C.S. Herrmann, and M.G. Woldorff. 2007. Good times for multisen-
sory integration: Effects of the precision of temporal synchrony as revealed by gamma band oscillations.
Neuropsychologica 45:561–571.
Stone, J.V. 2001. Where is now? Perception of simultaneity. Proceedings of the Royal Society of London. Series
B. Biological Sciences 268:31–38.
Sugita, Y., and Suzuki, Y. 2003. Audiovisual perception. Implicit evaluation of sound arrival time. Nature
421:911.
Sutton, S., M. Braren, J. Subin, and E.R. John. 1965. Evoked potential correlates of stimulus uncertainty.
Science 150:1178–1188
Varela, F., J. Lacheaux, E. Rodriguez, and J. Martinerie. 2001. The brain-web: Phase synchronization and large-
scale integration. Nature Reviews Neuroscience 2:229–239.
Vatakis, A., and C. Spence. 2007. Crossmodal binding: Evaluating the influence of the ‘unity assumption’ using
audiovisual speech stimuli. Perception & Psychophysics 69(5):744–56.
Vatakis, A., and Spence, C. 2008. Evaluating the influence of the ‘unity assumption’ on the temporal perception
of realistic audiovisual stimuli. Acta Psychologica 127:12–23.
Vatakis, A., J. Navarra, S. Soto-Faraco, and C. Spence. 2007. Temporal recalibration during asynchronous
audiovisual speech perception. Experimental Brain Research 181:173–181.
Vatakis, A., J. Navarra, S. Soto-Faraco, and C. Spence. 2008. Audiovisual temporal adaptation of speech:
Temporal order versus simultaneity judgments. Experimental Brain Research 185:521–529.
Von Békésy, G. 1963. Interaction of paired sensory stimuli and conduction of peripheral nerves. Journal of
Applied Physiology 18:1276–1284.
Von Stein, A., C. Chiang, and P. König. 2000. Top-down processing mediated by interarea synchronization.
Proceedings of the National Academy of Science of the United States of America 97:147148–147153.
Effects of Prolonged Exposure to Audiovisual Stimuli with Fixed Stimulus Onset Asynchrony 323

Vroomen, J., M. Keetels, B. de Gelder, and P. Bertelson. 2004. Recalibration of temporal order perception by
exposure to audio-visual asynchrony. Cognitive Brain Research 22:32–35.
Welch, R.B. 1999. Meaning, attention and the unity assumption in the intersensory bias of spatial and tem-
poral perceptions. In Cognitive contributions to the perception of spatial and temporal events, ed. G.
Achersleben, T. Bachmann, and J. Müsseler, 371–387. Amsterdam: Elsevier.
Welch, R.B., and D.H. Warren. 1980. Immediate perceptual response to intersensory discrepancy. Psychological
Bulletin 88:638–667.
Wilson, J.A., and S.M. Anstis. 1996. Visual delay as a function of luminance. American Journal of Psychology
82:350–358.
17 Development of Multisensory
Temporal Perception
David J. Lewkowicz

CONTENTS
17.1 Introduction........................................................................................................................... 325
17.2 Perception of Multisensory Temporal Information and Its Coherence................................. 326
17.3 Developmental Emergence of Multisensory Perception: General Patterns and Effects
of Experience......................................................................................................................... 327
17.4 Perception of Temporal Information in Infancy.................................................................... 330
17.5 Perception of A–V Temporal Synchrony............................................................................... 331
17.5.1 A–V Temporal Synchrony Threshold........................................................................ 331
17.5.2 Perception of A–V Speech Synchrony and Effects of Experience............................ 332
17.5.3 Binding of Nonnative Faces and Vocalizations......................................................... 334
17.6 Perception of Multisensory Temporal Sequences in Infancy................................................ 336
17.7 Speculations on Neural Mechanisms Underlying the Development of Multisensory
Perception.............................................................................................................................. 338
References....................................................................................................................................... 339

17.1  INTRODUCTION
The objects and events in our external environment provide us with a constant flow of multisensory
information. Such an unrelenting flow of information might be potentially confusing if no mecha-
nisms were available for its integration. Fortunately, however, sophisticated multisensory integra-
tion* mechanisms have evolved across the animal kingdom to solve this problem (Calvert et al.
2004; Ghazanfar and Schroeder 2006; Maier and Schneirla 1964; Marks 1978; Partan and Marler
1999; Rowe 1999; Stein and Meredith 1993; Stein and Stanford 2008; Welch and Warren 1980).
These mechanisms enable mature organisms to integrate multisensory inputs and, in the process,
make it possible for them to perceive the coherent nature of their multisensory world.
The other chapters in this volume discuss the structural and functional characteristics of multi-
sensory processing and integration mechanisms in adults. Here, I address the developmental ques-
tion by asking (1) when do multisensory response mechanisms begin to emerge in development,
and (2) what specific processes underlie their emergence? To answer these questions, I discuss our
work on the development of multisensory processing of temporal information and focus primar-
ily on human infants. I show that processing of multisensory temporal information, as well as the

* Historically, the term “integration,” when used in the context of work on multisensory processing, has been used to refer
to different processes by different researchers (Stein et al. 2010). For some, this term is reserved for cases in which sen-
sory input in one modality changes the qualitative experience that one has in response to stimulation in another modality,
as is the case in the McGurk effect (McGurk and MacDonald 1976). For others, it has come to be associated with neural
and behavioral responsiveness to near-threshold stimulation in one modality either being enhanced or suppressed by
stimulation in another modality (Stein and Stanford 2008). Finally, for some investigators, integration has simply meant
the process that enables perceivers to detect and respond to the relational nature of multisensory stimulation and no
assumptions were made about underlying perceptual or neural mechanisms. It is this last meaning that is used here.

325
326 The Neural Bases of Multisensory Processes

processing of other types of multisensory information, emerges gradually during the first year of
life, argue that the rudimentary multisensory processing abilities found at the beginning of life
reflect neural/behavioral immaturities and the relative lack of perceptual and sensorimotor experi-
ence, and provide evidence that the gradual improvement in multisensory processing ability reflects
the interaction between behavioral and (implied) neural maturation and perceptual experience.

17.2 PERCEPTION OF MULTISENSORY TEMPORAL


INFORMATION AND ITS COHERENCE
The temporal dimension of our everyday experience is an inescapable and fundamental part of
our perceptual and cognitive existence (Fraisse 1982; Greenfield 1991; Lashley 1951; Martin 1972;
Nelson 1986, 2007). The temporal flow of stimulation provides a host of perceptual cues that observ-
ers can use to detect the coherence, global structure, and even the hidden meanings inherent in mul-
tisensory events. For example, when people speak, they produce a series of mouth movements and
vocalizations. At a basic level, the onsets and offsets of mouth movements and the accompanying
vocalizations are precisely synchronized. This allows observers to determine that the movements
and vocalizations are part of a coherent multisensory event. Of course, detecting that multisensory
inputs correspond in terms of their onsets and offsets is not terribly informative because it does not
provide any information about the other key and overlapping characteristics of multisensory events.
For example, in the case of audiovisual speech, synchrony does not provide information about the
invariant durations of the audible and visible utterances nor about the correlated dynamic temporal
and spectral patterns across audible and visible articulations that are normally available and used
by adults (Munhall and Vatikiotis-Bateson 2004; Yehia et al. 1998). These latter perceptual cues
inform the observer about the amodal invariance of the event and, thus, serve as another important
basis for the perception of multisensory coherence.* Finally, at an even more global level, the tem-
poral patterning (i.e., rhythm) of the audible and visible attributes of a continuous utterance (i.e.,
a string of words) not only can provide another important basis for the perception of multisensory
coherence but can also provide cues to “hidden” meanings. The hidden meanings derive from the
particular ordering of the different constituents (e.g., syllables, words, and phrases) and when those
constituents are specified by multisensory attributes, their extraction can be facilitated by multisen-
sory redundancy effects (Bahrick et al. 2004; Lewkowicz and Kraebel 2004).
As might be expected, adults are highly sensitive to multisensory temporal information. This is
evident from the results of studies showing that adults can perceive temporally based multisensory
coherence (Gebhard and Mowbray 1959; Handel and Buffardi 1969; Myers et al. 1981; Shipley
1964; Welch et al. 1986). It is also evident from the results of other studies showing that adults’
responsiveness to the temporal aspects of stimulation in one sensory modality can influence their
responsiveness to the temporal aspects of stimulation in another modality. For example, when adults
hear a fluttering sound, their perception of a flickering light changes as a function of the frequency
of the flutter; the flutter “drives” the flicker (Myers et al. 1981). Particularly interesting are findings
showing that some forms of temporal intersensory interaction can produce illusions (Sekuler et al.
1997; Shams et al. 2000) or can influence the strength of illusions (Slutsky and Recanzone 2001). For

* There is a functionally important distinction between intersensory cues such as duration, tempo, and rhythm, on the one
hand, and intersensory temporal synchrony cues, on the other. The former are all amodal stimulus attributes because they
can be specified independently in different modalities and, as a result, can be perceived even in the absence of temporal
synchrony cues (e.g., even if the auditory and visual attributes of a speech utterance are not presented together, their
equal duration can be perceived). In contrast, temporal synchrony is not an amodal perceptual cue because it cannot be
specified independently in a single sensory modality; an observer must have access to the concurrent information in the
different modalities to perceive it. Moreover, and especially important for developmental studies, infants might be able to
perceive intersensory synchrony relations without being able to perceive the amodal cues that characterize the multisen-
sory attributes (e.g., an infant might be able to perceive that a talking face and the vocalizations that it produces belong
together but may not be able to detect the equal duration of the visible and audible articulations).
Development of Multisensory Temporal Perception 327

example, when adults see a single flash and hear two tones, they report two flashes even though they
know that there is only a single flash (Shams et al. 2000). Similarly, when two identical objects are
seen moving toward and then through each other and a brief sound is presented at the point of their
coincidence, adults report that the objects bounce against each other rather than pass through one
another (Sekuler et al. 1997). This “bounce” illusion emerges in infancy in that starting at 6 months
of age, infants begin to exhibit evidence that they experience it as well (Scheier et al. 2003).
Even though the various amodal and invariant temporal attributes are natural candidates for
the perception of multisensory coherence, there are good a priori theoretical reasons to expect that
intersensory temporal synchrony might play a particularly important role during the earliest stages of
development (Gibson 1969; Lewkowicz 2000a; Thelen and Smith 1994) and that young infants may
not perceive the kinds of higher-level amodal invariants mentioned earlier. One reason for this may be
the fact that, unlike in the case of the detection of higher-level amodal invariants, it is relatively easy to
detect multisensory temporal synchrony relations. All that is required is the detection of the concur-
rent onsets and offsets of stimulus energy across modalities. In contrast, the detection of amodal cues
requires the ability to perceive the equivalence of some of the higher-level types of correlated patterns
of information discussed earlier. Moreover, sometimes observers are even required to detect such pat-
terns when they are not available concurrently and can do so too (Kamachi et al. 2003). Infants also
exhibit this ability but, thus far, evidence indicates that they can do so only starting at 6 months of age
(Pons et al. 2009) and no studies have shown that they can perform this kind of task earlier.
Although young infants’ presumed inability to perceive amodal cues might seem like a serious
limitation, it has been argued by some that developmental limitations actually serve an important
function (Oppenheim 1981). With specific regard to multisensory functions, Turkewitz has argued
that sensory limitations help infants organize their perceptual world in an orderly fashion while at
the same time not overwhelming their system (Turkewitz 1994; Turkewitz and Kenny 1982). From
this perspective, the ability to detect temporal synchrony cues very early in life makes it possible
for young, immature, and inexperienced infants to first discover that multisensory inputs cohere
together, albeit at a very low level. This, in turn, gives them an entrée into a multisensory world
composed not only of the various higher-level amodal invariants mentioned earlier but other higher-
level nontemporal multisensory attributes such as gender, affect, and identity. Most theorists agree
that the general processes that mediate this gradual improvement in multisensory processing ability
are perceptual learning and differentiation in concert with infants’ everyday experience and senso-
rimotor interactions with their multisensory world.
Extant empirical findings are generally consistent with the theoretical developmental pattern
described above. For instance, young infants can detect the synchronous onsets of inanimate visual
and auditory stimuli (Lewkowicz 1992a, 1992b, 1996) and rely on synchrony cues to perceive the
amodal property of duration (Lewkowicz 1986). Likewise, starting at birth and thereafter, infants
can detect the synchronous relationship between the audible and visible attributes of vocalizing faces
(Lewkowicz 2000b, 2010; Lewkowicz and Ghazanfar 2006; Lewkowicz et al. 2010). Interestingly,
however, when the multisensory temporal task is too complex (i.e., when it requires infants to detect
which of two objects that are moving at different tempos corresponds to a synchronous sound)
synchrony cues are not sufficient for the perception of multisensory coherence (Lewkowicz 1992a,
1994). Similarly, when the relationship between two moving objects and a sound that occurs during
their coincidence is ambiguous (as is the case in the bounce illusion), 6- and 8-month-old infants
perceive this relationship but 4-month-olds do not.

17.3 DEVELOPMENTAL EMERGENCE OF MULTISENSORY PERCEPTION:


GENERAL PATTERNS AND EFFECTS OF EXPERIENCE
As indicated above, data from studies of infant response to multisensory temporal information sug-
gest that multisensory processing abilities improve during the first year of life. If that is the case,
328 The Neural Bases of Multisensory Processes

do these findings reflect a general developmental pattern? The answer is that the same pattern holds
for infant perception of other types of multisensory perceptual cues. To make theoretical sense of
the overall body of findings on the developmental emergence of multisensory perceptual abilities in
infancy, it is helpful to first ask what the key theoretical questions are in this area. If, as indicated
earlier, infants’ initial immaturity and relative lack of experience imposes serious limitations on
their ability to integrate the myriad inputs that constantly bombard their perceptual systems, how
do they go about integrating those inputs and how does this process get bootstrapped at the start of
postnatal life? As already suggested, one possible mechanism is a synchrony detection mechanism
that simply detects synchronous stimulus onsets and offsets across different modalities. This, in
turn, presumably provides developing infants with the opportunity to gradually discover increas-
ingly more complex multisensory coherence cues.
Although the detection of multisensory synchrony is one possible specific mechanism that can
mediate developmental change, other more general processes probably contribute to developmental
change as well. Historically, these more general processes have been proposed in what appear to be
two diametrically opposed theoretical views concerning the development of multisensory functions.
One of these views holds that developmental differentiation is the process underlying developmental
change, whereas the other holds that developmental integration is the key process. More specifi-
cally, the first, known as the developmental differentiation view, holds that infants come into the
world prepared to detect certain amodal invariants and that this ability improves and broadens in
scope as they grow (Gibson 1969; Thelen and Smith 1994; Werner 1973). According to the principal
proponent of this theoretical view (Gibson 1969), the improvement and broadening is mediated by
perceptual differentiation, learning, and the emergence of increasingly better stimulus detection
abilities. The second, known as the developmental integration view, holds that infants come into the
world with their different sensory systems essentially disconnected and that the senses gradually
become functionally connected as a result of children’s active interaction with their world (Birch
and Lefford 1963, 1967; Piaget 1952). One of the most interesting and important features of each of
these theoretical views is that both assign central importance to developmental experience.
A great deal of empirical evidence has been amassed since the time that the two principal theo-
retical views on the development of multisensory functions were proposed. It turns out that some of
this evidence can be interpreted as consistent with the developmental differentiation view whereas
some of it can be interpreted as consistent with the developmental integration view. Overall, then,
it seems that both processes play a role in the developmental emergence of multisensory functions.
The evidence that is consistent with the developmental differentiation view comes from studies
showing that despite the fact that the infant nervous system is highly immature and, despite the fact
that infants are perceptually inexperienced, infants exhibit some multisensory perceptual abilities
from birth onward (Gardner et al. 1986; Lewkowicz et al. 2010; Lewkowicz and Turkewitz 1980,
1981; Slater et al. 1997, 1999). Importantly, however, and as indicated earlier, these abilities are
relatively rudimentary. For example, newborns can detect multisensory synchrony cues and do so
by detecting nothing more than stimulus energy onsets and offsets (Lewkowicz et al. 2010). In
addition, newborns are able to detect audiovisual (A–V) intensity equivalence (Lewkowicz and
Turkewitz 1980) and can associate arbitrary auditory and visual object attributes on the basis of
their synchronous occurrence (Slater et al. 1997, 1999). Although impressive, these kinds of findings
are not surprising given that there are ample opportunities for intersensory interactions—especially
those involving the co-occurrence of sensations in different modalities—during fetal life and that
these interactions are likely to provide the foundation for the kinds of rudimentary multisensory
perceptual abilities found at birth (Turkewitz 1994).
Other evidence from the body of empirical work amassed to date is consistent with the develop-
mental integration view by indicating that multisensory perceptual abilities improve as infants grow
and acquire perceptual experience (Bremner et al. 2008; Lewkowicz 1994, 2000a, 2002; Lickliter
and Bahrick 2000; Walker-Andrews 1997). This evidence shows that older infants possess more
sophisticated multisensory processing abilities than do younger infants. For example, young infants
Development of Multisensory Temporal Perception 329

can perceive multisensory synchrony cues (Bahrick 1983; Bahrick and Lickliter 2000; Lewkowicz
1992a,b, 1996, 2000b, 2003, 2010), amodal intensity (Lewkowicz and Turkewitz 1980), amodal
duration (Lewkowicz 1986), and the multisensory invariance of isolated audible and visible pho-
nemes (Brookes et al. 2001; Kuhl and Meltzoff 1982, 1984; Patterson and Werker 2003). In contrast,
however, whereas younger infants do not, older infants (roughly older than 6 months of age) also
exhibit the ability to perceive amodal affects produced by strangers (Walker-Andrews 1986) and
amodal gender (Patterson and Werker 2002; Walker-Andrews et al. 1991), bind arbitrary modality-
specific cues (Bahrick 1994; Reardon and Bushnell 1988), integrate auditory and visual spatial cues
in an adult-like manner (Neil et al. 2006), and integrate multisensory spatial bodily and external
cues (Bremner et al. 2008). Considered together, this latter body of findings clearly shows that mul-
tisensory perceptual abilities improve over the first year of life. Thus, when all the extant empirical
evidence is considered together, it is clear that developmental differentiation and developmental
integration processes operate side-by-side in early human development and that both contribute to
the emergence of multisensory perceptual abilities in infancy and probably beyond.
If developmental differentiation and integration both contribute to the development of multisen-
sory perception, what role might experience play in this process? As might be expected (Gibson
1969), evidence from studies of human infants indicates that experience plays a critical role in the
development of multisensory functions. Until now, however, very little direct evidence for the effects
of early experience was available at the human level except for two studies that together demon-
strated that infant response to amodal affect information depends on the familiarity of the informa-
tion. Thus, in the first study, Walker-Andrews (1986) found that 7-month-olds but not 5-month-olds
detected amodal affect when the affect was produced by a stranger. In the second study, Kahana-
Kalman and Walker-Andrews (2001) found that when the affect was produced by the infant’s own
mother, infants as young as 3.5 months of age detected it.
More recently, my colleagues and I have discovered a particularly intriguing and seemingly
paradoxical effect of experience on the development of multisensory responsiveness. We have dis-
covered that some multisensory perceptual functions are initially present early in life and then
decline as infants age. This multisensory perceptual narrowing phenomenon was not predicted by
either the developmental differentiation or the developmental integration view. In these recent stud-
ies, we have found that infants between birth and 6 months of age can match monkey faces and the
vocalizations that they produce but that older infants no longer do so (Lewkowicz and Ghazanfar
2006; Lewkowicz et al. 2008, 2010). In addition, we have found that 6-month-old infants can match
visible and audible phonemes regardless of whether these phonemes are functionally relevant in
their own language or in other languages (Pons et al. 2009). Specifically, we found that 6-month-old
Spanish-learning infants can match a visible /ba/ to an audible /ba/ and a visible /va/ to an audible
/va/, whereas 11-month-old Spanish-learning infants no longer do so. In contrast, English-learning
infants can make such matches at both ages. The failure of the older Spanish-learning infants to
make the matches is correlated with the fact that the /ba/ – /va/ phonetic distinction is not phonemi-
cally functional in Spanish. This means that when older Spanish-learning infants have to choose
between a face mouthing a /ba/ and a face mouthing a /va/ after having listened to one of these
phonemes, they cannot choose the matching face because the phonemes are no longer distinct for
them. Together, our findings on multisensory perceptual narrowing indicate that as infants grow
and gain experience with vocalizing human faces and with native language audiovisual phonology,
their ability to perceive cross-species and cross-language multisensory coherence declines because
nonnative multisensory information is not relevant for everyday functioning.
We have also explored the possible evolutionary origins of multisensory perceptual narrowing
and, thus far, have found that it seems to be restricted to the human species. We tested young vervet
monkeys, at ages when they are old enough to be past the point of narrowing, with the same vocal-
izing rhesus monkey faces that we presented in our initial infant studies and found that vervets
do not exhibit multisensory perceptual narrowing (Zangenehpour et al. 2009). That is, the vervets
matched rhesus monkey visible and audible vocalizations even though they were past the point when
330 The Neural Bases of Multisensory Processes

narrowing should have occurred. We interpreted this finding as reflecting the fact that monkey
brains mature four times as fast as human brains do and that, as a result, young vervets are less open
to the effects of early experience than are human infants. This interpretation suggests that experi-
ence interacts with the speed of neural growth and differentiation and that slower brain growth and
differentiation is highly advantageous because it provides for greater developmental plasticity.
The vervet monkey study demonstrates that the rate of neural growth plays an important role
in the development of behavioral functions and provides yet another example illustrating this key
developmental principle (Turkewitz and Kenny 1982). What about neural and experiential immatu-
rity, especially at the beginning of postnatal and/or posthatching life? Do other organisms, besides
humans, manifest relatively poor and immature multisensory processing functions? The answer
is that they do. A number of studies have found that the kinds of immaturities and developmental
changes observed in human infants are also found in the young of other species. Together, these
studies have found that rats, cats, and monkeys exhibit relatively poor multisensory responsiveness
early in life, that its emergence follows a pattern of gradual improvement, and that early experience
plays a critical role in this process. For example, Wallace and Stein (1997, 2001) have found that
multisensory cells in the superior colliculus of cats and rhesus monkeys, which normally integrate
auditory and visual spatial cues in the adult, do not integrate in newborn cats and monkeys, and that
integration only emerges gradually over the first weeks of life. Moreover, Wallace et al. (2006) have
found that the appropriate alignment of the auditory and visual maps in the superior colliculus of
the rat depends on their normal spatial coregistration. The same kinds of effects have been found in
barn owls and ferrets, in which calibration of the precise spatial tuning of the neural map of auditory
space depends critically on concurrent visual input (King et al. 1988; Knudsen and Brainard 1991).
Finally, in bobwhite quail hatchlings, the ability to respond to the audible and visible attributes of
the maternal hen after hatching depends on prehatching and posthatching experience with the audi-
tory, visual, and tactile stimulation arising from the embryo’s own vocalizations, the maternal hen,
and broodmates (Lickliter and Bahrick 1994; Lickliter et al. 1996).
Taken together, the human and animal data indicate that the general developmental pattern con-
sists of an initial emergence of low-level multisensory abilities, a subsequent experience-dependent
improvement of emerging abilities, and finally, the emergence of higher-level multisensory abilities.
This developmental pattern, especially in humans, appears to be due to the operation of develop-
mental differentiation and developmental integration processes. Moreover, and most intriguing, our
recent discovery of multisensory perceptual narrowing indicates that even though young infants
possess relatively crude and low-level types of multisensory perceptual abilities (i.e., sensitivity to
A–V synchrony relations), these abilities imbue them with much broader multisensory perceptual
tuning than is the case in older infants. As indicated earlier, the distinct advantage of this kind of
tuning is that it provides young infants with a way of bootstrapping their multisensory perceptual
abilities at a time when they are too immature and inexperienced to extract higher-level amodal
attributes.
In the remainder of this chapter, I review results from our studies on infant response to multi-
sensory temporal information as an example of the gradual emergence of multisensory functions.
Moreover, I review additional evidence of the role of developmental differentiation and integration
processes as well as of early experience in the emergence of multisensory responsiveness. Finally,
I speculate on the neural mechanisms that might underlie the developmental emergence of multi-
sensory perception and highlight the importance of studying the interaction between neural and
behavioral growth and experience.

17.4  PERCEPTION OF TEMPORAL INFORMATION IN INFANCY


As indicated earlier, the temporal dimension of stimulation is the multisensory attribute par excel-
lence because it provides observers with various types of overlapping patterns of multisensory
information. For infants, this means that they have a ready-made and powerful basis for coherent
Development of Multisensory Temporal Perception 331

and cognitively meaningful multisensory experiences. This, of course, assumes that they are sensi-
tive to the temporal flow of information in each modality. Indeed, evidence indicates that infants are
sensitive to temporal information at both the unisensory and multisensory levels. For example, it has
been found that infants as young as 3 months of age can predict the occurrence of a visual stimulus
at a particular location based on their prior experience with a temporally predictable pattern of spa-
tiotemporally alternating visual stimuli (Canfield and Haith 1991; Canfield et al. 1997). Similarly,
it has been found that 4-month-old infants can quickly learn to detect a “missing” visual stimulus
after adaptation to a regular and predictable visual stimulus regimen (Colombo and Richman 2002).
In the auditory modality, studies have shown that newborn infants (1) exhibit evidence of temporal
anticipation when they hear a tone that is not followed by glucose—after the tone (CS) and the glu-
cose (UCS) were paired during an initial conditioning phase (Clifton 1974) and (2) can distinguish
between different classes of linguistic input on the basis of the rhythmic attributes of the auditory
input (Nazzi and Ramus 2003). Finally, in the audiovisual domain, it has been found that 7-month-
old infants can anticipate the impending presentation of an audiovisual event when they first hear a
white noise stimulus that has previously reliably predicted the occurrence of the audiovisual event
(Donohue and Berg 1991), and that infants’ duration discrimination improves between 6 and 10
months of age (Brannon et al. 2007). Together, these findings indicate that infants are generally
sensitive to temporal information in the auditory and visual modalities.

17.5  PERCEPTION OF A–V TEMPORAL SYNCHRONY


Earlier, it was indicated that the multisensory world consists of patterns of temporally coincident
and amodally invariant information (Gibson 1966) and that infants are likely to respond to A–V
temporal synchrony relations from an early age. There are two a priori reasons why this is the case.
The perceptual basis for this has already been mentioned, namely, that the detection of temporal
A–V synchrony is relatively easy because it only requires perception of synchronous energy onsets
and offsets in different modalities. In addition, the neural mechanisms underlying the detection of
intersensory temporal synchrony cues in adults are relatively widespread in the brain and are largely
subcortical (Bushara et al. 2001). Given that at least some of these mechanisms are subcortical, this
makes it likely that such mechanisms are also present and operational in the immature brain.
Consistent with these expectations, results from behavioral studies have shown that, starting
early in life, infants respond to A–V temporal synchrony and that this cue is primary for them.
These results have revealed (1) that 6- and 8-month-old infants can match pulsing auditory and
flashing static visual stimuli on the basis of their duration but only if the matching pair is also syn-
chronous (Lewkowicz 1986); (2) that 4- and 8-month-old infants can match an impact sound to one
of two bouncing visual stimuli on the basis of synchrony but not on the basis of tempo, regardless
of whether the matching tempos are synchronous or not (Lewkowicz 1992a, 1994); (3) that 4- to
8-month-old infants can perceive A–V synchrony relations inherent in simple audiovisual events
consisting of bouncing/sounding objects (Lewkowicz 1992b) as well as those inherent in vocalizing
faces (Lewkowicz 2000b, 2003); and (d) that newborns (Lewkowicz et al. 2010) and 4- to 6-month-
old infants (Lewkowicz and Ghazanfar 2006) can rely on A–V synchrony to match other species’
facial and vocal expressions.

17.5.1  A–V Temporal Synchrony Threshold


Given the apparently primary importance of A–V temporal synchrony, Lewkowicz (1996) con-
ducted a series of studies to investigate the threshold for the detection of A–V temporal asynchrony
in 2-, 4-, 6-, and 8-month-old infants and compared it to that in adults tested in a similar manner.
Infants were first habituated to a two-dimensional object that could be seen bouncing up and down
on a computer monitor and an impact sound that occurred each time the object changed direction
at the bottom of the monitor. They were then given a set of separate test trials during which the
332 The Neural Bases of Multisensory Processes

impact sound was presented 150, 250, and 350 ms before the object’s visible bounce (sound-first
group) or 250, 350, or 450 ms after the visible bounce (sound-second group). Infants in the sound-
first group detected the 350 ms asynchrony, whereas infants in the sound-second group detected the
450 ms asynchrony (no age effects were found). Adults, who were tested in a similar task and with
the same stimuli, detected an asynchrony of 80 ms in the sound-first condition and 112 ms in the
sound-second condition. Conceptualizing these results in terms of an intersensory temporal conti-
guity window (ITCW), they indicate that the ITCW is wider in infants than it is in adults and that it
decreases in size during development.

17.5.2  Perception of A–V Speech Synchrony and Effects of Experience


In subsequent studies, we found that the ITCW is substantially larger for multisensory speech than
for abstract nonspeech events. In the first of these studies (Lewkowicz 2000b), we habituated 4-,
6-, and 8-month-old infants to audiovisually synchronous syllables (/ba/ or /sha/) and then tested
their response to audiovisually asynchronous versions of these syllables (sound-first condition only)
and found that, regardless of age, infants only detected an asynchrony of 666 ms (in pilot work,
we tested infants with much lower asynchronies but did not obtain discrimination). We then repli-
cated the finding of such a high discrimination threshold in a subsequent study (Lewkowicz 2003)
in which we found that 4- to 8-month-old infants detected an asynchrony of 633 ms. It should be
noted that other than our pilot work, these two studies only tested infants with one degree of A–V
asynchrony. In other words, we did not formally investigate the size of the ITCW for audiovisual
speech events until more recently.
We investigated the size of the ITCW in our most recent studies (Lewkowicz 2010). In addition,
in these studies, we examined the effects of short-term experience on the detection of A–V temporal
synchrony relations and the possible mechanism underlying the detection of A–V synchrony rela-
tions. To determine the size of the ITCW, in Experiment 1, we habituated 4- to 10-month-old infants
to an audiovisually synchronous syllable and then tested for their ability to detect three increasingly
greater levels of asynchrony (i.e., 366, 500, and 666 ms). Infants exhibited response recovery to the
666 ms asynchrony but not to the other two asynchronies, indicating that the threshold was located
between 500 and 666 ms (see Figure 17.1).
Prior studies in adults have shown that when they are first tested with audiovisually asynchro-
nous events, they perceive them as asynchronous. If, however, they are first given short-term expo-
sure to an asynchronous event and are tested again for detection of asynchrony, they now respond
to such events as if they are synchronous (Fujisaki et al. 2004; Navarra et al. 2005; Vroomen et
al. 2004). In other words, short-term adaptation to audiovisually asynchronous events appears to
widen the ITCW in adults. One possible explanation for this adaptation effect is that it is partly
due to an experience-dependent synchrony bias that develops during adults’ lifetime of experience
with exclusively synchronous audiovisual events. This bias presumably leads to the formation of an
audiovisual “unity assumption” (Welch and Warren 1980). If that is the case, then it might be that
infants may not exhibit an adaptation effect because of their relatively lower overall experience with
synchronous multisensory events and the absence of a unity assumption. More specifically, infants
may not exhibit a widening of their ITCW after habituation to an asynchronous audiovisual event.
If so, rather than fail to discriminate between the asynchronous event and those that are physically
less synchronous, infants may actually exhibit a decrease in the size of the ITCW and exhibit even
better discrimination. To test this possibility, in Experiment 2, we habituated a new group of 4- to
10-month-old infants to an asynchronous syllable (A–V asynchrony was 666 ms) and then tested
them for the detection of decreasing levels of asynchrony (i.e., 500, 366, and 0 ms). As predicted,
this time, infants not only discriminated between the 666 ms asynchrony and temporal synchrony
(0 ms), but they also discriminated between the 666 ms asynchrony and an asynchrony of 366 ms
(see Figure 17.2). That is, short-term adaptation with a discriminable A–V asynchrony produced a
decrease, rather than an increase, in the size of the ITCW. These results show that in the absence
Development of Multisensory Temporal Perception 333

Fam-0 ms.
Nov-366 ms.
Nov-500 ms.
12 Nov-666 ms.
*

10

Mean duration of looking (s)


8

0
Test Trials

FIGURE 17.1  Mean duration of looking during test trials in response to each of three different A–V tem-
poral asynchronies after habituation to a synchronous audiovisual syllable. Error bars indicate standard error
of mean and asterisk indicates that response recovery in that particular test trial was significantly higher than
response obtained in the familiar test trial (Fam-0 ms.).

of a unity assumption, short-term exposure to an asynchronous multisensory event does not cause
infants to treat it as synchronous but rather focuses their attention on the event’s temporal attributes
and, in the process, sharpens their perception of A–V temporal relations.
Finally, to investigate the mechanisms underlying A–V asynchrony detection, in Experiment
3, we habituated infants to a synchronous audiovisual syllable and then tested them again for the
detection of asynchrony with audiovisual asynchronies of 366, 500, and 666 ms. This time, how-
ever, the test stimuli consisted of a visible syllable and a 400 Hz tone rather than the audible syllable.

12 Fam-666 ms.
Nov-500 ms.
Nov-366 ms.
Nov-0 ms. *
10
Mean duration of looking (s)

*
8

0
Test Trials

FIGURE 17.2  Mean duration of looking during test trials in response to each of three different A–V tempo-
ral asynchronies after habituation to an asynchronous audiovisual syllable. Error bars indicate standard error
of mean and asterisks indicate that response recovery in those particular test trials was significantly higher
than response obtained in the familiar test trial (Fam-666 ms.).
334 The Neural Bases of Multisensory Processes

Fam-0 ms. *
12 Nov-366 ms.
Nov-500 ms.
Nov-666 ms.
10

Mean duration of looking (s)


8

0
Test Trials

FIGURE 17.3  Mean duration of looking during test trials in response to each of three different A–V tempo-
ral asynchronies after habituation to an audiovisual stimulus consisting of a visible syllable and a synchronous
tone. Error bars indicate standard error of mean and asterisk indicates that response recovery in that particular
test trial was significantly higher than response obtained in the familiar test trial (Fam-0 ms.).

Substituting the tone for the acoustic part of the syllable was done to determine whether the dynamic
variations in the spectral energy inherent in the acoustic part of the audiovisual speech signal and/
or their correlation with the dynamic variations in gestural information contribute to infant detec-
tion of A–V speech synchrony relations. Once again, infants detected the 666 ms asynchrony but
not the two lower ones (see Figure 17.3). The fact that these findings replicated the findings from
Experiment 1 indicates that infants do not rely on acoustic spectral energy nor on its correlation
with the dynamic variations in the gestural information to detect A–V speech synchrony relations.
Rather, it appears that infants attend primarily to energy onsets and offsets when processing A–V
speech synchrony relations, suggesting that detection of such relations is not likely to require the
operation of higher-level neural mechanisms.

17.5.3  Binding of Nonnative Faces and Vocalizations


Given that energy onsets and offsets provide infants with sufficient information regarding the tem-
poral alignment of auditory and visual inputs, the higher-level perceptual features of the stimulation
in each modality are probably irrelevant to them. This is especially likely early in life where the
nervous system and the sensory systems are highly immature and inexperienced. As a result, it is
possible that young infants might perceive the faces and vocalizations of other species as belonging
together as long as they are synchronous. We tested this idea by showing side-by-side videos of the
same monkey’s face producing two different visible calls on each side to groups of 4-, 6-, 8-, and
10-month-old infants (Lewkowicz and Ghazanfar 2006). During the two initial preference trials,
infants saw the faces in silence, whereas during the next two trials, infants saw the same faces and
heard the audible call that matched one of the two visible calls. The different calls (a coo and a grunt)
differed in their durations and, as a result, the matching visible and audible calls corresponded in
terms of their onsets and offsets as well as their durations. In contrast, the nonmatching ones only
corresponded in terms of their onsets. We expected that infants would look longer at the visible call
that matched the audible call if they perceived the temporal synchrony that bound them. Indeed, we
found that the two younger groups of infants matched the corresponding faces and vocalizations
but that the two older groups did not. These results indicate that young infants can rely on A–V
Development of Multisensory Temporal Perception 335

synchrony relations to perceive even nonnative facial gestures and accompanying vocalizations as
coherent entities. The older infants no longer do so for two related reasons. First, they gradually
shift their attention to higher-level perceptual features as a function of increasing neural growth,
maturation of their perceptual systems, and increasing perceptual experience all acting together
to make it possible for them to extract such features. Second, their exclusive and massive experi-
ence with human faces and vocalizations narrows their perceptual expertise to ecologically relevant
signals. In other words, as infants grow and as they acquire experience with vocalizing faces, they
learn to extract more complex features (e.g., gender, affect, and identity), rendering low-level syn-
chrony relations much less relevant. In addition, as infants grow, they acquire exclusive experience
with human faces and vocalizations and, as a result, become increasingly more specialized. As they
specialize, they stop responding to the faces and vocalizations of other species.
Because the matching faces and vocalizations corresponded not only in terms of onset and offset
synchrony but in terms of duration as well, the obvious question is whether amodal duration might
have contributed to multisensory matching. To investigate this question, we repeated the Lewkowicz
and Ghazanfar (2006) procedures in a subsequent study (Lewkowicz et al. 2008), except that this
time, we presented the monkey audible calls out of synchrony with respect to both visible calls. This
meant that the corresponding visible and audible calls were now only related in terms of their dura-
tion. Results yielded no matching in either the 4- to 6-month-old or the 8- to 10-month-old infants,
indicating that A–V temporal synchrony mediated successful matching in the younger infants. The
fact that the younger infants did not match in this study, despite the fact that the corresponding faces
and vocalizations corresponded in their durations, shows that duration did not mediate matching in
the original study. This is consistent with previous findings that infants do not match equal-duration
auditory and visual inputs unless they are also synchronous (Lewkowicz 1986).
If A–V temporal synchrony mediates intersensory matching in young infants, and if responsive-
ness to this multisensory cue depends on a basic and relatively low-level process, then it is possible
that cross-species multisensory matching emerges very early in development. To determine if that
is the case, we asked whether newborns also might be able to match monkey faces and vocaliza-
tions (Lewkowicz et al. 2010). In Experiment 1 of this study, we used the identical stimulus mate-
rials and testing procedures used by Lewkowicz and Ghazanfar (2006), and found that newborns
also matched visible and audible monkey calls. We then investigated whether successful matching
reflected matching of the synchronous onsets and offsets of the audible and visible calls. If so,
then newborns should be able to make the matches even when some of the identity information is
removed. Thus, we repeated Experiment 1, except that rather than present the natural call, we pre-
sented a complex tone in Experiment 2. To preserve the critical temporal features of the audible call,
we ensured that the tone had the same duration as the natural call and that its onsets and offsets were
synchronous with the matching visible call. Despite the absence of acoustic identity information and
the absence of a correlation between the dynamic variations in facial gesture information and the
amplitude and formant structure inherent in the natural audible call, newborns still performed suc-
cessful intersensory matching. This indicates that newborns’ ability to make cross-species matches
in Experiment 1 was based on their sensitivity to the temporally synchronous onsets and offsets of
the matching faces and vocalizations and that it was not based on identity information nor on the
dynamic correlation between the visible and audible call features.
Together, the positive findings of cross-species intersensory matching in newborns and 4- to
6-month-old infants demonstrate that young infants are sensitive to a basic feature of their percep-
tual world, namely, stimulus energy onsets and offsets. This basic perceptual sensitivity bootstraps
newborns’ entry into the world of multisensory objects and events and enables them to perceive
them as coherent entities, regardless of their specific identity. This sensitivity is especially potent
when the visual information is dynamic. When it is not, infants do not begin to bind the auditory
and visual attributes of multisensory objects, such as color/shape and pitch, or color and taste until
the second half of the first year of life. The pervasive and fundamental role that A–V temporal syn-
chrony plays in infant perceptual response to multisensory attributes suggests that sensitivity to this
336 The Neural Bases of Multisensory Processes

intersensory perceptual cue reflects the operation of a fundamental early perceptual mechanism.
That is, as indicated earlier, even though sensitivity to A–V temporal synchrony is mediated by rela-
tively basic and low-level processing mechanisms, it provides infants with a powerful initial percep-
tual tool for gradually discovering that multisensory objects are characterized by many other forms
of intersensory invariance. For example, once infants start to bind the audible and visible attributes
of talking faces, they are in a position to discover that faces and the vocalizations that accompany
them could also be specified by common duration, tempo, and rhythm, as well as by higher-level
amodal and invariant attributes such as affect, gender, and identity.

17.6  PERCEPTION OF MULTISENSORY TEMPORAL SEQUENCES IN INFANCY


Multisensory objects often participate in complex actions that are sequentially organized over time.
For example, when people speak, they simultaneously produce sequences of vocal sounds and cor-
related facial gestures. The syntactically prescribed order of the syllables and words imbues utter-
ances with specific meanings. Unless infants master the ability to extract the sequential structure
from such an event, they will not be able to acquire language. Because this ability is so funda-
mental to adaptive perceptual and cognitive functioning, we have investigated its developmental
emergence. When we began these studies, there was little, if any, empirical evidence on infant
perception of multisensory temporal sequences to guide our initial exploration of this issue. Prior
theoretical views claimed that sequence learning is an innate ability (Greenfield 1991; Nelson 1986)
but neither of these views specified what they meant by sequence learning abilities nor what infants
should be capable of doing in this regard. Indeed, recent empirical research on infant pattern and
sequence perception has contradicted the claim that this ability is innate and, if anything, has shown
that sequence perception and learning is a very complex skill that consists of several component
skills and that it takes several years to reach adult levels of proficiency (Gulya and Colombo 2004;
Thomas and Nelson 2001).
Although no studies have investigated sequence perception and learning at birth, studies have
shown that different sequence perception abilities, including the ability to perceive and learn adja-
cent and distant statistical relations, simple sequential rules, and ordinal position information,
emerge at different points in infancy. Thus, beginning as early as 2 months of age, infants can
learn adjacent statistical relations that link a series of looming visual shapes (Kirkham et al. 2002;
Marcovitch and Lewkowicz 2009), by 8 months, they can learn the statistical relations that link
adjacent static object features (Fiser and Aslin 2002) as well as adjacent nonsense words in a stream
of sounds (Saffran et al. 1996), and by 15 months, they begin to exhibit the ability to learn distant
statistical relations (Gómez and Maye 2005). Moreover, although infants as young as 5 months of
age can learn simple abstract temporal rules such as one specifying the order (e.g., AAB vs. ABB)
of distinct elements consisting of abstract objects and accompanying speech sounds (Frank et al.
2009), only 7.5-month-old infants can learn such rules when they are instantiated by nonsense syl-
lables (Marcus et al. 1999, 2007) and only 11-month-olds can learn simple rules instantiated by
looming objects (Johnson et al. 2009). Finally, it is not until 9 months of age that infants can track
the ordinal position of a particular syllable in a string of syllables (Gerken 2006).
It is important to note that most of the studies of infant sequence learning have presented unisen-
sory stimuli even though most of our daily perceptual experiences are multisensory in nature. As
a result, we investigated whether the developmental pattern found thus far in the development of
sequence perception and learning differs for multisensory sequences. To do so, we provided infants
with an opportunity to learn a single audiovisual sequence consisting of distinct moving objects
and their impact sounds, whereas in others we allowed infants to learn a set of different sequences
in which each one was composed of different objects and impact sounds. Regardless of whether
infants had to learn a single sequence or multiple ones, during the habituation phase, they could see
the objects appear one after another at the top of a computer monitor and then move down toward
a ramp at the bottom of the stimulus display monitor. When the objects reached the ramp, they
Development of Multisensory Temporal Perception 337

made an impact sound, turned to the right, and moved off to the side and disappeared. This cycle
was repeated for the duration of each habituation trial. After habituation, infants were given test
trials during which the order of sequence elements was changed in some way and the question was
whether they detected the change.
In an initial study (Lewkowicz 2004), we asked whether infants can learn a sequence composed of
three moving/impacting objects and, if so, what aspects of that sequence they encoded. Results indi-
cated that 4-month-old infants detected serial order changes only when the changes were specified
concurrently by audible and visible attributes during the learning as well as the test phase and only
when the impact part of the event—a local event feature that was not informative about sequential
order—was blocked from view. In contrast, 8-month-old infants detected order changes regardless
of whether they were specified by unisensory or bisensory attributes and whether they could see the
impact or not. In sum, younger infants required multisensory redundancy to detect the serial order
changes whereas older infants did not. A follow-up study (Lewkowicz 2008) replicated the earlier
findings, ruled out primacy effects, extended the earlier findings by showing that even 3-month-old
infants can perceive and discriminate three-element dynamic audiovisual sequences and that they
also rely on multisensory redundancy for successful learning and discrimination. In addition, this
study showed that object motion plays an important role in that infants exhibited less robust respon-
siveness to audiovisual sequences consisting of looming rather than explicitly moving objects.
Because the changes in our two initial studies involved changes in the order of a particular object/
impact sound as well as its statistical relations vis-à-vis the other sequence elements, we investigated
the separate role of each of these sequential attributes in our most recent work (Lewkowicz and
Berent 2009). Here, we investigated directly whether 4-month-old infants could track the statis-
tical relations among specific sequence elements (e.g., AB, BC), and/or whether they could also
encode abstract ordinal position information (e.g., that B is the second element in a sequence such
as ABCD). Thus, across three experiments, we habituated infants to sequences of four moving/
sounding objects in which three of the objects and their sounds varied in their ordinal position but
in which the position of one target object/sound remained invariant (e.g., ABCD, CBDA). Figure
17.4 shows an example of one of these sequences and how they moved. We then tested whether the
infants detected a change in the target’s position. We found that infants detected an ordinal position
change only when it disrupted the statistical relations between adjacent elements, but not when the
statistical relations were controlled. Together, these findings indicate that 4-month-old infants learn
the order of sequence elements by tracking their statistical relations but not their invariant ordinal
position. When these findings are combined with the previously reviewed findings on sequence

FIGURE 17.4  One of three different sequences presented during the habituation phase of the sequence
learning experiment (actual objects presented are shown). Each object made a distinct impact sound when it
came in contact with the black ramp. Across three different sequences, the triangle was the target stimulus
and, thus, for one group of infants, the target remained in second ordinal position during habituation phase
and then changed to third ordinal position in the test trials.
338 The Neural Bases of Multisensory Processes

learning in infancy, they show that different and increasingly more complex temporal sequence
learning abilities emerge during infancy. For example, they suggest that the ability to perceive and
learn the invariant ordinal position of a sequence element emerges sometime after 4 months of age.
When it emerges and what mediates its emergence is currently an open question, as are the ques-
tions about the emergence of the other more complex sequence perception and learning skills.

17.7 SPECULATIONS ON NEURAL MECHANISMS UNDERLYING


THE DEVELOPMENT OF MULTISENSORY PERCEPTION
It is now abundantly clear that some basic multisensory processing abilities are present early in
human development, and that as infants grow and as they acquire perceptual experience, these
abilities improve. As indicated earlier, this general developmental pattern is consistent with the two
classic theoretical views because the core predictions that both views make is that multisensory
functions improve with development. Unfortunately, both views were silent about the possible neu-
ral mechanisms underlying the developmental emergence of multisensory processing. For example,
although Gibson (1969) proposed that infants are sensitive to perceptual structure and the amodal
invariants that are inherent in the structured stimulus array from birth onward, her insistence that
the information is already integrated in the external perceptual array can be interpreted to mean
that the nervous system does not play a significant role in integration. Of course, this assumption
does not square with the results from modern neurobiological studies, which clearly show that the
brain plays a crucial role in this process. Consequently, a more complete theoretical framework for
conceptualizing the development of multisensory processing is one that not only acknowledges that
the external stimulus array is highly structured but one that also admits that the perception of that
structure is intimately dependent on neural mechanisms that have evolved to detect that structure
(Ghazanfar and Schroeder 2006; Stein and Stanford 2008). In other words, perception of multi-
sensory coherence at any point in development is the joint product of the infant’s ability to detect
increasingly greater stimulus structure—because of the cumulative effects of sensory/perceptual
experience and learning—and to the increasing elaboration of neural structures and their functional
properties. The latter may not only permit the integration of multisensory inputs but sometimes
may actually induce integral perception even when stimulation in the external sensory array is only
unisensory (Romei et al. 2009). Like Gibson’s ecological view of multisensory perceptual develop-
ment, the developmental integration view also failed to specify the underlying neural mechanisms
that mediate the long-term effects of experience with the multisensory world and, thus, is subject to
similar limitations.
What possible neural mechanisms might mediate multisensory processing in early development?
Traditionally, it has been assumed that the neural mechanisms that mediate multisensory process-
ing are hierarchically organized with initial analysis being sensory-specific and only later analysis
being multisensory (presumably once the information arrives in the classic cortical association
areas). This hierarchical processing model has recently been challenged by findings showing that
multisensory interactions in the primary cortical areas begin to occur as early as 40 to 50 ms after
stimulation (Giard and Peronnet 1999; Molholm et al. 2002). Moreover, it has been suggested that
multisensory interactions are not only mediated by feedback connections from higher-level cortical
areas onto lower level areas but that they are also mediated by feedforward and lateral connections
from lower-level primary cortical areas (Foxe and Schroeder 2005). As a result, there is a growing
consensus that multisensory interactions occur all along the neuraxis, that multisensory integration
mechanisms are widespread in the primate neocortex, and that this is what makes the perception
of multisensory coherence possible (Ghazanfar and Schroeder 2006). This conclusion is supported
by findings showing that traditionally unisensory areas actually contain neurons that respond to
stimulation in other modalities. For example, responsiveness in the auditory cortex has been shown
to be modulated by visual input in humans (Calvert et al. 1999), monkeys (Ghazanfar et al. 2005),
ferrets (Bizley et al. 2007), and rats (Wallace et al. 2004).
Development of Multisensory Temporal Perception 339

If multisensory interactions begin to occur right after the sensory input stage and before sensory
elaboration has occurred, and if such interactions continue to occur as the information ascends
the neural pathways to the traditional association areas of the cortex, then this resolves a critical
problem. From the standpoint of the adult brain, it solves the problem of having to wait until the
higher-order cortical areas can extract the various types of relations inherent in multisensory input.
This way, the observer can begin to perform a veridical scene analysis and arrive at a coherent mul-
tisensory experience shortly after input arrives at the sensory organs (Foxe and Schroeder 2005).
From the standpoint of the immature infant brain, the adult findings raise some interesting possibili-
ties. For example, because these early neural interactions are of a relatively low level, they are likely
to occur very early in human development and can interact with any other low level subcortical
integration mechanisms. Whether this scenario is correct is currently unknown and awaits further
investigation. As shown here, behavioral findings from human infants support these conjectures
in that starting at birth, human infants are capable of multisensory perception. Thus, the question
is no longer whether such mechanisms operate but rather what is their nature and where in the
brain are such mechanisms operational. Another interesting question is whether the heterochronous
emergence of heterogeneous multisensory perceptual skills that has been found in behavioral infant
studies (Lewkowicz 2002) is reflected in the operation of distinct neural mechanisms emerging at
different times and in different regions of the brain.
There is little doubt that the neural mechanisms underlying multisensory processing are likely to be
quite rudimentary in early human development. The central nervous system as well as the different sen-
sory systems are immature and young infants are perceptually and cognitively inexperienced. This is
the case despite the fact that the tactile, vestibular, chemical, and auditory modalities begin to function
before birth (Gottlieb 1971) and despite the fact that this provides fetuses with some sensory experience
and some opportunity for intersensory interaction (Turkewitz 1994). Consequently, newborn infants are
relatively unprepared for the onslaught of new multisensory input that also, for the first time, includes
visual information. In addition, newborns are greatly limited by the immature nature of their different
sensory systems (Kellman and Arterberry 1998). That is, their visual limitations include poor spatial
and temporal resolution and poor sensitivity to contrast, orientation, motion, depth, and color. Their
auditory limitations include much higher thresholds compared to adults and include higher absolute
frequency, frequency resolution, and temporal resolution thresholds. Obviously, these basic sensory
functions improve rapidly over the first months of life, but there is little doubt that they initially impose
limitations on infant perception and probably account for some of the developmental changes found
in the development of multisensory responsiveness. The question for future studies is: How do infants
overcome these limitations? The work reviewed here suggests that the answer lies in the complex inter-
actions between neural and behavioral levels of organization and in the daily experiences that infants
have in their normal ecological setting. Because developmental change is driven by such interactions
(Gottlieb et al. 2006), the challenge for future studies is to explicate these interactions.

REFERENCES
Bahrick, L.E. 1983. Infants’ perception of substance and temporal synchrony in multimodal events. Infant
Behavior & Development 6:429–51.
Bahrick, L.E. 1994. The development of infants’ sensitivity to arbitrary intermodal relations. Ecological
Psychology 6:111–23.
Bahrick, L.E., and R. Lickliter. 2000. Intersensory redundancy guides attentional selectivity and perceptual
learning in infancy. Developmental Psychology 36:190–201.
Bahrick, L.E., R. Lickliter, and R. Flom. 2004. Intersensory redundancy guides the development of selective
attention, perception, and cognition in infancy. Current Directions in Psychological Science 13:99–102.
Birch, H.G., and A. Lefford. 1963. Intersensory development in children. Monographs of the Society for
Research in Child Development 25.
Birch, H.G., and A. Lefford. 1967. Visual differentiation, intersensory integration, and voluntary motor control.
Monographs of the Society for Research in Child Development 32:1–87.
340 The Neural Bases of Multisensory Processes

Bizley, J.K., F.R. Nodal, V.M. Bajo, I. Nelken, and A.J. King. 2007. Physiological and anatomical evidence for
multisensory interactions in auditory cortex. Cerebral Cortex 17:2172–89.
Brannon, E.M., S. Suanda, and K. Libertus. 2007. Temporal discrimination increases in precision over develop-
ment and parallels the development of numerosity discrimination. Developmental Science 10:770–7.
Bremner, A.J., N.P. Holmes, and C. Spence. 2008. Infants lost in (peripersonal) space? Trends in Cognitive
Sciences 12:298–305.
Brookes, H., A. Slater, P.C. Quinn et al. 2001. Three-month-old infants learn arbitrary auditory-visual pairings
between voices and faces. Infant & Child Development 10:75–82.
Bushara, K.O., J. Grafman, and M. Hallett. 2001. Neural correlates of auditory-visual stimulus onset asyn-
chrony detection. Journal of Neuroscience 21:300–4.
Calvert, G.A., M.J. Brammer, E.T. Bullmore et al. 1999. Response amplification in sensory-specific corti-
ces during crossmodal binding. Neuroreport: For Rapid Communication of Neuroscience Research
10:2619–23.
Calvert, G.A., C. Spence, and B. Stein (eds.). 2004. The Handbook of Multisensory Processes. Cambridge,
MA: MIT Press.
Canfield, R.L., and M.M. Haith. 1991. Young infants’ visual expectations for symmetric and asymmetric stimu-
lus sequences. Developmental Psychology 27:198–208.
Canfield, R.L., E.G. Smith, M.P. Brezsnyak, and K.L. Snow. 1997. Information processing through the first
year of life: A longitudinal study using the visual expectation paradigm. Monographs of the Society for
Research in Child Development 62:v–vi, 1–145.
Clifton, R.K. 1974. Heart rate conditioning in the newborn infant. Journal of Experimental Child Psychology
18:9–21.
Colombo, J., and W.A. Richman. 2002. Infant timekeeping: Attention and temporal estimation in 4-month-olds.
Psychological Science 13:475–9.
Donohue, R.L., and W.K. Berg. 1991. Infant heart-rate responses to temporally predictable and unpredictable
events. Developmental Psychology 27:59–66.
Fiser, J., and R.N. Aslin. 2002. Statistical learning of new visual feature combinations by infants. Proceedings
of the National Academy of Sciences of the United States of America 99:15822–6.
Foxe, J.J., and C.E. Schroeder. 2005. The case for feedforward multisensory convergence during early cortical
processing. Neuroreport 16:419.
Fraisse, P. 1982. The adaptation of the child to time. In W.J. Friedman (ed.), The developmental psychology of
time, 113–40. New York: Academic Press.
Frank, M.C., J.A. Slemmer, G.F. Marcus, and S.P. Johnson. 2009. Information from multiple modalities helps
5-month-olds learn abstract rules. Developmental Science 12:504–9.
Fujisaki, W., S. Shimojo, M. Kashino, and S.Y. Nishida. 2004. Recalibration of audiovisual simultaneity.
Nature Neuroscience 7:773–8.
Gardner, J.M., D.J. Lewkowicz, S.A. Rose, and B.Z. Karmel. 1986. Effects of visual and auditory stimula-
tion on subsequent visual preferences in neonates. International Journal of Behavioral Development
9:251–63.
Gebhard, J.W., and G.H. Mowbray. 1959. On discriminating the rate of visual flicker and auditory flutter.
American Journal of Psychology 72:521–9.
Gerken, L. 2006. Decisions, decisions: Infant language learning when multiple generalizations are possible.
Cognition 98:B67–74.
Ghazanfar, A.A., and C.E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive Sciences
10:278–85. Epub 2006 May 18.
Ghazanfar, A.A., J.X. Maier, K.L. Hoffman, and N.K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25:5004–12.
Giard, M.H., and F. Peronnet. 1999. Auditory–visual integration during multimodal object recognition in
humans: A behavioral and electrophysiological study. Journal of Cognitive Neuroscience 11:473–90.
Gibson, J.J. 1966. The senses considered as perceptual systems. Boston: Houghton-Mifflin.
Gibson, E.J. 1969. Principles of perceptual learning and development. New York: Appleton.
Gómez, R.L., and J. Maye. 2005. The developmental trajectory of nonadjacent dependency learning. Infancy
7:183–206.
Gottlieb, G. 1971. Ontogenesis of sensory function in birds and mammals. In The biopsychology of develop-
ment, ed. E. Tobach, L.R. Aronson, and E. Shaw, 67–128. New York: Academic Press.
Gottlieb, G., D. Wahlsten, and R. Lickliter. 2006. The significance of biology for human development: A devel-
opmental psychobiological systems view. In Handbook of child psychology, ed. R. Lerner, 210–57. New
York: Wiley.
Development of Multisensory Temporal Perception 341

Greenfield, P.M. 1991. Language, tools and brain: The ontogeny and phylogeny of hierarchically organized
sequential behavior. Behavioral and Brain Sciences 14:531–95.
Gulya, M., and M. Colombo. 2004. The ontogeny of serial-order behavior in humans (Homo sapiens):
Representation of a list. Journal of Comparative Psychology 118:71–81.
Handel, S., and L. Buffardi. 1969. Using several modalities to perceive one temporal pattern. Quarterly Journal
of Experimental Psychology 21:256–66.
Johnson, S.P., K.J. Fernandes, M.C. Frank et al. 2009. Abstract rule learning for visual sequences in 8- and
11-month-olds. Infancy 14:2–18.
Kahana-Kalman, R., and A.S. Walker-Andrews. 2001. The role of person familiarity in young infants’ percep-
tion of emotional expressions. Child Development 72:352–69.
Kamachi, M., H. Hill, K. Lander, and E. Vatikiotis-Bateson. 2003. Putting the face to the voice: Matching
identity across modality. Current Biology 13:1709–14.
Kellman, P.J., and M.E. Arterberry. 1998. The cradle of knowledge: Development of perception in infancy.
Cambridge, MA: MIT Press.
King, A.J., M.E. Hutchings, D.R. Moore, and C. Blakemore. 1988. Developmental plasticity in the visual and
auditory representations in the mammalian superior colliculus. Nature 332:73–6.
Kirkham, N.Z., J.A. Slemmer, and S.P. Johnson. 2002. Visual statistical learning in infancy: Evidence for a
domain general learning mechanism. Cognition 83:B35–42.
Knudsen, E.I., and M.S. Brainard. 1991. Visual instruction of the neural map of auditory space in the develop-
ing optic tectum. Science 253:85–7.
Kuhl, P.K., and A.N. Meltzoff. 1982. The bimodal perception of speech in infancy. Science 218:1138–41.
Kuhl, P.K., and A.N. Meltzoff. 1984. The intermodal representation of speech in infants. Infant Behavior &
Development 7:361–81.
Lashley, K.S. 1951. The problem of serial order in behavior. In Cerebral mechanisms in behavior: The Hixon
symposium, ed. L.A. Jeffress, 123–47. New York: Wiley.
Lewkowicz, D.J. 1986. Developmental changes in infants’ bisensory response to synchronous durations. Infant
Behavior & Development 9:335–53.
Lewkowicz, D.J. 1992a. Infants’ response to temporally based intersensory equivalence: The effect of synchro-
nous sounds on visual preferences for moving stimuli. Infant Behavior & Development 15:297–324.
Lewkowicz, D.J. 1992b. Infants’ responsiveness to the auditory and visual attributes of a sounding/moving
stimulus. Perception & Psychophysics 52:519–28.
Lewkowicz, D.J. 1994. Limitations on infants’ response to rate-based auditory-visual relations. Developmental
Psychology 30:880–92.
Lewkowicz, D.J. 1996. Perception of auditory-visual temporal synchrony in human infants. Journal of
Experimental Psychology: Human Perception & Performance 22:1094–106.
Lewkowicz, D.J. 2000a. The development of intersensory temporal perception: An epigenetic systems/limita-
tions view. Psychological Bulletin 126:281–308.
Lewkowicz, D.J. 2000b. Infants’ perception of the audible, visible and bimodal attributes of multimodal syl-
lables. Child Development 71:1241–57.
Lewkowicz, D.J. 2002. Heterogeneity and heterochrony in the development of intersensory perception.
Cognitive Brain Research 14:41–63.
Lewkowicz, D.J. 2003. Learning and discrimination of audiovisual events in human infants: The hierarchical
relation between intersensory temporal synchrony and rhythmic pattern cues. Developmental Psychology
39:795–804.
Lewkowicz, D.J. 2004. Perception of serial order in infants. Developmental Science 7:175–84.
Lewkowicz, D.J. 2008. Perception of dynamic and static audiovisual sequences in 3- and 4-month-old infants.
Child Development 79:1538–54.
Lewkowicz, D.J. 2010. Infant perception of audio-visual speech synchrony. Developmental Psychology 46:66–77.
Lewkowicz, D.J., and I. Berent. 2009. Sequence learning in 4-month-old infants: Do infants represent ordinal
information? Child Development 80:1811–23.
Lewkowicz, D., and K. Kraebel. 2004. The value of multisensory redundancy in the development of intersen-
sory perception. The Handbook of Multisensory Processes: 655–78. Cambridge, MA: MIT Press.
Lewkowicz, D.J., and A.A. Ghazanfar. 2006. The decline of cross-species intersensory perception in human
infants. Proceedings of the National Academy of Sciences of the United States of America 103:6771–4.
Lewkowicz, D.J., and G. Turkewitz. 1980. Cross-modal equivalence in early infancy: Auditory–visual intensity
matching. Developmental Psychology 16:597–607.
Lewkowicz, D.J., and G. Turkewitz. 1981. Intersensory interaction in newborns: Modification of visual prefer-
ences following exposure to sound. Child Development 52:827–32.
342 The Neural Bases of Multisensory Processes

Lewkowicz, D.J., R. Sowinski, and S. Place. 2008. The decline of cross-species intersensory perception in human
infants: Underlying mechanisms and its developmental persistence. Brain Research 1242:291–302.
Lewkowicz, D.J., I. Leo, and F. Simion. 2010. Intersensory perception at birth: Newborns match non-human
primate faces and voices. Infancy 15:46–60.
Lickliter, R., and L.E. Bahrick. 2000. The development of infant intersensory perception: Advantages of a
comparative convergent-operations approach. Psychological Bulletin 126:260–80.
Lickliter, R., and H. Banker. 1994. Prenatal components of intersensory development in precocial birds. In
Development of intersensory perception: Comparative perspectives, ed. D.J. Lewkowicz and R. Lickliter,
59–80. Norwood, NJ: Lawrence Erlbaum Associates, Inc.
Lickliter, R., D.J. Lewkowicz, and R.F. Columbus. 1996. Intersensory experience and early perceptual devel-
opment: The role of spatial contiguity in bobwhite quail chicks’ responsiveness to multimodal maternal
cues. Developmental Psychobiology 29:403–16.
Maier, N.R.F., and T.C. Schneirla. 1964. Principles of animal psychology. New York: Dover Publications.
Marcovitch, S., and D.J. Lewkowicz. 2009. Sequence learning in infancy: The independent contributions of
conditional probability and pair frequency information. Developmental Science 12:1020–5.
Marcus, G.F., S. Vijayan, S. Rao, and P. Vishton. 1999. Rule learning by seven-month-old infants. Science
283:77–80.
Marcus, G.F., K.J. Fernandes, and S.P. Johnson. 2007. Infant rule learning facilitated by speech. Psychological
Science 18:387–91.
Marks, L. 1978. The unity of the senses. New York: Academic Press.
Martin, J.G. 1972. Rhythmic (hierarchical) versus serial structure in speech and other behavior. Psychological
Review 79:487–509.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264:229–39.
Molholm, S., W. Ritter, M.M. Murray et al. 2002. Multisensory auditory–visual interactions during early
sensory processing in humans: A high-density electrical mapping study. Cognitive Brain Research
14:115–28.
Munhall, K.G., and E. Vatikiotis-Bateson. 2004. Spatial and temporal constraints on audiovisual speech percep-
tion. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B.E. Stein, 177–88.
Cambridge, MA: MIT Press.
Myers, A.K., B. Cotton, and H.A. Hilp. 1981. Matching the rate of concurrent tone bursts and light flashes as a
function of flash surround luminance. Perception & Psychophysics 30(1):33–8.
Navarra, J., A. Vatakis, M. Zampini et al. 2005. Exposure to asynchronous audiovisual speech extends the tem-
poral window for audiovisual integration. Cognitive Brain Research 25:499–507.
Nazzi, T., and F. Ramus. 2003. Perception and acquisition of linguistic rhythm by infants. Speech Communication
41:233–43.
Neil, P.A., C. Chee-Ruiter, C. Scheier, D.J. Lewkowicz, and S. Shimojo. 2006. Development of multisensory
spatial integration and perception in humans. Developmental Science 9:454–64.
Nelson, K. 1986. Event knowledge: Structure and function in development. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Nelson, K. 2007. Young minds in social worlds. Cambridge, MA: Harvard Univ. Press.
Oppenheim, R.W. 1981. Ontogenetic adaptations and retrogressive processes in the development of the nervous
system and behavior: A neuroembryological perspective. In Maturation and development: Biological and
psychological perspectives, ed. K.J. Connolly and H.F.R. Prechtl, 73–109. Philadelphia, PA: Lippincott.
Partan, S., and P. Marler. 1999. Communication goes multimodal. Science 283:1272–3.
Patterson, M.L., and J.F. Werker. 2002. Infants’ ability to match dynamic phonetic and gender information in
the face and voice. Journal of Experimental Child Psychology 81:93–115.
Patterson, M.L., and J.F. Werker. 2003. Two-month-old infants match phonetic information in lips and voice.
Developmental Science 6(2):191–6.
Piaget, J. 1952. The origins of intelligence in children. New York: International Universities Press.
Pons, F., D.J. Lewkowicz, S. Soto-Faraco, and N. Sebastián-Gallés. 2009. Narrowing of intersensory speech
perception in infancy. Proceedings of the National Academy of Sciences of the United States of America
106:10598–602.
Reardon, P., and E.W. Bushnell. 1988. Infants’ sensitivity to arbitrary pairings of color and taste. Infant Behavior
and Development 11:245–50.
Romei, V., M.M. Murray, C. Cappe, and G. Thut. 2009. Preperceptual and stimulus-selective enhancement of
low-level human visual cortex excitability by sounds. Current Biology 19:1799–805.
Rowe, C. 1999. Receiver psychology and the evolution of multicomponent signals. Animal Behaviour
58:921–31.
Development of Multisensory Temporal Perception 343

Saffran, J.R., R.N. Aslin, and E.L. Newport. 1996. Statistical learning by 8-month-old infants. Science
274:1926–8.
Scheier, C., D.J. Lewkowicz, and S. Shimojo. 2003. Sound induces perceptual reorganization of an ambiguous
motion display in human infants. Developmental Science 6:233–44.
Sekuler, R., A.B. Sekuler, and R. Lau. 1997. Sound alters visual motion perception. Nature 385:308.
Shams, L., Y. Kamitani, and S. Shimojo. 2000. What you see is what you hear. Nature 408(6814):788.
Shipley, T. 1964. Auditory flutter-driving of visual flicker. Science 145:1328–30.
Slater, A., E. Brown, and M. Badenoch. 1997. Intermodal perception at birth: Newborn infants’ memory for
arbitrary auditory–visual pairings. Early Development & Parenting 6:99–104.
Slater, A., P.C. Quinn, E. Brown, and R. Hayes. 1999. Intermodal perception at birth: Intersensory redundancy
guides newborn infants’ learning of arbitrary auditory–visual pairings. Developmental Science 2:333–8.
Slutsky, D.A., and G.H. Recanzone. 2001. Temporal and spatial dependency of the ventriloquism effect.
Neuroreport 12:7–10.
Stein, B.E., and M.A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Stein, B.E., and T.R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the single
neuron. Nature Reviews. Neuroscience 9:255–66.
Stein, B.E., D. Burr, C. Constantinidis et al. 2010. Semantic confusion regarding the development of multisen-
sory integration: A practical solution. European Journal of Neuroscience 31:1713–20.
Thelen, E., and L.B. Smith. 1994. A dynamic systems approach to the development of cognition and action.
Cambridge, MA: MIT Press.
Thomas, K.M., and C.A. Nelson. 2001. Serial reaction time learning in preschool- and school-age children.
Journal of Experimental Child Psychology 79:364–87.
Turkewitz, G. 1994. Sources of order for intersensory functioning. In The development of intersensory percep-
tion: Comparative perspectives, ed. D.J. Lewkowicz and R. Lickliter, 3–17. Hillsdale, NJ: Lawrence
Erlbaum Associates.
Turkewitz, G., and P.A. Kenny. 1982. Limitations on input as a basis for neural organization and perceptual
development: A preliminary theoretical statement. Developmental Psychobiology 15:357–68.
Vroomen, J., M. Keetels, B. de Gelder, and P. Bertelson. 2004. Recalibration of temporal order perception by
exposure to audio-visual asynchrony. Cognitive Brain Research 22:32–5.
Walker-Andrews, A.S. 1986. Intermodal perception of expressive behaviors: Relation of eye and voice?
Developmental Psychology 22:373–7.
Walker-Andrews, A.S. 1997. Infants’ perception of expressive behaviors: Differentiation of multimodal infor-
mation. Psychological Bulletin 121:437–56.
Walker-Andrews, A.S., L.E. Bahrick, S.S. Raglioni, and I. Diaz. 1991. Infants’ bimodal perception of gender.
Ecological Psychology 3:55–75.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience 17:2429–44.
Wallace, M.T., and B.E. Stein. 2001. Sensory and multisensory responses in the newborn monkey superior col-
liculus. Journal of Neuroscience 21:8886–94.
Wallace, M.T., R. Ramachandran, and B.E. Stein. 2004. A revised view of sensory cortical parcellation.
Proceedings of the National Academy of Sciences of the United States of America 101:2167–72.
Wallace, M.T., B.E. Stein, and R. Ramachandran. 2006. Early experience determines how the senses will inter-
act: A revised view of sensory cortical parcellation. Journal of Neurophysiology 101:2167–72.
Welch, R.B., and D.H. Warren. 1980. Immediate perceptual response to intersensory discrepancy. Psychological
Bulletin 88:638–67.
Welch, R.B., L.D. Duttenhurt, and D.H. Warren. 1986. Contributions of audition and vision to temporal rate
perception. Perception & Psychophysics 39:294–300.
Werner, H. 1973. Comparative psychology of mental development. New York: International Universities
Press.
Yehia, H., P. Rubin, and E. Vatikiotis-Bateson. 1998. Quantitative association of vocal-tract and facial behavior.
Speech Communication 26:23–43.
Zangenehpour, S., A.A. Ghazanfar, D.J. Lewkowicz, and R.J. Zatorre. 2009. Heterochrony and cross-species
intersensory matching by infant vervet monkeys. PLoS ONE 4:e4302.
18 Multisensory Integration
Develops Late in Humans
David Burr and Monica Gori

CONTENTS
18.1 Development of Multimodal Perception in Infancy and Childhood..................................... 345
18.2 Neurophysiological Evidence for Development of Multimodal Integration.......................... 347
18.3 Development of Cue Integration in Spatial Navigation......................................................... 348
18.4 Development of Audiovisual Cue Integration....................................................................... 349
18.5 Sensory Experience and Deprivation Influence Development of Multisensory
Integration.............................................................................................................................. 350
18.6 Development of Visuo-Haptic Integration............................................................................. 351
18.7 Calibration by Cross-Modal Comparison?............................................................................ 355
18.8 Haptic Discrimination in Blind and Low-Vision Children: Disruption of Cross-
Sensory Calibration?.............................................................................................................. 356
18.9 Concluding Remarks: Evidence of Late Multisensory Development.................................... 357
Acknowledgment............................................................................................................................ 358
References....................................................................................................................................... 358

18.1 DEVELOPMENT OF MULTIMODAL PERCEPTION


IN INFANCY AND CHILDHOOD
From birth, we interact with the world through our senses, which provide complementary informa-
tion about the environment. To perceive and interact with a coherent world, our brain has to merge
information from the different senses as efficiently as possible. Because the same environmental
property may be signaled by more than one sense, the brain must integrate redundant signals of a
particular property (such as the size and shape of an object held in the hand), which can result in a
more precise estimate than either individual estimate. Much behavioral, electrophysiological, and
neuroimaging evidence has shown that signals from the different senses related to the same event,
congruent in space and time, increase the accuracy and precision of its encoding well beyond what
would be possible from independent estimates from individual senses. Several recent studies have
suggested that human adults integrate redundant information in a statistically optimal fashion (e.g.,
Alais and Burr 2004; Ernst and Banks 2002; Trommershäuser et al. in press).
An important question is whether this optimal multisensory integration is present at birth, or
whether (and if so when) it develops during childhood. Early development of multisensory inte-
gration could be useful for the developing brain, but may also bring fresh challenges, given the
dramatic changes that the human brain and body undergo during this period. The clear advantages
of multisensory integration may come at a cost to the developing organism. In fact, as we will see
later in this chapter, many multisensory functions appear only late in development, well after the
maturation of individual senses.
Sensory systems are not mature at birth, but become increasingly refined during development.
The brain has to continuously update its mapping between sensory and motor correspondence and
to take these changes into account. This is a very protracted process, with cognitive changes and

345
346 The Neural Bases of Multisensory Processes

neural reorganization lasting well into early adolescence (Paus 2005). A further complication is that
different senses develop at different rates: first touch, followed by vestibular, chemical, and auditory
(all beginning to function before birth), and finally vision (Gottlieb 1971). The differences in devel-
opment rates could exacerbate the challenges for cross-modal integration and calibrating, needing
to take into account growing limbs, eye length, interocular distances, etc.
Some sensory properties, like contrast sensitivity, visual acuity, binocular vision, color percep-
tion, and some kinds of visual motion perception mature rapidly to reach near adult-like levels
within 8 to 12 months of age (for a review, see Atkinson 2000). Similarly, young infants can explore,
manipulate, and discriminate the form of objects haptically, analyzing and coding tactile and weight
information, during a period when their hands are undergoing rapid changes (Streri 2003; Streri
et al. 2000, 2004; Striano and Bushnell 2005).
On the other hand, not all perceptual skills develop early. For example, auditory frequency dis-
crimination (Olsho 1984; Olsho et al. 1988), temporal discrimination (Trehub et al. 1995), and basic
speech abilities all improve during infancy (Jusczyk et al. 1998). Also, projective size and shape
are not noticed or understood until at least 7 years of age, and evidence suggests that even visual
acuity and contrast sensitivity continue to improve slightly up until 5 to 6 years of age (Brown et
al. 1987). Other attributes, such as the use of binocular cues to control prehensile movements (Watt
et al. 2003) and the development of complex form and motion perception (Del Viva et al. 2006;
Ellemberg et al. 1999, 2004; Kovács et al. 1999; Lewis et al. 2004) continue until 8 to 14 years of
age. Object manipulation also continues to improve until 8 to 14 years (Rentschler et al. 2004),
and tactile object recognition in blind and sighted children does not develop until 5 to 6 years
(Morrongiello et al. 1994). Many other complex and experience-dependent capacities, such as facili-
tation of speech perception in noise (e.g., Elliott 1979; Johnson 2000), have been reported to be
immature throughout childhood.
All these studies suggest that there is a difference not only in the developmental rates of different
sensory systems, but also in the development of different aspects within each sensory system, all
potential obstacles for the development of cue integration. The development of multimodal percep-
tual abilities in human infants has been studied with various techniques, such as habituation and
preferential looking. Many studies suggest that some multisensory processes, such as cross-modal
facilitation, cross-modal transfer, and multisensory matching are present to some degree at an early
age (e.g., Streri 2003; Lewkowicz 2000, for review). Young infants can match signals between dif-
ferent sensory modalities (Dodd 1979; Lewkowicz and Turkewitz 1981) and detect equivalence in
the amodal properties of objects across the senses (e.g., Patterson and Werker 2002; Rose 1981). For
example, they can match faces with voices (Bahrick 2001) and visual and auditory motion signals
(Lewkowicz 1992) on the basis of their synchrony. By 3 to 5 months of age, they can discriminate
audiovisual changes in tempo and rhythm (Bahrick et al. 2002; Bahrick and Lickliter 2000), from
4 months of age, they can match visual and tactile form properties (Rose and Ruff 1987), and at
about 6 months of age, they can do duration-based matches (Lewkowicz 1986).
Young infants seem to be able to benefit from multimodal redundancy of information across
senses (Bahrick and Lickliter 2000, 2004; Bahrick et al. 2002; Lewkowicz 1988a, 1996; Neil et al.
2006). There is also evidence for cross-modal facilitation, in which stimuli in one modality increases
the responsiveness to stimuli in other modalities (Lewkowicz and Lickliter 1994; Lickliter et al.
1996; Morrongiello et al. 1998). However, not all forms of facilitation develop early. Infants do not
exhibit multisensory facilitation of reflexive head and eye movements for spatial localization until
about 8 months of age (Neil et al. 2006), and multisensory coactivation during a simple audiovisual
detection task does not occur until 8 years of age in most children (Barutchu et al. 2009, 2010).
Recent studies suggest that human infants can transfer information gleaned from one sense to
another (e.g., Streri 2003; Streri et al. 2004). For example, 1-month-old infants can visually recog-
nize an object they have previously explored orally (Gibson and Walker 1984; Meltzoff and Borton
1979) and 2-month-old infants can visually recognize an object they have previously felt (Rose
1981; Streri et al. 2008). However, many of these studies show an asymmetry in the transfer (Sann
Multisensory Integration Develops Late in Humans 347

and Streri 2007; Streri 2003; Streri et al. 2008) or a partial dominance of one modality over another
(Lewkowicz 1988a, 1988b), supporting the idea that, even when multimodal skills are present, they
are not necessarily fully mature. Recent results (Bremner et al. 2008a, 2008b) on the representation
of peripersonal space support the presence of two distinct mechanisms in sensory integration with
different developmental trends: the first, relying principally on visual information, is present dur-
ing the first 6 months; the second, incorporating information of hand and body posture with visual,
develops only after 6.5 months of age.
Over the past years, the majority of multisensory studies in infants and children have investi-
gated the development of multisensory matching, transfer, and facilitation abilities, whereas few of
those have investigated the development of multisensory integration. Those few that did investigate
multisensory integration in school-age children point to unimodal dominance rather than integra-
tion abilities (Hatwell 1987; Klein 1966; McGurk and Power 1980; Misceo et al. 1999).

18.2 NEUROPHYSIOLOGICAL EVIDENCE FOR DEVELOPMENT


OF MULTIMODAL INTEGRATION
There is now firm neurophysiological evidence for multimodal integration. Many studies have
demonstrated that the midbrain structure superior colliculus is involved in integrating information
between modalities and in initializing and controlling the localization and orientation of motor
responses (Stein et al. 1993). This structure is highly sensitive to input from the association cortex
(Stein 2005), and the inactivation of this input impairs the integration of multisensory signals (Jiang
and Stein 2003). Maturation of multisensory responses depends strongly on environmental expe-
rience (Wallace and Stein 2007): after visual deprivation (Wallace et al. 2004), the responses of
multisensory neurons are atypical, and fail to show multisensory integration.
A typically developed superior colliculus is structured in layers. Neurons in the superficial
layers are unisensory, whereas those in the deeper layers respond to the combination of visual,
auditory, and tactile stimuli (Stein et al. 2009b). Neurons related to a specific sensory modality
have their own spatial map that is spatially registered with the maps of the neurons involved in
the processing of other modalities (Stein et al. 1993, 2009b). These multisensory neurons respond
to spatiotemporally coincident multisensory stimuli with a multisensory enhancement (more
impulses than evoked by the strongest stimulus; Meredith and Stein 1986). Multisensory enhance-
ment has been observed in several different species of animals (in the superior colliculus of cat,
hamster, guinea pig, and monkeys as well as in the cortex of cat and monkey (Meredith and Stein
1986; Stein et al. 2009a, 2009b; Wilkinson et al. 1996) and functional magnetic resonance imag-
ing and behavioral studies support the existence of similar processes in humans (e.g., Macaluso
and Driver 2004).
Multimodal responses of collicular neurons are not present at birth but develop late in cats and
monkeys (Stein et al. 1973; Wallace and Stein 1997, 2001). For example, in the cat superior colli-
culus, neurons are somatosensory at birth (Stein et al. 1973), whereas auditory and visual neurons
appear only postnatally. Initially, these neurons respond well to either somatic or auditory or visual
signals. Enhanced multisensory responses emerge many weeks later, and its development depends
on both experience and input from association cortex (Wallace and Stein 2001).
Behavioral data suggest that the visual modality is principally involved in the processing of the
spatial domain and the auditory system in the temporal domain. Most neurophysiological stud-
ies have investigated spatial rather than temporal processing. However, development of temporal
properties may be interesting, as the temporal patterns of stimulation can be perceived in the uterus
before birth by the vestibular, tactile, and auditory senses. Indeed, neurophysiological studies sug-
gest that somatosensory–audio multisensory neurons develop a few days after birth whereas mul-
tisensory neurons that also modulate visual information only appear a few weeks later (Stein et al.
1973). Thus, integration of temporal attributes of perception could develop before spatial attributes
(such as location or orientation), which are not typically available prenatally.
348 The Neural Bases of Multisensory Processes

18.3  DEVELOPMENT OF CUE INTEGRATION IN SPATIAL NAVIGATION


When do human infants start integrating multisensory signals, and when does the integration become
statistically optimal? Nardini et al. (2008) studied the reliance on multiple spatial cues for short-range
navigation in children and adults. Navigation depends on both visual landmarks and self-generated
cues, such as vestibular and proprioceptive signals generated from the movement of the organism in
the environment. To measure and quantify the ability of adults and children to integrate this informa-
tion, they first measured the precision for each modality and then observed the improvement in the
bimodal condition. The subjects (adults and children aged 4 to 5 and 7 to 8 years) walked in a dark
room with peripherally illuminated landmarks and collected a series of objects (1, 2, and 3 in Figure
18.1a). After a delay, they replaced the objects. Subjects were provided with two cues to navigation,
visual landmarks (“moon,” “lightning bolt,” and “star” in Figure 18.1a) and self-motion.
They recorded the distance between the participant’s responses and the correct location as
well as root mean square errors for each condition, both for the two unimodal conditions—with
the room in darkness (no landmarks; SM) and with visual landmarks present (LM) but subjects

(a) (b)

1 1R
2 1 2
3 3

Start Start
(c) SM (self-motion)
100
LM (landmarks)
80 SM+LM
Mean SD (cm)

60

40

20

0
4-5 yr. 7-8 yr. Adult
Group
(d)
Prediction, integration model ±1 s.e. Prediction, alternation model ±1 s.e.
mean predicted SD (model)
Mean SD (cm) (measured)/

100 4-5 yr. 7-8 yr. Adult


90
80
70
60
50
40
30
0
0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00
Mean relative LM proximity (measured)/LM weight or probability (model)

FIGURE 18.1  (See color insert.) Use of multiple cues for navigation in adults and children. (a) Representation
of room in which subject performed the task in nonconflictual condition. Starting from “start,” subject picked up
three numbered objects in sequence. Three visual landmarks (a “moon,” a “lightning bolt,” and a “star”) were also
present in the room. (b) Representation of room in which subject performed the task in conflictual condition. Here,
landmarks were rotated around the subject from white to colored position of 15°. (c) Mean standard deviation (SD)
of participant responses for three different conditions. (d) Curves report the means of functions that predict mean
standard deviation (SD ±1 SE) from integration model (in green) or alternation model (in pink) for different age
groups. (Reproduced from Nardini, M. et al., Curr. Biol., 18, 689–693, 2008. With permission.)
Multisensory Integration Develops Late in Humans 349

disoriented—and with both cues present (SM + LM). Figure 18.1c shows a clear developmental
trend in the unimodal performance, with mean mislocalization thresholds decreasing with age, sug-
gesting that navigation improves during development.
More interestingly, whereas adults take advantage of multiple cue integration, the children do
not. SM + LM thresholds were higher than LM thresholds for children in both age groups, whereas
the adults showed lower thresholds in the two-cue condition (evidence of cross-sensory fusion).
Nardini et al. (2008) also measured navigation in a conflict condition (Figure 18.1b), in which land-
marks were rotated by 15° after the participants had collected the objects. They considered two
models, one in which the cues were weighted by the inverse of variance and integrated (green line in
Figure 18.1d), and one in which subjects alternate between the two cues (pink line in Figure 18.1d).
Although the integration model predicted adult performance in the conflict condition, 4- to 5- and
7- to 8-year-olds followed the alternation model rather than the integration model.
Although adults clearly integrate optimally multiple cues for navigation, young children do not,
alternating between cues from trial to trial. These results suggest that the development of the two
individual spatial representations occur before they are integrated within a common unique refer-
ence frame. This study suggests that optimal multisensory integration of spatial cues for short-range
navigation occurs late during development.

18.4  DEVELOPMENT OF AUDIOVISUAL CUE INTEGRATION


Audiovisual integration is fundamental for many tasks, such as orientation toward novel stimuli and
understanding speech in noisy environments. As the auditory system starts to develop before vision,
commencing in utero, it is interesting to examine when the two senses are integrated.
Neil et al. (2006) measured audiovisual facilitation of spatial location in adults and 1- to 10-month-
old infants, by comparing the response latency and accuracy of head and eye turns toward unimodal
(visual or auditory) and bimodal stimuli. Subjects were required to orient toward a stimulus (a red
vertical line or a sustained burst of white noise, or both) presented at one of five different locations.
For all stimuli, orientation latencies decreased steadily with age, from about 900 ms at 0 to 2 months
to 200 ms for adults. The response to the bimodal stimulus was faster than for either unimodal
stimulus at all ages, but only for adults and for 8- to 10-month-old infants was the “race model” (the
standard probability summation model of reaction times) consistently violated, implying neural inte-
gration. For young infants, the results were well-explained by independent probability summation,
without any evidence that the audiovisual signals were combined in any physiological way. Only after
8 to 10 months did the faster bimodal response suggest that behavioral summation had occurred.
Although multisensory facilitation for audiovisual reflexive eye and head movements for spatial
localization has been found to develop at about 10 months of age (Neil et al. 2006), recent findings
(Barutchu et al. 2009, 2010) report a different developmental trend for multisensory facilitation of
visual–audio not reflexive motor responses. Barutchu et al. (2009) studied the motor reaction times
during audiovisual detection task and found that multisensory facilitation is still immature by 10 to 11
years of age. In fact, only at around 7 years of age did the facilitation start to become consistent with the
coactivation model (Barutchu et al. 2009). These authors suggest that the difference observed in these
two trends can depend on the development of the process being facilitated by multisensory integration.
Thus, the maturity of processes being facilitated during eye and head reflexive movements precedes the
maturity of the processes being facilitated during more complex detection motor tasks (Barutchu et al.
2009) or speech perception (Massaro 1987). Also, Tremblay et al. (2007) showed that different audio-
visual illusions seem to develop at different rates. They investigated the development of visuo–audio
abilities for two different tasks: one for speech illusion and one for nonspeech illusion. They found that
although audiovisual speech illusions varied as a function of age and does not develop until 10 years of
age, nonspeech illusions were the same across ages, and already present at 5 years of age.
Later in the chapter, we shall suggest a different interpretation of these results, one of “cross-
modal calibration,” which we believe could stabilize at different ages for different tasks.
350 The Neural Bases of Multisensory Processes

18.5 SENSORY EXPERIENCE AND DEPRIVATION INFLUENCE


DEVELOPMENT OF MULTISENSORY INTEGRATION
Animal studies have shown that deprivation of cross-modal cues compromises the development
of normal multisensory responses. For example, Wallace et al. (2004) found that cats deprived of
audiovisual and visuo–tactile experience showed no multisensory response enhancement in the
superior colliculus.
Similarly, patients with specific sensory deficits, such as congenital deafness or blindness later
restored by surgery techniques, are ideal models to investigate the effects of sensory experience
on multisensory integration in humans. For example, Putzar et al. (2007) tested patients born with
dense congenital binocular cataracts (removed at 2 or 3 months) on a nonverbal audiovisual task as
well as audiovisual speech perception. This group actually performed better than a control group
on the nonverbal task, where they were required to make temporal judgments of visual stimuli pre-
sented together with auditory distractors, suggesting that the visuo–auditory “binding” was weaker
in patients who had been visually deprived for the first few months of life. Similarly, they performed
worse than controls in the speech experiment, where a fusion between spatial and temporal visuo–
auditory perceptual aspects assisted the task. These results highlight the importance of adequate
sensory input during early life for the development of multisensory interactions (see also Gori et al.
2010; Hotting and Roder 2009; Röder et al. 2004, 2007).
Also, auditory deprivation can influence the perception of multisensory stimuli, notably speech
perception, which involves the interaction of temporal and spatial visual and audio signals. The
clearest example of this is the McGurk effect (McGurk and Power 1980): subjects listening to a
spoken phoneme (e.g., /pa/) and watching a speaker pronounce another phoneme (such as /ka/) will
report hearing an in-between phoneme, /ta/. This compelling illusion occurs both for adults and
young children (e.g., Bergeson and Pisoni 2003). Schorr et al. (2005) took advantage of this robust
illusion to study bimodal fusion in children born deaf, with hearing restored by cochlear implants.
They first replicated the illusion in a control group of children with normal hearing, of whom 57%
showed bimodal fusion on at least 70% of trials, perceiving /ta/ when /ka/ was pronounced and /­pa/
observed on video (Figure 18.2). Of those who did not show fusion, the majority showed a clear

100

80
Subjects (%)

60

40

20

0
AV A V AV A V AV A V
Controls Implants < 30 m Implants > 30 m

FIGURE 18.2  McGurk effect in children with cochlear implants compared with age-matched controls.
Phoneme /pa/ was played to subjects while they observed a video of lips pronouncing /ka/, and reported the pho-
neme they perceived. Black bars show percentage of each group to report fusion (/ka/) on at least 70% of trials;
light gray bars show auditory dominance (/pa/) and dark gray bars show visual dominance (/ka/). For controls,
more than half showed bimodal fusion (McGurk effect), and of those that did not, most showed auditory domi-
nance. Also, for children with early cochlear implants (before 30 months of age), majority show fusion, but those
that did not showed visual dominance. For children with later implants, almost all showed visual dominance.
Multisensory Integration Develops Late in Humans 351

auditory dominance. Among the group who had implants at an early age (before 30 months), a simi-
lar proportion (56%) perceived the fused phoneme, suggesting that bimodal fusion was occurring.
However, the majority of those who did not perceive the fused phoneme perceived the visual /ka/
rather than the auditory /pa/ that the control children perceive. For late implants, however, only one
showed cross-modal fusion, all the others showed visual dominance.
These results suggest that cross-modal fusion is not innate, but needs to be learned. The group
of hearing-restored children who received the implant after 30 months of age showed no evidence
of cross-modal fusion, with the visual phoneme dominating perception. Those with early implants
demonstrate a remarkable plasticity in acquiring bimodal fusion, suggesting that there is a sensitive
period for the development of bimodal integration of speech.
It is interesting that in normal-hearing children, sound dominates the multimodal perception,
whereas vision dominated in all the cochlea-implanted children, both early and late implants. It is
possible that the dominance can be explained by reliability-based integration. Speech is a complex
temporal task in audition and spatiotemporal task in vision. Although performance has not yet been
measured (to our knowledge), it is reasonable to suppose that in normal-hearing children, the audi-
tory perception is more precise, explaining the dominance. What about the cochlea-implanted chil-
dren? Is their auditory precision worse than visual precision, so the visual dominance is the result
of ideal fusion? Or is the auditory perception actually better than visual perception at this task, so
the visual dominance is not the most optimal solution? In this case, it may be that vision remains the
most robust sense, even if not the most precise. This would be interesting to investigate, perhaps in
a simplified situation, as has been done for visuo-haptic judgments (see following section).

18.6  DEVELOPMENT OF VISUO-HAPTIC INTEGRATION


One of the earliest studies to investigate the capacity of integrated information between perceptual
systems was that of Ernst and Banks (2002), who investigated the integration of visual and haptic
estimates of size in human adults. Their results were consistent with a simple but powerful model
that proposes that visual and haptic inputs are combined in an optimal fashion, maximize the pre-
cision of the final estimate (see also chapter by Marc Ernst). This maximum likelihood estimate
(MLE) model combines sensory information by summing the independent estimates from each
modality, after weighting the estimates by their reliability, in turn, inversely proportional to the
variance of the presumed underlying noise distribution.

ŜVH = wVŜV + wH Ŝ H (18.1)

where ŜVH is the combined visual and haptic estimate, estimate and ŜV and ŜH the independent hap-
tic and visual estimates. The weights w V and wH sum to unity and are inversely proportional to the
variance (σ 2) of the presumed underlying noise distribution.


( ) ( )
wV = σ V−2 σ H−2 + σ V−2 , wH = σ H−2 σ H−2 + σ V−2 (18.2)

The MLE prediction for the visuo-haptic threshold (σVH) is given by

−2
σ VH = σ V−2 + σ H−2 (18.3)

where σ V and σ H are the visual and haptic unimodal thresholds. The improvement is greatest ( 2  )
when σ V = σ H.
This model has been spectacularly successful in predicting human multimodal integration for
various tasks, including visuo-haptic size judgments (Ernst and Banks 2002), audiovisual position
352 The Neural Bases of Multisensory Processes

judgments (Alais and Burr 2004), and visual–tactile integration of sequence of events (Bresciani
and Ernst 2007). Gori et al. (2008) adapted the technique to study the development of reliability-
based cross-sensory integration of two aspects of form perception: size and orientation discrimina-
tion. The size discrimination task (top left icon of Figure 18.3) was a low-technology, child-friendly
adaptation of Ernst and Banks’ technique (Ernst and Banks 2002), where visual and haptic informa-
tion were placed in conflict with each other to investigate which dominates perception under vari-
ous degrees of visual degradation. The stimuli were physical blocks of variable height, displayed in

(a) (d)

(b) Haptic standard (e) Visual standard

1.0 1.0
10 Years 8 Years

0.5 0.5
Proportion “steeper”
Proportion “taller”

0.0 0.0

MLE prediction MLE prediction

(c) Haptic standard (f ) Visual standard

1.0 1.0
5 Years 5 Years

0.5 0.5

0.0 0.0
–6 –3 0 3 6 –12 –6 0 6 12

MLE prediction MLE prediction


Relative probe size (mm) Relative probe orientation (deg)

FIGURE 18.3  (See color insert.) Development of cross-modal integration for size and orientation discrimi-
nation. Illustration of experimental setup for size (a) and orientation (d) discrimination. Sample psychometric
functions for four children, with varying degrees of cross-modal conflict. (b and c) Size discriminations: SB
age 10.2 (b); DV age 5.5 (c); (e and f) orientation discrimination: AR age 8.7 (e); GF age 5.7 (f). Lower color-
coded arrows show MLE predictions, calculated from threshold measurements (Equation 18.1). Black-dashed
horizontal lines show 50% performance point, intersecting with curves at their PSE (shown by short vertical
bars). Upper color-­coded arrows indicate size of haptic standard in size condition (b and c) and orientation of
visual standard in orientation condition (e and f). Older children generally follow the adult pattern, whereas
5-year-olds were dominated by haptic information for size task, and visual information for orientation task.
For size judgment, amount of conflict was 0 for red symbols, +3 mm (where plus means vision was larger) for
blue symbols, and –3 mm for green symbols. For orientation, same colors refer to 0° and ±4°.
Multisensory Integration Develops Late in Humans 353

front of an occluding screen for visual judgments, behind the screen for haptic judgments, or both
in front and behind for bimodal judgments.
All trials involved a two-alternative forced-choice task in which the subject judged whether a
standard block seemed taller or shorter than a probe of variable height. For the single-modality tri-
als, one stimulus was the standard, always 55 mm high, the other the probe, of variable height. The
proportion of trials in which the probe was judged taller than the standard was computed for each
probe height, yielding psychometric functions. The crucial condition was the dual-modality condi-
tion, in which visual and haptic sizes of the standard were in conflict, with the visual block 55 +
Δ mm and the haptic block 55 – Δ mm (Δ = 0 or ±3 mm). The probe was composed of congruent
visual and haptic stimuli of variable heights (48–62 mm).
After validating the technique with adults, demonstrating that optimal cross-modal integration
also occurred under these conditions, we measured haptic, visual, and bimodal visuo-haptic size dis-
crimination in 5- to 10-year-old children. Figure 18.3 shows sample psychometric functions for the
dual-modality measurements, fitted with cumulative Gaussian functions whose median estimates
the point of subjective equality (PSE) between the probe and standard. The pattern of results for the
10-year-old (Figure 18.3b) was very much like those for the adult: negative values of Δ caused the
curves to shift leftward, positive values caused them to shift rightward. That is, to say, the curves
followed the visual standard, suggesting that visual information was dominating the match, as the
MLE model suggests it should, as the visual thresholds were lower than the haptic thresholds. This
is consistent with the MLE model (indicated by color-coded arrows below the abscissa): the visual
judgment was more precise, and should therefore dominate.
For the 5-year-olds (Figure 18.3c), however, the results were completely different: the psycho-
metric functions shifted in the direction opposite to that of the 10-year-olds, following the bias of
the haptic stimulus. The predictions (color-coded arrows under the abscissa) are similar for both the
5- and 10-year-olds, as for both groups of children, visual thresholds were much lower than haptic
thresholds, so the visual stimuli should dominate: but for the 5-year-olds, the reverse holds, with the
haptic standard dominating the match.
These data show that for size judgments, touch dominates over vision. But is this universally
true? We repeated the experiments with another spatial task, orientation discrimination, another
basic spatial task that could, in principle, be computed by neural hardware of the primary visual
cortex (Hubel and Wiesel 1968). Subjects were required to discriminate which bar of a dual pre-
sentation (standard and probe) was rotated more counterclockwise. As with the size discrimina-
tions, we first measured thresholds in each separate modality, then visuo-haptically, by varying
degrees of conflict (Δ = 0 or ±4°). Figure 18.3e and F show sample psychometric functions for the
dual-modality measurements for a 5- and 8-year-old child. As with the size judgments, the pattern
of results for the 8-year-old was very much like those for the adult, with the functions of the three
different conflicts (Figure 18.3e) falling very much together, as predicted from the single modality
thresholds by the MLE model (arrows under the abscissa). Again, however, the pattern of results
for the 5-year-old was quite different (Figure 18.3f). Although the MLE model predicts similar
curves for the three conflict conditions, the psychometric functions very closely followed the visual
standards (indicated by the arrows above the graphs), the exact opposite pattern to that observed for
size discrimination.
Figure 18.4 reports PSEs for children in all ages for the three conflict conditions, plotted as
a function of the MLE predictions from single-modality discrimination thresholds. If the MLE
prediction held, the data should fall along the black-dotted equality line (like in the bottom graph
that reports the adults’ results). For adults this was so, for both size and orientation. However,
at 5 years of age, the story was quite different. For the size discriminations (Figure 18.4a), not
only do the measured PSEs not follow the MLE predictions, they varied inversely with Δ (fol-
lowing the haptic standard), lining up almost orthogonal to the equality line. Similarly, the data
for the 6-year-olds do not follow the prediction, but there is a tendency for the data to be more
scattered rather than ordered orthogonal to the prediction line. By 8 years of age, the data begin
354 The Neural Bases of Multisensory Processes

(a) Size (b) Orientation


4 6
2 3
0 5Y 0
–2 –3
–4 –6

4 6
2 3
0 6Y 0
–2 –3
–4 –6
PSE measured (mm)

PSE measured (deg)


4 6
2 3
0 8Y 0
–2 –3
–4 –6

4
2
0 10Y
–2
–4

4 6
2 3
0 Adults 0
–2 –3
–4 –6
–4 –2 0 2 4 –6 –3 0 3 6
(mm) (deg)
Prediction from thresholds

FIGURE 18.4  (See color insert.) Summary data showing PSEs for all subjects for all conflict conditions,
plotted against predictions, for size (a) and orientation (b) discriminations. Different colors refer to different
subjects within each age group. Symbol shapes refer to level of cross-sensory conflict (Δ): squares, 3 mm or
4°; circles, –3 mm or –4°; upright triangles, 0; diamonds, 2 mm; inverted triangles, –2 mm. Closed symbols
refer to no-blur condition for size judgments, and vertical orientation judgments; open symbols to modest blur
(screen at 19 cm) or oblique orientations; cross in symbols to heavy blur (screen at 39 cm).

to follow the prediction, and by age 10, the data falls along it well, similar to the adult pattern of
results.
Figure 18.5a shows how thresholds vary with age for the various conditions. For both tasks,
visual and haptic thresholds decreased steadily up till 10 years (orientation more so than size).
The light-blue symbols show the thresholds predicted from the MLE model (Equation 18.3). For
the adults, the predicted improvement was close to the best single-modality threshold, and indeed,
the dual-modality thresholds were never worse than the best single-modality threshold. For the
5-year-old children, the results were quite different, with the dual-modality thresholds following
the worst thresholds. For the size judgment, they follow the haptic thresholds, not only much higher
than the MLE predictions, but twice the best single-modality (visual) thresholds. This result shows
not only that integration was not optimal, it was not even a close approximation, like “winner take
all.” Indeed, it shows a “loser take all” strategy. This reinforces the PSE data in showing that these
young children do not integrate cross-modally in a way that benefits perceptual discrimination.
Figure 18.5b plots the development of theoretical (violet symbols) and observed (black symbols)
visual and haptic weights. For both size and orientation judgments, the theoretical haptic weights
(calculated from thresholds) were fairly constant over age, 0.2 to 0.3 for size and 0.3 to 0.4 for
Multisensory Integration Develops Late in Humans 355

(a) Haptic
Vision 30
10 MLE

Thresholds (mm)

Thresholds (deg)
Cross-modal
10
3
5

1 2
3 10 Adult Blur 3 10 Adult
(b)
1.0 PSEs
0.0
Thresholds
Haptic weight

Visual weight
0.5 0.5

0.0 1.0
3 10 Adult 3 10 Adult
Age (y)

FIGURE 18.5  (See color insert.) Development of thresholds and visuo-haptic weights. Average thresholds
(geometric means) for haptic (red symbols), visual (green), and visuo-haptic (dark blue) size and orientation
discrimination, together with average MLE predictions (light blue), as a function of age. Predictions were cal-
culated individually for each subject and then averaged. Tick-labeled “blur” shows thresholds for visual stimuli
blurred by a translucent screen 19 cm from blocks. Error bars are ±1 SEM. Haptic and visual weights for size
and orientation discrimination, derived from thresholds via MLE model (violet circles) or from PSE values
(black squares). Weights were calculated individually for each subject, and then averaged. After 8 to 10 years,
the two estimates converged, suggesting that the system then integrates in a statistically optimal manner.

orientation. However, the haptic weights necessary to predict the 5-year-old PSE size data are 0.6
to 0.8, far, far greater than the prediction, implying that these young children give far more weight
to touch for size judgments than is optimal. Similarly, the haptic weights necessary to predict the
orientation judgments are around 0, far less than the prediction, suggesting that these children base
orientation judgments almost entirely on visual information. In neither case does anything like
optimal cue combination occur.

18.7  CALIBRATION BY CROSS-MODAL COMPARISON?


Our experiments showed that before 8 years of age, children do not integrate information between
senses, but one sense dominates the other. Which sense dominates depends on the situation: for
size judgments, touch dominates; for orientation vision, neither seems to act as the “gold standard.”
Given the overwhelming body of evidence for optimal integration in adults, that children do not
integrate in an optimal manner was not to be expected, and suggests that multisensory interaction
in infants is fundamentally different from that in adults. How could it differ? Although most recent
work on multisensory interactions has concentrated on sensory fusion, the efficient combination of
information from all the senses, an equally important but somewhat neglected potential function is
calibration. In his 300-year-old “Essay towards a new theory of vision,” Bishop George Berkeley
(1709) correctly observed that vision has no direct access to attributes such as distance, solidar-
ity, or “bigness.” These can be acquired visually only after they have been associated with touch
(proposition 45): in other words, “touch educates vision,” perhaps better expressed as “touch cali-
brates vision.” Calibration is probably necessary at all ages, but during the early years of life, when
356 The Neural Bases of Multisensory Processes

High precision Low precision


Low accuracy High accuracy

FIGURE 18.6  Accuracy and precision. Accuracy is defined as closeness of a measurement to its true physi-
cal value (its veracity), whereas precision is degree of reproducibility or repeatability between measurements,
usually measured as standard deviation of distribution. “Target analogy” shows high precision but poor accu-
racy (left), and good average accuracy but poor precision (right). The archer would correct his or her aim by
calibrating sights of the bow. Similarly, perceptual systems can correct for a bias by cross-calibration between
senses.

children are effectively “learning to see,” calibration may be expected to be more important. It is
during these years that limbs are growing rapidly, eye length and eye separation are increasing, all
necessitating constant recalibration between sight and touch. Indeed, many studies suggest that the
first 8 years in humans corresponds to the critical period of plasticity in humans for many attributes,
for many properties such as binocular vision (Banks et al. 1975) and acquiring accent-free language
(Doupe and Kuhl 1999).
So before 8 years of age, calibration may be more important than integration. The advantages of
fusing sensory information are probably more than offset by those of keeping the evolving system
calibrated and using one system to calibrate another precludes the fusion of the two. Therefore, if
we accept Berkeley’s ideas that vision must be calibrated by touch might explain why size discrimi-
nation thresholds are dominated by touch, even though touch is less precise than vision. But why
are orientation thresholds dominated by vision? Perhaps Berkeley was not quite right, and touch
does not always calibrate vision, but the more robust sense for a particular task is the calibrator. In
the same way that the more precise sense has the highest weights for sensory fusion, perhaps the
more accurate sense is used for calibration. The more accurate need not be the more precise, but is
probably the more robust. Accuracy is defined in absolute terms, as the distance from physical real-
ity, whereas precision is a relative measure, related to the reliability or repeatability of the results
(see Figure 18.6). It is therefore reasonable that for size, touch will be more accurate, as vision
cannot code it directly, but only by a complex calculation of retinal size and estimate of distance.
Orientation, on the other hand, is coded directly by primary visual cortex (Hubel and Wiesel 1968),
and calculated from touch only indirectly via complex coordinate transforms.

18.8 HAPTIC DISCRIMINATION IN BLIND AND LOW-VISION


CHILDREN: DISRUPTION OF CROSS-SENSORY CALIBRATION?
If the idea of calibration is correct, then early deficits in one sense should affect the function of
other senses that rely on it for calibration. Specifically, haptic impairment should lead to poor visual
discrimination of size and visual impairment to poor haptic discrimination of orientation. We have
tested and verified the latter of these predictions (Gori et al. 2010). In 17 congenitally visually
impaired children (aged 5–19 years), we measured haptic discrimination thresholds for both orien-
tation and size, and found that orientation, but not size, thresholds were impaired. Figure 18.7 plots
size against orientation thresholds, both normalized by age-matched normally sighted children.
Multisensory Integration Develops Late in Humans 357

Normalized size thresholds


1

0.3
0.3 1 10
Normalized orientation thresholds

FIGURE 18.7  Thresholds for orientation discrimination, normalized by age-matched controls, plotted
against normalized size thresholds, for 17 unsighted or low-vision children aged between 5 and 18 years. Most
points lie in lower-right quadrant, implying better size and poorer orientation discrimination. Arrows refer
to group averages, 2.2 ± 0.3 for orientation and 0.8 ± 0.06 for size. Star in lower-left quadrant is the acquired
low-vision child. (Reprinted from Gori, M. et al., Curr. Biol., 20, 223–5, 2010. With permission.)

Orientation discrimination thresholds were all worse than the age-matched controls (>1), on aver-
age twice as high, whereas size discrimination thresholds were generally better than the controls
(<1). Interestingly, one child with an acquired visual impairment (star symbol) showed a completely
different pattern of results, with no orientation deficit. Although we have only one such subject, we
presume that his fine orientation thresholds result from the early visual experience (before 2½ years
of age), which may have been sufficient for the visual system to calibrate touch.
Many previous studies have examined haptic perception in the visually impaired, with seem-
ingly contradictory results: some studies show the performance of blind and low-vision subjects
to be as good or better than normally sighted controls, in tasks such as size discrimination with a
cane (Sunanto and Nakata 1998), haptic object exploration and recognition, and tactile recognition
of two-dimensional angles and gratings (Morrongiello et al. 1994); whereas other tasks including
haptic orientation discrimination (Alary et al. 2009; Postma et al. 2008), visual spatial imagination
(Noordzij et al. 2007), and representation and updating of spatial information (Pasqualotto and
Newell 2007) have shown impairments. Visually impaired children had particular difficulties with
rotated object arrays (Ungar et al. 1995). Most recently, Bülthoff and colleagues have shown that
congenitally blind subjects are worse than both blindfolded sighted and acquired-blind subjects at
haptic recognition of faces (Dopjans et al. 2009). It is possible that the key to understanding the dis-
crepancy in the literature is whether the haptic task may have required an early cross-modal visual
calibration. However, early exposure to vision seems to be sufficient to calibrate the developing
haptic system, suggesting that the sensitive period for damage is shorter than that for normal devel-
opment. This is consistent with other evidence for multiple sensitive periods, such as global motion
perception (Lewis and Maurer 2005).
The suggestion that specific perceptual tasks may require cross-modal calibration during devel-
opment could have practical implications, possibly leading to improvements in rehabilitation pro-
grams. Where cross-sensory calibration has been compromised, for example by blindness, it may
be possible to train people to use some form of “internal” calibration, or to calibrate by another
modality such as sound.

18.9 CONCLUDING REMARKS: EVIDENCE OF LATE


MULTISENSORY DEVELOPMENT
To perceive a coherent world, we need to combine signals from our five sensory systems, signals that
can be complementary or redundant. In adults, redundant signals from various sensory systems—
vision, audition, and touch—are usually integrated in an optimal manner, improving the precision
358 The Neural Bases of Multisensory Processes

of the individual estimates. In the past few years, a great interest has emerged in when and how
these functions develop in children and young animals.
Many studies, both in children and animal models, suggest that multisensory integration does not
occur at birth, but develops over time. Some basic forms of integration, such as reflexive orienting
toward an audiovisual signal, develop quite early (Neil et al. 2006); some others, such as integration
of visual-haptic signals for orientation and size (Gori et al. 2008), and self-generated cues during
navigation (Nardini et al. 2008) develop only after 8 years of age. Similarly, whereas orientating
reflexes benefit from cue-integration by 8 months (Neil et al. 2006), nonreflexive motor responses to
bimodal stimuli continue to develop throughout childhood (Barutchu et al. 2009, 2010).
Some have suggested that the late development might be because multisensory integration
requires higher-order cognitive processes, including attention, reach a certain level of maturity or,
alternatively, until all motor processes reach maturity (Barutchu et al. 2010), which does not occur
until late adolescence (e.g., Betts et al. 2006; Kanaka et al. 2008; Smith and Chatterjee 2008).
However, it is far from clear what complex cognitive processes are involved in simple size and ori-
entation discriminations, and processes such as attention have been shown to operate at very low
levels, including V1 and A1 (Gandhi et al. 1999; Woldorff et al. 1993).
We suggest that anatomical and physiological differences in maturation rates could pose particu-
lar challenges for development, as could the need for the senses to continually recalibrate, to take
into account growing limbs, eye length, interocular distances, etc. If cross-sensory calibration were
more fundamental during development than for mature individuals, this would explain the lack of
integration, as the use of one sense to calibrate the other necessarily precludes the integration of
redundant information. Calibration does not always occur in the same direction (such as touch edu-
cating vision) but, in general, the more robust sense for a particular task calibrates the less robust.
The haptic system, which has the more immediate information about size, seems to calibrate vision,
which has no absolute size information and must scale for distance. On the other hand, for orienta-
tion discrimination, the visual system, which has specialized detectors tuned for orientation, seems
to calibrate touch. Indeed, congenitally blind or low-vision children show a strong deficit for haptic
orientation judgments and are consistent with the possibility that the deficit results from an early
failure to calibrate.
Cross-sensory calibration can explain many curious results, such as the fact that before integra-
tion, the dominance is task dependent, visual for orientation, haptic for size. Similar results have
been observed with audiovisual integration: audiovisual speech illusions do not seem to develop
until 10 years of age, whereas illusions not involving speech are mature by age 5 (Tremblay et al.
2007). Along the same lines, it can also explain the asymmetries in task performances in subjects
with different sensorial deficits (Gori et al. 2010; Putzar et al. 2007; Schorr et al. 2005).
All these results suggest that whereas the different sensory systems of infants and children are
clearly interconnected, multimodal perception may be not fully developed until quite late. Cross-
sensory calibration may be a useful strategy to allow our brain to take into account the dramatic
anatomical and sensorial changes during early life, as well as keeping our senses robustly calibrated
through life’s trials and tribulations.

ACKNOWLEDGMENT
This research was supported by the Italian Ministry of Universities and Research, EC project
“STANIB” (FP7 ERC), EC project “RobotCub” (FP6-4270), and Istituto David Chiossone Onlus.

REFERENCES
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Current
Biology 14:257–62.
Multisensory Integration Develops Late in Humans 359

Alary, F., M. Duquette, R. Goldstein, C. Elaine Chapman, P. Voss, V. La Buissonniere-Ariza, and F. Lepore.
2009. Tactile acuity in the blind: A closer look reveals superiority over the sighted in some but not all
cutaneous tasks. Neuropsychologia 47:2037–43.
Atkinson, J. 2000. The developing visual brain. New York: Oxford Univ. Press.
Bahrick, L.E. 2001. Increasing specificity in perceptual development: Infants’ detection of nested levels of
multimodal stimulation. Journal of Experimental Child Psychology 79:253–70.
Bahrick, L.E., R. Flom, and R. Lickliter. 2002. Intersensory redundancy facilitates discrimination of tempo in
3-month-old infants. Developmental Psychobiology 41:352–63.
Bahrick, L.E., and R. Lickliter. 2000. Intersensory redundancy guides attentional selectivity and perceptual
learning in infancy. Developmental Psychology 36:190–201.
Bahrick, L.E., and R. Lickliter. 2004. Infants’ perception of rhythm and tempo in unimodal and multimodal
stimulation: a developmental test of the intersensory redundancy hypothesis. Cognitive, Affective &
Behavioral Neuroscience 4:137–47.
Banks, M.S., R.N. Aslin, and R.D. Letson. 1975. Sensitive period for the development of human binocular
vision. Science 190:675–7.
Barutchu, A., D.P. Crewther, and S.G. Crewther. 2009. The race that precedes coactivation: development of
multisensory facilitation in children. Developmental Science 12:464–73.
Barutchu, A., J. Danaher, S.G. Crewther, H. Innes-Brown, M.N. Shivdasani, and A.G. Paolini. 2010. Audiovisual
integration in noise by children and adults. Journal of Experimental Child Psychology 105:38–50.
Bergeson, T.R., and D.B. Pisoni. 2003. Audiovisual speech perception in deaf adults and children following
cochlear implantation. In Handbook of multisensory integration, ed. G. Calvert, C. Spence, and B.E.
Stein, 749–772. Cambridge, MA: MIT Press.
Berkeley, G. 1709. An essay towards a new theory of vision. 1963. Indianapolis, IN: Bobbs-Merrill.
Betts, J., J. McKay, P. Maruff, and V. Anderson. 2006. The development of sustained attention in children: The
effect of age and task load. Child Neuropsychology 12:205–21.
Bremner, A.J., N.P. Holmes, and C. Spence. 2008a. Infants lost in (peripersonal) space? Trends in Cognitive
Sciences 12:298–305.
Bremner, A.J., D. Mareschal, S. Lloyd-Fox, and C. Spence. 2008b. Spatial localization of touch in the first year
of life: Early influence of a visual spatial code and the development of remapping across changes in limb
position. Journal of Experimental Psychology. General 137:149–62.
Bresciani, J.P., and M.O. Ernst. 2007. Signal reliability modulates auditory–tactile integration for event count-
ing. Neuroreport 18:1157–61.
Brown, A.M., V. Dobson, and J. Maier. 1987. Visual acuity of human infants at scotopic, mesopic and photopic
luminances. Vision Research 27:1845–58.
Del Viva, M.M., R. Igliozzi, R. Tancredi, and D. Brizzolara. 2006. Spatial and motion integration in children
with autism. Vision Research 46:1242–52.
Dodd, B. 1979. Lip reading in infants: Attention to speech presented in- and out-of-synchrony. Cognitive
Psychology 11:478–84.
Dopjans, L., C. Wallraven, and H.H. Bülthoff. 2009. Visual experience supports haptic face recognition:
Evidence from the early- and late-blind. 10th International Multisensory Research Forum (IMRF), New
York City, The City College of New York.
Doupe, A.J., and P.K. Kuhl. 1999. Birdsong and human speech: Common themes and mechanisms. Annual
Review of Neuroscience 22:567–631.
Ellemberg, D., T.L. Lewis, D. Maurer, C.H. Lui, and H.P. Brent. 1999. Spatial and temporal vision in patients
treated for bilateral congenital cataracts. Vision Research 39:3480–9.
Ellemberg, D., T.L. Lewis, M. Dirks, D. Maurer, T. Ledgeway, J.P. Guillemot, and F. Lepore. 2004. Putting
order into the development of sensitivity to global motion. Vision Research 44:2403–11.
Elliott, L.L. 1979. Performance of children aged 9 to 17 years on a test of speech intelligibility in noise using
sentence material with controlled word predictability. Journal of the Acoustical Society of America
66:651–3.
Ernst, M.O., and M.S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415:429–33.
Gandhi, S.P., D.J. Heeger, and G.M. Boynton. 1999. Spatial attention affects brain activity in human pri-
mary visual cortex. Proceedings of the National Academy of Sciences of the United States of America
96:3314–9.
Gibson, E.J., and A.S. Walker. 1984. Development of knowledge of visual-tactual affordances of substance.
Child Development 55:453–60.
360 The Neural Bases of Multisensory Processes

Gori, M., M.M. Del Viva, G. Sandini, and D.C. Burr. 2008. Young children do not integrate visual and haptic
form information. Current Biology 18:694–8.
Gori, M., G. Sandini, C. Martinoli, and D. Burr. 2010. Poor haptic orientation discrimination in nonsighted
children may reflect disruption of cross-sensory calibration. Current Biology 20:223–5.
Gottlieb, G. 1971. Development of species identification in birds: An inquiry into the prenatal determinants of
perception. Chicago: Univ. of Chicago Press.
Hatwell, Y. 1987. Motor and cognitive functions of the hand in infancy and childhood. International Journal of
Behavioural Development 10:509–26.
Hotting, K., and B. Roder. 2009. Auditory and auditory–tactile processing in congenitally blind humans.
Hearing Research 258:165–74.
Hubel, D.H., and T.N. Wiesel. 1968. Receptive fields and functional architecture of monkey striate cortex.
Journal of Physiology 195:215–43.
Jiang, W., and B.E. Stein. 2003. Cortex controls multisensory depression in superior colliculus. Journal of
Neurophysiology 90:2123–35.
Johnson, C.E. 2000. Children’s phoneme identification in reverberation and noise. Journal of Speech, Language,
and Hearing Research 43:144–57.
Jusczyk, P., D. Houston, and M. Goodman. 1998. Speech perception during the first year. In Perceptual devel-
opment: Visual, Auditory, and Speech Perception in Infancy, ed. A. Slater. Psychology Press.
Kanaka, N., T. Matsuda, Y. Tomimoto, Y. Noda, E. Matsushima, M. Matsuura, and T. Kojima. 2008.
Measurement of development of cognitive and attention functions in children using continuous perfor-
mance test. Psychiatry and Clinical Neurosciences 62:135–41.
Klein, R.E. 1966. A developmental study of perception under condition of conflicting cues. Dissertation abstract.
Kovács, I., P. Kozma, A. Fehér, and G. Benedek. 1999. Late maturation of visual spatial integration in humans.
Proceedings of the National Academy of Sciences of the United States of America 96, 12204–9.
Lewis, T.L., and D. Maurer. 2005. Multiple sensitive periods in human visual development: Evidence from
visually deprived children. Developmental Psychobiology 46:163–83.
Lewis, T.L., D. Ellemberg, D. Maurer, J.-P. Guillemot, and F. Lepore. 2004. Motion perception in 5-year-olds:
Immaturity is related to hypothesized complexity of cortical processing. Journal of Vision 4:30–30a.
Lewkowicz, D.J. 1986. Developmental changes in infants’ bisensory response to synchronous durations. Infant
Behavior and Development 163:180–8.
Lewkowicz, D.J. 1988a. Sensory dominance in infants: 1. Six-month-old infants’ response to auditory–visual
compounds. Developmental Psychology 24:155–71.
Lewkowicz, D.J. 1988b. Sensory dominance in infants: 2. Ten-month-old infants’ response to auditory-visual
compounds. Developmental Psychology 24:172–82.
Lewkowicz, D.J. 1992. Infants’ responsiveness to the auditory and visual attributes of a sounding/moving
stimulus. Perception & Psychophysics 52:519–28.
Lewkowicz, D.J. 1996. Perception of auditory–visual temporal synchrony in human infants. Journal of
Experimental Psychology. Human Perception and Performance 22:1094–106.
Lewkowicz, D.J. 2000. The development of intersensory temporal perception: An epigenetic systems/limita-
tions view. Psychological Bulletin 126:281–308.
Lewkowicz, D.J., and R. Lickliter (ed.). 1994. The development of intersensory perception: Comparative perspec-
tives. Hillsdale, NJ: Lawrence Erlbaum Associates Inc.
Lewkowicz, D.J., and G. Turkewitz. 1981. Intersensory interaction in newborns: Modification of visual prefer-
ences following exposure to sound. Child Development 52:827–32.
Lickliter, R., D.J. Lewkowicz, and R.F. Columbus. 1996. Intersensory experience and early perceptual devel-
opment: The role of spatial contiguity in bobwhite quail chicks’ responsiveness to multimodal maternal
cues. Developmental Psychobiology 29:403–16.
Macaluso, E., and J. Driver. 2004. Neuroimaging studies of cross-modal integration for emotion. In The Handbook of
Multisensory Processes, ed. G.A. Calvet, C. Spence, and B.E. Stein, 529–48. Cambridge, MA: MIT Press.
Massaro, D.W. 1987. Speech perception by ear and eye: A paradigm for psychological inquiry. Program in
experimental psychology. Hillsdale, NJ: Laurence Erlbaum Associates.
McGurk, H., and R.P. Power. 1980. Intermodal coordination in young children: Vision and touch. Developmental
Psychology 16:679–80.
Meltzoff, A.N., and R.W. Borton. 1979. Intermodal matching by human neonates. Nature 282:403–4.
Meredith, M.A., and B.E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56:640–62.
Misceo, G.F., W.A. Hershberger, and R.L. Mancini. 1999. Haptic estimates of discordant visual-haptic size
vary developmentally. Perception & Psychophysics 61:608–14.
Multisensory Integration Develops Late in Humans 361

Morrongiello, B.A., G.K. Humphrey, B. Timney, J. Choi, and P.T. Rocca. 1994. Tactual object exploration and
recognition in blind and sighted children. Perception 23:833–48.
Morrongiello, B.A., K.D. Fenwick, and G. Chance. 1998. Cross-modal learning in newborn infants: Inferences
about properties of auditory–visual events. Infant Behavior and Development 21:543–54.
Nardini, M., P. Jones, R. Bedford, and O. Braddick. 2008. Development of cue integration in human navigation.
Current Biology 18:689–93.
Neil, P.A., C. Chee-Ruiter, C. Scheier, D.J. Lewkowicz, and S. Shimojo. 2006. Development of multisensory
spatial integration and perception in humans. Developmental Science 9:454–64.
Noordzij, M.L., S. Zuidhoek, and A. Postma. 2007. The influence of visual experience on visual and spatial
imagery. Perception 36:101–12.
Olsho, L.W. 1984. Infant frequency discrimination as a function of frequency. Infant Behavior and Development
7:27–35.
Olsho, L.W., E.G. Koch, E.A. Carter, C.F. Halpin, and N.B. Spetner. 1988. Pure-tone sensitivity of human
infants. Journal of the Acoustical Society of America 84:1316–24.
Pasqualotto, A., and F.N. Newell. 2007. The role of visual experience on the representation and updating of
novel haptic scenes. Brain and Cognition 65:184–94.
Patterson, M.L., and J.F. Werker. 2002. Infants’ ability to match dynamic phonetic and gender information in
the face and voice. Journal of Experimental Child Psychology 81:93–115.
Paus, T. 2005. Mapping brain development and aggression. Canadian Child and Adolescent Psychiatry Review
14:10–5.
Postma, A., S. Zuidhoek, M.L. Noordzij, and A.M. Kappers. 2008. Haptic orientation perception benefits from
visual experience: Evidence from early-blind, late-blind, and sighted people. Perception & Psychophysics
70:1197–206.
Putzar, L., I. Goerendt, K. Lange, F. Rosler, and B. Roder. 2007. Early visual deprivation impairs multisensory
interactions in humans. Nature Neuroscience 10:1243–5.
Rentschler, I., M. Jüttner, E. Osman, A. Müller, and T. Caelli. 2004. Development of configural 3D object rec-
ognition. Behavioural Brain Research 149:107–11.
Rose, S.A. 1981. Developmental changes in infants’ retention of visual stimuli. Child Development 52:227–33.
Rose, S.A., and H.A. Ruff. 1987. Cross-modal abilities in human infants. In Handbook of Infant Development,
ed. J.D. Osofsky, 318–62. New York: Wiley.
Röder, B., F. Rosler, and C. Spence. 2004. Early vision impairs tactile perception in the blind. Current Biology
14:121–4.
Röder, B., A. Kusmierek, C. Spence, and T. Schicke. 2007. Developmental vision determines the reference
frame for the multisensory control of action. Proceedings of the National Academy of Sciences of the
United States of America 104:4753–8.
Sann, C., and A. Streri. 2007. Perception of object shape and texture in human newborns: evidence from cross-
modal transfer tasks. Developmental Science 10:399–410.
Schorr, E.A., N.A. Fox, V. van Wassenhove, and E.I. Knudsen. 2005. Auditory-visual fusion in speech percep-
tion in children with cochlear implants. Proceedings of the National Academy of Sciences of the United
States of America 102:18748–50.
Smith, S.E., and A. Chatterjee. 2008. Visuospatial attention in children. Archives of Neurology 65:1284–8.
Stein, B.E. 2005. The development of a dialogue between cortex and midbrain to integrate multisensory infor-
mation. Experimental Brain Research 166:305–15.
Stein, B.E., E. Labos, and L. Kruger. 1973. Sequence of changes in properties of neurons of superior colliculus
of the kitten during maturation. Journal of Neurophysiology 36:667–79.
Stein, B.E., M.A. Meredith, and M.T. Wallace. 1993. The visually responsive neuron and beyond: multisensory
integration in cat and monkey. Progress in Brain Research 95:79–90.
Stein, B.E., T.J. Perrault, T.R. Stanford, and B.A. Rowland. 2009a. Postnatal experiences influence how the
brain integrates information from different senses. Frontiers in Integrative Neuroscience 3:21.
Stein, B.E., T.R. Stanford, and B.A. Rowland. 2009b. The neural basis of multisensory integration in the mid-
brain: Its organization and maturation. Hearing Research 258:4–15.
Streri, A. 2003. Cross-modal recognition of shape from hand to eyes in human newborns. Somatosensory &
Motor Research 20:13–8.
Streri, A., M. Lhote, and S. Dutilleul. 2000. Haptic perception in newborns. Developmental Science 3:319–27.
Streri, A., E. Gentaz, E. Spelke, and G. van de Walle. 2004. Infants’ haptic perception of object unity in rotating
displays. Quarterly Journal of Experimental Psychology A 57:523–38.
Streri, A., C. Lemoine, and E. Devouche. 2008. Development of inter-manual transfer of shape information in
infancy. Developmental Psychobiology 50:70–6.
362 The Neural Bases of Multisensory Processes

Striano, T., and E. Bushnell. 2005. Haptic perception of material properties by 3-month-old infants. Infant
Behavior and Development 28:266–89.
Sunanto, J., and H. Nakata. 1998. Indirect tactual discrimination of heights by blind and blindfolded sighted
subjects. Perceptual and Motor Skills 86:383–6.
Trehub, S.E., B.A. Schneider, and J.L. Henderson. 1995. Gap detection in infants, children, and adults. Journal
of the Acoustical Society of America 98:2532–41.
Tremblay, C., F. Champoux, P. Voss, B.A. Bacon, F. Lepore, and H. Theoret. 2007. Speech and non-speech
audio-visual illusions: a developmental study. PLoS One 2, e742.
Trommershäuser, J., M. Landy, and K. Körding (eds.) (in press). Sensory cue integration. New York: Oxford
Univ. Press.
Ungar, S., M. Blades, and C. Spencer. 1995. Mental rotation of a tactile layout by young visually impaired
children. Perception 24:891–900.
Wallace, M.T., and B.E. Stein. 1997. Development of multisensory neurons and multisensory integration in cat
superior colliculus. Journal of Neuroscience 17:2429–44.
Wallace, M.T., and B.E. Stein. 2001. Sensory and multisensory responses in the newborn monkey superior col-
liculus. Journal of Neuroscience 21:8886–94.
Wallace, M.T., and B.E. Stein. 2007. Early experience determines how the senses will interact. Journal of
Neurophysiology 97:921–6.
Wallace, M.T., T.J. Perrault Jr., W.D. Hairtston, and B.E. Stein. 2004. Visual experience is necessary for the
development of multisensory integration. Journal of Neuroscience 24:9580–4.
Watt, S.J., M.F. Bradshaw, T.J. Clarke, and K.M. Elliot. 2003. Binocular vision and prehension in middle child-
hood. Neuropsychologia 41:415–20.
Wilkinson, L.K., M.A. Meredith, and B.E. Stein. 1996. The role of anterior ectosylvian cortex in cross-modal-
ity orientation and approach behavior. Experimental Brain Research 112:1–10.
Woldorff, M.G., C.C. Gallen, S.A. Hampson, S.A. Hillyard, C. Pantev, D. Sobel, and F.E. Bloom. 1993.
Modulation of early sensory processing in human auditory cortex during auditory selective attention.
Proceedings of the National Academy of Sciences of the United States of America 90:8722–6.
19 Phonetic Recalibration
in Audiovisual Speech
Jean Vroomen and Martijn Baart

CONTENTS
19.1 Introduction........................................................................................................................... 363
19.2 A Short Historical Background on Audiovisual Speech Aftereffects...................................364
19.3 Seminal Study on Lip-Read–Induced Recalibration............................................................. 365
19.4 Other Differences between Recalibration and Selective Speech Adaptation . ..................... 367
19.4.1 Buildup...................................................................................................................... 367
19.4.2 Dissipation................................................................................................................. 368
19.4.3 Recalibration in “Speech” versus “Nonspeech” Mode.............................................. 368
19.5 Stability of Recalibration over Time..................................................................................... 369
19.5.1 Basic Phenomenon of Lexically Induced Recalibration............................................ 369
19.5.2 Lip-Read–Induced versus Lexically Induced Recalibration..................................... 370
19.6 Developmental Aspects......................................................................................................... 372
19.7 Computational Mechanisms.................................................................................................. 373
19.8 Neural Mechanisms............................................................................................................... 374
19.9 Conclusion............................................................................................................................. 376
Acknowledgments........................................................................................................................... 376
References....................................................................................................................................... 376

19.1  INTRODUCTION
In the literature on cross-modal perception, there are two important findings that most researchers
in this area will know about, although only few have ever made a connection between the two. The
first is that perceiving speech is not solely an auditory, but rather a multisensory phenomenon. As
many readers know by now, seeing a speaker deliver a statement can help decode the spoken mes-
sage. The most famous experimental demonstration of the multisensory nature of speech is the so-
called McGurk illusion: when perceivers are presented an auditory syllable /ba/ dubbed onto a face
articulating /ga/, they report “hearing” /da/ (McGurk and MacDonald 1976). The second finding
goes back more than 100 years ago to Stratton (1896). He performed experiments with goggles and
prisms that radically changed his visual field, thereby creating a conflict between vision and pro­
prioception. What he experienced is that after wearing prisms for a couple of days, he adapted to the
upside-down visual world and he learned to move along in it quite well. According to Stratton, the
visual world had changed as it sometimes appeared to him as if it was “right side up,” although oth-
ers such as Held (1965) argued later that it was rather the sensory–motor system that was adapted.
What these two seemingly different phenomena have in common is that in both cases an arti-
ficial conflict between the senses is created about an event that should yield congruent data under
normal circumstances. Thus, in the McGurk illusion, there is a conflict between the auditory system
that hears the syllable /ba/ and the visual system that sees the face of a speaker saying /ga/, in the
prism case there is a conflict between proprioception that may feel the hand going upward and the

363
364 The Neural Bases of Multisensory Processes

visual system that sees the same hand going downward. In 2003, the commonality between these
two phenomena led us (Bertelson et al. 2003) to question whether one might also observe long-
term adaptation effects with audiovisual speech as reported by Stratton for prism adaptation. To be
more specific, to the best of our knowledge, nobody had ever examined whether auditory speech
perception would adapt as a consequence of exposure to the audiovisual conflict present in McGurk
stimuli. This was rather surprising given that the original paper by McGurk and MacDonald is one
of the most widely cited papers in this research area (more than 1500 citations by January 2009).
Admittedly though, on first sight it may look as a somewhat exotic enterprise to examine whether
listeners adapt to speech sounds induced by exposure to an audiovisual conflict. After all, why
would adaptation to a video of an artificially dubbed speaker be of importance? Experimental psy-
chologists should rather spend their time on fundamental aspects of perception and cognition that
remain constant across individuals, cultures, and time, and not on matters that are flexible and
adjustable. And, indeed, the dominant approach in speech research did just that by focusing on the
information available in the speech signal, the idea being that there must be acoustic invariants in
the signal that are extracted during perception. On second thought though, it has turned out to be
extremely difficult to find a set of acoustic invariant parameters that work for all contexts, cultures,
and speakers, and the question we addressed might open an alternative view: Rather than searching
for acoustic invariants, it might be equally fruitful to examine whether and how listeners adjust their
phoneme boundaries so as to accommodate the variation they hear.
In 2003, we (Bertelson et al. 2003) reported that phonetic recalibration induced by McGurk-like
stimuli can indeed be observed. We termed the phenomenon “recalibration” in analogy with the
much better known “spatial recalibration,” as we considered it a readjustment or a fine-tuning of an
already existing phonetic representation. In the same year, and in complete independence, Norris
et al. (2003) reported a very similar phenomenon they named “perceptual learning in speech.” The
basic procedure in both studies was very similar: Listeners were presented with a phonetically
ambiguous speech sound and another source of contextual information that disambiguated that
sound. In our study, we presented listeners a sound halfway between /b/ and /d/ with as context
the video of a synchronized face that articulated /b/ or /d/ (in short, lip-read information), whereas
in the study of Norris et al. (2003), an ambiguous /s/-/f/ sound was heard embedded in the context
of an f- or s-biasing word (e.g., “witlo-s/f” was an f-biasing context because “witlof” is a word in
Dutch meaning “chicory,” but “witlos” is not a Dutch word). Recalibration (or perceptual learning)
was subsequently measured in an auditory-only identification test in which participants identified
members of a speech continuum. Recalibration manifested itself as a shift in phonetic categoriza-
tion toward the contextually defined speech environment. Listeners thus increased their report of
sounds consistent with the context they had received before, so more /b/ responses after exposure to
lip-read /b/ rather than lip-read /d/, and more /f/ responses after exposure to f-biasing words rather
than /s/-biasing words. Presumably, this shift reflected an adjustment of the phoneme boundary that
had helped listeners to understand speech better in the prevailing input environment.
After these seminal reports, there have been a number of studies that examined phonetic recali-
bration in more detail (Baart and Vroomen 2010a, 2010b; Cutler et al. 2008; Eisner and McQueen
2005, 2006; Jesse and McQueen 2007; Kraljic et al. 2008a, 2008b; Kraljic and Samuel 2005, 2006,
2007; McQueen et al. 2006a, 2006b; Sjerps and McQueen 2010; Stevens 2007; van Linden and
Vroomen 2007, 2008; Vroomen and Baart 2009a, 2009b; Vroomen et al. 2004, 2007). In what fol-
lows, we will provide an overview of this literature and, given the topic of this book, we will focus
on the audiovisual case.

19.2 A SHORT HISTORICAL BACKGROUND ON


AUDIOVISUAL SPEECH AFTEREFFECTS
Audiovisual speech has been extensively studied in recent decades ever since seminal reports came
out that lip-read information is of help in noisy environments (Sumby and Pollack 1954) and, given
Phonetic Recalibration in Audiovisual Speech 365

appropriate dubbings, can change the auditory percept (McGurk and MacDonald 1976). More
recently, audiovisual speech has served in functional magnetic resonance imaging (fMRI) stud-
ies as an ideal stimulus for studying the neural substrates of multisensory integration (Calvert and
Campbell 2003). Surprisingly though, until 2003 there were only three studies that had focused on
auditory aftereffects as a consequence of exposure to audiovisual speech, despite the fact that after-
effects were extensively studied in the late 1970s, and are again nowadays.
Roberts and Summerfield (1981) were the first to study the aftereffects of audiovisual speech,
although they were not searching for recalibration, but for “selective speech adaptation,” which is
basically a contrastive effect. The main question of their study was whether selective speech adapta-
tion takes place at a phonetic level of processing, as originally proposed by Eimas and Corbit (1973),
or at a more peripheral acoustic level. Selective speech adaptation differs from recalibration in that
it does not depend on an (intersensory) conflict, but rather on the repeated presentation of an acous-
tically nonambiguous sound that reduces report of sounds similar to the repeating one. For example,
hearing /ba/ many times reduces subsequent report of /ba/ on a /ba/–/da/ test continuum. Eimas
and Corbit (1973) argued that selective speech adaptation reflects the neural fatigue of hypothetical
“linguistic feature detectors,” but this viewpoint was not left unchallenged by others claiming that
it reflects a mere shift in criterion (Diehl 1981; Diehl et al. 1978, 1980) or a combination of both
(Samuel 1986), or possibly that even more qualitatively different levels of analyses are involved
(Samuel and Kat 1996). Still, others (Sawusch 1977) showed that the size of selective speech adapta-
tion depends on the degree of spectral overlap between the adapter and test sound, and that most—
although not all—of the effect is acoustic rather than phonetic.
Roberts and Summerfield (1981) found a clever way to disentangle the acoustic from the phonetic
contribution using McGurk-like stimuli. They dubbed a canonical auditory /b/ (a “good” acoustic
example) onto the video of lip-read /b/ to create an audiovisual congruent adapter and also dubbed
the auditory /b/ onto a lip-read /g/ to create a compound stimulus intended to be perceived as /d/.
Results showed that repeated exposure to the congruent audiovisual adapter induced similar con-
trastive aftereffects on a /b/–/d/ test continuum (i.e., fewer /b/ responses) as the incongruent adapter
AbVg, even though the two adapters were perceived differently. This led the authors to conclude
that selective speech adaptation mainly depends on the acoustic quality of the stimulus, and not the
perceived or lip-read one.
Saldaña and Rosenblum (1994) and Shigeno (2002) later replicated these results with different
adapters. Saldaña and Rosenblum compared auditory-only adapters with audiovisual ones (auditory
/b/ paired with visual /v/, a compound stimulus perceived mostly as /v/), and found, as Roberts and
Summerfield did, that the two adapters again behaved similarly, as in both cases fewer /b/ responses
were obtained at the test. Similar results were also found by Shigeno (2002) using AbVg as adapter,
and by us (unpublished) demonstrating that selective speech adaptation depends, to a large extent,
on repeated exposure to nonambiguous sounds.

19.3  SEMINAL STUDY ON LIP-READ–INDUCED RECALIBRATION


Bertelson et al. (2003) also studied the aftereffects of audiovisual incongruent speech; however,
their focus was not on selective speech adaptation, but on recalibration. Their study was inspired
by previous work on aftereffects of the “ventriloquist illusion.” In the ventriloquist illusion, the
apparent location of a target sound is shifted toward a visually displaced distracter that moves or
flashes in synchrony with that sound (Bermant and Welch 1976; Bertelson and Aschersleben 1998;
Bertelson and Radeau 1981; Klemm 1909). Besides this immediate bias in sound localization, one
can also observe aftereffects following a prolonged exposure to a ventriloquized sound (Bertelson
et al. 2006; Radeau and Bertelson 1974, 1976, 1977). For the ventriloquist situation, it was known
that the location of target sounds was shifted toward the visual distracter seen during the preceding
exposure phase. These aftereffects were similar to the ones following exposure to discordant visual
and proprioceptive information—as when the apparent location of a hand is displaced through a
366 The Neural Bases of Multisensory Processes

prism (Welch and Warren 1986)—and they all showed that exposure to spatially conflicting inputs
recalibrates processing in the respective modalities in a way that reduces the conflict.
Despite the fact that immediate biases and recalibration effects had been demonstrated for spa-
tial conflict situations, the existing evidence was less complete for conflicts regarding audiovisual
speech. Here, immediate biases were well known (the McGurk effect) as well as selective speech
adaptation, but recalibration had not been demonstrated. Bertelson et al. (2003) hypothesized that
a slight variation in the paradigm introduced by Roberts and Summerfield (1981) might neverthe-
less produce these effects, thus revealing recalibration. The key factor was the ambiguity of the
adapter sound. Rather than using a conventional McGurk-like stimulus containing a canonical (and
incongruent) sound, Bertelson et al. (2003) used an ambiguous sound. They created a synthetic
sound halfway between /aba/ and /ada/ (henceforth A? for auditory ambiguous) and dubbed it onto
the corresponding video of a speaker pronouncing /aba/ or /ada/ (A?Vb and A?Vd, respectively).
Participants were shortly exposed to either A?Vb or A?Vd, and then tested on identification of A?,
and the two neighbor tokens on the auditory continuum A? −1 and A? +1. Each exposure block con-
tained eight adapters (either A?Vb or A?Vd) immediately followed by six test trials. These exposure-
­test blocks were repeated many times, and participants were thus biased toward both /b/ and /d/ in
randomly ordered blocks (a within-subjects factor). Results showed that listeners quickly learned
to label the ambiguous sound in accordance with the lip-read information they were exposed to
shortly before. Listeners thus gave more /aba/ responses after exposure to A?Vb than after exposure
to A?Vd, and this was taken as the major sign of recalibration (see Figure 19.1, left panel).
In a crucial control experiment, Bertelson et al. (2003) extended these findings by incorporat-
ing audiovisual congruent adapters AbVb and AdVd. These adapters were not expected to induce
recalibration because there was no conflict between sound and vision. Rather, they were expected
to induce selective speech adaptation due to the nonambiguous nature of the sound. As shown in
Figure 19.1, right panel, these adapters indeed induced selective speech adaptation, and there were
thus less /aba/ responses after exposure to AbVb than AdVd, an effect in the opposite direction of
recalibration.
The attractiveness of these control stimuli was that participants could not distinguish them from
the ones with an ambiguous sound that induced recalibration. This was confirmed in an identifi-
cation test in which A?Vb and AbVb were perceived as /b/, and A?Vd and AdVd as /d/ on nearly

1.00 1.00
A?Vaba AVaba
A?Vada AVada
Proportion of 'b' responses

0.80 0.80

0.60 0.60

0.40 0.40

0.20 0.20

/A?/-1 /A?/ /A?/+1 /A?/-1 /A?/ /A?/+1


Test token Test token

FIGURE 19.1  Percentage of /aba/ responses as a function of auditory test token. Left panel: After exposure
to audiovisual adapters with ambiguous sounds, A?Vaba or A?Vada, there were more responses consistent
with the adapter (recalibration). Right panel: After exposure to audiovisual adapters with non-ambiguous
sounds, AVaba or AVada, there were fewer responses consistent with the adapter (selective speech adaptation).
(Results on auditory tests adapted from Bertelson, P. et al., Psychol. Sci., 14, 6, 592–597, 2003; Exp. 2.)
Phonetic Recalibration in Audiovisual Speech 367

100% the trials. Moreover, even when participants were explicitly asked to discriminate AbVb from
A?Vb, and AdVd from A?Vd, they performed at chance level because there was a strong immedi-
ate bias by the lip-read information that captured the identity of the sound (Vroomen et al. 2004).
These findings imply that the difference in aftereffects induced by adapters with ambiguous ver-
sus nonambiguous sounds cannot be ascribed to some (unknown) explicit strategy of the listeners,
because listeners simply could not know whether they were actually hearing adapters with ambigu-
ous sounds (causing recalibration) or nonambiguous sounds (causing selective speech adaptation).
This confirms the sensory, rather than strategic, nature of the phenomenon.
Lip-read–induced recalibration of speech was thus demonstrated, and appeared to be contingent
upon exposure to an ambiguous sound and another source of information that disambiguated that
sound. Selective speech adaptation, on the other hand, occurred in the absence of an intersensory
conflict, and mainly depended on repeated presentation of an acoustically clear sound. These two
forms of aftereffects had been studied before in other perceptual domains, but always in isolation.
Recalibration was earlier demonstrated for the ventriloquist situation and analogous intramodal
conflicts such as between different cues to visual depth (see reviews by Epstein 1975 and Wallach
1968), whereas contrastive aftereffects where already well known for color, curvature (Gibson 1933),
size (Blakemore and Sutton 1969) and motion (Anstis 1986; Anstis et al. 1998).

19.4 OTHER DIFFERENCES BETWEEN RECALIBRATION


AND SELECTIVE SPEECH ADAPTATION
After the first report, several follow-up studies appeared examining differences in the manifestation
of lip-read–induced recalibration and selective speech adaptation. Besides that the two phenomena
differed in the direction of their aftereffects, differences were found in their buildup, dissipation,
and the processing mode in which they occur (i.e., “speech mode” versus “nonspeech mode”).

19.4.1  Buildup
To examine the buildup of recalibration and selective speech adaptation, Vroomen et al. (2007)
presented the four previously used audiovisual adapters (A?Vb, A?Vd, AbVb, and AdVd) in a con-
tinuous series of exposure trials, and inserted test trials after 1, 2, 4, 8, 16, 32, 64, 128, and 256
exposures. The aftereffects of adapters with ambiguous sounds (A?Vb and A?Vd) were already
at ceiling after only eight exposure trials (the level of exposure used in the original study) and
then, surprisingly, after 32 exposure trials fell off with prolonged exposure (128 and 256 trials).
Aftereffects of adapters with nonambiguous sounds AbVb and AdVd were again contrastive and
the effect linearly increased with the (log-)number of exposure trials. The latter fitted well with the
idea that selective speech adaptation reflects an accumulative process, but there was no apparent
reason why a learning effect such as recalibration would reverse at some point. The authors sug-
gested that two processes might be involved here: selective speech adaptation running in parallel
with recalibration and eventually taking over. Recalibration would then dominate the observed
aftereffects in the early stages of exposure, whereas selective speech adaptation would become
manifest later on.
Such a phenomenon was indeed observed when data of an “early” study (i.e., one before the
initial reports on phonetic recalibration) by Samuel (2001) were reanalyzed. Samuel exposed his
participants to massive repeated presentations of an ambiguous /s/–/∫/ sound in the context of either
an /s/-final word (e.g., /bronchiti?/, from bronchitis), or a /∫/-final one (e.g., /demoli?/, from demolish).
In this situation, one might expect recalibration to take place. However, in post-tests involving iden-
tification of the ambiguous /s/–/∫/ sound, Samuel obtained contrastive aftereffects indicative of
selective speech adaptation, so less /s/ responses after exposure to /bronchiti?/ than /demoli?/ (and
thus an effect in the opposite direction later reported by Norris et al. 2003). This made him conclude
that a lexically restored phoneme produces selective speech adaptation similar to a nonambiguous
368 The Neural Bases of Multisensory Processes

sound. Others, though—including Samuel—would report in later years recalibration effects using
the same kinds of stimuli (Kraljic and Samuel 2005; Norris et al. 2003; van Linden and Vroomen
2007). To examine this potential conflict in more detail, Samuel allowed us to reanalyze the data
from his 2001 study as a function of number of exposures blocks (Vroomen et al. 2007). His experi-
ment consisted of 24 exposure blocks, each containing 32 adapters. Contrastive aftereffects were
indeed observed for the majority of blocks following block 3, showing the reported dominant role
of selective speech adaptation. Crucially, though, a significant recalibration effect was obtained (so
more /s/ responses after exposure to /bronchiti?/ than /demoli?/) in the first block of 32 exposure
trials, which, in the overall analyses, was swamped by selective adaptation in later blocks. Thus,
the same succession of aftereffects dominated early by recalibration and later by selective adapta-
tion was already present in Samuel’s data. The same pattern may therefore occur generally during
prolonged exposure to various sorts of conflict situations involving ambiguous sounds.

19.4.2  Dissipation
A study by Vroomen et al. (2004) focused on how long recalibration and selective speech adaptation
effects last over time. Participants were again exposed to A?Vb, A?Vd, AdVd, or AbVb, but rather
than using multiple blocks of eight adapters and six test trials in a within-subject design (as in the
original study), participants were now exposed to only one of the four adapters (a between-subject
factor) in three similar blocks consisting of 50 exposure trials followed by 60 test trials. The recal­
ibration effect turned out to be very short-lived and lasted only about six test trials, whereas the
selective speech adaptation effect was observed even after 60 test trials. The results again confirmed
that the two phenomena were different from each other. Surprisingly, though, lip-read–induced
recalibration turned out to be rather short-lived, a finding to which we will return later.

19.4.3  Recalibration in “Speech” versus “Nonspeech” Mode


The basic notion underlying recalibration is that it occurs to the extent that there is a (moder-
ate) conflict between two information sources that refer to the same external event (for speech,
a particular phoneme or gesture). Using sine-wave speech (SWS), one can manipulate whether a
sound is assigned to a speech sound (for short, a phoneme) or not, and thus whether recalibration
occurs. In SWS, the natural richness of speech sounds is reduced, and an identical sound can
be perceived as speech or nonspeech depending on the listener’s perceptual mode (Remez et al.
1981). Tuomainen et al. (2005) demonstrated that when SWS sounds are delivered in combination
with lip-read speech, listeners who are in speech mode show almost similar intersensory integra-
tion as when presented with natural speech (i.e., lip-read information strongly biases phoneme
identification), but listeners who do not know the SWS tokens are derived from speech (nonspeech
mode) show no, or only negligible, integration. Using these audiovisual SWS stimuli, we reasoned
that recalibration should only occur for listeners in speech mode (Vroomen and Baart 2009a). To
demonstrate this, participants were first trained to distinguish the SWS tokens /omso/ and /onso/
that were the two extremes of a seven-step continuum. Participants in the speech group labeled the
tokens as /omso/ or /onso/, whereas the nonspeech group labeled the same sounds as “1” and “2”.
Listeners were then shortly exposed to the adapters A?Vomso and A?Vonso (to examine recalibra-
tion), and AomsoVomso and AonsoVonso (to examine selective speech adaptation), and then tested on the
three most ambiguous SWS tokens that were identified as /omso/ or /onso/ in the speech group,
and as “1” or “2” in the nonspeech group. As shown in Figure 19.2, recalibration only occurred
for listeners in speech mode (the upper left panel), but not in nonspeech mode (lower left panel),
whereas selective speech adaptation occurred likewise in speech and nonspeech mode (right pan-
els). Attributing the auditory and visual signal to the same event was thus of crucial importance
for recalibration, whereas selective speech adaptation did not depend on the interpretation of the
signal.
Phonetic Recalibration in Audiovisual Speech 369

Exposure to adapters with ambiguous Exposure to unambiguous adapters


component/A?/ (recalibration) (selective speech adaptation)

1.00 1.00
V/omso/ V/omso/
Proportion of 'n' responses

V/onso/ V/onso/
0.80 0.80

0.60 0.60
Speech
mode
0.40 0.40

0.20 0.20

0.00 0.00
/A?-1/ /A?/ /A?+1/ /A?-1/ /A?/ /A?+1/
Auditory token Auditory token

1.00 1.00
V/omso/ V/omso/
Proportion of '2' responses

V/onso/ V/onso/
0.80 0.80

Non 0.60 0.60


speech
mode 0.40 0.40

0.20 0.20

0.00 0.00
/A?-1/ /A?/ /A?+1/ /A?-1/ /A?/ /A?+1/
Auditory token Auditory token

FIGURE 19.2  Curves represent mean proportion of /onso/ responses as a function of auditory test tokens
of continuum after exposure to auditory ambiguous adapters A?Vonso and A?Vomso (left panels), and audi-
tory non-ambiguous adapters AonsoVonso and AomsoVomso (right panels). Upper panels show performance
of speech group; lower panels show performance of non-speech group. Error bars = 1 SEM. (Adapted from
Vroomen, J., and Baart, M., Cognition, 110, 2, 254–259, 2009a.)

19.5  STABILITY OF RECALIBRATION OVER TIME


As noted before, studies on phonetic recalibration began with a pair of seminal studies, one of which
used lip-read information (Bertelson et al. 2003) and the other used lexical information (Norris et al.
2003). Both showed in essence the same phenomenon, but the results were nevertheless strikingly
different in one aspect: Whereas lip-read–induced recalibration was short-lived, lexical recalibra-
tion turned out to be robust and long-lived in the majority of studies. The reasons for this difference
are still not well understood, but in the following subsections we will give an overview of the find-
ings and some hints on possible causes.

19.5.1  Basic Phenomenon of Lexically Induced Recalibration


It is well known that in natural speech there are, besides the acoustic and lip-read input, other infor-
mation sources that inform listeners about the identity of the phonemes. One of the most important
ones is the listener’s knowledge about the words in the language, or for short, lexical information.
370 The Neural Bases of Multisensory Processes

As an example, listeners can infer that an ambiguous sound somewhere in between /b/ and /d/ in the
context of “?utter” is more likely to be /b/ rather than /d/ because “butter” is a word in English, but
not “dutter.” There is also, as for lip-reading, an immediate lexical bias in phoneme identification
known as the Ganong effect (Ganong 1980). For example, an ambiguous /g/-/k/ sound is “heard” as
/g/ when followed by “ift” and as /k/ when followed by “iss” because “gift” and “kiss” are words,
but “kift” and “giss” are not.
The corresponding aftereffect that results from exposure to such lexically biased phonemes was
first reported by Norris et al. (2003). They exposed listeners to a sound halfway between /s/ and /f/
in the context of an f- or s-biasing word, and listeners were then tested on an /es/-/ef/ continuum.
As comparable to the lip-reading case, the authors observed recalibration (or in their terminology,
perceptual learning), so more /f/ responses after an f-biasing context, and more /s/ responses after
an s-biasing context.
Later studies confirmed the original finding and additionally suggested that the effect is speaker-
specific (Eisner and McQueen 2005), or possibly, token-specific (Kraljic and Samuel 2006, 2007),
that it generalizes to words outside the original training set (McQueen et al. 2006a) and across syl-
labic positions (Jesse and McQueen 2007), and that it arises automatically as a consequence of hear-
ing the ambiguous pronunciations in words (McQueen et al. 2006b). Although Jesse and McQueen
(2007) demonstrated that lexical recalibration can generalize to word onset positions, there was no
lexical learning when listeners were exposed to ambiguous onset words (Jesse and McQueen 2007).
However, Cutler et al. (2008) showed that legal word-onset phonotactic information can induce reca-
libration, presumably because this type of information can be used immediately, whereas lexical
knowledge about the word is not yet available when one hears the ambiguous onset. Moreover, lexical
retuning is not restricted to a listener’s native language as the English fricative theta ([θ] as in “bath”)
presented in a Dutch f- or s-biasing context induced lexical learning (Sjerps and McQueen 2010).

19.5.2  Lip-Read –Induced versus Lexically Induced Recalibration


So far, these data fit well with studies on lip-read–induced recalibration, but there was one remark-
able difference: the duration of the reported aftereffects. Whereas lip-read–induced recalibration
was found to be fragile and short-lived (in none of the tests did it survive more than 6 to 12 test tri-
als; van Linden and Vroomen 2007; Vroomen and Baart 2009b; Vroomen et al. 2004), two studies
on lexically induced recalibration found that it was long-lived and resistant to change. Kraljic and
Samuel (2005) demonstrated that recalibration of an ambiguous /s/ or /∫/ remained robust after a
25-min delay. Moreover, it remained robust even after listeners heard canonical pronunciations of
/s/ and /∫/ during the 25-min delay, and the only condition in which the effect became somewhat
smaller, although not significantly so, was when listeners heard canonical pronunciations of /s/ and
/∫/ from the same speaker that they had originally adjusted to. In another study, Eisner and McQueen
(2006) showed that lexically induced recalibration remained stable over an even much longer delay
(12 h) regardless of whether subjects slept in the intervening time.
At this stage, one might conclude that, simply by their nature, lexical recalibration is robust and
lip-read recalibration is fragile. However, these studies were difficult to compare in a direct way
because there were many procedural and item-specific differences. To examine this in more detail,
van Linden and Vroomen (2007) conducted a series of experiments on lip-read–induced and lexi-
cally induced recalibration using the same procedure and test stimuli to check various possibilities.
They used an ambiguous stop consonant halfway between /t/ or /p/ that could be disambiguated by
either lip-read or lexical information. For lip-read recalibration, the auditory ambiguous sound was
embedded in Dutch nonwords such as “dikasoo?” and dubbed onto the video of lip-read “dikasoop”
or “dikasoot,” for lexical recalibration the ambiguous sound was embedded in Dutch p-words such
as “microscoo?” (“microscope”) or t-words such as “idioo?” (“idiot”).
Across experiments, results showed that lip-read and lexically recalibration effects were very
much alike. The lip-read aftereffect tended to be bigger than the lexical one, which was to be
Phonetic Recalibration in Audiovisual Speech 371

expected because lip-reading has in general a much stronger impact on sound processing than lexi-
cal information does (Brancazio 2004). Most important, though, both aftereffects dissipated equally
fast, and thus there was no sign that lexical recalibration by itself was more robust than lip-read–
induced recalibration.
The same study also explored whether recalibration would become more stable if a contrast pho-
neme from the opposite category was included in the set of exposure items. Studies reporting long-
lasting lexical aftereffects presented during the exposure not only words with ambiguous sounds,
but also filler words with nonambiguous sounds taken from the opposite side of the phoneme con-
tinuum. For example, in the exposure phase of Norris et al. (2003) in which an ambiguous s/f sound
was biased toward /f/, there were not only exposure stimuli such as “witlo?” that supposedly drive
recalibration, but also contrast stimuli containing the nonambiguous sound /s/ (e.g., naaldbos). Such
contrast stimuli might serve as an anchor or a comparison model for another stimulus, and afteref-
fects thought to reflect recalibration might in this way be boosted because listeners set the criterion
for the phoneme boundary in between the ambiguous token and the extreme one. The obtained
aftereffect may then reflect the contribution of two distinct processes: one related to recalibration
proper (i.e., a shift in the phoneme boundary meant to reduce the conflict between the sound and
the context), the other to a strategic and long-lasting criterion setting operation that depends on the
presence of an ambiguous phoneme and a contrast phoneme from the opposing category. Our results
showed that aftereffects did indeed become substantially bigger if a contrast stimulus was included
in the exposure set but crucially, aftereffects did not become more stable. Contrast stimuli thus
boosted the effect, but did not explain why sometimes long-lasting aftereffects were obtained.
Another factor that was further explored was whether participants were biased in consecutive
exposure phases toward only one or both phoneme categories. One can imagine that if listeners
are biased toward both a t-word and p-word (as was standard in lip-read studies, but not the lexical
ones), the boundary setting that listeners adopt may become fragile. However, this did not turn out
to be critical: Regardless of whether participants were exposed to only one or both contexts, it did
not change the size and stability of the aftereffect.
Of note is that lip-read and lexical recalibration effect did not vanish if a 3-min silent interval
separated the exposure phase from test. The latter finding indicates that recalibration as such is not
fragile, but that other factors possibly related to the test itself may explain why aftereffects dissipate
quickly during testing. One such possibility might be that listeners adjust their response criterion
in the course of testing such that the two response alternatives are chosen about equally often.
However, although this seems reasonable, it does not explain why in the same test selective speech
adaptation effects remained stable in due course of testing (Vroomen et al. 2004).
Still, another possibility is that recalibration needs time to consolidate, and sleep might be a fac-
tor in this. Eisner and McQueen (2006) explored this possibility and observed equal amounts of lex-
ically induced aftereffects after 12 h, regardless of whether listeners had slept. Vroomen and Baart
(2009b) conducted a similar study on lip-read–induced recalibration, including contrast phonemes
to boost the aftereffect, and tested participants twice: immediately after the lip-read exposure phase
(as standard) and after a 24-h period during which participants had slept. The authors found large
recalibration effects in the beginning of the test (the first six test trials), but they again quickly dis-
sipated with prolonged testing (within 12 trials), and did not reappear after a 24-h delay.
It may also be the case that the dissipation rate of recalibration depends on the acoustic nature
of the stimuli. The studies that found quick dissipation used intervocalic and syllable-final stops
that varied in place of articulation (/aba/-/ada/ and /p/-/t/), whereas others used fricatives (/f-s/ and
/s-∫/; Eisner and McQueen 2006; Kraljic et al. 2008b; Kraljic and Samuel 2005) or syllable-initial
voiced–voiceless stop consonants (/d-t/ and /b/-/p/; Kraljic and Samuel 2006). If the stability of the
phenomenon depends on the acoustic nature of the cues (e.g., place cues might be more vulnerable),
one may observe aftereffects to differ in this respect as well.
Another variable that may play a role is whether the same ambiguous sound is used during the
exposure phase, or whether the token varies from trial to trial. Stevens (2007, Chapter 3) examined
372 The Neural Bases of Multisensory Processes

token variability in lexical recalibration using similar procedures as those used by Norris et al.
(2003), but listeners were either exposed to the same or different versions of an ambiguous s/f sound
embedded in s- and f-biasing words. His design also included contrast phonemes from the opposite
phoneme category that should have boosted the effect. When the ambiguous token was constant,
as in the original study by Norris et al., the learning effect was quite substantial on the first test
trials, but quickly dissipated with prolonged testing, and in the last block (test trials 36–42), lexical
recalibration had disappeared completely akin to lip-read–induced recalibration (van Linden and
Vroomen 2007; Vroomen and Baart 2009b; Vroomen et al. 2004). When the sound varied from trial
to trial, the overall learning effect was much smaller and restricted to the f-bias condition, but the
effect lasted longer.
Another aspect that may play a role is the use of filler items. Studies reporting short-lived after-
effects tended to use massed trials of adapters with either no filler items separating the critical
items, or only a few contrast stimuli. Others, reporting long-lasting effects used lots of filler items
separating the critical items (Eisner and McQueen 2006; Kraljic and Samuel 2005, 2006; Norris
et al. 2003). Typically, about 20 critical items containing the ambiguous phoneme were interspersed
among 180 fillers items. A classic learning principle is that massed trials produce weaker learning
effect than spaced trials (e.g., Hintzman 1974). At present, it remains to be explored whether recal­
ibration is sensitive to this variable as well and whether it follows the same principle. One other
factor that might prove to be valuable in the discussion regarding short- versus long-lasting effects
is that extensive testing may override, or wash out, the learning effects (e.g., Stevens 2007) because
during the test, listeners might “relearn” their initial phoneme boundary. Typically, in the Bertelson
et al. (2003) paradigm, more test trials are used than in the Norris et al. (2003) paradigm, possibly
influencing the time course of the observed effects. For the time being, though, the critical differ-
ence between the short- and long-lasting recalibration effects remains elusive.

19.6  DEVELOPMENTAL ASPECTS


Several developmental studies have suggested that integration of visual and auditory speech is
already present early in life (e.g., Desjardins and Werker 2004; Kuhl and Meltzoff 1982; Rosenblum
et al. 1997). For example, 4-month-old infants, exposed to two faces articulating vowels on a screen,
look longer at the face that matches an auditory vowel played simultaneously (Kuhl and Meltzoff
1982; Patterson and Werker 1999) and even 2-month-old infants can detect the correspondence
between auditory and visually presented speech (Patterson and Werker 2003). However, it has also
been found that the impact of lip-reading on speech perception increases with age (Massaro 1984;
McGurk and MacDonald 1976). Such a developmental trend in the impact of visual speech may
suggest that lip-reading is an ability that needs to mature, or alternatively that linguistic experience
is necessary, possibly because visible articulation is initially not well specified. Exposure to audio-
visual speech may then be necessary to develop phonetic representations more completely.
Van Linden and Vroomen (2008) explored whether there is a developmental trend in the use
of lip-read information by testing children of two age groups, 5-year-olds and 8-year-olds, on lip-
read–induced recalibration. Results showed that the older children learned to categorize the initially
ambiguous speech sound in accord with the previously seen lip-read information, but this was not
the case for the younger age group. Presumably, 8-year-olds adjusted their phoneme boundary to
reduce the phonetic conflict in the audiovisual stimuli and this shift may occur in the older group
but not the younger one because lip-reading is not yet very effective at the age of 5.
However, Teinonen et al. (2008) were able to observe learning effects induced by lip-read speech
testing much younger infants with a different procedure. They exposed 6-month-old infants to
speech sounds from a /ba/-/da/ continuum. One group was exposed to audiovisual congruent map-
pings so that tokens from the /ba/ side of the continuum were combined with lip-read /ba/, and
tokens from the /da/ side were combined with lip-read /da/. Two other groups of infants were pre-
sented with the same sounds from the /ba/-/da/ continuum, but in one group all auditory tokens were
Phonetic Recalibration in Audiovisual Speech 373

paired with lip-read /ba/, and in the other group all auditory tokens were paired with lip-read /da/. In
the latter two groups, lip-read information thus did not inform the infant how to divide the sounds
from the continuum into two categories. A preference procedure revealed that infants in the former,
but not in the two latter groups learned to discriminate the tokens from the /ba/–/da/ continuum.
These results suggest that infants can use lip-read information to adjust the phoneme boundary of
an auditory speech continuum. Further testing, however, is clearly needed so as to understand what
critical experience is required and how it relates to lip-read–induced recalibration in detail.

19.7  COMPUTATIONAL MECHANISMS


How might the retuning of phoneme categories be accomplished from a computational perspective?
In principle, there are many solutions. All that is needed is that the system is able to use context to
change the way an ambiguous phoneme is categorized. Recalibration may be initiated whenever
there is discrepancy between the phonological representations induced by the auditory and lip-read
input, or for lexical recalibration, if there is a mismatch between the auditory input and the one
expected from lexical information. Recalibration might be accomplished at the phonetic level by
moving the position of the whole category, by adding the ambiguous sound as a new exemplar of
the appropriate category, or by changing the category boundaries. For example, in models such as
TRACE (McClelland and Elman 1986) or Merge (Norris et al. 2000), speech perception is envis-
aged in layers where features activate phonemes that in their turn activate words. Here, one can
implement recalibration as a change in the weights of the auditory feature-to-phoneme connections
(Mirman et al. 2006; Norris et al. 2000).
Admittedly though, the differences among these various possibilities are quite subtle. Yet, the
extent to which recalibration generalizes to new exemplars might be of relevance to distinguish
these alternatives. One observation is that repeated exposure to typical McGurk stimuli containing
a canonical sound, say nonambiguous auditory /ba/ combined with lip-read /ga/, does not invoke a
retuning effect of the canonical /ba/ sound itself (Roberts and Summerfield 1981). A “good” audi-
tory /ba/ thus remains a good example of its category despite that there is lip-read input repeatedly
telling that the phoneme belonged to another category. This may suggest that recalibration reflects
a shift in the phoneme boundary, and thus only affecting sounds near that boundary, rather than
that the acoustic-to-phonetic connections are rewired on the fly, thus affecting all sounds, and in
particular the trained ones.
In contrast with this view, however, there are also some data indicating the opposite. In particu-
lar, a closer inspection of the data from Shigeno (2002) shows that a single exposure to a McGurk-
like stimulus AbVg—here, called here an anchor—and followed by a target sound did change the
quality of canonical target sound /b/ (see Figure 2 of Shigeno 2002). This finding may be more
in line with the idea of a “rewiring” of feature-to-phoneme connections, or alternatively that this
specific trained sound is incorporated into the new category. However, it is clear that more data are
needed that specifically address these details.
There has also been a controversy about whether lexical recalibration actually occurs at the same
processing level as immediate lexical bias. Norris et al. (2003) have argued quite strongly in favor
of two types of lexical influence in speech perception: a lexical bias on phonemic decision-making
that does not involve any form of feedback, and lexical feedback necessary for perceptual learning.
Although there is a recent report supporting the idea of a dissociation between lexical involvement
in online decisions and in lexical recalibration (McQueen et al. 2009), we never obtained any data
that support this distinction—that is, we have not been able to dissociate bias (lip-read or lexical)
from recalibration. In fact, listeners who were strongly biased by the lip-read or lexical context from
the adapter stimuli (as measured in separate tests) also tended to show the biggest recalibration
effects (van Linden and Vroomen 2007). Admittedly, this argument is only based on a correlation,
and the correlation was at best marginally significant. Perhaps more relevant though are the SWS
findings in which it was demonstrated that when lip-read context did not induce a cross-modal
374 The Neural Bases of Multisensory Processes

bias—namely, in the case where SWS stimuli were perceived as nonspeech—there was also no
recalibration. Immediate bias and recalibration thus usually go hand in hand, and in order to claim
that they are distinct, one would like to see empirical evidence in the form of a dissociation.

19.8  NEURAL MECHANISMS


What are the neural mechanisms that underlie phonetic recalibration? Despite the fact that the
integration of auditory and visual speech has been extensively studied with brain imaging methods
(e.g., Callan et al. 2003; Calvert et al. 1997; Calvert and Campbell 2003; Campbell 2008; Colin et
al. 2002; Klucharev et al. 2003; Sams et al. 1991; Stekelenburg and Vroomen 2007), so far only
two studies have addressed the potential brain mechanisms involved in phonetic recalibration. Van
Linden et al. (2007) used mismatch negativity (MMN) as a tool to examine whether a recalibrated
phoneme left traces in the evoked potential. The MMN is a component in the event-related potential
that signals an infrequent discriminable change in an acoustic or phonological feature of a repeti-
tive sound (Näätänen et al. 1978), and its latency and amplitude is correlated with the behavioral
discriminability of the stimuli (Lang et al. 1990). The MMN is thought to be generated through
automatic change detection and is elicited regardless of sound relevance for the participant’s task
(Näätänen 1992; Näätänen et al. 1993). The MMN is not only sensitive to acoustic changes, but also
to learned language-specific auditory deviancy (Näätänen 2001; Winkler et al. 1999). Van Linden
et al. (2007) used a typical oddball paradigm to elicit an MMN so as to investigate whether lexi-
cally induced recalibration penetrates the mechanisms of perception at early pre-lexical levels, and
thus affect the way a sound is heard. The standard stimulus (delivered in 82% of the trials) was an
ambiguous sound halfway between /t/ and /p/ in either a t-biasing context “vloo?” (derived from
“vloot,” meaning “fleet”) or a p-biasing context “hoo?” (derived from “hoop,” meaning “hope”).
For the deviant condition, the ambiguous sound was in both conditions replaced by an acoustically
clear /t/, so “vloot” for the t-biasing context and “hoot” (a pseudoword in Dutch) for the p-biasing
context. If subjects had learned to “hear” the sound as specified by the context, we predicted the
perceptual change—as indexed by MMN—from /?/ → /t/ to be smaller in t-words than p-words,
despite that the acoustic change was identical. As displayed in Figure 19.3, the MMN in t-words
was indeed smaller than in p-words, thus confirming that recalibration might penetrate low-level
auditory mechanisms.
The second study concerned with potentially involved brain mechanisms used fMRI to examine
the brain mechanisms that drive phonetic recalibration (Kilian-Hütten et al. 2008). The authors
adapted the original study of Bertelson et al. (2003) for the fMRI scanner environment. Listeners
were presented with a short block of eight audiovisual adapters containing the ambiguous /aba/-/
ada/ sound dubbed onto the video of lip-read /aba/ or /ada/ (A?Vb or A?Vd). Each exposure block
was followed by six auditory test trials consisting of event-related forced-choice /aba/-/ada/ judg-
ments. Functional runs were analyzed using voxelwise multiple linear regression (General Linear
Model) of the blood oxygen level–dependent (BOLD) response time course. Brain regions involved
in the processing of the audiovisual stimuli were identified by contrasting the activation blocks with
a baseline. Moreover, a contrast based on behavioral performance was utilized so as to identify
regions of interest (ROIs) whose activation during the recalibration phase would predict subsequent
test performance (see also Formisano et al. 2008). Behaviorally, the results of Bertelson et al. (2003)
were replicated in the fMRI environment, so there were more /aba/ responses after exposure to
A?Vb than A?Vd. Also as expected, lip-read information during the exposure blocks elicited activa-
tion in typical areas, including primary and extrastriate visual areas, early auditory areas, superior
temporal gyrus and sulcus (STG/STS), middle and inferior frontal gyrus (MFG, IFG), premotor
regions, and posterior parietal regions. Most interestingly, the BOLD behavior analysis identified a
subset of this network (MFG, IFG, and inferior parietal cortex) whose activity during audiovisual
exposure correlated with the proportion of correctly recalibrated responses in the auditory test tri-
als. Activation in areas MFG, IFG, and inferior parietal cortex thus predicted, on a trial-by-trial
Fz Fz Fz
μV Standard μV Standard μV MMN p-word -µV
Deviant Deviant
-4 -4 -4 MMN t-word
MMN t-word MMN p-word

-2 -2 -2 ±1.9 µV
MMN
Phonetic Recalibration in Audiovisual Speech

0 0 0

2 2 2 -µV

4 4 4
-50 0 50 100 150 200 250 -50 0 50 100 150 200 250 -50 0 50 100 150 200 250
Time (ms) Time (ms) Time (ms)

FIGURE 19.3  Grand-averaged waveforms of standard, deviant, and MMN at electrode Fz for t-word condition (left panel) and p-word condition (middle panel).
(Adapted from Vroomen, J. et al., Neuropsychologia, 45, 3, 572–577, 2007.) Right panel: MMNs and their scalp topographies for both conditions. Voltage map ranges in
μV are displayed below each map. y-axis marks onset of acoustic deviation between /?/ and /t/.
375
376 The Neural Bases of Multisensory Processes

basis, the subjects’ percepts of ambiguous sounds to be tested about 10 s later. The functional inter-
pretation of these areas is to be explored further, but the activation changes may reflect trial-by-trial
variations in subjects’ processing of the audiovisual stimuli, which in turn influence recalibration
and later auditory perception. For instance, variations in recruitment of attentional mechanisms
and/or involvement of working memory might be of importance, although the latter seems to be
unlikely (Baart and Vroomen 2010b).

19.9  CONCLUSION
We reviewed literature that demonstrates that listeners adjust their phoneme boundaries to the pre-
vailing speech context. Phonetic recalibration can be induced by lip-read and lexical context. Both
yield converging data, although the stability of the effect varies quite substantially between studies
for as yet unknown reasons. One reason could be that aftereffects as measured during tests reflect
the contribution of both recalibration and selective speech adaptation that run in parallel but with
different contributions over time. Several computational mechanisms have been proposed that can
account for phonetic recalibration, but critical data that distinguish between these alternatives—in
particular, about the generalization to new tokens—have not yet been collected. Phonetic recalibra-
tion leaves traces in the brain that can be examined with brain imaging techniques. Initial studies
suggest that a recalibrated sound behaves like an acoustically real sound from that category, and
possible loci (e.g., middle and inferior frontal gyrus, parietal cortex) that subserve recalibration have
been identified. Further testing, however, is needed to examine this in more detail. Involvement of
the parietal cortex could indicate that (verbal) short-term memory plays a role in phonetic recal­
ibration, although a recent study conducted by our group indicates that phonetic recalibration is
not affected if subjects are involved in a difficult verbal or spatial short-term memory task (Baart
and Vroomen 2010b). Moreover, auditory speech has also been shown to shift the interpretation of
lip-read speech categories in a similar manner as auditory speech can be recalibrated by lip-read
information, so the effect is genuinely bidirectional (Baart and Vroomen 2010a). On this view,
audiovisual speech is like other cross-modal learning effects (e.g., the ventriloquist illusion) where
bidirectional effects have been demonstrated.

ACKNOWLEDGMENTS
We would like to thank Arthur Samuel and James McQueen for insightful comments on an earlier
version of this manuscript.

REFERENCES
Anstis, S. 1986. Motion perception in the frontal plane: Sensory aspects. In Handbook of perception and human
performance, Vol. 2, Chap. 27, ed. K. R. Boff, L. Kaufman, and J. P. Thomas. New York: Wiley.
Anstis, S., F. A. J. Verstraten, and G. Mather. 1998. The motion aftereffect. Trends in Cognitive Sciences 2:
111–117.
Baart, M., and J. Vroomen. 2010a. Do you see what you are hearing?: Crossmodal effects of speech sounds on
lipreading. Neuroscience Letters 471: 100–103.
Baart, M., and J. Vroomen. 2010b. Phonetic recalibration does not depend on working memory. Experimental
Brain Research 203: 575–582.
Bermant, R. I., and R. B. Welch. 1976. Effect of degree of separation of visual–auditory stimulus and eye posi-
tion upon spatial interaction of vision and audition. Perceptual and Motor Skills 42(43): 487–493.
Bertelson, P., and G. Aschersleben. 1998. Automatic visual bias of perceived auditory location. Psychonomic
Bulletin and Review 5(3): 482–489.
Bertelson, P., I. Frissen, J. Vroomen, B. De Gelder B. et al. 2006. The aftereffects of ventriloquism: Patterns of
spatial generalization. Perception and Psychophysics 68(3): 428–436.
Bertelson, P., and M. Radeau. 1981. Cross-modal bias and perceptual fusion with auditory–visual spatial dis-
cordance. Perception and Psychophysics 29(6): 578–584.
Phonetic Recalibration in Audiovisual Speech 377

Bertelson, P., J. Vroomen, and B. De Gelder. 2003. Visual recalibration of auditory speech identification: A
McGurk aftereffect. Psychological Science 14(6): 592–597.
Blakemore, C., and P. Sutton. 1969. Size adaptation: A new aftereffect. Science 166(902): 245–247.
Brancazio, L. 2004. Lexical influences in audiovisual speech perception. Journal of Experimental Psychology:
Human Perception and Performance 30(3): 445–463.
Callan, D. E. et al. 2003. Neural processes underlying perceptual enhancement by visual speech gestures.
Neuroreport 14(17): 2213–2218.
Calvert, G. A., E. T. Bullmore, M. J. Brammer et al. 1997. Activation of auditory cortex during silent lipreading.
Science 276(5312): 593–596.
Calvert, G. A., and R. Campbell. 2003. Reading speech from still and moving faces: The neural substrates of
visible speech. Journal of Cognitive Neuroscience 15(1): 57–70.
Campbell, R. 2008. The processing of audio-visual speech: Empirical and neural bases. Philosophical
Transactions of the Royal Society of London. Series B, Biological Sciences 363(1493): 1001–1010.
Colin, C., M. Radeau, A. Soquet, D. Demolin, F. Colin, and P. Deltenre. 2002. Mismatch negativity evoked
by the McGurk-MacDonald effect: A phonetic representation within short-term memory. Clinical
Neurophysiology 113(4): 495–506.
Cutler, A., J. M. McQueen, S. Butterfield, and D. Norris. 2008. Prelexically-driven perceptual retuning of pho-
neme boundaries. Proceedings of Interspeech 2008, Brisbane, Australia.
Desjardins, R. N., and J. F. Werker. 2004. Is the integration of heard and seen speech mandatory for infants?
Developmental Psychobiology 45: 187–203.
Diehl, R. L. 1981. Feature detectors for speech: a critical reappraisal. Psychological Bulletin 89(1): 1–18.
Diehl, R. L., J. L. Elman, and S. B. McCusker. 1978. Contrast effects on stop consonant identification. Journal
of Experimental Psychology: Human Perception and Performance 4(4): 599–609.
Diehl, R. L., M. Lang, and E. M. Parker. 1980. A further parallel between selective adaptation and contrast.
Journal of Experimental Psychology: Human Perception and Performance 6(1): 24–44.
Eimas, P. D., and J. D. Corbit. 1973. Selective adaptation of linguistic feature detectors. Cognitive Psychology
4: 99–109.
Eisner, F., and J. M. McQueen. 2005. The specificity of perceptual learning in speech processing. Perception
and Psychophysics 67(2): 224–238.
Eisner, F., and J. M. McQueen. 2006. Perceptual learning in speech: Stability over time. Journal of the Acoustical
Society of America 119(4): 1950–1953.
Epstein, W. 1975. Recalibration by pairing: A process of perceptual learning. Perception 4: 59–72.
Formisano, E., F. De Martino, M. Bonte, and R. Goebel. 2008. “Who” is saying “what”? Brain-based decoding
of human voice and speech. Science 322(5903): 970–973.
Ganong, W. F. 1980. Phonetic categorization in auditory word perception. Journal of Experimental Psychology:
Human Perception and Performance 6(1): 110–125.
Gibson, J. J. 1933. Adaptation, after-effects and contrast in the perception of curved lines. Journal of
Experimental Psychology 18: 1–31.
Held, R. 1965. Plasticity in sensory–motor systems. Scientific America 213(5): 84–94.
Hintzman, D. L. 1974. Theoretical implications of the spacing effect. In Theories in cognitive psychology: The
Loyola symposium, ed. R. L. Solso, 77–99. Potomac, MD: Erlbaum.
Jesse, A., and J. M. McQueen. 2007. Prelexical adjustments to speaker idiosyncracies: Are they position-
­specific? In Proceedings of Interspeech 2007, ed. H. V. Hamme and R. V. Son, 1597–1600. Antwerpen,
Belgium: Causal Productions (DVD).
Kilian-Hütten, N. J., J. Vroomen, and E. Formisano. 2008. One sound, two percepts: Predicting future speech
perception from brain activation during audiovisual exposure. [Abstract]. Neuroimage 41, Supplement
1: S112.
Klemm, O. 1909. Localisation von Sinneneindrücken bei disparaten Nebenreizen. Psychologische Studien 5:
73–161.
Klucharev, V., R. Möttönen, and M. Sams. 2003. Electrophysiological indicators of phonetic and non-­phonetic
multisensory interactions during audiovisual speech perception. Brain Research, Cognitive Brain
Research 18(1): 65–75.
Kraljic, T., S. E. Brennan, and A. G. Samuel. 2008a. Accommodating variation: dialects, idiolects, and speech
processing. Cognition 107(1): 54–81.
Kraljic, T., and A. G. Samuel. 2005. Perceptual learning for speech: Is there a return to normal? Cognitive
Psychology 51(2): 141–178.
Kraljic, T., and A. G. Samuel. 2006. Generalization in perceptual learning for speech. Psychonomic Bulletin
and Review 13(2): 262–268.
378 The Neural Bases of Multisensory Processes

Kraljic, T., and A. G. Samuel. 2007. Perceptual adjustments to multiple speakers. Journal of Memory and
Language 56: 1–15.
Kraljic, T., A. G. Samuel, and S. E. Brennan. 2008b. First impressions and last resorts: How listeners adjust to
speaker variability. Psychological Science 19(4): 332–338.
Kuhl, P. K., and A. N. Meltzoff. 1982. The bimodal perception of speech in infancy. Science 218: 1138–1141.
Lang, H., T. Nyrke, M. Ek, O. Aaltonen, I. Raimo, and R. Näätänen. 1990. Pitch discrimination performance
and auditory event-related potentials. In Psychophysiological Brain Research, vol. 1, ed. C. M. H. Brunia,
A. W. K. Gaillard, A. Kok, G. Mulder, and M. N. Verbaten, 294–298. Tilburg: Tilburg University Press.
Massaro, D. W. 1984. Children’s perception of visual and auditory speech. Child Development 55:
1777–1788.
McClelland, J. L., and J. L. Elman. 1986. The TRACE model of speech perception. Cognitive Psychology
18(1): 1–86.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264: 746–748.
McQueen, J. M., A. Cutler, and D. Norris. 2006a. Phonological abstraction in the mental lexicon. Cognitive
Science 30: 1113–1126.
McQueen, J. M., A. Jesse, and D. Norris. 2009. No lexical–prelexical feedback during speech perception or: Is
it time to stop playing those Christmas tapes? Journal of Memory and Language 61: 1–18.
McQueen, J. M., D. Norris, and A. Cutler. 2006b. The dynamic nature of speech perception. Language and
Speech 49(1): 101–112.
Mirman, D., J. L. McClelland, and L. L. Holt. 2006. An interactive Hebbian account of lexically guided tuning
of speech perception. Psychonomic Bulletin and Review 13(6): 958–965.
Norris, D., J. M. McQueen, and A. Cutler. 2000. Merging information in speech recognition: Feedback is never
necessary. Behavioral and Brain Sciences 23(3): 299–325 discussion: 325–370.
Norris, D., J. M. McQueen, and A. Cutler. 2003. Perceptual learning in speech. Cognitive Psychology 47(2):
204–238.
Näätänen, R. 1992. Attention and brain function. Hillsdale: Erlbaum.
Näätänen, R. 2001. The perception of speech sounds by the human brain as reflected by the mismatch negativ-
ity (MMN) and its magnetic equivalent. Psychophysiology 38: 1–21.
Näätänen, R., A. W. K. Gaillard, and S. Mäntysalo. 1978. Early selective-attention effect in evoked potential
reinterpreted. Acta Psychologica 42: 313–329.
Näätänen, R., P. Paavilainen, H. Tiitinen, D. Jiang, and K. Alho. 1993. Attention and mismatch negativity.
Psychophysiology 30: 436–450.
Patterson, M., and J. F. Werker. 1999. Matching phonetic information in lips and voice is robust in 4.5-month-
old infants. Infant Behavior and Development 22: 237–247.
Patterson, M. L., and J. F. Werker. 2003. Two-month-old infants match phonetic information in lips and voice.
Developmental Science 6(2): 191–196.
Radeau, M., and P. Bertelson. 1974. The after-effects of ventriloquism. The Quarterly Journal of Experimental
Psychology 26(1): 63–71.
Radeau, M., and P. Bertelson. 1976. The effect of a textured visual field on modality dominance in a ventrilo-
quism situation. Perception and Psychophysics 20: 227–235.
Radeau, M., and P. Bertelson. 1977. Adaptation to auditory–visual discordance and ventriloquism in semireal-
istic situations. Perception and Psychophysics 22(2): 137–146.
Remez, R. E., P. E. Rubin, D. B. Pisoni, and T. D. Carrell. 1981. Speech perception without traditional speech
cues. Science 212: 947–949.
Roberts, M., and Q. Summerfield. 1981. Audiovisual presentation demonstrates that selective adaptation in
speech perception is purely auditory. Perception and Psychophysics 30(4): 309–314.
Rosenblum, L. D., M. A. Schmuckler, and J. A. Johnson. 1997. The McGurk effect in infants. Perception and
Psychophysics 59: 347–357.
Saldaña, H. M., and L. D. Rosenblum. 1994. Selective adaptation in speech perception using a compelling
audiovisual adaptor. Journal of the Acoustical Society of America 95(6): 3658–3661.
Sams, M., R. Aulanko, M. Hämäläinen et al. 1991. Seeing speech: Visual information from lip movements
modifies activity in the human auditory cortex. Neuroscience Letters 127(1): 141–145.
Samuel, A. G. 1986. Red herring detectors and speech perception: in defense of selective adaptation. Cognitive
Psychology 18(4): 452–499.
Samuel, A. G. 2001. Knowing a word affects the fundamental perception of the sounds within it. Psychological
Science 12(4): 348–351.
Samuel, A. G., and D. Kat. 1996. Early Levels of Analysis of Speech. Journal of Experimental Psychology:
Human Perception and Performance 22(3): 676–694.
Phonetic Recalibration in Audiovisual Speech 379

Sawusch, J. R. 1977. Peripheral and central processes in selective adaptation of place of articulation in stop
consonants. Journal of the Acoustical Society of America 62(3): 738–750.
Shigeno, S. 2002. Anchoring effects in audiovisual speech perception. Journal of the Acoustical Society of
America 111(6): 2853–2861.
Sjerps, M. J., and J. M. McQueen. 2010. The bounds on flexibility in speech perception. Journal of Experimental
Psychology: Human Perception and Performance 36: 195–211.
Stekelenburg, J. J., and J. Vroomen. 2007. Neural correlates of multisensory integration of ecologically valid
audiovisual events. Journal of Cognitive Neuroscience 19(12): 1964–1973.
Stevens, M. 2007. Perceptual adaptation to phonological differences between language varieties. University of
Gent, Gent (Ph.D.-thesis).
Stratton, G. M. 1896. Some preliminary experiments on vision without inversion of the retinal image.
Psychological Review 611–617.
Sumby, W. H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. Journal of the
Acoustical Society of America 26: 212–215.
Teinonen, T., R. N. Aslin, P. Alku, and G. Csibra. 2008. Visual speech contributes to phonetic learning in
6-month-old infants. Cognition 108(3): 850-855.
Tuomainen, J., T. S. Andersen, K. Tiippana, and M. Sams. 2005. Audio-visual speech perception is special.
Cognition 96(1): B13–B22.
van Linden, S., J. J. Stekelenburg, J. Tuomainen, and J. Vroomen. 2007. Lexical effects on auditory speech
perception: An electrophysiological study. Neuroscience Letters 420(1): 49–52.
van Linden, S., and J. Vroomen. 2007. Recalibration of phonetic categories by lipread speech versus lexi-
cal information. Journal of Experimental Psychology: Human Perception and Performance 33(6):
1483–1494.
van Linden, S., and J. Vroomen. 2008. Audiovisual speech recalibration in children. Journal of Child Language
35(4): 809–822.
Vroomen, J., and M. Baart. 2009a. Phonetic recalibration only occurs in speech mode. Cognition 110(2):
254–259.
Vroomen, J., and M. Baart. 2009b. Recalibration of phonetic categories by lipread speech: Measuring afteref-
fects after a twenty-four hours delay. Language and Speech 52: 341–350.
Vroomen, J., S. van Linden, B. de Gelder, and P. Bertelson. 2007. Visual recalibration and selective adaptation
in auditory–visual speech perception: Contrasting build-up courses. Neuropsychologia 45(3): 572–577.
Vroomen, J., S. van Linden, M. Keetels, B. de Gelder, and P. Bertelson. 2004. Selective adaptation and recali-
bration of auditory speech by lipread information: Dissipation. Speech Communication 44: 55–61.
Wallach, H. 1968. Informational discrepancy as a basis of perceptual adaptation. In The neuropsychology of
spatially oriented behaviour, ed. S. J. Freeman, 209–230. Homewood, IL: Dorsey.
Welch, R. B., and D. H. Warren. 1986. In Handbook of perception and human performance, ed. K. R. Kaufman
and J. P. Thomas, 1–36. New York: Wiley.
Winkler, I., T. Kujala, and Y. Shtyrov. 1999. Brain responses reveal the learning of foreign language phonemes.
Psychophysiology 36: 638–642.
20 Multisensory Integration
and Aging
Jennifer L. Mozolic, Christina E. Hugenschmidt,
Ann M. Peiffer, and Paul J. Laurienti

CONTENTS
20.1 General Cognitive Slowing.................................................................................................... 383
20.2 Inverse Effectiveness............................................................................................................. 383
20.3 Larger Time Window of Integration...................................................................................... 385
20.4 Deficits in Attentional Control.............................................................................................. 385
20.5 An Alternative Explanation: Increased Noise at Baseline.................................................... 387
20.6 Summary and Conclusions.................................................................................................... 388
References....................................................................................................................................... 389

Effective processing of multisensory stimuli relies on both the peripheral sensory organs and cen-
tral processing in subcortical and cortical structures. As we age, there are significant changes in all
sensory systems and a variety of cognitive functions. Visual acuity tends to decrease and hearing
thresholds generally increase (Kalina 1997; Liu and Yan 2007), whereas performance levels on tasks
of motor speed, executive function, and memory typically decline (Rapp and Heindel 1994; Birren
and Fisher 1995; Rhodes 2004). There are also widespread changes in the aging brain, including
reductions in gray and white matter volume (Good et al. 2001; Salat et al. 2009), alterations in
neurotransmitter systems (Muir 1997; Backman et al. 2006), regional hypoperfusion (Martin et al.
1991; Bertsch et al. 2009), and altered patterns of functional activity during cognitive tasks (Cabeza
et al. 2004; Grady 2008). Given the extent of age-related alterations in sensation, perception, and
cognition, as well as in the anatomy and physiology of the brain, it is not surprising that multisen-
sory integration also changes with age.
Several early studies provided mixed results on the differences between multisensory process-
ing in older and younger adults (Stine et al. 1990; Helfer 1998; Strupp et al. 1999; Cienkowski
and Carney 2002; Sommers et al. 2005). For example, Stine and colleagues (1990) reported that
although younger adults’ memory for news events was better after audiovisual presentation than
after auditory information alone, older adults did not show improvement during the multisensory
conditions. In contrast, Cienkowski and Carney (2002) demonstrated that audiovisual integration
on the McGurk illusion was similar for older and younger adults, and that in some conditions,
older adults were even more likely to report the fusion of visual and auditory information than
their young counterparts. Similarly, in a study examining the contribution of somatosensory input
to participants’ perception of visuospatial orientation, Strupp et al. (1999) reported an age-related
increase in the integration of somatosensory information into the multisensory representation of
body orientation.
Despite providing a good indication that multisensory processing is somehow altered in aging,
the results of these studies are somewhat difficult to interpret due to their use of complex cog-
nitive tasks and illusions, and to the variability in analysis methods. Several newer studies that

381
382 The Neural Bases of Multisensory Processes

have attempted to address these factors more clearly demonstrate that multisensory integration is
enhanced in older adults (Laurienti et al. 2006; Peiffer et al. 2007; Diederich et al. 2008).
On a two-choice audiovisual discrimination task, Laurienti and colleagues (2006) showed that
response time (RT) benefits for multisensory versus unisensory targets were larger for older adults
than for younger adults (Figure 20.1). That is, older adults’ responses during audiovisual conditions
were speeded more than younger adults’, when compared with their respective responses during
unisensory conditions. Multisensory gains in older adults remained significantly larger than those
observed in younger adults, even after controlling for the presence of two targets in the multisensory
condition (redundant target effect; Miller 1982, 1986; Laurienti et al. 2006).
Using similar analysis methods, Peiffer et al. (2007) also reported increased multisensory gains
in older adults. On a simple RT task, where average unisensory RTs were equivalent in younger
and older adults, older adults actually responded faster than younger adults on multisensory trials
because of their enhanced multisensory integration (Peiffer et al. 2007). Diederich and colleagues
(2008) have also shown that older adults exhibit greater speeding of responses to multisensory
targets than younger adults on a saccadic RT task. The analysis methods used in this experiment
indicate a slowing of peripheral sensory processing, as well as a wider time window over which
integration of auditory and visual stimuli can occur (Diederich et al. 2008).
These experiments highlight several possible explanations that could help answer a critical ques-
tion about multisensory processing in aging: Why do older adults exhibit greater integration of
multisensory stimuli than younger adults? Potential sources of enhanced integration in older adults
include age-related cognitive slowing not specific to multisensory processing, inverse effectiveness

14%
Young
12%
Elderly
10%
Probability difference

8%

6%

4%

2%

0%
250 400 550 700 850 1000 1150 1300 1450 1600
–2%

–4%
Response time (ms)

FIGURE 20.1  Multisensory performance enhancements are significantly larger in older adults than in
younger adults on a two-choice audiovisual discrimination paradigm. These curves illustrate multisensory-
mediated gains relative to race model, which is the summed probability of unisensory responses. Each curve
is the difference between the cumulative distribution of response times for the multisensory condition and the
race model cumulative distribution function. Thus, positive deflections in these curves represent responses to
multisensory stimuli that were faster than would be predicted by independent processing of the auditory and
visual stimulus components (i.e., multisensory integration). Significant multisensory facilitation was observed
in younger adults 340–550 ms after stimulus onset, and the maximum benefit achieved was approximately
8.3%. Older adults exhibited significant multisensory gains over a broader temporal window (330–690 and
730–740 ms after stimulus onset), and had performance gains of about 13.5%. Thus, both younger and older
participants demonstrated speeding of responses to multisensory stimuli that exceeded gains predicted by
the race model; however, older adults benefited more from the multisensory stimulus presentation than did
younger adults. (Adapted from Laurienti, P.J. et al., Neurobiol Aging, 27, 1155–1163, 2006, with permission
from Elsevier.)
Multisensory Integration and Aging 383

associated with sensory deficits, alterations in the temporal parameters of integration, and ineffi-
cient top–down modulation of sensory processing. In the following sections we will investigate each
of these possible explanations in greater detail and offer some alternative hypotheses for the basis
of enhanced multisensory integration in older adults.

20.1  GENERAL COGNITIVE SLOWING


It is well documented that older adults exhibit a general slowing of sensorimotor and cognitive pro-
cessing that impacts their performance on nearly all tasks (Cerella 1985; Birren and Fisher 1995;
Salthouse 2000). This general cognitive slowing is very mild on easy tasks, but is exacerbated on
more demanding tasks that require more cognitive processing. For example, in a meta-analysis of
age-related changes in performance on the Stroop color–word task, Verhaeghen and De Meersman
(1998) showed that older adults were slower than younger adults on both the easier baseline condi-
tion and the more difficult interference condition of the task. If cognitive slowing did not factor into
task performance, we would expect that any differences between younger and older adults (because
of differences in sensory processing, motor responses, etc.) would remain constant across the differ-
ent task conditions. Older adults however, were slowed down to a greater extent on the difficult task
compared to the easy task. The authors interpreted these findings as support for the general cogni-
tive slowing hypothesis rather than evidence of an age-related change specific to the skills assessed
by the Stroop task (Verhaeghen and De Meersman 1998).
In a typical multisensory paradigm, the multisensory trials contain redundant input (e.g., an
auditory and a visual stimulus), whereas the unisensory trials contain only one input. Thus, the
unisensory condition could be regarded as a more difficult task where more slowing would be
expected in older adults. When multisensory RTs are then compared to unisensory RTs, it would
appear as if the older adults were speeded more in the multisensory condition relative to the unisen-
sory condition than the younger adults. If this were the case, increased multisensory gains observed
in older adults might not be attributable to specific changes in multisensory processing, but could
simply be an artifact of proportional differences in younger and older adults’ processing speed on
tasks of different cognitive loads.
To account for general cognitive slowing, Laurienti et al. (2006) log transformed multisensory
and unisensory RTs from younger and older adults who performed the two-choice audiovisual dis-
crimination task. Log transforming RTs is a post hoc adjustment that can help to equate young and
old RTs and correct for differences related to general cognitive slowing (Cerella 1985; Salthouse
1988; Cornelissen and Kooijman 2000). Older adults still exhibited larger gains in multisensory
integration than younger adults after log transforming the data, suggesting that differences between
the age groups cannot be accounted for solely by general cognitive slowing (Laurienti et al. 2006).
Peiffer et al. (2007) took additional steps to rule out general cognitive slowing as an explanation
for age-related multisensory enhancements by using a very simple audiovisual detection task. The
effects of general cognitive slowing are minimized on such tasks where RTs on unisensory trials
are the same for younger and older adults (Yordanova et al. 2004). In this experiment there were no
differences between old and young RTs on unisensory visual or auditory trials; however, on trials
that contained simultaneous visual and auditory targets, older participants were significantly faster
than young subjects (Peiffer et al. 2007). These results support the notion that there are specific
age-related differences in multisensory processing that cannot be explained by general cognitive
slowing.

20.2  INVERSE EFFECTIVENESS


In addition to nonspecific slowing of cognitive processes evident in aging, older adults also dem-
onstrate functional declines in all sensory systems. These functional declines are attributable both
to age-related changes in the peripheral sensory organs, such as rigidity in the lens, loss of hair
384 The Neural Bases of Multisensory Processes

cells, and changes in cutaneous receptors and the olfactory epithelium (Kovács 2004; Liu and Yan
2007; Shaffer and Harrison 2007; Charman 2008), and to age-related alterations in how the central
nervous system processes sensory information (Schmolesky et al. 2000; Cerf-Ducastel and Murphy
2003; Ostroff et al. 2003; Quiton et al. 2007).
Reduced sensitivity or acuity in the individual sensory systems is another potential explanation
for increased multisensory benefits for older adults, attributable to a governing principle of mul-
tisensory integration known as inverse effectiveness. According to this principle, decreasing the
effectiveness of individual sensory stimuli increases the magnitude of multisensory enhancements
(Meredith and Stein 1983, 1986). In other words, when an auditory or visual stimulus is presented
just above threshold level, the gains produced by bimodal audiovisual presentation are larger than
when the individual stimuli are highly salient. Early demonstrations of inverse effectiveness in the
cat superior colliculus (Meredith and Stein 1983, 1986) have been extended to cat and monkey cor-
tex (Wallace et al. 1992; Kayser et al. 2005) as well as both neural and behavioral data in humans
(Hairston et al. 2003; Stevenson and James 2009). For example, Hairston and colleagues (2003)
demonstrated that young participants with normal vision were able to localize unimodal visual
and bimodal audiovisual targets equally well; however, when participants’ vision was artificially
degraded, their localization abilities were significantly enhanced during audiovisual conditions
relative to performance on visual targets alone.
The evidence for inverse effectiveness as a source of enhanced multisensory integration in
older adults is not yet clear. In the study performed by Peiffer and colleagues (2007; mentioned
above), RTs on unisensory trials were similar for younger and older adults, yet the older adults still
showed larger multisensory gains than the younger group. This finding suggests that other mech-
anisms beyond inverse effectiveness may be required to explain the age-related enhancements.
The paradigm used in this study, however, matched the performance between populations using
superthreshold stimuli and did not specifically investigate the consequence of degrading stimulus
effectiveness.
In a population composed exclusively of older adults, Tye-Murray et al. (2007) demonstrated
that integration levels in an audiovisual speech perception task did not differ for older adults with
mild-to-moderate hearing loss and older adults without hearing impairment. However, all test-
ing in this experiment was conducted in the presence of background auditory noise (multitalker
“babble”), and the level of this noise was adjusted for each participant so that the performance of
the two groups was matched in the unisensory auditory condition. This design makes it difficult
to address the interesting question of whether reduced stimulus effectiveness due to age-related
hearing loss would increase multisensory integration in hearing-impaired versus normal-hearing
older adults.
Results from a study conducted by Cienkowski and Carney (2002) provide some clues on the
effects of hearing loss on age-related integration enhancements. This experiment tested three groups
of participants on the McGurk illusion: (1) young adults with normal hearing, (2) older adults with
mild, but age-appropriate hearing loss, and (3) a control group of young adults with hearing thresh-
olds artificially shifted to match the older adults. Both the older adults and the threshold-shifted
controls were more likely to integrate the visual and auditory information than young, normal hear-
ing participants in one experimental condition (Cienkowski and Carney 2002). In this condition, the
participants viewed the McGurk illusion presented by a male talker. Interestingly, integration did
not differ between the three groups when the illusion was presented by a female talker. Although the
response patterns of the threshold-shifted controls closely matched that of the older adults with mild
hearing loss, the level of integration experienced by each group across the different experimental
conditions did not have a clear inverse relationship with successful unisensory target identification.
For example, in an auditory-only condition, all groups were better at identifying syllables presented
by the male talker than the female talker, yet levels of audiovisual integration were higher for all
groups in the male-talker condition (Cienkowski and Carney 2002). If increased integration in this
task were due simply to increased ambiguity in the auditory signals for older adults and control
Multisensory Integration and Aging 385

subjects (whose hearing thresholds were shifted by noise), then we would expect the highest levels
of integration under conditions where unisensory performance was poorest. Clearly, more studies
that carefully modulate signal intensities and compare the multisensory gains in younger and older
adults will be needed to further characterize the role of inverse effectiveness in age-related multi-
sensory enhancements.

20.3  LARGER TIME WINDOW OF INTEGRATION


A common finding across many studies that compare distributions of RTs in younger and older
adults is that older adults’ responses are both slower and more variable, creating distributions that
are broader and shifted to the right in older adults relative to young (Hale et al. 1988; Morse 1993;
Hultsch et al. 2002). Multisensory enhancements have also been demonstrated to occur over a
wider distribution of RTs for older adults (Laurienti et al. 2006; Peiffer et al. 2007; Diederich et al.
2008). For example, in an audiovisual discrimination paradigm, Laurienti et al. (2006) reported
that whereas younger adults showed multisensory behavioral facilitation 340–550 ms after stimulus
onset, older adults began showing facilitation at approximately the same point (330 ms), but contin-
ued to show enhancements in responses made up to 740 ms after the audiovisual stimuli had been
presented (see Figure 20.1).
Recently, Diederich and colleagues (2008) have studied the temporal characteristics of integra-
tion in older and younger adults using a time-window-of-integration (TWIN) model. This model
is able to distinguish between the relative contributions of early, peripheral sensory processes and
subsequent, central integration processes to multisensory enhancements (Colonius and Diederich
2004; Diederich et al. 2008). Using a focused attention task where saccadic reaction time to a visual
target was measured with and without an accessory auditory stimulus, the authors reported that
older adults’ responses were slower, more variable, and showed greater multisensory enhancements
than younger adults’ responses (Diederich et al. 2008). Additionally, the TWIN model analysis
indicated that peripheral slowing in older adults resulted in a broader temporal window of multi-
sensory integration. Despite this longer period for potential interaction between stimuli, increased
RT and response variability in older adults actually reduce the probability that processing of both
the auditory and visual stimulus will occur within this time window. Given this reduced probability
of stimulus overlap, these data suggest that a longer time window for cross-modal interactions can
only partially compensate for an age-related reduction in the probability that multisensory integra-
tion will occur (Diederich et al. 2008). Thus, a wider time window of integration in older adults is
primarily the result of slower and more variable peripheral sensory processing, and cannot fully
explain why, when integration does occur, its magnitude is larger in older adults.

20.4  DEFICITS IN ATTENTIONAL CONTROL


In addition to stimulus properties such as timing, location, and intensity that can affect multisensory
integration, there are also top–down cognitive factors that can modulate cross-modal interactions
such as semantic congruence (Laurienti et al. 2004) and selective attention (Alsius et al. 2005;
Talsma and Woldorff 2005; Talsma et al. 2007; Mozolic et al. 2008a). Of particular relevance to
aging is selective attention, a top–down control mechanism that allows us to focus on a particular
location, stimulus feature, or sensory modality while ignoring other possible options (Corbetta et al.
1990; Posner and Driver 1992; Spence and Driver 1997; Kastner and Ungerleider 2000; Spence et
al. 2001). Attention to a particular sensory modality typically results in small behavioral benefits
in the attended modality and larger deficits in the unattended modality (Spence and Driver 1997;
Spence et al. 2001). Similarly, neuroimaging data suggest that modality-specific attention causes
activity increases in cortical areas associated with the attended modality and activity decreases in
brain regions associated with processing information from the unattended modality (Kawashima
et al. 1995; Ghatan et al. 1998; Johnson and Zatorre 2006; Mozolic et al. 2008b).
386 The Neural Bases of Multisensory Processes

In young, healthy adults, dividing attention across multiple sensory modalities appears to be
critical for multisensory integration, whereas restricting attention to a single sensory modality can
abolish behavioral and neural enhancements associated with multisensory stimuli (Alsius et al.
2005; Talsma and Woldorff 2005; Talsma et al. 2007; Mozolic et al. 2008a). Many studies have
demonstrated that older adults have deficits in attention and are more distracted by stimuli within
and across sensory modalities (Dywan et al. 1998; Alain and Woods 1999; West and Alain 2000;
Milham et al. 2002; Andres et al. 2006; Poliakoff et al. 2006; Yang and Hasher 2007; Healey et al.
2008). For example, Andres and colleagues (2006) reported that older adults were more distracted
by irrelevant sounds than younger adults on an auditory–visual oddball paradigm. It would seem
possible then, that increased multisensory integration in older adults could result from deficits in
top–down attentional control that allow more cross-modal information to be processed.
This apparently simple account of age-related increases in distractibility is complicated by the
fact that there is also strong evidence suggesting that older adults can, in fact, successfully engage
selective attention on a variety of tasks (Groth and Allen 2000; Verhaeghen and Cerella 2002;
Madden et al. 2004; Townsend et al. 2006; Ballesteros et al. 2008; Hugenschmidt et al. 2009a;
Hugenschmidt et al. 2009c). In a recent study, Hugenschmidt and colleagues (2009a) used a cued
multisensory discrimination paradigm to demonstrate that older adults can reduce multisensory
integration by attending to a single sensory modality in a similar manner as has been observed
young adults (Mozolic et al. 2008a). However, multisensory integration was still enhanced in older
adults relative to young because the levels of integration in older adults were significantly higher at
baseline, in the absence of modality-specific attentional modulation (Figure 20.2). These results indi-
cate that enhanced integration in older adults is not due to deficits in engaging top–down selective
attention mechanisms, but could instead result from age-related increases in baseline cross-modal
interactions. This alternative explanation may also help to account for the seemingly contradictory
evidence that older adults are both more distractible than younger adults and equally able to engage
selective attention.

(a) (b)
15 15
Probability difference (%)

Divided Divided
(multisensory–race)

10 Selective auditory 10 Selective auditory


Selective visual Selective visual
5 5

0 0
200 400 600 800 1000 1200 1400 200 400 600 800 1000 1200 1400
–5 –5

–10 –10
Reaction time (ms) Reaction time (ms)

FIGURE 20.2  Selective attention reduces multisensory integration in younger and older adults. As in Figure
20.1, each curve represents the difference between the cumulative distribution for multisensory responses and
the race model, and thus, positive deflections show time bins where multisensory integration was observed.
In this cued, two-choice discrimination paradigm, multisensory and unisensory targets were presented under
three different attention conditions: divided attention, selective auditory attention, and selective visual atten-
tion. Younger adults exhibited integration only during divided attention conditions (peak facilitation ≈ 5%);
selective attention abolished multisensory gains (a). Older adults were also able to reduce multisensory inte-
gration during selective attention; however, due to higher levels of integration during the baseline divided
attention condition (peak facilitation ≈ 10%), older adults still exhibited significant multisensory gains during
selective attention (b). These data demonstrate that older adults are able to engage selective attention and
modulate multisensory integration, yet have a general increase in the level of integration relative to younger
adults that is independent of attention condition. (Adapted from Hugenschmidt, C.E. et al., Neuroreport, 20,
349–353, 2009a, with permission form Wolters Kluwer Health.)
Multisensory Integration and Aging 387

20.5  AN ALTERNATIVE EXPLANATION: INCREASED NOISE AT BASELINE


In the cued multisensory discrimination paradigm mentioned above (Hugenschmidt et al. 2009a),
older adults experienced multisensory enhancements even while selectively attending to a single
sensory modality. These multisensory gains can be used to index the level of background sensory
noise being processed, because behavioral enhancements can only result if irrelevant stimuli from
the ignored sensory modality speed up responses to targets in the attended modality. When younger
adults engage selective attention on this task, multisensory enhancements are abolished, indicating
that extraneous sensory information from the ignored modality is being successfully suppressed
(Mozolic et al. 2008a; Hugenschmidt et al. 2009a). In contrast, results in older adults appear para-
doxical. They decrease multisensory integration during selective attention commensurately with
younger adults, but in spite of this, they integrate nearly as much during selective attention as their
younger counterparts do during divided attention (see Figure 20.2). This occurs because older adults
show increased integration during the baseline divided attention state. When older adults engage
selective attention, their relative level of integration suppression (peakdivided ≈ 10%; peakselective ≈ 5%)
is similar to the attention-mediated drop seen in younger adults (peakdivided ≈ 5%; peakselective < 2%).
However, because older adults have higher levels of integration at baseline, they still exhibit robust
integration after attention-mediated suppression. The important point is that increased processing
of irrelevant sensory information during selective attention is not due to a failure of attentional pro-
cesses, but rather to a shift in the processing of sensory stimuli at baseline. Attention does serve to
limit integration in older adults, but it does not appear to completely compensate for the increased
level of background sensory noise (Hugenschmidt et al. 2009a).
To directly test for age-related increases in background sensory processing, Hugenschmidt and
colleagues (2009b) compared cerebral blood flow (CBF) during resting state and a visual steady-
state task in younger and older adults. The hypothesis that older adults show higher baseline levels
of sensory processing led to three sets of predictions. First, during resting state, CBF to the audi-
tory cortex associated with background noise of the magnetic resonance imaging (MRI) scanner
would be increased in older adults relative to younger adults. Second, during visual stimulation,
both groups would have reduced auditory cortex CBF, but the relative amount of auditory CBF
should still be higher in the older adults. Third, older adults would show reductions in cross-modal
signal-to-noise ratio (SNR) during both resting and visual tasks. The cross-modal SNR was quan-
tified as the ratio between CBF in the visual and auditory cortices. The results of this study sup-
port these claims, suggesting that older adults process more background sensory information than
younger adults, demonstrated by increased CBF in the auditory cortex during rest and during visual
stimulation. Despite the fact that both older and younger adults show comparable reductions in
auditory CBF when engaged in a visual task, the older adults still have higher CBF in the auditory
cortex in response to the ongoing, but task-irrelevant, scanner noise. This increase in background
sensory processing also results in a reduced SNR for older adults during visual task performance
(Hugenschmidt et al. 2009b).
The results of this imaging study parallel the behavioral results discussed above, suggesting
that enhanced multisensory integration in older adults is at least partly attributable to the fact that
older adults are processing more sensory information than their younger counterparts, regardless
of stimulus relevance or attentional state (Alain and Woods 1999; Rowe et al. 2006; Hugenschmidt
et al. 2009a). Environmental conditions and task demands can determine whether increased sensory
processing is beneficial or detrimental to performance. In certain controlled laboratory paradigms,
processing more sensory information can be beneficial. For example, enhanced processing of both
a visual target and a spatially coincident, semantically congruent auditory stimulus can result in
speeded RTs (Laurienti et al. 2004). In the real world, however, we are constantly bombarded
with information from all sensory modalities. We use top–down control mechanisms, like selec-
tive attention, to focus on important and task-relevant sensory information and filter out irrelevant
and distracting stimuli. It seems that older adults can benefit from selective attention, yet because
388 The Neural Bases of Multisensory Processes

their baseline levels of sensory processing are elevated, they are still more distracted than younger
adults when incoming sensory streams contain irrelevant or conflicting information. However, if
the extraneous sensory information becomes task relevant, older adults will exhibit larger gains
than younger adults, as information that was previously interfering with task performance becomes
helpful in completing the task.
Additional illustrations of the costs and benefits that older adults experience as a consequence of
increased baseline sensory processing can be seen in unisensory distraction tasks (Rowe et al. 2006;
Yang and Hasher 2007; Healey et al. 2008). In one example, Yang and Hasher (2007) demonstrated
that older adults were more distracted by irrelevant pictures than young in a task that required par-
ticipants to make semantic judgments about words that appeared superimposed on the pictures. In a
very similar paradigm that modified task demands, however, older adults had an advantage (Rowe
et al. 2006). In this experiment, younger and older adults were required to make same/different
judgments about the pictures that appeared beneath an overlay containing irrelevant words. On a
subsequent test of implicit memory for the irrelevant words, older adults actually showed better
memory, indicating that they had indeed processed more “noise” or irrelevant background infor-
mation than younger adults (Rowe et al. 2006). These studies support the notion that older adults
are more distractible than younger adults because they do not adequately filter sensory noise, but
when to-be-ignored information becomes relevant, older adults can actually benefit from increased
background sensory processing.
In spite of the accumulating evidence that baseline sensory processing changes with age, there is
no clear evidence for an underlying neural mechanism. One potential source of age-related changes
in baseline filtering parameters is dysregulation of the default mode network (DMN), an anatomi-
cally and physiologically defined system of structures thought to be involved in monitoring inter-
nal thoughts and the external environment at rest (Raichle et al. 2001; Greicius and Menon 2004;
Buckner et al. 2008). Composed of regions such as the anterior cingulate, posterior cingulate/pre-
cuneus region, and the parietal cortex, the default mode network is most active during rest and
becomes less active during most goal-directed behaviors (Raichle et al. 2001; Greicius and Menon
2004; Buckner et al. 2008). Several studies have reported that the DMN is not suppressed as effec-
tively during external tasks in older adults as in young (Lustig et al. 2003; Grady et al. 2006; Persson
et al. 2007). Failure to suppress default mode network activity has also been implicated in reduced
stimulus processing during attentional lapses, increased frequency of task-unrelated thoughts, and
increased error rates (McKiernan et al. 2006; Weissman et al. 2006; Li et al. 2007). A recent study
by Stevens and colleagues (2008) directly linked increased background activity in auditory cortex
during a visual task to DMN activity. In this functional MRI (fMRI) study, older and younger adults
were asked to complete a visual working memory task in a noisy MRI scanner environment. When
older adults made errors on this task, they had increased activity in the auditory cortex. In younger
adults, however, error trials were not associated with increased auditory activation. This suggests
that older adults were processing more background information than younger adults and that the
increased processing was related to distraction by irrelevant auditory stimulation. Furthermore,
increased auditory activity was associated with increased DMN activity, indicating that older adults’
vulnerability to distraction may be linked to age-related differences in suppression of the DMN
(Stevens et al. 2008). It seems likely, therefore, that further characterization of the default mode
network in aging may be important for understanding the neural basis of altered baseline sensory
processing and enhanced multisensory integration in older adults.

20.6  SUMMARY AND CONCLUSIONS


Given the existing literature on multisensory processing in aging, it appears that there is not yet a
clear explanation for why older adults exhibit greater multisensory integration than younger adults.
Based on the studies summarized in this review, several potential sources of increased integration
can be ruled out as the sole cause of age-related gains. Experiments that apply adjustments for
Multisensory Integration and Aging 389

general cognitive slowing (Laurienti et al. 2006) or use paradigms that equate unisensory RTs for
younger and older adults (Peiffer et al. 2007) demonstrate that multisensory gains are still larger for
older participants. A large portion of the behavioral changes that older adults exhibit in these para-
digms must therefore be specific to multisensory processing, rather than be attributed to the general
effects of sensorimotor and cognitive slowing.
Similarly, older adults’ broad time window of integration does not seem to be the source of their
multisensory processing enhancements. The analysis methods used by Diederich and colleagues
(2008) clearly show that older adults have a larger time interval over which multisensory integration
can occur; however, this is the result of slowed peripheral sensory processing and does not appear to
compensate for a decreased probability that the processing of multiple unisensory stimuli will over-
lap in time. This decreased probability of interaction between unisensory stimuli is because older
adults’ unisensory processing times are slow and highly variable, and therefore two independent
stimuli are less likely to be available for processing and integration at the same time. Yet if the two
stimuli are integrated, the older adults are speeded more than younger adults (Diederich et al. 2008).
Thus, older adults’ wider time window of integration, a consequence of increased RT and variability,
does not provide an explanation as to why integration is stronger in older adults when it does occur.
Another logical hypothesis is that older adults show enhanced multisensory integration because
they are unable to use selective attention to filter incoming sensory information; however, age-related
deficits in attentional control fail to adequately explain integration enhancements. Hugenscmidt
et al. (2009c) have confirmed that older adults can successfully instantiate modality-specific selec-
tive attention and have further demonstrated that there is no age-related difference in the magnitude
of multisensory integration reduction during selective attention (Hugenschmidt et al. 2009a). Rather
than implicating selective attention deficits as the source of underlying increases in multisensory
integration, data suggest that older adults differ from younger adults in the amount of baseline
sensory processing. Findings from an MRI study of CBF support this notion, showing that audi-
tory cortex CBF associated with task-irrelevant scanner noise is increased in older adults relative to
young, both during rest and during a visual task (Hugenschmidt et al. 2009b). Increased activity in
brain structures that comprise the default mode network has been implicated in the level of back-
ground sensory processing in older adults, and further investigation of the DMN may yield critical
information about the nature of age-related changes in baseline sensory processing that can inform
our understanding of multisensory integration in aging.
Another potential mechanism for age-related increases in multisensory benefits that cannot be
discounted is inverse effectiveness. To our knowledge, there have been no conclusive studies on the
relationship between stimulus salience and multisensory gains in older adults. A recent fMRI study
in younger adults, performed by Stevenson and colleagues (2009), demonstrated inverse effective-
ness in the patterns of cortical activity during audiovisual presentations of speech and object stimuli.
As the intensity of the auditory and visual stimulus components decreased, activation gains in the
superior temporal sulcus during multisensory stimuli increased. In other words, highly effective
sensory stimuli resulted in smaller activity changes in multisensory cortex compared to degraded
stimuli. A similar experimental paradigm could be used to investigate the relationship between stim-
ulus effectiveness and multisensory enhancements at the cortical level in younger and older adults.
Over the past several years, we have learned a great a deal about how multisensory processing
changes with age; however the mechanisms underlying age-related enhancements in multisensory
integration are not yet clear. Further exploration of the connections between baseline sensory pro-
cessing, stimulus salience, and multisensory gains should provide insight into the advantages and
impairments older adults can experience from changes in multisensory integration.

REFERENCES
Alain, C., and D. L. Woods. 1999. Age-related changes in processing auditory stimuli during visual attention:
Evidence for deficits in inhibitory control and sensory memory. Psychol Aging 14:507–519.
390 The Neural Bases of Multisensory Processes

Alsius, A., J. Navarra, R. Campbell, and S. Soto-Faraco. 2005. Audiovisual integration of speech falters under
high attention demands. CurrBiol 15:839–843.
Andres, P., F. B. Parmentier, and C. Escera. 2006. The effect of age on involuntary capture of attention by irrel-
evant sounds: A test of the frontal hypothesis of aging. Neuropsychologia 44:2564–2568.
Backman, L., L. Nyberg, U. Lindenberger, S. C. Li, and L. Farde. 2006. The correlative triad among aging,
dopamine, and cognition: Current status and future prospects. Neurosci Biobehav Rev 30:791–807.
Ballesteros, S., J. M. Reales, J. Mayas, and M. A. Heller. 2008. Selective attention modulates visual and haptic
repetition priming: Effects in aging and Alzheimer’s disease. Exp Brain Res 189:473–483.
Bertsch, K., D. Hagemann, M. Hermes, C. Walter, R. Khan, and E. Naumann. 2009. Resting cerebral blood
flow, attention, and aging. Brain Res 1267:77–88.
Birren, J. E., and L. M. Fisher. 1995. Aging and speed of behavior: Possible consequences for psychological
functioning. Ann Rev Psychol 46:329–353.
Buckner, R. L., J. R. Andrews-Hanna, and D. L. Schacter. 2008. The brain’s default network: Anatomy, func-
tion, and relevance to disease. Ann NY Acad Sci 1124:1–38.
Cabeza, R., S. M. Daselaar, F. Dolcos, S. E. Prince, M. Budde, and L. Nyberg. 2004. Task-independent and
task-specific age effects on brain activity during working memory, visual attention and episodic retrieval.
Cereb Cortex 14:364–375.
Cerella, J. 1985. Information processing rates in the elderly. Psychol Bull 98:67–83.
Cerf-Ducastel, B., and C. Murphy. 2003. FMRI brain activation in response to odors is reduced in primary
olfactory areas of elderly subjects. Brain Res 986:39–53.
Charman, W. N. 2008. The eye in focus: Accommodation and presbyopia. Clinical and Experimental Optometry
91:207–225.
Cienkowski, K. M., and A. E. Carney. 2002. Auditory–visual speech perception and aging. Ear Hear
23:439–449.
Colonius, H., and A. Diederich. 2004. Multisensory interaction in saccadic reaction time: A time-window-of-
integration model. J Cogn Neurosci 16:1000–1009.
Corbetta, M., F. M. Miezin, S. Dobmeyer, G. L. Shulman, and S. E. Petersen. 1990. Attentional modulation of
neural processing of shape, color, and velocity in humans. Science 248:1556–1559.
Cornelissen, F. W., and A. C. Kooijman. 2000. Does age change the distribution of visual attention? A comment
on McCalley, Bouwhuis, and Juola (1995). J Gerontol B Psychol Sci Soc Sci 55:187–190.
Diederich, A., H. Colonius, and A. Schomburg. 2008. Assessing age-related multisensory enhancement with
the time-window-of-integration model. Neuropsychologia 46:2556–2562.
Dywan, J., S. J. Segalowitz, and L. Webster. 1998. Source monitoring: ERP evidence for greater reactivity to
nontarget information in older adults. Brain Cogn 36:390–430.
Ghatan, P. H., J. C. Hsieh, K. M. Petersson, S. Stone-Elander, and M. Ingvar. 1998. Coexistence of attention-
based facilitation and inhibition in the human cortex. Neuroimage 7:23–29.
Good, C. D., I. S. Johnsrude, J. Ashburner, R. N. Henson, K. J. Friston, and R. S. Frackowiak. 2001. A voxel-
based morphometric study of ageing in 465 normal adult human brains. Neuroimage 14:21–36.
Grady, C. L. 2008. Cognitive neuroscience of aging. Ann NY Acad Sci 1124:127–144.
Grady, C. L., M. V. Springer, D. Hongwanishkul, A. R. McIntosh, and G. Winocur. 2006. Age-related changes
in brain activity across the adult lifespan. J Cogn Neurosci 18:227–241.
Greicius, M. D., and V. Menon. 2004. Default-mode activity during a passive sensory task: Uncoupled from
deactivation but impacting activation. J Cogn Neurosci 16:1484–1492.
Groth, K. E., and P. A. Allen. 2000. Visual attention and aging. Front Biosci 5:D284.
Hairston, W. D., P. J. Laurienti, G. Mishra, J. H. Burdette, and M. T. Wallace. 2003. Multisensory enhancement
of localization under conditions of induced myopia. Exp Brain Res 152:404–408.
Hale, S., J. Myerson, G. A. Smith, and L. W. Poon. 1988. Age, variability, and speed: Between-subjects diver-
sity. Psychology and Aging 3:407.
Healey, M. K., K. L. Campbell, and L. Hasher. 2008. Cognitive aging and increased distractibility: Costs and
potential benefits (Chapter 22). Prog Brain Res 169:353–363.
Helfer, K. S. 1998. Auditory and auditory–visual recognition of clear and conversational speech by older adults.
J Am Acad Audiol 9:234.
Hugenschmidt, C. E., J. L. Mozolic, and P. J. Laurienti. 2009a. Suppression of multisensory integration by
modality-specific attention in aging. Neuroreport 20:349–353.
Hugenschmidt, C. E., J. L. Mozolic, H. Tan, R. A. Kraft, and P. J. Laurienti. 2009b. Age-related increase in
cross-sensory noise in resting and steady-state cerebral perfusion. Brain Topography 20:241–251.
Hugenschmidt, C. E., A. M. Peiffer, T. P. McCoy, S. Hayasaka, and P. J. Laurienti. 2009c. Preservation of cross-
modal selective attention in healthy aging. Exp Brain Res 198:273–285.
Multisensory Integration and Aging 391

Hultsch, D. F., S. W. MacDonald, and R. A. Dixon. 2002. Variability in reaction time performance of younger
and older adults. J Gerontol B Psychol Sci Soc Sci 57:101–115.
Johnson, J. A., and R. J. Zatorre. 2006. Neural substrates for dividing and focusing attention between simulta-
neous auditory and visual events. Neuroimage 31:1673–1681.
Kalina, R. E. 1997. Seeing into the future. Vision and aging. West J Med 167:253–257.
Kastner, S., and L. G. Ungerleider. 2000. Mechanisms of visual attention in the human cortex. Annu Rev
Neurosci 23:315–341.
Kawashima, R., B. T. O’Sullivan, and P. E. Roland. 1995. Positron-emission tomography studies of cross-
modality inhibition in selective attentional tasks: Closing the “mind’s eye.” Proc Natl Acad Sci U S A
92:5969–5972.
Kayser, C., C. I. Petkov, M. Augath, and N. K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48:373–384.
Kovács, T. 2004. Mechanisms of olfactory dysfunction in aging and neurodegenerative disorders. Ageing Res
Rev 3:215.
Laurienti, P. J., J. H. Burdette, J. A. Maldjian, and M. T. Wallace. 2006. Enhanced multisensory integration in
older adults. Neurobiol Aging 27:1155–1163.
Laurienti, P. J., R. A. Kraft, J. A. Maldjian, J. H. Burdette, and M. T. Wallace. 2004. Semantic congruence is a
critical factor in multisensory behavioral performance. Exp Brain Res 158:405–414.
Li, C. S., P. Yan, K. L. Bergquist, and R. Sinha. 2007. Greater activation of the “default” brain regions predicts
stop signal errors. Neuroimage 38:640–648.
Liu, X., and D. Yan. 2007. Ageing and hearing loss. J Pathol 211:188–197.
Lustig, C., A. Z. Snyder, M. Bhakta et al. 2003. Functional deactivations: Change with age and dementia of the
Alzheimer type. Proc Natl Acad Sci U S A 100:14504.
Madden, D. J., W. L. Whiting, R. Cabeza, and S. A. Huettel. 2004. Age-related preservation of top-down atten-
tional guidance during visual search. PsycholAging 19:304.
Martin, A. J., K. J. Friston, J. G. Colebatch, and R. S. Frackowiak. 1991. Decreases in regional cerebral blood
flow with normal aging. J Cereb Blood Flow Metab 11:684–689.
McKiernan, K. A., B. R. D’Angelo, J. N. Kaufman, and J. R. Binder. 2006. Interrupting the “stream of con-
sciousness”: An fMRI investigation. Neuroimage 29:1185–1191.
Meredith, M. A., and B. E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221:389–391.
Meredith, M. A., and B. E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. J Neurophysiol 56:640–662.
Milham, M. P., K. I. Erickson, M. T. Banich et al. 2002. Attentional control in the aging brain: Insights from an
fMRI study of the stroop task. Brain Cogn 49:277.
Miller, J. 1982. Divided attention: Evidence for coactivation with redundant signals. Cogn Psychol
14:247–279.
Miller, J. 1986. Time course of coactivation in bimodal divided attention. Percept Psychophys 40:331–343.
Morse, C. K. 1993. Does variability increase with age? An archival study of cognitive measures. Psychol Aging
8:156–164.
Mozolic, J. L., C. E. Hugenschmidt, A. M. Peiffer, and P. J. Laurienti. 2008a. Modality-specific selective atten-
tion attenuates multisensory integration. Exp Brain Res 184:39–52.
Mozolic, J. L., D. Joyner, C. E. Hugenschmidt et al. 2008b. Cross-modal deactivations during modality-specific
selective attention. BMC Neurol 8:35.
Muir, J. L. 1997. Acetylcholine, aging, and Alzheimer’s disease. Pharmacol Biochem Behav 56:687–696.
Ostroff, J. M., K. L. McDonald, B. A. Schneider, and C. Alain. 2003. Aging and the processing of sound dura-
tion in human auditory cortex. Hear Res 181:1–7.
Peiffer, A. M., J. L. Mozolic, C. E. Hugenschmidt, and P. J. Laurienti. 2007. Age-related multisensory enhance-
ment in a simple audiovisual detection task. Neuroreport 18:1077–1081.
Persson, J., C. Lustig, J. K. Nelson, and P. A. Reuter-Lorenz, 2007. Age differences in deactivation: A link to
cognitive control? J Cogn Neurosci 19:1021–1032.
Poliakoff, E., S. Ashworth, C. Lowe, and C. Spence. 2006. Vision and touch in ageing: Crossmodal selective
attention and visuotactile spatial interactions. Neuropsychologia 44:507–517.
Posner, M. I., and J. Driver. 1992. The neurobiology of selective attention. Curr Opin Neurobiol 2:165–169.
Quiton, R. L., S. R. Roys, J. Zhuo, M. L. Keaser, R. P. Gullapalli, and J. D. Greenspan. 2007. Age-related
changes in nociceptive processing in the human brain. Ann NY Acad Sci 1097:175–178.
Raichle, M. E., A. M. MacLeod, A. Z. Snyder, W. J. Powers, D. A. Gusnard, and G. L. Shulman. 2001. A default
mode of brain function. Proc Natl Acad Sci U S A 98:676–682.
392 The Neural Bases of Multisensory Processes

Rapp, P. R., and W. C. Heindel. 1994. Memory systems in normal and pathological aging. Curr Opin Neurol
7:294–298.
Rhodes, M. G. 2004. Age-related differences in performance on the Wisconsin card sorting test: A meta-ana-
lytic review. Psychol Aging 19:482–494.
Rowe, G., S. Valderrama, L. Hasher, and A. Lenartowicz. 2006. Attentional disregulation: A benefit for implicit
memory. Psychol Aging 21:826–830.
Salat, D. H., D. N. Greve, J. L. Pacheco et al. 2009. Regional white matter volume differences in nondemented
aging and Alzheimer’s disease. Neuroimage 44:1247–1258.
Salthouse, T. A. 1988. The complexity of age × complexity functions: Comment on Charness and Campbell
(1988). J Exp Psychol Gen 117:425.
Salthouse, T. A. 2000. Aging and measures of processing speed. Biol Psychol 54:35–54.
Schmolesky, M. T., Y. Wang, M. Pu, and A. G. Leventhal. 2000. Degradation of stimulus selectivity of visual
cortical cells in senescent rhesus monkeys. Nat Neurosci 3:384–390.
Shaffer, S. W., and A. L. Harrison, 2007. Aging of the somatosensory system: A translational perspective. Phys
Ther 87:193–207.
Sommers, M. S., N. Tye-Murray, and B. Spehar. 2005. Auditory–visual speech perception and auditory–visual
enhancement in normal-hearing younger and older adults. Ear Hear 26:263–275.
Spence, C., and J. Driver. 1997. On measuring selective attention to an expected sensory modality. Percept
Psychophys 59:389–403.
Spence, C., M. E. Nicholls, and J. Driver. 2001. The cost of expecting events in the wrong sensory modality.
Percept Psychophys 63:330–336.
Stevens, W. D., L. Hasher, K. S. Chiew, and C. L. Grady. 2008. A neural mechanism underlying memory failure
in older adults. J Neurosci 28:12820–12824.
Stevenson, R. A., and T. W. James. 2009. Audiovisual integration in human superior temporal sulcus: Inverse
effectiveness and the neural processing of speech and object recognition. NeuroImage 44:1210.
Stine, E. A., A. Wingfield, and S. D. Myers. 1990. Age differences in processing information from television
news: The effects of bisensory augmentation. J Gerontol 45:1–8.
Strupp, M., V. Arbusow, C. Borges Pereira, M., Dieterich, and T. Brandt. 1999. Subjective straight-ahead during
neck muscle vibration: Effects of ageing. Neuroreport 10:3191–3194.
Talsma, D., and M. G. Woldorff. 2005. Selective attention and multisensory integration: Multiple phases of
effects on the evoked brain activity. J Cogn Neurosci 17:1098–1114.
Talsma, D., T. J. Doty, and M. Woldorff. 2007. Selective attention and audiovisual integration: Is attending to
both modalities a prerequisite for early integration? Cereb Cortex 17:679–690.
Townsend, J., M. Adamo, and F. Haist. 2006. Changing channels: An fMRI study of aging and cross-modal
attention shifts. Neuroimage.
Tye-Murray, N., M.S. Sommers, and B. Spehar. 2007. Audiovisual integration and lipreading abilities of older
adults with normal and impaired hearing. Ear Hear 28:656–668.
Verhaeghen, P., and L. De Meersman. 1998. Aging and the Stroop effect: A meta-analysis. Psychol Aging
13:120–126.
Verhaeghen, P., and J. Cerella. 2002. Aging, executive control, and attention: A review of meta-analyses.
Neurosci Biobehav Rev 26:849–857.
Wallace, M. T., M. A. Meredith, and B. E. Stein. 1992. Integration of multiple sensory modalities in cat cortex.
Exp Brain Res 91:484–488.
Weissman, D. H., K. C. Roberts, K. M. Visscher, and M. G. Woldorff. 2006. The neural bases of momentary
lapses in attention. Nat Neurosci 9:971–978.
West, R., and C. Alain. 2000. Age-related decline in inhibitory control contributes to the increased Stroop effect
observed in older adults. Psychophysiology 37:179.
Yang, L., and L. Hasher. 2007. The enhanced effects of pictorial distraction in older adults. J Gerontol B
Psychol Sci Soc Sci 62:230–233.
Yordanova, J., V. Kolev, J. Hohnsbein, and M. Falkenstein. 2004. Sensorimotor slowing with ageing is medi-
ated by a functional dysregulation of motor-generation processes: Evidence from high-resolution event-
related potentials. Brain 127:351–362.
Section V
Clinical Manifestations
21 Neurophysiological
Mechanisms Underlying Plastic
Changes and Rehabilitation
following Sensory Loss in
Blindness and Deafness
Ella Striem-Amit, Andreja Bubic, and Amir Amedi

CONTENTS
21.1 Introduction........................................................................................................................... 395
21.2 Rehabilitation following Sensory Loss.................................................................................. 397
21.2.1 Sensory Substitution Devices.................................................................................... 397
21.2.2 Sensory Restoration Approaches...............................................................................400
21.2.3 Functional Visual Rehabilitation...............................................................................402
21.3 Neural and Cognitive Consequences of Sensory Loss..........................................................403
21.3.1 Evidence for Robust Plasticity Promoted by Sensory Loss.......................................404
21.3.2 Principles Guiding Reorganization following Sensory Loss.....................................406
21.3.3 Plasticity following Sensory Loss across the Lifespan.............................................407
21.3.4 Neurophysiologic Mechanisms Underlying Plastic Changes in the Blind................409
21.4 Rehabilitation-Induced Plasticity.......................................................................................... 412
21.4.1 Plasticity after SSD Use and Its Theoretical Implications........................................ 412
21.5 Concluding Remarks and Future Directions......................................................................... 414
References....................................................................................................................................... 415

21.1  INTRODUCTION
We live in a society that is based on vision. Visual information is used for orienting in our envi-
ronment, identifying objects in our surroundings, alerting us to important events that require our
attention, engaging in social interactions, and many more necessary functions so we can efficiently
function in everyday life. Similarly, audition is used for communication and for guiding our atten-
tion to potentially important or even dangerous events (e.g., the sound of a nearing car). Thus, the
loss of any of these modalities decreases the quality of life and represents a severe challenge to
efficient functioning for tens of millions of individuals worldwide (World Health Organization, Fact
Sheet no. 282, May 2009). Furthermore, it has a significant economic impact on society.
It is therefore not surprising that numerous approaches and potential solutions designed to over-
come these difficulties have been put forward to help the sensory-impaired. Although such compen-
sation devices, for example, highly sensitive hearing aids, volume enhancing devices for different
technologies, and medical–technological solutions such as cochlear implants, are much more suc-
cessful for the auditorily impaired, compensation and technological aids for the visually impaired,

395
396 The Neural Bases of Multisensory Processes

the focus of this chapter, are currently much less effective. At this point, the most commonly used
rehabilitation techniques for blindness are sensory aids such as the Braille reading system, mobil-
ity aids such as canes, or more contemporary devices such as obstacle detectors, laser canes, or
ultrasonic echolocating devices. All of these devices derive from the premise that the blind are
deprived of numerous important types of information typically acquired through vision and attempt
to supply such information through other sensory systems. Typically, these attempts utilize the nor-
mal perceptual processing of the system they exploit for communicating the relevant information.
In contrast to this, the new generation of sensory aids takes one step further, as it aims to deliver
pure visual information to the brains of the blind, either by surgically or medically restoring the
missing functionality of the eyes and brain areas typically exploited for visual processing (as is
already done in audition to some extent, mainly for successful perception of auditory single speaker
communication, using cochlear implants; Fallon et al. 2008; Spelman 2006; Geers 2006) or by
“teaching” these regions to take over visual functions after introducing them to visual information
transmitted through nonvisual modalities. The first group of such techniques, neuroprosthetic medi-
cal solutions is invasive—requiring surgical interventions, and extremely expensive at the moment.
These approaches show some promising results only in very restricted populations of the blind, but
unfortunately only to a limited extent. However, once the technological (and neuroscientific, i.e.,
the ability of the brain to make sense of the restored input; see below) obstacles are resolved, these
may hold great future promise for restoring natural vision to many blind individuals, similar to
the enormous progress in the treatment of deafness that has been made since the development of
cochlear implants. Similarly, novel medical approaches for replacing the damaged sensory receptor
cells, via stem cell transplantation (which will be discussed briefly) may be very promising in the
further future, but are currently only at relatively preliminary research stages. The second group
of rehabilitation approaches includes sensory substitution devices (SSDs) that represent noninva-
sive, cheap, and relatively accessible techniques. These devices are specifically designed in order to
deliver visual information to the blind using their remaining and fully functioning sensory modali-
ties, in hope that the brains of such individuals would learn to exploit this information, similar to the
way the sighted use equivalent information transmitted through the visual pathway. Although this
hope may appear counterintuitive or even unrealistic, the most recent SSDs are currently showing
remarkable behavioral outcomes. Such efficiency combined with their low cost and broad applica-
bility to different types or ages at sensory loss make them highly attractive sensory aids. This is
especially important in blindness, given that 87% of the blind are located in developing countries
and therefore need cheap and widely applicable solutions (World Health Organization, Fact Sheet
no. 282, May 2009).
In order to capture the “magic” of these rehabilitation approaches and illustrate how surprisingly
efficient they might be if proper training is applied, we will begin this chapter by presenting some
of these exciting new solutions and briefly discuss the rehabilitation outcomes currently associ-
ated with them. To better understand the mechanisms mediating such outcomes and appreciate the
remaining challenges that need to be overcome, in the second part of the chapter we provide a more
theoretical illustration of neuroplastic changes associated with the use of these devices. In particu-
lar, we show that these changes are not “magic” nor in any way restricted to the use of the presented
rehabilitation techniques. On the contrary, these techniques are designed in order to exploit and
channel the brain’s natural potential for change. This potential is present in all individuals, but may
become somewhat more accentuated in the brains of the sensory-impaired, as the lack of one sen-
sory modality leaves vast cortical regions free of their typical input and triggers a reorganization of
such cortices and their integration into other brain networks. This reorganization is constrained and
channeled by the individual’s own activity, information available from the environment, as well as
intrinsic properties of the neural system promoting or limiting such changes during different periods
in life. Importantly, such restructuring is crucial for enabling the cognitive changes that also occur
after sensory loss, allowing the sensory-impaired individuals to efficiently function in their environ-
ment. Specifically, successfully dealing with sensory impairment often results in collateral benefits,
Neurophysiological Mechanisms Underlying Plastic Changes 397

which include better differentiation and higher efficiency of nonvisual sensory or other cognitive
functions. Many of the neural and cognitive changes triggered by sensory loss will be reviewed in
the second part of the chapter, illustrating how they rely on the same mechanisms as those underly-
ing the successful outcomes of novel rehabilitation techniques, which will now be presented.

21.2  REHABILITATION FOLLOWING SENSORY LOSS


Sensory loss and blindness in particular decreases quality of life in millions of individuals (e.g., 314
million of individuals are visually impaired worldwide; WHO Report 2009, Fact Sheet no. 282).
Blindness hinders independent navigation in space, reading, recognizing people, and even commu-
nicating with them by restricting nonverbal communication via hand gesturing or facial expressions
such as gaze direction or smiling. Numerous approaches and potential solutions aimed at overcom-
ing these difficulties have been put forward (with various levels of success), offering hope and help
to those suffering from sensory impairment. In the blind, these include reading and mobility aids,
more advanced SSDs, and invasive sensory restoration and neuroprosthetic approaches. In this part
of the chapter we present some of these techniques. The main focus is on SSDs, which are gaining
increased popularity thanks to their noninvasiveness, low cost, and high potential for providing
systematic rehabilitation solutions for all types of blindness. In addition, we will briefly discuss the
potential for medically enabled sensory restoration which, although holding great potential, still
needs to overcome numerous technical and other challenges before becoming truly useful for most
of the blind.

21.2.1  Sensory Substitution Devices


Sensory substitution refers to the transformation of the characteristics of one sensory modality into
the stimuli of another modality. For example, it is possible to present visual information by touch or
audition, audition or vestibular information by touch, etc. In the case of blindness, SSDs represent
a noninvasive rehabilitation approach in which visual information is captured by an external device
such as a video camera and communicated to the blind via a human–machine interface in the form
of auditory or tactile input. Louis Braille (1809–1852), who developed Braille writing, pioneered the
work that paved the way to modern SSDs by substituting visually read letters by a raised dot code,
as an adaptation of Charles Barbier’s night writing code. Charles Barbier originally developed a
tactile 12-raised-dot writing code for the French army, which was deemed too difficult to decipher
and was abandoned by the army. After meeting Brabier and testing his invention, Louis Braille
simplified the code to a six–raised dots code, which made the symbols easier to recognize without
moving (the original 12-dot code required slow effortful motion to recognize each letter), thus
inventing the tactile Braille code widely used today (Sathian 2000; Sadato 2005). However, Braille
can only work for material transformed offline from printed visual letters to Braille dots, and can-
not be used for online reading of regular letters. In recent years, other reading substitutions have
been developed for online reading such as the Optacon (a print-to-tactual-image device devised for
reading embossed letters; Goldish and Taylor 1974; Linvill and Bliss 1966) and various versions of
dedicated text-to-speech engines (from the Kurzweil reading machine; Kleiner and Kurzweil 1977)
to current talking software such as JAWS (Freedom Scientific, Inc., St. Petersburg, Florida). In
addition to these reading aids, a great deal of effort has been invested in developing devices aimed
at improving the mobility of the blind. The long cane used to mechanically probe for obstacles rep-
resents the simplest, most commonly used device. Both the Braille system and the cane solutions
that were quickly adapted by blind users suggest that at times a simple, low technology, rather than
a demanding solution might be the most widely used one. However, in recent years more advanced
counterparts of the cane have become available, such as electronic travel aids designed to be used
along with the long cane in order to extend the distance for environmental preview and thus increase
the possible speed and efficiency of travel. The Sonic Pathfinder (Heyes 1984) and the Sonicguide
398 The Neural Bases of Multisensory Processes

(Kay and Kay 1983) typically scan the environment acoustically (ultrasonically) or optically (laser
light), and transmit spatial information on obstacles and objects in the surroundings via vibrotactile
or auditory signals.
In contrast to devices that are typically designed for a limited purpose and are successful in
substituting for only certain functional aspects of vision, more sophisticated techniques that replace
vision through tactile or auditory information have been developed over the past few decades (see
Figure 21.1a). The first targeted modality for substituting vision was touch, because of the simplicity
and ease of transforming visual into tactile signals that are both characterized by two-dimensional
(2-D) spatial representations (retina in vision and skin surface in touch). Pioneering work in this
field was done in the 1970s by Paul Bach-y-Rita, who devised a tactile display that mapped images
from a video camera to a vibrotactile device worn on the subject’s back. This device (Bach-y-Rita
2004; Bach-y-Rita et al. 1969; Bach-y-Rita and Kercel 2003), dubbed the Tactile Vision Substitution
System, provided tactile transformation of black-and-white images at a resolution of 20 × 20 pixels
and enabled the blind to perform sufficiently well in some visual tasks. However, it was extremely
large and immobile, which motivated the development of smaller, mobile tactile devices placed on
the tongue and forehead (for a review, see Bach-y-Rita 2004) that are also characterized by better
spatial somatosensory resolution. One of these, the Tongue display unit (TDU) (Bach-y-Rita et
al. 1968, 1998), an electrotactile device composed of a 12 × 12 matrix of stimulators (measuring
approximately 3 cm2) placed on the subject’s tongue, provides blind individuals with an initial
“visual” acuity (tested by the Snellen E chart) comparable to 20/860 (Sampaio et al. 2001; the
numerator refers to the distance in feet from which a person can reliably distinguish a pair of
objects, whereas the denominator is the distance from which a person with standard visual acuity
would be able to distinguish them; in North America and most of Europe, legal blindness is defined
as visual acuity of 20/200 or poorer), which might improve after training. Other studies investigat-
ing this device suggest that at least a subgroup of early-onset blind individuals may particularly
benefit from its use (Chebat et al. 2007).
Audition was the second candidate to substitute for vision. The development of auditory-based
devices was triggered by certain limitations of tactile SSDs: their price and the fact that they are
inherently limited by the spatial resolution of touch and relatively lower information content due to
a cap on the number of electrodes. The first auditory SSD device was The vOICe system (Meijer
1992), which currently uses a default resolution of 176 × 64 sampling points. This mobile and inex-
pensive device uses a video camera, which provides the visual input, a small computer running
the conversion program, and stereo headphones that provide the resulting sound patterns to the
user. Given the fact that 87% of the world’s visually impaired live in developing countries (WHO
report 2009 fact sheet 282), the importance of providing solutions that are not just high-resolution,
but also cheap and accessible, cannot be underestimated. To some extent, visual-to-auditory SSDs
fulfill all of these criteria. However, these devices still pose great challenges both to the developers
and the brains of blind individuals using them, as they rely on conversion algorithms that are much
less intuitive than those employed by visual-to-tactile SSDs. For example, in the visual-to-auditory
The vOICe SSD (Meijer 1992), the conversion program transforms visual into auditory information
(‘soundscapes’) based on three simple rules: the vertical axis (i.e., elevation of the object) is repre-
sented by frequency, the horizontal axis by time and stereo panning, and the brightness of the image
is encoded by loudness. Although these conversion rules appear relatively simple, explicit and quite
extensive training is required to learn how to interpret even simple shapes. Similar but not identi-
cal transformations are implemented in two more recently developed auditory SSDs: the Prosthesis
Substituting Vision with Audition (PSVA; Capelle et al. 1998) and SmartSight (Cronly-Dillon et al.
1999, 2000). PSVA uses different tones to provide horizontal location directly, whereas SmartSight
presents the vertical location information in terms of musical notes. PSVA can break down the
“visual sound” into components of vertically and horizontally oriented edges. Additionally, PSVA
applies a magnification to the center of the image to simulate the better resolution (magnification
factor) of the human fovea.
Neurophysiological Mechanisms Underlying Plastic Changes 399

(a)

Visual objects
(b) Tactile and SSD objects (c) Haptic objects
p = 0.005
n=7 IPS

p = 0.05 LOtv
(Corr.)

LOtv

FIGURE 21.1  Sensory substitution devices: general concept of sensory substitution (SSD) and use of SSDs
in studying brain plasticity, perception, and multisensory integration. (a) SSDs typically include a visual cap-
turing device (e.g., camera glasses), a computational device transforming visual input into either a tactile or
auditory display using a simple known transformation algorithm, and an output device, then transmitting this
information to user. Right: example of an auditory SSD (e.g., The vOICe; Meijer 1992) transmitting sensory-
transformed information using headphones. Left: example of a tactile device that can transmit tactile informa-
tion via an electrode array targeting the tongue (e.g., TDU; Bach-y-Rita et al. 1998) or another skin surface,
in this case presented in neck. (With kind permission from Springer Science+Business Media: Multisensory
Object Perception in the Primate Brain, part 4, 2010, 351–380, Bubic, A. et al., figure number 18.2.) (b) A con­­
junction analysis for shape perception across modalities and experimental conditions in a group of seven expert
users of The vOICe SSD (five sighted, one late blind, and one congenitally blind). A conjunction analysis test-
ing for common areas of activation between object recognition using soundscapes (i.e., using The vOICe SSD
to extract shape information) and by touch but not typical sounds made by objects (which does not convey
shape information) or by corresponding sensory controls. Contrast (random-effect GLM model, corrected for
multiple comparisons) showed bilateral LO activation with weaker responses in right hemisphere, signifying
that lateral occipital complex (LOC) region is a multimodal operator for shape. (Modified and adapted from
Amedi, A. et al., Nat Neurosci, 10, 687–689, 2007.) (c) Object-related regions in visual and haptic modalities
shown on an inflated right hemisphere (top: lateral view; bottom: ventral view). Visual object selectivity is
relative to scrambled visual images; haptic object selectivity is relative to haptic textures. Visuo-haptic object
selectivity in LOC is found within lateral occipito-temporal sulcus (delineating LOtv), similar to location of
multimodal object related area shown in panel b. (Modified and adapted from Amedi, A. et al., Nat Neurosci,
4, 324–330, 2001; and Lacey, S. et al., Brain Topogr, 21, 269–274, 2009.)
400 The Neural Bases of Multisensory Processes

Although extremely different, both auditory and tactile SSDs can potentially be very useful for
the blind. Recent tests show that blind and/or blindfolded sighted individuals can, especially after
training or prolonged use of the device (Poirier et al. 2006b), learn to interpret the transmitted infor-
mation and use it in simple visual discrimination and recognition (Arno et al. 1999, 2001; Sampaio
et al. 2001; Poirier et al. 2006b) as well as more complex tasks in which acquiring the knowledge of
spatial locations of objects (Auvray et al. 2007; Proulx et al. 2008) or constructing mental images
of more complex environments (Cronly-Dillon et al. 2000) is required. More anecdotal reports that
have not yet been explored in formal research confirm that following extended use of such devices,
behavioral abilities using SSDS may be even more promising, as they can be used to identify facial
expressions and read simple words (Amedi, Striem-Amit, and Reich, unpublished observation; see,
e.g., http://brain.huji.ac.il/press.asp) as well as to orient and navigate in everyday life (see, e.g.,
the reports of a late-onset blind individual of her experiences with the vOICe SSD in http://www​
.seeingwithsound.com/users.htm).
Although the sensory-transformed information may partly occupy an available sensory chan-
nel or at least add to its attentional load (e.g., provide vOICe information in addition to naturally
occurring environmental sounds), after training such percepts should not significantly interfere with
normal sensory perception. However, it needs to be emphasized that training is crucial in obtain-
ing optimal results in this regard, as initial usage of SSDs may be confusing and overwhelming for
the sensory impaired. Because of the multisensory nature of perception, the human brain can be
expected to successfully process these percepts in a parallel manner, similarly to processing several
types of visual parameters, allocating attention to the most relevant visual feature at the time, and
similarly to perceiving an auditory conversation above other environmental noises. Naturally, how-
ever, if an individual uses the SSD in very high volume or if the environmental sounds are near per-
ceptual thresholds, there might be a significant cost to using the SSD in the intact sensory channel.
Future studies on dividing attention between SSD input and the natural sensory input are needed to
fully assess the possible interference in such cases.
Overall, although there is still a great deal of work to be done in this area, initial experi-
ences with SSDs show more than promising results. These devices truly offer new hope for the
sensory-impaired in a somewhat nonintuitive, but “brain-friendly” manner, as they use normal
neural resources and functioning modalities for transmitting previously unavailable informa-
tion. Furthermore, in order to fully appreciate the value of sensory substitution, it might be use-
ful to imagine how exciting it would be to have infrared vision or hear ultrasound frequencies.
Interestingly, future, second-generation SSDs might just make these “superhuman” abilities pos-
sible: just as visual information can be transmitted and used by the blind through their functioning
auditory or tactile modality, so could infrared or ultrasound frequencies be perceived by anyone
using functioning vision or audition. For the blind, the efficient use of vision or visual information
transferred via such SSDs represents exactly this type of ability or an even greater accomplish-
ment, as they need to function in an environment optimally designed for the majority of the popu-
lation, that is, the sighted.

21.2.2  Sensory Restoration Approaches


Restoration of sensory input to the visually impaired represents an alternative to SSDs. Although
these approaches are perhaps more attractive than SSDs, as they provide a sense of “real” vision
(as compared to providing only visual information converted into the other senses), and are at the
cutting edge of medical, technological, and scientific advances, they are unfortunately only at the
relatively preliminary stages of research and development. Conventional sight restoration includes
surgical removal of cataracts and treatment or surgical solutions to vision loss caused by glaucoma.
Although highly practical and yielding remarkable results, especially after extensive training fol-
lowing the surgery (Ostrovsky et al. 2006, 2009; Sinha 2003), these solutions were originally only
applicable to specific causes and stages of vision loss. Sight restoration in blindness due to other
Neurophysiological Mechanisms Underlying Plastic Changes 401

etiologies, such as congenital or late-onset degeneration [e.g., age-related macular degeneration


(ARMD)] of the retina or optic tract, is only nowadays being addressed.
Sight restoration approaches can be grossly divided to two major types: biological methods,
such as cell transplantation therapy, which aim to replace the actual deteriorating retinal cells with
healthy ones, and electrical sensory restoration solutions, which try to create brain–machine inter-
faces by, for example, replacing a damaged retina (by the use of visual prostheses, similarly to
cochlear prostheses used in cases of deafness). Although we do not go into detail on the biological
methods and research directions, we recommend several good review articles on the matter (to
name some from recent years, Cronin et al. 2007; MacLaren and Pearson 2007; Lamba et al. 2008,
2009; Bull and Martin 2009; Locker et al. 2009; West et al. 2009) for a more thorough examination
of these approaches. In short, cell replacement strategies for retinal repair can be done by implant-
ing external stem cells (cell transplantation therapy), from various origins (e.g., neural or retinal
precursor cells) and differentiation/developmental stages (from multipotent progenitor cells to post-
mitotic photoreceptor progenitor cells), to compensate for retinal degeneration. These approaches
receive a great deal of attention, as they are most likely to generate high-resolution natural vision
in the future. However, these approaches may still require many years of development until they
are fully applicable, and even then, might suffer the same disadvantages as the visual prostheses
methods, which will be detailed below.
With regard to electrical sensory restoration efforts, the development of visual prostheses was
motivated by early studies in which visual percepts (phosphenes, visual light dots or patterns) were
successfully generated by electrical stimulation of the visual cortex (Penfield and Rasmussen 1950).
The idea of channeling these findings into clinical applications was suggested years ago by a handful
of researchers (Haddock and Berlin 1950; Newman et al. 1987), but their ideas are only now being
pursued as shown by the extensive development of visual prostheses. Today, different approaches
in which visual information is recorded by external (or implanted) devices and transmitted to the
sensory tract or secondary processing cells in the retina, ganglion cells, thalamus, or the visual cor-
tex, are being studied or tested in clinical trials (for several recent reviews of current technologies
and the remaining challenges, see Dagnelie 2008; Dowling 2008; Merabet et al. 2005; Rizzo et al.
2007; Weiland et al. 2005).
There are four main types of approaches to electrical sensory restoration, targeting the retina,
optic nerve, lateral geniculate nucleus (LGN), and the visual cortex. The retinal approach is designed
to stimulate secondary neurons in the inner retina by an electrode array placed on the inner retinal
surface or inserted under the retina (for a description of the different groups, and devices devel-
oped in recent years, see Dowling 2008). Such an approach is mainly useful in cases of retinitis
pigmentosa and ARMD, which cause selective degeneration of the photoreceptor layer of the outer
retina. In this case, the information sent to the visual cortex can still be transmitted over minimally
damaged retinal ganglion cells. Optic nerve approaches (Brelen et al. 2005; Veraart et al. 2003;
Delbeke et al. 2002) use two forms of stimulation: the simultaneous activation of many optic nerve
fibers through cuff electrodes, and more focused stimulation of small groups of fibers with pen-
etrating microelectrodes. Future thalamic prostheses (Pezaris and Reid 2005, 2009) will stimulate
a later station in the visual pathways, that is, the LGN, but their study is currently only under pre-
liminary methodological research in nonhuman primates. The cortical approach (Troyk et al. 2003;
Fernandez et al. 2002; Schmidt et al. 1996) places electrodes over the central visual field projection
in primary visual cortex. Typically, this is accomplished using surface (or penetrating) electrodes
that may provide relatively good stability of tissue stimulation, but are difficult to position in the
optimal location based on the known retinotopic mapping of V1. However, this approach can be
applied in most cases of blindness, including those (e.g., glaucoma and diabetic retinopathy, but
apart from cortical blindness, a relatively rare cause of blindness) affecting the retina and may not
benefit from a retinal prosthesis. Devices based on these approaches have so far shown some prom-
ising results, as experienced blind users can, to some extent, utilize visual phosphenes generated by
some of these devices in order to experience meaningful visual percepts, detect motion (Weiland
402 The Neural Bases of Multisensory Processes

and Humayun 2008), or identify very simple patterns, shapes and even letters (Brelen et al. 2005;
Dobelle 2000; Weiland and Humayun 2008). However, there are still several major issues currently
preventing these electrical and biological approaches from becoming true clinical solutions. First
of all, their invasive nature makes them prone to risks related to surgical procedures in the brain,
such as inflammation, hemorrhage, increased patient mortality, and focal seizures induced by direct
cortical stimulation in the case of retinal prostheses, and risks involved with immune rejection of
the implanted cells in the case of cell transplantation solutions.
Moreover, retinal prostheses (and retinal molecular approaches such as cell transplantation ther-
apy, detailed above), which currently appear more promising as future solutions for blindness, are
not applicable to all populations of the blind, as they require the existence of residual functional
retinal ganglion cells. Additionally, these techniques are expensive, making them unavailable to the
majority of the blind, who reside in developing countries. In addition to these drawbacks, visual
prostheses have severe technical limitations including relatively low resolution, narrow field of view,
and the need for complicated image processing algorithms compensating for the visual processing
taking place in the retina itself. Functionally, these devices typically do not take advantage of eye
movements (an exception to this is the system developed by Palanker (Palanker et al. 2005), and
require large and slow head movements to scan entire visual patterns (Brelen et al. 2005; Veraart
et al. 2003; Chen et al. 2007). Therefore, visual prostheses (which are not yet available except in
preliminary clinical trials) do not yet provide sight that resembles natural vision and a key milestone
in this field, namely, generating truly useful and functional vision, at affordable costs has yet to be
reached. Finally, just like cochlear implants (or even much more than this), visual prostheses require
extensive training in order to achieve reasonable performance even for very simple stimuli. This
will be discussed in the next section. If, however, visual prosthesis research, and even more so bio-
logical methods replacing the actual retinal cells, can overcome these obstacles, these approaches
could provide a real visual experience and not just the “visual” information or orientation provided
by SSDs.

21.2.3  Functional Visual Rehabilitation


Although further developing and improving rehabilitation techniques is still an enormous techno-
logical challenge, sensory restoration efforts may require more than simply transmitting the visual
information (either via other modalities as in SSDs or by providing vision through the natural visual
system) to the brain. In a sense, when first introduced to the brain of a congenitally blind individual,
the visual information is meaningless as that individual lacks any previous experience on which such
information could be interpreted. Furthermore, the brain of such individuals may lack a function-
ing visual system needed for interpreting the newly introduced information and giving it functional
meaning. Even in the case of noncongenitally blind who have had some previous visual experience,
one cannot expect that reintroducing visual information to their brains would automatically result
in fully sophisticated visual perception, since their “visual” brain regions may now be integrated
into other, nonvisual brain networks. Although this is somewhat counterintuitive, evidence for this
claim can be found in the relatively successful rehabilitation of deaf and hearing-impaired indi-
viduals using cochlear implants (Spelman 2006). Cochlear implants work because patients learn
to associate meanings between sounds and their sources by generating new associations, a process
that requires explicit teaching. Moreover, such rehabilitation is accompanied and enabled by cor-
responding plasticity in the auditory cortex (Kral et al. 2006), which is now required to respond to
the newly delivered input. Similarly, two case studies of surgical sight restoration after long-term
visual deprivation (Gregory and Wallace 1963; Fine et al. 2003) suggest that pure restoration of
the lost sensory input may also not suffice in case of vision. The patients in both of these studies
showed profound difficulty in recognizing objects, even after a long period of sight and visual train-
ing. This indicates that allowing the visual information to enter the brain via a functional retina
does not guarantee or enable full or natural visual perception. This can only be accomplished if the
Neurophysiological Mechanisms Underlying Plastic Changes 403

surgical procedure is coupled with specific additional rehabilitation strategies that modulate brain
processing, enabling it to extract relevant and functionally meaningful information from neuropros-
thetic inputs that should gradually lead to restoration or development of visual functions. Thus, in
contrast to the encouraging behavioral outcomes of some cochlear implant patients, it is illusory
to expect that such successful sensory restoration can easily generalize to different subpopulations
of sensory impaired, such as the visually impaired. More research and development of behavioral
rehabilitation may be needed to achieve functional sensory ability in those who once suffered from
sensory loss. To fulfill this goal, we will have to overcome more than just surgical or technical chal-
lenges that will enable safer medical procedures or more advanced sensory substitution algorithms.
Although necessary, such advancements will have to be complemented by knowledge pertaining to
brain mechanisms and cognitive functions we want to change or develop using the available reha-
bilitation techniques. Thus, achieving full sensory restoration will only be possible if we take into
account the specificities of cognitive and neural functioning of the sensory impaired, a topic that
will be presented in the next part of the chapter.

21.3  NEURAL AND COGNITIVE CONSEQUENCES OF SENSORY LOSS


When attempting to understand the minds and brains of individuals who have lost one or more
sensory modalities, it is worth starting by considering factors that shape the minds and brains of
those whose development and capabilities are considered normal. Regardless of our individual abili-
ties, we are all equipped with nervous systems whose development are constrained by our genetic
dispositions, but can be channeled in different directions depending on environmental factors and
specific individual experiences and activities. The interaction of all of these factors defines our neu-
ral and cognitive functioning. Thus, a substantial reorganization within the nervous system follow-
ing the loss of a sensory function, regardless of when, why, and in which modality it occurs, is not
merely a physiological or “brain” phenomenon that can be understood without taking into account
the cognitive challenges and demands imposed by nonstandard sensory input. Rather, in order to
achieve the same functional level in their everyday life, those who suffer from sensory loss need to
develop strategies that enable them to extract information relevant for achieving their goals from
alternative sources typically ignored by the majority of the population. Such adjustments are medi-
ated through sufficient restructuring in other sensory or higher-order cognitive functions. Thus,
different cognitive demands lead to different individual experiences and activities, which in turn
promote a certain pattern of plastic reorganization within the nervous system that is additionally
constrained by genetic and purely physiological factors. Therefore, brain reorganization after sen-
sory loss needs to be considered as a neurocognitive phenomenon that strongly reflects the brain’s
intrinsic potential for change as well as altered cognitive demands aimed at compensating for the
missing sensory information, both of which are crucial to rehabilitation efforts.
In addition, it is important to keep in mind that various subpopulations of individuals suffering
from sensory loss and differing in etiology or onset of sensory loss differ in their potential for plas-
ticity as well as the available cognitive resources that can be exploited for dealing with sensory loss.
The early onset of sensory loss encountered in congenital conditions triggers the most dramatic cases
of plasticity and enables drastic brain restructuring that compensates for the deficits, generating a
remarkably different functional network than the one seen in healthy individuals or individuals who
sustained brain or peripheral injuries later in life. Congenital blindness and deafness affect large por-
tions of the brain, especially when resulting from peripheral damage (i.e., to the retina, cochlea, or the
sensory tracts), which does not injure the brain itself, but instead deprive parts of the brain from their
natural input, leaving it essentially unemployed. More than 20% of the cerebral cortex is devoted to
analyzing visual information, and a similar portion is devoted to auditory information (but note that
some of these cortical areas overlap to some extent as some recent reports suggest: Beauchamp et al.
2004; Calvert 2001; Cappe and Barone 2005; Clavagnier et al. 2004; Schroeder and Foxe 2005; van
Atteveldt et al. 2004). Despite the lack of visual or auditory input, the visual and auditory cortices
404 The Neural Bases of Multisensory Processes

of the blind and deaf do not degenerate. Rather, they undergo extensive plasticity resulting in sig-
nificantly changed neural responsiveness as well as functional involvement in nonvisual/­nonauditory
cognitive functions. Significant, although typically less extensive, plastic changes may also occur in
populations suffering from noncongenital sensory loss. This neuroplasticity is evident both in atypi-
cal brain activation in the blind when compared with that of the sighted, as well as in behavioral
manifestations, for example, sensory hyperacuity and specific cognitive skills.

21.3.1  Evidence for Robust Plasticity Promoted by Sensory Loss


The first set of evidence for the extensive reorganization undergone by the brains of the congeni-
tally blind and deaf can be found in the reports of enhanced sensory and cognitive abilities of
such individuals that compensate for their sensory deficits. For example, blind individuals need to
compensate for the lack of vision, a modality that normally allows one to “know what is where by
looking” (Marr 1982) and is ideal for providing concurrent information about relations of planes
and surfaces to each other, drawing attention to relevant external cues and greatly facilitating spatial
coding (Millar 1981). Although the blind cannot acquire information needed for object localization
and recognition by looking, they still require this information in order to, for example, navigate
through space or find and recognize the objects around them. Therefore, they have to acquire this
information through alternative sensory or other, strategies. For example, as early as the epoch of
the Mishna (about 350 c.e.), it was known that blind individuals possess superior memory abili-
ties compared to sighted (“The traditions cited by Rabbi Sheshet are not subject to doubt as he is a
blind man.” Talmud Yerushalmi, tractate Shabbat 6b), which enable them to remember the exact
location and identity of stationary objects and the sequence of steps required to complete paths
(Raz et al. 2007; Noordzij et al. 2006; Vanlierde and Wanet-Defalque 2004). Such phenomenal
memory of the blind was also demonstrated in modern scientific studies (Tillman and Bashaw
1968; Smits and Mommers 1976; Pozar 1982; Pring 1988; Hull and Mason 1995; Röder et al. 2001;
D’Angiulli and Waraich 2002; Amedi et al. 2003; Raz et al. 2007). Similarly, it has been shown
that the blind have superior tactile and auditory perception abilities: for instance, they are able to
better discriminate fine tactile patterns or auditory spatial locations than the sighted, and even to
better identify smells (Murphy and Cain 1986; Röder et al. 1999; Grant et al. 2000; Goldreich and
Kanics 2003, 2006; Hugdahl et al. 2004; Wakefield et al. 2004; Doucet et al. 2005; Smith et al.
2005; Collignon et al. 2006). Similarly, deaf individuals show improved visual abilities on certain
tasks (Bavelier et al. 2006), which indicates that the remaining modalities compensate for the miss-
ing one, a phenomenon termed hypercompensation (Zwiers et al. 2001) or cross-modal compensa-
tory plasticity (Rauschecker 2000). However, the blind (or the deaf) do not always perform better
on such tasks (Zwiers et al. 2001), suggesting that optimal development of some aspects of sensory
processing in the unaffected modalities may depend on, or at least benefit from, concurrent visual
(auditory) input. Furthermore, when comparing different populations of the blind, it becomes clear
that the identified benefits in some auditory and tactile tasks depend to a great extent on the age
at sight loss. Specifically, these advantages are often, but not always, limited to the congenitally
and early blind, whereas the performance of the late blinded tends to resemble that of the sighted
(Fine 2008), reflecting differences in the potential for neuroplastic reorganization and the amount of
visual experience between these populations. However, there is also evidence indicating that com-
pensatory benefits also occur in the late blind, in which case they may be mediated by different neu-
rophysiological mechanisms (Fieger et al. 2006; Voss et al. 2004), as detailed in the next sections.
Importantly, although prolonged experience with a reduced number of available sensory modalities
lead to such benefits, these do not appear automatically. For example, it has been shown that blind
children have significant difficulties with some tasks, especially those requiring reference to exter-
nal cues, understanding directions and spatial relations between objects. Such tasks are challenging
for the blind, as they have compromised spatial representations and rely mostly on self-­reference
and movement sequences (Millar 1981; Noordzij et al. 2006; Vanlierde and Wanet-Defalque 2004).
Neurophysiological Mechanisms Underlying Plastic Changes 405

Consequently, the blind have problems recognizing potentially useful information needed to per-
form the mentioned tasks and lack the benefits that could arise from simultaneously available vision.
For example, concurrent visual input could facilitate recognition and learning of helpful auditory
or somatosensory features given that the existence of redundant or overlapping information from
more than one modality is generally associated with guiding attention and enhanced learning of
amodal stimulus features (Lickliter and Bahrick 2004). Nevertheless, such recognition of useful
cues or calibration of auditory and tactile space is eventually possible even in the absence of vision,
as it may be achieved using different cues, for example, those stemming from self-motion (Ashmead
et al. 1989, 1998). Importantly, although it may require relatively long training to reach a stage in
which the missing sensory input is replaced and compensated for by equivalent information from
other modalities, spatial representations that are finally generated on the basis of haptic and auditory
input of the blind seem to be equivalent to the visually based ones in the sighted (Röder and Rösler
1998; Vanlierde and Wanet-Defalque 2004). Overall, the findings indicate that the blind, once they
learn to deal with the available sensory modalities, can show comparable or superior performance
in many tasks when compared to the sighted. This advantage can even be compromised by the pres-
ence of visual information, as indicated by inferior performance of the partially blind (Lessard et
al. 1998). Thus, the available evidence tends to counter the notion that sensory loss leads to general
maladjustment and dysfunction in functions outside the missing modality. Quite the contrary, this
general-loss hypothesis should be abandoned in favor of the alternative, compensatory hypothesis
suggesting that sensory loss leads to the superior development of the remaining senses (Pascual-
Leone et al. 2005).
In the past decades, neural correlates of reported impairment-induced changes in cognitive func-
tions and strategies have been thoroughly studied, providing a wealth of information regarding the
brain’s abilities to change. Studies investigating neural processing of congenitally blind and deaf
individuals, as well as more invasive animal models of these conditions, show that the brain is
capable of robust plasticity reflected in profoundly modified functioning of entire brain networks.
Important evidence pertaining to the altered cognitive processing and the functional status of the
occipital cortex in the blind stems from electrophysiological studies that investigated nonvisual
sensory functions of the blind. These yielded results showing shorter latencies for event-related
potentials (ERP) in auditory and somatosensory tasks in the blind in contrast to the sighted, sug-
gesting more efficient processing in these tasks in this population (Niemeyer and Starlinger 1981;
Röder et al. 2000). Furthermore, different topographies of the elicited ERP components in the
sighted and the blind provided first indications of reorganized processing in the blind, such as to
include the engagement of their occipital cortex in nonvisual tasks (Kujala et al. 1992; Leclerc et al.
2000; Rösler et al. 1993; Uhl et al. 1991). Functional neuroimaging studies have collaborated and
extended these findings by showing functional engagement of the occipital lobe (visual cortex) of
congenitally blind individuals in perception in other modalities (i.e., audition and touch; Gougoux
et al. 2005; Kujala et al. 2005; Sathian 2005; Stilla et al. 2008; for a recent review of these findings,
see Noppeney 2007), tactile Braille reading (Büchel et al. 1998; Burton et al. 2002; Gizewski et al.
2003; Sadato et al. 1996, 1998), verbal processing (Burton et al. 2002, 2003; Ofan and Zohary 2006;
Röder et al. 2002), and memory tasks (Amedi et al. 2003; Raz et al. 2005). Importantly, the reported
activations reflect functionally relevant contributions to these tasks, as indicated by studies in which
processing within the occipital cortex was transiently disrupted using transcranial magnetic stimu-
lation (TMS) during auditory (Collignon et al. 2007), tactile processing including Braille reading
(Cohen et al. 1997; Merabet et al. 2004) as well as linguistic functions (Amedi et al. 2004). Akin
to the findings in the blind, it has been shown that the auditory cortex of the congenitally deaf is
activated by visual stimuli (Finney et al. 2001), particularly varieties of visual movement (Campbell
and MacSweeney 2004).
It is important to realize that involvement of unisensory brain regions in cross-modal perception
is not only limited to individuals with sensory impairments, but can under certain circumstances
also be identified in the majority of the population (Sathian et al. 1997; Zangaladze et al. 1999;
406 The Neural Bases of Multisensory Processes

Amedi et al. 2001, 2005b; Merabet et al. 2004; Sathian 2005), consistent with reports in experi-
mental animals of nonvisual inputs into visual cortex and nonauditory inputs into auditory cortex
(Falchier et al. 2002; Rockland and Ojima 2003; Schroeder et al. 2003; Lakatos et al. 2007). In the
blind and deaf this involvement is much stronger, because sensory areas deprived of their custom-
ary sensory input become functionally reintegrated into different circuits, which lead to profound
changes in the affected modality and the system as a whole.

21.3.2  Principles Guiding Reorganization following Sensory Loss


Earlier in this section we listed experimental findings stemming from behavioral, electrophysiologi-
cal, imaging, and TMS studies illustrating such changes. We will now try to offer a systematiza-
tion of such changes as it may be helpful for understanding the extent and main principles guiding
reorganization following sensory loss (for similar attempts of systematization, see, e.g., Röder and
Rösler 2004; Grafman 2000; Rauschecker 2008). These include intramodal, multimodal, cross-
modal (intermodal), and supramodal changes—those pertaining to the involvement of typically
visual areas in processing tactile and auditory information in the blind (or typically auditory areas
in processing visual information in the deaf), plastic changes occurring within the cortices nor-
mally serving the unaffected modality, changes in multisensory regions and global, whole-brain
changes involving more than unisensory and multisensory networks, respectively. Although some-
what autonomous, these different types of changes are in reality strongly interdependent and cannot
be separated on the level of either cognitive or neural processing.
Intramodal plasticity refers to the changes occurring within one sensory modality as a conse-
quence of altered, either increased or decreased, use of that sensory modality. These changes are
reflected, for example, in the superior performance of the blind on auditory or tactile tasks. Studies
investigating the neural foundations of this phenomenon indicate a high degree of reorganization of
sensory maps in different modalities following local peripheral damage, extensive training, or per-
ceptual learning (Kaas 2000; Buonomano and Johnson 2009; Recanzone et al. 1992). This reorga-
nization includes a coordinated shrinkage of maps representing the unused areas and expansion of
those representing the modality/limb experiencing increased use (Rauschecker 2008) and is deter-
mined by the amount of stimulation and structure of the input pattern within which competition
between the inputs plays an important role (Buonomano and Johnson 2009).
Multimodal or multisensory plasticity refers to the reorganization of multisensory areas after
sensory loss that arises from the impairment of one and compensatory hyperdevelopment of the
remaining sensory modalities. This altered structure of sensory inputs leads to changes in mul-
tisensory areas, the development of which is shaped by the convergence of incoming input from
unisensory systems (Wallace 2004a). For example, studies investigating the multisensory anterior
ectosylvian cortex in congenitally blind cats indicate an expansion of auditory and somatosensory
fields into the area usually housing visual neurons (Rauschecker and Korte 1993) as well as sharp-
ened spatial filtering characteristics (Korte and Rauschecker 1993) following blindness. These
changes underlie the improved spatial abilities of these animals and may also be crucially important
for the development of cross-modal plasticity.
Cross-modal (intermodal) plasticity refers to the reassignment of a particular sensory function
to another sensory modality, as reflected in, for example, engagement of the visual cortex in pro-
cessing auditory information. Numerous invasive studies in animals have shown the vast potential
for such reorganization, reflecting the fact that most aspects of structure and function of a given
brain area are determined by its inputs, not geographic location. For example, it has been shown
that typically auditory areas can, after being exposed to visual input through rerouted thalamic
fibers normally reaching primary visual areas, develop orientation-sensitive cells with the pattern
of connectivity resembling one typically found in the normally developed visual cortex (Sharma et
al. 2000) and fulfill the visual functionality of the rewired projections (von Melchner et al. 2000).
Similarly, tissue transplanted from the visual into the somatosensory cortex acquires functional
Neurophysiological Mechanisms Underlying Plastic Changes 407

properties of its “host” and does not hold on to its genetic predisposition (Schlaggar and O’Leary
1991). This implies that the cross-modal plasticity observed in the blind is most probably subserved
by altered connectivity patterns, as will be further discussed in the next section.
Supramodal plasticity refers to changes encompassing areas and brain functions that are typi-
cally considered nonsensory. Evidence for such plasticity have been revealed in studies showing
involvement of the occipital cortex in memory or language (verb generation or semantic judgments)
processing in the blind (Amedi et al. 2003, 2004; Burton et al. 2002, 2003; Ofan and Zohary 2006;
Raz et al. 2005; Röder et al. 2000, 2002). This type of plasticity is comparable to cross-modal
plasticity and is enabled by altered connectivity patterns between the visual cortex and other supra-
modal brain regions.
When describing and systematizing different types of plastic changes, we want to once again
emphasize that these are not mutually independent. They often occur in synchrony and it may
occasionally be difficult to categorize a certain type of change as belonging to one of the suggested
types. Furthermore, all types of large-scale plasticity depend on or reflect anatomical and functional
changes in neural networks and may therefore rely on similar neurophysiological mechanisms.
Before describing these mechanisms in more detail and illustrating how they could underlie differ-
ent types of plastic changes, we will present another important element that needs to be considered
with respect to compensating for sensory impairments. Specifically, we will now focus on the fact
that all of the mentioned changes in neural networks show large variability between individuals,
resulting in corresponding variability in compensatory cognitive and behavioral skills. It is impor-
tant to consider some of the main sources of this variability, not just so that we can better understand
the reorganization following sensory loss in different populations of the blind, but also because this
variability has important implications for the potential for successful rehabilitation.

21.3.3  Plasticity following Sensory Loss across the Lifespan


When discussing different types of neuroplastic changes and possible mechanisms underlying them,
it is important to emphasize that all of these vary significantly depending on the age at onset of
blindness, as was briefly discussed in previous sections. These differences reflect several factors: the
brain’s potential to change at different periods of development, the amount of experience with visual
or auditory processing before sensory loss, and the amount of practice with the remaining senses
or some special materials, for example, Braille letters. The most important of these factors reflects
the fact that the general potential for any form of plastic changes varies enormously across the
lifespan. Although the brain retains some ability to change throughout life, it is generally believed
and experimentally corroborated that the nervous system is most plastic during its normal develop-
ment as well as following brain injury. The developing brain is a highly dynamic system that under-
goes several distinct phases from cell formation to the rapid growth and subsequent elimination of
unused synapses before finally entering into a more stable phase following puberty (Chechik et al.
1998). The functional assignment of individual brain regions occurring during this time is crucially
dependent on synaptic development, which includes drastic changes that often take place in spurts.
In the visual cortex, during the first year after birth, the number of synapses grows tremendously
and is subsequently scaled down to the adult level around the age of 11 through extensive decreases
in synaptic and spine density, dendritic length, or even the number of neurons (Kolb 1995). This pro-
cess is primarily determined by experience and neural activity: synapses that are used are strength-
ened whereas those that are not reinforced nor actively used are eliminated. Synaptic development
is highly dependent on competition between incoming inputs, the lack of which can result in a
decreased level of synaptic revision and persistence of redundant connections in adulthood (De
Volder et al. 1997). This process of synaptic pruning represents a fairly continuous and extended
tuning of neural circuits and can be contrasted with other types of changes that occur at very short
timescales. During such periods of intensified development, (i.e., critical or, more broadly, sensi-
tive periods; Knudsen 2004; Michel and Tyler 2005), the system is the most sensitive to abnormal
408 The Neural Bases of Multisensory Processes

environmental inputs or injuries (Wiesel and Hubel 1963). Thus, injuries affecting different stages
of development, even when they occur at a roughly similar ages, may trigger distinct patterns of
compensatory neuroplastic changes and lead to different levels of recovery. Specifically, early stud-
ies of recovery after visual loss (Wiesel and Hubel 1963, 1965) suggested that vision is particularly
sensitive to receiving natural input during early development, and that visual deprivation even for
short durations, but at an early developmental stage, may irreversibly damage the ability for normal
visual perception at older ages. Conversely, evidence of sparse visual recovery after early-onset
blindness (Gregory and Wallace 1963; Fine et al. 2003) demonstrates that this may not necessarily
apply in all cases, and some (although not all) visual abilities may be regained later in life.
The potential for neuroplasticity after puberty is considered to be either much lower as compared
to childhood or even impossible except in cases of pathological states and neural overstimulation
(Shaw and McEachern 2000). However, recovery following different types of pathological states
occurring in adulthood (Brown 2006; Chen et al. 2002), changes in neuronal counts and compensa-
tory increases in the number of synapses in aging (Kolb 1995), and the profound changes following
short periods of blindfolding (Pitskel et al. 2007; Pascual-Leone et al. 2005; Amedi et al. 2006)
suggest otherwise. In reconciling these seemingly contradictory conclusions, it is useful to take into
account the multifaceted nature of plasticity that includes different forms of changes occurring at
different timescales and on different levels of neural functioning. For example, synaptic changes
occurring in aging develop over an extended period and in synergy with altered experiences and
needs characteristic of later periods in life. The robust, short-term plasticity occurring after blind-
folding may arise from the recruitment of already existing, but commonly unused, inhibited, or
masked pathways that become available once the source or reason for such masking (e.g., avail-
ability of visual input in those who have been blindfolded) is removed. Therefore, some forms of
adult plasticity do not reflect “plasticity de novo,” which is characterized by the creation of new
connectivity patterns (Burton 2003). In contrast, in pathological states, injuries, or late sensory
loss, both of these types of changes can occur. Rapid changes reflecting the unmasking of existing
connections occurring in the first phase promote and enable subsequent slow, but more permanent
structural changes (Amedi et al. 2005a; Pascual-Leone et al. 2005). This suggests that potentially
similar functional outcomes may be mediated by different neural mechanisms whose availability
depends on the developmental stage in which they occur.
All of these general principles and differences in neuroplasticity across the lifespan can be applied
to the more specific case of plasticity following sensory loss. Given that the most extensive plasticity
is seen in the congenitally or early-onset blind, it has been suggested that processing advantages
and large-scale cortical reorganization might be limited to the congenitally and early blind with the
performance of the late blind resembling more that of the sighted (Fine 2008). Similarly, Cohen et
al. (1999) suggested that the critical period of susceptibility for significant cross-modal plasticity
would end at puberty. However, findings showing a high degree of modifiability of cortical maps
even in adulthood (Kaas 1991) as well as those indicating significant reorganization in the occipi-
tal cortex of the late blind (Büchel et al. 1998; Burton et al. 2002; Voss et al. 2004) argue against
this restriction. They are in line with the previous suggestion that significant potential for plastic
changes exists throughout the lifespan, but may differ in the extent and the underlying neurophysi-
ological mechanisms available in different periods of development.
Specifically, experience of vision, especially if still available after puberty, shapes both cogni-
tion and the brain and this influence is present even after vision is lost. Although the late blind need
to reorganize information processing in order to compensate for the lack of visual input, they can
employ previously learned visual strategies, for example, visual imagery, which is still available
after visual loss (Büchel et al. 1998) much more than the early blind. They also benefit greatly from
fully developed multisensory systems, which may explain differences in multisensory plasticity
encountered across the populations of the congenitally, early, and late blind. Equivalent benefits
and cross-modal connections encountered in the late blind cannot be expected to occur in those
who lack the experience of concurrent, often redundant or complementary input from different
Neurophysiological Mechanisms Underlying Plastic Changes 409

sensory modalities. Although the potential for multisensory integration can primarily be seen as a
phenomenon that develops through integration of unisensory inputs (Wallace 2004b), it is important
to emphasize that this does not imply a serial process in which fully developed individual modalities
somehow merge in order to produce multisensory percepts. On the contrary, although some level of
development of unisensory processing may be needed for the emergence of multisensory neurons,
unisensory and multisensory perception start developing in a highly interdependent manner soon
after this initial phase. Furthermore, although multisensory percepts may develop as a consequence
of concurrent and correlated inputs from different modalities, they in turn also influence or channel
the development and differentiation of single modalities (Lickliter and Bahrick 2004). Specifically,
recent findings (Putzar et al. 2007) show that humans deprived of patterned visual input during the
first months of life that later had their patterned vision restored show reduced audiovisual interac-
tions. This indicates that adequate multisensory input during early development is indeed necessary
for the full development of cross-modal interactions. Similar findings have been found for abnormal
cross-modal integration in cochlear implant patients (Schorr et al. 2005).
Overall, findings indicate substantial differences in all types of plasticity across congenitally,
early, and late blind individuals. These between-group differences are not necessarily the same
across all types of plastic changes and brain areas (networks) affected by them, because they depend
to a great extent on the interaction between the onset of blindness and the exact stage of devel-
opment at the time of blindness, which may differ in different brain systems. For example, it is
plausible to assume that the ventral and dorsal pathways within the visual systems would be dif-
ferently influenced by loss of vision at different developmental stages. Thus, systems dedicated to
dynamically shifting relations between locations, objects, and events (including the dorsal visual
pathway) may develop earlier and be therefore prone to a different pattern of developmental defi-
cits (Neville and Bavelier 2000), comparable to specific findings showing that motion perception
develops earlier than object perception (Fine et al. 2003). Finally, although some of the described,
more “extreme” examples of plasticity may take years to develop, several studies suggest that with-
holding visual information for short periods, even a week, may have dramatic results: subjects who
were blindfolded for only a week showed posterior occipital lobe activation during Braille reading
(Amedi et al. 2006), and during tactile discrimination tasks (Merabet et al. 2008b). This activation
was reduced when the blindfold was removed. Hence, not all cross-modal changes require long-
term sensory deprivation, or slowly developing altered connectivity patterns; some may result from
the previously mentioned unmasking of existing connectivity between the visual and other cortices,
which are dormant (or actively inhibited) in normal conditions. It is likely that at least some of the
plastic changes require extended periods of sensory deprivation, possibly occurring in the critical
or sensitive periods in development. Such dependence may have important implications concern-
ing the ability to restore sight and regain functional vision, as well as for understanding the neural
mechanisms explaining the plastic changes evident both in early-onset as well as late-onset blind.

21.3.4  Neurophysiologic Mechanisms Underlying Plastic Changes in the Blind


Although the exact etiology of the changes observed in blindness, regardless of its onset, is not yet
fully understood, it has been suggested that all levels of connectivity, including connections within
local circuits as well as long-range corticocortical and subcortical connections are altered in the
blind (Bavelier and Neville 2002). Corroborating this, recent evidence indicates that the visual tracts
connecting the visual cortex with the eyes are degenerated in the early-onset blind (Noppeney et al.
2005; Pan et al. 2007; Shimony et al. 2006), while the functional connectivity of the occipital cortex
and various other cortical sites, including the supplementary motor area, pre- and postcentral gyri,
superior parietal lobule, and the left superior and middle temporal gyri (Yu et al. 2008), is decreased.
Therefore, altered connectivity patterns within other cortical or subcortical systems not affected by
blindness may underlie the robust plastic changes exhibited in the blind. Several models have been
posited, emphasizing the relevance of different types of connectivity in mediating different types
410 The Neural Bases of Multisensory Processes

of plasticity (i.e., cross-modal or multisensory plasticity), the plasticity of different areas, or, as
previously described, plastic changes that occur at different onsets of vision loss. Evidence has been
provided in support of each such model, suggesting that the individual models may capture different
phenomena of relevance and that their combination may offer the full specification of the changes
encountered after sensory loss. We will now briefly present models that emphasize subcortical and
cortical connectivity changes in different extent, and briefly review theories that aim at explaining
the general trends in long-range plasticity changes triggered by sensory loss.
Subcortical models of connectivity are based mostly on findings in animal models of plasticity
following sensory loss. Such studies in mammals, similar to the studies of rewiring sensory input
(Sharma et al. 2000; von Melchner et al. 2000), suggest that the visual cortex may receive sensory
input from subcortical sensory stations, which may cause a reorganization of the visual cortex,
enabling it to process stimuli from other modalities. Specifically, several studies have shown that
congenital blindness (caused by early enucleation) causes a rewiring of tactile and auditory inputs
from the thalamic and other brainstem stations in the sensory pathways to the visual cortex (Chabot
et al. 2007; Izraeli et al. 2002; Karlen et al. 2006; Laemle et al. 2006; Piche et al. 2007). This rewir-
ing is evident both in the neural connectivity (indicated by the use of anterograde and retrograde
tracers) and in the functional properties of the “visual” cortex (examined by electrophysiologi-
cal recordings), which now starts to exhibit auditory or tactile responses. This type of model may
explain the engagement of the visual cortex of blind humans in “low-level” sensory tasks (Kujala et
al. 2005; Gougoux et al. 2005) seen in many studies, which constitutes cross-modal or intermodal,
plasticity. However, despite the evidence for spontaneous occurrence of such connectivity in mam-
mals, no definite support for such a model has been established in humans as of yet.
Corticocortical models of connectivity are currently better grounded to account for the large-
scale plasticity observed in the blind. Although it was previously assumed that there are no direct
connections between sensory modalities, recent anatomical studies in primates indicate the exis-
tence of projections from the auditory to the visual cortex and multisensory feedback connections to
primary visual areas (Falchier et al. 2002; Rockland and Ojima 2003). Supporting this connectivity
in humans, increased functional connectivity between the primary somatosensory cortex and pri-
mary visual cortex was found in early-onset blind (Wittenberg et al. 2004). Although direct con-
nectivity from auditory or somatosensory cortex to primary visual cortex may explain some, it may
not account for all of its perceptual properties. In addition, such a model may not explain the “high
cognitive” component of the compensatory plasticity, as reflected in, for example, the involvement
of the visual cortex in verbal memory and language. In order to account for these findings, models
of corticocortical connectivity have to be further refined. Specifically, these cannot remain limited
only to illustrating the presence or lack of white matter fibers between different regions, but need
to address the dynamics of information transfer. In one such model, the so-called inverted hierar-
chy model (Amedi et al. 2003; Büchel 2003), feedback connectivity is considered to play a crucial
role in cross-modal (and supramodal) plasticity. Specifically, connections stemming from temporal,
parietal, and frontal lobes may, in the absence of visual input and visual pathway connectivity com-
petition, be responsible for providing nonvisual input to the occipital lobe, enabling its engagement
in nonvisual processing. This is particularly true for areas involved in multisensory processing even
in the sighted, such as regions within the lateral occipital complex (LOC) that are naturally active
both during tactile and visual object recognition (Amedi et al. 2001, 2002). Such areas retain some
of their original sensory input following the loss of one modality and may consequently preserve
their original functions (i.e., tactile shape recognition, including Braille reading), corresponding to
multimodal or multisensory plasticity. The feedback connectivity from these regions to earlier sta-
tions in the visual pathways, such as the primary visual cortex, may further expand this network.
Since these stations are now even further away from direct sensory input (a similar distance from
the sensory receptors as the frontal cortex, as measured by the number of synapses), the model
posits they may now begin to engage in even higher cognitive functions, similar to the frontal
cortex. In support of this hypothesis, it was demonstrated that the functional connectivity of the
Neurophysiological Mechanisms Underlying Plastic Changes 411

visual cortex with frontal language regions is increased in the blind (Liu et al. 2007). Such changes
in connectivity could account for the altered pattern of inputs reaching the occipital cortex, which
may in the end determine the morphological and physiological features of this area and enable its
functional reassignment to nonsensory tasks. It is still too early to speculate about all implications
of the inverted hierarchy approach, particularly in relation to those areas that might be at the top
of the postulated hierarchy. On a little less speculative front, recent studies have provided evidence
supporting some claims of the hypothesis suggesting increased feedback corticocortical informa-
tion transfer following sensory loss. For example, it has been shown that the area involved in (visual
and auditory) motion processing in the sighted is involved in auditory (Poirier et al. 2006a) as well
as tactile (Ricciardi et al. 2007) motion processing in the blind. Similar conclusions can be drawn
from findings showing the engagement of the ventral visual pathway typically involved in process-
ing information related to the identification of objects and faces (Ungerleider and Mishkin 1982)
in auditorily mediated object recognition, but only if detailed shape information is provided and
efficiently extracted (Poirier et al. 2006c; Amedi et al. 2007). All of these results are congruent
with the more general notion that cross-modal plasticity occurs in situations where the information
originally processed within a certain area is similar, regardless of the input being rerouted into it
(Grafman 2000). This implies that each cortical area may operate in a metamodal fashion (Pascual-
Leone and Hamilton 2001), being specialized in a particular type of computation rather than being
tied to a specific input modality. However, this type of broad generalization is subject to caution as it
is still not clear how such metamodal computations would develop, especially in the case of signifi-
cantly altered inputs during development such as in the case of congenital blindness. On one hand,
the metamodal theory suggests that, in blindness, visual deafferentation may lead to a strengthening
of the corresponding input signal from other modalities to the “visual” areas, which will maintain
the original cortical operation. This hypothesis predicts that the classical hierarchy (i.e., low-level
basic feature analysis in early visual areas, high level object recognition in LOC) is maintained in
the blind, who now utilize the tactile (and auditory) modalities. By contrast, the inverted hierarchy
theory suggests that, because of the dysfunctional main bottom-up geniculostriatal pathway in the
blind, the retinotopic areas (especially V1) will be much farther (in terms of the number of synapses)
from the remaining functional sense organs (in the tactile or auditory modalities). This, in turn,
would lead to V1 resembling more the prefrontal cortex (which is similarly remote from any direct
sensory input), rather than becoming a primary sensory area in the blind.
Both these theories may, however, be resolved with regard to the connectivity of the reorganized
visual cortex of the blind and the onset of visual loss. The development of computations charac-
teristic of a certain region is strongly dependent on the input that it originally receives. Therefore,
in cases of congenital sensory loss, primary sensory areas are less likely to develop computations
similar to the ones performed in the typical brain. These differences in connectivity may lead to
more excessive developmental changes, causing cortical regions to assume very different computa-
tions than their natural roles. The visual cortex of the congenitally blind may correspond to early
stations of sensory processing due to auditory or tactile subcortical (or even cortical, see Wittenberg
et al. 2004) connectivity as seen in animal models of blindness, or to higher stations in the hierarchy
(as predicted by the inverted hierarchy model) if most of the connectivity is indeed from high-order
(multisensory) cortical regions. Currently, evidence for both types of plasticity can be found, as the
visual cortex of the congenitally blind is activated both by simple perceptual tasks as well as by
mnemonic and semantic tasks, but there appear to be differences in the preference for perceptual vs.
high-level cognitive functions in different areas of the occipital cortex. Specifically, there is growing
evidence that as one goes anteriorly in the ventral visual stream—the weights between the activa-
tion to the two tasks tend toward the perceptual tasks, whereas posteriorly in and around calcarine
sulcus (V1) there is clear preference to the higher-order verbal memory and language tasks (Raz et
al., 2005). However, this issue is not yet resolved and it will greatly benefit from future anatomical
connectivity findings in humans. In the case of late-onset blindness, the connectivity of the visual
cortex and its development are more typical of the sighted brain (as previously described), and
412 The Neural Bases of Multisensory Processes

reorganization is more likely to be of the high-order corticocortical type, along with some unmask-
ing of subcortical connectivity, which is also apparent in blindfolded normally sighted individuals
(Pitskel et al. 2007; Pascual-Leone et al. 2005; Amedi et al. 2006).

21.4  REHABILITATION-INDUCED PLASTICITY


At the start of this chapter, we mentioned the most promising rehabilitation techniques available to
restore functional vision in the blind. Next, we illustrated how these approaches are used on indi-
viduals whose cognition and brains have, due to sensory loss, already undergone drastic changes
that depend crucially on the onset of such loss. In this case, plasticity can be viewed as a double-
edged sword. On one hand, it represents a critical property allowing functional adaptation to the
lost sensory input. On the other, these same changes need to be modified or partly reversed once
the lost sensory input is restored. Importantly, this remodeling does not have to result in a func-
tional organization that would be identical to the one found in the majority of the population. On
the contrary, considerable interindividual variability is to be expected in adapting to the implant
(e.g., in cochlear implant patients, speech recognition performance ranges from very poor to near
perfect; Geers 2006). This is particularly true with regard to the variability in onset of blindness,
as was discussed in relation to neural mechanisms of plasticity at different stages of life. Late-onset
blind may particularly benefit from reacquiring visual information, as their visual cortex has devel-
oped in a manner that would allow it to process such information. Therefore, sensory loss triggers
less pronounced reorganization of their visual cortex when compared to that in the early-onset or
congenitally blind whose brains undergo more extensive remodeling and therefore encounter more
difficulties in adapting to visual information. Furthermore, sensory implants (and, although in a
different manner, SSDs) are prone to influence the brain as a system, not just one modality. For
example, it was shown that visual information can disrupt the processing of auditory information
in newly transplanted cochlear implant patients, most probably due to cross-modal visual process-
ing in the auditory cortex (Champoux et al. 2009), and that “badly performing” cochlear implant
patients may have more extended visual responses in their auditory cortex (Doucet et al. 2006).
Similarly, cross-modal plasticity in the auditory cortex before the surgery can constrain cochlear
implant efficacy (Lee et al. 2001). Therefore, decreasing the amount of nonrelevant visual informa-
tion and increasing responses to input arriving from the cochlear implant might be a useful part of
rehabilitation. Such interference may occur in visual restoration, in all the tasks that are functionally
dependent on the occipital lobe of the blind (particularly tasks that can be disrupted by occipital
TMS, as described in previous sections). One solution to at least some of these problems might
be the integration of SSDs and prostheses as they provide fairly distinct advantages. Specifically,
although prostheses may allow entry of visual signals to the brain, SSDs can be very useful in teach-
ing the brain how to interpret the input from the new modality, and perhaps prepare the brain and
mind of the blind, in advance, for new sensory information.

21.4.1  Plasticity after SSD Use and Its Theoretical Implications


Observing the outcomes of sensory restoration and substitution is not only of practical relevance,
because it also provides a unique opportunity to address numerous theoretical questions about per-
ception and the nature of qualia and can teach valuable lessons about brain plasticity and percep-
tion. Several studies investigating SSDs use have examined the way blind and sighted brains process
sensory-transformed information provided by such devices. Functional properties of multisensory
regions can easily be studied using SSDs as a methodology of choice, as SSDs are naturally pro-
cessed in a multisensory or trans-modal manner. For example, several recent studies used SSDs to
test the metamodal processing theory (Pascual-Leone and Hamilton 2001), which states that each
brain region implements a particular type of computation regardless of its modality of input. These
studies showed that object shape information drives the activation of the LOC in the visual cortex,
Neurophysiological Mechanisms Underlying Plastic Changes 413

regardless of whether it is transmitted in the visual, tactile, or auditory modality (Amedi et al.
2007; Poirier et al. 2006c) in sighted as well as blind individuals (see Figure 21.1b, c). Interestingly,
applying TMS to this region can disrupt shape identification using an auditory SSD (Merabet et al.
2008a). In the same way, studies conducted using PSVA in the sighted show that auditorily mediated
face perception can activate the visual fusiform face area (Plaza et al. 2009), whereas depth percep-
tion activates the occipito-parietal and occipito-temporal regions (Renier et al. 2005).
Studying the use of SSDs in a longitudinal fashion also provides a good opportunity to monitor
in real time how newly acquired information is learned, and investigate the accompanying cogni-
tive and neural changes. For example, several studies have looked into differential activation before
and after learning how to use a specific SSD. One study showed that shape discrimination using
the TDU SSD generated activation of the occipital cortex following short training only in early-
onset blind individuals (but not in sighted; Ptito et al. 2005), and that TDU training enables TMS to
induce spatially organized tactile sensations on the tongue (Kupers et al. 2006). These studies sug-
gest that the occipital lobe of the blind may be more prone to plasticity or to cross-modal process-
ing even in adulthood when compared to that of the sighted. Cross-modal activation of the visual
cortex of sighted subjects was also demonstrated, following training on the PSVA SSD (Poirier et al.
2007a). Although these findings indicating behavioral and imaging findings have been reported for
both early- (Arno et al. 2001) and late-onset blind (Cronly-Dillon et al. 1999) and sighted individu-
als (Poirier et al. 2007a), it has recently been claimed that the recruitment of occipital areas during
the use of SSDs could be mediated by different processes or mechanisms in different populations.
Specifically, although the early blind might exhibit real bottom-up activation of occipital cortex for
tactile or auditory perception, in the late blind and sighted this activation might reflect top-down
visual imagery mechanisms (Poirier et al. 2007b). This suggestion is not surprising, given that we
have previously given a similar claim with regard to the mechanisms underlying plastic changes
following sensory loss itself. Importantly, recent evidence of multisensory integration for object rec-
ognition, as shown by using a novel cross-modal adaptation paradigm (Tal and Amedi 2009), may
imply that the sighted could share some bottom-up mechanisms of tactile and visual integration in
visual cortex. Nevertheless, in addition to relying on different neurophysiological mechanisms, the
possible behavioral potential of SSDs may also vary between subpopulations of the blind, as the
late-onset blind can better associate the cross-modal input to the properties of vision as they knew it
(e.g., they have better knowledge of the 2-D representation of visual pictures, which is useful in most
current 2-D SSDs), whereas early blind individuals lack such understanding of the visual world,
but may have more highly developed auditory and tactile cross-modal networks and plasticity. This
difference in utilizing visual rehabilitation between the two blind groups may be even more valid in
the case of sensory restoration. Importantly, this differentiation between early and late-onset blind
in SSD use also highlights the potential of introducing such devices as early as possible in develop-
ment, while the brain is still in its prime with respect to plasticity. Similar to the improved outcomes
of cochlear implantation in early childhood (Harrison et al. 2005), it may be of particular interest
to attempt to teach young blind children to utilize such devices. Several early attempts to teach
blind infants to use the Sonicguide (Kay and Kay 1983) showed some promise, as younger subjects
showed more rapid sensitivity to the spatial information provided by the device (Aitken and Bower
1982, 1983) (although with highly variable results; for a discussion, see Warren 1994). However, to
our knowledge, only a few preliminary later attempts (Segond et al. 2007; Amedi et al., unpublished
observations) have been made to adapt the use of SSDs to children. The training of infants on SSD
use may also lead to a more “natural” perception of the sensorily transformed information, perhaps
even to the level of synesthesia (Proulx and Stoerig 2006), a condition in which one type of sensory
stimulation evokes the sensation of another, commonly in another modality or submodality: for
instance, color is associated with letters or numbers, sounds with vision or other sensory combina-
tions. This type of synesthesia may create visual experiences or even visual qualia with regard to
the SSD percept. In a recent study (Ward and Meijer 2009) describing the phenomenology of two
blind users of a the vOICe SSD, some evidence for its feasibility can be seen in reports of a late-
414 The Neural Bases of Multisensory Processes

blind vOICe user, who reports of synesthetic percepts of vision while using the device, and even
synesthetic percepts of color, which is not conveyed by the devices, but is “filled in” by her mind’s
eye. Some of her descriptions of her subjective experience illustrate the synesthetic nature of the
SSD percept: “the soundscapes seem to trigger a sense of vision for me. . . . It does not matter to me
that my ears are causing the sight to occur in my mind” (see more on the vOICe website, http://www​
.seeingwithsound.com/users.htm).
In summary, observing the outcomes of sensory restoration and substitution offers a unique
opportunity to address and potentially answer numerous theoretical questions about the funda-
mental principles of brain organization, neuroplasticity, unisensory processing, and multisensory
integration, in addition to their obvious clinical use. Research in these fields may also provide use-
ful insights that can be applied in clinical settings, such as the suggested use of SSDs and sensory
recovery at an early developmental stage.

21.5  CONCLUDING REMARKS AND FUTURE DIRECTIONS


The findings reviewed in this chapter clearly show the extent of changes that accompany sensory
loss. Cognitive functioning as well as brain organization on all spatial scales undergoes profound
reorganization that depends on multiple factors, primarily the onset of the sensory impairment.
Functionally, many (although not necessarily all) of these changes are beneficial and allow indi-
viduals to better adapt and use the available resources to compensate for the loss and function
more efficiently in their surroundings. Sensory loss triggers a wide range of such changes under-
pinned by different neurophysiological mechanisms that we still know very little about. Therefore,
an important goal for the future will include elucidating these individual phenomena at various
levels and timescales. One critical variable that has so far not received enough attention includes
connectivity changes that enable different types of restructuring as well as the factors determining
such changes, for instance, the onset of blindness. The major theoretical mission for the future will,
however, include bringing all of these changes and suggested mechanisms together. Although dif-
ficult, this mission can be successful, as all of these seemingly unrelated types of changes occur in
the same dynamic system, are mutually interdependent, and are shaped by each other. Investigating
the consequences of sensory loss can also lead to significant advances in our understanding of the
fundamental principles of the formation and the development of the nervous system. For example,
investigating the brains of visually or auditory impaired individuals can shed light on issues that
include specifying criteria for defining and delineating individual cortical areas or determining the
extent and the interplay of genetic and environmental determination of brain organization.
On a less theoretical note, studying the outcomes of sensory loss is crucially important for devel-
oping rehabilitation techniques, the benefits of which cannot be emphasized enough. Optimization
of all of these approaches is motivated and enabled by theoretical progress and emerging knowledge
about plastic changes following sensory impairment. On the other hand, these approaches should
not be viewed as pure applications constrained by basic science. They are developed in synergy
and are sometimes even ahead of their theoretical counterpart. For example, the first SSDs were
developed at a time when their outcomes would have never been predicted by mainstream theories.
Even today, we are fascinated and often surprised by the outcomes of such devices, which strongly
inform our theoretical knowledge. Therefore, one major mission for the future will include bringing
theory and practice even closer together as they each benefit from the questions posed and answers
provided by the other. On a purely practical level, the main direction for the future will include the
improvement of current or development of new rehabilitation techniques and approaches aimed at
combining these techniques as these may often complement each other.
In conclusion, the present chapter reviewed the main findings and illustrated the theoretical and
practical importance of studying consequences of sensory loss. All of the described plastic changes
that occur in this context indicate that no region or network in the brain is an island and that all
types of a lesions or atypical development patterns are bound to influence the system as a whole. The
Neurophysiological Mechanisms Underlying Plastic Changes 415

current challenge is to understand the principles that guide, mechanisms that underlie, and factors
that influence such changes, so that this knowledge can be channeled into practical rehabilitation
purposes.

REFERENCES
Aitken, S., and T. G. Bower. 1982. Intersensory substitution in the blind. J Exp Child Psychol 33: 309–323.
Aitken, S., and T. G. Bower. 1983. Developmental aspects of sensory substitution. Int J Neurosci 19: 13–91.
Amedi, A., J. Camprodon, L. Merabet et al. 2006. Highly transient activation of primary visual cortex (V1) for
tactile object recognition in sighted following 5 days of blindfolding. Paper presented at the 7th Annual
Meeting of the International Multisensory Research Forum, University of Dublin.
Amedi, A., A. Floel, S. Knecht, E. Zohary, and L. G. Cohen. 2004. Transcranial magnetic stimulation of the
occipital pole interferes with verbal processing in blind subjects. Nat Neurosci 7: 1266–1270.
Amedi, A., G. Jacobson, T. Hendler, R. Malach, and E. Zohary. 2002. Convergence of visual and tactile shape
processing in the human lateral occipital complex. Cereb Cortex 12: 1202–1212.
Amedi, A., R. Malach, R. Hendler, S. Peled, and E. Zohary. 2001. Visuo-haptic object-related activation in the
ventral visual pathway. Nat Neurosci 4: 324–330.
Amedi, A., L. B. Merabet, F. Bermpohl, and A. Pascual-Leone. 2005a. The Occipital Cortex in the Blind.
Lessons About Plasticity and Vision. Curr Dir Psychol Sci, 14: 306–311.
Amedi, A., N. Raz, P. Pianka, R. Malach, and E. Zohary. 2003. Early ‘visual’ cortex activation correlates with
superior verbal memory performance in the blind. Nat Neurosci 6: 758–766.
Amedi, A., W. M. Stern, J. A. Camprodon et al. 2007. Shape conveyed by visual-to-auditory sensory substitu-
tion activates the lateral occipital complex. Nat Neurosci 10: 687–689.
Amedi, A., K. Von Kriegstein, N. M. Van Atteveldt, M. S. Beauchamp, and M. J. Naumer. 2005b. Functional
imaging of human crossmodal identification and object recognition. Exp Brain Res 166: 559–571.
Arno, P., C. Capelle, M. C. Wanet-Defalque, M. Catalan-Ahumada, and C. Veraart. 1999. Auditory coding of
visual patterns for the blind. Perception 28: 1013–1029.
Arno, P., A. G. De Volder, A. Vanlierde et al. 2001. Occipital activation by pattern recognition in the early blind
using auditory substitution for vision. Neuroimage 13: 632–645.
Ashmead, D. H., E. W. Hill, and C. R. Talor. 1989. Obstacle perception by congenitally blind children. Percept
Psychophys 46: 425–433.
Ashmead, D. H., R. S. Wall, K. A. Ebinger, S. B. Eaton, M. M. Snook-Hill, and X. Yang. 1998. Spatial hearing
in children with visual disabilities. Perception 27: 105–122.
Auvray, M., S. Hanneton, and J. K. O’Regan. 2007. Learning to perceive with a visuo-auditory substitution
system: Localisation and object recognition with ‘The vOICe’. Perception 36: 416–430.
Bach-Y-Rita, P. 2004. Tactile sensory substitution studies. Ann N Y Acad Sci 1013: 83–91.
Bach-Y-Rita, P., C. C. Collins, F. A. Saunders, B. White, and L. Scadden. 1969. Vision substitution by tactile
image projection. Nature 221: 963–964.
Bach-Y-Rita, P., K. A. Kaczmarek, M. E. Tyler, and J. Garcia-Lara. 1998. Form perception with a 49-point
electrotactile stimulus array on the tongue: A technical note. J Rehabil Res Dev 35: 427–430.
Bach-y-Rita, P., and S. W. Kercel. 2003. Sensory substitution and the human–machine interface. Trends Cogn
Sci 7: 541–546.
Bavelier, D., M. W. Dye, and P. C. Hauser. 2006. Do deaf individuals see better? Trends Cogn Sci 10: 512–518.
Bavelier, D., and H. J. Neville. 2002. Cross-modal plasticity: Where and how? Nat Rev Neurosci 3: 443–452.
Beauchamp, M. S., B. D. Argall, J. Bodurka, J. H. Duyn, and A. Martin. 2004. Unraveling multisensory integra-
tion: Patchy organization within human STS multisensory cortex. Nat Neurosci 7: 1190–1192.
Brelen, M. E., F. Duret, B. Gerard, J. Delbeke, and C. Veraart. 2005. Creating a meaningful visual perception
in blind volunteers by optic nerve stimulation. J Neural Eng 2: S22–S28.
Brown, J. A. 2006. Recovery of motor function after stroke. Prog Brain Res 157: 223–228.
Bubic, A., E. Striem-Amit, and A. Amedi. 2010. Large-scale brain plasticity following blindness and the use of
sensory substitution devices. In Multisensory Object Perception in the Primate Brain, ed. J. Kaiser and
M. Naumer, part 4, 351–380.
Büchel, C. 2003. Cortical hierarchy turned on its head. Nat Neurosci 6: 657–658.
Büchel, C., C. Price, R. S. Frackowiak, and K. Friston. 1998. Different activation patterns in the visual cortex
of late and congenitally blind subjects. Brain 121(Pt 3): 409–419.
Bull, N. D., and K. R. Martin. 2009. Using stem cells to mend the retina in ocular disease. Regen Med 4:
855–864.
416 The Neural Bases of Multisensory Processes

Buonomano, D. V., and H. A. Johnson. 2009. Cortical plasticity and learning: Mechanisms and models. In
Encyclopedia of neuroscience, ed. L. R. Squire. London: Academic Press.
Burton, H., A. Z. Snyder, J. B. Diamond, and M. E. Raichle. 2002. Adaptive changes in early and late blind: A
FMRI study of verb generation to heard nouns. J Neurophysiol 88: 3359–3371.
Burton, H. 2003. Visual cortex activity in early and late blind people. J Neurosci 23: 4005–4011.
Burton, H., J. B. Diamond, and K. B. McDermott. 2003. Dissociating cortical regions activated by semantic and
phonological tasks: A FMRI study in blind and sighted people. J Neurophysiol 90: 1965–1982.
Calvert, G. A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cereb Cortex 11: 1110–1123.
Campbell, R., and M. MacSweeney. 2004. Neuroimaging studies of cross-modal plasticity and language pro-
cessing in deaf people. In The handbook of multisensory processes, ed. G. Calvert, C. Spence, and B. E.
Stein. Cambridge, MA: MIT Press.
Capelle, C., C. Trullemans, P. Arno, and C. Veraart. 1998. A real-time experimental prototype for enhancement
of vision rehabilitation using auditory substitution. IEEE Trans Biomed Eng 45: 1279–1293.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. Eur J Neurosci 22: 2886–2902.
Chabot, N., S. Robert, R. Tremblay, D. Miceli, D. Boire, and G. Bronchti. 2007. Audition differently activates
the visual system in neonatally enucleated mice compared with anophthalmic mutants. Eur J Neurosci
26: 2334–2348.
Champoux, F., F. Lepore, J. P. Gagne, and H. Theoret. 2009. Visual stimuli can impair auditory processing in
cochlear implant users. Neuropsychologia 47: 17–22.
Chechik, G., I. Meilijson, and E. Ruppin. 1999. Neuronal regulation: A mechanism for synaptic pruning during
brain maturation. Neural Comput 11: 2061–2080.
Chebat, D. R., C. Rainville, R. Kupers, and M. Ptito. 2007. Tactile–‘visual’ acuity of the tongue in early blind
individuals. Neuroreport 18: 1901–1904.
Chen, R., L. G. Cohen, and M. Hallett. 2002. Nervous system reorganization following injury. Neuroscience
111: 761–773.
Chen, S. C., L. E. Hallum, G. J. Suaning, and N. H. Lovell. 2007. A quantitative analysis of head movement
behaviour during visual acuity assessment under prosthetic vision simulation. J Neural Eng 4: S108.
Clavagnier, S., A. Falchier, and H. Kennedy. 2004. Long-distance feedback projections to area V1: Implications
for multisensory integration, spatial awareness, and visual consciousness. Cogn Affect Behav Neurosci
4: 117–126.
Cohen, L. G., P. Celnik, A. Pascual-Leone, B. Corwell, L. Falz, J. Dambrosia, M. Honda, N. Sadato, C. Gerloff,
M. D. Catala, and M. Hallett. 1997. Functional relevance of cross-modal plasticity in blind humans.
Nature 389: 180–183.
Cohen, L. G., R. A. Weeks, N. Sadato, P. Celnik, K. Ishii, and M. Hallett. 1999. Period of susceptibility for
cross-modal plasticity in the blind. Ann Neurol 45: 451–460.
Collignon, O., L. Renier, R. Bruyer, D. Tranduy, and C. Veraart. 2006. Improved selective and divided spatial
attention in early blind subjects. Brain Res 1075: 175–182.
Collignon, O., M. Lassonde, F. Lepore, D. Bastien, and C. Veraart. 2007. Functional cerebral reorganization
for auditory spatial processing and auditory substitution of vision in early blind subjects. Cereb Cortex
17: 457–465.
Cronin, T., T. Leveillard, and J. A. Sahel. 2007. Retinal degenerations: From cell signaling to cell therapy; pre-
clinical and clinical issues. Curr Gene Ther 7: 121–129.
Cronly-Dillon, J., K. Persaud, and R. P. Gregory. 1999. The perception of visual images encoded in musical
form: A study in cross-modality information transfer. Proc Biol Sci 266: 2427–2433.
Cronly-Dillon, J., K. C. Persaud, and R. Blore. 2000. Blind subjects construct conscious mental images of
visual scenes encoded in musical form. Proc Biol Sci 267: 2231–2238.
D’angiulli, A., and P. Waraich. 2002. Enhanced tactile encoding and memory recognition in congenital blind-
ness. Int J Rehabil Res 25: 143–145.
Dagnelie, G. 2008. Psychophysical evaluation for visual prosthesis. Annu Rev Biomed Eng 10: 339–368.
Delbeke, J., M. C. Wanet-Defalque, B. Gerard, M. Troosters, G. Michaux, and C. Veraart. 2002. The microsys-
tems based visual prosthesis for optic nerve stimulation. Artif Organs 26: 232–234.
De Volder, A. G., A. Bol, J. Blin, A. Robert, P. Arno, C. Grandin, C. Michel, and C. Veraart. 1997. Brain energy
metabolism in early blind subjects: Neural activity in the visual cortex. Brain Res 750: 235–244.
Dobelle, W. H. 2000. Artificial vision for the blind by connecting a television camera to the visual cortex.
ASAIO J 46: 3–9.
Neurophysiological Mechanisms Underlying Plastic Changes 417

Doucet, M. E., F. Bergeron, M. Lassonde, P. Ferron, and F. Lepore. 2006. Cross-modal reorganization and
speech perception in cochlear implant users. Brain 129: 3376–3383.
Doucet, M. E., J. P. Guillemot, M. Lassonde, J. P. Gagne, C. Leclerc, and F. Lepore. 2005. Blind subjects pro-
cess auditory spectral cues more efficiently than sighted individuals. Exp Brain Res 160: 194–202.
Dowling, J. 2008. Current and future prospects for optoelectronic retinal prostheses. Eye 23:1999–2005.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. J Neurosci 22: 5749–5759.
Fallon, J. B., D. R. Irvine, and R. K. Shepherd. 2008. Cochlear implants and brain plasticity. Hearing Res 238:
110–117.
Fernandez, E., P. Ahnelt, P. Rabischong, C. Botella, and F. Garcia-De Quiros. 2002. Towards a cortical visual
neuroprosthesis for the blind. IFMBE Proc 3(2): 1690–1691.
Fieger, A., B. Röder, W. Teder-Salejarvi, S. A. Hillyard, and H. J. Neville. 2006. Auditory spatial tuning in late-
onset blindness in humans. J Cogn Neurosci 18: 149–157.
Fine, I. 2008. The behavioral and neurophysiological effects of sensory deprivation. In Blindness and brain
plasticity in navigation and object perception, ed. J. J. Rieser, D. H. Ashmead, F. F. Ebner, and A. L.
Corn. New York: Taylor and Francis.
Fine, I., A. R. Wade, A. A. Brewer et al. 2003. Long-term deprivation affects visual perception and cortex. Nat
Neurosci 6: 915–916.
Finney, E. M., I. Fine, and K. R. Dobkins. 2001. Visual stimuli activate auditory cortex in the deaf. Nat Neurosci
4: 1171–1173.
Geers, A. E. 2006. Factors influencing spoken language outcomes in children following early cochlear implan-
tation. Adv Otorhinolaryngol 64: 50–65.
Gizewski, E. R., T. Gasser, A. de Greiff, A. Boehm, and M. Forsting. 2003. Cross-modal plasticity for sensory
and motor activation patterns in blind subjects. Neuroimage 19: 968–975.
Goldish, L. H., and H. E. Taylor. 1974. The Optacon: A valuable device for blind persons. New Outlook Blind
68: 49–56.
Goldreich, D., and I. M. Kanics. 2003. Tactile acuity is enhanced in blindness. J Neurosci 23: 3439–3445.
Goldreich, D., and I. M. Kanics. 2006. Performance of blind and sighted humans on a tactile grating detection
task. Percept Psychophys 68: 1363–1371.
Gougoux, F., R. J. Zatorre, M. Lassonde, P. Voss, and F. Lepore. 2005. A functional neuroimaging study of sound
localization: Visual cortex activity predicts performance in early-blind individuals. PLoS Biol 3: e27.
Grafman, J. 2000. Conceptualizing functional neuroplasticity. J Commun Disord 33: 345–355; quiz 355-6.
Grant, A. C., M. C. Thiagarajah, and K. Sathian. 2000. Tactile perception in blind Braille readers: A psy-
chophysical study of acuity and hyperacuity using gratings and dot patterns. Percept Psychophys 62:
301–312.
Gregory, R. L., and J. G. Wallace. 1963. Recovery from early blindness: A case study. In Experimental
Psychology Society, Monograph Supplement. 2nd ed. Cambridge, MA: Heffers.
Haddock, J. N., and L. Berlin. 1950. Transsynaptic degeneration in the visual system; report of a case. Arch
Neurol Psychiatry 64: 66–73.
Harrison, R. V., K. A. Gordon, and R. J. Mount. 2005. Is there a critical period for cochlear implantation in
congenitally deaf children? Analyses of hearing and speech perception performance after implantation.
Dev Psychobiol 46: 252–261.
Heyes, A. D. 1984. The Sonic Pathfinder: A new electronic travel aid. J Vis Impair Blind 78: 200–202.
Hugdahl, K., M. Ek, F. Takio et al. 2004. Blind individuals show enhanced perceptual and attentional sensitivity
for identification of speech sounds. Brain Res Cogn Brain Res 19: 28–32.
Hull, T., and H. Mason. 1995. Performance of blind-children on digit-span tests. J Vis Impair Blind 89:
166–169.
Izraeli, R., G. Koay, M. Lamish, A. J. Heicklen-Klein, H. E. Heffner, R. S. Heffner, and Z. Wollberg. 2002.
Cross-modal neuroplasticity in neonatally enucleated hamsters: Structure, electrophysiology and behav-
iour. Eur J Neurosci 15: 693–712.
Kaas, J. H. 1991. Plasticity of sensory and motor maps in adult mammals. Annu Rev Neurosci 14: 137–167.
Kaas, J. H. 2000. The reorganization of somatosensory and motor cortex after peripheral nerve or spinal cord
injury in primates. Prog Brain Res 128: 173–179.
Karlen, S. J., D. M. Kahn, and L. Krubitzer. 2006. Early blindness results in abnormal corticocortical and thal-
amocortical connections. Neuroscience 142: 843–858.
Kay, L., and N. Kay. 1983. An ultrasonic spatial sensor’s role as a developmental aid for blind children. Trans
Ophthalmol Soc N Z 35: 38–42.
418 The Neural Bases of Multisensory Processes

Kleiner, A., and R. C. Kurzweil. 1977. A description of the Kurzweil reading machine and a status report on its
testing and dissemination. Bull Prosthet Res 10: 72–81.
Knudsen, E. I. 2004. Sensitive periods in the development of the brain and behavior. J Cogn Neurosci 16:
1412–1425.
Kolb, B. 1995. Brain plasticity and behavior. Mahwah: Lawrence Erlbaum Associates, Inc.
Korte, M., and J. P. Rauschecker. 1993. Auditory spatial tuning of cortical neurons is sharpened in cats with
early blindness. J Neurophysiol 70: 1717–1721.
Kral, A., J. Tillein, S. Heid, R. Klinke, and R. Hartmann. 2006. Cochlear implants: Cortical plasticity in con-
genital deprivation. Prog Brain Res 157: 283–313.
Kujala, T., K. Alho, P. Paavilainen, H. Summala, and R. Naatanen. 1992. Neural plasticity in processing of sound loca-
tion by the early blind: An event-related potential study. Electroencephalogr Clin Neurophysiol 84: 469–472.
Kujala, T., M. J. Palva, O. Salonen et al. 2005. The role of blind humans’ visual cortex in auditory change detec-
tion. Neurosci Lett 379: 127–131.
Kupers, R., A. Fumal, A. M. De Noordhout, A. Gjedde, J. Schoenen, and M. Ptito. 2006. Transcranial magnetic
stimulation of the visual cortex induces somatotopically organized qualia in blind subjects. Proc Natl
Acad Sci U S A 103: 13256–13260.
Lacey, S., N. Tal, A. Amedi, and K. Sathian. 2009. A putative model of multisensory object representation.
Brain Topogr 21: 269–274.
Laemle, L. K., N. L. Strominger, and D. O. Carpenter. 2006. Cross-modal innervation of primary visual cortex
by auditory fibers in congenitally anophthalmic mice. Neurosci Lett 396: 108–112.
Leclerc, C., D. Saint-Amour, M. E. Lavoie, M. Lassonde, and F. Lepore. 2000. Brain functional reorganization
in early blind humans revealed by auditory event-related potentials. Neuroreport 11: 545–550.
Lakatos, P., C. M. Chen, M. N. O’Connell, A. Mills, and C. E. Schroeder. 2007. Neuronal oscillations and
multisensory interaction in primary auditory cortex. Neuron 53: 279–292.
Lamba, D., M. Karl, and T. Reh. 2008. Neural regeneration and cell replacement: A view from the eye. Cell
Stem Cell 2: 538–549.
Lamba, D. A., M. O. Karl, and T. A. Reh. 2009. Strategies for retinal repair: Cell replacement and regeneration.
Prog Brain Res 175: 23–31.
Lee, D. S., J. S. Lee, S. H. Oh, S. K. Kim, J. W. Kim, J. K. Chung, M. C. Lee, and C. S. Kim. 2001. Cross-modal
plasticity and cochlear implants. Nature 409: 149–150.
Lessard, N., M. Pare, F. Lepore, and M. Lassonde. 1998. Early-blind human subjects localize sound sources
better than sighted subjects. Nature 395: 278–280.
Lickliter, R., and L. E. Bahrick. 2004. Perceptual development and the origins of multisensory responsiveness.
In The handbook of multisensory processes, ed. G. Calvert, C. Spence, and B. E. Stein. Cambridge, MA:
MIT Press.
Linvill, J. G., and J. C. Bliss. 1966. A direct translation reading aid for the blind. Proc IEEE 54: 40–51.
Liu, Y., C. Yu, M. Liang et al. 2007. Whole brain functional connectivity in the early blind. Brain 130:
2085–2096.
Locker, M., C. Borday, and M. Perron. 2009. Stemness or not stemness? Current status and perspectives of
adult retinal stem cells. Curr Stem Cell Res Ther 4: 118–130.
MacLaren, R. E., and R. A. Pearson. 2007. Stem cell therapy and the retina. Eye (London) 21: 1352–1359.
Marr, D. 1982. Vision. San Francisco: W. H. Freeman.
Meijer, P. B. 1992. An experimental system for auditory image representations. IEEE Trans Biomed Eng 39:
112–121.
Merabet, L., G. Thut, B. Murray, J. Andrews, S. Hsiao, and A. Pascual-Leone. 2004. Feeling by sight or seeing
by touch? Neuron 42: 173–179.
Merabet, L. B., J. F. Rizzo, A. Amedi, D. C. Somers, and A. Pascual-Leone. 2005. What blindness can tell us
about seeing again: Merging neuroplasticity and neuroprostheses. Nat Rev Neurosci 6: 71–77.
Merabet, L. B., L. Battelli, S. Obretenova, S. Maguire, P. Meijer, and A. Pascual-Leone. 2008a. Functional recruit-
ment of visual cortex for sound encoded object identification in the blind. Neuroreport 20: 132–138.
Merabet, L. B., R. Hamilton, G. Schlaug et al. 2008b. Rapid and reversible recruitment of early visual cortex
for touch. PLoS ONE 3: e3046.
Michel, G. F., and A. N. Tyler,. 2005. Critical period: A history of the transition from questions of when, to
what, to how. Dev Psychobiol 46: 156–162.
Millar, S. 1981. Cross-modal and intersensory perception and the blind. In Intersensory perception and sensory
integration, ed. R. D. Walk and H. L. J. Pick. New York: Plenum Press.
Murphy, C., and W. S. Cain. 1986. Odor identification: The blind are better. Physiol Behav 37: 177–180.
Neurophysiological Mechanisms Underlying Plastic Changes 419

Neville, H. J., and D. Bavelier. 2000. Specificity of developmental neuroplasticity in humans: Evidence from
sensory deprivation and altered language experience. In Toward a theory of neuroplasticity, ed. C. A.
Shaw and J. C. Mceachern. New York: Taylor and Francis.
Newman, N. M., R. A. Stevens, and J. R. Heckenlively. 1987. Nerve fibre layer loss in diseases of the outer
retinal layer. Br J Ophthalmol 71: 21–26.
Niemeyer, W., and I. Starlinger. 1981. Do the blind hear better? Investigations on auditory processing in con-
genital or early acquired blindness: II. Central functions. Audiology 20: 510–515.
Noordzij, M. L., S. Zuidhoek, and A. Postma. 2006. The influence of visual experience on the ability to form
spatial mental models based on route and survey descriptions. Cognition 100: 321–342.
Noppeney, U. 2007. The effects of visual deprivation on functional and structural organization of the human
brain. Neurosci Biobehav Rev 31: 1169–1180.
Noppeney, U., K. J. Friston, J. Ashburner, R. Frackowiak, and C. J. Price. 2005. Early visual deprivation
induces structural plasticity in gray and white matter. Curr Biol 15: R488–R490.
Ofan, R. H., and E. Zohary. 2006. Visual cortex activation in bilingual blind individuals during use of native and
second language. Cereb Cortex 17: 1249–1259.
Ostrovsky, Y., A. Andalman, and P. Sinha. 2006. Vision following extended congenital blindness. Psychol Sci
17: 1009–1014.
Ostrovsky, Y., E. Meyers, S. Ganesh, U. Mathur, and P. Sinha. 2009. Visual parsing after recovery from blind-
ness. Psychol Sci 20: 1484–1491.
Palanker, D., A. Vankov, P. Huie, and S. Baccus. 2005. Design of a high-resolution optoelectronic retinal pros-
thesis. J Neural Eng 2: S105–S120.
Pan, W. J., G. Wu, C. X. Li, F. Lin, J. Sun, and H. Lei. 2007. Progressive atrophy in the optic pathway and visual
cortex of early blind Chinese adults: A voxel-based morphometry magnetic resonance imaging study.
Neuroimage 37: 212–220.
Pascual-Leone, A., A. Amedi, F. Fregni, and L. B. Merabet. 2005. The plastic human brain cortex. Annu Rev
Neurosci 28: 377–401.
Pascual-Leone, A., and R. Hamilton. 2001. The metamodal organization of the brain. Prog Brain Res 134:
427–445.
Penfield, W., and T. Rasmussen. 1950. The cerebral cortex of man: A clinical study of localization of function.
New York: Macmillan.
Pezaris, J. S., and R. C. Reid. 2005. Microstimulation in LGN produces focal visual percepts: Proof of concept
for a visual prosthesis. J Vis 5: 367.
Pezaris, J. S., and R. C. Reid. 2009. Simulations of electrode placement for a thalamic visual prosthesis. IEEE
Trans Biomed Eng 56: 172–178.
Piche, M., N. Chabot, G. Bronchti, D. Miceli, F. Lepore, and J. P. Guillemot. 2007. Auditory responses in the
visual cortex of neonatally enucleated rats. Neuroscience 145: 1144–1156.
Pitskel, N. B., L. B. Merabet, C. Ramos-Estebanez, T. Kauffman, and A. Pascual-Leone. 2007. Time-dependent
changes in cortical excitability after prolonged visual deprivation. Neuroreport 18: 1703–1707.
Plaza, P., I. Cuevas, O. Collignon, C. Grandin, A. G. De Volder, and I. Renier. 2009. Perceiving faces using
auditory substitution of vision activates the fusiform face area. Belgian Society for Fundamental and
Clinical Physiology and Pharmacology, Spring Meeting 2009. Acta Physiologica 195: S670.
Poirier, C., O. Collignon, C. Scheiber et al. 2006a. Auditory motion perception activates visual motion areas in
early blind subjects. Neuroimage 31: 279–285.
Poirier, C., A. De Volder, D. Tranduy, and C. Scheiber. 2007a. Pattern recognition using a device substituting
audition for vision in blindfolded sighted subjects. Neuropsychologia 45: 1108–1121.
Poirier, C., A. G. De Volder, and C. Scheiber. 2007b. What neuroimaging tells us about sensory substitution.
Neurosci Biobehav Rev 31: 1064–1070.
Poirier, C., M. A. Richard, D. T. Duy, and C. Veraart. 2006b. Assessment of sensory substitution prosthesis
potentialities in minimalist conditions of learning. Appl Cogn Psychol 20: 447–460.
Poirier, C. C., A. G. De Volder, D. Tranduy, and C. Scheiber. 2006c. Neural changes in the ventral and dorsal
visual streams during pattern recognition learning. Neurobiol Learn Mem 85: 36–43.
Pozar, L. 1982. Effect of long-term sensory deprivation on recall of verbal material. Stud Psychol 24: 311–311.
Pring, L. 1988. The ‘reverse-generation’ effect: A comparison of memory performance between blind and
sighted children. Br J Psychol 79 (Pt 3): 387–400.
Proulx, M. J., and P. Stoerig. 2006. Seeing sounds and tingling tongues: Qualia in synaesthesia and sensory
substitution. Anthropol Philos 7: 135–151.
420 The Neural Bases of Multisensory Processes

Proulx, M. J., P. Stoerig, E. Ludowig, and I. Knoll. 2008. Seeing ‘where’ through the ears: Effects of learning-
by-doing and long-term sensory deprivation on localization based on image-to-sound substitution. PLoS
ONE 3: e1840.
Ptito, M., S. M. Moesgaard, A. Gjedde, and R. Kupers. 2005. Cross-modal plasticity revealed by electrotactile
stimulation of the tongue in the congenitally blind. Brain 128: 606–614.
Putzar, L., I. Goerendt, K. Lange, F. Rösler, and B. Röder. 2007. Early visual deprivation impairs multisensory
interactions in humans. Nat Neurosci 10: 1243–1245.
Rauschecker, J. P. 2000. Developmental neuroplasticity during brain development. In Toward a theory of neu-
roplasticity, ed. C. A. Shaw and J. C. McEachern. New York: Taylor and Francis.
Rauschecker, J. P. 2008. Plasticity of cortical maps in visual deprivation. In Blindness and brain plasticity in
navigation and object perception, ed. J. J. Rieser, D. H. Ashmead, F. F. Ebner, and A. L. Corn. New York:
Taylor and Francis.
Rauschecker, J. P., and M. Korte. 1993. Auditory compensation for early blindness in cat cerebral cortex. J
Neurosci 13: 4538–4548.
Raz, N., A. Amedi, and E. Zohary. 2005. V1 activation in congenitally blind humans is associated with episodic
retrieval. Cereb Cortex 15: 1459–1468.
Raz, N., E. Striem, G. Pundak, T. Orlov, and E. Zohary. 2007. Superior serial memory in the blind: A case of
cognitive compensatory adjustment. Curr Biol 17: 1129–1133.
Recanzone, G. H., M. M. Merzenich, W. M. Jenkins, K. A. Grajski, and H. R. Dinse. 1992. Topographic
reorganization of the hand representation in cortical area 3b of owl monkeys trained in a frequency-
discrimination task. J Neurophysiol 67: 1031–1056.
Renier, L., O. Collignon, C. Poirier et al. 2005. Cross-modal activation of visual cortex during depth perception
using auditory substitution of vision. J Vis 5: 902.
Ricciardi, E., N. Vanello, L. Sani et al. 2007. The effect of visual experience on the development of functional
architecture in hMT+. Cereb Cortex 17: 2933–2939.
Rizzo, J. F., L. Snebold, and M. Kenney. 2007. Development of a visual prosthesis. In Visual Prosthesis and
Ophthalmic Devices: New Hope in Sight, ed. J. Tombran-Tink, C. J. Barnstable, and J. F. Rizzo, 71–93.
Totowa, NJ: Humana Press.
Rockland, K. S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
Int J Psychophysiol 50: 19–26.
Röder, B., and F. Rösler. 2004. Compensatory plasticity as consequence of sensory loss. In The handbook of multi-
sensory processes. ed. G. Calvert, C. Spence, and B. E. Stein. Cambridge, MA: Bradford Books, MIT Press.
Röder, B., and F. Rösler. 1998. Visual input does not facilitate the scanning of spatial images. J Ment Imagery
22: 127–144.
Röder, B., F. Rösler, and H. J. Neville. 2000. Event-related potentials during auditory language processing in
congenitally blind and sighted people. Neuropsychologia 38: 1482–1502.
Röder, B., F. Rösler, and H. J. Neville. 2001. Auditory memory in congenitally blind adults: A behavioral–
electrophysiological investigation. Brain Res Cogn Brain Res 11: 289–303.
Röder, B., W. Teder-Salejarvi, A. Sterr, F. Rösler, S. A. Hillyard, and H. J. Neville. 1999. Improved auditory
spatial tuning in blind humans. Nature 400: 162–166.
Röder, B., O. Stock, S. Bien, H. Neville, and F. Rösler. 2002. Speech processing activates visual cortex in con-
genitally blind humans. Eur J Neurosci 16: 930–936.
Rösler, F., B. Röder, M. Heil, and E. Hennighausen. 1993. Topographic differences of slow event-related brain
potentials in blind and sighted adult human subjects during haptic mental rotation. Brain Res Cogn Brain
Res 1: 145–159.
Sadato, N. 2005. How the blind “see” Braille: Lessons from functional magnetic resonance imaging.
Neuroscientist 11: 577–582.
Sadato, N., A. Pascual-Leone, J. Grafman, M. P. Deiber, V. Ibanez, and M. Hallett. 1998. Neural networks for
Braille reading by the blind. Brain 121: 1213–1229.
Sadato, N., A. Pascual-Leone, J. Grafman, V. Ibanez, M. P. Deiber, G. Dold, and M. Hallett. 1996. Activation of
the primary visual cortex by Braille reading in blind subjects. Nature 380: 526–528.
Sampaio, E., S. Maris, and P. Bach-Y-Rita. 2001. Brain plasticity: ‘Visual’ acuity of blind persons via the
tongue. Brain Res 908: 204–207.
Sathian, K. 2000. Practice makes perfect: Sharper tactile perception in the blind. Neurology 54: 2203–2204.
Sathian, K. 2005. Visual cortical activity during tactile perception in the sighted and the visually deprived. Dev
Psychobiol 46: 279–286.
Sathian, K., A. Zangaladze, J. M. Hoffman, and S. T. Grafton. 1997. Feeling with the mind’s eye. Neuroreport
8: 3877–3881.
Neurophysiological Mechanisms Underlying Plastic Changes 421

Schlaggar, B. L., and D. D. O’Leary. 1991. Potential of visual cortex to develop an array of functional units
unique to somatosensory cortex. Science 252: 1556–1560.
Schmidt, E. M., M. J. Bak, F. T. Hambrecht, C. V. Kufta, D. K. O’rourke, and P. Vallabhanath. 1996. Feasibility
of a visual prosthesis for the blind based on intracortical microstimulation of the visual cortex. Brain
119(Pt 2): 507–522.
Schorr, E. A., N. A. Fox, V. van Wassenhove, and E. I. Knudsen. 2005. Auditory–visual fusion in speech percep-
tion in children with cochlear implants. Proc Natl Acad Sci U S A 102: 18748–18750.
Schroeder, C. E., J. Smiley, K. G. Fu, T. McGinnis, M. N. O’Connell, and T. A. Hackett. 2003. Anatomical
mechanisms and functional implications of multisensory convergence in early cortical processing. Int J
Psychophysiol 50: 5–17.
Schroeder, C. E., and J. Foxe. 2005. Multisensory contributions to low-level, ‘unisensory’ processing. Curr
Opin Neurobiol 15: 454–458.
Segond, H., D. Weiss, and E. Sampaio. 2007. A proposed tactile vision–substitution system for infants who are
blind tested on sighted infants. J Vis Impair Blind 101: 32–43.
Sharma, J., A. Angelucci, and M. Sur. 2000. Induction of visual orientation modules in auditory cortex. Nature
404: 841–847.
Shaw, C. A., and J. C. Mceachern. 2000. Transversing levels of organization: A theory of neuronal stability and
plasticity. In Toward a theory of neuroplasticity, ed. C. A. Shaw and J. C. Mceachern. New York: Taylor
and Francis.
Shimony, J. S., H. Burton, A. A. Epstein, D. G. McLaren, S. W. Sun, and A. Z. Snyder. 2006. Diffusion tensor
imaging reveals white matter reorganization in early blind humans. Cereb Cortex 16: 1653–1661.
Sinha, P. 2003. Face classification following long-term visual deprivation. J Vis 3: 104.
Smith, M., E. A. Franz, S. M. Joy, and K. Whitehead. 2005. Superior performance of blind compared with
sighted individuals on bimanual estimations of object size. Psychol Sci 16: 11–14.
Smits, B., and M. J. C. Mommers. 1976. Differences between blind and sighted children on WISC Verbal
Subtests. New Outlook Blind 70: 240–246.
Spelman, F. A. 2006. Cochlear electrode arrays: Past, present and future. Audiol Neurootol 11: 77–85.
Stilla, R., R. Hanna, X. Hu, E. Mariola, G. Deshpande, and K. Sathian. 2008. Neural processing underlying
tactile microspatial discrimination in the blind: A functional magnetic resonance imaging study. J Vis 8:
13 11–1319.
Tal, N., and A. Amedi. 2009. Multisensory visual–tactile object-related network in humans: Insights from a
novel crossmodal adaptation approach. Exp Brain Res 198: 165–182.
Tillman, M. H., and W. L. Bashaw. 1968. Multivariate analysis of the WISC scales for blind and sighted chil-
dren. Psychol Rep 23: 523–526.
Troyk, P., M. Bak, J., Berg et al. 2003. A model for intracortical visual prosthesis research. Artif Organs 27:
1005–1015.
Uhl, F., P. Franzen, G. Lindinger, W. Lang, and L. Deecke. 1991. On the functionality of the visually deprived
occipital cortex in early blind persons. Neurosci Lett 124: 256–259.
Ungerleider, L. G., and M. Mishkin. 1982. Two cortical visual systems. In Analysis of visual behavior, ed. D. J.
Ingle, M. A. Goodale, and R. J. W. Mansfield. Boston: MIT Press.
Van Atteveldt, N., E. Formisano, R. Goebel, and L. Blomert. 2004. Integration of letters and speech sounds in
the human brain. Neuron 43: 271–282.
Vanlierde, A., and M. C. Wanet-Defalque. 2004. Abilities and strategies of blind and sighted subjects in visuo-
spatial imagery. Acta Psychol (Amst) 116: 205–222.
Veraart, C., M. C. Wanet-Defalque, B. Gerard, A. Vanlierde, and J. Delbeke. 2003. Pattern recognition with the
optic nerve visual prosthesis. Artif Organs 27: 996–1004.
Von Melchner, L., S. L. Pallas, and M. Sur. 2000. Visual behaviour mediated by retinal projections directed to
the auditory pathway. Nature 404: 871–876.
Voss, P., M. Lassonde, F. Gougoux, M. Fortin, J. P. Guillemot, and F. Lepore. 2004. Early- and late-onset blind
individuals show supra-normal auditory abilities in far-space. Curr Biol 14: 1734–1738.
Wakefield, C. E., J. Homewood, and A. J. Taylor. 2004. Cognitive compensations for blindness in children: An
investigation using odour naming. Perception 33: 429–442.
Wallace, M. 2004a. The development of multisensory processes. Cogn Process 5: 69–83.
Wallace, M. T. 2004b. The development of multisensory integration. In The handbook of multisensory pro-
cesses, ed. G. Calvert, C. Spence, and B. E. Stein. Cambridge, MA: MIT Press.
Ward, J., and P. Meijer. 2009. Visual experiences in the blind induced by an auditory sensory substitution
device. Conscious Cogn 19: 492–500.
422 The Neural Bases of Multisensory Processes

Warren, D. H. 1994. Blindness and children: An individual differences approach. New York: Cambridge Univ.
Press.
Weiland, J. D., W. Liu, and M. S. Humayun. 2005. Retinal prosthesis. Annu Rev Biomed Eng 7: 361–401.
Weiland, J. D., and M. S. Humayun. 2008. Visual prosthesis. Proc IEEE 96: 1076–1084.
West, E. L., R. A. Pearson, R. E. MacLaren, J. C. Sowden, and R. R. Ali. 2009. Cell transplantation strategies
for retinal repair. Prog Brain Res 175: 3–21.
Wiesel, T. N., and D. H. Hubel. 1963. Single-cell responses in striate cortex of kittens deprived of vision in one
eye. J Neurophysiol 26: 1003–1017.
Wiesel, T. N., and D. H. Hubel. 1965. Comparison of the effects of unilateral and bilateral eye closure on corti-
cal unit responses in kittens. J Neurophysiol 28: 1029–1040.
Wittenberg, G. F., K. J. Werhahn, E. M. Wassermann, P. Herscovitch, and L. G. Cohen. 2004. Functional connec-
tivity between somatosensory and visual cortex in early blind humans. Eur J Neurosci 20: 1923–1927.
Yu, C., Y. Liu, J. Li et al. 2008. Altered functional connectivity of primary visual cortex in early blindness.
Hum Brain Mapp 29(5): 533–543.
Zangaladze, A., C. M. Epstein, S. T. Grafton, and K. Sathian. 1999. Involvement of visual cortex in tactile
discrimination of orientation. Nature 401: 587–590.
Zwiers, M. P., A. J. Van Opstal, and J. R. Cruysberg. 2001. A spatial hearing deficit in early-blind humans.
J Neurosci 21: RC142: 1–5.
22 Visual Abilities in Individuals
with Profound Deafness
A Critical Review
Francesco Pavani and Davide Bottari

CONTENTS
22.1 Visual Abilities in Profound Deafness: An Open Challenge for Cross-Modal Plasticity
Research................................................................................................................................. 423
22.1.1 Multiple Operational Definitions............................................................................... 425
22.1.2 Making Sense of Heterogeneity................................................................................ 426
22.2 A Task-Oriented Review of Empirical Evidence................................................................... 427
22.2.1 Perceptual Thresholds Tasks..................................................................................... 427
22.2.2 Simple Detection and Lateralization Tasks............................................................... 430
22.2.3 Visual Search Tasks................................................................................................... 432
22.2.4 Visual Discrimination and Identification Tasks........................................................ 434
22.2.4.1 Visual Discrimination with Flanker Interference . .................................... 436
22.2.5 Visual Tasks of Higher Complexity........................................................................... 438
22.3 A Transversal View on Literature.........................................................................................440
22.3.1 Enhanced Reactivity Rather than Enhanced Perceptual Processing........................440
22.3.2 Role of Deaf Sample Characteristics and Visual Stimulus Characteristics Are
Relevant but Not Critical........................................................................................... 441
22.3.3 Role of Target Eccentricity and Selective Visual Attention Is Critical but
Underspecified........................................................................................................... 441
22.4 Conclusions and Future Directions........................................................................................ 443
Acknowledgments...........................................................................................................................444
References.......................................................................................................................................444

22.1 VISUAL ABILITIES IN PROFOUND DEAFNESS: AN OPEN


CHALLENGE FOR CROSS-MODAL PLASTICITY RESEARCH
The world is inherently multisensory, and our ability to interact with it largely depends on the
capability of our cognitive system to coherently use and integrate such variety of sensory inputs.
Consider, for instance, the way in which we monitor the environment. In humans, vision plays a
crucial role in informing the cognitive system about the spatial layout of the scene, and in recog-
nizing objects and events. However, during steady fixation of gaze in one direction, the visual field
typically extends 100° laterally on either side, 60° upward, and 75° downward (Harrington 1971).
This leaves a large portion of the surrounding environment unexplored to vision, unless constant
eye, head, and trunk movements are performed. Other distal senses, such as hearing or smell, can
overcome this visual field limitation, providing inputs about regions of the environment beyond
the boundaries of current visual perception. These additional sensory modalities can inform our
cognitive system about stimuli that occur behind our body, are hidden by visual obstacles, or occur

423
424 The Neural Bases of Multisensory Processes

very far in space. In particular, hearing can provide a good estimate of the most likely location in
space of the nonvisible stimulus (see Heffner and Heffner 1992 for a cross-species evaluation of the
relationship between the ability to localize a sound and the width of the field of best vision). In addi-
tion, hearing constantly models the acoustic regularity in the environment and reacts to violations
of such regularity, regardless of the current behavioral goal of the individual (Näätänen 1992). Thus,
audition constitutes a fundamental guidance for reorienting our exploratory behavior. Efficient inte-
gration of sensory inputs from audition and vision is therefore essential for successful exploration
of the surrounding environment.
The way our cognitive system perceives the multisensory environment in which we live leads to
a fundamental question that has long been debated among scientists and philosophers: What are the
consequences of the absence of one sensory modality for cognition and multisensory perception?
For instance, which are the consequences of long-term auditory deprivation due to profound deaf-
ness for the remaining sensory modalities, mainly vision and touch? An interest for this issue can
be traced back at least to the seventeenth century (for historical reviews, see Hartmann 1933; Jordan
1961), and two opposing hypotheses have traditionally been put forward to account for the impact
of sensory deprivation (i.e., deafness or blindness) on the remaining senses. The first hypothesis is
that a substantial deficit in one sensory modality could affect the development and organization of
the other sensory systems. We will refer to this first perspective as the perceptual deficit hypothesis.
When applied to the case of profound deafness, the perceptual deficit hypothesis predicts poorer
visual and tactile perceptual performance in deaf individuals, as compared to the age-matched hear-
ing controls (e.g., Myklebust 1964). This hypothesis was based on the assumption that auditory defi-
ciency can have a direct impact on the development of the other senses. In addition, it assumed that
any language impairments resulting from profound deafness would limit hearing-impaired children
in their interaction with the world, and result in a cognitive development lag in perceptual and cog-
nitive tasks (Furth 1966). The second hypothesis is that a deficit in one sensory system would make
the other modalities more sensitive, vicariously compensating for the loss of one sensory channel
(e.g., Gibson 1969). We will refer to this second perspective as the sensory compensation hypoth-
esis. When applied to the case of profound deafness, the sensory compensation hypothesis predicts
that the visual and tactile modalities will show enhanced sensitivity. The latter prediction is often
stated both in terms of behavioral consequences of deafness, and in terms of its neural outcomes.
Specifically, the neural implications of the sensory compensation hypothesis are that the brain areas
serving the impaired sensory modality may develop the ability to process perceptual inputs from
one or more of the intact sensory systems (  functional reallocation account), or alternatively that
brain areas of the remaining senses may acquire enhanced functional and processing capabilities
(remaining senses hypertrophy account).
After more than 30 years of systematic research conducted mainly on the visual abilities of pro-
foundly deaf individuals, it is apparent that the long-standing debate as to whether perceptual and
cognitive functions of deaf individuals are deficient or supranormal is far from being settled. Several
reviews of this literature (e.g., Parasnis 1983; Bavelier et al. 2006; Mitchell and Maslin 2007) clearly
indicate that deaf and hearing individuals perform comparably on a number of perceptual tasks. As
we shall see later (see Section 22.2.1), this conclusion is strongly supported by tasks involving basic
perceptual thresholds. Instead, other studies have revealed a differential performance in the two
groups, either in the direction of deficient abilities in deaf than hearing participants (e.g., Quittner
et al. 2004; Parasnis et al. 2003), or in the direction of supranormal abilities for the deaf population
(e.g., Bottari et al. 2010; Loke and Song 1991; Neville and Lawson 1987). In this context, it should
perhaps be emphasized that in the absence of clear behavioral differences between deaf and hear-
ing participants, even the most striking differences between the two groups observed at the neural
level cannot disentangle between the perceptual deficit hypothesis and the sensory compensation
hypotheses. For instance, much of the renewed interest in the study of visual abilities in deaf indi-
viduals has been motivated by the seminal work of Neville et al. (1983). In that study, visual evoked
potentials (VEPs) recorded from the scalp of eight congenitally deaf adults were significantly larger
Visual Abilities in Individuals with Profound Deafness 425

over both auditory and visual cortices, with respect to those of eight hearing controls, specifically
for visual stimuli occurring in the periphery of the visual field (8.3°). Although this pioneering work
implies that the lack of auditory experience from an early age can influence the organization of the
human brain for visual processing [a finding that was later confirmed and extended by many other
studies using different methodologies for the recording of brain responses; e.g., electroencephalo-
gram (EEG): Neville and Lawson 1987; magnetoencephalography: Finney et al. 2003; functional
magnetic resonance imaging: Bavelier et al. 2000, 2001], in the absence of a behavioral difference
between the two groups it remains potentially ambiguous whether modifications at the neural level
are an index of deficiency or compensation. In other words, even if one assumes that larger visual
evoked components (e.g., Neville et al. 1983; Neville and Lawson 1987) or stronger bold responses
(e.g., Bavelier et al. 2000; 2001) indicate enhanced processing of the incoming input, if this is not
accompanied by behavioral enhancement it is difficult to conclude that it really serves some adap-
tive functional role. Unfortunately, the current evidence in the literature lacks this explicative power.
With the only exception of the work by Neville and Lawson (1987), all other neuroimaging studies
focused on measures of brain response alone, instead of combined measures of brain response and
behavior. Furthermore, conclusive evidence that cortical reorganization serves a functional role can
only originate from the observation that interfering with the reorganized brain response [e.g., using
transcranial magnetic stimulation (TMS)] impairs the supranormal behavioral performance in the
sensory-deprived participants (e.g., see Cohen et al. 1997 for an example of abolished supranormal
tactile discrimination in the blind, following disruption of occipital lobe function using TMS).

22.1.1  Multiple Operational Definitions


The solution of the controversy between deficient or compensatory behavioral outcomes of pro-
found deafness should first of all rely on a clear operational definition of the concept of “enhanced
visual abilities in deaf individuals.” On one hand, the question “Do deaf individuals see better?”
(e.g., Rettenbach et al. 1999; Bavelier et al. 2006) is provocatively broad and calls for a specifica-
tion of the domains of visual perception in which the sensory compensation hypothesis is to be
tested for the case of deafness. On the other hand, a definition centered on the sole concept of
enhanced sensitivity (e.g., Bross 1979a) is perhaps too limited, as it implies that the compensation
hypothesis can only be true whenever discrimination sensitivity of deaf individuals is better than
that measured in age-matched hearing controls. The concept of sensitivity refers to the ability of a
perceptual system to discriminate a signal (e.g., a target) from noise (e.g., background events), and
it is best described within the theoretical framework of the signal detection theory (SDT; Green and
Swets 1966). In particular, SDT allows distinguishing sensitivity (expressed by the d′ index) from
the observer’s response criterion (expressed by the c or ol β indices). Although the SDT is largely
considered a standpoint for the study of perception, it is worth noting that the studies on visual
abilities in deaf individuals have very rarely used the SDT approach to describe performance (see
Bross 1979a, 1997b; Neville and Lawson 1987; Bosworth and Dobkins 1999, 2002a, 2002b; Bottari
et al., in preparation).
The first aim of this review is to provide a detailed description of the empirical evidence of visual
abilities in profound deafness, structured as a function of the visual tasks that have been adopted
by the different investigators and the dependent variable considered in the analyses. We start by
describing the studies that investigated perceptual thresholds in the visual and tactile modalities,
which gave an operational definition of enhanced visual ability in terms of better low-level sensitiv-
ity to the stimulus. Second, we describe studies that centered on simple detection or lateralization
(left/right) responses, which gave an operational definition of enhanced visual ability in terms of
faster response to a target onset. Third, we review studies that adopted visual search tasks, which
gave on operational definition in term of efficiency in searching for a target feature in the visual
scene. Fourth, we review reports that centered on discrimination and identification of suprathresh-
old stimuli, which gave an operational definition of enhanced ability in terms of better recognition
426 The Neural Bases of Multisensory Processes

of perceptual events. Finally, we conclude with a section on visual tasks of higher complexity that
extended the operational definition to include the contribution of visual working memory and dual
task performance.

22.1.2  Making Sense of Heterogeneity


In addition to the controversy between “deficit” and “compensation” accounts, another critical issue
in this research domain concerns the understanding of which aspect may be transversal to the dif-
ferent behavioral tasks, and may possibly explain the heterogeneity of the empirical results.
The first transversal aspect that may account for the heterogeneity of the results is the diversity in
the deaf sample characteristics. As originally pointed out by Hoemann (1978), in choosing deaf par-
ticipants several studies have not controlled for differences in the amount of hearing loss, etiology of
deafness, time from deafness onset at testing, and language(s) or mode(s) of communication used by
deaf participants (see also Parasnis 1983). Recently, Bavelier and colleagues (2006) suggested that
these differences in the deaf population sample can largely account for the heterogeneity in the liter-
ature. Specifically, they argued that studies reporting deficient visual functions in deaf than hearing
individuals typically included deaf participants with heterogeneous background, whereas studies
that have documented enhanced visual functions only included “deaf native signers” (i.e., individu-
als with no associated central nervous system damage and born profoundly deaf to deaf parents;
Bavelier et al. 2006, p. 512). This specific deaf group achieves language development milestones at
the same rate and time as hearing individuals, thus giving the opportunity to investigate the effects
of auditory deprivation at the net of other confounding factors, such as language deprivation or
atypical cognitive development due to communication deficiencies. As we shall see later (see Section
22.3.2), although a selection of deaf participants on the basis of the criteria proposed by Bavelier et
al. (2006) has great methodological benefits, it appears unlikely that the heterogeneity in the empiri-
cal evidence can be reduced to this aspect alone. Furthermore, restricting the analysis only to “deaf
native signers” would greatly limit generalization of the results, as this subgroup represents only 5%
of the total deaf population (at least in the United States; see Mitchell and Karchmer 2002).
The second important aspect that has often been emphasized as potential source of heterogene-
ity in the empirical evidence is the visual characteristics of the target stimulus. Several authors
(e.g., Armstrong et al. 2002; Bavelier et al. 2006; Neville and Bavelier 2002) have proposed that
enhanced visual abilities in deaf individuals may emerge selectively for the analysis of visual fea-
tures that are preferentially processed within the visual-for-action pathway (also termed “motion
pathway”), associated with the dorsal visual stream (Milner and Goodale 1995). For instance, an
event-related potential (ERP) study by Armstrong and colleagues (2002) revealed enhanced corti-
cal responses (larger N1 components) in deaf than in hearing adults in response to task-irrelevant
motion stimuli at peripheral locations. Importantly, when cortical activity was compared between
groups for stimuli varying along the color dimension (a visual feature preferentially processed by
the ventral visual stream), enhanced cortical responses for deaf than hearing participants were no
longer evident. Motion stimuli have also been shown to activate the MT+ complex more strongly in
deaf than in hearing individuals using functional neuroimaging (Bavelier et al. 2000, 2001), and to
activate the right auditory cortex in the deaf participants (Fine et al. 2005; Finney et al. 2001).
The third aspect that has systematically been described as critical for enhanced visual abilities
in deaf people is the eccentricity of the visual stimulus. The main working hypothesis for several
investigations in this field has been that any visual enhancement in deaf individuals should emerge
particularly for visual stimuli appearing toward the periphery of the visual field (e.g., Parasnis
1983; Neville and Lawson 1987). This prediction stems from the observation that, under normal
conditions, the auditory system provides important information about the events that occur outside
the field of view. Therefore, in the absence of audition, visual processing might recalibrate to favor
visual events outside the fovea, in the attempt to monitor the environment through peripheral vision
(e.g., Loke and Song 1991; Parasnis and Samar 1985). As shall be shown, a number of independent
Visual Abilities in Individuals with Profound Deafness 427

studies have provided general support to the hypothesis that peripheral regions of the visual field
have a different status for deaf individuals with respect to hearing controls. However, the actual
visual eccentricities associated with the terms “central,” “perifoveal,” and “peripheral” consider-
ably varied across the different studies. Researchers have referred to stimulus location as “central”
both when the stimulus was presented directly at fixation (e.g., Poizner and Tallal 1987) and when
it was perifoveal (e.g., Neville and Lawson 1987). More critically, the term “peripheral” has been
applied to locations in the visual field ranging from 3° of eccentricity (e.g., Chen et al. 2006) to 20°
or more (e.g., Colmenero et al. 2004; Loke and Song 1991; Stevens and Neville 2006). As pointed
out by Reynolds (1993), this ambiguity in the adopted terminology originate from the fact that the
boundaries of the foveal region (up to 1.5° from fixation) are well defined by anatomical structures,
whereas the distinction between perifoveal and peripheral visual field is not.
Finally, most researchers have suggested that spatial selective attention plays a key role in modulat-
ing visual responses in deaf individuals (e.g., Bavelier et al. 2006; Dye et al. 2008; Loke and Song 1991;
Neville and Lawson 1987; Parasnis and Samar 1985; Sladen et al. 2005). This suggestion originated
from the studies that examined attention orienting in deaf and hearing participants (e.g., Colmenero et
al. 2004; Parasnis and Samar 1985) and found that deaf individuals pay less of a cost when detecting a
target occurring at invalidly cued locations. Furthermore, a potential difference in selective attention
has been proposed by those studies that examined the interference of flankers on target discrimination
(Proksch and Bavelier 2002; Sladen et al. 2005) and found that deaf individuals were more suscep-
tible to peripheral flankers than hearing controls. Finally, the suggestion that employment of selective
attention resources is the key requisite for revealing differences between deaf and hearing participants
has emerged from the empirical observation that differences between deaf individuals and hearing
controls have sometimes emerged specifically when attention was endogenously directed to the target
(e.g., Bavelier et al. 2000; Neville and Lawson 1987; but see Bottari et al. 2008).
However, whether all aspects of visual enhancement in deaf individuals are necessarily linked to
allocation of selective attention in space is still a matter of debate. Furthermore, it is well acknowl-
edged that selective spatial attention is not a unitary mechanism, and at least two functionally and
anatomically distinct mechanisms of spatial attention have been identified (Corbetta and Shulman
2002; Jonides 1981; Mayer et al. 2004; Posner 1980). Visual attention can be oriented to an object
or a location in a bottom-up fashion, because an abrupt change in visual luminance at the retinal
level has occurred in a specific region of the visual field. This type of attention orienting is entirely
automatic and has typically been referred to as exogenous orienting. Alternatively, visual attention
can be summoned to an object or a location because of its relevance for the behavioral goal of the
individual. This type of top-down attention orienting is voluntary and strategic, and has typically
been referred to as endogenous orienting. Whether one or both of the components of selective atten-
tion are changed as a consequence of deafness remains an open question. Thus, whenever the claim
that “early deafness results in a redistribution of attentional resources to the periphery” is made
(e.g., Dye et al. 2008, p. 75), one should also ask which aspect of selective attention (endogenous,
exogenous, or both) is changed by profound deafness.
In sum, four distinct transversal aspects may contribute to explain the heterogeneity of the
empirical results in the different behavioral tasks: diversity in the deaf sample characteristics, visual
characteristics of the target stimulus, target eccentricity, and role of selective spatial attention. The
second aim of the present review is to reevaluate the empirical evidence in support of these four
different (but possibly interrelated) aspects in modulating visual abilities in deaf individuals.

22.2  A TASK-ORIENTED REVIEW OF EMPIRICAL EVIDENCE


22.2.1  Perceptual Thresholds Tasks
One of the first studies to investigate perceptual thresholds in deaf individuals was conducted by
Bross (1979a), who tested brightness discrimination sensitivity in six deaf and six hearing children
428 The Neural Bases of Multisensory Processes

(11 years old on average) for two circular patches of white light presented at 4.8° of eccentricity,
on opposite sides with respect to the participant’s body midline. Initially, the just noticeable dif-
ference (JND) between the two patches was measured for each participant. Then, brightness for
one of the two stimuli (variable) was set to 0.75 JND units above or equal to the other (standard),
and participants were instructed to indicate whether the variable stimulus was brighter or equal in
apparent brightness with respect to the standard. In the latter task, the probability that the variable
stimulus was brighter than the standard changed between blocks, from less likely (0.25), to equal
(0.50), to more likely (0.75). Deaf and hearing participants showed comparable JNDs for brightness
discrimination. However, their sensitivity in the forced-choice task was better than hearing con-
trols, as measured by d′. Intriguingly, deaf performance was entirely unaffected by the probability
manipulation (i.e., deaf participants maintained a stable criterion, as measured by β), unlike hearing
controls who became more liberal in their criterion as stimulus probability increased. However, the
same two groups of participants showed comparable sensitivity (d′) when retested in a second study
with largely comparable methods (Bross 1979b). In addition, in one further study adapting the same
paradigm for visual-flicker thresholds, no difference between deaf and hearing controls emerged it
terms or d′ or β (Bross and Sauerwein 1980). This led Bross and colleagues (Bross 1979a, 1997b;
Bross and Sauerwein 1980) to conclude that no enhanced sensory sensitivity is observed in deaf
children, in disagreement with the sensory compensation hypothesis.
Finney and Dobkins (2001) reached a similar conclusion when measuring contrast sensitivity
to moving stimuli in 13 congenital or early deaf adult participants (all signers), 14 hearing subjects
with no signing experience, and 7 hearing subjects who signed from birth [Hearing Offspring of
Deaf parents (HOD)]. Stimuli were black and white moving sinusoidal gratings presented for 300
ms to the left or to the right of one visual marker, and the participant’s task was to report whether the
stimulus appeared to the left or to the right of the marker. Five markers were visible throughout the
task (the central fixation cross and four dots located at 15° of eccentricity with respect to fixation).
The stimulus could appear next to any of the five markers, thus forcing participants to distribute
their visual attention across several visual locations. The luminance contrast required to yield 75%
correct performance was measured for each participant across a range of 15 different combina-
tions of spatial and temporal frequency of the stimulus. Regardless of all these manipulations,
deaf, hearing, and HOD participants performed comparably on both central and peripheral stimuli,
leading to the conclusion that neither deafness nor sign-language use lead to overall increases or
decreases in absolute contrast sensitivity (Finney and Dobkins 2001, p. 175). Stevens and Neville
(2006) expanded this finding by showing that contrast sensitivity was comparable in 17 congenital
deaf and 17 hearing individuals, even for stimuli delivered in the macula of the participant, at 2°
around visual fixation (see also Bavelier et al. 2000, 2001, for further evidence of comparable lumi-
nance change detection in deaf and hearing individuals). Interestingly, a between-group difference
was instead documented when the task was changed to unspeeded detection of a small (1 mm) white
light, moving from the periphery to the center of the visual field. In this kinetic perimetry task,
deaf participants showed an enlarged field of view (about 196 cm2) with respect to hearing controls
(180 cm2), regardless of stimulus brightness.
The latter finding suggests that perceptual thresholds may differ for deaf and hearing individuals
when motion stimuli are employed. However, three further investigations (Bosworth and Dobkins
1999, 2002a; Brozinsky and Bavelier 2004) that examined the performance of deaf and hearing
participants in motion discrimination tasks indicate that this is not always the case. Bosworth and
Dobkins (1999) tested 9 congenital or early deaf (all signers) and 15 hearing (nonsigner) adults in
a motion direction–discrimination task. The stimulus consisted of a field of white dots presented
within a circular aperture, in which a proportion of dots (i.e., signal dots) moved in a coherent direc-
tion (either left or right), whereas the remaining dots (i.e., noise dots) moved in a random fashion.
Similar to the study of Finney and Dobkins (2001), stimuli were either presented at central fixa-
tion, or 15° to the left or to the right of fixation. Participants were instructed to report the direction
of motion with a key press, and the proportion of coherent motion signal yielding 75% correct
Visual Abilities in Individuals with Profound Deafness 429

performance was measured for each participant. Mean thresholds did not differ between deaf and
hearing controls, regardless of stimulus eccentricity (central or peripheral), stimulus duration (250,
400, or 600 ms) and vertical location of the lateralized stimuli (upper or lower visual field). The
only between group difference concerned the performance across the two visual hemifields. Deaf
participants exhibited a right visual field (RVF) advantage, whereas hearing controls exhibited a
slight left visual field (LVF) advantage. The latter finding, however, reflected the signing experience
rather than auditory deprivation, and resulted from the temporal coincidence between visual and
linguistic input in the left hemisphere of experienced signers, as subsequently shown by the same
authors (Bosworth and Dobkins 2002b). A convergent pattern of result emerged from the study by
Bosworth and Dobkins (2002a), in which 16 deaf signers (12 congenital), 10 hearing signers, and 15
hearing controls were asked to detect, within a circular aperture, the direction of motion of a pro-
portion of dots moving coherently (leftward or rightward), whereas the remaining dots moved in a
random fashion. The proportion of dots moving coherently varied across trials, to obtain a threshold
of the number of moving coherently dots necessary to yield the 75% of correct discriminations. The
results showed that all group of participants performed comparably in terms of thresholds suggest-
ing that deafness does not modulate the motion threshold.
Convergent findings also emerged from a study by Brozinsky and Bavelier (2004), in which 13
congenitally deaf (signers) and 13 hearing (nonsigner) adults were asked to detect velocity increases
in a ring of radially moving dots. On each trial, dots accelerated in one quadrant and participants
indicated the location of this velocity change in a four-alternative forced choice. Across experi-
ments, the field of dots extended between 0.5° and 8°, or between 0.4° and 2° (central field), or
between 12° and 15° (peripheral field). The temporal duration of the velocity change yielding to
79% correct was measured for each participant. Regardless of whether the dots moved centrally or
peripherally, velocity thresholds were equivalent for deaf and hearing individuals. Similar to the
study by Bosworth and Dobkins (1999), deaf signers displayed better performance in the RVF than
the LVF, again as a possible result of their fluency in sign language.
Equivalent performance in deaf and hearing individuals has been documented also when
assessing temporal perceptual thresholds (e.g., Bross and Sauerwein 1980; Poizner and Tallal
1987; Nava et al. 2008; but see Heming and Brown 2005). Poizner and Tallal (1987) conducted a
series of experiments to test temporal processing abilities in 10 congenitally deaf and 12 hearing
adults. Two experiments examined flicker fusion thresholds for a single circle flickering on and
off at different frequencies, or for two circles presented in sequence with variable interstimulus
interval (ISI) (Poizner and Tallal 1987; Experiments 1 and 2). One additional experiment tested
temporal order judgment abilities for pairs or triplets of visual targets presented in sequence
(Poizner and Tallal 1987; Experiment 3). All visual targets appeared from the same central spatial
location on the computer screen and participants were asked to report the correct order of target
appearance. No difference between deaf and hearing participants emerged across these tasks.
More recently, Nava et al. (2008) tested 10 congenital or early deaf adults (all signers), 10 hear-
ing controls auditory-deprived during testing, and 12 hearing controls who were not subjected to
any deprivation procedure, in a temporal order judgment for pairs of visual stimuli presented at
perifoveal (3°) or peripheral (8°) visual eccentricities. Regardless of stimulus eccentricity, tem-
poral order thresholds (i.e., JNDs) and points of subjective simultaneity did not differ between
groups. Notably, however, faster discrimination responses were systematically observed in deaf
than hearing participants, especially when the first of the two stimuli appeared at peripheral loca-
tions (Nava et al. 2008).
Finally, one study testing perceptual threshold for frequency discrimination in the tactile modal-
ity also confirmed the conclusion of comparable perceptual thresholds in deaf and hearing individu-
als (Levanen and Hamdorf 2001). Six congenitally deaf (all signers) and six hearing (nonsigners)
adults were asked to decide whether the frequency difference between a reference stimulus (at 200
Hz) and a test stimulus (changing in interval between 160 and 250 Hz) was “rising” or “falling.”
The frequency difference between the two stimuli that yielded 75% correct responses was measured
430 The Neural Bases of Multisensory Processes

for each participant. Although the frequency difference threshold was numerically smaller for deaf
than hearing participants, no statistically significant difference emerged.
In sum, the studies that have adopted perceptual thresholds to investigate the consequences of
deafness on vision and touch (i.e., used an operational definition of better performance in terms of
better low-level sensitivity to the stimulus) overall documented an entirely comparable performance
between deaf and hearing individuals. Importantly, these findings emerged regardless of whether
hearing-impaired participants were congenitally deaf born from deaf parents or early deaf. One
clear example of this is the comparison between the study by Poizner and Tallal (1987) and Nava
et al. (2008), which tested genetically versus early deaf on a comparable temporal order judgment
task, and converged to the same conclusion. The absence of a difference at the perceptual level also
emerged regardless of stimulus feature and eccentricity, i.e., regardless of whether target stimuli
were static (e.g., Bross 1979a, 1979b) or moving (e.g., Bosworth and Dobkins 1999; Brozinsky and
Bavelier 2004), and regardless of whether they appeared at central (e.g., Bosworth and Dobkins
1999; Brozinsky and Bavelier 2004; Poizner and Tallal 1987; Stevens and Neville 2006) or periph-
eral locations (e.g., Bosworth and Dobkins 1999; Brozinsky and Bavelier 2004; Nava et al. 2008).
Finally, making the stimulus location entirely predictable (Bross 1979a; Poizner and Tallal 1987) or
entirely unpredictable (e.g., Bosworth and Dobkins 1999; Brozinsky and Bavelier 2004) also had no
effect, indicating that comparable performance of deaf and hearing participants was not modulated
by the direction of selective visual attention in the scene. The only notable discrepancy with respect
to this very consistent pattern of results is the observation of Stevens and Neville (2006) that deaf
individuals possess a larger field of view with respect to hearing controls in the kinetic perimetry
task. It would be interesting to examine whether this finding can also be replicated with stationary
target at the extreme visual periphery.

22.2.2  Simple Detection and Lateralization Tasks


Another approach to the study of visual abilities in profound deafness has been the direct assess-
ment of the reactivity of deaf individuals in response to simple visual events or the assessment of
their lateralization abilities (left vs. right response). One important aspect to note concerning this
seemingly simple tasks, is that any advantage measured using these procedures could reflect faster
processing of the perceptual events, faster response preparation or release, or a combination of the
two. Many of the early studies on visual abilities in deaf individuals that aimed to test visual speed
(e.g., the classic article by Doehring and Rosenstein 1969, entitled “Speed of visual perception in
deaf children”; see also Olson 1967; Hartung 1970) actually examined unspeeded discriminations
and visual memory abilities for stimuli presented tachistoscopically. Thus, they are not directly
informative about the speed of visual processing and the speed of response in deaf people.
Loke and Song (1991) were among the first to compared 20 congenital or early-deaf high school
students and 19 hearing controls, in a task requiring simple detection of an asterisk briefly appear-
ing on the computer screen. The asterisk was presented either at fixation (0.5°), or in the visual
periphery (25°), and the task was always performed in monocular vision. The results documented
faster responses for deaf than hearing controls (85 ms on average), selectively for targets appearing
at peripheral locations. Interestingly, a similar between-group difference was also numerically evi-
dent for central locations (38 ms), and perhaps fell short of significance because of the very limited
number of trials in each experimental condition (20 trials overall, 10 for each target location).
Two years later, Reynolds (1993) also examined a group of 16 adult participants with early deaf-
ness (before 3 years of age, all signers) and 16 hearing controls, in two speeded detection tasks
to visual stimuli presented using a tachistoscope. In one task (baseline measure; Reynolds 1993,
p. 531), simple detection response times (RTs) were recorded in response to a black circular target,
presented for 70 ms directly at fixation, in the absence of any peripheral stimulus. In the other
task, participants were required to make a speeded bilateral key press to indicate the side of a
perifoveal target (4°), by pressing a button located to the left or to the right of the starting position
Visual Abilities in Individuals with Profound Deafness 431

of the responding finger (the purpose of the simultaneous bilateral response was to balance hemi-
spheric motoric activity in the task). Perifoveal targets consisted of six simple shapes (e.g., circle,
square, triangle, diamond) that could be presented alone or simultaneously with task-irrelevant
shapes of increasing complexity (from basic shapes to human faces or letters) delivered at fixation.
Immediately after stimulus detection, participants were also required to identify the shape of the
peripheral stimulus. Two results are noteworthy: first, simple detection of the foveal circle (baseline
task) was faster for deaf than hearing participants (70 ms on average); second, simple detection and
subsequent discrimination of the peripheral shapes also confirmed faster RTs for deaf than hearing
participants (56 ms), but failed to show any between-group difference in identification accuracy (see
Section 22.2.4 for further discussion of this study).
More recently, Bottari et al. (in preparation) asked 11 congenital or early deaf (all signers) and 11
hearing adults (non signers) to press the space bar of the computer keyboard to the appearance of a
small black circle, delivered for 48 ms on the computer screen at 3° or 8° of eccentricity. The results
showed that deaf were faster than hearing controls (56 ms on average) at detecting the onset of the
visual target, regardless of whether it appeared at 3° or 8°. Similarly, Bottari et al. (2010) asked a
different group of 11 congenital or early deaf (all signers) and 11 hearing controls (non signers) to
detect a circle open on the left or right side, presented for 48 ms at the 3° or 8° from central fixa-
tion. Stimuli were now corrected in size as a function of their eccentricity, and trials per condition
were increased from 24 to 96 to increase statistical power. The results of this second study entirely
supported those of Bottari et al. (in preparation), and showed a response time advantage in deaf
than hearing participants (44 ms on average) that again was not spatially selective, i.e., it emerged
regardless of target location instead of appearing only for peripheral targets (Loke and Song 1991).
One further finding of the study by Bottari and colleagues (2010) was that the overall RT advantage
for deaf participants emerged together with a differential response time ratios in the two groups
as a function of target location. Hearing controls paid a significant RT cost when responding to
peripheral than central target, whereas deaf individuals performed comparably across the two target
locations. This suggests that advantages in reactivity and advantages in peripheral processing may
be two dissociable aspects of enhanced visual processing in deaf individuals (see Section 22.3.3 for
further discussion of this point).
Other studies measuring speeded simple detection or speeded target lateralization in deaf people
also manipulated the direction of attention before target onset, typically adapting the cue–target
paradigm developed by Posner (1980). The first study to adopt this manipulation was conducted by
Parasnis and Samar in 1985. They tested 20 hearing and 20 congenitally deaf college students (all
signers and born from deaf parents) in a task requiring a speeded bimanual response (see Reynolds
1993) to indicate the side of a black unfilled circle, presented for 100 ms at 2.2° from central fixation.
The stimulus was preceded by an arrow indicating the correct target side 80% of the times, or by a
neutral cross signaling equal probability of the target on either side. In addition, across blocks, the
peripheral target was presented with concurrent stimulation at fixation (five black crosses; i.e., foveal
load condition) or alone (no load condition). Unlike the simple detection studies described above, the
results of this experiment showed no overall RT advantage for deaf than hearing participants (in fact,
there was even a trend for slower RTs in deaf than participants overall). Furthermore, all participants
showed RT benefits and costs, with respect to the neutral trials, when the target appeared at the cued
or the uncued location, respectively. However, deaf participants paid less cost than hearing controls
when responding to targets at the uncued locations under the foveal load condition. Parasnis and
Samar (1985) interpreted this finding as evidence of more efficient “redirecting of attention from one
part of the visual field to another in the presence of interfering foveal stimulation,” and concluded
that “developmental experience involving a visual–spatial language and/or a predominantly visual
(as contrasted with visual plus auditory) perception of the world leads to selective and ecologically
useful alterations in attentional control of perceptual processes” (Parasnis and Samar 1985, p. 321).
The results and conclusions of the classic study by Parasnis and Samar (1985) created the basis
for the widespread notion that attention reorienting is more efficient in deaf than hearing individuals.
432 The Neural Bases of Multisensory Processes

However, two further contributions that also examined simple detection of visual stimuli in the
presence of attentional cues suggest a more complex framework. Colmenero et al. (2004) asked 17
deaf (all signers with prelingually deafness) and 27 hearing adults to press a key whenever an “O”
appeared on the computer screen. The target appeared for 150 ms, at 20° of eccentricity to the left
or the right of central fixation, and was preceded by a vertical mark delivered at the exact target
location (valid condition, 53% of the trials), on the opposite side with respect to the target (invalid
condition, 13% of the trials) or on both sides (neutral condition, 33% of the trials). Stimulus onset
asynchrony (SOA) between cue and target ranged between 125 and 250 ms. Note that the use of
peripheral informative cues in this paradigm inevitably mixed exogenous and endogenous cueing of
attention within the same task. Deaf participants were faster than hearing control at detecting the
target (43 ms on average). Furthermore, the analysis of RT costs and benefits, for invalid and valid
cues, respectively, revealed that both attentional effects were larger in hearing than deaf partici-
pants. In a second experiment, Colmenero and colleagues (2004) examined whether performance
in the two groups differed when the SOA between the lateralized cue and the target was extended to
350 or 850 ms. With such long SOAs, hearing individuals typically show a cost at detecting targets
occurring at the cued location, which is interpreted as inhibition to reexplore locations where atten-
tion has been previously oriented [i.e., inhibition of return (IOR); Klein 2000)]. The results of this
second experiment revealed less enduring IOR in deaf than in hearing participants, again suggest-
ing a different role of attention orienting in the hearing-deprived population.
Chen et al. (2006) asked 16 congenitally deaf and 22 hearing adults to detect the occasional
appearance of a dot, presented at perifoveal locations (3°; see also Section 22.2.4 for a full descrip-
tion of the design of this study). The dot appeared with equal probability to the right or to the left
of fixation and was preceded by a valid or invalid exogenous cues. As in the study of Colmenero et
al. (2004), the SOA between the lateralized cue and the target was in the typical range for IOR (i.e.,
900 ms). Although IOR effects were again observed, these did not differ between the two groups.
However, the results revealed that detection of perifoveal targets was systematically faster in deaf
than in hearing participants (59 ms on average) regardless of the attention condition (i.e., valid or
invalid; Chen et al. 2006, Experiment 1).
In sum, two relevant aspects emerge from the studies that adopted an operational definition of bet-
ter visual performance in deaf individuals in terms of enhanced reactivity to the stimulus. First, all
reports (with the sole exception of the speeded lateralization study by Parasnis and Samar 1985) docu-
mented a response speed advantage in deaf than hearing individuals. Figure 22.1 summarizes this
result graphically, by plotting the percentage difference in RTs between hearing and deaf participants
with respect to the mean RT of the hearing group, in the different studies and as a function of stimulus
eccentricity. With the sole exception of point [3] corresponding to the study by Parasnis and Samar
(1985), all data points are above zero, indicating that deaf participants were faster than the hearing
controls (on average, 13% faster with respect to the hearing group; see legend to Figure 22.1 for exact
RT differences in milliseconds). Importantly, this response advantage in deaf participants emerged
regardless of whether the target appeared directly at fixation or at locations further toward the periph-
ery. This supranormal performance of deaf individuals in terms of response speed was also uninflu-
enced by the preceding attention cueing condition (e.g., Colmenero et al. 2004; Chen et al. 2006).
The second relevant aspect concerns the effect of attentional instructions on the performance of
deaf people. Deaf participants can benefit from valid cueing of spatial selective attention (Parasnis
and Samar 1985), but at the same time there is evidence that their performance may be less suscep-
tible to invalid attention orienting (e.g., Parasnis and Samar 1985; Colmenero et al. 2004) or IOR
(Colmenero et al. 2004; but see Chen et al. 2006) than hearing controls.

22.2.3  Visual Search Tasks


One further operational definition of better visual ability in deaf individuals has been in the terms
of faster search times when a prespecified target has to be found among distractors. In the visual
Visual Abilities in Individuals with Profound Deafness 433

Simple detection or localization tasks


50%

Deaf are faster

Percentage difference with respect to


the mean RT of the hearing group
[1] 25%
[7] [7] [5] [2]
[6] [6]
[2]
[4] [1]
0%
0 5 10 15 20 25 30
Hearing are faster

[3]
–25%

–50%
Visual eccentricity

FIGURE 22.1  Difference in RT between hearing and deaf individuals (expressed as a percentage of mean
RT of hearing group) across different studies, as a function of target eccentricity (in degrees). Multiple data
points from the same study (e.g., see point [2]) refer to targets at different eccentricities. Positive values on
Y-axis indicate faster response time in deaf than in hearing controls. Foveal (up to 1.5°), perifoveal (from 1.5°
to 5°), and peripheral eccentricities (beyond 5°) are indicated in plot by shadings of different hues of gray.
However, note that only boundaries of foveal visual field are clearly specified by anatomical landmarks; thus,
the distinction between perifoveal and peripheral regions is instead conventional (we adopted here the distinc-
tion proposed by Reynolds 1993; see Section 22.1.2). Actual RT difference are as follows: [1] Reynolds (1993):
70 ms at 0°, 56 ms at 4°; [2] Loke and Song (1991): 38 ms at 0.5°, 85 ms at 25°; [3] Parasnis and Samar (1985):
−58 ms at 2.2°; [4] Chen et al. (2006): 59 ms at 3°; [5] Colmenero et al. (2004): 43 ms at 20°; [6] Bottari et al.
(in preparation): 52 ms at 3°, 59 ms at 8°; [7] Bottari et al. (2010): 54 ms at 3°, 59 ms at 8°.

perception literature, visual search tasks have classically been employed to distinguish perceptual
processes requiring attention from perceptual processes occurring preattentively. When response
time is unaffected by the number of distractors in the array, the search is typically described as
preattentive (i.e., it does not require attention shift to the target in order to produce the response). By
contrast, when response time increases as a function of the number of distractors in the array, the
search is assumed to require serial attention shifts to the various items (Treisman 1982).
Henderson and Henderson (1973) were the first to compare the abilities of deaf and hearing
children (12.5 to 16.5 years old) in a visual search task that required searching for a target letter
in a letter array containing capital and lowercase letters. Although they found that the two groups
did not differ in the visual search task, it should be noted that the high similarity between the
target and the distractors inevitably determined a serial search in both groups. Several years later,
Stivalet and colleagues (1998) also adopted a visual search task to examine visual processing in
congenitally deaf and hearing adults. Unlike Henderson and Henderson (1973), they manipulated
the complexity of the search by asking participants to detect the presence or absence of a Q among
O’s (easier search, because the target contains a single identifying feature) or of an O among Q’s
(harder search, because the target is lacking a feature with respect to the distractors). Moreover,
to obtain a measure of visual processing time, which could be separate from the time required
for motor program retrieval and response initiation/execution, all stimuli were masked after a
variable interval and the dependent variable was the duration of the interval between stimuli and
mask sufficient to reach 90% correct. Notably, all stimuli were presented within the perifoveal
region, at an eccentricity ranging between 4.1° and 4.9°. When searching for Q among Os (easier
search), both groups performed a parallel search that was unaffected by the number of distractors
(4, 10, or 16). By contrast, when searching for an O among Qs (harder search), deaf adults proved
434 The Neural Bases of Multisensory Processes

substantially more efficient than hearing controls, with their visual search time (9 ms/letter) fall-
ing within the range of parallel processing (Enns and Rensink 1991), unlike hearing participants
(22 ms/letter).
Further evidence along the same direction came from a visual search study by Rettenbach and
colleagues (1999). They tested eight deaf and eight hearing adults, in a series of visual search task of
different complexity. Unlike the study by Stivalet and colleagues (1998), the stimuli covered a wide
visual angle, both vertically (20°) and horizontally (26°), thus spanning from central to peripheral
locations. The results revealed more efficient visual search in deaf than hearing adults. Interestingly,
when the same study was repeated in children and adolescents, deaf participants systematically
underperformed with respect to the age-matched hearing controls (see also Marendaz et al. 1997),
suggesting a potential developmental trajectory in the development of different visual search abili-
ties in deaf individuals.
In sum, the studies that evaluated visual search abilities in deaf and hearing controls indicate
that the range for parallel processing is ampler in deaf than hearing controls (Stivalet et al. 1998;
Rettenbach et al. 1999). Furthermore, this enhanced visual ability appears to be independent of the
spatial location of the stimuli, as it emerged for perifoveal (Stivalet et al. 1998) as well as periph-
eral stimuli (Rettenbach et al. 1999). However, the reconciliation of visual search findings with the
observation of less susceptibility of deaf participants to invalid cueing or IOR (e.g., Parasnis and
Samar 1985; Colmenero et al. 2004) is not straightforward. As we shall discuss later (see Section
22.3.3), assuming that both visual search and cueing effects can be accounted for by faster reorient-
ing of attention implies a description of better visual search in deaf individuals in terms of faster and
more efficient movement of the attention spotlight in space. This interpretation, however, is at odds
with the description of better search as being the result of preattentive processing.

22.2.4  Visual Discrimination and Identification Tasks


One common aspect between the simple detection tasks described in Section 22.2.2 and the easy
visual search tasks described in Section 22.2.3 (e.g., easy search of a Q among O’s), is that both
these tasks can in principle be performed without attention shifts (i.e., under distributed atten-
tion; e.g., see Bravo and Nakayama 1992; Sagi and Julesz 1984). Instead, shifts of spatial attention
are certainly required to perform complex visual search tasks or to perform visual discrimination
tasks. Discrimination or identification of a visual target requires binding of the multiple target
features, and therefore inevitably rely on selective attention processing (e.g., Turatto et al. 2007).
Furthermore, discriminating one stimulus from another implies some sort of perceptual matching
with a template held in working memory. In this respect, adopting discrimination and identification
task for the study of visual abilities in deaf individuals clearly implies taking a step forward in the
examination of visual cognition in this auditory deprived population.
Early studies on visual discrimination in deaf individuals assessed the ability of this population
in discriminating colors or complex shapes. For instance, Heider and Heider (1940) tested prelin-
gually deaf and hearing children in a color sorting task, in which participants had to select a range
of hues that could match a given standard color. Performance in the two groups was comparable,
and in fact deaf children selected a wider range of hues compared to hearing children. Similarly,
Suchman (1966) compared the ability of deaf and hearing individuals in an oddity discrimina-
tion task, which required the identification of an odd stimulus among other items. When the odd
stimulus differed in color (5% white increase or decrease in hues), deaf participants had higher
accuracy scores than hearing controls. By contrast, when the odd stimulus differed in shape (4° of
internal angle with respect to the other simple shapes) hearing controls discriminated better than
deaf participants. Hartung (1970) used tachistoscopic presentation to show prelingually deaf and
hearing children a series of English or Greek trigrams. The task was to determine if a particular let-
ter appeared in each trigram and to reproduce the English trigram. Although deaf performed worse
than hearing children with the English trigrams, no discrimination difference emerged with the
Visual Abilities in Individuals with Profound Deafness 435

unfamiliar Greek trigrams, suggesting that any discrimination difference between groups reflected
linguistic rather than perceptual difficulties.
A seminal works that adopted a visual discrimination task was conducted by Neville and Lawson
in 1987. In that study, behavioral and ERP responses were recorded while 12 congenitally deaf
adults (all signers, with at least one deaf parent) and 12 aged-match hearing controls performed a
discrimination of direction of motion for suprathreshold visual stimuli. Visual stimuli were white
squares presented at central (just above fixation) or peripheral locations (18° to the right or to the left
of central fixation), with an ISI from trial onset ranging randomly between 280 and 480 ms. On 80%
of the trials (termed “standards”), a single square appeared at one of these predetermined locations
for 33 ms. On the remaining 20% of the trials (termed “deviants”), the square jumped slightly to
one of eight possible immediately adjacent locations after the first 33 ms. The participant’s task con-
sisted in discriminating the direction of this moving square in deviant trials. Importantly, although
participants fixated centrally throughout the experimental session, they were also requested to ori-
ent their attention to one of the three possible target locations (center, left, or right) across blocks.
In terms of behavioral performance, deaf were faster than hearing controls (on average 70 ms) at
discriminating moving targets at the peripheral locations; by contrast, no between-group differ-
ence in RT emerged for targets occurring at central locations. Instead, the two groups performed
comparably in terms of sensitivity (d′): although hearing individuals showed better discrimination
ability in RVF than LVF, deaf participants showed the opposite pattern. In terms of EEG response,
three main findings were reported. First, the visual evoked component, termed P1 (i.e., positivity
peaking at about 100 ms after the stimulus presentation), was comparable between groups regard-
less of whether the stimulus was standard or deviant, and regardless of stimulus location and atten-
tion condition. Second, a larger amplitude in the N1 component emerged in deaf than in hearing
controls, when standard or deviant targets appeared at attended peripheral locations. These greater
increases in cortical response due to attentional engagement in deaf than hearing controls were
recorded over the occipital electrodes and in the left parietal and temporal regions. Third, the over-
all amplitude of N1 was larger over the right than left hemisphere in hearing controls, but larger
over the left than right hemisphere in deaf individuals. VEPs in response to central standards and
targets were instead comparable between groups. In summary, the result of the study by Neville and
Lawson (1987) suggested that deaf can outperform hearing individuals in terms of reactivity (but
not sensitivity) when discriminating the direction of motion for targets presented at peripheral loca-
tions. In addition, because VEP differences emerged in response to both static and moving stimuli
(i.e., standard and targets, respectively) specifically in the condition of attentional engagement to
peripheral locations, Neville and Lawson (1987) concluded that deafness modulates the neural sys-
tem that mediates spatial attention. However, later empirical evidence has shown that a similar N1
modulation can be also documented for target monitored in distributed attention (Armstrong et al.
2002), thus challenging the conclusion that differences between deaf and hearing controls emerge
selectively under conditions of focused attention.
Another study that evaluated discrimination performance in deaf and hearing participants
adopting moving stimuli was conducted by Bosworth and Dobkins (2002a; see also Bosworth and
Dobkins 2002b). These authors evaluated 16 profoundly deaf signers (12 congenital), 10 hearing
signers, and 15 hearing nonsigners in a direction-of-motion discrimination task. Participants were
required to discriminate the direction of motion of coherent moving dots presented among random
moving dots, within a single or multiple displays appearing in one or all the quadrants of the moni-
tor. The coherent motion threshold for each participant was the number of coherently moving dots
that yielded 75% correct discriminations. In addition to the number of presented displays, two other
conditions were manipulated: the presence or absence of endogenous cueing (a 100% predictive
spatial cue, delivered before display presentation) and stimulus duration (200 or 600 ms). Results
showed no overall better performance in deaf than hearing participants when discriminating direc-
tion of motion. Intriguingly, deaf individuals tended to be faster yet less accurate than the other
groups, suggesting a possible speed–accuracy trade-off in deaf but not hearing participants. The
436 The Neural Bases of Multisensory Processes

analyses also revealed that direction-of-motion thresholds were less affected by cueing of attention
in deaf individuals than in hearing controls (regardless of signing abilities). Furthermore, when the
stimuli lasted for 600 ms, performance for the deaf group paradoxically improved with multiple
rather than single displays, unlike hearing participants. Both these findings may indicate better
capture of attention by a discontinuity in a complex visual scene in deaf than hearing participants,
given enough time for the perceptual analysis.
Finally, in a recent study conducted in our laboratory (Bottari et al. 2010), we asked 11 congenital
or early deaf and 11 hearing controls to perform a speeded shape discrimination for visual targets
presented at one of eight possible locations (at 3° or 8° from central fixation). Targets were open circles
lasting for 48 ms and participants were required to discriminate whether the circle was open on the left
or on the right side. The result of this study showed comparable performance between deaf and hear-
ing individuals in terms of the RT measure, even if deaf participants showed numerically faster RTs.
Interestingly, deaf individuals performed worse than hearing controls in terms of accuracy, suggesting
different speed–accuracy trade-off in the deaf group (see also Bosworth and Dobkins 2002a).
In sum, the tasks requiring perceptual discrimination for suprathreshold stimuli did not provide
consistent evidence in support of the notion of enhanced abilities in deaf than in hearing controls.
When adopting static stimuli, better accuracy in deaf individuals compared to hearing controls
have been documented only for discrimination of colour changes (Suchman 1966). Instead, the
studies that required shape discrimination for static visual events failed to show any enhanced
abilities in deaf individuals (Hartung 1970; Bottari et al. 2010). When adopting moving stimuli,
faster RTs in deaf subjects than in hearing participants have been documented only by Neville
and Lawson (1987), selectively for events at peripheral locations. Instead, Bosworth and Dobkins
(2002a) showed an overall comparable performance between deaf and hearing controls when dis-
criminating coherence of motion.

22.2.4.1  Visual Discrimination with Flanker Interference


A series of experiments adopting discrimination or identification tasks also evaluated the effect
of entirely task-irrelevant competing stimuli on discrimination performance. The main rationale
underlying this manipulation is that any bias for processing peripheral events more than central
ones in the deaf population should emerge as larger interference effects of peripheral distracting
information on central targets (or, conversely, as smaller interference effects of central distractors
on peripheral targets).
One of the first examples of this experimental paradigm is the study by Reynolds (1993). In addi-
tion to the speeded lateralization task already described in Section 22.2.2, deaf and hearing partici-
pants were required to identify the figures that appeared 4° to the left or right of central fixation.
Target figures were presented alone or together with concurrent stimuli delivered at fixation (simple
shapes, outline drawings of familiar objects or letters). Overall, no recognition accuracy advantage
emerged in deaf than in hearing participants (62% vs. 58% correct). The only difference between
deaf and hearing controls emerged with respect to hemifield of stimulus presentation. Deaf partici-
pants showed an LVF advantage in identification accuracy when concurrent stimuli at fixation were
absent or were simple shapes, and an RVF advantage when concurrent stimuli at fixation consisted
of drawings or letter stimuli. The reversed pattern of results emerged in hearing controls.
One influential study that also examined identification with concurrent distractors at central and
peripheral locations has been conducted several years later by Proksch and Bavelier (2002). In three
experiments, they tested deaf students (all congenital and signers) and hearing controls (including
a group of participants born from deaf parents, who learned sign language in infancy) in a speeded
shape identification task. The target shape (square or diamond) was presented inside one of six cir-
cular frames, arranged around fixation in a ring of 2.1° of radius. In each trial, a distracting shape
was presented concurrently with the target, either in the center of the screen (0.5° to the left or right
of fixation) or outside the target ring (4.2° to the left or right of fixation). The distractor was an item
from the target set, either compatible (e.g., target: diamond; distractor: diamond) or incompatible
Visual Abilities in Individuals with Profound Deafness 437

(e.g., target: diamond; distractor: square), or else a neutral shape. Finally, a variable number (0, 1,
3, or 5) of filler shapes was introduced in the empty circular frames of the target ring to manipulate
perceptual load across trials. Participants were instructed to identify the target as quickly as pos-
sible, while ignoring all other distracting shapes. Overall, target identification proved longer for deaf
than hearing participants (Experiment 1: 765 vs. 824 ms, respectively; Experiment 3: 703 vs. 814
ms). All experiments consistently revealed the interfering effect of perceptual load and lateralized
distractors on RT performance. Critically, however, peripheral distractors proved more distracting
for deaf individuals, whereas central ones were more distracting for hearing controls (regardless
of whether they were signers). This led Proksch and Bavelier (2002) to conclude that “the spatial
distribution of visual attention is biased toward the peripheral field after early auditory deprivation”
(p. 699).
A related study was conducted by Sladen and colleagues (2005), using the classic flanker inter-
ference task developed by Eriksen and Eriksen (1974). Ten early deaf (onset before 2 years of age,
all signers) and 10 hearing adults were asked a speeded identification of a letter (H or N) presented
either in isolation (baseline) or surrounded by four response-compatible letters (two on either side;
e.g., HHHHH) or response-incompatible letters (e.g., NNHNN). Letters were presented 0.05°, 1°, or
3° apart from each other. The results showed that letter discrimination was faster in hearing than in
deaf participants in each of the experimental conditions including the baseline (e.g., between 50 and
81 ms difference, for incompatible stimuli), but this was accompanied by more errors in the hearing
group during incompatible trials. Interestingly, the two groups also differed in their performance
with the 1° spacing between target and flankers: the incongruent flanker cost emerged for both
groups, but was larger in deaf than in hearing participants. Again, this finding is compatible with
the notion that deaf individuals may have learned to “focus their visual attention in front of them
in addition to keeping visual resources allocated further out in the periphery” (Sladen et al. 2005,
p. 1536).
The study by Chen et al. (2006), described in Section 22.2.2, also adopted a flanker interfer-
ence paradigm. On each trial, participants were presented with a raw of three horizontally aligned
boxes, of which the central one contained the target and the side ones (arranged 3° on either side)
contained the distractors. The task required a speeded discrimination among four different colors.
Two colors were mapped onto the same response button, whereas the other two colors were mapped
onto a different response button. Simultaneous to target presentation, a flanker appeared in one of
the lateral boxes. The flanker was either identical to the target (thus leading to no perceptual conflict
and no motor response conflict), or different in color with respect to the target but mapped onto the
same motor response (thus leading only to a perceptual conflict) or different in color with respect
to the target and mapped onto a different response than the target (thus leading to perceptual and
motor conflict). Finally, spatial attention to the flanker was modulated exogenously by changing
the thickness and brightness of one of the lateral boxes at the beginning of each trial. Because the
time interval between this lateralized cue and the target was 900 ms, this attentional manipulation
created an IOR effect (see also Colmenero et al. 2004). Overall, color discrimination was compa-
rable between groups in terms of reaction times (see also Heider and Heider 1940). However, the
interference determined by the flankers emerged at different levels (perceptual vs. motor response)
in deaf and hearing participants, regardless of the cueing condition. Hearing participants displayed
a flanker interference effect both for flankers interacting at the perceptual and response levels. In
contrast, deaf participants showed flanker interference effects at the response level, but not at the
perceptual level.
Finally, Dye et al. (2007) asked 17 congenitally deaf and 16 hearing adults to perform a speeded
discrimination about the direction of a central arrow (pointing left or right) presented 1.5° above or
below central fixation and flanked by peripheral distractors (other arrows with congruent or incon-
gruent pointing directions, or neutral lines without arrowheads). A cue consisting of one or two
asterisks presented 400 ms before the onset of the arrows oriented attention to central fixation, to
the exact upcoming arrow location, or to both potential arrow locations (thus alerting for stimulus
438 The Neural Bases of Multisensory Processes

appearance without indicating the exact target location). The findings showed comparable effects
of orienting spatial cues in hearing and deaf individuals, as well as comparable alerting benefits.
Interestingly, when the number of flanker arrows was reduced to 2 and their relative distance from
the central arrow was increased to 1°, 2°, or 3° of visual angle, deaf participants displayed stronger
flanker interference effects in RTs compared to hearing controls.
In sum, the studies that measured allocation of attentional resources in the visual scene using
flanker interference tasks showed larger interference from distractors in deaf than in hearing partic-
ipants (Proksch and Bavelier 2002; Sladen et al. 2005; Chen et al. 2006; Dye et al. 2007). However,
although Proksch and Bavelier (2002) showed enhanced distractor processing in deaf than in hear-
ing adults at 4.2° from central fixation, Sladen et al. (2005) showed enhanced distractor processing
at 1° from central fixation, but comparable distractor processing at 3°. Finally, Dye et al. (2007)
showed increased flanker interference in deaf than in hearing controls regardless of whether the two
distracting items were located at 1°, 2°, or 3° from fixation. These mixed results suggest that some
characteristics of the visual scene and task, other than just the peripheral location of the distractors,
could play a role. These additional characteristics might include the degree of perceptual load, the
amount of crowding, or the relative magnification of the stimuli.

22.2.5  Visual Tasks of Higher Complexity


Beyond visual discrimination or identification tasks, our attempt to relate single experimental para-
digms with single operational definitions of better visual abilities in deaf individuals becomes inevi-
tably more complex. For instance, the visual enumeration test and the Multiple Object Tracking test
recently adopted by Hauser and colleagues (2007), the change detection task adopted by our group
(Bottari et al. 2008, in preparation), or the studies on speech-reading ability of deaf individuals (e.g.,
Bernstein et al. 2000; Mohammed et al. 2005) can hardly be reduced to single aspects of visual
processing. Nonetheless, we report these studies in detail because they are informative about the
selectivity of the performance enhancements observed in the deaf population.
Hauser and colleagues (2007) evaluated 11 congenital deaf and 11 hearing controls in an enu-
meration task, asking participants to report on a keyboard the number of briefly flashed static targets
in a display, as quickly and accurately as possible. The task was either conducted with a field of
view restricted to 5° around fixation or with a wider field of view of 20° around fixation. In such
enumeration tasks, participants typically display a bilinear performance function, with fast and
accurate performance with few items (subitizing range), and a substantially greater cost in terms
of reaction times and accuracy as the number of items increase. The results of Hauser et al. (2007)
showed comparable subitizing performance in deaf and hearing individuals, regardless of which
portion of visual field was evaluated. A second experiment conducted on 14 congenital deaf and
12 hearing control, adapted the Multiple Object Tracking test (Pylyshyn 1989). In this task, partici-
pants are presented with a number of moving dots of which a subset is initially cued. When the cues
disappear, participants are required to keep track of the dots that were initially cued until one of
the dots in the set is highlighted. Participants have to indicate whether such dot was also cued at the
beginning of the trial. Despite that this task was performed on a wide field of view, to maximize the
possibility any difference between deaf and hearing participants, no sensitivity difference emerged.
The authors concluded that “early deafness does not enhance the ability to deploy visual attention
to several different objects at once, to dynamically update information in memory as these objects
move through space, and to ignore irrelevant distractors during such tracking” (Hauser et al. 2007,
p. 183).
Two studies from our group evaluated the ability of deaf and hearing individuals to discriminate
between the presence or absence of a change in a visual scene (Bottari et al. 2008, in preparation).
In these studies, two visual scenes were presented one after the other in each experimental trial,
separated by an entirely blank display. Each visual scene comprised four or eight line-drawing
images, half of which were arranged at 3° from central fixation and the other half were arranged
Visual Abilities in Individuals with Profound Deafness 439

at 8°. On 50% of the trials, the second scene was entirely identical to the first (i.e., no change
occurred), whereas on the other 50% of the trials one drawing in the first scene changed into a dif-
ferent one in the second scene. The participant’s task was to detect whether the change was present
or absent. When comparing two alternating visual scenes, any change is typically detected without
effort because it constitutes a local transient that readily attracts exogenous attention to the location
where the change has occurred (O’Regan et al. 1999, 2000; Turatto and Brigeman 2005). However,
if a blank image is interposed between the two alternating scenes (as in the adopted paradigm),
any single part of the new scene changes with respect to the previous blank image, resulting in a
global rather than local transient. The consequence of this manipulation is that attention is no longer
exogenously captured to the location of change, and the change is noticed only through a strategic
(endogenous) scan of the visual scene (the so-called “change blindness” effect; Rensink 2001). Thus,
the peculiarity of this design was the fact that all local transients related to target change or target
onset were entirely removed. This produced an entirely endogenous experimental setting, which
had never been adopted in previous visual tasks with deaf people (see Bottari et al. 2008 for further
discussion of this point). The result of two studies (Bottari et al. 2008, in preparation) revealed that
sensitivity to the change in deaf and hearing adults was comparable, regardless of change in location
(center or periphery), suggesting that the sensitivity to changes in an entirely transient-free context
is not modulated by deafness. Furthermore, this conclusion was confirmed also when the direction
of endogenous attention was systematically manipulated between blocks by asking participants
to either focus attention to specific regions of the visual field (at 3° or 8°) or to distribute spatial
attention across to the whole visual scene (Bottari et al. 2008). In sum, even visual tasks tapping
on multiple stages of nonlinguistic visual processing (and particularly visual working memory)
do not reveal enhanced processing in deaf than in hearing controls. Once again, the absence of
supranormal performance was documented regardless of the eccentricity of the visual stimulation.
Furthermore, the results of Bottari et al. (2008) indicate that focusing endogenous attention is not
sufficient to determine a between-group difference. It remains to be ascertained whether the latter
results (which is at odds with the behavioral observation of Neville and Lawson 1987 and with the
neural observation of Bavelier et al. 2000) might be the consequence of having removed from the
scene all target-related transients that could exogenously capture the participant’s attention.
A different class of complex visual tasks in which deaf individuals were compared to hearing
controls evaluated speech-reading abilities (also termed lip-reading). Initial studies on speech-read-
ing suggested that this ability was considerably limited in hearing controls (30% words or fewer cor-
rect in sentences according to Rönnberg 1995) and that “the best totally deaf and hearing-impaired
subject often perform only as well as the best subjects with normal hearing” (Summerfield 1991,
p. 123; see also Rönnberg 1995). However, two later contributions challenged this view and clearly
showed that deaf individuals can outperform hearing controls in speech-reading tasks. Bernstein et
al. (2001) asked 72 deaf individuals and 96 hearing controls to identify consonant–vowel nonsense
syllables, isolated monosyllabic words and sentences presented through silent video recordings of a
speaker. The results showed that deaf individuals were more accurate than hearing controls, regard-
less of the type of the verbal material. In agreement with this conclusion, Auer and Bernstein (2007)
showed a similar pattern of results in a study that evaluated identification of visually presented sen-
tences in an even larger samples of deaf individuals and hearing controls (112 and 220, respectively).
It is important to note that both studies did not include deaf individuals who used sign language as
preferential communication mode, thus relating these enhanced lip-reading skills to the extensive
training that deaf individuals had throughout their lives.
For the purpose of the present review, it is important to note that speechreading is a competence
that links linguistic and nonlinguistic abilities. Mohammed and colleagues (2005) replicated the
observation that deaf individuals outperform hearing controls in lip-reading skills. Furthermore,
they showed that the lip-reading performance of deaf individual (but not hearing controls) cor-
related with the performance obtained in a classical coherence motion test (see also Bosworth and
Dobkins 1999; Finney and Dobkins 2001), despite that the overall visual motion thresholds were
440 The Neural Bases of Multisensory Processes

entirely comparable between the two groups (in agreement with what we reported in Section 22.2.1).
In sum, lip-reading is a visual skill that systematically resulted enhanced in deaf individuals com-
pared to hearing controls. Intriguingly, in deaf individuals this skill appears to be strongly intercon-
nected with the ability of perceiving motion in general, supporting the notion that visual motion
perception has a special role in this sensory-deprived population.

22.3  A TRANSVERSAL VIEW ON LITERATURE


The first aim of the present review was to provide a detailed report on the empirical evidence about
visual abilities in profound deafness, organized as a function of task. This served the purpose of
highlighting the different operational definitions of “better visual ability” adopted in the literature
and examined the consistency of the findings across tasks. The second aim was to evaluate to what
extent four distinct aspects, which are transversal to the different behavioral tasks, can contribute
to the understanding of the heterogeneity of the empirical findings. In particular, the aspects con-
sidered were: (1) diversity in deaf individuals sample characteristics; (2) visual characteristics of the
target stimulus; (3) target eccentricity; and (4) the role of selective spatial attention.

22.3.1  Enhanced Reactivity Rather than Enhanced Perceptual Processing


One aspect that clearly emerges from our task-based review of the literature is that the operational
definitions of better visual abilities in deaf individuals in terms of enhanced perceptual processing
of the visual stimulus do not reveal systematic differences between deaf and hearing controls. This
conclusion is particularly clear in all those studies that examined perceptual processing for stimuli
at or near threshold (see Section 22.2.1), but it is also confirmed by studies that required discrimina-
tion or identification for stimuli above thresholds (see Section 22.2.4) and by studies that also took
the role of visual working memory into account (see Section 22.2.5). In the case of discrimination
and identification tasks, only one report has shown a behavioral discrimination advantage for deaf
than hearing controls (e.g., see the RT difference for stimuli at peripheral locations in the work
of Neville and Lawson 1987), whereas in all the remaining studies a between-group difference
emerged only in the way attention instructions or flankers impacted on the performance of deaf and
hearing participants, but not in terms of overall processing advantage for the deaf group. In striking
contrast with this pattern of results, almost all studies adopting simple detection or lateralization
tasks have shown a reactivity advantage (occurring in a range between 38 and 85 ms) in deaf over
hearing participants. Furthermore, when these studies are considered, collectively enhanced reac-
tivity in deaf participants do not appear to be modulated by stimulus eccentricity in any obvious
way (see Figure 22.1). Finally, although attentional manipulations did impact on simple detection
performance (e.g., Colmenero et al. 2004; Chen et al. 2006), the between-group difference did not
emerge selectively as a function of the attentional condition.
The observation that better visual abilities in deaf individuals emerge mainly for tasks designed
around speeded simple detection of the stimulus, rather than tasks designed around discrimina-
tion performance, suggests that profound deafness might not result in enhanced perceptual repre-
sentation of visual events. Instead, any modification of visual processing in deaf individuals may
occur at the level of visual processing speed, or at the level response selection/generation or at both
these stages (for further discussion of this point, see Bottari et al. 2010). Prinzmetal and colleagues
(2005, 2009) recently proposed that performance enhancement could either reflect perceptual chan-
nel enhancement or perceptual channel selection. Although channel enhancement would result in
better sensory representation of the perceptual events, channel selection would only result in faster
processing. We suggest that enhanced visual abilities in deaf individuals may reflect channel selec-
tion more than channel enhancement, and that enhanced reactivity may be the core aspect of the
compensatory cross-modal plasticity occurring in this sensory-deprived population. In the context
of the present review, it is also interesting to note that Prinzmetal and colleagues (2005, 2009) have
Visual Abilities in Individuals with Profound Deafness 441

associated channel enhancements with endogenous attention selection, but channel selection with
exogenous attention capture (see also Section 22.3.3).

22.3.2  Role of Deaf Sample Characteristics and Visual Stimulus


Characteristics Are Relevant but Not Critical
Several investigators have suggested that the heterogeneity of the results observed in the literature
on visual abilities in deaf individuals might reflect the diversities in the characteristics of deaf par-
ticipants recruited across the different studies (e.g., Bavelier et al. 2006; Hoemann 1978). Although
this perspective appears very likely, to the best of our knowledge systematic studies on the impact
of critical variables, such as deafness onset (early or late) or preferred communication mode, on
the deaf visual skill have not been conducted. Similarly, the exact role of amount of hearing loss
and etiology of deafness remains to be ascertained. Our review indicates that the vast majority
of investigations have tested congenital or early deaf participants, using primarily sign language.
However, our review also challenges the idea that sample characteristics alone can account for the
variability in the results. Even those studies that restricted the population to “deaf native signers”
(Bavelier et al. 2006) did not find systematically better abilities in deaf than in hearing controls. For
instance, Hauser and colleagues (2007) pointed out that the comparable performance between deaf
and hearing controls in their visual enumeration and visual working memory tasks emerged despite
the fact that the population of deaf native signers tested in the study was identical to that recruited
in previous studies from the same research group that instead documented enhanced performance
with respect to hearing controls (Hauser et al. 2007, p. 184).
Specificity of the target stimulus characteristics is also unlikely to explain the heterogeneity of
the findings. The hypothesis that motion stimuli are more effective than the static one in determin-
ing enhanced visual abilities in deaf individuals is, at the very least, controversial in light of the cur-
rent review of the literature. Studies adopting perceptual threshold tasks consistently documented
comparable performance between deaf and hearing participants regardless of whether the stimuli
were static (as in Bross 1979a, 1979b) or moving (e.g., Bosworth and Dobkins 1999; Brozinsky and
Bavelier 2004; but see Stevens and Neville 2006). Instead, in simple detection tasks, enhanced
reactivity in deaf than in hearing participants have been documented primarily with static stim-
uli. Finally, using complex visual tasks tapping on working memory capacities, Hauser and col-
leagues (2007) showed comparable performance between deaf and hearing individuals regardless
of whether stimuli were stationary (enumeration task) or moving (Multiple Object Tracking task).
One piece of evidence that could support the notion that motion stimuli are more effective than
static ones in eliciting differences between the two groups is the observation that discrimination for
moving stimuli at the visual periphery (18°) is better for deaf than hearing participants (Neville and
Lawson 1987), whereas discrimination for static stimuli also appearing toward the periphery (8°)
is not (Bottari et al. 2010). However, the evident discrepancy in stimulus location between the two
studies prevents any definite conclusion, which could only be obtained by running a direct compari-
son of deaf and hearing performance using stimuli differing in the motion/static dimension, while
other variables are held fixed.

22.3.3  Role of Target Eccentricity and Selective Visual


Attention Is Critical but Underspecified
The present review supports the notion that representation of the visual periphery in the profoundly
deaf might indeed be special. It is clearly more often the case that differences between the two
groups emerged for stimuli delivered at peripheral than central locations (e.g., Loke and Song 1991;
Bottari et al. 2010, in preparation; Neville and Lawson 1987). However, it is also clear that the
central or peripheral location of the stimulus is not a definite predictor of whether deaf and hearing
442 The Neural Bases of Multisensory Processes

participants will differ in their performance. Better performance in deaf than in hearing participants
has been documented with both central and peripheral stimuli (e.g., see Section 22.2.2). Conversely,
threshold tasks proved ineffective in showing between-group differences, regardless of whether
stimuli were delivered centrally or peripherally. Thus, the question of what exactly is special in the
representation of peripheral stimuli in deaf individuals has not yet been resolved.
One observation relevant to this problem may be the recent finding from our group that the
differential processing for central and peripheral locations in deaf and hearing people emerge
independently from orienting of attention. Bottari et al. (2010) showed no RT cost when process-
ing peripheral than central items in deaf participants, unlike hearing controls. Importantly, this
occurred in a task (simple detection) that requires no selective allocation of attentional resources
(Bravo and Nakayama 1992). This implies a functional enhancement for peripheral portions of the
visual field that cannot be reduced to the differential allocation of attentional resources alone (see
also Stevens and Neville 2006 for related evidence). Because the cost for peripheral than central
processing in hearing controls is classically attributed to the amount of visual neurons devoted to
the analysis of central than peripheral portion of the visual field (e.g., Marzi and Di Stefano 1981;
Chelazzi et al. 1988), it can be hypothesized that profound deafness can modify the relative propor-
tion of neurons devoted to peripheral processing or their baseline activity. Note that assuming a dif-
ferent neural representation of the peripheral field also has implication for studies that examined the
effects of peripheral flankers on central targets (e.g., Proksch and Bavelier 2002; Sladen et al. 2005),
that is, it suggests that the larger interference from peripheral flankers in deaf individuals could at
least partially result from enhanced sensory processing of these stimuli, rather than attentional bias
to the periphery (similar to what would be obtained in hearing controls by simply changing the size
or the saliency of the peripheral flanker).
The final important aspect to consider is the role of selective attention in enhanced visual abili-
ties of deaf individuals. Our review of the literature concurs with the general hypothesis that deaf-
ness somehow modulates selective visual attention (e.g., Parasnis 1983; Neville and Lawson 1987;
Bavelier et al. 2006; Mitchell and Maslin 2007). However, it also indicates that any further devel-
opment of this theoretical assumption requires a better definition of which aspects of selective
attention are changed in this context of cross-modal plasticity. To date, even the basic distinction
between exogenous and endogenous processes have largely been neglected. If this minimal distinc-
tion is applied, it appears that endogenous orienting alone does not necessarily lead to better behav-
ioral performance in deaf than in hearing controls. This is, first of all, illustrated by the fact that
endogenous cueing of spatial attention (e.g., using a central arrow, as Parasnis and Samar 1985 have
done) can produce similar validity effects in deaf and hearing individuals. Furthermore, a recent
study by Bottari et al. (2008), which examined endogenous orienting of attention in the absence of
the exogenous captures induced by target onset, revealed no difference whatsoever between deaf
and hearing participants, regardless of whether attention was focused to the center, focused to the
periphery, or distributed across the entire visual scene. By contrast, several lines of evidence suggest
that the exogenous component of selective attention may be more prominent in deaf than in hearing
people. First, studies that have adapted the cue–target paradigm have shown more efficient detection
in deaf than in hearing adults, when the target occurs in a location of the visual field that have been
made unattended (i.e., invalid; see Parasnis and Samar 1985; Colmenero et al. 2004, Experiment
1; Bosworth and Dobkins 2002a). Second, paradigms that adopted an SOA between cue and target
that can lead to IOR also revealed that deaf participants are less susceptible to this attention manipu-
lation and respond more efficiently to targets appearing at the supposed inhibited location with
respect to controls (e.g., Colmenero et al. 2004, Experiment 2). Finally, deaf participants appear
to be more distracted than hearing controls by lateralized flankers that compete with a (relatively)
more central target (Dye et al. 2008; Proksch and Bavelier 2002; Sladen et al. 2005), as if the flanker
onset in the periphery of the visual field can capture exogenous attention more easily.
In the literature on visual attention in deaf individuals, the latter three findings have been inter-
preted within the spotlight metaphor for selective attention (Posner 1980), assuming faster shifts of
Visual Abilities in Individuals with Profound Deafness 443

visual attention (i.e., faster reorienting) in deaf than in hearing participants. However, this is not the
only way in which attention can be conceptualized. A well-known alternative to the spotlight meta-
phor of attention is the so-called gradient metaphor (Downing and Pinker 1985), which assumes a
peak of processing resources at the location selected (as a result of bottom-up or top-down signals)
as well as a gradual decrease of processing resources as the distance from the selected location
increases. Within this alternative perspective, the different performance in deaf participants during
the attention tasks (i.e., enhanced response to targets at the invalid locations, or more interference
from lateralized flankers) could reflect a less steep gradient of processing resources in the profoundly
deaf. Although it is premature to conclude in favor of one or the other metaphor of selective atten-
tion, we believe it is important to consider the implications of assuming one instead of the other. For
instance, the gradient metaphor could provide a more neurally plausible model of selective atten-
tion. If one assumes that reciprocal patterns of facilitation and inhibition in the visual cortex can
lead to the emergence of a saliency map that can contribute to the early filtering of bottom-up inputs
(e.g., Li 2002), the different distribution of exogenous selective attention in deaf individuals could
represent a modulation occurring at the level of this early saliency map. Furthermore, assuming a
gradient hypothesis may perhaps better reconcile the results obtained in the studies that adopted the
cue–target and flanker paradigms in deaf individuals, with the results showing more efficient visual
search pattern in this population. Within the gradient perspective, better visual search for simple
features or faster detection of targets at invalidly cued locations could both relate to more resources
for preattentive detection of discontinuities in deaf individuals.

22.4  CONCLUSIONS AND FUTURE DIRECTIONS


When taken collectively, the past 30 years of research on visual cognition in deaf individuals may,
at first sight, appear heterogeneous. However, our systematic attempt to distinguish between the dif-
ferent operational definitions of “better visual abilities” in deaf individuals proved useful in reveal-
ing at least some of the existing regularities in this literature and specify under which context the
compensatory hypothesis is consistently supported.
First, the remarkable convergence of findings in the studies that adopted simple detection tasks and
the mixed findings of the studies that adopted discrimination paradigms (either for near-threshold or
suprathreshold stimuli), suggests that enhanced visual abilities in deaf individuals might be best
conceptualized as enhanced reactivity to visual events, rather than enhanced perceptual representa-
tions. In other words, deaf individuals “do not see better,” but react faster to the stimuli in the envi-
ronment. If this conclusion is true, reactivity measures may prove more informative than accuracy
reports when comparing deaf and hearing controls, even when discrimination tasks are adopted.
This raises the issue of which may be the neural basis for enhanced reactivity in deaf individuals
and at which processing stage it may emerge (i.e., perceptual processing, response preparation/
execution, or both). In addition, it raises the question of which functional role enhanced reactivity
may play in real life. In this respect, the multisensory perspective that we have introduced at the
beginning of this chapter may be of great use for understanding the ecological relevance of this phe-
nomenon. If audition constitutes a fundamental guidance for reorienting our exploratory behavior
and it is a dedicated system for detecting and reacting to discontinuities, one could hypothesize that
faster reactivity to the visual events in deaf individuals may primarily serve the purpose of trigger-
ing orienting responses. Because all the evidence we have reviewed in this chapter originated from
paradigms in which overt orienting was completely prevented, this question remains open for future
research.
The second consistency that emerged from our review concerns the modulation that profound
deafness determines the representation of peripheral visual space and visual attention. Although a
number of evidence in the literature converges in supporting this conclusion, the challenge for future
research is the better specification of the operational description of both these concepts. Without
such an effort, the concepts of enhanced peripheral processing and enhanced visual attention are
444 The Neural Bases of Multisensory Processes

at risk of remaining tautological redefinitions of the empirical findings. As discussed above for the
example of selective attention, even a minimal description of which aspects of selective attention
may be changed by profound deafness or a basic discussion about of the theoretical assumptions
underlying the notion of selective attention can already contribute to the generation of novel predic-
tions for empirical research.

ACKNOWLEDGMENTS
We thank two anonymous reviewers for helpful comments and suggestions on an earlier version of
this manuscript. We are also grateful to Elena Nava for helpful comments and discussion. This work
was supported by a PRIN 2006 grant (Prot. 2006118540_004) from MIUR (Italy), a grant from
Comune di Rovereto (Italy), and a PAT-CRS grant from University of Trento (Italy).

REFERENCES
Armstrong, B., S. A. Hillyard, H. J. Neville, and T. V. Mitchell. 2002. Auditory deprivation affects processing
of motion, but not colour. Brain Research Cognitive Brain Research 14: 422–434.
Auer, E. T., Jr., and L. E. Bernstein. 2007. Enhanced visual speech perception in individuals with early-onset
hearing impairment. Journal of Speech Language and Hearing Research 50(5):1157–1165.
Bavelier, D., C. Brozinsky, A. Tomman, T. Mitchell, H. Neville, and G. H. Liu. 2001. Impact of early deaf-
ness and early exposure to sign language on the cerebral organization for motion processing. Journal of
Neuroscience 21: 8931–8942.
Bavelier, D., M. W. G. Dye, and P. C. Hauser. 2006. Do deaf individuals see better? Trends in Cognitive Science
10: 512–518.
Bavelier, D., A. Tomann, C. Hutton, T. V. Mitchell, D. P. Corina, G. Liu, and H. J. Neville. 2000. Visual atten-
tion to the periphery is enhanced in congenitally deaf individuals. Journal of Neuroscience 20: 1–6.
Bernstein, L. E., M. E. Demorest, and P. E. Tucker. 2000. Speech perception without hearing. Perception &
Psychophysics 62: 233–252.
Bernstein, L. E., E. T. Auer Jr., and P. E. Tucker. 2001. Enhanced speechreading in deaf adults: Can short-term
training/practice close the gap for hearing adults? Journal of Speech, Hearing, and Language Research
44: 5–18.
Bosworth, R. G., and K. R. Dobkins. 1999. Left-hemisphere dominance for motion processing in deaf signers.
Psychological Science 10: 256–262.
Bosworth, R. G., and K. R. Dobkins. 2002a. The effect of spatial attention on motion processing in deaf signers,
hearing signers, and hearing nonsigners. Brain and Cognition 4: 152–169.
Bosworth, R. G., and K. R. Dobkins. 2002b. Visual field asymmetries for motion processing in deaf and hearing
signers. Brain and Cognition 4: 152–169.
Bottari, D., M. Turatto, F. Bonfioli, C. Abbadessa, S. Selmi, M. A. Beltrame, and F. Pavani. �������������������
2008. Change blind-
ness in profoundly deaf individuals and cochlear implant recipients. Brain Research 1242: 209–218.
Bottari, D., E. Nava, P. Ley, and F. Pavani. 2010. Enhanced reactivity to visual stimuli in deaf individuals.
Restorative Neurology and Neuroscience 28: 167–179.
Bottari, D., M. Turatto, and F. Pavani. In preparation. Visual change perception and speeded simple detection
in profound deafness.
Bravo, M. Y., and K. Nakayama. 1992. The role of attention in different visual search tasks. Perception &
Psychophysics 51: 465–472.
Bross, M. 1979a. Residual sensory capacities of the deaf: A signal detection analysis of a visual discrimination
task. Perceptual Motor Skills 1: 187–194.
Bross, M. 1979b. Response bias in deaf and hearing subjects as a function of motivational factors. Perceptual
Motor Skills 3: 779–782.
Bross, M., and H. Sauerwein. 1980. Signal detection analysis of visual flicker in deaf and hearing individuals.
Perceptual Motor Skills 51: 839–843.
Brozinsky, C. J., and D. Bavelier. 2004. Motion velocity thresholds in deaf signers: Changes in lateralization
but not in overall sensitivity. Brain Research Cognitive Brain Research 21: 1–10.
Chelazzi, L., C. A. Marzi, G. Panozzo, N. Pasqualini, G. Tassi�������������������������������������������������
nari, and L.�������������������������������������
Tomazzoli. �������������������������
1988. Hemiretinal differ-
ence in speed of light detection esotropic amblyopes. Vision Research 28(1): 95–104.
Visual Abilities in Individuals with Profound Deafness 445

Chen, Q., M. Zhang, and X. Zhou. 2006. Effects of spatial distribution of attention during inhibition of return
IOR on flanker interference in hearing and congenitally deaf people. Brain Research 1109: 117–127.
Cohen, L. G., P. Celnik, A. Pascual-Leone, B. Corwell, L. Falz, J. Dambrosia, J. et al. 1997. Functional rel-
evance of cross-modal plasticity in blind humans. Nature 389: 180–183.
Colmenero, J. M., A. Catena, L. J. Fuentes, and M. M. Ramos. 2004. Mechanisms of visuo-spatial orienting in
deafness. European Journal Cognitive Psychology 16: 791–805.
Corbetta, M., and G. L. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain.
Nature Review Neuroscience 3: 201–21
Doehring, D. G., and J. Rosenstein. 1969. Speed of visual perception in deaf children. Journal of Speech and
Hearing Research 12:118–125.
Downing, C. J., and S. Pinker. 1985. The spatial structure of visual attention. In Attention and Performance
Posner, ed. M. I. Posner and O. S. M. Marin, 171–187. Hillsdale, NJ: Erlbaum.
Dye, M. W., P. C. Hauser, and D. Bavelier. 2008. Visual skills and cross-modal plasticity in deaf readers:
Possible implications for acquiring meaning from print. Annals of the New York Academy of Science
1145: 71–82.
Dye, M. W. G., D. E. Baril, and D. Bavelier. 2007. Which aspects of visual attention are changed by deafness?
The case of the Attentional Network Test. Neuropsychologia 45: 1801–1811.
Enns, J. T., and R. A. Rensink. 1991. Preattentive recovery of three-dimensional orientation from line-draw-
ings. Psychological Review 98: 335–351.
Eriksen, B. A., and C. W. Eriksen. 1974. Effects of noise letters upon the identification of a target letter in a
nonsearch task. Perception & Psychophysics 16: 143–149.
Fine, I., E. M. Finney, G. M. Boynton, and K. R. Dobkins. 2005. Comparing the effects of auditory depriva-
tion and sign language within the auditory and visual cortex. Journal of Cognitive Neuroscience 17:
1621–1637.
Finney, E. M., and K. R. Dobkins. 2001. Visual contrast sensitivity in deaf versus hearing populations: explor-
ing the perceptual consequences of auditory deprivation and experience with a visual language. Cognitive
Brain Research 11(1): 171–183.
Finney, E. M., I. Fine, and K. R. Dobkins. 2001. Visual stimuli activate auditory cortex in the deaf. Natural
Neuroscience 4(12): 1171–1173.
Finney, E. M., B. A. Clementz, G. Hickok, and K. R. Dobkins. 2003. Visual stimuli activate auditory cortex in
deaf subjects: Evidence from MEG. Neuroreport 11: 1425–1427.
Furth, H. 1966. Thinking without language: Psychological implications of deafness. New York: Free
Press.
Suchman, R. G. 1966. Color–form preference, discriminative accuracy and learning of deaf and hearing chil-
dren. Child Development 37(2): 439–451.
Gibson, E. 1969. Principles of perceptual learning and development. New York: Meredith.
Green, D. M., and J. A. Swets. 1966. Signal detection theory and psychophysics. New York: Wiley.
Harrington, D. O. 1971. The visual fields. St. Louis, MO: CV Mosby.
Hartmann, G. W. 1933. Changes in visual acuity through simultaneous stimulation of other sense organs.
Journal of Experimental Psychology 16:393–407.
Hartung, J. E. 1970. Visual perceptual skill, reading ability, and the young deaf child. Exceptional Children
36(8): 603–638.
Hauser, P. C., M. W. G. Dye, M. Boutla, C. S. Gree, and D. Bavelier. 2007. Deafness and visual enumeration:
Not all aspects of attention are modified by deafness. Brain Research 1153: 178–187.
Heider, F., and G. Heider. 1940. Studies in the psychology of the deaf. Psychological Monographs 52: 6–22.
Heffner R. S., and H. E. Heffner. 1992. Visual factors in sound localization in mammals. Journal of Comparative
Neurology. 317(3): 219–32.
Heming, J. E., and L. N. Brown. 2005. Sensory temporal processing in adults with early hearing loss. Brain
and Cognition 59: 173–82.
Henderson, S. E., and L. Henderson. 1973. Levels of visual-information processing in deaf and hearing chil-
dren. American Journal of Psychology 86(3): 507–521.
Hoemann, H. 1978. Perception by the deaf. In Handbook of perception: Perceptual ecology, vol. 10, ed.
E. Carterette and M. Friedman, 43–64. NewYork: Academic Press.
Jonides, J. 1981. Voluntary versus automatic control over the mind’s eye’s movement. In Attention and perfor-
mance, Vol. IX, ed. L. B. Long and A. D. Baddeley, 187–203. Hillsdale, NJ: Erlbaum.
Jordan, T. E. 1961. Historical notes on early study of the deaf. Journal of Speech Hearing Disorders 26:118–121.
Klein, R. M. 2000. Inhibition of return. Trends in Cognitive Science 4: 138–147.
446 The Neural Bases of Multisensory Processes

Levanen, S., and D. Hamdorf. 2001. Feeling vibrations: Enhanced tactile sensitivity in congenitally deaf
humans. Neuroscience Letters 301: 75–77.
Li, Z. 2002. A saliency map in primary visual cortex. Trends in Cognitive Science 1: 9–16.
Loke, W. H., and S. Song. 1991. Central and peripheral visual processing in hearing and nonhearing individu-
als. Bulletin of the Psychonomic Society 29: 437–440.
Marendaz, C., C. Robert, and F. Bonthoux. 1997. Deafness and attentional visual search: A developmental
study. Perception A: 26.
Marzi, C. A., and M. Di Stefano. 1981. Hemiretinal differences in visual perception. Documenta Ophthalmologica
Proceedings Series 30: 273–278.
Mayer, A. R., J. M. Dorflinger, S. M. Rao, and M. Seidenberg. 2004. Neural networks underlying endogenous
and exogenous visual–spatial orienting. Neuroimage 2: 534–541
Milner, A. D., and M. A. Goodale. 1995. The visual brain in action. Oxford, UK: Oxford Univ. Press.
Mitchell, R. E., and M. A. Karchmer. 2002. Demographics of deaf education: More students in more places.
American Annals of the Deaf 151(2, issue 2006): 95–104. Washington, DC: Gallaudet Univ. Press.
Mitchell, T., and M. T. Maslin. 2007. How vision matters for individuals with hearing loss. International
Journal of Audiology 46(9): 500–511.
Mohammed T., R. Campbell, M. MacSweeney, E. Milne, P. Hansen, and M. Coleman. 2005. Speechreading
skill and visual movement sensitivity are related in deaf speechreaders. Perception 34: 205–216.
Myklebust, H. 1964. The psychology of deafness. New York: Grune and Stratton.
Näätänen, R. 1992. Attention and brain function. Hillsdale, NJ: Erlbaum.
Nava, E., D. Bottari, M. Zampini, and F. Pavani. 2008. Visual temporal order judgment in profoundly deaf
individuals. Experimental Brain research 190(2): 179–188.
Neville, H. J., and D. S. Lawson. 1987. Attention to central and peripheral visual space in a movement detec-
tion task: an event related potential and behavioral study: II. Congenitally deaf adults. Brain Research
405: 268–283.
Neville, H. J., and D. Bavelier. 2002. Human brain plasticity: Evidence from sensory deprivation and altered
language experience. Progress in Brain Research 138: 177–188.
Neville, H. J., A. Schmidt, and M. Kutas. 1983. Altered visual-evoked potentials in congenitally deaf adults.
Brain Research 266(1): 127–132.
O’Regan, J. K., H. Deubel, J. J. Clark, and R. A. Rensink. 2000. Picture changes during blinks: Looking with-
out seeing and seeing without looking. Visual Cognition 7: 191–212.
O’Regan, J. K., R. A. Rensink, and J. J. Clark. 1999. Change-blindness as a result of “mudsplashes.” Nature
398: 34.
Olson, J. R. 1967. A factor analytic study of the relation between the speed of visual perception and the lan-
guage abilities of deaf adolescents. Journal of Speech and Hearing Research 10(2): 354–360.
Parasnis, I. 1983. Visual perceptual skills and deafness: A research review. Journal of the Academy of
Rehabilitative Audiology 16: 148–160.
Parasnis, I., and V. J. Samar. 1985. Parafoveal attention in congenitally deaf and hearing young adults. Brain
and Cognition 4: 313–327.
Parasnis, I., V. J. Samar, and G. P. Berent. 2003. Deaf adults without attention deficit hyperactivity disorder
display reduced perceptual sensitivity and elevated impulsivity on the Test of Variables of Attention
T.O.V.A. Journal of Speech Language and Hearing Research 5: 1166–1183.
Poizner, H., and P. Tallal. 1987. Temporal processing in deaf signers. Brain and Language 30: 52–62.
Posner, M. 1980. Orienting of attention. The Quarterly Journal of Experimental Psychology 32: 3–25.
Prinzmetal, W., C. McCool, and S. Park. 2005. Attention: Reaction time and accuracy reveal different mecha-
nisms. Journal of Experimental Psychology: General 134: 73–92.
Prinzmetal, W., A. Zvinyatskovskiy, P. Gutierrez, and L. Dilem. 2009. Voluntary and involuntary attention
have different consequences: The effect of perceptual difficulty. The Quarterly Journal of Experimental
Psychology 2: 352–369.
Proksch, J., and D. Bavelier. 2002. Changes in the spatial distribution of visual attention after early deafness.
Journal of Cognitive Neuroscience 14: 687–701.
Pylyshyn, Z.W. 1989. The role of location indexes in spatial perception: A sketch of the FINST spatial-index
model. Cognition 32: 65–97.
Quittner, A. L., P. Leibach, and K. Marciel. 2004. The impact of cochlear implants on young deaf children:
New methods to assess cognitive and behavioral development. Archives of Otolaryngology Head Neck
and Surgery 5: 547–554.
Rensink, R. A. 2001. Change blindness: Implications for the nature of attention. In Vision and Attention, ed.
M. R. Jenkin and L. R., 169–188. New York: Springer.
Visual Abilities in Individuals with Profound Deafness 447

Rettenbach, R., G. Diller, and R. Sireteanu. 1999. Do deaf people see better? Texture segmentation and visual
search compensate in adult but not in juvenile subjects. Journal of Cognitive Neuroscience 5: 560–583.
Reynolds, H. 1993. Effects of foveal stimulation on peripheral visual processing and laterality in deaf and hear-
ing subjects. American Journal of Psychology 106(4): 523–540.
Rönnberg, J. 1995. Perceptual compensation in the deaf and blind: Myth or reality? In Compensating for psy-
chological deficits and declines, ed. R. A. Dixon and L. Backman, 251–274. Mahwah, NJ: Erlbaum.
Sagi, D., and B. Julesz. 1984. Detection versus discrimination of visual orientation. Perception 13(5):
619–628.
Sladen, D., A. M. Tharpe, D. H. Ashmead, D. W. Grantham, and M. M. Chun. 2005. Visual attention in deaf
and normal hearing adults: effects of stimulus compatibility. Journal of Speech Language and Hearing
Research 48: 1–9.
Stevens, C., and H. Neville. 2006. Neuroplasticity as a double-edged sword: Deaf enhancements and dyslexic
deficits in motion processing. Journal of Cognitive Neuroscience 18: 701–714.
Stivalet, P., Y. Moreno, J. Richard, P. A. Barraud, and C. Raphael. 1998. Differences in visual search tasks
between congenitally deaf and normally hearing adults. Brain Research Cognitive Brain Research 6:
227–232.
Suchman, R. G. 1966. Color–form preference, discriminative accuracy and learning of deaf and hearing chil-
dren. Child Development 2: 439–451.
Summerfield, Q. 1991. Visual perception of phonetic gestures. In Modularity and the motor theory of speech
perception, ed. I. G. Mattingly and M. Studdert-Kennedy, 117–137. Hillsdale, NJ: Erlbaum.
Treisman, A. 1982. Perceptual grouping and attention in visual search for features and for objects. Journal of
Experimental Psychology Human Perceptual Performance 2: 94–214.
Turatto, M., and B. Brigeman. 2005. Change perception using visual transients: Object substitution and dele-
tion. Experimental Brain Research 167: 595–608.
Turatto, M., M. Valsecchi, L. Tamè, and E. Betta. 2007. Microsaccades distinguish between global and local
visual processing. Neuroreport 18:1015–1018.
23 A Multisensory Interface for
Peripersonal Space

Body–Object Interactions
Claudio Brozzoli, Tamar R. Makin, Lucilla Cardinali,
Nicholas P. Holmes, and Alessandro Farnè

CONTENTS
23.1 Multisensory and Motor Representations of Peripersonal Space.......................................... 449
23.1.1 Multisensory Features of Peripersonal Space: Visuo- Tactile Interaction around
the Body.....................................................................................................................449
23.1.1.1 Premotor Visuo-Tactile Interactions........................................................... 450
23.1.1.2 Parietal Visuo-Tactile Interactions.............................................................. 450
23.1.1.3 Subcortical Visuo-Tactile Interaction......................................................... 451
23.1.1.4 A Visuo-Tactile Network............................................................................ 452
23.1.1.5 Dynamic Features of PpS Representation.................................................. 452
23.1.2 Motor Features of PpS: Visuo-Tactile Interaction around the Acting Body............. 452
23.1.3 A Multisensory–Motor Network for Body–Object Interactions in PpS.................... 454
23.2 Multisensory-Based PpS Representation in Humans............................................................ 455
23.2.1 PpS Representation in Humans................................................................................. 455
23.2.1.1 PpS Representation in Neuropsychological Patients.................................. 455
23.2.1.2 PpS Representation in Neurotypical Participants....................................... 456
23.2.2 A Multisensory Interface for Body–Objects Interactions......................................... 458
23.3 Conclusion.............................................................................................................................460
Acknowledgments...........................................................................................................................460
References.......................................................................................................................................460

23.1 MULTISENSORY AND MOTOR REPRESENTATIONS


OF PERIPERSONAL SPACE
23.1.1  M
 ultisensory Features of Peripersonal Space:
Visuo- Tactile Interaction around the Body
The binding of visual information available outside the body with tactile information arising, by
definition, on the body, allows the representation of the space lying in between, which is often the
theater of our interactions with objects. The representation of this intermediate space has become
known as “peripersonal space” (Rizzolatti et al. 1981b, 1981c). The definition of peripersonal space
(PpS hereafter) originates from single-unit electrophysiological studies in macaque monkeys, based
on a class of multisensory, predominantly visual–tactile neurons. Over the years, such neurons have
been identified in several regions of the monkey brain, including premotor area 6, parietal areas
(Broadmann’s area 7b and the ventral intraparietal area, VIP), and the putamen (Fogassi et al. 1999;
Graziano 2001; Rizzolatti et al. 1997). The most relevant characteristic of these neurons, for present

449
450 The Neural Bases of Multisensory Processes

purposes, is that, in addition to responding both to visual and tactile stimulation (referred to here
as visuo-tactile), their visually evoked responses are modulated by the distance between the visual
object and the tactile receptive field (RF). This allows for the coding of visual information that is
dependent, or centered, on the body part that contains the tactile RF.

23.1.1.1  Premotor Visuo-Tactile Interactions


The most detailed series of studies on the properties of visuo-tactile neurons have been performed
in the premotor cortex. Neurons in the F4 subregion of inferior area 6 in ventral premotor cortex
(Matelli et al. 1985) are strongly responsive to tactile stimulation. They are characterized by rela-
tively large tactile RFs located primarily on the monkey’s face, neck, arm, hand, or both hands
and face (e.g., in the peribuccal region; Gentilucci et al. 1988; Rizzolatti et al. 1981a). A large
proportion (85%) of the tactile neurons in this area also discharges in response to visual stimuli.
According to the depth of the visual RFs extending out from the body, these bimodal neurons were
originally subdivided into pericutaneous (54%) and distant peripersonal neurons (46%). The pericu-
taneous neurons responded best to stimuli presented a few centimeters from the skin (10 cm or less;
Rizzolatti et al. 1981b), whereas the distant peripersonal neurons responded to stimuli within reach
of the monkey’s arms. We will refer to both as “peripersonal” visuo-tactile neurons throughout the
text. Therefore, an important property of these neurons (and neurons in other PpS-related areas; see
below) is that their visual RFs are limited in depth from the tactile RFs (in most cases from ~5 to
~50 cm). The visual RFs are generally independent of gaze direction (Fogassi et al. 1992; Gentilucci
et al. 1983), being spatially related instead to the body parts on which the tactile RFs are located.
Moreover, when the arm is moved under the monkey’s view, the visual RF follows the body part,
being “anchored” to the tactile RF thus keeping a rough spatial match between the locations of the
visual RF and the arm with every displacement (Graziano et al. 1994, 1997; Figure 23.1).
Although less numerous, visuo-tactile neurons are also present in the rostral subregion F5 of area
6, and have smaller tactile RFs than F4 neurons. The tactile RFs are frequently located on the face,
the hand, or both. However, the visual properties of these neurons were shown to be quite differ-
ent: even though stimuli presented close to the body resulted in stronger responses, the size of the
stimuli appeared to be a more critical factor in driving the activity of F5 neurons (Rizzolatti et al.
1988; Rizzolatti and Gentilucci 1988).

23.1.1.2  Parietal Visuo-Tactile Interactions


The posterior parietal lobe of the macaque brain contains two subregions with visuo-tactile prop-
erties: area 7b of the inferior posterior parietal lobe and the ventral section of the intraparietal
sulcus (VIP). As in the premotor cortex, electrophysiological studies in awake monkeys revealed
that visuo-tactile integration in these areas arises at the single unit level (Hyvärinen and Poranen
1974; Hyvärinen 1981; Leinonen et al. 1979; Leinonen and Nyman 1979; Mountcastle et al. 1975;
Robinson et al. 1978; Robinson and Burton 1980a, 1980b).* Within area 7b, most neurons were
responsive to tactile stimuli, and presented a gross somatotopic organization, with separate face,
arm, and hand representations (Hyvärinen and Shelepin 1979; Hyvärinen 1981; Robinson and
Burton 1980a). Within the face and arm regions of this map, visuo-tactile cells (33%) have been
reported (Hyvärinen and Poranen 1974; Hyvärinen and Shelepin 1979; Hyvärinen 1981; Leinonen et
al. 1979; Leinonen and Nyman 1979). What is the function of these responses? Researchers initially
interpreted these visual responses as an “anticipatory activation” that appeared before the neuron’s
tactile RF was touched (Hyvärinen and Poranen 1974, p. 675). Importantly, a close correspondence
between the tactile and visual RFs has been documented, especially for tactile RFs on the arm

* A possibly earlier report can be attributed to Sakata and colleagues (1973, p. 100). In this study about the functional
organization of area 5, the authors stated: “Even the relatively rare neurons which we could activate visually were more
powerfully driven by somatosensory stimuli.” However, no further detail or discussion was offered concerning the limi-
tation in depth of the visual RF.
Peripersonal Space 451

(a)
1 2 3 4
Tactile RF

Visual stimulus trajectory

(c)
= Arm right
70 = Arm left

Response of neuron (spike/s)


60

50
(b)
1 2 3 4 40

30

20

10

0
1 2 3 4
Stimulus trajectory

FIGURE 23.1  Representation of visual stimuli in hand-based coordinates. Visual responses of a typical premo-
tor neuron with a tactile RF (hatched) on forearm and hand, and a visual RF within 10 cm of tactile RF. On each
trial, the arm contralateral to neuron was fixed in one of two positions: (a) on the right (light gray symbols and
lines) or (b) on the left (dark gray symbols and lines) and visual stimulus was advanced along one of four trajec-
tories (numbered 1–4). (c) Responses of neuron to four stimulus trajectories when the arm was visible to the mon-
key were recorded for both positions. When the arm was fixed on the right, response was maximal for trajectory
3, which was approaching the neuron’s tactile RF. When the arm was fixed on the left, maximal response shifted
with the hand to trajectory 2, which was now approaching the tactile RF. This example shows that neurons in the
monkey’s premotor cortex represent visual information with respect to the tactile RF. (Modified from Graziano,
M. S. A. In Proceedings of the National Academy of Sciences of the United States of America, 1999.)

(Leinonen et al. 1979). That is, these neurons’ activation was shown to be dependent on the distance
of the effective visual stimulus from the body part. Most of these neurons responded to visual
stimuli moving toward the monkey, within about 10 cm of the tactile RF (although in some cases,
stimulation presented further away, but still within a reachable distance, was also effective).
Multisensory neurons have also been found in the monkey area VIP, in the fundus of the intrapa-
rietal sulcus (Avillac et al. 2005; Colby and Duhamel 1991; Colby et al. 1993; Duhamel et al. 1998).
VIP neurons respond to tactile and visual stimulation presented within a few centimeters of the
tactile RF. Unlike area 7b neurons, tactile RFs in VIP are primarily located on the face and head,
and visual RFs are anchored to a region of space around the face (Colby et al. 1993).

23.1.1.3  Subcortical Visuo-Tactile Interaction


Pools of multisensory neurons have also been found in subcortical structures of the macaque brain.
The multisensory encoding of events has been well established in the superior colliculus (Stein
and Meredith 1993; Wallace and Stein 2007). Such collicular activity, however, seems not to be
devoted primarily to representing the space near the body (for a full discussion of the properties and
functional roles of multisensory neurons in the superior colliculus, see Chapter 11 and Chapter 15).
The putamen, on the other hand, seems to be a relevant region for the visuo-tactile processing of
events in the space around the body (Graziano and Gross 1993, 1994, 1995). Visuo-tactile neurons
in the putamen with tactile RFs on the arm, hand, and face are somatotopically organized. Just as
for the cortical visuo-tactile neurons, the visual and tactile RFs in the putamen show a rough spatial
correspondence, with the visual RFs being anchored to the tactile ones. Most of the neurons also
452 The Neural Bases of Multisensory Processes

responsive to visual stimuli, as long as they are presented close to the tactile RF. A large p­ortion
(82%) of face neurons responds best to visual stimuli presented in a region of space within 10–20 cm
from the tactile RF. Neurons with tactile RFs on the arm and hand present even more shallow visual
RFs around the hand (up to 5 cm; Graziano and Gross 1993).

23.1.1.4  A Visuo-Tactile Network


The neurophysiological findings described in the previous sections define a set of at least four dis-
tinctive areas with similar visuo-tactile responses: premotor inferior area 6, parietal areas 7b and
VIP, and the putamen. These areas are heavily interconnected, forming a tight network (Matelli and
Luppino 2001; Rizzolatti et al. 1997, 1998). Neurons in this network share some common features:
(1) The visual responses lie primarily within a head–face or arm–hand centered somatosensory rep-
resentation of the body. (2) Visual stimuli moving near the monkey modulate the neurons’ responses
stronger than farther stimuli. This suggests that these neurons allow for body part–centered coding
of visual stimuli within sectors of space adjacent to the tactile surface. This network possesses all
of the necessary properties to bind together external visual information around the body and tactile
information on a specific body part (Fogassi et al. 1992; Graziano and Gross 1993; Rizzolatti et al.
1997).

23.1.1.5  Dynamic Features of PpS Representation


An important characteristic of some visuo-tactile areas is the dynamic property of their visual RFs.
Fogassi and colleagues (1996) found that the depth of the visual RFs of F4 visuo-tactile neurons can
increase with increases in the velocity (20–80 cm/s) of a visual stimulus approaching the cutaneous
RF. This property could be crucial for preparing and/or executing actions toward nearby objects.
Iriki and colleagues (1996) revealed that, after training monkeys to use a rake as a tool to reach
food pellets placed outside their reaching space, some neurons in the post-central gyrus (somewhat
extending into the intraparietal sulcus) began to display visual responses. In addition, although con-
cerns have been raised in this respect (Holmes and Spence 2004), such visual responses appeared
to be modulated by active, but not by passive, tool use. The newly acquired visual RFs seemed to
have expanded toward the tool tip. A few minutes after the active tool use, the visual RFs apparently
shrank back to their original size. In other words, the dynamic aspects of the visual RF may depend
on the execution of specific motor actions (Rizzolatti et al. 1998).
An interesting recent finding showed that visuo-tactile neurons within area 7b and VIP also
respond when another individual’s body part is approached by a visual stimulus (Ishida et al. 2009).
Similarly to the visuo-tactile neurons described above, these “body-matching neurons” respond to
visual stimuli presented near the tactile RF. Moreover, the neurons are responsive to a visual stimu-
lus presented close to the corresponding body part of another individual (a human experimenter)
being observed by the monkey. For instance, a neuron displaying a tactile RF on the arm not only
responded to a visual stimulus presented close to the monkey’s own arm, but also to visual stimuli
presented close to another individual’s arm. For some of these neurons, this matching property
seems to be independent of the position of the observed individual with respect to the observing
monkey (up to 35° of rotation).

23.1.2  Motor Features of PpS: Visuo-Tactile Interaction around the Acting Body
Why should the brain maintain a representation of the space around the body separate from a
representation of far extrapersonal space? One possibility is that this dichotomy stems purely from
perceptual aims, giving a “greater” perceptual salience to visual events occurring in the vicinity of
the body. Following this idea, the parieto-frontal network, together with the putamen, would code
visual space with individual body parts as its reference. This is suggested by the sensory properties
of this set of neurons, responding selectively for visual information close to the body. However, we
believe that this interpretation does not fully describe the potential functional applications of this
Peripersonal Space 453

system, since it does not correspond with some of the evidence described above. First, it may be dif-
ficult to interpret the complex tactile RFs of some of these neurons (e.g., single neurons in area F4
that represent both the hand and face, as reported by Rizzolatti et al. 1981a, 1981b). Second, it does
not account for the dynamic changes in their visual RFs, as observed in cases of objects approach-
ing the body (Fogassi et al. 1996). More critically, a purely perceptual account does not fit with the
presence of such bimodal neurons in a predominantly “motor” area, such as the premotor cortex.
Numerous visuo-tactile neurons in inferior area 6 (Gentilucci et al. 1988; Rizzolatti et al. 1981c,
1987, 1988, 1997; Rizzolatti and Gentilucci 1988), parietal areas 7b (Hyvärinen 1981; Hyvärinen
and Poranen 1974; Hyvärinen and Shelepin 1979; Leinonen 1980; Leinonen et al. 1979; Leinonen
and Nyman 1979; Robinson et al. 1978), and the putamen (Crutcher and DeLong 1984) respond not
only to passive visual and tactile stimulation, but also during motor activity.
These findings raise the more compelling possibility that the multisensory representation of PpS
serves some motor function. Objects in the vicinity of the body are indeed more relevant by virtue
of the possible interactions our body can establish with them (Graziano et al. 1993; Rizzolatti et
al. 1997, 1998). Therefore, hand-centered representation of PpS provides us with extremely valu-
able information regarding the spatial position of objects with respect to our hands. Here follows a
description of the motor aspects associated with PpS brain areas, as revealed by electrophysiologi-
cal studies in macaque monkeys.
The premotor cortex has both direct (Martino and Strick 1987) and indirect (Godschalk et al.
1984; Matsumura and Kubota 1979; Muakkassa and Strick 1979; Pandya and Vignolo 1971) access
to the control of upper limb movements, via projections to the spinal cord and the primary motor
cortex, respectively. The motor properties of neurons in the inferior premotor cortex support a
role for this structure in a perception–action interface. In particular, the visual responses of some
neurons within this area are enhanced when a reaching movement is performed toward an object
(Godschalk et al. 1985), as well as during reaching and grasping movements of the arm and hand
(Godschalk et al. 1981, 1985; Kurata et al. 1985; Kurata and Tanji 1986; Rizzolatti and Gentilucci
1988) and mouth (Rizzolatti et al. 1981c). Moreover, neurons in this area show a rather precise degree
of motor representation. Proximal and distal movements are represented separately (in areas F4/F1
and area F5, respectively), with the proximal neurons mostly activated for arm and face movements.
(Gentilucci et al. 1988; Kurata and Tanji 1986; Murata et al. 1997; Raos et al. 2006; Rizzolatti et al.
1987, 1988; Rizzolatti and Gentilucci 1988). Crucially, the passive RFs and the active movements
appear to share related functional roles: neurons with visuo-tactile RFs on the face also discharged
during arm reaching movements toward the upper part of space that corresponds to its visual RF.
This suggests that the sensory and motor responses are expressed in a common reference frame for
locating objects in the space close to the body and for guiding movements toward them. We believe
that such a complex motor mechanism cannot subserve a purely perceptual function.
Parietal area 7b also has motor properties. As in the premotor cortex, parietal motor functions
seem to be related to approaching movements of a body part toward an object (Gardner et al. 2007;
Lacquaniti and Caminiti 1998; Rizzolatti et al. 1997). Indeed, the posterior parietal cortex is part of
the dorsal stream of action-oriented visual processing (Milner and Goodale 1995), and both inferior
and superior parietal lobules are interconnected with the premotor cortex (see above).
Ablation and reversible inactivation studies in monkeys have shown a direct relationship between
the PpS network and motor responses. These studies tested for the behavioral consequences of a
lesion within premotor and posterior parietal areas, where visuo-tactile neurons have been found.
Interestingly, lesions to both the anterior or posterior parts of this network seem to produce very
similar patterns of motor impairments, most of which affect, in particular, the execution of visually
guided reaching actions (Battaglini et al. 2002; Deuel and Regan 1985; Ettlinger and Kalsbeck 1962;
Faugier-Grimaud et al. 1978; Gallese et al. 1994; Halsban and Passingham 1982; Moll and Kuypers
1977; Rizzolatti et al. 1983). After premotor ablation, for instance, the monkeys were unable to
reach when the movement required the monkey to avoid an obstacle with the contralesional arm.
Arm movements were executed without correctly taking into account visual information within
454 The Neural Bases of Multisensory Processes

PpS (Battaglini et al. 2002; Moll and Kuypers 1977). Similarly, removal of postarcuate regions
in the premotor cortex where the mouth is represented (presumably in area F4), caused a severe
impairment in grasping with the mouth (Rizzolatti et al. 1983). Attentional deficits have also been
reported after selective damage to visuo-tactile parietal and premotor regions (Rizzolatti et al. 1983)
in the form of spatial hemineglect and extinction. The monkeys appeared to be unaware of visual
(or tactile) stimuli presented in the contralesional space. Crucially, this deficit was selective for the
space around the body.
Subregion F5 of the inferior area 6 is also characterized by the presence of “mirror” neurons, a
special class of motor neurons with visual properties. These neurons are selective for the execution
of a specific motor act, such as precision grasping. They also discharge when the monkey observes
another monkey or a human executing the same action (di Pellegrino et al. 1992; Gallese et al. 1996;
Rizzolatti et al. 1996).* Relevant for this chapter is a recent study that showed selectivity in certain
mirror neurons for actions performed within the observer’s PpS rather than in its extrapersonal
space (peripersonal mirror neurons, Caggiano et al. 2009). A different subpopulation of mirror neu-
rons showed the opposite preference (i.e., selectivity for actions performed in extrapersonal space,
rather than PpS). Moreover, peripersonal and extrapersonal space appeared to be defined according
to a functional criterion: When accessibility to PpS was limited (e.g., by placing a screen in front of
the monkey), the responses of several peripersonal mirror neurons were reduced during observation
of actions performed in the inaccessible portion of the space. That is, when PpS was inaccessible
for action, it has been represented as farther extrapersonal space. Indeed, in such circumstances,
extrapersonal mirror neurons started to respond to observation of actions performed in the inacces-
sible PpS.

23.1.3  A Multisensory–Motor Network for Body–Object Interactions in PpS


The above reviewed studies provide a large body of indirect evidence in favor of the proposal that
this parieto-frontal network binds together visual and tactile information in order to generate an
appropriate motor program toward objects in the world. We would like to suggest that the occur-
rence of multisensory and motor processing within the same area provides an interface between
perception and action.
What kind of body–object interactions can body-centered PpS representation subserve? PpS
has traditionally been suggested to play a role in guiding hand actions toward objects within reach-
ing distance (Bremmer 2005; Fogassi and Luppino 2005; Graziano 1999; Maravita et al. 2003;
Maravita 2006; Rizzolatti et al. 1987). Indeed, the evidence described above seems to support the
involvement of some PpS areas in reaching and grasping. Another intriguing possibility that has
recently been investigated is the involvement of the PpS network in defensive (re)actions. By act-
ing as an anticipatory sensory–motor interface, PpS may serve for the early detection of potential
threats approaching the body (Fogassi et al. 1996) in order to drive involuntary defensive move-
ments (Cooke and Graziano 2004; Graziano and Cooke 2006). The most direct evidence in favor
of this hypothesis comes from cortical electrical stimulation studies (although concerns have been
raised in this respect; see Strick 2002; Graziano et al. 2002). Electrical stimulation of the ventral
premotor cortex and the VIP (Graziano and Cooke 2006) has been reported to elicit a pattern of
movements that is compatible with defensive arm movements and the withdrawal of the arm or the
head (Cooke and Graziano 2003). However, the same anticipatory features may also have evolved to
serve voluntary object-oriented actions (Gardner et al. 2007; Rizzolatti et al. 1981a, 1981b, 1997). In
support of this view are the results of the described electrophysiological recording studies, showing
the motor properties of both parietal and periarcuate visuo-tactile neurons, whose discharges are

* A first report of neurons responding while the monkey was watching an action performed by another individual is
already present in an early electrophysiological study over the parietal area 7b (Leinonen 1980, p. 305) : “[…] two cells
discharged when the monkey grasped an object […] or when the monkey saw an investigator grasp an object.”
Peripersonal Space 455

mostly correlated with reaching and grasping movements (see Section 23.1.2). The two hypotheses
(involuntary and voluntary object-oriented actions) are not mutually exclusive and one could specu-
late that a fine-grained and sophisticated function could have developed from a more primordial
defensive machinery, using the same visuo-tactile spatial coding of the PpS (see the “neuronal
recycling hypothesis,” as proposed by Dehaene 2005). This hypothetical evolutionary advancement
could lead to the involvement of the PpS mechanisms in the control of the execution of voluntary
actions toward objects. Some comparative data showed, for instance, that prosimian sensory areas
corresponding to the monkeys’ parietal areas already present some approximate motor activity. The
most represented movements are very stereotyped limb retractions that are associated with avoid-
ance movements (Fogassi et al. 1994).

23.2  MULTISENSORY-BASED PPS REPRESENTATION IN HUMANS


Several studies support the existence of a similar body part–centered multisensory representation
of the space around the body in the human brain. In this respect, the study of a neuropsycho-
logical condition called “extinction” (Bender 1952; Brozzoli et al. 2006) has provided considerable
insight into the behavioral characteristics of multisensory spatial representation in the human brain
(Ladavas 2002; Ladavas and Farnè 2004; Legrand et al. 2007). Evidence for visuo-tactile interac-
tions is also available in healthy people, in the form of distance-modulated interference exerted by
visual over tactile stimuli (Brozzoli et al. 2009a, 2009b; Spence et al. 2004a, 2008). The crucial
point of these studies is the presence, both in the brain-damaged and healthy populations, of stron-
ger visuo-tactile interactions when visual stimuli are presented in near, as compared to far space.
These studies thus support the idea that the human brain also represents PpS through an integrated
visuo-tactile system (Figure 23.2).

23.2.1  PpS Representation in Humans


23.2.1.1  PpS Representation in Neuropsychological Patients
Extinction is a pathological sign following brain damage, whereby patients fail to perceive con-
tralesional stimuli only under conditions of double simultaneous stimulation, thus revealing the
competitive nature of this phenomenon (di Pellegrino and De Renzi 1995; Driver 1998; Ward et al.

Head- and hand-centered


peripersonal space

Arm-centered reaching space

FIGURE 23.2  Peripersonal space representation. Head- and hand-centered peripersonal space (dark gray
areas) with respect to reaching space (light gray region). (Modified from Cardinali, L. et al., In Encyclopedia
of Behavioral Neuroscience, 2009b.)
456 The Neural Bases of Multisensory Processes

1994). A number of studies have shown that extinction can emerge when concurrent stimuli are pre-
sented in different sensory modalities: A visual stimulus presented near to the ipsilesional hand can
extinguish a touch delivered on the contralesional hand (di Pellegrino et al. 1997; see also Costantini
et al. 2007, for an example of cross-modal extinction within a hemispace). Crucially, such cross-
modal visuo-tactile extinction appears to be stronger when visual stimuli are presented in near
as compared to far space, thus providing neuropsychological support for the idea that the human
brain represents PpS through an integrated visuo-tactile system. Moreover, in accordance with the
findings from the electrophysiological studies described in the previous section, visual responses
to stimuli presented near the patient’s hand remain anchored to the hand when it is moved to the
opposite hemispace. This evidence suggests that PpS in humans is also coded in a hand-centered
reference frame (di Pellegrino et al. 1997; Farnè et al. 2003). A converging line of evidence suggests
that the space near the human face is also represented by a multisensory mechanism. We demon-
strated that visuo-tactile extinction can occur by applying visual and tactile stimuli on the patient’s
face (Farnè et al. 2005b). Interestingly, the extinction was strongest when the homologous body part
was being stimulated (i.e., left and right cheeks, rather than left hand and right cheek), suggesting
that different spatial regions, adjacent to different body parts, are represented separately (Farnè et
al. 2005b). In a further study, we presented four extinction patients with visual stimuli near and far
from the experimenter’s right hand, as well as from their own right hands (Farnè et al., unpublished
data). Although the visual stimulus presented near the patients’ hands successfully extinguished the
touch on the patients’ left hand, no cross-modal extinction effect was found to support a possible
body-matching property of the human PpS system. This discrepancy with the evidence reported
in the electrophysiological literature might stem from the fact that we used a more radical change
in orientation between the observer’s own and the observed hands (more than 35°; see Section
23.1.1). Finally, we have shown that the human PpS also features plastic properties, akin to those
demonstrated in the monkey: Visual stimuli presented in far space induced stronger cross-modal
extinction after the use of a 38-cm rake to retrieve (or act upon) distant objects (Farnè and Làdavas
2000; see also Berti and Frassinetti 2000; Bonifazi et al. 2007; Farnè et al. 2005c, 2007; Maravita
and Iriki 2004). The patients’ performance was evaluated before tool use, immediately after a 5-min
period of tool use, and after a further 5- to 10-min resting period. Far visual stimuli were found to
induce more severe contralesional extinction immediately after tool use, compared with before tool
use. These results demonstrate that, although near and far spaces are separately represented, this
spatial division is not defined a priori. Instead, the definition of near and far space may be derived
functionally, depending on movements that allow the body to interact with objects in space.*

23.2.1.2  PpS Representation in Neurotypical Participants


In healthy participants, most of the behavioral evidence for the hand-centered visuo-tactile repre-
sentation of near space derives from a visuo-tactile interference (VTI) paradigm. In this series of
studies, participants were asked to discriminate between two locations of a tactile stimulus, while
an irrelevant visual distractor was delivered at a congruent or incongruent location. The overall
effect was a slowing in response times for the incongruent trials, as compared with the congru-
ent ones (Pavani and Castiello 2004; Spence et al. 2004b, 2008). More relevant here is the fact
that the interference exerted when the visual distractor was presented near to as compared to far
from the tactile targets. In analogy with the cross-modal extinction studies, the VTI was stronger
when the visual information occurred close to the tactually stimulated body part rather than in far
space (for reviews, see Spence et al. 2004b, 2008). Using the same approach, the effect of tool use
on VTI in near and far space has been studied in healthy individuals (Holmes et al. 2004, 2007a,

* We have recently studied the effects of tool use on the body schema (Cardinali et al. 2009c). We have found that the rep-
resentation of the body has been dynamically updated with the use of the tool. This dynamic updating of the body schema
during action execution may serve as a sort of skeleton for PpS representation (for a critical review of the relationship
between human PpS and body schema representations, see Cardinali et al. 2009a).
Peripersonal Space 457

2007b, 2008), with some differences in results as compared to studies conducted in neurological
patients, as described above (see also Maravita et al. 2002).
Evidence for the existence of multisensory PpS is now accumulating from neuroimaging studies
in healthy humans. These new studies provide further support for the homologies between some
of the electrophysiological evidence reviewed above and the PpS neural mechanisms in the human
brain. Specifically, brain areas that represent visual and tactile information on and near to the hand
and face in body-centered coordinates have been reported to be the anterior section of the intrapa-
rietal sulcus and the ventral premotor cortex (Bremmer et al. 2001; Makin et al. 2007; Sereno and
Huang 2006). These findings correspond nicely with the anatomical locations of the monkey visuo-
tactile network. Moreover, recent studies have identified the superior parietal occipital junction as
a potential site for representing near-face and near-hand visual space (Gallivan et al. 2009; Quinlan
and Culham 2007). This new evidence extends our current knowledge of the PpS neural network,
and may guide further electrophysiological studies to come.
Although using functional brain imaging enabled us to demonstrate that multiple brain areas in
both sensory and motor cortices modulate their responses to visual stimuli based on their distance
from the hand and face, it did not allow us to determine the direct involvement of such representa-
tions in motor processing. In a series of experiments inspired by the macaque neurophysiological
literature, we recently examined the reference frames underlying rapid motor responses to real,
three-dimensional objects approaching the hand (Makin et al. 2009). We asked subjects to make a
simple motor response to a visual “Go” signal while they were simultaneously presented with a task-
irrelevant distractor ball, rapidly approaching a location either near to or far from their responding
hand. To assess the effects of these rapidly approaching distractor stimuli on the excitability of the
human motor system, we used single pulse transcranial magnetic stimulation, applied to the pri-
mary motor cortex, eliciting motor evoked potentials (MEPs) in the responding hand. As expected,
and across several experiments, we found that motor excitability was modulated as a function of the
distance of approaching balls from the hand: MEP amplitude was selectively reduced when the ball
approached near the hand, both when the hand was on the left and on the right of the midline. This
suppression likely reflects the proactive inhibition of a possible avoidance responses that is elicited
by the approaching ball (see Makin et al. 2009). Strikingly, this hand-centered suppression occurred
as early as 70 ms after ball appearance, and was not modified by the location of visual fixation rela-
tive to the hand. Furthermore, it was selective for approaching balls, since static visual distractors
did not modulate MEP amplitude. Together with additional behavioral measurements, this new
series of experiments provides direct and converging evidence for automatic hand-centered coding
of visual space in the human motor system. These results strengthen our interpretation of PpS as a
mechanism for translating potentially relevant visual information into a rapid motor response.
Together, the behavioral and imaging studies reviewed above confirm the existence of brain
mechanisms in humans that are specialized for representing visual information selectively when it
arises from near the hand. As highlighted in the previous section on monkey research, a strong bind-
ing mechanism of visual and tactile inputs has repeatedly been shown also in humans. Importantly,
these converging results have refined and extended our understanding of the neural processes under-
lying multisensory representation of PpS, namely, by identifying various cortical areas that are
involved in different sensory–motor aspects of PpS representation, and the time course of hand-
centered processing.
The tight relationship between motor and visual representation of near space in the human brain
led us most recently to an intriguing question: Would the loss of a hand through amputation (and
therefore the inability of the brain to represent visual information with respect to it) lead to changes
in visual perception? We recently discovered that hand amputation is indeed associated with a mild
visual “neglect” of the amputated side: Participants with an amputated hand favored their intact side
when comparing distances in a landmark position-judgment task (Makin et al. 2010). Importantly,
this bias was absent when the exact same task was repeated with the targets placed in far space.
These results thus suggest that the possibility for action within near space shapes the actor’s spatial
458 The Neural Bases of Multisensory Processes

perception, and emphasize the unique role that PpS mechanisms may play as a medium for interac-
tions between the hands and the world.

23.2.2  A Multisensory Interface for Body–Objects Interactions


Until recently, the characteristics of visuo-tactile PpS in humans had been assessed exclusively,
whereas the relevant body parts were held statically. Even the most “dynamic” properties of PpS,
such as tool-use modulation of the visuo-tactile interaction, have been studied in the static phase
before or after the active use of the tool (Farnè et al. 2005a; Holmes et al. 2007b; Maravita et al.
2002). An exception could be found in studies showing dynamic changes of PpS during tasks such
as line bisection (e.g., Berti and Frassinetti 2000), although multisensory integration was not mea-
sured in these studies. However, if the PpS representation is indeed directly involved in body–object
interactions, then modulations of visuo-tactile interaction should be found without needing the use
of any tools. On the contrary, the visuo-tactile interaction, or the dynamic “remapping” of near
space, should be a basic, primary property that only secondarily can be generalized to tool use (see
Brozzoli et al. 2009b). In this respect, the execution of a voluntary free-hand action, for instance
reaching toward an object, should induce a rapid online remapping of visuo-tactile spatial interac-
tions, as the action unfolds. To test this hypothesis in humans, we conceived a modified version of
the VTI paradigm described above, where multisensory interactions were also assessed during the
dynamic phases of an action. We asked a group of healthy participants to perform two tasks within
each trial. The first task was perceptual, whereby participants discriminated the elevation (up or
down) of a tactile target delivered to a digit on one hand (index finger or thumb) trying to ignore
task-irrelevant visual distractor presented on a target object. The second motor task consisted of
grasping the target object, which was presented in four different orientations, with the index finger
and thumb in a precision grip. The visuo-tactile stimulation was presented at one of three different
timings with respect to the execution of the action: either in a static phase, when the grasping hand
had not yet moved; at the onset of the movement (0 ms); or in the early execution phase (200 ms after
movement onset). When participants performed the action with the tactually stimulated hand, the
VTI was enhanced (i.e., there was more interference from the visual distractor on the tactile task) as
compared to the static phase (Figure 23.3a). This effect was even more pronounced when the visuo-
tactile interaction was assessed during the early execution phase of the grasping. Crucially, if the
same action was performed with the nonstimulated hand, no multisensory modulation was observed,
even though both hands displayed comparable kinematic profiles (Brozzoli et al. 2009b; see Figure
23.3b). This result provided the first evidence that, in humans, a motor-evoked remapping of PpS
occurs, which is triggered by the execution of a grasping action: As in the monkey brain (see Section
23.1.1), the human brain links sources of visual and tactile information that are spatially separated at
the action onset, updating their interaction as a function of the phase of the action. Our brain updates
the relationship between visual and tactile information well before the hand comes into contact with
the object, since the perceptual reweighting is already effective at the very early stage of the action
(Figure 23.3a and b). The finding that such visuo-tactile reweighting was observed selectively when
both perceptual and grasping tasks concerned the same hand, not only confirms the hand-centered
nature of the PpS, but critically extends this property to ecological and adaptive dynamic situations
of voluntary manipulative actions. Furthermore, the kinematics analysis revealed possible parallels
between the motor and perceptual performances, showing that a difference in the kinematic pattern
was reflected by a difference in the perceptual domain (for details, see Brozzoli et al. 2009b).
It is worth noting that the increase in VTI that was triggered by the action, even if already pres-
ent at the very onset of the movement (Figure 23.3a and b), kept increasing during the early execu-
tion phase. That is, an even stronger interference of visual on tactile information was revealed, as
the action unfolded in time and space. This suggests that performing a voluntary action triggers a
continuous monitoring of action space, which keeps “assisting” the motor execution of the action
during its whole dynamic phase.
Peripersonal Space 459

(a)
100
*
90
*
80
*
70 *
CCE (ms)
60
50
40
30
20
10
0

(b) Static Start Execution

Z (mm) Z (mm) Z (mm)

300 300 300

200 200 200

100 100 100

500 400 300 200 100 0 500 400 300 200 100 0 500 400 300 200 100 0
Y (mm) Y (mm) Y (mm)

FIGURE 23.3  (See color insert.) Grasping actions remap peripersonal space. (a) Action induces a reweight-
ing of multisensory processing as shown by a stronger VTI at action Onset (55 ms) compared to Static condi-
tion (22 ms). The increase is even more important (79 ms) when stimulation occurs in early Execution phase
(200 ms after action starts). (b) Dynamics of free hand grasping; schematic representation of estimated posi-
tion of the hand in the instant when stimulation occurred, for the static condition (blue panel), exactly at onset
of movement (yellow panel) or during the early execution phase (light blue panel). Wrist displacement (green
trajectory) and grip evolution (pink trajectories) are shown in each panel. (Modified from Brozzoli, C. et al.,
NeuroReport, 20, 913–917, 2009b.)

To investigate more deeply the relationship between PpS remapping and the motor characteris-
tics of the action, we tested whether different multisensory interactions might arise as a function of
the required sensory–motor transformations. We would expect that action-dependent multisensory
remapping should be more important whenever action performance requires relatively more com-
plex sensory–motor transformations.
In a more recent study (Brozzoli et al. 2009a), we asked a group of healthy participants to per-
form either grasping movements (as in Brozzoli et al. 2009b) or pointing movements. For both
movements, the interaction between task-irrelevant visual information on the object and the tactile
information delivered on the acting hand increased in the early component of the action (as reflected
in a higher VTI), thus replicating our previous findings. However, a differential updating of the
VTI took place during the execution phase of the two action types. Although the VTI magnitude
was further increased during the execution phase of the grasping action (with respect to movement
onset), this was not the case in the pointing action. In other words, when the hand approached the
object, the grasping movement triggered stronger visuo-tactile interaction than pointing. Thus, not
only a continuous updating of PpS occurs during action execution, but this remapping varies with
460 The Neural Bases of Multisensory Processes

the characteristics of the given motor act. If (part of) the remapping of PpS is already effective at
the onset of the motor program, the perceptual modulation will be kept unchanged. But in the case
of relatively complex object-oriented interactions such as grasping, the remapping of PpS will be
dynamically updated with respect to the motor command.

23.3  CONCLUSION
The studies reviewed in this chapter uncover the multisensory mechanisms our brain uses in order
to directly link between visual information available outside our body and tactile information on
our body. In particular, electrophysiological studies in monkeys revealed that the brain builds a
body parts–centered representation of the space around the body, through a network of visuo-tactile
areas. We also reviewed later evidence suggesting a functionally homologous representation of PpS
in humans, which serves as a multisensory interface for interactions with objects in the external
world. Moreover, the action-related properties of PpS representation feature a basic aspect that
might be crucial for rapid and automatic avoidance reactions, that is, a hand-centered representa-
tion of objects in near space. We also showed that PpS representation is dynamically remapped
during action execution, as a function of the sensory–motor transformations required by the action
kinematics. We therefore suggested that PpS representation may also play a major role in voluntary
action execution on nearby objects. These two hypotheses (involuntary and voluntary object-oriented
actions) are not mutually exclusive and one could speculate that, from a more primordial defensive
function of this machinery, a more fine-grained and sophisticated function could have developed
using the same, relatively basic visuo-tactile spatial computational capabilities. This development
could lead to its involvement in the control of the execution of voluntary actions toward objects.

ACKNOWLEDGMENTS
This work was supported by European Mobility Fellowship, ANR grants no. JCJC06_133960 and
RPV08085CSA, and INSERM AVENIR grant no. R05265CS.

REFERENCES
Avillac, M., S. Denève, E. Olivier, A. Pouget, and J. R. Duhamel. 2005. Reference frames for representing
visual and tactile locations in parietal cortex. Nature Neuroscience 8: 941–949.
Battaglini, P. P., A. Muzur, C. Galletti, M. Skrap, A. Brovelli, and P. Fattori. 2002. Effects of lesions to area
V6A in monkeys. Experimental Brain Research 144: 419–422.
Bender, M. 1952. Disorders in perception. Springfield, IL: Thomas.
Berti, A., and F. Frassinetti. 2000. When far becomes near: Remapping of space by tool use. Journal of Cognitive
Neuroscience 12: 415–420.
Bremmer, F. 2005. Navigation in space—the role of the macaque ventral intraparietal area. Journal of Physiology
566: 29–35.
Bremmer, F. et al. 2001. Polymodal motion processing in posterior parietal and premotor cortex: A human
fMRI study strongly implies equivalencies between humans and monkeys. Neuron 29: 287–296.
Brozzoli, C., L. Cardinali, F. Pavani, and A. Farnè. 2009a. Action specific remapping of peripersonal space.
Neuropsychologia, in press.
Brozzoli, C., M. L. Demattè, F. Pavani, F. Frassinetti, and A. Farnè. 2006. Neglect and extinction: Within and
between sensory modalities. Restorative Neurology Neuroscience 24: 217–232.
Brozzoli, C., F. Pavani, C. Urquizar, L. Cardinali, and A. Farnè. 2009b. Grasping actions remap peripersonal
space. NeuroReport 20: 913–917.
Bonifazi, S., A. Farnè, L. Rinaldesi, and E. Ladavas. 2007. Dynamic size-change of peri-hand space through
tool-use: Spatial extension or shift of the multi-sensory area. Journal of Neuropsychology 1: 101–114.
Caggiano, V., L. Fogassi, G. Rizzolatti, P. Thier, and A. Casile. 2009. Mirror neurons differentially encode the
peripersonal and extrapersonal space of monkeys. Science 324: 403–406.
Cardinali, L., C. Brozzoli, and A. Farnè. 2009a. Peripersonal space and body schema: Two labels for the same
concept? Brain Topography 21: 252–260
Peripersonal Space 461

Cardinali, L., C. Brozzoli, and A. Farnè. 2009b. Peripersonal space and body schema. In Encyclopedia of
Behavioral Neuroscience, ed. G. F. Koob, M. Le Moal, and R. R. Thompson, 40, Elsevier Science
Ltd.
Cardinali, L., F. Frassinetti, C. Brozzoli, C. Urquizar, A. Roy, and A. Farnè. 2009c. Tool-use induces morpho-
logical up-dating of the body schema. Current Biology 19: R478–R479.
Colby, C. L., and J. R. Duhamel. 1991. Heterogeneity of extrastriate visual areas and multiple parietal areas in
the macaque monkey. Neuropsychologia 29: 517–537.
Colby, C. L., J. R. Duhamel, and M. E. Goldberg, 1993. Ventral intraparietal area of the macaque: Anatomic
location and visual response properties. Journal of Neurophysiology 69: 902–914.
Cooke, D. F., and M. S. Graziano. 2003. Defensive movements evoked by air puff in monkeys. Journal of
Neurophysiology 90: 3317–3329.
Cooke, D. F., and M. S. Graziano. 2004. Sensorimotor integration in the precentral gyrus: Polysensory neurons
and defensive movements. Journal of Neurophysiology 91: 1648–1660.
Costantini, M., D. Bueti, M. Pazzaglia, and S. M. Aglioti. 2007. Temporal dynamics of visuo-tactile extinction
within and between hemispaces. Neuropsychology 21: 242–250.
Crutcher, M. D., and M. R. DeLong. 1984. Single cell studies of the primate putamen: II. Relations to direction
of movement and pattern of muscular activity. Experimental Brain Research 53: 244–258.
Dehaene, S. 2005. Evolution of human cortical circuits for reading and arithmetic: The “neuronal recycling”
hypothesis. In From Monkey Brain to Human Brain, ed. S. Dehaene, J. R. Duhamel, M. Hauser, and G.
Rizzolatti, 133–157. Cambridge, MA: MIT Press.
Deuel, R. K., and D. J. Regan. 1985. Parietal hemineglect and motor deficits in the monkey. Neuropsychologia
23: 305–314.
di Pellegrino, G., and De Renzi, E. 1995. An experimental investigation on the nature of extinction.
Neuropsychologia, 33: 153–170.
di Pellegrino, G., L. Fadiga, L. Fogassi, V. Gallese, and G. Rizzolatti. 1992. Understanding motor events: A
neurophysiological study. Experimental Brain Research 91: 176–180.
di Pellegrino, G., E. Ladavas, and A. Farné. 1997. Seeing where your hands are. Nature 21: 730.
Driver, J. 1998. The neuropsychology of spatial attention. In Attention, ed. H. Pashler, 297–340. Hove:
Psychology Press.
Duhamel, J. R., C. L. Colby, and M. E. Goldberg. 1998. Ventral Intraparietal area of the macaque: Congruent
visual and somatic response properties. Journal of Neurophysiology 79: 126–136.
Ettlinger, G., and J. E. Kalsbeck. 1962. Changes in tactile discrimination and in visual reaching after successive
and simultaneous bilateral posterior parietal ablations in the monkey. Journal of Neurology, Neurosurgery
and Psychiatry 25: 256–268.
Farnè, A. et al. 2003. Visuo-motor control of the ipsilateral hand: Evidence from right brain–damaged patients.
Neuropsychologia 41: 739–757.
Farnè, A., S. Bonifazi, and E. Ladavas. 2005a. The role played by tool-use and tool-length on the plastic elonga-
tion of peri-hand space: A single case study. Cognitive Neuropsychology 22: 408–418.
Farnè, A., Demattè, M., and Ladavas, E. 2003. Beyond the window: Multisensory representation of periper-
sonal space across a transparent barrier. Journal of Physiology Paris 50: 51–61.
Farnè, A., M. L. Demattè, and E. Ladavas. 2005b. Neuropsychological evidence of modular organization of the
near peripersonal space. Neurology 13: 1754–1758.
Farnè, A., A. Iriki, and E. Ladavas. 2005c. Shaping multisensory action-space with tools: Evidence from
patients with cross-modal extinction. Neuropsychologia 43: 238–248.
Farnè, A., and E. Ladavas. 2000. Dynamic size-change of hand peripersonal space following tool use.
NeuroReport 11: 1645–1649.
Farnè, A., A. Serino, and E. Ladavas. 2007. Dynamic size-change of peri-hand space following tool-use:
Determinants and spatial characteristics revealed through cross-modal extinction. Cortex 43: 436–443.
Faugier-Grimaud, S., C. Frenois, and D. G. Stein. 1978. Effects of posterior parietal lesions on visually guided
behavior in monkeys. Neuropsychologia 16: 151–168.
Fogassi, L. et al. 1992. Space coding by premotor cortex. Experimental Brain Research 89: 686–690.
Fogassi, L. et al. 1996. Coding of peripersonal space in inferior premotor cortex (area F4). Journal of
Neurophysiology 76: 141–157.
Fogassi, L., V. Gallese, M. Gentilucci, G. Luppino, M. Matelli, and G. Rizzolatti. 1994. The fronto-parietal
cortex of the prosimian Galago: Patterns of cytochrome oxidase activity and motor maps. Behavioral
Brain Research 60: 91–113.
Fogassi, L., and G. Luppino. 2005. Motor functions of the parietal lobe. Current Opinion in Neurobiology 15:
626–631.
462 The Neural Bases of Multisensory Processes

Fogassi, L., V. Raos, G. Franchi, V. Gallese, G. Luppino, and M. Matelli. 1999. Visual responses in the dorsal
premotor area F2 of the macaque monkey. Experimental Brain Research 128: 194–199.
Gallese, V., L. Fadiga, L. Fogassi, and G. Rizzolatti. 1996. Action recognition in the premotor cortex. Brain
119: 593–609.
Gallese, V., A. Murata, M. Kaseda, N. Niki, and H. Sakata. 1994. Deficit of hand preshaping after muscimol
injection in monkey parietal cortex. NeuroReport 5: 1525–1529.
Gallivan, J. P., C. Cavina-Pratesi, and J. C. Culham. 2009. Is that within reach? fMRI reveals that the human
superior parieto-occipital cortex encodes objects reachable by the hand. Journal of Neuroscience 29:
4381–4391.
Gardner, E. P. et al. 2007. Neurophysiology of prehension: I. Posterior parietal cortex and object-oriented hand
behaviors. Journal of Neurophysiology 97: 387–406.
Gentilucci, M. et al. 1988. Somatotopic representation in inferior area 6 of the macaque monkey. Experimental
Brain Research 71: 475–490.
Gentilucci, M., C. Scandolara, I. N. Pigarev, and G. Rizzolatti. 1983. Visual responses in the postarcuate cortex
(area 6) of the monkey that are independent of eye position. Experimental Brain Research 50: 464–468.
Godschalk, M., R. N. Lemon, H. G. Nijs, and H. G. Kuypers. 1981. Behaviour of neurons in monkey peri-
arcuate and precentral cortex before and during visually guided arm and hand movements. Experimental
Brain Research 44: 113–116.
Godschalk, M., R. N. Lemon, H. G. Kuypers, and R. K. Ronday. 1984. Cortical afferents and efferents of
monkey postarcuate area: An anatomical and electrophysiological study. Experimental Brain Research
56: 410–424.
Godschalk, M., R. N. Lemon, H. G. Kuypers, and J. van der Steen. 1985. The involvement of monkey premo-
tor cortex neurones in preparation of visually cued arm movements. Behavioral Brain Research 18:
143–157.
Graziano, M. S. A. 1999. Where is my arm? The relative role of vision and proprioception in the neuronal
representation of limb position. Proceedings of the National Academy of Sciences of the United States of
America 96: 10418–10421.
Graziano, M. S. A. 2001. A system of multimodal areas in the primate brain. Neuron 29: 4–6.
Graziano, M. S. A., and D. F. Cooke. 2006. Parieto-frontal interactions, personal space, and defensive behavior.
Neuropsychologia 44: 2621–2635.
Graziano, M. S. A., and C. G. Gross. 1993. A bimodal map of space: Somatosensory receptive fields in the
macaque putamen with corresponding visual receptive fields. Experimental Brain Research 97: 96–109.
Graziano, M. S. A., and C. G. Gross. 1994. Multiple pathways for processing visual space. In Attention and
Performance XV, ed. C. Umiltà and M. Moscovitch, 181–207. Oxford: Oxford Univ. Press.
Graziano, M. S. A., and C. G. Gross. 1995. The representation of extrapersonal space: A possible role for
bimodal, visuo-tactile neurons. In The Cognitive Neurosciences, ed. M. Gazzaniga, 1021–1034. MIT
Press.
Graziano, M. S., X. T. Hu, and C. G. Gross. 1997. Visuospatial properties of ventral premotor cortex. Journal
of Neurophysiology 77: 2268–2292.
Graziano, M. S., C. S. Taylor, T. Moore, and D. F. Cooke. 2002. The cortical control of movement revisited.
Neuron 36: 349–362.
Graziano, M. S., G. S. Yap, and C. G. Gross. 1994. Coding of visual space by premotor neurons. Science 266:
1054–1057.
Halsband, U., and R. Passingham. 1982. The role of premotor and parietal cortex in the direction of action.
Brain Research 240: 368–372.
Holmes, N. P., G. A. Calvert, and C. Spence. 2004. Extending or projecting peripersonal space with tools? Multisensory
interactions highlight only the distal and proximal ends of tools. Neuroscience Letters 372: 62–67.
Holmes, N. P., D. Sanabria, G. A. Calvert, and C. Spence. 2007a. Tool-use: Capturing multisensory spatial
attention or extending multisensory peripersonal space? Cortex 43: 469–489. Erratum in: Cortex 43:
575.
Holmes, N. P., and C. Spence. 2004. The body schema and multisensory representations of peripersonale space.
Cognitive Processing 5: 94–105.
Holmes, N. P., G. A. Calvert, and C. Spence. 2007b. Tool use changes multisensory interactions in seconds:
Evidence from the crossmodal congruency task. Experimental Brain Research 183: 465–476.
Holmes, N. P., C. Spence, P. C. Hansen, C. E. Mackay, and G. A. Calvert. 2008. The multisensory attentional
consequences of tool use: A functional magnetic resonance imaging study. PLoS One 3: e3502.
Hyvärinen, J. 1981. Regional distribution of functions in parietal association area 7 of the monkey. Brain
Research 206: 287–303.
Peripersonal Space 463

Hyvärinen, J., and A. Poranen. 1974. Function of the parietal associative area 7 as revealed from cellular dis-
charges in alert monkeys. Brain 97: 673–692.
Hyvärinen, J., and Y. Shelepin. 1979. Distribution of visual and somatic functions in the parietal associative
area 7 of the monkey. Brain Research 169: 561–564.
Iriki, A., M. Tanaka, and Y. Iwamura. 1996. Coding of modified body schema during tool use by macaque
postcentral neurons. NeuroReport 7: 2325–2330.
Ishida, H., K. Nakajima, M. Inase, and A. Murata. 2009. Shared mapping of own and others’ bodies in visuo-
tactile bimodal area of monkey parietal cortex. Journal of Cognitive Neuroscience 1–14.
Jeannerod, M. 1988. Motor control: Concepts and issues. New York: Wiley.
Jones, E. G., and T. P. Powell. 1970. An anatomical study of converging sensory pathways within the cerebral
cortex of the monkey. Brain 93: 793–820.
Kurata, K., and J. Tanji. 1986. Premotor cortex neurons in macaques: Activity before distal and proximal fore-
limb movements. Journal of Neuroscience 6: 403–411.
Kurata, K., K. Okano, and J. Tanji. 1985. Distribution of neurons related to a hindlimb as opposed to forelimb
movement in the monkey premotor cortex. Experimental Brain Research 60: 188–191.
Lacquaniti, F., and R. Caminiti. 1998. Visuo-motor transformations for arm reaching. European Journal of
Neuroscience 10: 195–203. Review. Erratum in: European Journal of Neuroscience, 1998, 10: 810.
Ladavas, E. 2002. Functional and dynamic properties of visual peripersonal space. Trends in Cognitive Sciences
6: 17–22.
Ladavas, E., and A. Farnè. 2004. Visuo-tactile representation of near-the-body space. Journal of Physiology
Paris 98: 161–170.
Legrand, D., C. Brozzoli, Y. Rossetti, and A. Farnè. 2007. Close to me: Multisensory space representations
for action and pre-reflexive consciousness of oneself-in-the-world. Consciousness and Cognition 16:
687–699.
Leinonen, L. 1980. Functional properties of neurones in the posterior part of area 7 in awake monkey. Acta
Physiologica Scandinava 108: 301–308.
Leinonen, L., J. Hyvärinen, G. Nyman, and I. Linnankoski. 1979. I. Functional properties of neurons in lateral
part of associative area 7 in awake monkeys. Experimental Brain Research, 34: 299–320.
Leinonen, L., and G. Nyman. 1979. II. Functional properties of cells in anterolateral part of area 7 associative
face area of awake monkeys. Experimental Brain Research 34: 321–333.
Luppino, G., A. Murata, P. Govoni, and M. Matelli. 1999. ���������������������������������������������������
Largely segregated parietofrontal connections link-
ing rostral intraparietal cortex (areas AIP and VIP) and the ventral premotor cortex (areas F5 and F4).
Experimental Brain Research 128: 181–187.
Lynch, J. C., V. B. Mountcastle, W. H. Talbot, and T. C. T. Yin. 1977. Parietal lobe mechanisms for directed
visual attention. Journal of Neurophysiology 140: 462–489.
Makin, T. R., N. P. Holmes, C. Brozzoli, Y. Rossetti, and A. Farnè. 2009. Coding of visual space during motor
preparation: Approaching objects rapidly modulate corticospinal excitability in hand-centered coordi-
nates. Journal of Neuroscience 29: 11841–11851.
Makin, T. R., N. P. Holmes, and H. H. Ehrsson. 2008. On the other hand: Dummy hands and peripersonal space.
Behavioral Brain Research 191: 1–10.
Makin, T. R., N. P. Holmes, and E. Zohary. 2007. Is that near my hand? Multisensory representation of peri­
personal space in human intraparietal sulcus. Journal of Neuroscience 27: 731–740.
Makin, T. R., M. Wilf, I. Schwartz, and E. Zoary. 2010. Amputees “neglect” the space near their missing hand.
Psychological Science, in press.
Maravita, A. 2006. From body in the brain, to body in space: Sensory and intentional aspects of body represen-
tation. In The human body: Perception from the inside out, ed. G. Knoblich, M. Shiffrar, and M. Grosjean,
65–88. Oxford Univ. Press.
Maravita, A., and A. Iriki. 2004. Tools for the body (schema). Trends in Cognitive Science 8: 79–86.
Maravita, A., C. Spence, and J. Driver. 2003. Multisensory integration and the body schema: Close to hand and
within reach. Current Biology 13: R531–R539.
Maravita, A., C. Spence, S. Kennett, and J. Driver. 2002. Tool-use changes multimodal spatial interactions
between vision and touch in normal humans. Cognition 83: B25–B34.
Martino, A. M., and P. L. Strick. 1987. Corticospinal projections originate from the arcuate premotor area.
Brain Research 404: 307–312.
Matelli, M., R. Camarda, M. Glickstein, and G. Rizzolatti. 1984a. Interconnections within the postarcuate
cortex (area 6) of the macaque monkey. Brain Research 310: 388–392.
Matelli, M., R. Camarda, M. Glickstein, and G. Rizzolatti. 1984b. ����������������������������������������������
Afferent and efferent projections of the infe-
rior area 6 in the macaque monkey. Journal of Comparative Neurology 251: 281–298.
464 The Neural Bases of Multisensory Processes

Matelli, M., �������������������������������������������������������������������������������������������������


and����������������������������������������������������������������������������������������������
G. Luppino. 2001. Parietofrontal
���������������������������������������������������������������������������
circuits for action and space perception in the macaque mon-
key. Neuroimage 14: S27–S32.
Matelli, M., G. Luppino, and G. Rizzolatti. 1985. ��������������������������������������������������������������
Patterns of cytochrome oxidase activity in the frontal agranu-
lar cortex of the macaque monkey. Behavioral Brain Research 18: 125–136.
Matsumura, M., and K. Kubota. 1979. Cortical projection to hand-arm motor area from post-arcuate area in
macaque monkeys: A histological study of retrograde transport of horseradish peroxidase. Neuroscience
Letters 11: 241–246.
Maunsell, J. H., and D. C. van Essen. 1983. The connections of the middle temporal visual area (MT) and their
relationship to a cortical hierarchy in the macaque monkey. Journal of Neuroscience 3: 2563–2586.
Meredith, M. A., and B. E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. Journal of Neurophysiology 56: 640–662.
Mesulam, M. M., G. W. Van Hoesen, D. N. Pandya, and N. Geschwind. 1977. Limbic and sensory connections
of the inferior parietal lobule (area PG) in the rhesus monkey: A study with a new method for horseradish
peroxidase histochemistry. Brain Research 136: 393–414.
Milner, A. D., and M. A. Goodale. 1995. The visual brain in action. Oxford: Oxford Univ. Press.
Moll, L., and H. G. Kuypers. 1977. Premotor cortical ablations in monkeys: Contralateral changes in visually
guided reaching behavior. Science 198: 317–319.
Mountcastle, V. B., J. C. Lynch, A. Georgopoulos, H. Sakata, and C. Acuna. 1975. Posterior parietal associa-
tion cortex of the monkey: Command functions for operations within extrapersonal space. Journal of
Neurophysiology 38: 871–908.
Muakkassa, K. F., and P. L. Strick. 1979. Frontal lobe inputs to primate motor cortex: Evidence for four soma-
totopically organized ‘premotor’ areas. Brain Research 177: 176–182.
Murata, A., L. Fadiga, L. Fogassi, V. Gallese, V. Raos, and G. Rizzolatti. 1997. Object representation in the
ventral premotor cortex (area F5) of the monkey. Journal of Neurophysiology 78: 2226–2230.
Murray, M. M. et al. 2005. Grabbing your ear: Rapid auditory–somatosensory multisensory interactions in low-
level sensory cortices are not constrained by stimulus alignment. Cerebral Cortex 15: 963–974.
Pandya, D. N., and L. A. Vignolo. 1971. Intra- and interhemispheric projections of the precentral, premotor and
arcuate areas in the rhesus monkey. Brain Research 26: 217–233.
Paulignan, Y., C. MacKenzie, R. Marteniuk, and M. Jeannerod. 1991. Selective perturbation of visual input
during prehension movements: 1. The effects of changing object position. Experimental Brain Research
83: 502–512.
Pavani, F., and U. Castiello. 2004. Binding personal and extrapersonal space through body shadows. Nature
Neuroscience 7: 14–16.
Pisella, L. et al. 2000. An ‘automatic pilot’ for the hand in human posterior parietal cortex: Toward reinterpret-
ing optic ataxia. Nature Neuroscience 3: 729–736.
Prabhu, G. et al. 2009. Modulation of primary motor cortex outputs from ventral premotor cortex during visu-
ally guided grasp in the macaque monkey. Journal of Physiology 587: 1057–1069.
Quinlan, D. J., and J. C. Culham. 2007. fMRI reveals a preference for near viewing in the human parieto-
occipital cortex. Neuroimage 36: 167–187.
Raos, V., M. A. Umiltá, A. Murata, L. Fogassi, and V. Gallese. 2006. Functional properties of grasping-­related
neurons in the ventral premotor area F5 of the macaque monkey. Journal of Neurophysiology 95:
709–729.
Rizzolatti, G., R. Camarda, L. Fogassi, M. Gentilucci, G. Luppino, and M. Matelli. 1988. Functional����������������
orga-
nization of inferior area 6 in the macaque monkey: II. Area F5 and the control of distal movements.
Experimental Brain Research 71: 491–507.
Rizzolatti, G., L. Fadiga, L. Fogassi, and V. Gallese. 1997. The space around us. Science 11: 190–191.
Rizzolatti, G., L. Fadiga, V. Gallese, and L. Fogassi. 1996. Premotor cortex and the recognition of motor
actions. Cognitive Brain Research 3: 131–141.
Rizzolatti, G., and M. Gentilucci. 1988. Motor and visual–motor functions of the premotor cortex. In
Neurobiology of Neocortex, ed. P. Rakic and W. Singer, 269–284. John Wiley and Sons Ltd.
Rizzolatti, G., M. Gentilucci, L. Fogassi, G. Luppino, M. Matelli, and S. Ponzoni-Maggi. 1987. Neurons related
to goal-directed motor acts in inferior area 6 of the macaque monkey. Experimental Brain Research 67:
220–224.
Rizzolatti, G., and G. Luppino. 2001. The cortical motor system. Neuron 31: 889–901.
Rizzolatti, G., M. Matelli, and G. Pavesi. 1983. Deficits in attention and movement following the removal of
postarcuate (area 6) and prearcuate (area 8) cortex in macaque monkeys. Brain 106: 655–673.
Rizzolatti, G., G. Luppino, and M. Matelli. 1998. The organization of the cortical motor system: New concepts.
Electroencephalography and Clinical Neurophysiology 106: 283–296.
Peripersonal Space 465

Rizzolatti, G., and M. Matelli. 2003. Two different streams form the dorsal visual system: Anatomy and func-
tions. Experimental Brain Research 153: 146–157.
Rizzolatti, G., C. Scandolara, M. Gentilucci, and R. Camarda. 1981a. Response properties and behavioral
modulation of “mouth” neurons of the postarcuate cortex (area 6) in macaque monkeys. Brain Research
225: 421–424.
Rizzolatti, G., C. Scandolara, M. Matelli, and M. Gentilucci. 1981b. Afferent properties of periarcuate neurons
in macaque monkeys: I. Somatosensory responses. Behavioral Brain Research 2: 125–146.
Rizzolatti, G., C. Scandolara, M. Matelli, and M. Gentilucci. 1981c. Afferent properties of periarcuate neurons
in macque monkeys: II. Visual responses. Behavioral Brain Research 2, 147–163.
Robinson, D. L., M. E. Goldberg, and G. B. Stanton. 1978. Parietal association cortex in the primate: Sensory
mechanisms and behavioral modulations. Journal of Neurophysiology 41: 910–932.
Robinson, C. J., and H. Burton. 1980a. Organization of somatosensory receptive fields in cortical areas 7b,
retroinsula, postauditory and granular insula of M. fascicularis. Journal of Comparative Neurology 192:
69–92.
Robinson, C. J., and H. Burton. 1980b. Somatic submodality distribution within the second somatosensory
(SII), 7b, retroinsular, postauditory, and granular insular cortical areas of M. fascicularis. Journal of
Comparative Neurology 192: 93–108.
Sakata, H., Y. Takaoka, A. Kawarasaki, and H. Shibutani. 1973. Somatosensory properties of neurons in the
superior parietal cortex (area 5) of the rhesus monkey. Brain Research 64: 85–102.
Seltzer, B., and D. N. Pandya. 1980. Converging visual and somatic sensory cortical input to the intraparietal
sulcus of the rhesus monkey. Brain Research 192: 339–351.
Sereno, M. I., and R. S. Huang. 2006. A human parietal face area contains aligned head-centered visual and
tactile maps. Nature Neuroscience 9: 1337–1343.
Shimazu, H., M. A. Maier, G. Cerri, P. A. Kirkwood, and R. N. Lemon. 2004. Macaque ventral premotor cortex
exerts powerful facilitation of motor cortex outputs to upper limb motoneurons. Journal of Neuroscience
24: 1200–1211.
Spence, C., F. Pavani, and J. Driver. 2004a. Spatial constraints on visual–tactile cross-modal distractor congru-
ency effects. Cognitve Affective and Behavioral Neuroscience 4: 148–169.
Spence, C., F. Pavani, A. Maravita, and N. Holmes. 2004b. Multisensory contributions to the 3-D representation
of visuotactile peripersonal space in humans: Evidence from the crossmodal congruency task. Journal of
Physiology Paris 98: 171–189.
Spence, C., F. Pavani, A. Maravita, and N. P. Holmes. 2008. Multisensory interactions. In Haptic rendering:
Foundations, algorithms, and applications, ed. M. C. Lin and M. A. Otaduy, 21–52. Wellesley, MA:
A. K. Peters Ltd.
Stein, B. E., and M. A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Strick, P. L., and C. C. Kim. 1978. Input to primate motor cortex from posterior parietal cortex (area 5):
I. Demonstration by retrograde transport. Brain Research 157: 325–330.
Strick, P. L. 2002. Stimulating research on motor cortex. Nature Neuroscience 5: 714–715.
Ungerleider, L. G., and Desimone, R. 1986. Cortical connections of visual area MT in the macaque. Journal of
Comparative Neurology, 248: 190–222.
Wallace, M. T., and Stein, B. E. 2007. Early experience determines how the senses will interact. Journal of
Neurophysiology 97: 921–926.
Wang, Y., S. Celebrini, Y. Trotter, and P. Barone. 2008. Visuo-auditory interactions in the primary visual cortex
of the behaving monkey: Electrophysiological evidence. BMC Neuroscience 9: 79.
Ward, R., S. Goodrich, and J. Driver. 1994. Grouping reduces visual extinction: Neuropsychological evidence
for weight-linkage in visual selection. Visual Cognition 1: 101–129.
24 Multisensory Perception and
Bodily Self-Consciousness
From Out-of-Body to
Inside- Body Experience
Jane E. Aspell, Bigna Lenggenhager, and Olaf Blanke

CONTENTS
24.1 Introduction........................................................................................................................... 467
24.2 Multisensory Disintegration in Out-of-Body and Related Experiences of Neurological
Origin.....................................................................................................................................468
24.3 Using Multisensory Conflicts to Investigate Bodily Self in Healthy Subjects...................... 470
24.3.1 Body Part Studies: Rubber Hand Illusion.................................................................. 470
24.3.2 Full Body Studies...................................................................................................... 471
24.3.3 Mislocalization of Touch during FBIs....................................................................... 475
24.3.4 Multisensory First-Person Perspective...................................................................... 477
24.4 Conclusion............................................................................................................................. 478
References....................................................................................................................................... 478

24.1  INTRODUCTION
The most basic foundations of the self arguably lie in those brain systems that represent the body
(Blanke and Metzinger 2009; Damasio 2000; Gallagher 2005; Jeannerod 2006; Knoblich 2002;
Metzinger et al. 2007). The representation of the body is complex, involving the encoding and inte-
gration of a wide range of multisensory (somatosensory, visual, auditory, vestibular, visceral) and
motor signals (Damasio 2000; Gallagher 2005; Metzinger 2003). One’s own body is thus possibly
the most multisensory “object” in the world. Importantly, whereas external objects of perception
come and go, multisensory bodily inputs are continuously present, and have thus been proposed
as the basis for bodily self-consciousness—the nonconceptual and prereflective representation of
body-related information (Gallagher 2000; Haggard et al. 2003; Jeannerod 2007; Metzinger et al.
2007; Pacherie 2008).
Despite the apparent unitary, global character of bodily self-consciousness, experimental manip-
ulations have mainly focused on subglobal aspects, such as the sense of ownership and agency for
one’s hand and its movements (Botvinick and Cohen 1998; Ehrsson et al. 2004; Jeannerod 2006,
2007; Knoblich 2002; Pavani et al. 2000; Tsakiris and Haggard 2005; Tsakiris et al. 2007). These
latter studies on body-part representation are important (and will be discussed below in detail), yet
we have argued (e.g., see Blanke and Metzinger 2009) that they fail to account for a key feature of
bodily self-consciousness: its global character. This is because a fundamental aspect of bodily self-
consciousness is its association with a single, whole body, not with multiple body parts (Blanke and
Metzinger 2009; Carruthers 2008; Lenggenhager et al. 2007; Metzinger et al. 2007). A number of

467
468 The Neural Bases of Multisensory Processes

recent studies (Aspell et al. 2009; Ehrsson 2007; Lenggenhager et al. 2007, 2009; Mizumoto and
Ishikawa 2005; Petkova and Ehrsson 2008) have demonstrated that more global aspects of body
perception can also be experimentally manipulated using multisensory conflicts. These experi-
mental studies on healthy subjects were inspired by an unusual and revealing set of neurological
phenomena—autoscopic phenomena—in which the sense of the body as a whole is disrupted in
different ways, and which are likely to be caused by an underlying abnormality in the multisensory
integration of global bodily inputs (Blanke and Mohr 2005). In this chapter, we first examine how
the scientific understanding of bodily self-consciousness and its multisensory mechanisms can be
informed by the study of autoscopic phenomena. We then present a review of investigations of
multisensory processing relating to body-part perception (“rubber hand” illusion studies: Botvinick
and Cohen 1998; Ehrsson et al. 2004; Tsakiris and Haggard 2005) and go on to discuss more recent
“full body” illusion studies that were inspired by autoscopic phenomena and have shown that it is
also possible to dissociate certain components of bodily self-consciousness—namely, self-location,
self-identification, and the first-person perspective—in healthy subjects by inducing multisensory
conflicts.

24.2 MULTISENSORY DISINTEGRATION IN OUT-OF-BODY AND


RELATED EXPERIENCES OF NEUROLOGICAL ORIGIN
The following is a description of an out-of-body experience (OBE) by Sylvan Muldoon, one of the
first authors to publish detailed descriptions of his own (and others’) OBEs: “I was floating in the
very air, rigidly horizontal, a few feet above the bed […] I was moving toward the ceiling, horizontal
and powerless […] I managed to turn around and there […] was another ‘me’ lying quietly upon the
bed” (Muldoon and Carrington 1929).
We and other research groups (Irwin 1985; Brugger et al. 1997; Brugger 2002; Blanke et al.
2002, 2004; Blanke and Mohr 2005) have argued that an OBE is a breakdown of several key aspects
of bodily self-consciousness, and that the study of this phenomenon is likely to lead to insights into
the multisensory foundations of bodily self-consciousness. OBEs can be characterized by three
phenomenological elements: the impression (1) that the self is localized outside one’s body (dis-
embodiment or extracorporeal self-location), (2) of seeing the world from an extracorporeal and
elevated first-person perspective, and (3) of seeing one’s own body from this perspective (Blanke et
al. 2004; Irwin 1985). OBEs challenge our everyday experience of the spatial unity of self and body:
the experience of a “real me” that “resides” in my body and is the subject or “I” of experience and
thought (Blackmore 1982).
OBEs have been estimated to occur in about 5% of the general population (Blackmore 1982;
Irwin 1985) and they also occur in various medical conditions (Blanke et al. 2004). Several pre-
cipitating factors have been determined including certain types of neurological and psychiatric
diseases (Devinsky et al. 1989; Kölmel 1985; Lippman 1953; Todd and Dewhurst 1955). OBEs have
also been associated with various generalized and focal diseases of the central nervous system
(Blanke et al. 2004; Brugger et al. 1997; Dening and Berrios 1994; Devinsky et al. 1989; Hécaen
and Ajuriaguerra 1952; Lhermitte 1939). OBEs of focal origin mainly implicate posterior regions
of the brain and some authors have suggested a primary involvement of either the temporal or
parietal lobe (Blanke et al. 2004; Devinsky et al. 1989; Hécaen and Ajuriaguerra 1952; Todd and
Dewhurst 1955). More recently, Blanke and colleagues (2004) argued for a crucial role for the cor-
tex at the temporo-parietal junction (TPJ). The crucial role of the right TPJ was suggested because
lesions and foci of epileptic seizures overlap in several patients with OBEs centered on this region
(Blanke et al. 2004; Blanke and Mohr 2005), electrical stimulation of this region can give rise to
OBE-like experiences (Blanke et al. 2002; De Ridder et al. 2007; Penfield and Erickson 1941), and
because the TPJ is activated during mental imagery of disembodied self-location (Arzy et al. 2006).
The role of the TPJ in OBEs makes sense on functional grounds since this region is important
Multisensory Perception and Bodily Self-Consciousness 469

for multisensory integration, vestibular processing, and in generating an egocentric perspective


(Brandt and Dieterich 1999; Bremmer et al. 2001; Calvert et al. 2000; Leube et al. 2003; Ruby and
Decety 2001).
An individual undergoing an OBE usually experiences a dissociation between his self-location
and his first-person visuospatial perspective with respect to the seen location of his own body—in
other words, he perceives his own body (and the world) from a spatial location and perspective that
does not coincide with the seen position of his body (Blanke et al. 2004; Blanke and Mohr 2005;
Brugger et al. 1997). In OBEs the origin of the first-person visuospatial perspective is colocalized
with self-location (as it is for healthy subjects under normal conditions), but the body is experienced
at a different location. What causes this breakdown in the unity between self and body?
To date, only a few neurological and neuroscientific investigations have been carried out on
OBEs, probably because, in general, they occur spontaneously, are of short duration, and happen
only once or twice in a lifetime (Irwin 1985). However, the anatomical, phenomenological, and
behavioral data collected from patients has led to the hypothesis that the abnormal perceptions in
OBEs are due to selective deficits in integrating multisensory body-related information into a single
coherent neural representation of one’s body and its position in extrapersonal space (Blanke et al.
2004; Blanke and Mohr 2005). This theory extended previous propositions made for the related
phenomena of phantom limb sensations (Brugger 2002; Brugger et al. 1997) and synesthesia (Irwin
1985). Furthermore, OBE deficits were attributed to abnormal processing at the TPJ, TPJ lesions
are found in patients with OBEs (Blanke et al. 2004; Blanke and Mohr 2005), and neuroimaging
studies (Arzy et al. 2006; Blanke et al. 2005; Vallar et al. 1999) have shown that this region plays an
important role in multisensory integration, embodiment, and in generating an egocentric perspec-
tive in healthy subjects (see also Bremmer et al. 2001; Calvert et al. 2000; Leube et al. 2003; Ruby
and Decety 2001; Schwabe et al. 2008; Vogeley and Fink 2003).
More precisely, Blanke and colleagues (Blanke et al. 2004; Blanke and Mohr 2005) have pro-
posed that OBEs occur when there is, first, a disintegration in own-body (personal) space because
of incongruent tactile, proprioceptive, and visual inputs and, second, a disintegration between
personal and extrapersonal space due to incongruent vestibular and visual inputs. They further
suggested that the phenomenological variation between different types of autoscopic phenom-
ena—the group of illusions characterized by an illusory multisensory duplication of one’s own
body. Autoscopic phenomena include OBEs, heautoscopy, and autoscopic hallucination, and it
has been proposed that they are caused by a shared disintegration in own-body (personal) space,
but different levels of disintegration between personal and extrapersonal space due to vestibular
disturbance. Vestibular dysfunction (mainly of otolithic origin) is greatest in OBEs, which are
strongly associated with feelings of floating and elevation (usually absent in the two other auto-
scopic phenomena; Blanke et al. 2004). During autoscopic hallucinations patients see their body in
extrapersonal space, but there is no disembodiment, no self-identification with the illusory extra-
corporeal body, and no change in first-person perspective (Blanke et al. 2004; Brugger et al. 1997).
Autoscopic hallucinations are caused by damage that primarily implicates the temporo-occipital
and parieto-occipital cortices (Blanke and Castillo 2007). Patients with heautoscopy—linked to
the left TPJ (Blanke and Mohr 2005)—may experience their self-location and visuospatial per-
spective at the position of the physical body or at the position of the illusory body, or these may
even rapidly alternate, leaving them confused about where their self is localized (Blanke et al.
2004; Brugger et al. 1994). The pronounced vestibular disturbance in OBEs and heautoscopy fits
with the greater implication of the TPJ in both disorders (Blanke and Mohr 2005; Lopez et al.
2008), as the core region of vestibular cortex is located in the TPJ (Brandt and Dieterich 1999;
Fasold et al. 2002; Lobel et al. 1998). These clinical data may suggest that vestibular function in
the left and right TPJs may differ, with the left TPJ specialized for vestibular input from the semi-
circular canals and the right TPJ encoding primarily otolithic input (for more details, see Lopez
et al. 2008).
470 The Neural Bases of Multisensory Processes

24.3 USING MULTISENSORY CONFLICTS TO INVESTIGATE


BODILY SELF IN HEALTHY SUBJECTS
Clinical patients with disturbed bodily self-consciousness due to aberrant brain processes present
unique and important opportunities to study the relation between the representation of the body and
the self. However, the small sample sizes of clinical studies, the often long-term pathological history
of these patients, as well as other methodological concerns make it difficult to generalize these find-
ings to normal bodily self-consciousness in healthy subjects. In the past decade, a growing number
of studies have therefore used the technique of providing conflicting or ambiguous multisensory
information about the body in order to “trick” the brain and induce bodily illusions in healthy sub-
jects that resemble experiences in neurological patients. These experimental manipulations enable
better-controlled and repeatable investigations of bodily self-consciousness and its underlying neu-
ral bases in large samples of healthy subjects.

24.3.1  Body Part Studies: Rubber Hand Illusion


Probably the most commonly used body illusion is the so-called “rubber hand illusion” (Botvinick
and Cohen 1998), in which a subject watches a rubber hand on a table being stroked in synchrony
with his corresponding (left or right) hidden hand. After a few seconds this simple manipulation
causes the rubber hand (Ehrsson et al. 2004; Lloyd 2007) to “feel like my own hand,” that is, to be
self-attributed. This does not happen when the stroking is applied asynchronously, suggesting that an
intermodal correlation of different senses is crucial for self-attribution (Botvinick and Cohen 1998).
The phenomenological experience of self-attribution is accompanied by a change in where subjects
localize their real stroked hand (“proprioceptive drift”; Botvinick and Cohen 1998; Tsakiris and
Haggard 2005; Kammers et al. 2009; Longo et al. 2008; Schütz-Bosbach et al. 2009). It has been
argued that this latter finding demonstrates that the changes in bodily self-consciousness induced
by the rubber hand illusion are due to changes in low-level, multisensory body representations.
Recent studies of the illusion revealed a number of further behavioral changes related to the rubber
hand such as increased cortical (Ehrsson et al. 2007), physiological (skin conductance response;
Ehrsson 2007; Hägni et al. 2008), and fear responses to a threat to the rubber hand. Moreover, there
are also rubber hand illusion–related changes in the real stimulated hand (e.g., body part–specific
decrease in skin temperature; Moseley et al. 2008). However, the relation between these different
measurements is still unclear (Schütz-Bosbach et al. 2009), and recent studies discuss the possibil-
ity of the existence of multiple (parallel and serial) body representations and dimensions of bodily
self-consciousness (Longo et al. 2008; Kammers et al. 2009) that are differentially affected by the
rubber hand illusion.
The rubber hand illusion has been explained as an effect of visual capture—the dominance
of vision over other modalities in representations of the spatial location of events (Botvinick and
Cohen 1998)—and has been related to properties of bimodal neurons in the parietal and premotor
cortices (Ehrsson et al. 2004; Graziano et al. 2000; Iriki et al. 1996, 2001; Rizzolatti et al. 1981;
Tsakiris et al. 2007). A recent article on the rubber hand illusion (Makin et al. 2008) proposed an
explanatory model for the rubber hand illusion that implicates the role of multisensory integration
within peri-hand space. The relative weighting (compared to that of proprioception) of visual infor-
mation about hand position is greater when the hand is not moving; thus, in this situation visual
information can bias proprioception. Furthermore, because vision can dominate over touch in the
representation of spatial location, the brushstrokes that are seen to occur on the rubber hand may
be processed as though they are occurring nearer to or on the real hand. Thus, the central repre-
sentation of the location of the real hand may shift toward the rubber hand (Lloyd 2007). Given the
temporal congruence of the seen and felt stroking, these inputs are integrated together as a coherent
multisensory event in spatial coordinates that are shifted toward those of the rubber hand (Graziano
et al. 2000). Makin and colleagues (2008) propose that this may result in the sensation of touch and
Multisensory Perception and Bodily Self-Consciousness 471

ownership being referred to the rubber hand. It should be noted that these mechanisms and this
direction of causality have yet to be verified experimentally. It is worth noting that the size of the
drift is generally quite small (a few centimeters) compared to the actual distance between the fake
and the real hand, and that the induced changes in illusory touch and (even more so) ownership dur-
ing the rubber hand illusion are most often relatively weak changes in conscious bodily experience
(even after 30 min of stroking).
Several studies have investigated the brain mechanisms involved in the rubber hand illusion, for
example, using functional MRI (Ehrsson et al. 2004) and positron emission tomoraphy (Tsakiris
et al. 2007). A systematic review of the studies using the rubber hand illusion would be beyond the
scope of the present review, as this chapter focuses on scientific experimentation with full body illu-
sions and global aspects of bodily self-consciousness. The interested reader is referred to the recent
review on body part–specific aspects of bodily self-consciousness by Makin and colleagues (2008).
We only note here that comparison of neuroimaging studies of the rubber hand illusion is hampered
by the fact that the studies employed different methods to induce the rubber hand illusion, used
different control conditions, different behavioral proxies to quantify illusory touch and ownership,
and employed different brain imaging techniques. Not surprisingly, though, these studies implicated
several key brain areas that have previously been shown to be important in multisensory integration,
such as the premotor and intraparietal cortices as well as the TPJ, insula, extrastriate cortex, and
the cerebellum.

24.3.2  Full Body Studies


Although illusory ownership in the rubber hand illusion exemplifies a deviant form of bodily self-
consciousness, the illusion only affects partial ownership, or the attribution and localization of a
hand with respect to the global bodily self, that is, it is characterized by a change in part-to-whole
relationships. As we have seen, the situation is different in neurological patients who have illusory
perceptions of their full bodies such as in OBEs and heautoscopy. These states are characterized
by abnormal experience with respect to the global bodily self, for example, a mislocalization and a
misidentification of the entire body and self (Blanke et al. 2004; Blanke and Mohr 2005; Brugger
et al. 1997). Recent studies in healthy subjects (Ehrsson 2007; Lenggenhager et al. 2007, 2009;
Mizumoto and Ishikawa 2005) have therefore sought to investigate these global aspects of self-
consciousness (self-location and self-identification) by the systematic manipulation of the multisen-
sory cues that the brain uses to create a representation of self-location and identity. As we shall see,
these experimental setups have allowed us to gain insight into the biological mechanisms that are
important for humans’ everyday “inside-body experience.” They show that this experience—which
is often taken for granted (“where else should I be localized than in my body?”)—is made possible
by active multisensory brain processes.
Two groups (Ehrsson 2007; Lenggenhager et al. 2007) have separately developed novel tech-
niques to dissociate (1) the location of the physical body, (2) the location of the self (self-location),
(3) the location of the origin of the first-person visuospatial perspective, and (4) self-identifica-
tion. Both groups utilized congruent and incongruent visual–tactile stimulation to alter these four
aspects of bodily self-consciousness by extending a protocol similar to that used in the rubber hand
illusion (Botvinick and Cohen 1998)—to the full body (see Figure 24.1; see also Altschuler and
Ramachandran 2007; Mizumoto and Ishikawa 2005; Stratton 1899). The general idea in these full
body studies is to mislead subjects about where they experience their body and/or self to be, and/or
with what location and which body they self-identify with. To achieve this, a visual (real-time video)
image of their body was presented via a head-mounted display (HMD) that was linked to a video
camera that filmed their back from behind (Figure 24.1). They were thus able to see themselves from
an “outside” or third-person visuospatial perspective, as though they were viewing their own body
from the visuospatial perspective of the camera (note that this is related to changes in perspective
during OBEs). In one study (Lenggenhager et al. 2007), subjects viewed the video image of their
472 The Neural Bases of Multisensory Processes

FIGURE 24.1  Experimental setup in synchronous (back) stroking condition in Lenggenhager et al.’s (2007)
study (top panel) and in synchronous (chest) stroking condition in Ehrsson’s (2007) study (bottom panel). In
both panels, the physical body of the subject is light-colored and the dark-colored body indicates the hypoth-
esized location of the perceived body (bodily self). (Modified from Lenggenhager, B. et al., Consciousness
and Cognition, 18(1), 110–117, 2009.)

body (the “virtual body”) while they were stroked on their real back with a stick. This stroking was
felt on their back and also seen in front on the virtual body either simultaneously (in real time) or not
(when delayed by a video delay). The stroking manipulation thus generated either congruent (syn-
chronous) or incongruent (asynchronous) visuo-tactile stimulation (as had been shown to affect the
perception of hand ownership and hand location in the rubber hand illusion; Botvinick and Cohen
1998). It was found that the illusion of self-identification with the virtual body (i.e., global owner-
ship, the feeling that “the virtual body is my body”) and the referral of touch (“feeling the touch of
the stick where I saw it touching my virtual body”) were both stronger when subjects were stroked
synchronously than when they were stroked asynchronously (Lenggenhager et al. 2007). Self-
location was also measured by passively displacing blindfolded subjects after the stroking period
and then asking them to walk back to the original position. Note that, as predicted, self-location was
experienced at a position that was closer to the virtual body, as if the subject was located “in front”
of the position where (s)he had been standing during the experiment. This ensemble of measures has
been termed the full body illusion (FBI).
In a related study (Ehrsson 2007), subjects were stroked on their chest (Figure 24.1). They were
seated while they viewed themselves (via an HMD) from behind, and they could see a stick moving
(synchronous or asynchronous with the touch) just below the camera’s lens. In this case, subjects
felt that the stick they saw was touching their real chest, they self-identified with the camera’s loca-
tion and they felt that looking at the virtual body was like viewing the body of someone else (i.e.,
decreased self-identification with the virtual body). Self-location was not quantified in this study by
using the drift measure as in Lenggenhager et al.’s (2007) study; instead, a threatening stimulus was
Multisensory Perception and Bodily Self-Consciousness 473

presented to the apparent location of the origin of the visuospatial perspective (just below the cam-
era). The skin conductance response to a swinging hammer (approaching the camera) was found
to be higher during synchronous stroking than during asynchronous stroking, providing implicit
physiological evidence that subjects self-identified with a spatial position that was displaced toward
the position of the camera.
There were several differences in bodily experiences in these two similar setups, and it is worth
considering what may account for these. Meyer (2008) proposed (in a response to these studies) that
in both setups the brain may use at least four different sources of information to generate the con-
scious experience of self-location and self-identification: (1) where the body is seen, (2) where the
world is seen from (the origin of the visuospatial perspective), (3) where the touch is seen to occur,
and (4) where the touch is felt to occur. These four “cues” do not correspond in both experimental
setups (but in everyday life, they usually do). Meyer argued that the most important of these cues for
the conscious experience of self-location might be where the touch is seen to occur (i.e., where the
stroking stick is seen). He concluded this because, first, in neither setup did self-location (measured
via drift by Lenggenhager et al. 2007 or assessed via a questionnaire score by Ehrsson 2007) exactly
coincide with the location where the touch was felt (i.e., where the physical body was located).
Second, the seen location of the virtual body biased self-location in one study (Lenggenhager et
al. 2007) but not in the other (Ehrsson 2007), and third, the location of the visuospatial perspec-
tive corresponded to self-location in Ehrsson’s (2007) study but not in Lenggenhager et al.’s (2007)
study. However, in both cases (during synchronous stroking), self-location coincided with (or more
accurately, was biased toward) the location where the touch was seen to occur (i.e., the seen location
of the stroking stick).
It is not very surprising that the tactile sense appears to have the weakest role in determining
self-location. Touch, after all, cannot give any reliable information regarding the location of the
body in external space, except via tactile contact with external surfaces. There is, however, an
additional important point to consider regarding the four cues: self-location was biased toward the
virtual body more when the seen stroking was synchronous with the felt stroking than when it was
asynchronous (Blanke et al. 2008). Thus, the congruence between tactile and visual input is an
additional important factor in determining self-location in this context. It seems that when vision
and touch are incongruent, the influence of the “visual information about stroking” is weaker and
not preeminent as Meyer implies. Thus, in the asynchronous condition, subjects’ self-location is
closer to where the touch is felt (i.e., where their physical body is actually located) than it is in the
synchronous condition.
It should be noted that different methods (different experimental conditions and dependent vari-
ables to quantify changes in bodily self-consciousness) were used in these studies (Ehrsson 2007;
Lenggenhager et al. 2007). It is therefore difficult to make meaningful, direct comparisons between
the results of these studies. A more recent study (Lenggenhager et al. 2009) therefore sought to
directly compare the approaches presented in these previous studies by using identical body posi-
tions and measures in order to quantify the conscious experience of self-identification, first-person
visuospatial perspective, and self-location. In addition, the authors investigated these aspects of
bodily self-consciousness while subjects were tested in the supine position (as OBEs usually occur
in this position; Bünning and Blanke 2005; Green 1968).
Subjects were again fitted with an HMD that displayed a video image of their body. Their vir-
tual body thus appeared to be located below their physical body (see Figure 24.2). The dependent
behavioral measure for the quantification of self-location was a new one: a “mental ball dropping”
(MBD) task in which subjects had to imagine that a ball fell from their hand, and they had to press
one button when they imagined that it left their grasp, and then another button when they imagined
that it hit the floor. The authors hypothesized that MBD estimation would be greater (i.e., the time
that subjects imagined it would take for the ball to reach the ground would be longer) when subjects’
self-location (where they perceived their self to be) was higher from the ground than when it was
closer to the ground. The prediction in this study was that, compared to asynchronous stroking,
474 The Neural Bases of Multisensory Processes

FIGURE 24.2  Experimental setup in synchronous (back) stroking condition (top panel) and synchronous
(chest) stroking condition (bottom panel) in Lenggenhager et al.’s (2009) study. Subject was filmed from
above and viewed the scene via an HMD. Light-colored body indicates where subjects’ real body was
located and dark-colored body, the hypothesized location of the perceived body (bodily self). (Modified from
Lenggenhager, B. et al., Consciousness and Cognition, 18(1), 110–117, 2009.)

synchronous back stroking would lead to a “downward” shift in self-location (toward the virtual
body, seen as though below subjects) and an increased self-identification with the virtual body.
Synchronous chest stroking, conversely, would lead to an “upward” shift in self-location (“away”
from the virtual body seen below), and a decreased self-identification with the virtual body. As
predicted, self-identification with the virtual body and referral of touch to the virtual body were
found to be greater during synchronous than during asynchronous back stroking. In contrast, during
synchronous chest stroking, there was decreased self-identification with the virtual and decreased
illusory touch. The MBD time estimates (quantifying self-location) were lower for synchronous
back stroking than synchronous chest stroking, suggesting that, as predicted, self-location was more
biased toward the virtual body in the synchronous back stroking condition and relatively more
toward the location of the visuospatial perspective (a third-person perspective) in the synchronous
chest stroking condition. This study confirmed the earlier suggestion that self-location and self-
identification are strongly influenced by where the stroking is seen to occur. Thus, self-location was
biased toward the virtual body located as though below (or in front) when subjects were stroked on
the back, and biased toward the location of the visuospatial perspective (behind/above the virtual
body) when subjects were stroked on their chests. These studies revealed that humans’ “inside-
body” self-location and “inside-body” first-person perspective can be transferred to an extracorpo-
real self-location and a third-person perspective.
It is notable that the subjective upward drift in self-location during synchronous chest strok-
ing was correlated with sensations of elevation and floating (as assessed by questionnaires). This
Multisensory Perception and Bodily Self-Consciousness 475

suggests that when subjects adopt a relaxed prone position—synchronous visual–tactile events may
interfere with vestibular processing. The importance of vestibular (otolith) input in abnormal self-
location has already been demonstrated (Blanke et al. 2002, 2004). Furthermore, there is evidence
that vestibular cues may interfere with body and self-representation (Le Chapelain et al. 2001;
Lenggenhager et al. 2008; Lopez et al. 2008; Yen Pik Sang et al. 2006). The relatively motionless
prone body position of the subjects in this study would have minimized vestibular sensory updating
and thus may have further contributed to the occurrence of such vestibular sensations, highlight-
ing their potential relevance for bodily self-consciousness, OBEs, and related experiences (see also
Lopez et al. 2008; Schwabe and Blanke 2008).
Can the mechanisms (explained above) for the rubber hand illusion also explain the changes
in self-location, first-person perspective, and self-identification during the FBI? It is probable that
some mechanisms are shared but there are likely to be several important conceptual, behavioral,
and neurobiological differences. The finding that in the FBI there appears to be referral of touch
to a virtual body viewed as though at a distance of 2 m away is in contrast to the finding that the
rubber hand illusion is greatly weakened or abolished by changing the posture of the rubber hand
to an implausible one (Tsakiris and Haggard 2005) or by placing the rubber hand at more distant
positions (Lloyd 2007). Viewing one’s body from an external perspective at 2 m distance is even
less “anatomically plausible” than a rubber hand with a misaligned posture; therefore, it is perhaps
surprising that the FBI occurs at all under such conditions. However, it has been shown that the
visual receptive field size of parietal bimodal neurons with tactile receptive fields centered on the
shoulder or the back can be very large—extending sometimes for more than a meter in extraper-
sonal space (Duhamel et al. 1998; Maravita and Iriki 2004). Shifts in the spatial characteristics of
such trunk-centered bimodal neurons may thus account for the observed changes during the FBI
(Blanke and Metzinger 2009). What these differences illustrate is that the constraints operating in
the FBI are in certain ways markedly different to those operating in the rubber hand illusion. They
appear similar in that the strength of both illusions depends on the temporal congruence between
seen and felt stroking. However, the constraints regarding the spatial relations between the location
of the origin of the first-person visuospatial perspective and the rubber hand are different to those
between the location of the origin of the first-person visuospatial perspective and the location of
the seen virtual body (see also Blanke and Metzinger 2009). Moreover, in the RHI it is the hand
with respect to the body that is mislocalized: a “body part–body” interaction. In the FBI the entire
body (the bodily self) is mislocalized within external space: a “body–world” interaction. It may be
that the “whole body drift” entails that (during the synchronous condition) the “volume” of peri­
personal space is relocated (toward the virtual body) within a stable external space (compatible with
subjective reports during OBEs). Alternatively, it may be that peripersonal and extrapersonal space
are modified. The dimensions of the external room—for example, the proximity of walls to the
subjects—are likely to affect the FBI more than the RHI, but this has not been systematically tested
yet. Given the differences between the illusions, it is to be expected that there should be differences
in both the spatial constraints and neural bases (at the level of bimodal visuo-tactile neurons and of
brain regions encoding multisensory bodily signals) between these illusions.

24.3.3  Mislocalization of Touch during FBIs


The studies discussed above (Ehrsson 2007; Lenggenhager et al. 2007, 2009) suggest that during
the FBI changes in self-location and self-identification are accompanied by a mislocalization of
touch, that is, the feeling of touch is biased toward where the touch is seen on one’s own body in
extrapersonal space. However, the evidence for this that was presented in these studies (Ehrsson
2007; Lenggenhager et al. 2007, 2009) came only from questionnaire ratings, specifically the
statements: “It seemed as if I was feeling the touch in the location where I saw the virtual body
touched” (Lenggenhager et al. 2007, 2009) and “I experienced that the hand I was seeing approach-
ing the cameras was directly touching my chest (with the rod)” (Ehrsson 2007). Questionnaire
476 The Neural Bases of Multisensory Processes

ratings, being explicit judgments, are susceptible to various biases, for example, experimenter
expectancy effects. Also, the questions were asked only after the period of stroking, not during,
and so were not “online” measures of bodily self-consciousness. Furthermore, as recently pointed
out (Ehrsson and Petkova 2008), such questions are somewhat ambiguous in a VR setup: they are,
arguably, unable to distinguish between self-identification with a virtual body and self-recognition
in a VR/video system. A more recent study (Aspell et al. 2009) therefore developed an online
measure for the mislocalization of touch that would be less susceptible to response biases and that
would test more directly whether tactile mapping is altered during the FBI. This study investigated
whether modifications in bodily self-consciousness are associated with changes in tactile spatial
representations.
To investigate this, the authors (Aspell et al. 2009) adapted the cross-modal congruency task
(Spence et al. 2004) for the full body. This task was used because the cross-modal congruency
effect (CCE) measured in the task can function as a behavioral index of the perceived proximity of
visual and tactile stimuli. In previous studies of the CCE (Igarashi et al. 2008; Pavani and Castiello
2004; Pavani et al. 2000; Shore et al. 2006; Spence et al. 2004), the visual and tactile stimuli were
presented on foam cubes held in the hands: single vibrotactile devices paired with small lights [light
emitting diodes (LEDs)] were positioned next to the thumb and index finger of each hand. Subjects
made speeded elevation discriminations (“up”/index or “down”/thumb) of the tactile stimuli while
attempting to ignore the visual distractors. It was found that subjects performed worse when a
distracting visual stimulus occurred at an incongruent elevation with respect to the tactile (target)
stimulus. Importantly, the CCE (difference between reaction times during incongruent and congru-
ent conditions) was larger when the visual and tactile stimuli occurred closer to each other in space
(Spence et al. 2004).
The CC task was adapted for the full body (from the typical setup for the hands; Spence et al.
2004) by placing the vibrotactile and LEDs on the subject’s torso (back). Subjects were able to view
their body and the LEDs via an HMD (see Figure 24.3) as the setup was similar to that used in the
previous FBI study (Lenggenhager et al. 2007). To investigate whether “full body CCEs” would be
associated in a predictable way with changes in bodily self-consciousness, subjects’ self-identifica-
tion with the virtual body and self-location were manipulated across different blocks by employing
either synchronous or asynchronous stroking of the subjects’ backs. CCEs were measured during

Synchronous stroking Asynchronous stroking

FIGURE 24.3  Subject stood 2 m in front of a camera with a 3-D encoder. Four light vibration devices were
fixed to the subject’s back, the upper two at inner edges of the shoulder blades and the lower two 9 cm below.
Small inset windows represent what the subject viewed via the head mounted device. (1) Left panel: synchro-
nous stroking condition. (2) Right panel: asynchronous stroking condition. (Modified from Aspell, J. E. et al.,
PLoS ONE, 4(8), e6488, 2009.)
Multisensory Perception and Bodily Self-Consciousness 477

the stroking period and, as predicted, were found to be larger during synchronous than asynchro-
nous blocks, indicating that, as predicted, there was a greater mislocalization of touch during syn-
chronous stroking compared to during asynchronous stroking. [Note that although a number of
components—attention, response bias, and multisensory integration—are all thought to contrib-
ute to the CCE to varying degrees (e.g., depending on the stimulus-onset asynchrony between the
visual and tactile stimuli)—the finding of a difference in the CCE between same and different side
stimuli during the synchronous condition, but not during the asynchronous condition, indicates that
the visual and tactile stimuli were represented as being closer to each other in the former case.] In
the synchronous condition, there was also a greater bias in self-location toward the virtual body
and a greater self-identification with the virtual body compared to in asynchronous blocks (as in
Lenggenhager et al. 2007). Control conditions revealed that the modulating effect of spatial remap-
ping of touch was body-specific.
Interestingly, this study also found that the size of the CCE, the degree of self-identification
with, and the bias in self-location toward the virtual body were all modulated by the stimulus onset
synchrony between the visual and vibrotactile stimuli used in the CCE task. These data thus suggest
that certain key components of bodily self-consciousness—that is, “what I experience as my body”
(self-identification) and “where I experience my body to be” (self-location)—are associated with
changes in the spatial representation of tactile stimuli. They imply that a greater degree of visual
capture of tactile location occurs when there is a greater degree of self-identification for the seen
body. This change in the tactile spatial representation of stimuli is not a remapping on the body, but
is, we suggest, a change in tactile mapping with respect to extrapersonal space: the tactile sensations
are perceived at a spatial location biased toward the virtual body.

24.3.4  Multisensory First-Person Perspective


Less work has been carried out on the question of whether the experienced spatial position of the
first-person perspective can be dissociated from that of self-location (Blanke and Metzinger 2009;
Schwabe and Blanke 2008). The aforementioned FBI studies suggest that the first-person visuospa-
tial perspective can (at least with a video setup) be dissociated from self-location in healthy subjects.
This has rarely been reported in patients with own body illusions such as OBEs and related experi-
ences. As seen above, in a typical OBE the self is experienced as “colocalized” with the first-person
visuospatial perspective. However, a recent neurological study (De Ridder et al. 2007) showed that
intracranial electrical stimulation at the right TPJ may lead to the experience of dissociation of self-
location from the first-person visuospatial perspective. Thus, the patient experienced extracorporeal
self-location and disembodiment to a position behind his body, but perceived the environment from
his normal, body-centered, first-person visuospatial perspective (and not from the disembodied per-
spective as is classically reported by people with OBEs). Furthermore, some patients suffering from
heautoscopy may experience two rapidly alternating first-person visuospatial perspectives and self-
locations (Blanke et al. 2004; Brugger et al. 1994). In such patients, the first-person visuospatial
perspective may sometimes even be experienced at two positions at the same time and this is often
associated with feelings of bilocation: the experience of a duplicated or split self, that is, not just a
split between body and self as in OBEs, but between two experienced self-locations (see also Lopez
et al. 2008).
The first-person visuospatial perspective is perhaps the only perspective that usually comes
to mind, and yet vision is not the only modality with an inherent “perspectivalness” (Metzinger
2003; Metzinger et al. 2007)—there is certainly also an auditory first-person perspective and
possibly also “perspectives” based primarily on proprioceptive and motor signals (Schwabe and
Blanke 2008). Again, in healthy subjects the auditory perspective and visual perspective are spa-
tially congruent, and yet patients with heautoscopy may describe spatial incongruence between
these perspectives (for further examples and discussion, see Blanke et al. 2004; Blanke and
Metzinger 2009).
478 The Neural Bases of Multisensory Processes

24.4  CONCLUSION
Studies of OBEs of neurological origin have influenced current scientific thinking on the nature of
global bodily self-consciousness. These clinical studies have highlighted that bodily self-conscious-
ness can be broken down into three key components: self-location, first-person perspective, and self-
identification (Blanke and Metzinger 2009). The phenomenology of OBEs and related experiences
demonstrates that these three components are dissociable, suggesting that they may have distinct
functional and neural bases. The first empirical investigations into the key dimensions of bodily
self-consciousness that we have reviewed here show that it is also possible to study and dissociate
these three components of the global bodily self in healthy subjects.
Future studies should seek to develop experimental settings in which bodily self-consciousness
can be manipulated more robustly and more strongly in healthy subjects. It will also be important
for future studies to characterize in detail the neural machinery that leads to the described experien-
tial and behavioral changes in bodily self-consciousness. The TPJ is likely to be crucially involved
(Blanke et al. 2004; Blanke and Mohr 2005), but we expect that other areas such as the medial
prefrontal cortex (Gusnard et al. 2001) and the precuneus (Northoff and Bermpohl 2004), as well as
somatosensory (Ruby and Decety 2001) and vestibular cortex (Lopez et al. 2008) will also be found
to contribute to bodily self-consciousness.
Will it ever be possible to experimentally induce full-blown OBEs in healthy subjects? OBEs
have previously been induced using direct brain stimulation in neurological patients (Blanke et al.
2002; De Ridder et al. 2007; Penfield 1955), but these clinical examinations can only be carried
out in a highly selective patient population, whereas related techniques, such as transcranial mag-
netic stimulation do not induce similar effects (Blanke and Thut 2007). Blackmore (1982, 1984)
has listed a number of behavioral procedures that may induce OBEs, and it may be interesting for
future empirical research to employ some of these “induction” methods in a systematic manner in
combination with well-controlled scientific experimentation. It is important to note that OBEs were
not actually induced in the studies (Ehrsson 2007; Lenggenhager et al. 2007, 2009) that used video-
projection, but rather produced states that are more comparable to heautoscopy. Where will we
find techniques to create experimental setups able to induce something even closer to an OBE? We
believe that virtual reality technology, robotics, and methods from the field of vestibular physiology
may be promising avenues to explore.

REFERENCES
Altschuler, E., and V. Ramachandran. 2007. A simple method to stand outside oneself. Perception 36(4):
632–634.
Arzy, S., G. Thut, C. Mohr, C. M. Michel, and O. Blanke. 2006. Neural basis of embodiment: Distinct con-
tributions of temporoparietal junction and extrastriate body area. Journal of Neuroscience 26(31):
8074–8081.
Aspell, J. E., B. Lenggenhager, and O. Blanke. 2009. Keeping in touch with one’s self: Multisensory mecha-
nisms of self-consciousness. PLoS ONE 4(8): e6488.
Blackmore, S. 1982. Beyond the body. An investigation of out-of-body experiences. London: Heinemann.
Blackmore, S. 1984. A psychological theory of the out-of-body experience. Journal of Parapsychology 48:
201–218.
Blanke, O., T. Landis, L. Spinelli, and M. Seeck. 2004. Out-of-body experience and autoscopy of neurological
origin. Brain 127(2): 243–258.
Blanke, O., and T. Metzinger. 2009. Full-body illusions and minimal phenomenal selfhood. Trends in Cognitive
Sciences 13(1): 7–13.
Blanke, O., T. Metzinger, and B. Lenggenhager. 2008. Response to Kaspar Meyer’s E-letter. Science E-letter.
Blanke, O., and C. Mohr. 2005. Out-of-body experience, heautoscopy, and autoscopic hallucination of neuro-
logical origin: Implications for neurocognitive mechanisms of corporeal awareness and self-conscious-
ness. Brain Research Reviews 50(1): 184–199.
Blanke, O., S. Ortigue, T. Landis, and M. Seeck. 2002. Neuropsychology: Stimulating illusory own-body per-
ceptions. Nature 419(6904): 269–270.
Multisensory Perception and Bodily Self-Consciousness 479

Blanke, O., and V. Castillo. 2007. Clinical neuroimaging in epileptic patients with autoscopic hallucinations
and out-of-body experiences. Epileptologie 24: 90–95.
Blanke, O., and G. Thut. 2007. Inducing out of body experiences. In Tall Tales, ed. G. Della Sala. Oxford:
Oxford Univ. Press.
Botvinick, M., and J. Cohen. 1998. Rubber hands ‘feel’ touch that eyes see. Nature 391(6669): 756–756.
Brandt, T., and M. Dieterich. 1999. The vestibular cortex: Its locations, functions, and disorders. Annals of the
New York Academy of Science 871(1): 293–312.
Bremmer, F., A. Schlack, J.-R. Duhamel, W. Graf, and G. R. Fink. 2001. Space coding in primate posterior
parietal cortex. NeuroImage 14(1): S46–S51.
Brugger, P. 2002. Reflective mirrors: Perspective-taking in autoscopic phenomena. Cognitive Neuropsychiatry
7: 179–194.
Brugger, P., R. Agosti, M. Regard, H. Wieser, and T. Landis. 1994. Heautoscopy, epilepsy, and suicide. Journal
of Neurology, Neurosurgery & Psychiatry 57(7): 838–839.
Brugger, P., M. Regard, and T. Landis. 1997. Illusory reduplication of one’s own body: Phenomenology and
classification of autoscopic phenomena. Cognitive Neuropsychiatry 2(1): 19–38.
Bünning, S., and O. Blanke. 2005. The out-of body experience: Precipitating factors and neural correlates. In
Progress in Brain Research, Vol. 150, 331–350. Amsterdam, The Netherlands: Elsevier.
Calvert, G. A., R. Campbell, and M. J. Brammer. 2000. Evidence from functional magnetic resonance imaging
of crossmodal binding in the human heteromodal cortex. Current Biology 10(11): 649–657.
Carruthers, G. 2008. Types of body representation and the sense of embodiment. Consciousness and Cognition
17: 1302–1316.
Damasio, A. R. 2000. The feeling of what happens: Body and emotion in the making of consciousness. New
York: Harcourt Brace.
Dening, T. R., and G. E. Berrios. 1994. Autoscopic phenomena. The British Journal of Psychiatry 165: 808–817,
doi:10.1192/bjp.165.6.808.
De Ridder, D., K. Van Laere, P. Dupont, T. Menovsky, and P. Van de Heyning. 2007. Visualizing out-of-body
experience in the brain. New England Journal of Medicine 357(18): 1829–1833.
Devinsky, O., E. Feldmann, K. Burrowes, and E. Bromfield. 1989. Autoscopic phenomena with seizures.
Archives of Neurology 46(10): 1080–1088.
Duhamel, J., C. L. Colby, and M. Goldberg. 1998. Ventral intraparietal area of the macaque: Congruent visual
and somatic response properties. Journal of Neurophysiology 79(1): 126–136.
Ehrsson, H. 2007. The experimental induction of out-of-body experiences. Science 317(5841): 1048,
doi:10.1126/science.1142175.
Ehrsson, H., and V. Petkova. 2008. Response to Kaspar Meyer’s E-letter. Science, E-Letter.
Ehrsson, H., C. Spence, and R. Passingham. 2004. That’s my hand! Activity in premotor cortex reflects feeling
of ownership of a limb. Science 305(5685): 875–877, doi:10.1126/science.1097011.
Ehrsson, H. H., N. P. Holmes, and R. E. Passingham. 2005. Touching a rubber hand: Feeling of body ownership
is associated with activity in multisensory brain areas. The Journal of Neuroscience 25: 10564–10573,
doi:10.1523/jneurosci.0800-05.2005.
Ehrsson, H. H., K. Wiech, N. Weiskopf, R. J. Dolan, and R. E. Passingham. 2007. Threatening a rubber hand
that you feel is yours elicits a cortical anxiety response. Proceedings of the National Academy of Sciences
104: 9828–9833, doi:10.1073/pnas.0610011104.
Fasold, O. et al. 2002. Human vestibular cortex as identified with caloric stimulation in functional magnetic
resonance imaging. NeuroImage 17: 1384–1393.
Gallagher, S. 2000. Philosophical conceptions of the self: Implications for cognitive science. Trends in Cognitive
Sciences 4(1): 14–21.
Gallagher, S. 2005. How the body shapes the mind. Oxford: Clarendon Press.
Graziano, M., D. Cooke, and C. Taylor. 2000. Coding the location of the arm by sight. Science 290(5497):
1782–1786.
Green, C. 1968. Out-of-body experiences. Oxford: Institute of Psychophysical Research.
Gusnard, D. A., E. Akbudak, G. L. Shulman, and M. E. Raichle. 2001. Medial prefrontal cortex and self-refer-
ential mental activity: Relation to a default mode of brain function. Proceedings of the National Academy
of Sciences of the United States of America 98(7): 4259–4264.
Haggard, P., M. Taylor-Clarke, and S. Kennett. 2003. Tactile perception, cortical representation and the bodily
self. Current Biology 13(5): R170–R173.
Hägni, K., K. Eng, M.-C. Hepp-Reymond, L. Holper, B. Keisker, E. Siekierka et al. 2008. Observing virtual
arms that you imagine are yours increases the galvanic skin response to an unexpected threat. PLoS ONE
3(8): e3082.
480 The Neural Bases of Multisensory Processes

Hécaen, H., and J. Ajuriaguerra. 1952. Méconnaissances et hallucinations corporelles: intégration et désinté-
gration de la somatognosie. Masson.
Igarashi, Y., Y. Kimura, C. Spence, and S. Ichihara. 2008. The selective effect of the image of a hand on visuo­
tactile interactions as assessed by performance on the crossmodal congruency task. Experimental Brain
Research 184(1): 31–38.
Iriki, A., M. Tanaka, and Y. Iwamura. 1996. Coding of modified body schema during tool use by macaque
postcentral neurones. NeuroReport 7: 2325–2330.
Iriki, A., M. Tanaka, S. Obayashi, and Y. Iwamura. 2001. Self-images in the video monitor coded by monkey
intraparietal neurons. Neuroscience Research 40: 163–173.
Irwin, H. 1985. Flight of mind: A psychological study of the out-of-body experience. Metuche, NJ: Scarecrow
Press.
Jeannerod, M. 2006. Motor cognition: What actions tell the self. Oxford, UK: Oxford Univ. Press.
Jeannerod, M. 2007. Being oneself. Journal of Physiology – Paris 101(4–6): 161–168.
Kammers, M. P. M., F. de Vignemont, L. Verhagen, and H. C. Dijkerman. 2009. The rubber hand illusion in
action. Neuropsychologia 47: 204–211, doi:10.1016/j.neuropsychologia.2008.07.028.
Knoblich, G. 2002. Self-recognition: Body and action. Trends in Cognitive Sciences 6(11): 447–449.
Kölmel, H. 1985. Complex visual hallucinations in the hemianopic field. Journal of Neurology, Neurosurgery
and Psychiatry 48: 29–38.
Le Chapelain, L., J. Beis, J. Paysant, and J. André. 2001. Vestibular caloric stimulation evokes phantom limb
illusions in patients with paraplegia. Spinal Cord 39(2): 85–87.
Lenggenhager, B., C. Lopez, and O. Blanke. 2008. Influence of galvanic vestibular stimulation on egocentric
and object-based mental transformations. Experimental Brain Research 184(2): 211–221.
Lenggenhager, B., M. Mouthon, and O. Blanke. 2009. Spatial aspects of bodily self-consciousness.
Consciousness and Cognition 18(1): 110–117.
Lenggenhager, B., T. Tadi, T. Metzinger, and O. Blanke. 2007. Video ergo sum: Manipulating bodily self-
consciousness. Science 317(5841): 1096–1099.
Leube, D. T., G. Knoblich, M. Erb, W. Grodd, M. Bartels, and T. T. J. Kircher. 2003. The neural correlates of
perceiving one’s own movements. NeuroImage 20(4): 2084–2090.
Lhermitte, J. 1939. In L’image de notre corps, 170–227. L’Harmattan.
Lippman, C. 1953. Hallucinations of physical duality in migraine. Journal of Nervous and Mental Disease
117: 345–350.
Lloyd, D. M. 2007. Spatial limits on referred touch to an alien limb may reflect boundaries of visuo-tactile
peripersonal space surrounding the hand. Brain and Cognition 64(1): 104–109.
Lobel, E., J. F. Kleine, D. L. Bihan, A. Leroy-Willig, and A. Berthoz. 1998. Functional MRI of galvanic ves-
tibular stimulation. Journal of Neurophysiology 80: 2699–2709.
Longo, M. R., S. Cardozo, and P. Haggard. 2008. Visual enhancement of touch and the bodily self. Consciousness
and Cognition 17: 1181–1191.
Lopez, C., P. Halje, and O. Blanke. 2008. Body ownership and embodiment: Vestibular and multisensory mech-
anisms. Neurophysiologie Clinique/Clinical Neurophysiology 38(3): 149–161.
Makin, T. R., N. P. Holmes, and H. H. Ehrsson. 2008. On the other hand: Dummy hands and peripersonal space.
Behavioural Brain Research 191(1): 1–10.
Maravita, A., and A. Iriki. 2004. Tools for the body (schema). Trends in Cognitive Sciences 8(2): 79–86.
Metzinger, T. 2003. Being no one. The self-model theory of subjectivity. Cambridge, MA: MIT Press.
Metzinger, T., B. Rahul, and K. C. Bikas. 2007. Empirical perspectives from the self-model theory of subjec-
tivity: A brief summary with examples. In Progress in Brain Research, Vol. 168, 215–245, 273–278.
Amsterdam, The Netherlands: Elsevier.
Meyer, K. 2008. How does the brain localize the self? Science, E-Letter.
Mizumoto, M., and M. Ishikawa. 2005. Immunity to error through misidentification and the bodily illusion
experiment Journal of Consciousness Studies 12(7): 3–19.
Muldoon, S., and H. Carrington. 1929. The projection of the astral body. London: Rider and Co.
Northoff, G., and F. Bermpohl. 2004. Cortical midline structures and the self. Trends in Cognitive Sciences
8(3): 102–107.
Pacherie, E. 2008. The phenomenology of action: A conceptual framework. Cognition 107(1): 179–217.
Pavani, F., and U. Castiello. 2004. Binding personal and extrapersonal space through body shadows. Nature
Neuroscience 7(1): 14–16.
Pavani, F., C. Spence, and J. Driver. 2000. Visual capture of touch: Out-of-the-body experiences with rubber
gloves. Psychological Science 11(5): 353–359.
Multisensory Perception and Bodily Self-Consciousness 481

Penfield, W. 1955. The 29th Maudsley lecture—The role of the temporal cortex in certain psychical phenom-
ena. Journal of Mental Science 101(424): 451–465.
Penfield, W., and T. Erickson. 1941. Epilepsy and Cerebral Localization. Oxford, England: Charles C.
Thomas.
Petkova, V., and H. H. Ehrsson. 2008. If I were you: Perceptual illusion of body swapping. PLoS ONE 3(12):
e3832.
Rizzolatti, G., C. Scandolara, M. Matelli, and M. Gentilucci. 1981. Afferent properties of periarcuate neurons
in macaque monkeys. II. Visual responses. Behavioural Brain Research 2: 147–163.
Ruby, P., and J. Decety. 2001. Effect of subjective perspective taking during simulation of action: A PET inves-
tigation of agency. Nature Neuroscience 4(5): 546–550.
Schütz-Bosbach, S., J. Musil, and P. Haggard. 2009. Touchant-touché: The role of self-touch in the representa-
tion of body structure. Consciousness and Cognition 18: 2–11.
Schwabe, L., and O. Blanke. 2008. The vestibular component in out-of-body experiences: A computational
approach. Frontiers in Human Neuroscience 2: 17.
Shore, D. I., M. E. Barnes, and C. Spence. 2006. Temporal aspects of the visuotactile congruency effect.
Neuroscience Letters 392(1–2): 96–100.
Spence, C., F. Pavani, and J. Driver. 2004. Spatial constraints on visual–tactile cross-modal distractor congru-
ency effects. Cognitive, Affective and Behavioral Neuroscience 4(2): 148–169.
Stratton, G. 1899. The spatial harmony of touch and sight. Mind 8: 492–505.
Todd, J., and K. Dewhurst. 1955. The double: Its psychopathology and psycho-physiology. Journal of Nervous
and Mental Disorders 122: 47–55.
Tsakiris, M., and P. Haggard. 2005. The rubber hand illusion revisited: Visuotactile integration and self-attribu-
tion. Journal of Experimental Psychology-Human Perception and Performance 31(1): 80–91.
Tsakiris, M., M. Hesse, C. Boy, P. Haggard, and G. R. Fink. 2007. Neural signatures of body ownership: A
sensory network for bodily self-consciousness. Cerebral Cortex 17(10): 2235–2244, doi:10.1093/cercor​
/bhl131.
Vallar, G. et al. 1999. A fronto-parietal system for computing the egocentric spatial frame of reference in
humans. Experimental Brain Research 124: 281–286.
Vogeley, K., and G. R. Fink. 2003. Neural correlates of the first-person-perspective. Trends in Cognitive
Sciences 7: 38–42.
Yen Pik Sang, F., K. Jáuregui-Renaud, D. A. Green, A. Bronstein, and M. Gresty. 2006. Depersonalisation/
derealisation symptoms in vestibular disease. Journal of Neurology Neurosurgery and Psychiatry 77(6):
760–766.
Section VI
Attention and Spatial Representations
25 Spatial Constraints in
Multisensory Attention
Emiliano Macaluso

CONTENTS
25.1 Introduction........................................................................................................................... 485
25.2 Unisensory and Multisensory Areas in Human Brain.......................................................... 487
25.3 Multisensory Endogenous Spatial Attention......................................................................... 490
25.4 Stimulus-Driven Spatial Attention........................................................................................ 492
25.5 Possible Relationship between Spatial Attention and Multisensory Integration................... 497
25.6 Conclusions............................................................................................................................500
References....................................................................................................................................... 501

25.1  INTRODUCTION
Our sensory organs continuously receive a large amount of input from the external world; some
of these are important for a successful interaction with the environment, whereas others can be
ignored. The operation of selecting relevant signals and filtering out irrelevant information is a
key task of the attentional system (Desimone and Duncan 1995; Kastner and Ungerleider 2001).
Attentional selection can occur on the basis of many different criteria, with a main distinction
between endogenous control (i.e., selection based on voluntary attention, current aims, and knowl-
edge) and stimulus-driven control (i.e., selection based on the intrinsic features of the sensory input).
Accordingly, we can decide to pay attention to the face of one person in a crowded room (i.e., attend-
ing to subtle details in a rich and complex environment), or attention can be captured by a loud
sound in a quiet room (i.e., attention captured by a salient stimulus).
Many different constraints can guide endogenous and stimulus-driven attention. We can volun-
tarily decide to attend to a specific visual feature, such as color or motion, but the very same features
can guide stimulus-driven attention if they stand out from the surrounding environment (“pop-out”
item, e.g., a single red stimulus presented among many green stimuli). Here, I will focus on pro-
cesses related to attentional selection based on spatial location. The investigation of mechanisms of
spatial attention control is appealing for many reasons. Spatial selectivity is one of the most impor-
tant characteristics of single neurons (i.e., the neuron’s receptive field) and well-organized maps of
space can be found throughout the brain (Gross and Graziano 1995). These include sensory areas
(e.g., striate and extrastriate occipital regions, for retinotopic representations of the visual world;
Tootell et al. 1982), subcortical regions [e.g., the superior colliculus (SC); Wallace et al. 1997], and
higher-level associative areas in frontal and parietal cortex (e.g., Ben Hamed et al. 2001; Sommer
and Wurtz 2000). This widespread selectivity for spatial locations opens the question about how/
whether these anatomically segregated representations contribute to the formation of an integrated
representation of external space. Indeed, from a subjective point of view, signals about different
visual features (e.g., shape/color) as well as motor commands seem to all merge effortlessly, giving
rise to a coherent and unified perception–action system that allows us to interact spatially with the
external environment.

485
486 The Neural Bases of Multisensory Processes

The coordination of anatomically distributed spatial representations is also particularly relevant


for the processing of signals from different sensory modalities. The position of a single object
or event in the external world can be registered using signals in different modalities (e.g., a car
approaching that we can both see and hear), but this requires some mechanisms matching spatial
information that is initially processed in anatomically separate areas (e.g., occipital cortex for vision,
and temporal cortex for audition). The brain’s ability to detect spatial colocalization of signals in
different modalities can lead to faster and more accurate responses for multisensory signals origi-
nating from a single external location compared with spatially separate signals (e.g., see Spence et
al. 1998). Indeed, spatial alignment is considered a key determinant for multisensory integration
(Stein and Meredith 1993; Meredith and Stein 1996).
The main topic of this chapter concerns the possible relationship between the coordination/inte-
gration of spatial maps across sensory modalities and the control of spatial attention. Before tack-
ling this issue, I will briefly consider a few ideas about the neural basis of spatial attention control
in the visual modality. These will then help to highlight commonalities and differences between
visuospatial attention and spatial attention control in multisensory situations.
The neural mechanisms underlying visuospatial attention control have been studied extensively.
One of the most popular approaches consists in presenting first an informative cue that instruct the
subject to attend to a specific spatial location, followed by a target stimulus either at the attended
location (“valid trial,” typically 75–80%) or somewhere else in the visual display (“invalid trials,”
20–25%; see Posner 1980; Posner et al. 1984). Using event-related functional magnetic resonance
imaging (fMRI) techniques, it became possible to identify brain areas associated with three main
processes involved in these tasks. Processing of the cue, which involves preparatory orienting of
endogenous spatial attention toward the cued location, has been associated with activation of a dor-
sal fronto parietal (dFP) network including intraparietal (IPS) and posterior parietal cortex (PPC),
plus dorsal premotor cortex thought to correspond to the human frontal eye fields (FEFs; Corbetta
and Shulman 2002; Yantis et al. 2002; Vandenberghe et al. 2001). Kastner and colleagues (1999)
reported that activity in the dorsal network remains elevated in the interval between the cue and the
target when the display does not contain any visual stimulus, indicating that the activation of these
regions is generated “internally” (i.e., endogenous shift and holding of spatial attention; see also
Corbetta et al. 2005; Kelley et al. 2008).
Upon presentation of the target, different brain areas are activated depending on whether the
target is presented at the attended (valid) or unattended (invalid) location. On valid trials, attention
modulates activity in retinotopic occipital visual areas that represent the target location (e.g., see
Martinez et al. 1999; Hopfinger et al. 2000; see also Luck et al. 1997) as well as dorsal parietal and
frontal regions that also contain retinotopically organized maps of visual space (Sereno et al. 2001;
Hagler and Sereno 2006; Saygin and Sereno 2008). Accordingly, visuospatial attention can modulate
activity in multiple brain regions, all representing the position of the attended stimulus. When the
target is presented at the uncued side (invalid trials), activation is typically found in a ventral fronto-
parietal network (vFP), comprising the inferior parietal cortex [temporo-parietal junction (TPJ)]
and inferior premotor regions [inferior frontal gyrus(IFG) and frontal operculum; see Arrington et
al. 2000; Corbetta et al. 2000]. Invalidly cued targets are thought to trigger a shift of spatial atten-
tion from the cued/attended location to the unattended target location, and therefore activation of the
vFP has been associated with stimulus (target)-driven reorienting of spatial attention.
These findings can be contextualized in “site-source models” of visuospatial attention control
(Figure 25.1a; for review, see also Pessoa et al. 2003). They postulate that control regions in dFP
(“sources” of endogenous control) influence activity in sensory areas (“sites” of attentional mod-
ulation) that represent the currently relevant/attended position, via modulatory feedback signals
(Desimone and Duncan 1995; Bressler et al. 2008; Ruff et al. 2006; see also Moore 2006, for review).
These modulatory influences facilitate the processing of stimuli at the attended position, enabling
them to outcompete other stimuli for the same processing resources. A recent EEG study clarified
the temporal sequence of activation in these control and sensory areas (Green and McDonald 2008).
Spatial Constraints in Multisensory Attention 487

(a) Site-source model of visuo-spatial (b) Multisensory control of spatial attention


attention control

Endogenous control 3 Som/Aud areas


1

Dorsal FP Ventral FP 2
(IPS, PPC, FEF) (TPJ, IFG)
Dorsal FP Ventral FP

Visual areas
Visual areas
(occipital cortex)
SC
4 Thalamus
Stimulus-driven control

FIGURE 25.1  Schematic models of spatial attention control. (a) “Site-source model” of visuospatial con-
trol. This distinguishes areas that generate spatial biases [“sources,” in dorsal fronto-parietal (dFP) cortex]
and areas that receive these modulatory signals (“sites,” occipital visual cortex). The model also includes a
distinction between endogenous control (dark gray) and stimulus-driven control (light gray; see Corbetta et
al. 2002). The two control systems operate together and interaction between them has been proposed to affect
functional coupling between visual cortex and ventral attention control network (vFP; see Corbetta et al.
2008). IPS, intraparietal sulcus; PPC, posterior parietal cortex; TPJ, temporo-parietal cortex; IFG, inferior
frontal gyrus; SC, superior colliculus; Som/Aud, somatosensory/auditory. (b) An extension of “site-source”
model, with feedforward connectivity and backprojections that allow transferring spatial information between
sensory-specific (e.g., visual, auditory, and somatosensory areas) and multisensory regions (dFP and vFP).
These multiple pathways may mediate spatial constraints in multisensory attention. Possible routes include:
(1) feedforward multisensory input converging into vFP, for stimulus-driven control; (2) multisensory interac-
tions in dFP, which in turn may affect interplay between dorsal (endogenous) and ventral (stimulus-driven)
attention control systems; (3) direct projections between sensory-specific areas that may mediate cross-modal
effects in sensory-specific areas; (4) multisensory interaction via subcortical structures that send and receive
projections to/from sensory-specific and multisensory cortical areas.

During the cue-to-target interval, activation occurred first in the parietal cortex (approximately
at 200 ms post-cue onset) and then in the frontal regions (at 400 ms), followed by reactivation
of parietal regions (600 ms), and lastly, attentional modulation was found in the occipital cortex.
Moreover, this study also showed that these preparatory effects are predictive of subsequent per-
ceptual performance upon presentation of the target, confirming the relationship between activation
of fronto-parietal control regions and attentional benefits for targets presented at the cued location.
The vFP provides an additional control system that can flexibly interrupt endogenous control when
unexpected/salient events require reorienting of attention toward a new location (stimulus-driven
control; Corbetta and Shulman 2002; Corbetta et al. 2008).
It should be stressed that this is a simplified model of visuospatial attention control, as there
are many other processes (e.g., feature conjunction, sensory–motor transformations, etc.) and brain
regions (e.g., the SC and the pulvinar) that also contribute to covert spatial orienting. However, this
simple model embodies a few key concepts concerning (1) attention control vs. modulation, (2) feed-
forward vs. feedback connectivity, and (3) endogenous vs. stimulus-driven control, which can help
in the interpretation of many findings in studies of multisensory spatial attention.

25.2  UNISENSORY AND MULTISENSORY AREAS IN HUMAN BRAIN


Before addressing the main issue of multisensory spatial attention control, it is worth to briefly
review the regions of the brain that respond to stimuli in more than one modality, and to ask
whether they also show some differential activation depending on stimulus position. This will
488 The Neural Bases of Multisensory Processes

highlight potential players in mechanisms of multisensory spatial attention. Single-cell electro-


physiology identified multisensory neurons in many cortical regions, including intraparietal areas
(Duhamel et al. 1998), premotor regions (Graziano and Gross 1993), posterior temporal cortex
(Bruce et al. 1981), and subcortical regions (Meredith and Stein 1986a). Noninvasive functional
imaging revealed corresponding multisensory responses in humans. For example, Bremmer et al.
(2001) presented subjects with visual (moving vs. stationary random dots), auditory (moving tones
vs. rest), or tactile stimuli (moving air puff vs. rest), asking where in the brain activity increases
irrespective of the stimulated modality. This showed activation of the intraparietal sulcus, inferior
parietal cortex, and premotor regions. Additional multisensory activations have been also found
in the insula (Lewis et al. 2000) and in posterior regions of the superior temporal sulcus (STS;
Beauchamp et al. 2004).
Turning to the more specific question about multisensory spatial representations, we utilized
visual or tactile stimuli presented either in the left or right hemifield, with left and right visual
stimuli positioned in close spatial proximity of the corresponding left and right hands (see Figure
25.2a). This allowed us to test for brain regions that show differential activation depending on the
stimulus position (left vs. right stimuli and vice versa) and—critically—whether any such differ-
ence depends on the stimulated modality (i.e., unisensory vs. multisensory, side-specific activa-
tions). Occipital visual regions activated for visual stimuli presented in the contralateral hemifield,
but were unaffected during tactile stimulation (unisensory visual side-specific effects; see Figure
25.2a). Somatosensory areas in the post-central sulcus activated for contralateral tactile stimuli and
did not respond to visual stimulations (unisensory tactile side-specific effects). Most importantly, a
higher-order region in the anterior intraparietal sulcus (aIPS) activated more for contralateral than
ipsilateral stimulation, but now irrespective of the modality of the stimuli (multisensory visuo-
tactile, side-specific effects; see signal plots in Figure 25.2a). Accordingly, spatial information from
different senses appears to come together in this region, forming a supramodal representation of
contralateral space. More recently, Sereno and Huang (2006) extended these results showing that
multisensory responses in anterior IPS are not only side-specific, but also follow a well-organized
topographical layout with contiguous positions around the face represented in contiguous regions
of IPS. The activation of this intraparietal region does not seem to merely reflect a common out-
put/motor system for the different modalities, because multisensory spatial effects in aIPS have
been found irrespective of overt motor task (see Macaluso et al. 2003a, who used manual/saccadic
responses to visual/tactile stimuli).
To summarize, multisensory responses have been found in frontal, parietal, and temporal cortex.
These include intraparietal and dorsal premotor regions (Bremmer et al. 2001) that overlap with
the dFP network involved in endogenous control of visuospatial attention (Corbetta et al. 2002);
and regions around the TPJ and ventral premotor cortex (Beauchamp et al. 2004; Bremmer et
al. 2001) that seem to overlap with the vFP attention network (Corbetta et al. 2002). Studies that
manipulated the spatial position of stimuli in different modalities (Macaluso and Driver 2001;
Sereno and Huang 2006) revealed a segregation between sensory-specific spatial representation in
occipital (vision) and post-central (touch) areas, and multisensory representation primarily in the
intraparietal sulcus.
These findings fit with the idea that the integration of multisensory signals and the construction
of multisensory representations of space may occur via feedforward convergence (see Massaro
1999; Graziano and Gross 1995). Accordingly, spatial locations are first computed in sensory-
­specific areas, which then project to common (multisensory) regions in high-order frontal, pari-
etal, and temporal cortex. In addition, the localization of multisensory responses both in vFP and
dFP raises the hypothesis that the attention control systems operate not only in vision, but rather
they may control the deployment of spatial attention irrespective of modality (Farah et al. 1989;
Driver and Spence 1998). This will be addressed in the following sections, where first we will
examine examples of endogenous multisensory attention and then situations related to stimulus-
driven multisensory attention.
Spatial Constraints in Multisensory Attention 489

(a) Mapping multisensory space (b) Endogenous cross-modal attention

T V T T V

Left anterior IPS


Multisensory Multisensory
10 8 * *
*

Effect size (a.u.)


Effect size (a.u.)

6
5 * 4
2
0
0
–2
–5
L R L R aL aR aL aR
Bs
Vision Touch aVision aTouch

Left occipital cortex

Sensory-specific Cross-modal
10 8
Effect size (a.u.)

Effect size (a.u.)

* 6
*
5 *
n.s. 4
2
0
0
–2
–5
L R L R aL aR aL aR
Bs
Vision Touch aVision aTouch

FIGURE 25.2  (a) Mapping of multisensory space. Top panel shows a schematic illustration of an fMRI
experiment to map visual and tactile side-specific activation. In different blocks/conditions, subjects were
presented with stimuli in one modality and one side only (right touch in example). A region in aIPS showed
greater responses for contralateral than ipsilateral stimuli, irrespective of stimulus modality. Middle panel
shows multisensory activation of left aIPS for visual and tactile stimuli on right side. By contrast, sensory-
specific areas showed an effect of contralateral versus ipsilateral stimuli only for corresponding modality. For
example, left occipital visual cortex activated significantly more for right than left stimulation, but only for
visual stimuli (see bottom panel). (b) Multisensory endogenous spatial attention. Top panel shows a schematic
illustration of one of the setup utilized to study visuo-tactile cross-modal links in endogenous spatial atten-
tion. (Reproduced from Macaluso, E. et al., Cereb. Cortex, 12, 357–368, 2002b.) The stimulation was always
bimodal and bilateral, but in different conditions subjects are asked to attend to only one side and one modality
(attend right touch, in this example). Direct comparison of conditions of attention to one versus the other side
(attend right vs. attend left, in the figure) reveals modality-independent attentional modulation in contralateral
multisensory regions (e.g., left aIPS for attention to right hemifield; see middle panel) but also cross-modal
influences in sensory-specific areas. For example, bottom panel shows cross-modal spatial attentional effects
in left occipital visual cortex, with increased activation when subjects attended right vision (bar 2 minus 1, in
signal plot) but also when they attended right touch (bar 4 minus 3, in plot). V/T, visual/tactile; L/R, left/right;
aL/aR, attend left/right; Bs, baseline condition (central detection); *p < .05.
490 The Neural Bases of Multisensory Processes

25.3  MULTISENSORY ENDOGENOUS SPATIAL ATTENTION


Behavioral studies have often reported cross-modal cueing effects that suggest the existence of a
common system for spatial attention control across the different sensory modalities (e.g., Spence
and Driver 1996). The typical finding here is that when subjects are asked to attend to one location
and to expect targets in one specific modality there (e.g., attend to the left side to discriminate visual
targets), subjects are not only faster in discriminating visual targets at the attended side compared
to visual targets on the opposite side (i.e., intramodal effect of visuospatial attention), but will also
show an advantage for targets presented in a different modality (e.g., touch) presented on the “visu-
ally attended” compared to the unattended side (e.g., faster responses for left vs. right touch). These
results are consistent with the proposal that the selection of the relevant location occurs irrespective
of modality, supporting a supramodal account of endogenous attention control (Farah et al. 1989;
Driver et al. 1998).
We have investigated the neural substrates of cross-modal cueing effects in endogenous atten-
tion using positron emission tomography and fMRI (Macaluso et al. 2000a, 2002b, 2003b). The
supramodal account of attention control predicts that regions of the brain involved in spatial atten-
tion should show modality-independent activation. Indeed, we found that associative regions in the
fronto-parietal cortex activated when subjects attended to one or the other side (compared with cen-
tral attention), regardless of whether they were asked to judge vision or touch (see also Shomstein
and Yantis 2006). In addition, in one fMRI study we separated cue-related effects and target-related
activity (Macaluso et al. 2003b). Specifically, we compared the activity associated with predictive
auditory cues (tones instructing subjects to shift attention toward one or the other hemifield) versus
control cues (a different tone indicating that no target would follow on that trial, i.e., no shift is
required). This showed activation of dorsal fronto-parietal regions, regardless of whether subjects
had to prepare for visual or tactile discrimination. Preparatory, cue-related effects in dFP regions
have also been reported in studies on pure auditory attention (Wu et al. 2007), confirming that dFP
is involved in endogenous, voluntary attention control irrespective of modality.
Additional evidence for supramodal mechanisms of attention control came from analyses that
directly compared attention to one or the other hemifield. In particular, if attention control selects
spatial locations irrespective of modality, it may be expected that activity in regions that represent
the attended location should be affected irrespective of the modality of the target presented there.
In should be stressed that in attention experiments, the stimuli are held constant across conditions
(cf. Figure 25.2a and b), thus highlighting the modulatory effect of endogenous spatial attention over
and above any activation related to the sensory input. The comparison of attention to one versus
the other hemifield showed modulation of activity in the aIPS, both when subjects attend and judge
vision but also when they attend and judge touch (Macaluso et al. 2000a, 2002b, 2003b; see Figure
25.2b). Moreover, these spatially specific attentional modulations in multisensory aIPS occurred
in anticipation of the appearance of the target (cue-related effects), corroborating the “internal,”
endogenous origin of these signals (see Macaluso et al. 2003b). These results are in agreement with
the idea that parietal association cortex contains multisensory representation of space (feedforward
convergence hypothesis) and that supramodal mechanisms of endogenous spatial attention control
operate by means of these representations.
Together with the modulation of activity within “multisensory convergence” regions, we have
also consistently found cross-modal influences of endogenous spatial attention in “sensory-specific”
occipital visual cortex. The occipital visual cortex that does not respond to tactile stimuli (see
Figure 25.2a; but see also Kayser et al. 2005; Meyer et al. 2007)—and if anything—tends overall to
deactivate when attention is focused on nonvisual stimuli (Laurienti et al. 2002). Nonetheless, the
direction of tactile attention was found to modulate cross-modally activity there. For example, when
subjects directed endogenous attention toward the right hand to discriminate tactile targets there,
activity increased in the left occipital cortex that represents the contralateral, right visual hemifield
(Figure 25.2b, bottom panel; see also Macaluso et al. 2000a, 2002b, 2003b). These effects were
Spatial Constraints in Multisensory Attention 491

observed even when visual distracters at the attended side conveyed misleading information (e.g.,
a single flash of light, while subjects attempted to detect double pulses of vibrations; Macaluso et
al. 2000a, 2002b, 2003b; see also Ciaramitaro et al. 2007). Accordingly, it is unlikely that subjects
decided to strategically shift both tactile and visual attention toward one side, but rather cross-modal
spatial influences in visual cortex appear to be obligatory (see also Eimer 1999). It should be noted
that modulatory effects of one modality on areas dedicated to a different modality are not confined
to tactile attention affecting the visual cortex (for review, see Eimer and Driver 2001). For example,
Eimer and Van Velzen (2002) showed modulation of early somatosensory event-related potentials
(ERPs) depending on the direction of visual attention (see also Kida et al. 2007; for a recent mag-
netoencephalography study localizing related cross-modal influences in secondary somatosensory
cortex); Teder-Salejarvi et al. (1999) found that endogenous visuospatial attention can modulate
early auditory ERPs; and Hotting et al. (2003) reported reciprocal cross-modal influences of audi-
tory and tactile spatial attention on tactile and auditory ERPs, respectively.
Our visuo-tactile fMRI study that isolated cue-related, preparatory processes (Macaluso et al.
2003b) provided additional hints about the nature of spatially specific cross-modal influences in the
occipital cortex. The comparison of leftward versus rightward attention-directing cues, and vice
versa, demonstrated that activity in contralateral occipital cortex increases before the presentation
of the target stimuli, that is, when subjects prepared for the upcoming tactile judgment. For exam-
ple, when the auditory cue instructed the subject to shift tactile attention to the right hemifield, brain
activity increased not only in left post-central somatosensory areas and in left multimodal intrapa-
rietal cortex, but also in the left extrastriate visual cortex (for preparatory cross-modal influences
between other modalities, see also Trenner et al. 2008; Eimer et al. 2002; Green et al. 2005). This
supports the hypothesis that endogenous attention generates “multisensory spatial biases,” and that
these can influence multiple levels of processing, including activity in multisensory regions (aIPS)
as well as in sensory-specific areas (somatosensory and visual cortex, for tactile spatial attention).
To summarize, studies on multisensory endogenous spatial attention have shown that: (1) con-
trol regions in dFP activate irrespective of modality; (2) selective attention to one hemifield boosts
activity in areas that represent the contralateral hemifield, including also sensory-specific areas
concerned with a different modality (e.g., cross-modal modulation of occipital cortex during tactile
attention); (3) both multisensory regions in dFP (plus spatially specific aIPS) and unisensory areas
show attentional modulation before the presentation of the target stimuli (cue-related effects), con-
sistent with the endogenous, internally generated nature of these attentional signals.
These findings can be interpreted in the context of “sites-sources” models of attention con-
trol. Accordingly, feedforward sensory convergence would make multisensory information avail-
able to the dFP attentional network that can therefore operate as a supramodal control system.
Backprojections from the control system (“sources”) to sensory-specific areas (“sites”) enable con-
veying modulatory signals about the currently relevant location. Critically, because the control sys-
tem operates supramodally and is connected with several modalities, these signals will spread over
multiple “site” regions affecting activity in a distributed network of multimodal and unimodal brain
regions, all representing the attended location. The net result of this is that endogenous attention
selects the attended location irrespective of modality, with all stimuli presented at the attended loca-
tion receiving enhanced processing (e.g., Eimer and Driver 2000).
This proposal entails that feedforward and feedbackward connections between sensory areas
and associative regions in dFP mediate a transfer of spatial information across modalities. This
effectively means that endogenous attention “broadcasts” information about the currently attended
location between anatomically distant brain areas, thus mediating multisensory integration of space.
Drawing a loose analogy with the feature integration theory in the visual modality (Treisman and
Gelade 1980), we can think that space is like an “object” composed by multiple “features” (visual
location, auditory location, saccadic target location, etc). Each “feature” is represented in a specific
region of brain, including many sensory-specific, multisensory, and motor representations local-
ized in separate brain regions. Attention coordinates and binds together these representations via
492 The Neural Bases of Multisensory Processes

modulatory influences, thus generating a coherent representation of the whole “object,” that is, an
integrated representation of space.
However, traditional views of multisensory integration posit that signals in different modalities
interact in an automatic manner, suggesting “preattentive” mechanisms of multisensory integration.
The next two sections will address this issue in more detail, first looking for multisensory effects
in paradigms involving stimulus-driven rather than voluntary attention, and then discussing a set
of studies that directly tested for the interplay between endogenous and stimulus-driven factors in
multisensory spatial attention. In the last section, I will further specify the possible relationship
between attention control and multisensory integration.

25.4  STIMULUS-DRIVEN SPATIAL ATTENTION


The previous section highlighted multisensory consequences of spatial selection, when subjects choose
voluntarily to play attention to one specific location (endogenous attention). Under these conditions,
selection appears to operate cross-modally, with supramodal mechanisms of control (“sources”: dFP)
and modulation (“sites”: sensory areas) that boost processing of stimuli at the attended location irre-
spective of modality. The question arises whether these supramodal mechanisms are contingent on
voluntary control (e.g., maybe because of a limited pool of resources for endogenous attention) or
whether the merging of spatial representations across modalities can also occur in situations that do
not involve any strategic, voluntary control. Cross-modal cueing effects in automatic, stimulus-driven
spatial attention and any associated change of brain activity might also provide further evidence
about the possible relationship between spatial attention and multisensory integration.
Behavioral studies showed that nonpredictive stimuli in one modality can affect processing of
subsequent targets in a different modality (e.g., Spence et al. 1998; McDonald et al. 2000). For
example, task-irrelevant touch on the left hand can speed up responses to visual targets presented
nearby, compared with visual targets presented on the other side. Critically, this occurs also when
the side of the tactile stimulus does not predict the side of the subsequent visual target (nonpredic-
tive tactile cues), suggesting that these cross-modal cueing effects do not depend solely on strategic
(endogenous) deployment of spatial attention.
The investigation of the neural basis of stimulus-driven (visual) attention with neuroimaging
methods can be somewhat problematic. Studies on endogenous attention typically compare experi-
mental conditions with identical stimuli, manipulating only the instructions that are given to the
subject (Heinze et al. 1994; cf. also Figure 25.2b for an example in the context of multisensory atten-
tion). On the contrary, stimulus-driven effects depend by definition on the stimulus configuration.
In the visual modality, a typical stimulus-driven (exogenous) cueing study entails presentation of
a nonpredictive peripheral visual cue followed by a visual target either at the same location (valid
trials, 50%) or on the opposite side (invalid trials, 50%). The direct comparison of these two trials
entails both a different attentional status (e.g., “attended” target on valid trials vs. “unattended”
target on invalid trials), but also different physical stimuli. Hence, the interpretation of any differ-
ential brain activation can be problematic. For example, two stimuli presented in close spatiotem-
poral proximity (as during valid trials) can give rise to nonlinear summation of the hemodynamic
responses, which will appear as a differential activation when valid and invalid trials are directly
compared even if there is no actual change in neuronal activity or attentional modulation. These
drawbacks are somewhat mitigated in the context of multisensory fMRI experiments. Multisensory
stimulus-driven paradigms also entail the co-occurrence of attentional (attended vs. unattended)
and sensory effects (same side vs. opposite side), but—critically—“same-side” conditions do not
entail delivering twice the same, or very similar, stimuli (i.e., cue and target are now in differ-
ent modalities). Moreover, the registration of the cue–target spatial alignment (i.e., the distinction
between valid/same-side and invalid/opposite-side trials) cannot occur in a straightforward manner
within low-level sensory-specific maps of space, because the two stimuli are initially processed in
anatomically separate brain regions (e.g., visual vs. somatosensory cortices).
Spatial Constraints in Multisensory Attention 493

We have utilized stimulus-driven paradigms in a series of multisensory, visuo-tactile fMRI stud-


ies (Macaluso et al. 2000b, 2002a, 2005; Zimmer and Macaluso 2007). As in classical behavioral
studies, we compared trials with nonpredictive touch on the same side as the visual target versus
trials with touch and vision on opposite sides. More specifically, we revealed hemifield-specific
cross-modal interactions by testing for the interaction between the position of the visual target (left/
right) and the presence of the tactile input (e.g., Macaluso et al. 2000b) or the spatial congruence of
the visuo-tactile stimuli (same or opposite sides; cf. Figure 25.3a; see also Zimmer and Macaluso
2007). This allowed us to eliminate any trivial difference due to the sensory stimuli; for example,
comparing directly VLTL trials (vision and touch on the left “same side”) versus a VLTR trials
(“opposite side” trials) would activate the right somatosensory cortex just because this comparison
also entails left versus right tactile stimuli. On the contrary, the comparison (VLTL – VLnoT) >
(VRTL – VRnoT) does not entail this confound, with the mere effect of visual and tactile stimula-
tion subtracting out in the interaction.
Our results consistently showed that nonpredictive, task-irrelevant tactile stimuli on the same
side of the visual target can boost activity in occipital visual cortex contralateral to the target side
(e.g., Macaluso et al. 2000b). Figure 25.3a shows an example of this cross-modal, stimulus-driven,
and spatially specific effect in the visual cortex. In this experiment, the visual target was delivered
equiprobably in left or right visual hemifield near to the subject’s face. Task-irrelevant nonpredictive
touch consisted of air puffs presented equiprobably on the left or right side of the forehead, in close
spatial correspondence of the position of the visual stimuli on each side. The task of the subject
was to discriminate the “up/down” elevation of the visual target (two LEDs were mounted on each
side). The test for hemifield-specific effects of cross-modal attention (i.e., the interaction between
the position of the visual stimulus and the spatial congruence of the bimodal stimulation; for more
details on this topic, see also Zimmer and Macaluso 2007) revealed increased activation in the left
occipital cortex when both vision and touch were on the right side of space; and activation in the
right occipital cortex for spatially congruent (same-side) vision and touch on the left side (see Figure
25.3a). Accordingly, task-irrelevant touch can affect processing in the visual cortex in a spatially
specific and fully stimulus-driven manner. This is consistent with the hypothesis that spatial infor-
mation about one modality (e.g., touch) can be transmitted to anatomically distant areas that process
stimuli in a different modality (e.g., occipital visual cortex), and that this can occur irrespective of
strategic, endogenous task requirements (see also McDonald and Ward 2000; Kennett et al. 2001;
Kayser et al. 2005; Teder-Salejarvi et al. 2005; related findings about stimulus-driven cross-talks
between areas processing different modalities are also discussed later in this book).
The finding of spatially specific cross-modal influences of touch in visual areas is also remark-
able because the visual cortex registers the stimulus position in a retino-centered frame of reference,
whereas the position of touch is initially registered in a body-centered frame of reference. Thus, the
question arises on whether these side-specific effects of multisensory spatial congruence truly reflect
the alignment of visual and tactile stimuli in external space or rather merely reflect an overall hemi-
spheric bias. Indeed, a congruent VRTR stimulus entails a double stimulation of the left hemisphere,
whereas on incongruent VRTL trials the two stimuli will initially activate opposite hemispheres (see
also Kinsbourne 1970, on hemispheric biases in spatial attention). We dissociated the influence of
hemisphere versus external location manipulating the direction of gaze with respect to the hand posi-
tion (Macaluso et al. 2002a). Tactile stimuli were always delivered to the right hand that was posi-
tioned centrally. When subjects fixated on the left side, the right visual field stimulus was spatially
aligned with touch, and both right touch and right vision projected to the left hemisphere. However,
when gaze was shifted to the right side, now the left visual field stimulus was spatially aligned with
right touch, with vision and touch projecting initially to opposite hemispheres. The fMRI results
showed that common location in external space, rather than common hemisphere, determined cross-
modal influences in the occipital cortex. Hence, right-hand touch can boost the right visual field
when the right hand is in the right visual field, but will boost the left visual field if a posture change
puts the right hand in the left visual field (see also Kennett et al. 2001, for a related ERP study).
494 The Neural Bases of Multisensory Processes

(a) Stimulus-driven cross-modal spatial interactions

Left occipital cortex + Right occipital cortex


4 VT
Effect size (a.u.)

3 4
2 3

Cong
1 2
1
0 0
–1 –1
–2 –2
–3 –3
–4 –4
VL VR VL VR T V VL VR VL VR
TL TR TR TL TL TR TR TL

Incong
Cong Incong Cong Incong

(b) Stimulus-driven cross-modal interactions and endogenous visuo-spatial load

Left occipital cortex Right occipital cortex


High

6 6

Left–right vision (a.u.)


+
Right–left vision (a.u.)

4 4
Low

2 2
+
0 VT 0
Cong Incong Cong Incong Cong Incong Cong Incong
LOW load HIGH load LOW load HIGH load

FIGURE 25.3  Stimulus-driven cross-modal spatial attention and interactions with endogenous control.
(a) Stimulus-driven cross-modal influences in visual cortex. In this event-related fMRI study (unpublished
data), subjects performed a visual discrimination task (“up/down” judgment) with visual stimuli presented
in left or right hemifield near the forehead. Task-irrelevant touch was presented equiprobably on left or right
side of the forehead, yielding to spatially congruent trials (vision and touch on same side; e.g., both stimuli on
right side, cf. top-central panel) and incongruent trials (vision and touch on opposite sides; e.g., vision on the
right and touch on the left). Imaging data tested for interaction between position of visual target (left/right) and
spatial congruence of bimodal stimulation (congruent/incongruent: e.g., testing for greater activation for right
than left visual targets, in spatially congruent vs. incongruent trials). This revealed activity enhancement in
occipital visual areas when a contralateral visual target was coupled with a spatially congruent task-irrelevant
touch. For example, left occipital cortex showed greater activation comparing “right minus left visual targets,”
when touch was congruent vs. incongruent (see signal plot on left side: compare “bar 2 minus 1” vs. “bar 4
minus 3”); effectively yielding to maximal activation of left occipital cortex area when a right visual target
was combined with right touch on same side (see bar 2, in same plot). (b) Stimulus-driven cross-modal influ-
ences and endogenous visuospatial attention. (From Zimmer, U. and Macaluso, E., Eur. J. Neurosci., 26,
1681–1691, 2007.) Also in this study, we indexed side-specific cross-modal influences testing for interaction
between position of visual stimuli and spatial congruence of visuo-tactile input (see also Figure 25.3a; note
that, for simplicity, panel b shows only “right-congruent” condition), but now with both vision and touch fully
task-irrelevant. We assessed these cross-modal spatial effects under two conditions of endogenous visuospa-
tial attentional load. In “High load” condition, subjects were asked to detect subtle changes of orientation of
a grating patch presented above fixation. In “Low load” condition, they detected changes of luminance at
fixation. fMRI results showed that activity in occipital cortex increased for spatially congruent visuo-tactile
stimuli in contralateral hemifield, and that—critically—this occurred irrespective of load of visuospatial
endogenous task. Accordingly, analogous effects of spatial congruence were found in “Low load” condition
(bar 1 minus 2) and in “High load” condition (bar 3 minus 4, in each signal plot). V/T, vision/touch; L/R, left/
right; Cong/Incong, congruent (VT on the same side)/incongruent (VT on opposite sides).
Spatial Constraints in Multisensory Attention 495

The finding that cross-modal influences in sensory-specific occipital cortex can take posture
into account suggests that intermediate brain structures representing the current posture are also
involved. Postural signals have been found to affect activity in many different regions of the brain,
including fronto-parietal areas that also participate in attention control and multisensory processing
(Andersen et al. 1997; Ben Hamed and Duhamel 2002; Boussaoud et al. 1998; Bremmer et al. 1999;
Kalaska et al. 1997; Fasold et al. 2008). Hence, we can hypothesize that the fronto-parietal cortex
may also take part in stimulus-driven multisensory attention control.
In the visual modality, stimulus-driven control has been associated primarily with activation
of a vFP, including the TPJ and the IFG. These areas activate when subjects are cued to attended
to one hemifield but the visual target appears on the opposite side (invalid trials), thus triggering
a stimulus/target-driven shift of visuospatial attention (plus other task-related resetting processes;
see below). We employed a variation of this paradigm to study stimulus-driven shifts of attention
in vision and in touch (Macaluso et al. 2002c). A central informative cue instructed the subject
to attend to one side. On 80% of the trials the target appeared on the attended side (valid trials),
whereas in the remaining 20% of the trials the target appeared on the opposite side (invalid trials).
Critically, the target could be either visual (LED near to the left/right hands, on each side) or tactile
(air puff on the left/right hands). The modality of the target stimulus was randomized and unpre-
dictable, thus subjects could not strategically prepare to perform target discrimination in one or the
other modality. The dorsal FP network activated irrespective of cue validity, consistent with the role
of this network in voluntary shifts of attention irrespective of modality (see also Wu et al. 2007).
The direct comparison of invalid versus valid trials revealed activation of the vFP (TPJ and IFG),
both for invalid visual targets and for invalid tactile targets. This demonstrates that both visual and
tactile target stimuli at the unattended location can trigger stimulus-driven reorienting of spatial
attention and activation of the vFP network (see also Mayer et al. 2006; Downar et al. 2000).
Nonetheless, extensive investigation of spatial cueing paradigms in the visual modality indicates
that the activation of the vFP network does not reflect pure stimulus-driven control. As a matter of
fact, invalid trials involve not only stimulus-driven shifts of attention from the cued location to the
new target location, but they also entail breaches of expectation (Nobre et al. 1999), updating task-
related settings (Corbetta and Shulman 2002) and processing of low frequency stimuli (Vossel et
al. 2006). Several different strategies have been undertaken to tease apart the contribution of these
factors (e.g., Kincade et al. 2005; Indovina and Macaluso 2007). Overall, the results of these studies
lead to the current view that task-related (e.g., the task-relevance of the reorienting stimulus, i.e.,
the target that requires judgment and response) and stimulus-driven factors jointly contribute to the
activation of the vFP system (see Corbetta et al. 2008 for review).
Additional evidence for the role of task relevance for the activation of vFP in the visual modality
comes from a recent fMRI study, where we combined endogenous predictive cues and exogenous
nonpredictive visual cues on the same trial (Natale et al. 2009). Each trial began with a central,
predictive endogenous cue indicating the most likely (left/right) location of the upcoming target.
The endogenous cue was followed by a task-irrelevant, nonpredictive exogenous cue (brighten-
ing and thickening of a box in the left or right hemifield) that was quickly followed by the (left or
right) visual target. This allowed us to cross factorially the validity of endogenous and exogenous
cues within the same trial. We reasoned that if pure stimulus-driven attentional control can influ-
ence activity in vFP, exogenous cues that anticipate the position of an “endogenous-invalid” task-
relevant target (e.g., endogenous cue left, exogenous cue right, target right) should affect reorienting
related activation of vFP. Behaviorally, we found that both endogenous and exogenous cues affected
response times. Subjects were faster to discriminate “endogenous-invalid” targets when the exog-
enous cue anticipated the position of the target (exogenous valid trials, as in the stimulus sequence
above). However, the fMRI data did not reveal any significant effect of the exogenous cues in the
vFP, which activated equivalently in all conditions containing task-relevant targets on the opposite
side of the endogenously cued hemifield (i.e., all endogenous-invalid trials). These findings are in
agreement with the hypothesis that fully task-irrelevant visual stimuli do not affect activity in vFP
496 The Neural Bases of Multisensory Processes

(even when the behavioral data demonstrate an influence on these task-irrelevant cues on target
discrimination; see also Kincade et al. 2005).
However, a different picture emerged when we used task-irrelevant auditory rather than visual
cues (Santangelo et al. 2009). The experimental paradigm was analogous to the pure visual study
described above, with a predictive endogenous cue followed by a nonpredictive exogenous cue (now
auditory) and by the visual target, within each trial. The visual targets were presented in the left/right
hemifields near to the subject’s face and the task-irrelevant auditory stimuli were delivered at cor-
responding external locations. The overall pattern of reaction times was similar to the visual study:
both valid endogenous and valid exogenous cues speeded up responses, confirming cross-modal
influences of the task-irrelevant auditory cues on the processing of the visual targets (McDonald et
al. 2000). The fMRI data revealed the expected activation of vFP for “endogenously invalid” visual
targets, demonstrating once again the role of these regions during reorienting toward task-relevant
targets (e.g., Corbetta et al. 2000). But critically, now the side of the task-irrelevant auditory stim-
uli was found to modulate activity in the vFP. Activation of the right TPJ for endogenous-invalid
trials diminished when the auditory cue was on the same side as the upcoming invalid target (e.g.,
endogenous cue left, exogenous auditory cue right, visual target right). Accordingly, task-irrelevant
sounds that anticipate the position of the invalid visual target reduce reorienting-related activation
in TPJ, demonstrating a “pure” stimulus-driven cross-modal spatial effect in the ventral attention
control system (but see also Downar et al. 2001; Mayer et al. 2009).
To summarize, multisensory studies of stimulus-driven attention showed that: (1) task-irrelevant
stimuli in one modality modulate activity in sensory-specific areas concerned with a different
modality, and they can do so in a spatially specific manner (e.g., boosting of activity in contralateral
occipital cortex for touch and vision on the same side); (2) spatially specific cross-modal influences
in sensory-specific areas take posture into account, suggesting indirect influences via higher-order
areas; (3) control regions in vFP operate supramodally, activating during stimulus-driven spatial
reorienting toward visual or tactile targets; (4) task-irrelevant auditory stimuli can modulate activity
in vFP, revealing a “special status” of multisensory stimulus-driven control compared with unisen-
sory visuospatial attention (cf. Natale et al. 2009). These findings call for an extension of site-source
models of attention control, which should take into account the “special status” of multisensory
stimuli. In particular, models of multisensory attention control should include pathways allowing
nonvisual stimuli to reach the visual cortex and to influence activity in the ventral attention network
irrespective of task-relevance.
Figure 25.1b shows some of the hypothetical pathways that may mediate these effects. “Pathway 1”
entails direct feedforward influences from auditory/somatosensory cortex into the vFP attention
system. The presence of multisensory neurons in the temporo-parietal cortex and inferior premor-
tor cortex (Bruce et al. 1981; Barraclough et al. 2005; Hyvarinen 1981; Dong et al. 1994; Graziano
et al. 1997), plus activation of these regions for vision, audition, and touch in humans (Macaluso
and Driver 2001; Bremmer et al. 2001; Beauchamp et al. 2004; Downar et al. 2000) is consistent
with convergent multisensory projections into the vFP. A possible explanation for the effect of task-
irrelevant auditory cues in TPJ (see Santangelo et al. 2009) is that feedforward pathways from the
auditory cortex, unlike the pathway from occipital cortex, might not be under “task-related inhibi-
tory influences” (see Figure 25.1a). The hypothesis of inhibitory influences on the visual, occipital-
to-TPJ pathway was initially put forward by Corbetta and Shulman (2002) as a possible explanation
for why task-irrelevant visual stimuli do not activate TPJ (see also Natale et al. 2009). More recently
the same authors suggested that these inhibitory effects may arise from the middle frontal gyrus
and/or via subcortical structures (locus coeruleus; for details on this topic, see Corbetta et al. 2008).
Our finding of a modulatory effect by task-irrelevant audition in TPJ (Santangelo et al. 2009) sug-
gests that these inhibitory effects may not apply in situations involving task-irrelevant stimuli in a
modality other than vision.
“Pathway 2” involves indirect influences of multisensory signals in the ventral FP network, via
dorsal FP regions. Task-related modulations of the pathway between occipital cortex and TPJ are
Spatial Constraints in Multisensory Attention 497

thought to implicate the dFP network (Corbetta et al. 2008; see also the previous paragraph). Because
multisensory stimuli can affect processing in the dorsal FP network (via feedforward convergence),
these may in turn modify any influence that the dorsal network exerts on the ventral network (see
also He et al. 2007, for an example of how changes/lesions of one attention network can affect func-
tioning of the other network). This could comprise the abolishment of any inhibitory influence on
(auditory) task-irrelevant stimuli. The involvement of dorsal FP areas may also be consistent with
the finding that cross-modal effects in unisensory areas take posture into account. Postural signals
modulate activity of neurons in many dFP regions (e.g., Andersen et al. 1997; Ben Hamed et al.
2002; Boussaoud et al. 1998; Bremmer et al. 1999; Kalaska et al. 1997). An indirect route via dFP
could therefore combine sensory signals and postural information about eyes/head/body, yielding
to cross-modal influences according to position in external space (cf. Stein and Steinford 2008; but
note that postural signals are available in multisensory regions of the vFP network; Graziano et al.
1997; and the SC, Grossberg et al. 1997; see also Pouget et al. 2002 Deneve and Pouget 2004 for
computational models on this issue).
“Pathway 3” involves direct anatomical projections between sensory-specific areas that process
stimuli in different modalities. These have been now reported in many animal studies (e.g., Falchier
et al. 2002; Rockland and Ojima 2003; Cappe and Barone 2005) and could mediate automatic
influences of one modality (e.g., touch) on activity in sensory-specific areas of a different modality
(e.g., occipital visual cortex; see also Giard and Peronnet 1999; Kayser et al. 2005; Eckert et al.
2008). These connections between sensory-specific areas may provide fast, albeit spatially coarse,
indications about the presence of a multisensory object or event in the external environment. In
addition, a direct effect of audition or touch in occipital cortex could change the functional con-
nectivity between occipital cortex and TPJ (see Indovina and Macaluso 2004), also determining
stimulus-driven cross-modal influences in vFP.
Finally, additional pathways are likely to involve subcortical structures (“pathways 4” in Figure
25.1b). Many different subcortical regions contain multisensory neurons and can influence cortical
processing (e.g., superior colliculus, Meredith and Stein 1983; thalamus, Cappe et al. 2009; basal
ganglia, Nagy et al. 2006). In addition, subcortical structures are important for spatial orienting
(e.g., intermediate and deep layers SC are involved in the generation of overt saccadic responses; see
also Frens and Van Opstal 1998, for a study on overt orienting to bimodal stimuli) and have been
linked to selection processes in spatial attention (Shipp 2004). The critical role of SC for combining
spatial information across sensory modalities has been also demonstrated in two recent behavioral
studies (Maravita et al. 2008; Leo et al. 2008). These showed that superior behavioral performance
for spatially aligned, same-side versus opposite-side audiovisual trials disappears when the visual
stimuli are invisible to the SC (purple/blue stimuli).

25.5 POSSIBLE RELATIONSHIP BETWEEN SPATIAL ATTENTION


AND MULTISENSORY INTEGRATION
Regardless of the specific pathways involved (see preceding section), the finding that spatial infor-
mation can be shared between multiple sensory-specific and multisensory areas even in condition
of stimulus-driven automatic attention, suggests a possible relationship between attention control
and the integration of space across sensory modalities. The central idea here is that attention may
“broadcast” information about the currently relevant location between anatomically distant brain
areas, thus providing a mechanism that coordinates spatial representations in different sensory
modalities and implying some relationship between attention and multisensory integration.
The functional relationship between attention and multisensory integration is very much debated
and not understood yet (e.g., Talsma et al. 2007; McDonald et al. 2001; Alsius et al. 2005; Saito et
al. 2005; Macaluso and Driver 2005; Driver and Spence 1998; Bertelson et al. 2000; Kayser et al.
2005). This is attributable—at least, to some extent—to the difficulty of defining univocal indexes
of multisensory integration. Different authors have proposed and utilized a variety of measures to
498 The Neural Bases of Multisensory Processes

highlight interactions between stimuli in different senses. These include phenomenological mea-
sures such as the perception of multisensory illusions (e.g., as in the “McGurk” illusion, McGurk
and MacDonald 1976; see also Soto-Faraco and Alsius 2009; or the “sound-bounce” illusion,
Bushara et al. 2003), behavioral criteria based on violations of the Miller inequality (Miller 1982;
see Tajadura-Jiménez et al. 2009, for an example), or physiological measures related to nonlinear
effects in single-cell spiking activity (Meredith and Stein 1986b), EEG (Giard and Peronnet 1999),
or fMRI (Calvert et al. 2001) signals. At present, there is still no consensus as most of these mea-
sures have drawbacks and no single index appears suitable for all possible experimental situations
(for an extensive treatment, see Beauchamp 2005; Laurienti et al. 2005; Holmes 2009).
In the case of cross-modal spatial cueing effects in stimulus-driven attention, the issue is further
complicated by the fact that stimulus-driven effects are driven by changes in stimulus configuration
(same vs. different position), which is also considered a critical determinant for multisensory inte-
gration (Meredith and Stein 1986b). Therefore, it is difficult to experimentally tease apart these two
processes. In our initial study (Macaluso et al. 2000b), we showed boosting of activity in occipital
cortex contralateral to the position of spatially congruent bimodal visuo-tactile stimuli that were
presented simultaneously and for a relatively long duration (300 ms). McDonald et al. (2001) argued
that these cross-modal influences may relate to multisensory interactions rather than spatial atten-
tion, as there was no evidence that task-irrelevant touch captured attention on the side of the visual
target. However, this point is difficult to address because it is impossible to obtain behavioral evi-
dence that exogenous cues—which by definition do not require any response—trigger shifts of
spatial attention. A related argument was put forward suggesting that a minimum condition to dis-
entangle attention versus integration is to introduce a gap between the offset of the cue and the onset
of the target (McDonald et al. 2001). This should eliminate multisensory integration (the trial would
never include simultaneous bimodal stimulation), while leaving spatial attentional effects intact
(i.e., faster and more accurate behavioral responses for same-side vs. opposite-side trials). However,
we have previously argued that criteria based on stimulus timing may be misleading because of
differential response latencies and discharge proprieties of neurons in different regions of the brain
(Macaluso et al. 2001). Thus, physically nonoverlapping stimuli (e.g., an auditory cue that precedes
a visual target) may produce coactivation of a bimodal neuron that has shorter response latency for
audition than for vision (e.g., see Meredith et al. 1987; for related findings using ERPs in humans;
see also Meylan and Murray 2007).
As an extension of the idea that the temporal sequence of events may be used to disentangle the
role of attention and multisensory integration in stimulus-driven cross-modal cueing paradigms
(McDonald et al. 2001), one may consider the timing of neuronal activation rather than the tim-
ing of the external stimuli. This can be addressed in the context of site-source models of attention
(cf. Figure 25.1). Along these lines, Spence et al. (2004) suggested that if control regions activate
before any modulation in sensory areas, this would speak for a key role of attention in cross-modal
integration; meanwhile, if attentional control engages after cross-modal effects in sensory-specific
areas, this would favor the view that multisensory integration takes place irrespective of attention.
In the latter case, cross-modal cueing effects could be regarded as arising as a “consequence” of
the integration process (see also Busse et al. 2005). Using ERP and dipole source localization in a
stimulus-­driven audiovisual cueing paradigm, McDonald and colleagues (2003) found that associa-
tive regions in the posterior temporal cortex activate before any cross-modal spatial effect in the
visual cortex. In this study, there was a 17- to 217-ms gap between cue offset and target onset, and
the analysis of the behavioral data showed increased perceptual sensitivity (d′) for valid compared to
invalid trials. Accordingly, the authors suggested that the observed sequence of activation (including
cross-modal influences of audition on visual ERPs) could be related to involuntary shifts of spatial
attention. However, this study did not assess brain activity associated specifically with the exog-
enous cues, thus again not providing any direct evidence for cue-related shifts of attention. Using a
different approach to investigate the dynamics of cross-modal influences in sensory areas, a recent
fMRI study of functional connectivity showed that during processing of simultaneous audiovisual
Spatial Constraints in Multisensory Attention 499

streams, temporal areas causally influence activity in visual and auditory cortices, rather than the
other way round (Noesselt et al. 2007). Thus, cross-modal boosting of activity in sensory-specific
areas seems to arise because of backprojections from multisensory regions, emphasizing the causal
role of high-order associative areas and consistent with some coupling between attention control
and the sharing of spatial information across sensory modalities (which, depending on the defini-
tion, can be viewed as an index of multisensory integration).
More straightforward approaches can be undertaken to investigate the relationship between
endogenous attention and multisensory integration. Still pending on the specific definition of mul-
tisensory integration (see above), one may ask whether endogenous attention affects the way signals
in different modalities interact with each other. For example, Talsma and Woldorff (2005) indexed
multisensory integration using a supra-additive criterion on ERP amplitudes (AV > A + V), and
tested whether this was different for stimuli at the endogenously attended versus unattended side
(note that both vision and audition were task-relevant/attended in this experiment). Supra-additive
responses for AV stimuli were found in frontal and centro-medial scalp sites. Critically, this effect
was larger for stimuli at the attended than the unattended side, demonstrating some interplay
between spatial endogenous attention and multisensory integration (see also the study of Talsma et
al. 2007, who manipulated relevant-modality rather than relevant-location).
In a similar vein, we have recently investigated the effect of selective visuospatial endogenous
attention on the processing of audiovisual speech stimuli (Fairhall and Macaluso 2009). Subjects
were presented visually with two “speaking mouths” simultaneously in the left and right visual
fields. A central auditory stream (speaking voice) was congruent with one of the two visual stimuli
(mouth reading the same tale’s passage) and incongruent with the other one (mouth reading a dif-
ferent passage). In different blocks, subjects were asked to attend either to the congruent or to the
incongruent visual stimulus. In this way, we were able to keep the absolute level of multisensory
information present in the environment constant, testing specifically for the effect of selective spa-
tial attention to congruent or incongruent multisensory stimuli. The results showed that endogenous
visuospatial attention can influence the processing of audiovisual stimuli, with greater activation
for “attend to congruent” than “attend to incongruent” conditions. This interplay between attention
and multisensory processing was found to affect brain activity at multiple stages, including high-
level regions in the superior temporal sulcus, subcortically in the superior colliculus, as well as in
sensory-specific occipital visual cortex (V1 and V2).
Endogenous attention has been found not only to boost multisensory processing, but also in
some cases to reduce responses for attended versus unattended multisensory stimuli. For example,
van Atteveldt and colleagues (2007) presented subjects with letter–sound pairs that were either
congruent or incongruent. Under conditions of passive listening, activity increased in association
cortex for congruent compared to incongruent presentations. However, this effect disappeared as
soon as subjects were asked to perform an active “same/different” judgment with the letters and
sounds. The authors suggested that voluntary top-down attention can overrule bottom-up multi-
sensory interactions (see also Mozolic et al. 2008, on the effect of active attention to one modality
during multisensory stimulation). In another study on audiovisual speech, Miller and D’Esposito
(2005) dissociated patterns of activation related to physical stimulus attributes (synchronous vs.
asynchronous stimuli) and perception (“fused” vs. “unfused” percept). This showed that active per-
ception leads to increases in activity in the auditory cortex and the superior temporal sulcus for
fused audiovisual stimuli, whereas in the SC activity decreased for synchronous vs. asynchronous
stimuli, irrespective of perception. These results indicate that constraints of multisensory integra-
tion may change as a function of endogenous factors (fused/unfused percept), for example, with
synchronous audiovisual stimuli reducing rather than increasing activity in the SC (cf. Miller and
D’Esposito 2005 and Meredith et al. 1987).
Another approach to investigate the relationship between endogenous attention and multisensory
integration is to manipulate the attentional load of a primary task and to assess how this influences
multisensory processing. The underlying idea is that if a single/common pool of neural resources
500 The Neural Bases of Multisensory Processes

mediates both processes, increasing the amount of resources spent on a primary attentional task
should lead to some changes in the processing of the multisensory stimuli. On the contrary, if multi-
sensory integration does not depend on endogenous attention, changes in the attentional task should
not have any influence on multisensory processing. We used this approach to investigate the possible
role of endogenous visuospatial attention for the integration of visuo-tactile stimuli (Zimmer and
Macaluso 2007). We indexed multisensory integration comparing same-side versus opposite-side
visual–tactile stimuli and assessing activity enhancement in contralateral occipital cortex for the
same-side condition (cf. Figure 25.3a). These visual and tactile stimuli were fully task-irrelevant
and did not require any response. Concurrently, we asked subjects to perform a primary endogenous
visuospatial attention task. This entailed either attending to central fixation (low load) or sustaining
visuospatial covert attention to a location above fixation to detect subtle orientation changes in a
grating patch (high load; see Figure 25.3b). The results showed cross-modal enhancements in the
contralateral visual cortex for spatially congruent trials, irrespective of the level of endogenous load
(see signal plots in Figure 25.3b). These findings suggest that the processing of visuo-tactile spatial
congruence in visual cortex can be uncoupled from endogenous visuospatial attention control (see
also Mathiak et al. 2005, for a magnetoencephalography study reporting related findings in auditory
cortex).
In summary, direct investigation of the possible relationship between attention control and mul-
tisensory integration revealed that voluntary attention to multisensory stimuli or changing the task
relevance of the unisensory components of a multisensory stimulus (attend to one modality, to both,
or to neither) can affect multisensory interactions. This indicates that—to some extent—attention
control and multisensory integration make use of a shared pool of processing resources. However,
when both components of a multisensory stimulus are fully task-irrelevant, changes in cognitive
load in a separate task does not affect the integration of the multisensory input (at least for the load
manipulations reported by Zimmer et al. 2007; Mathiak et al. 2005).
Taken together, these findings suggest that multisensory interactions can occur at multiple levels
of processing, and that different constraints apply depending on the relative weighting of stimulus-
driven and endogenous attentional requirements. This multifaceted scenario can be addressed in
the context of models of spatial attention control that include multiple routes for the interaction of
signals in different modalities (see Figure 25.1b). It can be hypothesized that some of these path-
ways (or network’s nodes) are under the modulatory influence of endogenous and/or stimulus-driven
attention. For instance, cross-modal interactions that involve dorsal FP areas are likely to be subject
to endogenous and task-related attentional factors (e.g., see Macaluso et al. 2002b). Conversely,
stimulus-driven factors may influence multisensory interactions that take place within or via the
ventral FP system (e.g., Santangelo et al. 2009). Direct connections between sensory-specific areas
should be—at least in principle—fast, automatic, and preattentive (Kayser et al. 2005), although
attentional influences may then superimpose on these (e.g., see Talsma et al. 2007). Some interplay
between spatial attention and multisensory processing can take place also in subcortical areas, as
demonstrated by attentional modulation there (Fairhall et al. 2008; see also Wallace and Stein 1994;
Wilkinson et al. 1996, for the role of cortical input on multisensory processing in the SC).

25.6  CONCLUSIONS
Functional imaging studies of multisensory spatial attention revealed a complex interplay between
effects associated with the external stimulus configuration (e.g., spatially congruent vs. incon-
gruent multisensory input) and endogenous task requirements. Here, I propose that these can be
addressed in the context of “site-source” models of attention that include control regions in dorsal
and vFP associative cortex, connected via feedforward and feedback projections with sensory-
­specific areas  (plus subcortical regions). This architecture permits sharing spatial information
across multiple brain regions that represent space (unisensory, multisensory, plus motor represen-
tations). Spatial attention and the selection of currently relevant location result from the dynamic
Spatial Constraints in Multisensory Attention 501

interplay between the nodes of this network, with both stimulus-driven and endogenous factors
influencing the relative contribution of each node and pathway. I propose that the coordination of
activity within this complex network underlies the integration of space across modalities, produc-
ing a sensory–motor system that allows us to perceive and act within a unified representation of
external space.
In this framework, future studies may seek to better specify the dynamics of this network. A
key issue concerns possible causal links between activation of some parts of the network and atten-
tion/integration effects in other parts of the network. This relationship is indeed a main feature of
the “sites-sources” distinction emphasized in this model. This can be addressed in several ways.
Transcranic magnetic stimulation (TMS) can be used to transiently knock out one node of the
network during multisensory attention tasks, revealing the precise timing of activation of each net-
work’s node. Using this approach, Chambers and colleagues (2004a) identified two critical windows
for the activation of inferior parietal cortex during visuospatial reorienting, and demonstrated the
involvement of the same region (the angular gyrus) for stimulus-driven visuo-tactile spatial inter-
actions (Chambers et al. 2007; but see also Chambers et al. 2004b, for modality-specific effects).
TMS was also used to demonstrate the central role of posterior parietal cortex for spatial remapping
between vision and touch (Bolognini and Maravita 2007) and to infer direct influences of auditory
input on human visual cortex (Romei et al. 2007). Most recently, TMS has been combined with
fMRI, which allows investigating the causal influence of one area (e.g., frontal or parietal regions)
on activity in other areas (e.g., sensory-specific visual areas; see Ruff et al. 2006; and Bestmann et
al. 2008, for review). These studies may be extended to multisensory attention paradigms, looking
for the coupling between fronto-parietal attention control regions and sensory areas as a function
of the type of input (unisensory or multisensory, spatially congruent or incongruent). Task-related
changes in functional coupling between brain areas can also be assessed using analyses of effective
connectivity (e.g., dynamic causal modeling; Stephan et al. 2007). These have been successfully
applied to both fMRI and ERP data in multisensory experiments, showing causal influences of
associative areas in parietal and temporal cortex on sensory processing in the visual cortex (Moran
et al. 2008; Noesselt et al. 2007; Kreifelts et al. 2007). Future studies may combine attentional
manipulations (e.g., the direction of endogenous attention) and multisensory stimuli (e.g., spatially
congruent vs. incongruent multisensory input), providing additional information on the causal role
of top-down and bottom-up influences for the formation of an integrated system that represents
space across sensory modalities.

REFERENCES
Alsius, A., J. Navarra, R. Campbell, and S. Soto-Faraco. 2005. Audiovisual integration of speech falters under
high attention demands. Curr Biol 15: 839–843.
Andersen, R. A., L. H. Snyder, D. C. Bradley, and J. Xing. 1997. Multimodal representation of space in the
posterior parietal cortex and its use in planning movements. Annu Rev Neurosci 20: 303–330.
Arrington, C. M., T. H. Carr, A. R. Mayer, and S. M. Rao. 2000. Neural mechanisms of visual attention: Object-
based selection of a region in space. J Cogn Neurosci 2: 106–117.
Barraclough, N. E., D. Xiao, C. I. Baker, M. W. Oram, and D. I. Perrett. 2005. Integration of visual and audi-
tory information by superior temporal sulcus neurons responsive to the sight of actions. J Cogn Neurosci
17: 377–391.
Beauchamp, M. S. 2005. Statistical criteria in FMRI studies of multisensory integration. Neuroinformatics 3:
93–113.
Beauchamp, M. S., B. D. Argall, J. Bodurka, J. H. Duyn, and A. Martin. 2004. Unraveling multisensory integra-
tion: Patchy organization within human STS multisensory cortex. Nat Neurosci 7: 1190–1192.
Ben Hamed, S., J. R. Duhamel, F. Bremmer, and W. Graf. 2001. Representation of the visual field in the lat-
eral intraparietal area of macaque monkeys: A quantitative receptive field analysis. Exp Brain Res 140:
127–144.
Ben Hamed, S., and J. R. Duhamel. 2002. Ocular fixation and visual activity in the monkey lateral intraparietal
area. Exp Brain Res 142: 512–528.
502 The Neural Bases of Multisensory Processes

Bertelson, P., J. Vroomen, B. de Gelder, and J. Driver. 2000. The ventriloquist effect does not depend on the
direction of deliberate visual attention. Percept Psychophys 62: 321–332.
Bestmann, S., C. C. Ruff, F. Blankenburg, N. Weiskopf, J. Driver, and J. C. Rothwell. 2008. Mapping causal
interregional influences with concurrent TMS-fMRI. Exp Brain Res 191: 383–402.
Bolognini, N., and A. Maravita. 2007. Proprioceptive alignment of visual and somatosensory maps in the pos-
terior parietal cortex. Curr Biol 17: 1890–1895.
Boussaoud, D., C. Jouffrais, and F. Bremmer. 1998. Eye position effects on the neuronal activity of dorsal
premotor cortex in the macaque monkey. J Neurophysiol 80: 1132–1150.
Bremmer, F., W. Graf, S. Ben Hamed, and J. R. Duhamel. 1999. Eye position encoding in the macaque ventral
intraparietal area (VIP). Neuroreport 10: 873–878.
Bremmer, F., A. Schlack, N. J. Shah, O. Zafiris, M. Kubischik, K. Hoffmann et al. 2001. Polymodal motion
processing in posterior parietal and premotor cortex: A human fMRI study strongly implies equivalencies
between humans and monkeys. Neuron 29: 287–296.
Bressler, S. L., W. Tang, C. M. Sylvester, G. L. Shulman, and M. Corbetta. 2008. Top-down control of human
visual cortex by frontal and parietal cortex in anticipatory visual spatial attention. J Neurosci 28:
10056–10061.
Bruce, C., R. Desimone, and C. G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. J Neurophysiol 46: 369–384.
Bushara, K. O., T. Hanakawa, I. Immisch, K. Toma, K. Kansaku, and M. Hallett. 2003. Neural correlates of
cross-modal binding. Nat Neurosci 6:190–195.
Busse, L., K. C. Roberts, R. E. Crist, D. H. Weissman, and M. G. Woldorff. 2005. The spread of attention across
modalities and space in a multisensory object. Proc Natl Acad Sci USA 102: 18751–18756.
Calvert, G. A., P. C. Hansen, S. D. Iversen, and M. J. Brammer. 2001. Detection of audio-visual integration sites
in humans by application of electrophysiological criteria to the BOLD effect. Neuroimage 14: 427–438.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. Eur J Neurosci 22: 2886–2902.
Cappe, C., A. Morel, P. Barone, and E. M. Rouiller. 2009. The thalamocortical projection systems in primate:
an anatomical support for multisensory and sensorimotor interplay. Cereb Cortex 19: 2025–2037.
Chambers, C. D., J. M. Payne, and J. B. Mattingley. 2007. Parietal disruption impairs reflexive spatial attention
within and between sensory modalities. Neuropsychologia 45: 1715–1724.
Chambers, C. D., J. M. Payne, M. G. Stokes, and J. B. Mattingley. 2004a. Fast and slow parietal pathways
mediate spatial attention. Nat Neurosci 7: 217–218.
Chambers, C. D., M. G. Stokes, and J. B. Mattingley. 2004b. Modality-specific control of strategic spatial atten-
tion in parietal cortex. Neuron 44: 925–930.
Ciaramitaro, V. M., G. T. Buracas, and G. M. Boynton. 2007. Spatial and cross-modal attention alter responses
to unattended sensory information in early visual and auditory human cortex. J Neurophysiol 98:
2399–2413.
Corbetta, M., J. M. Kincade, J. M., Ollinger, M. P. McAvoy, and G. L. Shulman. 2000. Voluntary orienting is
dissociated from target detection in human posterior parietal cortex. Nat Neurosci 3: 292–297.
Corbetta, M., G. Patel, and G. L. Shulman. 2008. The reorienting system of the human brain: From environ-
ment to theory of mind. Neuron 58: 306–324.
Corbetta, M., and G. L. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nat
Rev Neurosci 3: 215–229.
Corbetta, M., A. P. Tansy, C. M. Stanley, S. V. Astafiev, A. Z. Snyder, and G. L. Shulman. 2005. A functional
MRI study of preparatory signals for spatial location and objects. Neuropsychologia 43: 2041–2056.
Deneve, S., and A. Pouget. 2004. Bayesian multisensory integration and cross-modal spatial links. J Physiol
Paris 98: 249–258.
Desimone, R., and J. Duncan. 1995. Neural mechanisms of selective visual attention. Annl Rev Neurosci 18:
193–222.
Dong, W. K., E. H. Chudler, K. Sugiyama, V. J. Roberts, and T. Hayashi. 1994. Somatosensory, multisen-
sory, and task-related neurons in cortical area 7b (PF) of unanesthetized monkeys. J Neurophysiol 72:
542–564.
Downar, J., A. P. Crawley, D. J. Mikulis, and K. D. Davis. 2000. A multimodal cortical network for the detec-
tion of changes in the sensory environment. Nat Neurosci 3: 277–283.
Downar, J., A. P. Crawley, D. J. Mikulis, and K. D. Davis. 2001. The effect of task relevance on the cortical response
to changes in visual and auditory stimuli: An event-related fMRI study. Neuroimage 14: 1256–1267.
Driver, J., and C. Spence. 1998. Attention and the crossmodal construction of space. Trends Cogn Sci 2:
254–262.
Spatial Constraints in Multisensory Attention 503

Duhamel, J. R., C. L. Colby, and M. E. Goldberg. 1998. Ventral intraparietal area of the macaque: Congruent
visual and somatic response properties. J Neurophysiol 79: 126–136.
Eckert, M. A., N. V. Kamdar, C. E. Chang, C. F. Beckmann, M. D. Greicius, and V. Menon. 2008. A cross-
modal system linking primary auditory and visual cortices: Evidence from intrinsic fMRI connectivity
analysis. Hum Brain Mapp 29: 848–857.
Eimer, M. 1999. Can attention be directed to opposite locations in different modalities? An ERP study. Clin
Neurophysiol 110: 1252–1259.
Eimer, M., and J. Driver. 2000. An event-related brain potential study of cross-modal links in spatial attention
between vision and touch. Psychophysiology 37: 697–705.
Eimer, M., and J. Driver. 2001. Crossmodal links in endogenous and exogenous spatial attention: Evidence
from event-related brain potential studies. Neurosci Biobehav Rev 25: 497–511.
Eimer, M., and J. van Velzen. 2002. Crossmodal links in spatial attention are mediated by supramodal control
processes: Evidence from event-related potentials. Psychophysiology 39: 437–449.
Eimer, M., J. van Velzen, and J. Driver. 2002. Cross-modal interactions between audition, touch, and vision
in endogenous spatial attention: ERP evidence on preparatory states and sensory modulations. J Cogn
Neurosci 14: 254–271.
Fairhall, S. L., and E. Macaluso. 2009. Spatial attention can modulate audiovisual integration at multiple corti-
cal and subcortical sites. Eur J Neurosci 29: 1247–1257.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2002. Anatomical evidence of multimodal integration
in primate striate cortex. J Neurosci 22: 5749–5759.
Farah, M. J., A. B. Wong, M. A. Monheit, and L. A. Morrow. 1989. Parietal lobe mechanisms of spatial atten-
tion: Modality-specific or supramodal? Neuropsychologia 27: 461–470.
Fasold, O., J. Heinau, M. U. Trenner, A. Villringer, and R. Wenzel. 2008. Proprioceptive head posture-related
processing in human polysensory cortical areas. Neuroimage 40: 1232–1242.
Frens, M. A., and A. J. Van Opstal. 1998. Visual–auditory interactions modulate saccade-related activity in
monkey superior colliculus. Brain Res Bull 46: 211–224.
Giard, M. H., and F. Peronnet. 1999. Auditory–visual integration during multimodal object recognition in
humans: A behavioral and electrophysiological study. J Cogn Neurosci 11: 473–490.
Graziano, M. S., and C. G. Gross. 1993. A bimodal map of space: Somatosensory receptive fields in the macaque
putamen with corresponding visual receptive fields. Exp Brain Res 97: 96–109.
Graziano, M. S., and C. G. Gross. 1995. The representation of extrapersonal space: A possible role for bimodal,
visuo-tactile neurons. In The Cognitive Neurosciences, ed. M. S. Gazzaniga, 1021–1034. Cambridge,
MA: MIT Press.
Graziano, M. S., X. T. Hu, and C. G. Gross. 1997. Visuospatial properties of ventral premotor cortex. J
Neurophysiol 77: 2268–2292.
Green, J. J., and J. J. McDonald. 2008. Electrical neuroimaging reveals timing of attentional control activity in
human brain. PLoS Biol 6: 81.
Green, J. J., W. A. Teder-Salejarvi, and J. J. McDonald. 2005. Control mechanisms mediating shifts of attention
in auditory and visual space: A spatio-temporal ERP analysis. Exp Brain Res 166: 358–369.
Gross, C. G., and M. S. Graziano. 1995. Multiple representations of space in the brain. The Neuroscientist 1: 43–50.
Grossberg, S., K. Roberts, M. Aguilar, and D. Bullock. 1997. A neural model of multimodal adaptive saccadic
eye movement control by superior colliculus. J Neurosci 17: 9706–9725.
Hagler Jr., D. J., and M. I. Sereno. 2006. Spatial maps in frontal and prefrontal cortex. Neuroimage 29:
567–577.
He, B. J., A. Z. Snyder, J. L. Vincent, A. Epstein, G. L. Shulman, and M. Corbetta. 2007. Breakdown of func-
tional connectivity in frontoparietal networks underlies behavioral deficits in spatial neglect. Neuron 53:
905–918.
Heinze��������������������������������������������������������������������������������������������������
, H. J., G. R. Mangun, W. Burchert, H. Hinrichs, M. Scholz, T. F. Munte et al. 1994. �������������
Combined spa-
tial and temporal imaging of brain activity during visual selective attention in humans. Nature 372:
543–546.
Holmes, N. P. 2009. The principle of inverse effectiveness in multisensory integration: Some statistical consid-
erations. Brain Topogr 21: 168–176.
Hopfinger, J. B., M. H. Buonocore, and G. R. Mangun. 2000. The neural mechanisms of top-down attentional
control. Nat Neurosci 3: 284–291.
Hotting, K., F. Rosler, and B. Roder. 2003. Crossmodal and intermodal attention modulate event-related brain
potentials to tactile and auditory stimuli. Exp Brain Res 148: 26–37.
Hyvarinen, J. 1981. Regional distribution of functions in parietal association area 7 of the monkey. Brain Res
206: 287–303.
504 The Neural Bases of Multisensory Processes

Indovina, I., and E. Macaluso, E. 2004. Occipital–parietal interactions during shifts of exogenous visuospatial
attention: Trial-dependent changes of effective connectivity. Magn Reson Imaging 22: 1477–1486.
Indovina, I., and E. Macaluso. 2007. Dissociation of stimulus relevance and saliency factors during shifts of
visuospatial attention. Cereb Cortex 17: 1701–1711.
Kalaska, J. F., S. H. Scott, P. Cisek, and L. E. Sergio. 1997. Cortical control of reaching movements. Curr Opin
Neurobiol 7: 849–859.
Kastner, S., M. A. Pinsk, P. De Weerd, R. Desimone, and L. G. Ungerleider. 1999. Increased activity in human
visual cortex during directed attention in the absence of visual stimulation. Neuron 22: 751–761.
Kastner, S., and L. G. Ungerleider. 2001. The neural basis of biased competition in human visual cortex.
Neuropsychologia 39: 1263–1276.
Kayser, C., C. I. Petkov, M. Augath, and N. K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48: 373–384.
Kelley, T. A., J. T. Serences, B. Giesbrecht, and S. Yantis. 2008. Cortical mechanisms for shifting and holding
visuospatial attention. Cereb Cortex 18: 114–325.
Kennett, S., M. Eimer, C. Spence, and J. Driver. 2001. Tactile–visual links in exogenous spatial attention under
different postures: Convergent evidence from psychophysics and ERPs. J Cogn Neurosci 13: 462–478.
Kida, T., K. Inui, T. Wasaka, K. Akatsuka, E. Tanaka, and R. Kakigi. 2007. Time-varying cortical activa-
tions related to visual–tactile cross-modal links in spatial selective attention. J Neurophysiol 97:
3585–3596.
Kincade, J. M., R. A. Abrams, S. V. Astafiev, G. L. Shulman, and M. Corbetta. 2005. An event-related functional
magnetic resonance imaging study of voluntary and stimulus-driven orienting of attention. J Neurosci
25: 4593–4604.
Kinsbourne, M. 1970. The cerebral basis of lateral asymmetries in attention. Acta Psychol (Amst) 33:
193–201.
Kreifelts, B., T. Ethofer, W. Grodd, M. Erb, and D. Wildgruber. 2007. Audiovisual integration of emotional
signals in voice and face: An event-related fMRI study. Neuroimage 37: 1445–1456.
Laurienti, P. J., J. H. Burdette, M. T. Wallace, Y. F. Yen, A. S. Field, and B. E. Stein. 2002. Deactivation of
sensory-specific cortex by cross-modal stimuli. J Cogn Neurosci 14: 420–429.
Laurienti, P. J., T. J. Perrault, T. R. Stanford, M. T. Wallace, and B. E. Stein. 2005. On the use of superadditiv-
ity as a metric for characterizing multisensory integration in functional neuroimaging studies. Exp Brain
Res 166: 289–297.
Leo, F., C. Bertini, G. di Pellegrino, and E. Ladavas. 2008. Multisensory integration for orienting responses in
humans requires the activation of the superior colliculus. Exp Brain Res 186: 67–77.
Lewis, J. W., M. S. Beauchamp, and E. A. DeYoe. 2000. A comparison of visual and auditory motion process-
ing in human cerebral cortex. Cereb Cortex 10: 873–888.
Luck, S. J., L. Chelazzi, S. A. Hillyard, and R. Desimone. 1997. Neural mechanisms of spatial selective atten-
tion in areas V1, V2, and V4 of macaque visual cortex. J Neurophysiol 77: 24–42.
Macaluso, E., and J. Driver. 2001. Spatial attention and crossmodal interactions between vision and touch.
Neuropsychologia 39: 1304–1316.
Macaluso, E., and J. Driver. 2005. Multisensory spatial interactions: A window onto functional integration in
the human brain. Trends Neurosci 28: 264–271.
Macaluso, E., J. Driver, and C. D. Frith. 2003a. Multimodal spatial representations engaged in human parietal
cortex during both saccadic and manual spatial orienting. Curr Biol 13: 990–999.
Macaluso, E., M. Eimer, C. D. Frith, and J. Driver. 2003b. Preparatory states in crossmodal spatial attention:
Spatial specificity and possible control mechanisms. Exp Brain Res 149: 62–74.
Macaluso, E., C. Frith, and J. Driver. 2000a. Selective spatial attention in vision and touch: Unimodal and
multimodal mechanisms revealed by PET. J Neurophysiol 83: 3062–3075.
Macaluso, E., C. D. Frith, and J. Driver. 2005. Multisensory stimulation with or without saccades: fMRI evi-
dence for crossmodal effects on sensory-specific cortices that reflect multisensory location-congruence
rather than task-relevance. Neuroimage 26: 414–425.
Macaluso, E., C. D. Frith, and J. Driver. 2001. Multisensory integration and crossmodal attention effects in the
human brain. Science [Technical response] 292: 1791.
Macaluso, E., C. D. Frith, and J. Driver. 2002a. Crossmodal spatial influences of touch on extrastriate visual
areas take current gaze direction into account. Neuron 34: 647–658.
Macaluso, E., C. D. Frith, and J. Driver. 2002b. Directing attention to locations and to sensory modalities:
Multiple levels of selective processing revealed with PET. Cereb Cortex 12: 357–368.
Macaluso, E., C. D. Frith, and J. Driver. 2002c. Supramodal effects of covert spatial orienting triggered by
visual or tactile events. J Cogn Neurosci 14: 389–401.
Spatial Constraints in Multisensory Attention 505

Macaluso, E., C. D. Frith, and J. Driver. 2000b. Modulation of human visual cortex by crossmodal spatial atten-
tion. Science 289: 1206–1208.
Maravita, A., N. Bolognini, E. Bricolo, C. A. Marzi, and S. Savazzi. 2008. Is audiovisual integration subserved
by the superior colliculus in humans? Neuroreport 19: 271–275.
Martinez, A., L. Anllo-Vento, M. I. Sereno, L. R. Frank, R. B. Buxton, D. J. Dubowitz et al. 1999. Involvement
of striate and extrastriate visual cortical areas in spatial attention. Nat Neurosci 2: 364–369.
Massaro, D. W. 1999. Speechreading: Illusion or window into pattern recognition. Trends Cogn Sci 3:
310–317.
Mathiak, K., I. Hertrich, M. Zvyagintsev, W. Lutzenberger, and H. Ackermann. 2005. Selective influences of
cross-modal spatial-cues on preattentive auditory processing: A whole-head magnetoencephalography
study. Neuroimage 28: 627–634.
Mayer, A. R., A. R. Franco, and D. L. Harrington. 2009. Neuronal modulation of auditory attention by informa-
tive and uninformative spatial cues. Hum Brain Mapp 30: 1652–1666.
Mayer, A. R., D. Harrington, J. C. Adair, and R. Lee. 2006. The neural networks underlying endogenous audi-
tory covert orienting and reorienting. Neuroimage 30: 938–949.
McDonald, J. J., W. A. Teder-Salejarvi, F. Di Russo, and S. A. Hillyard. 2003. Neural substrates of perceptual
enhancement by cross-modal spatial attention. J Cogn Neurosci 15: 10–19.
McDonald, J. J., W. A. Teder-Salejarvi, and S. A. Hillyard. 2000. Involuntary orienting to sound improves
visual perception. Nature 407: 906–908.
McDonald, J. J., W. A. Teder-Salejarvi, and L. M. Ward. 2001. Multisensory integration and crossmodal atten-
tion effects in the human brain. Science 292: 1791.
McDonald, J. J., and L. M. Ward. 2000. Involuntary listening aids seeing: Evidence from human electrophysiol-
ogy. Psychol Sci 11: 167–171.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264: 746–748.
Meredith, M. A., J. W. Nemitz, and B. E. Stein. 1987. Determinants of multisensory integration in superior
colliculus neurons: I. Temporal factors. J Neurosci 7: 3215–3229.
Meredith, M. A., and B. E. Stein. 1996. Spatial determinants of multisensory integration in cat superior col-
liculus neurons. J Neurophysiol 75: 1843–1857.
Meredith, M. A., and B. E. Stein. 1986a. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. J Neurophysiol 56: 640–662.
Meredith, M. A., and B. E. Stein. 1986b. Spatial factors determine the activity of multisensory neurons in cat
superior colliculus. Brain Res 365: 350–354.
Meredith, M. A., and B. E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221: 389–391.
Meyer, M., S. Baumann, S. Marchina, and L. Jancke. 2007. Hemodynamic responses in human multisensory
and auditory association cortex to purely visual stimulation. BMC Neurosci 8: 14.
Meylan, R. V., and M. M. Murray. 2007. Auditory–visual multisensory interactions attenuate subsequent visual
responses in humans. Neuroimage 35: 244–254.
Miller, J. 1982. Discrete versus continuous stagemodels of human information processing: In search of partial
output. Exp Psychol Hum Percept Perform 8: 273–296.
Miller, L. M., and M. D’Esposito. 2005. Perceptual fusion and stimulus coincidence in the cross-modal integra-
tion of speech. J Neurosci 25: 5884–5893.
Moore, T. 2006. The neurobiology of visual attention: Finding sources. Curr Opin Neurobiol 16: 159–165.
Moran, R. J., S. Molholm, R. B. Reilly, and J. J. Foxe. 2008. Changes in effective connectivity of human supe-
rior parietal lobule under multisensory and unisensory stimulation. Eur J Neurosci 27: 2303–2312.
Mozolic, J. L., D. Joyner, C. E. Hugenschmidt, A. M. Peiffer, R. A. Kraft, J. A. Maldjian et al. 2008. Cross-
modal deactivations during modality-specific selective attention. BMC Neurol 8: 35.
Nagy, A., G. Eordegh, Z. Paroczy, Z. Markus, and G. Benedek. 2006. Multisensory integration in the basal
ganglia. Eur J Neurosci 24: 917–924.
Natale, E., C. A. Marzi, and E. Macaluso. 2009. FMRI correlates of visuo-spatial reorienting investigated with
an attention shifting double-cue paradigm. Hum Brain Mapp 30: 2367–2381.
Nobre, A. C., J. T. Coull, C. D. Frith, and M. M. Mesulam. 1999. Orbitofrontal cortex is activated during
breaches of expectation in tasks of visual attention. Nat Neurosci 2: 11–12.
Noesselt, T., J. W. Rieger, M. A. Schoenfeld, M. Kanowski, H. Hinrichs, H. J. Heinze et al. 2007. Audiovisual
temporal correspondence modulates human multisensory superior temporal sulcus plus primary sensory
cortices. J Neurosci 27: 11431–11441.
Pessoa, L., S. Kastner, and L. G. Ungerleider. 2003. Neuroimaging studies of attention: From modulation of
sensory processing to top-down control. J Neurosci 23: 3990–3998.
506 The Neural Bases of Multisensory Processes

Posner, M. I. 1980. Orienting of attention. Q J Exp Psychol 32: 3–25.


Posner, M. I., J. A. Walker, F. J. Friedrich, and R. D. Rafal. 1984. Effects of parietal injury on covert orienting
of attention. J Neurosci 4: 1863–1874.
Pouget, A., S. Deneve, and J. R. Duhamel. 2002. A computational perspective on the neural basis of multisen-
sory spatial representations. Nat Rev Neurosci 3: 741–747.
Rockland, K. S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
Int J Psychophysiol 50: 19–26.
Romei, V., M. M. Murray, L. B. Merabet, and G. Thut. 2007. Occipital transcranial magnetic stimulation has
opposing effects on visual and auditory stimulus detection: Implications for multisensory interactions.
J Neurosci 27: 11465–11472.
Ruff, C. C., F. Blankenburg, O. Bjoertomt, S. Bestmann, E. Freeman, J. D. Haynes et al. 2006. Concurrent
TMS-fMRI and psychophysics reveal frontal influences on human retinotopic visual cortex. Curr Biol
16: 1479–1488.
Saito, D. N., K. Yoshimura, T. Kochiyama, T. Okada, M. Honda, and N. Sadato. 2005. Cross-modal binding
and activated attentional networks during audio-visual speech integration: A functional MRI study. Cereb
Cortex 15: 1750–1760.
Santangelo, V., M. O. Belardinelli, C. Spence, and E. Macaluso. 2009. Interactions between voluntary and stimulus-
­driven spatial attention mechanisms across sensory modalities. J Cogn Neurosci 21: 2384–2397.
Saygin, A. P., and M. I. Sereno. 2008. Retinotopy and attention in human occipital, temporal, parietal, and
frontal cortex. Cereb Cortex 18: 2158–2168.
Sereno, M. I., and R. S. Huang. 2006. A human parietal face area contains aligned head-centered visual and
tactile maps. Nat Neurosci 9: 1337–1343.
Sereno, M. I., S. Pitzalis, and A. Martinez. 2001. Mapping of contralateral space in retinotopic coordinates by
a parietal cortical area in humans. Science 294: 1350–1354.
Shipp, S. 2004. The brain circuitry of attention. Trends Cogn Sci 8: 223–230.
Shomstein, S., and S. Yantis. 2006. Parietal cortex mediates voluntary control of spatial and nonspatial auditory
attention. J Neurosci 26: 435–439.
Sommer, M. A., and R. H. Wurtz. 2000. Composition and topographic organization of signals sent from the
frontal eye field to the superior colliculus. J Neurophysiol 83: 1979–2001.
Soto-Faraco, S., and A. Alsius. 2009. Deconstructing the McGurk–MacDonald illusion. Exp Psychol Hum
Percept Perform 35: 580–587.
Spence, C., and Driver, J. 1996. Audiovisual links in endogenous covert spatial attention. J Exp Psychol Hum
Percept Perform 22: 1005–1030.
Spence, C., J. J. McDonald, and J. Driver. 2004. Exogenous spatial-cuing studies of human cross-modal atten-
tion and multisensory integration. In: Crossmodal space and crossmodal attention, ed. C. Spence and
J. Driver, 277–320. Oxford: Oxford Univ. Press.
Spence, C., M. E. Nicholls, N. Gillespie, and J. Driver. 1998. Cross-modal links in exogenous covert spatial
orienting between touch, audition, and vision. Percept Psychophys 60: 544–557.
Stein, B. E., and M. A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Stein, B. E., and T. R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the
single neuron. Nat Rev Neurosci 9: 255–266.
Stephan, K. E., L. M. Harrison, S. J. Kiebel, O. David, W. D. Penny, and K. J. Friston. 2007. Dynamic causal
models of neural system dynamics: Current state and future extensions. J Biosci 32: 129–144.
Talsma, D., T. J. Doty, and M. G. Woldorff. 2007. Selective attention and audiovisual integration: Is attending
to both modalities a prerequisite for early integration? Cereb Cortex 17: 679–690.
Talsma, D., and M.G. Woldorff. 2005. Selective attention and multisensory integration: Multiple phases of
effects on the evoked brain activity. J Cogn Neurosci 17: 1098–1114.
Tajadura-Jiménez, A., N. Kitagawa, A. Väljamäe, M. Zampini, M. M. Murray, and C. Spence. 2009. Auditory–
somatosensory multisensory interactions are spatially modulated by stimulated body surface and acous-
tic spectra. Neuropsychologia 47: 195–203.
Teder-Salejarvi, W. A., F. Di Russo, J. J. McDonald, and S. A. Hillyard. 2005. Effects of spatial congruity on
audio-visual multimodal integration. J Cogn Neurosci 17: 1396–1409.
Teder-Salejarvi, W. A., T. F. Munte, F. Sperlich, and S. A. Hillyard. 1999. Intra-modal and cross-modal spa-
tial attention to auditory and visual stimuli. An event-related brain potential study. Cogn Brain Res 8:
327–343.
Tootell, R. B., M. S. Silverman, E. Switkes, and R. L. De Valois. 1982. Deoxyglucose analysis of retinotopic
organization in primate striate cortex. Science 218: 902–904.
Treisman, A. M., and G. Gelade. 1980. A feature-integration theory of attention. Cogn Psychol. 12: 97–136.
Spatial Constraints in Multisensory Attention 507

Trenner, M. U., H. R. Heekeren, M. Bauer, K. Rossner, R. Wenzel, A. Villringer et al. 2008. What happens in
between? Human oscillatory brain activity related to crossmodal spatial cueing. PLoS ONE 3: 1467.
van Atteveldt, N. M., E. Formisano, R. Goebel, and L. Blomert. 2007. Top-down task effects overrule auto-
matic multisensory responses to letter–sound pairs in auditory association cortex. Neuroimage 36:
1345–1360.
Vandenberghe, R., D. R. Gitelman, T. B. Parrish, and M. M. Mesulam. 2001. Functional specificity of superior
parietal mediation of spatial shifting. Neuroimage 14: 661–673.
Vossel, S., C. M. Thiel, and G. R. Fink. 2006. Cue validity modulates the neural correlates of covert endog-
enous orienting of attention in parietal and frontal cortex. Neuroimage 32: 1257–1264.
Wallace, M. T., J. G. McHaffie, and B. E. Stein. 1997. Visual response properties and visuotopic representation
in the newborn monkey superior colliculus. J Neurophysiol 78: 2732–2741.
Wallace, M. T., and B. E. Stein, 1994. Cross-modal synthesis in the midbrain depends on input from cortex.
J Neurophysiol 71: 429–432.
Wilkinson, L. K., M. A. Meredith, and B. E. Stein. 1996. The role of anterior ectosylvian cortex in cross-
modality orientation and approach behavior. Exp Brain Res 112: 1–10.
Wu, C. T., D. H. Weissman, K. C. Roberts, and M. G. Woldorff. 2007. The neural circuitry underlying the
executive control of auditory spatial attention. Brain Res 1134: 187–198.
Yantis, S., J. Schwarzbach, J. T. Serences, R. L. Carlson, M. A. Steinmetz, J. J. Pekar et al. 2002. Transient
neural activity in human parietal cortex during spatial attention shifts. Nat Neurosci 5: 995–1002.
Zimmer, U., and E. Macaluso. 2007. Processing of multisensory spatial congruency can be dissociated from
working memory and visuo-spatial attention. Eur J Neurosci 26: 1681–1691.
26 Cross-Modal Spatial Cueing
of Attention Influences
Visual Perception
John J. McDonald, Jessica J. Green,
Viola S. Störmer, and Steven A. Hillyard

CONTENTS
26.1 Spatial Attention: Modality-Specific or Supramodal?..........................................................509
26.2 Involuntary Cross-Modal Spatial Attention Enhances Perceptual Sensitivity...................... 511
26.3 Involuntary Cross-Modal Spatial Attention Modulates Time-Order Perception.................. 512
26.4 Beyond Temporal Order: The Simultaneity Judgment Task................................................. 516
26.5 Involuntary Cross-Modal Spatial Attention Alters Appearance........................................... 518
26.6 Possible Mechanisms of Cross-Modal Cue Effects............................................................... 520
26.7 Conclusions and Future Directions........................................................................................ 523
References....................................................................................................................................... 523

26.1  SPATIAL ATTENTION: MODALITY-SPECIFIC OR SUPRAMODAL?


It has long been known that “looking out of the corner of one’s eye” can influence the processing
of objects in the visual field. One of the first experimental demonstrations of this effect came from
Hermann von Helmholtz, who, at the end of the nineteenth century, demonstrated that he could
identify letters in a small region of a briefly illuminated display if he directed his attention covertly
(i.e., without moving his eyes) toward that region in advance (Helmholtz 1866). Psychologists began
to study this effect systematically in the 1970s using the spatial-cueing paradigm (Eriksen and
Hoffman 1972; Posner 1978). Across a variety of speeded response tasks, orienting attention to a
particular location in space was found to facilitate responses to visual targets that appeared at the
cued location. Benefits in speeded visual performance were observed when attention was oriented
voluntarily (endogenously, in a goal-driven manner) in response to a spatially predictive symbolic
visual cue or involuntarily (exogenously, in a stimulus-driven manner) in response to a spatially
nonpredictive peripheral visual cue such as a flash of light. For many years, the covert orienting of
attention in visual space was seen as a special case, because initial attempts to find similar spatial
cueing effects in the auditory modality did not succeed (e.g., Posner 1978). Likewise, in several
early cross-modal cueing studies, voluntary and involuntary shifts of attention in response to visual
cues were found to have no effect on the detection of subsequent auditory targets (for review, see
Spence and McDonald 2004). Consequently, during the 1970s and 1980s (and to a lesser extent
1990s), the prevailing view was that location-based attentional selection was a modality-specific
and predominantly visual process.
Early neurophysiological and neuropsychological studies painted a different picture about the
modality specificity of spatial attention. On the neurophysiological front, Hillyard and colleagues
(1984) showed that sustaining attention at a predesignated location to the left or right of fixation mod-
ulates the event-related potentials (ERPs) elicited by stimuli in both task-relevant and task- irrelevant

509
510 The Neural Bases of Multisensory Processes

modalities. Visual stimuli presented at the attended location elicited an enlarged negative ERP
component over the anterior scalp 170 ms after stimulus onset, both when visual stimuli were rel-
evant and when they were irrelevant. Similarly, auditory stimuli presented at the attended location
elicited an enlarged negativity over the anterior scalp beginning 140 ms after stimulus onset, both
when auditory stimuli were relevant and when they were irrelevant. Follow-up studies confirmed
that spatial attention influences ERP components elicited by stimuli in an irrelevant modality when
attention is sustained at a prespecified location over several minutes (Teder-Sälejärvi et al. 1999)
or is cued on a trial-by-trial basis (Eimer and Schröger 1998). The results from these ERP studies
indicate that spatial attention is not an entirely modality-specific process.
On the neuropsychological front, Farah and colleagues (1989) showed that unilateral damage to
the parietal lobe impairs reaction time (RT) performance in a spatial cueing task involving spatially
nonpredictive auditory cues. Prior visual-cueing studies had shown that patients with damage to the
right parietal lobe were substantially slower to detect visual targets appearing in the left visual field
following a peripheral visual cue to the right visual field (invalid trials) than when attention was
cued to the left (valid trials) or was cued to neither side (neutral trials) (Posner et al. 1982, 1984).
This location-specific RT deficit was attributed to an impairment in the disengagement of attention,
mainly because the patients appeared to have no difficulty in shifting attention to the contralesional
field following a valid cue or neutral cue. In Farah et al.’s study, similar impairments in detecting
contralesional visual targets were observed following either invalid auditory or visual cues pre-
sented to the ipsilesional side. On the basis of these results, Farah and colleagues concluded that
sounds and lights automatically engage the same supramodal spatial attention mechanism.
Given the neurophysiological and neuropsychological evidence in favor of a supramodal (or at
least partially shared) spatial attention mechanism, why did several early behavioral studies appear
to support the modality-specific view of spatial attention? These initial difficulties in showing spa-
tial attention effects outside of the visual modality may be attributed largely to methodological
factors, because some of the experimental designs that had been used successfully to study visual
spatial attention were not ideal for studying auditory spatial attention. In particular, because sounds
can be rapidly detected based on spectrotemporal features that are independent of a sound’s spatial
location, simple detection measures that had shown spatial specificity in visual cueing tasks did not
always work well for studying spatial attention within audition (e.g., Posner 1978). As researchers
began to realize that auditory spatial attention effects might be contingent on the degree to which
sound location is processed (Rhodes 1987), new spatial discrimination tasks were developed to
ensure the use of spatial representations (McDonald and Ward 1999; Spence and Driver 1994). With
these new tasks, researchers were able to document spatial cueing effects using all the various com-
binations of visual, auditory, and tactile cue and target stimuli. As reviewed elsewhere (e.g., Driver
and Spence 2004), voluntary spatial cueing studies had begun to reveal a consistent picture by the
mid 1990s: voluntarily orienting attention to a location facilitated the processing of subsequent tar-
gets regardless of the cue and target modalities.
The picture that emerged from involuntary spatial cueing studies remained less clear because
some of the spatial discrimination tasks that were developed failed to reveal cross-modal cueing
effects (for detailed reviews of methodological issues, see Spence and McDonald 2004; Wright
and Ward 2008). For example, using an elevation-discrimination task, Spence and Driver found an
asymmetry in the involuntary spatial cueing effects between visual and auditory stimuli (Spence and
Driver 1997). In their studies, spatially nonpredictive auditory cues facilitated responses to visual
targets, but spatially nonpredictive visual cues failed to influence responses to auditory targets. For
some time the absence of a visual–auditory cue effect weighed heavily on models of involuntary
spatial attention. In particular, it was taken as evidence against a single supramodal attention system
that mediated involuntary deployments of attention in multisensory space. However, researchers
began to suspect that Spence and Driver’s (1997) missing audiovisual cue effect stemmed from the
large spatial separation between cue and target, which existed even on validly (ipsilaterally) cued
trials, and the different levels of precision with which auditory and visual stimuli can be localized.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 511

Specifically, it was hypothesized that visual cues triggered shifts of attention that were focused
too narrowly around the cued location to affect processing of a distant auditory target (Ward et al.
2000). Data from a recent study confirmed this narrow-focus explanation for the last remaining
“missing link” in cross-modal spatial attention (Prime et al. 2008). Visual cues were found to facili-
tate responses to auditory targets that were presented at the cued location but not auditory targets
that were presented 14° above or below the cued location (see also McDonald et al. 2001).
The bulk of the evidence to date indicates that orienting attention involuntarily or voluntarily
to a specific location in space can facilitate responding to subsequent targets, regardless of the
modality of the cue and target stimuli. In principle, such cross-modal cue effects might reflect the
consequences of a supramodal attention-control system that alters the perceptual representations of
objects in different modalities (Farah et al. 1989). However, the majority of behavioral studies to
date have examined the effects of spatial cues on RT performance, which is at best a very indirect
measure of perceptual experience (Luce 1986; Watt 1991). Indeed, measures of response speed are
inherently ambiguous in that RTs reflect the cumulative output of multiple stages of processing,
including low-level sensory and intermediate perceptual stages, as well as later stages involved
in making decisions and executing actions. In theory, spatial cueing could influence processing at
any one of these stages. There is some evidence that the appearance of a spatial cue can alter an
observer’s willingness to respond and reduce the uncertainty of his or her decisions without affect-
ing perception (Shiu and Pashler 1994; Sperling and Dosher 1986). Other evidence suggests that
whereas voluntary shifts of attention can affect perceptual processing, involuntary shifts of atten-
tion may not (Prinzmetal et al. 2005).
In this chapter, we review studies that have extended the RT-based chronometric investigation
of cross-modal spatial attention by utilizing psychophysical measures that better isolate perceptual-
level processes. In addition, neurophysiological and neuroimaging methods have been combined
with these psychophysical approaches to identify changes in neural activity that might underlie the
cross-modal consequences of spatial attention on perception. These methods have also examined
neural activity within the cue–target interval that might reflect supramodal (or modality specific)
control of spatial attention and subsequent anticipatory biasing of activity within sensory regions
of the cortex.

26.2 INVOLUNTARY CROSS-MODAL SPATIAL ATTENTION


ENHANCES PERCEPTUAL SENSITIVITY
The issue of whether attention affects perceptual or post-perceptual processing of external stimuli
has been vigorously debated since the earliest dichotic listening experiments revealed that selective
listening influenced auditory performance (Broadbent 1958; Cherry 1953; Deutsch and Deutsch
1963; Treisman and Geffen 1967). In the context of visual–spatial cueing experiments, the debate
has focused on two general classes of mechanisms by which attention might influence visual per-
formance (see Carrasco 2006; Lu and Dosher 1998; Luck et al. 1994, 1996; Smith and Ratcliff
2009; Prinzmetal et al. 2005). On one hand, attention might lead to a higher signal-to-noise ratio
for stimuli at attended locations by enhancing their perceptual representations. On the other hand,
attention might reduce the decision-level or response-level uncertainty without affecting perceptual
processing. For example, spatial cueing might bias decisions about which location contains relevant
stimulus information (the presumed signal) in favor of the cued location, thereby promoting a strat-
egy to exclude stimulus information arising from uncued locations (the presumed noise; e.g., Shaw
1982, 1984; Shiu and Pashler 1994; Sperling and Dosher 1986). Such noise-reduction explanations
account for the usual cueing effects (e.g., RT costs and benefits) without making assumptions about
limited perceptual capacity.
Several methods have been developed to discourage decision-level mechanisms so that any
observable cue effect can be ascribed more convincingly to attentional selection at perceptual stages
512 The Neural Bases of Multisensory Processes

of processing. One such method was used to investigate whether orienting attention involuntarily to
a sudden sound influences perceptual-level processing of subsequent visual targets (McDonald et
al. 2000). The design was adapted from earlier visual-cueing studies that eliminated location uncer-
tainty by presenting a mask at a single location and requiring observers to indicate whether they
saw a target at the masked location (Luck et al. 1994, 1996; see also Smith 2000). The mask serves
a dual purpose in this paradigm: to ensure that the location of the target (if present) is known with
complete certainty and to backwardly mask the target so as to limit the accrual and persistence of
stimulus information at the relevant location. Under such conditions, it is possible to use methods of
signal detection theory to obtain a measure of an observer’s perceptual sensitivity (d′)—the ability
to discern a sensory event from background noise—that is independent of the observer’s decision
strategy (which, in signal detection theory, is characterized by the response criterion, β; see Green
and Swets 1966).
Consistent with a perceptual-level explanation, McDonald and colleagues (2000) found that per-
ceptual sensitivity was higher when the visual target appeared at the location of the auditory cue
than when it appeared on the opposite side of fixation (Figure 26.1a and b). This effect was ascribed
to an involuntary shift of attention to the cued location because the sound provided no information
about the location of the impending target. Also, because there was no uncertainty about the target
location, the effect could not be attributed to a reduction in location uncertainty. Consequently,
the results provided strong evidence that shifting attention involuntarily to the location of a sound
actually improves the perceptual quality of a subsequent visual event appearing at that location (see
also Dufour 1999). An analogous effect on perceptual sensitivity has been reported in the converse
audiovisual combination, when spatially nonpredictive visual cues were used to orient attention
involuntarily before the onset of an 800-Hz target embedded in a white-noise mask (Soto-Faraco et
al. 2002). Together, these results support the view that sounds and lights engage a common supra-
modal spatial attention system, which then modulates perceptual processing of relevant stimuli at
the cued location (Farah et al. 1989).
To investigate the neural processes by which orienting spatial attention to a sudden sound influ-
ences processing of a subsequent visual stimulus, McDonald and colleagues (2003) recorded ERPs
in the signal-detection paradigm outlined above. ERPs to visual stimuli appearing at validly and
invalidly cued locations began to diverge from one another at about 100 ms after stimulus onset,
with the earliest phase of this difference being distributed over the midline central scalp (Figure
26.1c and d). After about 30–40 ms, this ERP difference between validly and invalidly cued visual
stimuli shifted to midline parietal and lateral occipital scalp regions. A dipole source analysis indi-
cated that the initial phase of this difference was generated in or near the multisensory region of the
superior temporal sulcus (STS), whereas the later phase was generated in or near the fusiform gyrus
of the occipital lobe (Figure 26.1e). This pattern of results suggests that enhanced visual perception
produced by the cross-modal orienting of spatial attention may depend on feedback connections
from the multisensory STS to the ventral stream of visual cortical areas. Similar cross-modal cue
effects were observed when participants made speeded responses to the visual targets, but the earli-
est effect was delayed by 100 ms (McDonald and Ward 2000). This is in line with behavioral data
suggesting that attentional selection might take place earlier when target detection accuracy (or fine
perceptual discrimination; see subsequent sections) is emphasized than when speed of responding
is emphasized (Prinzmetal et al. 2005).

26.3 INVOLUNTARY CROSS-MODAL SPATIAL ATTENTION


MODULATES TIME-ORDER PERCEPTION
The findings reviewed in the previous section provide compelling evidence that cross-modal atten-
tion influences the perceptual quality of visual stimuli. In the context of a spatial cueing experi-
ment, perceptual enhancement at an early stage of processing could facilitate decision and response
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 513

validly cued
(a) (b)
invalidly cued
2.00

Detectibility (d′)
1.75

1.50

1.25

1.00

0
Left Right
Target location

(c) (d)
120–140 ms 150–170 ms
PO7 PO8

validly cued –1µV


–1.73 0 –2.2 0
invalidly cued 200 ms µV µV

(e)
x=43 x=33 x=35

STS dipoles

FG dipoles

PostC dipoles

FIGURE 26.1  Results from McDonald et al.’s (2000, 2003) signal detection experiments. (a) Schematic
illustration of stimulus events on a valid-cue trial. Small light displays were fixed to bottoms of two loud-
speaker cones, one situated to the left and right of a central fixation point. Each trial began with a spatially
nonpredictive auditory cue from the left or right speaker (first panel), followed by a faint visual target on some
trials (second panel) and a salient visual mask (third panel). Participants were required to indicate whether
they saw the visual target. (b) Perceptual sensitivity data averaged across participants. (c) Grand-average
event-related potentials (ERPs) to left visual field stimuli following valid and invalid auditory cues. The ERPs
were recorded from lateral occipital electrodes PO7 and PO8. Negative voltages are plotted upward, by con-
vention. Shaded box highlights interval of P1 and N1 components, in which cue effects emerged. (d) Scalp
topographies of enhanced negative voltages to validly cued visual targets. (e) Projections of best-fitting dipo-
lar sources onto sections of an individual participant’s MRI. Dipoles were located in superior temporal lobe
(STS), fusiform gyrus (FG), and perisylvian cortex near post-central gyrus (PostC). PostC dipoles accounted
for relatively late (200–300 ms) activity over more anterior scalp regions.

processing at later stages, thereby leading to faster responses for validly cued objects than for inval-
idly cued objects. Theoretically, however, changes in the timing of perceptual processing could also
contribute to the cue effects on RT performance: an observer might become consciously aware of a
target earlier in time when it appears at a cued location than when it appears at an uncued location.
In fact, the idea that attention influences the timing of our perceptions is an old and controversial
one. More than 100 years ago, Titchener (1908) asserted that when confronted with multiple objects,
an observer becomes consciously aware of an attended object before other unattended objects.
Titchener called the hypothesized temporal advantage for attended objects the law of prior entry.
514 The Neural Bases of Multisensory Processes

Observations from laboratory experiments in the nineteenth and early twentieth centuries were
interpreted along the lines of attention-induced prior entry. In one classical paradigm known as
the complication experiment, observers were required to indicate the position of a moving pointer
at the moment a sound was presented (e.g., Stevens 1904; Wundt 1874; for a review, see Boring
1929). When listening in anticipation for the auditory stimulus, observers typically indicated that
the sound appeared when the pointer was at an earlier point along its trajectory than was actually
the case. For example, observers might report that a sound appeared when a pointer was at posi-
tion 4 even though the sound actually appeared when the pointer was at position 5. Early on, it was
believed that paying attention to the auditory modality facilitated sound perception and led to a
relative delay of visual perception, so that the pointer’s perceived position lagged behind its actual
position. However, this explanation fell out of favor when later results indicated that a specific judg-
ment strategy, rather than attention-induced prior entry, might be responsible for the mislocalization
error (e.g., Cairney 1975).
In more recent years, attention-induced prior entry has been tested experimentally in visual tem-
poral-order judgment (TOJ) tasks that require observers to indicate which of two rapidly presented
visual stimuli appeared first. When the attended and unattended stimuli appear simulta­neously,
observers typically report that the attended stimulus appeared to onset before the unattended stim­
ulus (Stelmach and Herdman 1991; Shore et al. 2001). Moreover, in line with the supramodal view
of spatial attention, such changes in temporal perception have been found when shifts in spatial
attention were triggered by spatially nonpredictive auditory and tactile cues as well as visual cues
(Shimojo et al. 1997).
Despite the intriguing behavioral results from TOJ experiments, the controversy over attention-
induced prior entry has continued. The main problem harks back to the debate over the complication
experiments: an observer’s judgment strategy might contribute to the tendency to report the cued
target as appearing first (Pashler 1998; Schneider and Bavelier 2003; Shore et al. 2001). Thus, in a
standard TOJ task, observers might perceive two targets to appear simultaneously but still report
seeing the target on the cued side first because of a decision rule that favors the cued target (e.g.,
when in doubt, select the cued target). Simple response biases (e.g., stimulus–response compatibility
effects) can be avoided quite easily by altering the task (McDonald et al. 2005; Shore et al. 2001),
but it is difficult to completely avoid the potential for response bias.
As noted previously, ERP recordings can be used to distinguish between changes in high-level
decision and response processes and changes in perceptual processing that could underlie entry
to conscious awareness. An immediate challenge to this line of research is to specify the ways
in which the perceived timing of external events might be associated with activity in the brain.
Philosopher Daniel Dennett expounded two alternatives (Dennett 1991). On one hand, the perceived
timing of external events may be derived from the timing of neural activities in relevant brain cir-
cuits. For example, the perceived temporal order of external events might be based on the timing
of early cortical evoked potentials. On the other hand, the brain might not represent the timing of
perceptual events with time itself. In Dennett’s terminology, the represented time (e.g., A before B)
is not necessarily related to the time of the representing (e.g., representing of A does not necessarily
precede representing of B). Consequently, the perceived temporal order of external events might be
based on nontemporal aspects of neural activities in relevant brain circuits.
McDonald et al. (2005) investigated the effect of cross-modal spatial attention on visual time-
order perception using ERPs to track the timing of cortical activity in a TOJ experiment. A spatially
nonpredictive auditory cue was presented to the left or right side of fixation just before the occur-
rence of a pair of simultaneous or nearly simultaneous visual targets (Figure 26.2a). One of the
visual targets was presented at the cued location, whereas the other was presented at the homolo-
gous location in the opposite visual hemifield. Consistent with previous behavioral studies, the
auditory spatial cue had a considerable effect on visual TOJs (Figure 26.2b). Participants judged the
cued target as appearing first on 79% of all simultaneous-target trials. To nullify this cross-modal
cueing effect, the uncued target had to be presented nearly 70 ms before the cued target.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 515

(a) (b) 100


Actual simultaneity

Cued side reported first (%)


75

50

25 Point of subjective
simultaneity

uncued target first cued target first


0
Cue T1 T2 –70 –35 0 35 70
Cued side on set advantage (ms)

(c) Ipsilateral to cued side (d) (e)


Contralateral to cued side 90-120 ms

–2 µV P1 1.55

N1
0
p < .05
μV

0 100 200 300 ipsilateral contralateral STS dipole FG dipole


Time post stimulus (ms)

FIGURE 26.2  Results from McDonald et al.’s (2005) temporal-order-judgment experiment. (a) Schematic
illustration of events on a simultaneous-target trial (top) and nonsimultaneous target trials (bottom).
Participants indicated whether a red or a green target appeared first. SOA between cue and first target event was
100– 300 ms, and SOA between nonsimultaneous targets was 35 or 70 ms. T1 and T2 denote times at which
visual targets could occur. (b) Mean percentage of trials on which participants reported seeing the target on
cued side first, as a function of cued-side onset advantage (CSOA; i.e., lead time). Negative CSOAs indicate
that uncued-side target was presented first; positive CSOAs indicate that cued-side target was presented first.
(c) Grand-average ERPs to simultaneous visual targets, averaged over 79% of trials on which participants
indicated that cued-side target appeared first. ERPs were recorded at contralateral and ipsilateral occipital
electrodes (PO7/PO8). Statistically significant differences between contralateral and ipsilateral waveforms are
denoted in gray on time axis. (d) Scalp topographies of ERP waveforms in time range of P1 (90–120 ms). Left
and right sides of the map show electrodes ipsilateral and contralateral electrodes, respectively. (e) Projections
of best-fitting dipolar sources onto sections of an average MRI. Dipoles were located in superior temporal lobe
(STS) and fusiform gyrus (FG). FG dipoles accounted for cue-induced P1 amplitude modulation, whereas STS
dipoles accounted for a long-latency (200–250 ms) negative deflection.

To elucidate the neural basis of this prior-entry effect, McDonald and colleagues (2005) exam-
ined the ERPs elicited by simultaneously presented visual targets following the auditory cue. The
analytical approach taken was premised on the lateralized organization of the visual system and
the pattern of ERP effects that have been observed under conditions of bilateral visual stimulation.
Several previous studies on visual attention showed that directing attention to one side of a bilateral
visual display results in a lateralized asymmetry of the early ERP components measured over the
occipital scalp, with an increased positivity at electrode sites contralateral to the attended location
beginning in the time range of the occipital P1 component (80–140 ms; Heinze et al. 1990, 1994;
Luck et al. 1990; see also Fukuda and Vogel 2009). McDonald et al. (2005) hypothesized that if
attention speeds neural transmission at early stages of the visual system, the early ERP compo-
nents elicited by simultaneous visual targets would show an analogous lateral asymmetry in time,
such that the P1 measured contralateral to the attended (cued) visual target would occur earlier
than the P1 measured contralateral to the unattended (uncued) visual target. Such a finding would
516 The Neural Bases of Multisensory Processes

be consistent with Stelmach and Herdman’s (1991) explanation of attention-induced prior entry as
well as with the view that the time course of perceptual experience is tied to the timing of the early
evoked activity in the visual cortex (Dennett 1991). Such a latency shift was not observed, however,
even though the auditory cue had a considerable effect on the judgments of temporal order of the
visual targets. Instead, cross-modal cueing led to an amplitude increase (with no change in latency)
of the ERP positivity in the ventral visual cortex contralateral to the side of the auditory cue, start-
ing in the latency range of the P1 component (90–120 ms) (Figure 26.2c–e). This finding suggests
that the effect of spatial attention on the perception of temporal order occurs because an increase in
the gain of the cued sensory input causes a perceptual threshold to be reached at an earlier time, not
because the attended input was transmitted more rapidly than the unattended input at the earliest
stages of processing.
The pattern of ERP results obtained by McDonald and colleagues is likely an important clue for
the understanding the neural basis of visual prior entry due to involuntary deployments of spatial
attention to sudden sounds. Although changes in ERP amplitude appear to underlie visual percep-
tual prior entry when attention is captured by lateralized auditory cues, changes in ERP timing
might contribute to perceptual prior entry in other situations. This issue was addressed in a recent
study of multisensory prior entry, in which participants voluntarily attended to either visual or tac-
tile stimuli and judged whether the stimulus on the left or right appeared first, regardless of stimulus
modality (Vibell et al. 2007). The ERP analysis centered on putatively visual ERP peaks over the
posterior scalp (although ERPs to the tactile stimuli were not subtracted out and thus may have
contaminated the ERP waveforms; cf. Talsma and Woldorff 2005). Interestingly, the P1 peaked at
an average of 4 ms earlier when participants were attending to the visual modality than when they
were attending the tactile modality, suggesting that modality-based attentional selection may have
a small effect on the timing of early, evoked activity in the visual system. These latency results are
not entirely clear, however, because the small-but-significant attention effect may have been caused
by a single participant with an implausibly large latency difference (17 ms) and may have been influ-
enced by overlap with the tactile ERP. Unfortunately, the authors did not report whether attention
had a similar effect on the latency of the tactile ERPs, which may have helped to corroborate the
small attention effect on P1 latency. Notwithstanding these potential problems in the ERP analysis,
it is tempting to speculate that voluntary modality-based attentional selection influences the timing
of early visual activity, whereas involuntary location-based attentional selection influences the gain
of early visual activity. The question would still remain, however, how very small changes in ERP
latency (4 ms or less) could underlie much larger perceptual effects of tens of milliseconds.

26.4  BEYOND TEMPORAL ORDER: THE SIMULTANEITY JUDGMENT TASK


Recently, Santangelo and Spence (2008) offered an alternative explanation for the finding of
McDonald and colleagues (2005) that nonpredictive auditory spatial cues affect visual time order
perception. Specifically, the authors suggested that the behavioral results in McDonald et al.’s TOJ
task were not due to changes in perception but rather to decision-level factors. They acknowledged
that simple response biases (e.g, a left cue primes a “left” response) would not have contributed to
the behavioral results because participants indicated the color, not the location, of the target that
appeared first. However, Santangelo and Spence raised the concern that some form of “secondary”
response bias might have contributed to the TOJ effects (Schneider and Bavelier 2003; Stelmach
and Herdman 1991).* For example, participants might have decided to select the stimulus at the
cued location when uncertain as to which stimulus appeared first. In an attempt to circumvent such
secondary response biases, Santangelo and Spence used a simultaneity judgment (SJ) task, in which
participants had to judge whether two stimuli were presented simultaneously or successively (Carver
and Brown 1997; Santangelo and Spence 2008; Schneider and Bavelier 2003). They reported that

* This argument would also apply to the findings of Vibell et al.’s (2007) cross-modal TOJ study.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 517

the uncued target had to appear 15–17 ms before the cued target in order for participants to have the
subjective impression that the two stimuli appeared simultaneously. This difference is referred to as
a shift in the point of subjective simultaneity (PSS), and it is typically attributed to the covert ori-
enting of attention (but see Schneider and Bavelier 2003, for an alternative sensory-based account).
The estimated shift in PSS was much smaller than the one reported in McDonald et al.’s earlier TOJ
task (17.4 vs. 68.5 ms), but the conclusions derived from the two findings were the same: Involuntary
capture of spatial attention by a sudden sound influences the perceived timing of visual events.
Santangelo and Spence went on to argue that the shift in PSS reported by McDonald et al. might
have been due to secondary response biases and, as a result, the shift in PSS observed in their study
provided “the first unequivocal empirical evidence in support of the effect of cross-modal atten-
tional capture on the latencies of perceptual processing” (p. 163).
Although the SJ task has its virtues, there are two main arguments against Santangelo and
Spence’s conclusions. First, the authors did not take into account the neurophysiological findings
of McDonald and colleagues’ ERP study. Most importantly, the effect of auditory spatial cuing
on early ERP activity arising from sensory-specific regions of the ventral visual cortex cannot be
explained in terms of response bias. Thus, although it may be difficult to rule out all higher-order
response biases in a TOJ task, the ERP findings provide compelling evidence that cross-modal spa-
tial attention modulates early visual-sensory processing. Moreover, although the SJ task may be less
susceptible to some decision-level factors, it may be impossible to rule out all decision-level factors
entirely as contributors to the PSS effect.* Thus, it is not inconceivable that Santangelo and Spence’s
behavioral findings may have reflected post-perceptual rather than perceptual effects.
Second, it should be noted that Santangelo and Spence’s results provided little, if any, empirical
support for the conclusion that cross-modal spatial attention influences the timing of visual percep-
tual processing. The problem is that their estimated PSS did not accurately represent their empiri-
cal data. Their PSS measure was derived from the proportion of “simultaneous” responses, which
varied as a function of the stimulus onset asynchrony (SOA) between the target on the cued side
and the target on the uncued side. As shown in their Figure 2a, the proportion of “simulta­neous”
responses peaked when the cued and uncued targets appeared simultaneously (0 ms SOA) and
decreased as the SOA between targets increased. The distribution of responses was fit to a Gaussian
function using maximum likelihood estimation, and the mean of the fitted Gaussian function—not
the observed data—was used as an estimate of the PSS. Critically, this procedure led to a mismatch
between the mean of the fitted curve (or more aptly, the mean of the individual-subject fitted curves)
and the mean of the observed data. Specifically, whereas the mean of the fitted curves fell slightly
to the left of the 0-ms SOA (uncued target presented first), the mean of the observed data actually
fell slightly to the right of the 0-ms SOA (cued target presented first) because of a positive skew of
the distribution.†
Does auditory cueing influence the subjective impression of simultaneity in the context of a SJ
task? Unfortunately, the results from Santangelo and Spence’s study provide no clear answer to this
question. The reported leftward shift in PSS suggests that the auditory cue had a small facilitatory
effect on the perceived timing of the ipsilateral target. However, the rightward skew of the observed

* Whereas Santangelo and Spence (2008) made the strong claim that performance in SJ tasks should be completely inde-
pendent of all response biases, Schneider and Bavelier (2003) argued only that performance in SJ tasks should be less
susceptible to such decision-level effects than performance in TOJ tasks.
† The mismatch between the estimated PSS and the mean of the observed data in Santangelo and Spence’s (2008) SJ task

might have been due to violations in the assumptions of the fitting procedure. Specifically, the maximum likelihood
procedure assumes that data are distributed normally, whereas the observed data were clearly skewed. Santangelo and
Spence did perform one goodness-of-fit test to help determine whether the data differed significantly from the fitted
Gaussians, but this test was insufficient to pick up the positive skew (note that other researchers have employed multiple
goodness-of-fit tests before computing PSS; e.g., Stone et al. 2001). Alternatively, the mismatch between the estimated
PSS and the mean of the observed data might have arisen because data from the simultaneous-target trials were actually
discarded prior to the curve-fitting procedure. This arbitrary step shifted the mode of the distribution 13 ms to the left
(uncued target was presented 13 ms before cued target), which happened to be very close to the reported shift in PSS.
518 The Neural Bases of Multisensory Processes

distribution (and consequential rightward shift in the mean) suggests that the auditory cue may actu-
ally have delayed perception of the ipsilateral target. Finally, the mode of the observed distribution
suggests that the auditory cue had no effect on subjective reports of simultaneity. These inconclusive
results suggest that the SJ task may lack adequate sensitivity to detect shifts in perceived time order
induced by cross-modal cueing.

26.5 INVOLUNTARY CROSS-MODAL SPATIAL ATTENTION


ALTERS APPEARANCE
The findings of the signal-detection and TOJ studies outlined in previous sections support the view
that involuntary cross-modal spatial attention alters the perception of subsequent visual stimuli as
well as the gain of neural responses in extrastriate visual cortex 100–150 ms after stimulus onset.
These results largely mirrored the effects of visual spatial cues on visual perceptual sensitivity
(e.g., Luck et al. 1994; Smith 2000) and temporal perception (e.g. Stelmach and Herdman 1991;
Shore et al. 2001). However, none of these studies directly addressed the question of whether atten-
tion alters the subjective appearance of objects that reach our senses. Does attention make white
objects appear whiter and dark objects appear darker? Does it make the ticking of a clock sound
louder? Psychologists have pondered questions like these for well over a century (e.g., Fechner 1882;
Helmholtz 1866; James 1890).
Recently, Carrasco and colleagues (1994) introduced a psychophysical paradigm to address the
question, “does attention alter appearance.” The paradigm is similar to the TOJ paradigm except
that, rather than varying the SOA between two visual targets and asking participants to judge which
one was first (or last), the relative physical contrast of two targets is varied and participants are
asked to judge which one is higher (or lower) in perceived contrast. In the original variant of the
task, a small black dot was used to summon attention to the left or right just before the appearance
of two Gabor patches at both left and right locations. When the physical contrasts of the two targets
were similar or identical, observers tended to report the orientation of the target on the cued side.
Based on these results, Carrasco and colleagues (2004) concluded that attention alters the subjective
impression of contrast. In subsequent studies, visual cueing was found to alter the subjective impres-
sions of several other stimulus features, including color saturation, spatial frequency, and motion
coherence (for a review, see Carrasco 2006).
Carrasco and colleagues performed several control experiments to help rule out alternative expla-
nations for their psychophysical findings (Prinzmetal et al. 2008; Schneider and Komlos 2008). The
results of these controls argued against low-level sensory factors (Ling and Carrasco 2007) as well
as higher-level decision or response biases (Carrasco et al. 2004; Fuller et al. 2008). However, as we
have discussed in previous sections, it is difficult to rule out all alternative explanations on the basis
of the behavioral data alone. Moreover, results from different paradigms have led to different con-
clusions about whether attention alters appearance: whereas the results from Carrasco’s paradigm
have indicated that attention does alter appearance, the results from an equality-judgment paradigm
introduced by Schneider and Komlos (2008) have suggested that attention may alter decision pro-
cesses rather than contrast appearance.
Störmer et al. (2009) recently investigated whether cross-modal spatial attention alters visual
appearance. The visual cue was replaced by a spatially nonpredictive auditory cue delivered in ste-
reo so that it appeared to emanate from a peripheral location of a visual display (25° from fixation).
After a 150-ms SOA, two Gabors were presented, one at the cued location and one on the opposite
side of fixation (Figure 26.3a). The use of an auditory cue eliminated some potential sensory inter-
actions between visual cue and target that might boost the contrast of the cued target even in the
absence of attention (e.g., the contrast of a visual cue could add to the contrast of the cued-location
target, thereby making it higher in contrast than the uncued-location target). As in Carrasco et al.’s
(2004) high-contrast experiment, the contrast of one (standard) Gabor was set at 22%, whereas the
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 519

(a) (b)
1.0
Test patch cued

patch as higher in contrast


Probability of choosing test
0.8
p < .05
0.6
–100 0 100 200 300
Time (ms) 0.4
N1
0.2
–1 µV Standard patch cued
0.0
6 13 22 37 78
Contrast level of test patch (log)
Ipsilateral to cued side
Contralateral to cued side P1 (e) 120–140 ms
1.5

contralateral minus ipsilateral


Mean amplitude difference
(c) (d) 1.0
120–140 ms 120–140 ms
0.5
1.8
0.0
PO3/PO4
–0.5
PO7/PO8

0 –1.0
μV –0.1 0.0 0.1 0.2 0.3

Ipsilateral Contralateral Difference of the probability of choosing cued target


as higher contrast minus the probability of choosing
uncued target as higher in contrast

FIGURE 26.3  Results from Störmer et al.’s (2009) contrast-appearance experiment. (a) Stimulus sequence
and grand-average ERPs to equal-contrast Gabor, recorded at occipital electrodes (PO7/PO8) contralateral
and ipsilateral to cued side. On a short-SOA trial (depicted), a peripheral auditory cue was presented 150 ms
before a bilateral pair of Gabors that varied in contrast (see text for details). Isolated target ERPs revealed an
enlarged positivity contralateral to cued target. Statistically significant differences between contralateral and
ipsilateral waveforms are denoted in gray on time axis. (b) Mean probability of reporting contrast of test patch
to be higher than that of standard patch, as a function of test-patch contrast. Probabilities for cued-test and
cued-standard trials are shown separately. (c) Scalp topographies of equal-contrast-Gabor ERPs in time inter-
val of P1 (120–140 ms). Left and right sides of the map show electrodes ipsilateral and contralateral electrodes,
respectively. (d) Localization of distributed cortical current sources underlying contralateral-minus-ipsilateral
ERP positivity in 120–140 ms interval, projected onto cortical surface. View of the ventral surface, with
occipital lobes at the top. Source activity was estimated using LAURA algorithm and is shown in contralateral
hemisphere (right side of brain) only. (e) Correlations between individual participants’ tendencies to report the
cued-side target to be higher in contrast and magnitude of enlarged ERP positivities recorded at occipital and
parieto-occipital electrodes (PO7/PO8, PO3/PO4) in 120–140 ms interval.

contrast of the other (test) Gabor varied between 6% and 79%. ERPs were recorded on the trials (1/3
of the total) where the two Gabors were equal in contrast. Participants were required to indicate
whether the higher-contrast Gabor patch was oriented horizontally or vertically.
The psychophysical findings in this auditory cueing paradigm were consistent with those reported
by Carrasco and colleagues (2004). When the test and standard Gabors had the same physical con-
trast, observers reported the orientation of the cued-location Gabor significantly more often than the
uncued-location Gabor (55% vs. 45%) (Figure 26.3b). The point of subjective equality (PSE)—the
520 The Neural Bases of Multisensory Processes

test contrast at which observers judged the test patch to be higher in contrast on half of the trials—­
averaged 20% when the test patch was cued and 25% when the standard patch was cued (in compari-
son with the 22% standard contrast; Figure 26.3a). These results indicate that spatially nonpredictive
auditory cues as well as visual cues can influence subjective (visual) contrast judgments.
To investigate whether the auditory cue altered visual appearance as opposed to a decision or
response processes, Störmer and colleagues (2009) examined the ERPs elicited by the equal-contrast
Gabors as a function of cue location. The authors reasoned that changes in subjective appearance
would likely be linked to modulations of early ERP activity in visual cortex associated with percep-
tual processing rather than decision- or response-level processing (see also Schneider and Komlos
2008). Moreover, any such effect on early ERP activity should correlate with the observers’ tenden-
cies to report the cued target as being higher in contrast. This is exactly what was found. Starting at
approximately 90 ms after presentation of the equal-contrast targets, the waveform recorded con-
tralaterally to the cued side became more positive than the waveform recorded ipsilaterally to the
cued side (Figure 26.3a). This contralateral positivity was observed on those trials when observers
judged the cued-location target to be higher in contrast but not when observers judged the uncued-
location target to be higher in contrast. The tendency to report the cued-location target as being
higher in contrast correlated with the contralateral ERP positivity, most strongly in the time interval
of the P1 component (120–140 ms), which is generated at early stages of visual cortical processing.
Topographical mapping and distributed source modeling indicated that the increased contralateral
positivity in the P1 interval reflected modulations of neural activity in or near the fusiform gyrus of
the occipital lobe (Figure 26.3c and d). These ERP findings converge with the behavioral evidence
that cross-modal spatial attention affects visual appearance through modulations at an early sensory
level rather than by affecting a late decision process.

26.6  POSSIBLE MECHANISMS OF CROSS-MODAL CUE EFFECTS


The previous sections have focused on the perceptual consequences of cross-modal spatial cue-
ing. To sum up, salient-but-irrelevant sounds were found to enhance visual perceptual sensitivity,
accelerate the timing of visual perceptions, and alter the appearance of visual stimuli. Each of these
perceptual effects was accompanied by modulation of the early cortical response elicited by the
visual stimulus within ventral-stream regions. Such findings are consistent with the hypothesis that
auditory and visual stimuli engage a common neural network involved in the control and covert
deployment of attention in space (Farah et al. 1989). Converging lines of evidence have pointed to
the involvement of several key brain structures in the control and deployment of spatial attention in
visual tasks. These brain regions include the superior colliculus, pulvinar nucleus of the thalamus,
intraparietal sulcus, and dorsal premotor cortex (for additional details, see Corbetta and Shulman
2002; LaBerge 1995; Posner and Raichle 1994). Importantly, multisensory neurons have been found
in each of these areas, which suggests that the neural network responsible for the covert deployment
of attention in visual space may well control the deployment of attention in multisensory space (see
Macaluso, this volume; Ward et al. 1998; Wright and Ward 2008).
At present, however, there is no consensus as to whether a supramodal attention system is
responsible for the cross-modal spatial cue effects outlined in the previous sections. Two different
controversies have emerged. The first concerns whether a single, supramodal system controls the
deployment of attention in multisensory space or whether separate, modality-specific systems direct
attention to stimuli of their respective modalities. The latter view can account for cross-modal cue-
ing effects by assuming that the activation of one system triggers coactivation of others. According
to this separate-but-linked proposal, a shift of attention to an auditory location would lead to a
separate shift of attention to the corresponding location of the visual field. Both the supramodal and
separate-but-linked hypotheses can account for cross-modal cueing effects, making it difficult to
distinguish between the two views in the absence of more direct measures of the neural activity that
underlies attention control.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 521

The second major controversy over the possible mechanisms of cross-modal cue effects is spe-
cific to studies utilizing salient-but-irrelevant stimuli to capture attention involuntarily. In these
studies, the behavioral and neurophysiological effects of cueing are typically maximal when the
cue appears 100–300 ms before the target. Although it is customary to attribute these facilitatory
effects to the covert orienting of attention, they might alternatively result from sensory interactions
between cue and target (Tassinari et al. 1994). The cross-modal-cueing paradigm eliminates uni-
modal sensory interactions, such as those taking place at the level of the retina, but the possibility of
cross-modal sensory interaction remains because of the existence of multisensory neurons at many
levels of the sensory pathways that respond to stimuli in different modalities (Driver and Noesselt
2008; Foxe and Schroeder 2005; Meredith and Stein 1996; Schroeder and Foxe 2005). In fact,
the majority of multisensory neurons do not simply respond to stimuli in different modalities, but
rather appear to integrate the input signals from different modalities so that their responses to mul-
timodal stimulation differ quantitatively from the simple summation of their unimodal responses
(for reviews, see Stein and Meredith 1993; Stein et al. 2009; other chapters in this volume). Such
multisensory interactions are typically largest when stimuli from different modalities occur at about
the same time, but they are possible over a period of several hundreds of milliseconds (Meredith
et al. 1987). In light of these considerations, the cross-modal cueing effects described in previous
sections could in principle have been due to the involuntary covert orienting of spatial attention or
to the integration of cue and target into a single multisensory event (McDonald et al. 2001; Spence
and McDonald 2004; Spence et al. 2004).
Although it is often difficult to determine which of these mechanisms are responsible for cross-
modal cueing effects, several factors can help to tip the scales in favor of one explanation or the
other. One factor is the temporal relationship between the cue and target stimuli. A simple rule of
thumb is that increasing the temporal overlap between the cue and target will make multisensory
integration more likely and pre-target attentional biasing less likely (McDonald et al. 2001). Thus,
it is relatively straightforward to attribute cross-modal cue effects to multisensory integration when
cue and target are presented concurrently or to spatial attention when cue and target are separated
by a long temporal gap. The likely cause of cross-modal cueing effects is not so clear, however,
when there is a short gap between cue and target that is within the temporal window where inte-
gration is possible. In such situations, other considerations may help to disambiguate the causes of
the cross-modal cueing effects. For example, multisensory integration is largely an automatic and
invariant process, whereas stimulus-driven attention effects are dependent on an observer’s goals
and intentions (i.e., attentional set; e.g., Folk et al. 1992). Thus, if cross-modal spatial cue effects
were found to be contingent upon an observer’s current attentional set, they would be more likely to
have been caused by pre-target attentional biasing. To our knowledge, there has been little discus-
sion of the dependency of involuntary cross-modal spatial cueing effects on attentional set and other
task-related factors (e.g., Ward et al. 2000).
A second consideration that could help distinguish between alternative mechanisms of cross-
modal cueing effects concerns the temporal sequence of control operations (Spence et al. 2004).
According to the most prominent multisensory integration account, signals arising from stimuli
in different modalities converge onto multimodal brain regions and are integrated therein. The
resulting integrated signal is then fed back to the unimodal brain regions to influence processing
of subsequent stimuli in modality-specific regions of cortex (Calvert et al. 2000; Macaluso et al.
2000). Critically, such an influence on modality-specific processing would occur only after feedfor-
ward convergence and integration of the unimodal signals takes place (Figure 26.4a). This contrasts
with the supramodal-attention account, according to which the cue’s influence on modality-specific
processing may be initiated before the target in another modality has been presented (i.e., before
integration is possible). In the context of a peripheral cueing task, a cue in one modality (e.g.,
audi tion) would initiate a sequence of attentional control operations (such as disengage, move, reen-
gage; see Posner and Raichle 1994) that would lead to anticipatory biasing of activity in another
modality (e.g., vision) before the appearance of the target (Figure 26.4b). In other words, whereas
522 The Neural Bases of Multisensory Processes

(a) Integration (b) Attention

AV AV

Auditory Visual Auditory Visual


cue target cue
attentional spotlight

AV AV

time
Auditory Visual Auditory Visual

AV AV

Auditory Visual Auditory Visual

AV

Auditory Visual
target

FIGURE 26.4  Hypothetical neural mechanisms for involuntary cross-modal spatial cueing effects.
(a) Integration-based account. Nearly simultaneous auditory and visual stimuli first activate unimodal audi-
tory and visual cortical regions and then converge upon a multisensory region (AV). Audiovisual interaction
within multisensory region feeds back to boost activity in visual cortex. (b) Attention-based account. An audi-
tory cue elicits a shift of spatial attention in a multisensory representation, which leads to pre-target biasing of
activity in visual cortex and ultimately boosts target-related activity in visual cortex.

multisensory integration occurs only after stimulation in two (or more) modalities, the consequences
of spatial attention are theoretically observable after stimulation in the cue modality alone. Thus,
a careful examination of neural activity in the cue–target interval would help to ascertain whether
pre-target attentional control is responsible for the cross-modal cueing effects on perception. This
is a challenging task in the case of involuntary cross-modal cue effects, because the time interval
between the cue and target is typically very short. In the future, however, researchers might success-
fully adapt the electrophysiological methods used to track the voluntary control of spatial attention
(e.g., Doesburg et al. 2009; Eimer et al. 2002; Green and McDonald 2008; McDonald and Green
2008; Worden et al. 2000) to look for signs of attentional control in involuntary cross-modal cueing
paradigms such as the ones described in this chapter.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 523

26.7  CONCLUSIONS AND FUTURE DIRECTIONS


To date, most of the research on spatial attention has considered how attending to a particular region
of space influences the processing of objects within isolated sensory modalities. However, a grow-
ing number of studies have demonstrated that orienting attention to the location of a stimulus in one
modality can influence the perception of subsequent stimuli in different modalities. As outlined
here, recent cross-modal spatial cueing studies have shown that the occurrence of a nonpredictive
auditory cue affects the way we see subsequent visual objects in several ways: (1) by improving the
perceptual sensitivity for detection of masked visual stimuli appearing at the cued location, (2) by
producing earlier perceptual awareness of visual stimuli appearing at the cued location, and (3) by
altering the subjective appearance of visual stimuli appearing at the cued location. Each of these
cross-modally induced changes in perceptual experience is accompanied by short-latency changes
in the neural processing of targets within occipitotemporal cortex in the vicinity of the fusiform
gyrus, which is generally considered to represent modality-specific cortex belonging to the ventral
stream of visual processing.
There is still much to be learned about these cross-modally induced changes in perception. One
outstanding question is why spatial cueing appears to alter visual perception in tasks that focus
on differences in temporal order or contrast (Carrasco et al. 2004; McDonald et al. 2005; Störmer
et al. 2009) but not in tasks that focus on similarities (i.e., “same or not” judgments; Santangelo and
Spence 2008; Schneider and Komlos 2008). Future studies could address this question by recording
physiological measures (such as ERPs) in the two types of tasks. If an ERP component previously
shown to correlate with perception were found to be elicited equally well under the two types of
task instructions, it might be concluded that the same-or-not judgment lacks sensitivity to reveal
perceptual effects.
Another outstanding question is whether the cross-modal cueing effects reviewed in this chapter
are caused by the covert orienting of attention or by passive intersensory interactions. Some insight
may come from recent ERP studies of the “double flash” illusion produced by the interaction of a
single flash with two pulsed sounds (Mishra et al. 2007, 2010). In these studies, an enhanced early
ventral stream response at 100–130 ms was observed in association with the perceived extra flash.
Importantly, this neural correlate of the illusory flash was sensitive to manipulations of spatial
selective attention, suggesting that the illusion is not the result of automatic multisensory integra-
tion. Along these lines, it is tempting to conclude that the highly similar enhancement of early
ventral-stream activity found in audiovisual cueing studies (McDonald et al. 2005; Störmer et al.
2009) also results from the covert deployment of attention rather than the automatic integration of
cue and target stimuli. Future studies could address this issue by looking for electrophysiological
signs of attentional control and anticipatory modulation of visual cortical activity before the onset
of the target stimulus.
A further challenge for future research will be to extend these studies to different combinations
of sensory modalities to determine whether cross-modal cueing of spatial attention has analogous
effects on the perception of auditory and somatosensory stimuli. Such findings would be consistent
with the hypothesis that stimuli from the various spatial senses can all engage the same neural sys-
tem that mediates the covert deployment of attention in multisensory space (Farah et al. 1989).

REFERENCES
Boring, E. G. 1929. A history of experimental psychology. New York: Appleton-Century.
Broadbent, D. E. 1958. Perception and communication. London: Pergamon Press.
Cairney, P. T. 1975. The complication experiment uncomplicated. Perception 4: 255–265.
Calvert, G. A., R. Campbell, and M. J. Brammer. 2000. Evidence from functional magnetic resonance imaging
of crossmodal binding in the human heteromodal cortex. Current Biology 10: 649–657.
524 The Neural Bases of Multisensory Processes

Carrasco, M. 2006. Covert attention increases contrast sensitivity: Psychophysical, neurophysiological, and
neuroimaging studies. In Progress in Brain Research, Volume 154, Part 1: Visual Perception. Part I.
Fundamentals of Vision: Low and Mid-level Processes in Perception, ed. S. Martinez-Conde, S. L.
Macknik, L. M. Martinez, J. M. Alonso, and P. U. Tse, 33–70. Amsterdam: Elsevier.
Carrasco, M., S. Ling, and S. Read. 2004. Attention alters appearance. Nature Neuroscience 7: 308–313.
Carver, R. A., and V. Brown. 1997. Effects of amount of attention allocated to the location of visual stimulus
pairs on perception of simultaneity. Perception & Psychophysics 59: 534–542.
Cherry, C. E. 1953. Some experiments on the recognition of speech with one and two ears. Journal of the
Acoustical Society of America 25: 975–979.
Corbetta, M., and G. L. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain.
Nature Reviews Neuroscience 3: 201–215.
Dennett, D. 1991. Consciousness explained. Boston: Little, Brown & Co.
Deutsch, J. A., and D. Deutsch. 1963. Attention: Some theoretical considerations. Psychological Review 70:
80–90.
Doesburg, S. M., J. J. Green, J. J. McDonald, and L. M. Ward. 2009. From local inhibition to long-range inte-
gration: A functional dissociation of alpha-band synchronization across cortical scales in visuospatial
attention. Brain Research 1303: 97–110.
Driver, J., and T. Noesselt. 2008. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’
brain regions, neural responses, and judgments. Neuron 57: 11–23.
Driver, J., and C. Spence. 2004. Crossmodal spatial attention: Evidence from human performance. In Crossmodal
space and crossmodal attention, ed. C. Spence and J. Driver, 179–220. Oxford: Oxford Univ. Press.
Dufour, A. 1999. Importance of attentional mechanisms in audiovisual links. Experimental Brain Research
126: 215–222.
Eimer, M., J. van Velzen, and J. Driver. 2002. Cross-modal interactions between audition, touch, and vision in
endogenous spatial attention: ERP evidence on preparatory states and sensory modulations. Journal of
Cognitive Neuroscience 14: 254–271.
Eimer, M., and E. Schröger. 1998. ERP effects of intermodal attention and cross-modal links in spatial atten-
tion. Psychophysiology 35: 313–327.
Eriksen, C. W., and J. E. Hoffman. 1972. Temporal and spatial characteristics of selective encoding from visual
displays. Perception & Psychophysics 12: 201–204.
Farah, M. J., A. B. Wong, M. A. Monheit, and L. A. Morrow. 1989. Parietal lobe mechanisms of spatial atten-
tion—modality-specific or supramodal. Neuropsychologia 27: 461–470.
Fechner, G. T. 1882. Revision der Hauptpunkte der Psychophysik. Leipzig: Breitkopf & Härtel.
Folk, C. L., R. W. Remington, and J. C. Johnston. 1992. Involuntary covert orienting is contingent on atten-
tional control settings. Journal of Experimental Psychology: Human Perception and Performance 18:
1030–1044.
Foxe, J. J., and C. E. Schroeder. 2005. The case for feedforward multisensory convergence during early cortical
processing. Neuroreport 16: 419–423.
Fukuda, K., and E. K. Vogel, 2009. Human variation in overriding attentional capture. Journal of Neuroscience
29: 8726–8733.
Fuller, S., R. Z. Rodriguez, and M. Carrasco. 2008. Apparent contrast differs across the vertical meridian:
Visual and attentional factors. Journal of Vision 8: 1–16.
Green, D. M., and J. A. Swets. 1966. Signal detection theory and psychophysics. New York: Wiley.
Green, J. J., and J. J. McDonald. 2008. Electrical neuroimaging reveals timing of attentional control activity in
human brain. PLoS Biology 6: e81.
Heinze, H. J., G. R. Mangun, and S. A. Hillyard. 1990. Visual event-related potentials index perceptual accu-
racy during spatial attention to bilateral stimuli. In Psychophysiological Brain Research, ed. C. Brunia et
al., 196–202. Tilburg, The Netherlands: Tilburg Univ. Press.
Heinze, H. J., G. R. Mangun, W. Burchert et al. 1994. Combined spatial and temporal imaging of brain activity
during visual selective attention in humans. Nature 372: 543–546.
Helmholtz, H. V. 1866. Treatise on psychological optics, 3rd ed., Vols. 2 & 3. Rochester: Optical Society of
America.
Hillyard, S. A., G. V. Simpson, D. L. Woods, S. Vanvoorhis, and T. F. Münte. 1984. Event-related brain poten-
tials and selective attention to different modalities. In Cortical Integration, ed. F. Reinoso-Suarez and C.
Ajmone-Marsan, 395–414. New York: Raven Press.
James, W. 1890. The principles of psychology. New York: Henry Holt.
LaBerge, D. 1995. Attentional processing: The brain’s art of mindfulness. Cambridge, MA: Harvard Univ.
Press.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 525

Ling, S., and M. Carrasco. 2007. Transient covert attention does alter appearance: A reply to Schneider 2006.
Perception & Psychophysics 69: 1051–1058.
Lu, Z. L., and B. A. Dosher. 1998. External noise distinguishes attention mechanisms. Vision Research 38:
1183–1198.
Luce, P. A. 1986. A computational analysis of uniqueness points in auditory word recognition. Perception &
Psychophysics 39: 155–158.
Luck, S. J., H. J. Heinze, G. R. Mangun, and S. A. Hillyard. 1990. Visual event-related potentials index
focussed attention within bilateral stimulus arrays: II. Functional dissociation of P1 and N1 components.
Electroencephalography and Clinical Neurophysiology 75: 528–542.
Luck, S. J., S. A. Hillyard, M. Mouloua, and H. L. Hawkins. 1996. Mechanisms of visual–spatial attention:
Resource allocation or uncertainty reduction? Journal of Experimental Psychology: Human Perception
and Performance 22: 725–737.
Luck, S. J., S. A. Hillyard, M. Mouloua, M. G. Woldorff, V. P. Clark, and H. L. Hawkins. 1994. Effects of spa-
tial cuing on luminance detectability: Psychophysical and electrophysiological evidence for early selec-
tion. Journal of Experimental Psychology: Human Perception and Performance 20: 887–904.
Macaluso, E., C. D. Frith, and J. Driver. 2000. Modulation of human visual cortex by crossmodal spatial atten-
tion. Science 289: 1206–1208.
McDonald, J. J., and J. J. Green. 2008. Isolating event-related potential components associated with voluntary
control of visuo-spatial attention. Brain Research 1227: 96–109.
McDonald, J. J., W. A. Teder-Sälejärvi, F. Di Russo, and S. A. Hillyard. 2003. Neural substrates of perceptual
enhancement by cross-modal spatial attention. Journal of Cognitive Neuroscience 15: 10–19.
McDonald, J. J., W. A. Teder-Sälejärvi, F. Di Russo, and S. A. Hillyard. 2005. Neural basis of auditory-induced
shifts in visual time-order perception. Nature Neuroscience 8: 1197–1202.
McDonald, J. J., W. A. Teder-Sälejärvi, D. Heraldez, and S. A. Hillyard. 2001. Electrophysiological evidence
for the “missing link” in crossmodal attention. Canadian Journal of Experimental Psychology 55:
141–149.
McDonald, J. J., W. A. Teder-Sälejärvi, and S. A. Hillyard. 2000. Involuntary orienting to sound improves
visual perception. Nature 407: 906–908.
McDonald, J. J., and L. M. Ward. 1999. Spatial relevance determines facilitatory and inhibitory effects of audi-
tory covert spatial orienting. Journal of Experimental Psychology: Human Perception and Performance
25: 1234–1252.
McDonald, J. J., and L. M. Ward. 2000. Involuntary listening aids seeing: Evidence from human electrophysiol-
ogy. Psychological Science 11: 167–171.
Meredith, M. A., J. W. Nemitz, and B. E. Stein. 1987. Determinants of multisensory integration in superior
colliculus neurons: 1. Temporal factors. Journal of Neuroscience 7: 3215–3229.
Meredith, M. A., and B. E. Stein. 1996. Spatial determinants of multisensory integration in cat superior col-
liculus neurons. Journal of Neurophysiology 75: 1843–1857.
Mishra, J., A. Martinez, T. J. Sejnowski, and S. A. Hillyard. 2007. Early cross-modal interactions in auditory
and visual cortex underlie a sound-induced visual illusion. Journal of Neuroscience 27: 4120–4131.
Mishra, J., A. Martinez, and S. A. Hillyard. 2010. Effect of attention on early cortical processes associated with
the sound-induced extra flash illusion. Journal of Cognitive Neuroscience 22: 1714–1729.
Pashler, H. E. 1998. The psychology of attention. Cambridge, MA: MIT Press.
Posner, M. I. 1978. Chronometric explorations of mind. Hillsdale, NJ: Lawrence Erlbaum.
Posner, M. I., Y. Cohen, and R. D. Rafal. 1982. Neural systems control of spatial orienting. Philosophical
Transactions of the Royal Society of London Series B-Biological Sciences 298: 187–198.
Posner, M. I., and M. E. Raichle 1994. Images of mind. New York: W. H. Freeman.
Posner, M. I., J. A. Walker, F. J. Friedrich, and R. D. Rafal. 1984. Effects of parietal injury on covert orienting
of attention. Journal of Neuroscience 4: 1863–1874.
Prime, D. J., J. J. McDonald, J. Green, and L. M. Ward. 2008. When cross-modal spatial attention fails.
Canadian Journal of Experimental Psychology 62: 192–197.
Prinzmetal, W., V. Long, and J. Leonhardt. 2008. Involuntary attention and brightness contrast. Perception &
Psychophysics 70: 1139–1150.
Prinzmetal, W., C. McCool, and S. Park. 2005. Attention: Reaction time and accuracy reveal different mecha-
nisms. Journal of Experimental Psychology: General 134: 73–92.
Rhodes, G. 1987. Auditory attention and the representation of spatial information. Perception & Psychophysics
42: 1–14.
Santangelo, V., and C. Spence. 2008. Crossmodal attentional capture in an unspeeded simultaneity judgement
task. Visual Cognition 16: 155–165.
526 The Neural Bases of Multisensory Processes

Schneider, K. A., and D. Bavelier. 2003. Components of visual prior entry. Cognitive Psychology 47:
333–366.
Schneider, K. A., and M. Komlos. 2008. Attention biases decisions but does not alter appearance. Journal of
Vision 8: 1–10.
Schroeder, C. E., and J. Foxe. 2005. Multisensory contributions to low-level, ‘unisensory” processing. Current
Opinion in Neurobiology 15: 454–458.
Shaw, M. L. 1982. Attending to multiple sources of information: 1.The integration of information in decision-
making. Cognitive Psychology 14: 353–409.
Shaw, M. L. 1984. Division of attention among spatial locations: A fundamental difference between detection
of letters and detection of luminance increments. In Attention and Performance X, ed. H. Bouma and
D. G. Bouwhui, 109–121. Hillsdale, NJ: Erlbaum.
Shimojo, S., S. Miyauchi, and O. Hikosaka. 1997. Visual motion sensation yielded by non-visually driven
attention. Vision Research 37: 1575–1580.
Shiu, L. P., and H. Pashler. 1994. Negligible effect of spatial precueing on identification of single digits. Journal
of Experimental Psychology: Human Perception and Performance 20: 1037–1054.
Shore, D. I., C. Spence, and R. M. Klein. 2001. Visual prior entry. Psychological Science 12: 205–212.
Smith, P. L. 2000. Attention and luminance detection: Effects of cues, masks, and pedestals. Journal of
Experimental Psychology: Human Perception and Performance 26: 1401–1420.
Smith, P. L., and R. Ratcliff. 2009. An integrated theory of attention and decision making in visual signal detec-
tion. Psychological Review 116: 283–317.
Soto-Faraco, S., J. McDonald, and A. Kingstone. 2002. Gaze direction: Effects on attentional orienting and
crossmodal target responses. Poster presented at the annual meeting of the Cognitive Neuroscience
Society, San Francisco, CA.
Spence, C. J., and J. Driver. 1994. Covert spatial orienting in audition—exogenous and endogenous mecha-
nisms. Journal of Experimental Psychology: Human Perception and Performance 20: 555–574.
Spence, C., and J. Driver. 1997. Audiovisual links in exogenous covert spatial orienting. Perception &
Psychophysics 59: 1–22.
Spence, C., and J. J. McDonald. 2004. The crossmodal consequences of the exogenous spatial orienting of
attention. In The handbook of multisensory processing, ed. G. A. Calvert, C. Spence, and B. E. Stein,
3–25. Cambridge, MA: MIT Press.
Spence, C., J. J. McDonald, and J. Driver. 2004. Exogenous spatial cuing studies of human crossmodal atten-
tion and multisensory integration. In Crossmodal space and crossmodal attention, ed. C. Spence and J.
Driver, 277–320. Oxford: Oxford Univ. Press.
Sperling, G., and B. A. Dosher. 1986. Strategy and optimization in human information processing. In Handbook
of Perception and Human Performance, ed. K. R. Boff, L. Kaufman, and J. P. Thomas, 1–65. New York:
Wiley.
Stein, B. E., and M. A. Meredith 1993. The merging of the senses. Cambridge, MA: MIT Press.
Stein, B. E., T. R. Stanford, R. Ramachandran, T. J. Perrault, and B. A. Rowland. 2009. Challenges in quantify-
ing multisensory integration: Alternative criteria, models, and inverse effectiveness. Experimental Brain
Research 198: 113–126.
Stelmach, L. B., and C. M. Herdman. 1991. Directed attention and perception of temporal-order. Journal of
Experimental Psychology: Human Perception and Performance 17: 539–550.
Stevens, H. C. 1904. A simple complication pendulum for qualitative work. American Journal of Psychology
15: 581.
Stone, J. V., N. M. Hunkin, J. Porrill et al. 2001. When is now? Perception of simultaneity. Proceedings of the
Royal Society of London Series B: Biological Sciences 268: 31–38.
Störmer, V. S., J. J. McDonald, and S. A. Hillyard. 2009. Cross-modal cueing of attention alters appearance and
early cortical processing of visual stimuli. PNAS 106: 22456–22461.
Talsma, D., and M. G. Woldorff. 2005. Selective attention and multisensory integration: Multiple phases of
effects on the evoked brain activity. Journal of Cognitive Neuroscience 17: 1098–1114.
Tassinari, G., S. Aglioti, L. Chelazzi, A. Peru, and G. Berlucchi. 1994. Do peripheral non-informative cues
induce early facilitation of target detection. Vision Research 34: 179–189.
Teder-Sälejärvi, W. A., T. F. Münte, F. J. Sperlich, and S. A. Hillyard. 1999. Intra-modal and cross-modal
spatial attention to auditory and visual stimuli. An event-related brain potential study. Cognitive Brain
Research 8: 327–343.
Titchener, E. N. 1908. Lectures on the elementary psychology of feeling and attention. New York: The
MacMillan Company.
Cross-Modal Spatial Cueing of Attention Influences Visual Perception 527

Treisman, A., and G. Geffen. 1967. Selective attention: Perception or response? Quarterly Journal of
Experimental Psychology 19: 1–18.
Vibell, J., C. Klinge, M. Zampini, C. Spence, and A. C. Nobre. 2007. Temporal order is coded temporally in the
brain: Early event-related potential latency shifts underlying prior entry in a cross-modal temporal order
judgment task. Journal of Cognitive Neuroscience 19: 109–120.
Ward, L. M., J. J. McDonald, and N. Golestani. 1998. Cross-modal control of attention shifts. In Visual atten-
tion, ed. R. D. Wright, 232–268. New York: Oxford Univ. Press.
Ward, L. M., J. J. McDonald, and D. Lin. 2000. On asymmetries in cross-modal spatial attention orienting.
Perception & Psychophysics 62: 1258–1264.
Watt, R. J. 1991. Understanding vision. San Diego, CA: Academic Press.
Worden, M. S., J. J. Foxe, N. Wang, and G. V. Simpson. 2000. Anticipatory biasing of visuospatial attention
indexed by retinotopically specific-band electroencephalography increases over occipital cortex. Journal
of Neuroscience 20 (RC63): 1–6.
Wright, R. D., and L. M. Ward. 2008. Orienting of attention. New York: Oxford Univ. Press.
Wundt, W. 1874. Grundzüge der physiologischen psychologies [Foundations of physiological psychology].
Leipzig, Germany: Wilhelm Engelmann.
27 The Colavita Visual
Dominance Effect
Charles Spence, Cesare Parise, and Yi-Chuan Chen

CONTENTS
27.1 Introduction........................................................................................................................... 529
27.2 Basic Findings on Colavita Visual Dominance Effect.......................................................... 531
27.2.1 Stimulus Intensity...................................................................................................... 531
27.2.2 Stimulus Modality..................................................................................................... 531
27.2.3 Stimulus Type............................................................................................................ 532
27.2.4 Stimulus Position....................................................................................................... 532
27.2.5 Bimodal Stimulus Probability................................................................................... 532
27.2.6 Response Demands.................................................................................................... 533
27.2.7 Attention.................................................................................................................... 533
27.2.8 Arousal...................................................................................................................... 534
27.2.9 Practice Effects.......................................................................................................... 535
27.3 Interim Summary.................................................................................................................. 537
27.4 Prior Entry and Colavita Visual Dominance Effect.............................................................. 537
27.5 Explaining the Colavita Visual Dominance Effect...............................................................540
27.5.1 Accessory Stimulus Effects and Colavita Effect.......................................................540
27.5.2 Perceptual and Decisional Contributions to Colavita Visual Dominance Effect...... 541
27.5.3 Stimulus, (Perception), and Response?...................................................................... 542
27.6 Biased (or Integrated) Competition and Colavita Visual Dominance Effect........................ 545
27.6.1 Putative Neural Underpinnings of Modality-Based Biased Competition................. 545
27.6.2 Clinical Extinction and Colavita Visual Dominance Effect..................................... 547
27.7 Conclusions and Questions for Future Research...................................................................548
27.7.1 Modeling the Colavita Visual Dominance Effect..................................................... 549
27.7.2 Multisensory Facilitation versus Interference........................................................... 549
References....................................................................................................................................... 550

27.1  INTRODUCTION
Visually dominant behavior has been observed in many different species, including birds, cows, dogs,
and humans (e.g., Partan and Marler 1999; Posner et al. 1976; Uetake and Kudo 1994; Wilcoxin et al.
1971). This has led researchers to suggest that visual stimuli may constitute “prepotent” stimuli for
certain classes of behavioral responses (see Colavita 1974; Foree and LoLordo 1973; LoLordo 1979;
Meltzer and Masaki 1973; Shapiro et al. 1980). One particularly impressive example of vision’s
dominance over audition (and more recently, touch) has come from research on the Colavita visual
dominance effect (Colavita 1974). In the basic experimental paradigm, participants have to make
speeded responses to a random series of auditory (or tactile), visual, and audiovisual (or visuotac-
tile) targets, all presented at a clearly suprathreshold level. Participants are instructed to make one
response whenever an auditory (or tactile) target is presented, another response whenever a visual
target is presented, and to make both responses whenever the auditory (or tactile) and visual targets

529
530 The Neural Bases of Multisensory Processes

are presented at the same time (i.e., on the bimodal target trials). Typically, the unimodal targets are
presented more frequently than the bimodal targets (the ratio of 40% auditory—or tactile—targets,
40% visual targets, and 20% bimodal targets has often been used; e.g., Koppen and Spence 2007a,
2007b, 2007c). The striking result to have emerged from a number of studies on the Colavita effect
is that although participants have no problem in responding rapidly and accurately to the unimodal
targets, they often fail to respond to the auditory (or tactile) targets on the bimodal target trials (see
Figure 27.1a and b). It is almost as if the simultaneous presentation of the visual target leads to the
“extinction” of the participants’ perception of, and/or response to, the nonvisual target on a propor-
tion of the bimodal trials (see Egeth and Sager 1977; Hartcher-O’Brien et al. 2008; Koppen et al.
2009; Koppen and Spence 2007c).
Although the majority of research on the Colavita effect has focused on the pattern of errors
made by participants in the bimodal target trials, it is worth noting that visual dominance can also
show up in reaction time (RT) data. For example, Egeth and Sager (1977) reported that although
participants responded more rapidly to unimodal auditory targets than to unimodal visual targets,
this pattern of results was reversed on the bimodal target trials—that is, participants responded

(a) Audiovisual Colavita (b) Visuotactile Colavita


60 60
% of responses
% of responses

50 50
40 40 Colavita
Colavita effect
30 effect 30
20 20
10 10
0 0
Both responses Vision-only Audition-only Both responses Vision-only Touch-only
Response type Response type

(c) Audiovisual Colavita - Coffeine (d) Audiovisual Colavita - Placebo


60 60
% of responses
% of responses

50 50
40 40
30 Colavita 30 Colavita
20 effect 20 effect
10 10
0 0
Both responses Vision-only Audition-only Both responses Vision-only Audition-only
Response type Response type

FIGURE 27.1  Results of experiments conducted by Elcock and Spence (2009) highlighting a significant
Colavita visual dominance effect over both audition (a) and touch (b). Values reported in the graphs refer to the
percentage of bimodal target trials in which participants correctly made both responses, or else made either
a visual-only or auditory- (tactile-) only response. The order in which the two experiments were performed
was counterbalanced across participants. Nine participants (age, 18–22 years) completed 300 experimental
trials (40% auditory; 40% visual, and 20% bimodal; plus 30 unimodal practice trials) in each experiment. In
audiovisual experiment (a), auditory stimulus consisted of a 4000-Hz pure tone (presented at 63dB), visual
stimulus consisted of illumination of loudspeaker cone by an LED (64.3 cd/m2). In the visuotactile experiment
(b), the stimulus was presented to a finger on the participant’s left hand, and the visual target now consisted of
illumination of the same finger. Thus, auditory, visual, and tactile stimuli were presented from exactly the same
spatial location. Participants were given 2500 ms from the onset of the target in which to respond, and intertrial
interval was set at 650 ms. The Colavita effect was significant in both cases, that is, participants in audiovisual
experiment made 45% more visual-only than auditory-only responses, whereas participants in visuotactile
experiment made 41% more visual-only than tactile-only responses. (c and d) Results from Elcock and Spence’s
Experiment 3, in which they investigated the effects of caffeine (c) versus a placebo pill (d) on the audio­
visual Colavita visual dominance effect. The results show that participants made significantly more visual-​only
than auditory-only responses in both conditions (24% and 29% more, respectively), although there was no sig-
nificant difference between the magnitude of Colavita visual dominance effect reported in two cases.
The Colavita Visual Dominance Effect 531

more rapidly to the visual targets than to the auditory targets. Note that Egeth and Sager made sure
that their participants always responded to both the auditory and visual targets on the bimodal tri-
als by presenting each target until the participant had made the relevant behavioral response.* A
similar pattern of results in the RT data has also been reported in a number of other studies (e.g.,
Colavita 1974, 1982; Colavita and Weisberg 1979; Cooper 1998; Koppen and Spence 2007a; Sinnett
et al. 2007; Zahn et al. 1994).
In this article, we will focus mainly (although not exclusively) on the Colavita effect present in the
error data (in line with the majority of published research on this phenomenon). We start by sum-
marizing the basic findings to have emerged from studies of the Colavita visual dominance effect
conducted over the past 35 years or so. By now, many different factors have been investigated in
order to determine whether they influence the Colavita effect: Here, they are grouped into stimulus-
­related factors (such as stimulus intensity, stimulus modality, stimulus type, stimulus position, and
bimodal stimulus probability) and task/participant-related factors (such as attention, arousal, task/
response demands, and practice). A range of potential explanations for the Colavita effect are evalu-
ated, and all are shown to be lacking. A new account of the Colavita visual dominance effect is
therefore proposed, one that is based on the “biased competition” model put forward by Desimone
and Duncan (1995; see also Duncan 1996; Peers et al. 2005). Although this model was initially
developed in order to provide an explanation for the intramodal competition taking place between
multiple visual object representations in both normal participants and clinical patients (suffering
from extinction), here we propose that it can be extended to provide a helpful framework in which
to understand what may be going on the Colavita visual dominance effect. In particular, we argue
that a form of cross-modal biased competition can help to explain why participants respond to the
visual stimulus while sometimes failing to respond to the nonvisual stimulus on the bimodal target
trials in the Colavita paradigm. More generally, it is our hope that explaining the Colavita visual
dominance effect may provide an important step toward understanding the mechanisms underlying
multisensory interactions. First, though, we review the various factors that have been hypothesized
to influence the Colavita visual dominance effect.

27.2  BASIC FINDINGS ON COLAVITA VISUAL DOMINANCE EFFECT


27.2.1  Stimulus Intensity
The Colavita visual dominance effect occurs regardless of whether the auditory and visual stimuli
are presented at the same (subjectively matched) intensity (e.g., Colavita 1974; Koppen et al. 2009;
Zahn et al. 1994) or the auditory stimulus is presented at an intensity that is rated subjectively as
being twice that of the visual stimulus (see Colavita 1974, Experiment 2). Hartcher-O’Brien et al.
(2008; Experiment 4) have also shown that vision dominates over touch under conditions in which
the intensity of the tactile stimulus is matched to that of the visual stimulus (presented at the 75%
detection threshold). Taken together, these results suggest that the dominance of vision over both
audition and touch in the Colavita paradigm cannot simply be attributed to any systematic differ-
ences in the relative intensity of the stimuli that have been presented to participants in previous
studies (but see also Gregg and Brogden 1952; O’Connor and Hermelin 1963; Smith 1933).

27.2.2  Stimulus Modality


Although the majority of the research on the Colavita visual dominance effect has investigated the
dominance of vision over audition, researchers have recently shown that vision also dominates over

* That is, the visual target was only turned off once the participants made a visual response, and the auditory target was
only turned off when the participants made an auditory response. This contrasts with Colavita’s (1974) studies, in which
a participant’s first response turned off all the stimuli, and with other more recent studies in which the targets were only
presented briefly (i.e., for 50 ms; e.g., Koppen and Spence 2007a, 2007b, 2007c, 2007d).
532 The Neural Bases of Multisensory Processes

touch in normal participants (Hartcher-O’Brien et al. 2008, 2010; Hecht and Reiner 2009; see also
Gallace et al. 2007). Costantini et al. (2007) have even reported that vision dominates over touch
in extinction patients (regardless of whether the two stimuli were presented from the same position,
or from different sides; see also Bender 1952). Interestingly, however, no clear pattern of sensory
dominance has, as yet, been observed when participants respond to simultaneously presented audi-
tory and tactile stimuli (see Hecht and Reiner 2009; Occelli et al. 2010; but see Bonneh et al. 2008,
for a case study of an autistic child who exhibited auditory dominance over both touch and vision).
Intriguingly, Hecht and Reiner (2009) have recently reported that vision no longer dominates
when targets are presented in all three modalities (i.e., audition, vision, and touch) at the same time.
In their study, the participants were given a separate button with which to respond to the targets
in each modality, and had to press one, two, or three response keys depending on the combination
of target modalities that happened to be presented on each trial. Whereas vision dominated over
both audition and touch in the bimodal target trials, no clear pattern of dominance was shown on
the trimodal target trials (see also Shapiro et al. 1984, Experiment 3). As yet, there is no obvious
explanation for this result.

27.2.3  Stimulus Type


The Colavita visual dominance effect has been reported for both onset and offset targets (Colavita
and Weisberg 1979; see also Osborn et al. 1963). The effect occurs both with simple stimuli (i.e.,
tones, flashes of light, and brief taps on the skin) and also with more complex stimuli, including
pictures of objects and realistic object sounds, and with auditory and visual speech stimuli (see
Koppen et al. 2008; Sinnett et al. 2007, 2008). The Colavita effect not only occurs when the target
stimuli are presented in isolation (i.e., in an otherwise dark and silent room), but also when they are
embedded within a rapidly presented stream of auditory and visual distractors (Sinnett et al. 2007).
Interestingly, however, the magnitude of the Colavita visual dominance effect does not seem to be
affected by whether or not the auditory and visual targets on the bimodal trials are semantically
congruent (see Koppen et al. 2008).

27.2.4  Stimulus Position


Researchers have also considered what effect, if any, varying either the absolute and/or relative
location from which the stimuli are presented might have on performance in the Colavita task.
The Colavita visual dominance effect occurs both when the auditory stimuli are presented over
headphones and when they are presented from an external loudspeaker placed in front of the par-
ticipant (Colavita 1974, 1982). Researchers have demonstrated that it does not much matter whether
the participants look in the direction of the visual or auditory stimulus or else fixate on some other
intermediate location (see Colavita et al. 1976). Vision’s dominance over both audition and touch
has also been shown to occur regardless of whether the stimuli are presented from the same spatial
location or from different positions (one on either side of fixation), although the Colavita effect is
somewhat larger in the former case (see Hartcher-O’Brien et al. 2008, 2010; Koppen and Spence
2007c). Taken together, these results therefore show that varying either the absolute position (e.g.,
presenting the stimuli from the center vs. in the periphery) or relative position (i.e., presenting the
various stimuli from the same or different positions) from which the target stimuli are presented
has, at most, a relatively modest impact on the magnitude of the Colavita visual dominance effect
(see also Johnson and Shapiro 1989).

27.2.5  Bimodal Stimulus Probability


As already noted, studies on the Colavita visual dominance effect usually present far fewer bimodal
targets than unimodal targets. Nevertheless, researchers have shown that a robust Colavita visual
The Colavita Visual Dominance Effect 533

dominance effect can still be obtained if the probability of each type of target is equalized (i.e.,
when 33.3% auditory, 33.3% visual, and 33.3% bimodal targets are presented; see Koppen and
Spence 2007a). Koppen and Spence (2007d) investigated the effect of varying the probability of
bimodal target trials on the Colavita visual dominance effect (while keeping the relative propor-
tion of unimodal auditory and visual target trials matched).* They found that although a significant
Colavita effect was demonstrated whenever the bimodal targets were presented on 60% or less of
the trials, vision no longer dominated when the bimodal targets were presented on 90% of the tri-
als (see also Egeth and Sager 1974; Manly et al. 1999; Quinlan 2000). This result suggests that the
Colavita effect is not caused by stimulus-related (i.e., sensory) factors, since these should not have
been affected by any change in the probability of occurrence of bimodal targets (cf. Odgaard et al.
2003, 2004, on this point). Instead, the fact that the Colavita effect disappears if the bimodal targets
are presented too frequently (i.e., on too high a proportion of the trials) would appear to suggest that
response-related factors (linked to the probability of participants making bimodal target responses)
are likely to play an important role in helping to explain the Colavita effect (see also Gorea and Sagi
2000).

27.2.6  Response Demands


The majority of studies on the Colavita visual dominance effect have been conducted under condi-
tions in which participants were given a separate response key with which to respond to the targets
presented in each sensory modality. Normally, participants are instructed to respond to the (rela-
tively infrequent) bimodal targets by pressing both response keys. Similar results have, however,
now also been obtained under conditions in which the participants are given a separate response
key with which to respond to the bimodal targets (Koppen and Spence 2007a; Sinnett et al. 2007).
This result rules out the possibility that the Colavita effect is simply caused by participants having
to make two responses at more or less the same time. Surprisingly, Colavita (1974; Experiment
4) showed that participants still made a majority of visual responses after having been explicitly
instructed to respond to the bimodal targets by pressing the auditory response key instead.
Koppen et al. (2008) have also reported that the Colavita effect occurs when participants are
instructed to press one button whenever they either see or hear a dog, another button whenever they
see or hear a cat, and to make both responses whenever a cat and a dog are presented at the same
time. Under such conditions, the visual presentation of the picture of one of these animals resulted
in participants failing to respond to the sound of the other animal (be it the woofing of the dog or
the meowing of the cat) on 10% more of the trials than they failed to respond to the identity of the
visually presented animal. Taken together, these results therefore confirm the fact that the Colavita
visual dominance effect occurs under a variety of different task demands/response requirements
(i.e., it occurs no matter whether participants respond to the sensory modality or semantic identity
of the target stimuli).

27.2.7  Attention
Originally, researchers thought that the Colavita visual dominance effect might simply reflect a
predisposition by participants to direct their attention preferentially toward the visual modality
(Colavita 1974; Posner et al. 1976). Posner et al.’s idea was that people endogenously (or voluntarily)
directed their attention toward the visual modality in order to make up for the fact that visual stimuli
are generally less alerting than stimuli presented in the other modalities (but see Spence et al. 2001b,
footnote 5). Contrary to this suggestion, however, a number of more recent studies have actually

* Note that researchers have also manipulated the relative probability of unimodal auditory and visual targets (see Egeth
and Sager 1977; Quinlan 2000; Sinnett et al. 2007). However, since such probability manipulations have typically been
introduced in the context of trying to shift the focus of a participant’s attention between the auditory and visual modali-
ties, they will be discussed later (see Section 27.2.7).
534 The Neural Bases of Multisensory Processes

shown that although the manipulation of a person’s endogenous attention can certainly modulate
the extent to which vision dominates over audition, it cannot in and of itself be used to reverse the
Colavita effect. That is, even when a participant’s attention is directed toward the auditory modality
(i.e., by verbally instructing them to attend to audition or by presenting unimodal auditory targets
much more frequently than unimodal visual targets), people still exhibit either visually dominant
behavior or else their behavior shows no clear pattern of dominance (see Koppen and Spence 2007a,
2007d; Sinnett et al. 2007). These results therefore demonstrate that any predisposition that partici-
pants might have to direct their attention voluntarily (or endogenously) toward the visual modality
cannot explain why vision always seems to dominate in the Colavita visual dominance effect.
De Reuck and Spence (2009) recently investigated whether varying the modality of a second-
ary task would have any effect on the magnitude of the Colavita visual dominance effect. To
this end, a video game (“Food boy” by T3Software) and a concurrent auditory speech stream
(consisting of pairs of auditory words delivered via a central loudspeaker) were presented in the
background while participants performed the two-response version of the Colavita task (i.e.,
pressing one key in response to auditory targets, another key in response to visual targets, and
both response keys on the bimodal target trials; the auditory targets in this study consisted of a
4000-Hz pure tone presented from a loudspeaker cone placed in front of the computer screen,
whereas the visual target consisted of the illumination of a red light-emitting device (LED), also
mounted in front of the computer screen). In the condition involving the secondary visual task,
the participants performed the Colavita task with their right hand while playing the video game
with their left hand (note that the auditory distracting speech streams were presented in the back-
ground, although they were irrelevant in this condition and so could be ignored). The participants
played the video game using a computer mouse to control a character moving across the bottom
of the computer screen. The participants had to “swallow” as much of the food dropping from the
top of the screen as possible, while avoiding any bombs that happened to fall. In the part of the
study involving an auditory secondary task, the video game was run in the demonstration mode
to provide equivalent background visual stimulation to the participants who now had to respond
by pressing a button with their left hand whenever they heard an animal name in the auditory
stream.
The results showed that the modality of the secondary task (auditory or visual) did not modulate
the magnitude of the Colavita visual dominance effect significantly, that is, the participants failed
to respond to a similar number of the auditory stimuli regardless of whether they were performing a
secondary task that primarily involved participants having to attend to the auditory or visual modal-
ity. De Reuck and Spence’s (2009) results therefore suggest that the Colavita visual dominance
effect may be insensitive to manipulations of participants’ attention toward either the auditory or
visual modality that are achieved by varying the requirements of a simultaneously performed sec-
ondary task (see Spence and Soto-Faraco 2009).
Finally, Koppen and Spence (2007a) have shown that exogenously directing a participant’s atten-
tion toward either the auditory or visual modality via the presentation of a task-irrelevant nonpre-
dictive auditory or visual cue 200 ms before the onset of the target (see Rodway 2005; Spence et al.
2001a; Turatto et al. 2002) has only a marginal effect on the magnitude of vision’s dominance over
audition (see also Golob et al. 2001). Taken together, the results reported in this section therefore
highlight the fact that although attentional manipulations (be they exogenous or endogenous) can
sometimes be used to modulate, or even to eliminate, the Colavita visual dominance effect, they
cannot be used to reverse it.

27.2.8  Arousal
Early animal research suggested that many examples of visual dominance could be reversed under
conditions in which an animal was placed in a highly aroused state (i.e., when, for example, fear-
ful of the imminent presentation of an electric shock; see Foree and LoLordo 1973; LoLordo and
The Colavita Visual Dominance Effect 535

Furrow 1976; Randich et al. 1978). It has been reported that although visual stimuli tend to control
appetitive behaviors, auditory stimuli tend to control avoidance behaviors in many species. Shapiro
et al. (1984) extended the idea that changes in the level of an organism’s arousal might change the
pattern of sensory dominance in the Colavita task to human participants (see also Johnson and
Shapiro 1989; Shapiro and Johnson 1987). They demonstrated what looked like auditory domi-
nance (i.e., participants making more auditory-only than visual-only responses in the Colavita task)
under conditions in which their participants were aversively motivated (by the occurrence of electric
shock, or to a lesser extent by the threat of electric shock, or tactile stimulation, presented after the
participants’ response on a random 20% of the trials).
It should, however, be noted that no independent measure of the change in a participant’s level
of arousal (i.e., such as a change in their galvanic skin response) was provided in this study. What
is more, Shapiro et al.’s (1984) participants were explicitly told to respond to the stimulus that they
perceived first on the bimodal target trials, that is, the participants effectively had to perform a tem-
poral order judgment (TOJ) task. What this means in practice is that their results (and those from
the study of Shapiro and Johnson (1987) and Johnson and Shapiro (1989), in which similar instruc-
tions were given) may actually reflect the effects of arousal on “prior entry” (see Spence 2010; Van
Damme et al. 2009b), rather than, as the authors argued, the effects of arousal on the Colavita visual
dominance effect.
Indeed, the latest research has demonstrated that increased arousal can lead to the prior entry of
certain classes of stimuli over others (when assessed by means of a participant’s responses on a TOJ
task; Van Damme et al. 2009b). In Van Damme et al.’s study, auditory and tactile stimuli delivered
from close to one of the participant’s hands were prioritized when an arousing picture showing
physical threat to a person’s bodily tissues was briefly flashed beforehand from the same (rather
than opposite) location. Meanwhile, Van Damme et al. (2009a) have shown that, when participants
are instructed to respond to both of the stimuli in the bimodal trials, rather than just to the stimulus
that the participant happens to have perceived first, the effects of arousal on the Colavita visual
dominance effect are far less clear-cut (we return later to the question of what role, if any, prior entry
plays in the Colavita visual dominance effect).
Elcock and Spence (2009) recently investigated the consequences for the Colavita effect of phar-
macologically modulating the participants’ level of arousal by administering caffeine. Caffeine
is known to increase arousal and hence, given Shapiro et al.’s (1984) research, ingesting caffeine
might be expected to modulate the magnitude of the Colavita visual dominance effect (Smith et al.
1992).* To this end, 15 healthy participants were tested in a within-participants, double-blind study,
in which a 200-mg caffeine tablet (equivalent to drinking about two cups of coffee) was taken 40
min before one session of the Colavita task and a visually identical placebo pill was taken before the
other session (note that the participants were instructed to refrain from consuming any caffeine in
the morning before taking part in the study). The Colavita visual dominance effect was unaffected
by whether the participants had ingested the caffeine tablet or the placebo (see Figure 27.1c and d).
Taken together, the results reported in this section would therefore appear to suggest that, contrary
to Shapiro et al.’s early claim, the magnitude of the Colavita visual dominance effect is not affected
by changes in a participant’s level of arousal.

27.2.9  Practice Effects


The largest Colavita visual dominance effects have been reported in studies in which only a small
number of bimodal target trials were presented. In fact, by far the largest effects on record were
reported by Frank B. Colavita himself in his early research (see Koppen and Spence 2007a, Table 1,

* Caffeine is a stimulant that accelerates physiological activity, and results in the release of adrenaline and the increased
production of the neurotransmitter dopamine. Caffeine also interferes with the operation of another neurotransmitter:
adenosine (Smith 2002; Zwyghuizen-Doorenbos et al. 1990).
536 The Neural Bases of Multisensory Processes

25 Auditory-only responses Visual-only responses


Auditory dominance Visual dominance

Errors (% bimodal trials)


20

15

10

0
–600 –400 –200 0 200 400 600
Audition Vision
SOA (ms)
first first

FIGURE 27.2  Graph highlighting the results of Koppen and Spence’s (2007b) study of Colavita effect in
which auditory and visual targets on bimodal target trials could be presented at any one of 10 SOAs. Although
a significant visual dominance effect was observed at a majority of asynchronies around objective simulta-
neity, a significant auditory dominance effect was only observed at the largest auditory-leading asynchrony.
Shaded gray band in the center of the graph represents the temporal window of audiovisual integration. Shaded
areas containing the ear and the eye schematically highlight SOAs at which auditory and visual dominance,
respectively, were observed. Note though (see text on this point) that differences between the proportion of
auditory-only and visual-only responses only reached statistical significance at certain SOAs (that said, the
trend in the data is clear). The error bars represent standard errors of means.

for a review). In these studies, each participant was only ever presented with a maximum of five
or six bimodal targets (see Colavita 1974, 1982; Colavita et al. 1976; Colavita and Weisberg 1979).
Contrast this with the smaller Colavita effects that have been reported in more recent research,
where as many as 120 bimodal targets were presented to each participant (e.g., Hartcher-O’Brien
et al. 2008; Koppen et al. 2008; Koppen and Spence 2007a, 2007c). This observation leads on to the
suggestion that the Colavita visual dominance effect may be more pronounced early on in the exper-
imental session (see also Kristofferson 1965).* That said, significant Colavita visual dominance
effects have nevertheless still been observed in numerous studies where participants’ performance
has been averaged over many hundreds of trials. Here, it may also be worth considering whether
any reduction in the Colavita effect resulting from increasing the probability of (and/or practice
with responding to) bimodal stimuli may also be related to the phenomenon of response coupling
(see Ulrich and Miller 2008). That is, the more often two independent target stimuli happen to be
presented at exactly the same time, the more likely it is that the participant will start to couple (i.e.,
program) their responses to the two stimuli together.
In the only study (as far as we are aware) to have provided evidence relevant to the question of
the consequence of practice on the Colavita visual dominance effect, the vigilance performance of
a group of participants was assessed over a 3-h period (Osborn et al. 1963). The participants in this
study had to monitor a light and sound source continuously for the occasional (once every 2½ min)
brief (i.e., lasting only 41 ms) offset of either or both of the stimuli. The participants were instructed
to press one button whenever the light was extinguished and another button whenever the sound was
interrupted. The results showed that although participants failed to respond to more of the auditory
than visual targets during the first 30-min session (thus showing a typical Colavita visual domi-
nance effect), this pattern of results reversed in the final four 30-min sessions (i.e., participants made

* Note that if practice were found to reduce the magnitude of the Colavita visual dominance effect, then this might pro-
vide an explanation for why increasing the probability of occurrence of bimodal target trials up to 90% in Koppen and
Spence’s (2007d) study has been shown to eliminate the Colavita effect (see Section 27.2.5). Alternatively, however,
increasing the prevalence (or number) of bimodal targets might also lead to the increased coupling of a participants’
responses on the bimodal trials (see main text for further details; Ulrich and Miller 2008).
The Colavita Visual Dominance Effect 537

more auditory-only than visual-only responses on the bimodal target trials; see Osborn et al. 1963;
Figure 27.2). It is, however, unclear whether these results necessarily reflect the effects of practice
on the Colavita visual dominance effect, or whether instead they may simply highlight the effects
of fatigue or boredom after the participants had spent several hours on the task (given that auditory
events are more likely to be responded to than visual events should the participants temporarily look
away or else close their eyes).

27.3  INTERIM SUMMARY


To summarize, the latest research has confirmed the fact that the Colavita visual dominance effect
is a robust empirical phenomenon. The basic Colavita effect—defined here in terms of participants
failing to respond to the nonvisual stimulus more often than they fail to respond to the visual
stimulus on the bimodal audiovisual or visuotactile target trials—has now been replicated in many
different studies, and by a number of different research groups (although it is worth noting that the
magnitude of the effect has fluctuated markedly from one study to the next). That said, the Colavita
effect appears to be robust to a variety of different experimental manipulations (e.g., of stimulus
intensity, stimulus type, stimulus position, response demands, attention, arousal, etc.). Interestingly,
though, while many experimental manipulations have been shown to modulate the size of the
Colavita visual dominance effect, and a few studies have even been able to eliminate it entirely, only
two of the studies discussed thus far have provided suggestive evidence regarding a reversal of the
Colavita effect in humans (i.e., evidence that is consistent with, although not necessarily providing
strong support for, auditory dominance; see Osborn et al. 1963; Shapiro et al. 1984).
Having reviewed the majority of the published research on the Colavita visual dominance effect,
and having ruled out accounts of the effect in terms of people having a predisposition to attend
endogenously to the visual modality (see Posner et al. 1976), differences in stimulus intensity
(Colavita 1974), and/or difficulties associated with participants having to make two responses at
the same time on the bimodal target trials (Koppen and Spence 2007a), how should the effect be
explained? Well, researchers have recently been investigating whether the Colavita effect can be
accounted for, at least in part, by the prior entry of the visual stimulus to participants’ awareness
(see Spence 2010; Spence et al. 2001; Titchener 1908). It is to this research that we now turn.

27.4  PRIOR ENTRY AND COLAVITA VISUAL DOMINANCE EFFECT


Koppen and Spence (2007b) investigated whether the Colavita effect might result from the prior
entry of the visual stimulus into participants’ awareness on some proportion of the bimodal target
trials. That is, even though the auditory and visual stimuli were presented simultaneously in the
majority of published studies of the Colavita effect, research elsewhere has shown that a visual
stimulus may be perceived first under such conditions (see Rutschmann and Link 1964). In order
to evaluate the prior entry account of the Colavita visual dominance effect, Koppen and Spence
assessed participants’ perception of the temporal order of pairs of auditory and visual stimuli that
had been used in another part of the study to demonstrate the typical Colavita visual dominance
effect.* Psychophysical analysis of participants’ TOJ performance showed that when the auditory
and visual stimuli were presented simultaneously, participants actually judged the auditory stimulus
to have been presented slightly, although not significantly, ahead of the visual stimulus (i.e., con-
trary to what would have been predicted according to the prior entry account; but see Exner 1875
and Hirsh and Sherrick 1961, for similar results; see also Jaśkowski 1996, 1999; Jaśkowski et al.
1990).

* Note the importance of using the same stimuli within the same pool of participants, given the large individual differences
in the perception of audiovisual simultaneity that have been reported previously (Smith 1933; Spence 2010; Stone et al.
2001).
538 The Neural Bases of Multisensory Processes

It is, however, important to note that there is a potential concern here regarding the interpretation
of Koppen and Spence’s (2007b) findings. Remember that the Colavita visual dominance effect is
eliminated when bimodal audiovisual targets are presented too frequently (e.g., see Section 27.2.5).
Crucially, Koppen and Spence looked for any evidence of the prior entry of visual stimuli into
awareness in their TOJ study under conditions in which a pair of auditory and visual stimuli were
presented on each and every trial. The possibility therefore remains that visual stimuli may only
be perceived before simultaneously presented auditory stimuli under those conditions in which the
occurrence of bimodal stimuli is relatively rare (cf. Miller et al. 2009). Thus, in retrospect, Koppen
and Spence’s results cannot be taken as providing unequivocal evidence against the possibility that
visual stimuli have prior entry into participants’ awareness on the bimodal trials in the Colavita
paradigm. Ideally, future research will need to look for any evidence of visual prior entry under
conditions in which the bimodal targets (in the TOJ task) are actually presented as infrequently as
when the Colavita effect is demonstrated behaviorally (i.e., when the bimodal targets requiring a
detection/discrimination response are presented on only 20% or so of the trials).
Given these concerns over the design (and hence interpretation) of Koppen and Spence’s (2007b)
TOJ study, it is interesting to note that Lucey and Spence (2009) were recently able to eliminate
the Colavita visual dominance effect by delaying the onset of the visual stimulus by a fixed 50 ms
with respect to the auditory stimuli on the bimodal target trials. Lucey and Spence used a between-
participants experimental design in which one group of participants completed the Colavita task
with synchronous auditory and visual targets on the bimodal trials (as in the majority of previ-
ous studies), whereas for the other group of participants, the onset of the visual target was always
delayed by 50 ms with respect to that of the auditory target. The apparatus and materials were
identical to those used by Elcock and Spence (2009; described earlier) although the participants
in Lucey and Spence’s study performed the three-button version of the audiovisual Colavita task
(i.e., in which participants had separate response keys for auditory, visual, and bimodal targets).
The results revealed that although participants made significantly more vision-only than auditory-
only responses in the synchronous bimodal condition (10.3% vs. 2.4%, respectively), no significant
Colavita visual dominance effect was reported when the onset of the visual target was delayed (4.6%
vs. 2.9%, respectively; n.s.). These results therefore demonstrate that the Colavita visual dominance
effect can be eliminated by presenting the auditory stimulus slightly ahead of the visual stimu-
lus. The critical question here, following on from Lucey and Spence’s results, is whether auditory
dominance would have been elicited had the auditory stimulus led the visual stimulus by a greater
interval.
Koppen and Spence (2007b) have provided an answer to this question. In their study of the
Colavita effect, the auditory and visual stimuli on the bimodal target trials were presented at one
of 10 stimulus onset asynchronies (SOAs; from auditory leading by 600 ms through to vision lead-
ing by 600 ms). Koppen and Spence found that the auditory lead needed in order to eliminate the
Colavita visual dominance effect on the bimodal target trials was correlated with the SOA at which
participants reliably started to perceive the auditory stimulus as having been presented before the
visual stimulus (defined as the SOA at which participants make 75% audition first responses; see
Koppen and Spence 2007b; Figure 27.3). This result therefore suggests that the prior entry of the
visual stimulus to awareness plays some role in its dominance over audition in the Colavita effect.
That said, however, Koppen and Spence also found that auditory targets had to be presented 600 ms
before visual targets in order for participants to make significantly more auditory-only than visual
only responses on the bimodal target trials (although a similar nonsignificant trend toward auditory
dominance was also reported at an auditory lead of 300 ms; see Figure 27.2).
It is rather unclear, however, what exactly caused the auditorily dominant behavior observed at
the 600 ms SOA in Koppen and Spence’s (2007b) study. This (physical) asynchrony between the
auditory and visual stimuli is far greater than any shift in the perceived timing of visual relative to
auditory stimuli that might reasonably be expected due to the prior entry of the visual stimulus to
awareness when the targets were actually presented simultaneously (see Spence 2010). In fact, this
The Colavita Visual Dominance Effect 539

(a) (b)

Neural activity

Neural activity
RTV
A(V) criterion

40 ms V criterion A criterion
+ RTV(A) RTV
V(A) criterion
R R

A
V
= =
A) V)
R V( R A(
RTA Time Time

(c)

35 ms
Neural activity

Neural activity
+ RTA RTA(V)

Time
Stimulus Unimodal
onset RT
Criterion Criterion
A)

V)
V(

R R RA R A(
V

Time Time

(d)
Neural activity

Unimodal criterion

A(V) criterion
al R
od
im

V(A) criterion
Un
A)
V(

R V
)

R A(
Time

FIGURE 27.3  (a) Schematic illustration of the results of Sinnett et al.’s (2008; Experiment 2) speeded target
detection study. The figure shows how the presentation of an accessory sound facilitates visual RTs (RT V(A)),
whereas the presentation of an accessory visual stimulus delays auditory RTs (RTA(V)). Note that unimodal
auditory (RTA) and visual (RT V) response latencies were serendipitously matched in this study (V, visual
target; A, auditory stimulus). (b) Schematic diagrams showing how the asymmetrical cross-modal accessory
stimulus effects reported by Sinnett et al. might lead to more (and more rapid) vision-only than auditory-only
responses on bimodal trials. Conceptually simple models outlined in panels (b) and (c) account for Sinnett et
al.’s asymmetrical RT effect in terms of changes in the criterion for responding to auditory and visual targets
(on bimodal as opposed to unimodal trials; (b) or in terms of asymmetrical cross-modal changes in the rate of
information accrual (c). We plot the putative rate of information accrual (R) as a function of stimuli presented.
However, the results of Koppen et al.’s (2009) recent signal detection study of Colavita effect has now pro-
vided evidence that is inconsistent with both of these simple accounts (see Figure 27.4). Hence, in panel (d), a
mixture model is proposed in which the presentation of an accessory stimulus in one modality leads both to a
change in criterion for responding to targets in the other modality (in line with the results of Koppen et al.’s,
study) and also to an asymmetrical effect on the rate of information accrual in the other modality (see Koppen
et al. 2007a; Miller 1986).
540 The Neural Bases of Multisensory Processes

SOA is also longer than the mean RT of participants’ responses to the unimodal auditory (440 ms)
targets. Given that the mean RT for auditory only responses on the bimodal target trials was only
470 ms (i.e., 30 ms longer, on average, than the correct responses on the bimodal trials; see Koppen
and Spence 2007b, Figure 1 and Table 1), one can also rule out the possibility that this failure to
report the visual stimulus occurred on trials in which the participants made auditory responses that
were particularly slow. Therefore, given that the visual target on the bimodal trials (in the 600 ms
SOA vision-lagging condition) was likely being extinguished by an already-responded-to auditory
target, one might think that this form of auditory dominance reflects some sort of refractory period
effect (i.e., resulting from the execution of the participants’ response to the first target; see Pashler
1994; Spence 2008), rather than the Colavita effect proper.
In summary, although Koppen and Spence’s (2007b) results certainly do provide an example
of auditory dominance, the mechanism behind this effect is most probably different from the
one causing the visual dominance effect that has been reported in the majority of studies (of the
Colavita effect), where the auditory and visual stimuli were presented simultaneously (see also
Miyake et al. 1986). Thus, although recent research has shown that delaying the presentation of
the visual stimulus can be used to eliminate the Colavita visual dominance effect (see Koppen and
Spence 2007b; Lucey and Spence 2009), and although the SOA at which participants reliably start
to perceive the auditory target as having been presented first correlates with the SOA at which the
Colavita visual dominance effect no longer occurs (Koppen and Spence 2007b), we do not, as yet,
have any convincing evidence that auditory dominance can be observed in the Colavita paradigm
by presenting the auditory stimulus slightly before the visual stimulus on the bimodal target trials
(i.e., at SOAs where the visual target is presented before the participants have initiated/executed
their response to the already-presented auditory target). That is, to date, no simple relationship has
been demonstrated between the SOA on the audiovisual target trials in the Colavita paradigm and
modality dominance. Hence, we need to look elsewhere for an explanation of vision’s advantage
in the Colavita visual dominance effect. Recent progress in understanding what may be going on
here has come from studies looking at the effect of accessory stimuli presented in one modality on
participants’ speeded responding to targets presented in another modality (Sinnett et al. 2008), and
from studies looking at the sensitivity and criterion of participants’ responses in the Colavita task
(Koppen et al. 2009).

27.5  EXPLAINING THE COLAVITA VISUAL DOMINANCE EFFECT


27.5.1  Accessory Stimulus Effects and Colavita Effect
One of the most interesting recent developments in the study of the Colavita effect comes from an
experiment reported by Sinnett et al. (2008; Experiment 2). The participants in this study had to
make speeded target detection responses to either auditory or visual targets. An auditory stimulus
was presented on 40% of the trials, a visual stimulus was presented on a further 40% of the trials,
and both stimuli were presented simultaneously on the remaining 20% of trials (i.e., just as in a typi-
cal study of the Colavita effect; note, however, that this task can also be thought of as a kind of go/
no-go task; see Egeth and Sager 1977; Miller 1986; Quinlan 2000). The participants responded sig-
nificantly more rapidly to the visual targets when they were accompanied by an accessory auditory
stimulus than when they were presented by themselves (see Figure 27.3a). By contrast, participants’
responses to the auditory targets were actually slowed by the simultaneous presentation of an acces-
sory visual stimulus (cf. Egeth and Sager 1977).
How might the fact that the presentation of an auditory accessory stimulus speeds participants’
visual detection/discrimination responses, whereas the presentation of a visual stimulus slows their
responses to auditory stimuli be used to help explain the Colavita visual dominance effect? Well, let
us imagine that participants set one criterion for initiating their responses to the relatively common
unimodal visual targets and another criterion for initiating their responses to the equally common
The Colavita Visual Dominance Effect 541

unimodal auditory targets. Note that the argument here is phrased in terms of changes in the crite-
rion for responding set by participants, rather than in terms of changes in the perceptual threshold,
given the evidence cited below that behavioral responses can sometimes be elicited under conditions
in which participants remain unaware (i.e., they have no conscious access to the inducing stimulus).
According to Sinnett et al.’s (2008) results, the criterion for initiating a speeded response to the
visual targets should be reached sooner on the relatively infrequent bimodal trials than on the uni-
modal visual trials, whereas it should be reached more slowly (on the bimodal than on the unimodal
trials) for auditory targets.
There are at least two conceptually simple means by which such a pattern of behavioral results
could be achieved. First, the participants could lower their criterion for responding to the visual
targets on the bimodal trials while simultaneously raising their criterion for responding to the audi-
tory target (see Figure 27.3b). Alternatively, however, the criterion for initiating a response might not
change but the presentation of the accessory stimulus in one modality might instead have a cross-
modal effect on the rate of information accrual (R) within the other modality (see Figure 27.3c).
The fact that the process of information accrual (like any other internal process) is likely to be a
noisy one might then help to explain why the Colavita effect is only observed on a proportion of the
bimodal target trials. Evidence that is seemingly consistent with both of these simple accounts can
be found in the literature.
In particular, evidence consistent with the claim that bimodal (as compared to unimodal) stimu-
lation can result in a change in the rate of information accrual comes from an older go/no-go study
reported by Miller (1986). Unimodal auditory and unimodal visual target stimuli were presented ran-
domly in this experiment together with trials in which both stimuli were presented at one of a range
of different SOAs (0–167 ms). The participants had to make a simple speeded detection response
whenever a target was presented (regardless of whether it was unimodal or bimodal). Catch trials,
in which no stimulus was presented (and no response was required), were also included. Analysis of
the results provided tentative evidence that visual stimuli needed less time to reach the criterion for
initiating a behavioral response (measured from the putative onset of response-related activity) com-
pared to the auditory stimuli on the redundant bimodal target trials—this despite the fact that the
initiation of response-related activation after the presentation of an auditory stimulus started earlier
in time than following the presentation of a visual stimulus (see Miller 1986, pp. 340– 341). Taken
together, these results therefore suggest that stimulus-related information accrues more slowly for
auditory targets in the presence (vs. absence) of concurrent visual stimuli than vice versa, just as
highlighted in Figure 27.3c. Similarly, Romei et al.’s (2009) recent results showing that looming
auditory signals enhance visual excitability in a preperceptual manner can also be seen as being
consistent with the information accrual account. However, results arguing for the inclusion of some
component of criterion shifting into one’s model of the Colavita visual dominance effect (although
note that the results are inconsistent with the simple criterion-shifting model put forward in Figure
27.3b) comes from a more recent study reported by Koppen et al. (2009).

27.5.2  Perceptual and Decisional Contributions to Colavita Visual Dominance Effect


Koppen et al. (2009) recently explicitly assessed the contributions of perceptual (i.e., threshold) and
decisional (i.e., criterion-related) factors to the Colavita visual dominance effect in a study in which
the intensities of the auditory and visual stimuli were initially adjusted until participants were only
able to detect them on 75% of the trials. Next, a version of the Colavita task was conducted using
these near-threshold stimuli (i.e., rather than the clearly suprathreshold stimuli that have been uti-
lized in the majority of previous studies). A unimodal visual target was presented on 25% of the tri-
als, a unimodal auditory target on 25% of trials, a bimodal audiovisual target on 25% of trials (and
no target was presented on the remaining 25% of trials). The task of reporting which target modali-
ties (if any) had been presented in each trial was unspeeded and the participants were instructed to
refrain from responding on those trials in which no target was presented.
542 The Neural Bases of Multisensory Processes

Analysis of Koppen et al.’s (2009) results using signal detection theory (see Green and Swets
1966) revealed that although the presentation of an auditory target had no effect on visual sensitiv-
ity, the presentation of a visual target resulted in a significant drop in participants’ auditory sen-
sitivity (see Figure 27.4a; see also Golob et al. 2001; Gregg and Brogden 1952; Marks et al. 2003;
Odgaard et al. 2003; Stein et al. 1996; Thompson et al. 1958). These results therefore show that
the presentation of a visual stimulus can lead to a small, but significant, lowering of sensitivity to
a simultaneously presented auditory stimulus, at least when the participants’ task involves trying
to detect which target modalities (if any) have been presented.* Koppen et al.’s results suggest that
only a relatively small component of the Colavita visual dominance effect may be attributable to the
asymmetrical cross-modal effect on auditory sensitivity (i.e., on the auditory perceptual threshold)
that results from the simultaneous presentation of a visual stimulus. That is, the magnitude of the
sensitivity drop hardly seems large enough to account for the behavioral effects observed in the
normal speeded version of the Colavita task.
The more important result to have emerged from Koppen et al.’s (2009) study in terms of the
argument being developed here was the significant drop in participants’ criterion for responding on
the bimodal (as compared to the unimodal) target trials. Importantly, this drop was significantly
larger for visual than for auditory targets (see Figure 27.4b). The fact that the criterion dropped
for both auditory and visual targets is inconsistent with the simple criterion shifting account of
the asymmetrical cross-modal effects highlighted by Sinnett et al. (2008) that were put forward in
Figure 27.3b. In fact, when the various results now available are taken together, the most plausible
model of the Colavita visual dominance effect would appear to be one in which an asymmetrical
lowering of the criterion for responding to auditory and visual targets (Koppen et al. 2009), is paired
with an asymmetrical cross-modal effect on the rate of information accrual (Miller 1986; see Figure
27.3d).
However, although the account outlined in Figure 27.3d may help to explain why it is that a
participant will typically respond to the visual stimulus first on the bimodal target trials (despite
the fact that the auditory and visual stimuli are actually presented simultaneously), it does not
explain why participants do not quickly recognize the error of their ways (after making a vision-
only response, say), and then quickly initiate an additional auditory response.† The participants
certainly had sufficient time in which to make a response before the next trial started in many of the
studies where the Colavita effect has been reported. For example, in Koppen and Spence’s (2007a,
2007b, 2007c) studies, the intertarget interval was in the region of 1500–1800 ms, whereas mean
vision-only response latencies fell in the 500–700 ms range.

27.5.3  Stimulus, (Perception), and Response?


We believe that in order to answer the question of why participants fail to make any response to
the auditory (or tactile) targets on some proportion of the bimodal target trials in the Colavita
paradigm, one has to break with the intuitively appealing notion that there is a causal link between
(conscious) perception and action. Instead, it needs to be realized that our responses do not always
rely on our first becoming aware of the stimuli that have elicited those responses. In fact, according
to Neumann (1990), the only causal link that exists is the one between a stimulus and its associated
response. Neumann has argued that conscious perception should not always be conceptualized as a

* Note here that a very different result (i.e., the enhancement of perceived auditory intensity by a simultaneously-­presented
visual stimulus) has been reported in other studies in which the participants simply had to detect the presence of an audi-
tory target (see Odgaard et al. 2004). This discrepancy highlights the fact that the precise nature of a participant’s task
constitutes a critical determinant of the way in which the stimuli presented in different modalities interact to influence
human information processing (cf. Gondan and Fisher 2009; Sinnett et al. 2008; Wang et al. 2008, on this point).
† Note here that we are talking about the traditional two-response version of the Colavita task. Remember that in the three-
response version, the participant’s first response terminates the trial, and hence there is no opportunity to make a second
response.
The Colavita Visual Dominance Effect 543

(a) 3.5

3
2.5

Sensitivity (d' )
Target type
2 Unimodal
1.5 Bimodal
1
0.5
0
Auditory Visual
(b)
1.2

1
Criterion (c)

0.8

0.6

0.4

0.2

0
Auditory Visual
Target modality

FIGURE 27.4  Summary of mean sensitivity (d' ) values (panel a) and criterion (c) (panel b) for unimodal
auditory, unimodal visual, bimodal auditory, and bimodal visual targets in Koppen et al.’s (2009) signal detec-
tion study of the Colavita visual dominance effect. Error bars indicate the standard errors of means. The
results show that although the simultaneous presentation of auditory and visual stimuli resulted in a reduction
of auditory sensitivity (when compared to performance in unimodal auditory target trials), no such effect was
reported for visual targets. The results also show highlight the fact presentation of a bimodal audiovisual
target resulted in a significant reduction in the criteria (c) for responding, and that this effect was significantly
larger for visual targets than for auditory targets. (Redrawn from Koppen, C. et al., Exp. Brain Res., 196,
353–360, 2009. With permission.)

necessary stage in the chain of human information processing. Rather, he suggests that conscious
perception can, on occasion, be bypassed altogether. Support for Neumann’s view that stimuli can
elicit responses in the absence of awareness comes from research showing, for example, that par-
ticipants can execute rapid and accurate discrimination responses to masked target stimuli that
they are subjectively unaware of (e.g., Taylor and McCloskey 1996). The phenomenon of blindsight
is also pertinent here (e.g., see Cowey and Stoerig 1991). Furthermore, researchers have shown
that people sometimes lose their memory for the second of two stimuli as a result of their having
executed a response to the first stimulus (Crowder 1968; Müsseler and Hommel 1997a, 1997b; see
also Bridgeman 1990; Ricci and Chatterjee 2004; Rizzolatti and Berti 1990). On the basis of such
results, then, our suggestion is that a participant’s awareness (of the target stimuli) in the speeded
version of the Colavita paradigm may actually be modulated by the responses that they happen to
make (select or initiate) on some proportion of the trials, rather than necessarily always being driven
by their conscious perception of the stimuli themselves (see also Hefferline and Perera 1963).
To summarize, when participants try to respond rapidly in the Colavita visual dominance task,
they may sometimes end up initiating their response before becoming aware of the stimulus (or
stimuli) that have elicited that response. Their awareness of which stimuli have, in fact, been pre-
sented is then constrained by the response(s) that they actually happen to make. In other words, if (as
a participant) I realize that I have made (or am about to make) a vision-only response, it would seem
unsurprising that I only then become aware of the visual target, even if an auditory target had also
been presented at the same time (although it perhaps reached the threshold for initiating a response
544 The Neural Bases of Multisensory Processes

more slowly than the visual stimulus; see above). Here, one might even consider the possibility that
participants simply stop processing (or stop responding to) the target stimulus (or stimuli) after they
have selected/triggered a response (to the visual target; i.e., perhaps target processing reflects a kind
of self-terminating processing). Sinnett et al.’s (2008) research is crucial here in showing that, as a
result of the asymmetrical cross-modal effects of auditory and visual stimuli on each other, the first
response that a participant makes on a bimodal target trial is likely to be to a visual (rather than an
auditory) stimulus.
If this hypothesis regarding people’s failure to respond to some proportion of the auditory (or
tactile) stimuli on the bimodal trials in the Colavita paradigm were to be correct, one would expect
the fastest visual responses to occur on those bimodal trials in which participants make a visual-
only response. Koppen and Spence’s (2007a; Experiment 3) results show just such a result in their
three-response study of the Colavita effect (i.e., where participants made one response to auditory
targets, one to visual targets, and a third to the bimodal targets; note, however, that the participants
did not have the opportunity to respond to the visual and auditory stimuli sequentially in this study).
In Koppen and Spence’s study, the visual-only responses on the bimodal target trials were actually
significantly faster, on average (mean RT = 563 ms), than the visual-only responses on unimodal
visual trials (mean RT = 582 ms; see Figure 27.5). This result therefore demonstrates that even
though participants failed to respond to the auditory target, its presence nevertheless still facilitated
their behavioral performance. Finally, the vision-only responses (on the bimodal trials) were also
found, on average, to be significantly faster than the participants’ correct bimodal responses on the
bimodal target trails (mean = 641 ms).
Interestingly, however, participants’ auditory-only responses on the bimodal target trials in
Koppen and Spence’s (2007a) study were significantly slower, on average, than on the unimodal
auditory target trials (mean RTs of 577 and 539 ms, respectively). This is the opposite pattern of
results to that seen for the visual target detection data (i.e., a bimodal slowing of responding for
auditory targets paired with a bimodal speeding of responding to the visual targets). This result
provides additional evidence for the existence of an asymmetrical cross-modal effect on the rate of
information accrual). Indeed, taken together, these results mirror those reported by Sinnett et al.
(2008) in their speeded target detection task, but note here that the data come from a version of the
Colavita task instead. Thus, it really does seem as though the more frequent occurrence of vision-
only as compared to auditory-only responses on the bimodal audiovisual target trials in the Colavita
visual dominance paradigm is tightly linked to the speed with which a participant initiates his/

*
* *
n.s.

Target Colavita
stimulus effect

0 539 563 577 582 641 Reaction time (ms)

FIGURE 27.5  Schematic timeline showing the mean latency of participants’ responses (both correct and
incorrect responses) in Koppen et al.’s (2007a) three-button version of the Colavita effect. Significant dif-
ferences between particular conditions of interest (p < .05) are highlighted with an asterisk. (See text for
details.)
The Colavita Visual Dominance Effect 545

her response. When participants respond rapidly, they are much more likely to make an erroneous
visual-only response than to make an erroneous auditory-only response.*

27.6 BIASED (OR INTEGRATED) COMPETITION AND


COLAVITA VISUAL DOMINANCE EFFECT
How can the asymmetric cross-modal effects of simultaneously presented auditory and visual tar-
gets on each other (that were highlighted in the previous section) be explained? We believe that a
fruitful approach may well come from considering them in the light of the biased (or integrated)
competition hypothesis (see Desimone and Duncan 1995; Duncan 1996). According to Desimone
and Duncan, brain systems (both sensory and motor) are fundamentally competitive in nature.
What is more, within each system, a gain in the activation of one object/event representation always
occurs at a cost to others. That is, the neural representation of different objects/events is normally
mutually inhibitory. An important aspect of Desimone and Duncan’s biased competition model
relates to the claim that the dominant neural representation suppresses the neural activity associated
with the representation of the weaker stimulus (see Duncan 1996). In light of the discussion in the
preceding section (see Section 27.5.2), one might think of biased competition as affecting the rate of
information accrual, changing the criterion for responding, and/or changing perceptual sensitivity
(but see Gorea and Sagi 2000, 2002). An extreme form of this probabilistic winner-takes-all princi-
ple might therefore help to explain why it is that the presentation of a visual stimulus can sometimes
have such a profound effect on people’s awareness of the stimuli coded by a different brain area (i.e.,
modality; see also Hahnloser et al. 1999).
Modality-based biased competition can perhaps also provide a mechanistic explanation for the
findings of a number of other studies of multisensory information processing. For example, over
the years, many researchers have argued that people’s attention is preferentially directed toward
the visual modality when pairs of auditory and visual stimuli are presented simultaneously (e.g.,
see Falkenstein et al. 1991; Golob et al. 2001; Hohnsbein and Falkenstein 1991; Hohnsbein et al.
1991; Oray et al. 2002). As Driver and Vuilleumier (2001, p. 75) describe the biased (or integrated)
competition hypothesis: “ . . . multiple concurrent stimuli always compete to drive neurons and
dominate the networks (and ultimately to dominate awareness and behavior).” They continue: “vari-
ous phenomena of ‘attention’ are cast as emergent properties of whichever stimuli happen to win the
competition.” In other words, particularly salient stimuli will have a competitive advantage and may
thus tend to “attract attention” on purely bottom-up grounds. Visual stimuli might then, for whatever
reason (see below), constitute a particularly salient class of stimuli. Such stimulus-driven competi-
tion between the neural activation elicited by the auditory (or tactile) and visual targets on bimodal
target trials might also help to explain why the attentional manipulations that have been utilized
previously have proved so ineffective in terms of reversing the Colavita visual dominance effect
(see Koppen and Spence 2007d; Sinnett et al. 2007). That is, although the biasing of a participant’s
attention toward one sensory modality (in particular, the nonvisual modality) before stimulus onset
may be sufficient to override the competitive advantage resulting from any stimulus-driven biased
competition (see McDonald et al. 2005; Spence 2010; Vibell et al. 2007), it cannot reverse it.

27.6.1  Putative Neural Underpinnings of Modality-Based Biased Competition


Of course, accounting for the Colavita visual dominance effect in terms of biased competition does
not itself explain why it is the visual stimulus that always wins the competition more frequently than
the nonvisual stimulus. Although a satisfactory neurally inspired answer to this question will need

* One final point to note here concerns the fact that when participants made an erroneous response on the bimodal target
trials, the erroneous auditory-only responses were somewhat slower than the erroneous vision-only responses, although
this difference failed to reach statistical significance.
546 The Neural Bases of Multisensory Processes

to await future research, it is worth noting here that recent research has highlighted the importance
of feedback activity from higher order to early sensory areas in certain aspects of visual awareness
(e.g., Lamme 2001; Lamme et al. 2000; Pascual-Leone and Walsh 2001; but see also Macknik 2009;
Macknik and Martinez-Conde 2007, in press). It is also pertinent to note that far more of the brain
is given over to the processing of visual stimuli than to the processing of stimuli from the other sen-
sory modalities. For example, Sereno et al. (1995) suggest that nearly half of the cortex is involved
in the processing of visual information. Meanwhile, Felleman and van Essen (1991) point out that
in the macaque there are less than half the number of brain areas involved in the processing of
tactile information as involved in the processing of visual information. In fact, in their authoritative
literature review, they estimate that 55% of neocortex (by volume), is visual, as compared to 12%
somatosensory, 8% motor, 3% auditory, and 0.5% gustatory. Given such statistics, it would seem
probable that the visual system might have a better chance of setting-up such feedback activity fol-
lowing the presentation of a visual stimulus than would the auditory or tactile systems following the
simultaneous presentation of either an auditory or tactile stimulus. Note that this account suggests
that visual dominance is natural, at least for humans, in that it may have a hardwired physiologi-
cal basis (this idea was originally captured by Colavita et al.’s (1976) suggestion that visual stimuli
might be “prepotent”). It is interesting to note in this context that the amount of cortex given over
to the processing of auditory and tactile information processing is far more evenly matched than for
the competition between audition and vision, hence perhaps explaining the lack of a clear pattern
of dominance when stimuli are presented in these two modalities at the same time (see Hecht and
Reiner 2009; Occelli et al. 2010).
It is also important to note here that progress in terms of explaining the Colavita effect at
a neural level might also come from a more fine-grained study of the temporal dynamics of
multisensory integration in various brain regions. In humans, the first wave of activity in pri-
mary auditory cortex in response to the presentation of suprathreshold stimuli is usually seen at
a latency of about 10–15 ms (e.g., Liegeois-Chauvel et al. 1994; Howard et al. 2000; Godey et al.
2001; Brugge et al. 2003). Activity in primary visual cortex starts about 40–50 ms after stimulus
presentation (e.g., Foxe et al. 2008; see also Schroeder et al. 1998), whereas for primary soma-
tosensory cortex the figure is about 8–12 ms (e.g., Inui et al., 2004; see also Schroeder et al. 2001).
Meanwhile, Schroeder and Foxe (2002, 2004) have documented the asymmetrical time course of
the interactions taking place between auditory and visual cortex. Their research has shown that
the visual modulation of activity in auditory cortex occurs several tens of milliseconds after the
feed­forward sweep of activation associated with the processing of auditory stimuli, under condi-
tions where auditory and visual stimuli happen to be presented simultaneously from a location
within peripersonal space (i.e., within arm’s reach; see Rizzolatti et al. 1997). This delay is caused
by the fact that much of the visual input to auditory cortex is routed through superior temporal
polysensory areas (e.g., Foxe and Schroeder 2002; see also Ghazanfar et al. 2005; Kayser et al.
2008; Smiley et al. 2007), and possibly also through prefrontal cortex. It therefore seems plau-
sible to suggest that such delayed visual (inhibitory) input to auditory cortex might play some
role in disrupting the setting-up of the feedback activity from higher (auditory) areas.* That said,
Falchier et al. (2010) recently reported evidence suggesting the existence of a more direct rout-
ing of information from visual to auditory cortex (i.e., from V2 to caudal auditory cortex), hence
potentially confusing the story somewhat.
By contrast, audition’s influence on visual information processing occurs more rapidly, and
involves direct projections from early auditory cortical areas to early visual areas. That is, direct
projections have now been documented from the primary auditory cortex A1 to the primary visual
cortex V1 (e.g., see Wang et al. 2008; note, however, that these direct connections tend to target

* Note here also the fact that visual influences on primary and secondary auditory cortex are greatest when the visual
stimulus leads the auditory stimulus by 20–80 ms (see Kayser et al. 2008), the same magnitude of visual leads that have
also been shown to give rise to the largest Colavita effect (see Figure 2; Koppen and Spence 2007b).
The Colavita Visual Dominance Effect 547

peripheral, rather than central, locations in the visual field; that said, other projections may well
be more foveally targeted). Interestingly, however, until very recently no direct connections had
as yet been observed in the opposite direction (see Falchier et al. 2010). These direct projections
from auditory to visual cortex may help to account for the increased visual cortical excitability seen
when an auditory stimulus is presented together with a visual stimulus (e.g., Martuzzi et al. 2007;
Noesselt et al. 2007; Rockland and Ojima 2003; Romei et al. 2007, 2009; see also Besle et al. 2009;
Clavagnier et al. 2004; Falchier et al. 2003). Indeed, Bolognini et al. (2010) have recently shown that
transcranic magnetic stimulation (TMS)-elicited phosphenes (presented near threshold) are more
visible when a white noise burst is presented approximately 40 ms before the TMS pulse (see also
Romei et al. 2009).
It is also interesting to note here that when auditory and tactile stimuli are presented simulta­
neously from a distance of less than 1 m (i.e., in peripersonal space), the response in multisensory
convergence regions of auditory association cortex is both rapid and approximately simultaneous
for these two input modalities (see Schroeder and Foxe 2002, p. 193; see also Foxe et al. 2000,
2002; Murray et al. 2005; Schroeder et al. 2001). Such neurophysiological timing properties may
then also help to explain why no clear Colavita dominance effect has as yet been reported between
these two modalities (see also Sperdin et al. 2009).* That said, any neurally inspired account of the
Colavita effect will likely also have to incorporate the recent discovery of feedforward multisensory
interactions to early cortical areas taking place in the thalamus (i.e., via the thalamocortical loop;
Cappe et al. 2009).
Although any attempt to link human behavior to single-cell neurophysiological data in either
awake and anesthetized primates is clearly speculative at this stage, we are nevertheless convinced
that this kind of interdisciplinary approach will be needed if we are to develop a fuller understand-
ing of the Colavita effect in the coming years. It may also prove fruitful, when trying to explain
why it is that participants fail to make an auditory (or tactile) response once they have made a visual
one to consider the neuroscience research on the duration (and decay) of sensory memory in the
different modalities (e.g., Lu et al. 1992; Harris et al. 2002; Uusitalo et al. 1996; Zylberberg et al.
2009). Here, it would be particularly interesting to know whether there are any systematic modality-
specific differences in the decay rate of visual, auditory, and tactile sensory memory.

27.6.2  Clinical Extinction and Colavita Visual Dominance Effect


It will most likely also be revealing in future research to explore the relationship between the
Colavita visual dominance effect and the clinical phenomenon of extinction that is sometimes seen
in clinical patients following lateralized (typically right parietal) brain damage (e.g., Baylis et al.
1993; Bender 1952; Brozzoli et al. 2006; Driver and Vuilleumier 2001; Farnè et al. 2007; Rapp and
Hendel 2003; Ricci and Chatterjee 2004). The two phenomena share a number of similarities: Both
are sensitive to the relative spatial position from which the stimuli are presented (Costantini et al.
2007; Hartcher-O’Brien et al. 2008, 2010; Koppen and Spence 2007c); both are influenced by the
relative timing of the two stimuli (Baylis et al. 2002; Costantini et al. 2007; Koppen and Spence
2007b; Lucey and Spence 2009; Rorden et al. 1997); both affect perceptual sensitivity as well
as being influenced by response-related factors (Koppen et al. 2009; Ricci and Chatterjee 2004;

* It would be interesting here to determine whether the feedforward projections between primary auditory and tactile
cortices are any more symmetrical than those between auditory and visual cortices (see Cappe and Barone 2005; Cappe
et al. 2009; Hackett et al. 2007; Schroeder et al. 2001; Smiley et al. 2007, on this topic), since this could provide a neural
explanation for why no Colavita effect has, as yet, been reported between the auditory and tactile modalities (Hecht and
Reiner 2009; Occelli et al. 2010). That said, it should also be borne in mind that the nature of auditory-somatosensory
interactions have recently been shown to differ quite dramatically as a function of the body surface stimulated (e.g., dif-
ferent audio–tactile interactions have been observed for stimuli presented close to the hands in frontal space vs. close to
the back of the neck in rear space; see Fu et al. 2003; Tajadura-Jiminez et al. 2009; cf. Critchley 1953, p. 19). The same
may, of course, also turn out to be true for the auditory–tactile Colavita effect.
548 The Neural Bases of Multisensory Processes

Sinnett et al. 2008; see also Gorea and Sagi 2002). The proportion of experimental trials on which
each phenomenon occurs in the laboratory has also been shown to vary greatly between studies.
In terms of the biased (or integrated) competition hypothesis (Desimone and Duncan 1995;
Duncan 1996), extinction (in patients) is thought to reflect biased competition against stimuli from
one side (Driver and Vuilleumier 2001; Rapp and Hendel 2003), whereas here we have argued that
the Colavita effect reflects biased competition that favors the processing of visual stimuli. Although
extinction has typically been characterized as a spatial phenomenon (i.e., it is the contralesional
stimulus that normally extinguishes a simultaneously presented ipsilesional stimulus), it is worth
noting that nonspatial extinction effects have also been reported (Costantini et al. 2007; Humphreys
et al. 1995; see also Battelli et al. 2007). Future neuroimaging research will hopefully help to deter-
mine the extent to which the neural substrates underlying the Colavita visual dominance effect in
healthy individuals and the phenomenon of extinction in clinical patients are similar (Sarri et al.
2006). Intriguing data here come from a neuroimaging study of a single patient with visual–tactile
extinction reported by Sarri et al. In this patient, awareness of touch on the bimodal visuotactile
trials was associated with increased activity in right parietal and frontal regions. Sarri et al. argued
that the cross-modal extinction of the tactile stimulus in this patient resulted from increased com-
petition arising from the functional coupling of visual and somatosensory cortex with multisensory
parietal cortex.
The literature on unimodal and cross-modal extinction suggests that the normal process of
biased competition can be interrupted by the kinds of parietal damage that lead to neglect and/or
extinction. It would therefore be fascinating to see whether one could elicit the same kinds of biases
in neural competition (usually seen in extinction patients) in normal participants, simply by admin-
istering TMS over posterior parietal areas (see Driver and Vuilleumier 2001; Duncan 1996; Sarri
et al. 2006). Furthermore, following on from the single-cell neurophysiological work conducted by
Schroeder and his colleagues (e.g., see Schroeder and Foxe 2002, 2004; Schroeder et al. 2004), it
might also be interesting to target superior temporal polysensory areas, and/or the prefrontal cortex
in order to try and disrupt the modality-based biased competition seen in the Colavita effect (i.e.,
rather than the spatial or temporal competition that is more typically reported in extinction patients;
see Battelli et al. 2007). There are two principle outcomes that could emerge from such a study,
and both seem plausible: (1) TMS over one or more such cortical sites might serve to magnify the
Colavita visual dominance effect observed in normal participants, based on the consequences of
pathological damage to these areas observed in extinction patients; (2) TMS over these cortical sites
might also reduce the magnitude of the Colavita effect, by interfering with the normal processes of
biased competition, and/or by interfering with the late-arriving cross-modal feedback activity from
visual to auditory cortex (see Section 27.6.1). It would, of course, also be very interesting in future
research to investigate whether extinction patients exhibit a larger Colavita effect than normal par-
ticipants in the traditional version of the Colavita task (cf. Costantini et al. 2007).

27.7  CONCLUSIONS AND QUESTIONS FOR FUTURE RESEARCH


Research conducted over the past 35 years or so has shown the Colavita visual dominance effect to
be a robust empirical phenomenon. However, traditional explanations of the effect simply cannot
account for the range of experimental data that is currently available. In this article, we argue that the
Colavita visual dominance effect may be accounted for in terms of Desimone and Duncan’s (1995;
see also Duncan 1996) model of biased (or integrated) competition. According to the explanation
outlined here, the Colavita visual dominance effect can be understood in terms of the cross-modal
competition between the neural representations of simultaneously presented visual and auditory (or
tactile) stimuli. Cognitive neuroscience studies would certainly help to further our understanding of
the mechanisms underlying the Colavita effect. It would be particularly interesting, for example, to
compare the pattern of brain activation on those trials in which participants fail to respond correctly
to the nonvisual stimulus to the activation seen on those trials in which they respond appropriately
The Colavita Visual Dominance Effect 549

(cf. Fink et al. 2000; Golob et al. 2001; Sarri et al. 2006; �������������������������������������������
Schubert et al. 2006�����������������������
). Event-related poten-
tial studies could also help to determine just how early (or late, see Falkenstein et al. 1991; Quinlan
2000; Zahn et al. 1994) the processing of ignored and reported auditory (or tactile) stimuli differs
(see Hohnsbein et al. 1991).

27.7.1  Modeling the Colavita Visual Dominance Effect


There is also a considerable amount of interesting work to be done in terms of modeling the Colavita
visual dominance effect. Cooper (1998) made a start on this more than a decade ago. He developed a
computational model that was capable of simulating the pattern of participants’ RTs in the Colavita
task. Cooper’s model consisted of separate modality-specific input channels feeding into a single
“object representation network” (whose function involved activating specific response schemas—
presumably equivalent to a target stimuli reaching the criterion for responding, as discussed earlier)
in which the speed of each channel was dependent on the strength (i.e., weight) of the channel
itself. By assuming that the visual channel was stronger than the auditory channel, the model was
able to successfully account for the fact that although responses to auditory stimuli are faster than
responses to visual stimuli in unimodal trials, the reverse pattern is typically found on bimodal
target trials.
The challenge for researchers in this area will be to try and develop models that are also capable
of accounting for participants’ failure to respond to the nonvisual stimulus (i.e., the effect that has
constituted the focus for the research discussed in this article; cf. Peers et al. 2005); such models
might presumably include the assignment of different weights to visual and auditory cues, biases to
preferentially respond to either visual or auditory stimuli, different gain/loss functions associated
with responding, or failing to respond, to auditory and visual target stimuli, etc. It will be especially
interesting here to examine whether the recent models of Bayesian multisensory integration (see
Ernst 2005) that have proved so successful in accounting for many aspects of cross-modal percep-
tion, sensory dominance, and multisensory information processing, can also be used to account for
the Colavita visual dominance effect.

27.7.2  Multisensory Facilitation versus Interference


Finally, in closing, it is perhaps worth pausing to consider the Colavita effect in the context of
so many other recent studies that have demonstrated the benefits of multisensory over unisensory
stimulus presentation (e.g., in terms of speeding simple speeded detection responses; Nickerson
1973; Sinnett et al. 2008, Experiment 1; see also Calvert et al. 2004). To some, the existence of
the Colavita effect constitutes a puzzling example of a situation in which multisensory stimula-
tion appears to impair (rather than to facilitate) human performance. It is interesting to note here
though that whether one observes benefits or costs after multisensory (as compared to unisensory)
stimulation seems to depend largely on the specific requirements of the task faced by participants.
For example, Sinnett et al. (2008; Experiment 2) reported the facilitation of simple speeded detec-
tion latencies on bimodal audiovisual trials (i.e., they observed a violation of the race model; Miller
1982, 1991) when their participants had to make the same simple speeded detection responses to
auditory, visual, and audiovisual targets. By contrast, they observed an inhibitory effect when their
participants had to respond to the targets in each modality by pressing a separate response key (i.e.,
the typical Colavita paradigm). However, this latter result is not really so surprising if one stops to
consider the fact that in the Colavita task participants can really be thought of as performing two
tasks at once: that is, in the traditional two-response version of the Colavita task, the participants
perform both a speeded auditory target detection task as well as a speeded visual target detection
task. Although on the majority of (unimodal) trials the participants only have to perform one task,
on a minority of (bimodal) trials they have to perform both tasks at the same time (and it is on these
550 The Neural Bases of Multisensory Processes

trials that the Colavita effect occurs when the nonvisual stimulus is seemingly ignored).* By con-
trast, in the redundant target effect paradigm (see earlier), both stimuli are relevant to the same task
(i.e., to making a simple speeded target detection response).
Researchers have known for more than half a century that people find it difficult to perform two
tasks at the same time (regardless of whether the target stimuli relevant to performing those tasks
are presented in the same versus different sensory modalities (e.g., Pashler 1994; Spence 2008). One
can therefore think of the Colavita paradigm in terms of a form of dual-task interference (resulting
from modality-based biased competition at the response-selection level)—interference that appears
to be intimately linked to the making of speeded responses to the target stimuli (however, see
Koppen et al. 2009). More generally, it is important to stress that although multisensory integra-
tion may, under the appropriate conditions, give rise to improved perception/performance, the ben-
efits may necessarily come at the cost of some loss of access to the component unimodal signals
(cf. Soto-Faraco and Alsius 2007, 2009). In closing, it is perhaps worth highlighting the fact that
the task-dependent nature of the consequences of multisensory integration that show up in studies
related to the Colavita effect have now also been demonstrated in a number of different behavioral
paradigms, in both humans (see Cappe et al. in press; Gondan and Fischer 2009; Sinnett et al. 2008;
Spence et al. 2003) and monkeys (see Besle et al. 2009; Wang et al. 2008).

REFERENCES
Battelli, L., A. Pascual-Leone, and P. Cavanagh. 2007. The ‘when’ pathway of the right parietal lobe. Trends in
Cognitive Sciences 11: 204–210.
Baylis, G. C., J. Driver, and R. D. Rafal. 1993. Visual extinction and stimulus repetition. Journal of Cognitive
Neuroscience 5: 453–466.
Baylis, G. C., S. L. Simon, L. L. Baylis, and C. Rorden. 2002. Visual extinction with double simultaneous
stimulation: What is simultaneous? Neuropsychologia 40: 1027–1034.
Bender, M. B. 1952. Disorders in perception. Springfield, IL: Charles Thomas.
Besle, J., O. Bertrand, and M. H. Giard. 2009. Electrophysiological (EEG, sEEG, MEG) evidence for multiple
audiovisual interactions in the human auditory cortex. Hearing Research 258(1–2): 143–151.
Bolognini, N., I. Senna, A. Maravita, A. Pasqual-Leone, and L. B. Merabeth. 2010. Auditory enhancement
of visual phosphene perception: The effect of temporal and spatial factors and of stimulus intensity.
Neuroscience Letters 477: 109–114.
Bonneh, Y. S., M. K. Belmonte, F. Pei, P. E. Iversen, T. Kenet, N. Akshoomoff, Y. Adini, H. J. Simon, C. I.
Moore, J. F. Houde, and M. M. Merzenich. 2008. Cross-modal extinction in a boy with severely autistic
behavior and high verbal intelligence. Cognitive Neuropsychology 25: 635–652.
Bridgeman, B. 1990. The physiological basis of the act of perceiving. In Relationships between perception and
action: Current approaches, ed. O. Neumann and W. Prinz, 21–42. Berlin: Springer.
Brozzoli, C., M. L. Demattè, F. Pavani, F. Frassinetti, and A. Farnè. 2006. Neglect and extinction: Within and
between sensory modalities. Restorative Neurology and Neuroscience 24: 217–232.
Brugge, J. F., I. O. Volkov, P. C. Garell, R. A. Reale, and M. A. Howard 3rd. 2003. Functional connections
between auditory cortex on Heschl’s gyrus and on the lateral superior temporal gyrus in humans. Journal
of Neurophysiology 90: 3750–3763.
Calvert, G. A., C. Spence, and B. E. Stein (eds.). 2004. The handbook of multisensory processes. Cambridge,
MA: MIT Press.
Cappe, C., and P. Barone, P. 2005. Heteromodal connections supporting multisensory integration at low levels
of cortical processing in the monkey. European Journal of Neuroscience 22: 2886–2902.

* One slight complication here though relates to the fact that people typically start to couple multiple responses to different
stimuli into response couplets under the appropriate experimental conditions (see Ulrich and Miller 2008). Thus, one
could argue about whether participants’ responses on the bimodal target trials actually counts as a third single (rather
than dual) task, but one that, in the two-response version of the Colavita task involves a bi-finger, rather than a unifingered
response. When considered in this light, the interference of performance seen in the Colavita task does not seem quite so
surprising.
The Colavita Visual Dominance Effect 551

Cappe, C., A. Morel, P. Barone, and E. M. Rouiller. 2009. The thalamocortical projection systems in primates:
An anatomical support for multisensory and sensorimotor interplay. Cerebral Cortex 19: 2025–2037.
Clavagnier, S., A. Falchier, and H. Kennedy. 2004. Long-distance feedback projections to area V1: Implications
for multisensory integration, spatial awareness, and visual consciousness. Cognitive, Affective, and
Behavioural Neuroscience 4: 117–126.
Colavita, F. B. 1974. Human sensory dominance. Perception and Psychophysics 16: 409–412.
Colavita, F. B. 1982. Visual dominance and attention in space. Bulletin of the Psychonomic Society 19:
261–262.
Colavita, F. B., R. Tomko, and D. Weisberg. 1976. Visual prepotency and eye orientation. Bulletin of the
Psychonomic Society 8: 25–26.
Colavita, F. B., and D. Weisberg. 1979. A further investigation of visual dominance. Perception and
Psychophysics 25: 345–347.
Cooper, R. 1998. Visual dominance and the control of action. In Proceedings of the 20th Annual Conference of
the Cognitive Science Society, ed. M. A. Gernsbacher and S. J. Derry, 250–255. Mahwah, NJ: Lawrence
Erlbaum Associates.
Costantini, M., D. Bueti, M. Pazzaglia, and S. M. Aglioti. 2007. Temporal dynamics of visuo-tactile extinction
within and between hemispaces. Neuropsychology 21: 242–250.
Cowey, A., and P. Stoerig. 1991. The neurobiology of blindsight. Trends in the Neurosciences 14: 140–145.
Critchley, M. 1953. Tactile thought, with special reference to the blind. Brain 76: 19–35.
Crowder, R. G. 1968. Repetition effects in immediate memory when there are no repeated elements in the
stimuli. Journal of Experimental Psychology 78: 605–609.
De Reuck, T., and C. Spence. 2009. Attention and visual dominance. Unpublished manuscript.
Desimone, R., and J. Duncan. 1995. Neural mechanisms of selective visual attention. Annual Review of
Neuroscience 18: 193–222.
Driver, J., and P. Vuilleumier. 2001. Perceptual awareness and its loss in unilateral neglect and extinction.
Cognition 79: 39–88.
Duncan, J. 1996. Cooperating brain systems in selective perception and action. In Attention and performance
XVI: Information integration in perception and communication, ed. T. Inui and J. L. McClelland, 549–
578. Cambridge, MA: MIT Press.
Egeth, H. E., and L. C. Sager. 1977. On the locus of visual dominance. Perception and Psychophysics 22: 77–86.
Elcock, S., and C. Spence. 2009. Caffeine and the Colavita visual dominance effect. Unpublished manuscript.
Ernst, M. 2005. A Bayesian view on multimodal cue integration. In Perception of the human body from the
inside out, ed. G. Knoblich, I. Thornton, M. Grosejan, and M. Shiffrar, 105–131. New York: Oxford Univ.
Press.
Exner, S. 1875. Experimentelle Untersuchung der einfachsten psychischen Processe (Experimental study of
the most simple psychological processes). Archiv für die gesammte Physiologie des menschens und der
Thiere (Pflüger’s Archive) 11: 403–432.
Falchier, A., S. Clavagnier, P. Barone, and H. Kennedy. 2003. Anatomical evidence of multimodal integration
in primate striate cortex. Journal of Neuroscience 22: 5749–5759.
Falchier, A., C. E. Schroeder, T. A. Hackett, P. Lakatos, S. Nascimento-Silva, I. Ulbert, G. Karmos, and J. F.
Smiley. 2010. Projection from visual areas V2 and prostriata to caudal auditory cortex in the monkey.
Cerebral Cortex 20: 1529–1538.
Falkenstein, M., J. Hohnsbein, J. Hoormann, and L. Blanke. 1991. Effects of crossmodal divided attention on
late ERP components: II. Error processing in choice reaction tasks. Electroencephalography and Clinical
Neurophysiology 78: 447–455.
Farnè, A., C. Brozzoli, E. Làdavas, and T. Ro. 2007. Investigating multisensory spatial cognition through the
phenomenon of extinction. In Attention and performance XXII: Sensorimotor foundations of higher cog-
nition, ed. P. Haggard, Y. Rossetti, and M. Kawato, 183–206. Oxford: Oxford Univ. Press.
Felleman, D. J., and D. C. Van Essen. 1991. Distributed hierarchical processing in primate cerebral cortex.
Cerebral Cortex 1: 1–47.
Fink, G. R., J. Driver, C. Rorden, T. Baldeweg, and R. J. Dolan. 2000. Neural consequences of competing
stimuli in both visual hemifields: A physiological basis for visual extinction. Annals of Neurology 47:
440–446.
Foree, D. D., and V. M. J. LoLordo. 1973. Attention in the pigeon: Differential effects of food-getting versus
shock-avoidance procedures. Journal of Comparative and Physiological Psychology 85: 551–558.
Foxe, J. J., I. A. Morocz, M. M. Murray, B. A. Higgins, D. C. Javitt, and C. E. Schroeder. 2000. Multisensory
auditory–somatosensory interactions in early cortical processing revealed by high-density electrical
mapping. Cognitive Brain Research 10: 77–83.
552 The Neural Bases of Multisensory Processes

Foxe, J. J., E. C. Strugstad, P. Sehatpour, S. Molholm, W. Pasieka, C. E., Schroeder, and M. E. McCourt. 2008.
Parvocellular and magnocellular contributions to the initial generators of the visual evoked potential:
High-density electrical mapping of the “C1” component. Brain Topography 21: 11–21.
Foxe, J. J., G. R. Wylie, A. Martinez, C. E. Schroeder, D. C. Javitt, D. Guilfoyle, W. Ritter, and M. M. Murray.
2002. Auditory–somatosensory multisensory processing in auditory association cortex: An fMRI study.
Journal of Neurophysiology 88: 540–543.
Fu, K.-M. G., T. A. Johnston, A. S. Shah, L. Arnold, J. Smiley, T. A. Hackett, P. E. Garraghty, and C. E. Schroeder.
2003. Auditory cortical neurons respond to somatosensory stimulation. Journal of Neuroscience 23:
7510–7515.
Gallace, A., H. Z. Tan, and C. Spence. 2007. Multisensory numerosity judgments for visual and tactile stimuli.
Perception and Psychophysics 69: 487–501.
Ghazanfar, A. A., J. X. Maier, K. L. Hoffman, and N. K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in Rhesus monkey auditory cortex. Journal of Neuroscience 25: 5004–5012.
Godey, B., D. Schwartz, J. B. de Graaf, P. Chauvel, and C. Liegeois-Chauvel. 2001. Neuromagnetic source
localization of auditory evoked fields and intracerebral evoked potentials: A comparison of data in the
same patients. Clinical Neurophysiology 112: 1850–1859.
Golob, E. J., G. G. Miranda, J. K. Johnson, and A. Starr. 2001. Sensory cortical interactions in aging, mild
cognitive impairment, and Alzheimer’s disease. Neurobiology of Aging 22: 755–763.
Gondan, M., and V. Fischer. 2009. Serial, parallel, and coactive processing of double stimuli presented with
onset asynchrony. Perception 38(Suppl.): 16.
Gorea, A., and D. Sagi. 2000. Failure to handle more than one internal representation in visual detection tasks.
Proceedings of the National Academy of Sciences of the United States of America 97: 12380–12384.
Gorea, A., and D. Sagi, D. 2002. Natural extinction: A criterion shift phenomenon. Visual Cognition 9: 913–936.
Green, D. M., and J. A. Swets. 1966. Signal detection theory and psychophysics. New York: Wiley.
Gregg, L. W., and W. J. Brogden. 1952. The effect of simultaneous visual stimulation on absolute auditory
sensitivity. Journal of Experimental Psychology 43: 179–186.
Hackett, T. A., L. De La Mothe, I. Ulbert, G. Karmos, J. Smiley, and C. E. Schroeder. 2007. Multisensory
convergence in auditory cortex: II. Thalamocortical connections of the caudal superior temporal plane.
Journal of Comparative Neurology 502: 924–952.
Hahnloser, R., R. J. Douglas, M. Mahowald, and K. Hepp. 1999. Feedback interactions between neuronal
pointers and maps for attentional processing. Nature Neuroscience 2: 746–752.
Harris, J. A., C. Miniussi, I. M. Harris, and M. E. Diamond. 2002. Transient storage of a tactile memory trace
in primary somatosensory cortex. Journal of Neuroscience 22: 8720–8725.
Hartcher-O’Brien, J., A. Gallace, B. Krings, C. Koppen, and C. Spence. 2008. When vision ‘extinguishes’
touch in neurologically-normal people: Extending the Colavita visual dominance effect. Experimental
Brain Research 186: 643–658.
Hartcher-O’Brien, J., C. Levitan, and C. Spence. 2010. Out-of-touch: Does vision dominate over touch when it
occurs off the body? Brain Research 1362: 48–55.
Hecht, D., and M. Reiner. 2009. Sensory dominance in combinations of audio, visual and haptic stimuli.
Experimental Brain Research 193: 307–314.
Hefferline, R. F., and T. B. Perera. 1963. Proprioceptive discrimination of a covert operant without its observa-
tion by the subject. Science 139: 834–835.
Hirsh, I. J., and C. E. Sherrick Jr. 1961. Perceived order in different sense modalities. Journal of Experimental
Psychology 62: 423–432.
Hohnsbein, J., and M. Falkenstein. 1991. Visual dominance: Asymmetries in the involuntary processing of
visual and auditory distractors. In Channels in the visual nervous system: Neurophysiology, psychophys-
ics and models, ed. B. Blum, 301–313. London: Freund Publishing House.
Hohnsbein, J., M. Falkenstein, and J. Hoormann. 1991. Visual dominance is reflected in reaction times and
event-related potentials (ERPs). In Channels in the visual nervous system: Neurophysiology, psychophys-
ics and models, ed. B. Blum, 315–333. London: Freund Publishing House.
Howard, M. A., I. O. Volkov, R. Mirsky, P. C. Garell, M. D. Noh, M. Granner, H. Damasio, M. Steinschneider,
R. A. Reale, J. E. Hind, and J. F. Brugge. 2000. Auditory cortex on the human posterior superior temporal
gyrus. Journal of Comparative Neurology 416: 79–92.
Humphreys, G. W., C. Romani, A. Olson, M. J. Riddoch, and J. Duncan. 1995. Nonspatial extinction following
lesions of the parietal lobe in man. Nature 372: 357–359.
Inui, K., X. Wang, Y. Tamura, Y. Kaneoke, and R. Kakigi. 2004. Serial processing in the human somatosensory
system. Cerebral Cortex 14: 851–857.
The Colavita Visual Dominance Effect 553

Jaśkowski, P. 1996. Simple reaction time and perception of temporal order: Dissociations and hypotheses.
Perceptual and Motor Skills 82: 707–730.
Jaśkowski, P. 1999. Reaction time and temporal-order judgment as measures of perceptual latency: The prob-
lem of dissociations. In Cognitive contributions to the perception of spatial and temporal events, ed. G.
Aschersleben, T. Bachmann, and J. Műsseler, 265–282. North-Holland: Elsevier Science.
Jaśkowski, P., F. Jaroszyk, and D. Hojan-Jesierska. 1990. Temporal-order judgments and reaction time for
stimuli of different modalities. Psychological Research 52: 35–38.
Johnson, T. L., and K. L. Shapiro. 1989. Attention to auditory and peripheral visual stimuli: Effects of arousal
and predictability. Acta Psychologica 72: 233–245.
Kayser, C., C. I. Petkov, and N. K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18: 1560–1574.
Koppen, C., A. Alsius, and C. Spence. 2008. Semantic congruency and the Colavita visual dominance effect.
Experimental Brain Research 184: 533–546.
Koppen, C., C. Levitan, and C. Spence. 2009. A signal detection study of the Colavita effect. Experimental
Brain Research 196: 353–360.
Koppen, C., and C. Spence. 2007a. Seeing the light: Exploring the Colavita visual dominance effect.
Experimental Brain Research 180: 737–754.
Koppen, C., and C. Spence. 2007b. Audiovisual asynchrony modulates the Colavita visual dominance effect.
Brain Research 1186: 224–232.
Koppen, C., and C. Spence. 2007c. Spatial coincidence modulates the Colavita visual dominance effect.
Neuroscience Letters 417: 107–111.
Koppen, C., and C. Spence. 2007d. Assessing the role of stimulus probability on the Colavita visual dominance
effect. Neuroscience Letters 418: 266–271.
Kristofferson, A. B. 1965. Attention in time discrimination and reaction time. NASA Contractors Report 194.
Washington, D.C.: Office of Technical Services, U.S. Department of Commerce.
Lamme, V. A. F. 2001. Blindsight: The role of feedforward and feedback corticocortical connections. Acta
Psychologica 107: 209–228.
Lamme, V. A. F., H. Supèr, R. Landman, P. R. Roelfsema, and H. Spekreijse. 2000. The role of primary visual
cortex (V1) in visual awareness. Vision Research 40: 1507–1521.
Liegeois-Chauvel, C., A. Musolino, J. M. Badier, P. Marquis, and P. Chauvel. 1994. Evoked potentials
recorded from the auditory cortex in man: Evaluation and topography of the middle latency components.
Electroencephalography and Clinical Neuroscience 92: 204–214.
LoLordo, V. M. 1979. Selective associations. In Mechanisms of learning and motivation: A memorial to Jerzy
Konorsky, ed. A. Dickinson and R. A. Boakes, 367–399. Hillsdale, NJ: Erlbaum.
LoLordo, V. M., and D. R. Furrow. 1976. Control by the auditory or the visual element of a compound discrimi-
native stimulus: Effects of feedback. Journal of the Experimental Analysis of Behavior 25: 251–256.
Lu, Z.-L., S. J. Williamson, and L. Kaufman. 1992. Behavioral lifetime of human auditory sensory memory
predicted by physiological measures. Science 258: 1669–1670.
Lucey, T., and C. Spence. 2009. Visual dominance. Unpublished manuscript.
Macaluso, E., and J. Driver. 2005. Multisensory spatial interactions: A window onto functional integration in
the human brain. Trends in Neurosciences 28: 264–271.
Macknik, S. L. 2009. The role of feedback in visual attention and awareness. Perception 38(Suppl.): 162.
Macknik, S., and S. Martinez-Conde. 2007. The role of feedback in visual masking and visual processing.
Advances in Cognitive Psychology 3: 125–152.
Macknik, S., and S. Martinez-Conde. In press. The role of feedback in visual attention and awareness. In The
new cognitive neurosciences, ed. M. S. A. Gazzaniga, 1163–1177. Cambridge, MA: MIT Press.
Manly, T., I. H. Robertson, M. Galloway, and K. Hawkins. 1999. The absent mind: Further investigations of
sustained attention to response. Neuropsychologia 37: 661–670.
Marks, L. E., E. Ben-Artzi, and S. Lakatos. 2003. Cross-modal interactions in auditory and visual discrimina-
tion. International Journal of Psychophysiology 50: 125–145.
Martuzzi, R., M. M. Murray, C. M. Michel, J. P. Thiran, P. P. Maeder, S. Clarke, and R. A. Meuli. 2007.
Multisensory interactions within human primary cortices revealed by BOLD dynamics. Cerebral Cortex
17: 1672–1679.
McDonald, J. J., W. A. Teder-Sälejärvi, F. Di Russo, and S. A. Hillyard. 2005. Neural basis of auditory-induced
shifts in visual time-order perception. Nature Neuroscience 8: 1197–1202.
Meltzer, D., and M. A. Masaki. 1973. Measures of stimulus control and stimulus dominance. Bulletin of the
Psychonomic Society 1: 28–30.
554 The Neural Bases of Multisensory Processes

Miller, J. O. 1982. Divided attention: Evidence for coactivation with redundant signals. Cognitive Psychology
14: 247–279.
Miller, J. O. 1986. Time course of coactivation in bimodal divided attention. Perception and Psychophysics 40:
331–343.
Miller, J. O. 1991. Channel interaction and the redundant targets effect in bimodal divided attention. Journal of
Experimental Psychology: Human Perception and Performance 17: 160–169.
Miller, J., R. Ulrich, and B. Rolke. 2009. On the optimality of serial and parallel processing in the psycho-
logical refractory period paradigm: Effects of the distribution of stimulus onset asynchronies. Cognitive
Psychology 58: 273–310.
Miyake, S., S. Taniguchi, and K. Tsuji. 1986. Effects of light stimulus upon simple reaction time and EP latency
to the click presented with different SOA. Japanese Psychological Research 28: 1–10.
Murray, M. M., S. Molholm, C. M. Michel, D. J. Heslenfeld, W. Ritter, D. C. Javitt, C. E. Schroeder, C. E., and
J. J. Foxe. 2005. Grabbing your ear: Auditory–somatosensory multisensory interactions in early sensory
cortices are not constrained by stimulus alignment. Cerebral Cortex 15: 963–974.
Müsseler, J., and B. Hommel. 1997a. Blindness to response-compatible stimuli. Journal of Experimental
Psychology: Human Perception and Performance 23: 861–872.
Müsseler, J., and B. Hommel. 1997b. Detecting and identifying response-compatible stimuli. Psychonomic
Bulletin and Review 4: 125–129.
Neumann, O. 1990. Direct parameter specification and the concept of perception. Psychological Research 52:
207–215.
Nickerson, R. 1973. Intersensory facilitation of reaction time: Energy summation or preparation enhancement?
Psychological Review 80: 489–509.
Noesselt, T., J. W. Rieger, M. A. Schoenfeld, M. Kanowski, H. Hinrichs, H.-J. Heinze, and J. Driver. 2007.
Audiovisual temporal correspondence modulates human multisensory temporal sulcus plus primary sen-
sory cortices. Journal of Neuroscience 27: 11431–11441.
Occelli, V., J. Hartcher O’Brien, C. Spence, and M. Zampini. 2010. Assessing the audiotactile Colavita effect
in near and rear space. Experimental Brain Research 203: 517–532.
O'Connor, N., and B. Hermelin. 1963. Sensory dominance in autistic children and subnormal controls.
Perceptual and Motor Skills 16: 920.
Odgaard, E. C., Y. Arieh, and L. E. Marks. 2003. Cross-modal enhancement of perceived brightness: Sensory
interaction versus response bias. Perception and Psychophysics 65: 123–132.
Odgaard, E. C., Y. Arieh, and L. E. Marks. 2004. Brighter noise: Sensory enhancement of perceived loudness by
concurrent visual stimulation. Cognitive, Affective, and Behavioral Neuroscience 4: 127–132.
Oray, S., Z. L. Lu, and M. E. Dawson. 2002. Modification of sudden onset auditory ERP by involuntary atten-
tion to visual stimuli. International Journal of Psychophysiology 43: 213–224.
Osborn, W. C., R. W. Sheldon, and R. A. Baker. 1963. Vigilance performance under conditions of redundant
and nonredundant signal presentation. Journal of Applied Psychology 47: 130–134.
Partan, S., and P. Marler. 1999. Communication goes multimodal. Science 283: 1272–1273.
Pascual-Leone, A., and V. Walsh. 2001. Fast backprojections from the motion to the primary visual area neces-
sary for visual awareness. Science 292: 510–512.
Pashler, H. 1994. Dual-task interference in simple tasks: Data and theory. Psychological Bulletin 116:
220–244.
Peers, P. V., C. J. H. Ludwig, C. Rorden, R. Cusack, C. Bonfiglioli, C. Bundesen, J. Driver, N. Antoun, and J.
Duncan. 2005. Attentional functions of parietal and frontal cortex. Cerebral Cortex 15: 1469–1484.
Posner, M. I., M. J. Nissen, and R. M. Klein. 1976. Visual dominance: An information-processing account of
its origins and significance. Psychological Review 83: 157–171.
Quinlan, P. 2000. The “late” locus of visual dominance. Abstracts of the Psychonomic Society 5: 64.
Randich, A., R. M. Klein, and V. M. LoLordo. 1978. Visual dominance in the pigeon. Journal of the Experimental
Analysis of Behavior 30: 129–137.
Rapp, B., and S. K. Hendel. 2003. Principles of cross-modal competition: Evidence from deficits of attention.
Psychonomic Bulletin and Review 10: 210–219.
Ricci, R., and A. Chatterjee. 2004. Sensory and response contributions to visual awareness in extinction.
Experimental Brain Research 157: 85–93.
Rizzolatti, G., and A. Berti. 1990. Neglect as a neural representation deficit. Revue Neurologique (Paris) 146:
626–634.
Rizzolatti, G., L. Fadiga, L. Fogassi, and V. Gallese. 1997. The space around us. Science 277: 190–191.
Rockland, K. S., and H. Ojima. 2003. Multisensory convergence in calcarine visual areas in macaque monkey.
International Journal of Psychophysiology 50: 19–26.
The Colavita Visual Dominance Effect 555

Rodway, P. 2005. The modality shift effect and the effectiveness of warning signals in different modalities. Acta
Psychologica 120: 199–226.
Romei, V., M. M. Murray, C. Cappe, and G. Thut. 2009. Preperceptual and stimulus-selective enhancement of
low-level human visual cortex excitability by sounds. Current Biology 19: 1799–1805.
Romei, V., M. M. Murray, L. B. Merabet, and G. Thut. 2007. Occipital transcranial magnetic stimulation has
opposing effects on visual and auditory stimulus detection: Implications for multisensory interactions.
Journal of Neuroscience 27: 11465–11472.
Rorden, C., J. B. Mattingley, H.-O. Karnath, and J. Driver. 1997. Visual extinction and prior entry: Impaired
perception of temporal order with intact motion perception after unilateral parietal damage. Neuro­
psychologia 35: 421–433.
Rutschmann, J., and R. Link. 1964. Perception of temporal order of stimuli differing in sense mode and simple
reaction time. Perceptual and Motor Skills 18: 345–352.
Sarri, M., F. Blankenburg, and J. Driver. 2006. Neural correlates of crossmodal visual–tactile extinction and
of tactile awareness revealed by fMRI in a right-hemisphere stroke patient. Neuropsychologia 44:
2398–2410.
Schroeder, C. E., and J. J. Foxe. 2002. The timing and laminar profile of converging inputs to multisensory
areas of the macaque neocortex. Brain Research: Cognitive Brain Research 14: 187–198.
Schroeder, C. E., and J. J. Foxe. 2004. Multisensory convergence in early cortical processing. In The handbook of
multisensory processes, ed. G. A. Calvert, C. Spence, and B. E. Stein, 295–309. Cambridge, MA: MIT Press.
Schroeder, C. E., R. W. Lindsley, C. Specht, A. Marcovici, J. F. Smiley, and D. C. Javitt. 2001. Somatosensory input
to auditory association cortex in the macaque monkey. Journal of Neurophysiology 85: 1322–1327.
Schroeder, C. E., A. D. Mehta, and S. J. Givre. 1998. A spatiotemporal profile of visual system activation
revealed by current source density analysis in the awake macaque. Cerebral Cortex 8: 575–592.
Schroeder, C. E., S. Molholm, P. Lakatos, W. Ritter, and J. J. Foxe. 2004. Human simian correspondence in the
early cortical processing of multisensory cues. Cognitive Processing 5: 140–151.
Schubert, R., F. Blankenberg, S. Lemm, A. Villringer, and G. Curio. 2006. Now you feel it, now you don’t: ERP
correlates of somatosensory awareness. Psychophysiology 43: 31–40.
Sereno, M. I., A. M. Dale, J. B. Reppas, K. K., Kwong, J. W. Belliveau, T. J. Brady, B. R. Rosen, and R. B.
H. Tootell. 1995. Borders of multiple visual areas in humans revealed by functional magnetic resonance
imaging. Science 268: 889–893.
Shapiro, K. L., B. Egerman, and R. M. Klein. 1984. Effects of arousal on human visual dominance. Perception
and Psychophysics 35: 547–552.
Shapiro, K. L., W. J. Jacobs, and V. M. LoLordo. 1980. Stimulus–reinforcer
����������������������������������������������������
interactions in Pavlovian condi-
tioning of pigeons: Implications for selective associations. Animal Learning and Behavior 8: 586–594.
Shapiro, K. L., and T. L. Johnson. 1987. Effects of arousal on attention to central and peripheral visual stimuli.
Acta Psychologica 66: 157–172.
Sinnett, S., S. Soto-Faraco, and C. Spence. 2008. The co-occurrence of multisensory competition and facilita-
tion. Acta Psychologica 128: 153–161.
Sinnett, S., C. Spence, and S. Soto-Faraco. 2007. Visual dominance and attention: The Colavita effect revisited.
Perception and Psychophysics 69: 673–686.
Smiley, J., T. A. Hackett, I. Ulbert, G. Karmos, P. Lakatos, D. C. Javitt, and C. E. Schroeder. 2007. Multisensory
convergence in auditory cortex: I. Cortical connections of the caudal superior temporal plane in Macaque
monkey. Journal of Comparative Neurology 502: 894–923.
Smith, A. 2002. Effects of caffeine on human behavior. Food Chemistry and Toxicology 40: 1243–1255.
Smith, A. P., A. M. Kendrick, and A. L. Maben. 1992. Effects of breakfast and caffeine on performance and
mood in the late morning and after lunch. Neuropsychobiology 26: 198–204.
Smith, W. F. 1933. The relative quickness of visual and auditory perception. Journal of Experimental Psychology
16: 239–257.
Soto-Faraco, S., and A. Alsius. 2007. Conscious access to the unisensory components of a cross-modal illusion.
Neuroreport 18: 347–350.
Soto-Faraco, S., and A. Alsius. 2009. Deconstructing the McGurk–MacDonald illusion. Journal of Experimental
Psychology: Human Perception and Performance 35: 580–587.
Spence, C. 2008. Cognitive neuroscience: Searching for the bottleneck in the brain. Current Biology 18:
R965–R968.
Spence, C. 2010. Prior entry: Attention and temporal perception. In Attention and time, ed. A. C. Nobre and J.
T. Coull, 89–104. Oxford: Oxford Univ. Press.
Spence, C., R. Baddeley, M. Zampini, R. James, and D. I. Shore. 2003. Crossmodal temporal order judgments:
When two locations are better than one. Perception and Psychophysics 65: 318–328.
556 The Neural Bases of Multisensory Processes

Spence, C., M. E. R. Nicholls, and J. Driver. 2001a. The cost of expecting events in the wrong sensory modality.
Perception and Psychophysics 63: 330–336.
Spence, C., D. I. Shore, and R. M. Klein. 2001b. Multisensory prior entry. Journal of Experimental Psychology:
General 130: 799–832.
Spence, C., and S. Soto-Faraco. 2009. Auditory perception: Interactions with vision. In Auditory perception, ed.
C. Plack, 271–296. Oxford: Oxford Univ. Press.
Sperdin, H. F., C. Cappe, J. J. Foxe, and M. M. Murray. 2009. Early, low-level auditory–somatosensory multi-
sensory interactions impact reaction time speed. Frontiers in Integrative Neuroscience 3(2): 1–10.
Stein, B. E., N. London, L. K. Wilkinson, and D. P. Price. 1996. Enhancement of perceived visual intensity by
auditory stimuli: A psychophysical analysis. Journal of Cognitive Neuroscience 8: 497–506.
Stone, J. V., N. M. Hunkin, J. Porrill, R. Wood, V. Keeler, M. Beanland, M. Port, and N. R. Porter. 2001. When
is now? Perception of simultaneity. Proceedings of the Royal Society (B) 268: 31–38.
Tajadura-Jiménez, A., N. Kitagawa, A. Väljamäe, M. Zampini, M. M. Murray, and C. Spence. 2009. Auditory–
somatosensory multisensory interactions are spatially modulated by stimulated body surface and acous-
tic spectra. Neuropsychologia 47: 195–203.
Taylor, J. L., and D. I. McCloskey. 1996. Selection of motor responses on the basis of unperceived stimuli.
Experimental Brain Research 110: 62–66.
Thompson, R. F., J. F. Voss, and W. J. Brogden. 1958. Effect of brightness of simultaneous visual stimulation
on absolute auditory sensitivity. Journal of Experimental Psychology 55: 45–50.
Titchener, E. B. 1908. Lectures on the elementary psychology of feeling and attention. New York:
Macmillan.
Turatto, M., F. Benso, G. Galfano, L. Gamberini, and C. Umilta. 2002. Non-spatial attentional shifts between
audition and vision. Journal of Experimental Psychology: Human Perception and Performance 28:
628–639.
Uetake, K., and Y. Kudo. 1994. Visual dominance over hearing in feed acquisition procedure of cattle. Applied
Animal Behaviour Science 42: 1–9.
Ulrich, R., and J. Miller. 2008. Response grouping in the psychological refractory period (PRP) paradigm:
Models and contamination effects. Cognitive Psychology 57: 75–121.
Uusitalo, M. A., S. J. Williamson, and M. T. Seppä. 1996. Dynamical organisation of the human visual system
revealed by lifetimes of activation traces. Neuroscience Letters 213: 149–152.
Van Damme, S., G. Crombez, and C. Spence. 2009a. Is the visual dominance effect modulated by the threat
value of visual and auditory stimuli? Experimental Brain Research 193: 197–204.
Van Damme, S., A. Gallace, C. Spence, and G. L. Moseley. 2009b. Does the sight of physical threat induce
a tactile processing bias? Modality-specific attentional facilitation induced by viewing threatening pic-
tures. Brain Research 1253: 100–106.
Vibell, J., C. Klinge, M. Zampini, C. Spence, and A. C. Nobre. 2007. Temporal order is coded temporally in
the brain: Early ERP latency shifts underlying prior entry in a crossmodal temporal order judgment task.
Journal of Cognitive Neuroscience 19: 109–120.
Wang, Y., S. Celebrini, Y. Trotter, and P. Barone. 2008. Visuo-auditory interactions in the primary visual cortex
of the behaving monkey: Electrophysiological evidence. BMC Neuroscience 9: 79.
Wilcoxin, H. C., W. B. Dragoin, and P. A. Kral. 1971. Illness-induced aversions in rat and quail: Relative
salience of visual and gustatory cues. Science 171: 826–828.
Zahn, T. P., D. Pickar, and R. J. Haier. 1994. Effects of clozapine, fluphenazine, and placebo on reaction time
measures of attention and sensory dominance in schizophrenia. Schizophrenia Research 13: 133–144.
Zwyghuizen-Doorenbos, A., T. A. Roehrs, L. Lipschutz, V. Timms, and T. Roth. 1990. Effects of caffeine on
alertness. Psychopharmacology 100: 36–39.
Zylberberg, A., S. Dehaene, G. B. Mindlin, and M. Sigman. 2009. Neurophysiological bases of exponential
sensory decay and top-down memory retrieval: A model. Frontiers in Computational Neuroscience 3(4):
1–16.
28 The Body in a
Multisensory World
Tobias Heed and Brigitte Röder

CONTENTS
28.1 Introduction........................................................................................................................... 557
28.2 Construction of Body Schema from Multisensory Information............................................ 558
28.2.1 Representing Which Parts Make Up the Own Body................................................. 558
28.2.2 Multisensory Integration for Limb and Body Ownership......................................... 559
28.2.3 Extending the Body: Tool Use................................................................................... 561
28.2.4 Rapid Plasticity of Body Shape................................................................................. 562
28.2.5 Movement and Posture Information in the Brain...................................................... 563
28.2.6 The Body Schema: A Distributed versus Holistic Representation............................564
28.2.7 Interim Summary...................................................................................................... 565
28.3 The Body as a Modulator for Multisensory Processing........................................................ 565
28.3.1 Recalibration of Sensory Signals and Optimal Integration....................................... 565
28.3.2 Body Schema and Peripersonal Space....................................................................... 566
28.3.3 Peripersonal Space around Different Parts of the Body............................................ 568
28.3.4 Across-Limb Effects in Spatial Remapping of Touch............................................... 569
28.3.5 Is the External Reference Frame a Visual One?........................................................ 570
28.3.6 Investigating the Body Schema and Reference Frames with Electrophysiology...... 572
28.3.7 Summary................................................................................................................... 574
28.4 Conclusion............................................................................................................................. 574
References....................................................................................................................................... 575

28.1  INTRODUCTION
It is our body through which we interact with the environment. We have a very clear sense about
who we are in the sense that we know where our body ends, and what body parts we own. Above
that, we usually are (or can easily become) aware of where each of our body parts is currently
located, and most of our movements seem effortless, whether performed under conscious control or
not. When we think about ourselves, we normally perceive our body as a stable entity. For example,
when we go to bed, we do not expect that our body will be different when we wake up the next
morning. Quite contrary to such introspective assessment, the brain has been found to be surpris-
ingly flexible in updating its representation of the body. As an illustration, consider what happens
when an arm or leg becomes numb after you have sat or slept in an unsuitable position for too long.
Touching the numb foot feels very strange, as if you touch someone else’s foot. When you lift a
numb hand with the other hand, it feels far too heavy. Somehow, it feels as if the limb does not
belong to the own body.
Neuroscientists have long been fascinated with how the brain represents the body. It is usually
assumed that there are several different types of body representations, but there is no consensus
about what these representations are, or how many there may be (de Vignemont 2010; Gallagher
1986; Berlucchi and Aglioti 2010; see also Dijkerman and de Haan 2007 and commentaries thereof).

557
558 The Neural Bases of Multisensory Processes

The most common distinction is that between a body schema and a body image. The body schema
is usually defined as a continuously updated sensorimotor map of the body that is important in the
context of action, informing the brain about what parts belongs to the body, and where those parts
are currently located (de Vignemont 2010). In contrast, the term body image is usually used to
refer to perceptual, emotional, or conceptual knowledge about the body. However, other taxonomies
have been proposed (see Berlucchi and Aglioti 2010; de Vignemont 2010), and the use of the terms
body schema and body image has been inconsistent. This chapter will not present an exhaustive
debate about these definitions, and we refer the interested reader to the articles cited above for
detailed discussion; in this article, we will use the term body schema with the sensorimotor defini-
tion introduced above, referring to both aspects of what parts make up the body, and where they
are located.
The focus of this chapter will be on the importance of multisensory processing for representing
the body, as well as on the role of body representations for multisensory processing. On one hand,
one can investigate how the body schema is constructed and represented in the brain, and Section
28.2 will illustrate that the body schema emerges from the interaction of multiple sensory modali-
ties. For this very reason, one can, on the other hand, ask how multisensory interactions between
the senses are influenced by the fact that the brain commands a body. Section 28.3, therefore, will
present research on how the body schema is important in multisensory interactions, especially for
spatial processing.

28.2 CONSTRUCTION OF BODY SCHEMA FROM


MULTISENSORY INFORMATION
28.2.1  Representing Which Parts Make Up the Own Body
There is some evidence suggesting that an inventory of the normally existing body parts is genet-
ically predetermined. Just like amputees, people born without arms and/or legs can have vivid
sensations of the missing limbs, including the feeling of using them for gestural movements dur-
ing conversation and for finger-aided counting (Ramachandran 1993; Ramachandran and Hirstein
1998; Brugger et al. 2000; Saadah and Melzack 1994; see also Lacroix et al. 1992). This phenom-
enon has therefore been termed phantom limbs. Whereas the existence of a phantom limb in ampu-
tees could be explained with the persistence of experience-induced representations of this limb after
the amputation, such an explanation does not hold for congenital phantom limbs. In one person with
congenital phantom limbs, transcranial magnetic stimulation (TMS) over primary motor, premotor,
parietal, and primary sensory cortex evoked sensations and movements of the congenital phantom
limbs (Brugger et al. 2000). This suggests that the information about which parts make up the own
body is distributed across different areas of the brain.
There are not many reports of congenital phantoms in the literature, and so the phenomenon
may be rare. However, the experience of phantom limbs after the loss of a limb, for example, due
to amputation, is very common. It has been reported (Simmel 1962) that the probability of perceiv-
ing phantom limbs gradually increases with the age of limb loss from very young (2 of 10 children
with amputations below the age of 2) to the age of 9 years and older (all of 60 cases), suggesting
that developmental factors within this age interval may be crucial for the construction of the body
schema (and, in turn, for the occurrence of phantom limbs).
The term “phantom limb” refers to limbs that would normally be present in a healthy person. In
contrast, a striking impairment after brain damage, for example, to the basal ganglia (Halligan et al.
1993), the thalamus (Bakheit and Roundhill 2005), or the frontal lobe (McGonigle et al. 2002), is
the report of one or more supernumerary limbs in addition to the normal limbs. The occurrence of a
supernumerary limb is usually associated with the paralysis of the corresponding real limb, which is
also attributable to the brain lesion. The supernumerary limb is vividly felt, and patients confabulate
to rationalize why the additional limb is present (e.g., it was attached by the clinical staff during
The Body in a Multisensory World 559

sleep), and why it is not visible (e.g., it was lost 20 years ago) (Halligan et al. 1993; Sellal et al.
1996; Bakheit and Roundhill 2005). It has therefore been suggested that the subjective presence of a
supernumerary limb may result from cognitive conflicts between different pieces of sensory infor-
mation (e.g., visual vs. proprioceptive) or fluctuations in the awareness about the paralysis, which in
turn may be resolved by assuming the existence of two (or more) limbs rather than one (Halligan et
al. 1993; Ramachandran and Hirstein 1998).
Whereas a patient with a phantom or a supernumerary limb perceives more limbs than he actu-
ally owns, some brain lesions result in the opposite phenomenon of patients denying the owner-
ship of an existing limb. This impairment, termed somatoparaphrenia, has been reported to occur
after temporo-parietal (Halligan et al. 1995) or thalamic-temporo-parietal damage (Daprati et al.
2000)—notably all involving the parietal lobe, which is thought to mediate multisensory integration
for motor planning. Somatoparaphrenia is usually observed in conjunction with hemineglect and
limb paralysis (Cutting 1978; Halligan et al. 1995; Daprati et al. 2000) and has been suggested to
reflect a disorder of body awareness due to the abnormal sensorimotor feedback for the (paralyzed)
limb after brain damage (Daprati et al. 2000).
Lesions can also affect the representation of the body and self as a whole, rather than just affect-
ing single body parts. These experiences have been categorized into three distinct phenomena
(Blanke and Metzinger 2009). During out-of-body experiences, a person feels to be located outside
of her real body and to look at herself, often from above. In contrast, during an autoscopic illusion,
the person localizes herself in her real body, but sees an illusory body in extrapersonal space (e.g.,
in front of herself). Finally, during heautoscopy, a person sees a second body and feels to be located
in both, either at the same time, or in sometimes rapid alternation. In patients, such illusions have
been suggested to be related to damage to the temporo-parietal junction (TPJ) (Blanke et al. 2004),
and an out-of-body experience was elicited by stimulation of an electrode implanted over the TPJ
for presurgical assessment (Blanke et al. 2002). Interestingly, whole body illusions can coincide
with erroneous visual perceptions about body parts, for example, an impression of limb shortening
or illusory flexion of an arm. It has therefore been suggested that whole body illusions are directly
related to the body schema, resulting from a failure to integrate multisensory (e.g., vestibular and
visual) information about the body and its parts, similar to the proposed causes of supernumerary
limbs (Blanke et al. 2004).
In sum, many brain regions are involved in representing the configuration of the body; some
aspects of these representations seem to be innate, and are probably refined during early develop-
ment. Damage to some of the involved brain regions can lead to striking modifications of the per-
ceived body configuration, as well as to illusions about the whole body.

28.2.2  Multisensory Integration for Limb and Body Ownership


Although the previous section suggests that some aspects of the body schema may be hardwired, the
example of the sleeping foot with which we started this chapter suggests that the body schema is a
more flexible representation. Such fast changes of the body’s representation have been demonstrated
with an ingenious experimental approach: to mislead the brain as to the status of ownership of a
new object and to provoke its inclusion into the body schema. This trick can be achieved by using
rubber hands: a rubber hand is placed in front of a participant in such a way that it could belong
to her own body, and it is then stroked in parallel with the participant’s real, hidden hand. Most
participants report that they feel the stroking at the location of the rubber hand, and that they feel
as if the rubber hand were their own (Botvinick and Cohen 1998). One of the main determinants
for this illusion to arise is the synchrony of the visual and tactile stimulation. In other words, the
touches felt at the own hand and those seen to be delivered to the rubber hand must match. It might
in fact be possible to trick the brain into integrating other than handlike objects into its body schema
using this synchronous stroking technique: when the experimenter stroked not a rubber hand but a
shoe placed on the table (Ramachandran and Hirstein 1998) or even the table surface (Armel and
560 The Neural Bases of Multisensory Processes

Ramachandran 2003), participants reported that they “felt” the touch delivered to their real hand to
originate from the shoe and the table. Similarly, early event-related potentials (ERPs) in response
to tactile stimuli were enhanced after synchronous stimulation of a rubber hand as well as of a non-
hand object (Press et al. 2008). Even more surprisingly, participants in Armel and Ramachandran’s
study displayed signs of distress and an increased skin conductance response when the shoe was hit
with a hammer, or a band-aid was ripped off the table surface. Similar results, that is, signs of dis-
tress, were also observed when the needle of a syringe was stabbed into the rubber hand, and these
behavioral responses were associated with brain activity in anxiety-related brain areas (Ehrsson et
al. 2007). Thus, the mere synchrony of visual events at an object with the tactile sensations felt at
the hand seem to have led to some form of integration of the objects (the rubber hand, the shoe, or
the table surface) into the body schema, resulting in physiological and emotional responses usu-
ally reserved for the real body. It is important to understand that participants in the rubber hand
illusion (RHI) do not feel additional limbs; rather, they feel a displacement of their own limb,
which is reflected behaviorally by reaching errors after the illusion has manifested itself (Botvinick
and Cohen 1998; Holmes et al. 2006; but see Kammers et al. 2009a, 2009c, and discussion in de
Vignemont 2010), and by an adjustment of grip aperture when finger posture has been manipulated
during the RHI (Kammers et al. 2009b). Thus, a new object (the rubber hand) is integrated into the
body schema, but is interpreted as an already existing part (the own but hidden arm).
The subjective feeling of ownership of a rubber hand has also been investigated using func-
tional magnetic resonance imaging (fMRI). Activity emerged in the ventral premotor cortex and
(although, statistically, with only a tendency for significance) in the superior parietal lobule (SPL)
(Ehrsson et al. 2004). In the monkey, both of these areas respond to peripersonal stimuli around
the hand and head. Activity related to multisensory integration—synchrony of tactile and visual
events, as well as the alignment of visual and proprioceptive information about arm posture—was
observed in the SPL, presumably in the human homologue of an area in the monkey concerned with
arm reaching [the medial intraparietal (MIP) area]. Before the onset of the illusion, that is, during
its buildup, activity was seen in the intraparietal sulcus (IPS), in the dorsal premotor cortex (PMd),
and in the supplementary motor area (SMA), which are all thought to be part of an arm-reaching
circuit in both monkeys and humans. Because the rubber arm is interpreted as one’s own arm, the
illusion may be based on a recalibration of perceived limb position, mediated parietally, according
to the visual information about the rubber arm (Ehrsson et al. 2004; Kammers et al. 2009c). As
such, the integration of current multisensory information about the alleged position of the hand must
be integrated with long-term knowledge about body structure (i.e., the fact that there is a hand to be
located) (de Vignemont 2010; Tsakiris 2010).
Yet, as noted earlier, an integration of a non-body-like object also seems possible in some cases.
Besides the illusory integration of a shoe or the table surface due to synchronous stimulation, an
association of objects with the body has been reported in a clinical case of a brain-lesioned patient
who denied ownership of her arm and hand; when she wore the wedding ring on that hand, she did
not recognize it as her own. When it was taken off the neglected hand, the patient immediately rec-
ognized the ring as her own (Aglioti et al. 1996). Such findings might therefore indicate an involve-
ment of higher cognitive processes in the construction of the body schema.
It was mentioned in the previous section that brain damage can lead to misinterpretations of
single limbs (say, an arm or a leg), but also of the whole body. Similarly, the rubber hand paradigm
has been modified to study also the processes involved in the perception of the body as a whole
and of the feeling of self. Participants viewed a video image of themselves filmed from the back
(Ehrsson 2007) or a virtual reality character at some distance in front of them (Lenggenhager et al.
2007). They could see the back of the figure in front of them being stroked in synchrony with feel-
ing their own back being stroked. This manipulation resulted in the feeling of the self being located
outside the own body and of looking at oneself (Ehrsson 2007). Furthermore, when participants
were displaced from their viewing position and asked to walk to the location at which they felt
“themselves” during the illusion, they placed themselves in between the real and the virtual body’s
The Body in a Multisensory World 561

locations (Lenggenhager et al. 2007). Although both rubber hand and whole body illusions use the
same kind of multisensory manipulation, the two phenomena have been proposed to tap into dif-
ferent aspects of body processing (Blanke and Metzinger 2009): whereas the rubber hand illusion
leads to the attribution of an object into the body schema, the whole body illusion manipulates the
location of a global “self” (Blanke and Metzinger 2009; Metzinger 2009), and accordingly the first-
person perspective (Ehrsson 2007). This distinction notwithstanding, both illusions convincingly
demonstrate how the representation of the body in the brain is determined by the integration of
multisensory information.
To sum up, our brain uses the synchrony of multisensory (visual and tactile) stimulation to deter-
mine body posture. Presumably, because touch is necessarily located on the body, such synchronous
visuo-tactile stimulation can lead to illusions about external objects to belong to our body, and even
to mislocalization of the location of the whole body. However, the illusion is not of a new body part
having been added, but rather of a non-body object taking the place of an already existing body part
(or, in the case of the whole body illusion, the video image indicating our body’s location).

28.2.3  Extending the Body: Tool Use


At first sight, the flexibility of the body schema demonstrated with the rubber hand illusion and the
whole body illusion may seem impedimental rather than useful. However, a very common situation
in which such integration may be very useful is the use of tools. Humans, and to some extent also
monkeys, use tools to complement and extend the abilities and capacity of their own body parts to
act upon their environment. In this situation, visual events at the tip of the tool (or, more generally, at
the part of the tool used to manipulate the environment) coincides with tactile information received
at the hand—a constellation that is very similar to the synchronous stroking of a non-body object
and a person’s hand. Indeed, some neurons in the intraparietal part of area PE (PEip) of monkeys
respond to tactile stimuli to the hand, as well as to visual stimuli around the tactile location (see also
Section 28.3.2). When the monkey was trained to use a tool to retrieve otherwise unreachable food,
the visual receptive fields (RFs), which encompassed only the hand when no tool was used, now
encompassed both the hand and the tool (Iriki et al. 1996). In a similar manner, when the monkey
learned to observe his hand in a monitor rather than seeing it directly, the visual RFs now encom-
passed the monitor hand (Obayashi et al. 2000). These studies have received some methodological
criticism (Holmes and Spence 2004), but their results are often interpreted as some form of integra-
tion of the tool into the monkey’s body schema. Neurons with such RF characteristics might there-
fore be involved in the mediation of rapid body schema modulations illustrated by the rubber hand
illusion in humans. Although these monkey findings are an important step toward understanding
tool use and its relation to the body schema, it is important to note that the mechanisms discovered
in the IPS cannot explain all phenomena involved either in tool use or in ownership illusions. For
example, it has been pointed out that a tool does not usually feel as an own body part, even when
it is frequently used, as for example, a fork (Botvinick 2004). Such true ownership feelings may
rather be restricted to body part–shaped objects such as a prosthesis or a rubber hand, given that
they are located in an anatomically plausible location (Graziano and Gandhi 2000; Pavani et al.
2000). For the majority of tools, one might rather feel that the sensation of a touch is projected to
the action-related part of the tool (usually the tip), such as one may feel the touch of the pen to occur
between the paper and the pen tip, and not at the fingers holding the pen (see also Yamamoto and
Kitazawa 2001b; Yamamoto et al. 2005). Accordingly, rather than the tool being integrated into the
body schema, it may be that tool use results in the directing of attention toward that part in space
that is relevant for the currently performed action. Supporting such interpretations, it has recently
been shown that visual attention was enhanced at the movement endpoint of the tool as well as at the
movement endpoint of the hand when a reach was planned with a tool. Attention was not enhanced,
however, in between those locations along the tool (Collins et al. 2008). Similarly, cross-modal
(visual–tactile) interactions have been shown to be enhanced at the tool tip and at the hand, but not
562 The Neural Bases of Multisensory Processes

in locations along the tool (Holmes et al. 2004; Yue et al. 2009). Finally, in a recent study, partici-
pants were asked to make tactile discrimination judgments about stimuli presented to the tip of a
tool. Visual distractors were presented in parallel to the tactile stimuli. fMRI activity in response
to the visual distractors near the end of the tool was enhanced in the occipital cortex, compared to
locations further away from the tool (Holmes et al. 2008). These findings were also interpreted to
indicate an increase of attention at the tool tip, due to the use of the tool.
Experimental results such as these challenge the idea of an extension of the body schema. Other
results, in contrast, do corroborate the hypothesis of an extension of the body schema due to tool
use. For example, tool use resulted in a change of the perceived distance between two touches to
the arm, which was interpreted to indicate an elongated representation of the arm (Cardinali et al.
2009b).
It has recently been pointed out that the rubber hand illusion seems to consist of several disso-
ciable aspects (Longo et al. 2008), revealed by the factor-analytic analysis of questionnaire related
to the experience of the rubber hand illusion. More specific distinctions may need to be made about
the different processes (and, as a consequence, the different effects found in experiments) involved
in the construction of the body schema, and different experimental paradigms may tap into only a
subset of these processes.
In sum, multisensory signals are not only important for determining what parts we perceive to
be made of. Multisensory mechanisms are also important in mediating the ability to use tools. It is
currently under debate if tools extend the body schema by integrating the tool as a body part, or if
other multisensory processes, for example, a deployment of attention to the space manipulated by
the tool, are at the core of our ability to use tools.

28.2.4  Rapid Plasticity of Body Shape


The rubber hand illusion demonstrates that what the brain interprets as the own body can be rapidly
adjusted to the information that is received from the senses. Rapid changes of the body schema are,
however, not restricted to the inventory of body parts considered to belong to the body, or their cur-
rent posture. They also extend to the body’s shape. We already mentioned that the representation of
the arm may be prolonged after tool use (Cardinali et al. 2009b). An experience most of us have had
is the feeling of an increased size of an anesthesized body part, for example, the lip during a dentist’s
appointment (see also Türker et al. 2005; Paqueron et al. 2003). Somewhat more spectacular, when
a participant holds the tip of his nose with his thumb and index finger while his biceps muscle is
vibrated to induce the illusion of the arm moving away from the body, many participants report that
they perceive their nose to elongate to a length of up to 30 cm (sometimes referred to as the Pinocchio
illusion; Lackner 1988). A related illusion can be evoked when an experimenter leads the finger of
a participant to irregularly tap the nose of a second person (seated next to the participant), while he
synchronously taps the participant’s own nose (Ramachandran and Hirstein 1997; see also discussion
in Ramachandran and Hirstein 1998). Both illusions are induced by presenting the brain with mis-
matching information about touch and proprioception. They demonstrate that, despite the fact that
our life experience would seem to preclude sudden elongations of the nose (or any other body part, for
that matter), the body schema is readily adapted when sensory information from different modalities
(here, tactile and proprioceptive) calls for an integration of initially mismatching content.
The rubber hand illusion has been used also to investigate effects of the perception of body part
size. Participants judged the size of a coin to be bigger when the illusion was elicited with a rub-
ber hand bigger than their own, and to be smaller when the rubber hand was smaller (Bruno and
Bertamini 2010). The rubber hand illusion thus influenced tactile object perception. This influence
was systematic: as the real object held by the participants was always the same size, their finger
posture was identical in all conditions. With the illusion of a small hand, this posture would indicate
a relatively small distance between the small fingers. In contrast, with the illusion of a big hand, the
same posture would indicate a larger distance between the large fingers.
The Body in a Multisensory World 563

Similarly, visually perceived hand size has also been shown to affect grip size, although more so
when the visual image of the hand (a projection of an online video recording of the hand) was bigger
than normal (Marino et al. 2010).
The rubber hand illusion has also been used to create the impression of having an elongated arm
by having participants wear a shirt with an elongated sleeve from which the rubber hand protruded
(Schaefer et al. 2007). By recording magnetoencephalographic (MEG) responses to tactile stimuli
to the illusion hand, this study also demonstrated an involvement of primary somatosensory cortex
in the illusion.
These experiments demonstrate that perception of the body can be rapidly adjusted by the brain,
and that these perceptual changes in body shape affect object perception as well as hand actions.

28.2.5  Movement and Posture Information in the Brain


The rubber hand illusion shows how intimately body part ownership and body posture are related:
in this illusion, an object is felt to belong to the own body, but at the same time, posture of the real
arm is felt to be at the location of the rubber arm. In the same way, posture is, of course, intimately
related to movement, as every movement leads to a change in posture. However, different brain
areas seem to be responsible for perceiving movement and posture.
The perception of limb movement seems to depend on the primary sensory and motor cortex as
well as on the premotor and supplementary motor cortex (reviewed by Naito 2004). This is true also
for the illusory movement of phantom limbs, which is felt as real movement (Bestmann et al. 2006;
Lotze et al. 2001; Roux et al. 2003; Brugger et al. 2000). However, the primary motor cortex may
play a crucial role in movement perception. One can create an illusion of movement by vibration
of the muscles responsible for the movement of a body part, for example, the arm or hand. When a
movement illusion is created for one hand, then this illusion transfers to the other hand if the palms
of the two hands touch. For both hands, fMRI activity increased in primary motor cortex, suggesting
a primary role of this motor-related structure also for the sensation of movement (Naito et al. 2002).
In contrast, the current body posture seems to be represented quite differently from limb move-
ment perception. Proprioceptive information arrives in the cortex via the somatosensory cortex.
Accordingly, neuronal responses in secondary somatosensory cortex (SII) to tactile stimuli to a
monkeys hand were shown to be modulated by the monkey’s arm posture (Fitzgerald et al. 2004).
In humans, the proprioceptive drift associated with the rubber hand illusion—that is, the change of
the subjective position of the own hand toward the location of the rubber hand—was correlated with
activity in SII acquired with PET (Tsakiris et al. 2007). (SII) was also implicated in body schema
functions by a study in which participants determined the laterality of an arm seen on a screen by
imagining to turn their own arm until it matched the seen one, as compared to when they deter-
mined the onscreen arm’s laterality by imagining its movement toward the appropriate location on
a body that was also presented on the screen (Corradi-Dell’Acqua et al. 2009). SII was thus active
during the imagination of specifically one’s own posture when making a postural judgment.
However, many other findings implicate hierarchically higher, more posterior parietal areas in
the maintenance of a posture representation. When participants were asked to reach with their hand
to another body part, activity increased in the SPL after a posture change as compared to when
participants repeated a movement they had just executed before. This posture change effect was
observed both when the reaching hand changed its posture, as well as when participants reached
with one hand to the other, and the target hand rather than the reaching hand changed its posture
(Pellijeff et al. 2006). Although the authors interpreted their results as reflecting postural updating,
they may instead be attributable to reach planning. However, a patient with an SPL lesion displayed
symptoms that corroborate the view that the SPL is involved in the maintenance of a continuous
postural model of the body (Harris and Wolpert 1998). This patient complained that her arm and
leg felt like they drifted and then faded, unless she could see them. This subjective feeling was
accompanied by an inability to retain grip force as well as a loss of tactile perception of a vibratory
564 The Neural Bases of Multisensory Processes

stimulus after it was displayed for several seconds. Because the patient’s deficit was not a general
disability to detect tactile stimulation or perform hand actions, these results seem to imply that it is
the maintenance of the current postural state of the body that was lost over time unless new visual,
tactile, or proprioceptive information forced an update of the model. The importance of the SPL for
posture control is also evident from a patient who, after SPL damage, lost her ability to correctly
interact with objects requiring whole body coordination, such as sitting on a chair (Kase et al. 1977).
Still further evidence for an involvement of the SPL in posture representation comes from experi-
ments in healthy participants. When people are asked to judge the laterality of a hand presented in a
picture, these judgments are influenced by the current hand posture adopted by the participant: the
more unnatural it would be to align the own hand with the displayed hand, the longer participants
take to respond (Parsons 1987; Ionta et al. 2007). A hand posture change during the hand lateral-
ity task led to an activation in the SPL in fMRI (de Lange et al. 2006). Hand crossing also led to
a change in intraparietal activation during passive tactile stimulation (Lloyd et al. 2003). Finally,
recall that fMRI activity during the buildup of the rubber hand illusion, thought to involve postural
recalibration due to the visual information about the rubber arm, was also observed in the SPL.
These findings are consistent with data from neurophysiological recordings in monkeys show-
ing that neurons in area 5 (Sakata et al. 1973) in the superior parietal lobe as well as neurons in
area PEc (located just at the upper border of the IPS and extending into the sulcus to border MIP;
Breveglieri et al. 2008) respond to complex body postures, partly involving several limbs. Neurons
in these areas respond to tactile, proprioceptive, and visual input (Breveglieri et al. 2008; Graziano
et al. 2000). Furthermore, some area 5 neurons fire most when the felt and the seen position of the
arm correspond rather than when they do not (Graziano 1999; Graziano et al. 2000). These neurons
respond not only to vision of the own arm, but also to vision of a fake arm, if it is positioned in an
anatomically plausible way such that they look as if they might belong to the animal’s own body,
reminiscent of the rubber hand illusion in humans. Importantly, some neurons fire most when the
visual information of the fake arm matches the arm posture of the monkey’s real, hidden arm, but
reduce their firing rate when vision and proprioception do not match.
To summarize, body movement and body posture are represented by different brain regions.
Movement perception relies on the motor structures of the frontal lobe. Probably, the most impor-
tant brain region for the representation of body posture, in contrast, is the SPL. This region is known
to integrate signals from different sensory modalities, and damage to it results in dysfunctions of
posture perception and actions requiring postural adaptations. However, other brain regions are
involved in posture processing as well.

28.2.6  The Body Schema: A Distributed versus Holistic Representation


The evidence reviewed so far has shown that what has been subsumed under the term body schema
is not represented as one single, unitary entity in the brain—even if, from a psychological stand-
point, it would seem to constitute an easily graspable and logically coherent concept. However,
as has often proved to be the case in psychology and in the neurosciences, what researchers have
hypothesized to be functional entities for the brain’s organization is not necessarily the way nature
has indeed evolved the brain. The organization of the parietal and frontal areas seems to be modu-
lar, and they appear to be specialized for certain body parts (Rizzolatti et al. 1998; Grefkes and
Fink 2005; Andersen and Cui 2009), for example, for hand grasping, arm reaching, and eye move-
ments. Similarly, at least in parts of the premotor cortex, RFs for the different sensory modalities are
body part–centered (e.g., around the hand; see also Section 28.3.2), suggesting that, possibly, other
body part–specific areas may feature coordinate frames anchored to those body parts (Holmes and
Spence 2004). As a consequence, the holistic body schema that we subjectively experience has been
proposed to emerge from the interaction of multiple space-, body-, and action-related brain areas
(Holmes and Spence 2004).
The Body in a Multisensory World 565

28.2.7  Interim Summary


The first part of this chapter has highlighted how important the integration of multisensory informa-
tion is for body processing. We showed that a representation of our body parts is probably innate,
and that lesions to different brain structures such as the parietal and frontal lobes as well as subcor-
tical structures can lead to malfunctions of this representation. Patients can perceive lost limbs as
still present, report additional limbs to the normal ones, and deny the ownership of a limb. We went
on to show how the integration of multisensory (usually visual and tactile) information is used in an
online modification or “construction” of the body schema. In the rubber hand illusion, synchronous
multisensory information leads to the integration of an external object into the body schema in the
sense that the location of the real limb is felt to be at the external object. Multisensory information
can also lead to adjustments of perceived body shape, as in the Pinocchio illusion. Information
about body parts—their movement and their posture—are represented in a widespread network
in the brain. Whereas limb movement perception seems to rely on motor structures, multisensory
parietal areas are especially important for the maintenance of a postural representation. Finally, we
noted that the current concept of the body schema in the brain is that of an interaction between many
body part–specific representations.

28.3  THE BODY AS A MODULATOR FOR MULTISENSORY PROCESSING


The first part of this chapter has focused on the multisensory nature of the body schema with its two
aspects of what parts make up the body, and where those parts are located in space and in relation
to one another. These studies form the basis for an exploration of the specific characteristics of body
processing and its relevance for perception, action, and the connection of these two processes. The
remainder of this article, therefore, will adopt the opposite view than the first part: it will assume the
existence of a body schema and explore its influence on multisensory processing.
One of the challenges for multisensory processing is that information from the different senses
is received by sensors that are arranged very differently from modality to modality. In vision, light
originating from neighboring spatial locations falls on neighboring rods and cones on the retina.
When the eyes move, light from the same spatial origin falls on different sensors on the retina.
Visual information is therefore initially eye-centered. Touch is perceived through sensors all over
the skin. Because the body parts constantly move in relation to each other, a touch to the same part
of the skin can correspond to very different locations in external, visual space. Similar challenges
arise for the spatial processing in audition, but we will focus here on vision and touch.

28.3.1  Recalibration of Sensory Signals and Optimal Integration


In some cases, the knowledge about body posture and movement is used to interpret sensory infor-
mation. For example, Lackner and Shenker (1985) attached a light or a sound source to each hand
of their participants who sat in a totally dark room. They then vibrated the biceps muscles of the
two arms; recall that muscle vibration induces the illusion of limb movement. In this experimental
setup, participants perceived an outward movement of the two arms. Both the lights and the sound
were perceived as moving with the apparent location of the hands, although the sensory information
on the retina and in the cochlea remained identical throughout these manipulations. Such experi-
mental findings have lead to the proposal that the brain frequently recalibrates the different senses
to ensure that the actions carried out with the limbs are in register with the external world (Lackner
and DiZio 2000). The brain seems to use different sensory input to do this, depending on the experi-
mental situation. In the rubber hand illusion, visual input about arm position apparently overrules
proprioceptive information about the real position of the arm. In other situations, such as in the arm
vibration illusion, proprioception can overrule vision.
566 The Neural Bases of Multisensory Processes

Although winner-take-all schemes for such dominance of one sense over another have been
proposed (e.g., Ramachandran and Hirstein 1998), there is ample evidence that inconsistencies in
the information from the different senses does not simply lead to an overruling of one by the other.
Rather, the brain seems to combine the different senses to come up with a statistically optimal esti-
mate of the true environmental situation, allowing for statistically optimal movements (Körding and
Wolpert 2004; Trommershäuser et al. 2003) as well as perceptual decisions (Ernst and Banks 2002;
Alais and Burr 2004). Because in many cases one of our senses outperforms the others in a specific
sensory ability—for example, spatial acuity is superior in vision (Alais and Burr 2004), and tem-
poral acuity is best in audition (Shams et al. 2002; Hötting and Röder 2004)—many experimental
results have been interpreted in favor of an “overrule” hypothesis. Nevertheless, it has been demon-
strated, for example, in spatial tasks, that the weight the brain assigns to the information received
through a sensory channel is directly related to its spatial acuity, and that audition (Alais and Burr
2004) and touch (Ernst and Banks 2002) will overrule vision when visual acuity is sufficiently
degraded. Such integration is probably involved also in body processing and in such phenomena as
the rubber hand and Pinocchio illusions.
In sum, the body schema influences how multisensory information is interpreted by the brain.
The weight that a piece of multisensory information is given varies with its reliability (see also de
Vignemont 2010).

28.3.2  Body Schema and Peripersonal Space


Many neurons in a brain circuit involving the ventral intraparietal (VIP) area and the ventral pre-
motor cortex (PMv) feature tactile RFs, mainly around the monkey’s mouth, face, or hand. These
tactile RF are supplemented by visual and sometimes auditory RFs that respond to the area up to
~30 cm around the body part (Fogassi et al. 1996; Rizzolatti et al. 1981a, 1981b; Graziano et al.
1994, 1999; Duhamel et al. 1998; Graziano and Cooke 2006). Importantly, when either the body
part or the eyes are moved, the visual RF is adjusted online such that the tactile and the visual
modality remain aligned within a given neuron (Graziano et al. 1994). When one of these neurons
is electrically stimulated, the animal makes defensive movements (Graziano and Cooke 2006).
Because of these unique RF properties, the selective part of space represented by this VIP–PMv
circuit has been termed the peripersonal space, and it has been suggested to represent a defense zone
around the body. Note that the continuous spatial adjustment of the visual to the tactile RF requires
both body posture and eye position to be integrated in a continuous manner. Two points therefore
become immediately clear: first, the peripersonal space and the body schema are intimately related
(see also Cardinali et al. 2009a); and second, as the body schema, the representation of peripersonal
space includes information from several (if not all) sensory modalities. As is the case with the term
“body schema,” the term “peripersonal space” has also been defined in several ways. It is sometimes
used to denote the space within arm’s reach (see, e.g., Previc 1998). For the purpose of this review,
“peripersonal space” will be used to denote the space directly around the body, in accord with the
findings in monkey neurophysiology.
Different approaches have been taken to investigate if peripersonal space is similarly represented
in humans as in monkeys. One of them has been the study of patients suffering from extinction.
These patients are usually able to report single stimuli in all spatial locations, but fail to detect con-
tralesional stimuli when these are concurrently presented with ipsilesional stimuli (Ladavas 2002).
The two stimuli can be presented in two different modalities (Ladavas et al. 1998), indicating that
the process that is disrupted by extinction is multisensory in nature. More importantly, extinction
is modulated in some patients by the distance of the distractor stimulus (i.e., the ipsilesional stimu-
lus that extinguishes the contralesional stimulus) from the hand. For example, in some patients a
tactile stimulus to the contralesional hand is extinguished by an ipsilesional visual stimulus to a
much higher degree when it is presented in the peripersonal space of the patient’s ipsilesional hand
than when it is presented far from it (di Pellegrino and Frassinetti 2000). Therefore, extinction is
The Body in a Multisensory World 567

modulated by two manipulations that are central to neurons representing peripersonal space in
monkeys: (1) extinction can be multisensory and (2) it can dissociate between peripersonal and
extrapersonal space. In addition, locations of lesions associated with extinction coincide (at least
coarsely) with the brain regions associated with peripersonal spatial functions in monkeys (Mort et
al. 2003; Karnath et al. 2001). The study of extinction patients has therefore suggested that a circuit
for peripersonal space exists in humans, analogously to the monkey.
The peripersonal space has also been investigated in healthy humans. One of the important
characteristics of the way the brain represents peripersonal space is the alignment of visual and
tactile events. In an fMRI study in which participants had to judge if a visual stimulus and a tactile
stimulus to the hand were presented from the same side of space, hand crossing led to an increase
of activation in the secondary visual cortex, indicating an influence of body posture on relatively
low-level sensory processes (Misaki et al. 2002). In another study, hand posture was manipulated
in relation to the eye: rather than changing hand posture itself, gaze was directed such that a tac-
tile stimulus occurred either in the right or the left visual hemifield. The presentation of bimodal
visual–tactile stimuli led to higher activation in the visual cortex in the hemisphere contralateral to
the visual hemifield of the tactile location, indicating that the tactile location was remapped with
respect to the visual space and then influenced visual cortex (Macaluso et al. 2002). These influ-
ences of posture and eye position on early sensory cortex may be mediated by parietal cortex. For
example, visual stimuli were better detected when a tactile stimulus was concurrently presented
(Bolognini and Maravita 2007). This facilitatory influence of the tactile stimulus was best when the
hand was held near the visual stimulus, both when this implied a normal or a crossed hand posture.
However, hand crossing had a very different effect when neural processing in the posterior parietal
cortex was impaired by repetitive TMS: now a tactile stimulus was most effective when it was deliv-
ered to the hand anatomically belonging to that side of the body at which the visual stimulus was
presented; when the hands were crossed, a right hand stimulus, for example, facilitated a right-side
visual stimulus, although the hand was located in the left visual space (Bolognini and Maravita
2007). This result indicates that after disruption of parietal processing, body posture was no longer
taken into account during the integration of vision and touch, nicely in line with the findings about
the role of parietal cortex for posture processing (see Section 28.2.5).
A more direct investigation of how the brain determines if a stimulus is located in the peri­
personal space was undertaken in an fMRI study that independently manipulated visual and pro-
prioceptive cues about hand posture to modulate the perceived distance of a small visual object
from the participants’ hand. Vision of the arm could be occluded, and the occluded arm was then
located near the visual object (i.e., peripersonally) or far from it; the distance from the object could
be determined by the brain only by using proprioceptive information. Alternatively, vision could be
available to show that the hand was either close or far from the stimulus. Ingeniously, the authors
manipulated these proprioceptive and visual factors together by using a rubber arm: when the real
arm was held far away from the visual object, the rubber hand could be placed near the object so
that visually the object was in peripersonal space (Makin et al. 2007). fMRI activity due to these
manipulations was found in posterior parietal areas. There was some evidence that for the determi-
nation of posture in relation to the visual object, proprioceptive signals were more prominent in the
anterior IPS close to the somatosensory cortex, and that vision was more prominent in more poste-
rior IPS areas, closer to visual areas. Importantly, however, all of these activations were located in
the SPL and IPS, the areas that have repeatedly been shown to be relevant for the representation of
posture and of the body schema.
Besides these neuroimaging approaches, behavioral studies have also been successful in investi-
gating the peripersonal space and the body schema. One task that has rendered a multitude of find-
ings is a cross-modal interference paradigm, the cross-modal congruency (CC) task (reviewed by
Spence et al. 2004b). In this task, participants receive a tactile stimulus to one of four locations; two
of these locations are located “up” and two are located “down” (see Figure 28.1). Participants are
asked to judge the elevation of the tactile stimulus in each trial, regardless of its side (left or right).
568 The Neural Bases of Multisensory Processes

tactile stimuli
visual distractors

FIGURE 28.1  Standard cross-modal congruency task. Tactile stimuli are presented to two locations on the
hand (often index finger and thumb holding a cube; here, back and palm of the hand). In each trial, one of the
tactile stimuli is presented concurrently with one of the visual distractor stimuli. Participants report if tactile
stimulus came from an upper or a lower location. Although they are to ignore visual distractors, tactile judg-
ment is biased toward location of the light. This influence is biggest when the distractor is presented at the
same hand as tactile stimulus, and reduced when the distractor occurs at the other hand.

However, a to-be-ignored visual distractor stimulus is presented with every tactile target stimulus,
also located at one of the four locations at which the tactile stimuli can occur. The visual distractor
is independent of the tactile target; it can therefore occur at a congruent location (tactile and visual
stimulus have the same elevation) or at an incongruent location (tactile and visual stimulus have
opposing elevations). Despite the instruction to ignore the visual distractors, participants’ reaction
times and error probabilities are influenced by them. When the visual distractors are congruent,
participants perform faster and with higher accuracy than when the distractors are incongruent.
The difference of the incongruent minus the congruent conditions (e.g., in RT and in accuracy) is
referred to as the CC effect. Importantly, the CC effect is larger when the distractors are located
close to the stimulated hands rather than far away (Spence et al. 2004a). Moreover, the CC effect is
larger when the distractors are placed near rubber hands, but only if those are positioned in front
of the participant in such a way that, visually, they could belong to the participant’s body (Pavani
et al. 2000). The CC effect is also modulated by tool use in a similar manner as by rubber hands;
when a visual distractor is presented in far space, the CC effect is relatively small, but it increases
when a tool is held near the distractor (Maravita et al. 2002; Maravita and Iriki 2004; Holmes et al.
2007). Finally, the CC effect is increased during the whole body illusion (induced by synchronous
stroking; see Section 28.2.2) when the distractors are presented on the back of the video image felt
to be the own body, compared to when participants see the same video image and distractor stimuli,
but without the induction of the whole body illusion (Aspell et al. 2009). These findings indicate that
cross-modal interaction, as indexed in the CC effect, is modulated by the distance of the distractors
from what is currently represented as the own body (i.e., the body schema) and thus suggest that the
CC effect arises in part from the processing of peripersonal space.
To summarize, monkey physiology, neuropsychological findings, and behavioral research sug-
gest that the brain specially represents the space close around the body, the peripersonal space.
There is a close relationship between the body schema and the representation of peripersonal space,
as body posture must be taken into account to remap, from moment to moment, which part of exter-
nal space is peripersonal.

28.3.3  Peripersonal Space around Different Parts of the Body


All of this behavioral research—just as much as the largest part of all neurophysiological, neuro­
psychological, and neuroimaging research—has explored peripersonal space and the body schema
using stimulation to and near the hands. The hands may be considered special in that they are
used for almost any kind of action we perform. Processing principles revealed for the hands may
therefore not generalize to other body parts. As an example, hand posture, but not foot posture, has
been reported to influence the mental rotation of these limbs (Ionta et al. 2007; Ionta and Blanke
2009; but see Parsons 1987). Moreover, monkey work has demonstrated multisensory neurons with
peripersonal spatial characteristics only for the head, hand, and torso, but neurons with equivalent
The Body in a Multisensory World 569

characteristics for the lower body have so far not been reported (Graziano et al. 2002). The periper-
sonal space representation may thus be limited to body parts that are important for the manipulation
of objects under (mainly) visual control. To test this hypothesis in humans, body schema–related
effects such as the CC effect, which have been conducted for the hands, must be investigated for
other body parts.
The aforementioned study of the CC effect during the whole body illusion (Aspell et al. 2009;
see also Section 28.2.2) demonstrated a peripersonal spatial effect near the back. The CC effect was
observable also when stimuli were delivered to the feet (Schicke et al. 2009), suggesting that a rep-
resentation of the peripersonal space exists also for the space around these limbs. If the hypothesis
is correct that the body schema is created from body part–specific representations, one might expect
that the representation of the peripersonal space of the hand and that of the foot do not interact. To
test this prediction, tactile stimuli were presented to the hands while visual distractors were flashed
either near the participant’s real foot, near a fake foot, or far from both the hand and the foot. The
cross-modal interference of the visual distractors, indexed by the CC effect, was larger when they
were presented in the peripersonal space of the real foot than when they were presented near the
fake foot or in extrapersonal space (Schicke et al. 2009). The spatial judgment of tactile stimuli at
the hand was thus modulated when a visual distractor appeared in the peripersonal space of another
body part. This effect cannot be explained with the current concept of peripersonal space as tactile
RFs encompassed by visual RFs. These results rather imply either a holistic body schema represen-
tation, or, more probably, interactions beyond simple RF overlap between the peripersonal space
representations of different body parts (Holmes and Spence 2004; Spence et al. 2004b).
In sum, the peripersonal space is represented not just for the hands, but also for other body parts.
Interactions between the peripersonal spatial representations of different body parts challenge the
concept of peripersonal space being represented merely by overlapping RFs.

28.3.4  Across-Limb Effects in Spatial Remapping of Touch


The fact that visual distractors in the CC paradigm have a higher influence when they are presented
in the peripersonal space implies that the brain matches the location of the tactile stimulus with
that of the visual one. The tactile stimulus is registered on the skin; to match this skin location to
the location of the visual stimulus requires that body posture be taken into account and the skin
location be projected into an external spatial reference frame. Alternatively, the visual location of
the distractor could be computed with regard to the current location of the tactile location, that is,
with respect to the hand, and thus be viewed as a projection of external onto the somatotopic space
(i.e., the skin).
This remapping of visual–tactile space has been more thoroughly explored by manipulating
hand posture. As in the standard CC task described earlier, stimuli were presented to the two hands
and the distractors were placed near the tactile stimuli (Spence et al. 2004a). However, in half of the
trials, participants crossed their hands. If spatial remapping occurs in this task, then the CC effect
should be high whenever the visual distractor is located near the stimulated hand. In contrast, if tac-
tile stimuli were not remapped into external space, then a tactile stimulus on the right hand should
always be influenced most by a right-hemifield visual stimulus, independent of body posture. The
results were clear-cut: when the hands were crossed, the distractors that were now near the hand
were most effective. In fact, in this experiment the CC effect pattern of left and right distractor
stimuli completely reversed, which the authors interpreted as a “complete remapping of visuotactile
space” (p. 162).
Spatial remapping could thus be viewed as a means of integrating spatial information from the
different senses in multisensory contexts. However, spatial remapping has also been observed in
purely tactile tasks that do not involve any distractor stimuli of a second modality. One example is
the temporal order judgment (TOJ) task, in which participants judge which of two tactile stimuli
occurred first. Performance in this task is impaired when participants cross their hands (Yamamoto
570 The Neural Bases of Multisensory Processes

and Kitazawa 2001a; Shore et al. 2002; Röder et al. 2004; Schicke and Röder 2006; Azanon and
Soto-Faraco 2007). It is usually assumed that the performance deficit after hand crossing in the
TOJ task is due to a conflict between two concurrently active reference frames: one anatomical and
one external (Yamamoto and Kitazawa 2001a; Röder et al. 2004; Schicke and Röder 2006). The
right–left coordinate axes of these two reference frames are opposed to each other when the hands
are crossed; for example, the anatomically right arm is located in the externally left hemispace dur-
ing hand crossing. This remapping takes place despite the task being purely tactile, and despite the
detrimental effect of using the external reference frame in the task. Remapping of stimulus location
by accounting for current body posture therefore seems to be an automatically evoked process in
the tactile system.
In the typical TOJ task, the two stimuli are applied to the two hands. It would therefore be pos-
sible that the crossing effect is simply due to a confusion regarding the two homologous limbs,
rather than to the spatial location of the stimuli. This may be due to a coactivation of homologous
brain areas in the two hemispheres (e.g., in SI or SII), which may make it difficult to assign the two
concurrent tactile percepts to their corresponding visual spatial locations. However, a TOJ crossing
effect was found for tactile stimuli delivered to the two hands, to the two feet, or to one hand and the
contralateral foot (Schicke and Röder 2006). In other words, participants were confused not only
about which of the two hands or the two feet was stimulated first, but they were equally impaired
in deciding if it was a hand or a foot that received the first stimulus. Therefore, the tactile location
originating on the body surface seems to be remapped into a more abstract spatial code for which
the original skin location, and the somatotopic coding of primary sensory cortex, is no longer a
dominating feature. In fact, it has been suggested that the location of a tactile stimulus on the body
may be reconstructed by determining which body part currently occupies the part of space at which
the tactile stimulus has been sensed (Kitazawa 2002). The externally anchored reference frame is
activated in parallel with a somatotopic one, and their concurrent activation leads to the observed
behavioral impairment.
To summarize, remapping of stimulus location in a multisensory experiment such as the CC
paradigm is a necessity for aligning signals from different modalities. Yet, even when stimuli are
purely unimodal, and the task would not require a recoding of tactile location into an external coor-
dinate frame, such a transformation nonetheless seems to take place. Thus, even for purely tactile
processing, posture information (e.g., proprioceptive and visual) is automatically integrated.

28.3.5  Is the External Reference Frame a Visual One?


The representation of several reference frames is, of course, not unique to the TOJ crossing effect. In
monkeys, the parallel existence of multiple reference frames has been demonstrated in the different
subareas of the IPS, for example, in VIP (Schlack et al. 2005), which is involved in the representa-
tion of peripersonal space, in MIP, which is involved in arm reaching (Batista et al. 1999), and in
LIP, which is engaged in saccade planning (Stricanne et al. 1996). Somewhat counterintuitively,
many neurons in these areas do not represent space in a reference frame that can be assigned to
one of the sensory systems (e.g., a retinotopic one for vision, a head-centered one for audition)
or a specific limb (e.g., a hand-centered reference frame for hand reach planning). Rather, there
are numerous intermediate coding schemes present in the different neurons (Mullette-Gillman et
al. 2005; Schlack et al. 2005). However, such intermediate coding has been shown to enable the
transformation of spatial codes between different reference frames, possibly even in different direc-
tions, for example, from somatotopic to eye-centered and vice versa (Avillac et al. 2005; Pouget et
al. 2002; Cohen and Andersen 2002; Xing and Andersen 2000). Similar intermediate coding has
been found in posture-related area 5, which codes hand position in an intermediate manner between
eye- and hand-centered coordinates (Buneo et al. 2002). Further downstream, in some parts of MIP,
arm reaching coordinates may, in contrast, be represented fully in eye-centered coordinates, inde-
pendent of whether the sensory target for reaching is visual (Batista et al. 1999; Scherberger et al.
The Body in a Multisensory World 571

2003; Pesaran et al. 2006) or auditory (Cohen and Andersen 2000). In addition to these results from
monkeys, an fMRI experiment in humans has suggested common spatial processing of visual and
tactile targets for saccade as well as for reach planning (Macaluso et al. 2007). Still further down-
stream, in the motor-related PMv, which has been proposed to form the peripersonal space circuit
together with VIP, visual RFs are aligned with hand position (Graziano and Cooke 2006).
These findings have led to the suggestion that the external reference frame involved in tactile
localization is a visual one, and that remapping occurs automatically to aid the fusion of spatial
information of the different senses. Such use of visual coordinates may be helpful not only for action
planning (e.g., the reach of the hand toward an object), but also for an efficient online correction of
motor error with respect to the visual target (Buneo et al. 2002; Batista et al. 1999).
A number of variants of the TOJ paradigm have been employed to study the visual origin of the
external reference frame in humans. For example, the crossing effect could be ameliorated when
participants viewed uncrossed rubber hands (with their real hands hidden), indicating that visual
(and not just proprioceptive) cues modulate spatial remapping (Azanon and Soto-Faraco 2007). In
the same vein, congenitally blind people did not display a TOJ crossing effect, suggesting that they
do not by default activate an external reference frame for tactile localization (Röder et al. 2004).
Congenitally blind people also outperformed sighted participants when the use of an anatomically
anchored reference frame was advantageous to solve a task, whereas they performed worse than
the sighted when an external reference frame was better suited to solve a task (Röder et al. 2007).
Importantly, people who turned blind later in life were influenced by an external reference frame in
the same manner as sighted participants, indicating that spatial remapping develops during ontog-
eny when the visual system is available, and that the lack of automatic coordinate transformations
into an external reference frame is not simply an unspecific effect of long-term visual deprivation
(Röder et al. 2004, 2007). In conclusion, the use of an external reference frame seems to be induced
by the visual system, and this suggests that the external coordinates used in the remapping of sen-
sory information are visual coordinates.
Children did not show a TOJ crossing effect before the age of ~5½ years (Pagel et al. 2009). This
late use of external coordinates suggests that spatial remapping requires a high amount of learning
and visual–tactile experience during interaction with the environment. One might therefore expect
remapping to take place only in regions that are accessible to vision. In the TOJ paradigm, one
would thus expect a crossing effect when the hands are held in front, but no such crossing effect
when the hands are held behind the back (as, because of the lack of tactile–visual experience in that
part of space, no visual–tactile remapping should take place). At odds with these predictions, Kobor
and colleagues (2006) observed a TOJ crossing effect (although somewhat reduced) also behind the
back. We conducted the same experiment in our laboratory, and found that the size of the crossing
effect did not differ in the front and in the back [previously unpublished data; n = 11 young, healthy,
blindfolded adults, just noticeable difference (JND) for correct stimulus order: front uncrossed: 66 ±
10 ms; uncrossed back: 67 ± 11 ms; crossed front: 143 ± 39 ms; crossed back: 138 ± 25 ms; ANOVA
main effect of part of space and interaction of hand crossing with hand of space, both F1,10  <1].
Because we must assume only minimal visual–tactile experience for the space behind our body,
the results of these two experiments do not support the idea that the external coordinate system in
tactile remapping is purely visual.
It is possible that the brain uses an action-based reference frame that represents the environment
(or action target location) in external coordinates, which can be used to orient not only the eye, but
also gaze, trunk, or the whole body, rather than simply visual (i.e., eye- or retina-centered) coor-
dinates. In other words, the external reference frame may be anchored to the eyes for that part of
space that is currently accessible to the eyes, but may be related to head, trunk, or body movement
parameters for those parts of space currently out of view. Such a coordinate system would benefit
from using external coordinates, because eye-, head-, and possibly body or trunk position must all
be fused to allow the directing of the eyes (and, with them, usually the focus of attention) onto an
externally located target.
572 The Neural Bases of Multisensory Processes

Such an action-related reference frame seems plausible for several reasons. At least eye and
head orienting (together referred to as gaze orienting) are both mediated by the superior colliculus
(SC) (Walton et al. 2007), a brain structure that is important for multisensory processing (Stein et
al. 1995; Freedman and Sparks 1997; Stuphorn et al. 2000) and that is connected to the IPS (Pare
and Wurtz 1997, 2001). IPS as well as PMv (also connected to IPS) encode reaching (i.e., action)
targets also in the dark, that is, in the absence of visual information (Fattori et al. 2005; Graziano
et al. 1997). Moreover, a recent fMRI study demonstrated activation of the frontal eye fields—a
structure thought to be involved in saccadic eye movements and visual attention—to sounds behind
the head (Tark and Curtis 2009), which would suggest either a representation of unseen space (Tark
and Curtis 2009) or, alternatively, the representation of a target coordinate in “action space” rather
than in eye-centered space.
For locations that one can orient toward with an eye–head movement, an action-based reference
frame could be identical to a visual reference frame and use eye- or gaze-centered coordinates, in
line with eye-centered coding of saccade as well as hand reach targets in LIP and MIP. The mon-
key’s head is usually fixed during single-cell recording experiments, making it impossible to differ-
entiate between eye-centered and gaze-centered (let alone trunk- or body-centered target) coding.
In addition, the spatial coding of reach targets that are out of view (but not in the dark) has, to our
knowledge, not been investigated.
To sum up, many electrophysiological and behavioral studies have suggested that touch is
remapped into visual coordinates, presumably to permit its integration with information from other
modalities. Remapping refers to a recoding of the location of a tactile event on the skin onto its
external-spatial coordinates; in other words, remapping accounts for body posture when matching
visual and tactile spatial locations. Because of the influence of the visual system during ontogeny
(and, therefore, not in the congenitally blind), remapping occurs even for unimodal tactile events.
Yet, the external reference frame may be “more than” visual, subserving orienting actions also to
parts of space outside the current visual field.

28.3.6  Investigating the Body Schema and Reference Frames with Electrophysiology
Most of the evidence for coordinate transformations in humans discussed in this chapter so far has
used behavioral measures. Electrophysiological measures (e.g., ERPs) offer an additional approach
to investigate these processes. ERPs record electrical brain signals with millisecond resolution and
therefore allow a very detailed investigation of functional brain activity. One fruitful approach is the
manipulation of the attentional processing of sensory stimuli: it is known that the ERP is enhanced
when a stimulus is presented at a location the person is currently attending to. In fact, there have
been reports about the effect of hand crossing on the attentional processing of tactile stimuli deliv-
ered to the two hands. When a tactile stimulus is delivered to a hand while participants direct their
attention to that hand, ERP deflections in the time range of 80–150 ms as well as between 200 and
300 ms are enhanced compared to when the same stimuli are delivered to the same hand while it is
not attended. However, when participants crossed their hands, early attentional effects disappeared,
and later effects were significantly reduced (Eimer et al. 2001, 2003).
These ERP results imply that tactile spatial attentional processes do not rely on an anatomi-
cal reference frame alone, as posture should otherwise have had no influence on attention-related
ERP effects. A disadvantage of this experimental design is the only coarse differentiation between
attended and unattended stimulation to determine the influence of reference frames: the lack of
difference between attended and unattended conditions after hand crossing may be due to mere
confusion of the two hands, effectively preventing selective direction of attention to one hand.
Alternatively, it may be due to attentional enhancement to one hand in a somatotopic, and to the
other hand in an external reference frame.
However, the difference in ERP magnitude between attended and unattended spatial locations is
not binary. Rather, ERPs gradually decrease with the distance at which a stimulus is presented from
The Body in a Multisensory World 573

the attended location, a phenomenon termed the spatial attentional gradient (Mangun and Hillyard
1988). The spatial gradient can be exploited to test more thoroughly if the ERP effects due to hand
crossing are attributable to hand confusion, and to investigate if coordinate transformations are
calculated for body parts other than the hands. To this end, participants were asked to attend to one
of their feet, while tactile stimuli were presented to both hands and feet in random stimulus order.
The hands were placed near the feet. Crucially, in some blocks each hand lay near its ipsilateral foot,
whereas in some blocks, the hands were crossed so that each hand lay next to its contralateral foot.
Thus, each hand could be near to or far from the attended location (one of the feet). The external
spatial distance of each hand to the attended foot reversed with hand crossing, whereas of course
the anatomical distance from each hand to the attended foot remained identical in both uncrossed
and crossed conditions. Investigating the spatial gradient in ERPs to hand and foot stimuli thus
made it possible to investigate whether the tactile system defines spatial distance in somatotopic or
in external coordinates.
In the time interval 100–140 ms after stimulus presentation, ERPs of unattended tactile hand
stimuli were more similar to the ERP of an attended hand stimulus when the hands were located
close to the attended foot than when they were located far away, demonstrating that tactile attention
uses an external reference frame (Heed and Röder 2010; see Figure 28.2). At the same time, ERPs
were also influenced by the anatomical distance between the attended and the stimulated locations.

x
[µV]

–1
x

1 x

2
x
3

4
x
–100 0 100 200 300 400 [ms]

FIGURE 28.2  ERP results for hand stimulation. Traces from a fronto-central electrode ipsilateral to stimu-
lation. In the figures depicting the different conditions, the attended foot is indicated by a filled gray dot; the
stimulated right hand is indicated by a gray cross. Thin black (lowest) trace (last figure): the hand was attended
and stimulated. Signal should be highest in this condition. Note that the direction of ERP (positive or negative
deflection) does not carry meaning in this context. Black traces (first and second figures): stimulation of the
hand contralateral to attended foot. Gray traces (third and fourth figures): stimulation of the hand ipsilateral
to attended foot. Thin traces: close spatial distance (according to an external reference frame) between stimu-
lated and attended limb. Bold traces: far spatial distance between stimulated and attended limbs. ERPs started
to differ after ~100 ms after stimulus. Differences were largest in the 100- to 140-ms time interval, which has
been known to be modulated by spatial attention. For hand stimulation both ipsilateral and contralateral to the
attended foot, a short spatial distance from the attended foot led to a more positive ERP in this time interval;
in other words, the ERP was more similar to the thin black trace (stimulation at attended location) for near
than for far spatial distance (thin vs. bold traces), indicating the use of an external spatial reference frame for
the representation of nonattended tactile stimuli. At the same time, anatomical distance (black vs. gray colors)
also modulated ERPs, indicating the use of an anatomical reference frame.
574 The Neural Bases of Multisensory Processes

ERPs to unattended hand stimuli were more similar to the ERP of an attended hand stimulus when
the ipsilateral rather than the contralateral foot was attended.
ERPs in this time range are thought to originate in the secondary somatosensory cortex (SII)
(Frot and Mauguiere 1999; Eimer and Forster 2003). Recall that SII was implicated in the integra-
tion of a rubber hand, as indexed by the perceptual drift of the own hand toward the rubber hand
(Tsakiris et al. 2007), as well as in making postural judgments (Corradi-Dell’Acqua et al. 2009).
These findings thus converge with the ERP results in emphasizing the importance of relatively
lower-level somatosensory areas in the representation of our body schema by coding not only the
current position of our hands, but also the current spatial relationship of different body parts to each
other, both in anatomical and external coordinates.

28.3.7  Summary
The second part of this chapter focused on the influence of the body and the body schema on multi-
sensory processing. We started showing that body posture can be used to calibrate the spatial rela-
tionship between the senses, and we discussed that the brain weighs information from the different
senses depending on their reliability. Such statistically optimal integrational processes may also
be at the heart of the phenomena presented in the first part of the chapter, for example, the rubber
hand illusion. The remainder of the chapter focused on multisensory spatial processing, starting out
with the evidence for a special representation of the space directly around our body, demonstrat-
ing the link between the body schema and multisensory spatial processing. We showed that the
peripersonal space is represented not only for the hands, but also for other body parts, and that not
all experimental results can be explained by the common notion of the peripersonal space being
represented simply by tactile RFs on a body part with a matching visual RF. We then showed that
the body schema is also important in multisensory processing, but also in purely tactile processing,
in that tactile locations are automatically remapped into external spatial coordinates. These external
coordinates are closely related to the visual modality, but extend beyond the current visual field into
the space that cannot be seen. Finally, ERPs were shown to be modulated both by anatomical and
external coordinate frames. This highlights that although in some situations tactile locations seem
to be fully remapped into purely external coordinates, the original, anatomical location of the touch
is never quite forgotten. Such concurrent representations of both anatomical and external location
seem useful in the context of action control. For example, to fend off a dangerous object—be it an
insect ready to sting or the hand of an adversary who has grabbed one’s arm—it is crucial to know
which limb can be chosen to initiate the defensive action, but also where the action must be guided
in space. Thus, when the right arm has been grabbed, one cannot use this arm to strike against the
opponent—and this is independent of the current external location of the captured arm. However,
once it has been determined which arm is free for use in a counterattack, it becomes crucial to know
where (in space) this arm should strike to fend off the attacker.

28.4  CONCLUSION
Our different senses enable us to perceive and act toward the environment. However, they also
enable us to perceive ourselves and, foremost, our body. Because we can move in many different
ways, our brain must keep track of our current posture at all times to guide actions effectively.
However, the brain is also surprisingly flexible with respect to representing what it assumes to
belong to the body at any given point in time, and about the body’s current shape. One of the main
principles of the brain’s body processing seems to be the attempt to “make sense of all the senses”
by integrating all available information. As we saw, this processing principle can lead to surprising
illusions, such as the rubber hand illusion, the Pinocchio nose, or the feeling of being located outside
the body, displaced toward a video image. As is often the case in psychology, these illusions also
enlighten us about the brain processes in normal circumstances.
The Body in a Multisensory World 575

As much as multisensory information is important for the construction of our body schema, this
body representation is in turn important for many instances of multisensory processing. Visual
events in the peripersonal space are specially processed to protect our body, and our flexibility
to move in many ways requires that the spatial locations of the different sensory modalities are
transformed into a common reference system. None of these functions could work without some
representation of the body’s current configuration.

REFERENCES
Aglioti, S., N. Smania, M. Manfredi, and G. Berlucchi. 1996. Disownership of left hand and objects related to
it in a patient with right brain damage. Neuroreport 8: 293–296.
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol
14: 257–262.
Andersen, R. A., and H. Cui. 2009. Intention, action planning, and decision making in parietal–frontal circuits.
Neuron 63: 568–583.
Armel, K. C., and V. S. Ramachandran. 2003. Projecting sensations to external objects: Evidence from skin
conductance response. Proc R Soc Lond B Biol Sci 270: 1499–1506.
Aspell, J. E., B. Lenggenhager, and O. Blanke. 2009. Keeping in touch with one’s self: Multisensory mecha-
nisms of self-consciousness. PLoS ONE 4: e6488.
Avillac, M., S. Deneve, E. Olivier, A. Pouget, and J. R. Duhamel. 2005. Reference frames for representing
visual and tactile locations in parietal cortex. Nat Neurosci 8: 941–949.
Azanon, E., and S. Soto-Faraco. 2007. Alleviating the ‘crossed-hands’ deficit by seeing uncrossed rubber
hands. Exp Brain Res 182: 537–548.
Bakheit, A. M., and S. Roundhill. 2005. Supernumerary phantom limb after stroke. Postgrad Med J 81: e2.
Batista, A. P., C. A. Buneo, L. H. Snyder, and R. A. Andersen. 1999. Reach plans in eye-centered coordinates.
Science 285: 257–260.
Berlucchi, G., and S. M. Aglioti. 2010. The body in the brain revisited. Exp Brain Res 200: 25–35.
Bestmann, S., A. Oliviero, M. Voss, P. Dechent, E. Lopez-Dolado, J. Driver, and J. Baudewig. 2006.
Cortical correlates of TMS-induced phantom hand movements revealed with concurrent TMS-fMRI.
Neuropsychologia 44: 2959–2971.
Blanke, O., T. Landis, L. Spinelli, and M. Seeck. 2004. Out-of-body experience and autoscopy of neurological
origin. Brain 127: 243–258.
Blanke, O., and T. Metzinger. 2009. Full-body illusions and minimal phenomenal selfhood. Trends Cogn Sci
13: 7–13.
Blanke, O., S. Ortigue, T. Landis, and M. Seeck. 2002. Stimulating illusory own-body perceptions. Nature 419:
269–270.
Bolognini, N., and A. Maravita. 2007. Proprioceptive alignment of visual and somatosensory maps in the pos-
terior parietal cortex. Curr Biol 17: 1890–1895.
Botvinick, M. 2004. Neuroscience. Probing the neural basis of body ownership. Science 305: 782–783.
Botvinick, M., and J. Cohen. 1998. Rubber hands ‘feel’ touch that eyes see. Nature 391: 756.
Breveglieri, R., C. Galletti, S. Monaco, and P. Fattori. 2008. Visual, somatosensory, and bimodal activities in
the macaque parietal area PEc. Cereb Cortex 18: 806–816.
Brugger, P., S. S. Kollias, R. M. Muri, G. Crelier, M. C. Hepp-Reymond, and M. Regard. 2000. Beyond remem-
bering: Phantom sensations of congenitally absent limbs. Proc Natl Acad Sci U S A 97: 6167–6172.
Bruno, N., and M. Bertamini. 2010. Haptic perception after a change in hand size. Neuropsychologia 48: 1853–1856.
Buneo, C. A., M. R. Jarvis, A. P. Batista, and R. A. Andersen. 2002. Direct visuomotor transformations for
reaching. Nature 416: 632–636.
Cardinali, L., C. Brozzoli, and A. Farne. 2009a. Peripersonal space and body schema: Two labels for the same
concept? Brain Topogr 21: 252–260.
Cardinali, L., F. Frassinetti, C. Brozzoli, C. Urquizar, A. C. Roy, and A. Farne. 2009b. Tool-use induces mor-
phological updating of the body schema. Curr Biol 19: R478–R479.
Cohen, Y. E., and R. A. Andersen. 2000. Reaches to sounds encoded in an eye-centered reference frame. Neuron
27: 647–652.
Cohen, Y. E., and R. A. Andersen. 2002. A common reference frame for movement plans in the posterior pari-
etal cortex. Nat Rev Neurosci 3: 553–562.
Collins, T., T. Schicke, and B. Röder. 2008. Action goal selection and motor planning can be dissociated by tool
use. Cognition 109: 363–371.
576 The Neural Bases of Multisensory Processes

Corradi-Dell’Acqua, C., B. Tomasino, and G. R. Fink. 2009. What is the position of an arm relative to the body?
Neural correlates of body schema and body structural description. J Neurosci 29: 4162–4171.
Cutting, J. 1978. Study of anosognosia. J Neurol Neurosurg Psychiatry 41: 548–555.
Daprati, E., A. Sirigu, P. Pradat-Diehl, N. Franck, and M. Jeannerod. 2000. Recognition of self-produced move-
ment in a case of severe neglect. Neurocase 6: 477–486.
de Lange, F. P., R. C. Helmich, and I. Toni. 2006. Posture influences motor imagery: An fMRI study. Neuroimage
33: 609–617.
de Vignemont, F. 2010. Body schema and body image—pros and cons. Neuropsychologia 48: 669–680.
di Pellegrino, G., and F. Frassinetti. 2000. Direct evidence from parietal extinction of enhancement of visual
attention near a visible hand. Curr Biol 10: 1475–1477.
Dijkerman, H. C., and E. H. de Haan. 2007. Somatosensory processes subserving perception and action. Behav
Brain Sci 30: 189–201; discussion 201–239.
Duhamel, J. R., C. L. Colby, and M. E. Goldberg. 1998. Ventral intraparietal area of the macaque: Congruent
visual and somatic response properties. J Neurophysiol 79: 126–136.
Ehrsson, H. H. 2007. The experimental induction of out-of-body experiences. Science 317: 1048.
Ehrsson, H. H., C. Spence, and R. E. Passingham. 2004. That’s my hand! Activity in premotor cortex reflects
feeling of ownership of a limb. Science 305: 875–877.
Ehrsson, H. H., K. Wiech, N. Weiskopf, R. J. Dolan, and R. E. Passingham. 2007. Threatening a rubber hand
that you feel is yours elicits a cortical anxiety response. Proc Natl Acad Sci U S A 104: 9828–9833.
Eimer, M., D. Cockburn, B. Smedley, and J. Driver. 2001. Cross-modal links in endogenous spatial attention
are mediated by common external locations: Evidence from event-related brain potentials. Exp Brain Res
139: 398–411.
Eimer, M., and B. Forster. 2003. Modulations of early somatosensory ERP components by transient and sus-
tained spatial attention. Exp Brain Res 151: 24–31.
Eimer, M., B. Forster, and J. Van Velzen. 2003. Anterior and posterior attentional control systems use differ-
ent spatial reference frames: ERP evidence from covert tactile–spatial orienting. Psychophysiology 40:
924–933.
Ernst, M. O., and M. S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415: 429–433.
Fattori, P., D. F. Kutz, R. Breveglieri, N. Marzocchi, and C. Galletti. 2005. Spatial tuning of reaching activity in
the medial parieto-occipital cortex (area V6A) of macaque monkey. Eur J Neurosci 22: 956–972.
Fitzgerald, P. J., J. W. Lane, P. H. Thakur, and S. S. Hsiao. 2004. Receptive field properties of the macaque second
somatosensory cortex: Evidence for multiple functional representations. J Neurosci 24: 11193–11204.
Fogassi, L., V. Gallese, L. Fadiga, G. Luppino, M. Matelli, and G. Rizzolatti. 1996. Coding of peripersonal
space in inferior premotor cortex (area F4). J Neurophysiol 76: 141–157.
Freedman, E. G., and D. L. Sparks. 1997. Activity of cells in the deeper layers of the superior colliculus of the
rhesus monkey: Evidence for a gaze displacement command. J Neurophysiol 78: 1669–1690.
Frot, M., and F. Mauguiere. 1999. Timing and spatial distribution of somatosensory responses recorded in the
upper bank of the sylvian fissure (SII area) in humans. Cereb Cortex 9: 854–863.
Gallagher, S. 1986. Body image and body schema: A conceptual clarification. J Mind Behav 7: 541–554.
Graziano, M. S. 1999. Where is my arm? The relative role of vision and proprioception in the neuronal repre-
sentation of limb position. Proc Natl Acad Sci U S A 96: 10418–10421.
Graziano, M. S., and D. F. Cooke. 2006. Parieto-frontal interactions, personal space, and defensive behavior.
Neuropsychologia 44: 845–859.
Graziano, M. S., D. F. Cooke, and C. S. Taylor. 2000. Coding the location of the arm by sight. Science 290:
1782–1786.
Graziano, M. S., and S. Gandhi. 2000. Location of the polysensory zone in the precentral gyrus of anesthetized
monkeys. Exp Brain Res 135: 259–266.
Graziano, M. S., X. T. Hu, and C. G. Gross. 1997. Visuospatial properties of ventral premotor cortex. J
Neurophysiol 77: 2268–2292.
Graziano, M. S., L. A. Reiss, and C. G. Gross. 1999. A neuronal representation of the location of nearby sounds.
Nature 397: 428–430.
Graziano, M. S., C. S. Taylor, and T. Moore. 2002. Complex movements evoked by microstimulation of pre-
central cortex. Neuron 34: 841–851.
Graziano, M. S., G. S. Yap, and C. G. Gross. 1994. Coding of visual space by premotor neurons. Science 266:
1054–1057.
Grefkes, C., and G. R. Fink. 2005. The functional organization of the intraparietal sulcus in humans and mon-
keys. J Anat 207: 3–17.
The Body in a Multisensory World 577

Halligan, P. W., J. C. Marshall, and D. T. Wade. 1993. Three arms: A case study of supernumerary phantom
limb after right hemisphere stroke. J Neurol Neurosurg Psychiatry 56: 159–166.
Halligan, P. W., J. C. Marshall, and D. T. Wade. 1995. Unilateral somatoparaphrenia after right hemisphere
stroke: A case description. Cortex 31: 173–182.
Harris, C. M., and D. M. Wolpert. 1998. Signal-dependent noise determines motor planning. Nature 394:
780–784.
Heed, T., and B. Röder. 2010. Common anatomical and external coding for hands and feet in tactile attention:
Evidence from event-related potentials. J Cogn Neurosci 22: 184–202.
Holmes, N. P., G. A. Calvert, and C. Spence. 2004. Extending or projecting peripersonal space with tools?
Multisensory interactions highlight only the distal and proximal ends of tools. Neurosci Lett 372:
62–67.
Holmes, N. P., G. A. Calvert, and C. Spence. 2007. Tool use changes multisensory interactions in seconds:
Evidence from the crossmodal congruency task. Exp Brain Res 183: 465–476.
Holmes, N. P., H. J. Snijders, and C. Spence. 2006. Reaching with alien limbs: Visual exposure to pros-
thetic hands in a mirror biases proprioception without accompanying illusions of ownership. Percept
Psychophys 68: 685–701.
Holmes, N. P., and C. Spence. 2004. The body schema and multisensory representation(s) of peripersonal
space. Cogn Process 5: 94–105.
Holmes, N. P., C. Spence, P. C. Hansen, C. E. Mackay, and G. A. Calvert. 2008. The multisensory attentional
consequences of tool use: A functional magnetic resonance imaging study. PLoS ONE 3: e3502.
Hötting, K., and B. Röder. 2004. Hearing cheats touch, but less in congenitally blind than in sighted individuals.
Psychol Sci 15: 60–64.
Ionta, S., and O. Blanke. 2009. Differential influence of hands posture on mental rotation of hands and feet in
left and right handers. Exp Brain Res 195: 207–217.
Ionta, S., A. D. Fourkas, M. Fiorio, and S. M. Aglioti,. 2007. The influence of hands posture on mental rotation
of hands and feet. Exp Brain Res 183: 1–7.
Iriki, A., M. Tanaka, and Y. Iwamura. 1996. Coding of modified body schema during tool use by macaque
postcentral neurones. Neuroreport 7: 2325–2330.
Kammers, M. P., F. de Vignemont, L. Verhagen, and H. C. Dijkerman. 2009a. The rubber hand illusion in
action. Neuropsychologia 47: 204–211.
Kammers, M. P., J. A. Kootker, H. Hogendoorn, and H. C. Dijkerman. 2009b. How many motoric body repre-
sentations can we grasp? Exp Brain Res 202: 203–212.
Kammers, M. P., L. O. Verhagen, L. H. C. Dijkerman, H. Hogendoorn, F. De Vignemont, and D. J. Schutter.
2009c. Is this hand for real? Attenuation of the rubber hand illusion by transcranial magnetic stimulation
over the inferior parietal lobule. J Cogn Neurosci 21: 1311–1320.
Karnath, H. O., S. Ferber, and M. Himmelbach. 2001. Spatial awareness is a function of the temporal not the
posterior parietal lobe. Nature 411: 950–953.
Kase, C. S., J. F. Troncoso, J. E. Court, J. F. Tapia, and J. P. Mohr. 1977. Global spatial disorientation. Clinico-
pathologic correlations. J Neurol Sci 34: 267–278.
Kitazawa, S. 2002. Where conscious sensation takes place. Conscious Cogn 11: 475–477.
Kobor, I., L. Furedi, G. Kovacs, C. Spence, and Z. Vidnyanszky. 2006. Back-to-front: Improved tactile dis-
crimination performance in the space you cannot see. Neurosci Lett 400: 163–167.
Körding, K. P., and D. M. Wolpert, 2004. Bayesian integration in sensorimotor learning. Nature 427:
244–247.
Lackner, J. R. 1988. Some proprioceptive influences on the perceptual representation of body shape and orien-
tation. Brain 111(Pt 2): 281–297.
Lackner, J. R., and P. A. DiZio. 2000. Aspects of body self-calibration. Trends Cogn Sci 4: 279–288.
Lackner, J. R., and B. Shenker. 1985. Proprioceptive influences on auditory and visual spatial localization. J
Neurosci 5: 579–583.
Lacroix, R., R. Melzack, D. Smith, and N. Mitchell. 1992. Multiple phantom limbs in a child. Cortex 28:
503–507.
Ladavas, E. 2002. Functional and dynamic properties of visual peripersonal space. Trends Cogn Sci 6: 17–22.
Ladavas, E., G. di Pellegrino, A. Farne, and G. Zeloni. 1998. Neuropsychological evidence of an integrated
visuotactile representation of peripersonal space in humans. J Cogn Neurosci 10: 581–589.
Lenggenhager, B., T. Tadi, T. Metzinger, and O. Blanke. 2007. Video ergo sum: manipulating bodily self-
consciousness. Science 317: 1096–1099.
Lloyd, D. M., D. I. Shore, C. Spence, and G. A. Calvert. 2003. Multisensory representation of limb position in
human premotor cortex. Nat Neurosci 6: 17–18.
578 The Neural Bases of Multisensory Processes

Longo, M. R., F. Schuur, M. P. Kammers, M. Tsakiris, and P. Haggard. 2008. What is embodiment? A psycho-
metric approach. Cognition 107: 978–998.
Lotze, M., H. Flor, W. Grodd, W. Larbig, and N. Birbaumer. 2001. Phantom movements and pain. An fMRI
study in upper limb amputees. Brain 124: 2268–2277.
Macaluso, E., C. D. Frith, and J. Driver. 2002. Crossmodal spatial influences of touch on extrastriate visual
areas take current gaze direction into account. Neuron 34: 647–658.
Macaluso, E., C. D. Frith, and J. Driver. 2007. Delay activity and sensory–motor translation during planned eye
or hand movements to visual or tactile targets. J Neurophysiol 98: 3081–3094.
Makin, T. R., N. P. Holmes, and E. Zohary. 2007. Is that near my hand? Multisensory representation of peri­
personal space in human intraparietal sulcus. J Neurosci 27: 731–740.
Mangun, G. R., and S. A. Hillyard. 1988. Spatial gradients of visual attention: Behavioral and electrophysi-
ological evidence. Electroencephalogr Clin Neurophysiol 70: 417–428.
Maravita, A., and A. Iriki. 2004. Tools for the body (schema). Trends Cog Sci 8: 79–86.
Maravita, A., C. Spence, S. Kennett, and J. Driver. 2002. Tool-use changes multimodal spatial interactions
between vision and touch in normal humans. Cognition 83: B25–B34.
Marino, B. F., N. Stucchi, E. Nava, P. Haggard, and A. Maravita. 2010. Distorting the visual size of the hand
affects hand pre-shaping during grasping. Exp Brain Res 202: 499–505.
McGonigle, D. J., R. Hanninen, S. Salenius, R. Hari, R. S. Frackowiak, and C. D. Frith. 2002. Whose arm is it
anyway? An fMRI case study of supernumerary phantom limb. Brain 125: 1265–1274.
Metzinger, T. 2009. Why are out-of-body experiences interesting for philosophers? The theoretical relevance
of OBE research. Cortex 45: 256–258.
Misaki, M., E. Matsumoto, and S. Miyauchi. 2002. Dorsal visual cortex activity elicited by posture change in
a visuo- tactile matching task. Neuroreport 13: 1797–1800.
Mort, D. J., P. Malhotra, S. K. Mannan, C. Rorden, A. Pambakian, C. Kennard, and M. Husain. 2003. The
anatomy of visual neglect. Brain 126: 1986–1997.
Mullette-Gillman, O. A., Y. E. Cohen, and J. M. Groh. 2005. Eye-centered, head-centered, and complex coding
of visual and auditory targets in the intraparietal sulcus. J Neurophysiol 94: 2331–2352.
Naito, E. 2004. Sensing limb movements in the motor cortex: How humans sense limb movement. Neuroscientist
10: 73–82.
Naito, E., P. E. Roland, and H. H. Ehrsson. 2002. I feel my hand moving: A new role of the primary motor
cortex in somatic perception of limb movement. Neuron 36: 979–988.
Obayashi, S., M. Tanaka, and A. Iriki. 2000. Subjective image of invisible hand coded by monkey intraparietal
neurons. Neuroreport 11: 3499–3505.
Pagel, B., T. Heed, and B. Röder. 2009. Change of reference frame for tactile localization during child develop-
ment. Dev Sci 12: 929–937.
Paqueron, X., M. Leguen, D. Rosenthal, P. Coriat, P. J. C. Willer, and N. Danziger. 2003. The phenomenology
of body image distortions induced by regional anaesthesia. Brain 126: 702–712.
Pare, M., and R. H. Wurtz. 1997. Monkey posterior parietal cortex neurons antidromically activated from supe-
rior colliculus. J Neurophysiol 78: 3493–3497.
Pare, M., and R. H. Wurtz. 2001. Progression in neuronal processing for saccadic eye movements from parietal
cortex area LIP to superior colliculus. J Neurophysiol 85: 2545–2562.
Parsons, L. M. 1987. Imagined spatial transformations of one’s hands and feet. Cogn Psychol 19: 178–241.
Pavani, F., C. Spence, and J. Driver. 2000. Visual capture of touch: Out-of-the-body experiences with rubber
gloves. Psychol Sci 11: 353–359.
Pellijeff, A., L. Bonilha, P. S. Morgan, K. McKenzie, and S. R. Jackson. 2006. Parietal updating of limb pos-
ture: An event-related fMRI study. Neuropsychologia 44: 2685–2690.
Pesaran, B., M. J. Nelson, and R. A. Andersen. 2006. Dorsal premotor neurons encode the relative position of
the hand, eye, and goal during reach planning. Neuron 51: 125–134.
Pouget, A., S. Deneve, and J. R. Duhamel. 2002. A computational perspective on the neural basis of multi­
sensory spatial representations. Nat Rev Neurosci 3: 741–747.
Press, C., C. Heyes, P. Haggard, and M. Eimer. 2008. Visuotactile learning and body representation: An ERP
study with rubber hands and rubber objects. J Cogn Neurosci 20: 312–323.
Previc, F. H. 1998. The neuropsychology of 3-D space. Psychol Bull 124: 123–164.
Ramachandran, V. S. 1993. Behavioral and magnetoencephalographic correlates of plasticity in the adult human
brain. Proc Natl Acad Sci U S A 90: 10413–10420.
Ramachandran, V. S., and W. Hirstein. 1997. Three laws of qualia—what neurology tells us about the biological
functions of consciousness, qualia and the self. J Consciousness Stud 4: 429–458.
The Body in a Multisensory World 579

Ramachandran, V. S., and W. Hirstein. 1998. The perception of phantom limbs. The D. O. Hebb lecture. Brain
121(Pt 9): 1603–1630.
Rizzolatti, G., G. Luppino, and M. Matelli. 1998. The organization of the cortical motor system: New concepts.
Electroencephalogr Clin Neurophysiol 106: 283–296.
Rizzolatti, G., C. Scandolara, M. Matelli, and M. Gentilucci. 1981a. Afferent properties of periarcuate neurons
in macaque monkeys. I. Somatosensory responses. Behav Brain Res 2: 125–146.
Rizzolatti, G., C. Scandolara, M. Matelli, and M. Gentilucci. 1981b. Afferent properties of periarcuate neurons
in macaque monkeys: II. Visual responses. Behav Brain Res 2: 147–163.
Röder, B., A. Kusmierek, C. Spence, and T. Schicke. 2007. Developmental vision determines the reference
frame for the multisensory control of action. Proc Natl Acad Sci U S A 104: 4753–4758.
Röder, B., F. Rösler, and C. Spence. 2004. Early vision impairs tactile perception in the blind. Curr Biol 14:
121–124.
Roux, F. E., J. A. Lotterie, E. Cassol, Y. Lazorthes, J. C. Sol, and I. Berry. 2003. Cortical areas involved in vir-
tual movement of phantom limbs: comparison with normal subjects. Neurosurgery 53: 1342–1352.
Saadah, E. S., and R. Melzack. 1994. Phantom limb experiences in congenital limb-deficient adults. Cortex
30: 479–485.
Sakata, H., Y. Takaoka, A. Kawarasaki, and H. Shibutani. 1973. Somatosensory properties of neurons in the
superior parietal cortex (area 5) of the rhesus monkey. Brain Res 64: 85–102.
Schaefer, M., H. Flor, H. J. Heinze, and M. Rotte. 2007. Morphing the body: Illusory feeling of an elongated
arm affects somatosensory homunculus. Neuroimage 36: 700–705.
Scherberger, H., M. A. Goodale, and R. A. Andersen. 2003. Target selection for reaching and saccades share a
similar behavioral reference frame in the macaque. J Neurophysiol 89: 1456–1466.
Schicke, T., F. Bauer, and B. Röder. 2009. Interactions of different body parts in peripersonal space: how vision
of the foot influences tactile perception at the hand. Exp Brain Res 192: 703–715.
Schicke, T., and B. Röder. 2006. Spatial remapping of touch: Confusion of perceived stimulus order across
hand and foot. Proc Natl Acad Sci U S A 103: 11808–11813.
Schlack, A., S. J. Sterbing-D’Angelo, K. Hartung, K. P. Hoffmann, and F. Bremmer. 2005. Multisensory space
representations in the macaque ventral intraparietal area. J Neurosci 25: 4616–4625.
Sellal, F., C. Renaseau-Leclerc, and R. Labrecque. 1996. The man with 6 arms. An analysis of supernumerary
phantom limbs after right hemisphere stroke. Rev Neurol (Paris) 152: 190–195.
Shams, L., Y. Kamitani, and S. Shimojo. 2002. Visual illusion induced by sound. Brain Res Cogn Brain Res
14: 147–152.
Shore, D. I., E. Spry, and C. Spence. 2002. Confusing the mind by crossing the hands. Brain Res Cogn Brain
Res 14: 153–163.
Simmel, M. L. 1962. The reality of phantom sensations. Soc Res 29: 337–356.
Spence, C., F. Pavani, and J. Driver. 2004a. Spatial constraints on visual–tactile cross-modal distractor congru-
ency effects. Cogn Affect Behav Neurosci 4: 148–169.
Spence, C., F. Pavani, A. Maravita, and N. Holmes, N. 2004b. Multisensory contributions to the 3-D represen-
tation of visuotactile peripersonal space in humans: Evidence from the crossmodal congruency task. J
Physiol Paris 98: 171–189.
Stein, B. E., M. T. Wallace, and M. A. Meredith. 1995. Neural mechanisms mediating attention and orientation
to multisensory cues. In The Cognitive Neurosciences, ed. M. S. Gazzaniga, 683–702. Cambridge, MA:
MIT Press, Bradford Book.
Stricanne, B., R. A. Andersen, and P. Mazzoni. 1996. Eye-centered, head-centered, and intermediate coding of
remembered sound locations in area LIP. J Neurophysiol 76: 2071–2076.
Stuphorn, V., E. Bauswein, and K. P. Hoffmann. 2000. Neurons in the primate superior colliculus coding for
arm movements in gaze-related coordinates. J Neurophysiol 83: 1283–1299.
Tark, K. J., and C. E. Curtis. 2009. Persistent neural activity in the human frontal cortex when maintaining
space that is off the map. Nat Neurosci 12: 1463–1468.
Trommershäuser, J., L. T. Maloney, and M. S. Landy. 2003. Statistical decision theory and trade-offs in the
control of motor response. Spat Vis 16: 255–275.
Tsakiris, M. 2010. My body in the brain: a neurocognitive model of body-ownership. Neuropsychologia 48:
703–712.
Tsakiris, M., M. D. Hesse, C. Boy, P. Haggard, and G. R. Fink. 2007. Neural signatures of body ownership: A
sensory network for bodily self-consciousness. Cereb Cortex 17: 2235–2244.
Türker, K. S., P. L. Yeo, and S. C. Gandevia. 2005. Perceptual distortion of face deletion by local anaesthesia of
the human lips and teeth. Exp Brain Res 165: 37–43.
580 The Neural Bases of Multisensory Processes

Walton, M. M., B. Bechara, and N. J. Gandhi. 2007. Role of the primate superior colliculus in the control of
head movements. J Neurophysiol 98: 2022–2037.
Xing, J., and R. A. Andersen. 2000. Models of the posterior parietal cortex which perform multimodal integra-
tion and represent space in several coordinate frames. J Cogn Neurosci 12: 601–614.
Yamamoto, S., and S. Kitazawa. 2001a. Reversal of subjective temporal order due to arm crossing. Nat Neurosci
4: 759–765.
Yamamoto, S., and S. Kitazawa. 2001b. Sensation at the tips of invisible tools. Nat Neurosci 4: 979–980.
Yamamoto, S., S. Moizumi, and S. Kitazawa. 2005. Referral of tactile sensation to the tips of L-shaped sticks.
J Neurophysiol 93: 2856–2863.
Yue, Z., G. N. Bischof, X. Zhou, C. Spence, and B. Röder. 2009. Spatial attention affects the processing of
tactile and visual stimuli presented at the tip of a tool: An event-related potential study. Exp Brain Res
193: 119–128.
Section VII
Naturalistic Multisensory Processes:
Motion Signals
29 Multisensory Interactions
during Motion Perception
From Basic Principles to
Media Applications
Salvador Soto-Faraco and Aleksander Väljamäe

CONTENTS
29.1 Introduction........................................................................................................................... 583
29.2 Basic Phenomenology of Multisensory Interactions in Motion Perception.......................... 584
29.3 Some Behavioral Principles . ................................................................................................ 586
29.3.1 What Is the Processing Level at Which Cross-Modal Interactions in Motion
Processing Originate?................................................................................................ 586
29.3.2 Are These Interactions Specific to Motion Processing?............................................ 588
29.3.3 Pattern of Modality Dominance................................................................................ 588
29.3.4 Multisensory Integration of Motion Speed................................................................ 589
29.4 Neural Correlates of Multisensory Integration of Motion..................................................... 591
29.4.1 Multisensory Motion Processing Areas in the Brain................................................ 591
29.4.2 Evidence for Cross-Modal Integration of Motion Information in the
Human Brain............................................................................................................. 592
29.5 Motion Integration in Multisensory Contexts beyond the Laboratory.................................. 593
29.5.1 Sound Compensating for Reduced Visual Frame Rate............................................. 593
29.5.2 Filling in Visual Motion with Sound......................................................................... 594
29.5.3 Perceptually Optimized Media Applications............................................................ 596
29.6 Conclusions............................................................................................................................ 597
Acknowledgments........................................................................................................................... 598
References....................................................................................................................................... 598

29.1  INTRODUCTION
Hearing the blare of an ambulance siren often impels us to trace the location of the emergency
vehicle with our gaze so we can quickly decide which way to pull the car over. In doing so, we must
combine motion information from the somehow imprecise but omnidirectional auditory system
with the far more precise, albeit spatially bounded, visual system. This type of multisensory inter-
play, so pervasive in everyday life perception of moving objects, has been largely ignored in the
scientific study of motion perception until recently. Here, we provide an overview of recent research
about behavioral and neural mechanisms that support the binding of different sensory modalities
during the perception of motion, and discuss some potential extensions of this research into the
applied context of audiovisual media.

583
584 The Neural Bases of Multisensory Processes

29.2 BASIC PHENOMENOLOGY OF MULTISENSORY


INTERACTIONS IN MOTION PERCEPTION
Early research addressing the perception of motion when more than one sensory modality is
involved mainly focused on psychophysical tasks in humans (for a review of early research, see
Soto-Faraco and Kingstone 2004). The most popular approach has been the use of intersensory
conflict situations, where incongruent information in two sensory modalities is presented to the
observer (as illustrated, for spatially static events, in the famous ventriloquist illusion; de Gelder
and Bertelson 2003; Howard and Templeton 1966). Some of the early observations arising from
intersensory conflict were already indicative of the consequences of cross-modality interactions in
the perception of motion direction (Anstis 1973; Zapparoli and Reatto 1969). Zapparoli and Reatto
(1969), for instance, reported that when presenting their observers with combinations of direction-
ally incongruent visual and auditory apparent motion streams,* some of them reported that they
experienced the two modalities moving in a unified trajectory (see also Anstis 1973, for a similar
kind of introspective report regarding subjective perception of motion coherence upon auditory–
visual motion conflict).
More recent research sought to confirm these introspective data via more sophisticated experi-
mental tasks (e.g., Allen and Kohlers 1981; Staal and Donderi 1983). For example, Allen and Kohlers
estimated the interstimulus interval that leads to the perception of apparent motion between two dis-
crete stimuli in one sensory modality (vision or audition), as a function of the timing and directional
congruency of two discrete events in another sensory modality (audition or vision, respectively)
presented concurrently. For example, in their study, Allen and Kohlers (see also Staal and Donderi
1983) found significant cross-modal effects of vision on the likelihood of perceiving motion in
auditory apparent motion displays, whereas the effects in the reverse direction were much weaker,
if present at all. Although these studies did not specifically measure perceived direction of motion,
their results often revealed cross-modal influences that were independent of the directional congru-
ency between modalities. Given the results of later studies (described below), this failure to find
direction congruency effects could have been more related to methodological peculiarities of the
setups used in these studies than due to the actual absence of cross-modal interaction in terms of
motion direction (see Soto-Faraco and Kinsgstone 2004 and Soto-Faraco et al. 2004a, for further
discussion of these confounds). Along these lines, Mateeff et al. (1985) reported an elegant study
with more tightly controlled conditions where the speed and direction of a moving sound were
adjusted psychophysically for subjective steadiness. When visual motion was presented concur-
rently but in the opposite direction of the moving sound, the sound needed to move at a velocity of
25% to 50% of that of the distractor light for subjective sound steadiness to be achieved. This find-
ing suggests, again, that the perception of motion direction reflects the outcome of a combination
process involving directional information available to both sensory modalities.
Some of the past work in our laboratory has attempted to apply the logic of the seminal intro-
spective studies described earlier, from a psychophysical viewpoint. In particular, we developed a
procedure based on intersensory conflict whereby directionally incongruent motion signals were
presented in audition and vision (Soto-Faraco et al. 2002, 2004a). In a typical task, participants
are asked to report the direction of an auditory apparent motion stream while an (irrelevant) visual
apparent motion stream is presented concurrently in the same or different direction (Figure 29.1a
and b). The results in this type of experiments, replicated now many times and in different labora-
tories, are clear-cut. Responses to the sound direction are very accurate when a concurrent visual
motion stream is presented in the same direction as the sounds (about 100% correct), whereas sound

* The phenomenon of apparent motion (namely, experiencing a connected trajectory across two discrete events presented
successively at alternate locations) has been described in different sensory modalities, including the classic example of
vision (Exner 1875; Wertheimer 1912) but also in audition and touch (Burt 1917a, 1917b; Hulin 1927; Kirman 1974).
Moreover, there is evidence suggesting that the principles governing apparent motion are similar for the different senses
(Lakatos and Shepard 1997).
Multisensory Interactions during Motion Perception 585

(a) (c) Congruent


51% Capture
Sound Conflicting
1,00

Proportion correct
Light
0,75

0,50
Task: Judge sound
direction 0,25

0,00
Synch. Asynch.

(b) (d)
Distribution of the cross-modal
Synchronous Sound dynamic capture (n = 384)
congruent Light 70
60 Observed
Synchronous Sound frequency
50

Frequency
conflicting Light Gaussian fit
40
Asynchronous Sound 30
congruent Light
20
500 ms
10
Asynchronous Sound 0
conflicting Light
–1.00
–0.80
–0.60
–0.40
–0.20
0.00
0.20
0.40
0.60
0.80
1.00
500 ms

Size of CDC

FIGURE 29.1  Cross-modal dynamic capture effect. (a) Observer is presented with auditory motion together
with visual motion along the horizontal plane, and is asked to determine the direction of sounds and ignore
the visual event. (b) Examples of different kinds of trials used in the task, combining directional congruency
and synchrony between sound and light. (c) Typical outcome in this task, where accuracy in sound direction
task is strongly influenced by congruency of visual distractor (CDC effect), but only in synchronous trials.
(d) Histogram regarding the size of congruency effect across a sample of 384 participants who performed this
task across a variety of experiments, but under comparable conditions.

motion discrimination performance drops dramatically (by 50%) when the lights are presented
synchronously but move in the opposite direction (see Figure 29.1c and d). This effect of directional
congruency, termed cross-modal dynamic capture (CDC), occurs with equivalent strength when
using continuous (rather than apparent) motion displays, but is eliminated if the visual and auditory
signals are desynchronized in time by as little as half a second. One interesting aspect is that the
frequent errors made by observers under directionally incongruent audiovisual motion are better
explained by a phenomenological reversal in the direction of sounds, rather than by mere confusion.
This latter inference is supported by the finding that the same pattern of directional congruency
effects is seen even after filtering out low confidence responses (self-rated by the observer, after
every trial) from the data (Soto-Faraco et al. 2004a).
Another relevant finding supporting the existence of multisensory integration between motion
signals comes from an adaptation paradigm developed by Kitagawa and Ichihara (2002). In their
study, Kitagawa and Ichihara adapted observers with visual motion either receding from or looming
toward them, and found adaptation aftereffects not only on the perceived direction of subsequent
visual stimuli, but also on auditory stimuli. For example, after adapting observers to looming visual
motion, a steady sound would appear to move away from them (i.e., its intensity would seem to fade
off slightly over time). This result supports the early nature of multisensory interactions between
auditory and visual motion detectors. Interestingly, Kitagawa and Ichihara also tested adaptation
aftereffects when adapting observers with combined auditory and visual motion moving stimuli,
and found that the magnitude of the adaptation effect depended on the directional congruency
586 The Neural Bases of Multisensory Processes

between the two adaptor motion signals (for related findings, see Väljamäe and Soto-Faraco 2008;
Vroomen and de Gelder 2003).
In summary, the findings discussed above seem to point to the existence of robust interactions
between sensory modalities during the extraction of motion information, and in particular of its
direction. However, they are still far from providing a full characterization of these interactions at a
behavioral and at a neural level. We provide some of the main findings regarding these two aspects
in the following sections.

29.3  SOME BEHAVIORAL PRINCIPLES


Despite the existence of strong phenomenological correlates suggesting cross-modal interactions in
motion perception, there are a number of important questions that need to be addressed for a com-
plete characterization of these interactions. We discuss some of them below.

29.3.1  What Is the Processing Level at Which Cross-Modal


Interactions in Motion Processing Originate?
One critical question for any cross-modal interaction arising from intersensory conflict concerns
its level of processing (e.g., Bertelson and Aschersleben 1998; de Gelder and Bertelson 2003). That
is, given intersensory conflict between the stimuli present in the display, there are a number of
levels of information processing at which interactions could potentially occur, ranging from early,
sensory stages up to late, decisional ones. Whereas the former stages of information processing are
relevant in terms of characterizing multisensory interactions during perception, the latter ones are
relatively uninformative to this respect, and they inform us about cognitive mechanisms that apply
to response selection in general, and not necessarily specific to multisensory interactions.
The general problem in the interpretation of intersensory conflict paradigms has been well
described by a number of authors (e.g., Bertelson and Aschersleben 1998; Choe et al. 1975; de
Gelder and Bertelson 2003; Welch 1999, about this issue) and has produced a long historical debate
in the case of the ventriloquist illusion. The problem can be stated as follows: irrelevant (to-be-
ignored) information present in the display maps onto the response set available to the participant in
a way that, for incongruent (conflict) trials this irrelevant modality primes the erroneous response
to the target, whereas in the congruent (no conflict) trials the distractor favors the appropriate
response. This creates at least two types of confound, one based on stimulus–response compatibil-
ity effects (first reported by Fitts and Seeger 1953; see also Fitts and Deininger 1954; Simon 1969;
see Hommel 2000, for a review) and the other based on the potential (conscious or unconscious)
strategies adopted by participants based on their awareness about the conflicting nature of some
trials, also referred to as cognitive bias (e.g., Bertelson and Aschersleben 1998). Either type of bias,
in isolation or combined, can provide a sufficient explanation of many intersensory conflict results
based on known empirical facts, without the need to resort to cross-modal interactions at the percep-
tual level. Like in other domains, however, in the specific case of multisensory interactions during
motion perception, a series of results have shown confidently that interactions at a perceptual level
are relevant. Results from adaptation aftereffects, such as those shown in Kitagawa and Ichihara’s
(2002) study, favor an interpretation in terms of early stages of processing, given that aftereffects are
often attributed to fatigue of motion detectors in sensory brain areas. However, note that Kitagawa
and Ichihara’s study still contains the elements for a post-perceptual interpretation (i.e., presence of
intersensory conflict, from adaptation to test phase, and awareness of the conflict).
Soto-Faraco et al. (2005) used a variation of the original CDC task described above, but replaced
the unimodal “left vs. right” discrimination task with a “same vs. different” task in which partici-
pants were asked to compare motion direction across the two modalities presented (auditory and
visual). Soto-Faraco et al. found that participants were unable to distinguish between same- and
Multisensory Interactions during Motion Perception 587

different-direction audiovisual apparent motion streams unless the interstimulus interval between
the two discrete flashes/beeps was larger than 300 ms. Yet, the same observers were able to accu-
rately discriminate the direction of apparent motion streams in each sensory modality individually
for interstimulus intervals below 75 ms. Given that conflict between the stimulus (left–right) and
response (same–different) was not possible in this paradigm, stimulus–response compatibility could
be ruled out as the source of the behavioral effect. Moreover, in this experiment the interstimulus
interval was adjusted using interleaved adaptive staircases to reach the point of perceptual uncer-
tainty. At this point, by definition participants are not aware of whether they are being presented
with a conflicting or a congruent trial, and thus cannot adopt strategies based on stimulus congru-
ence, thereby also ruling out cognitive biases. After eliminating stimulus–response compatibility
and cognitive biases as possible explanations, the participants’ failure to individuate the direction
of each sensory modality in multisensory displays can only be attributed to an interference at a
perceptual level (Soto-Faraco et al. 2005).
Other approaches that have been used to disentangle the contribution of perceptual versus post-
perceptual mechanisms in cross-modal motion effects include the use of analytic tools such as the
signal detection theory (see MacMillan and Creelman 1991), where an independent estimation of
the sensitivity (associated to perceptual sources) and the decision bias (associated to post-perceptual
sources) can be obtained (e.g., Sanabria et al. 2007; Soto-Faraco et al. 2006). In Sanabria et al.
(2007) and Soto-Faraco et al. (2006) studies, for example, participants were asked to discrimi-
nate left-moving sounds (signal) from right-moving sounds (noise) in the context of visual stimuli
that moved in a constant direction throughout the whole experimental block (always left or always
right). The findings were clear, the presence of visual motion lowered sensitivity (d′) to sound direc-
tion as compared to a no-vision baseline, regardless of whether sound direction was consistent or
inconsistent with the visual distractor motion. That is, visual motion made signal and noise motion
signals in the auditory modality more similar between them, and thus, discrimination was more
difficult. However, response criterion (c) in this task shifted consistently with the direction of the
visual distractor. In sum, this experiment was able to dissociate the effects of perceptual interactions
from the effects at the response level. Other authors have used a similar strategy to disentangle the
contribution of perceptual versus post-perceptual processes in somewhat different types of displays
(see Meyer and Wuerger 2001; Meyer et al. 2005). For instance, Meyer and Wuerger presented their
participants with a visual direction discrimination task (using random dot kinematograms) in the
context of auditory distractor motion. They used a mathematical model that included a sensitivity
parameter and a bias parameter, and found that most of the influence that auditory motion had on
the detection responses to visual random dot displays was explained by a decision bias (for a simi-
lar strategy, see Alais and Burr 2004a). This result highlights the importance that post-perceptual
biases can have in experiments using cross-modal distractors, and, in part, contrasts with the result
presented above using the CDC task. Although it is difficult to compare across these methodologi-
cally very different studies, part of the discrepancy might root in the use of vision as the target
modality (as opposed to sound), and the target stimulus being near the threshold for direction dis-
crimination (in Meyer et al.’s case, random dot displays with low directional coherence). More
recent applications of this type of approach have revealed, however, that one can obtain both a shift
in sensitivity over and above any bias effects (Meyer et al. 2005; Wuerger et al. 2003).
In sum, the presence of decision and cognitive influences is very likely to have an effect in most
of the tasks used to address multisensory contributions to motion processing. Yet, the data of several
independent studies seems to show rather conclusively that influences at the level of perception do
also occur during multisensory motion processing. It must be noted, however, that there are lim-
its on how early in the processing hierarchy this cross-modal interaction can occur. For example,
there is evidence that cross-modal motion integration takes place only after certain unisensory
perceptual processes have been completed, such as visual perceptual grouping (e.g., Sanabria et
al. 2004a, 2004b) or the computation of visual speed (López-Moliner and Soto-Faraco 2007; see
Section 29.2.4).
588 The Neural Bases of Multisensory Processes

29.3.2  Are These Interactions Specific to Motion Processing?


Although the studies described above have used moving stimuli for their tasks, one potential concern
is about the specificity of the conclusions. That is, the processes whereby these multisensory interac-
tions arise might not be based on motion information (i.e., direction of motion), but on other features
that are unrelated to motion per se but present in the displays, such as the location and timing of the
onsets or offsets of the stimuli (for a discussion of this problem, see Meyer and Wuerger 2001; Soto-
Faraco et al. 2002, 2004a). This is especially relevant, but not unique, for experiments using apparent
motion displays. If an explanation based on nonmotion features was possible, then the results of most
of the studies discussed up to know would not be particularly informative as to whether and how
multisensory integration of motion information occurs. A number of findings suggest, however, that
multisensory integration does indeed take place specifically during motion processing.
For instance, Soto-Faraco et al. (2004a) estimated that the magnitude of the CDC (see Figure
29.1c) for apparent motion displays was about 50%, that is, sound direction was judged erroneously
in about half the conflicting trials, whereas it was nearly 100% accurate in congruent trials (thus,
for an effect of 50%). Using the same setup, spatial disparity, and timings, they assessed the chances
that any of the two individual components of auditory apparent motion stream was ventriloquized
toward a concurrent light flash, which resulted in an effect magnitude of only 17%. From this result,
it would appear that the consequences of CDC go beyond what can be expected on the basis of
simple static capture (see also Soto-Faraco et al. 2002, Experiment 3, for a similar result).
The exploration of split-brain patient JW afforded another independent test of the critical role of
motion for CDC (Soto-Faraco et al. 2002). JW had his corpus callosum (all the fibers connecting the
left and right cerebral hemispheres) surgically sectioned, so that there is no direct communication
between his two hemispheres at a cortical level. Whereas auditory apparent motion is experienced
normally by JW (given that the auditory pathways do not cross at the corpus callosum), he does not
experience the typical impression of visual apparent motion when presented with two sequential
flashes at different hemifields. This is so despite that JW’s ability to localize individual flashes
presented laterally is spared (Gazzaniga 1987; Soto-Faraco et al. 2002). Interestingly, when tested
in the CDC task, JW barely made any errors regarding the direction of sounds in incongruent trials
(thus, in stark contrast with the healthy controls, which experienced the typical 50% CDC effect).
This result suggests that JW’s inability to experience visual apparent motion across the midline
spared him from the CDC effect, and reveals that the experience of motion is essential for the cross-
modal interactions observed in previous studies.

29.3.3  Pattern of Modality Dominance


The studies discussed up to this point have touched on the case of audiovisual interactions in motion
perception, and in particular on the effects of visual motion on the perception of sound motion. This
bias reflects the imbalance of modality combinations being represented in current cross-modal lit-
erature addressing motion perception. Yet, cross-modal literature should make us aware that many
multisensory phenomena present asymmetries, so that visual effects on audition are substantially
different from auditory effects on vision (i.e., the ventriloquist illusion; e.g., Alais and Burr 2004b;
Radeau and Bertelson 1976; the McGurk effect, McGurk and MacDonald 1976). What is more,
since motion can be extracted from tactile signals, as well as from visual and acoustic ones, differ-
ent modality combinations may involve different sets of constraints.
Previous research in multisensory integration of motion already suggests a clear asymmetry
whereby effects of audition on vision are smaller, if present at all, than those typically seen for
vision on audition (e.g., Allen and Kohlers 1981; Kitagawa and Ichihara 2002; Meyer and Wuerger
2001; Ohmura 1987). For example, the CDC effect does not occur at all, or does occur very weakly,
when visual apparent motion is the target and sounds are the to-be-ignored modality (e.g., Soto-
Faraco et al. 2004a). There are indeed some reports of acoustic influences on visual motion, but
Multisensory Interactions during Motion Perception 589

these are invariably obtained when the visual signal is at or near threshold (e.g., Alais and Burr
2004b; Meyer and Wuerger 2001).
We have incorporated touch to the CDC paradigm in several studies (e.g., Oruç et al. 2008;
Sanabria et al. 2005a, 2005b; Soto-Faraco et al. 2004b). In these cases, participants were asked to
wear vibrotactile stimulators at their index finger pads and place their hands resting on the table,
near the LEDs and/or loudspeakers. Tactile apparent motion was generated by presenting a brief
(50 ms) sine wave (200 Hz) vibration in alternation to each index finger. In this way, auditory, tac-
tile, and visual apparent motion streams could be presented using equivalent stimulus parameters in
terms of onset time, duration, and spatial location. All possible combinations of distractor and target
modality using tactile, visual, and acoustic stimuli were tested (Oruç et al. 2008; Soto-Faraco et al.
2000; Soto-Faraco and Kingstone 2004). When considered as a whole, the results of these experi-
ments reveal a hierarchy of sensory modalities with respect to their contribution to the perception
of motion direction. Vision has a strong influence on auditory motion, yet acoustic distractors did
not modulate the perception of visual motion (along the lines of other recent results, such as those
of Meyer and Wuerger 2001; Kitagawa and Ichiara 2002). A similar pattern applies to visuo-tactile
interactions, whereby vision captures tactile motion direction but touch hardly exerted any influence
on the perception of motion in vision. The combination of auditory and tactile motion stimuli, how-
ever, showed reciprocal influence between both modalities, albeit with a stronger effect of touch on
sound than the reverse. This particular hierarchy, however, must be considered with some caution,
given that factors such as stimulus saliency, reliability, and even cognitive aspects such as attention,
may indeed exert an important influence on the relative strength of the modalities. For example,
it has been shown that directing the focus of attention to one or another modality can modulate
CDC in the case of audio-tactile interactions, although not in modality pairings where vision was
involved (Oruç et al. 2008).
According to the findings described above, vision would be the most dominant sense in terms
of its contribution to compute the direction of motion, then touch, and lastly audition. Within this
framework, multisensory integration would not consist of a process in which one modality overrides
the information in another modality (as the results of the audiovisual case, when considered in isola-
tion, might sometimes suggest; for a similar example based on the dominance of vision over touch
in shape/size perception, see, e.g., Rock and Harris 1967). Instead, the results support the proposal
that multisensory integration of motion would abide to some kind of weighted combination between
different information sources (see López-Moliner and Soto-Faraco 2007). If this is so, then the
strength to which each modality is weighted during the unisensory perception of motion becomes a
particularly relevant issue. This is clearly a matter for further research, but based on the success in
explaining cross-modal results regarding other perceptual domains, one could borrow the ideas of
modality appropriateness or the so-called optimal integration model (Ernst and Banks 2002; Ernst
and Bulthoff 2004) to answer this question.

29.3.4  Multisensory Integration of Motion Speed


Most of current knowledge regarding multisensory contributions to motion perception has been
based on tasks using discrimination of motion direction, or detection of motion signals. However,
another fundamental property of motion, velocity, has been largely neglected in these past studies
(for exceptions see, López-Moliner and Soto-Faraco 2007; Manabe and Riquimaroux 2000). López-
Moliner and Soto-Faraco (2007) addressed whether, and how, different sensory sources of velocity
information influence each other during the perception of motion. Participants were presented with
moving sound images (left to right or right to left) at varying speeds in a 2IFC task, with the velocity
of the comparison sounds being determined by QUEST staircases (see Figure 29.2a) to be equalled
to a standard sound moving at 30° s−1. The comparison sound, however, could be presented concur-
rent with a variety of visual sinusoidal gratings moving at different velocities, ranging from 15 to
45° s−1. The results clearly showed a shift in perceived sound velocity as a function of the velocity
590 The Neural Bases of Multisensory Processes

(a) (b)
Set of moving visual
gratings used
12
45º/s

Spatial frequency
10
30º/s
8

(cycles/º)
15º/s
6

Comparison sound 4
Standard sound
(velocity = 30º/s) (velocity determined 2
by QUEST 0
staircase) 0 0.5 1
Temporal frequency (Hz)

(c) (d) (e)


40
35
30
PSE
25
20
15
10

0.20 0.25 0.30 0.35 0.40 0.45 0.50 4 6 8 10 15 20 25 30 35 40 45


Spatial frequency (c/deg) Temporal frequency (Hz) Velocity

FIGURE 29.2  A summary of results about cross-modal effects in velocity perception, from López-Moliner
and Soto-Faraco’s study. (a) 2IFC paradigm involved a velocity discrimination judgment regarding sounds
(Was the second sound faster or slower than the first?), where second interval could contain either no visual
motion or else moving gratings. (b) Graphical description of different gratings used in the task in space
defined by temporal and spatial frequency. Note that two exemplars of each motion velocity formed by differ-
ent combinations of spatial and temporal frequencies were used. Each of three velocities is denoted by a dif-
ferent symbol (see labels next to symbols). (c–e) Point of subjective equality for sound velocity with reference
to a 30° s−1 standard, when combined with different kinds of moving gratings. Same data are depicted as a
function of (c) spatial frequency, (d) temporal frequency, and (e) velocity of gratings. It can be seen that results
pattern best when depicted along velocity axis.

of the concurrent visual stimulus, so that slow visual motion made participants underestimate the
velocity of concurrent sounds, and rapid visual motion made people perceive the sounds moving
faster than they really were.
In this study, gratings composed of different combinations of spatial and temporal frequencies
could represent visual motion of a given velocity (see Figure 29.2b). This was done so because
sinusoidal moving gratings can be conveniently separated into spatial frequency (sf ) and temporal
frequency (tf ) (Watson and Ahumada 1983), and velocity (v) of a grating can be expressed by the
ratio between its tf (in Hz) and its sf (number of cycles per degree of visual angle). This spatiotem-
poral definition of stimulus space has been previously used to characterize the spectral receptive
fields of neurons at various levels of the visual system in the monkey (e.g., Perrone and Thiele 2001),
and it has received confirmation from human psychophysics (Reisbeck and Gegenfurtner 1999). For
instance, many neurons in the middle temporal cortex (MT) encode velocity, in the sense that the
set of stimuli that best drives these neurons lay along an isovelocity continuum in the space defined
Multisensory Interactions during Motion Perception 591

by sf and tf. Unlike MT neurons, many of the motion-sensitive neurons found at earlier stages of
the visual system such as V1 fail to display an invariant response across different stimuli moving
at equivalent velocities, but rather they often display a response profile tuned to particular temporal
frequencies (however, see Priebe et al. 2006, for velocity responses in some V1 neurons). Given that
the velocity of a grating can be decomposed in terms of spatial and temporal frequency (v = sf/ tf ),
one can then attempt to isolate the influence of varying spatial and temporal frequencies of the
visual stimulus, on the perceived velocity of sounds. What López-Moliner and Soto-Faraco found
is that neither spatial frequency per se, nor temporal frequency, produced any systematic effect on
sound velocity perception, above and beyond that explained by velocity. One could then infer that
the binding of multisensory velocity information might be based on motion information that is
already available at late stages of processing within the visual system. Note that this inference reso-
nates with the finding that multisensory motion integration occurs only after perceptual grouping
has been completed within vision (Sanabria et al. 2004a, 2004b).

29.4  NEURAL CORRELATES OF MULTISENSORY INTEGRATION OF MOTION


29.4.1  Multisensory Motion Processing Areas in the Brain
One way in which it is particularly important to advance in order to constrain any explanation about
multisensory contributions to the perception of motion is to understand its neural underpinnings.
Although there are many structures in the visual system that contain motion-responsive neurons, the
processing of global visual motion has repeatedly been shown to strongly involve visual area V5/
MT, in the occipitotemporal cortex. Together with its surrounding regions (the V5/MT+ complex),
V5/MT has been characterized as a motion processing area in both human (e.g., Watson et al. 1993;
Zihl et al. 1983, 1991) and nonhuman primate studies (e.g., Ungerleider and Desimone 1986). Area
V5/MT+ is well connected with other visual areas such as the V1/V2 complex and V3, as well as
higher-order areas in the posterior parietal cortex, in particular the ventral intraparietal area (VIP),
within the intraparietal sulcus (IPS; e.g., Maunsell and Van Essen 1983), which contains direction-
ally sensitive neurons [e.g., Colby et al. 1993; Duhamel et al. 1991, 1998; see also Bremmer et al.
2001b, for evidence regarding a homologous region in the human cortex using functional magnetic
resonance imaging (fMRI)]. Given the strong connectivity between VIP and the ventral premotor
cortex (PMv; e.g., Luppino et al. 1999), it is conceivable that this frontal area was involved in some
aspects of visual motion processing.
Although the level of knowledge about the network of brain areas supporting motion perception
in the auditory and somatosensory systems is not nearly as detailed as that of the visual system, there
have been some important recent advances. For example, some researchers have demonstrated the
selective involvement of certain brain areas (such as the planum temporale, the inferior and superior
parietal cortices, and the right insula) in response to auditory motion (e.g., Baumgart et al. 1999;
Pavani et al. 2002). The primary and secondary somatosensory areas (SI and SII), located in the post-
central gyrus, seem to play a major role in the perception of tactile motion (e.g., Hagen et al. 2002).
Beyond these brain areas associated to the processing of motion in individual sensory modali-
ties, the relevant question, however, is “Which are the brain areas that may serve motion processing
in more than one sensory modality?” Animal electrophysiology has already made some important
advances in detailing several structures containing neurons that are responsive to dynamic changes
and/or moving stimuli in several sensory modalities. Among the most relevant, we find several
of the higher-order areas pointed out earlier as subserving visual motion processing such as VIP
(Bremmer et al. 2001b; Colby et al. 1993; Duhamel et al. 1998; Schlack et al. 2005) and PMv (e.g.,
Graziano et al. 1994, 1997). Some parts of the temporal lobule (in particular, the superior temporal
sulculs and the middle temporal gyrus; Bremmer et al. 2001a, 2001b; Bruce et al. 1981; Desimone
and Gross 1979) have been also linked to the processing of dynamic stimuli in multiple sensory
modalities.
592 The Neural Bases of Multisensory Processes

In humans, a particularly compelling demonstration of brain areas relevant for multisensory


motion processing would be the association of motion processing deficits across sensory modalities
after focal brain lesions. Yet, to our knowledge, this kind of neuropsychological evidence is still
lacking (Griffiths et al. 1997b; Zihl et al. 1991). Despite this lack of neuropsychological association,
a number of neuroimagnig studies in humans have helped to map multisensory brain areas that
receive converging motion information from several sensory systems (e.g., Bremmer et al. 2001b;
Hagen et al. 2002; Lewis et al. 2000). In Bremmer et al.’s fMRI study, human observers were
scanned while presented with visual, acoustic, or somatosensory motion stimuli in alternate blocks
of trials. Regions commonly active for motion in all three modalities were found in both right and
left posterior parietal cortices (with local maxima suggesting a potential involvement of VIP), and
also in PMv. Thus, this pattern would appear to match well with the data obtained in monkey elec-
trophysiology (e.g., Graziano et al. 1994, 1997, 2004). In another fMRI study, Lewis et al. found a
region of common activation for visual and auditory motion processing in the lateral parietal cortex,
again suggesting the involvement of some areas along the IPS, as well as some frontal structures
such as the lateral frontal cortex and the anterior cingulate.
The results of these fMRI studies are clear in that there are some particular brain structures
that are responsive to motion signals in more than one sensory modality. The combination of evi-
dence from human neuroimaging and monkey neurophysiology helps us infer that the multisensory
responses in these motion-responsive areas can be due to the presence of multisensory neurons that
are sensitive to motion in at least one of their sensory modalities. This is an important step in order
to disentangle the mechanisms responsible for processing multisensory motion, but it is perhaps not
sufficient to enable us to claim that these areas display any particular role in the process of binding
multisensory information about motion.

29.4.2  Evidence for Cross-Modal Integration of Motion


Information in the Human Brain
A couple of recent human brain imaging studies have shed some light precisely on the question
of the brain areas that may respond preferentially to the combined presence of motion in several
sensory modalities (Alink et al. 2008; Baumann and Greenlee 2007). Baumann and Greenlee used
random dot kinematograms where 16% of the dots moved in a predetermined direction, to create
their visual motion stimulus. In the critical comparison, they contrasted brain activity arising in a
condition where sounds moved in phase versus a condition in which they moved in antiphase with
respect to the direction of this visual motion stimulus. They used a stationary sound condition as
their baseline, and found that, with respect to this baseline, sounds in-phase with visual motion
produced a pattern of activity involving extensive regions of the superior temporal gyrus (STG), the
supra marginal gyrus (SMG), the IPS, and the superior parietal lobule (SPL), in addition to some
regions of the primary and secondary visual cortex. Sounds in antiphase with the visual stimulus
produced a similar pattern of activity, but with weaker BOLD (blood oxygen level dependence)
increase throughout, thus leading the authors to suggest that these areas are responsive to the inte-
gration of correlated motion information across modalities.
In another recent study, Alink et al. (2008) used an adaptation of the CDC task (e.g., Soto-
Faraco et al. 2002, 2004a) to compare brain activity in response to audiovisual motion presented
in directional congruency or incongruency. Alink et al. focused their analysis on unimodal motion
processing regions, which were localized a priori on individual basis. These areas involved classical
visual motion areas V5/MT+ as well as the auditory motion complex (AMC), an area of the poste-
rior part of the planum temporale that had already been associated to auditory motion processing
(e.g., Baumgart et al. 1999). One of Alink et al.’s relevant findings was that these unimodally defined
motion processing regions responded with decreased activity when presented with directionally
conflicting auditory and visual motion signals, as compared to congruent ones. This result would
Multisensory Interactions during Motion Perception 593

lend support to the idea that the consequences of multisensory integration of motion information
can be traced back to relatively early (sensory) stages of motion processing (for behavioral support
of this hypothesis, see Soto-Faraco et al. 2004a, 2005; Sanabria et al. 2007). A second interesting
aspect of Alink et al.’s study is that they reproduced the CDC effect while measuring brain activ-
ity. Thus, they were able to contrast BOLD changes resulting from trials where the CDC illusion
presumably occurred (incorrectly responded ones) with that evoked by otherwise identical trials but
where sounds were perceived to move in the physically correct direction. This contrast revealed that
in trials where the CDC illusion was experienced, activity in the auditory motion areas (AMC) suf-
fered a reduction with respect to veridically perceived trials, whereas the reverse pattern occurred
for visual motion areas (i.e., enhanced activity in MT/V5+ in illusion trials with respect to veridical
perception). This result parallels the visual dominance pattern typically observed in the behavioral
manifestation of CDC. Finally, when extending the scope of their analysis to the whole brain, Alink
et al. found that conflicting motion led to the activation of an extensive network of frontoparietal
areas, including the IPS and supplementary motor areas (SMA). Remarkably, in this analysis Alink
et al. also found that VIP was modulated by the occurrence of illusory motion percepts (as indexed
by the CDC task). In particular, not only VIP resulted more strongly activated in trials leading to
illusory percepts, but this activation seemed to precede in time the activity evoked by the stimulus
itself, an indication that prior state of the motion processing network might be a critical determinant
for CDC to occur, and hence, of cross-modal motion integration.

29.5 MOTION INTEGRATION IN MULTISENSORY


CONTEXTS BEYOND THE LABORATORY
Although the behavioral and neural principles described above have represented an important advance
in our understanding of how perception of multisensory motion signals works, we are still far from
achieving a satisfactory characterization. Yet, this has not prevented the development of media appli-
cations, such as cinema, that are recreating a real-world illusion by using (multisensory) motion per-
ception principles. Most of the time, these applications have been developed independently of, and
even antedate, the scientific principles they are related to. Here we discuss a few relevant cases.

29.5.1  Sound Compensating for Reduced Visual Frame Rate


Visual “flicker” fusion occurs at rates just above about 50 to 100 Hz (Landis 1954; van der Zee and
van der Meulen 1982). Because of this perceptual characteristic of the visual system, standard cin-
ema and television applications in the twentieth century used frame rates of 50 and 60 Hz. However,
compared to the visual system, temporal accuracy of the auditory system is significantly higher, as
humans are able to detect amplitude modulations of high-frequency tonal carriers of up to 600 Hz
(Kohlrausch et al. 2000). Given this sensory imbalance, together with the multisensory nature of
our perception, it is a likely possibility that sound can significantly alter our perception of visual
world dynamics. In fact, several well-studied effects provide evidence for that, such as the “auditory
driving” effect (e.g., Welch et al. 1986), the temporal ventriloquism (Morein-Zamir et al. 2003), or
the freezing effect (e.g., Vroomen and de Gelder 2000).
The audiovisual media industry has extensively exploited this feature of multisensory perception
(auditory influence on vision in the time domain) in applications where sound has been traditionally
used for highlighting temporal structure of rapid visual events. Consider, for example, sound tran-
sients articulating hits in kung fu fighting scenes (Chion 1994) or Walt Disney’s “Mickey Mousing”
technique whereby a specific theme of a soundtrack is tightly synchronized with a cartoon charac-
ter’s movements (Thomas and Johnston 1981). Interestingly, sound has also been used in cinema
for creating an illusion of visual event continuity, “greasing the cut” as editors would put it, when
594 The Neural Bases of Multisensory Processes

an accompanying sound effect masks rough editing between two shots (Eidsvik 2005). In a classic
example from George Lucas’ film, The Empire Strikes Back (1980), the visual illusion of a space-
ship door sliding open is created using two successive still shots of a door closed and a door opened
combined with a “whoosh” sound effect (Chion 1994).
These auditory modulations of visual dynamics perception give rise to the practical question of
whether sound could compensate for reduced frame rate in films. A recent study by Mastoropoulou
et al. (2005) investigated the influence of sound in a forced choice discrimination task between pairs
of 3-s video sequences displayed at varying temporal resolutions of 10, 12, 15, 20, or 24 frames per
second (fps). Participants judged motion smoothness of the videos being presented. In visual-only
conditions, naïve participants could discriminate between displays differing by as little as 4 fps. On
the contrary, in audiovisual presentations participants could reliably discriminate between displays
only when they differed by 14 fps. It is perhaps surprising that Mastoropoulou et al. (2005) hypoth-
esized that divided attention might be the cause of the reported effects, without even considering the
alternative explanation that audiovisual information integration might have produced the sensation
of smoother visual displays altogether, thereby making it more difficult to spot discontinuities.

29.5.2  Filling in Visual Motion with Sound


Recent studies focusing on cross-modal interactions have found that sound can induce the illusion
of seeing a visual event when there is none. For example, Shams et al. (2000) created an illusion of
multiple visual flashes by coupling a single brief visual stimulus with multiple auditory beeps. In
these experiments, participants were asked to count the number of times a flickering white disk had
flashed in displays containing one or more concurrent task-irrelevant brief sounds. The number of
flashes reported by observers increased with a number of beeps present. In a follow-up ERP study
Shams et al. (2001) reported that sound modulated early visual evoked potentials originating from
the occipital cortex. Interestingly, the electrophysiological activity corresponding to the illusory
flashes was found to be very similar to the activity produced when a flash was physically presented.
Other research groups have demonstrated that this illusory-flash effect occurs at a perceptual level
with psychophysically assessable characteristics (e.g., McCormick and Mamassian 2008) and that it
does not subjectively differ from a real flash when used in orientation-discrimination tasks (Berger
et al. 2003).
The sound-induced flash illusion has been also studied using apparent motion, where visual bars
were flashed in succession starting from one side of the screen to the other (Kamitani and Shimojo
2001). In this case, increasing the number of beeps produces additional illusory bars that are lead-
ing to a subjective experience of smoother visual object motion. In order to quantify the perceptual
effects of illusory-flash in time-sampled motion contexts, such as that often seen in animated car-
toons, Väljamäe and Soto-Faraco (2008) applied a methodology similar to the motion adaptation
aftereffects reported by Kitagawa and Ichihara (2002), as discussed earlier (Section 29.1). Väljamäe
and Soto-Faraco exposed participants to time-sampled approaching/receding visual, auditory, or
audiovisual motion in depth that was simulated by changing the size of visual stimulus (0° to 9° of
visual angle) and intensity of sound (40–80 dB sound pressure level range). Both unisensory and
audiovisual combination of adaptors were used, with the audiovisual adaptors being either direction-
ally congruent or conflicting (see Figure 29.3a). Visual and auditory adaptors had two frequency
rates: high-rate train of flashes (flicker) or beeps (flutter) at 12.5 Hz and low-rate flicker or flutter at
6.25 Hz frequency. An adaptive staircase procedure was used to measure the amount of auditory
motion aftereffect (point of subjective steadiness) induced by different adaptor stimuli. In addition, in
one of the experiments participants also had to judge subjective smoothness of the visual adaptors.
The results showed visual adaptation aftereffects on sounds, so that high-rate flashes produced
stronger auditory motion aftereffect than flashes at a lower rate, which were largely ineffective.
Importantly, when the visual adaptors were combined with high-rate flutter, not only the size of
Multisensory Interactions during Motion Perception 595

(a) (b) 3

Visual angle
Congruent direction Incongruent direction

Magnitude of auditory aftereffect (dB/s)


Flicker, Vl+ 2 Vh–Ah–
Vl–Ah–
(6.25 Hz) t
Vl+Ah–
1
133 ms 27 ms
Vl+Al–
Loudness

Flutter, Ah– 0
(12.5 Hz) t Vl–Al+ Vl–Ah+
–1
53 ms 27 ms
Vh+Ah+ Vl+Ah+
–2
+/– approaching/receding stimuli
Ah/Al: 12.5/6.25Hz auditory flutter
Flicker +
t Vh/Vl: 12.5/6.25Hz visual flicker
Flutter, Vl+Ah– –3

FIGURE 29.3  A subset of experimental conditions and results. (a) Some examples of motion adaptors rep-
resenting low-rate approaching visual stimuli, high-rate receding sounds, and direction conflicting stimuli
combining high-rate sounds with low-rate visual events. (Reprinted from Väljamäe, A. and Soto-Faraco, S.,
Acta Psychol., 129, 249–254, Copyright 2008, with permission from Elsevier.) (b) Magnitude of auditory
aftereffect (in dB/s) after adaptation to time-sampled approaching (+) or receding (−) audiovisual motion
in depth. Left subpanel shows results for directionally congruent adaptors (high-rate visual combined with
high-rate sounds; and low-rate visual combined with high-rate sounds) and right subpanel represents results of
directionally incongruent audiovisual adaptors (low-rate visual combined with low-rate sounds; and low-rate
visual combined with low-rate sounds).

the adaptation aftereffect increased overall, but interestingly both the fast and the slow flicker rates
turned out to be equally effective in producing auditory aftereffects (see Figure 29.3b, left subpanel).
This result strongly suggested that high-rate flutter can fill in sparsely sampled visual object motion.
This filling-in effect could be related to the sound-induced flash phenomenon, whereby the combi-
nation of low-rate flicker with a rapid train of beeps leads to illusory flashes (e.g., Shams et al. 2002).
In fact, the judgments of subjective smoothness regarding the visual flicker stimuli supported the
psychophysical data—low-rate flicker was rated as being smoother when combined with high-rate
beeps than when combined with low-rate flutter.
However, results from these experiments did not speak directly as to whether the observed effects
are specific to motion per se or else, they result just from the effects of the high-frequency temporal
structure of sound signal. In a separate experiment, Väljamäe and Soto-Faraco (2008) tested the
relevance of motion direction congruency of the adaptors by using direction incongruent multisen-
sory adaptors. If the effect of the audiovisual adaptor lacks direction specificity, then the audiovi-
sual adapting stimulus should work equally well despite the cross-modal incongruence in motion
direction. However, the results showed that in the case of incongruent combination of audiovisual
adaptors, weaker aftereffects were produced (Figure 29.3b, right subpanel). In fact, the aftereffects
of these adaptors were not different in size and direction to the auditory motion aftereffects induced
by unimodal acoustic adaptors.
The findings of Väljamäe and Soto-Faraco’s (2008) study could be potentially attributed to the
sound-induced visual flash illusion, given that the timing parameters of the present discrete stimuli
are similar to the ones used in original experiments by Shams et al. 2000 (cf. Lewkowicz 1999 for
discussion on intermodal temporal contiguity window for integration of discrete multisensory events).
Thus, the aftereffects of multisensory adaptors might be explained by perceptual “upgrading” of low-
rate flicker by high-rate train of beeps. In this case, illusory visual flashes might have filled in sparsely
sampled real visual flicker and increased motion aftereffects. Importantly, the observed effects did
not solely depend on the flutter rate, but also on the directional congruency between auditory and
596 The Neural Bases of Multisensory Processes

visual adaptors. This means that the potential of sounds to fill in the visual series critically depends
on some kind of compatibility, or congruence, between the motion signal being processed by hearing
and sight. Thus, above and beyond the potential contribution of the auditory driving phenomenon
(e.g., Welch et al. 1986), the effect described above seems to belong to interactions between motion
cues provided by a moving object. These results might support the ideas that sound can compen-
sate for a reduced visual frame rate in media applications as described in Section 29.5.1. A better
understanding of underlying mechanisms involved in such cross-modal fill-in effects may facilitate
new types of perceptually optimized media applications, where sound and visuals are tightly syn-
chronized on a frame-by-frame basis. From classical examples one can highlight animation films by
Walt Disney (e.g., Fantasia), where music was directly used as a reference for animator’s work, or
the abstract films by Len Lye (e.g., Colour Flight, Rhythm, Free Radicals; see Len Lye filmography
2005) where he was visualizing the music rhythm by painting or scratching directly on celluloid.

29.5.3  Perceptually Optimized Media Applications


Modern times challenge us with rapidly evolving technologies that mediate our perception of physi-
cal space and time. In many situations, cinema, television, and virtual reality aim for representing
realistic or fictional worlds using the most immersive technologies available on the market (e.g.,
Sanchez-Vives and Slater 2005). The history of cinema serves as an illustration of how technical
development gradually equipped the “Great Mute” with sound, color, large and curved projection
screens, and stereoscopy. Developers of new immersive broadcasting solutions often include vibrot-
actile and even olfactory stimulation in their prototypes (Isono et al. 1996). From this perspective, it
is important to answer the question of which sensory information is needed to create a coherent per-
ceptual representation of the real-world motion in the viewers. For example, limited animation tech-
niques with reduced frame rates and minimalist graphics (e.g., Japanese animation or “anime”) have
been widely used in media since the 1950s and represent an alternative to a photo­realistic approach
(Furniss 1998). Representing a scene by series of still images instead of a continuous video stream
becomes a more and more common technique in music video clips and advertisements, where fast-
editing techniques are typically used to catch the user’s attention (Fahlenbrach 2002). However, pre-
senting an audiovisual stream as a sequence of still images can also be seen as a way of regulating
the sensory load for an end-user of multisensory displays. In a way, such slideshow-like presentation
of a visual stream resembles the common technique in visual and multimodal information search
and retrieval where only key video frames are used (Snoek and Worring 2002).
An interesting question is, to what extent visual information can be compensated by other modali-
ties in new, perceptually optimized media applications (Väljamäe et al. 2008)? For example, a video
stream can be divided into a sequence of still images combined with a continuous soundtrack, either
monophonic or spatialized. Several research projects used single still photographs combined with
spatial sound to recreate a coherent audiovisual experience (Bitton and Agamanolis 2004; Hoisko
2003). A classic cinematographic example where successive still images are used as a visual repre-
sentation is the photo-novel La Jetée by Marker (1962), where an accompanying voice of a narrator
guides the viewer through the story. Comparing subjective experiences produced by La Jetée and
conventional film shows that reduced temporal resolution of the visual stream does not degrade
emotional impact and evaluation of spatial content (Väljamäe and Tajadura-Jiménez 2007). In the
spirit of Chris Marker’s pioneering films, current movie and video clips makers make increasing use
of photograph trains instead of continuous visual stream. An illustrative example of this approach is
the recent music video clip by Jem “It’s Amazing,” directed by Saam Gabbay (Gabbay 2009; http://
www.youtube.com/watch?v=8XDxhDbtDak), where more than 25,000 still photographs were used
to create a 4-min music video and at overage 4–5 fps were used on the areas that required lip sync
to music (Saam Gabbay, personal communication).
Effective reduction of visual information in audiovisual content, as shown in the examples above,
critically depends on better understanding of multisensory motion processing. Such perceptual
Multisensory Interactions during Motion Perception 597

optimization may have important implications for audio and video compression and rendering tech-
nologies, especially in wireless communication, which at the present time are developed rather
independently. In these technologies, a critical problem is to find a compromise between the limited
transmission rate of information available to current technology and the realism of the content being
displayed (e.g., Sanchez-Vives and Slater 2005). Future audiovisual media content synthesis, deliv-
ery, and reproduction may switch from such unisensory approach to amodal categories of the end-
user percepts (such as objects, events, or even affective states of the user). In this new multisensory
design, amodal categories may then define sensory modalities to be reproduced and their rendering
quality (cf. “quality of service” approach in media delivery applications).

29.6  CONCLUSIONS
We started by providing an overview of past and recent developments revealing the phenomeno-
logical interactions that can be observed during the perception of motion in multisensory con-
texts. Over and above early findings based on introspective reports (e.g., Anstis 1973; Zapparoli and
Reatto 1969), which already pointed to the existence of strong multisensory interactions in motion
processing, more recent psychophysical studies in humans have often reported that perception of
sound motion can be influenced by the several properties of a concurrently presented visual motion
signal (direction, smoothness, speed). For example, in the CDC effect, sounds can appear to move
in the same direction as a synchronized visual moving object, despite that in reality they travel in
opposite directions (e.g., Soto-Faraco et al. 2002, 2004a). Although these findings were frequently
observed under artificially induced intersensory conflict, they speak directly to the fact of the strong
tendency for multisensory binding that rules motion perception in everyday life naturalistic envi-
ronments. Some of the characteristics that define this multisensory binding of motion signals are:
(1) that these multisensory combination processes occur, at least in part, at early perceptual stages
before other potential effects related to decisional stages take place; (2) that motion information is
subject to multisensory binding, over and above any other binding phenomena that can take place
between spatially static stimuli; and (3) that when other sensory modality combinations are taken
into account, a hierarchy of modalities arises, where vision dominates touch which, in turn, domi-
nates audition (e.g., Soto-Faraco et al. 2003). This hierarchy, however, can be modulated by factors
such as attention focus (Oruç et al. 2008) and most probably by the relative reliabilities between the
sensory signals, as per recent findings in other domains.
We have also touched upon the potential underlying brain mechanisms that support multisensory
binding of motion information. Both animal and human studies reveal that among the brain struc-
tures that are responsive to visual motion, the higher the processing stage at which one looks at, the
more the chances that this area will reveal multisensory properties. In fact, some studies in the past
have provided evidence of overlap between the brain regions that are active during the presentation
of motion in audition, touch, and vision (Bremmer et al. 2001b; Lewis et al. 2000). Two of the struc-
tures that are consistently found in this type of studies are the PMv and parts of the IPS, possibly the
human homologue of the monkey ventral intraparietal (VIP) region. As per animal electrophysiology
data, these two areas are strongly interconnected, display a similar tuning to spatial representations of
moving objects, and contain multisensory neurons. Two recent studies have provided further insight
about the functional organization of multisensory motion processing in the human brain (Alink et
al. 2008; Baumann et al. 2007). In both cases, the involvement of posterior parietal (VIP) and fon-
tral (PMv) areas in binding multisensory motion information seems clear. In addition, Alink et al.’s
results were suggestive of the cross-modal modulation of early sensory areas usually considered to be
involved in unisensory motion processing (MT/V5 in vision, and the PT in audition). One additional
recent finding is also suggestive of the responsiveness of early visual areas to acoustic motion, in this
case as a consequence of brain plasticity in the blind (Saenz et al. 2008).
Finally, we have discussed some of the potential connections between basic and applied research
with regard to the use of dynamic displays in audiovisual media. Film editing techniques that have
598 The Neural Bases of Multisensory Processes

been developed empirically over the years reflect some of the principles that have been indepen-
dently discovered in the laboratory. For example, sound is often used in the cinema to support visual
continuity of a highly dynamic scene, capitalizing on the superior temporal resolution of audition
over vision. Väljamäe and Soto-Faraco (2008) attempted to bridge the gap between basic research
on motion perception and application of multisensory principles by showing that sounds with high-
rate dynamic structure could help compensate for the poor visual continuity of moving stimuli
displayed at low sampling rates. These examples show that better understanding of the underlying
principles of multisensory integration might help to optimize synthesis, transmission, and presenta-
tion of multimedia content.
Future research on multisensory motion perception might make use of the principles that are
being discovered in the laboratory in order to achieve more realistic ecological stimuli using vir-
tual or augmented reality setups. It will also be interesting to study situations where the user or
observer can experience either illusory or real self-motion (see Hettinger 2002 for a recent review).
Although the current multisensory motion research concentrated solely on the situations where the
user is static, viewers often are moving about in real-life situations, which implies that perception
of moving objects of the surrounding environment is modulated by experienced self-motion (Probst
et al. 1984, see Calabro, Soto-Faraco and Vaina 2011, for a multisensory approach). Such investiga-
tions can shed light on interactions between neural mechanisms involved in self-motion and object
motion perception (cf. Bremmer et al. 2005) and, in addition, may further contribute to optimization
of media applications for training and entertainment.

ACKNOWLEDGMENTS
S.S.-F. received support from the Spanish Ministry of Science and Innovation (PSI2010-15426
and Consolider INGENIO CSD2007-00012) and the Comissionat per a Universitats i Recerca del
DIUE-Generalitat de Catalunya (SRG2009-092). A.V. was supported by Fundació La Marató de
TV3 through grant no. 071932.

REFERENCES
Alais, D., and D. Burr. 2004a. The ventriloquist effect results from near-optimal bimodal integration. Current
Biology 14: 257–262.
Alais, D., and D. Burr. 2004b. No direction-specific bimodal facilitation for audiovisual motion detection.
Cognitive Brain Research 19: 185–194.
Alink, A. W., and L. Muckli. 2008. Capture of auditory motion by vision is represented by an activation shift
from auditory to visual motion cortex. Journal of Neuroscience 28: 2690–2697.
Allen, P. G., and P. A. Kolers. 1981. Sensory specificity of apparent motion. Journal of Experimental Psychology:
Human Perception and Performance 7: 1318–1326.
Anstis, S. M. 1973. Hearing with the hands. Perception, 2, 337–341.
Baumann, O., and M. W. Greenlee. 2007. Neural correlates of coherent audiovisual motion perception. Cerebral
Cortex 17:1433–1443.
Berger, T. D., M. Martelli, and D. G. Pelli. 2003. Flicker flutter: Is an illusory event as good as the real thing?
Journal of Vision 3(6): 406–412.
Bertelson, P., and G. Aschersleben. 1998. Automatic visual bias of perceived auditory location. Psychonomic
Bulletin and Review 5: 482–489.
Bitton, J., and S. Agamanolis. 2004. RAW: Conveying minimally-mediated impressions of everyday life with
an audio-photographic tool. In Proceedings of CHI 2004, 495–502. ACM Press.
Bremmer, F., A. Schlack, J. R. Duhamel, W. Graf, and G. R. Fink. 2001a. Space coding in primate parietal
cortex. Neuroimage 14: S46–S51.
Bremmer, F., A. Schlack, N. J. Shah et al. 2001b. Polymodal motion processing in posterior parietal and premo-
tor cortex: A human fMRI study strongly implies equivalencies between humans and monkeys. Neuron
29: 287–296.
Bremmer, F. 2005. Navigation in space: The role of the macaque ventral intraparietal area. Journal of Physiology
566: 29–35.
Multisensory Interactions during Motion Perception 599

Burtt, H. E. 1917a. Auditory illusions of movement — A preliminary study. Journal of Experimental Psychology
2: 63–75.
Burtt, H. E. 1917b. Tactile illusions of movement. Journal of Experimental Psychology 2: 371–385.
Calabro, F., S. Soto-Faraco, and L. M. Vaina. 2011. Acoustic facilitation of object movement detection during
self-motion. Proceedings of the Royal Academy of Sciences B. doi:10.1098/rspb.2010.2757. In press.
Calvert, G., C. Spence, and E. Barry (eds). 2004. The handbook of multisensory processes. Cambridge, MA:
MIT Press.
Chion, M. 1994. Audio-vision: Sound on screen. New York: Columbia Univ. Press.
Choe, C. S., R. B. Welch, R. M. Gilford, and J. F. Juola. 1975. The ‘ventriloquist effect’: Visual dominance or
response bias? Perception and Psychophysics 18: 55–60.
Colby, C. L., J. R. Duhamel, and M. E. Goldberg. 1993. Ventral intra-parietal area of the macaque: Anatomical
location and visual response properties. Journal of Neurophysiology 69: 902–914.
Connor, S. 2000. Dumbstruck: A cultural history of ventriloquism. Oxford: Oxford Univ. Press.
de Gelder, B., and P. Bertelson. 2003. Multisensory integration, perception and ecological validity. Trends in
Cognitive Sciences 7: 460–467.
Dong, C., N. V. Swindale, and M. S. Cynader. 1999. A contingent aftereffect in the auditory system. Nature
Neuroscience 2: 863–865.
Duhamel, J. R., C. L. Colby, and M. E. Goldberg. 1991. Congruent representations of visual and somatosensory
space in single neurons of monkey ventral intra-parietal cortex (area VIP). In Brain and space, ed. J.
Palliard, 223–236. Oxford: Oxford Univ. Press.
Duhamel, J. R., C. L. Colby, and M. E. Goldberg. 1998. Ventral intra-parietal area of the macaque: Congruent
visual and somatic response properties. Journal of Neurophysiology 79: 126–136.
Eidsvik, C. 2005. Background tracks in recent cinema. In Moving image theory: Ecological condisderations,
ed. J. D. Anderson and B. F. Anderson, 70–78. Carbondale, IL: Southern Illinois Univ. Press.
Ernst, M. O., and M. S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415: 429–433.
Ernst, M. O., and H. H. Bulthoff. 2004. Merging the senses into a robust percept. Trends in Cognitive Sciences
8: 162–169.
Exner, S. 1875. Experimentelle Untersuchung der einfachsten psychischen Processe. Pfluger’s Arch Physiol
11: 403–432.
Fahlenbrach, K. 2002. Feeling sounds: Emotional aspects of music videos. In Proceedings of IGEL 2002 con-
ference, Pécs, Hungary.
Fitts, P. M., and R. L. Deininger. 1954. S–R compatibility: Correspondence among paired elements within
stimulus and response codes. Journal of Experimental Psychology 48: 483–492.
Fitts, P. M., and C. M. Seeger. 1953. S–R compatibility: Spatial characteristics of stimulus and response codes.
Journal of Experimental Psychology 46: 199–210.
Furniss, M. 1998. Art in motion: Animation aesthetics. London: John Libbey.
Gazzaniga, M. S. 1987. Perceptual and attentional processes following callosal section in humans. Neuro­
psychologia 25: 119–133.
Gepshtein, S., and K. Kubovy. 2000. The emergence of visual objects in space-time. Proceedings of the National
Academy of Sciences 97: 8186–8191.
Gilbert, G. M. 1939. Dynamic psychophysics and the phi phenomenon. Archives of Psychology 237: 5–43.
Graziano, M. S. A., C. G. Gross, C. S. R. Taylor, and T. Moore. 2004. A system of multimodal areas in the pri-
mate brain. In Crossmodal space and crossmodal attention, ed. C. Spence and J. Driver, 51–68. Oxford:
Oxford Univ. Press.
Graziano, M. S. A., X. Hu, and C. G. Gross. 1997. Visuo-spatial properties of ventral premotor cortex. Journal
of Neurophysiology 77: 2268–2292.
Graziano, M. S. A., G. S. Yap, and C. G. Gross. 1994. Coding of visual space by premotor neurons. Science
266: 1054–1057.
Hagen, M. C., O. Franzen, F. McGlone, G. Essick, C. Dancer, and J. V. Pardo. 2002. Tactile motion activates
the human MT/V5 complex. European Journal of Neuroscience 16: 957–964.
Hettinger, L. J. 2002. Illusory self-motion in virtual environments. In Handbook of virtual environments, ed.
K. M. Stanney, 471–492. Hillsdale, NJ: Lawrence Erlbaum.
Hoisko, J. 2003. Early experiences of visual memory prosthesis for supporting episodic memory. International
Journal of Human–Computer Interaction 15: 209–320.
Hommel, B. 2000. The prepared reflex: Automaticity and control in stimulus–response translation. In Control
of cognitive processes: Attention and performance XVIII, ed. S. Monsell and J. Driver, 247–273.
Cambridge, MA: MIT Press.
600 The Neural Bases of Multisensory Processes

Howard, I. P., and W. B. Templeton. 1966. Human spatial orientation. New York: Wiley.
Hulin, W. S. 1927. An experimental study of apparent tactual movement. Journal of Experimental Psychology
10: 293–320.
Isono H., S. Komiyama, and H. Tamegaya. 1996. An autostereoscopic 3-D HDTV display system with reality
and presence. SID Digest 135–138.
Kamitani, Y., and S. Shimojo. 2001. Sound-induced visual “rabbit.” Journal of Vision 1: 478a.
Kirman, J. H. 1974. Tactile apparent movement: The effects of interstimulus onset interval and stimulus dura-
tion. Perception and Psychophysics 15: 1–6.
Kitagawa, N., and S. Ichihara. 2002. Hearing visual motion in depth. Nature 416: 172–174.
Kohlrausch, A., R. Fassel, and T. Dau. 2000. The influence of carrier level and frequency on modulation and
beat-detection thresholds for sinusoidal carriers. Journal of the Acoustical Society of America 108:
723–734.
Korte, A. 1915. Kinematoscopische Untersuchungen. Zeitschrift für Psychologie 72: 193–296.
Lakatos, S., and R. N. Shepard. 1997. Constraints common to apparent motion in visual, tactile and auditory
space. Journal of Experimental Psychology: Human Perception and Performance 23: 1050–1060.
Landis, C. 1954. Determinants of the critical flicker-fusion threshold. Physiological Reviews 34: 259–286.
Len Lye filmography. Len Lye Foundation site, http://www.govettbrewster.com/LenLye/Foundation/
LenLyeFoundation.aspx (accessed 28 March 2011).
Lewkowicz, D. J. 1999. The development of temporal and spatial intermodal perception. In Cognitive con-
tributions to the perception of spatial and temporal events, ed. G. Aschersleben, 395–420. Amsterdam:
Elsevier.
Lopez-Moliner, J., and S. Soto-Faraco. 2007. Vision affects how fast we hear sounds move. Journal of Vision
7:6.1–6.7.
Luppino, G., A. Murata, P. Govoni, and M. Matelli. 1999. Largely segregated parietofrontal connections link-
ing rostral intraparietal cortex (areas AIP and VIP) and the ventral premotor cortex (areas F5 and F4).
Experimental Brain Research 128: 181–187.
Macmillan, N. A., and C. D. Creelman. 1991. Detection theory: A user’s guide. Cambridge, UK: Cambridge
Univ. Press.
Manabe, K., and H. Riquimaroux. 2000. Sound controls velocity perception of visual apparent motion. Journal
of the Acoustical Society of Japan 21: 171–174.
Marker, C. 1962. La Jetée [Motion picture]. France: Argos Film.
Mastoropoulou, G., K. Debattista, A. Chalmers, and T. Troscianko. 2005. The influence of sound effects on
the perceived smoothness of rendered animations. Paper presented at APGV’05: Second Symposium on
Applied Perception in Graphics and Visualization, La Coruña, Spain.
Mateeff, S., J. Hohnsbein, and T. Noack. 1985. Dynamic visual capture: Apparent auditory motion induced by
a moving visual target. Perception 14: 721–727.
Maunsell, J. H. R., and D. C. Van Essen. 1983. The connections of the middle temporal visual area (MT) and their
relationship to a cortical hierarchy in the macaque monkey. Journal of Neuroscience 3: 2563–2580.
McCormick, D., and P. Mamassian. 2008. What does the illusory flash look like? Vision Research 48: 63–69.
McGurk, H., and J. MacDonald. 1976. Hearing lips and seeing voices. Nature 264: 746–748.
Meyer, G. F., and S. M. Wuerger. 2001. Cross-modal integration of auditory and visual motion signals.
Neuroreport 12: 2557–2560.
Meyer, G. F., S. M. Wuerger, F. Röhrbein, and C. Zetzsche. 2005. Low-level integration of auditory and visual
motion signals requires spatial co-localisation. Experimental Brain Research 166: 538–547.
Morein-Zamir, S., S. Soto-Faraco, and A. Kingstone. 2003. Auditory capture of vision: Examining temporal
ventriloquism. Cognitive Brain Research 17: 154–163.
Ohmura, H. 1987. Intersensory influences on the perception of apparent movement. Japanese Psychological
Research 29: 1–19.
Oruç, I., S. Sinnett, W. F. Bischof, S. Soto-Faraco, K. Lock, and A. Kingstone. 2008. The effect of attention on
the illusory capture of motion in bimodal stimuli, Brain Research 1242: 200–208.
Pavani, F., E. Macaluso, J. D. Warren, J. Driver, and T. D. Griffiths. 2002. A common cortical substrate acti-
vated by horizontal and vertical sound movement in the human brain. Current Biology 12: 1584–1590.
Perrone, J. A., and A. Thiele. 2001. Speed skills: Measuring the visual speed analyzing properties of primate
MT neurons. Nature Neuroscience 4: 526–532.
Pick, H. L., D. H. Warren, and J. C. Hay. 1969. Sensory conflict in judgments of spatial direction. Perception
and Psychophysics 6: 203–205.
Priebe, N. J., S. G. Lisberger, and J. A. Movshon. 2006. Tuning for spatiotemporal frequency and speed in
directionally selective neurons of macaque striate cortex. Journal of Neuroscience 26: 2941–2950.
Multisensory Interactions during Motion Perception 601

Probst, T., S. Krafczyk, T. Brandt, and E. Wist. 1984. Interaction between perceived self-motion and object
motion impairs vehicle guidance. Science 225: 536–538.
Radeau, M., and P. Bertelson. 1976. The effect of a textured visual field on modality dominance in a ventrilo-
quism situation. Perception and Psychophysics 20: 227–235.
Reisbeck, T. E., and K. R. Gegenfurtner. 1999. Velocity tuned mechanisms in human motion processing. Vision
Research 39: 3267–3285.
Rock, I., and C. S. Harris. 1967. Vision and touch. Scientific American 216: 96–104.
Saenz, M., L. B. Lewis, A. G. Huth, I. Fine, and C. Koch. 2008. Visual motion area MT+/V5 responds to audi-
tory motion in human sight-recovery subjects. Journal of Neuroscience 28: 5141–5148.
Sanabria, D., S. Soto-Faraco, and C. Spence. 2004a. Exploring the role of visual perceptual grouping on the
audiovisual integration of motion. Neuroreport 18: 2745–2749.
Sanabria, D., S. Soto-Faraco, and C. Spence. 2005a. Spatiotemporal interactions between audition and touch
depend on hand posture. Experimental Brain Research 165: 505–514.
Sanabria, D., S. Soto-Faraco, and C. Spence. 2005b. Assessing the influence of visual and tactile distractors on
the perception of auditory apparent motion. Experimental Brain Research 166: 548–558.
Sanabria, D., S. Soto-Faraco, J. S. Chan, and C. Spence. 2004b. When does visual perceptual grouping affect
multisensory integration? Cognitive, Affective, and Behavioral Neuroscience 4: 218–229.
Sanabria, D., C. Spence, and S. Soto-Faraco. 2007. Perceptual
�������������������������������������������������������������
and decisional contributions to audiovisual inter-
actions in the perception of apparent motion: A signal detection study. Cognition 102:299–310.
Sanchez-Vives, M. V., and M. Slater. 2005. From presence to consciousness through virtual reality. Nature
Reviews Neuroscience 4: 332–339.
Schlack, A., S. Sterbing, K. Hartung, K. P. Hoffmann, and F. Bremmer. 2005. Multisensory space representa-
tions in the macaque ventral intraparietal area. Journal of Neuroscience 25: 4616–4625.
Sekuler, R., A. B. Sekuler, and R. Lau. 1997. Sound alters visual motion perception. Nature 385: 308.
Shams, L., Y. Kamitani, and S. Shimojo. 2000. What you see is what you hear. Nature 408: 788.
Shams, L., Y. Kamitani, and S. Shimojo. 2002. Visual illusion induced by sound. Cognitive Brain Research 14:
147–152.
Shams, L., Y. Kamitani, S. Thompson, and S. Shimojo. 2001. Sound alters visual evoked potential in humans.
NeuroReport 12: 3849–3852.
Shimojo, S., C. Scheier, R. Nijhawan, L. Shams, Y. Kamitani, and K. Watanabe. 2001. Beyond perceptual
modality: Auditory effects on visual perception. Acoustical Science and Technology 22: 61–67.
Simon, J. R. 1969. Reactions towards the source of stimulation. Journal of Experimental Psychology 81:
174–176.
Snoek, C., and M. Worring. 2002. Multimodal video indexing: A review of the state-of-the-art. Multimedia
Tools and Applications 25: 5–35.
Soto-Faraco, S., and A. Kingstone. 2004. Multisensory integration of dynamic information. In The hand-
book of multisensory processes, ed. G. Calvert, C. Spence, and B. E. Stein, 49–68. Cambridge, MA:
MIT Press.
Soto-Faraco, S., A. Kingstone, and C. Spence. 2000. The role of movement and attention in modulating audio-
visual and audiotactile ‘ventriloquism’ effects. Abstracts of the Psychonomic Society 5: 40.
Soto-Faraco, S., A. Kingstone, and C. Spence. 2003. Multisensory contributions to the perception of motion.
Neuropsychologia 41: 1847–1862.
Soto-Faraco, S., A. Kingstone, and C. Spence. 2006. Integrating motion information across sensory modalities:
The role of top-down factors. In Progress in Brain Research: Visual Perception Series, vol. 155, ed. S.
Martínez-Conde et al., 273–286. Amsterdam: Elsevier.
Soto-Faraco, S., J. Lyons, M. S. Gazzaniga, C. Spence, and A. Kingstone. 2002. The ventriloquist in motion:
Illusory capture of dynamic information across sensory modalities. Cognitive Brain Research 14:
139–146.
Soto-Faraco, S., C. Spence, and A. Kingstone. 2004a. Crossmodal dynamic capture: Congruency effects in the
perception of motion across sensory modalities. Journal of Experimental Psychology: Human Perception
and Performance 30: 330–345.
Soto-Faraco, S., C. Spence, and A. Kingstone. 2005. Assessing automaticity in the audio-visual integration of
motion. Acta Psychologica 118: 71–92.
Soto-Faraco, S., C. Spence, and A. Kingstone. 2004b. Congruency effects between auditory and tactile motion:
Extending the phenomenon of cross-modal dynamic capture. Cognitive Affective and Behavioral
Neuroscience 4: 208–217.
Staal, H. E., and D. C. Donderi. 1983. The effect of sound on visual apparent movement. American Journal of
Psychology 96: 95–105.
602 The Neural Bases of Multisensory Processes

Stein, B. E., and M. A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Thomas, F., and O. Johnston. 1981. Disney animation: The illusion of life. New York: Abbeyville Press.
Ungerleider, L. G., and R. Desimone. 1986. Cortical connections of visual area MT in the macaque. Journal of
Computational Neurology 248: 190–222.
Väljamäe, A., and S. Soto-Faraco. 2008. Filling-in visual motion with sounds. Acta Psychologica 129:
249–254.
Väljamäe, A., and A. Tajadura-Jiménez. 2007. Perceptual optimization of audio-visual media: Moved by sound.
In Narration and spectatorship in moving images, ed. B. Anderson and J. Anderson. Cambridge Scholars
Press.
Väljamäe, A., A. Tajadura-Jiménez, P. Larsson, D. Västfjäll, and M. Kleiner. 2008. Handheld experiences:
Using audio to enhance the illusion of self-motion. IEEE MultiMedia 15: 68–75.
van der Zee, E., and A. W. van der Meulen. 1982. The influence of field repetition frequency on the visibility of
flicker on displays. IPO Annual Progress Report 17: 76–83.
Vroomen, J., and B. de Gelder. 2000. Sound enhances visual perception: Cross-modal effects of auditory
organization on vision. Journal of Experimental Psychology: Human Perception and Performance 26:
1583–1590.
Vroomen, J., and B. de Gelder. 2003. Visual motion influences the contingent auditory motion aftereffect.
Psychological Science 14: 357–361.
Watson, A. B., and A. J. Ahumada. 1983. A look at motion in the frequency domain. In Motion: Perception and
representation, ed. J. K. Tsotsos, 1–10. New York: Association for Computing Machinery.
Watson, J. D., R. Myers, R. S. Frackowiak et al. 1993. Area V5 of the human brain: Evidence from a combined
study using positron emission tomography and magnetic resonance imaging. Cerebral Cortex 3: 79–94.
Welch, R. B. 1999. Meaning, attention, and the “unity assumption” in the intersensory bias of spatial and
temporal perceptions. In Cognitive contributions to the perception of spatial and temporal events, ed. G.
Ascherlseben, T. Bachmann, and J. Musseler, 371–387. Amsterdam: Elsevier Science.
Welch, R. B., and D. H. Warren. 1986. Intersensory interactions. In Handbook of perception and human perfor-
mance. Vol. 1, Sensory processes and perception, ed. K. R. Boff, L. Kaufman, and J. P. Thomas, 25–36.
New York: Wiley.
Welch, R. B., L. D. Duttenhurt, and D. H. Warren. 1986. Contributions of audition involved in the multi-
modal integration of perceptual and vision to temporal rate perception. Perception and Psychophysics
39: 294–300.
Wertheimer, M. 1912. Experimentelle Studien über das Sehen von Bewegung. [Experimental studies on the
visual perception of movement]. Zeitschrift für Psychologie 61: 161–265.
Wertheimer, M. 1932. Principles of perceptual organization. Psychologische Forschung 41: 301–350. Abridged
translation by M. Wertheimer, in Readings in perception, ed. D. S. Beardslee and M. Wertheimer, 115–
137. Princeton, NJ: Van Nostrand-Reinhold.
Wuerger, S. M., M. Hofbauer, and G. F. Meyer. 2003. The integration of auditory and visual motion signals at
threshold. Perception and Psychophysics 65: 1188–1196.
Zapparoli, G. C., and L. L. Reatto. 1969. The apparent movement between visual and acoustic stimulus and the
problem of intermodal relations. Acta Psychologia 29: 256–267.
Zihl, J., D. von Cramon, and N. Mai. 1983. Selective disturbance of movement vision after bilateral brain dam-
age. Brain 106: 313–40.
Zihl, J., D. von Cramon, N. Mai, and C. Schmid. 1991. Disturbance of movement vision after bilateral posterior
brain damage. Further evidence and follow up observations. Brain 114: 2235–2252.
30 Multimodal Integration during
Self-Motion in Virtual Reality
Jennifer L. Campos and Heinrich H. Bülthoff

CONTENTS
30.1 Introduction...........................................................................................................................603
30.2 Simulation Tools and Techniques..........................................................................................604
30.2.1 Visual Displays..........................................................................................................604
30.2.2 Treadmills and Self-Motion Simulators....................................................................606
30.3 Influence of Visual, Proprioceptive, and Vestibular Information on Self-Motion
Perception.............................................................................................................................. 611
30.3.1 Unisensory Self-Motion Perception........................................................................... 611
30.3.2 Multisensory Self-Motion Perception........................................................................ 613
30.3.2.1 Effects of Cue Combination........................................................................ 613
30.3.2.2 Cue Weighting under Conflict Conditions.................................................. 616
30.3.3 Unique Challenges in Studying Multisensory Self-Motion Perception..................... 618
30.4 Advantages and Disadvantages of Using Simulation Technology to Study Multisensory
Self-Motion Perception.......................................................................................................... 619
30.5 Multisensory Self-Motion Perception: An Applied Perspective........................................... 620
30.6 Summary............................................................................................................................... 622
Acknowledgments........................................................................................................................... 622
References....................................................................................................................................... 622

30.1  INTRODUCTION
Our most common, everyday activities and those that are most essential to our survival, typically
involve moving within and throughout our environment. Whether navigating to acquire resources,
avoiding dangerous situations, or tracking one’s position in space relative to important landmarks,
accurate self-motion perception is critically important. Self-motion perception is typically experi-
enced when an observer is physically moving through space including self-propelled movements
such as walking, running, or swimming, and also when being passively moved while on train or
when actively driving a car or flying a plane. Self-motion perception is important for estimating
movement parameters such as speed, distance, and heading direction. It is also important for the
control of posture, the modulation of gait, and for predicting time to contact when approaching or
avoiding obstacles. It is an essential component of path integration, which involves the accumula-
tion of self-motion information when tracking one’s position in space relative to other locations or
objects. It is also important for the formation of spatial memories when learning complex routes and
environmental layouts.
During almost all natural forms of self-motion, there are several sensory systems that provide
redundant information about the extent, speed, and direction of egocentric movement, the most
important of which include dynamic visual information (i.e., optic flow), vestibular information (i.e.,
provided through the inner ear organs including the otoliths and semicircular canals), proprioceptive

603
604 The Neural Bases of Multisensory Processes

information provided by the muscles and joints, and the efference copy signals representing the com-
mands of these movements. Also important, although less well studied, are auditory signals related
to self-motion and somatosensory cues provided through wind, vibrations, and changes in pressure.
Currently, much work has been done to understand how several of these individual modalities can
be used to perceive different aspects of self-motion independently. However, researchers have only
recently begun to evaluate how they are combined to form a coherent percept of self-motion and the
relative influences of each cue when more than one is available.
Not only is it important to take a multisensory approach to self-motion perception in order to
understand the basic science underlying cue combination, but it is also important to strive toward
evaluating human behaviors as they occur under natural, cue-rich, ecologically valid conditions.
The inherent difficulty in achieving this is that the level of control that is necessary to conduct
careful scientific evaluations is often very difficult to achieve under natural, realistic conditions.
Consequently, in order to maintain strict control over experimental conditions, much of the past
work has been conducted within impoverished, laboratory environments using unnatural tasks.
More recently, however, Virtual Reality (VR) technology and sophisticated self-motion interfaces
have been providing researchers with the opportunity to provide natural, yet tightly controlled,
stimulus conditions, while also maintaining the capacity to create unique experimental scenarios
that could not occur in the real world (Bülthoff and van Veen 2001; Loomis et al. 1999; Tarr and
Warren 2002). VR also does this in a way that maintains an important perception–action loop that
is inherent to nearly all aspects of human–environment interactions.
Visually simulated Virtual Environments (VEs) have been the most commonly used form of VR,
because, until very recently it has been difficult to simulate full-body motion through these environ-
ments without having to resort to unnatural control devices such as joysticks and keyboards. More
recently, the development of high-precision motion tracking systems and sophisticated self-motion
simulators (e.g., treadmills and motion platforms) are allowing far more control and flexibility in
the presentation of body-based self-motion cues (i.e., proprioceptive and vestibular information).
Consequently, researchers are now able to study multisensory self-motion perception in novel and
exciting ways. The significant technological advancements and increased accessibility of many VR
systems have stimulated a renewed excitement in recognizing its significant potential now and in
the future.
Much of the multisensory research up until this point has focused on tasks involving discrete
stimulus presentations in near body space, including visual–auditory, visual–proprioceptive, and
visual–haptic interactions. Far less is understood about how different sources of sensory informa-
tion are combined during large-scale self-motion through action space. Unlike other approaches
used to examine the integration of two specific cues at a particular, discrete instance in time, navi-
gating through the environment requires the dynamic integration of several cues across space and
over time. Understanding the principles underlying multimodal integration in this context of unfold-
ing cue dynamics provides insight into an important category of multisensory processing.
This chapter begins by a brief description of some of the different types of simulation tools and
techniques that are being used to study self-motion perception, along with some of the advantages
and disadvantages of the different interfaces. Subsequently, some of the current empirical work inves-
tigating multisensory self-motion perception using these technologies will be summarized, focusing
mainly on visual, proprioceptive, and vestibular influences during full-body self-motion through
space. Finally, the implications of this research for several applied areas will be briefly described.

30.2  SIMULATION TOOLS AND TECHNIQUES


30.2.1  Visual Displays
The exciting potential of VR comes from the fact that you can create worlds with particular char-
acteristics that can be systematically manipulated and customized. This includes elaborate worlds
Multimodal Integration during Self-Motion in Virtual Reality 605

unlike anything that can or does exist within the known real world. Rich, realistic visual details
can be included, or the visual scene can be intentionally limited to particular visual cues of inter-
est such as the optic flow provided through a cloud of dots or the relative positioning of selected
landmarks. Instant teleportation from one position in space to another (Meilinger et al. 2007), the
inclusion of wormholes to create non-Euclidean spaces (Schnapp and Warren 2007), and navigation
throughout four-dimensional (4-D) environments (D’Zmura et al. 2000) are all possible. This type
of control and flexibility is not something that can be achieved in a real-world testing environment.
Whereas in the past the process of using computer graphics to create more complex VEs, such as
realistic buildings or cities, was time consuming and arduous, new software advancements are now
allowing entire virtual cities of varying levels of detail to be built in just a few days (e.g., Müller et
al. 2006).
In order to allow an observer to visualize these VEs, different types of displays have been used
(for a more thorough review, see Campos et al. 2007a). Traditionally, desktop displays have been
the most commonly used visualization tool for presenting VEs. These displays typically consist of
a stationary computer monitor paired with an external control device that is used to interact with
the VE (i.e., a joystick or a mouse). Even though the quality and resolution of desktop displays has
been steadily increasing in recent years (e.g., high dynamic range displays; see Akyüz et al. 2007),
they are nonimmersive, have a limited field of view (FOV), and can accommodate very little natural
movement.
Other displays such as the Cave Automatic Virtual Environments (CAVE™; Cruz-Neira et al.
1993) and other large curved projection screen systems (e.g., Meilinger et al. 2008; http://www​
.cyberneum.com/PanoLab_en.html; see Figure 30.1) provide observers with a much wider FOV by

FIGURE 30.1  MPI Panoramic projection screen. This large, spherical panoramic projection screen consists
of four projectors that project images of Virtual Environments (VEs) onto surrounding curved walls and
also the floor. This provides a field of view of more than 220° horizontal and 125° vertical, thereby taking up
almost the entire human visual field. Participants can move through the VE via various different input devices
such as bicycles, driving interfaces, or joysticks (as shown here). The VE displayed in photo is a highly realis-
tic virtual model of the city center of Tübingen. (Photo courtesy of: Axel Griesch.)
606 The Neural Bases of Multisensory Processes

projecting images on the walls surrounding the observer, and in some cases, the floor. Such displays
are often projected with two slightly different images (accounting for the interpupillary distance),
which, when paired with stereo glasses (anaglyph stereo or polarized stereo), can provide a 3-D
display of the environment. Despite the full FOV and high level of immersion provided by these
displays, they again only allow for a limited range of active movements.
Apart from desktop displays, head-mounted displays (HMDs) are perhaps the most widely used
visualization system for navigational tasks. HMDs range in size, resolution, and FOV. Their typi-
cally small FOV is one of the main restrictions. This restriction can be partially ameliorated by
pairing the HMD with a motion tracking system that can be used to update the visual image directly
as a function of the observer’s own head movements. This allows for a greater visual sampling of the
environmental space and a more natural method of visually exploring one’s environment. HMDs
also provide a highly immersive experience because the visual information is completely restricted
to that experienced through the display by blocking out all surrounding visual input. The greatest
advantage of HMDs is the extent of mobility that is possible, allowing for natural, large-scale move-
ments through space such as walking.
In terms of understanding the role of particular sources of sensory information in self-motion
perception, there is often a trade-off between having high-resolution, wide FOV displays, which
provide the most compelling visual information, and the flexibility of having a visualization system
that can move with the observer (i.e., HMD), thus providing natural body-based cues. Therefore,
using a combination of approaches is often advisable.

30.2.2  Treadmills and Self-Motion Simulators


The most natural way in which humans interact with and navigate within their environment is by
actually moving. Therefore, understanding self-motion perception can only truly be accomplished
by studying an active observer as they physically move through space, something for which a simple
visualization device alone will not suffice. From the perspective of multisensory approaches to study-
ing self-motion, it is also important that particular body-based cues can be isolated from each other,
for instance, by independently manipulating proprioceptive and vestibular inputs. Several sophisti-
cated self-motion interfaces and motion capture systems are now providing such opportunities.
Of course, the most natural form of movement through a VE is in fact, not simulated move-
ment at all, but actual walking. Several laboratoriess have now developed large, fully tracked, free
walking spaces (e.g., the MPI Tracking Lab, Campos et al. 2009, http://www.cyberneum.com/
TrackingLab_en.html, see Figure 30.2; the VENlab, Tarr and Warren 2002; and the HIVE, Waller
et al. 2007). Using motion capture information, an observer can walk, rotate, and orient in any
direction while his/her movements are used directly to update the information in the visual display
(i.e., HMD). This provides a highly natural locomotor experience and retains proprioceptive and
vestibular inputs in their purest form. The main limitation of these setups is that the size of the VE is
constrained by the size of the actual environment. Although this is sufficient for studying behaviors
that take place in smaller-scale spaces, it would not suffice for understanding the role of self-motion
perception during the exploration of larger outdoor spaces or complex buildings, for instance. Some
strategies have been used to maximize movement capacities, such as placing a gain on the visuals
during rotations. What this does is redirect the walker by causing them to turn a greater or lesser
angle physically as a way of containing their movements within the confines of the space (Engel et
al. 2008; Peck et al. 2008; Razzaque et al. 2001, 2002). However, the perceptual consequences of
such redirected walking manipulations are currently not known.
The advantage of tracking an observer’s position in space as a way of updating their position in
the VE is that this also provides a moment-by-moment recording of the behaviors that are being
performed during any given task. This is particularly informative when studying self-motion per-
ception because it provides a measure of different movement parameters such as walking speed and
the walked trajectory. With full or partial body tracking, additional movement characteristics such
Multimodal Integration during Self-Motion in Virtual Reality 607

FIGURE 30.2  MPI Tracking Laboratory. This fully tracked, free-walking space is 12 × 12 m in size. In this
space, participants’ position and orientation are tracked using an optical tracking system (16 Vicon MX13
cameras) through monitoring of reflective markers. Information about a participant’s position and orientation
is sent from optical trackers, via a wireless connection, to a backpack-mounted laptop worn by participant.
This system can therefore be used to both update the visual environment as a function of participants’ own
movements (i.e., in HMD as shown here) and to capture different movement parameters. With this setup, it is
also possible to track two or more observers and thus allows for multiuser interactions within a VE. (Photo
courtesy of Manfred Zentsch.)

as step length, facing direction, pointing direction, and body posture can also be recorded. This
provides a rich source of information as it effectively captures even subtle movement characteristics
at every instance in time (e.g., Campos et al. 2009; Siegle et al. 2009).
Other devices that are used to allow physical walking through VEs are treadmill setups. Unlike
free walking spaces, treadmills permit unconstrained walking over infinite distances. Standard
treadmills typically provide a capacity for straight, forward walking while limiting the walker to
one position in space. Essentially, this limits the body-based cues to proprioceptive information.
Most often these setups also use a handrail for stability and support, which provides additional
haptic information informing the observer of their lack of movement through space. When walk-
ing in place under such conditions, not only are the kinematics of walking different from walking
over ground (e.g., propulsive forces), but the vestibular information that is typically generated dur-
ing the acceleration phase of walking is missing. In order to account for this, other, much larger
treadmills (ranging from 1.5 to 2.5 m wide and 3 to 6 m long) have been developed, which allow for
forward, accelerated walking across the treadmill belt until a constant walking velocity is reached
(Hollerbach et al. 2000; Souman et al. 2010; Thompson et al. 2005). A harness can be used for safety
to ensure that the walker does not leave the surface of the treadmill, while still allowing the flex-
ibility of relatively unconstrained movements. Furthermore, systems such as the Sarcos Treadport
system developed by Hollerbach and colleagues is equipped with a tether that can be used to push
and pull the walker in a way that simulates the accelerating or decelerating forces that accompany
walking through space (Christensen et al. 2000). This tether can also be used to simulate uphill or
downhill locomotion (Tristano et al. 2000).
608 The Neural Bases of Multisensory Processes

By pairing these types of setups with a motion tracking system, the treadmill speed can be
adjusted online in response to the observer’s own movements. Specifically, control algorithms have
been developed as a way of allowing an observer to walk naturally (including stopping and chang-
ing walking speeds), while at the same time the treadmill speed is adjusted in a way that causes
the walker to remain as centrally on the treadmill as possible (e.g., Souman et al. 2010). These
algorithms are also optimized so that the recentering movements produce accelerations that are not
strong enough to create large perturbations during walking, causing a loss of balance. In general,
as a method of naturally moving through VEs, large linear treadmills can effectively provide pro-
prioceptive information during walking, as well as some important vestibular cues. However, they
do not allow for turning or rotational movement trajectories and can create some “noisy” vestibular
stimulation during recentering when using a control algorithm.
Circular treadmills constitute another type of movement device that allows for limitless cur-
vilinear walking through space without reaching any end limits. During curvilinear walking, the
vestibular system is always stimulated, thus providing a rich sensory experience through both pro­
prioceptive and inertial senses. Most circular treadmills are quite small in diameter and thus mainly
permit walking or rotating in place (e.g., Jürgens et al. 1999). Larger circular treadmills allow for
natural, full-stride walking in circles (see Figure 30.3 for an image of the MPI circular treadmill that
is 3.6 m in diameter). The MPI circular treadmill is a modified version of that originally developed
by Mittelstaedt and Mittelstaedt (1996), which includes new control and safety features and a motor-
ized handlebar that can move independently of the treadmill belt/disk. Consequently, this provides
a unique opportunity to decouple vestibular and proprioceptive information by having participants
walk in place at one rate as they are moved through space at a different rate. This is achieved by
having the participants’ rate of movement through space (i.e., inertial input) dictated by the speed at
which the handlebar is moved, while the rate at which they walk in place (i.e., proprioceptive input)

FIGURE 30.3  MPI Circular Treadmill. This circular treadmill (3.6 m in diameter) allows for natural, full-
stride walking in circles. It is equipped with a motorized handlebar that can move independently from tread-
mill belt/disk. Using this setup, the relation between handlebar speed and disk speed can be systematically
manipulated to provide different information to two sensory systems. A computer monitor mounted on han-
dlebar can also be used to present visual information during movement. (Photo courtesy of Axel Griesch.)
Multimodal Integration during Self-Motion in Virtual Reality 609

is dictated by the rate of the disk relative to walking/handlebar speed. Using this setup, the relation
between the handlebar speed and the disk speed can be systematically manipulated to provide dif-
ferent information to the two sensory systems.
The main drawback of most of these types of treadmill systems is that they do not allow for
combinations of purely linear and rotational movements, nor can they accommodate changes in
walking direction. To address this problem, there have been a handful of attempts to develop omni-
directional treadmills that allow limitless walking in every direction (Darken et al. 1997; Iwata
1999, Torus treadmill). The newest omnidirectional treadmill built by the Cyberwalk project (http://
www.cyberwalk-project.org) is the largest at 6.5 m (21 ft) × 6.5 m (4 m (13 ft) × 4 m walking area)
and weighing 11 tons (see Figure 30.4). It is made up of a series of individual treadmill belts running
in one direction (x), all mounted on two chains that move the belts in the orthogonal direction (y).
Consequently, the combined motion of belts and chains can create motion in any direction. Again,
this system is used in combination with a customized control algorithm to ensure that the walker
remains centered on the platform while allowing them to change speed and direction (Souman et
al. 2010).
Another form of self-motion perception is that which occurs when one is passively moved
through space. In this case, proprioceptive information about lower limb movements is not avail-
able and thus, in the absence of vision, self-motion is mainly detected through vestibular cues and
other sources of nonvisual information (e.g., wind, changes in skin pressure, vibrations). In order to
understand how inertial information can be used for self-motion perception, researchers have used
devices that are able to move an observer within 2-D space, including manual wheelchairs (Allen
et al. 2004; Waller and Greenauer 2007), programmable robotic wheelchairs (Berthoz et al. 1995;
Israël et al. 1997; Siegle et al. 2009), frictionless sleds (Seidman 2008), rotating platforms (Jürgens

FIGURE 30.4  Cyberwalk Omni-directional Treadmill. This large omnidirectional treadmill was built by the
Cyberwalk project (http://www.cyberwalk-project.org) and is housed at the MPI for Biological Cybernetics.
It is 6.5 × 6.5 m (4 × 4 m walking area) and weighs 11 tons. It is made up of a series of individual treadmill
belts running in one direction (x) all mounted on two chains that can move the belts in the orthogonal direction
(y). Consequently, combined motion of belts and chains can create motion in any direction. (Photo courtesy
of Tina Weidgans.)
610 The Neural Bases of Multisensory Processes

et al. 1999), and circular treadmills (Mittelstaedt and Mittelstaedt 1996; MPI circular treadmill, see
Figure 30.3). Other devices allow for 3-D movements such as standard 6 degree-of-freedom motion
platforms (e.g., Stewart motion platform; Berger et al. 2010; Butler et al. 2010; Lehmann et al. 2008;
Riecke et al. 2006; http://www.cyberneum.com/MotionLab_en.html; see Figure 30.5). The MPI
has recently developed a completely new type of motion simulator based on an anthropomorphic
robot arm design (Teufel et al. 2007, http://www.cyberneum.com/RoboLab_en.html; see Figure
30.6). The MPI Motion Simulator can move participants linearly over a range of several meters and
can rotate them around any axis, thus offering a high degree of freedom of motion. Observers can
be passively moved along predefined trajectories (i.e., open loop; Siegle et al. 2009) or they can be
given complete interactive control of their own movements (i.e., closed loop) via a variety of input
devices, including a helicopter cyclic stick (Beykirch et al. 2007) and a steering wheel. As a conse-
quence of its structure, certain degrees of freedom, such as roll and lateral arcs, do not interact with
other degrees of freedom. Furthermore, this serial design provides a larger workspace and allows

FIGURE 30.5  MPI Stewart motion platform. The Motion Lab at the MPI for Biological Cybernetics consists
of a Maxcue 600, 6 degree-of-freedom Stewart platform coupled with a 86 × 65 degree field of view projection
screen mounted on a platform. Subwoofers are installed underneath the seat to produce somatosensory vibra-
tions as a way of masking platform motors. Movements can be presented passively, or participants can control
the platform via several different input devices including a helicopter cyclic stick and a 4 degree-of-freedom
haptics manipulator. (Photo courtesy of Manfred Zentsch.)
Multimodal Integration during Self-Motion in Virtual Reality 611

FIGURE 30.6  MPI Motion Simulator. The MPI Motion simulator is based on an anthropomorphic robot
arm design and can move participants linearly over a range of several meters and can rotate them around any
axis. Observers can be passively moved along predefined trajectories or they can be given complete interactive
control of their own movements via a variety of input devices, such as a helicopter cyclic stick or a steering
wheel. A curved projection screen can also be mounted on end of the robot arm in front of the seated observer
or alternatively an HMD can be used to present immersive visuals. Optical tracking systems have also been
mounted on the robot arm to measure position and orientation of an observer’s head or their arm during
pointing-based tasks. (Photo courtesy of Anne Faden.)

for upside-down movements, infinite roll capabilities, and continuous centrifugal forces—all of
which are not possible with traditional simulator designs.
In summary, as evidenced by the range of interfaces now available and customizable for address-
ing particular research questions, technology is now providing a means by which to carefully eval-
uate multimodal self-motion perception. Visualization devices can be used to assess how visual
information alone can be used to perceive self-motion and can help to determine the importance
of particular visual cues. Self-motion devices are allowing for the systematic isolation of vestibular
or proprioceptive cues during both active, self-propelled movements and during passive transport.
When these different interfaces are combined, this provides the opportunity to devise very specific
multisensory scenarios. Much of this was not possible until very recently and as such, the field of
multisensory self-motion perception is an exciting and a newly emerging field.

30.3 INFLUENCE OF VISUAL, PROPRIOCEPTIVE, AND VESTIBULAR


INFORMATION ON SELF-MOTION PERCEPTION
30.3.1  Unisensory Self-Motion Perception
The classic approach to understanding how particular cues contribute to different aspects of self-
motion perception has been to systematically eliminate particular cues and evaluate behaviors under
reduced cue conditions. This, of course, is an important first step in understanding which cues are
612 The Neural Bases of Multisensory Processes

necessary and/or sufficient to accurately perceive self-motion. Performance has been measured for
observers who only receive computer-simulated visual information in the absence of body-based
cues, and also when evaluating behaviors during movements in the complete absence of vision (e.g.,
when walking or being passively moved).
Much of the work on visual self-motion perception has looked specifically at the capacity of an
observer to use optic flow alone to effectively perceive self-motion using either sparse visual input
(i.e., textured ground plane or cloud of dots) or a rich visual scene (i.e., realistic visual environment).
For example, it has been shown that individuals are relatively accurate at using dynamic visual
information to discriminate and reproduce visually simulated traveled distances (Bremmer and
Lappe 1999; Frenz et al. 2003; Frenz and Lappe 2005; Redlick et al. 2001; Sun et al. 2004a) and
to update their landmark-relative position in space (Riecke et al. 2002). Other studies have shown
that optic flow alone can be used to estimate various other characteristics of self-motion including
direction (Warren and Hannon 1988; Warren et al. 2001), and speed (Larish and Flach 1990; Sun
et al. 2003) of self-motion through space. Optic flow can also induce postural sway in the absence
of physical movement perturbations (Lee and Aronson 1974; Lestienne et al. 1977) and can be used
to predict the time to contact with an environmental object (Lee 1976). Characteristics of visually
induced illusory self-motion, referred to as “vection,” have also received considerable interest, par-
ticularly from individuals using VR (Dichgans and Brandt 1978; Hettinger 2002; Howard 1986).
Most readers have likely experienced vection while sitting in a stationary train when a neighbor-
ing train begins to move. In this case, the global movement of the outside visual scene induces a
compelling sense of self-motion when really it is the environment (i.e., the neighboring train) that
is moving relative to you. This phenomenon highlights the extent to which vision alone can create
a compelling illusion of self-motion.
Others have studied conditions in which access to visual information is removed and only body-
based cues (e.g., inertial and proprioceptive cues) remain available during movement. It has been
clearly established that humans are able to view a static target up to 20 m and accurately reproduce
this distance by walking an equal extent without vision (Elliott 1986; Fukusima et al. 1997; Loomis
et al. 1992; Mittelstaedt and Mittelstaedt 2001; Rieser et al. 1990; Sun et al. 2004b; Thomson 1983).
Participants can also continuously point to a previously viewed target when walking past it blind-
folded on a straight, forward trajectory (Campos et al. 2009; Loomis et al. 1992). Others have dem-
onstrated that individuals are able to estimate distance information when learning and responding
through blindfolded walking (Ellard and Shaughnessy 2003; Klatzky et al. 1998; Mittelstaedt and
Mittelstaedt 2001; Sun et al. 2004b). A recent article by Durgin et al. (2009) looked specifically at
the mechanisms through which proprioceptive information can be used to estimate an extent of self-
motion and suggest that step integration might be a form of odometry used by humans (even when
explicit step counting is not permitted). Such mechanisms are similar to those previously shown to
be used by terrestrial insects such as desert ants (Wittlinger et al. 2006). There is some evidence,
however, that step integration could be susceptible to accumulating noise and might therefore only
be reliable for short traveled distances (Cheung et al. 2007).
A thorough collection of research has focused specifically on investigating the role of inertial
information, mainly provided through the vestibular organs, during simple linear and rotational
movements (Berthoz et al. 1995; Bertin and Berthoz 2004; Butler et al. 2010; Harris et al. 2000;
Israël and Berthoz 1989; Ivanenko et al. 1997; Mittelstaedt and Glasauer 1991; Mittelstaedt and
Mittelstaedt 2001; Seidman 2008; Siegle et al. 2009; Yong et al. 2007) and when traveling along
more complex routes involving several different segments (Allen et al. 2004; Sholl et al. 1989; Waller
and Greenauer 2007). Some findings have been interpreted to indicate that head velocity and dis-
placement can be accurately perceived by temporally integrating the linear acceleration information
detected by the otolith system. Others indicate that the influence and/or effectiveness of vestibular
information is somewhat limited, particularly when other nonvisual information such as vibrations
are no longer available (Seidman 2008), when moving along trajectories with more complex velocity
profiles (Siegle et al. 2009) or during larger-scale navigation (Waller and Greenauer 2007).
Multimodal Integration during Self-Motion in Virtual Reality 613

30.3.2  Multisensory Self-Motion Perception


As important as it is to understand how humans are able to perceive self-motion during reduced
sensory conditions, under most natural conditions, it is almost always the case that information from
several modalities is concurrently available. Unlike other types of cue combinations that maintain a
correlational relationship, in the case of self-motion perception, visual–proprioceptive and proprio-
ceptive–vestibular interactions are often casually related. For instance, when observers self-propel
themselves during walking, they immediately experience changes in optic flow information as a
direct consequence of their movement. Rarely does the entire visual field move when the body sig-
nals that it is stationary or vice versa. In fact, motion sickness often arises when the brain attempts
to reconcile the fact that the visual environment (e.g., the interior cabin of a ship) does not appear to
move relative to the head and yet the vestibular system is clearly detecting physical movements.
Traditionally, several different approaches have been used to evaluate the contributions of par-
ticular sensory systems to self-motion perception and spatial updating during egocentric move-
ments. These have included: (1) directly comparing the effects of multisensory versus unisensory
conditions (the most common approach); (2) creating subtle and transient cue conflicts between the
information provided by different sensory systems; and (3) introducing a prolonged conflict as a way
of evaluating the effects of sensory recalibration. Empirical evidence obtained using each of these
strategies will be discussed in turn, with a focus on studies that have exploited simulation tools and
techniques.

30.3.2.1  Effects of Cue Combination


Tasks that have been used to investigate the role of different sensory systems for self-motion per-
ception have ranged from estimating the speed, distance, and direction of a simple linear or rota-
tional movement, returning to origin after traveling a two-segment path separated by a turn, and
navigating longer, more complex routes. In order to evaluate the contributions of particular sensory
systems to each of these tasks, research has directly compared reduced cue conditions to conditions
in which all or most sensory information is available. What is clear is that no single sensory sys-
tem appears to be globally critical for all aspects of self-motion perception, but rather, the relative
importance of particular modalities is somewhat task-dependent. Although an exhaustive review is
not provided here, this summary is intended to emphasize the necessity of taking a comprehensive
approach to evaluating multisensory self-motion perception as it applies to different levels and types
of behaviors.
When looking at multisensory self-motion for purely rotational movements, several studies have
indicated that proprioceptive information appears to be quite important. For instance, Bakker et al.
(1999) asked subjects to turn various angles while in a virtual forest. They compared conditions in
which only visual or vestibular (passive rotations) information was available, to conditions in which
participants actively rotated themselves by moving their legs. It was reported that having only visual
information led to the poorest performance in this task, followed by pure vestibular stimulation.
When participants actually moved themselves, they were the most consistent and accurate. Visual
information, however, was not completely ignored, because the estimates in the combined cue condi-
tion (i.e., when participants saw the forest while physically moving) fell between the two unisensory
conditions, indicating a combined cue effect. Lathrop and Kaiser (2002) also evaluated perceived
self-orientation by measuring pointing accuracy to unseen virtual landmarks. Performances in
which participants learned the location of landmarks (via an HMD) during full-body rotations were
better than when the same movements were simulated visually on a desktop monitor.
Consistent with the idea that pure vestibular inputs are not sufficient for self-orientation in space
during rotational movements, Wilkie and Wann (2005) reported that, when completing steering
maneuvers by rotating on a motorized chair, inertial information did not contribute significantly
more than that already provided through various visual inputs. Using both a realistic, visually rich
scene and a pure optic flow stimulus, Riecke et al. (2006) evaluated the effects of rotational inertial
614 The Neural Bases of Multisensory Processes

cues on obligatory egocentric spatial updating. It was found that neither in rich nor impoverished
visual conditions did the added inertial information improve performance. Unlike other studies,
however, Riecke et al. (2006) demonstrated that with a realistic, familiar visual scene, dynamic
visual information alone could be used for updating egocentric positions.
Lehmann et al. (2008) evaluated the benefits of having inertially based self-motion information
during the mental rotation of an array of objects. In this case, participants were seated on a motion
simulator while viewing a large projection screen and a virtual array of objects displayed on a table-
top directly in front of them. When having to identify which object in the array was shifted after
a viewpoint change (either physically or visually introduced), a detection advantage was observed
after the physical rotation. This indicates that the inertial information provided during the rota-
tion facilitated mental rotation, thus also supporting previous real-world studies (Simons and Wang
1998).
Others have also investigated individual cue contributions during purely linear movements. For
instance, Harris et al. (2000) evaluated the ability of participants to estimate linear trajectories
using either visual information provided through an HMD and/or vestibular sources when passively
moved on a cart. Here they found that when visual and vestibular inputs were concurrently avail-
able, estimates more closely approximated those of the purely vestibular estimates than the purely
visual estimates. The importance of body-based cues for traveled distance estimation has also been
revealed through a series of studies by Campos et al. (2007b). In these experiments, body-based
cues were provided either by: (1) natural walking in a fully tracked free walking space (propriocep-
tive and vestibular), (2) being passively moved by a robotic wheelchair (vestibular), or (3) walking
in place on a treadmill (proprioceptive). Distances were either presented through optic flow alone,
body-based cues alone, or both visual and body-based cues combined. In this case, combined cue
effects were again always observed, indicating that no modality was ever completely disregarded.
When visual and body-based cues were combined during walking, estimates more closely approxi-
mated the unisensory body-based estimates. When visual and inertial cues were combined during
passive movements, the estimates fell in between the two unisensory estimates.
Sun et al. (2004a) investigated the relative contributions of visual and proprioceptive information
by having participants compare two traveled distances experienced by riding a stationary bicycle
down a virtual hallway viewed in an HMD. It was concluded in this case that visual information
was predominantly used. It is important to note that when riding a bike, there is no absolute one-
to-one relationship between the metrics of visual space and those of the proprioceptive movements
because of the unknown scale of one pedal rotation (i.e., this would be depend on the gear, for
instance). Even under such conditions, combined cue effects were observed, such that, when visual
and proprioceptive cues were both available, estimates differed from those in either of the unimodal
conditions.
Cue combination effects have also been evaluated for speed perception during linear self-​motion
(Durgin et al. 2005; Sun et al. 2003). For instance, Durgin et al. (2005) have reported that physi-
cally moving (i.e., walking or being passively moved) during visually simulated self-motion causes
a reduction in perceived visual speed compared to situations in which visually simulated self-
motion is experienced when standing stationary. The authors attribute this to the brain’s attempt
at optimizing its efficiency when presented with two, typically correlated cues with a predictable
relationship.
Slightly more complex paths consisting of two linear segments separated by a rotation of vary-
ing angles have also been used to understand how self-motion is integrated across different types
of movements. Typically, such tasks are used to answer questions about how accurately observers
are able to continuously update their position in space without using landmarks (i.e., perform path
integration). For instance, triangle completion tasks typically require participants to travel a linear
path, rotate a particular angle, travel a second linear path, and then return to home (or face the point
of origin). In a classic triangle completion study, Klatzky et al. (1998) demonstrated that during
purely visual simulations of rotational components of the movement (i.e., the turn), participants
Multimodal Integration during Self-Motion in Virtual Reality 615

were highly disoriented when attempting to face back to start compared to conditions under which
full-body information was present during the rotation. In fact, the physical turn condition resulted in
errors that were almost as low as the errors in the full, real walking condition in which body infor-
mation was available during the entire route (walk, turn, walk, face start). Unlike several of the rota-
tional self-motion experiments described above, vestibular inputs during the rotational component
of this triangle completion task appeared to be very important for perceived self-orientation. Again,
however, it emphasizes the importance of physical movement cues over visual cues in isolation.
Using a similar task, Kearns et al. (2002) demonstrated that pure optic flow information was
sufficient to complete a return-to-origin task, although the introduction of body-based cues (pro-
prioceptive and vestibular) when walking through the virtual environment led to decreased vari-
ability in responding. This was true regardless of the amount of optic flow that was available from
the surrounding environment, thus suggesting a stronger reliance on body-based cues. When using
a two-segment path reproduction task to compare moving via a joystick versus walking on the
Torus treadmill, Iwata and Yoshida (1999) reported a higher accuracy during actual walking on the
treadmill compared to when active control of self-motion was provided through the use of an input
device.
Chance et al. (1998) used a more demanding task in which participants were asked to travel
through a virtual maze and learn the locations of several objects as they moved. At the end of the
route, when prompted, participants turned and faced the direction of a particular target. Here, the
authors compared conditions in which participants actually walked through the maze (propriocep-
tive and vestibular inputs during translation and rotation), to that in which a joystick was used to
navigate the whole maze (vision alone), to that in which a joystick was used to translate and physical
rotations were provided (proprioceptive and vestibular inputs during rotation only). When physi-
cally walking errors were the lowest, when only visual information was available errors were the
highest, and when only physical rotations were possible, responses fell in between (although they
were not significantly different from the vision-only condition). Using similar conditions, Ruddle
and Lessels (2006, 2009) observed a comparable pattern of results when evaluating performances
on a search task in a room-sized virtual environment. Specifically, conditions in which participants
freely walked during their search resulted in highly accurate and efficient search performance,
when allowed to only physically rotate observers were less efficient, and even less efficient with only
visual information.
Waller and colleagues have evaluated questions related to multisensory navigation as they relate
to larger-scale self-motion perception and the acquisition and encoding of spatial representations.
For instance, they have considered whether the inertial information provided during passive move-
ments in a car contributes to the development of an accurate representation of a route beyond the
information already provided through dynamic visual inputs (Waller et al. 2003). They found that
inertial inputs did not significantly improve performance and even when the inertial cues were
not consistent with the visuals, instead of disorienting or distracting observers, there was in fact
no impact on spatial memory. Similarly, Waller and Greenauer (2007) asked participants to travel
along a long indoor route (about 480 ft), and then evaluated their ability to perform a variety of spa-
tial tasks. Although participants learned the route under different sensory conditions—by walking
with updated vision, by being passively moved with updated vision, or by viewing a visual simula-
tion of the same movement—there appeared to be no significant effects of cue availability (however,
see Waller et al. 2004). Overall, the less obvious role of body-based cues in these larger-scale, more
cognitively demanding tasks stands in contrast to the importance of body-based cues evidenced in
simpler self-motion updating tasks. As such, future work must help to reconcile these findings and
to form a more comprehensive model of multisensory self-motion in order to understand how the
scale of a space, the accumulation of self-motion information, and the demands of the task relate to
relative cue weighting.
Not only do the effects of cue combinations exhibit themselves through consciously produced
behaviors or responses in spatial tasks, but they can also be seen in other aspects of self-motion,
616 The Neural Bases of Multisensory Processes

including the characteristics of gait. For instance, Mohler et al. (2007a) investigated differences in
gait parameters such as walking speed, step length, and head-to-trunk angle when walking with
eyes open versus closed, and also when walking in a VE (wearing an HMD) versus walking in the
real world. It was found that participants walked slower and exhibited a shorter stride length when
walking with their eyes closed. During sighted walking while viewing the VE through the HMD,
participants walked slower and took smaller steps than when walking in the real world. Their head-
to-trunk angle was also smaller when walking in the VE, most likely due to the reduced vertical
FOV.
Similarly, Sheik-Nainar and Kaber (2007) evaluated different aspects of gait, such as speed,
cadence, and joint angles when walking on a treadmill. They evaluated the effects of presenting
participants with congruent and updated visuals (via an HMD projecting a simulated version of the
laboratory space) compared to stationary visuals (real-world laboratory space with reduced FOV to
approximate HMD). These two conditions were compared to natural, overground walking. Results
indicated that although both the treadmill conditions caused participants to walk slower and take
smaller steps, when optic flow was consistent with the walking speed, gait characteristics more
closely approximated that of overground walking.
Finally, although most of the work on multisensory self-motion perception has dealt specifically
with visual interactions with body-based cues, it is important to note that researchers have begun to
evaluate the impact of auditory cues on self-motion perception. For instance, Väljamäe et al. (2008)
have shown that sounds associated with self-motion through space, such as footsteps, can enhance
the perception of linear vection. Furthermore, Riecke et al. (2009) have shown that sounds produced
by a particular spatial location (i.e., by water flowing in a fountain) can enhance circular vection
when appropriately updated with the moving visuals.

30.3.2.2  Cue Weighting under Conflict Conditions


Although understanding the perceptual and behavioral consequences of adding or subtracting
cues remains an informative approach to understanding self-motion perception, it is limited when
attempting to precisely quantify the contributions made by individual cues or when defining the
exact principles underlying this integration. Considering that individual modalities are sufficient
in isolation for many of the different self-motion based tasks, it is difficult to assess how the dif-
ferent modalities combine when several are simultaneously present. In most cases, the information
provided by two different sensory modalities regarding the same external stimuli is redundant and
thus, it is difficult to dissociate the individual contributions of each.
A popular and effective strategy for dissociating naturally congruent cues has been the cue conflict
approach. This approach involves providing individual modalities with different and incongruent
information about a single perceptual event or environmental stimuli. Much of the classic research
using experimentally derived cue conflicts in the real world comes from work using displacement
prisms (Pick et al. 1969; Welch and Warren 1980), and other recent examples have used magnifica-
tion/minimization lenses (Campos et al. 2010). In the case of self-motion perception, prism goggles
have been used, for instance, to shift the entire optic array horizontally, thus causing a conflict
between what is perceived visually and what is perceived via other modalities such as propriocep-
tion (Rushton et al. 1998). Although prism approaches have, in the past provided great insight into
sensory–motor interactions in the real world, distortions can occur and the type of conflict manipu-
lations that can be introduced are limited (e.g., restricted to changing heading direction or vertical
eye height). VR, however, provides a much more flexible system that can change many different
characteristics of the visual environment as well as present visual speeds, traveled distances, head-
ing directions, orientation in 3-D space, etc., that differ from that being simultaneously presented
to proprioceptive and vestibular sources. In the context of understanding multisensory integration
during self-motion, cue conflicts have been used to understand (1) the immediate consequences of
transient sensory conflict (momentary incongruencies) and (2) the recalibration of optic flow and
body-based cues over time (enduring conflict). In this section, each will be considered.
Multimodal Integration during Self-Motion in Virtual Reality 617

In the case of transient cue conflicts, it is typically the case that such conflicts occur on a trial-by-
trial basis in an effort to avoid adaptation effects. In this case, the idea is ultimately to understand
the relative cue weighting of visual and body-based cues when combined under normal circum-
stances. For instance, Sun et al. (2003, 2004a) used this strategy in the aforementioned simulated
bike riding experiment as a way of dissociating the proprioceptive information provided by pedal-
ing, from the yoked optic flow information provided via an HMD. In a traveled distance comparison
task, they reported an overall higher weighting of visual information when the relation between the
two cues was constantly varied. However, the presence of proprioceptive information continued to
improve visually specified distance estimates, even when it was not congruent with the visuals (Sun
et al. 2004a). On the other hand, Harris et al. (2000) used a similar technique to examine the relative
contributions of visual–vestibular information to linear self-motion estimation over several meters
and found that observers estimated more closely approximated the distances specified by vestibular
cues than those specified by optic flow. Sun et al. (2003) also evaluated the relative contributions of
visual and proprioceptive information using a speed discrimination task while bike riding down a
virtual hallway. Here, they found that although both cues contributed to speed estimates, proprio-
ceptive information was in fact weighted higher.
For smaller scale, simulated full-body movements have also investigated visual–vestibular inte-
gration by presenting optic flow stimuli via a projection screen and presenting vestibular informa-
tion via a 6 degree-of-freedom motion platform (Butler et al. 2010; Fetsch et al. 2009; Gu et al.
2008). In this case, it has consistently been shown that the variances observed for the estimates in
the combined cue conditions are lower than estimates in either of the unisensory conditions. In the
series of traveled distance experiments by Campos et al. (2007b) described above, subtle cue con-
flicts were also created between visual and body-based cues (see also Kearns 2003). Here, incon-
gruencies were created by either changing the visual gain during physical movements or changing
the proprioceptive gain during walking (i.e., by changing the treadmill speed). Overall, the results
demonstrated a higher weighting of body-based cues during natural overground walking, a higher
weighting of proprioceptive information during treadmill walking, and a relatively equal weighting
of visual and vestibular cues during passive movement. These results were further strengthened by
the fact that the higher weighting of body-based cues during walking was unaffected by whether
visual or proprioceptive gain was manipulated.
The vast majority of the work evaluating relative cue weighting during self-motion perception
using cue conflict paradigms has considered how vision combines with different body-based cues.
Others have recently conducted some of the first experiments to use this technique for studying
proprioceptive–vestibular integration. In order to achieve this, they used the MPI circular treadmill
setup described above (see Figure 30.3). Because this treadmill setup consists of a handlebar that
can move independently of the treadmill disk, the relation between the handlebar speed and the disk
speed can be changed to provide different information to the two sensory systems.
Cue conflict techniques have also been used to evaluate the effect of changing cue relations on
various gait parameters. For instance, Prokop et al. (1997) asked participants to walk at a comfort-
able, yet constant speed on a self-driven treadmill. When optic flow was accelerated or decelerated
relative to the actual walking speed, unintentional modulations in walking speed were observed.
Specifically, when the visual speed increased, walking speeds decreased, whereas the opposite was
true for decreased visual speeds. Similarly, it has also been shown that walk-to-run and run-to-walk
transitions can also be unintentionally modified by providing a walking observer with different
rates of optic flow (Guerin and Bardy 2008; Mohler et al. 2007b). Again, as the rate of optic flow
is increased, the speed at which an observer will transition from running to walking will be lower,
whereas the opposite is true for decreased optic flow rates.
Another group of studies has used prolonged cue conflicts as a way of investigating sensory–­
motor recalibration effects during self-motion. A classic, real-world multisensory recalibration
experiment was conducted by Rieser and colleagues (1995), in which an extended mismatch was
created between visual flow and body-based cues. Using a cleverly developed setup, participants
618 The Neural Bases of Multisensory Processes

walked on a treadmill at one speed, while it was pulled behind a tractor moving at either a faster or
slower speed. Consequently, the speed of the movement experienced motorically was either greater
or less than the speed of the visually experienced movement. After adaptation, participants walked
blindfolded to previewed visual targets. Results indicated that when the visual flow was slower than
the locomotor information participants overshot the target (relative to pretest), whereas when the
visual flow was faster than the locomotor information they undershot the target distance.
Although the approach used by Rieser et al. (1995) was ingenious, one can imagine that this
strategy can be accomplished much more easily, safely, and under more highly controlled circum-
stances by using simulation devices. Indeed, the results of Rieser et al.’s (1995) original study have
since been replicated and expanded upon using VR. This has been achieved by having participants
walk on a treadmill or within a tracked walking space while they experience either relatively faster
or slower visually perceived flow via an HMD or a large FOV projection display (Durgin et al. 2005;
Mohler et al. 2007c; Proffitt et al. 2003; Thompson et al. 2005). For instance, it has been shown that
adaptations that occur when subjects are walking through a VE on a treadmill transfer to a real-
world blind walking task (Mohler et al. 2007c). There is also some indication that the aftereffects
observed during walking on solid ground (tracked walking space) are larger than those observed
during treadmill walking (Durgin et al. 2005). Pick et al. (1999) have also shown similar recalibra-
tion effects for rotational self-motion.

30.3.3  Unique Challenges in Studying Multisensory Self-Motion Perception


In recent years, much of the multisensory research community have used psychophysical methods
as a way of evaluating whether two cues are integrated in a statistically optimal fashion [i.e., maxi-
mum likelihood estimation (MLE) or Bayesian approaches to cue integration; Alais and Burr 2004;
Blake and Bülthoff 1993; Bülthoff and Mallot 1988; Bülthoff and Yuille 1991, 1996; Butler et al.
2010; Cheng et al. 2007; Ernst and Banks 2002; Ernst and Bülthoff 2004; Fetsch et al. 2009; Knill
and Saunders 2003; Kording and Wolpert 2004; MacNeilage et al. 2007; Welchman et al. 2008].
A traditional design used to evaluate such predictions involves a comparison of the characteristics
of the psychometric functions (i.e., just noticeable difference or variance scores) obtained during
unisensory conditions to those obtained during multisensory conditions. Based on the assumptions
of an MLE account, at least two general predictions can be made. First, the variance observed in
the combined sensory condition should be lower than that observed in either of the unimodal condi-
tions. Second, the cue with the highest unimodal variance should be given less weight when the two
cues are combined. A cue conflict is often used as a way of providing slightly different information
to each of the two cues as a way of identifying which cue was weighted higher in the combined
estimate.
However, because of the tight relationship between visual, vestibular, and proprioceptive infor-
mation during self-motion, this presents a unique challenge for obtaining unbiased unisensory esti-
mates through which to base predictive models. This is because even in the unisensory conditions,
there remains an inherent conflict. For instance, when visual self-motion is simulated in the absence
of proprioceptive and vestibular inputs, this could be a challenge for the brain to reconcile. Because
the proprioceptive and vestibular systems cannot be “turned off,” they constantly send the brain
information about self-motion, regardless of whether that information indicates self-motion through
space or a stationary egocentric position. Therefore, when the visual system is provided with a com-
pelling sense of self-motion, both the muscles and joints and inner ear organs clearly do not support
this assessment.
Despite these constraints, effective models of self-motion perception have recently been devel-
oped as a way of assessing some of the abovementioned predictions (e.g., Jürgens and Becker 2006;
Laurens and Droulez 2007). For instance, Jürgens and Becker (2006) evaluated the weighting of ves-
tibular, proprioceptive, and cognitive inputs on displacement perception. They report that the more
sensory information that is available, the less participants appeared to rely on cognitive strategies.
Multimodal Integration during Self-Motion in Virtual Reality 619

In addition, with increasing sources of combined information, lower variance scores were observed.
Cheng et al. (2007) have also summarized some of the multisensory work in locomotion and spatial
navigation and evaluated how these findings fit within the context of Bayesian theoretical predic-
tions. Overall, there remains much important work to be done concerning the development of quan-
titative models describing the principles underlying multisensory self-motion perception.

30.4 ADVANTAGES AND DISADVANTAGES OF USING SIMULATION


TECHNOLOGY TO STUDY MULTISENSORY SELF-MOTION PERCEPTION
Throughout this chapter, numerous unique benefits of using visually simulated environments and
various self-motion simulators to study multisensory self-motion perception have been described.
However, because VR technology is not yet capable of achieving the extraordinary task of capturing
every aspect of reality in veridical spatial and temporal terms, there are several limitations that must
also be acknowledged. Below, we will briefly consider some of the additional advantages and disad-
vantages of using VR in studying multisensory self-motion perception not already discussed earlier
in this chapter (see also Bülthoff and van Veen 2001; Loomis et al. 1999; Tarr and Warren 2002).
Considering that the natural world contains an infinite amount of contextual and behaviorally
relevant sensory information, it is often difficult to predict how these sources of information will
interact. As mentioned above, perhaps the most significant advantage of VR is that it can it provide
a highly controlled, multisensory experience. It is also able to overcome some of the difficulties
inherent in experimentally manipulating a unimodal component of a multisensory experience and
for dissociating individual cues within one modality. Moreover, each of these manipulations is
achieved under safe, low-risk, highly replicable circumstances, and often (although not always) at a
much lower cost than is possible in the real world. For instance, Souman et al. (2009) were interested
in empirically testing the much-speculated question of whether humans indeed walk in circles when
lost in the desert. To do this, Souman and colleagues traveled to the Sahara desert. Without going
through this level of effort and expense, conducting such experiments would otherwise be extremely
difficult to test in the real world because of the need to have a completely sparse environment
through which an individual can walk for hours. However, following the original, real-world experi-
ment, Souman and colleagues have since been able to evaluate similar questions under more precise
conditions by using the newly developed MPI omnidirectional treadmill. Here they can manipulate
particular characteristics of the VE as a way of evaluating the exact causes of any observed veering
behaviors, while still allowing for limitless walking capabilities in any direction.
Although many of the tasks described thus far have dealt mainly with consciously reported or
reproduced behaviors in VEs, it is also important to note that even unconscious, physiological reac-
tions (e.g., heart rate and galvanic skin response) often respond in ways similar to that observed for
real-world events. For instance, when having observers walk to the very edge of cliff in a VE, not
only do many participants report a compelling sense of fear, but their heart rate also increases con-
siderably (Meehan et al. 2005). This effect is further amplified when additional sensory cues, such
as the haptic sensation of feeling the edge of drop-off with one’s feet, are also provided.
The disadvantages of VR must also be accounted for when considering whether particular tech-
nologies are appropriate for addressing specific research questions. For instance, as mentioned
above, there is often a trade-off between high-quality, wide FOV visualization systems, and mobil-
ity. However, the impact that a reduced FOV has on self-motion perception is still relatively unclear.
The results of Warren and Kurtz (1992) indicate that, unlike previously believed, peripheral optic
flow of information is not necessarily the dominant source of visual input when performing a visual
heading task, but rather, central visual input tends to provide more accurate estimates. Banton et al.
(2005), on the other hand, indicate that peripheral information seems to be important for accurately
perceiving visual speed when walking. Therefore, the perceptual impact of having a restricted FOV
on the perception of various aspects of self-motion requires further investigation.
620 The Neural Bases of Multisensory Processes

There are also several clear and consistent perceptual errors that occur in VEs that do not occur
in the real world. For instance, although much research has now demonstrated that humans are very
good at estimating the distance between themselves and a stationary target in the real world (see
Loomis and Philbeck 2008 for a review), the same distance magnitudes are consistently underes-
timated in immersive VEs by as much 50% (Knapp and Loomis 2004; Loomis and Knapp 2003;
Thompson et al. 2004; Witmer and Kline 1998). This effect is not entirely attributable to poor visual
graphics (Thompson et al. 2004) and although some groups have reported a distance compression
effect when the FOV is reduced and the viewer is stationary (Witmer and Kline 1998), others have
shown that when head movements are allowed under restricted FOV conditions, these effects are
not observed (Creem-Regehr et al. 2005; Knapp and Loomis 2004). Strategies have been used to
reduce this distance compression effect, for instance, by providing various forms of feedback when
interacting in the VE (Mohler et al. 2007c; Richardson and Waller 2005; Waller and Richardson
2008), yet the exact cause of this distance compression remains unknown.
Another, less-studied perceptual difference between virtual and real environments that has also
been reported, is the misperception of visual speed when walking in VEs (Banton et al. 2005;
Durgin et al. 2005). For instance, Banton et al. (2005) required participants to match their visual
speed (presented via an HMD) to their walking speed as they walked on a treadmill. When facing
forward during walking, visual speeds were increased by about 1.6× that of the walking speed in
order to appear equal.
When motion tracking is used to visually update an observer’s position in the VE, there is also
the concern that temporal lag has the potential to create unintentional sensory conflict, disrupt the
feeling of presence, and cause cybersickness. There is also some indication that characteristics of
gait change when walking overground in a VE compared to the real world (Mohler et al. 2007a),
and walking on a treadmill in a VE is associated with increased stride frequency (Sheik-Nainar
and Kaber 2007). It is yet unknown how such changes in physical movement characteristics might
impact particular aspects of self-motion perception.
In addition to lower-level perceptual limitations of VEs, there are also higher-level cognitive
effects that can affect behavior. For instance, there is often a general awareness when interacting
within a VE that one is in fact engaging with artificially derived stimuli. Observers might react dif-
ferently to simulated scenarios, for instance, by placing a lower weighting on sensory information
that they know to be simulated. Furthermore, when visually or passively presented movements defy
what is physically possible in the real world, this information might also be treated differently. In
cue conflict situations, it has also been shown that relative cue weighting during self-motion can
change as a function of whether an observer is consciously aware of any cue conflicts that are intro-
duced (Berger and Bülthoff 2009).
There is also a discord between the perceptual attributes of the virtual world that an observer is
immersed in and the knowledge of the real world that they are physically located within. Evidence
that this awareness might impact behavior comes from findings indicating that, during a homing
task in a VE, knowledge of the size of the real-world room impacts navigational behaviors in the VE
(Nico et al. 2002). Specifically, when participants knowingly moved within in a smaller real-world
room they undershot the origin in the VE compared to when they were moving within a larger real-
world space (even though the VEs were of identical size).
Overall, researchers should ideally strive to exploit the advantages offered by the various avail-
able interfaces while controlling for the specific limitations through the use of others. Furthermore,
whenever possible, it is best to take the reciprocally informative approach of comparing and coordi-
nating research conducted in VR with that taking place in real-world testing scenarios.

30.5  MULTISENSORY SELF-MOTION PERCEPTION: AN APPLIED PERSPECTIVE


Being able to effectively and accurately represent multiple sources of sensory information within a
simulated scenario is essential for a broad variety of applied areas. VR technologies are now being
Multimodal Integration during Self-Motion in Virtual Reality 621

widely adopted for use in areas as diverse as surgical, aviation, and rescue training, architectural
design, driving and flight simulation, athletic training and evaluation, psychotherapy, gaming, and
entertainment. Therefore, not only is it important to understand cue integration during relatively
simple tasks, but it is also imperative to understand these perception–action loops during more
complex, realistic, multifaceted behaviors. Although most multisensory research has focused on the
interaction of only two sensory cues, most behaviors occur in the context of a variety of sensory
inputs, and therefore understanding the interaction of three or more cues (e.g., Bresciani et al. 2008)
during ecologically valid stimulus conditions is also important. These issues are particularly criti-
cal considering the possibly grave consequences of misperceiving spatial properties or incorrectly
adapting to particular stimulus conditions. Here, we briefly consider two applied fields that we feel
are of particular interest as they relate to multisensory self-motion perception: helicopter flight
behavior and locomotor rehabilitation.
Helicopter flight represents one of the most challenging multisensory control tasks accomplished
by humans. The basic science of helicopter flight behavior is extremely complex and the effects
of specific flight simulation training on real-world performance (i.e., transfer of training) remain
poorly understood. Because several misperceptions are known to occur during helicopter flight, it
is important to first understand the possible causes of such misperceptions in a way that will allow
for more effective training procedures. One example of such a misperception that can occur when
reliable visual information is not available during flight is the somatogravic illusion. In this case,
the inertial forces during accelerations of the aircraft and those specifying gravitational forces may
become confused, thus causing an illusion of tilt during purely linear accelerations, often resulting
in devastating outcomes.
Several studies have been conducted using the MPI Motion Simulator by outfitting it with a heli-
copter cyclic stick and various visualization devices in order to create a unique and customizable
flight simulator. For instance, nonexpert participants were trained on the simulator to acquire the
skills required to stabilize a helicopter during a hovering task (Nusseck et al. 2008). In this case,
the robot was programmed to move in a way that mimicked particular helicopter dynamics and the
participants’ task was to hover in front of real-world targets. Two helicopter sizes were simulated:
one that was light and agile and another that was heavy and inert. Participants were initially trained
on one of the two helicopters and subsequently their performance was tested when flying the second
helicopter. This method was used to reveal the novice pilots’ ability to transfer the general flight
skills they learned on one system, to another system with different dynamics. The results indicated
that participants were able to effectively transfer the skills obtained when training in the light heli-
copter to the heavy helicopter, whereas the opposite was not true. Understanding these transfer of
training effects are important for assessing the effectiveness of both, training in simulators and
flying in actual aircraft, and also for understanding the subtle differences of flying familiar versus
unfamiliar aircraft—something almost all pilots are at one time faced with.
Another applied area that would benefit greatly from understanding multisensory self-motion
perception is the diagnosis and rehabilitative treatment of those with disabling injury or illness. A
significant percentage of the population suffers from the locomotor consequences of Parkinson’s
disease, stroke, acquired brain injuries, and other age-related conditions. Oftentimes rehabilitation
therapies consist of passive range of motion tasks (through therapist manipulation or via robotic-
assisted walking) or self-initiated repetitive action tasks. In the case of lower-limb sensory–motor
disabilities, one rehabilitative technique is to have patients walk on a treadmill as a way of actively
facilitating and promoting the movements required for locomotion. The focus of such techniques,
however, is exclusively on the motor system, with very little consideration given to the multimodal
nature of locomotion. In fact, treadmill walking actually causes a conflict between proprioceptive
information, which specifies that the person is moving, and visual information, which indicates a
complete lack of self-motion.
Considering that one of the key factors in the successful learning or relearning of motor behaviors
is feedback, a natural source of feedback can be provided by the visual flow information obtained
622 The Neural Bases of Multisensory Processes

during walking. As such, incorporating visual feedback into rehabilitative treadmill walking thera-
pies could prove to be of great importance. Actively moving within a VE is also likely to be highly
rewarding for individuals lacking stable mobility and thus may increase levels of motivation in addi-
tion to recalibrating the perceptual–motor information.
Although some work has been done to evaluate multimodal effects in upper-limb movement
recovery, this is not something that has been investigated as thoroughly for full body locomotor
behavior such as walking. A group that has evaluated such questions is Fung et al. (2006), who
have used a self-paced treadmill, mounted on a small motion platform coupled with a projection
display as a way of adapting gait behavior in stroke patients. They found that, by training with
this multimodal system, patients showed clear locomotor improvements such as increases in gait
speed and the ability to more flexibly adapt their gait when faced with changes in ground terrain.
Rehabilitation research and treatment programs can benefit greatly from the flexibility, safety, and
high level of control offered by VR and simulator systems. As such, technologies that offer multi-
modal stimulation and control are expected to have a major impact in the future [e.g., see, Toronto
Rehabilitation Institute’s Challenging Environment Assessment Laboratory (CEAL); http://www​
.cealidapt.com].

30.6  SUMMARY
This chapter has emphasized the closed-loop nature of human locomotor behavior by evaluating
studies that preserve the coupling between perception and action during self-motion percep-
tion. This combined cue approach to understanding full body movements through space offers
unique insights into multisensory processes as they occur over space and time. Future work in
this area should aim to define the principles underlying human perceptual and cognitive pro-
cesses in the context of realistic sensory information. Using simulation techniques also allows
for a reciprocally informative approach of using VR as a useful tool for understanding basic
science questions related to the human observer in action, while also utilizing the results of this
research to provide informed methods of improving VR technologies. As such, the crosstalk
between applied fields and basic science research approaches should be strongly encouraged and
facilitated.

ACKNOWLEDGMENTS
We would like to thank members of the MPI Cyberneum group, past and present (http://www​
.cyberneum.com/People.html), Jan Souman, John Butler, and Ilja Frissen for fruitful discussions.
We also thank Simon Musall for invaluable assistance and two anonymous reviewers for helpful
comments.

REFERENCES
Akyüz, A. O., R. W. Fleming, B. E. Riecke, E. Reinhard, and H. H. Bülthoff. 2007. Do HDR displays support
LDR content: A psychophysical evaluation. ACM Trans Graphics 26(3:38): 1–7.
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol
14: 257–262.
Allen, G. L., K. C. Kirasic, M. A. Rashotte, and D. B. M. Haun. 2004. Aging and path integration skill:
Kinesthetic and vestibular contributions to wayfinding. Percept Psychophys 66(1): 170–179.
Bakker, N. H., P. J. Werkhoven, and P. O. Passenier. 1999. The effects of proprioceptive and visual feedback on
geographical orientation in virtual environments. Pres Teleop Virtual Environ 8: 36–53.
Banton, T., J. Stefanucci, F. Durgin, A. Fass, and D. Proffitt. 2005. The perception of walking speed in a virtual
environment. Pres Teleop Virtual Environ 14(4): 394–406.
Berger, D. R., J. Schulte-Pelkum, and H. H. Bülthoff. 2010. Simulating believable forward accelerations on a
Stewart motion platform. ACM Trans Appl Percept 7(1:5): 1–27.
Multimodal Integration during Self-Motion in Virtual Reality 623

Berger, D. R., and H. H. Bülthoff. 2009. The role of attention on the integration of visual and inertial cues. Exp
Brain Res 198(2–3): 287–300.
Berthoz, A., I. Israël, P. Georges-François, R. Grasso, and T. Tsuzuku. 1995. Spatial memory of body linear
displacement: What is being stored? Science 269: 95–98.
Bertin, R. J. V., and A. Berthoz. 2004. Visuo-Vestibular interaction in the reconstruction of travelled trajecto-
ries. Exp Brain Res 154: 11–21.
Beykirch, K., F. M. Nieuwenhuizen, H. J. Teufel, H.-G. Nusseck, J. S. Butler, and  H. H. Bülthoff. 2007.
Control of a lateral helicopter sidestep maneuver on an anthropomorphic robot. Proceedings of the AIAA
Modeling and Simulation Technologies Conference and Exhibit, 1–8. American Institute of Aeronautics
and Astronautics, Reston, VA, USA.
Blake, A., H. H. Bülthoff, and D. Sheinberg. 1993. Shape from texture: Ideal observers and human psychophys-
ics. Vis Res 33: 1723–1737.
Bremmer, F., and M. Lappe. 1999. The use of optical velocities for distance discrimination and reproduction
during visually simulated self motion. Exp Brain Res 127: 33–42.
Bresciani, J.-P., F. Dammeier, and M. O. Ernst, 2008. Trimodal integration of visual, tactile and auditory signals
for the perception of sequences of events. Brain Res Bull 75(6): 753–760.
Bülthoff, H. H., and H. A. Mallot. 1988. Integration of depth modules: Stereo and shading. J Opt Soc Am 5:
1749–1758.
Bülthoff, H. H., and H.-J. van Veen. 2001. Vision and action in virtual environments: Modern psychophysics.
In Spatial cognition research. Vision and attention, ed. M. L. Jenkin and L. Harris, 233–252. New York:
Springer Verlag.
Bülthoff, H. H., and A. Yuille. 1991. Bayesian models for seeing shapes and depth. Comments Theor Biol 2(4):
283–314.
Bülthoff, H. H., and A. L. Yuille. 1996. A Bayesian framework for the integration of visual modules. In Attention
and Performance XVI: Information Integration in Perception and Communication, ed. J. McClelland and
T. Inui, 49–70. Cambridge, MA: MIT Press.
Butler, J. S., S. T. Smith, J. L. Campos, and H. H. Bülthoff. 2010. Bayesian integration of visual and vestibular
signals for heading. J Vis 10(11): 23, 1–13.
Campos, J. L., J. S. Butler, B. Mohler, and H. H. Bülthoff. 2007b. The contributions of visual flow and locomo-
tor cues to walked distance estimation in a virtual environment. Appl Percept Graphics Vis 4: 146.
Campos, J. L., P. Byrne, and H.-J. Sun. 2010. Body-based cues trump vision when estimating walked distance.
Eur J Neurosci 31: 1889–1898.
Campos, J. L., H.-G. Nusseck, C. Wallraven, B. J. Mohler, and  H. H. Bülthoff. 2007a. Visualization and
(mis) perceptions in virtual reality. Tagungsband 10. Proceedings of Workshop Sichtsysteme, ed. R. Möller
and R. Shaker, 10–14. Aachen, Germany.
Campos, J. L., J. Siegle, B. J. Mohler, H. H. Bülthoff and J. M. Loomis. 2009. Imagined self-motion dif-
fers from perceived self-motion: Evidence from a novel continuous pointing method. PLoS ONE 4(11):
e7793. doi:10.1371/journal.pone.0007793.
Chance, S. S., F. Gaunet, A. C. Beall, and J. M. Loomis. 1998. Locomotion mode affects the updating of objects
encountered during travel: The contribution of vestibular and proprioceptive inputs to path integration.
Pres Teleop Virtual Environ 7(2): 168–178.
Cheng, K., S. Shettleworth, J. Huttenlocher, and J. J. Rieser. 2007. Bayesian integrating of spatial information.
Psychol Bull 133(4): 625–637.
Cheung, A., S. Zhang, C. Stricker, and M. V. Srinivasan. 2007. Animal navigation: The difficulty of moving in
a straight line. Biol Cybern 97: 47–61.
Christensen���������������������������������������������������������������������������������������������������
, R., J. M. Hollerbach, Y. Xu, and S. Meek. 2000. �������������������������������������������������
Inertial force feedback for the Treadport locomo-
tion interface. Pres Teleop Virtual Environ 9: 1–14.
Creem-Regehr, S. H., P. Willemsen, A. A. Gooch, and W. B. Thompson. 2005. The influence of restricted
viewing conditions on egocentric distance perception: Implications for real and virtual environments.
Perception 34(2): 191–204.
Cruz-Neira, C., T. A. Sandin, and R. V. DeFantini. 1993. Surround screen projection-based virtual reality: The
design and implementation of the cave. Proc SIGGRAPH, 135–142.
Darken, R. P., W. R. Cockayne, and D. Carmein. 1997. The omni-directional treadmill: A locomotion device
for virtual worlds. Proceedings of the ACM User Interface Software and Technology, Banff, Canada,
October 14–17, 213–221.
Dichgans, J., and T. Brandt. 1978. Visual–vestibular interaction: Effects on self-motion perception and postural
control. In Perception, Vol. VIII, Handbook of sensory physiology, ed. R. Held, H. W. Leibowitz, and H.
L. Teuber, 756–804. Berlin: Springer.
624 The Neural Bases of Multisensory Processes

Durgin, F. H., A. Pelah, L. F. Fox et al. 2005. Self-motion perception during locomotor recalibration: More than
meets the eye. J Exp Psychol Hum Percept Perform 31: 398–419.
Durgin, F. H., M. Akagi, C. R. Gallistel, and W. Haiken. 2009. The precision of locomotor odometry in humans.
Exp Brain Res 193(3): 429–436.
D’Zmura�������������������������������������������������������������������������������������������������
, M., P. Colantoni, and G. Seyranian. 2000. Virtual
�����������������������������������������������������
environments with four or more spatial dimen-
sions. Pres Teleop Virtual Environ 9(6): 616–631.
Ellard, C. G., and S. C. Shaughnessy. 2003. A comparison of visual and non-visual sensory inputs to walked
distance in a blind-walking task. Perception 32(5): 567–578.
Elliott, D. 1986. Continuous visual information may be important after all: A failure to replicate Thomson. J
Exp Psychol Hum Percept Perform 12: 388–391.
Engel, D., C. Curio, L. Tcheang, B. J. Mohler, and H. H. Bülthoff. 2008. A psychophysically calibrated control-
ler for navigating through large environments in a limited free-walking space. In Proceedings of the 2008
ACM Symposium on Virtual Reality Software and Technology, ed. S. Feiner, D. Thalmann, P. Guitton, B.
Fröhlich, E. Kruijff, M. Hachet, 157–164. New York: ACM Press.
Ernst, M. O., and M. S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415: 429–433.
Ernst, M. O., and H. H. Bülthoff. 2004. Merging the senses into a robust percept. Trends Cogn Sci 8: 162–169.
Fetsch, C. R., A. H. Turner, G. C. DeAngelis, and D. E. Angelaki. 2009. Dynamic reweighting of visual and
vestibular cues during self-motion perception. J Neurosci 29(49): 15601–15612.
Frenz, H., F. Bremmer, and M. Lappe. 2003. Discrimination of travel distances from “situated” optic flow. Vis
Res 43(20): 2173–2183.
Frenz, H., and M. Lappe. 2005. Absolute travel distance from optic flow. Vis Res 45(13): 1679–1692.
Fukusima, S. S., J. M. Loomis, and J. A. DaSilva. 1997. Visual perception of egocentric distance as assessed by
triangulation. J Exp Psychol Hum Percept Perform 23: 86–100.
Fung, J., C. L. Richards, F. Malouin, B. J. McFadyen, and A. Lamontagne. 2006. A treadmill and motion
coupled virtual reality system for gait training post-stroke. Cyberpsych Behav 9(2): 157–162.
Gu, Y., D. E. Angelaki, and G. C. Deangelis. 2008. Neural correlates of multisensory cue integration in macaque
MSTd. Nat Neurosci 11(10): 1201–1210.
Guerin, P., and B. G. Bardy. 2008. Optical modulation of locomotion and energy expenditure at preferred tran-
sition speed. Exp Brain Res 189: 393–402.
Harris, L. R., M. Jenkin, and D. C. Zikovitz. 2000. Visual and non-visual cues in the perception of linear self-
motion. Exp Brain Res 135: 12–21.
Hettinger, L. J. 2002. Illusory self-motion in virtual environments. In Handbook of virtual environments, ed.
K. M. Stanney, 471–492. Hillsdale, NJ: Lawrence Erlbaum.
Hollerbach, J. M., Y. Xu, R. Christensen, and S. C. Jacobsen. 2000. Design specifications for the second gen-
eration Sarcos Treadport locomotion interface Haptics Symposium, Proc. ASME Dynamic Systems and
Control Division, DSC-Vol. 69-2, 1293–1298, Orlando, November.
Howard, I. P. 1986. The perception of posture, self-motion, and the visual vertical. In Sensory processes and
perception, Vol. I, Handbook of human perception and performance, ed. K. R. Boff, L. Kaufman, and
J. P. Thomas, 18.1–18.62, New York: Wiley.
Israël, I., and A. Berthoz. 1989. Contributions of the otoliths to the calculation of linear displacement.
J Neurophysiol 62(1): 247–263.
Israël, I., R. Grasso, P. Georges-Francois, T. Tsuzuku, and A. Berthoz. 1997. Spatial memory and path integration
studied by self-driven passive linear displacement: I. Basic properties. J Neurophysiol 77: 3180–3192.
Ivanenko, Y. P., R. Grasso, I. Israël, and A. Berthoz. 1997. The contributions of otoliths and semicircular
canals to the perception of two-dimensional passive whole-body motion in humans. J Physiol 502(1):
223–233.
Iwata, H. 1999. Walking about virtual environments on an infinite floor. IEEE Virtual Real 13–17, March.
Iwata, H., and Y. Yoshida. 1999. Path reproduction tests using a torus treadmill. Pres Teleop Virtual Environ
8(6): 587–597.
Jürgens, R., and W. Becker. 2006. Perception of angular displacement without landmarks: Evidence for
Bayesian fusion of vestibular, optokinetic, podokinesthetic, and cognitive information. Exp Brain Res
174(3): 528–43.
Jürgens, R., T. Boß, and W. Becker. 1999. Estimation of self-turning in the dark: Comparison between active
and passive rotation. Exp Brain Res 128: 491–504.
Kearns, M. J., W. H. Warren, A. P. Duchon, and M. J. Tarr. 2002. Path integration from optic flow and body
senses in a homing task. Perception 31: 349–374.
Multimodal Integration during Self-Motion in Virtual Reality 625

Kearns, M. J. 2003. The roles of vision and body senses in a homing task: The visual environment matters.
Unpublished doctoral thesis, Brown University.
Klatzky, R. L., J. M. Loomis, A. C. Beall, S. S. Chance, and R. G. Golledge. 1998. Spatial updating of self-
position and orientation during real, imagined, and virtual locomotion. Psychol Sci 9(4): 293–298.
Knapp, J. M., and J. M. Loomis. 2004. Limited field of view of head-mounted displays is not the cause of dis-
tance underestimation in virtual environments. Pres Teleop Virtual Environ 13(5): 572–577.
Knill, D. C., and J. A. Saunders. 2003. Do humans optimally integrate stereo and texture information for judg-
ments of surface slant? Vis Res 43: 2539–2558.
Kording, K. P., and D. M. Wolpert. 2004. Bayesian integration in sensorimotor learning. Nature 427(15):
244–247.
Larish, J. F., and J. M. Flach. 1990. Sources of optical information useful for perception of speed of rectilinear
self-motion. J Exp Psychol Hum Percept Perform 16: 295–302.
Lathrop, W. B., and M. K. Kaiser. 2002. Perceived orientation in physical and virtual environments: Changes
in perceived orientation as a function of idiothetic information available. Pres Teleop Virtual Environ
11(1): 19–32.
Laurens, J., and J. Droulez. 2007 Bayesian processing of vestibular information. Biol Cybern 96: 389–404.
Lee, D. N. 1976. Theory of visual control of braking based on information about time-to-collision. Perception
5(4): 437–459.
Lee, D. N., and E. Aronson. 1974. Visual proprioceptive control of standing in human infants. Percept
Psychophys 15(3): 529–532.
Lehmann, A., M. Vidal, and H. H. Bülthoff. 2008. ��������������������������������������������������������������
A high-end virtual reality setup for the study of mental rota-
tions. Pres Teleop Virtual Environ 17(4): 365–375.
Lestienne, F., J. Soechting, and A. Berthoz. 1977. Postural readjustments induced by linear vection of visual
scenes. Exp Brain Res 28(3–4): 363–384.
Loomis, J. M., J. A. Da Silva, N. Fujita, and S. S. Fukusima. 1992. Visual space perception and visually directed
action. J Exp Psychol Hum Percept Perform 18: 906– 921.
Loomis, J. M., J. J. Blascovich, and A. C. Beall. 1999. Immersive virtual environment technology as a basic
research tool in psychology. Behav Res Methods Instrum Comp 31(4): 557–564.
Loomis, J. M., and J. M. Knapp. 2003. Visual perception of egocentric distance in real and virtual environ-
ments. In Virtual and adaptive environments, ed. L. J. Hettinger and M. W. Haas, 21–46. Mahwah, NJ:
Erlbaum.
Loomis, J. M., and J. W. Philbeck. 2008. Measuring perception with spatial updating and action. In Embodiment,
ego-space and action, ed. R. L. Klatzky, M. Behrmann, and B. MacWhinney, 1–42.  Mahwah, NJ:
Erlbaum.
MacNeilage, P. R., M. S. Banks, D. R. Berger, and H. H. Bülthoff. 2007. A Bayesian model of the disambigu-
ation of gravitoinertial force by visual cues. Exp Brain Res 179: 263–290.
Meehan, M., S. Razzaque, B. Insko, M. Whitton, and F. P. Brooks. 2005. Review of four studies on the use of
physiological reaction as a measure of presence in stressful Virtual Environments. Appl Psychophysiol
Biofeedback 30(3): 239–258.
Meilinger, T., B. E. Riecke, and H. H. Bülthoff. 2007. Orientation
�����������������������������������������������������
specificity in long-term memory for envi-
ronmental spaces. Proceedings of the Cognitive Sciences Society, Nashville, Tennessee, USA, August
1–4, 479–484.
Meilinger, T., M. Knauff, and H. H. Bülthoff. 2008. Working memory in wayfinding: A dual task experiment in
a virtual city. Proc Cog Sci 32(4): 755–770.
Mittelstaedt, M. L., and S. Glasauer. 1991. Idiothetic navigation in gerbils and humans. Zool J Physiol 95:
427–435.
Mittelstaedt, M. L., and H. Mittelstaedt. 1996. The influence of otoliths and somatic graviceptors on angular
velocity estimation. J Vestib Res 6(5): 355–366.
Mittelstaedt, M. L., and H. Mittelstaedt. 2001. Idiothetic navigation in humans: Estimation of path length. Exp
Brain Res 13: 318–332.
Mohler, B. J., J. L. Campos, M. Weyel, and H. H. Bülthoff. 2007a. Gait parameters while walking in a head-
mounted display virtual environment and the real world. Proc Eurographics, 85–88.
Mohler, B. J., W. B. Thompson, S. H. Creem-Regehr, H. L. Pick, and W. H. Warren. 2007b. Visual flow influ-
ences gait transition speed and preferred walking speed. Exp Brain Res 181(2): 221–228.
Mohler, B. J., W. B. Thompson, S. H. Creem-Regehr, P. Willemsen, H. L. Pick, and  J. J. Rieser. 2007c.
Calibration of locomotion due to visual motion in a treadmill-based virtual environment. ACM Trans
Appl Percept 4(1): 20–32.
626 The Neural Bases of Multisensory Processes

Müller, P., P. Wonka, S. Haegler, A. Ulmer, and L. Van Gool. 2006. Procedural modeling of buildings. Proc
ACM SIGGRAPH 2006/ACM Transactions on Graphics (TOG), 25(3): 614–623. New York: ACM Press.
Nico, D., I. Israël, and A. Berthoz. 2002. Interaction of visual and idiothetic information in a path completion
task. Exp Brain Res 146: 379–382.
Nusseck, H.-G., H. J. Teufel, F. M. Nieuwenhuizen, and  H. H. Bülthoff. 2008. Learning system dynamics:
Transfer of training in a helicopter hover simulator. Proc AIAA Modeling and Simulation Technologies
Conference and Exhibit, 1–11, AIAA, Reston, VA, USA.
Peck, T. C., M. C. Whitton, and H. Fuchs. 2008. Evaluation of reorientation techniques for walking in
large virtual environments. Proceedings of IEEE Virtual Reality, Reno, NV, 121–128. IEEE Computer
Society.
Pick, H. L., D. H. Warren, and J. C. Hay. 1969. Conflict in judgments of spatial direction. Percept Psychophys
6(4): 203, 1969.
Pick, H. L., D. Wagner, J. J. Rieser, and A. E. Garing. 1999. The recalibration of rotational locomotion. J Exp
Psychol Hum Percept Perform 25(5): 1179–1188.
Proffitt, D. R., J. Stefanucci, T. Banton, and W. Epstein. 2003. The role of effort in perceiving distance. Psychol
Sci 14(2): 106–112.
Prokop, T., M. Schubert, and W. Berger. 1997. Visual influence on human locomotion. Exp Brain Res 114:
63–70.
Razzaque, S., Z. Kohn, and M. Whitton. 2001. Redirected walking. Proceedings of Eurographics, 289–294.
Manchester, UK.
Razzaque, S., D. Swapp, M. Slater, M. C. Whitton, and A. Steed. 2002. Redirected walking in place. Proceedings
of Eurographics, 123–130.
Redlick, F. P., M. Jenkin, and L. R. Harris. 2001. Humans can use optic flow to estimate distance of travel. Vis
Res 41: 213–219.
Richardson, A. R., and D. Waller. 2005. The effect of feedback training on distance estimation in Virtual
Environments. Appl Cogn Psychol 19: 1089–1108.
Riecke, B. E., D. W. Cunningham, and H. H. Bülthoff. 2006. Spatial updating in virtual reality: The sufficiency
of visual information. Psychol Res 71(3): 298–313.
Riecke, B. E., H. A. H. C. van Veen, and H. H. Bülthoff, 2002. Visual homing is possible without landmarks—​
A path integration study in virtual reality. Pres Teleop Virtual Environ 11(5): 443–473.
Riecke, B. E., A. Väljamäe, and J. Schulte-Pelkum. 2009. Moving sounds enhance the visually-induced self-
motion illusion (circular vection) in Virtual Reality. ACM Trans Appl Percept 6(2): 1–27.
Rieser, J. J., D. H. Ashmead, C. R. Talor, and G. A. Youngquist. 1990. Visual perception and the guidance of
locomotion without vision to previously seen targets. Perception 19(5): 675–689.
Rieser, J. J., H. L. Pick, D. H. Ashmead, and A. E. Garing. 1995. Calibration of human locomotion and models
of perceptual motor organization. J Exp Psychol Hum Percept Perform 21(3): 480–497.
Ruddle, R. A., and S. Lessels. 2006. For efficient navigational search humans require full physical movement
but not a rich visual scene. Psychol Sci 17: 460–465.
Ruddle, R. A., and S. Lessels. 2009. The benefits of using a walking interface to navigate virtual environments.
ACM Trans Comput-Hum Interact 16(1): 1–18.
Rushton, S. K., J. M. Harris, and M. R. Lloyd. 1998. Guidance of locomotion on foot uses perceived target
location rather than optic flow. Curr Biol 8(21): 1191–1194.
Schnapp, B., and W. Warren. 2007. Wormholes in Virtual Reality: What spatial knowledge is learned for navi-
gation? J Vis 7(9): 758, 758a.
Seidman, S. H. 2008 Translational motion perception and vestiboocular responses in the absence of non-inertial
cues. Exp Brain Res 184: 13–29.
Sheik-Nainar, M. A., and D. B. Kaber. 2007. The utility of a Virtual Reality locomotion interface for studying
gait behavior. Hum Factors 49(4): 696–709.
Sholl, M. J. 1989. The relation between horizontality and rod-and-frame and vestibular navigational perfor-
mance. J Exp Psychol Learn Mem Cogn 15: 110–125.
Siegle, J., J. L. Campos, B. J. Mohler, J. M. Loomis, and H. H. Bülthoff. 2009. Measurement of instantaneous
perceived self-motion using continuous pointing. Exp Brain Res 195(3): 429–444.
Simons, D. J., and R. F. Wang. 1998. Perceiving real-world viewpoint changes. Psychol Sci 9: 315–320.
Souman, J. L., P. Robuffo Giordano, I. Frissen, A. De Luca, and M. O. Ernst. 2010. Making virtual walking real:
Perceptual evaluation of a new treadmill control algorithm. ACM Trans Appl Percept 7(2:11): 1–14.
Souman, J. L., I. Frissen, M. Sreenivasa, and M. O. Ernst. 2009. Walking straight into circles. Curr Biol 19(18):
1538–1542.
Multimodal Integration during Self-Motion in Virtual Reality 627

Sun, H.-J., A. J. Lee, J. L. Campos, G. S. W. Chan, and D. H. Zhang. 2003. Multisensory integration in speed
estimation during self-motion. Cyberpsychol Behav 6(5): 509–518.
Sun, H.-J., J. L. Campos, and G. S. W. Chan. 2004a. Multisensory integration in the estimation of relative path
length. Exp Brain Res 154(2): 246–254.
Sun, H.-J., J. L. Campos, G. S. W. Chan, M. Young, and C. Ellard. 2004b. The contributions of static visual
cues, nonvisual cues, and optic flow in distance estimation. Perception 33: 49–65.
Tarr, M. J., and W. H. Warren. 2002. Virtual reality in behavioral neuroscience and beyond. Nat Neurosci 5:
1089–1092.
Teufel, H. J., H.-G. Nusseck, K. A. Beykirch, J. S. Butler, M. Kerger, and H. H. Bülthoff. 2007. MPI Motion
Simulator: Development and analysis of a novel motion simulator. Proc AIAA Modeling and Simulation
Technologies Conference and Exhibit, 1–11, American Institute of Aeronautics and Astronautics, Reston,
VA, USA.
Thompson, W. B., P. Willemsen, A. A. Gooch, S. H. Creem-Regehr, J. M. Loomis, and A. C. Beall. 2004. Does
the quality of the computer graphics matter when judging distances in visually immersive environments?
Pres Teleop Virtual Environ 13(5): 560–571.
Thompson, W. B., S. H. Creem-Regehr, B. J. Mohler, and P. Willemsen. 2005. Investigations on the interac-
tions between vision and locomotion using a treadmill Virtual Environment. Proc. SPIE/IS&T Human
Vision & Electronic Imaging Conference, January.
Thomson, J. A. 1983. Is continuous visual monitoring necessary in visually guided locomotion? J Exp Psychol
Hum Percept Perform 9: 427–443.
Tristano, D., J. M. Hollerbach, and R. Christensen. 2000. Slope display on a locomotion interface. In
Experimental Robotics VI, ed. P. Corke and J. Trevelyan, 193–201. London: Springer-Verlag.
Väljamäe, A., P. Larsson, D. Västfjäll, and M. Kleiner. 2008. Sound representing self-motion in Virtual
Environments enhances linear vection. Pres Teleop Virtual Environ 17(1): 43–56.
Waller, D., J. M. Loomis, and D. B. M. Haun. 2004. Body-based senses enhance knowledge of directions in
large-scale environments. Psychon Bull Rev 11(1): 157–163.
Waller, D., J. M. Loomis, and S. D. Steck. 2003. Inertial cues do not enhance knowledge of environmental
layout. Psychon Bull Rev 10: 987–993.
Waller, D., E. Bachmann, E. Hodgson, and A. C. Beall. 2007. The HIVE: A Huge Immersive Virtual Environment
for research in spatial cognition. Behav Res Methods 39: 835–843.
Waller, D., and N. Greenauer. 2007. The role of body-based sensory information in the acquisition of enduring
spatial representations. Psychol Res 71(3): 322–332.
Waller, D., and A. R. Richardson. 2008. Correcting distance estimates by interacting with immersive virtual
environments: Effects of task and available sensory information. J Exp Psychol Appl 14(1): 61–72.
Warren, W. H., and D. J. Hannon. 1998. Direction of self-motion is perceived from optical-flow. Nature 336:
162–163.
Warren, W. H., B. A. Kay, W. D. Zosh, A. P. Duchon, and S. Sahuc. 2001. Optic flow is used to control human
walking. Nat Neurosci 4: 213–216.
Warren, W. H., and K. J. Kurtz. 1992. The role of central and peripheral vision in perceiving the direction of
self-motion. Percept Psycho 51(5): 443–454.
Welch, R. B., and D. H. Warren. 1980. Immediate perceptual response to intersensory discrepancy. Psychol
Bull 88(3): 638–667.
Welchman, A. E., J. M. Lam, and H. H. Bülthoff. 2008. Bayesian motion estimation accounts for a surprising
bias in 3D vision. Proc Natl Acad Sci U S A 105(33): 12087–12092.
Wilkie, R. M., and J. P. Wann. 2005. The role of visual and nonvisual information in the control of locomotion.
J Exp Psychol Hum Percept Perform 31(5): 901–911.
Witmer, B. G., and P. B. Kline. 1998. Judging perceived and traversed distance in virtual environments. Pres
Teleop Virtual Environ 7: 144–167.
Wittlinger, M., R. Wehner, and H. Wolf. 2006. The ant odometer: Stepping on stilts and stumps. Science 312:
1965–1967.
Yong, N. A., G. D. Paige, and S. H. Seidman. 2007. Multiple sensory cues underlying the perception of transla-
tion and path. J Neurophysiol 97: 1100–1113.
31 Visual–Vestibular Integration
for Self-Motion Perception
Gregory C. DeAngelis and Dora E. Angelaki

CONTENTS
31.1 The Problem of Self-Motion Perception and the Utility of Visual–Vestibular
Integration.............................................................................................................................. 629
31.1.1 Optic Flow................................................................................................................. 630
31.1.2 Vestibular Signals...................................................................................................... 630
31.2 Potential Neural Substrates for Visual–Vestibular Integration.............................................. 631
31.3 Heading Tuning and Spatial Reference Frames in Area MSTd............................................ 633
31.3.1 Heading Tuning......................................................................................................... 633
31.3.2 Reference Frames...................................................................................................... 634
31.4 The Neuronal Combination Rule and Its Dependence on Cue Reliability............................ 636
31.5 Linking Neuronal and Perceptual Correlates of Multisensory Integration........................... 639
31.5.1 Behavioral Results.....................................................................................................640
31.5.2 Neurophysiological Results....................................................................................... 641
31.5.3 Correlations with Behavioral Choice......................................................................... 642
31.6 Conclusion.............................................................................................................................644
Acknowledgments........................................................................................................................... 645
References....................................................................................................................................... 645

31.1 THE PROBLEM OF SELF-MOTION PERCEPTION AND THE


UTILITY OF VISUAL–VESTIBULAR INTEGRATION
How do we perceive our direction of self-motion through space? To navigate effectively through a
complex three-dimensional (3-D) environment, we must accurately estimate our own motion rela-
tive to objects around us. Self-motion perception is a demanding problem in sensory integration,
requiring the neural combination of visual signals (e.g., optic flow), vestibular signals regarding
head motion, and perhaps also somatosensory and proprioceptive cues (Hlavacka et al. 1992, 1996;
Dichgans and Brandt 1974). Consider a soccer player running downfield to intercept a pass and head
the ball toward the goal. This athlete must be able to accurately judge the trajectory of the ball rela-
tive to the trajectory of his/her self-motion, in order to precisely time his/her head thrust to meet the
ball. Optic flow and vestibular signals are likely the two most sensitive cues for judging self-motion
(Gu et al. 2007, 2008; Fetsch et al. 2009). To understand the need for multisensory integration of
these cues, it is useful to consider the strengths and weaknesses of each cue. Although self-motion
generally involves both translations and rotations of the observer, we shall limit the scope of this
review to translational movements, such that we focus on visual and vestibular cues that determine
our perceived direction of heading.

629
630 The Neural Bases of Multisensory Processes

31.1.1  Optic Flow


It has long been recognized that visual cues provide a rich source of information about self-­motion
(Gibson 1950). As we move through the environment, the resulting pattern of full-field retinal
motion (optic flow) can be used to estimate heading. In the simplest case, involving an observer
with stationary eyes and head moving through a stationary scene, the location of the focus of radial
expansion in the optic flow field provides a direct indicator of heading. Many visual psychophysical
and theoretical studies have examined how heading can be computed from optic flow (see Warren
2003 for review). The notion that optic flow contributes to self-motion perception is further sup-
ported by the fact that optic flow, by itself, can elicit powerful illusions of self-motion. As early
as 1875, Ernst Mach described self-motion sensations (i.e., circular and linear vection) induced by
optic flow. Numerous studies have subsequently characterized the behavioral observation that large-
field optic flow stimulation induces self-motion perception (e.g., Berthoz et al. 1975; Brandt et al.
1973; Dichgans and Brandt 1978).
Interpretation of optic flow, however, becomes considerably complicated under more natural con-
ditions. Specifically, optic flow is substantially altered by movements of the eyes and head (Banks
et al. 1996; Crowell et al. 1998; Royden et al. 1992, 1994), and by motion of objects in the visual
field (Royden and Hildreth 1996; Gibson 1954; Warren and Saunders 1995). An extensive literature,
including studies cited above, has been devoted to perceptual mechanisms that compensate for eye
and/or head rotation during translational self-motion, making use of both retinal and extraretinal
signals (reviewed by Warren 2003). Perceptual compensation for eye and head movements is largely
successful, and is likely aided by the fact that the brain contains internal signals related to eye and
head movements (e.g., efference copy) that can be used to transform visual signals. The neural basis
of this compensation for eye and head movements has been explored considerably (Bradley et al.
1996; Page and Duffy 1999; Shenoy et al. 1999), although our understanding of these compensatory
mechanisms is far from complete.
Motion of objects in the world presents an even greater challenge to interpretation of optic
flow because the brain contains no internal signals related to object motion. In general, the brain
needs to solve a source separation problem because optic flow on the retina at any moment in time
includes two major components: flow resulting from self-motion along with the static 3-D structure
of the environment, and flow resulting from the movement of objects relative to the observer. Some
psychophysical studies have suggested that this source separation problem can be solved through
purely visual analysis of optic flow (Rushton and Warren 2005; Warren and Rushton 2007, 2008;
Matsumiya and Ando 2009), whereas other studies indicate that nonvisual signals may be essen-
tial for interpretation of optic flow in the presence of object motion (Wexler 2003; Wexler et al.
2001; Wexler and van Boxtel 2005). Although interactions between object and background motion
have been studied physiologically (Logan and Duffy 2006), the neural mechanisms that solve this
problem remain unclear. Vestibular signals may be of particular importance in dealing with object
motion because the vestibular system provides an independent source of information about head
movements that may help to identify optic flow that is inconsistent with self-motion (induced by
moving objects).

31.1.2  Vestibular Signals


The vestibular system provides a powerful independent source of information about head motion
in space. Specifically, vestibular sensors provide information about the angular rotation and linear
acceleration of the head in space (Angelaki 2004; Angelaki and Cullen 2008), and thus provide
important inputs to self-motion estimation. A role of the vestibular system in the perception of self-
motion has long been acknowledged (Guedry 1974, 1978; Benson et al. 1986; Telford et al. 1995).
With regard to heading perception, the limitations of optic flow processing might be overcome
by making use of inertial motion signals from the vestibular otolith organs (Benson et al. 1986;
Visual–Vestibular Integration for Self-Motion Perception 631

Fernandez and Goldberg 1976a, 1976b; Guedry 1974). The otoliths behave much like linear accel-
erometers, and otolith afferents provide the basis for directional selectivity that could in principle
be used to guide heading judgments. Indeed, with a sensory organ that signals real inertial motion
of the head, one might ask why the nervous system should rely on visual information at all. Part
of the answer is that even a reliable linear accelerometer has shortcomings, such as the inability to
encode constant-velocity motion and the inability to distinguish between translation and tilt relative
to gravity (due to Einstein’s equivalence principle). The latter problem may be resolved using angu-
lar velocity signals from the semicircular canals (Angelaki et al. 1999, 2004; Merfeld et al. 1999),
but the properties of the canals render this strategy ineffective during low-frequency motion or
static tilts. In fact, in the absence of visual cues, linear acceleration is often misperceived as tilt (the
somatogravic illusion; Previc et al. 1992; Wolfe and Cramer 1970). This illusion can be quite dan-
gerous for aviators, who feel compelled to pitch the nose of their aircraft downward to compensate
for a nonexistent upward tilt, when in fact what they experienced was linear inertial acceleration.
In summary, both the visual and vestibular systems are limited in their ability to unambiguously
signal self-motion. A sensible approach for heading estimation would thus be to combine visual and
vestibular information to overcome the limitations of each modality on its own. As discussed fur-
ther below, this cross-modal integration can also improve perceptual discrimination of heading over
what is possible for each modality alone. Thus, we suggest that multisensory integration of visual
and vestibular inputs provides dual benefits: it overcomes important limitations of each sensory
system alone and it provides increased sensitivity when both systems are active.

31.2 POTENTIAL NEURAL SUBSTRATES FOR VISUAL–


VESTIBULAR INTEGRATION
Where should one look in the brain to find neurons that integrate visual and vestibular signals for
self-motion perception? One possibility is to look in portions of “visual” cortex that are known to
carry selective responses to optic flow stimuli. Another possibility is to look in regions of “vestibu-
lar” cortex that may integrate otolith inputs with visual signals. Here, we briefly consider what is
known about each of these possibilities.
Optic flow–sensitive neurons have been found in the dorsal portion of the medial superior tem-
poral area (MSTd; Tanaka et al. 1986; Duffy and Wurtz 1991, 1995), ventral intraparietal area
(VIP; Bremmer et al. 2002a, 2002b; Schaafsma and Duysens 1996), posterior parietal cortex (7a;
Siegel and Read 1997), and the superior temporal polysensory area (STP; Anderson and Siegel
1999). Among these areas, MSTd and VIP (Figure 31.1) currently stand out as good candidates for

VIP

3a
2v MST
PIVC

FIGURE 31.1  (See color insert.) Illustration of some of the areas thought to be involved in processing of
visual and/or vestibular signals for self-motion perception (see text for details). A partially inflated surface
of cerebral cortex of a macaque monkey is shown. Colored regions indicate different functionally and ana-
tomically defined areas. MST, medial superior temporal; VIP, ventral intra-parietal; PIVC, parieto-insular
vestibular cortex.
632 The Neural Bases of Multisensory Processes

integrating visual and vestibular signals to subserve heading perception because (1) they have large
receptive fields and selectivity for complex optic flow patterns that simulate self-motion (Duffy
and Wurtz 1991, 1995; Tanaka et al. 1986; Tanaka and Saito 1989; Schaafsma and Duysens 1996;
Bremmer et al. 2002a), (2) they show some compensation for shifts in the focus of expansion due to
pursuit eye movements (Bradley et al. 1996; Zhang et al. 2004; Page and Duffy 1999), and (3) they
have been causally linked to heading judgments based on optic flow in microstimulation studies
(Britten and van Wezel 1998, 2002; Zhang and Britten 2003). Perhaps most importantly, MSTd and
VIP also contain neurons sensitive to physical translation in darkness (Bremmer et al. 1999, 2002b;
Duffy 1998; Gu et al. 2006; Chen et al. 2007; Schlack et al. 2002; Takahashi et al. 2007; Chowdhury
et al. 2009). This suggests the presence of vestibular signals that may be useful for heading percep-
tion, and thus the potential for integration with optic flow signals.
In addition to regions conventionally considered to be largely visual in nature, there are sev-
eral potential loci within the vestibular system where otolith-driven signals regarding transla-
tion could be combined with optic flow signals. Putative visual–vestibular convergence has been
reported as early as one or two synapses from the vestibular periphery, in the brainstem vestibular
nuclei (Daunton and Thomsen 1979; Henn et al. 1974; Robinson 1977; Waespe and Henn 1977) and
vestibulo-­cerebellum (Markert et al. 1988; Waespe et al. 1981; Waespe and Henn 1981). However,
responses to visual (optokinetic) stimuli within these subcortical circuits are more likely related
to gaze stabilization and eye movements [optokinetic nystagmus (OKN), vestibulo-ocular reflex
(VOR), and/or smooth pursuit] rather than self-motion perception per se. This conclusion is sup-
ported by recent experiments (Bryan and Angelaki 2008) showing a lack of optic-flow responsive-
ness in the vestibular and deep cerebellar nuclei when animals were required to fixate a head-fixed
target (suppressing OKN).
At higher stages of vestibular processing, several interconnected cortical areas have tradition-
ally been recognized as “vestibular cortex” (Fukushima 1997; Guldin and Grusser 1998), and are
believed to receive multiple sensory inputs, including visual, vestibular, and somatosensory/pro­
prioceptive signals. Specifically, three main cortical areas (Figure 31.1) have been characterized as
either exhibiting responses to vestibular stimulation and/or receiving short-latency vestibular sig-
nals (trisynaptic through the vestibular nuclei and the thalamus). These include: (1) area 2v, located
in the transition zone of areas 2, 5, and 7 near the lateral tip of the intraparietal sulcus (Schwarz and
Fredrickson 1971a, 1971b; Fredrickson et al. 1966; Buttner and Buettner 1978); (2) the parietoinsu-
lar vestibular cortex (PIVC), located between the auditory and secondary somatosensory cortices
(Grusser et al. 1990a, 1990b); and (3) area 3a, located within the central sulcus extending into the
anterior bank of the precentral gyrus (Odkvist et al. 1974; Guldin et al. 1992). In addition to show-
ing vestibular responsiveness, neurons in PIVC (Grusser et al. 1990b) and 2v (Buttner and Buettner
1978) were reported to show an influence of visual/optokinetic stimulation, similar to subcortical
structures. However, these studies did not conclusively demonstrate that neurons in any of these
areas provide robust information about self-motion from optic flow. Indeed, we have recently shown
that PIVC neurons generally do not respond to brief (2-second) optic flow stimuli with a Gaussian
velocity profile (Chen et al. 2010), whereas these same visual stimuli elicit very robust directional
responses in areas MSTd and VIP (Gu et al. 2006; Chen et al. 2007). Thus far, we also have not
encountered robust optic flow selectivity in area 2v (unpublished observations).
In summary, the full repertoire of brain regions that carry robust signals related to both optic flow
and inertial motion remains to be further elaborated, and other areas that serve as important players
in multisensory integration for self-motion perception may yet emerge. However, two aspects of the
available data are fairly clear. First, extrastriate areas MSTd and VIP contain robust representations
of self-motion direction based on both visual and vestibular cues. Second, traditional vestibular cor-
tical areas (PIVC, 2v) do not appear to have sufficiently robust responses to optic flow to be serious
candidates for the neural basis of multimodal heading perception. In the remainder of this review,
we shall therefore focus on what is known about visual–vestibular integration in area MSTd, as this
area has been best studied so far.
Visual–Vestibular Integration for Self-Motion Perception 633

31.3  HEADING TUNING AND SPATIAL REFERENCE FRAMES IN AREA MSTD


31.3.1  Heading Tuning
The discovery of vestibular translation responses in MSTd, first reported by Duffy (1998), was sur-
prising because this area is traditionally considered part of the extrastriate visual cortex. The results
of Duffy’s groundbreaking study revealed a wide variety of visual–vestibular interactions in MSTd,
including enhancement and suppression of responses relative to single-cue conditions, as well as
changes in cells’ preferred direction with anticongruent stimulation.
Building upon Duffy’s findings, we used a custom-built virtual reality system (Figure 31.2a)
to examine the spatial tuning of MSTd neurons in three dimensions (Figure 31.2b), making use of

(a) (b)
Up
Mirror

Monkey Down
Screen

re
Left

Fo
Field Righ

ft
t

A
coil
(c)
Projector 1.0 0.3
Acceleration (m/s2)
0.2

Velocity (m/s)
0.5
0.1
6 degrees of freedom 0.0 0.0
motion platform –0.1
–0.5
–0.2
Acceleration
–1.0 Velocity –0.3
0.0 0.5 1.0 1.5 2.0
Time (s)
(d)
Vestibular Visual Combined
–90
–45 40
30
0
Firing rate (sp/s)

20
Elevation (º)

45 10
90
(e)
–90
–45 60
0 40
20
45
90

270 180 90 0 –90 270 180 90 0 –90 270 180 90 0 –90

Azimuth (º)

FIGURE 31.2  (a–c) Apparatus and stimuli used to examine visual–vestibular interactions in rhesus mon-
keys. (a) 3-D virtual reality system, (b) heading trajectories, and (c) velocity and acceleration profiles used by
Gu et al. (2006). 3-D heading tuning functions of two example MSTd neurons: (d) a “congruent cell” and (e) an
“opposite” cell. Firing rate (grayscale) is plotted as a function of azimuth (abscissa) and elevation (ordinate)
of heading trajectory. For each cell, tuning was measured in three stimulus conditions: vestibular (inertial
motion only), visual (optic flow only), and combined visual–vestibular stimulation. (Adapted from Gu, Y. et
al., J. Neurosci., 26, 73–85, 2006.)
634 The Neural Bases of Multisensory Processes

stimuli with a Gaussian stimulus velocity profile (Figure 31.2c) that is well suited to activating the
otolith organs (Gu et al. 2006; Takahashi et al. 2007). Heading tuning was measured under three
stimulus conditions: visual only, vestibular only, and a combined condition in which the stimulus
contained precisely synchronized optic flow and inertial motion. We found that about 60% of MSTd
neurons show significant directional tuning for both visual and vestibular heading cues. MSTd neu-
rons showed a wide variety of heading preferences, with individual neurons being tuned to virtually
all possible directions of translation in 3-D space. Notably, however, there was a strong bias for
MSTd neurons to respond best to lateral motions within the frontoparallel plane (i.e., left/right and
up/down), with relatively few neurons preferring fore–aft directions of motion. This was true for
both visual and vestibular tuning separately (Gu et al. 2006, 2010).
Interestingly, MSTd neurons seemed to fall into one of two categories based on their relative
preferences for heading defined by visual and vestibular cues. For congruent cells, the visual and
vestibular heading preferences are closely matched, as illustrated by the example neuron shown in
Figure 31.2d. This neuron preferred rightward motion of the head in both the visual and vestibu-
lar conditions. In contrast, opposite cells have visual and vestibular heading preferences that are
roughly 180° apart (Gu et al. 2006). For example, the opposite cell in Figure 31.2e prefers rightward
and slightly upward motion in the vestibular condition, but prefers leftward and slightly downward
translation in the visual condition. For this neuron, responses in the combined stimulus condition
(Figure 31.2e, right panel) were very similar to those elicited by optic flow in the visual condition.
This pattern of results was common in the study of Gu et al. (2006). However, as discussed further
below, this apparent visual dominance was because high-coherence visual stimuli were used. We
shall consider this issue in considerably more detail in the next section.
The responses of MSTd neurons to translation in the vestibular condition were found to be very
similar when responses were recorded during translation in complete darkness (as opposed to during
viewing of a fixation target on a dim background), suggesting that spatial tuning seen in the vestibu-
lar condition (e.g., Figure 31.2d, e) was indeed of labyrinthine origin (Gu et al. 2006; Chowdhury et
al. 2009). To verify this, we examined the responses of MSTd neurons after a bilateral labyrinthec-
tomy. After the lesion, MSTd neurons did not give significant responses in the vestibular condition,
and spatial tuning was completely abolished (Gu et al. 2007; Takahashi et al. 2007). Thus, responses
observed in MSTd during the vestibular condition arise from otolith-driven input.

31.3.2 Reference Frames


Given that neurons in MSTd show spatial tuning for both visual and vestibular inputs, a natural
question arises regarding the spatial reference frames of these signals. Vestibular signals regarding
translation must initially be coded by the otolith afferents in head-centered coordinates, because
the vestibular organs are fixed in the head. In contrast, visual motion signals must initially be
coded in retinal (eye-centered) coordinates. Since these two signals arise in different spatial frames
of reference, how are they coded when they are integrated by MSTd neurons? Some researchers
have suggested that signals from different sensory systems should be expressed in a common refer-
ence frame when they are integrated (Groh 2001). On the other hand, computational models show
that neurons can have mixed and intermediate reference frames while still allowing signals to be
decoded accurately (Deneve et al. 2001; Avillac et al. 2005).
To investigate this issue, we tested whether visual and vestibular heading signals in MSTd share
a common reference frame (Fetsch et al. 2007). To decouple head-centered and eye-centered coor-
dinates, we measured visual and vestibular heading tuning while monkeys fixated on one of three
target locations: straight ahead, 20–25° to the right, and 20–25° to the left. If heading is coded in
eye-centered coordinates, the heading preference of the neuron should shift horizontally (in azi-
muth) by the same amount as the gaze is deviated from straight ahead. If heading is coded in
head-centered coordinates, then the heading preference should remain constant as a function of eye
position.
Visual–Vestibular Integration for Self-Motion Perception 635

Figure 31.3a shows the effect of eye position on the vestibular heading preference of an MSTd
neuron. In this case, heading preference (small white circles connected by dashed line) remains
quite constant as eye position varies, indicating head-centered tuning. Figure 31.3b shows the effect
of eye position on the visual heading tuning of another MSTd neuron. Here, the heading prefer-
ence clearly shifts with eye position, such that the cell signals heading in an eye-centered frame of
reference. A cross-correlation technique was used to measure the amount of shift of the heading
preference relative to the change in eye position. This yields a metric, the displacement index, which

(a) (b)
Cell 1: Vestibular Cell 2: Visual
120
Firing rate (sp/s)

40
90
30
–20º 60
20
30
10

–90
Elevation (º)

–45

0 +20º

45
90
270 180 90 0 –90 270 180 90 0 –90
Azimuth (º)
(c) (d)
Head- 0.89
45 1.0
centered
Eye-centered
40
35 0.8
Displacement index
Number of cells

30
0.24 0.6
25 Visual
Vestibular
20
0.4
15
10 0.2 Vestibular
5 Visual
Combined
0 0.0
–0.5 0.0 0.5 1.0 1.5 2.0 2.5 0.5 1 2 4
Displacement index Visual–vestibular ratio

FIGURE 31.3  Reference frames of visual and vestibular heading signals in MSTd. Tuning functions are
plotted for two example cells in (a) vestibular and (b) visual conditions, measured separately at three static eye
positions along horizontal meridian: −20° (top), 0° (middle), and +20° (bottom). Dashed white line connects
preferred heading in each case, to illustrate horizontal shift (or lack thereof) of tuning function across eye
positions. (c) Histogram of displacement index (DI) values for MSTd neurons tested in vestibular (black bars)
and visual (gray bars) conditions. DI is defined as angular shift of the tuning function normalized by change
in eye position; thus a value of 0 indicates a head- (or body-) centered reference frame and 1 indicates an eye-
centered frame. (d) Binned average DI values for three stimulus conditions (vestibular, visual, combined) as a
function of relative strength of visual and vestibular single-cue tuning (visual/vestibular ratio). (Adapted from
Fetsch, C.R. et al., J. Neurosci., 27, 700–712, 2007.)
636 The Neural Bases of Multisensory Processes

will be 0.0 for head-centered tuning and 1.0 for eye-centered tuning. As shown in Figure 31.3c, we
found that visual heading tuning was close to eye-centered, with a median displacement index of
0.89. In contrast, vestibular heading tuning was found to be close to head-centered, with a median
displacement index of 0.24. This value for the vestibular condition was significantly larger than 0.0,
indicating that vestibular heading tuning was slightly shifted toward eye-centered coordinates.
These data show that visual and vestibular signals in MSTd are not expressed in a common refer-
ence frame. By conventional thinking, this might cast doubt on the ability of this area to perform
sensory integration for heading perception. However, computational modeling suggests that sensory
signals need not explicitly occupy a common reference frame for integration to occur (Avillac et
al. 2005; Fetsch et al. 2007; Deneve et al. 2001). Moreover, as we will see in a later section, MSTd
neurons can account for improved behavioral sensitivity under cue combination. Thus, the conven-
tional and intuitive notion that sensory signals need to be expressed in a common reference frame
for multisensory integration to occur may need to be discarded.
The results of the study by Fetsch et al. (2007) also provide another challenge to conventional
ideas regarding multisensory integration and reference frames. To our knowledge, all previous stud-
ies on reference frames of sensory signals have only examined responses during unisensory stimu-
lation. Also relevant is the reference frame exhibited by neurons during combined, multimodal
stimulation, and how this reference frame depends on the relative strengths of responses to the
two sensory modalities. To examine this issue, Fetsch et al. (2007) measured the reference frame
of activity during the combined (visual–vestibular) condition, as well as the unimodal conditions.
Average displacement index values were computed as a function of the relative strength of unimodal
visual and vestibular responses [visual/vestibular ratio (VVR)]. For the visual (circles) and vestibu-
lar (squares) conditions, the average displacement index did not systematically depend on VVR
(Figure 31.3d), indicating that the reference frame in the unimodal conditions was largely indepen-
dent of the relative strengths of visual and vestibular inputs to the neuron under study. In contrast,
for the combined condition (diamonds), the average displacement index changed considerably as a
function of VVR, such that the reference frame of combined responses was more head-centered for
neurons with low VVR and more eye-centered for neurons with high VVR (Figure 31.3d). Thus, the
reference frame of responses to multimodal stimuli can vary as a function of the relative strengths
of the visual and vestibular inputs. This has potentially important implications for understanding
how multisensory responses are decoded, and deserves further study.

31.4 THE NEURONAL COMBINATION RULE AND ITS


DEPENDENCE ON CUE RELIABILITY
An issue of great interest in multisensory integration has been the manner in which neurons com-
bine their unimodal sensory inputs. Specifically, how is the response to a bimodal stimulus related
to the responses to the unimodal components presented separately? Traditionally, this issue has
been examined by computing one of two metrics: (1) a multisensory enhancement index, which
compares the bimodal response to the largest unimodal response, and (2) an additivity index, which
compares the bimodal response to the sum of the unimodal responses (Stein and Stanford 2008).
In classic studies of visual–auditory integration in the superior colliculus (Stein and Meredith
1993), bimodal responses were often found to be superadditive (larger than the sum of the unimodal
responses) and this was taken as evidence for a nonlinear cue combination rule such as multiplica-
tion (Meredith and Stein 1983, 1986). In contrast, a variety of studies of multisensory integration
in cortical areas have reported subadditive interactions (Avillac et al. 2007; Morgan et al. 2008;
Sugihara et al. 2006). Some of this variation is likely accounted for by variations in the efficacy
of unimodal stimuli, as recent studies in the superior colliculus has demonstrated that superaddi-
tive interactions become additive or even subadditive as the strength of unimodal stimuli increases
(Perrault et al. 2003, 2005; Stanford et al. 2005).
Visual–Vestibular Integration for Self-Motion Perception 637

Although many studies have measured additivity and/or enhancement of multisensory responses,
there has been a surprising lack of studies that have directly attempted to measure the mathemati-
cal rule by which multisensory neurons combine their unimodal inputs (hereafter the “combination
rule”). Measuring additivity (or enhancement) for a limited set of stimuli is not sufficient to charac-
terize the combination rule. To illustrate this point, consider a hypothetical neuron whose bimodal
response is the product (multiplication) of its unimodal inputs. The response of this neuron could
appear to be subadditive (e.g., 2 × 1 = 2), additive (2 × 2 = 4), or superadditive (2 × 3 = 6) depend-
ing on the magnitudes of the two inputs to the neuron. Thus, to estimate the combination rule, it is
essential to examine responses to a wide range of stimulus variations in both unimodal domains.
Recently, we have performed an experiment to measure the combination rule by which neurons
in area MSTd integrate their visual and vestibular inputs related to heading (Morgan et al. 2008).
We asked whether bimodal responses in MSTd are well fit by a weighted linear summation of uni-
modal responses, or whether a nonlinear (i.e., multiplicative) combination rule is required. We also
asked whether the combination rule changes with the relative reliability of the visual and vestibular
cues. To address these questions, we presented eight evenly spaced directions of motion (45° apart)
in the horizontal plane (Figure 31.4, inset). Unimodal tuning curves (Figure 31.4a–c, margins) were
measured by presenting these eight headings in both the vestibular and visual stimulus conditions.
In addition, we measured a full bimodal interaction profile by presenting all 64 possible combina-
tions of these 8 vestibular and 8 visual headings, including 8 congruent and 56 incongruent (cue-
conflict) conditions. Figure 31.4a–c shows data from an exemplar “congruent” cell in area MSTd.
The unimodal tuning curves (margins) show that this neuron responded best to approximately right-
ward motion (0°) in both the visual and vestibular conditions. When optic flow at 100% coherence
was combined with vestibular stimulation, the bimodal response profile of this neuron (grayscale
map in Figure 31.4a) was dominated by the visual input, as indicated by the horizontal band of high
firing rates. When the optic flow stimulus was weakened by reducing the motion coherence to 50%
(Figure 31.4b), the bimodal response profile showed a more balanced, symmetric peak, indicating
that the bimodal response now reflects roughly equal contributions of visual and vestibular inputs.
When the motion coherence was further reduced to 25% (Figure 31.4c), the unimodal visual tuning
curve showed considerably reduced amplitude and the bimodal response profile became dominated
by the vestibular input, as evidenced by the vertical band of high firing rates. Thus, as the relative
strengths of visual and vestibular cues to heading vary, bimodal responses of MSTd neurons range
from visually dominant to vestibularly dominant.
To characterize the combination rule used by MSTd neurons in these experiments, we attempted
to predict the bimodal response profile as a function of the unimodal tuning curves. We found that
bimodal responses were well fit by a weighted linear summation of unimodal responses (Morgan et
al. 2008). On average, this linear model accounted for ~90% of the variance in bimodal responses,
and adding various nonlinear components to the model (such as a product term) accounted for only
1–2% additional variance. Thus, weighted linear summation provides a good model for the combi-
nation rule used in MSTd, and the weights are typically less than 1 (Figure 31.4d, e), indicating that
subadditive interactions are commonplace.
How does the weighted linear summation model of MSTd integration depend on the reliability
of the cues to heading? As the visual cue varies in reliability due to changes in motion coherence,
the bimodal response profile clearly changes shape (Figure 31.4a–c). There are two basic possible
explanations for this change in shape. One possibility is that the bimodal response profile changes
simply from the fact that lower coherences elicit visual responses with weaker modulation as a func-
tion of heading. In this case, the weights with which each neuron combines its vestibular and visual
inputs remain constant and the decreased visual influence in the bimodal response profile is simply
due to weaker visual inputs at lower coherences. In this scenario, each neuron has a combination
rule that is independent of cue reliability. A second possibility is that the weights given to the ves-
tibular and visual inputs could change with the relative reliabilities of the two cues. This outcome
638 The Neural Bases of Multisensory Processes

(a) 100% Coherence


50 90º
135º 45º
180
180º 0º
90
25
0 225º 315º
270º
–90 0 (d) (e)
0.55 0.81 0.72 0.87
25 25
–90 0 90 180 Coherence
20 20

Number of cells
100%
(b) 50% Coherence 50%
15 15
Visual heading (º)

50 Firing rate (sp/s)


180
10 10
90
25
0 5 5

–90 0 0 0
40 0.0 0.4 0.8 1.2 0.0 0.4 0.8 1.2
0 Vestibular weight Visual weight
–90 0 90 180 (f ) (g)
(c) 1.5 1.5
25% Coherence
50
180
Vestibular weight

Visual weight
1 1
90 25
0
0.5 0.5
–90 0

0 0
–90 0 90 180 25 50 75 100 25 50 75 100
Vestibular heading (º) Coherence (%) Coherence (%)

FIGURE 31.4  Effects of cue strength (motion coherence) on weighted summation of visual and vestibular
inputs by MSTd neurons. (a–c) Comparison of unimodal and bimodal tuning for a congruent MSTd cell, tested
at three motion coherences. Grayscale maps show mean firing rates as a function of vestibular and visual
headings in bimodal condition (including all 64 possible combinations of 8 visual headings and 8 vestibular
headings at 45° intervals). Tuning curves along left and bottom margins show mean (±SEM) firing rates versus
heading for unimodal visual and vestibular conditions, respectively. (a) Bimodal responses at 100% coherence
are visually dominated. (b) Bimodal responses at 50% coherence show a balanced contribution of visual and
vestibular cues. (c) At 25% coherence, bimodal responses appear to be dominated by vestibular input. (d–g)
Dependence of vestibular and visual weights on visual motion coherence. Vestibular and visual weights for
each MSTd neuron were derived from linear fits to bimodal responses. (d, e) Histograms of vestibular and
visual weights computed from data at 100% (black) and 50% (gray) coherence. Triangles are plotted at medi-
ans. (f, g) Vestibular and visual weights are plotted as a function of motion coherence for each neuron exam-
ined at multiple coherences. Data points are coded by significance of unimodal visual tuning (open vs. filled
circles). (Adapted from Morgan, M.L. et al., Neuron, 59, 662–673, 2008.)

would indicate that the neuronal combination rule is not fixed, but changes with cue reliability. This
is a fundamental issue of considerable importance in multisensory integration.
To address this issue, we obtained the best fit of the weighted linear summation model separately
for each motion coherence. At all coherences, the linear model provided a good fit to the bimodal
responses. The key question then becomes whether the visual and vestibular weights attributed to
each neuron remain constant as a function of coherence or whether they change systematically.
Figure 31.4d, e shows the distributions of weights obtained at 100% (black bars) and 50% (gray
Visual–Vestibular Integration for Self-Motion Perception 639

bars) coherence. The average visual weight is significantly higher at 100% coherence than 50%
coherence, whereas the average vestibular weight shows the opposite effect. For all neurons that
were tested at multiple coherences, Figure 31.4f, g shows how the vestibular and visual weights,
respectively, change with coherence for each neuron. There is a clear and significant trend for ves-
tibular weights to decline with coherence, whereas visual weights increase (Morgan et al. 2008).
A model in which the weights are fixed across coherences does not fit the data as well as a model
in which the weights vary with coherence, for the majority of neurons (Morgan et al. 2008). The
improvement in model fit with variable weights (although significant) is rather modest for most neu-
rons, however, and it remains to be determined whether these weight changes have large or small
effects on population codes for heading.
The findings of Morgan et al. (2008) could have important implications for understanding the
neural circuitry that underlies multisensory integration. Whereas the neuronal combination rule is
well described as weighted linear summation for any particular values of stimulus strength/energy,
the weights in this linear combination rule are not constant when stimulus strength varies. If MSTd
neurons truly perform a simple linear summation of their visual and vestibular inputs, then this
finding would suggest that the synaptic weights of these inputs change as a function of stimulus
strength. Although this is possible, it is not clear how synaptic weights would be dynamically modi-
fied from moment to moment when the stimulus strength is not known in advance. Yet, it is well
established that human cue integration behavior involves a dynamic, trial-by-trial reweighting of
cues. A recent neural theory of cue integration shows that neurons that simply sum their multisen-
sory inputs can account for dynamic cue reweighting at the perceptual level, if their spiking statis-
tics fall into a Poisson-like family (Ma et al. 2006). In this theory, it was not necessary for neurons
to change their combination rule with stimulus strength, but this is what the results of Morgan et al.
(2008) demonstrate.
One possible resolution to this conundrum is that multisensory neurons linearly sum their inputs
with fixed weights, at the level of membrane potential, but that some network-level nonlinearity
makes the weights appear to change with stimulus strength. A good candidate mechanism that may
account for the findings of Morgan et al. (2008) is divisive normalization (Carandini et al. 1997;
Heeger 1992). In a divisive normalization circuit, each cell performs a linear weighted summation
of its inputs at the level of membrane potential, but the output of each neuron is divided by the
summed activity of all neurons in the circuit (Heeger 1992). This model has been highly success-
ful in accounting for how the responses of neurons in the primary visual cortex (V1) change with
stimulus strength (i.e., contrast; Carandini et al. 1997) and how neurons in visual area MT combine
multiple motion signals (Rust et al. 2006), and has also recently been proposed as an explanation
for how selective attention modifies neural activity (Lee and Maunsell 2009; Reynolds and Heeger
2009). Recent modeling results (not shown) indicate that divisive normalization can account for the
apparent changes in weights with coherence (Figure 31.4f, g), as well as a variety of other classic
findings in multisensory integration (Ohshiro et al. 2011). Evaluating the normalization model of
multisensory integration is a topic of current research in our laboratories.

31.5 LINKING NEURONAL AND PERCEPTUAL CORRELATES


OF MULTISENSORY INTEGRATION
Most physiological studies of multisensory integration have been performed in animals that are
anesthetized or passively experiencing sensory stimuli. Ultimately, to understand the neural basis of
multisensory cue integration, we must relate neural activity to behavioral performance. Because cue
integration may only occur when cues have roughly matched perceptual reliabilities (Alais and Burr
2004; Ernst and Banks 2002), it is critical to address the neural mechanisms of sensory integration
under conditions in which cue combination is known to take place perceptually. As a first major
step in this direction, we have developed a multisensory heading discrimination task for monkeys
640 The Neural Bases of Multisensory Processes

(Gu et al. 2008; Fetsch et al. 2009). This task enabled us to ask two fundamental questions that
had remained unaddressed: (1) Can monkeys integrate visual and vestibular cues near-optimally to
improve heading discrimination performance? (2) Can the activity of MSTd neurons account for the
behavioral improvement observed?

31.5.1  Behavioral Results


Monkeys were trained to report their perceived heading relative to straight ahead in a two-alterna-
tive forced choice task (Figure 31.5a). In each trial of this task, the monkey experienced a forward
motion with a small leftward or rightward component, and the animal’s task was to make a saccade
to one of two choice targets to indicate its perceived heading. Again, three stimulus conditions
(visual, vestibular, and combined) were examined, except that the heading angles during the task
were limited to a small range around straight forward. Psychometric functions were plotted as
the proportion of rightward choices as a function of heading angle (negative, leftward; positive,
rightward) and fit with a cumulative Gaussian function (Wichmann and Hill 2001). The standard
deviation (σ) of the fitted function was taken as the psychophysical threshold, corresponding to the
heading at which the subject was approximately 84% correct.

(a) Time

3 4
2
1

(b) (c)
3.5 Monkey C Monkey A
1.0 Vestibular
Visual 1.2
rightward decisions

Threshold (º)

Combined 3.0
Proportion of

0.5 1.0
2.5
σves = 3.5º
σvis = 3.6º
σcom = 2.3º 2.0 0.8
e d ed

ed d
ar

m l

ar

m l

n
Co ua

Co sua

Pr ine
tio

tio
ul

ul
Pr bin

0.0
s
tib

tib
Vi

Vi
ic

ic
b

–8 –4 0 4 8
s

s
Ve

Ve

Heading direction (º) Stimulus condition

FIGURE 31.5  Heading discrimination task and behavioral performance. (a) After fixating a visual target,
the monkey experienced forward motion (real and/or simulated with optic flow) with a small leftward or right-
ward component, and subsequently reported his perceived heading (“left” vs. “right”) by making a saccadic
eye movement to one of two choice targets. (b) Psychometric functions for one animal under unimodal (ves-
tibular: dashed curve, visual: solid curve) and bimodal (gray curve) conditions. Psychophysical threshold was
defined as the standard deviation (σ) of fitted cumulative Gaussian. (c) Summary of measured and predicted
psychophysical thresholds for monkey C. Bars show average threshold (±SE) for vestibular (white), visual
(dark gray), and combined conditions (black), along with predicted threshold for combined condition assum-
ing optimal cue integration (light gray). (d) Summary of psychophysical performance for monkey A. (Adapted
from Gu, Y. et al., Neuron, 66, 596–609, 2008.)
Visual–Vestibular Integration for Self-Motion Perception 641

Optimal cue-integration models (e.g., Alais and Burr 2004; Ernst and Banks 2002; Knill and
2
Saunders 2003) predict that the threshold in the combined condition (σ comb ) should be lower than the
2 2
single-cue thresholds (σ ves, σ vis), as given by the following expression:

2 2
2 σ vis σ ves
σ comb = 2 2 (31.1)
σ vis + σ ves

To maximize the predicted improvement in performance, the reliability of the visual and vestibular
cues (as measured by thresholds in the single-cue conditions) was matched by adjusting the motion
coherence of optic flow in the visual display (for details, see Gu et al. 2008). Psychometric func-
tions for one animal are plotted in Figure 31.5b. The vestibular (filled symbols, dashed curve) and
visual (open symbols, solid curve) functions are nearly overlapping, with thresholds of 3.5° and 3.6°,
respectively. In the combined condition (gray symbols and curve), the monkey’s heading threshold
was substantially smaller (2.3°), as evidenced by the steeper slope of the psychometric function.
Figure 31.5c, d summarizes the psychophysical data from two monkeys. For both animals, psy-
chophysical thresholds in the combined condition were significantly lower than thresholds in the
visual and vestibular conditions, and were quite similar to the optimal predictions generated from
Equation 31.1 (Gu et al. 2008). Thus, monkeys integrate visual and vestibular cues near-optimally
to improve their sensitivity in the heading discrimination task. Similar results were also found for
human subjects (Fetsch et al. 2009).

31.5.2  Neurophysiological Results


Having established robust cue integration behavior in macaques, we recorded from single neurons
in area MSTd while monkeys performed the heading discrimination task (Gu et al. 2008). Figure
31.6a, b shows tuning curves from two example neurons tested with heading directions evenly
spaced in the horizontal plane. The neuron in Figure 31.6a preferred leftward (negative) headings for
both visual and vestibular stimuli, and was classified as a congruent cell. In contrast, the neuron in
Figure 31.6b preferred leftward headings under the visual condition (solid line) and rightward head-
ings under the vestibular condition (dashed line), and was classified as an opposite cell.
Figure 31.6c and d shows the tuning of these example neurons over the much narrower range
of headings sampled during the discrimination task. For the congruent cell (Figure 31.6c), heading
tuning became steeper in the combined condition, whereas for the opposite cell (Figure 31.6d) it
became flatter. To allow a more direct comparison between neuronal and behavioral sensitivities,
we used signal detection theory receiver operating characteristic (ROC) analysis; Bradley et  al.
1987; Green and Swets 1966; Britten et al. 1992) to quantify the ability of an ideal observer to
discriminate heading based on the activity of a single neuron (Figure 31.6e and f, symbols). As
with the psychometric data, we fitted these neurometric data with cumulative Gaussian functions
(Figure 31.6e and f, smooth curves) and defined the neuronal threshold as the standard deviation
of the Gaussian. For the congruent neuron in Figure 31.6e, the neuronal threshold was smallest in
the combined condition (gray symbols and lines), indicating that the neuron could discriminate
smaller variations in heading when both cues were provided. In contrast, for the opposite neuron
in Figure 31.6f, the reverse was true: the neuron became less sensitive in the presence of both cues
(gray symbols and lines).
The effect of visual–vestibular congruency on neuronal sensitivity in the combined condition
was robust across the population of recorded MSTd neurons. To summarize this effect, we defined
a congruency index (CI) that ranged from +1 (when visual and vestibular tuning functions have a
consistent slope, e.g., Figure 31.6c) to −1 (when they have opposite slopes; Figure 31.6d) (for details,
see Gu et al. 2008). We then computed, for each neuron, the ratio of the neuronal threshold in
the combined condition to the expected threshold if neurons combine cues optimally according to
642 The Neural Bases of Multisensory Processes

(a) (b)
40 Vestibular 0º
Visual 80
30
–90º 90º 60
20 ±180º
40
10 20
Firing rate (sp/s)

0 0
–180 –90 0 90 180 –180 –90 0 90 180
(c) (d)
40 0º 40
Combined

30
30
20
20
10

10 0
–10 –5 0 5 10 –20 –10 0 10 20
(e) (f )
1.0
decisions of ideal observer

1.0
Proportion ‘rightward’

0.5 0.5
σves = 5.1º σves = 5.7º
σvis = 2.6º σvis = 2.6º
σcom = 1.8º σcom = 40.8º
0.0 0.0
–10 –5 0 5 10 –20 –10 0 10 20
Heading direction (º)

FIGURE 31.6  Heading tuning and heading sensitivity in area MSTd. (a–b) Heading tuning curves of two
example neurons with (a) congruent and (b) opposite visual–vestibular heading preferences. (c–d) Responses
of same neurons to a narrow range of heading stimuli presented while the monkey performed the discrimi-
nation task. (e–f) Neurometric functions computed by ROC analysis from firing rate data plotted in pan els
(c)  and  (d). Smooth curves show best-fitting cumulative Gaussian functions. (Adapted from Gu, Y. et al.,
Neuron, 66, 596– 609, 2008.)

Equation 31.1. A significant correlation was seen between the combined threshold/predicted thresh-
old ratio and CI (Figure 31.7a), such that neurons with large positive CIs (congruent cells, black
circles) had thresholds close to the optimal prediction (ratios near unity). Thus, neuronal thresholds
for congruent MSTd cells followed a pattern similar to the monkeys’ behavior. In contrast, com-
bined thresholds for opposite cells were generally much higher than predicted from optimal cue
integration (Figure 31.7a, open circles), indicating that these neurons became less sensitive during
cue combination.

31.5.3  Correlations with Behavioral Choice


If monkeys rely on area MSTd for heading discrimination, the results of Figure 31.7a suggest that
they selectively monitor the activity of congruent cells and not opposite cells. To test this hypoth-
esis, we used the data from the recording experiments (Gu et al. 2007, 2008) to compute “choice
Visual–Vestibular Integration for Self-Motion Perception 643

(a)
Congruent
Opposite

(combined/predicted)
Intermediate

Threshold ratio
10

–1.0 –0.5 0.0 0.5 1.0


(b)
0.8

0.7
Choice probability

0.6

0.5

0.4

0.3

0.2
–1.0 –0.5 0.0 0.5 1.0
Congruency index

FIGURE 31.7  Neuronal thresholds and choice probabilities as a function of visual–vestibular congruency
in combined condition. (a) Ordinate in this scatter plot represents ratio of threshold measured in combined
condition to prediction from optimal cue integration. Abscissa represents CI of heading tuning for visual and
vestibular responses. Asterisks denote neurons for which CI is not significantly different from zero. Dashed
horizontal line denotes that threshold in combined condition is equal to the prediction. (b) Choice probability
(CP) data are plotted as a function of congruency index for each MSTd neuron tested in combined condition.
Note that congruent cells (black filled symbols), which have neuronal thresholds similar to optimal prediction
in panel (a), also have CPs consistently and substantially larger than 0.5. (Adapted from Gu, Y. et al., Neuron,
66, 596–609, 2008.)

probabilities” (CPs) (Britten et al. 1996). CPs are computed by ROC analysis similar to neuronal
thresholds, except that the ideal observer is asked to predict the monkey’s choice (rather than the
stimulus) from the firing rate of the neuron. This analysis is performed after the effect of heading
on response has been removed, such that it isolates the effect of choice on firing rates. Thus, CPs
quantify the relationship between trial-to-trial fluctuations in neural firing rates and the monkeys’
perceptual decisions. A CP significantly greater than 0.5 indicates that the monkey tended to choose
the neuron’s preferred sign of heading (leftward or rightward) when the neuron fires more strongly.
Such a result is thought to reflect a functional link between the neuron and perception (Britten et
al. 1996; Krug 2004; Parker and Newsome 1998). Notably, although MSTd is classically considered
visual cortex, CPs significantly larger than 0.5 (mean = 0.55) were seen in the vestibular condition
(Gu et al. 2007), indicating that MSTd activity is correlated with perceptual decisions about heading
based on nonvisual information.
It is of particular interest to examine the relationship between CP and CI in the combined con-
dition, where the monkey makes use of both visual and vestibular cues. Given that opposite cells
become insensitive during cue combination and congruent cells increase sensitivity, we might expect
CP to depend on congruency in the combined condition. Indeed, Figure 31.7b shows that there is a
644 The Neural Bases of Multisensory Processes

robust correlation between CP and CI (Gu et al. 2008). Congruent cells (black symbols) generally
have CPs greater than 0.5, often much greater, indicating that they are robustly correlated with the
animal’s perceptual decisions during cue integration. In contrast, opposite cells (unfilled symbols)
tend to have CP values near 0.5, and the mean CP for opposite cells does not differ significantly
from 0.5 (t-test, p = .08). This finding is consistent with the idea that the animals selectively monitor
congruent cells to achieve near-optimal cue integration.
These findings suggest that opposite cells are not useful for visual–vestibular cue integration dur-
ing heading discrimination. What, then, is the functional role of opposite cells? We do not yet know
the answer to this question, but we hypothesize that opposite cells, in combination with congruent
cells, are important for dissociating object motion from self-motion. In general, the complex pat-
tern of image motion on the retina has two sources: (1) self-motion combined with the 3-D layout
of the scene and (2) objects moving in the environment. It is important for estimates of heading not
to be biased by the presence of moving objects, and vice versa. Note that opposite cells will not be
optimally stimulated when a subject moves through a static environment, but may fire more robustly
when retinal image motion is inconsistent with self-motion. Thus, the relative activity of congruent
and opposite cells may help identify (and perhaps discount) retinal image motion that is not pro-
duced by self-motion. Indeed, ongoing modeling work suggests that decoding a mixed population
of congruent and opposite cells allows heading to be estimated with much less bias from moving
objects.
In summary, by simultaneously monitoring neural activity and behavior, it has been possible
to study neural mechanisms of multisensory processing under conditions in which cue integra-
tion is known to take place perceptually. In addition to demonstrating near-optimal cue integration
by monkeys, a population of neurons has been identified in area MSTd that could account for the
improvement in psychophysical performance under cue combination. These findings implicate area
MSTd in sensory integration for heading perception and establish a model system for studying the
detailed mechanisms by which neurons combine different sensory signals.

31.6  CONCLUSION
These studies indicate that area MSTd is one important brain area where visual and vestibular sig-
nals might be integrated to achieve robust perception of self-motion. It is likely that other areas also
integrate visual and vestibular signals in meaningful ways, and a substantial challenge for the future
will be to understand the specific roles that various brain regions play in multisensory perception
of self-motion and object motion. In addition, these studies raise a number of important general
questions that may guide future studies on multisensory integration in multiple systems and species.
What are the respective functional roles of neurons that have congruent or incongruent tuning for
two sensory inputs? Do the spatial reference frames in which multiple sensory signals are expressed
constrain the contribution of multisensory neurons to perception? Do multisensory neurons gener-
ally perform weighted linear summation of their unimodal inputs, or do the mathematical combina-
tion rules used by neurons vary across brain regions and across stimuli/tasks within a brain region?
How can we account for the change in the weights that neurons apply to their unimodal inputs as the
strength of the sensory inputs varies? Does this require dynamic changes in synaptic weights or can
this phenomenology be explained in terms of nonlinearities (such as divisive normalization) that
operate at the level of the network? During behavioral discrimination tasks involving cue conflict,
do single neurons show correlates of the dynamic cue reweighting effects that have been seen con-
sistently in human perceptual studies of cue integration? How do populations of multimodal sensory
neurons represent the reliabilities (i.e., variance) of the sensory cues as they change dynamically
in the environment? Most of these questions should be amenable to study within the experimental
paradigm of visual–vestibular integration that we have presented thus far. Thus, we expect that this
will serve as an important platform for tackling critical questions regarding multisensory integra-
tion in the future.
Visual–Vestibular Integration for Self-Motion Perception 645

ACKNOWLEDGMENTS
We thank Amanda Turner and Erin White for excellent monkey care and training. This work
was supported by NIH EY017866 and EY019087 (to DEA) and by NIH EY016178 and an EJLB
Foundation grant (to GCD).

REFERENCES
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Curr Biol
14: 257–262.
Anderson, K. C., and R. M. Siegel. 1999. Optic flow selectivity in the anterior superior temporal polysensory
area, STPa, of the behaving monkey. J Neurosci 19: 2681–2692.
Angelaki, D. E. 2004. Eyes on target: What neurons must do for the vestibuloocular reflex during linear motion.
J Neurophysiol 92: 20–35.
Angelaki, D. E., and K. E. Cullen. 2008. Vestibular system: The many facets of a multimodal sense. Annu Rev
Neurosci 31: 125–150.
Angelaki, D. E., M. Q. Mchenry, J. D. Dickman, S. D. Newlands, and B. J. Hess. 1999. Computation of inertial
motion: Neural strategies to resolve ambiguous otolith information. J Neurosci 19: 316–327.
Angelaki, D. E., A. G. Shaikh, A. M. Green, and J. D. Dickman. 2004. Neurons compute internal models of the
physical laws of motion. Nature 430: 560–564.
Avillac, M., S. Ben Hamed, and J. R. Duhamel. 2007. Multisensory integration in the ventral intraparietal area
of the macaque monkey. J Neurosci 27: 1922–1932.
Avillac, M., S. Deneve, E. Olivier, A. Pouget, and J. R. Duhamel. 2005. Reference frames for representing
visual and tactile locations in parietal cortex. Nat Neurosci 8: 941–949.
Banks, M. S., S. M. Ehrlich, B. T. Backus, and J. A. Crowell. 1996. Estimating heading during real and simu-
lated eye movements. Vision Res 36: 431–443.
Benson, A. J., M. B. Spencer, and J. R. Stott. 1986. Thresholds for the detection of the direction of whole-body,
linear movement in the horizontal plane. Aviat Space Environ Med 57: 1088–1096.
Berthoz, A., B. Pavard, and L. R. Young. 1975. Perception of linear horizontal self-motion induced by periph-
eral vision (linearvection) basic characteristics and visual–vestibular interactions. Exp Brain Res 23:
471–489.
Bradley, A., B. C. Skottun, I. Ohzawa, G. Sclar, and R. D. Freeman. 1987. Visual orientation and spatial fre-
quency discrimination: A comparison of single neurons and behavior. J Neurophysiol 57: 755–772.
Bradley, D. C., M. Maxwell, R. A. Andersen, M. S. Banks, and K. V. Shenoy. 1996. Mechanisms of heading
perception in primate visual cortex. Science 273: 1544–1547.
Brandt, T., J. Dichgans, and E. Koenig. 1973. Differential effects of central verses peripheral vision on egocen-
tric and exocentric motion perception. Exp Brain Res 16: 476–491.
Bremmer, F., J. R. Duhamel, S. Ben Hamed, and W. Graf. 2002a. Heading encoding in the macaque ventral
intraparietal area (VIP). Eur J Neurosci 16: 1554–1568.
Bremmer, F., F. Klam, J. R. Duhamel, S. Ben Hamed, and W. Graf. 2002b. Visual–vestibular interactive
responses in the macaque ventral intraparietal area (VIP). Eur J Neurosci 16: 1569–1586.
Bremmer, F., M. Kubischik, M. Pekel, M. Lappe, and K. P. Hoffmann. 1999. Linear vestibular self-motion
signals in monkey medial superior temporal area. Ann N Y Acad Sci 871: 272–281.
Britten, K. H., W. T. Newsome, M. N. Shadlen, S. Celebrini, and J. A. Movshon. 1996. A relationship between
behavioral choice and the visual responses of neurons in macaque MT. Vis Neurosci 13: 87–100.
Britten, K. H., M. N. Shadlen, W. T. Newsome, and J. A. Movshon. 1992. The analysis of visual motion: A
comparison of neuronal and psychophysical performance. J Neurosci 12: 4745–4765.
Britten, K. H., and R. J. Van Wezel. 1998. Electrical microstimulation of cortical area MST biases heading
perception in monkeys. Nat Neurosci 1: 59–63.
Britten, K. H., and R. J. Van Wezel. 2002. Area MST and heading perception in macaque monkeys. Cereb
Cortex 12: 692–701.
Bryan, A. S., and D. E. Angelaki. 2008. Optokinetic and vestibular responsiveness in the macaque rostral ves-
tibular and fastigial nuclei. J Neurophysiol 101: 714–720.
Buttner, U., and U. W. Buettner. 1978. Parietal cortex (2v) neuronal activity in the alert monkey during natural
vestibular and optokinetic stimulation. Brain Res 153: 392–397.
Carandini, M., D. J. Heeger, and J. A. Movshon. 1997. Linearity and normalization in simple cells of the
macaque primary visual cortex. J Neurosci 17: 8621–8644.
646 The Neural Bases of Multisensory Processes

Chen, A., G. C. Deangelis, and D. E. Angelaki. 2010. Macaque parieto-insular vestibular cortex: Responses to
self-motion and optic flow. J Neurosci 30: 3022–3042.
Chen, A., E. Henry, G. C. Deangelis, and D. E. Angelaki. 2007. Comparison of responses to three-dimensional
rotation and translation in the ventral intraparietal (VIP) and medial superior temporal (MST) areas of
rhesus monkey. Program No. 715.19. 2007 Neuroscience Meeting Planner. San Diego, CA: Society for
Neuroscience, 2007. Online Society for Neuroscience.
Chowdhury, S. A., K. Takahashi, G. C. Deangelis, and D. E. Angelaki. 2009. Does the middle temporal area
carry vestibular signals related to self-motion? J Neurosci 29: 12020–12030.
Crowell, J. A., M. S. Banks, K. V. Shenoy, and R. A. Andersen. 1998. Visual self-motion perception during head
turns. Nat Neurosci 1: 732–737.
Daunton, N., and D. Thomsen. 1979. Visual modulation of otolith-dependent units in cat vestibular nuclei. Exp
Brain Res 37: 173–176.
Deneve, S., P. E. Latham, and A. Pouget. 2001. Efficient computation and cue integration with noisy population
codes. Nat Neurosci 4: 826–831.
Dichgans, J., and T. Brandt. 1974. The psychophysics of visually-induced perception of self motion and tilt. In
The Neurosciences, 123–129. Cambridge, MA: MIT Press.
Dichgans, J., and T. Brandt. 1978. Visual–vestibular interaction: Effects on self-motion perception and postural
control. In Handbook of sensory physiology, ed. R. Held, H. W. Leibowitz, and H. L. Teuber. Berlin:
Springer-Verlag.
Duffy, C. J. 1998. MST neurons respond to optic flow and translational movement. J Neurophysiol 80:
1816–1827.
Duffy, C. J., and R. H. Wurtz. 1991. Sensitivity of MST neurons to optic flow stimuli: I. A continuum of
response selectivity to large-field stimuli. J Neurophysiol 65: 1329–1345.
Duffy, C. J., and R. H. Wurtz. 1995. Response of monkey MST neurons to optic flow stimuli with shifted cen-
ters of motion. J Neurosci 15: 5192–5208.
Ernst, M. O., and M. S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415: 429–433.
Fernandez, C., and J. M. Goldberg. 1976a. Physiology of peripheral neurons innervating otolith organs of the
squirrel monkey: I. Response to static tilts and to long-duration centrifugal force. J Neurophysiol 39:
970–984.
Fernandez, C., and J. M. Goldberg. 1976b. Physiology of peripheral neurons innervating otolith organs of the
squirrel monkey: II. Directional selectivity and force–response relations. J Neurophysiol 39: 985–995.
Fetsch, C. R., A. H. Turner, G. C. Deangelis, and D. E. Angelaki. 2009. Dynamic reweighting of visual and
vestibular cues during self-motion perception. J Neurosci 29: 15601–15612.
Fetsch, C. R., S. Wang, Y. Gu, G. C. Deangelis, and D. E. Angelaki. 2007. Spatial reference frames of visual,
vestibular, and multimodal heading signals in the dorsal subdivision of the medial superior temporal area.
J Neurosci 27: 700–712.
Fredrickson, J. M., P. Scheid, U. Figge, and H. H. Kornhuber. 1966. Vestibular nerve projection to the cerebral
cortex of the rhesus monkey. Exp Brain Res 2: 318–327.
Fukushima, K. 1997. Corticovestibular interactions: Anatomy, electrophysiology, and functional consider-
ations. Exp Brain Res 117: 1–16.
Gibson, J. J. 1950. The perception of the visual world. Boston: Houghton-Mifflin.
Gibson, J. J. 1954. The visual perception of objective motion and subjective movement. Psychol Rev 61:
304–314.
Green, D. M., and J. A. Swets. 1966. Signal detection theory and psychophysics. New York: Wiley.
Groh, J. M. 2001. Converting neural signals from place codes to rate codes. Biol Cybern 85: 159–165.
Grusser, O. J., M. Pause, and U. Schreiter. 1990a. Localization and responses of neurones in the parieto-insular
vestibular cortex of awake monkeys (Macaca fascicularis). J Physiol 430: 537–557.
Grusser, O. J., M. Pause, and U. Schreiter. 1990b. Vestibular neurones in the parieto-insular cortex of monkeys
(Macaca fascicularis): Visual and neck receptor responses. J Physiol 430: 559–583.
Gu, Y., D. E. Angelaki, and G. C. Deangelis. 2008. Neural correlates of multisensory cue integration in macaque
MSTd. Nat Neurosci 11: 1201–1210.
Gu, Y., G. C. Deangelis, and D. E. Angelaki. 2007. A functional link between area MSTd and heading percep-
tion based on vestibular signals. Nat Neurosci 10: 1038–1047.
Gu, Y., C. R. Fetsch, B. Adeyemo, G. C. Deangelis, and D. E. Angelaki. 2010. Decoding of MSTd population
activity accounts for variations in the precision of heading perception. Neuron 66: 596–609.
Gu, Y., P. V. Watkins, D. E. Angelaki, and G. C. Deangelis. 2006. Visual and nonvisual contributions to three-
dimensional heading selectivity in the medial superior temporal area. J Neurosci 26: 73–85.
Visual–Vestibular Integration for Self-Motion Perception 647

Guedry, F. E. 1974. Psychophysics of vestibular sensation. In Handbook of sensory physiology. The vestibular
system, ed. H. H. Kornhuber. New York: Springer-Verlag.
Guedry Jr., F. E. 1978. Visual counteraction on nauseogenic and disorienting effects of some whole-body
motions: A proposed mechanism. Aviat Space Environ Med 49: 36–41.
Guldin, W. O., S. Akbarian, and O. J. Grusser. 1992. Cortico-cortical connections and cytoarchitectonics
of the primate vestibular cortex: A study in squirrel monkeys (Saimiri sciureus). J Comp Neurol 326:
375–401.
Guldin, W. O., and O. J. Grusser. 1998. Is there a vestibular cortex? Trends Neurosci 21: 254–259.
Heeger, D. J. 1992. Normalization of cell responses in cat striate cortex. Vis Neurosci 9: 181–197.
Henn, V., L. R. Young, and C. Finley. 1974. Vestibular nucleus units in alert monkeys are also influenced by
moving visual fields. Brain Res 71: 144–149.
Hlavacka, F., T. Mergner, and B. Bolha. 1996. Human self-motion perception during translatory vestibular and
proprioceptive stimulation. Neurosci Lett 210: 83–86.
Hlavacka, F., T. Mergner, and G. Schweigart. 1992. Interaction of vestibular and proprioceptive inputs for
human self-motion perception. Neurosci Lett 138: 161–164.
Knill, D. C., and J. A. Saunders. 2003. Do humans optimally integrate stereo and texture information for judg-
ments of surface slant? Vision Res 43: 2539–2558.
Krug, K. 2004. A common neuronal code for perceptual processes in visual cortex? Comparing choice and
attentional correlates in V5/MT. Philos Trans R Soc Lond B Biol Sci 359: 929–941.
Lee, J., and J. H. Maunsell. 2009. A normalization model of attentional modulation of single unit responses.
PLoS ONE 4: e4651.
Logan, D. J., and C. J. Duffy. 2006. Cortical area MSTd combines visual cues to represent 3-D self-movement.
Cereb Cortex 16: 1494–1507.
Ma, W. J., J. M. Beck, P. E. Latham, and A. Pouget. 2006. Bayesian inference with probabilistic population
codes. Nat Neurosci 9: 1432–1438.
Markert, G., U. Buttner, A. Straube, and R. Boyle. 1988. Neuronal activity in the flocculus of the alert monkey
during sinusoidal optokinetic stimulation. Exp Brain Res 70: 134–144.
Matsumiya, K., and H. Ando. 2009. World-centered perception of 3D object motion during visually guided
self-motion. J Vis 9: 151–153.
Meredith, M. A., and B. E. Stein. 1983. Interactions among converging sensory inputs in the superior colliculus.
Science 221: 389–391.
Meredith, M. A., and B. E. Stein. 1986. Visual, auditory, and somatosensory convergence on cells in superior
colliculus results in multisensory integration. J Neurophysiol 56: 640–662.
Merfeld, D. M., L. Zupan, and R. J. Peterka. 1999. Humans use internal models to estimate gravity and linear
acceleration. Nature 398: 615–618.
Morgan, M. L., G. C. Deangelis, and D. E. Angelaki. 2008. Multisensory integration in macaque visual cortex
depends on cue reliability. Neuron 59: 662–673.
Odkvist, L. M., D. W. Schwarz, J. M. Fredrickson, and R. Hassler. 1974. Projection of the vestibular nerve to
the area 3a arm field in the squirrel monkey (Saimiri sciureus). Exp Brain Res 21: 97–105.
Ohshiro, T., D. E. Angelaki, and G. C. DeAngelis. 2011. A normalization model of multisensory integration.
Nat Neurosci In press.
Page, W. K., and C. J. Duffy. 1999. MST neuronal responses to heading direction during pursuit eye move-
ments. J Neurophysiol 81: 596–610.
Parker, A. J., and W. T. Newsome. 1998. Sense and the single neuron: Probing the physiology of perception.
Annu Rev Neurosci 21: 227–277.
Perrault Jr., T. J., J. W. Vaughan, B. E. Stein, and M. T. Wallace. 2003. Neuron-specific response characteristics
predict the magnitude of multisensory integration. J Neurophysiol 90: 4022–406.
Perrault Jr., T. J., J. W. Vaughan, B. E. Stein, and M. T. Wallace. 2005. Superior colliculus neurons use distinct
operational modes in the integration of multisensory stimuli. J Neurophysiol 93: 2575–2586.
Previc, F. H., D. C. Varner, and K. K. Gillingham. 1992. Visual scene effects on the somatogravic illusion. Aviat
Space Environ Med 63: 1060–1064.
Reynolds, J. H., and D. J. Heeger. 2009. The normalization model of attention. Neuron 61: 168–185.
Robinson, D. A. 1977. Linear addition of optokinetic and vestibular signals in the vestibular nucleus. Exp Brain
Res 30: 447–450.
Royden, C. S., M. S. Banks, and J. A. Crowell. 1992. The perception of heading during eye movements. Nature
360: 583–585.
Royden, C. S., J. A. Crowell, and M. S. Banks. 1994. Estimating heading during eye movements. Vis Res 34:
3197–3214.
648 The Neural Bases of Multisensory Processes

Royden, C. S., and E. C. Hildreth. 1996. Human heading judgments in the presence of moving objects. Percept
Psychophys 58: 836–856.
Rushton, S. K., and P. A. Warren. 2005. Moving observers, relative retinal motion and the detection of object
movement. Curr Biol 15: R542–R543.
Rust, N. C., V. Mante, E. P. Simoncelli, and J. A. Movshon. 2006. How MT cells analyze the motion of visual
patterns. Nat Neurosci 9: 1421–1431.
Schaafsma, S. J., and J. Duysens. 1996. Neurons in the ventral intraparietal area of awake macaque monkey
closely resemble neurons in the dorsal part of the medial superior temporal area in their responses to
optic flow patterns. J Neurophysiol 76: 4056–4068.
Schlack, A., K. P. Hoffmann, and F. Bremmer. 2002. Interaction of linear vestibular and visual stimulation in
the macaque ventral intraparietal area (VIP). Eur J Neurosci 16: 1877–1886.
Schwarz, D. W., and J. M. Fredrickson. 1971a. Rhesus monkey vestibular cortex: A bimodal primary projection
field. Science 172: 280–281.
Schwarz, D. W., and J. M. Fredrickson. 1971b. Tactile direction sensitivity of area 2 oral neurons in the rhesus
monkey cortex. Brain Res 27: 397–401.
Shenoy, K. V., D. C. Bradley, and R. A. Andersen. 1999. Influence of gaze rotation on the visual response of
primate MSTd neurons. J Neurophysiol 81: 2764–2786.
Siegel, R. M., and H. L. Read. 1997. Analysis of optic flow in the monkey parietal area 7a. Cereb Cortex 7:
327–346.
Stanford, T. R., S. Quessy, and B. E. Stein. 2005. Evaluating the operations underlying multisensory integration
in the cat superior colliculus. J Neurosci 25: 6499–6508.
Stein, B. E., and M. A. Meredith. 1993. The merging of the senses. Cambridge, MA: MIT Press.
Stein, B. E., and T. R. Stanford. 2008. Multisensory integration: Current issues from the perspective of the
single neuron. Nat Rev Neurosci 9: 255–266.
Sugihara, T., M. D. Diltz, B. B. Averbeck, and L. M. Romanski. 2006. Integration of auditory and visual com-
munication information in the primate ventrolateral prefrontal cortex. J Neurosci 26: 11138–11147.
Takahashi, K., Y. Gu, P. J. May, S. D. Newlands, G. C. Deangelis, and D. E. Angelaki. 2007. Multimodal coding
of three-dimensional rotation and translation in area MSTd: Comparison of visual and vestibular selectiv-
ity. J Neurosci 27: 9742–9756.
Tanaka, K., K. Hikosaka, H. Saito, M. Yukie, Y. Fukada, and E. Iwai. 1986. Analysis of local and wide-field
movements in the superior temporal visual areas of the macaque monkey. J Neurosci 6: 134–144.
Tanaka, K., and H. Saito. 1989. Analysis of motion of the visual field by direction, expansion/contraction, and
rotation cells clustered in the dorsal part of the medial superior temporal area of the macaque monkey. J
Neurophysiol 62: 626–641.
Telford, L., I. P. Howard, and M. Ohmi. 1995. Heading judgments during active and passive self-motion. Exp
Brain Res 104: 502–510.
Waespe, W., U. Buttner, and V. Henn. 1981. Visual–vestibular interaction in the flocculus of the alert monkey:
I. Input activity. Exp Brain Res 43: 337–348.
Waespe, W., and V. Henn. 1977. Neuronal activity in the vestibular nuclei of the alert monkey during vestibular
and optokinetic stimulation. Exp Brain Res 27: 523–538.
Waespe, W., and V. Henn. 1981. Visual–vestibular interaction in the flocculus of the alert monkey: II. Purkinje
cell activity. Exp Brain Res 43: 349–360.
Warren, P. A., and S. K. Rushton. 2007. Perception of object trajectory: Parsing retinal motion into self and
object movement components. J Vis 7: 21–11.
Warren, P. A., and S. K. Rushton. 2008. Evidence for flow-parsing in radial flow displays. Vis Res 48:
655–663.
Warren, W. H. 2003 Optic flow. In The visual neurosciences, ed. L. M. Chalupa and J. S. Werner. Cambridge,
MA: MIT Press.
Warren, W. H., and J. A. Saunders. 1995. Perceiving heading in the presence of moving objects. Perception 24:
315–331.
Wexler, M. 2003. Voluntary head movement and allocentric perception of space. Psychol Sci 14: 340–346.
Wexler, M., F. Panerai, I. Lamouret, and J. Droulez. 2001. Self-motion and the perception of stationary objects.
Nature 409: 85–88.
Wexler, M., and J. J. Van Boxtel. 2005. Depth perception by the active observer. Trends Cogn Sci 9: 431–438.
Wichmann, F. A., and N. J. Hill. 2001. The psychometric function: I. Fitting, sampling, and goodness of fit.
Percept Psychophys 63: 1293–1313.
Wolfe, J. W., and R. L. Cramer. 1970. Illusions of pitch induced by centripetal acceleration. Aerosp Med 41:
1136–1139.
Visual–Vestibular Integration for Self-Motion Perception 649

Zhang, T., and K. H. Britten. 2003. Microstimulation of area VIP biases heading perception in monkeys.
Program No. 339.9. 2003 Neuroscience Abstract Viewer/Itinerary Planner. New Orleans, LA: Society
for Neuroscience.
Zhang, T., H. W. Heuer, and K. H. Britten. 2004. Parietal area VIP neuronal responses to heading stimuli are
encoded in head-centered coordinates. Neuron 42: 993–1001.
Section VIII
Naturalistic Multisensory Processes:
Communication Signals
32 Unity of the Senses for Primate
Vocal Communication
Asif A. Ghazanfar

CONTENTS
32.1 Introduction........................................................................................................................... 653
32.2 Multisensory Communication Is the Default Mode of Communication............................... 654
32.3 Monkeys Link Facial Expressions to Vocal Expressions...................................................... 654
32.4 Dynamic Faces Modulate Voice Processing in Auditory Cortex . ....................................... 655
32.5 Auditory Cortical Interactions with Superior Temporal Sulcus Mediates Face/Voice
Integration.............................................................................................................................. 656
32.6 Viewing Vocalizing Conspecifics.......................................................................................... 658
32.7 Somatosensory Feedback during Vocal Communication...................................................... 659
32.8 Emergence of Multisensory Systems for Communication....................................................660
32.9 Conclusions............................................................................................................................ 661
Acknowledgments........................................................................................................................... 661
References....................................................................................................................................... 662

32.1  INTRODUCTION
The basic tenet of neocortical organization is: different regions of the cortex have different func-
tions. Some regions receive visual, auditory, tactile, olfactory, and gustatory sensations. Each of
these sensory regions is thought to send projections that converge on an “association area,” which
then enables the association between the different senses and between the senses and movement.
According to a highly influential two-part review by Norman Geschwind, entitled, “Disconnexion
syndromes in animals and man” (Geschwind 1965a, 1965b), the connections between sensory asso-
ciation areas are not robust in nonhuman animals, limiting their ability to make cross-modal sen-
sory associations. In contrast, humans can readily make such associations, for example, between the
sight of a lion and the sounds of its roar.
This picture of human versus nonhuman cross-modal abilities based on anatomy led to the idea
that human speech and language evolved in parallel with robust cross-modal connections within the
neocortex. Geschwind claimed that the “ability to acquire speech has as a prerequisite the ability to
form cross-modal associations” (Geschwind 1965a, 1965b). This view of cross-modal associations
as a potentially uniquely human capacity remains present even in more current ideas about the evo-
lution of language. For example, it has been suggested that human language depends on our unique
ability to imitate in multiple modalities, which in turn relies on a “substantial change in neural orga-
nization, one that affects not only imitation but also communication” (Hauser et al. 2002, p. 1575).
The purpose of this review is twofold: (1) to refute the view that the cross-modal (multisensory,
hereafter) associations are mediated solely through association areas and (2) to debunk the view
that human communication is uniquely multisensory. To achieve these two goals, I will focus on the
multisensory nature of nonhuman primate vocal communication and the many possible roles that
one, nonassociation area plays: the auditory cortex.

653
654 The Neural Bases of Multisensory Processes

32.2 MULTISENSORY COMMUNICATION IS THE


DEFAULT MODE OF COMMUNICATION
It is widely accepted that human speech is fundamentally a multisensory behavior, with face-to-face
communication perceived through both the visual and auditory channels. Such multisensory speech
perception is evident even at the earliest stages of human cognitive development (Gogate et al. 2001;
Patterson and Werker 2003); its integration across the two modalities is ubiquitous and automatic
(McGurk and MacDonald 1976), and at the neural level, audiovisual speech integration occurs at the
“earliest” stages of cortical processing (Ghazanfar and Schroeder 2006). Indeed, there are strong
arguments suggesting that multisensory speech is the primary mode of speech perception and is not
a capacity that is “piggybacked” on to auditory speech perception (Rosenblum 2005). This implies
that the perceptual mechanisms, neurophysiology, and evolution of speech perception are based
on primitives that are not tied to a single sensory modality (Romanski and Ghazanfar 2009). The
essence of these ideas is shared by many investigators in the domain of perception (Liberman and
Mattingly 1985; Meltzoff and Moore 1997; Fowler 2004).

32.3  MONKEYS LINK FACIAL EXPRESSIONS TO VOCAL EXPRESSIONS


What is true for human speech is also true for vocal communication in nonhuman primates: vision
and audition are inextricably linked. Human and primate vocalizations are produced by coordinated
movements of the lungs, larynx (vocal folds), and the supralaryngeal vocal tract (Fitch and Hauser
1995; Ghazanfar and Rendall 2008). The vocal tract consists of the column of air derived from
the pharynx, mouth, and nasal cavity. In humans, speech-related vocal tract motion results in the
predictable deformation of the face around the oral aperture and other parts of the face (Yehia et al.
1998, 2002; Jiang et al. 2002). For example, human adults automatically link high-pitched sounds to
facial postures producing an /i/ sound and low-pitched sounds to faces producing an /a/ sound (Kuhl
et al. 1991). In primate vocal production, there is a similar link between acoustic output and facial
dynamics. Different macaque monkey vocalizations are produced with unique lip configurations
and mandibular positions and the motion of such articulators influences the acoustics of the signal
(Hauser et al. 1993; Hauser and Ybarra 1994). Coo calls, such as /u/ in speech, are produced with
the lips protruded, whereas screams, such as the /i/ in speech, are produced with the lips retracted
(Figure 32.1). Thus, it is likely that many of the facial motion cues that humans use for speech-
reading are present in other primates as well.
Given that both humans and other extant primates use both facial and vocal expressions as com-
munication signals, it is perhaps not surprising that many primates other than humans recognize
the correspondence between the visual and auditory components of vocal signals. Macaque mon-
keys (Macaca mulatta), capuchins (Cebus apella), and chimpanzees (Pan troglodytes) all recognize
auditory–visual correspondences between their various vocalizations (Ghazanfar and Logothetis
2003; Izumi and Kojima 2004; Parr 2004; Evans et al. 2005). For example, rhesus monkeys read-
ily match the facial expressions of “coo” and “threat” calls with their associated vocal components
(Ghazanfar and Logothetis 2003). Perhaps more pertinent, rhesus monkeys can also segregate com-
peting voices in a chorus of coos, much as humans might with speech in a cocktail party scenario,
and match them to the correct number of individuals seen cooing on a video screen (Jordan et al.
2005). Finally, macaque monkeys use formants (i.e., vocal tract resonances) as acoustic cues to
assess age-related body size differences among conspecifics (Ghazanfar et al. 2007). They do so
by linking across modalities the body size information embedded in the formant spacing of vocal-
izations (Fitch 1997) with the visual size of animals who are likely to produce such vocalizations
(Ghazanfar et al. 2007).
Taken together, these data suggest that humans are not at all unique in their ability to perceive
communication signals across modalities. Indeed, as will be described below, vocal communica-
tion is a fully integrated multi-sensori-motor system with numerous similarities between humans
Unity of the Senses for Primate Vocal Communication 655

16

Frequency (kHz)
0
0 0.8
16

Frequency (kHz)

0
0 Time (s) 0.26

FIGURE 32.1  Exemplars of facial expressions produced concomitantly with vocalizations. Rhesus monkey
coo and scream calls taken at midpoint of expressions with their corresponding spectrograms.

and monkeys and in which the auditory cortex may serve as a key node in a larger neocortical
network.

32.4 DYNAMIC FACES MODULATE VOICE


PROCESSING IN AUDITORY CORTEX
Traditionally, the linking of vision with audition in the multisensory vocal perception described
above would be attributed to the functions of association areas such as the superior temporal sulcus
in the temporal lobe or the principal and intraparietal sulci located in the frontal and parietal lobes,
respectively. Although these regions may play important roles (see below), they are certainly not
necessary for all types of multisensory behaviors (Ettlinger and Wilson 1990), nor are they the sole
regions for multisensory convergence (Ghazanfar and Schroeder 2006; Driver and Noesselt 2008).
The auditory cortex, in particular, has many potential sources of visual inputs (Ghazanfar and
Schroeder 2006), and this is borne out in the increasing number of studies demonstrating visual
modulation of auditory cortical activity (Schroeder and Foxe 2002; Ghazanfar et al. 2005, 2008;
Bizley et al. 2007; Kayser et al. 2007, 2008). Here we focus on those auditory cortical studies inves-
tigating face/voice integration specifically.
Recordings from both primary and lateral belt auditory cortex reveal that responses to the voice
are influenced by the presence of a dynamic face (Ghazanfar et al. 2005, 2008). Monkey subjects
viewing unimodal and bimodal versions of two different species-typical vocalizations (coos and
grunts) show both enhanced and suppressed local field potential (LFP) responses in the bimodal
condition relative to the unimodal auditory condition (Ghazanfar et al. 2005). Consistent with
evoked potential studies in humans (Besle et al. 2004; van Wassenhove et al. 2005), the combination
of faces and voices led to integrative responses (significantly different from unimodal responses)
in the vast majority of auditory cortical sites—both in the primary auditory cortex and the lateral
belt auditory cortex. The data demonstrated that LFP signals in the auditory cortex are capable of
multisensory integration of facial and vocal signals in monkeys (Ghazanfar et al. 2005) and have
subsequently been confirmed at the single unit level in the lateral belt cortex as well (Ghazanfar et
al. 2008; Figure 32.2a).
656 The Neural Bases of Multisensory Processes

Face + voice
Pr Grunt Gr Grunt Voice
100 80
Face
Disk + voice
Spikes/s

0
0

–400 0 400 800 1200 –400 0 400 800 1200


Time (ms) Time (ms)

FIGURE 32.2  (See color insert.) Single neuron examples of multisensory integration of Face + Voice
stimuli compared with Disk + Voice stimuli in lateral belt area. Left: enhanced response when voices are
coupled with faces, but no similar modulation when coupled with disks. Right: similar effects for a suppressed
response. x-Axes show time aligned to onset of face (solid line). Dashed lines indicate onset and offset of voice
signal. y-Axes depict firing rate of neuron in spikes per second. Shaded regions denote SEM.

The specificity of face/voice integrative responses was tested by replacing the dynamic faces with
dynamic disks that mimicked the aperture and displacement of the mouth. In human psychophysical
experiments, such artificial dynamic stimuli can still lead to enhanced speech detection, but not to
the same degree as a real face (Bernstein et al. 2004; Schwartz et al. 2004). When cortical sites or
single units were tested with dynamic disks, far less integration was seen when compared to the real
monkey faces (Ghazanfar et al. 2005, 2008; Figure 32.2). This was true primarily for the lateral belt
auditory cortex (LFPs and single units) and was observed to a lesser extent in the primary auditory
cortex (LFPs only). This suggests that there may be increasingly specific influences of “extra” sen-
sory modalities as one moves away from the primary sensory regions.
Unexpectedly, grunt vocalizations were overrepresented relative to coos in terms of enhanced
multisensory LFP responses (Ghazanfar et al. 2005). As coos and grunts are both produced fre-
quently in a variety of affiliative contexts and are broadband spectrally, the differential representa-
tion cannot be attributed to experience, valence, or the frequency tuning of neurons. One remaining
possibility is that this differential representation may reflect a behaviorally relevant distinction, as
coos and grunts differ in their direction of expression and range. Coos are generally contact calls
rarely directed toward any particular individual. In contrast, grunts are often directed toward indi-
viduals in one-on-one situations, often during social approaches as in baboons and vervet monkeys
(Cheney and Seyfarth 1982; Palombit et al. 1999). Given their production at close range and context,
grunts may produce a stronger face/voice association than coo calls. This distinction appeared to
be reflected in the pattern of significant multisensory responses in the auditory cortex, that is, this
multisensory bias toward grunt calls may be related to the fact the grunts (relative to coos) are often
produced during intimate, one-to-one social interactions.

32.5 AUDITORY CORTICAL INTERACTIONS WITH SUPERIOR


TEMPORAL SULCUS MEDIATES FACE/VOICE INTEGRATION
The face-specific visual influence on the lateral belt auditory cortex begs the question as to its
anatomical source. Although there are multiple possible sources of visual input to auditory cortex
(Ghazanfar and Schroeder 2006), the STS is likely to be a prominent one, particularly for integrat-
ing faces and voices, for the following reasons. First, there are reciprocal connections between the
STS and the lateral belt and other parts of the auditory cortex (Barnes and Pandya 1992; Seltzer and
Pandya 1994). Second, neurons in the STS are sensitive to both faces and biological motion (Harries
and Perrett 1991; Oram and Perrett 1994). Finally, the STS is known to be multisensory (Benevento
Unity of the Senses for Primate Vocal Communication 657

et al. 1977; Bruce et al. 1981; Schroeder and Foxe 2002; Barraclough et al. 2005; Chandrasekaran
and Ghazanfar 2009). One mechanism for establishing whether auditory cortex and the STS inter-
act at the functional level is to measure their temporal correlations as a function of stimulus condi-
tion. Concurrent recording LFPs and spiking activity in the lateral belt of the auditory cortex and
the upper bank of the STS revealed that functional interactions, in the form of gamma band cor-
relations, between these two regions increased in strength during presentations of faces and voices
together relative to the unimodal conditions (Ghazanfar et al. 2008; Figure 32.3a). Furthermore,
these interactions were not solely modulations of response strength, as phase relationships were
significantly less variable (tighter) in the multisensory conditions (Figure 32.3b).
The influence of the STS on the auditory cortex was not merely on its gamma oscillations.
Spiking activity seems to be modulated, but not “driven,” by ongoing activity arising from the STS.
Three lines of evidence suggest this scenario. First, visual influences on single neurons were most
robust when in the form of dynamic faces and were only apparent when neurons had a significant
response to a vocalization (i.e., there were no overt responses to faces alone). Second, these integra-
tive responses were often “face-specific” and had a wide distribution of latencies, which suggested
that the face signal was an ongoing signal that influenced auditory responses (Ghazanfar et al.
2008). Finally, this hypothesis for an ongoing signal is supported by the sustained gamma band
activity between the auditory cortex and the STS and by a spike-field coherence analysis. This
analysis reveals that just before spiking activity in the auditory cortex, there is an increase in gamma
band power in the STS (Ghazanfar et al. 2008; Figure 32.3c).

(a) (b)
Face + voice Face Voice
Normalized concentration
198 2.0 1.25
Normalized amplitude

164 Face + voice


136 1.8 1.20 Voice
Frequency (Hz)

Face
113 1.15 Disk + voice
94 1.6
78 1.4 1.10
64
54 1.05
1.2
44 1.00
37 1.0
–400 –200 0 200 400 600 –400 –200 0 200 400 600 –400 –200 0 200 400 600 40 60 80 100120140160180

Time (ms) Time (ms) Time (ms) Frequency (Hz)


(c)
Mean normalized power
Face + voice Face Voice
198 198 198 1.3
Frequency (Hz)

139 139 139 1.2


97 97 97 1.1
68 68 68 1.0
48 48 48 0.9
–200 –150 –100 –50 0 50 100 –200 –150 –100 –50 0 50 100 –200 –150 –100 –50 0 50 100
Time (ms) Time (ms) Time (ms)

FIGURE 32.3  (See color insert.) (a) Time–frequency plots (cross-spectrograms) illustrate modulation of
functional interactions (as a function of stimulus condition) between lateral belt auditory cortex and STS for
a population of cortical sites. x-Axes depict time in milliseconds as a function of onset of auditory signal
(solid black line). y-Axes depict frequency of oscillations in Hz. Color bar indicates amplitude of these signals
normalized by baseline mean. (b) Population phase concentration from 0 to 300 ms after voice onset. x-Axes
depict frequency in Hz. y-Axes depict average normalized phase concentration. Shaded regions denote SEM
across all electrode pairs and calls. All values are normalized by baseline mean for different frequency bands.
Right panel shows phase concentration across all calls and electrode pairs in gamma band for four conditions.
(c) Spike-field cross-spectrogram illustrates relationship between spiking activity of auditory cortical neurons
and STS local field potential across population of cortical sites. x-Axes depict time in milliseconds as a func-
tion of onset of multisensory response in auditory neuron (solid black line). y-Axes depict frequency in Hz.
Color bar denotes cross-spectral power normalized by baseline mean for different frequencies.
658 The Neural Bases of Multisensory Processes

Both the auditory cortex and the STS have multiple bands of oscillatory activity generated in
responses to stimuli that may mediate different functions (Lakatos et al. 2005; Chandrasekaran
and Ghazanfar 2009). Thus, interactions between the auditory cortex and the STS are not lim-
ited to spiking activity and high frequency gamma oscillations. Below 20 Hz, and in response to
naturalistic audiovisual stimuli, there are directed interactions from the auditory cortex to the STS,
whereas above 20 Hz (but below the gamma range), there are directed interactions from the STS to
the auditory cortex (Kayser and Logothetis 2009). Given that different frequency bands in the STS
integrate faces and voices in distinct ways (Chandrasekaran and Ghazanfar 2009), it is possible that
these lower frequency interactions between the STS and the auditory cortex also represent distinct
multisensory processing channels.
Two things should be noted here. The first is that functional interactions between the STS and the
auditory cortex are not likely to occur solely during the presentation of faces with voices. Other con-
gruent, behaviorally salient audiovisual events such as looming signals (Maier et al. 2004; Gordon
and Rosenblum 2005; Cappe et al. 2009) or other temporally coincident signals may elicit similar
functional interactions (Noesselt et al. 2007; Maier et al. 2008). The second is that there are other
areas that, consistent with their connectivity and response properties (e.g., sensitivity to faces and
voices), could also (and very likely) have a visual influence on the auditory cortex. These include
the ventrolateral prefrontal cortex (Romanski et al. 2005; Sugihara et al. 2006) and the amygdala
(Gothard et al. 2007; Kuraoka and Nakamura 2007).

32.6  VIEWING VOCALIZING CONSPECIFICS


Humans and other primates readily link facial expressions with appropriate, congruent vocal
expressions. What cues they use to make such matches are not known. One method for investigat-
ing such behavioral strategies is the measurement of eye movement patterns. When human subjects
are given no task or instruction regarding what acoustic cues to attend, they will consistently look
at the eye region more than the mouth when viewing videos of human speakers (Klin et al. 2002).
Macaque monkeys exhibit the exact same strategy. The eye movement patterns of monkeys viewing
conspecifics producing vocalizations reveal that monkeys spend most of their time inspecting the
eye region relative to the mouth (Ghazanfar et al. 2006; Figure 32.4a). When they did fixate on the
mouth, it was highly correlated with the onset of mouth movements (Figure 32.4b). This, too, was
highly reminiscent of human strategies: subjects asked to identify words increased their fixations
onto the mouth region with the onset of facial motion (Lansing and McConkie 2003).
Somewhat surprisingly, activity in both primary auditory cortex and belt areas is influenced by
eye position. When the spatial tuning of primary auditory cortical neurons is measured with the eyes
gazing in different directions, ~30% of the neurons are affected by the position of the eyes (Werner-
Reiss et al. 2003). Similarly, when LFP-derived current-source density activity was measured from
the auditory cortex (both primary auditory cortex and caudal belt regions), eye position significantly
modulated auditory-evoked amplitude in about 80% of sites (Fu et al. 2004). These eye-position
effects occurred mainly in the upper cortical layers, suggesting that the signal is fed back from
another cortical area. A possible source includes the frontal eye field located in the frontal lobes, the
medial portion of which generates relatively long saccades (Robinson and Fuchs 1969), is intercon-
nected with both the STS (Seltzer and Pandya 1989; Schall et al. 1995) and multiple regions of the
auditory cortex (Schall et al. 1995; Hackett et al. 1999; Romanski et al. 1999).
It does not take a huge stretch of the imagination to link these auditory cortical processes to
the oculomotor strategy for looking at vocalizing faces. A dynamic, vocalizing face is a complex
sequence of sensory events, but one that elicits fairly stereotypical eye movements: we and other pri-
mates fixate on the eyes but then saccade to the mouth when it moves before saccading back to the
eyes. Is there a simple scenario that could link the proprioceptive eye position effects in the auditory
cortex with its face/voice integrative properties (Ghazanfar and Chandrasekaran 2007)? Reframing
(ever so slightly) the hypothesis of Schroeder and colleagues (Lakatos et al. 2007; Schroeder et al.
Unity of the Senses for Primate Vocal Communication 659

(a) (b)
60 30
Eye Without sound
Mouth

Fixation onset relative


25 With sound

to video start (s)


40 20

% Fixations
15

20 10

0 0
Normal Mismatch Silent 0 5 10 15 20 25 30
Mouth movement onset relative
to video start (s)

FIGURE 32.4  (a) Average fixation on eye region versus mouth region across three subjects while viewing
a 30-s video of vocalizing conspecific. Audio track had no influence on proportion of fixations falling onto
mouth or eye region. Error bars represent SEM. (b) We also find that when monkeys do saccade to mouth
region, it is tightly correlated with onset of mouth movements (r = .997, p < .00001).

2008), one possibility is that the fixations at the onset of mouth movements send a signal to the audi-
tory cortex, which resets the phase of an ongoing oscillation. This proprioceptive signal thus primes
the auditory cortex to amplify or suppress (depending on the timing of) a subsequent auditory signal
originating from the mouth. Given that mouth movements precede the voiced components of both
human (Abry et al. 1996) and monkey vocalizations (Ghazanfar et al. 2005; Chandrasekaran and
Ghazanfar 2009), the temporal order of visual to proprioceptive to auditory signals is consistent
with this idea. This hypothesis is also supported (although indirectly) by the finding that sign of
face/voice integration in the auditory cortex and the STS is influenced by the timing of mouth
movements relative to the onset of the voice (Ghazanfar et al. 2005; Chandrasekaran and Ghazanfar
2009).

32.7  SOMATOSENSORY FEEDBACK DURING VOCAL COMMUNICATION


Numerous lines of both physiological and anatomical evidence demonstrate that at least some
regions of the auditory cortex respond to touch as well as sound (Schroeder and Foxe 2002; Fu et al.
2003; Kayser et al. 2005; Hackett et al. 2007a, 2007b; Lakatos et al. 2007; Smiley et al. 2007). Yet,
the sense of touch is not something we normally associate with vocal communication. It can, how-
ever, influence what we hear under certain circumstances. For example, kinesthetic feedback from
one’s own speech movements also integrates with heard speech (Sams et al. 2005). More directly,
if a robotic device is used to artificially deform the facial skin of subjects in a way that mimics the
deformation seen during speech production, then subjects actually hear speech differently (Ito et
al. 2009). Surprisingly, there is a systematic perceptual variation with speechlike patterns of skin
deformation that implicates a robust somatosensory influence on auditory processes under normal
conditions (Ito et al. 2009).
The somatosensory system’s influence on the auditory system may also occur during vocal learn-
ing. When a mechanical load is applied to the jaw, causing a slight protrusion, as subjects repeat
words (“saw,” “say,” “sass,” and “sane”), it can alter somatosensory feedback without changing
the acoustics of the words (Tremblay et al. 2003). Measuring adaptation in the jaw trajectory after
many trials revealed that subjects learn to change their jaw trajectories so that they are similar to
the preload trajectory—despite not hearing anything different. This strongly implicates a role for
somatosensory feedback that parallels the role for auditory feedback in guiding vocal production
(Jones and Munhall 2003, 2005). Indeed, the very same learning effects are observed with deaf
subjects when they turn their hearing aids off (Nasir and Ostry 2008).
660 The Neural Bases of Multisensory Processes

Although the substrates for these somatosensory–auditory effects have not been explored, inter-
actions between the somatosensory system and the auditory cortex seem like a likely source for the
phenomena described above for the following reasons. First, many auditory cortical fields respond
to, or are modulated by, tactile inputs (Schroeder et al. 2001; Fu et al. 2003; Kayser et al. 2005).
Second, there are intercortical connections between somatosensory areas and the auditory cortex
(Cappe and Barone 2005; de la Mothe et al. 2006; Smiley et al. 2007). Third, the Caudomedial
auditory area CM, where many auditory–tactile responses seem to converge, is directly connected
to somatosensory areas in the retroinsular cortex and the granular insula (de la Mothe et al. 2006;
Smiley et al. 2006). Oddly enough, a parallel influence of audition on somatosensory areas has also
been reported: neurons in the “somatosensory” insula readily and selectively respond to vocaliza-
tions (Beiser 1998; Remedios et al. 2009). Finally, the tactile receptive fields of neurons in auditory
cortical area CM are confined to the upper body, primarily the face and neck regions (areas consist-
ing of, or covering, the vocal tract) (Fu et al. 2003) and the primary somatosensory cortical (area 3b)
representation for the tongue (a vocal tract articulator) projects to auditory areas in the lower bank of
the lateral sulcus (Iyengar et al. 2007). All of these facts lend further credibility to the putative role
of somatosensory–­auditory interactions during vocal production and perception.
Like humans, other primates also adjust their vocal output according to what they hear. For
example, macaques, marmosets (Callithrix jacchus), and cotton-top tamarins (Saguinus oedipus)
adjust the loudness, timing, and acoustic structure of their vocalizations depending on background
noise levels and patterns (Sinnott et al. 1975; Brumm et al. 2004; Egnor and Hauser 2006; Egnor
et al. 2006, 2007). The specific number of syllables and temporal modulations in heard conspecific
calls can also differentially trigger vocal production in tamarins (Ghazanfar et al. 2001, 2002).
Thus, auditory feedback is also very important for nonhuman primates, and altering such feedback
can influence neurons in the auditory cortex (Eliades and Wang 2008). At this time, however, no
experiments have been conducted to investigate whether somatosensory feedback plays a role in
influencing vocal feedback. The neurophysiological and neuroanatomical data described above sug-
gest that it is not unreasonable to think that it does.

32.8  EMERGENCE OF MULTISENSORY SYSTEMS FOR COMMUNICATION


The behavioral and neurobiological data and speculation described above beg the question of how
might such an integrated system emerge ontogenetically. Although there are numerous studies on the
development of multisensory processes in humans (see Lewkowicz and Lickliter 1994 for review),
there are only a handful of reports for primates (Gunderson 1983; Gunderson et al. 1990; Adachi et
al. 2006; Batterson et al. 2008; Zangehenpour et al. 2008). Given that monkeys and humans develop
at different rates, it is important to know how this might influence the behavior and neural circuitry
underlying multisensory communication. Furthermore, there is only one neurobiological study of
multisensory integration in the developing primate (Wallace and Stein 2001). This study suggests
that although neurons in the newborn macaque monkey may respond to more than one modality,
they are unable to integrate them—that is, they do not produce enhanced responses to bimodal
stimulation like they do in adult monkeys. Taken together, one line of investigation suggests that
an interaction between developmental timing (heterochrony) and social experience may shape the
neural circuits underlying both human and primate vocal communication.
Three lines of evidence demonstrate that the rate of neural development in Old World monkeys
is faster than in humans and that, as a result, they are neurologically precocial relative to human
infants. First, in terms of overall brain size at birth, Old World monkeys are among the most pre-
cocial of all mammals (Sacher and Staffeldt 1974), possessing ~65% of their brain size at birth
compared to only ~25% for human infants (Sacher and Staffeldt 1974; Malkova et al. 2006). Second,
fiber pathways in the developing monkey brain are more heavily myelinated than in the human brain
at the same postnatal age (Gibson 1991), suggesting that postnatal myelination in the rhesus monkey
brain is about three to four times faster than in the human brain (Gibson 1991; Malkova et al. 2006).
Unity of the Senses for Primate Vocal Communication 661

All sensorimotor tracts are heavily myelinated by 2 to 3 months after birth in rhesus monkeys, but
not until 8 to 12 months after birth in human infants. Finally, at the behavioral level, the differential
patterns of brain growth in the two species lead to differential timing in the emergence of species-
specific motor, socioemotional, and cognitive abilities (Antinucci 1989; Konner 1991).
The heterochrony of neural and behavioral development across different primate species raises
the possibility that the development of multisensory integration may be different in monkeys rela-
tive to humans. In particular, Turkewitz and Kenny (1982) suggested that the neural limitations
imposed by the relatively slow rate of neural development in human infants may actually be advan-
tageous because the limitations may provide them with greater functional plasticity. This, in turn,
may make human infants initially more sensitive to a broader range of sensory stimulation and to
the relations among multisensory inputs. This theoretical observation has received empirical sup-
port from studies showing that infants go through a process of “perceptual narrowing” in their
processing of unisensory as well as multisensory information, that is, where initially they exhibit
broad sensory tuning, they later exhibit narrower tuning. For example, 4- to 6-month-old human
infants can match rhesus monkey faces and voices, but 8- to 10-month-old infants no longer do so
(Lewkowicz and Ghazanfar 2006). These findings suggest that as human infants acquire increas-
ingly greater experience with conspecific human faces and vocalizations—but none with hetero-
specific faces and vocalizations—their sensory tuning (and their neural systems) narrows to match
their early experience.
If a relatively immature state of neural development leaves a developing organism more “open”
to the effects of early sensory experience, then it stands to reason that the more advanced state of
neural development in monkeys might result in a different outcome. In support of this, a study of
infant vervet monkeys that was identical in design to the human infant study of cross-species mul-
tisensory matching (Lewkowicz and Ghazanfar 2006) revealed that, unlike human infants, they
exhibit no evidence of perceptual narrowing (Zangehenpour et al. 2008). That is, the infant vervet
monkeys could match faces and voices of rhesus monkeys despite the fact that they had no prior
experience with macaque monkeys and that they continued to do so well beyond the ages where
such matching ability declines in human infants (Zangehenpour et al. 2008). The reason for this
lack of perceptual narrowing may lie in the precocial neurological development of this Old World
monkey species.
These comparative developmental data reveal that although monkeys and humans may appear
to share similarities at the behavioral and neural levels, their different developmental trajectories
are likely to reveal important differences. It is important to keep this in mind when making claims
about homologies at either of these levels.

32.9  CONCLUSIONS
The overwhelming evidence from the studies reviewed here, and numerous other studies from dif-
ferent domains of neuroscience, all converge on the idea that the neocortex is fundamentally mul-
tisensory (Ghazanfar and Schroeder 2006). This is not terribly surprising given that the sensory
experiences of humans and other animals are profoundly multimodal. This does not mean, however,
that every cortical area is uniformly multisensory. Indeed, I hope that the role of the auditory cortex
reviewed above for vocal communication illustrates that cortical areas maybe weighted differently
by “extra”-modal inputs depending on the task at hand and its context.

ACKNOWLEDGMENTS
The author gratefully acknowledges the scientific contributions and numerous discussions with
the following people: Chand Chandrasekaran, Kari Hoffman, David Lewkowicz, Joost Maier,
and Hjalmar Turesson. This work was supported by NIH R01NS054898 and NSF BCS-0547760
CAREER Award.
662 The Neural Bases of Multisensory Processes

REFERENCES
Abry, C., M.-T. Lallouache, and M.-A. Cathiard. 1996. How can coarticulation models account for speech sen-
sitivity in audio-visual desynchronization? In Speechreading by humans and machines: Models, systems
and applications, ed. D. Stork and M. Henneke, 247–255. Berlin: Springer-Verlag.
Adachi, I., H. Kuwahata, K. Fujita, M. Tomonaga, and T. Matsuzawa. 2006. Japanese macaques form a cross-
modal representation of their own species in their first year of life. Primates 47: 350–354.
Antinucci, F. 1989. Systematic comparison of early sensorimotor development. In Cognitive structure and devel-
opment in nonhuman primates, ed. F. Antinucci, 67–85. Hillsday, NJ: Lawrence Erlbaum Associates.
Barnes, C. L., and D. N. Pandya. 1992. Efferent cortical connections of multimodal cortex of the superior tem-
poral sulcus in the rhesus-monkey. Journal of Comparative Neurology 318: 222–244.
Barraclough, N. E., D. Xiao, C. I. Baker, M. W. Oram, and D. I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17: 377–391.
Batterson, V. G., S. A. Rose, A. Yonas, K. S. Grant, and G. P. Sackett. 2008. The effect of experience on the
development of tactual–visual transfer in pigtailed macaque monkeys. Developmental Psychobiology
50: 88–96.
Beiser, A. 1998. Processing of twitter-call fundamental frequencies in insula and auditory cortex of squirrel
monkeys. Experimental Brain Research 122: 139–148.
Benevento, L. A., J. Fallon, B. J. Davis, and M. Rezak. 1977. Auditory–visual interactions in single cells
in the cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey.
Experimental Neurology 57: 849–872.
Bernstein, L. E., E. T. Auer, and S. Takayanagi. 2004. Auditory speech detection in noise enhanced by lipread-
ing. Speech Communication 44: 5–18.
Besle, J., A. Fort, C. Delpuech, and M. H. Giard. 2004. Bimodal speech: Early suppressive visual effects in
human auditory cortex. European Journal of Neuroscience 20: 2225–2234.
Bizley, J. K., F. R. Nodal, V. M. Bajo, I. Nelken, and A. J. King. 2007. Physiological and anatomical evidence
for multisensory interactions in auditory cortex. Cerebral Cortex 17: 2172–2189.
Bruce, C., R. Desimone, and C. G. Gross. 1981. Visual properties of neurons in a polysensory area in superior
temporal sulcus of the macaque. Journal of Neurophysiology 46: 369–384.
Brumm, H., K. Voss, I. Kollmer, and D. Todt. 2004. Acoustic communication in noise: Regulation of call char-
acteristics in a New World monkey. Journal of Experimental Biology 207: 443–448.
Cappe, C., and P. Barone. 2005. Heteromodal connections supporting multisensory integration at low levels of
cortical processing in the monkey. European Journal of Neuroscience 22: 2886–2902.
Cappe, C., G. Thut, V. Romei, and M. M. Murray. 2009. Selective integration of auditory–visual looming cues
by humans. Neuropsychologia 47: 1045–1052.
Chandrasekaran, C., and A. A. Ghazanfar. 2009. Different neural frequency bands integrate faces and voices
differently in the superior temporal sulcus. Journal of Neurophysiology 101: 773–788.
Cheney, D. L., and R. M. Seyfarth. 1982. How vervet nonkeys perceive their grunts—Field playback experi-
ments. Animal Behaviour 30: 739–751.
De La Mothe, L. A., S. Blumell, Y. Kajikawa, and T. A. Hackett. 2006. Cortical connections of the auditory
cortex in marmoset monkeys: Core and medial belt regions. Journal of Comparative Neurology 496:
27–71.
Driver, J., and T. Noesselt. 2008. Multisensory interplay reveals crossmodal influences on ‘sensory-specific’
brain regions, neural responses, and judgments. Neuron 57: 11–23.
Egnor, S. E. R., and M. D. Hauser. 2006. Noise-induced vocal modulation in cotton-top tamarins (Saguinus
oedipus). American Journal of Primatology 68: 1183–1190.
Egnor, S. E. R., C. G. Iguina, and M. D. Hauser., 2006. Perturbation of auditory feedback causes system-
atic perturbation in vocal structure in adult cotton-top tamarins. Journal of Experimental Biology 209:
3652–3663.
Egnor, S. E. R., J. G. Wickelgren, and M. D. Hauser. 2007. Tracking silence: Adjusting vocal production to
avoid acoustic interference. Journal of Comparative Physiology A–Neuroethology Sensory Neural and
Behavioral Physiology 193: 477–483.
Eliades, S. J., and X. Q. Wang. 2008. Neural substrates fo vocalization feedback monitoring in primate auditory
cortex. Nature 453: 1102–1107.
Ettlinger, G., and W. A. Wilson. 1990. Cross-modal performance: Behavioural processes, phylogenetic consid-
erations and neural mechanisms. Behavioural Brain Research 40: 169–192.
Unity of the Senses for Primate Vocal Communication 663

Evans, T. A., S. Howell, and G. C. Westergaard. 2005. Auditory–visual cross-modal perception of communica-
tive stimuli in tufted capuchin monkeys (Cebus apella). Journal of Experimental Psychology—Animal
Behavior Processes 31: 399–406.
Fitch, W. T. 1997. Vocal tract length and formant frequency dispersion correlate with body size in rhesus
macaques. Journal of the Acoustical Society of America 102: 1213–1222.
Fitch, W. T., and M. D. Hauser. 1995. Vocal production in nonhuman primates—Acoustics, physiology, and
functional constraints on honest advertisement. American Journal of Primatology 37: 191–219.
Fowler, C. A. 2004. Speech as a supramodal or amodal phenomenon. In The handbook of multisensory pro-
cesses, ed. G.A. Calvert, C. Spence, and B.E. Stein, 189–201. Cambridge, MA: MIT Press.
Fu, K. M. G., T. A. Johnston, A. S. Shah, L. Arnold, J. Smiley, T. A. Hackett, P. E. Garraghty, and C. E. Schroeder.
2003. Auditory cortical neurons respond to somatosensory stimulation. Journal of Neuroscience 23:
7510–7515.
Fu, K. M. G., A. S. Shah, M. N. O’Connell, T. Mcginnis, H. Eckholdt, P. Lakatos, J. Smiley, and C. E. Schroeder.
2004. Timing and laminar profile of eye-position effects on auditory responses in primate auditory cor-
tex. Journal of Neurophysiology 92: 3522–3531.
Geschwind, N. 1965a. Disconnexion syndromes in animals and man, Part I. Brain 88: 237–294.
Geschwind, N. 1965b. Disconnexion syndromes in animals and man, Part II. Brain 88: 585–644.
Ghazanfar, A. A., C. Chandrasekaran, and N. K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience 28: 4457–4469.
Ghazanfar, A. A., and C. F. Chandrasekaran. 2007. Paving the way forward: Integrating the senses through
phase-resetting of cortical oscillations. Neuron 53: 162–164.
Ghazanfar, A. A., J. I. Flombaum, C. T. Miller, and M. D. Hauser. 2001. The units of perception in the antipho-
nal calling behavior of cotton-top tamarins (Saguinus oedipus): Playback experiments with long calls.
Journal of Comparative Physiology A – Neuroethology Sensory Neural and Behavioral Physiology 187:
27–35.
Ghazanfar, A. A., and N. K. Logothetis. 2003. Facial expressions linked to monkey calls. Nature 423:
937–938.
Ghazanfar, A. A., J. X. Maier, K. L. Hoffman, and N. K. Logothetis. 2005. Multisensory integration of dynamic
faces and voices in rhesus monkey auditory cortex. Journal of Neuroscience 25: 5004–5012.
Ghazanfar, A. A., K. Nielsen, and N. K. Logothetis. 2006. Eye movements of monkeys viewing vocalizing
conspecifics. Cognition 101: 515–529.
Ghazanfar, A. A., and D. Rendall. 2008. Evolution of human vocal production. Current Biology 18:
R457–R460.
Ghazanfar, A. A., and C. E. Schroeder. 2006. Is neocortex essentially multisensory? Trends in Cognitive
Sciences 10: 278–285.
Ghazanfar, A. A., D. Smith-Rohrberg, A. A. Pollen, and M. D. Hauser. 2002. Temporal cues in the antiphonal
long-calling behaviour of cottontop tamarins. Animal Behaviour 64: 427–438.
Ghazanfar, A. A., H. K. Turesson, J. X. Maier, R. Van Dinther, R. D. Patterson, and N. K. Logothetis. 2007.
Vocal tract resonances as indexical cues in rhesus monkeys. Current Biology 17: 425–430.
Gibson, K. R. 1991. Myelination and behavioral development: A comparative perspective on questions of
neoteny, altriciality and intelligence. In Brain maturation and cognitive development: Comparative
and cross-cultural perspectives, ed. K. R. Gibson and A. C. Petersen, 29–63. New York: Aldine de
Gruyter.
Gogate, L. J., A. S. Walker-Andrews, and L. E. Bahrick. 2001. The intersensory origins of word comprehen-
sion: An ecological–dynamic systems view. Developmental Science 4: 1–18.
Gordon, M. S., and L. D. Rosenblum. 2005. Effects of intrastimulus modality change on audiovisual time-to-
arrival judgments. Perception and Psychophysics 67: 580–594.
Gothard, K. M., F. P. Battaglia, C. A. Erickson, K. M. Spitler, and D. G. Amaral. 2007. Neural responses to facial
expression and face identity in the monkey amygdala. Journal of Neurophysiology 97: 1671–1683.
Gunderson, V. M. 1983. Development of cross-modal recognition in infant pigtail monkeys (Macaca nemes-
trina). Developmental Psychology 19: 398–404.
Gunderson, V. M., S. A. Rose, and K. S. Grantwebster. 1990. Cross-modal transfer in high-risk and low-risk
infant pigtailed macaque monkeys. Developmental Psychology 26: 576–581.
Hackett, T. A., L. A. de La Mothe, I. Ulbert, G. Karmos, J. Smiley, and C. E. Schroeder. 2007a. Multisensory
convergence in auditory cortex: II. Thalamocortical connections of the caudal superior temporal plane.
Journal of Comparative Neurology 502: 924–952.
664 The Neural Bases of Multisensory Processes

Hackett, T. A., J. F. Smiley, I. Ulbert, G. Karmos, P. Lakatos, L. A. de La Mothe, and C. E. Schroeder., 2007b.
Sources of somatosensory input to the caudal belt areas of auditory cortex. Perception 36: 1419–1430.
Hackett, T. A., I. Stepniewska, and J. H. Kaas. 1999. Prefrontal connections of the parabelt auditory cortex in
macaque monkeys. Brain Research 817: 45–58.
Harries, M. H., and D. I. Perrett. 1991. Visual processing of faces in temporal cortex—Physiological evidence for
a modular organization and possible anatomical correlates. Journal of Cognitive Neuroscience 3: 9–24.
Hauser, M. D., N. Chomsky, and W. Fitch. 2002. The faculty of language: What is it, who has it, and how did
it evolve? Science 298: 1569–1579.
Hauser, M. D., C. S. Evans, and P. Marler. 1993. The role of articulation in the production of rhesus-monkey,
Macaca mulatta. Vocalizations. Animal Behaviour 45: 423–433.
Hauser, M. D., and M. S. Ybarra. 1994. The role of lip configuration in monkey vocalizations—Experiments
using xylocaine as a nerve block. Brain and Language 46: 232–244.
Ito, T., M. Tiede, and D. J. Ostry. 2009. Somatosensory function in speech perception. Proceedings of the
National Academy of Sciences of the United States of America 106: 1245–1248.
Iyengar, S., H. Qi, N. Jain, and J. H. Kaas. 2007. Cortical and thalamic connections of the representations of
the teeth and tongue in somatosensory cortex of new world monkeys. Journal of Comparative Neurology
501: 95–120.
Izumi, A., and S. Kojima. 2004. Matching vocalizations to vocalizing faces in a chimpanzee (Pan troglodytes).
Animal Cognition 7: 179–184.
Jiang, J. T., A. Alwan, P. A. Keating, E. T. Auer, and L. E. Bernstein. 2002. On the relationship between face
movements, tongue movements, and speech acoustics. EURASIP Journal of Applied Signal Processing
1174–1188.
Jones, J. A., and K. G. Munhall. 2003. Learning to produce speech with an altered vocal tract: The role of audi-
tory feedback. Journal of the Acoustical Society of America 113: 532–543.
Jones, J. A., and K. G. Munhall. 2005. Remapping auditory–motor representations in voice production. Current
Biology 15: 1768–1772.
Jordan, K. E., E. M. Brannon, N. K. Logothetis, and A. A. Ghazanfar. 2005. Monkeys match the number of
voices they hear with the number of faces they see. Current Biology 15: 1034–1038.
Kayser, C., and N. K. Logothetis. 2009. Directed interactions between auditory and superior temporal cor-
tices and their role in sensory integration. Frontiers in Integrative Neuroscience 3: 7. doi: 10.3389/
neuro.07.007.2009.
Kayser, C., C. I. Petkov, M. Augath, and N. K. Logothetis. 2005. Integration of touch and sound in auditory
cortex. Neuron 48: 373–384.
Kayser, C., C. I. Petkov, M. Augath, and N. K. Logothetis. 2007. Functional imaging reveals visual modulation
of specific fields in auditory cortex. Journal of Neuroscience 27: 1824–1835.
Kayser, C., C. I. Petkov, and N. K. Logothetis. 2008. Visual modulation of neurons in auditory cortex. Cerebral
Cortex 18: 1560–1574.
Klin, A., W. Jones, R. Schultz, F. Volkmar, and D. Cohen. 2002. Visual fixation patterns during viewing of
naturalistic social situations as predictors of social competence in individuals with autism. Archives of
General Psychiatry 59: 809–816.
Konner, M. 1991. Universals of behavioral development in relation to brain myelination. In Brain maturation
and cognitive development: Comparative and cross-cultural perspectives, ed. K. R. Gibson and A. C.
Petersen, 181–223. New York: Aldine de Gruyter.
Kuhl, P. K., K. A. Williams, and A. N. Meltzoff. 1991. Cross-modal speech perception in adults and infants
using nonspeech auditory stimuli. Journal of Experimental Psychology: Human perception and perfor-
mance 17: 829–840.
Kuraoka, K., and K. Nakamura. 2007. Responses of single neurons in monkey amygdala to facial and vocal
emotions. Journal of Neurophysiology 97: 1379–1387.
Lakatos, P., C.-M. Chen, M. N. O’Connell, A. Mills, and C. E. Schroeder. 2007. Neuronal oscillations and
multisensory interaction in primary auditory cortex. Neuron 53: 279–292.
Lakatos, P., A. S. Shah, K. H. Knuth, I. Ulbert, G. Karmos, and C. E. Schroeder. 2005. An oscillatory hier-
archy controlling neuronal excitability and stimulus processing in the auditory cortex. Journal of
Neurophysiology 94: 1904–1911.
Lansing, I. R., and G. W. Mcconkie. 2003. Word identification and eye fixation locations in visual and visual-
plus-auditory presentations of spoken sentences. Perception and Psychophysics 65: 536–552.
Lewkowicz, D. J., and A. A. Ghazanfar. 2006. The decline of cross-species intersensory perception in
human infants. Proceedings of the National Academy of Sciences of the United States of America 103:
6771–6774.
Unity of the Senses for Primate Vocal Communication 665

Lewkowicz, D. J., and R. Lickliter. 1994. The development of intersensory perception: Comparative perspec-
tives. Hillsdale, NJ: Lawrence Erlbaum Associates.
Liberman, A. M., and I. Mattingly. 1985. The motor theory revised. Cognition 21: 1–36.
Maier, J. X., C. Chandrasekaran, and A. A. Ghazanfar. 2008. Integration of bimodal looming signals through
neuronal coherence in the temporal lobe. Current Biology 18: 963–968.
Maier, J. X., J. G. Neuhoff, N. K. Logothetis, and A. A. Ghazanfar. 2004. Multisensory integration of looming
signals by Rhesus monkeys. Neuron 43: 177–181.
Malkova, L., E. Heuer, and R. C. Saunders. 2006. Longitudinal magnetic resonance imaging study of rhesus
monkey brain development. European Journal of Neuroscience 24: 3204–3212.
Mcgurk, H., and J. Macdonald. 1976. Hearing lips and seeing voices. Nature 264: 229–239.
Meltzoff, A. N., and M. Moore. 1997. Explaining facial imitation: A theoretical model. Early Development and
Parenting 6: 179–192.
Nasir, S. M., and D. J. Ostry. 2008. Speech motor learning in profoundly deaf adults. Nature Neuroscience 11:
1217–1222.
Noesselt, T., J. W. Rieger, M. A. Schoenfeld, M. Kanowski, H. Hinrichs, H.-J. Heinze, and J. Driver. 2007.
Audiovisual temporal correspondence modulates human multisensory superior temporal sulcus plus pri-
mary sensory cortices. Journal of Neuroscience 27: 11431–11441.
Oram, M. W., and D. I. Perrett. 1994. Responses of anterior superior temporal polysensory (Stpa) neurons to
biological motion stimuli. Journal of Cognitive Neuroscience 6: 99–116.
Palombit, R. A., D. L. Cheney, and R. M. Seyfarth. 1999. Male grunts as mediators of social interaction with
females in wild chacma baboons (Papio cynocephalus ursinus). Behaviour 136: 221–242.
Parr, L. A. 2004. Perceptual biases for multimodal cues in chimpanzee (Pan troglodytes) affect recognition.
Animal Cognition 7: 171–178.
Patterson, M. L., and J. F. Werker. 2003. Two-month-old infants match phonetic information in lips and voice.
Developmental Science 6: 191–196.
Remedios, R., N. K. Logothetis, and C. Kayser. 2009. An auditory region in the primate insular cortex respond-
ing preferentially to vocal communication sounds. Journal of Neuroscience 29: 1034–1045.
Robinson, D. A., and A. F. Fuchs. 1969. Eye movements evoked by stimulation of frontal eye fields. Journal of
Neurophysiology 32: 637–648.
Romanski, L. M., B. B. Averbeck, and M. Diltz. 2005. Neural representation of vocalizations in the primate
ventrolateral prefrontal cortex. Journal of Neurophysiology 93: 734–747.
Romanski, L. M., J. F. Bates, and P. S. Goldman-Rakic. 1999. Auditory belt and parabelt projections to the
prefrontal cortex in the rhesus monkey. Journal of Comparative Neurology 403: 141–157.
Romanski, L. M., and A. A. Ghazanfar. 2009. The primate frontal and temporal lobes and their role in multi-
sensory vocal communication. In Primate neuroethology, ed. M. L. Platt and A. A. Ghazanfar. Oxford:
Oxford Univ. Press.
Rosenblum, L. D. 2005. Primacy of multimodal speech perception. In Handbook of speech perception, ed.
D. B. Pisoni and R. E. Remez, 51–78. Malden, MA: Blackwell.
Sacher, G. A., and E. F. Staffeldt. 1974. Relation of gestation time to brain weight for placental mammals:
Implications for the theory of vertebrate growth. American Naturalist 108: 593–615.
Sams, M., R. Mottonen, and T. Sihvonen. 2005. Seeing and hearing others and oneself talk. Cognitive Brain
Research 23: 429–435.
Schall, J. D., A. Morel, D. J. King, and J. Bullier. 1995. Topography of visual cortex connections with frontal
eye field in macaque: Convergence and segregation of processing streams. Journal of Neuroscience 15:
4464–4487.
Schroeder, C. E., and J. J. Foxe, 2002. The timing and laminar profile of converging inputs to multisensory
areas of the macaque neocortex. Cognitive Brain Research 14: 187–198.
Schroeder, C. E., P. Lakatos, Y. Kajikawa, S. Partan, and A. Puce. 2008. Neuronal oscillations and visual ampli-
fication of speech. Trends in Cognitive Science 12: 106–113.
Schroeder, C. E., R. W. Lindsley, C. Specht, A. Marcovici, J. F. Smiley, and D. C. Javitt. 2001. Somatosensory input
to auditory association cortex in the macaque monkey. Journal of Neurophysiology 85: 1322–1327.
Schwartz, J.-L., F. Berthommier, and C. Savariaux. 2004. Seeing to hear better: Evidence for early audio-visual
interactions in speech identification. Cognition 93: B69–B78.
Seltzer, B., and D. N. Pandya. 1989. Frontal-lobe connections of the superior temporal sulcus in the rhesus-
monkey. Journal of Comparative Neurology 281: 97–113.
Seltzer, B., and D. N. Pandya. 1994. Parietal, temporal, and occipital projections to cortex of the superior tem-
poral sulcus in the rhesus monkey: A retrograde tracer study. Journal of Comparative Neurology 343:
445–463.
666 The Neural Bases of Multisensory Processes

Sinnott, J. M., W. C. Stebbins, and D. B. Moody. 1975. Regulation of voice amplitude by monkey. Journal of
the Acoustical Society of America 58: 412–414.
Smiley, J. F., T. A. Hackett, I. Ulbert, G. Karmas, P. Lakatos, D. C. Javitt, and C. E. Schroeder. 2007. Multisensory
convergence in auditory cortex: I. Cortical connections of the caudal superior temporal plane in macaque
monkeys. Journal of Comparative Neurology 502: 894–923.
Sugihara, T., M. D. Diltz, B. B. Averbeck, and L. M. Romanski. 2006. Integration of auditory and visual
communication information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience 26:
11138–11147.
Tremblay, S., D. M. Shiller, and D. J. Ostry. 2003. Somatosensory basis of speech production. Nature 423:
866–869.
Turkewitz, G., and P. A. Kenny. 1982. Limitations on input as a basis for neural organization and perceptual
development: A preliminary theoretical statement. Developmental Psychobiology 15: 357–368.
Van Wassenhove, V., K. W. Grant, and D. Poeppel. 2005. Visual speech speeds up the neural processing of
auditory speech. Proceedings of the National Academy of Sciences of the United States of America 102:
1181–1186.
Wallace, M. T., and B. E. Stein. 2001. Sensory and multisensory responses in the newborn monkey superior
colliculus. Journal of Neuroscience 21: 8886–8894.
Werner-Reiss, U., K. A. Kelly, A. S. Trause, A. M. Underhill, and J. M. Groh. 2003. Eye position affects activity
in primary auditory cortex of primates. Current Biology 13: 554–562.
Yehia, H., P. Rubin, and E. Vatikiotis-Bateson. 1998. Quantitative association of vocal-tract and facial behavior.
Speech Communication 26: 23–43.
Yehia, H. C., T. Kuratate, and E. Vatikiotis-Bateson. 2002. Linking facial animation, head motion and speech
acoustics. Journal of Phonetics 30: 555–568.
Zangehenpour, S., A. A. Ghazanfar, D. J. Lewkowicz, and R. J. Zatorre. 2008. Heterochrony and cross-species
intersensory matching by infant vervet monkeys. PLoS ONE 4: e4302.
33 Convergence of Auditory,
Visual, and Somatosensory
Information in Ventral
Prefrontal Cortex
Lizabeth M. Romanski

CONTENTS
33.1 Introduction........................................................................................................................... 667
33.2 Anatomical Innervation of Ventral Prefrontal Cortex........................................................... 668
33.2.1 Visual Projections to Ventral Prefrontal Cortex........................................................ 668
33.2.2 Auditory Projections to Prefrontal Cortex.................................................................668
33.2.3 Somatosensory Connections with Prefrontal Cortex................................................ 670
33.3 Physiological Responses in VLPFC Neurons........................................................................ 670
33.3.1 Visual Responses....................................................................................................... 670
33.3.2 Auditory Responses and Function in Prefrontal Cortex............................................ 671
33.3.3 Prefrontal Responses to Vocalizations...................................................................... 672
33.3.4 Somatosensory Responses......................................................................................... 673
33.3.5 Multisensory Responses............................................................................................ 676
33.3.6 Functional Considerations......................................................................................... 678
References....................................................................................................................................... 678

33.1  INTRODUCTION
Our ability to recognize and integrate auditory and visual stimuli is the basis for many cognitive
processes but is especially essential in meaningful communication. Although many brain regions
contribute to recognition and integration of sensory signals, the frontal lobes both receive a mul-
titude of afferents from sensory association areas and have influence over a wide region of the
nervous system to govern behavior. Furthermore, the frontal lobes are special in that they have
been associated with language processes, working memory, planning, and reasoning, which all
depend on the recognition and integration of a vast network of signals. Research has also shown
that somatosensory afferents reach the frontal lobe and that in specific regions single cells encode
somatosensory signals. In this chapter we will focus on the ventrolateral prefrontal cortex (VLPFC),
also known as the inferior convexity in some studies, and describe the connectivity of VLPFC
with auditory, visual, and somatosensory cortical areas. This connectivity provides the circuitry
for prefrontal responses to these stimuli, which will also be described from previous research. The
potential function of combined auditory, visual, and somatosensory inputs will be described with
regard to communication and object recognition.

667
668 The Neural Bases of Multisensory Processes

33.2  ANATOMICAL INNERVATION OF VENTRAL PREFRONTAL CORTEX


The prefrontal cortex receives a widespread array of afferents from cortical and subcortical areas.
These include sensory, motor, and association cortex and thalamic nuclei. The extensive innerva-
tion of the frontal lobe is nonetheless organized, and particular circuits have been investigated and
carefully characterized leading to a better understanding of frontal lobe function based on this con-
nectivity. Although many areas of the frontal lobe receive converging inputs, we will focus on the
multisensory innervation of the VLPFC.

33.2.1  Visual Projections to Ventral Prefrontal Cortex


Much of what we know about the cellular functions of the primate prefrontal cortex is based on
the processing of visual information. Thus, it is not surprising that many studies have examined
projections from visual association cortex to the primate prefrontal cortex. With regard to the fron-
tal lobe, early anatomical studies by Helen Barbas, Deepak Pandya, and their colleagues (Barbas
1988; Barbas and Mesulam 1981; Barbas and Pandya 1989; Chavis and Pandya 1976) examined the
innervation of the entire prefrontal mantle by visual association areas. These studies denoted some
specificity in the innervation of dorsal, ventral, and medial prefrontal cortices. Barbas was among
the first to note that basoventral prefrontal cortices were more strongly connected with extrastriate
ventral visual areas, which have been implicated in pattern recognition and feature discrimina-
tion, whereas medial and dorsal prefrontal cortices are more densely connected with medial and
dorsolateral occipital and parietal areas, which are associated with visuospatial functions (Barbas
1988). This dissociation was echoed by Bullier and colleagues (1996), who found some segregation
of inputs to PFC when paired injections of tracers were placed into temporal and parietal visual
processing regions. In their study, the visual temporal cortex projected mainly to area 45, located
ventrolaterally in the PFC, whereas the parietal cortex sent projections to both ventrolateral PFC
(area 45) and dorsolateral PFC (DLPC) (areas 8a and 46) (Schall et al. 1995; Bullier et al. 1996).
Tracing and lesion studies by Ungerleider et al. (1989) showed that area TE projected specifically
to three ventral prefrontal targets including the anterior limb of the arcuate sulcus (area 45), the
inferior convexity just ventral to the principal sulcus (areas 46v and 12) and within the lateral orbital
cortex (areas 11, 12o). These projections are via the uncinate fasciculus (Ungerleider et al. 1989).
The selective connectivity of ventrolateral PFC areas 12 and 45, which contain object- and face-
selective neurons (O’Scalaidhe et al. 1997, 1999; Wilson et al. 1993), with inferotemporal areas TE
and TEO was specifically documented by Webster and colleagues (1994). Comparison of TE and
TEO connectivity in their study revealed a number of important differences, including the finding
that it is mainly area TE that projects to ventrolateral PFC and orbitofrontal areas 11, 12, and 13.
These orbital regions have also been associated with visual object functions.

33.2.2  Auditory Projections to Prefrontal Cortex


In early anatomical studies, lesion/degeneration techniques were used to reveal projections from
the caudal superior temporal gyrus (STG) to the periprincipalis, periarcuate, and inferior convexity
regions of the frontal lobe and from the middle and rostral STG to rostral principalis and orbital
regions (Pandya et al. 1969; Pandya and Kuypers 1969; Jones and Powell 1970; Chavis and Pandya
1976). Studies with anterograde and retrograde tracers that were aimed at determining the overall
connectivity between the temporal and frontal lobes brought additional specificity (Pandya and
Sanides 1973; Galaburda and Pandya 1983; Barbas 1988; Barbas and Mesulam 1981; Barbas and
Pandya 1989; Cipolloni and Pandya 1989). Studies of the periprincipalis and arcuate region showed
that the anterior and middle aspects of the principal sulcus, including areas 9, 10, and 46, were con-
nected with the middle and caudal STG (Barbas and Mesulam 1985; Petrides and Pandya 1988),
whereas area 8 receives projections from mostly caudal STG (Barbas and Mesulam 1981; Petrides
Convergence of Information in Ventral Prefrontal Cortex 669

and Pandya 1988). Latter studies confirmed the connection of the posterior STG with areas 46,
dorsal area 8, and the middle STG with rostral–dorsal 46 and 10, area 9, and area 12 (Petrides and
Pandya 1988; Barbas 1992).
Connections of ventrolateral prefrontal areas with auditory association cortex have been consid-
ered by several groups. Cytoarchitectonic analysis of the VLPFC suggested that the region labeled
by Walker as area 12 in the macaque monkey has similar characteristics as that of human area 47,
and was thus renamed in the macaque as area 47/12 by Petrides and Pandya (1988). Analysis of the
connections of areas 45 and 47/12 in the VLPFC has shown that they receive innervation from the
STG, the inferotemporal cortex, and from multisensory regions within the superior temporal sulcus.
Combining physiological recording with anatomical tract tracing, Romanski and colleagues (1999)
analyzed the connections of physiologically defined areas of the belt and parabelt auditory cortex
and determined that the projections to prefrontal cortex are topographically arranged so that rostral
and ventral prefrontal cortex receive projections from the anterior auditory association cortex (areas
AL and anterior parabelt), whereas caudal prefrontal regions are innervated by the posterior audi-
tory cortex (areas CL and caudal parabelt; Figure 33.1). Together with recent auditory physiological
recordings from the lateral belt (Tian et al. 2001) and from the prefrontal cortex (Romanski and
Goldman-Rakic 2002; Romanski et al. 2005), these studies suggest that separate auditory streams
originate in the anterior and posterior auditory cortex and target anterior-ventrolateral object, and

(a) 2 3
1 cs
asd
8b 8a
9 46d
46v AI
12vl 45 asv CL
10 ML
ls
AL
sts

(b) 2 asd
1 asd
ps
46 45
46

12vl

12o
los

FIGURE 33.1  Innervation of prefrontal cortex by auditory belt and parabelt injections. (a) Projections from
anterior auditory cortex to ventrolateral prefrontal cortex (VLPFC) are shown with black arrows and projec-
tions from caudal auditory cortex to dorsolateral prefrontal cortex (DLPFC) are shown in white. (b) Coronal
sections through the frontal lobe detail anatomical connections. Injections placed into anterior auditory belt
area AL resulted in projections to rostral 46, ventrolateral area 12vl, and lateral orbital cortex area 12o (shown
in black). Projections from caudal auditory cortex area CL and adjacent parabelt targeted caudal dorsal pre-
frontal cortex areas 46, area 8a, and part of area 45 (shown as white cells and fibers). Projections from ML
included some dorsal and ventral targets and are shown in gray. asd, dorsal ramus of arcuate sulcus; asv, ven-
tral ramus of arcuate sulcus; cs, central sulcus; ls, lateral sulcus; sts, superior temporal sulcus.
670 The Neural Bases of Multisensory Processes

dorsolateral spatial domains in the frontal lobe, respectively (Romanski 2007), similar to those
of the visual system. Ultimately, this also implies that auditory and visual afferents target similar
regions of dorsolateral and ventrolateral prefrontal cortex (Price 2008). The convergence of audi-
tory and visual ventral stream inputs to the same VLPFC domain implies that they may be inte-
grated and combined to serve a similar function, that of object recognition.

33.2.3  Somatosensory Connections with Prefrontal Cortex


Previous studies have noted connections between the principal sulcus and inferior convexity with
somatosensory cortical areas (Barbas and Mesulam 1985), most notably SII and 7b (Cavada and
Goldman-Rakic 1989; Preuss and Goldman-Rakic 1989; Carmichael and Price 1995). Injections
that included the ventral bank of the principal sulcus and the anterior part of area 12 resulted in
strong labeling of perisylvian somatic cortex including the second somatosensory area (SII) and
insular cortex (Preuss and Goldman-Rakic 1989). Anterograde studies have confirmed this showing
that area SII has a projection to the inferior convexity and principal sulcus region of the macaque
frontal lobe (Cipolloni and Pandya 1999). This region of the PFC overlaps with the projection field
of auditory association cortex and visual extrastriate cortex.

33.3  PHYSIOLOGICAL RESPONSES IN VLPFC NEURONS


33.3.1  Visual Responses
In 1993, Wilson et al. (1993) published a groundbreaking study revealing a physiological dissocia-
tion between dorsal and ventral prefrontal cortex (Figure 33.2). In this study, the authors showed
that DLPFC cells responded in a spatial working memory task, with single cells exhibiting selec-
tive cue and delay activity for discrete eccentric locations. In the same animals, it was shown that
electrode penetrations into VLPFC regions, which included the expanse of the inferior convexity
(areas 47/12 lateral, 12 orbital, and area 45), revealed single-unit responses to pictures of objects and

Dorsal stream
spatial vision

Parietal
8a cortex
DLPFC
46

12 45
10 VLPFC V1

Ventral stream Inferotemporal


object vision cortex

FIGURE 33.2  Lateral brain schematic of visual pathways in nonhuman primate showing dorsal–spatial and
ventral–object visual streams that terminate in DLPFC and VLPFC, respectively. Wilson et al. (1993) showed
that neurons in DLPFC (black) respond during perception and memory of visuospatial information, whereas
neurons in VLPFC (gray) responded to object features including color, form, and type of visual stimulus. Later
studies by O’Scalaidhe et al. (1997, 1999) described “face cells” that were localized to gray region, of VLPFC
in areas 12 and 45.
Convergence of Information in Ventral Prefrontal Cortex 671

faces. These VLPFC cells did not respond in the spatial working memory task but did respond in
an object-fixation task and an object-conditional association task. Further electrophysiological and
neuroimaging studies have demonstrated face selectivity in this same area of VLPFC (O’Scalaidhe
et al. 1997, 1999; Tsao et al. 2008), confirming this functional domain separation.
Although these studies were the first to demonstrate an electrophysiological dissociation between
DLPFC and VLPFC, they were not the first to suggest a functional difference and to show the pref-
erence for object as opposed to spatial processing in the ventral prefrontal cortex. An earlier study
by Mishkin and Manning (1978) showed that lesions of the VLPFC in nonhuman primates interfere
with the processing of nonspatial information, including color and form. These ventral prefrontal
lesions had a severe and lasting impairment on the performance of three nonspatial tasks, whereas
lesions of the principal sulcus had only a transient effect (Mishkin and Manning 1978). Just a few
years earlier, Passingham (1975) had also suggested a dissociation between dorsal and ventral PFC.
In their study, rhesus monkeys were trained on delayed color matching task and delayed spatial
alternation tasks. Lesions of the VLPFC resulted in an impairment only in the delayed color match-
ing task, whereas lesions of the DLPFC only impaired the delayed spatial alternation task. These
results, like the Wilson et al. study two decades later, demonstrated a double dissociation of dorsal
and ventral PFC and suggested a role in the processing of object features and recognition for the
VLPFC.
Further analysis of the properties of cells in the VLPFC was done by Joaquin Fuster and col-
leagues. In their electrophysiological analysis of ventral prefrontal neurons, they showed that single
cells are responsive to simple and complex visual stimuli presented at the fovea (Pigarev et al. 1979;
Rosenkilde et al. 1981). The foveal receptive field properties of these cells had first been shown in
studies by Suzuki and Azuma (1977), who examined receptive field properties of neurons across
the expanse of lateral prefrontal cortex. The receptive fields of neurons in DLPFC were found to
lie outside the fovea and favored the contralateral visual field, whereas neurons below the principal
sulcus in areas 12/47 and 45 were found to be driven best by visual stimuli shown within the fovea
(Suzuki and Azuma 1977). Hoshi et al. (2000) examined the spatial distribution of location-­selective
and shape-selective neurons during cue, delay, and response periods, and found more location-
­selective neurons in the posterior part of the lateral PFC, whereas more shape-selective neurons
were found in the anterior part, corresponding to area 12/47. Ninokura et al. (2004) found that cells
that responded selectively to the physical properties (color and shape) of objects were localized to
the VLPFC. These various studies fostered the notion that visual neurons in VLPFC were tuned to
nonspatial features including color, shape, and type of object, and had receptive fields representing
areas in and around the fovea.
Finally, studies from Goldman-Rakic and colleagues further demonstrated that neurons in the
VLPFC were not only responsive to object features, but that some neurons were highly special-
ized and were face-selective (Wilson et al. 1993; O’Scalaidhe et al. 1997, 1999). The face-selective
neurons were found in several discrete regions including an anterior location that appears to be
area 12/47, a posterior, periarcuate, location within area 45, and some penetrations into the orbital
cortex also yielded face cells. These single-unit responses were further corroborated with functional
magnetic resonance imaging (fMRI) data by Tsao and colleagues (2008). In their fMRI, study they
showed that three loci within the VLPFC of macaques were selectively activated by faces (Tsao
et al. 2008; Figure 33.3). These three locations correspond roughly to the same anterior, posterior,
and ventral/orbital locations that O’Scalaidhe et al. (1997, 1999) mapped as being face-responsive
in their single-unit recording studies. Demonstration by both methods of visual responsiveness and
face selectivity substantiates the notion that the VLPFC is involved in object and face processing.

33.3.2  Auditory Responses and Function in Prefrontal Cortex


The frontal lobe has long been linked with complex auditory function through its association with
language functions and Broca’s area. What we hear and say seems to be important to frontal lobe
672 The Neural Bases of Multisensory Processes

+28.5

6.5

4
+30.0

6.5

FIGURE 33.3  (See color insert.) Activation of macaque prefrontal cortex by faces in the study of Tsao et al.
(2008). Shown here are two coronal sections showing “face patches” in VLPFC (activations are yellow), delineated
with white arrows. (Reprinted from Tsao, D. Y. et al., Nat. Neurosci., 11, 877–879, 2008. With permission.)

neurons. In the human brain, the posterior aspects of Broca’s area are thought to be especially
involved in the phonetic and motor control of speech, whereas more anterior regions have been shown
to be activated during semantic processing, comprehension, and auditory working memory (Zatorre
et al. 1992; Paulesu et al. 1993; Buckner et al. 1995; Demb et al. 1995; Fiez et al. 1996; Stromswold
et al. 1996; Cohen et al. 1997; Gabrieli et al. 1998; Stevens et al. 1998; Price 1998; Posner et al. 1999;
Gelfand and Bookheimer 2003). Examination of prefrontal auditory function in nonhuman primates
has not received as much attention as visual prefrontal function. A few studies have investigated the
effects of large prefrontal lesions on behavioral task performance of auditory discrimination or mne-
monic processing of complex acoustic stimuli. In each of these four studies, relatively large lesions
of the lateral PFC were shown to cause an impairment in an auditory go/no-go task for food reward
(Weiskrantz and Mishkin 1958; Gross and Weiskrantz 1962; Gross 1963; Goldman and Rosvold
1970). This was taken as evidence of the PFC’s involvement in modality-independent processing
especially in tasks requiring inhibitory control (Weiskrantz and Mishkin 1958).
Despite the localization of language function in the human brain to ventral frontal lobe regions and
the demonstration that lesions of lateral PFC in nonhuman primates interferes with auditory discrimi-
nation, single-cell responses to acoustic stimuli have only been sporadically noted in the frontal lobes
of Old and New World monkeys (Benevento et al. 1977; Bodner et al. 1996; Newman and Lindsley
1976; Tanila et al. 1992, 1993; Wollberg and Sela 1980). However, a close look at these studies reveals
that few of the studies sampled neurons in ventrolateral and orbitofrontal regions. Most recordings in
the past have been confined to the dorsolateral surface of the frontal lobe where projections from sec-
ondary and tertiary auditory cortices are sparse. Only one early study recorded from the lateral orbital
region in the macaque cortex and found both auditory and visual responses to simple visual flashes
and to broadband auditory clicks (Benevento et al. 1977). Furthermore, none of the studies tested
neurons systematically with naturalistic and species-relevant acoustic stimuli. Recent approaches to
frontal lobe auditory function have utilized naturalistic stimuli, including species-specific vocaliza-
tions and have extended the area of investigation to orbital and ventral PFC regions.

33.3.3  Prefrontal Responses to Vocalizations


After establishing the area of the prefrontal cortex that receive dense afferents from early-­auditory
cortical regions (Romanski et al. 1999a, 1999b), Romanski and Goldman-Rakic, revealed a discrete
Convergence of Information in Ventral Prefrontal Cortex 673

auditory responsive region in the macaque VLPFC (Romanski and Goldman-Rakic 2002). This
VLPFC region has neurons that respond to complex acoustic stimuli, including species-specific
vocalizations, and lies adjacent to the object- and face-selective region proposed previously
(O’Scalaidhe et al. 1997, 1999; Wilson et al. 1993). Although VLPFC auditory neurons have not
been thoroughly tested for directional selectivity, further examination has suggested that they
encode complex auditory features and thus respond to complex stimuli on the basis of similar acous-
tic features (Romanski et al. 2005; Averbeck and Romanski 2006).
Use of a large library of rhesus macaque vocalizations to test auditory selectivity in prefrontal
neurons has shown that VLPFC neurons are robustly responsive to species-specific vocalizations
(Romanski et al. 2005). A cluster analysis of these vocalization responses did not show a cluster-
ing of responses to vocalizations depicting similar functions (i.e., food calls) but demonstrated that
neurons tend to respond to multiple vocalizations with similar acoustic morphology (Romanski
et al. 2005; Figure 33.4). Neuroimaging in rhesus monkeys has revealed a small ventral prefron-
tal locus that was active during presentation of complex acoustic stimuli including vocalizations
(Poremba and Mishkin 2007). Additional electrophysiological recording studies by Cohen and col-
leagues have suggested that prefrontal auditory neurons may also participate in the categorization
of species-specific vocalizations (Gifford et al. 2005). These combined data are consistent with a
role for VLPFC auditory neurons in a ventral auditory processing stream that analyzes the features
of auditory objects including vocalizations.
Evidence for object-based auditory processing in the ventral frontal lobe of the human brain is
suggested by neuroimaging studies that have detected activation in the VLPFC not only by speech
stimuli but by nonspeech and music stimuli (Belin et al. 2000; Binder et al. 2000; Scott et al. 2000;
Zatorre et al. 2004) in auditory recognition tasks and voice recognition tasks (Fecteau et al. 2005).
The localization of an auditory object processing stream in the human brain to the very same ven-
tral prefrontal region in a nonhuman primate suggests a functional similarity between this area and
human language-processing regions located in the inferior frontal gyrus (Deacon 1992; Romanski
and Goldman- Rakic 2002).

33.3.4  Somatosensory Responses


Fewer studies have examined the responses of prefrontal neurons to somatosensory stimuli. This
may be partly attributable to the lack of an easy association between a known human function
for somatosensory stimuli and the frontal lobes, as there is for language and audition. One group,
however, has demonstrated responses to somatosensory stimuli in single lateral prefrontal neurons.
Recordings in the prefrontal cortex were made while macaque monkeys performed a somatosensory
discrimination task (Romo et al. 1999). Neurons were found whose discharge rates varied before
and during the delay period between the two stimuli, as a monotonic function of the base stimulus
frequency (Figure 33.5). These cells were localized specifically to the VLPFC, also known as the
inferior convexity (Romo et al. 1999) within the same general ventral prefrontal regions where
object-, face-, and auditory-responsive neurons have been recorded. The feature-based encoding of
these cells supports their role in an object-based ventral stream function. In addition to this demon-
stration of prefrontal neuronal function in a somatosensory task, there is an early lesion study that
noted an impairment in a somatosensory alternation task after large lateral prefrontal lesions but not
after parietal lesions (Ettlinger and Wegener 1958).
In human neuroimaging studies, it has been shown that the ventral frontal lobe is activated by
somatosensory stimulation (Hagen et al. 2002). In their study, two discrete ventral frontal brain
regions were responsive to somatosensory stimulation including the posterior inferior frontal gyrus
and the orbitofrontal cortex. Additional neuroimaging studies have examined frontal lobe activation
during haptic shape perception and discrimination. A recent fMRI study found that several frontal
lobe sites were activated during haptic shape-selectivity (Miquee et al. 2008) and during visuo-haptic
processing (Stilla and Sathian 2008). Most interesting is the demonstration of vibrotactile working
(a) Submissive scream Copulation scream Gecker Bark Girney
674

33.0
sp/s

0 1000 0 1000 0 1000 0 1000 0 1000

Harmonic arch Warble Coo Grunt Shrill bark

0 1000 0 1000 0 1000 0 1000 0 1000

(b) Submissive scream Copulation scream (c)


8
25K 25K

Frequency (Hz)
2

0 0 0
0 0.580 0 0.250
Time (s)
Coo

Bark
Grunt

Girney
Gecker
Warble
Scream

Shrill Bark
Cop Scream

Harmonic Arch

FIGURE 33.4   A vocalization responsive cell in VLPFC. (a) Response to 10 vocalization exemplars is shown in raster/spike density plots. Strongest response was to
submissive scream vocalizations and copulation scream vocalizations, which are similar in acoustic features as shown in spectrogram in panel (b). A cluster analysis
(shown in c) of mean firing rate to these calls shows that calls with similar acoustic features tend to evoke similar neuronal responses. (Modified from Romanski, L. M. et
al., J. Neurophysiol., 93, 734– 747, 2005.)
The Neural Bases of Multisensory Processes
Convergence of Information in Ventral Prefrontal Cortex 675

(a) (b)
80 51

Hz

0 0

(c) (d)
82 39

Hz

0 0

(e) (f )
28 77

Hz

0 0
0 1 2 3 0 1 2 3
s s

(g) 105 106


140
Number of

45
neurons

0
0 1 2 3
s

FIGURE 33.5  Single-neuron spike density functions from six different neurons. Dark bars above each plot
indicate times during which the neuron’s firing rate carried significant (P < .01) monotonic signal about base
stimulus. (a, c, e) Positive monotonic neurons. (b, d, f) Negative monotonic neurons. (g) Total number of
recorded neurons (during fixed 3-s delay period runs) carrying a significant signal about the base stimulus, as
a function of time relative to beginning of delay period. Individual neurons may participate in more than one
bin. Base stimulus period is shaded gray, and minimum and maximum number of neurons, during the first,
middle, and last seconds of delay period, respectively, are indicated with arrows. (Reprinted by permission
from Macmillan Publishers Ltd., Romo, R. et al., Nature, 399, 470–473, 1999. With permission.)

memory activation of human VLPFC areas 47/12 and 45 by Kostopoulos et al. (2007). In their
fMRI study, the authors not only demonstrated activity of the VLPFC during a vibrotactile working
memory task but also showed functional connectivity with the secondary somatosensory cortex,
which was also active in this vibrotactile delayed discrimination task. The area activated, area 47
in the human brain, is analogous to monkey area 12/47, where face and vocalization responses have
been recorded (O’Scalaidhe et al. 1997, 1999; Romanski and Goldman-Rakic 2002; Romanski et al.
2005). The anatomical, electrophysiological, and neuroimaging data suggest that somatosensory
stimuli may converge in similar VLPFC regions where auditory and visual responsive neurons are
found and may combine to participate in object recognition.
676 The Neural Bases of Multisensory Processes

33.3.5  Multisensory Responses


The anatomical, physiological, and behavioral data described above show that the ventral fron-
tal lobe receives afferents carrying information about auditory, visual, and somatosensory stimuli.
Furthermore, physiological studies indicate that VLPFC neurons prefer complex information and
are activated by stimuli with social communication information, that is, faces and vocalizations.
Although only one group has examined somatosensory responses in the prefrontal cortex thus far,
several imaging studies have shown activation of the ventral frontal lobe with haptic stimulation,
which also holds some importance in social communication and also in object recognition.
Although many human neuroimaging studies have posited a role for the frontal lobes in the
integration of auditory and visual speech or communication information (Gilbert and Fiez 2004;
Hickok et al. 2003; Jones and Callan 2003; Homae et al. 2002), few studies have addressed the cel-
lular mechanisms underlying frontal lobe multisensory integration. An early study by Benevento
et al. (1977) made intracellular electrophysiological recordings in the lateral orbital cortex (area
12  orbital) and found that single cells were responsive to simple auditory and visual stimuli
(Benevento et al. 1977). Fuster and colleagues recorded from the lateral frontal cortex during an
audiovisual matching task (Fuster et al. 2000; Bodner et al. 1996). In this task, prefrontal cortex
cells responded selectively to tones, and most of them also responded to colors according to the
task rule (Fuster et al. 2000). Gaffan and Harrison (1991) determined the importance of ventral
prefrontal cortex in sensory integration by showing that lesions disrupt the performance of cross-
modal matching involving auditory and visual objects. Importantly, Rao et al. (1997) have described
integration of object and location information in single prefrontal neurons.
A recent study by Romanski and colleagues has documented multisensory responses to combined
auditory and visual stimuli in the VLPFC. In this study, rhesus monkeys were presented with mov-
ies of familiar monkeys vocalizing while single neurons were recorded from the VLPFC (Sugihara
et al. 2006). These movies were separated into audio and video streams, and neural responses
to the unimodal stimuli were compared to combined audiovisual responses. Interestingly, about
half of the neurons recorded in the VLPFC were multisensory in that they responded to both uni-
modal auditory and visual stimuli or responded differently to simultaneously presented audiovisual
stimuli than to either unimodal stimuli (Sugihara et al. 2006). As has been shown in the superior
colliculus, the STS, and auditory cortex, prefrontal neurons exhibited enhancement or suppression
(Figure 33.6), and, like the STS, suppression was more commonly observed than enhancement.
Multisensory responses were stimulus-dependent in that not all combinations of face-vocalization
stimuli elicited a multisensory response. Hence, our estimate of multisensory neurons is most likely
a lower bound. If the stimulus battery tested were large enough, we would expect that more neurons
would be shown to be multisensory rather than the default, unimodal visual. It was also interesting
that face/voice stimuli evoked multisensory responses more frequently than nonface/nonvoice com-
binations, as in auditory cortex (Ghazanfar et al. 2008) and in the STS (Barraclough et al. 2005).
This adds support to the notion that VLPFC is part of a circuit that is specialized for integrating face
and voice information rather than integrating all forms of auditory and visual stimuli generically.
Specialization for the integration of communication-relevant audiovisual stimuli in the frontal
lobe, and particularly in the VLPFC, is also apparent in the human brain. An fMRI study has shown
that area 47 in the human brain is active during the simultaneous presentation of gestures and
speech (Xu et al. 2009). In this study, Braun and colleagues found overlap of activation in area 47
when subjects viewed gestures or listened to a voicing of the phrase that fit the gesture. The region
of activation in this study of the human brain is a homologous area to that recorded by Sugihara
et al., suggesting that this region of the VLPFC is specialized for the multisensory integration of
communication-relevant auditory and visual information, namely, gestures (i.e., facial) and vocal
sounds.
Thus, there is evidence that auditory, visual, and somatosensory information is reaching the
VLPFC, and is converging within areas 12/47 and 45 (Figure 33.7). Furthermore, this information
Convergence of Information in Ventral Prefrontal Cortex 677

(a) Multisensory enhancement


Aud Vis AV
120 30

Spikes/s
0
–250 0 1000 1250

0
A V AV

(b) Multisensory suppression


Aud Vis AV
35 15

Spikes/s
0
–250 0 1000 1250

0
A V AV

FIGURE 33.6  Multisensory neuronal responses in prefrontal cortex. Responses of two single units are
shown in (a) and (b) as raster/spike density plots to auditory vocalization alone (Aud), face alone (Vis), and
both presented simultaneously (AV). A bar graph of mean response to these stimuli is shown at right depict-
ing auditory (dark gray), visual (white), and multisensory (light gray) responses. Cell in panel (a) exhibited
multisensory enhancement and cell in panel (b) showed multisensory suppression.

appears to be related most to communication. Although Romo et al. (1999) showed evidence of
somatosensory processing related to touch, the innervation of ventral prefrontal cortex includes
afferents from the face region of SII (Preuss et al. 1989). This somatosensory face information
is arriving at ventral prefrontal regions that receive information about face identity, features, and
expression from areas TE and TPO (Webster et al. 1994; O’Scalaidhe et al. 1997, 1999), in addition

as

ps Audiovisual
responsive cells
(Sugihara et al. 2006)

Somatosensory
responsive region
ls
(Romo et al. 1999)
Auditory responsive
region sts
(Romanski and Goldman-
Rakic 2002)
Visual/Face
responsive region
(O'Scalidhe et al. 1997)

FIGURE 33.7  Auditory, visual, and somatosensory convergence in VLPFC is shown on a lateral brain sche-
matic of macaque frontal lobe. VLPFC location of vocalization-responsive area (dark gray), visual object- and
face-responsive area (light gray), somatosensory-responsive area (dashed line circle), audiovisual-responsive
cells (black dots) are all depicted on prefrontal cortex of macaque monkey in which they were recorded as,
arcuate sulcus; ls, lateral sulcus; ps, principal sulcus; sts, superior temporal sulcus. (Data from Sugihara, T.
et al., J. Neurosci., 26, 11138–11147, 2006.)
678 The Neural Bases of Multisensory Processes

to auditory inputs that carry information regarding species-specific vocalizations (Romanski et al.
2005).

33.3.6  Functional Considerations


Although a number of studies have examined the response of prefrontal neurons to face, vocaliza-
tion, and somatosensory stimuli during passive fixation tasks, it is expected that the VLPFC utilizes
these stimuli in more complex processes. There is no doubt that the context of a task will affect
the firing of VLPFC neurons. Nonetheless, face and vocalization stimuli are different from typical
simple sensory stimuli in that they already carry semantic meaning and emotional valence and need
no additional task contexts to make them relevant. A face or vocalization, even when presented in a
passive task, will be associated with previous experiences, emotions, and meanings that will evoke
responses in a number of brain areas that project to the VLPFC, whereas simple sensory stimuli
do not have innate associations and depend only on task contingencies to give them relevance.
Thus, responses to face, voice, and other communication-relevant stimuli in prefrontal neurons
are the sum total of experience with these stimuli in addition to any task or contextual information
presented.
The combination of somatosensory face or touch information, visual face information, and
vocalization information could play a number of roles. First, the general process of conjunction
allows for the combining of auditory, visual, and/or somatosensory stimuli for many known and, as
yet, unknown functions. Thus, the VLPFC may serve a general purpose in allowing complex stimuli
related to any of the modalities to be integrated. This may be especially relevant for the frontal
lobe when the information is to be remembered or operated on in some way. However, a function
more directly suited to the process of communication would be feedback control of articulation.
Auditory information that is coded phonologically and mouth or face movements perceived via
somatosensory input would be integrated, and then orofacial movements could be adjusted to alter
the production of sounds via a speech/vocalization output circuit. The posterior part of the inferior
frontal gyrus (Broca’s area) has been shown, via lesions analysis and neuroimaging, to play a role in
the production of this phonetic code, or articulatory stream. In contrast, the anterior inferior frontal
gyrus may integrate auditory, somatosensory, and visual perceptual information to produce this
stream (Papoutsi et al. 2009). The somatosensory feedback regarding positioning of the mouth and
face would play an important role in control of articulation. The visual face and auditory vocaliza-
tion information available to these neurons could provide further information from a speaker that
warrants a reply or could provide information about a hand or face during a gesture. Thus, a third
function for the combination of auditory, visual, and somatosensory information would be the per-
ception, memory, and execution of gestures that accompany speech and vocalizations.
The VLPFC may also be part of a larger circuit that has been called the mirror neuron system.
This system is purported to be involved in the perception and execution of gestures as occurs in
imitation (Rizzolatti and Craighero 2004). The VLPFC has reciprocal connections with many parts
of the mirror neuron circuit. Finally, convergence of auditory, visual, and haptic information could
also be used in face or object recognition especially when one sense is not optimal, and additional
information from other sensory modalities is needed to confirm identification. The convergence of
these sensory modalities and others may play additional functional roles during a variety of com-
plex cognitive functions.

REFERENCES
Averbeck, B. B., L. M. Romanski. 2006. Probabilistic encoding of vocalizations in macaque ventral lateral
prefrontal cortex. Journal Neuroscience 26: 11023–11033.
Barbas, H. 1988. Anatomic organization of basoventral and mediodorsal visual recipient prefrontal regions in
the rhesus monkey. Journal of Comparative Neurology 276: 313–342.
Convergence of Information in Ventral Prefrontal Cortex 679

Barbas, H. 1992. Architecture and cortical connections of the prefrontal cortex in the rhesus monkey. Advances
in Neurology 57: 91–115.
Barbas, H., and D. N. Pandya. 1989. Architecture and intrinsic connections of the prefrontal cortex in the rhesus
monkey. Journal of Comparative Neurology 286: 353–375.
Barbas, H., and M. M. Mesulam. 1981. Organization of afferent input to subdivisions of area 8 in the rhesus
monkey. Journal of Comparative Neurology 200: 407–431.
Barbas, H., and M. M. Mesulam. 1985. Cortical afferent input to the principalis region of the rhesus monkey.
Neuroscience 15: 619–637.
Barraclough, N. E., D. Xiao, C. I. Baker, M. W. Oram, and D. I. Perrett. 2005. Integration of visual and auditory
information by superior temporal sulcus neurons responsive to the sight of actions. Journal of Cognitive
Neuroscience 17: 377–391.
Belin, P., R. J. Zatorre, P. Lafaille, P. Ahad, and B. Pike. 2000. Voice-selective areas in human auditory cortex.
Nature 403: 309–312.
Benevento, L. A., J. Fallon, B. J. Davis, and M. Rezak. 1977. Auditory–visual interaction in single cells in the
cortex of the superior temporal sulcus and the orbital frontal cortex of the macaque monkey. Experimental
Neurology 57: 849–872.
Binder, J. R., J. A. Frost, T. A. Hammeke, P. S. Bellgowan, J. A. Springer, J. N. Kaufman, and E. T. Possing.
2000. Human temporal lobe activation by speech and nonspeech sounds. Cerebral Cortex 10: 512–528.
Bodner, M., J. Kroger, and J. M. Fuster. 1996. Auditory memory cells in dorsolateral prefrontal cortex.
Neuroreport 7: 1905–1908.
Buckner, R. L., M. E. Raichle, and S. E. Petersen. 1995. Dissociation of human prefrontal cortical areas across
different speech production tasks and gender groups. Journal of Neurophysiology 74: 2163–2173.
Bullier, J., J. D. Schall, and A. Morel. 1996. Functional streams in occipito-frontal connections in the monkey.
Behavioural Brain Research 76: 89–97.
Carmichael, S. T., and J. L. Price. 1995. Sensory and premotor connections of the orbital and medial prefrontal
cortex of macaque monkeys. Journal of Comparative Neurology 363: 642–664.
Cavada, C., and P. S. Goldman-Rakic. 1989. Posterior parietal cortex in rhesus monkey: II. Evidence for
segregated corticocortical networks linking sensory and limbic areas with the frontal lobe. Journal of
Comparative Neurology 287: 422–445.
Chavis, D. A., and D. N. Pandya. 1976. Further observations on corticofrontal connections in the rhesus mon-
key. Brain Research 117: 369–386.
Cipolloni, P. B., and D. N. Pandya. 1989. Connectional analysis of the ipsilateral and contralateral afferent
neurons of the superior temporal region in the rhesus monkey. Journal of Comparative Neurology 281:
567–585.
Cipolloni, P. B., and D. N. Pandya. 1999. Cortical connections of the frontoparietal opercular areas in the rhesus
monkey. Journal of Comparative Neurology 403: 431–458.
Cohen, J. D., W. M. Perlstein, T. S. Braver, L. E. Nystrom, D. C. Noll, J. Jonides, and E. E. Smith. 1997.
Temporal dynamics of brain activation during a working memory task. Nature 386: 604–608.
Cohen, Y. E., F. Theunissen, B. E. Russ, and P. Gill. 2007. Acoustic features of rhesus vocalizations and their
representation in the ventrolateral prefrontal cortex. Journal of Neurophysiology 97: 1470–1484.
Deacon, T. W. 1992. Cortical connections of the inferior arcuate sulcus cortex in the macaque brain. Brain
Research 573: 8–26.
Demb, J. B., J. E. Desmond, A. D. Wagner, C. J. Vaidya, G. H. Glover, and J. D. Gabrieli. 1995. Semantic
encoding and retrieval in the left inferior prefrontal cortex: A functional MRI study of task difficulty and
process specificity. Journal of Neuroscience 15: 5870–5878.
Ettlinger, G., and J. Wegener. 1958. Somaesthetic alternation, discrimination and orientation after frontal and
parietal lesions in monkeys. The Quarterly Journal of Experimental Psychology 10: 177–186.
Fecteau, S., J. L. Armony, Y. Joanette, and P. Belin. 2005. Sensitivity to voice in human prefrontal cortex.
Journal of Neurophysiology 94: 2251–2254.
Fiez, J. A., E. A. Raife, D. A. Balota, J. P. Schwarz, M. E. Raichle, and S. E. Petersen. 1996. A positron emis-
sion tomography study of the short-term maintenance of verbal information. Journal of Neuroscience
16: 808–822.
Fuster, J. M., M. Bodner, and J. K. Kroger. 2000. Cross-modal and cross-temporal association in neurons of
frontal cortex. Nature 405: 347–351.
Gabrieli, J. D. E., R. A. Poldrack, and J. E. Desmond. 1998. The role of left prefrontal cortex in language and
memory. Proceedings of the National Academy of Sciences 95: 906–913.
Gaffan, D., and S. Harrison. 1991. Auditory–visual associations, hemispheric specialization and temporal-
frontal interaction in the rhesus monkey. Brain 114: 2133–2144.
680 The Neural Bases of Multisensory Processes

Galaburda, A. M., and D. N. Pandya. 1983. The intrinsic architectonic and connectional organization of the
superior temporal region of the rhesus monkey. Journal of Comparative Neurology 221: 169–184.
Gelfand, J. R., and S. Y. Bookheimer. 2003. Dissociating neural mechanisms of temporal sequencing and pro-
cessing phonemes. Neuron 38: 831–842.
Ghazanfar, A. A., C. Chandrasekaran, and N. K. Logothetis. 2008. Interactions between the superior tempo-
ral sulcus and auditory cortex mediate dynamic face/voice integration in rhesus monkeys. Journal of
Neuroscience 28: 4457–4469.
Gilbert, A. M., and J. A. Fiez. 2004. Integrating rewards and cognition in the frontal cortex. Cognitive, Affective
and Behavioral Neuroscience 4: 540–552.
Goldman, P. S., and H. E. Rosvold. 1970. Localization of function within the dorsolateral prefrontal cortex of
the rhesus monkey. Experimental Neurology 27: 291–304.
Gross, C. G. 1963. A comparison of the effects of partial and total lateral frontal lesions on test performance by
monkeys. Journal of Comparative Physiological Psychology 56: 41–47.
Gross, C. G., and L. Weiskrantz. 1962. Evidence for dissociation of impairment on auditory discrimination
and delayed response following lateral frontal lesions in monkeys. Experimental Neurology 5: 453–476.
Hagen, M. C., D. H. Zald, T. A. Thornton, and J. V. Pardo. 2002. Somatosensory processing in the human infe-
rior prefrontal cortex. Journal of Neurophysiology 88: 1400–1406.
Hickok, G., B. Buchsbaum, C. Humphries, and T. Muftuler. 2003. Auditory–motor interaction revealed by fMRI:
Speech, music, and working memory in area spt. Journal of Cognitive Neuroscience 15: 673–682.
Homae, F., R. Hashimoto, K. Nakajima, Y. Miyashita, and K. L. Sakai. 2002. From perception to sentence com-
prehension: The convergence of auditory and visual information of language in the left inferior frontal
cortex. NeuroImage 16: 883–900.
Hoshi, E., K. Shima, and J. Tanji. 2000. Neuronal activity in the primate prefrontal cortex in the process of
motor selection based on two behavioral rules. Journal of Neurophysiology 83: 2355–2373.
Gifford III, G. W., K. A. Maclean, M. D. Hauser, and Y. E. Cohen. 2005. The neurophysiology of functionally
meaningful categories: Macaque ventrolateral prefrontal cortex plays a critical role in spontaneous cat-
egorization of species-specific vocalizations. Journal of Cognitive Neuroscience 17: 1471–1482.
Jones, E. G., and T. P. Powell. 1970. An anatomical study of converging sensory pathways within the cerebral
cortex of the monkey. Brain 93: 793–820.
Jones, J. A., and D. E. Callan. 2003. Brain activity during audiovisual speech perception: An fMRI study of the
McGurk effect. Neuroreport 14: 1129–1133.
Kostopoulos, P., M. C. Albanese, and M. Petrides. 2007. Ventrolateral prefrontal cortex and tactile memory dis-
ambiguation in the human brain. Proceedings of the National Academy of Sciences of the United States
of America 104: 10223–10228.
Miquee, A., C. Xerri, C. Rainville, J. L. Anton, B. Nazarian, M. Roth, and Y. Zennou-Azogui. 2008. Neuronal
substrates of haptic shape encoding and matching: A functional magnetic resonance imaging study.
Neuroscience 152: 29–39.
Mishkin, M., and F. J. Manning. 1978. Non-spatial memory after selective prefrontal lesions in monkeys. Brain
Research 143: 313–323.
Newman, J. D., and D. F. Lindsley. 1976. Single unit analysis of auditory processing in squirrel monkey frontal
cortex. Experimental Brain Research 25: 169–181.
Ninokura, Y., H. Mushiake, and J. Tanji. 2004. Integration of temporal order and object information in the
monkey lateral prefrontal cortex. Journal of Neurophysiology 91: 555–560.
O’Scalaidhe, S. P. O., F. A. W. Wilson, and P. G. R. Goldman-Rakic. 1999. Face-selective neurons during pas-
sive viewing and working memory performance of rhesus monkeys: Evidence for intrinsic specialization
of neuronal coding. Cerebral Cortex 9: 459–475.1
O’Scalaidhe, S. P., F. A. Wilson, and P. S. Goldman-Rakic. 1997. Areal segregation of face-processing neurons
in prefrontal cortex. Science 278: 1135–1138.
Pandya, D. N., and F. Sanides. 1973. Architectonic parcellation of the temporal operculum in rhesus monkey
and its projection pattern. Zeitschrift fuer Anatomie und Entwicklungsgeschichte 139: 127–161.
Pandya, D. N., and H. G. Kuypers. 1969. Cortico-cortical connections in the rhesus monkey. Brain Research
13: 13–36.
Pandya, D. N., M. Hallett, and S. K. Kmukherjee. 1969. Intra- and interhemispheric connections of the neocor-
tical auditory system in the rhesus monkey. Brain Research 14: 49–65.
Papoutsi, M., J. A. de Zwart, J. M. Jansma, M. J. Pickering, J. A. Bednar, and B. Horwitz. 2009. From pho-
nemes to articulatory codes: An fMRI study of the role of Broca’s area in speech production. Cerebral
Cortex 19: 2156–2165.
Convergence of Information in Ventral Prefrontal Cortex 681

Passingham, R. 1975. Delayed matching after selective prefrontal lesions in monkeys (Macaca mulatta). Brain
Research 92: 89–102.
Paulesu, E., C. D. Frith, and R. S. J. Frackowiak. 1993. The neural correlates of the verbal component of work-
ing memory. Nature 362: 342–5.32
Petrides, M., and D. N. Pandya. 1988. Association fiber pathways to the frontal cortex from the superior tem-
poral region in the rhesus monkey. Journal of Comparative Neurology 273: 52–66.
Pigarev, I. N., G. Rizzolatti, and C. Schandolara. 1979. Neurons responding to visual stimuli in the frontal lobe
of macaque monkeys. Neuroscience Letters 12: 207–212.
Poremba, A., and M. Mishkin. 2007. Exploring the extent and function of higher-order auditory cortex in rhesus
monkeys. Hearing Research 229: 14–23.
Posner, M. I., Y. G. Abdullaev, B. D. McCandliss, and S. C. Sereno. 1999. Neuroanatomy, circuitry and plastic-
ity of word reading. Neuroreport 10: R12–R23.
Preuss, T. M., and P. S. Goldman-Rakic. 1989. Connections of the ventral granular frontal cortex of macaques
with perisylvian premotor and somatosensory areas: Anatomical evidence for somatic representation in
primate frontal association cortex. Journal of Comparative Neurology 282: 293–316.
Price, C. J. 1998. The functional anatomy of word comprehension and production. Trends in Cognitive Sciences
2: 281–288.
Price, J. L. 2008. Multisensory convergence in the orbital and ventrolateral prefrontal cortex. Chemosensory
Perception 1: 103–109.
Rao, S. C., G. Rainer, and E. K. Miller. 1997. Integration of what and where in the primate prefrontal cortex.
Science 276: 821–824.
Rizzolatti, G., and L. Craighero. 2004. The mirror–neuron system. Annual Review of Neuroscience 27: 169–192.
Romanski, L. M. 2007. Representation and integration of auditory and visual stimuli in the primate ventral
lateral prefrontal cortex. Cerebral Cortex 17 S1: i61–i69.
Romanski, L. M., and P. S. Goldman-Rakic. 2002. An auditory domain in primate prefrontal cortex. Nature
Neuroscience 5: 15–16.
Romanski, L. M., B. B. Averbeck, and M. Diltz. 2005. Neural representation of vocalizations in the primate
ventrolateral prefrontal cortex. Journal of Neurophysiology 93: 734–747.
Romanski, L. M., B. Tian, J. Fritz, M. Mishkin, P. S. Goldman-Rakic, and J. P. Rauschecker. 1999b. Dual streams
of auditory afferents target multiple domains in the primate prefrontal cortex. Nature Neuroscience 2:
1131–1136.
Romanski, L. M., J. F. Bates, and P. S. Goldman-Rakic. 1999a. Auditory belt and parabelt projections to the
prefrontal cortex in the rhesus monkey. Journal of Comparative Neurology 403: 141–157.
Romo, R., C. D. Brody, A. Hernandez, and L. Lemus. 1999. Neuronal correlates of parametric working mem-
ory in the prefrontal cortex. Nature 399: 470–473.
Rosenkilde, C. E., R. H. Bauer, and J. M. Fuster. 1981. Single cell activity in ventral prefrontal cortex of behav-
ing monkeys. Brain Research 209: 375–394.
Schall, J. D., A. Morel, D. J. King, and J. Bullier. 1995. Topography of visual cortex connections with frontal eye field
in macaque: Convergence and segregation of processing streams. Journal of Neuroscience 15: 4464–4487.
Scott, S. K., C. C. Blank, S. Rosen, and R. J. Wise. 2000. Identification of a pathway for intelligible speech in
the left temporal lobe. Brain 12: 2400–2406.
Stevens, A. A., P. S. Goldman-Rakic, J. C. Gore, R. K. Fulbright, and B. E. Wexler. 1998. Cortical dysfunction
in schizophrenia during auditory word and tone working memory demonstrated by functional magnetic
resonance imaging. Archives of General Psychiatry 55: 1097–1103.
Stilla, R., and K. Sathian. 2008. Selective visuo-haptic processing of shape and texture. Human Brain Mapping
29: 1123–1138.
Stromswold, K., D. Caplan, N. Alpert, and S. Rauch. 1996. Localization of syntactic comprehension by posi-
tron emission tomography. Brain & Language 52: 452–473.
Sugihara, T., M. D. Diltz, B. B. Averbeck, and L. M. Romanski. 2006. Integration of auditory and visual
communication information in the primate ventrolateral prefrontal cortex. Journal of Neuroscience 26:
11138–11147.
Suzuki, H., and M. Azuma. 1977. Prefrontal neuronal activity during gazing at a light spot in the monkey. Brain
Research 126: 497–508.
Tanila, H., S. Carlson, I. Linnankoski, and H. Kahila. 1993. Regional distribution of functions in dorsolateral
prefrontal cortex of the monkey. Behavioral Brain Research 53: 63–71.
Tanila, H., S. Carlson, I. Linnankoski, F. Lindroos, and H. Kahila. 1992. Functional properties of dorsolateral
prefrontal cortical neurons in awake monkey. Behavioral Brain Research 47: 169–180.
682 The Neural Bases of Multisensory Processes

Tian, B., D. Reser, A. Durham, A. Kustov, and J. P. Rauschecker. 2001. Functional specialization in rhesus
monkey auditory cortex. Science 292: 290–293.
Tsao, D. Y., N. Schweers, S. Moeller, and W. A. Freiwald. 2008. Patches of face-selective cortex in the macaque
frontal lobe. Nature Neuroscience 11: 877–879.
Ungerleider, L. G., D. Gaffan, and V. S. Pelak. 1989. Projections from inferior temporal cortex to prefrontal
cortex via the uncinate fascicle in rhesus monkeys. Experimental Brain Research 76: 473–484.
Webster, M. J., J. Bachevalier, and L. G. Ungerleider. 1994. Connections of inferior temporal areas TEO and
TE with parietal and frontal cortex in macaque monkeys. Cerebral Cortex 4: 470–483.
Weiskrantz, L., and M. Mishkin. 1958. Effects of temporal and frontal cortical lesions on auditory discrimina-
tion in monkeys. Brain 80: 406–414.
Wilson, F. A., S. P. O’Scalaidhe, and P. S. Goldman-Rakic. 1993. Dissociation of object and spatial processing
domains in primate prefrontal cortex. Science 260: 1955–1958.
Wollberg, Z., and J. Sela. 1980. Frontal cortex of the awake squirrel monkey: Responses of single cells to visual
and auditory stimuli. Brain Research 198: 216–220.
Xu, J., P. J. Gannon, K. Emmorey, J. F. Smith, and A. R. Braun. 2009. Symbolic gestures and spoken language
are processed by a common neural system. Proceedings of the National Academy of Sciences of the
United States of America 106: 20664–20669.
Zatorre, R. J., A. C. Evans, E. Meyer, and A. Gjedde. 1992. Lateralization of phonetic and pitch discrimination
in speech processing. Science 256: 846–849.
Zatorre, R. J., M. Bouffard, and P. Belin. 2004. Sensitivity to auditory object features in human temporal neo-
cortex. The Journal of Neuroscience 24: 3637–3642.
34 A Multisensory Perspective
on Human Auditory
Communication
Katharina von Kriegstein

CONTENTS
34.1 Introduction........................................................................................................................... 683
34.2 The Auditory Perspective on Auditory Communication....................................................... 684
34.3 The Visual Perspective on Visual Communication............................................................... 685
34.4 The Multisensory Perspective on Auditory Communication................................................ 686
34.4.1 Improving Unisensory Recognition by Multisensory Learning................................ 687
34.4.1.1 Face Benefit: Auditory Recognition Is Improved after Voice–Face
Learning...................................................................................................... 687
34.4.1.2 Is the Face Benefit Caused by Greater Attention during Voice–Face
Learning?.................................................................................................... 688
34.4.1.3 Importance of a Common Cause for Rapid Learning Effects.................... 689
34.4.2 Auditory–Visual Model for Human Auditory Communication................................690
34.4.2.1 Visual Face Areas Are Behaviorally Relevant for Auditory
Recognition.................................................................................................690
34.4.3 A Multisensory Predictive Coding Framework for Auditory Communication . ...... 693
34.5 Conclusions and Future Directions........................................................................................ 694
References....................................................................................................................................... 695

34.1  INTRODUCTION
We spend a large amount of our time communicating with other people. Much of this communica-
tion occurs face to face, where the availability of sensory input from several modalities (e.g., audi-
tory, visual, tactile, olfactory) ensures a robust perception of information (e.g., Sumby and Pollack
1954; Gick and Derrick 2009). Robustness, in this case, means that the perception of a communi-
cation signal is veridical even when parts of the signal are noisy or occluded (Ay et al. 2007). For
example, if the auditory speech signal is noisy, then the concurrent availability of visual speech sig-
nals (e.g., lip movements and gestures) improves the perception of the speech information (Sumby
and Pollack 1954; Ross et al. 2007). The robustness in face-to-face communication does not only
pertain to speech recognition (Sumby and Pollack 1954; Ross et al. 2007), but also to other infor-
mation relevant for successful human interaction, for example, recognition of gender (Smith et al.
2007), emotion (de Gelder and Vroomen 1995; Massaro and Egan 1996), or identity (Schweinberger
et al. 2007).
Nevertheless, in our daily life there are also often situations when only a single modality is avail-
able, for example, when talking on the phone, listening to the radio, or when seeing another person
from a distance. Current models assume that perception in these unimodal tasks is based on and

683
684 The Neural Bases of Multisensory Processes

constrained to the unimodal sensory system. For example, in this view, solely the auditory system
is involved in the initial sensory analysis of the auditory speech signal during a telephone conver-
sation (see, e.g., Belin et al. 2004; Scott 2005; Hickok and Poeppel 2007). Similarly, it is assumed
that solely the visual system is involved in the initial sensory analysis of faces (Bruce and Young
1986; Haxby et al. 2000). In this chapter, I will review evidence that these models might need to be
extended; perception in human communication may always involve multisensory processing even
when our brains are processing only unimodal input (see, e.g., Hall et al. 2005; Pitcher et al. 2008;
von Kriegstein et al. 2008b). This involvement of multisensory processing might contribute to the
robustness of perception. I will start with a brief overview on mechanisms and models for auditory
speech and visual face processing from a modality-specific perspective. This will be followed by a
summary and discussion of recent behavioral and functional neuroimaging experiments in human
auditory communication that challenge the modality-specific view. They show that an interaction
between auditory and visual sensory processing can increase robustness and high performance in
auditory-only communication. I conclude with a view how one can explain these findings with a
model that unifies unimodal and multimodal recognition.

34.2  THE AUDITORY PERSPECTIVE ON AUDITORY COMMUNICATION


Given good listening conditions, auditory speech perception leads to reliable comprehension of what
is said. Auditory speech additionally reveals information about many other things, for example,
the identity (Sheffert et al. 2002), social and geographical background (Clopper and Pisoni 2004;
Thomas and Reaser 2004), or the emotional state of the speaker (Johnson et al. 1986; Scherer 1986).
Although all this information is relevant for successful human communication in this chapter, I will
focus mostly on two aspects: (1) recognition of what is said (speech recognition) and (2) recognition
of who is talking (speaker recognition).
Large areas in the human brain are dedicated to processing auditory speech. It is still a matter
of debate how speech-specific these areas are (Price et al. 2005; Nelken and Bar-Yosef 2008; von
Kriegstein et al. 2007; Bizley et al. 2009). Basic perceptual features of speech and nonspeech sounds
are pitch and timbre. Voice pitch is related to the vibration rate of the glottal folds. This information
is processed relatively early in the auditory hierarchy, that is, in the brainstem (inferior colliculus)
and close to the primary auditory cortex in Heschl’s gyrus (Griffiths et al. 2001; Patterson et al.
2002; Penagos et al. 2004; von Kriegstein et al. 2010). Timbre is an umbrella term and operationally
defined as the perceptual difference of two sounds having the same pitch, duration, and intensity
(American Standards Association 1960). It comprises such acoustic features as the spectral enve-
lope (i.e., the shape of the power spectrum) and the amplitude envelope (i.e., the shape of the ampli-
tude waveform) of the sound (Grey 1977; Iverson and Krumhansl 1993; McAdams et al. 1995). The
difference between the two speech sounds /a/ and /o/ (spoken with the same voice pitch, intensity,
and duration) is based on the different positions of the articulators (lips, tongue, etc.), which affects
the timbre of the sound. Moreover, the difference between an /a/ spoken by two different speakers
(with the same voice pitch) differs in the timbre, for example, because the two speakers have dis-
tinct sizes of the vocal tract. In contrast to pitch, differential responses to timbre in nonspeech and
speech sounds have been reported further away from primary auditory cortex in superior temporal
gyrus (STG) and sulcus (STS) (Menon et al. 2002; Warren et al. 2005). Posterior STG/STS contains
regions that are more involved in processing certain aspects of timbre of speech sounds (i.e., those
reflecting the size of the speaker) in contrast to similar aspects of timbre in nonspeech sounds (i.e.,
those reflecting the size of the musical instrument or animal; von Kriegstein et al. 2007).
Bilateral posterior STG/STS has also been implicated in mapping acoustic signals into speech
sounds and speech sound categories, that is, phonemes (for review, see Hickok and Poeppel 2007;
Obleser and Eisner 2008). Speech processing is left-lateralized if the experimental design empha-
sizes understanding what is said, for example, if speech recognition tasks are contrasted with
speaker recognition tasks (but using the same auditory speech input) (Leff et al. 2008; Scott et al.
A Multisensory Perspective on Human Auditory Communication 685

2000; von Kriegstein et al. 2003, 2008a). In contrast, right temporal lobe regions [temporal lobe
voice areas (TVA)] are more involved in extracting the nonlinguistic voice properties of the speech
signal such as speaker identity (Belin et al. 2000, 2002; von Kriegstein et al. 2003; von Kriegstein
and Giraud 2004). This left–right dichotomy is also supported by lesion studies that typically find
speech processing deficits after left-hemispheric lesions. In contrast, acquired phonagnosia, that is,
a deficiency in recognizing identity by voice, has been reported with right parietal and temporal
lobe damage (Van Lancker and Canter 1982; Van Lancker et al. 1989; Neuner and Schweinberger
2000; Lang et al. 2009). Whether the left–right dichotomy is only relative is still a matter of debate
(Hickok and Poeppel 2000). For example, although speech recognition can be impaired after left
hemispheric lesions (Boatman et al. 1995), it can also be impaired after right hemispheric lesions
in adverse listening conditions (Boatman et al. 2006). The functional view of hemispheric special-
ization might boil down to a specialization of different regions for different time windows in the
speech input. There is evidence that the right hemisphere samples over longer time windows than
the left hemisphere (Poeppel 2003; Boemio et al. 2005; Giraud et al. 2007; Abrams et al. 2008;
Overath et al. 2008). This implies that the relative specialization of the left hemisphere for speech
processing is a result of the highly variable nature of the acoustic input required for speech recog-
nition. In contrast, the relative specialization of the right hemisphere for speaker processing might
be a result of the relatively constant nature of speaker parameters, which also enable us to identify
others by voice (Lavner et al. 2000; Sheffert et al. 2002).
In addition to temporal lobe areas, there is evidence that motor regions (i.e., primary motor and
premotor cortex) play a role in the sensory analysis of speech sounds at the level of phonemes and syl-
lables (Liberman and Mattingly 1985; Watkins et al. 2003; D’Ausilio et al. 2009); however, whether
this involvement reflects a necessary sensory mechanism or other mechanisms necessary for spoken
language comprehension is still being debated (Hickok and Poeppel 2007; Scott et al. 2009).
At a higher level, one of the overarching goal of the sensory analysis of speech signals is to
understand spoken language or to recognize who is talking. The former involves a range of pro-
cessing steps from connecting speech sounds to words and sentences, to grammatical rules and
semantic processing. These processing steps involve an extended network of brain areas (see, e.g.,
Vigneau et al. 2006; Price 2000; Marslen-Wilson and Tyler 2007). For example, prefrontal areas
(BA 44/45) have been implicated in relatively complex language tasks such as syntax or work-
ing memory (Friederici 2002; Hickok and Poeppel 2007). Furthermore, semantic analysis might
involve several temporal lobe areas as well as an associative system comprising many widely dis-
tributed brain regions (Martin and Caramazza 2003; Barsalou 2008). One example for such seman-
tic analysis is the involvement of specific areas in the motor cortex for action words (Pulvermuller
et al. 2006; Hauk et al. 2008).
Moreover, the recognition of who is talking involves processing steps beyond sensory analysis
of speaker characteristics and voice identification, for example, associating a specific face or name
with the voice. This is thought to involve several extra-auditory areas, for example, supramodal
areas coding for person identity or visual areas that are involved in face identity processing (Ellis
et al. 1997; Gainotti et al. 2003; Tsukiura et al. 2006; von Kriegstein and Giraud 2006; Campanella
and Belin 2007).

34.3  THE VISUAL PERSPECTIVE ON VISUAL COMMUNICATION


A substantial amount of communication information is transmitted in the visual domain. A par-
ticularly important visual input is the face. Dynamic face information complements auditory infor-
mation about what is said (e.g., lip movements; Chandrasekaran et al. 2009). Moreover, the face
provides a reliable means for recognizing people as well as the emotion of the speaker. I will focus
here on face information although there are, of course, other types of visual information that play
important roles in communication. These include hand gestures during face-to-face communication
or text and emoticons in e-mail communication.
686 The Neural Bases of Multisensory Processes

Similarly to the auditory modality, face processing is assumed to be separated into processing of
variable aspects of the face (e.g., expression, speech-related orofacial movements) and processing
of the more invariant aspects of the face (i.e., face identity) (Bruce and Young 1986; Burton et al.
1999; Haxby et al. 2000). This distinction was initially based on behavioral studies showing that
face movement or expression processing can be separately impaired from face-identity processing
(Bruce and Young 1986; Young et al. 1993). For example, the patient LM cannot speech-read from
moving faces but has intact face recognition (Campbell et al. 1997). In contrast, prosopagnosics,
that is people who have a deficiency in recognizing identity from the face, are thought to be unim-
paired in the recognition of dynamic aspects of the human face (Humphreys et al. 1993; Lander et
al. 2004; Duchaine et al. 2003).
The prevalent model of face processing assumes that aspects relevant for identity recognition
are processed in the fusiform face area (FFA) in the ventrotemporal cortex (Sergent et al. 1992;
Kanwisher et al. 1997; Haxby et al. 2000; Bouvier and Engel 2006). Recognition of face expres-
sion and face movement involves the mid/posterior STS (Puce et al. 1998; Pelphrey et al. 2005;
Thompson et al. 2007). However, not all studies are in support for two entirely separate routes in
processing face identity and face dynamics, and the extent of specialization of the two areas for
dynamic versus invariant aspects of faces is still under debate (O’Toole et al. 2002; Calder and
Young 2005; Thompson et al. 2005; Fox et al. 2009).
Visual and visual association cortices have been described as the “core system” of face percep-
tion. In contrast, the “extended system” of face perception involves several nonvisual brain regions—
for example, the amygdala and anterior temporal lobe for processing social significance, emotion,
and person identity (Baron-Cohen et al. 2000; Haxby et al. 2000; Neuner and Schweinberger 2000;
Haxby et al. 2002; Kleinhans et al. 2009). Furthermore, in the model developed by Haxby and
colleagues, the extended system also comprises auditory cortices that are activated in response to
lipreading from faces (Calvert et al. 1997; Haxby et al. 2000).

34.4  THE MULTISENSORY PERSPECTIVE ON AUDITORY COMMUNICATION


There is growing evidence that sensory input of one modality can lead to neuronal responses or
altered processing in sensory areas of another modality (Calvert et al. 1997; Sathian et al. 1997;
Zangaladze et al. 1999; Hall et al. 2005; von Kriegstein et al. 2005; Besle et al. 2008). For example,
lipreading from visual-only videos of faces is associated with responses in auditory cortices even if
no auditory input is available (i.e., Heschl’s and Planum temporale) (Hall et al. 2005; Pekkola et al.
2005; Besle et al. 2008). Furthermore, if both the auditory and visual information of a speaker’s face
are available, then the neuronal dynamics in auditory sensory cortices are modulated by the visual
information (van Wassenhove et al. 2005; Arnal et al. 2009). In these studies, the amount of modula-
tion was dependent on how predictable the visual information was in relation to the auditory signal
(van Wassenhove et al. 2005; Arnal et al. 2009). For example, a visual /p/, which is a speech sound
that can be visually easily distinguished from other speech sounds, led to faster auditory responses
to the auditory stimulus than the more difficult-to-distinguish /k/. Because visual information about
the facial movements precedes auditory information in time (Chandrasekaran et al. 2009), these
altered responses to auditory stimuli could reflect the transmission of predictive visual information
to the auditory cortices. This predictive information could be used to improve recognition by resolv-
ing auditory ambiguities. In this view, the alteration of responses in auditory cortices would provide
a benefit for processing of the auditory stimulus. Such a mechanism might be responsible for the
robustness of perception in multisensory situations.
The above studies have shown that input from one modality (e.g., visual) can influence responses
in the cortices of another input modality (e.g., auditory) and by that might improve behavioral per-
formance in multisensory situations. What, however, happens in unisensory situations, when only
one input modality is available? Does it have any behavioral relevance that the input modality (e.g.,
auditory) influences the activity in sensory cortices of another modality (e.g., visual) that does not
A Multisensory Perspective on Human Auditory Communication 687

receive any direct sensory (i.e., visual) input? Recent research suggests that it does. For example,
activation of visual areas has been shown to improve recognition of speech information in auditory-
only situations, such as when talking on the phone (von Kriegstein and Giraud 2006; von Kriegstein
et al. 2006; von Kriegstein et al. 2008b). These findings show that (after a brief period of audiovisual
learning), activation of visual association cortices (i.e., the FFA and the face-movement sensitive
STS) is correlated with behavioral benefits for auditory-only recognition. Such findings are at odds
with the above-described unisensory perspective on auditory-only communication, because they
imply that not only auditory sensory but also visual sensory areas are instrumental for auditory-
only tasks. In the following, I will review these behavioral and neuroimaging findings in detail and
discuss the implications for models of human auditory-only perception in human communication.

34.4.1  Improving Unisensory Recognition by Multisensory Learning


34.4.1.1  Face Benefit: Auditory Recognition Is Improved after Voice–Face Learning
Recent studies show that a brief period of prior face-to-face communication improves our ability
to identify a particular speaker by his/her voice in auditory-only conditions (Sheffert and Olson
2004; von Kriegstein et al. 2006; von Kriegstein et al. 2008b) as well as understanding what this
speaker is saying (von Kriegstein et al. 2008b). One of the earliest indications for such beneficial
speaker-specific effects of prior face-to-face communication came from a behavioral study on rec-
ognition of speakers in auditory-only situations (Sheffert and Olson 2004). In this study, subjects
were first trained to associate names with five speakers’ voices. Training was done in two groups.
One group (audiovisual) learned via previously recorded auditory–visual videos of the speaker’s
voice and face. The other group (auditory-only) learned by listening to the auditory track of the
same videos (the face was not visible). After training, both groups were tested on recognizing the
speakers by voice in auditory-only conditions. The results showed that the audiovisual learning was
more effective than the auditory-only learning. This beneficial effect of multisensory learning on
voice recognition has been reproduced in two further studies involving several control conditions
(von Kriegstein and Giraud 2006; von Kriegstein et al. 2008b; see Figure 34.1 for an example design
from one of the studies). Translated into everyday life, these findings would imply that auditory-only
recognition of voices of, for example, TV presenters, is easier than the recognition of voices of radio
speakers (who one has never seen speaking), given the same acoustic quality and total amount of
exposure. Furthermore, it was shown that not only voice recognition is improved after audiovisual
learning, but also recognizing what is said (von Kriegstein et al. 2008b); previous voice–face video
training improved the recognition of words in auditory-only sentences more than a matched control
training that did not involve faces (Figure 34.1). As in the previous studies, the sentences during the
word recognition task were spoken by the same trained speakers but were different from the train-
ing sentences. In the following, we will term the behavioral improvement on auditory tasks after
a voice–face training, relative to a matched control training, the “face benefit” (Figure 34.1; von
Kriegstein et al. 2008b).
Besides speaker-specific face benefits, there are speaker-independent face benefits in human
auditory communication. For example, learning of foreign phonemes has been shown to be more
efficient when training is performed with audiovisual videos of a talking face, in contrast to training
without dynamic face information (Hardison 2003; Hazan et al. 2005; Hirata and Kelly, 2010). For
visually well distinguishable phonemes /b/ and /p/ the face benefit is higher than for the visually less
salient speech sounds /l/ and /r/ (Hazan et al. 2005). This face benefit generalizes from the training
speaker to other speakers, that is, listening to and watching the language teacher will also improve
phoneme discrimination in auditory-only conditions for other speakers speaking that language. The
studies on speaker-independent face benefits in phoneme recognition use relatively long training
sessions (e.g., ca. 7 h in total for learning the consonants b, v, and p with videos of five speakers)
before testing for differences between the auditory–visual and auditory-only training conditions
688 The Neural Bases of Multisensory Processes

Training Test Results “Face


(<2 min/speaker) speech and speaker recognition (% correct) benefit”
Video
Daniel
Task instruction Stimulus block
Voice–face learning

Speech 94%
Nico
+ er + geht

82%
Peter
Speaker
+ Peter + Daniel
2%

Symbol
Ingo
5%
Voice–occupation learning

Jan Speech 92%


+ sie + rennt

Martin Speaker 77%


+ Ingo + Martin

FIGURE 34.1  Example for experimental design. Subjects were first trained on voices and names of six
different speakers. For three of these speakers, training was done with a voice–face video of the speaking
person (voice–face training). For the three other speakers training was done with the voice and a symbol for
the speaker’s occupation. In the subsequent test session, subjects performed a speech or speaker recognition
task on blocks of sentences spoken by previously trained speakers. Results show mean % correct recognition
over subjects. Face benefit is calculated as difference in performance after voice–face vs. voice–occupation
training. (Adapted from von Kriegstein, K. et al., Proc. Natl. Acad. Sci. U.S.A. 105, 6747–6752, 2008b.)

with phonemes spoken by a different set of speakers. In contrast, the speaker-specific face benefits
have been shown to develop very quickly. For example, Sheffert and Olson trained their subjects
with ca. 50 words from each of five speakers. Further studies showed that less than 2 min of training
per speaker already resulted in a significant face benefit [i.e., 9% for speaker recognition in the study
of von Kriegstein and Giraud 2006, and ca. 5% (speaker)/2%(speech) in the report of von Kriegstein
et al. 2008b]. Note, however, that the brief exposure times required for speaker-specific face benefits
seem to have their lower limits. Speaker recognition ability has been investigated after presentation
of only one sentence (mean duration 15 syllables/ca. 900 ms) and after multiple repetitions of a sen-
tence (45 syllables/ca. 2.7 s) (Cook and Wilding 1997, 2001). For the one-sentence condition, voice
recognition was actually worse after voice–face exposure (in contrast to voice-only exposure). For
the three-sentence condition, voice recognition was the same after voice–face exposure (in contrast
to voice-only exposure). Thus, the beneficial effect of voice–face training for voice recognition in
auditory-only conditions seems to occur somewhere between 3 s and 2 min of training (Cook and
Wilding 1997, 2001; von Kriegstein and Giraud 2006; von Kriegstein et al. 2008b).

34.4.1.2  Is the Face Benefit Caused by Greater Attention during Voice–Face Learning?
One simple explanation for the face benefits could be that seeing people talking is much more excit-
ing and attention-grabbing than just listening to the audio track, even if it is additionally accom-
panied by a visual symbol (Figure 34.1). This increase in attention during training with videos
A Multisensory Perspective on Human Auditory Communication 689

may lead to better performance during test in auditory-only conditions. However, there is strong
evidence against this possibility. First, in the Sheffert and Olson (2004) study, subjects additionally
performed an old/new recognition test on words spoken by the familiarized speakers or nonfamiliar
speakers. If subjects paid more attention during the voice–face training (in contrast to the voice-
only training), they should remember words from the voice–face training condition better (than
those from the voice-only training). However, there was no such difference in word memory for the
two training conditions. Second, in the von Kriegstein and Giraud (2006) study, subjects were addi-
tionally trained to recognize ringtones of cell phones. In one condition, subjects were trained with
videos of hands operating cell phones. In the control condition, subjects were trained with the brand
names of cell phones. Subsequently, ringtone recognition was tested in an auditory-only condition.
If training with videos was more attention-grabbing, then one would expect better recognition of
ringtones after training with videos in contrast to after training with brand names. However, there
was no such benefit for ringtone recognition. Third, the probably most compelling argument against
an attentional effect is that the face benefits for speech and speaker recognition are behaviorally
dissociable. This dissociability was shown in a study on developmental prosopagnosic subjects and
controls (von Kriegstein et al. 2008b). Developmental prosopagnosia is a lifelong inability to rec-
ognize other people by their face (McConachie 1976; Behrmann and Avidan 2005; Duchaine and
Nakayama 2005; Gruter et al. 2007). The perception of facial dynamics has been shown to be unim-
paired (Lander et al. 2004). In our study (von Kriegstein et al. 2008b), we trained prosopagnosics
and control subjects to associate six speakers’ voices with their names (see Figure 34.1). Training
was done in two conditions. In one condition (voice–face), subjects learned via previously recorded
auditory–visual videos of the speaker’s voice and face. In the control condition (voice–symbol),
subjects learned by listening to the auditory track of the same videos and seeing a visual symbol for
the occupation of the person. After training, all subjects were tested on two tasks in auditory-only
conditions. In one task, subjects recognized the speakers by voice (speaker recognition), in the other
task subjects recognized what was said (speech recognition). If the improvement in auditory-only
conditions by prior voice–face training (i.e., the face benefit) depends on attention, one would expect
that both groups have similar face benefits on the two tasks. This was not the case. Although pros-
opagnosics have a normal face benefit for speech recognition (as compared to controls), they had
no face benefit for speaker recognition (which is different from controls). This means that the face
benefit in speech recognition can be normal, whereas the face benefit in speaker recognition can
be selectively impaired. It suggests that the face benefits in speech and speaker recognition rely on
two distinct and specific mechanisms instead of one common attentional mechanism. I will explain
what these mechanisms might be in terms of brain processes in Section 34.4.2.

34.4.1.3  Importance of a Common Cause for Rapid Learning Effects


Multisensory learning improves unisensory recognition not only for human communication signals.
For example, Seitz et al. (2006) trained subjects to detect visual motion within a visual dot pattern.
There were two training conditions. In one condition, the visual motion was accompanied by mov-
ing sounds (audiovisual training). The other condition was a visual-only motion detection training.
The audiovisual training resulted in better performance (as compared to the visual-only training) on
motion detection in the visual-only test. The audiovisual training benefit occurred only if dots and
sounds moved in the same direction but not if they moved in opposing directions (Kim et al. 2008;
reviewed by Shams and Seitz 2008). These findings are compatible with the view that multisensory
training is beneficial for unisensory tasks if information in each modality is based on a common
cause, which has physically highly predictable consequences in the sensorium (von Kriegstein and
Giraud 2006). For example, when an object is moving it produces consequences in the auditory and
visual domain, which are not arbitrarily related, because they are caused by the same movement.
Similarly if a foreign phoneme is learned in multisensory situations, then the vocal tract move-
ments of the speaker cause the acoustic properties of the speech sound. They are not arbitrarily
related either, because a certain speech sound is, at least in ecologically valid situations, caused
690 The Neural Bases of Multisensory Processes

by a specific vocal tract movement. This common cause results in similar and tightly correlated
dynamics in the visual and auditory modality (Chandrasekaran et al. 2009). Not only movement
has an expression in the visual and auditory modality, but also shape and other material properties
(Lakatos et al. 1997; Smith et al. 2005). For example, voices give information about the physical
characteristics of the speaker, such as body size, because the length of the vocal tract influences the
timbre of the voice and is correlated with body size (Smith et al. 2005). In contrast, other auditory–
visual events can be arbitrarily related. Ringtones and cell phones, for example, relate to a unique
ecologically valid multimodal source, but their association is arbitrary. The visual appearance of the
cell phone does not physically cause the characteristics of the ringtone and vice versa. We assume
that the rapid acquisition of face benefits and multisensory learning benefits occurs if the brain can
exploit already existing knowledge about the relationship between auditory and visual modalities
(von Kriegstein and Giraud 2006; von Kriegstein et al. 2008b). This would explain why there are
rapid learning benefits when auditory and visual information is tightly correlated, whereas there are
no such rapid learning benefits when they are arbitrarily related (Seitz et al. 2006; von Kriegstein
and Giraud 2006; Kim et al. 2008).

34.4.2  Auditory–Visual Model for Human Auditory Communication


There are several types of mechanisms that have been suggested to account for the behavioural
benefits in unisensory conditions after multisensory learning. The conventional view (“auditory-
only model”) would assume that the brain uses auditory-only processing capabilities for the sensory
analysis of auditory information. In this case, the face benefits (or other multisensory learning
benefits) could be explained by an increase in effectiveness of sensory processing in auditory areas.
Such a mechanism has been suggested previously (Seitz and Dinse 2007; Shams and Seitz 2008),
but to my knowledge has not been tested in detail yet. In contrast, the “audiovisual model” assumes
that the brain uses previously learned audiovisual speaker-specific information to improve recog-
nition in auditory-only conditions (von Kriegstein et al. 2008b). In this view, even without visual
input, face-processing areas use encoded knowledge about the visual orofacial kinetics of talking
and simulate a speaker to make predictions about the trajectory of what is heard (Figure 34.2). This
visual online simulation places helpful constraints on auditory perception to improve recognition by
resolving auditory ambiguities. This model implies that (1) visual face processing areas are involved
in auditory-only tasks and that (2) this involvement is behaviorally relevant. There is neuroimaging
evidence in support of the audiovisual model and I will review this evidence in the following.

34.4.2.1  Visual Face Areas Are Behaviorally Relevant for Auditory Recognition
Recent neuroimaging studies show that face-sensitive areas (STS and FFA) are involved in the rec-
ognition of auditory communication signals (von Kriegstein et al. 2005; von Kriegstein and Giraud
2006; von Kriegstein et al. 2006, 2008b). They suggest that the FFA is behaviorally relevant for
auditory-only speaker recognition, and that the face-movement sensitive STS is behaviorally rel-
evant for auditory-only speech recognition.

34.4.2.1.1  FFA and Speaker Recognition


Several studies focused on the blood oxygen level dependent (BOLD) responses in the FFA during
FFA speaker recognition in auditory-only conditions. The FFA is more activated if subjects perform
(1) a speaker task (in contrast to a speech task) for personally familiar speakers (in contrast to non-
familiar speakers (von Kriegstein et al. 2005, 2006); (2) a speaker task after voice–face learning
(in contrast to a speaker task before voice–face learning (von Kriegstein and Giraud 2006); (3) a
speaker task after voice–face learning (in contrast to a speaker task after a matched control learn-
ing) (von Kriegstein and Giraud 2006; von Kriegstein et al. 2008b); (4) a speaker task in contrast
to a speech task after voice–face learning (in contrast to the same contrast after a matched control
learning (von Kriegstein et al. 2008b). In summary, FFA activation during auditory-only speaker
A Multisensory Perspective on Human Auditory Communication 691

(a) Training (b) After Training

Speech recognition Speech recognition


Left (fronto)- Left (fronto)- (Left) pSTS
(Left) pSTS
temporal temporal
Vocal Facial Vocal Facial
speech speech speech speech

Auditory Visual Auditory Visual


structural structural structural structural
analysis analysis analysis analysis

Voice Face Voice Face


identity identity identity identity
Right Right
m/a STS Right FFA Right FFA
m/a STS

Person recognition Person recognition

FIGURE 34.2  Audiovisual model for human communication. Schematic for processing of human commu-
nication signals during speech and speaker recognition. (a) Audiovisual input enters auditory and visual pre-
processing areas. These feed into two distinct networks, which process speech and speaker information. This
panel schematically depicts potential mechanism during voice–face training (see Figure 34.1) as well as areas
potentially involved in this process. (b) Auditory-only input enters auditory preprocessing areas. For speech
recognition, facial and vocal speech areas interact while engaging concurrently with higher levels of speech
processing. Similarly, for speaker recognition, face and voice identity areas interact while engaging concur-
rently with higher levels of speaker identity processing. This panel schematically depicts potential mechanism
during auditory testing after voice–face training (see Figure 34.1) as well as areas potentially involved in this
process. Note that interactions between boxes do not imply direct anatomical connections and that boxes may
represent more than one area, in particular for higher levels of speech and speaker recognition.

tasks is increased after prior voice–face experience and is task-­specific. Activation of the FFA is
higher if subjects are asked to recognize the speaker in contrast to recognizing what is said, even
if the stimulus input for the two tasks is exactly the same. Figure 34.3 shows an example for FFA
activity during speaker recognition after and before voice–face and voice–name learning. Note that
in contrast to the increased FFA activation after voice–face learning, the auditory voice region in the
right temporal lobe (TVA) shows similar activation increase for the two training conditions (Figure
34.3). This could be taken as indication against the view that face benefits can be explained by an
increased effectiveness of auditory-only processing.
Not only does the level of activation change after a brief voice–face learning, but also the func-
tional connectivity of the FFA to other brain areas. When subjects recognize previously heard
voices of nonfamiliar people, the FFA is functionally connected to a frontoparietal network (von
Kriegstein and Giraud 2006). This pattern is similar to the connectivity pattern of the FFA, when
subjects are instructed to vividly imagine faces without any meaningful sensory input besides the
task instructions (Ishai et al. 2002; Mechelli et al. 2004). The connectivity changes dramatically
after a brief voice–face training. After training, the functional connectivity of the FFA to the fronto­
parietal network is decreased. In contrast, connectivity between FFA and auditory voice-sensitive
areas (TVA) increases (von Kriegstein and Giraud 2006). A similar pattern of connectivity between
FFA and TVA can also be found during recognition of personally familiar speakers’ voices (von
Kriegstein et al. 2005). The change in connectivity suggests that the FFA activation after voice–face
training results from a different mechanism than before training or during task-instructed imagery.
The more direct connectivity between FFA and TVA after voice–face learning is compatible with
the hypothesis that auditory and visual areas interact already at stages of sensory analysis as sug-
gested by the audiovisual model (von Kriegstein and Giraud 2006).
692 The Neural Bases of Multisensory Processes

Speaker recognition in auditory-only conditions

Right TVA Fusiform face area


Signal change

Signal change
Voice–face Voice–name Voice–face Voice–name

Before training After training

FIGURE 34.3  Blood oxygen level dependent (BOLD) responses in voice- (left panel) and face-sensitive
(right panel) areas after and before different types of audiovisual training. In this study, control training
involved learning of voice–name associations (instead of voice–occupation symbol association displayed in
Figure 34.1). Note that increase in activation in auditory voice areas (TVA) is similar for both training condi-
tions. In contrast, responses in fusiform face area increase only after voice–face training, not after voice–
name training. Signal change here refers to a contrast between speaker recognition and ringtone recognition
(for details, see von Kriegstein and Giraud 2006).

34.4.2.1.2  Face-Movement Sensitive Posterior STS and Speech Recognition


The face-movement sensitive posterior STS is also activated after a brief voice–face training (in
contrast to a matched control training), but only if the subjects’ task is to recognize what has been
said (in contrast to a speaker recognition task) (von Kriegstein et al. 2008b). In this study, the face-
movement sensitive posterior STS has been located with visual stimuli only (Figure 34.4, blue) and
has been shown to be distinct from STS areas that are involved in speech recognition in general
(Figure 34.4, green).

34.4.2.1.3 FFA and Face-Movement Sensitive STS Play


Distinct Roles in Auditory Recognition
The task-specificity of the activation in FFA and face-movement sensitive STS suggests that these
two regions serve different roles in auditory speech and speaker perception. If these roles are within
the domain of sensory analysis, then one would expect that the amount of activation correlates
positively with performance on auditory recognition tasks. Recent research confirms this (von
Kriegstein et al. 2006; von Kriegstein et al. 2008b). It was found that subjects, who profit most from
the prior voice–face training when they perform auditory-only tasks, have a high activation level
in visual face-sensitive areas. Specifically, the activation level of the face-movement sensitive STS
correlates positively with the across-subjects face benefits in speech recognition (Figure 34.4, red),
whereas the activity in the FFA correlates positively with the across-subject face benefits in speaker
recognition (von Kriegstein et al. 2008b).
Furthermore, the behavioral dissociation of face benefits for speech and speaker recognition in
auditory-only conditions (see Section 34.4.1.2) is paralleled by a neuroanatomical dissociation. In
von Kriegstein et al.’s (2008) study, both prosopagnosics and controls had a positive correlation
of the face benefit in speech recognition with the amount of STS activation. Controls also had
a positive correlation of the face benefit in speaker recognition with the amount of FFA activa-
tion. In contrast, in prosopagnosics there was no positive correlation of the face benefit in speaker
A Multisensory Perspective on Human Auditory Communication 693

y = –45

y = –51

speech > object speech > speaker

Speech task (face benefit)


Visual face area localizer

FIGURE 34.4  (See color insert.) Face-sensitive left STS (blue) is located in regions of STS that are distinct
from those that are responsive to auditory speech (green). Positive correlation of activity in STS with face
benefit in speech task (red) overlaps with the face area (overlap in purple) but not with the auditory area (green)
(for more details on specific contrasts used, see von Kriegstein et al. 2008b). y, MNI coordinate in anterior–
posterior direction.

recognition with the amount of FFA activation. The behavioral and neuroanatomical dissociation
is in accord with the audiovisual model (Figure 34.2). Speech and speaker recognition largely rest
on two different sets of audiovisual correlations. Speech recognition is based predominantly on fast
time-varying acoustic cues produced by the varying vocal tract shape (Fant 1960), and much of this
is visible on the speaker’s face (Yehia et al. 1998). Conversely, speaker recognition uses predomi-
nantly very slowly varying properties of the speech signal such as the acoustic properties of the
vocal tract length (Lavner et al. 2000). If the brain uses encoded visual information for processing
auditory-only speech, the behavioral improvement that is induced by voice–face training (i.e., the
face benefit) must be dissociable for speech and speaker recognition (von Kriegstein et al. 2008b).

34.4.3  A Multisensory Predictive Coding Framework for Auditory Communication


Which computational mechanism may underlie the audiovisual model? The model posits that an
internal simulation of facial features is instrumental in performing recognition tasks on auditory-
only speech. In this view, the brain internally represents a multisensory environment that enables
robust sensory perception in auditory-only conditions. This is comparable to external face simu-
lations used to improve speech recognition especially in the hearing-impaired; speech recogni-
tion during telephone conversations can be improved by external video simulations of an artificial
694 The Neural Bases of Multisensory Processes

“talking face” (Siciliano et al. 2002). Such external simulation helps hearing-impaired listeners to
understand what is said. This creation of an artificial talking face uses a phoneme recognizer and
a face synthesizer to recreate the facial movements based on the auditory input. The audiovisual
model for auditory communication predicts that the human brain routinely uses a similar mecha-
nism: Auditory-only speech processing and speaker recognition is improved by internal simulation
of a talking face. How can such a model be explained in computational modeling terms? Recent
theoretical neuroscientific work has suggested that recognition can be modeled using a predic-
tive coding framework (Friston 2005). This framework assumes that efficient online recognition
of sensory signals is accomplished by a cortical hierarchy that is tuned to the prediction of sensory
signals. It is assumed that high levels of the hierarchy (i.e., further away from the sensory input)
provide predictions about the representation of information at a lower level of the hierarchy (i.e.,
closer to the sensory input). Each level contains a forward or generative model for the causes of
the sensory input and uses this model to generate predictions and constraints for the interpreta-
tion of the sensory input. Higher levels send predictions to the lower level, whereas the lower level
sends prediction errors to the higher level. One prerequisite to make such a mechanism useful is
that the brain learns regularities within the environment to efficiently predict the future sensory
input. Furthermore, these regularities should be adaptable to allow for changes in the regularities of
the environment. Therefore predictive coding theories have been formulated in a Bayesian frame-
work. In this framework, predictions are based on previous sensory evidence and can have varying
degrees of certainty. In visual and sensory–motor processing, “internal forward models” have been
used to explain how the brain encodes complex sensory data by relatively few parameters (Wolpert
et al. 1995; Knill et al. 1998; Rao and Ballard 1999; Bar 2007; Deneve et al. 2007).
Although predictive coding theories usually emphasize interaction between high and low levels,
a similar interaction might occur between sensory modalities. For example, the brain might use
audiovisual forward models, which encode the physical, causal relationship between a person talking
and its consequences for the visual and auditory input (von Kriegstein et al. 2008b). Critically, these
models encode the causal dependencies between the visual and auditory trajectories. Perception is
based on the “inversion” of models, that is, the brain identifies causes (e.g., Mr. Smith says “Hello”)
that explain the observed audiovisual input best. The changes in behavioral performance after a
brief voice–face experience suggest that the human brain can quickly and efficiently learn “a new
person” by adjusting key parameters in existing internal audiovisual forward models. Once param-
eters for an individual person are learned, auditory speech processing is improved because the brain
learned parameters of an audiovisual forward model with strong dependencies between internal
auditory and visual trajectories. The use of these models is reflected in an increased activation of
face-processing areas during auditory tasks. The audiovisual speaker model enables the system to
simulate visual trajectories (via the auditory trajectories) when there is no visual input. The talking
face simulation works best if the learned coupling between auditory and visual input is strong and
veridical. The visual simulation is fed back to auditory areas thereby improving auditory recogni-
tion by providing additional constraints. This mechanism can be used iteratively until the inversion
of the audiovisual forward model converges on a percept. In summary, this scheme suggests that
forward models encode and exploit dependencies in the environment and are used to improve recog-
nition in unisensory conditions by simulating the causes of the sensory input. Note that this chapter
focuses on the visual part of this simulation process. It is currently unclear whether motor processes
also play a role for this online simulation and whether the simulation proposed here has a relation
to simulation accounts underlying the motor theory of speech perception (Fischer and Zwaan 2008;
D’Ausilio et al. 2009).

34.5  CONCLUSIONS AND FUTURE DIRECTIONS


In contrast to a modality-specific view on unimodal perception, recent research suggests that not
only auditory areas but also visual face-sensitive areas are behaviorally relevant for the sensory
A Multisensory Perspective on Human Auditory Communication 695

analysis of auditory communication signals (von Kriegstein et al. 2005; von Kriegstein and Giraud
2006; von Kriegstein et al. 2006; von Kriegstein et al. 2008b). Speech recognition is supported by
selective recruitment of the face-sensitive STS, which is known to be involved in orofacial movement
processing (Puce et al. 1998; Thompson et al. 2007). Speaker recognition is supported by selective
recruitment of the FFA, which is involved in face-identity processing (Eger et al. 2004; Rotshtein
et al. 2005; Bouvier and Engel 2006). These findings challenge auditory-only models for speech
processing, because they imply that during large parts of ecologically valid social interactions, not
only auditory but also visual areas are involved to solve auditory tasks. For example, during a phone
conversation with personally familiar people (e.g., friends or colleagues), face sensitive areas will
be employed to optimally understand what the person is saying and to identify the other by his/her
voice. The same applies to less familiar people, given a brief prior face-to-face interaction.
The results have been explained by an audiovisual model couched in a predictive coding frame-
work (von Kriegstein and Giraud 2006; von Kriegstein et al. 2008b). This model assumes that the
brain routinely simulates talking faces in response to auditory input and that this internal audio­
visual simulation is used to actively predict and thereby constrain the possible interpretations of the
auditory signal. This mechanism leads to improved recognition in situations where only the audi-
tory modality is available.
Whether the audiovisual simulation scheme is a general principle of how unisensory tasks are
performed when one or more of the usual input modalities are missing is unclear. I assume that the
same principle also applies to other voice–face information that is correlated in the auditory and
visual domains, such as recognition of emotion from voice and face (de Gelder and Vroomen 1995;
Massaro and Egan 1996). Furthermore, the principle might even be applicable to noncommunica-
tion sensory signals with a (veridical or illusionary) common cause such as the recognition of move-
ment trajectories of computer-animated dot patterns and moving sound sources (Seitz et al. 2006).
Neuroscientific research has focused on responses in visual sensory areas in auditory-only condi-
tions after a brief voice–face sensory experience. However, visual sensory areas could also play a
role for speakers for which no specific voice–face sensory experience was made. For example, the
speaker-independent effect of foreign phoneme training (Hardison 2003; Hazan et al. 2005; Hirata
and Kelly, 2010) could be based on extrapolating the speaker-specific face model to other speakers.
Similar mechanisms might occur during development of the speech perception system in children.
The use of internal face models for speech and speaker recognition might be especially impor-
tant in situations where there is uncertainty about the input modality. There are multiple sources
for uncertainty in human auditory communication. For example, a low level of experience with a
second language will likely result in a high level of uncertainty about the trajectory of the speech
signal. Furthermore, a high level of background noise will result in a high level of uncertainty about
the speech input. The use of an internal face simulation mechanism could increase robustness of
perception in these situations.

REFERENCES
Abrams, D. A., T. Nicol, S. Zecker, and N. Kraus. 2008. Right-hemisphere auditory cortex is dominant for cod-
ing syllable patterns in speech. J Neurosci 28: 3958–3965.
American Standards Association. 1960. Acoustical Terminology SI. New York: Association AS.
Arnal, L. H., B. Morillon, C. A. Kell, and A. L. Giraud. 2009. Dual neural routing of visual facilitation in
speech processing. J Neurosci 29: 13445–13453.
Ay, N., J. Flack, and D. C. Krakauer. 2007. Robustness and complexity co-constructed in multimodal signalling
networks. Philos Trans R Soc Lond B Biol Sci 362: 441–447.
Bar, M. 2007. The proactive brain: Using analogies and associations to generate predictions. Trends Cogn Sci
11: 280–289.
Baron-Cohen, S., H. A. Ring, E. T. Bullmore, S. Wheelwright, C. Ashwin, and S. C. R. Williams. 2000. The
amygdala theory of autism. Neurosci Biobehav Rev 24: 355–364.
Barsalou, L. W. 2008. Grounded cognition. Annu Rev Psychol 59: 617–645.
696 The Neural Bases of Multisensory Processes

Behrmann, M., and G. Avidan. 2005. Congenital prosopagnosia: Face-blind from birth. Trends Cogn Sci
9:180–187.
Belin, P., S. Fecteau, and C. Bedard. 2004. Thinking the voice: Neural correlates of voice perception. Trends
Cogn Sci 8: 129–135.
Belin, P., R. J. Zatorre, and P. Ahad. 2002. Human temporal-lobe response to vocal sounds. Brain Res Cogn
Brain Res 13: 17–26.
Belin, P., R. J. Zatorre, P. Lafaille, P. Ahad, and B. Pike. 2000. Voice-selective areas in human auditory cortex.
Nature 403: 309–312.
Besle, J., C. Fischer, A. Bidet-Caulet, F. Lecaignard, O. Bertrand, and M. H. Giard. 2008. Visual activation
and audiovisual interactions in the auditory cortex during speech perception: Intracranial recordings in
humans. J Neurosci 28: 14301–14310.
Bizley, J. K., K. M. Walker, B. W. Silverman, A. J. King, and J. W. Schnupp. 2009. Interdependent encoding of
pitch, timbre, and spatial location in auditory cortex. J Neurosci 29: 2064–2075.
Boatman, D., R. P. Lesser, and B. Gordon. 1995. Auditory speech processing in the left temporal lobe: An
electrical interference study. Brain Lang 51: 269–290.
Boatman, D. F., R. P. Lesser, N. E. Crone, G. Krauss, F. A. Lenz, and D. L. Miglioretti. 2006. Speech recogni-
tion impairments in patients with intractable right temporal lobe epilepsy. Epilepsia 47: 1397–1401.
Boemio, A., S. Fromm, A. Braun, and D. Poeppel. 2005. Hierarchical and asymmetric temporal sensitivity in
human auditory cortices. Nat Neurosci 8: 389–395.
Bouvier, S. E., and S. A. Engel. 2006. Behavioral deficits and cortical damage loci in cerebral achromatopsia.
Cereb Cortex 16: 183–191.
Bruce, V., and A. Young. 1986. Understanding face recognition. Br J Psychol 77: 305–327.
Burton, A. M., V. Bruce, and P. J. B. Hancock. 1999. From pixels to people: A model of familiar face recogni-
tion. Cogn Sci 23: 1–31.
Calder, A. J., and A. W. Young. 2005. Understanding the recognition of facial identity and facial expression.
Nat Rev Neurosci 6: 641–651.
Calvert, G. A., E. T. Bullmore, M. J. Brammer, R. Campbell, S. C. Williams, P. K. McGuire, P. W. Woodruff,
S. D. Iversen, and A. S. David. 1997. Activation of auditory cortex during silent lipreading. Science 276:
593–596.
Campanella, S., and P. Belin. 2007. Integrating face and voice in person perception. Trends Cogn Sci 11:
535–543.
Campbell, R., J. Zihl, D. Massaro, K. Munhall, and M. M. Cohen. 1997. Speechreading in the akinetopsic
patient, L.M. Brain 120 (Pt 10): 1793–1803.
Chandrasekaran, C., A. Trubanova, S. Stillittano, A. Caplier, and A. A. Ghazanfar. 2009. The natural statistics
of audiovisual speech. PLoS Comput Biol 5: e1000436.
Clopper, C. G., and D. B. Pisoni. 2004. Some acoustic cues for the perceptual categorization of American
English regional dialects. J Phon 32: 111–140.
Cook, S., and J. Wilding. 1997. Earwitness testimony: 2. Voices, faces and context. Appl Cogn Psychol 11:
527–541.
Cook, S., and J. Wilding. 2001. Earwitness testimony: Effects of exposure and attention on the face overshad-
owing effect. Br J Psychol 92: 617–629.
D’Ausilio, A., F. Pulvermuller, P. Salmas, I. Bufalari, C. Begliomini, and L. Fadiga. 2009. The motor somato-
topy of speech perception. Curr Biol 19: 381–385.
de Gelder, B., and J. Vroomen. 1995. The perception of emotions by ear and by eye, 289–311. Los Angeles:
Psychology Press.
Deneve, S., J. R. Duhamel, and A. Pouget. 2007. Optimal sensorimotor integration in recurrent cortical net-
works: A neural implementation of Kalman filters. J Neurosci 27: 5744–5756.
Duchaine, B., and K. Nakayama. 2005. Dissociations of face and object recognition in developmental proso­
pagnosia. J Cogn Neurosci 17: 249–261.
Duchaine, B. C., H. Parker, and K. Nakayama. 2003. Normal recognition of emotion in a prosopagnosic.
Perception 32: 827–838.
Eger, E., P. G. Schyns, and A. Kleinschmidt. 2004. Scale invariant adaptation in fusiform face-responsive
regions. Neuroimage 22: 232–242.
Ellis, H. D., D. M. Jones, and N. Mosdell. 1997. Intra- and inter-modal repetition priming of familiar faces and
voices. Br J Psychol 88 (Pt 1): 143–156.
Fant, G. 1960. Acoustic theory of speech production. Paris: Mouton.
Fischer, M. H., and R. A. Zwaan. 2008. Embodied language: A review of the role of the motor system in lan-
guage comprehension. Q J Exp Psychol (Colchester) 61: 825–850.
A Multisensory Perspective on Human Auditory Communication 697

Fox, C. J., S. Y. Moon, G. Iaria, and J. J. Barton. 2009. The correlates of subjective perception of identity and
expression in the face network: An fMRI adaptation study. Neuroimage 44: 569–580.
Friederici, A. D. 2002. Towards a neural basis of auditory sentence processing. Trends Cogn Sci 6: 78–84.
Friston, K. 2005. A theory of cortical responses. Philos Trans R Soc Lond B Biol Sci 360: 815–836.
Gainotti, G., A. Barbier, and C. Marra. 2003. Slowly progressive defect in recognition of familiar people in a
patient with right anterior temporal atrophy. Brain 126: 792–803.
Gick, B., and D. Derrick. 2009. Aero-tactile integration in speech perception. Nature 462: 502–504.
Giraud, A. L., A. Kleinschmidt, D. Poeppel, T. E. Lund, R. S. Frackowiak, and H. Laufs. 2007. Endogenous cortical
rhythms determine cerebral specialization for speech perception and production. Neuron 56: 1127–1134.
Grey, J. M. 1977. Multidimensional perceptual scaling of musical timbres. J Acoust Soc Am 61: 1270–1277.
Griffiths, T. D., S. Uppenkamp, I. Johnsrude, O. Josephs, and R. D. Patterson. 2001. Encoding of the temporal
regularity of sound in the human brainstem. Nat Neurosci 4: 633–637.
Gruter, M., T. Gruter, V. Bell, P. W. Halligan, J. Horst, Sperling et al. 2007. Hereditary prosopagnosia: The first
case series. Cortex 43: 734–749.
Hall, D. A., C. Fussell, and A. Q. Summerfield. 2005. Reading fluent speech from talking faces: Typical brain
networks and individual differences. J Cogn Neurosci 17: 939–953.
Hardison, D. M. 2003. Acquisition of second-language speech: Effects of visual cues, context, and talker vari-
ability. Appl Psycholinguist 24: 495.
Hauk, O., Y. Shtyrov, and F. Pulvermuller. 2008. The time course of action and action–word comprehension in
the human brain as revealed by neurophysiology. J Physiol Paris 102: 50–58.
Haxby, J. V., E. A. Hoffman, and M. I. Gobbini. 2000. The distributed human neural system for face perception.
Trends Cogn Sci 4: 223–233.
Haxby, J. V., E. A. Hoffman, and M. I. Gobbini. 2002. Human neural systems for face recognition and social
communication. Biol Psychiatry 51: 59–67.
Hazan, V., A. Sennema, M. Iba, and A. Faulkner. 2005. Effect of audiovisual perceptual perception and produc-
tion of training on the consonants by Japanese learners of English. Speech Commun 47: 360–378.
Hickok, G., and D. Poeppel. 2000. Towards a functional neuroanatomy of speech perception. Trends Cogn Sci
4: 131–138.
Hickok, G., and D. Poeppel. 2007. The cortical organization of speech processing. Nat Rev Neurosci 8:
393–402.
Hirata, Y., and S. D. Kelly. 2010. Effects of lips and hands on auditory learning of second language speech
sounds. Lang Hear Res 2: 298–310.
Humphreys, G. W., N. Donnelly, and M. J. Riddoch. 1993. Expression is computed separately from facial
identity, and it is computed separately for moving and static faces: Neuropsychological evidence.
Neuropsychologia 31: 173–181.
Ishai, A., J. V. Haxby, and L. G. Ungerleider. 2002. Visual imagery of famous faces: Effects of memory and
attention revealed by fMRI. Neuroimage 17: 1729–1741.
Iverson, P., and C. L. Krumhansl. 1993. Isolating the dynamic attributes of musical timbre. J Acoust Soc Am
94: 2595–2603.
Johnson, W. F., R. N. Emde, K. R. Scherer, and M. D. Klinnert. 1986. Recognition of emotion from vocal cues.
Arch Gen Psychiatry 43: 280–283.
Kanwisher, N., J. McDermott, and M. M. Chun. 1997. The fusiform face area: A module in human extrastriate
cortex specialized for face perception. J Neurosci 17: 4302–4311.
Kim, R. S., A. R. Seitz, and L. Shams. 2008. Benefits of stimulus congruency for multisensory facilitation of
visual learning. PLoS ONE 3: e1532.
Kleinhans, N. M., L. C. Johnson, T. Richards, R. Mahurin, J. Greenson, G. Dawson et al. 2009. Reduced neural
habituation in the amygdala and social impairments in autism spectrum disorders. Am J Psychiatry 166:
467–475.
Knill, D., D. Kersten, A. Yuille, and W. Richards. 1998. Introduction: A Bayesian formulation of visual percep-
tion. In Perception as Bayesian Inference, 1–21. Cambridge, MA: Cambridge Univ. Press.
Lakatos, S., S. McAdams, and R. Causse. 1997. The representation of auditory source characteristics: Simple
geometric form. Percept Psychophys 59: 1180–1190.
Lander, K., G. Humphreys, and V. Bruce. 2004. Exploring the role of motion in prosopagnosia: Recognizing,
learning and matching faces. Neurocase 10: 462–470.
Lang, C. J., O. Kneidl, M. Hielscher-Fastabend, and J. G. Heckmann. 2009. Voice recognition in aphasic and
non-aphasic stroke patients. J Neurol 256: 1303–1306.
Lavner, Y., I. Gath, and J. Rosenhouse. 2000. The effects of acoustic modifications on the identification of
familiar voices speaking isolated vowels. Speech Commun 30: 9–26.
698 The Neural Bases of Multisensory Processes

Leff, A. P., T. M. Schofield, K. E. Stephan, J. T. Crinion, K. J. Friston, and C. J. Price. 2008. The cortical
dynamics of intelligible speech. J Neurosci 28: 13209–13215.
Liberman, A. M., and I. G. Mattingly. 1985. The motor theory of speech perception revised. Cognition 21:
1–36.
Marslen-Wilson, W. D., and L. K. Tyler. 2007. Morphology, language and the brain: The decompositional sub-
strate for language comprehension. Philos Trans R Soc Lond B Biol Sci 362: 823–836.
Martin, A., and A. Caramazza. 2003. Neuropsychological and neuroimaging perspectives on conceptual knowl-
edge: An introduction. Cogn Neuropsychol 20: 195–212.
Massaro, D. W., and P. B. Egan. 1996. Perceiving affect from the voice and the face. Psychon Bull Rev 3:
215–221.
McAdams, S., S. Winsberg, S. Donnadieu, G. Desoete, and J. Krimphoff. 1995. Perceptual scaling of syn-
thesized musical timbres—Common dimensions, specificities, and latent subject classes. Psychol Res
Psychol Forsch 58: 177–192.
McConachie, H. R. 1976. Developmental prosopagnosia. A single case report. Cortex 12: 76–82.
Mechelli, A., C. J. Price, K. J. Friston, and A. Ishai. 2004. Where bottom-up meets top-down: Neuronal interac-
tions during perception and imagery. Cereb Cortex 14: 1256–1265.
Menon, V., D. J. Levitin, B. K. Smith, A. Lembke, B. D. Krasnow, D. Glazer et al. 2002. Neural correlates of
timbre change in harmonic sounds. Neuroimage 17: 1742–1754.
Nelken, I., and O. Bar-Yosef. 2008. Neurons and objects: The case of auditory cortex. Front Neurosci 2: 107–113.
Neuner, F., and S. R. Schweinberger. 2000. Neuropsychological impairments in the recognition of faces, voices,
and personal names. Brain Cogn 44: 342–366.
O’Toole, A. J., D. A. Roark, and H. Abdi. 2002. Recognizing moving faces: A psychological and neural syn-
thesis. Trends Cogn Sci 6: 261–266.
Obleser, J., and F. Eisner. 2008. Pre-lexical abstraction of speech in the auditory cortex. Trends Cogn Sci 13(1):
14–19.
Overath, T., S. Kumar, K. von Kriegstein, and T. D. Griffiths. 2008. Encoding of spectral correlation over time
in auditory cortex. J Neurosci 28: 13268–13273.
Patterson, R. D., S. Uppenkamp, I. S., Johnsrude, and T. D. Griffiths. 2002. The processing of temporal pitch
and melody information in auditory cortex. Neuron 36: 767–776.
Pekkola, J., V. Ojanen, T. Autti, I. P. Jaaskelainen, R. Mottonen et al. 2005. Primary auditory cortex activation
by visual speech: An fMRI study at 3 T. Neuroreport 16: 125–128.
Pelphrey, K. A., J. P. Morris, C. R. Michelich, T. Allison, and G. McCarthy. 2005. Functional anatomy of bio-
logical motion perception in posterior temporal cortex: An FMRI study of eye, mouth and hand move-
ments. Cereb Cortex 15: 1866–1876.
Penagos, H., J. R. Melcher, and A. J. Oxenham. 2004. A neural representation of pitch salience in nonpri-
mary human auditory cortex revealed with functional magnetic resonance imaging. J Neurosci 24:
6810–6815.
Pitcher, D., L. Garrido, V. Walsh, and B. C. Duchaine. 2008. Transcranial magnetic stimulation disrupts the
perception and embodiment of facial expressions. J Neurosci 28: 8929–8933.
Poeppel, D. 2003. The analysis of speech in different temporal integration windows: Cerebral lateralization as
‘asymmetric sampling in time.’ Speech Commun 41: 245–255.
Price, C., G. Thierry, and T. Griffiths. 2005. Speech-specific auditory processing: Where is it? Trends Cogn Sci
9: 271–276.
Price, C. J. 2000. The anatomy of language: Contributions from functional neuroimaging. J Anat 197(Pt 3):
335–359.
Puce, A., T. Allison, S. Bentin, J. C. Gore, and G. McCarthy. 1998. Temporal cortex activation in humans view-
ing eye and mouth movements. J Neurosci 18: 2188–2199.
Pulvermuller, F., M. Huss, F. Kherif, F. M. D. P. Martin, O. Hauk, and Y. Shtyrov. 2006. Motor cortex maps
articulatory features of speech sounds. Proc Natl Acad Sci U S A 103: 7865–7870.
Rao, R. P., and D. H. Ballard. 1999. Predictive coding in the visual cortex: A functional interpretation of some
extra-classical receptive-field effects. Nat Neurosci 2: 79–87.
Ross, L. A., D. Saint-Amour, V. M. Leavitt, D. C. Javitt, and J. J. Foxe. 2007. Do you see what I am say-
ing? Exploring visual enhancement of speech comprehension in noisy environments. Cereb Cortex 17:
1147–1153.
Rotshtein, P., R. N. Henson, A. Treves, J. Driver, and R. J. Dolan. 2005. Morphing Marilyn into Maggie dissoci-
ates physical and identity face representations in the brain. Nat Neurosci 8: 107–113.
Sathian, K., A. Zangaladze, J. M. Hoffman, and S. T. Grafton. 1997. Feeling with the mind’s eye. Neuroreport
8: 3877–3881.
A Multisensory Perspective on Human Auditory Communication 699

Scherer, K. R. 1986. Vocal affect expression: A review and a model for future research. Psychol Bull 99:
143–165.
Schweinberger, S. R., D. Robertson, and J. M. Kaufmann. 2007. Hearing facial identities. Q J Exp Psychol
(Colchester) 60: 1446–1456.
Scott, S. K. 2005. Auditory processing—Speech, space and auditory objects. Curr Opin Neurobiol 15:
197–201.
Scott, S. K., C. McGettigan, and F. Eisner. 2009. A little more conversation, a little less action—Candidate roles
for the motor cortex in speech perception. Nat Rev Neurosci 10: 295–302.
Scott, S. K., C. C. Blank, S. Rosen, and R. J. Wise. 2000. Identification of a pathway for intelligible speech in
the left temporal lobe. Brain 123(Pt 12): 2400–2406.
Seitz, A. R., and H. R. Dinse. 2007. A common framework for perceptual learning. Curr Opin Neurobiol 17:
148–153.
Seitz, A. R., R. Kim, and L. Shams. 2006. Sound facilitates visual learning. Curr Biol 16: 1422–1427.
Sergent, J., S. Ohta, and B. MacDonald. 1992. Functional neuroanatomy of face and object processing. A posi-
tron emission tomography study. Brain 115(Pt 1): 15–36.
Shams, L., and A. R. Seitz. 2008. Benefits of multisensory learning. Trends Cogn Sci 12: 411–417.
Sheffert, S. M., and E. Olson. 2004. Audiovisual speech facilitates voice learning. Percept Psychophys 66:
352–362.
Sheffert, S. M., D. B. Pisoni, J. M. Fellowes, and R. E. Remez, 2002. Learning to recognize talkers from natu-
ral, sinewave, and reversed speech samples. J Exp Psychol Hum Percept Perform 28: 1447–1469.
Siciliano, C., G. Williams, J. Beskow, and A. Faulkner. 2002. Evaluation of a multilingual synthetic talking face
as a communication aid for the hearing-impaired. Speech Hear Lang Work Prog 14: 51–61.
Smith, D. R. R., R. D. Patterson, R. Turner, H. Kawahara, and T. Irino. 2005. The processing and perception of
size information in speech sounds. J Acoust Soc Am 117: 305–318.
Smith, E. L., M. Grabowecky, and S. Suzuki. 2007. Auditory–visual crossmodal integration in perception of
face gender. Curr Biol 17: 1680–1685.
Sumby, W. H., and I. Pollack. 1954. Visual contribution to speech intelligibility in noise. J Acoust Soc Am 26:
212–215.
Thomas, E. R., and J. Reaser. 2004. Delimiting perceptual cues used for the ethnic labeling of African American
and European American voices. J Socioling 8: 54–87.
Thompson, J. C., M. Clarke, T. Stewart, and A. Puce. 2005. Configural processing of biological motion in
human superior temporal sulcus. J Neurosci 25: 9059–9066.
Thompson, J. C., J. E. Hardee, A. Panayiotou, D. Crewther, and A. Puce. 2007. Common and distinct brain
activation to viewing dynamic sequences of face and hand movements. Neuroimage 37: 966–973.
Tsukiura, T., H. Mochizuki-Kawai, and T. Fujii. 2006. Dissociable roles of the bilateral anterior temporal lobe
in face–name associations: An event-related fMRI study. Neuroimage 30: 617–626.
Van Lancker, D. R., and J. G. Canter. 1982. Impairment of voice and face recognition in patients with hemi-
spheric damage. Brain Cogn 1: 185–195.
Van Lancker, D. R., J. Kreiman, and J. Cummings. 1989. Voice perception deficits: Neuroanatomical correlates
of phonagnosia. J Clin Exp Neuropsychol 11: 665–674.
van Wassenhove, V., K. W. Grant, and D. Poeppel. 2005. Visual speech speeds up the neural processing of audi-
tory speech. Proc Natl Acad Sci U S A 102: 1181–1186.
Vigneau, M., V. Beaucousin, P.  Y. Herve, H. Duffau, F. Crivello, O. Houde et al. 2006. Meta-analyzing left hemi-
sphere language areas: phonology, semantics, and sentence processing. Neuroimage 30: 1414–1432.
von Kriegstein, K., and A. L. Giraud. 2004. Distinct functional substrates along the right superior temporal
sulcus for the processing of voices. Neuroimage 22: 948–955.
von Kriegstein, K., and A. L. Giraud. 2006. Implicit iultisensory associations influence voice recognition. PLoS
Biol 4: e326.
von Kriegstein, K., A. Kleinschmidt, and A. L. Giraud. 2006. Voice recognition and cross-modal responses to
familiar speakers’ voices in prosopagnosia. Cereb Cortex 16: 1314–1322.
von Kriegstein, K., R. D. Patterson, and T. D. Griffiths. 2008a. Task-dependent modulation of medial geniculate
body is behaviorally relevant for speech recognition. Curr Biol 18: 1855–1859.
von Kriegstein, K., E. Eger, A. Kleinschmidt, and A. L. Giraud. 2003. Modulation of neural responses to speech
by directing attention to voices or verbal content. Brain Res Cogn Brain Res 17: 48–55.
von Kriegstein, K., A. Kleinschmidt, P. Sterzer, and A. L. Giraud. 2005. Interaction of face and voice areas dur-
ing speaker recognition. J Cogn Neurosci 17: 367–376.
von Kriegstein, K., D. R. Smith, R. D. Patterson, D. T. Ives, and T. D. Griffiths. 2007. Neural representation of
auditory size in the human voice and in sounds from other resonant sources. Curr Biol 17: 1123–1128.
700 The Neural Bases of Multisensory Processes

von Kriegstein, K., O. Dogan, M. Gruter, A. L. Giraud, C. A. Kell, T. Gruter et al. 2008b. Simulation of talk-
ing faces in the human brain improves auditory speech recognition. Proc Natl Acad Sci U S A 105:
6747–6752.
von Kriegstein, K., D. R. Smith, R. D. Patterson, S. J. Kiebel, and T. D. Griffiths. 2010. How the human brain
recognizes speech in the context of changing speakers. J Neurosci 30: 629–638.
Warren, J. D., A. R. Jennings, and T. D. Griffiths. 2005. Analysis of the spectral envelope of sounds by the
human brain. Neuroimage 24: 1052–1057.
Watkins, K. E., A. P. Strafella, and T. Paus. 2003. Seeing and hearing speech excites the motor system involved
in speech production. Neuropsychologia 41: 989–994.
Wolpert, D. M., Z. Ghahramani, and M. I. Jordan. 1995. An internal model for sensorimotor integration. Science
269: 1880–1882.
Yehia, H., P. Rubin, and E. Vatikiotis-Bateson. 1998. Quantitative association of vocal-tract and facial behavior.
Speech Commun 26:23–43.
Young, A. W., F. Newcombe, E. H. de Haan, M. Small, and D. C. Hay. 1993. Face perception after brain injury.
Selective impairments affecting identity and expression. Brain 116 (Pt 4): 941–959.
Zangaladze, A., C. M. Epstein, S. T. Grafton, and K. Sathian. 1999. Involvement of visual cortex in tactile
discrimination of orientation. Nature 401: 587–590.
Section IX
Naturalistic Multisensory Processes:
Flavor
35 Multimodal Chemosensory
Interactions and
Perception of Flavor
John Prescott

CONTENTS
35.1 Introduction........................................................................................................................... 703
35.2 Chemosensory Interactions and Integration..........................................................................704
35.3 Associative Learning and Integration................................................................................... 705
35.4 Cross-Modal Chemosensory Binding.................................................................................... 706
35.5 Attentional Processes in Binding.......................................................................................... 708
35.6 Analysis and Synthesis in Perception of Flavor.................................................................... 708
35.7 Investigating Cognitive Processes in Flavor Perception........................................................ 710
35.8 Hedonic Implications of Chemosensory Integration . .......................................................... 712
References....................................................................................................................................... 714

35.1  INTRODUCTION
Writing in the early nineteenth century, the gastronomic pioneer, Brillat-Savarin was “tempted to
believe that smell and taste are in fact but a single sense, whose laboratory is the mouth and whose
chimney is the nose” (Brillat-Savarin 1825). Much of the subsequent history of perception research
in the chemical senses has, in contrast, been characterized by a focus on discrete sensory channels,
and their underlying anatomy and physiology. However, there has recently been renewed interest
in examining flavor as a functional perceptual system. This has been borne to some extent out of
a realization that in our everyday food experiences, we respond, perceptually and hedonically, not
to discrete tastes, odors, and tactile sensations, but to flavors constructed from a synthesis of these
sensory signals (Prescott 2004b).
This refocus regarding flavor is very much in line with the ecological approach to perception that
had been advocated by Gibson (1966). Gibson argued that the primary purpose of perception is to
seek out objects in our environment, particularly those that are biologically important. As such, the
physiological origin of sensory information is less salient than that the information can be used in
object identification. Effectively, then, the key to successful perception is that sensory information
is interpreted as qualities that belong to the object itself. Within this context, flavor can be seen as
a functionally distinct sense that is cognitively “constructed” from the integration of distinct physi-
ologically defined sensory systems (such as olfaction and gustation) that are “functionally united
when anatomically separated” (Gibson 1966, p. 137) in order to identify and respond to objects that
are important to our survival, namely, foods.

703
704 The Neural Bases of Multisensory Processes

35.2  CHEMOSENSORY INTERACTIONS AND INTEGRATION


Cross-modal sensory integration is frequently inferred from the influence of one modality on
responses to another. Commonly, this is an enhanced (sometimes supra-additive) response to infor-
mation from one sensory system due to concurrent input from another modality (Calvert et al. 1999).
For example, in a noisy environment, speech comprehension is improved if we see the speaker’s lip
movements (Sumby and Polack 1954). Even information that is irrelevant to a task enhances neural
response to task-relevant stimuli and augments behavioral performance (Stein et al. 1988). There is
similarly evidence that tastes and odors, when encoded together as a flavor, interact to modify the
perception of one another.
The most obvious expression of odor–taste interactions is the widely observed attribution of
qualities that are more usually associated with basic taste qualities to odors (Burdach et al. 1984).
When asked to describe the odor of caramel or vanilla, most people will use the term “sweet-
smelling”; similarly, “sour” is used for the odor of vinegar (Stevenson and Boakes 2004). In one
descriptive analysis of characteristics of a wide range of odors (Dravnieks 1985), 65% of assessors
gave “sweetness” as an appropriate descriptor for the odor of vanillin, whereas 33% described the
odor of hexanoic acid as being sour. These descriptions appear to have many of the qualities of
synesthesia, in which a stimulus in one sensory modality reliably elicits a consistent corresponding
stimulus in another modality (Martino and Marks 2001; Stevenson et al. 1998). Whereas in other
modalities, synesthesia is a relatively uncommon event, the possession of taste properties by odors
is almost universal, particularly in the case of commonly consumed foods. In fact, for some odors,
taste qualities may represent the most consistent description used. Stevenson and Boakes (2004)
reported data showing that, over repeat testing, ratings of taste descriptors for odors (e.g., sweetness
of banana odor) were at least as reliable as ratings of the primary quality of the odor (i.e., banana).
This commonplace phenomenon could be dismissed as merely imprecise language (since highly
specific odor descriptors are elusive) or even metaphor, given that the odor name is likely to refer
to an object, which might also be sweet or sour. However, there are measurable consequences of
such odor taste qualities, in that these odors, when added to tastants in solution, can modify the
taste intensity. The most frequent finding is the ability of food odors such as strawberry or vanilla
to enhance the sweetness of sucrose solutions (Frank and Byram 1988; Frank et al. 1989). This phe-
nomenon is both taste- and odor-specific. For example, the sweet-smelling odor of strawberry will
enhance a sweet taste, but the odor of bacon will not. Conversely, a nonsweet taste, for example,
saltiness, will not be enhanced by strawberry (Frank and Byram 1988). Stevenson et al. (1999)
showed that the smelled sweetness of an odorant was the best predictor of that odorant’s ability to
enhance a sweet taste when they were presented together in solution. Similarly, the ability of food
odors to enhance saltiness in solution has been shown as highly correlated with the extent to which
the foods themselves were judged to be salty (Lawrence et al. 2009).
Subsequently, these findings were extended by studies showing that odors added to tastants can
also suppress taste intensity. Prescott (1999) found that odors judged to be low in smelled sweetness
(peanut butter, oolong tea) suppressed sweetness when added to sucrose in solution, in contrast to
raspberry odor, which enhanced it. Stevenson et al. (1999) reported that sweet-smelling caramel
odor not only enhanced the sweetness of sucrose in solution but also suppressed the sourness of
a citric acid solution. Importantly, this latter effect parallels the pattern of interactions seen with
binary taste mixtures, in that the addition of sucrose would similarly suppress the sourness of
citric acid. Such findings provide evidence that odor taste properties reflect a genuine perceptual
phenomenon.
The ability of odors possessing smelled taste qualities to influence tastes has also been dem-
onstrated in paradigms using measures other than intensity ratings. Dalton et al. (2000) assessed
orthonasal (sniffed) detection thresholds for the odorant benzaldehyde, which has a cherry/almond
quality, while subjects held a sweet taste (saccharin) in the mouth. Detection thresholds for the odor
was significantly reduced compared with benzaldehyde alone, or in combination with either water
Multimodal Chemosensory Interactions and Perception of Flavor 705

or a nonsweet taste (monosodium glutamate, a savory quality). The most plausible interpretation of
these findings is that the smelled sweetness of benzaldehyde and tasted sweetness of saccharin were
being integrated at subthreshold levels. Similar odor threshold effects have also been found using
a somewhat different experimental protocol in which both the odorant and tastant were presented
together in solution (Delwiche and Heffelfinger 2005).
Reciprocal effects of odors on tastes are also found. These include increases in the detection
accuracy of a sweet taste at around threshold in the presence of an orthonasally presented congruent
odorant (strawberry) as compared to one that was not sweet (ham) (Djordjevic et al. 2004), as well
as a similar effect using a priming procedure, in which the odorant preceded the taste presentation
(Prescott 2004b), showing that a sweet-smelling odor produced a greater change in detectability,
relative to no odor, than did another, nonsweet odorant. Similar priming effects at suprathreshold
levels have been demonstrated behaviorally in a study in which subjects were asked to identify a
taste quality sipped from a device that also simultaneously presented an odor—either congruent or
incongruent—orthonasally. Speed of naming of tastes during presentation of congruent odor/taste
pairs (sweet smelling cherry odor/sucrose; sour smelling grapefruit odor/citric acid) was faster rela-
tive to incongruent pairs (cherry odor/citric acid; grapefruit odor/sucrose), or neutral/control pairs
(either butanol or no odor plus either sucrose or citric acid) (White and Prescott 2007).

35.3  ASSOCIATIVE LEARNING AND INTEGRATION


The importance of taste-related odor properties for understanding sensory integration in flavors
derives principally from the fact that these effects are thought only to arise once the odor and taste
have been repeatedly experienced together as a mixture in the mouth, most typically in the context
of foods or beverages. This process has been repeatedly demonstrated experimentally. Novel odors
that have little or no smelled sweetness, sourness, or bitterness when sniffed take on these qualities
when repeatedly paired in solution with sweet, sour, or bitter tastes, respectively (Prescott 1999;
Stevenson et al. 1995, 1998, 1999; Yeomans et al. 2006, 2009). Recent studies have expanded these
findings beyond associative relationships of odors with tastes. Thus, odors paired with high fat
milks themselves became fattier smelling and were able to increase the perceived fattiness of the
milks, when added to the milks subsequent to conditioning (Sundquist et al. 2006).
Such acquired perceptual similarity has been seen as an example of a “learned synesthesia,” in
which qualities in one sensory system (olfaction) are able to evoke qualities in another (taste) only
as a result of their frequent co-occurrence (Stevenson et al. 1998). The nature of the change that the
odor undergoes has been explained in terms of increasing congruency (similarity) with the taste, in
that they possess qualities in common, as a result of coexposure (Frank et al. 1993; Schifferstein and
Verlegh 1996). Hence, the sweetness of a taste such as sucrose is seen to be more congruent with
the sweet-smelling odor of caramel than it is with the odor of bacon, which typically has no sweet
smell. It is only after this coexposure that the odor enhances a (now) congruent taste (Prescott 1999;
Stevenson et al. 1999). Thus, Frank et al. (1993) found that the degree of enhancement produced by
an odor for a particular taste was significantly correlated with ratings of the perceived similarity or
congruency of the odorant and tastant. This suggests, therefore, that whether an odor/taste combina-
tion is seen as congruent is dependent on prior association of the components as a combination.
Given the associative origin of these effects in the context of foods and beverages, we might
expect cross-cultural differences in the extent to which particular odors and tastes are judged as
congruent. For example, the odor of pumpkin is likely to smell sweeter in those cultures where
it is incorporated into desserts (e.g., United States) as compared to cultures where it is savoury.
Consistent with this, it has been reported that French and Vietnamese vary in their judgments of
odor/taste harmony—that is, the extent to which an odor and taste are seen as congruent (Nguyen
et al. 2002).
One explanatory model for these effects proposes that each experience of an odor always invokes
a search of memory for prior encounters with that odor. If, in the initial experience of the odor, it
706 The Neural Bases of Multisensory Processes

was paired with a taste, a cross-modal configural stimulus—that is, a flavor—is encoded in mem-
ory. Subsequently sniffing the odor alone will evoke the most similar odor memory—the flavor—
that will include both the odor and the taste component. Thus, for example, sniffing caramel odor
activates memorial representations of caramel flavors, which includes a sweet taste component.
This results either in perceptions of smelled taste properties such as sweetness or, in the case of a
mixture, a perceptual combination of the memorial odor representation with the physically present
taste in solution (Stevenson and Boakes 2004; Stevenson et al. 1998).

35.4  CROSS-MODAL CHEMOSENSORY BINDING


In vision, aspects of a scene or object include features such as form, color, or movement, combined
to form a coherent perception. The neural processing of form can be shown to be independent of that
of color, but our perception is always that the two visual phenomena are bound seamlessly together.
To understand flavor perception, it is similarly crucial to know the mechanisms responsible for bind-
ing tastes, odors, and tactile sensations into one coherent, cross-modal percept.
Studies of interactions between visual, auditory, and somatosensory systems have demonstrated
the importance of spatial and/or temporal contiguity in facilitating cross-modal sensory integration
(Calvert et al. 1998; Driver and Spence 2000; Spence and Squire 2003). In flavors, the different
stimulus elements are associated temporally. However, although both gustatory and somatosensory
receptors are spatially located in the mouth, olfactory receptors are not. The question then arises of
how odors become bound to taste and touch. Central to this process is the olfactory location illu-
sion, in which the odor components of a food appear to originate in the mouth (Rozin 1982). Thus,
we never have a sense that the oranginess of orange juice is being perceived within the nose, even
if we are aware that it is an odor. This illusion is both strong and pervasive, despite the fact that we
are frequently presented with evidence of the importance of the olfactory component in flavors, for
example, through a blocked nose during a head cold. One common manifestation of this phenom-
enon is the interchangeability of chemosensory terms such as flavor and taste in common usage—
that is, we routinely fail to make a distinction between olfactory and taste qualities within flavors.
The location illusion itself may depend on both the spatial and temporal contiguity of the discrete
sensory inputs. von Bekesy (1964) illustrated the likely importance of temporal factors as potential
determinants of odor/taste integration by showing that the perceived location of an odor (mouth vs.
nose) and the extent to which an odor and taste were perceived as one sensation or two could be
manipulated by varying the time delay between the presentation of the odor and taste. With a time
delay of zero (simultaneous presentation), the apparent locus of the odor was the back of the mouth
and the odor/taste mixture was perceived as a single entity. When the odor preceded the taste, the
sensation was perceived as originating in the nose (see Figure 35.1). Although this report is consis-
tent with models of binding across other sensory modalities, von Bekesy (1964) did not provide suf-
ficient details to judge the reliability of his conclusions. The number of other studies addressing this
issue is also very limited. A demonstration that odor-induced taste enhancement can occur whether
the odor is presented orthonasally or retronasally, providing that the odor and taste are presented
simultaneously (Sakai et al. 2001) does suggest a key role for temporal synchrony in facilitating
integration. Pfieffer et al. (2005) manipulated both spatial and temporal contiguity for the odor and
taste while assessing the threshold for benzaldehyde odor (almond/cherry) in the presence of a sub-
threshold sweet taste, failing to find convincing evidence of their manipulations. However, a recent
preliminary finding suggests that synchronicity judgments of odor and taste may be less sensitive to
onset discrepancies than other multimodal stimulus pairs, including audiovisual stimuli, and odors
and tastes, each paired with visual stimuli (Kobayakawa et al. 2009). One interpretation of such a
finding, if confirmed, together with the data of Sakai et al. (2001), would be that odor–taste binding
operates under less stringent requirements for spatiotemporal synchrony than multisensory integra-
tion within other sensory systems. In turn, binding under conditions in which there is a tolerance
for asynchrony might reflect the high adaptive significance of chemosensory binding. Alternatively,
Multimodal Chemosensory Interactions and Perception of Flavor 707

Olfactory nerves

Olfactory
receptors
Smell ahead
in time

Taste ahead

FIGURE 35.1  Temporal and spatial determinants of odor/taste integration. Combination of smell and taste
into a single sensation. A varying time difference between stimuli moves locus of sensation from tip of the
nose back to the throat and forward again to tip of the tongue. (Reprinted from von Bekesy, G., J. Appl.
Physiol., 19, 369–373, 1964. Copyright, used with permission from The American Physiological Society.)

at least in the case of temporal asynchrony, congruency between the odor and taste may be crucial.
Hence, it has been demonstrated that judgments of audiovisual asynchrony are more difficult when
the different modalities are bound by a common origin (Spence and Parise 2010).
The olfactory location illusion is effectively an equivalent phenomenon to the auditory/visual
“ventriloquism effect” in that, like the ventriloquist’s voice, the location of the odor is captured
by other sensory inputs. The extent to which either concurrent taste or somatosensation, or both,
is chiefly responsible for the capture and referral of olfactory information to the oral cavity is not
known. However, the somatosensory system is more strongly implicated since it provides more
detailed spatial information than does taste (Lim and Green 2008). Moreover, in neuroimaging
studies, odors that are available to bind with tastes—that is, those presented retronasally (via the
mouth) —have been shown to activate the mouth area of the primary somatosensory cortex, whereas
the same odors presented via the nose do not (Small et al. 2005). This distinction, which occurs even
when subjects are unaware of route of stimulation, suggests a likely neural correlate of the binding
process, and supports the idea that somatosensory input is the underlying mechanism.
In fact, our tastes experiences may themselves be multimodal. Under most circumstances, taste
and tactile sensations in the mouth are so well integrated that we cannot begin to disentangle them,
and there is growing evidence that our everyday experiences of taste are themselves multisensory,
in that they involve somatosensory input (Green 2003; Lim and Green 2008). Taste buds are inner-
vated by somatosensory fibers (Whitehead et al. 1985) and various categories of somatosensory
stimuli are also capable of inducing taste sensations. Thus, it has been noted that about 25% of
fungiform papillae respond to tactile stimulation by fine wires with a taste quality (Cardello 1981).
More recently, tastes have been shown to be elicited by heated and cooled probes placed on areas
innervated by cranial nerves VII and IX, which subserve taste (Cruz and Green 2000), and by
the application of the prototypical “pure” irritant, capsaicin, to circumvallate papillae (Green and
Hayes 2003). Further evidence points to the ability of tactile stimulation to capture taste, presum-
ably by providing superior spatial information and enhancing localization (Delwiche et al. 2000;
Lim and Green 2008; Todrank and Bartoshuk 1991). Tactile information may therefore have an
important role in binding tastes, perhaps together with odors, both to one another and to a physical
stimulus such as a food.
The binding of odors to tastes and tactile stimuli may also rely on processing information about
the origins of odor stimulation. Orthonasally presented odors are more readily identified and have
lower thresholds than the same odors presented retronasally via the mouth (Pierce and Halpern
708 The Neural Bases of Multisensory Processes

1996; Voirol and Daget 1986), and there is a strong suggestion that the two routes of stimulation are
processed with some independence. Thus, neuroimaging studies show different activation patterns
in cortical olfactory areas as a result of route of administration (Small et al. 2005). From an adaptive
point of view, this makes sense. Olfaction has been described (Rozin 1982) as the only dual sense
because it functions both to detect volatile chemicals in the air (orthonasal sniffing) and to classify
objects in the mouth as foods or not, and each of these roles has unique adaptive significance. Since
the mouth acts as the gateway to the gut, our chemical senses can be seen as part of a defense system
to protect our internal environment—once something is placed in the mouth, there is high survival
value in deciding whether consumption is appropriate. Sensory qualities (tastes, retronasal odors,
tactile qualities) occurring together in the mouth are therefore bound into a single perception, which
identifies a substance as a food (cf. Gibson 1966).

35.5  ATTENTIONAL PROCESSES IN BINDING


Even though an odor’s sniffed “taste” qualities and its ability to enhance that taste in solution are
highly correlated (Stevenson et al. 1999), demonstrating, for example, that a sweet-smelling odor
can enhance the sweetness of sucrose in solution appears to operate under some constraints. This
became evident from findings that whether an odor enhances taste was dependent on task require-
ments. Thus, Frank et al. (1993) found that although strawberry odor enhanced the sweetness of
sucrose in solution when the subjects were asked to judge only sweetness, the enhancement was not
evident when other sensory qualities of these mixtures, such as sourness and fruitiness, were rated
as well. In addition, the sweetness of the strawberry/sucrose mixtures was suppressed when the sub-
jects rated total intensity of the mixture and then partitioned their responses into sweet, salty, sour,
bitter, and/or other tastes. Interestingly, these effects were also noted for some taste mixtures, in
which the elements are often judged as similar (e.g., sour/bitter taste mixtures), but not others with
dissimilar components (e.g., sweet/bitter mixtures; Frank et al. 1993). Similarly, significantly less
sweetness enhancement was found when subjects rated the odor as well as taste intensity of flavors
(sweetness plus strawberry or vanilla) than when they rated sweetness alone (Clark and Lawless
1994).
In attempting to explain such effects, Frank and colleagues (Frank 2003; Frank et al. 1993; van
der Klaauw and Frank 1996) suggested that, given perceptual similarity between an odor and taste,
the conceptual “boundaries” that the subject sets for a given complex stimulus will reflect the task
requirements. In the case of an odor/taste mixture in which the elements share a similar quality,
combining those elements is essentially optional. This explanation invokes the notion that integra-
tion of perceptually similar dimensions is determined by the attentional focus demanded in the task.
These effects of instructional sets are analogous to those seen in studies of cross-modal integration
of vision and hearing. For example, focusing on the overall similarity of visual or auditory stimulus
pairs, representing different stimulus dimensions, versus focusing on their component dimensions,
can influence whether the pairs are treated as interacting or separable dimensions (Melara et al.
1992). This suggests the possibility that the apparent influence of the number of rating scales on
odor/taste interactions results from the impact of these scales on how attention is directed toward
the odor and taste. In keeping with this view, van der Klaauw and Frank (1996) were able to elimi-
nate taste enhancement by directing subjects’ attention to the appropriate attributes in a taste/odor
mixture, even when they were only required to rate sweetness.

35.6  ANALYSIS AND SYNTHESIS IN PERCEPTION OF FLAVOR


These attentional effects appear to correspond to the differing modes of interaction that occur within
sensory modalities. The blending of odors to form entirely new odors is a commonplace occurrence
in flavor chemistry and perfumery (at least for odor mixtures with greater than two components; see
Laing and Willcox 1983), and hence is referred to as synthetic interaction (analogous to the blending
Multimodal Chemosensory Interactions and Perception of Flavor 709

of light wavelengths). By contrast, the mixing of tastes is typically seen as an analytic process,
because individual taste qualities do not fuse to form new qualities and, like simultaneous auditory
tones, can be distinguished from one another in mixtures. A further category of interaction, namely,
fusion—the notion of sensations combined to form a single percept, rather than combining syntheti-
cally to form a new sensation—has also been proposed and applied to flavor perception (McBurney
1986).
The notion of fusion in flavor perception implies that the percept remains analyzable into its
constituent elements even when otherwise perceived as a whole. Thus, although our initial response
is to apple flavor—an effortless combining of all of its sensory qualities into a single percept—we
can, if required, switch between a synthetic approach to flavor and an analysis of the flavor ele-
ments. Hence, apple flavor can be both a synthetic percept and, with minimal effort, a collection of
tastes (sweet; sour), textures (crisp; juicy) and odor notes (lemony; acetone-like; honey) (see Figure
35.2). A more precise way of conceptualizing flavor therefore is that cross-modal sensory signals
are combined to produce a percept, rather than combining synthetically—in the way that odors
themselves do—to form a new sensation. During normal food consumption, we typically respond
to flavors synthetically—an approach reinforced by the olfactory illusion and by the extent to which
flavor components are congruent. As noted earlier, this implies a sharing of perceptual qualities,
for example, sweetness of a taste and of an odor, derived from prior experience of these qualities
together.
Conversely, analytic approaches to complex food or other flavor stimuli (e.g., wines) are often
used by trained assessors to provide a descriptive profile of discrete sensory qualities, as distinct
from an assessment of the overall flavor. Asking assessors to become analytical appears to produce
the same inhibitory effects on odor–taste interactions noted in studies by Frank et al. (1993) and
others. In one study using both trained descriptive panelists and untrained consumers (Bingham et
al. 1990), solutions of the sweet-smelling odorant maltol plus sucrose were rated as sweeter than a
solution of sucrose alone by the untrained consumers. In contrast, no such enhancement was found
in the ratings of those trained to adopt an analytical approach to the sensory properties of this
mixture.
In experimental paradigms, whether an odor/taste mixture is perceived analytically or syntheti-
cally can be determined by the responses required of the subject. Multiple ratings of appropriate

Apple flavor (synthetic) Elements of apple flavor (analytic)


Acetone odor
100
Flouriness Grassy odor
80
60
Juiciness Lemon odor
40
20
Firmness 0 Stalky odor

Crispness Honey flavor

Acid taste Stewed flavor


Sweet taste
Apple A
Apple B
Apple C

FIGURE 35.2  Synthetic and analytic views of a flavor. In each case, sensory signals are identical, but per-
ception differs—whole flavor of apple versus a collection of sensory qualities on which different apples may
vary.
710 The Neural Bases of Multisensory Processes

attributes force an analytical approach, whereas a single rating of a sensory quality that can apply
to both congruent odors and tastes (e.g., the tasted sweetness of sucrose and the smelled sweetness
of strawberry odor) encourages synthesis of the common quality from both sensory modalities.
The components of these flavors may not be treated separately when judged in terms of sweetness
or other single characteristics. When instructions require separation of the components, however,
this can be done—the components of a flavor are evaluated individually, and sweetness enhance-
ment is eliminated. In other words, rating requirements lead to different perceptual approaches
(analytical or synthetic) that, in turn, influence the degree of perceptual integration that occurs.
A recent study of odor mixtures has indicated that an analytical approach is similarly able to influ-
ence the integration of the individual mixture components, as shown in a reduction in the extent
to which subjects perceived a unique quality distinct from those of the components (Le Berre
et al. 2008).

35.7  INVESTIGATING COGNITIVE PROCESSES IN FLAVOR PERCEPTION


Thus, the concept of fusion suggests that flavor perception is highly dependent on both past expe-
rience with specific odor/taste combinations (the origin of congruence) and cognitive factors that
influence whether the flavor elements are combined or not. The most influential model of visual
binding proposes that individual visual features are only loosely associated during early stages of
processing, most likely by a common spatial location, but are bound to form a coherent perception
as a result of attention directed toward combining these features as aspects of the same object or
scene (Treisman 1998, 2006). Similarly, the configural account of odor/taste perceptual learning
(Stevenson and Boakes 2004) implies that when attention is directed toward a flavor, it is attended
to as a single compound or configuration, rather than a collection of elements. A configural explana-
tion for the ability of an odor to later summate with the taste to produce enhanced sweetness implies
an attentional approach that combines the odor and taste, rather than identifying them as separate
elements in the flavor. In other words, for a complete binding of flavor features via configural learn-
ing, synthesis of the elements via attending to the whole flavor is critical. The limited evidence
that exists suggests that the binding and joint encoding of odors, tastes, and tactile sensations is
automatic. This is indicated both by the finding that perceptual changes in odors after pairing with
tastes appears not to require conscious awareness on the part of the subject of the particular odor–
taste contingencies (Stevenson et al. 1998) and data suggesting that a single coexposure of an odor
and taste can result in transfer of the taste properties to the odor (Prescott et al. 2004). Thus, such
learning should be sensitive to manipulations in which attention is directed toward the identity of
their constituent elements.
One approach to examining the role of these factors has been to force subjects to adopt contrast-
ing attentional strategies (analytic vs. synthetic) while either experiencing or judging odor/taste
mixtures. If it is the case that odor/taste interactions can be influenced by the extent to which an
analytical or synthetic perceptual approach is taken during rating, then this suggests the possibility
that the extent to which the odors and tastes become integrated (as shown by increased perceptual
similarity) might similarly be determined by the way in which the components of the flavor are
associated during their joint exposure. In turn, any influence of odors on those tastes in solution
may similarly be modulated. Hence, an exposure strategy that emphasizes the distinctiveness of the
elements in the odor/taste mixture (an analytical perceptual strategy) should inhibit increases in the
taste properties of the odor, and the subsequent ability of the odor to influence tastes in solution. In
contrast, treating the elements as a synthetic whole is likely to encourage the blurring of the percep-
tual boundaries, fostering subsequent odor/taste interactions.
Consistent with this, pre-exposure of the elements of the specific odor–taste flavor compounds
that were later repeatedly associated—in Pavlovian terms, unconditional stimulus or conditional
stimulus pre-exposure—eliminated any change in the odors’ perceptual qualities following the pair-
ing (Stevenson and Case 2003). Thus, pre-exposed odors later paired with sweet or sour tastes did
Multimodal Chemosensory Interactions and Perception of Flavor 711

not become sweeter or sourer smelling, whereas taste-paired odors that had not been pre-exposed
did. In contrast, initial attempts to disrupt configural integrity by directing attention toward the
elemental nature of the compound stimulus during associative pairing were unsuccessful. Neither
training subjects to distinguish the individual odor and taste components of flavors prior to learning
(Stevenson and Case 2003) nor predisposing subjects to adopt an analytical strategy by requiring
intensity ratings of these odor and taste components separately during their joint exposure (Prescott
et al. 2004) were initially successful in influencing whether odors paired with a sweet tastes became
sweeter smelling. This is probably attributable to methodological reasons.
If it is the case that odors and tastes are automatically coencoded as a flavor in the absence of task
demands that focus attention on the elements, then experimental designs in which odor and taste
elements appear together without such an attentional strategy are likely to predispose toward syn-
thesis. Hence, the analytical strategy used by Stevenson and Case (2003) was likely to be ineffective
since they asked subjects during the exposure to rate overall liking for the odor–taste compound,
an approach that may have encouraged integration of the elements. The analytical manipulation in
Prescott et al.’s (2004) study may not have influenced the development of smelled sweetness because
it took place after the initial pairing of the sweet taste and odor that occurred before the formal asso-
ciative process—that is, as the preconditioning measure in the pre–post design. As noted earlier, a
second study in Prescott et al.’s (2004) report demonstrated that a single odor–sweet taste coexpo-
sure can produce an odor that smells sweet.
More recently, it has been demonstrated that when such methodological considerations are
addressed, prior analytical training in which attention is explicitly directed toward the individual
elements in an odor and sweet taste mixture does inhibit the development of a sweet-smelling odor
(Prescott and Murphy 2009; see Figure 35.3a). In this study, subjects only ever received a particular

(a) (b)
35
30 Sucrose
Mean change in odor smelled sweetness

Water 30
Mean change in tasted sweetness

20
25

10 20

15
0

10
–10
5
–20
0
Synthetic Analytic Synthetic Analytic
–30 –5
Group Group

FIGURE 35.3  Changes in perceptual characteristics of odors and flavors as a function of odor–taste coex-
posure and attentional strategy. (a) Mean ratings of smelled sweetness of odors increase after repeat paired
with a sweet taste in solution, but only for a group using a strategy in which odor and taste elements are
treated synthetically. In contrast, coexposure to same odor–taste mixture when odor and taste are attended to
analytically as distinct elements produces no such increase in smelled sweetness. (Reprinted from Prescott,
J., and Murphy, S., Q. J. Exp. Psychol., 62 (11), 2133–2140, 2009. Copyright, with permission from Taylor &
Francis.) (b) Mean ratings of sweetness of a flavor composed of sucrose in solution together with an odor that
has previously been conditioned with this taste so that it smells sweet. Despite this, enhancement is evident
only in a group that treated elements synthetically during their association. (Adapted from Prescott, J. et al.,
Chem. Senses, 29, 331–334, 2004. With permission.)
712 The Neural Bases of Multisensory Processes

odor taste combination under conditions in which they had been trained to respond to the combina-
tion in explicitly synthetic or analytical ways. Moreover, the fact that the training used different
odor/taste combinations than were later used in the conditioning procedure suggests that an atten-
tional approach (analytical or synthetic) was being induced in the subjects during training that was
then applied to new odor/taste combinations during conditioning.
The findings from this study have important theoretical implications, in that they are clearly
consistent with configural accounts of perceptual odor–taste learning and flavor representation
(Stevenson and Boakes 2004; Stevenson et al. 1998). Under conditions where attention is directed
toward individual stimulus elements during conditioning, the separate representation of these ele-
ments may be incompatible with learning of a configural representation. This explanation is sup-
ported by the demonstration that an analytical approach also acted to inhibit a sweet-smelling odor’s
ability to enhance a sweet taste when the odor/taste combination were evaluated in solution after
repeated pairing (Prescott et al. 2004; see Figure 35.3b). In other words, an analytical attentional
strategy can be shown to interfere with either the development of a flavor configuration resulting
from associative learning, or the subsequent ability of this configuration to combine with a physi-
cally present tastant.

35.8  HEDONIC IMPLICATIONS OF CHEMOSENSORY INTEGRATION


Likes and dislikes naturally arise from the integrated perception of flavor, since we are respond-
ing to substances that we have learned to recognize as foods and that are therefore biologically,
culturally, and socially valued. Initial (“gut”) responses to foods are almost always hedonic and
this naturally precedes accepting or rejecting the food. Hence, perhaps unique among multisensory
interactions, multisensory integration in the chemical senses is, to greater and lesser extents, a pro-
cess that has an inherently hedonic dimension.
As with perceptual changes in odors, hedonic properties of flavors arise from associative learn-
ing. Although our initial responses to odors may or may not be strongly positively or negatively
valenced, tastes evoke emotions that are innate (Prescott 1998; Steiner et al. 2001). Because of
this hedonic valence, repeated pairing odors with tastes not only produces a transfer of perceptual
properties, leading to odor–taste properties, but also a change in the hedonic character of the odor,
and hence also of the flavor. Thus, repeat pairing of a novel odor that is hedonically neutral with a
liked sweet taste typically produces an increase in liking for that odor; conversely, pairing with a
bitter taste produces a disliked odor (Baeyens et al. 1990; Zellner et al. 1983). This form of learn-
ing, known as evaluative conditioning (EC; Levey and Martin 1975), is procedurally identical to
odor–taste perceptual learning. Nevertheless, evaluative and perceptual associative conditioning
can be distinguished by the fact that conditioned increases in the taste properties of odors can occur
without consistent changes in liking (Stevenson et al. 1995, 1998) and also by the reliance of odor–
taste evaluative, but not perceptual, learning on the motivational state of the individual. Hence, EC
is reduced or eliminated under conditions of satiation, whereas perceptual learning is unaffected
(Yeomans and Mobini 2006). EC, but not perceptual learning, also relies on the relative hedonic
value of the tastes. Although even relatively weak bitterness per se is universally negative (Steiner
et al. 2001), in adults there is variation in the extent to which sweetness is hedonically positive
(Pangborn 1970). However, when this is controlled for, by selecting “sweet likers”—commonly
defined as those whose hedonic responses tend to increase with increasing sweetener concentra-
tion—odors paired with sweet tastes reliably become more liked (Yeomans et al. 2006, 2009).
A configural or holistic learning model of the type discussed earlier in relation to perceptual
changes in odors paired with tastes, also accounts for odor–taste evaluative learning by propos-
ing that the configuration includes a hedonic component “supplied” by the taste, which is evoked
when the odor or flavor is experienced (De Houwer et al. 2001). This model is supported for
EC by an identical finding for analytical versus synthetic attention as that shown with perceptual
learning. That is, training to identify the elemental nature of the odor–taste compound during
Multimodal Chemosensory Interactions and Perception of Flavor 713

learning also eliminates the transfer of hedonic properties from the taste to the odor (Prescott and
Murphy 2009), suggesting that the formation of an odor–taste configuration that includes hedonic
values has been inhibited. Recent evidence also suggests that, even after learning, the hedonic
value of a flavor can be altered by the extent to which an analytical approach is taken to the fla-
vor. Comparisons between acceptability ratings alone and the same ratings followed by a series
of analytical ratings of flavor sensory qualities found a reduction of liking in the latter condition
(Prescott et al. 2011), suggesting that analytical approaches are inhibitory to liking even once that
liking has been established. The explanation for this effect is that, as with the similar effects on
perceptual learning reported by Prescott et al. (2004), an analytical attentional strategy is induced
by knowledge that the flavour is to be perceptually analyzed, reducing the configuration process
responsible for the transfer of hedonic properties. This finding joins a number of others indicating
that analytical cognitions are antagonistic toward the expression of likes and dislikes (Nordgren
and Dijksterhuis 2008).
An additional consequence of EC has been demonstrated in studies that have measured the
behavioral consequences of pairing an odor with a tastant that may be valued metabolically. A
considerable body of animal (Myers and Sclafani 2001) and human (Kern et al. 1993; Prescott
2004a; Yeomans et al. 2008b) literature exists showing that odor–taste pairing leading to learned
preferences is highly effective when a tastant that provides valued nutrients is ingested. This process
can be shown to be independent of the hedonic value of the tastant—for example, by comparing
conditioning of odors using sweet tastants that provide energy (e.g., sucrose) with those that do not
(Mobini et al. 2007). As with EC generally, this form of postingestive conditioning is sensitive to
motivational state and is maximized when conditioning and evaluation of learning take place under
relative hunger (Yeomans and Mobini 2006). It has also been recently demonstrated that a novel
flavor paired with ingested monosodium glutamate (MSG) not only increased in rated liking, even
when tested without added MSG, but also, relative to a non-MSG control, produced behavioral
changes including increases in ad libitum food intake and rated hunger after an initial tasting of the
flavor (Yeomans et al. 2008).
Finally, one interesting behavioral consequence of odor–taste perceptual integration has been
a demonstration that a sweet-smelling odor significantly increased pain tolerance relative to a no-
odor control (Prescott and Wilkie 2007). Given that the effect was not seen in an equally pleasant,
but not sweet-smelling, odor, the conclusion drawn was that the odor sweetness was acting in an
equivalent manner to sweet tastes, which have been shown to have this same effect on pain (Blass
and Hoffmeyer 1991). Although the presumption is that such effects are also the result of the same
learned integration that produces the sweet smell and the ability to modify taste perceptions, the
crucial demonstration of this has yet to be carried out. It does suggest, however, that the process of
elicitation of a flavor representation by an odor may have broad behavioral as well as perceptual and
hedonic consequences.
There have been some recent attempts to explore the practical implications of odor–taste learn-
ing, opening opportunities to perhaps exploit its consequences. It has been shown, for example, that
the enhancement of tastes by congruent odors seen in model systems (i.e., solutions) also occurs in
foods, with bitter- and sweet-smelling odors enhancing their respective congruent tastes in milk
drinks (Labbe et al. 2006). Also consistent with data derived from model systems was a failure in
these studies for a sweet-smelling odor to enhance the sweetness of an unfamiliar beverage. Most
recently, an examination of the potential for odors from a range of salty foods to enhance saltiness
in solution (Lawrence et al. 2009) raised the possibility that such odors could be used to effectively
reduce the sodium content of foods, without the typical concurrent loss of acceptability that occurs
(Girgis et al. 2003). Similarly, the finding that odors can take on fatlike properties after associa-
tive pairing with fats (Sundquist et al. 2006) might allow odors to partially substitute for actual fat
content in foods. These studies therefore point to an exciting prospect, in which research aimed at
understanding multisensory processes in flavor perception may lead to applications that ultimately
have important public health consequences.
714 The Neural Bases of Multisensory Processes

REFERENCES
Baeyens, F., P. Eelen, O. Van den Bergh, and G. Crombez. 1990. Flavor–flavor and color–flavor conditioning in
humans. Learning and Motivation 21: 434–445.
Bingham, A. F., G. G. Birch, C. De Graaf, J. M. Behan, and K. D. Perring. 1990. Sensory studies with sucrose–
maltol mixtures. Chemical Senses 15(4): 447–456.
Blass, E. M., and L. B. Hoffmeyer. 1991. Sucrose as an analgesic for newborn infants. Pediatrics 87(2):
215–218.
Brillat-Savarin, J.-A. 1825. The physiology of taste, 1994 ed. London: Penguin Books.
Burdach, K. J., J. H. A. Kroeze, and E. P. Koster. 1984. Nasal, retronasal, and gustatory perception: An experi-
mental comparison. Perception & Psychophysics 36(3): 205–208.
Calvert, G. A., M. J. Brammer, E. T. Bullmore, R. Campbell, S. D. Iversen, and A. S. David. 1999. Response
amplification in sensory-specific cortices during crossmodal binding. NeuroReport 10: 2619–2623.
Calvert, G. A., M. J. Brammer, and S. D. Iversen. 1998. Crossmodal identification. Trends in Cognitive Sciences
2(7): 247–253.
Cardello, A. V. 1981. Comparison of taste qualities elicited by tactile, electrical and chemical stimulation of
single human taste papillae. Perception & Psychophysics 29: 163–169.
Clark, C. C., and H. T. Lawless. 1994. Limiting response alternatives in time–intensity scaling: An examination
of the halo-dumping effect. Chemical Senses 19(6): 583–594.
Cruz, A., and B. G. Green. 2000. Thermal stimulation of taste. Nature 403: 889–892.
Dalton, P., N. Doolittle, H. Nagata, and P. A. S. Breslin. 2000. The merging of the senses: Integration of sub-
threshold taste and smell. Nature Neuroscience 3: 431–432.
De Houwer, J., S. Thomas, and F. Baeyens. 2001. Associative learning of likes and dislikes: A review of
25 years of research on human evaluative conditioning. Psychological Bulletin 127(6): 853–869.
Delwiche, J. F., and A. L. Heffelfinger. 2005. Cross-modal additivity of taste and smell. Journal of Sensory
Studies 20: 512–525.
Delwiche, J. F., M. L. Lera, and P. A. S. Breslin. 2000. Selective removal of a target stimulus localized by taste
in humans. Chemical Senses 25: 181–187.
Djordjevic, J., R. J. Zatorre, and M. Jones-Gotman. 2004. Effects of perceived and imagined odors on taste
detection. Chemical Senses 29: 199–208.
Dravnieks, A. 1985. Atlas of odor character profiles. Philadelphia, PA: American Society for Testing and
Materials.
Driver, J., and C. Spence. 2000. Multisensory perception: Beyond modularity and convergence. Current Biology
10: R731–R735.
Frank, R. A. 2003. Response context affects judgments of flavor components in foods and beverages. Food
Quality and Preference 14: 139–145.
Frank, R. A., and J. Byram. 1988. Taste-smell interactions are tastant and odorant dependent. Chemical Senses
13(3): 445–455.
Frank, R. A., K. Ducheny, and S. J. S. Mize. 1989. Strawberry odor, but not red color, enhances the sweetness
of sucrose solutions. Chemical Senses 14(3): 371–377.
Frank, R. A., N. J. van der Klaauw, and H. N. J. Schifferstein. 1993. Both perceptual and conceptual factors
influence taste–odor and taste–taste interactions. Perception & Psychophysics 54(3): 343–354.
Gibson, J. J. 1966. The senses considered as perceptual systems. Boston: Houghton Mifflin Company.
Girgis, S., B. Neal, J. Prescott et al. 2003. A one-quarter reduction in the salt content of bread can be made
without detection. European Journal of Clinical Nutrition 57(4): 616–620.
Green, B. G. 2003. Studying taste as a cutaneous sense. Food Quality and Preference 14: 99–109.
Green, B. G., and J. E. Hayes. 2003. Capsaicin as a probe of the relationship between bitter taste and chemes-
thesis. Physiology & Behavior 79: 811–821.
Kern, D. L., L. McPhee, J. Fisher, S. Johnson, and L. L. Birch. 1993. The postingestive consequences of fat
condition preferences for flavors associated with high dietary fat. Physiology & Behavior 54: 71–76.
Kobayakawa, T., H. Toda, and N. Gotow. 2009. Synchronicity judgement of gustation and olfaction. Paper
presented at the Association for Chemoreception Sciences, Sarasota, FL.
Labbe, D., L. Damevin, C. Vaccher, C. Morgenegg, and N. Martin. 2006. Modulation of perceived taste by
olfaction in familiar and unfamiliar beverages. Food Quality and Preference 17: 582–589.
Laing, D. G., and M. E. Willcox. 1983. Perception of components in binary odour mixtures. Chemical Senses
7(3–4): 249–264.
Lawrence, G., C. Salles, C. Septier, J. Busch, and T. Thomas-Danguin. 2009. Odour–taste interactions: A way
to enhance saltiness in low-salt content solutions. Food Quality and Preference 20(3): 241–248.
Multimodal Chemosensory Interactions and Perception of Flavor 715

Le Berre, E., T. Thomas-Danguin, N. Beno, G. Coureaud, P. Etievant, and J. Prescott. 2008. Perceptual process-
ing strategy and exposure influence the perception of odor mixtures. Chemical Senses 33: 193–199.
Levey, A. B., and I. Martin. 1975. Classical conditioning of human ‘evaluative’ responses. Behavior Research
& Therapy 13: 221–226.
Lim, J., and B. G. Green. 2008. Tactile interaction with taste localization: Influence of gustatory quality and
intensity. Chemical Senses 33: 137–143.
Martino, G., and L. E. Marks. 2001. Synesthesia: Strong and weak. Current Directions in Psychological
Science 10(2): 61–65.
McBurney, D. H. 1986. Taste, smell, and flavor terminology: Taking the confusion out of fusion. In
Clinical measurement of taste and smell, ed. H. L. Meiselman and R. S. Rivkin, 117–125. New York:
Macmillan.
Melara, R. D., L. E. Marks, and K. E. Lesko. 1992. Optional processes in similarity judgments. Perception &
Psychophysics 51(2): 123–133.
Mobini, S., L. C. Chambers, and M. R. Yeomans. 2007. Interactive effects of flavour–flavour and flavour–con-
sequence learning in development of liking for sweet-paired flavours in humans. Appetite 48: 20–28.
Myers, K. P., and A. Sclafani. 2001. Conditioned enhancement of flavor evaluation reinforced by intragastric
glucose: I. Intake acceptance and preference analysis. Physiology & Behavior 74: 481–493.
Nguyen, D. H., D. Valentin, M. H. Ly, C. Chrea, and F. Sauvageot. 2002. When does smell enhance taste? Effect
of culture and odorant/tastant relationship. Paper presented at the European Chemoreception Research
Organisation conference, Erlangen, Germany.
Nordgren, L. F., and A. P. Dijksterhuis. 2008. The devil is in the deliberation: Thinking too much reduces pref-
erence consistency. Journal of Consumer Research 36: 39–46.
Pangborn, R. M. 1970. Individual variation in affective responses to taste stimuli. Psychonomic Science 21(2):
125–126.
Pfieffer, J. C., T. A. Hollowood, J. Hort, and A. J. Taylor. 2005. Temporal synchrony and integration of sub-
threshold taste and smell signals. Chemical Senses 30: 539–545.
Pierce, J., and B. Halpern. 1996. Orthonasal and retronasal odorant identification based upon vapor phase input
from common substances. Chemical Senses 21(5): 529–543.
Prescott, J. 1998. Comparisons of taste perceptions and preferences of Japanese and Australian consum-
ers: Overview and implications for cross-cultural sensory research. Food Quality and Preference 9(6):
393–402.
Prescott, J. 1999. Flavour as a psychological construct: Implications for perceiving and measuring the sensory
qualities of foods. Food Quality and Preference 10: 349–356.
Prescott, J. 2004a. Effects of added glutamate on liking for novel food flavors. Appetite 42(2): 143–150.
Prescott, J. 2004b. Psychological processes in flavour perception. In Flavour perception, ed. A. J. Taylor and
D. Roberts, 256–277. London: Blackwell Publishing.
Prescott, J., V. Johnstone, and J. Francis. 2004. Odor/taste interactions: Effects of different attentional strategies
during exposure. Chemical Senses 29: 331–340.
Prescott, J., and S. Murphy. 2009. Inhibition of evaluative and perceptual odour–taste learning by attention to
the stimulus elements. Quarterly Journal of Experimental Psychology 62: 2133–2140.
Prescott, J., K.-O. Kim, and S. M. Lee. 2011. Analytic approaches to evaluation modify hedonic responses.
Food Quality and Preference 22: 391–393.
Prescott, J., and J. Wilkie. 2007. Pain tolerance selectively increased by a sweet-smelling odor. Psychological
Science 18(4): 308–311.
Rozin, P. 1982. “Taste–smell confusions” and the duality of the olfactory sense. Perception & Psychophysics
31(4): 397–401.
Sakai, N., T. Kobayakawa, N. Gotow, S. Saito, and S. Imada. 2001. Enhancement of sweetness ratings of aspar-
tame by a vanilla odor presented either by orthonasal or retronasal routes. Perceptual and Motor Skills
92: 1002–1008.
Schifferstein, H. N. J., and P. W. J. Verlegh. 1996. The role of congruency and pleasantness in odor-induced
taste enhancement. Acta Psychologica 94: 87–105.
Small, D. M., J. C. Gerber, Y. E. Mak, and T. Hummel. 2005. Differential neural responses evoked by orthona-
sal versus retronasal odorant perception in humans. Neuron 47: 593–605.
Spence, C., and S. Squire. 2003. Multisensory integration: Maintaining the perception of synchrony. Current
Biology 13: R519–R521.
Spence, C., and C. Parise. 2010. Prior-entry: A review. Consciousness and Cognition 19: 364–379.
Stein, B. E., W. S. Huneycutt, and M. A. Meredith. 1988. Neurons and behavior: The same rules of multisen-
sory integration apply. Brain Research 448: 355–358.
716 The Neural Bases of Multisensory Processes

Steiner, J. E., D. Glaser, M. E. Hawilo, and K. C. Berridge. 2001. Comparative expression of hedonic impact:
Affective reactions to taste by human infants and other primates. Neuroscience & Biobehavioral Reviews
25: 53–74.
Stevenson, R. J., and R. A. Boakes. 2004. Sweet and sour smells: The acquisition of taste-like qualities by odors.
In Handbook of multisensory processes, ed. G. Calvert, C. B. Spence, and B. Stein, 69–83. Cambridge,
MA: MIT Press.
Stevenson, R. J., R. A. Boakes, and J. Prescott. 1998. Changes in odor sweetness resulting from implicit learn-
ing of a simultaneous odor–sweetness association: An example of learned synesthesia. Learning and
Motivation 29: 113–132.
Stevenson, R. J., and T. I. Case. 2003. Preexposure to the stimulus elements, but not training to detect them,
retards human odour–taste learning. Behavioural Processes 61: 13–25.
Stevenson, R. J., J. Prescott, and R. A. Boakes. 1995. The acquisition of taste properties by odors. Learning &
Motivation 26: 1–23.
Stevenson, R. J., J. Prescott, and R. A. Boakes. 1999. Confusing tastes and smells: How odors can influence the
perception of sweet and sour tastes. Chemical Senses 24: 627–635.
Sumby, W. H., and I. Polack. 1954. Visual contribution to speech intelligibility in noise. Journal of the Acoustical
Society of America 26: 212–215.
Sundquist, N., R. J. Stevenson, and I. R. J. Bishop. 2006. Can odours acquire fat-like properties? Appetite 47:
91–99.
Todrank, J., and L. M. Bartoshuk. 1991. A taste illusion: Taste sensation localised by touch. Physiology &
Behavior 50: 1027–1031.
Treisman, A. 1998. Feature binding, attention and object perception. Philosophical Transactions of the Royal
Society, London B 353: 1295–1306.
Treisman, A. 2006. How the deployment of attention determines what we see. Visual Cognition 14: 411–443.
van der Klaauw, N. J., and R. A. Frank. 1996. Scaling component intensities of complex stimuli: The influence
of response alternatives. Environment International 22(1): 21–31.
Voirol, E., and N. Daget. 1986. Comparative study of nasal and retronasal olfactory perception. Lebensmittel-
Wissenschaft und-Technologie 19: 316–319.
von Bekesy, G. 1964. Olfactory analogue to directional hearing. Journal of Applied Physiology 19: 369–373.
White, T. L., and J. Prescott. 2007. Chemosensory cross-modal Stroop effects: Congruent odors facilitate taste
identification. Chemical Senses 32: 337–341.
Whitehead, M. C., C. S. Beeman, and B. A. Kinsella. 1985. Distribution of taste and general sensory nerve end-
ings in fungiform papillae of the hamster. American Journal of Anatomy 173: 185–201.
Yeomans, M. R., N. Gould, S. Mobini, and J. Prescott. 2008a. Acquired flavor acceptance and intake facilitated
by monosodium glutamate in humans. Physiology & Behavior 93: 958–66.
Yeomans, M. R., M. Leitch, N. J. Gould, and S. Mobini. 2008b. Differential hedonic, sensory and behav-
ioral changes associated with flavor–nutrient and flavor–flavor learning. Physiology & Behavior 93:
798–806.
Yeomans, M. R., and S. Mobini. 2006. Hunger alters the expression of acquired hedonic but not sensory quali-
ties of food-paired odors in humans. Journal of Experimental Psychology: Animal Behavior Processes
32(4): 460–466.
Yeomans, M. R., S. Mobini, T. D. Elliman, H. C. Walker, and R. J. Stevenson. 2006. Hedonic and sensory char-
acteristics of odors conditioned by pairing with tastants in humans. Journal of Experimental Psychology:
Animal Behavior Processes 32(3): 215–228.
Yeomans, M. R., J. Prescott, and N. G. Gould. 2009. Individual differences in responses to tastes determine
hedonic and perceptual changes to odours following odour/taste conditioning. Quarterly Journal of
Experimental Psychology 62(8): 1648–1664.
Zellner, D. A., P. Rozin, M. Aron, and C. Kulish. 1983. Conditioned enhancement of human’s liking for flavor
by pairing with sweetness. Learning and Motivation 14: 338–350.
36 A Proposed Model of
a Flavor Modality
Dana M. Small and Barry G. Green

CONTENTS
36.1 Introduction........................................................................................................................... 717
36.2 Flavor is Taste, Touch, and Smell.......................................................................................... 717
36.3 Oral Referral.......................................................................................................................... 720
36.3.1 Olfactory Referral...................................................................................................... 720
36.3.2 Taste Referral: Localization of Taste by Touch......................................................... 724
36.3.3 Shared Qualities between Olfaction and Taste.......................................................... 725
36.4 The Proposed Model.............................................................................................................. 725
36.4.1 Odor Objects.............................................................................................................. 725
36.4.1.1 Synthesis..................................................................................................... 725
36.4.1.2 Experience.................................................................................................. 727
36.4.2 Flavor Objects............................................................................................................ 727
36.4.3 Encoding of Flavor Objects....................................................................................... 728
36.5 Neural Mechanisms............................................................................................................... 729
36.5.1 The Binding Mechanism........................................................................................... 729
36.5.2 Neural Correlates of Flavor Object............................................................................ 731
36.6 Alternative Models................................................................................................................ 733
36.7 Summary............................................................................................................................... 733
References....................................................................................................................................... 733

36.1  INTRODUCTION
The perception of flavor occurs when a food or drink enters the mouth. Although the resulting per-
ception depends on inputs from multiple sensory modalities, it is experienced as a unitary percept of
a food or beverage. In this chapter the psychophysical characteristics and neural substrates of flavor
perception are reviewed within the context of a proposed model of a flavor modality in which the
diverse sensory inputs from the mouth and nose become integrated. More specifically, it is argued
that a binding mechanism in the somatomotor mouth area of the cortex brings taste, touch, and
smell together into a common spatial register and facilitates their perception as a coherent “flavor
object.” We propose that the neural representation of the flavor object is a distributed pattern of
activity across the insula, overlying operculum (including the somatomotor mouth region), orbito­
frontal, piriform, and anterior cingulate cortex.

36.2  FLAVOR IS TASTE, TOUCH, AND SMELL


When we “taste,” we also touch the food or drink in our mouths and sense its odor, via retronasal
olfaction (Figure 36.1). The term flavor describes this multimodal experience. The gustatory sense
(i.e., taste) refers specifically to the sensations of sweet, sour, salty, bitter, savory (Bartoshuk 1991;

717
718 The Neural Bases of Multisensory Processes

Volatiles in orthonasal route


Volatiles in retronasal route Olfactory epithelium

Ortho
Retro
Nares

Ortho
Retro
Tongue

FIGURE 36.1  Orthonasal vs. retronasal olfaction. Schematic depiction of two routes of olfactory percep-
tion: orthonasal and retronasal. Odors sensed orthonasally enter the body through the nose (nares) and travel
directly to olfactory epithelium in nasal cavity. Odors sensed retronasally enter the mouth during eating and
drinking. Volatiles are released from food or drink and subsequently pass through the nasophyarynx at back
of oral cavity to enter nasal cavity and reach olfactory epithelium. (From Kringelbach, M.L., Berridge, K.C.,
eds., Oxford handbook: Pleasures of the brain, 2009. With permission from Oxford University Press, Inc.)

Chandrashekar et al. 2006), and perhaps fat (Chale-Rush et al. 2007; Gilbertson 1998; Gilbertson
et al. 1997). Each of the five major taste qualities serves to signal a specific class of nutrients or
potential threats: sweet signals energy in the form of calories, salty signals electrolytes, sour signals
low pH, savory (umami) signals proteins, and since most poisonous substances are bitter, bitter-
ness signals potential toxins (Scott and Plata-Salaman 1991). Thus, the sense of taste helps identify
physiologically beneficial nutrients and potentially harmful stimuli. Because taste receptors lie side
by side in the oral cavity with thermoreceptors, mechanoreceptors, and nociceptors, everything that
is tasted induces tactile and thermal sensations, and sometimes also chemesthetic sensations (e.g.,
burning and stinging; Green 2003; Simon et al. 2008). In addition, some taste stimuli can them-
selves evoke somatosensory sensations. For example, in moderate to high concentrations, salts and
acids can provoke chemesthetic sensations of burning, stinging, or pricking (Green and Gelhard
1989; Green and Lawless 1991). Consequently, even presumably “pure taste” stimuli can have an
oral somatosensory component.
The taste signal itself is carried from taste receptor cells in the oral cavity by cranial nerves
VII, IX, and X to the nucleus of the solitary tract in the brainstem, where taste inputs are joined
by oral somatosensory projections from the spinal trigeminal nucleus. The precise locations of the
trigeminal projections vary across species, but there is evidence (including in humans) of overlap
with gustatory areas (Whitehead 1990; Whitehead and Frank 1983), and of tracts that run within the
nucleus of the solitary tract that may facilitate cross-modal integration (Travers 1988; Figure 36.2).
Somatosensory input also reaches the nucleus of the solitary tract via the glossopharyngeal nerve,
which contains taste-sensitive, as well as mechano- and thermosensitive neurons (Bradley et  al.
1992). Overlapping representation of gustatory and somatosensory information also occurs in the
A Proposed Model of a Flavor Modality 719

PO
ACC
AI Thalamus
MI
LOFC
LOFC VI
MOFC Amyg
Piri
OB

Taste
VII X Somatosensory
IX Olfactory
NST
V

FIGURE 36.2  Oral sensory pathways. A glass brain schematic depiction of taste (black circles), somatosen-
sory (white circles), and olfactory (gray circles) pathways. Anatomical locations are only approximate and
connectivity is not exhaustive. Information from taste receptors on tongue is conveyed via the chorda tympani
(VII), glossophyarngeal nerve (IX), and vagus nerve (X) to rostral nucleus of the solitary tract (NST), which
then projects to thalamus. From here, taste information projects to mid insula (MI) and anterior insula and
overlying frontal operculum (AI). AI also projects to ventral insula (VI), medial orbitofrontal cortex (MOFC),
and lateral orbitofrontal cortex (LOFC). Somatosensory input reaches NST via glossopharyngeal nerve (IX)
and trigeminal nerve (V), which then project to thalamus. Oral somatosensory information is then relayed to
opercular region of postcentral gyrus (PO). Olfactory information is conveyed via cranial nerve I to olfactory
bulb, which projects to primary olfactory cortex, including piriform cortex (piri). Piriform cortex, in turn,
projects to VI and orbitofrontal cortex. Anterior cingulated cortex (ACC) and amygdala (Amyg) are also
strongly interconnected with insula and orbital regions representing taste, smell, and oral somatosensation.
(From Kringelbach, M.L., Berridge, K.C., eds., Oxford handbook: Pleasures of the brain, 2009. With permis-
sion from Oxford University Press, Inc.)

thalamus (Pritchard et al. 1989) and at the cortical level (Cerf-Ducastel et al. 2001; Pritchard et al.
1986). For example, the primary gustatory cortex contains nearly as many somatosensory-specific
as taste-specific neurons, in addition to bimodal neurons responding to both somatosensory and
taste stimulation (Kadohisa et al. 2004; Plata-Salaman et al. 1992, 1996; Smith-Swintosky et al.
1991; Yamamoto et al. 1985). ��������������������������������������������������������������������
In sum, taste and oral somatosensation have distinct receptor mecha-
nisms, but their signals converge at virtually every level of the neuroaxis, suggestive of extensive
interaction.
Although taste and oral somesthesis provide critical information about the physicochemical
nature of ingested stimuli, it is the olfactory component of food that is required for flavor identifica-
tion (Mozell et al. 1969). The acts of chewing and swallowing release volatile molecules into the oral
cavity, which during exhalation traverse the epipharynx (also referred to as the nasopharynx) and
stimulate receptors on the olfactory epithelium. This process is referred to as retronasal olfaction
(Figure 36.1), in contrast to orthonasal olfaction, which occurs during inhalation through the nose.
Both orthonasal and retronasal olfactory signals are carried via cranial nerve I to the olfactory bulb,
which projects to the anterior olfactory nucleus, the olfactory tubercle, the piriform cortex, several
amygdaloid subnuclei, and rostral entorhinal cortex and thalamus. These areas, in turn, project to
additional amygdala subnuclei, the entorhinal, insula, and orbitofrontal cortex (OFC) (de Olmos et
al. 1978; Price 1973; Turner et al. 1978; Figure 36.2). Thus, olfactory information is carried to the
720 The Neural Bases of Multisensory Processes

brain by distinct pathways and does not converge with gustation and oral somatosensation until
higher-order cortical regions, such as the insula and the OFC.
In summary, the perception of flavor depends on multiple distinct inputs that interact at several
levels in the central nervous system. How these interactions act to “bind” the signals into coherent
perceptions of flavor is currently unknown. Here, we propose a model in which the somatomotor
mouth area orchestrates this binding via a process that results in referral of olfactory sensations to
the oral cavity. It is worth noting that flavor percepts can also be influenced by visual inputs (Koza
et al. 2005) and by beliefs and expectations (de Araujo et al. 2003), which are factors that represent
top-down modulation of flavor. However, these types of cognitive effects are outside the scope of
the present chapter.

36.3  ORAL REFERRAL


Consistent with the concept of “proximity” proposed by Gestalt psychologists (Kohler 1929), it is
well known that stimuli that appear to originate from a common location are interpreted as having
a common source (Stein 1998). In the case of flavor, sensory mechanisms exist that cause all of
the perceptual components of flavor (taste, smell, and touch) to appear to arise from the oral cav-
ity (Green 2003; Hollingworth and Poffenberger 1917; Lim and Green 2008; Murphy et al. 1977;
Prescott 1999; Small and Prescott 2005; Todrank and Bartoshuk 1991). Here we argue that the func-
tion of these referral mechanisms is to bring the sensory components of flavor into a common spatial
register that facilitate their binding into a “flavor object.” This process may also be aided by the fact
that odors and tastes can share common sensory characteristics (e.g., some odors are perceived as
sweet) that blur the qualitative boundary between taste and smell and facilitate integration (Auvray
and Spence 2008).
Although several authors have proposed the existence of an object-based flavor system (Auvray
and Spence 2008; Green 2003; Prescott 1999; Small 2008; Small and Prescott 2005), the neuro-
physiology of the hypothesized system remains relatively unexplored. The model proposed here
holds that oral referral is required for the perception of flavor objects, and neural mechanisms that
mediate referral and flavor learning are posited. Because oral referral is central to the model, we
begin our discussion with the various forms of referral that have been described.

36.3.1  Olfactory Referral


As noted above there are two ways to smell: during inhalation through the nose (orthonasal olfac-
tion) and during exhalation through the mouth (retronasal olfaction) (Figure 36.1). Orthonasally
sensed odors appear to emanate from objects in the external world, whereas retronasally sensed
odors often appear to emanate from the oral cavity (Heilmann and Hummel 2004; Hummel et al.
2006; Murphy et al. 1977; Rozin 1982) and may be confused with taste (Ashkenazi and Marks 2004;
Hollingworth and Poffenberger 1917; Murphy et al. 1977; Murphy and Cain 1980). Although scien-
tists have been aware of the misattribution of smell as taste for some time (Tichener 1909), the first
systematic study was made by Murphy et al. (1977). In that study, Murphy and her colleagues inves-
tigated the nature of taste–odor interactions by asking subjects to estimate the intensity of tastes,
odors, and their mixtures. The results showed that the perceived intensity of a taste–odor mixture
roughly equalled the sum of the perceived intensities of the unmixed components, indicating that
tastes and odors were perceptually independent. However, subjects attribute approximately 80% of
the mixture’s intensity to the gustatory modality (Murphy et al. 1977). Specifically, taste intensity
ratings were higher when the nostrils were open compared to when they were pinched (a stylized
version of this finding is represented in Figure 36.3). Since the odor they used, ethyl butyrate, smells
sweet, they suggested the effect resulted from a combination of shared sensory properties (sweet)
and the misattribution of the retronasal olfactory sweet component to the taste system due to the
referral of the odor to the oral cavity. This and subsequent studies also ruled out the possibility that
A Proposed Model of a Flavor Modality 721

Perceived taste magnitude


15

10

Nostrils open
No ethyl butyrate
"0.00133%"
"0.00685%"
5
"0.037%"
Line of identity

0
0 5 10 15
Nostrils closed

FIGURE 36.3  Taste–odor confusion. This figure is a stylized representation of data reported in Figure 4
of Murphy and colleagues (1977) (rendered with permission from Dr. Claire Murphy) and represents first
experimental demonstration of taste–odor confusion. Graph depicts perceived taste magnitude of mixtures
of ethyl butyrate and saccharin sipped when nostrils were open versus taste of magnitude of mixtures sipped
when nostrils were closed (open symbols). The parameter is concentration of odorant ethyl butyrate. Closed
circles represent judgments of stimuli that contained no ethyl butyrate, only saccharin. Dashed line is the line
of identity.

referral could be attributed to the activation of taste cells by odors, because the chemicals that pro-
duce taste-like smells (e.g., strawberry smells sweet) do not taste sweet when sampled in the mouth
with the nares occluded (Murphy and Cain 1980; Sakai et al. 2001; Schifferstein and Verlegh 1996;
Stevenson et al. 2000b). Thus, the sweet quality of an odor occurs in the absence of the activation of
taste receptor cells, but when sensed retronasally may nevertheless be attributed to taste.
Indeed, it has been argued that orthonasal and retronasal olfaction represent two distinct modali-
ties. Inspired by a comment made by a friend that “I really love the taste (of Limburger cheese) if
only I can get it by my nose,” Rozin (1982) first proposed that olfaction is a dual-sense modality,
with one component (orthonasal olfaction) specialized for sensing objects in the world and the other
(retronasal olfaction) specialized for sensing objects in the mouth. Building upon Gibson’s proposal
that “tasting” and “smelling” are distinct perceptual systems that cut across receptor classes, Rozin
suggested that “the same olfactory stimulation may be perceived and evaluated in two qualitatively
different ways, depending on whether it was referred to the mouth or the external world.” In support
of this view, he found that subjects frequently reported disliking the smell, but liking the taste, of
certain foods (e.g., fish, eggs, and cheese). He also demonstrated that subjects had great difficulty
correctly identifying flavor stimuli that had first been learned via the orthonasal route. These data
are therefore consistent with the notion that olfactory stimuli arising from the mouth have different
sensory–perceptual properties than those originating in the external world. Rozin suggested that
these perceptual processes might be achieved by differential gating of inputs triggered by the pres-
ence of a palpable object in the mouth, or by the direction of movement of the odor across the olfac-
tory mucosa. Alternatively, he posited that it may be that odor information is not gated but rather
is combined with available oral inputs into an emergent percept in which the olfactory component
loses its separate identity.
After the publication of Rozin’s hypothesis, several investigators argued that the differences
between orthonasal and retronasal olfaction were primarily quantitative rather than qualitative.
This argument was based on evidence that retronasal stimulation by the same physical stimulus
722 The Neural Bases of Multisensory Processes

tends to result in lower perceived intensity than orthonasal stimulation (Pierce and Halpern 1996;
Voirol and Dagnet 1986). Although it is clear that quantitative differences are present, there is
also more recent evidence supporting the duality hypothesis (Bender et al. 2009; Heilmann and
Hummel 2001; Hummel et al. 2006; Koza et al. 2005; Landis et al. 2005; Small et al. 2005; Sun and
Halpern 2005; Welge-Lussen et al. 2009). Of particular note, Hummel and his colleagues devised a
method for delivering odorants in the vapor phase via either the ortho- or retronasal routes (Figure
36.4). Critically, the method allows assessment of retronasal olfaction without stimulation of the
oral cavity (Heilmann and Hummel 2004). Two tubes are inserted into the subject’s nose under
endoscopic guidance so that one tube ends at the external nares (to achieve orthonasal delivery)
and the other tube at the epipharynx (to achieve retronasal delivery). The tubes are, in turn, con-
nected to a computer-­controlled olfactometer that delivers pulses of odorant embedded in an odor-
less airstream. Using an electronic nose to measure the stimulus in the airspace below the olfactory
epithelium, the authors demonstrated that the maximum concentration and duration of the signal
was equivalent after delivery by either route (Hummel et al. 2006). Despite similar signals and
the absence of oral stimulation, the olfactory localization illusion was, in part, maintained (Figure
36.5). Subjects were more likely to report that the retronasal odors came from the back of the throat,
whereas orthonasal odors appeared to come from the nose. The mechanism(s) behind the olfac-
tory referral illusion remain unknown. However, this study ruled out intensity differences as a cue,
because the odors were titrated to equate perceived intensity. The finding also suggests that oral
stimulation is not required for at least some referral to occur, since the procedure involved neither
a gustatory nor somatosensory stimulus. However, in a subsequent investigation in which subjects
were asked to indicate if the odor were delivered orthonasally or retronasally (rather than localize it
to the nose or mouth), trigeminal (chemesthetic) stimulation was found to be an important factor for
making the discrimination (Frasnelli et al. 2008). More work is therefore needed to determine the
degree to which odors can be referred to the mouth based on the direction of flow of the olfactory
stimulus.

FIGURE 36.4  (See color insert.) An MRI image showing tubing placement using methods described by
Heilmann and Hummel (2004). This sagittal brain section reveals placement of nasal cannulae at external
nares to achieve orthonasal delivery, and at nasopharynx to achieve retronasal delivery. Tubes appear white
and odor delivery is represented by small white dots. (Reproduced from Small, D.M. et al., Neuron, 47,
593– 605, 2005. With permission from Elsevier.)
A Proposed Model of a Flavor Modality 723

Session 1 Session 2

Orthonasal Retronasal
30 30
***
***
* ***
Throat 15 15

0 0
Nose

–15 –15

–30 –30
H2S CO2 H2S CO2

FIGURE 36.5  Odorant localization. Preliminary data from 20 subjects showing that orthonasal odor is
perceived as coming from front of nasal cavity and retronasal odor as coming from back of nasal/oral cavity
(throat). This perception occurred despite constant airflow through both routes at all times and no change
in air pressure or flow rate during switching between odor and pure air. Odorants were one pure olfactant
[hydrogen sulfide (H2S)] and one olfactory/chemesthetic stimulus with a significant trigeminal component
[carbon dioxide (CO2)]. Results represent mean ratings from 20 subjects. Error bars represent standard error
of the mean. Positive numbers indicate that subjects perceived odor at back of nasal/oral cavity (throat area),
whereas negative numbers indicate subjects perceived odor at front of the nose; the higher the numbers, the
more certain were subjects about their decision (range of scale from −50 to 0, and from 0 to 50). Data were
obtained in two sessions separated by at least 1 day. Stimuli of 200-ms duration were presented using air-
­dilution olfactom­etry (birhinal olfactometer OM6b; Burghart Instruments, Wedel, Germany). Thus, stimula-
tion was the same as used in fMRI study (t-test: *p < .05; ***p < .001). (Reproduced from Small, D.M. et al.,
Neuron, 47, 593–605, 2005. With permission from Elsevier.)

A possible mechanism by which such referral might occur is the direction of odorant flow across
the olfactory epithelium. Indeed, since the data supplied from the electronic nose indicated that the
physical stimulus arriving at the epithelium can be identical (at least for the measured parameters),
the primary difference between the routes in Hummel’s experiments was the direction of odorant
flow. Hummel and colleagues therefore suggested there may be a distinct organization of olfactory
receptor neurons in the back versus the more anterior portions of the nasal cavity. This hypothesis
is consistent with Mozell’s chromotagraphic model of olfaction, which postulates that the pattern of
odorant binding to receptors can lead to different odor perceptions (Mozell 1970). Further support
for the chromatographic model comes from a study in the laboratory of Sobel et al. (1999), which
showed that subtle differences in airflow patterns between the left and right nostrils can lead to dif-
ferent perceptual experiences.
Although neither taste nor oral somatosensation appear to be required for at least some degree
of referral to occur (Heilmann and Hummel 2004; Hummel et al. 2006; Small et al. 2005), fur-
ther study is needed to determine if stimulation of these modalities may nevertheless contribute to
referral.
In summary, the olfactory localization illusion, coupled with the fact that flavor identity is con-
veyed primarily by olfaction, leads to the perception that flavors come from the mouth. Despite
the fact that this illusion has a profound impact on flavor perception, the mechanisms that produce
it remain unknown. Possible mechanisms include spatiotemporal differences in odorant binding
across the olfactory epithelium during retro- versus orthonasal stimulation, and/or capture by tactile
and/or gustatory stimulation.
724 The Neural Bases of Multisensory Processes

36.3.2  Taste Referral: Localization of Taste by Touch


Not only are retronasal odor sensations referred to the mouth and attributed to taste; taste sensations
can be referred to the location of tactile stimulation on the tongue (Green 2003; Lim and Green
2008; Todrank and Bartoshuk 1991; Figure 36.6). This illusion was first demonstrated by Todrank
and Bartoshuk (1991), who were motivated by the observation that during eating, taste sensations
seem to originate throughout the tongue even though the taste buds are located at specific loci (tip,
side, and back of the tongue and soft palate). The authors postulated that this effect might depend on
a capture-illusion similar to the ventriloquist effect, whereby one sensory modality dominates the
other (Tastevin 1937). Specifically, it was hypothesized that taste localization is dominated by touch
in a manner akin to the phenomenon of thermal referral (Green 1977), in which touch dominates
localization of thermal sensation. To test this prediction, Todrank and Bartoshuk asked subjects to
report the intensity of taste as a stimulus was painted onto the tongue along a path that traversed

“Veridical” condition

dH2O Taste stimulus

1 cm

“Referral” condition
Taste stimulus dH2O

1 cm

FIGURE 36.6  Taste localization by touch. Stimulus configuration used to measure referral of taste sensations to
site of tactile stimulation. On each trial, experimenter touched three saturated cotton swabs simultaneously to ante-
rior edge of tongue, producing identical tactile stimulation at each site. In veridical condition (top), taste stimulus
was delivered only on middle swab, with deionized water on two outer swabs. In referral condition, taste stimulus
was delivered only on two outer swabs, with deionized water on middle swab. In both conditions, subjects’ task
was to ignore any tastes on outer swabs and to rate intensity of taste perceived at middle swab. Significant taste
sensations were reported at middle swab in referral condition for all four taste stimuli tested (sucrose, NaCl, citric
acid, and quinine). (From Green, B.G., Food Qual. Prefer., 14, 99–109, 2002. With permission.)
A Proposed Model of a Flavor Modality 725

regions of high and low taste bud density. When the path began in a region of low taste bud density,
taste sensations started out weak. As the path intersected regions of greater taste bud density, taste
sensations became stronger. However, when the path returned to low density regions the sensation
remained nearly as intense as it was in the high density region. The authors interpreted this result
to mean that the taste sensation was “captured” by the tactile stimulation of the swab and dragged
into the insensitive area. More recent work has corroborated this interpretation by finding that tastes
can be localized to a spatially adjacent tactile stimulus (Green 2003; Lim and Green 2008; Figure
36.6).
Although it is also true that tastes can be localized independently from touch (Delwiche et al.
2000; Lim and Green 2008; Shikata et al. 2000), we believe that referral of taste to touch helps to
create a coherent “perceptive field” onto which odors are also referred, thus providing the founda-
tion for a unitary flavor percept.

36.3.3  Shared Qualities between Olfaction and Taste


In addition to oral referral mechanisms, shared qualities between olfaction and taste promote the
integration of tastes and smells in flavors. Odors often have taste-like characteristics (Dravnieks
1985; Harper et al. 1968), which may be acquired by experience (Stevenson 2001; Stevenson and
Boakes 2004; Stevenson et al. 2000a; Stevenson and Prescott 1995; Stevenson et al. 1999). It has
been proposed that the existence of these shared qualities, coupled with olfactory referral, blurs
the boundary between taste and smell, which in turn facilitates the sensation of a unitary percept
(Auvray and Spence 2008; McBurney 1986).
In summary, there are at least three mechanisms that promote the integration of discrete sensory
inputs that are stimulated during feeding and drinking into a unitary flavor percept or object: olfac-
tory referral, taste referral to touch, and shared taste–odor qualities.

36.4  THE PROPOSED MODEL


The central tenant of the proposed model is that oral referral mechanisms play a critical role in
encoding flavor by helping to fuse multisensory inputs into a perceptual gestalt. This idea builds
upon, and has direct parallels with, the coding of “odor objects” as described by Haberly (2001) and
by Wilson and Stevenson (2003), and “odor–taste learning” described by Stevenson and Boakes
(2004). Therefore, a brief discussion of odor objects follows.

36.4.1  Odor Objects


Wilson and Stevenson (2003) suggest that although the peripheral olfactory system may be orga-
nized to emphasize analytical processing (Buck and Axel 1991), the primary function of olfactory
cortex is the experience-dependent synthesis of odorant components into unique, identifiable odor
objects. Critically, the neural representation of the odor object is distinct from the representation of
its sensory components, and it is the encoding of the entire pattern of activity that forms a percep-
tual gestalt. Wilson and Stevenson base this argument on what they view as two cardinal features of
olfactory perception: that it is (1) synthetic and (2) experience-bound.

36.4.1.1  Synthesis
With regard to synthesis, Wilson and Stevenson (2003) propose that odor elements combine to
produce novel odor qualities within which the odor elements are no longer discernible, and thus
that olfaction is a synthetic modality akin to color vision. Recognizing that these perceptual fea-
tures of olfaction are at odds with the analytical organization of the peripheral olfactory system,
Wilson and Stevenson argued that an experience-dependent synthesis of odor information from the
periphery occurs (Haberly 2001) that creates an emergent neural code in the cortex. Specifically,
726 The Neural Bases of Multisensory Processes

they proposed that neurons in anterior piriform cortex receive signals about odorant features from
the olfactory bulb (analytical elements) and initially function as coincident feature detectors (Figure
36.7). The response properties of the cortical neurons then rapidly shift as stimulation continues,
resulting in an experience- and odorant-dependent neural signature within an ensemble of neurons,
the “odor object.” In support of this view, recent work from Wilson’s laboratory examined neural
and perceptual responses to a set of odorant mixture “morphs”—odor mixtures with one or more
components of a 10-component (stock) mixture either removed or replaced (Barnes et al. 2008).
Electrohphysiological recordings from the rodent brain showed that the neural ensemble activity
in the piriform cortex, but not in the olfactory bulb, remained correlated when one of the compo-
nents was missing, resulting in rats being unable to discriminate the nine-element mixture from the
stock mixture. However, when a component was replaced, the piriform ensemble activity decor-
related and discrimination was possible. This suggests that neural ensembles in rodent piriform
cortex code odor quality and perform pattern completion to support perceptual stability of odor
objects. Similarly, in humans, Gottfried and colleagues used functional magnetic resonance imag-
ing (fMRI) to demonstrate a double dissociation of odor coding in the piriform cortex, with the
posterior piriform sensitive to the physiochemical features of odors (i.e., alcohol vs. aldehyde) and
not the odor quality (e.g., vegetable vs. fruit), and the anterior piriform sensitive to odor quality and
not physiochemical features (Gottfried et al. 2006b). This result indicates that it is the odor object,
and not the physical stimulus, that is represented past the initial cortical relay. Since it is likely that
conscious perception of odors in humans requires the OFC (Li et al. 2008), it is reasonable to con-
clude that olfactory perceptions are based on odor objects.

(a) Isoamyl acetate Ethyl pentanoate (b) Isoamyl acetate Ethyl pentanoate

O O O O O O
O O
ORN ORN

Glomerulus Glomerulus
M/T M/T

aPCX aPCX
E7 AA AA
E7
TRENDS in Neurosciences

FIGURE 36.7  (See color insert.) Synthetic processing in anterior piriform cortex. This figure depicts model
of olfactory processing proposed by Wilson and Stevenson. Recent olfactory sensory physiology is consistent
with a view of olfactory bulb mitral cells serving a largely feature-detection role in odor processing and
neurons in anterior piriform cortex (aPCX) serving as synthetic processors, capable of learning unique com-
binations of feature input associated with specific odors. (a) In response to a novel odor, neurons of piriform
cortex function largely as coincidence detectors for coactive feature input from mitral and tufted (M/T) cells
[color-coded for type of feature input they receive from olfactory receptor neurons (ORN)]. As coincidence
detectors, they might not be efficient at discriminating different odors within their receptive fields. (b) After
rapid perceptual learning and plasticity of association and/or afferent synapses, single neurons of piriform
cortex respond to odors as a whole, which enables enhanced discrimination between odors within their recep-
tive fields and allows maintained responsiveness to partially degraded inputs. Odorants in this example are
isoamyl acetate (AA) and ethyl pentanoate (E7), although the model also applies to mixtures of multiple
odorants. (Figure and caption are reproduced from Wilson, D.A., and Stevenson, R.J., Trends Neurosci., 26,
243–247, 2003. With permission from Elsevier and from Don Wilson.)
A Proposed Model of a Flavor Modality 727

However, the development of unique neural codes representing odors and odor mixtures does not
necessarily mean that odor objects are perceptually synthetic. Although studies of odor identifica-
tion in mixtures by Laing et al. (Laing and Francis 1989; Livermore and Laing 1996) have been
cited as evidence of synthesis (Wilson and Stevenson 2003), those results actually show a degree of
analytical processing that led Livermore and Laing (1996) to conclude that “. . . olfaction is neither
entirely analytic nor synthetic, but . . . contains elements of both” (p. 275). Thus, even though both
“expert” and novice subjects have difficulty identifying more than two or three odors in a mixture
(Livermore and Laing 1996), the ability to perceive at least some components rules out a purely
synthetic process. We therefore favor the view of Jinks and Laing (2001) that olfactory perception
is “configurational” in a manner similar to facial perception in vision (Rakover and Teucher 1997).
As those authors described it, configurational processing is based on perceptual fusion rather than
perceptual synthesis of odor qualities, which creates a gestalt in which “limited analysis” of mix-
ture components is possible. This view is also consistent with Gottfried’s conclusion that emerging
data in olfactory neuroscience support the conclusion “that the brain has simultaneous access to the
elemental and configural representations” (Gottfried 2009). As will be shown below, this concept
has also been applied to flavor perception.

36.4.1.2  Experience
There are many examples of experience dependence in the olfactory system (Dade et al. 1998;
Dalton et al. 2002; Li et al. 2006; Wilson et al. 2006). One particularly elegant example of olfactory
perceptual learning comes from Li and colleagues, who presented subjects with odor enantiomer
pairs (mirror image molecules) that were initially indistinguishable (Li et al. 2008). Subsequently,
they associated one member of the enantiomer pair with a shock. This resulted in perceptual learn-
ing in which subjects became able to distinguish the members of the pair and, consistent with
Wilson and Stevenson’s theory, this was accompanied by a divergence in neural response to the odor
pair in the anterior piriform cortex.
A second example of the role of experience in shaping olfactory perception, which is particularly
relevant to this chapter, is that when an odor is experienced with a taste, the odor later comes to
smell more like the taste with which it was experienced (Stevenson and Prescott 1995). This has
been termed the acquisition of taste-like properties by odors, and is described in depth in Chapter 35
by Prescott. It is likely that this form of perceptual learning plays an important role in the formation
of the flavor objects.

36.4.2  Flavor Objects


As noted above, flavor perception has been described as resulting from a process of sensory fusion
(Auvray and Spence 2008; McBurney 1986). One can introspect and identify the olfactory compo-
nent of a flavor (e.g., strawberry) as well as the taste component of a flavor (sweet and sour); however,
since some percepts (sweet) are communal, the boundary between what is odor and what is taste is
not always discernible. Thus, consistent with our view of odor objects, we propose that the elements
of flavor are discernible yet fused. Unlike olfaction, which may promote configural processes, taste
appears to be primarily analytic (Breslin 2000); tastes do not mix to produce novel percepts. Fla­vor
percepts therefore arise from the binding of neural processing in a distributed pattern of distinct
elements that maintain their individual quality to varying degrees (e.g., tastes more so than odor
objects). In addition, there is evidence that the response selectivity of bimodal (odor- and taste-
sensitive) neurons is shaped by the coactivation of unimodal inputs (Rolls 2007). It is therefore pro-
posed that, like the creation of odor objects, the creation of flavor objects depends on a distributed
pattern of neural activity that is sculpted by experience.
What might this pattern of neural activity look like? It is argued that that it is a distributed circuit
including the neural representation of the odor object, unimodal taste cells, unimodal oral somato­
sensory cells, multimodal cells, and a “binding mechanism” (Figure 36.9). We propose that it is the
728 The Neural Bases of Multisensory Processes

activation of the binding mechanism that mediates oral referral, and that the binding mechanism is
required to fuse flavor components into a flavor object. As such, retronasal olfaction has a privileged
role in the formation of flavor objects. That is, unless a flavor has been experienced retronasally, it
is not incorporated into a flavor object. A prediction that follows from this line of reasoning is that
if Stevenson’s basic, taste–odor learning paradigm is repeated, but the conditioning trials are per-
formed with orthonasal rather than retronasal odor stimulation, then the odors should not acquire
taste-like properties. This experiment has yet to be carried out.

36.4.3  Encoding of Flavor Objects


Upon binding of the associated distributed responses, a flavor object is created and must be encoded
in memory. Although it is clear that the interaction between tastes and odors is experience-dependent,
the nature of the learning is currently unknown. There are several possibilities. First, odor objects,
consisting of the activity of unimodal olfactory cells, could—via associative learning—come to
acquire the ability to activate taste cells (Rescorla 1981). In this case, the connection between a
unimodal taste-responsive neuron and a unimodal smell-responsive neuron that fire together is
strengthened, so that the experience of the odor alone is able to cause the taste-responsive neuron
to fire. Based on perception, it is clear that this process would have to be asymmetrical, because
although some odors have taste-like characteristics, no tastes have odor-like characteristics. Such an
organization is unlikely because bimodal taste–odor neurons with congruent response profiles have
been identified, and clearly play a role in flavor processing (Rolls and Baylis 1994). A more likely
mechanism would therefore be Hebbian learning (Cruikshank and Weinberger 1996; Hebb 1949),
by which odors would acquire the ability to selectively activate bimodal neurons that are simulta­
neously stimulated by taste cells. This type of model has been proposed by Rolls, who argues that
unimodal inputs shape bimodal and multimodal cells by experience, and that the perception of
flavor is encoded by the bimodal cells (Rolls et al. 1996). However, a fundamental problem with
this model is that bimodal taste–odor neurons (with congruent responses to odors and tastes) fire
in response to presentation of unimodal odors and unimodal tastes (Rolls and Baylis 1994), yet the
perception of flavor only occurs in response to odors.
The only mechanism that can reconcile flavor perception with the known physiology is one in
which the multimodal inputs from the oral cavity are encoded together as a flavor object via config-
ural learning (Stevenson et al. 2000a, 2000b). This is not to say that associative learning does not
occur in the flavor modality, as it clearly does (Yeomans et al. 2006). Rather, the argument is that
the initial encoding of the flavor object proceeds via configural learning. In contrast to associative
and Hebbian learning, which are based on strengthening of connections of elements, configural
learning involves the encoding of the entire pattern of stimulation (Pearce 2002). In other words,
when a mixture is sampled by mouth, a unitary flavor is perceived rather than independent tastes
and odors, and it is this unitary percept that is encoded in memory.
The empirical foundation for the assertion that the encoding of flavor objects requires configural
processes comes from evidence that the enhancement of taste-like properties by odors is highly
resistant to extinction and counterconditioning (Harris et al. 2004; Stevenson et al. 2000a, 2000b).
If odor–taste exposures strengthen the ability to activate a sensory representation of the taste (as
would be the case if associative mechanisms were at play), then repeated exposure to the odor with-
out the taste should lead to the extinction of this association (Rescorla 1981; Rescorla and Freeberg
1978), which does not occur. Counterconditioning is the process by which the association between
A and B is replaced by a new association between A and C. For example, in the first phase of the
experiment a subject learns that a cue “A” is associated with receipt of food “B”. Once this asso-
ciation is established (e.g., seeing A causes salivation), A is then paired with a new consequence
that opposes B (e.g., shock). Some stimuli, such as faces, are resistant to extinction but will display
counterconditioning (Baeyens et al. 1989). Stevenson and colleagues reasoned that if the acquisi-
tion of taste-like properties by odors is based on configural encoding, counterconditioning should
A Proposed Model of a Flavor Modality 729

not be possible (Stevenson et al. 2000a). To test this possibility they subjected tastes, and odors and
tastes and colors, to a counterconditioning paradigm. In a single conditioning session, subjects were
exposed to taste–odor and taste–color pairs. At least 24 h later, one taste–odor and one taste–color
pair underwent counterconditioning (e.g., the odor and the color were paired with new tastes). As
predicted, the odor maintained its original taste and did not acquire the new taste. In contrast, an
expectancy measure indicated that subjects expected the colored solution to taste like the counter-
conditioned taste rather than the originally conditioned taste. One caveat is that, to date, all of the
odors used in studies of odor acquisition of taste-like qualities have been rated as having perceptible
amounts of the target taste quality before the conditioning trials. Accordingly, it may be more accu-
rate to view the effect of taste–odor learning as an enhancement rather than an acquisition of taste-
like qualities. If so, it would not be surprising if pairing odors with other tastes failed to eliminate a
taste quality that the odor possessed before the original odor–taste pairing.
An obvious next question concerns the nature of odor–somatosensory learning. There are some
data to suggest that odors may acquire fat sensations after pairing with a fat-containing milk
(Sundqvist et al. 2006). However, fat may be sensed via taste channels (Gilbertson 1998; Gilbertson
et al. 1997), and therefore may be perceived as qualities of odors via the same mechanism as other
taste qualities. Certainly, sniffed odors do not appear to invoke sensations of texture and tempera-
ture. It is likely, therefore, that although configural and synthetic processes may occur during taste–
odor perceptual learning, oral somatosensory contributions to the unitary flavor percept may not be
learned, and undergo sensory fusion rather than synthesis.
Notably, whereas a pure strawberry odor may result in the perception of sweetness, a pure sweet
solution, or the texture of a berry, never evokes the perception of strawberry. Together with refer-
ral, these observations further support the view that olfaction has a privileged role in the flavor
modality. Specifically, food identity, and thus perception of flavor objects, depends primarily on
the olfactory channel (Mozell et al. 1969). Although many different foods can be characterized as
predominantly sweet, predominantly salty, smooth, or crunchy, in nature there is only one food that
is predominantly “strawberry” and one food that is predominantly “peach.” Such an arrangement
has clear advantages because it enables organisms to learn to identify many different potential food
sources and to associate them with the presence of nutrients (e.g., sugars) or toxins. Moreover, the
duality of the olfactory modality allows key sensory signals about the sources of nutrients or toxins
to be incorporated into the odor percept during eating and drinking (retronasal olfaction), which
then enables them to be sensed at a distance (orthonasal olfaction). Indeed, although humans do not
normally use their noses to sniff out food sources, the ability to use orthonasal olfaction to locate a
food source is preserved (Porter et al. 2007).

36.5  NEURAL MECHANISMS


36.5.1  The Binding Mechanism
According to the proposed model, a neural substrate that orchestrates perceptual binding should
exist. Since we propose that binding depends on referral, the substrate should be selectively respon-
sive to retronasal odors. Also, activation of the binding mechanism should be independent of experi-
ence, but necessary for configurational learning to take place. Although there is no direct evidence
for a region that causes or controls such processes, there is evidence that such a mechanism might
exist in the somatomotor mouth area of the cortex. This evidence comes from an fMRI study
investigating the effect of odorant route (ortho- vs. retro-) on evoked neural responses (Small et al.
2005). In brief, four odors were presented to subjects orthonasally and retronasally according to
the procedure devised by Heilman and Hummel described above, while subjects underwent fMRI
scanning. Three of these odors were nonfood odors (lavender, farnesol, and butanol), and one was
a food odor (chocolate). When the responses associated with orthonasal delivery were compared
to responses associated with retronasal delivery (and vice versa), there was very little differential
730 The Neural Bases of Multisensory Processes

neural response if responses were collapsed across odorant type. The only significant finding was
that the oral somatomotor mouth area responded preferentially to retronasal compared to orthonasal
odors, regardless of odor identity (Figure 36.8). The response in this region was therefore suggested
to reflect olfactory referral to the oral cavity, which was documented to occur during retronasal, but
not orthonasal, stimulation.
It is not possible to know from this study whether the response in the somatomotor mouth area
was the result or the cause of referral. However, there are several factors that point to this region as
the likely locus of the binding mechanism. First, the somatomotor mouth region was the only area
to show a significant differential response to retronasal compared to orthonasal stimulation. Second,
responses there were independent of whether the odor represented a food or a nonfood stimulus.
Third, the perception of flavor consistently results in greater responses in this region than does the
perception of a tasteless solution. (Cerf-Ducastel and Murphy 2001; de Araujo and Rolls 2004;
Marciani et al. 2006), indicating that it is active when flavor percepts are experienced. Fourth, since
it is argued that stimulus integration and configural encoding are dependent on oral referral, it fol-
lows that the binding mechanism should be localized in the cortical representation of the mouth. We
also note that the location of the binding mechanism in the somatomotor mouth area is consistent
with Auvery and Spence’s suggestion that the formation of the flavor perceptual modality is depen-
dent on a higher-order cortical binding mechanism (Auvray and Spence 2008). In addition to the
initial binding, it is further predicted that neural computations in the somatomotor mouth area play
a “permissive” role in enabling the sculpting of multimodal neurons. Specifically, it is proposed that
unimodal taste and unimodal smell neurons located in the piriform and anterior dorsal insula sculpt
the profiles of bimodal taste/smell neurons located in the ventral anterior insula and the caudal OFC
only when there is concurrent activation of the binding substrate (and associated oral referral).
This model is consistent with the observations of subthreshold taste–odor summation. Whereas
subthreshold summation between orthonasally sensed odor and taste appears, like taste enhance-
ment, to be dependent on perceptual congruency (Dalton et al. 2000), subthreshold summation
between retronasally sensed odors and tastes occurs for both congruent and incongruent pairs
(Delwiche and Heffelfinger 2005). This suggests that experience is not required for summation of
subthreshold taste and retronasal olfactory signals. This observation is consistent with the proposed
model because all retronasal odors are predicted to give rise to a response in the somatomotor
mouth area. In contrast, orthonasal olfactory experiences do not activate the somatomotor mouth

SMM
4

3
2

FIGURE 36.8  Preferential activation of somatomotor mouth area by retronasal compared to orthonasal
sensation of odors. Functional magnetic resonance imaging data from a study (Small et al. 2005) using the
Heilmann and Hummel (2004) method of odorant presentation to study brain response to orthonasal and
retronasal odors. Image represents a sagittal section of brain showing response in somatomotor mouth area to
retronasal vs. orthonasal sensation of same odors superimposed upon averaged anatomical scans. (Adapted
with permission from Small, D.M. et al., Neuron 47, 593–605, 2005.)
A Proposed Model of a Flavor Modality 731

area and are therefore not referred to the mouth. As a result, orthonasal olfactory inputs can only
integrate with other oral sensations by reactivating odor objects, which have been previously associ-
ated with flavor objects.
The role of the somatomotor mouth area in oral referral and in the creation of the flavor modal-
ity could be tested in a variety of ways. For example, one could record single-unit responses in the
somatomotor mouth area and the OFC in a taste–odor learning paradigm. In humans, one could
examine taste–odor learning in patients with specific damage to the somatomotor mouth region or
in healthy controls by using transcranial magnetic stimulation to induce temporary “lesions.” The
prediction in both cases would be that lesions disrupt oral referral and the enhancement of taste-like
properties by odors. Another possibility would be to use a combination of fMRI and network con-
nectivity models such as dynamic causal modeling (Friston et al. 2003; Friston and Price 2001) to
test whether response in the somatomotor mouth area to flavors influences responses in regions such
as the OFC, and to test whether the magnitude of this influence changes as a function of learning.

36.5.2  Neural Correlates of Flavor Object


The binding mechanism in the somatomotor mouth area is proposed to comprise unimodal and
multimodal representations of taste, smell, and oral somatosensation that arise when a stimulus is
in the mouth. However, the current paucity of data on flavor processing necessitates a hypothetical
rather than an empirical description of the proposed network, which is depicted in Figure 36.9.
Certainly, odor object representations in the piriform cortex (Gottfried et al. 2006b; Wilson and
Stevenson 2004) and OFC (Gottfried et al. 2006a; Schoenbaum and Eichenbaum 1995a, 1995b)
are likely to be key components of the flavor network. In addition, regions with overlapping rep-
resentation of taste, odor, and oral somatosensation are likely to be critical. In humans, there is
evidence from functional neuroimaging studies of overlapping responses to taste, smell, and oral
somatosensation in the insula and overlying operculum (Cerf-Ducastel and Murphy 2001; de Araujo
and Rolls 2004; Marciani et al. 2006; Poellinger et al. 2001; Savic et al. 2000; Small et al. 1999,
2003; Verhagen et al. 2004; Zald et al. 1998), and in the OFC (Francis et al. 1999; Frank et al. 2003;
Gottfried et al. 2002a, 2002b, 2006a; Marciani et al. 2006; O’Doherty et al. 2000; Poellinger et al.
2001; Savic et al. 2000; Small et al. 1997, 1999, 2003; Sobel et al. 1998; Zald et al. 1998; Zald and

GSO
OGSO
Insula
GSO GSO
G
GS
Orbital frontal O
Piriform

FIGURE 36.9  Proposed flavor network. A “glass” brain drawing depicting proposed flavor network as gray
circles. G, gustation; S, somatosensation; O, olfaction. Arrows indicate point of entry for sensory signal.
Dashed line box with GS represents gustatory (G) and somatosensory (S) relays in thalamus. Hatched region
indicates insular cortex. Bolded gray circle with S (somatosensory) indicates somatomotor mouth area. Note
that gustatory and somatosensory information are colocalized, except in somatomotor mouth area. Unitary
flavor percept is formed only when all nodes (gray circles) receive inputs. No single sensory channel (gusta-
tory, olfactory, or somatosensory) can invoke flavor object in isolation.
732 The Neural Bases of Multisensory Processes

Pardo 1997; Zatorre et al. 1992). In accordance with these findings in humans, single-cell recording
studies in monkeys have identified both taste- and smell-responsive cells in the insula/operculum
(Scott and Plata-Salaman 1999) and OFC (Rolls and Baylis 1994; Rolls et al. 1996).
Although not considered traditional chemosensory cortex, the anterior cingulate cortex receives
direct projections from the insula and the OFC (Carmichael and Price 1996; Vogt and Pandya 1987),
responds to taste and smell (de Araujo and Rolls 2004; de Araujo et al. 2003; Marciani et al. 2006;
O’Doherty et al. 2000; Royet et al. 2003; Savic et al. 2000; Small et al. 2001, 2003; Zald et al. 1998;
Zald and Pardo 1997), and shows supra-additive responses to congruent taste–odor pairs (Small et al.
2004). Therefore, it is possible that this region contributes to flavor processing. Moreover, a meta-
analysis of all independent studies of taste and smell confirmed large clusters of overlapping activa-
tion in the insula/operculum, OFC, and anterior cingulate cortex (Verhagen and Engelen 2006).
There is also evidence for supra-additive responses to the perception of congruent but not incon-
gruent taste–odor solutions in the anterodorsal insula/frontal operculum, anteroventral insula/
caudal OFC, frontal operculum, and anterior cingulate cortex (McCabe and Rolls 2007; Small
et al. 2004). Such supra-additive responses are thought to be a hallmark of multisensory integra-
tion (Calvert 2001; Stein 1998). The fact that the supra-additive responses in these regions are
experience-­dependent strongly supports the possibility that these areas are key nodes of the distrib-
uted representation of the flavor object. In support of this possibility, an unpublished work suggests
that there are differential responses to food versus nonfood odors, and that such responses occur in
the insula, operculum, anterior cingulate cortex, and OFC (Small et al., in preparation).
Finally, neuroimaging studies with whole brain coverage frequently report responses in similar
regions of the cerebellum (Cerf-Ducastel and Murphy 2001; Savic et al. 2002; Small et al. 2003;
Sobel et al. 1998; Zatorre et al. 2000) and amygdala (Anderson et al. 2003; Gottfried et al. 2002a,
2002b, 2006b; Small et al. 2003, 2005; Verhagen and Engelen 2006; Winston et al. 2005; Zald
et al. 1998; Zald and Pardo 1997) to taste and smell stimulation, although neither region shows
supra-additive responses to taste and smell (Small et al. 2004). We have elected not to include these
regions in the proposed network, but acknowledge that there is at least some empirical basis for
further investigation of their role in flavor processing.
One important but still unresolved question regarding the neurophysiology of flavor perception is
whether the process by which an odor object becomes part of a flavor percept results in changes to
the odor object (Wilson and Stevenson 2004). Preliminary work suggests that the taste-like proper-
ties of food odors are encoded in the same region of insula that encodes sweet taste, and not in the
piriform cortex or OFC (Veldhuizen et al. 2010). Subjects underwent fMRI scanning while being
exposed to a weak sweet taste (sucrose), a strong sweet taste, two sweet food odors (strawberry
and chocolate), and to sweet nonfood odors (rose and lilac). A region of insular cortex was identi-
fied that responded to taste and odor sweetness. This finding is consistent with a recent report that
insular lesions disrupt taste and odor-induced taste perception (Stevenson et al. 2008). Moreover, it
was found that the magnitude of insular response to food, but not nonfood odors, correlated with
perceived sweetness. The selectivity of the association between response and sweetness perception
strongly suggests that experience with an odor in the mouth as a food or flavor modifies neural
activity, and that this occurs in the insula, but not in the piriform cortex. This, in turn, suggests that
odor objects represented in the piriform cortex are not modified by flavor learning. In summary, it is
proposed that bimodal taste–odor neurons in the OFC and anterior insula are changed during simul-
taneous perception of taste and retronasally sensed odor, whereas piriform neurons are not. Thus,
we hypothesize that the flavor object comprises an unmodified odor object and modified bimodal
cells that become associated within a distributed pattern of activation during initial binding.
Another critical question for understanding neural encoding of flavor objects is whether the
entire active network is encoded or only a subset of key elements. For example, is activation of the
somatomotor mouth area required to reexperience the flavor percept? If not, what are the key ele-
ments? The answers to these questions are currently unknown. However, as discussed above, it is
possible that the taste signal is critical (Davidson et al. 1999; Synder et al. 2007).
A Proposed Model of a Flavor Modality 733

36.6  ALTERNATIVE MODELS


Two other neural models for configural encoding of unitary flavor percepts have been proposed.
First, Stevenson and Tomiczek (2007) consider the acquisition of taste-like properties in the con-
text of synesthesia. They propose that a multimodal representation of flavors exists in a distributed
network that includes the insula, amygdala, and OFC. This idea is similar to the proposed model.
However, instead of emphasizing a binding mechanism related to referral, they conceive of taste–
odor learning as an implicit synesthesia, with the odor as the inducer and the taste as the concurrent,
or illusory, perception. The model hinges on the fact that odors have two pathways to the orbital
cortex: a direct projection from the olfactory bulb and one reliant on a relay through the thalamus.
It is argued that the thalamocortical pathway, which receives purely olfactory input, allows the
multimodal representation activated by the direct pathway to be assigned as olfactory experience,
giving rise to the perception that the odor has a taste. The second model was proposed by Verhagen
and Engelen (2006), who, like us, highlight the importance of binding. However, they do not focus
on oral referral and suggest a role for the hippocampus, or hippocampus-like mechanism, in binding
and for the perirhinal cortex in the conscious perception of flavors (Verhagen 2007). Future research
will determine which of these models—if any—is correct.

36.7  SUMMARY
We propose that during tasting, retronasal olfactory, gustatory, and somatosensory stimuli form
a perceptual gestalt—the “flavor object”—the elements of which maintain their individual qualities
to varying degrees. The development and experience of this percept is dependent on oral refer-
ral, for which neural processing in the somatomotor mouth area is deemed critical. An as-yet-
­unidentified neural mechanism within this region is hypothesized to bind the pattern of responses
elicited by flavor stimuli. When the binding mechanism is active, unimodal inputs shape the selec-
tivity of bimodal taste–odor neurons. Flavor objects are then encoded via configural learning as a
distributed pattern of response across the somatomotor mouth area, multiple regions of insula and
overlying operculum, orbitofrontal cortex, piriform cortex, and anterior cingulate cortex. It is these
functionally associated regions that constitute the neural basis of the proposed flavor modality.

REFERENCES
Anderson, A. K., K. Christoff, I. Stappen et al. 2003. Dissociated neural representations of intensity and valence
in human olfaction. Nat Neurosci 6: 196–202.
Ashkenazi, A., and L. E. Marks. 2004. Effect of endogenous attention on detection of weak gustatory and olfac-
tory flavors. Percept Psychophys 66: 596–608.
Auvray, M., and C. Spence. 2008. The multisensory perception of flavor. Conscious Cogn 17: 1016–1031.
Baeyens, F., P. Eelen, O. Van den Bergh et al. 1989. Acquired affective–evaluative vale: Conservative but not
interchangeable. Behav Res Ther 27: 279–287.
Barnes, D. C., R. D. Hofacer, A. R. Zaman et al. 2008. Olfactory perceptual stability and discrimination. Nat
Neurosci 11: 1378–1380.
Bartoshuk, L. M. 1991 Taste, smell, and pleasure. In The hedonics of taste and smell, ed. R. C. Bolles, 15–28.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Bender, G., T. Hummel, S. Negoias et al. 2009. Separate signals for orthonasal vs. retronasal perception of food
but not nonfood odors. Behav Neurosci 123: 481–489.
Bradley, R. M., R. H. Smoke, T. Akin et al. 1992. Functional regeneration of glossopharyngeal nerve through
micromachined sieve electrode arrays. Brain Res 594: 84–90.
Breslin, P. A. 2000. Human Gestation. In The neurobiology of taste and smell, ed. T. E. Finger and W. L. Singer,
423–461. San Diego, CA: Wiley-Liss, Inc.
Buck, L., and R. Axel. 1991. A novel multigene family may encode odorant receptors: a molecular basis for
odor recognition.[see comment]. Cell 65: 175–187.
Bult, J. H., R. A. de Wijk, and T. Hummel. 2007. Investigations on multimodal sensory integration: texture,
taste, and ortho- and retronasal olfactory stimuli in concert. Neurosci Lett 411: 6–10.
734 The Neural Bases of Multisensory Processes

Calvert, G. A. 2001. Crossmodal processing in the human brain: Insights from functional neuroimaging studies.
Cereb Cortex 11: 1110–1123.
Carmichael, S. T., and J. L. Price. 1996. Connectional networks within the orbital and medial prefrontal cortex
of Macaque monkeys. J Comp Physiol Psychol 371: 179–207.
Cerf-Ducastel, B., and C. Murphy. 2001. fMRI activation in response to odorants orally delivered in aqueous
solutions. Chem Senses 26: 625–637.
Cerf-Ducastel, B., P. F. Van de Moortele, P. MacLeod et al. 2001. Interaction of gustatory and lingual somato­
sensory perceptions at the cortical level in the human: A functional magnetic resonance imaging study.
Chem Senses 26: 371–383.
Chale-Rush, A., J. R. Burgess, and R. D. Mattes. 2007. Evidence for human orosensory (taste?) sensitivity to
free fatty acids. Chem Senses 32: 423–431.
Chandrashekar, J., M. A. Hoon, N. J. Ryba et al. 2006. The receptors and cells for mammalian taste. Nature
444: 288–294.
Cruikshank, S. J., and N. M. Weinberger. 1996. Evidence for the Hebbian hypothesis in experience-dependent
physiological plasticity of the neocortex: A critical review. Brain Res Rev 22: 191–228.
Dade, L. A., M. Jones-Gotman, R. J. Zatorre et al. 1998. Human brain function during odor encoding and rec-
ognition. A PET activation study. Ann NY Acad Sci 855: 572–574.
Dalton, P., N. Doolittle, and P. A. Breslin. 2002. Gender-specific induction of enhanced sensitivity to odors.
Nat Neurosci 5: 199–200.
Dalton, P., N. Doolittle, H. Nagata et al. 2000. The merging of the senses: Integration of subthreshold taste and
smell. Nat Neurosci 3: 431–432.
Davidson, J. M., R. S. T. Linforth, T. A. Hollowood et al. 1999. Effect of sucrose on the perceived flavor inten-
sity of chewing gum. J Agric Food Chem 47: 4336–4340.
de Araujo, E., and E. T. Rolls. 2004. Representation in the human brain of food texture and oral fat. J Neurosci
24: 3086–3093.
de Araujo, E., E. T. Rolls, M. L. Kringelbach et al. 2003. Taste–olfactory convergence, and the representation
of the pleasantness of flavour in the human brain. Eur J Neurosci 18: 2059–2068.
de Olmos, J., H. Hardy, and L. Heimer. 1978. The afferent connections of the main and the accessory olfactory
bulb formations in the rat: An experimental HRP-study. J Comp Neurol 181: 213–244.
Delwiche, J. F., and A. L. Heffelfinger. 2005. Cross-modal additivity of taste and smell. J Sens Stud 20:
512–525.
Delwiche, J. F., M. F. Lera, and P. A. S. Breslin. 2000. Selective removal of a target stimulus localized by taste
in humans. Chem Senses 25: 181–187.
Dravnieks, A. 1985. Atlas of odor character profiles (ASTM Data series DS61). West Conshohocken, PA:
American Society for Testing and Materials.
Francis, S., E. T. Rolls, R. Bowtell et al. 1999. The representation of pleasant touch in the brain and its relation-
ship with taste and olfactory areas. Neuroreport 10: 435–459.
Frank, G. K., W. H. Kaye, C. S. Carter et al. 2003. The evaluation of brain activity in response to taste stimuli­—A
pilot study and method for central taste activation as assessed by event-related fMRI. J Neurosci Methods
131: 99–105.
Frasnelli, J., M. Ungermann, and T. Hummel. 2008. Ortho- and retronasal presentation of olfactory stimuli
modulates odor percepts. Chemosens Percept 1: 9–15.
Friston, K., L. Harrison, and W. D. Penny. 2003. Dynamic causal modelling. Neuroimage 19: 1273–1302.
Friston, K., and C. J. Price. 2001. Dynamic representations and generative models of brain function. Brain Res
Bull 54: 275–285.
Gilbertson, T. A. 1998. Gustatory mechanisms for the detection of fat. Curr Opin Neurobiol 8: 447–452.
Gilbertson, T. A., D. T. Fontenot, L. Liu et al. 1997. Fatty acid modulation of K+ channels in taste receptor cells:
Gustatory cues for dietary fat. Am J Physiol 272: C1203–C1210.
Gottfried, J. A. 2009. Function follows form: Ecological constraints on odor codes and olfactory percepts. Curr
Opin Neurobiol, in press.
Gottfried, J. A., R. Deichmann, J. S. Winston et al. 2002a. Functional heterogeneity in human olfactory cortex:
An event-related functional magnetic resonance imaging study. J Neurosci 22: 10819–10828.
Gottfried, J. A., J. O’Doherty, and R. J. Dolan. 2002b. Appetitive and aversive olfactory learning in humans
studied using event-related functional magnetic resonance imaging. J Neurosci 22: 10829–10837.
Gottfried, J. A., D. M. Small, and D. H. Zald. 2006a. The chemical senses. In The orbitofrontal cortex, ed. D. H.
Zald and S. L. Rauch, 125–171. New York: Oxford Univ. Press.
Gottfried, J. A., J. S. Winston, and R. J. Dolan. 2006b. Dissociable codes of odor quality and odorant structure
in human piriform cortex. Neuron 49: 467–479.
A Proposed Model of a Flavor Modality 735

Green, B. G. 1977. Localization of thermal sensation: An illusion and synthetic heat. Percept Psychophys 22:
331–337.
Green, B. G. 2002. Studying taste as a cutaneous sense. Food Qual Prefer 14: 99–109.
Green, B. G. 2003. Studying taste as a cutaneous sense. Food Qual Prefer 14: 99–109.
Green, B. G., and B. Gelhard. 1989. Salt as an oral irritant. Chem Senses 14: 259–271.
Green, B. G., and H. T. Lawless. 1991. The psychophysics of somatosensory chemoreception in the nose and
mouth. In Smell and taste in health and disease, ed. T.V. Getchell, R. L. Doty, L. M. Bartoshuk, and J. B.
Snow, 235–253. New York: Raven Press.
Haberly, L. B. 2001. Parallel-distributed processing in olfactory cortex: New insights from morphological and
physiological analysis of neuronal circuitry. Chem Senses 26: 551–576.
Harper, R., D. G. Land, N. M. Griffiths et al. 1968. Odor qualities: A glossary of usage. Br J Psychol 59:
231–252.
Harris, J. A., F. L. Shand, L. Q. Carroll et al. 2004. Persistence of preference for a flavor presented in simulta­
neous compound with sucrose. J Exp Psychol Anim Behav Processes 30: 177–189.
Hebb, D. O. 1949. The organization of behavior. New York: Wiley.
Heilmann, S., and T. Hummel. 2004. A new method for comparing orthonasal and retronasal olfaction. Behav
Neurosci 118: 412–419.
Hollingworth, H. L., and A. T. Poffenberger. 1917. The sense of taste. New York: Moffat, Yard.
Hummel, T., S. Heilmann, B. N. Landis et al. 2006. Perceptual differences between chemical stimuli presented
through the ortho- or retronasal route. Flavor Fragrance J 21: 42–47.
Jinks, A., and D. G. Laing. 2001. The analysis of odor mixtures by humans: Evidence for a configurational
process. Physiol Behav 72: 51–63.
Kadohisa, M., E. T. Rolls, and J. V. Verhagen. 2004. Orbitofrontal cortex: Neuronal representation of oral tem-
perature and capsaicin in addition to taste and texture. Neuroscience 127: 207–221.
Kohler, W. 1929. Gestalt psychology. New York: Horace Liveright.
Koza, B. J., A. Cilmi, M. Dolese et al. 2005. Color enhances orthonasal olfactory intensity and reduces retro-
nasal olfactory intensity. Chem Senses 30: 643–649.
Kringelbach, M. L., and K. C. Berridge. 2009. Oxford handbook: Pleasures of the brain. Oxford: Oxford Univ.
Press.
Laing, D. G., and G. W. Francis. 1989. The capacity of humans to identify odors in mixtures. Physiol Behav
46: 809–814.
Landis, B. N., J. Frasnelli, J. Reden et al. 2005. Differences between orthonasal and retronasal olfactory func-
tions in patients with loss of the sense of smell. Arch Otolaryngol Head Neck Surg 131: 977–981.
Li, W., J. D. Howard, T. B. Parrish et al. 2008. Aversive learning enhances perceptual and cortical discrimina-
tion of indiscriminable odor cues. Science 319: 1842–1845.
Li, W., E. Luxenberg, T. Parrish et al. 2006. Learning to smell the roses: Experience-dependent neural plasticity
in human piriform and orbitofrontal cortices. Neuron 52: 1097–1108.
Lim, J., and B. G. Green. 2008. Tactile interaction with taste localization: Influence of gustatory quality and
intensity. Chem Senses 33: 137–143.
Livermore, A., and D. G. Laing. 1996. Influence of training and experience on the perception of multicompo-
nent odor mixtures. J Exp Psychol Hum Percept Perform 46: 809–814.
Marciani, L., J. C. Pfeiffer, J. Hort et al. 2006. Improved methods for fMRI studies of combined taste and aroma
stimuli. J Neurosci Methods 158: 186–194.
McBurney, D. H. 1986. Taste, smell and flavor terminology: Taking the confusion out of confusion. In Clinical
measurement of taste and smell, ed. H. L. Meiselman and R. S. Rivkin, 117–124. New York: Macmillan.
McCabe, C., and E. T. Rolls. 2007. Umami: A delicious flavor formed by convergence of taste and olfactory
pathways in the human brain. Eur J Neurosci 25: 1855–1864.
Mozell, M. M. 1970. Evidence for a chromatographic model of olfaction. J Gen Physiol 56: 46–63.
Mozell, M. M., B. P. Smith, P. E. Smith et al. 1969. Nasal chemoreception in flavor identification. Arch
Otolaryngol 90: 367–373.
Murphy, C., W. S. Cain, and L. M. Bartoshuk. 1977. Mutual action of taste and olfaction. Sens Processes 1:
204–211.
Murphy, C. A., and W. S. Cain. 1980. Taste and olfaction: Independence vs interaction. Physiol Behav 24:
601–605.
O’Doherty, J., E. T. Rolls, S. Francis et al. 2000. Sensory-specific satiety-related olfactory activation of the
human orbitofrontal cortex. Neuroreport 11: 399–403.
Pearce, J. M. 2002. Evaluation and development of a connectionist theory of configural learning. Anim Learn
Behav 30: 73–95.
736 The Neural Bases of Multisensory Processes

Pierce, J., and B. P. Halpern. 1996. Orthonasal and retronasal odorant identification based upon vapor phase
input from common substances. Chem Senses 21: 529–543.
Plata-Salaman, C. R., T. R. Scott, and V. L. Smith-Swintosky. 1992. Gustatory neural coding in the monkey
cortex: l-Amino acids. J Neurophysiol 67: 1552–1561.
Plata-Salaman, C. R., V. L. Smith-Swintosky, and T. R. Scott. 1996. Gustatory neural coding in the monkey
cortex: Mixtures. J Neurophysiol 75: 2369–2379.
Poellinger, A., R. Thomas, P. Lio et al. 2001. Activation and habituation in olfaction—An fMRI study.
Neuroimage 13: 547–560.
Porter, J., B. Craven, R. M. Khan et al. 2007. Mechanisms of scent-tracking in humans. Nat Neurosci 10:
27–29.
Prescott, J. 1999. Flavour as a psychological construct: Implications for perceiving and measuring the sensory
qualities of foods. Food Qual Prefer 10: 349–356.
Price, J. L. 1973. An autoradiographic study of complementary laminar patterns of termination of afferent
fibers to the olfactory cortex. J Comp Neurol 150: 87–108.
Pritchard, T. C., R. B. Hamilton, J. R. Morse et al. 1986. Projections of thalamic gustatory and lingual areas in
the monkey, Macaca fascicularis. J Comp Neurol 244: 213–228.
Pritchard, T. C., R. B. Hamilton, and R. Norgren. 1989. Neural coding of gustatory information in the thalamus
of Macaca mulatta. J Neurophysiol 61: 1–14.
Rakover, S. S., and B. Teucher. 1997. Facial inversion effects: Parts and whole relationship. Percept Psychophys
59: 752–761.
Rescorla, R. A. 1981 Simultaneous associations. In Predictability, Correlation, and Contiguity, ed. P. Harzen
and M. D. Zeilner, 47–80. Chichester: Wiley.
Rescorla, R. A., and L. Freeberg. 1978. The extinction of within-compound flavor associations. Learn Motiv
9: 411–427.
Rolls, E. T. 2007. Sensory processing in the brain related to the control of food intake. Proc Nutr Soc 66:
96–112.
Rolls, E. T., and L. L. Baylis. 1994. Gustatory, olfactory, and visual convergence within the primate orbitofron-
tal cortex. J Neurosci 14: 5437–5452.
Rolls, E. T., H. D. Critchley, and A. Treves. 1996. Representation of olfactory information in the primate
or­bitofrontal cortex. J Neurophysiol 75: 1982–1996.
Royet, J. P., J. Plailly, C. Delon-Martin et al. 2003. fMRI of emotional responses to odors: Influence of hedonic
valence and judgment, handedness, and gender. Neuroimage 20: 713–728.
Rozin, P. 1982. “Taste–smell confusions” and the duality of the olfactory sense. Percept Psychophys 31:
397–401.
Sakai, N., T. Kobayakawa, N. Gotow et al. 2001. Enhancement of sweetness ratings of aspartame by a vanilla
odor presented either by orthonasal or retronasal routes. Percept Mot Skills 92: 1002–1008.
Savic, I., B. Gulyas, and H. Berglund. 2002. Odorant differentiated pattern of cerebral activation: comparison
of acetone and vanillin. Hum Brain Mapp 17: 17–27.
Savic, I., B. Gulyas, M. Larsson et al. 2000. Olfactory functions are mediated by parallel and hierarchical pro-
cessing. Neuron 26: 735–745.
Schifferstein, H. N. J., and P. W. J. Verlegh. 1996. The role of congruency and pleasantness in odor-induced
taste enhancement. Acta Psychol 94: 87–105.
Schoenbaum, G., and H. Eichenbaum. 1995a. Information coding in the rodent prefrontal cortex: I. Single-
neuron activity in orbitofrontal cortex compared with that in piriform cortex. J Neurophysiol 74:
733–750.
Schoenbaum, G., and H. Eichenbaum. 1995b. Information coding in the rodent prefrontal cortex: II. Ensemble
activity in orbitofrontal cortex. J Neurophysiol 74: 751–762.
Scott, T. R., and C. R. Plata-Salaman. 1991 Coding of Taste Quality. In Smell and taste in health and disease,
ed. T. V. Getchel. New York: Raven Press.
Scott, T. R., and C. R. Plata-Salaman. 1999. Taste in the monkey cortex. Physiol Behav 67: 489–511.
Shikata, H., D. B. McMahon, and P. A. Breslin. 2000. Psychophysics of taste lateralization on anterior tongue.
Percept Psychophys 62: 684–694.
Simon, S. A., I. de Araujo, J. R. Stapleton et al. 2008. Multisensory processing of gustatory stimuli. Chemosens
Percept, in press.
Small, D. M. 2008. Flavor and the formation of category-specific processing in olfaction. Chemosens Percept
1: 136–146.
Small, D. M., J. Gerber, Y. E. Mak et al. 2005. Differential neural responses evoked by orthonasal versus retro-
nasal odorant perception in humans. Neuron 47: 593–605.
A Proposed Model of a Flavor Modality 737

Small, D. M., M. D. Gregory, Y. E. Mak et al. 2003. Dissociation of neural representation of intensity and affec-
tive valuation in human gustation. Neuron 39: 701–711.
Small, D. M., M. Jones-Gotman, R. J. Zatorre et al. 1997. Flavor processing: More than the sum of its parts.
Neuroreport 8: 3913–3917.
Small, D. M., and J. Prescott. 2005. Odor/taste integration and the perception of flavor. Exp Brain Res 166:
345–357.
Small, D. M., J. Voss, Y. E. Mak et al. 2004. Experience-dependent neural integration of taste and smell in the
human brain. J Neurophysiol 92: 1892–1903.
Small, D. M., D. H. Zald, M. Jones-Gotman et al. 1999. Human cortical gustatory areas: A review of functional
neuroimaging data. Neuroreport 10: 7–14.
Small, D. M., R. J. Zatorre, A. Dagher et al. 2001. Changes in brain activity related to eating chocolate: From
pleasure to aversion. Brain 124: 1720–1733.
Smith-Swintosky, V. L., C. R. Plata-Salaman, and T. R. Scott. 1991. Gustatory neural coding in the monkey
cortex: stimulus quality. J Neurophysiol 66: 1156–1165.
Sobel, N., R. M. Khan, A. Saltman et al. 1999. Olfaction: The world smells different to each nostril. Nature
402: 35.
Sobel, N., V. Prabhakaran, J. E. Desmond et al. 1998. Sniffing and smelling: Separate subsystems in the human
olfactory cortex. Nature 392: 282–286.
Sobel, N., V. ������������������������������������������������������������������������������������������������
Prabhakaran�������������������������������������������������������������������������������������
, C. A. Hartley et al. 1998. Odorant-induced
��������������������������������������������������������
and sniff-induced activation in the cer-
ebellum of the human. J Neurosci 18: 8990–9001.
Stein, B. E. 1998. Neural mechanisms for synthesizing sensory information and producing adaptive behaviors.
Exp Brain Res 123: 124–135.
Stevenson, R. J. 2001. Associative learning and odor quality perception: How sniffing an odor mixture can alter
the smell of its parts. Learn Motiv 32: 154–177.
Stevenson, R. J., and R. A. Boakes. 2004 Sweet and sour smells: Learned synesthesia between the senses of
taste and smell. In The handbook of multisensory processes, ed. G. A. Calvert, C. Spence, and B. E. Stein,
69–83. Boston: MIT Press.
Stevenson, R. J., R. A. Boakes, and J. P. Wilson. 2000a. Counter-conditioning following human odor-taste and
color-taste learning. Learn Motiv 31: 114–127.
Stevenson, R. J., R. A. Boakes, and J. P. Wilson. 2000b. Resistance to extinction of conditioned odor percep-
tions: Evaluative conditioning is not unique. J Exp Psychol Learn Mem Cogn 26: 423–440.
Stevenson, R. J., L. A. Miller, and Z. C. Thayer. 2008. Impairments in the perception of odor-induced tastes and
their relationship to impairments in taste perception. J Exp Psychol Hum Percept Perform 34: 1183–1197.
Stevenson, R. J., and J. Prescott. 1995. The acquisition of taste properties by odors. Learn Motiv 26: 433–455.
Stevenson, R. J., J. Prescott, and R. A. Boakes. 1999. Confusing tastes and smells: how odours can influence
the perception of sweet and sour tastes. Chem Senses 24: 627–635.
Stevenson, R. J., and C. Tomiczek. 2007. Olfactory-induced synesthesias: A review and model. Psychol Bull
133: 294–309.
Sun, B. C., and B. P. Halpern. 2005. Identification of air phase retronasal and orthonasal odorant pairs. Chem
Senses 30: 693–706.
Sundqvist, N. C., R. J. Stevenson, and I. R. J. Bishop. 2006. Can odours acquire fat-like properties? Appetite
47: 91–99.
Synder, D. J., C. J. Clark, F. A. Catalanotto et al. 2007. Oral anesthesia specifically impairs retronasal olfaction.
Chem Senses 32: A15.
Tastevin, J. 1937. En partant de l’experience d’Aristote. Encephale 1: 57–84, 140–158.
Tichener, E. B. 1909. A textbook of psychology. New York: Macmillan.
Todrank, J., and L. M. Bartoshuk. 1991. A taste illusion: Taste sensation localized by touch. Physiol Behav 50:
1027–1031.
Travers, J. B. 1988. Efferent projections from the anterior nucleus of the solitary tract of the hamster. Brain
Res 457: 1–11.
Turner, B. H., K. C. Gupta, and M. Mishkin. 1978. The locus and cytoarchitecture of the projection areas of the
olfactory bulb in Macaca mulatta. J Comp Neurol 177: 381–396.
Veldhuizen, M. G., D. Nachtigal, L. Teulings et al. 2010. The insular taste cortex contributes to odor quality
coding. Frontiers in Human Neuroscience 21:4. Pii: 58
Verhagen, J. V. 2007. The neurocognitive bases of human multimodal food perception: Consciousness. Brain
Res Rev 53: 271–286.
Verhagen, J. V., and L. Engelen. 2006. The neurocognitive bases of human multimodal food perception: Sensory
integration. Neurosci Biobehav Rev 30: 613–650.
738 The Neural Bases of Multisensory Processes

Verhagen, J. V., M. Kadohisa, and E. T. Rolls. 2004. Primate insular/opercular taste cortex: Neuronal repre-
sentations of the viscosity, fat texture, grittiness, temperature, and taste of foods. J Neurophysiol 92:
1685–1699.
Vogt, B. A., and D. Pandya. 1987. Cingulate cortex of the rhesus monkey: II. Cortical afferents. J Comp Neurol
262: 271–289.
Voirol, E., and N. Dagnet. 1986. Comparative study of nasal and retronasal olfactory perception. Food Sci
Technol 19: 316–319.
Welge-Lussen, A., J. Drago, M. Wolfensberger et al. 2005. Gustatory stimulation influences the processing of
intranasal stimuli. Brain Res 1038: 69–75.
Welge-Lussen, A., A. Husner, M. Wolfensberger et al. 2009. Influence of simultaneous gustatory stimuli on
orthonasal and retronasal olfaction. Neurosci Lett 454: 124–128.
Whitehead, M. C. 1990. Subdivisions and neuron types of the nucleus of the solitary tract that project to the
parabrachial nucleus in the hamster. J Comp Neurol 301: 554–574.
Whitehead, M. C., and M. E. Frank. 1983. Anatomy of the gustatory system in the hamster: Central projections
of the chorda tympani and the lingual nerve. J Comp Neurol 220: 378–395.
Wilson, D. A., M. Kadohisa, and M. L. Fletcher. 2006. Cortical contributions to olfaction: Plasticity and per-
ception. Semin Cell Dev Biol 17: 462–470.
Wilson, D. A., and R. J. Stevenson. 2003. The fundamental role of memory in olfactory perception. Trends
Neurosci 26: 243–247.
Wilson, D. A., and R. J. Stevenson. 2004. The fundamental role of memory in olfactory perception. Trends
Neurosci 25: 243–247.
Winston, J. S., J. A. Gottfried, J. M. Kilner et al. 2005. Integrated neural representations of odor intensity and
affective valence in human amygdala. J Neurosci 25: 8903–8907.
Yamamoto, T., N. Yuyama, T. Kato et al. 1985. Gustatory responses of cortical neurons in rats: II. Information
processing of taste quality. J Neurophysiol 53: 1370–1386.
Yeomans, M. R., S. Mobini, T. D. Elliman et al. 2006. Hedonic and sensory characteristics of odors conditioned
by pairing with tastants in humans. J Exp Psychol Anim Behav Processes 32: 215–228.
Zald, D. H., J. T. Lee, K. W. Fluegel et al. 1998. Aversive gustatory stimulation activates limbic circuits in
humans. Brain 121: 1143–1154.
Zald, D. H., and J. V. Pardo. 1997. Emotion, olfaction, and the human amygdala: amygdala activation during
aversive olfactory stimulation. Proc Natl Acad Sci U S A 94: 4119–4124.
Zatorre, R. J., M. Jones-Gotman, A. C. Evans et al. 1992. Functional localization and lateralization of human
olfactory cortex. Nature 360: 339–340.
Zatorre, R. J., M. Jones-Gotman, and C. Rouby. 2000. Neural mechanisms involved in odor pleasantness and
intensity judgments. Neuroreport 11: 2711–2716.
37 Assessing the Role of
Visual and Auditory
Cues in Multisensory
Perception of Flavor
Massimiliano Zampini and Charles Spence

CONTENTS
37.1 Introduction........................................................................................................................... 739
37.2 Multisensory Interactions between Visual and Flavor Perception........................................ 740
37.2.1 Role of Color Cues on Multisensory Flavor Perception............................................ 740
37.2.2 Color-Flavor Interactions: Possible Role of Taster Status.......................................... 743
37.2.3 Color–Flavor Interactions: Possible Role of Learned Associations between
Colors and Flavors..................................................................................................... 745
37.2.4 Color–Flavor Interactions: Neural Correlates........................................................... 747
37.2.5 Interim Summary...................................................................................................... 748
37.3 Role of Auditory Cues in the Multisensory Experience of Foodstuffs................................. 749
37.3.1 Effect of Sound Manipulation on the Perception of Crisps....................................... 749
37.3.2 Effect of Auditory Cues on the Perception of Sparkling Water................................ 751
37.4 Conclusions............................................................................................................................ 752
References....................................................................................................................................... 753

37.1  INTRODUCTION
Our perception of the objects and events that fill the world in which we live depends on the integration
of the sensory inputs that simultaneously reach our various sensory systems (e.g., vision, audition,
touch, taste, and smell). Perhaps the best-known examples of genuinely multisensory experiences
come from our perception and evaluation of food and drink. The average person would say that the
flavor of food derives primarily from its taste in the mouth. They are often surprised to discover that
there is a strong “nasal” role in the perception of flavor. In fact, it has been argued that the majority
of the flavor of food actually comes from its smell (e.g., Cain 1977; Murphy and Cain 1980; Rozin
1982).* Our perception of food and drink, however, is not simply a matter of combining gustatory

* For example, coffee and tea are indistinguishable (with both having a bitter taste) if drunk while holding one’s nose
pinched shut. Whereas the taste of a lemon only actually consists of sour, sweet, and bitter components, most of the flavor
we normally associate with the taste of a lemon actually comes from the terpene aroma, one of the constituent chemicals
that stimulate the olfactory mucosa via the nasopharynx (i.e., retronasal olfaction). Odor molecules may reach the recep-
tors in the olfactory epithelium (i.e., the area located in the rear of the nasal cavity) traveling inward from the anterior
nares or through the posterior nares of the nasopharynx. Most typically, orthonasal olfaction occurs during respiratory
inhalation or sniffing, whereas retronasal olfaction occurs during respiratory exhalation or after swallowing. People
usually report experiencing odors as originating from the external world when perceived orthonasally, and as coming
from the mouth when perceived retronasally (Rozin 1982). Importantly, the latest cognitive neuroscience evidence has
highlighted the fact that somewhat different neural structures are used to process these two kinds of olfactory informa-
tion (Small et al. 2005, 2008; see also Koza et al. 2005).

739
740 The Neural Bases of Multisensory Processes

and olfactory food cues (although this is undoubtedly very important; Dalton et al. 2000). For
instance, our evaluation of the pleasantness of a particular foodstuff can be influenced not only by
what it looks, smells, and tastes like, but also what it sounds like in the mouth (think, for example, of
the auditory sensations associated with biting into a potato chip or a stick of celery; see Spence and
Zampini 2006, for a review). The feel of a foodstuff (i.e., its oral–somatosensory attributes) is also
very important; the texture, temperature, viscosity, and even the painful sensations we experience
when eating hot foods (e.g., chilli peppers) all contribute to our overall multisensory experience of
foodstuffs (e.g., Bourne 1982; Lawless et al. 1985; Tyle 1993). Flavor perception is also influenced
by the interactions taking place between oral texture and both olfactory and gustatory cues (see also
Bult et al. 2007; Christensen 1980a, 1980b; Hollowood et al. 2002). Given the multisensory nature
of our perception of food, it should come as little surprise that many studies have been conducted in
order to try and understand the relative contribution of each sense to our overall evaluation of food
(e.g., see Delwiche 2004; Spence 2002; Stevenson 2009; Stillman 2002). In this chapter, we review
the contribution of visual and auditory cues to the multisensory perception of food. Moreover, any
possible influence of visual and auditory aspects of foods and drinks might take place at different
stages of the food experience. Visual cues are perceived when foodstuffs are outside of the mouth.
Auditory cues are typically primarily perceived when we are actually consuming food.

37.2 MULTISENSORY INTERACTIONS BETWEEN


VISUAL AND FLAVOR PERCEPTION
37.2.1  Role of Color Cues on Multisensory Flavor Perception
Over the past 80 years or so, many researchers have been interested in the role of visual information
in the perception of foodstuffs (Moir 1936). It seems that the visual appearance of food and drink can
have a profound impact on our perception and evaluation of flavor. The role of color cues on people’s
flavor perception has been investigated in many different studies, although the majority of the research
has been published in food science journals rather than psychology or neuroscience (for reviews, see
Clydesdale 1993; Delwiche 2004; Spence et al. 2010; Stevenson 2009). The majority of these studies
have shown that people’s perception of a variety of different foods and drinks can be dramatically
modified by changing the color of food or drink items (e.g., DuBose et al. 1980; Duncker 1939; Garber
et al. 2000; Johnson and Clydesdale 1982; Morrot et al. 2001; Philipsen et al. 1995; Roth et al. 1988;
Stillman 1993; Wheatley 1973; Zampini et al. 2007; Zellner and Durlach 2003).
One of the most dramatic early empirical demonstrations of the strong link between color
and the pleasure we derive from food (and/or our appetitive responses to food) was reported by
Wheatley (1973). He described a situation in which a group of people ate a meal of steak, French
fries, and peas under color-masking lighting conditions. Halfway through the meal, normal lighting
was restored revealing that the steak was colored blue, the French fries had been colored green, and
the peas were red. According to Wheatley’s description, the mere sight of the food was sufficient to
induce nausea in many of his dinner guests. Such results, although anecdotal, do at least hint at the
powerful influence that visual cues can have over our appetitive responses.
Color has also been shown to exert a significant effect on our ability to recognize specific food-
stuffs. For example, in one oft-cited study, DuBose et al. (1980) presented participants with drinks
incorporating a variety of different color–flavor combinations (the flavored solutions were colored
either appropriately, inappropriately, or else were presented as colorless solutions). DuBose et al.
found that participants’ identification of the flavors of many of the drinks was significantly influ-
enced by their color. In particular, the participants were less accurate in identifying the flavor of
fruit-flavored beverages when they were unaware of the appropriate color. For instance, 40% of
the participants reported that a cherry-flavored beverage actually tasted of orange when it had
been inappropriately colored orange (compared to 0% orange-flavor responses when the drink was
appropriately colored red; a similar effect was reported for the lime-flavored beverage). Many other
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 741

researchers have reported a similar visual modulation of participants’ odor discrimination/identifi-


cation responses (e.g., Blackwell 1995; Davis 1981; Koza et al. 2005; Morrot et al. 2001; Stevenson
and Oaten 2008; Zellner et al. 1991; Zellner and Kautz 1990; Zellner and Whitten 1999).
Although the potential influence of color cues on people’s flavor identification responses is by
now well documented, the evidence regarding the impact of changes in color intensity on perceived
flavor intensity is rather less clear. For example, ambiguous results have been reported in studies in
which the participants had to rate the intensity of the flavor of solutions that varied in the intensity
of the color that had been added to the solutions (e.g., DuBose et al. 1980; Johnson and Clydesdale
1982; Johnson et al. 1983; see Clydesdale 1993, for a review). For example, DuBose et al. found
that overall flavor intensity was affected by color intensity, with more intense coloring resulting in
stronger flavor evaluation responses by participants for the orange-flavored, but not for the cherry-
flavored beverages, tested in their study. However, in other studies, the concentration of coloring
in the solutions did not influence participants’ ratings, regardless of whether the solutions were
appropriately or inappropriately colored (e.g., Alley and Alley 1998; Frank et al. 1989; Zampini et
al. 2007).
Researchers have also investigated the effect of varying the intensity of the color on the per-
ceived intensity of tastes and odors separately. For instance, the addition of a red coloring to cherry-
and strawberry-flavored sucrose solutions has been found to increase the perceived sweetness of
these solutions in certain studies (Johnson and Clydesdale 1982; Johnson et al. 1983). Maga (1974)
hypothesized that the influence of colors on sweetness perception in humans might be particularly
strong for colors that are typically associated with the natural ripening of fruits (e.g., yellow, red; see
also Lavin and Lawless 1998; Strugnell 1997). By contrast, researchers have reported that the addi-
tion of color has no effect on the perceived saltiness of foods such as soups (Gifford and Clydesdale
1986; Gifford et al. 1987; Maga 1974), perhaps because (in contrast to sweet foods) there are no
particular colors associated with the salt content of a food (i.e., salt is ubiquitous to many different
kinds, and hence colors, of food; see Maga 1974 and Lavin and Lawless 1998, on this point).
In one of the earliest studies to have been published in this area, Pangborn (1960) reported that
people reported green-colored pear nectar as being less sweet than colorless pear nectar. However,
Pangborn and Hansen (1963) failed to replicate these results. Although they found that green color-
ing had no effect on the perceived sweetness of pear nectar, its addition did give rise to an overall
increase in sensitivity to sweetness when color was added to the solutions. Similarly, for the pairing
of color with odor, Zellner and Kautz (1990) reported that solutions were rated as having a more
intense odor when color had been added to the solutions than when it was absent, regardless of the
appropriateness of the color–odor match. In fact, Zellner and Kautz noted that the participants in
their study simply refused to believe that colored and uncolored solutions of equal odor intensity
were actually equally strong.
The explanation for these contradictory results regarding the influence of variations in color
intensity on the perception of taste, odor, and flavor (i.e., odor + taste) intensity is far from obvi-
ous (see Shankar et al. 2010). For example, Chan and Kane-Martinelli (1997) reported that the
perceived flavor intensity for certain foods (such as chicken bouillon) was higher with the com-
mercially available color sample than when the samples were given in a higher-intensity color (see
also Clydesdale 1993, on this point). Note also that if the discrepancy between the intensity of the
color and the intensity of the flavor is too great, participants may experience a disconfirmation of
expectation (or some form of dissonance between the visually and gustatorily determined flavor
intensities) and the color and taste cues may no longer be linked (e.g., Clydesdale 1993; cf. Ernst and
Banks 2002; Yeomans et al. 2008).
Another potentially important issue in terms of assessing interactions between color and flavor
is the role of people’s awareness of the congruency of the color–flavor pairings used (Zampini et al.
2007). In fact, in most of the research that has been published to date on the effects of color cues on
human flavor perception, the participants were not explicitly informed that the flavors of the solu-
tions they were evaluating might not be paired with the appropriately colored solutions (e.g., see
742 The Neural Bases of Multisensory Processes

DuBose et al. 1980; Johnson and Clydesdale 1982; Morrot et al. 2001; Oram et al. 1995; Philipsen
et al. 1995; Roth et al. 1988; Stillman 1993; Zellner and Durlach 2003). One might therefore argue
that the visual modulation of flavor perception reported in many of these previous studies simply
reflects a decisional bias introduced by the obvious variation in the color cues (cf. the literature on
the effectiveness of the color of medications on the placebo effect; e.g., de Craen et al. 1996; see also
Engen 1972), rather than a genuine perceptual effect (i.e., whereby the color cues actually modulate
the perception of flavor itself; although see also Garber et al. 2001, 2008, for an alternative perspec-
tive from the field of marketing). For example, if participants found it difficult to correctly identify the
flavor of the food or drink on the basis of gustatory and olfactory cues in flavor discrimination tasks,
then they may simply have based their responses on the more easily discriminable color cues instead.
Therefore, it might be argued that participants’ judgments in these previous studies may simply have
been influenced by decisional processes instead.
In their study, Zampini et al. (2007) tried to reduce any possible influence of response biases that
might emerge when studying color–flavor interactions by explicitly informing their participants that
the color–flavor link would often be misleading (i.e., that the solutions would frequently be presented
in an inappropriate color; cf. Bertelson and Aschersleben 1998). This experimental manipulation
was introduced in order to investigate whether the visual cues would still influence human flavor
perception when the participants were aware of the lack of any meaningful correspondence between
the color and the flavor of the solutions that they were tasting. The participants in Zampini et al.’s
study were presented with strawberry, lime, orange fruit–flavored solutions or flavorless solutions,
and requested to identify the flavor of each solution. Each of the different flavors was associated
equiprobably with each of the different colors (red, green, orange, and colorless). This meant that,
for example, the strawberry-flavored solutions were just as likely to be colored red, green, orange,
as to be presented as a colorless solution. Therefore, each of the solutions might have been colored
either “appropriately” or “inappropriately” (consisting of incongruently colored or colorless solu-
tions). The participants were informed that they would often be tricked by the color of the solutions
that would often not correspond to the flavor typically associated with that color.
The most important finding to emerge from Zampini et al.’s (2007) study was that color infor-
mation has a strong impact on flavor identification even when participants were informed that the
colors of the drinks that they were testing were often misleading. In particular, flavors associ-
ated with appropriate colors (e.g., lime flavor–green color; orange flavor–orange color) or colorless
were recognized far more accurately than when they were presented with an inappropriate color-
ing (i.e., lime-flavored drinks that were colored either red or orange; orange-flavored drinks that
were colored either green or red). These results therefore show that inappropriate coloring tends to
lead to impaired flavor discrimination responses, whereas appropriate coloring does not necessarily
improve the accuracy of participants’ flavor discrimination responses (at least when compared to
the flavor discrimination accuracy for the colorless solutions). Interestingly, however, no significant
effect of color was shown for the strawberry-flavored solutions. That is, the inappropriate coloring
of the strawberry-flavored solutions (i.e., when those solutions were colored green or orange) did not
result in a significant reduction in the participants’ ability to recognize the actual strawberry flavor.
One possible explanation for this result is that those flavors that are more strongly associated with
a particular color are more difficult to identify when presented in inappropriately colored solutions
(see Shankar et al. 2009). In fact, Zampini et al. (2007, Experiment 1; see Table 37.1) has shown that
the link between color and a specific flavor was stronger for the orange- and green-colored solutions
than for the red-colored solutions. That is, the participants in their study more often matched the
orange color with the flavor of orange and the green color with the flavor of lime. By contrast, the
red color was associated with strawberry, raspberry, and cherry flavors.
Whatever the reason for the difference in the effect of the various colors on participants’ flavor
discrimination responses, it is important to note that Zampini et al.’s (2007) results nevertheless show
that people can still be misled by the inappropriate coloring of a solution even if they know that the
color does not provide a reliable guide to the flavor of the solution. By contrast, the participants in
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 743

TABLE 37.1
Flavors Most Frequently Associated with Each Colored Solution
in Zampini et al.’s (2007, Experiment 1) Study
Color Most Associated Flavors
Green Lime (69%)a
Orange Orange (91%)a
Yellow Lemon (89%)a
Blue Spearmint (86%)a
Gray Black currant (53%), licorice (40%)a
Red Strawberry (46%), raspberry (27%), cherry (27%)
Colorless Flavorless (51%)a

Source: Zampini, M. et al., Food Qual. Prefer., 18, 975–984, 2007. With permission.
a Significant color–flavor association tested using χ2 analysis.

the majority of previous studies in this area (e.g., DuBose et al. 1980; Johnson and Clydesdale 1982;
Morrot et al. 2001; Oram et al. 1995; Philipsen et al. 1995; Roth et al. 1988; Stillman 1993; Zellner
and Durlach 2003) were not explicitly informed that the flavors of the solutions might not be paired
with the appropriately colored solutions. Zampini et al.’s results therefore suggest that the modula-
tory role of visual information on multisensory flavor perception is robust enough to override any
awareness that participants might have (e.g., as informed by the experimenter) concerning the lack
of congruency between the color and the flavor of the solutions that they taste. However, it would be
interesting in future research to investigate whether knowing that there is no meaningful relation-
ship between the color of the solutions and their flavor would modulate (i.e., reduce vs. enhance) the
influence of colors on flavors perception, as compared to the situation in which the participants are
not given any prior information about whether the colors are meaningfully related to the flavors.

37.2.2  Color-Flavor Interactions: Possible Role of Taster Status


Given recent interest in the consequences of individual differences in taster status on flavor percep-
tion (see Drewnowski 2003, for a review), Zampini and his colleagues (2008) wanted to investigate
whether any possible multisensory effects of visual (i.e., the colors of the solutions) and/or gustatory
(i.e., the presence vs. absence of fruit acids) cues on flavor perception might be affected by the taster
status of their participants. Previous research has demonstrated the existence of three subgroups
of tasters (nontasters, medium tasters, and supertasters), varying in their sensitivity to 6-n-propyl­
thiouracil (PROP; e.g., Bartoshuk et al. 1992) as well as to a variety of other tastants (e.g., Prescott
et al. 2001; Reed 2008).* Surprisingly, however, none of the previous studies that has investigated
individual differences in taste perception has as yet looked at the possible influence of taster status
on the visual modulation of (or dominance over) flavor perception.
In Zampini et al.’s (2008) study, the taster status of each participant was initially assessed using
suprathreshold PROP filter paper strips (see Bartoshuk et al. 1994). The participants had to place
the PROP filter paper strips on their tongue and then rate the intensity of the sensation of bitterness

* The individual differences in taste sensitivity most extensively studied are those for the bitterness intensity of PROP [and
phenylthiocarbamide (PTC) in earlier work]. Supertasters, medium tasters, and nontasters rate the bitterness of PROP as
very to intensely strong, moderate to strong, and weak, respectively. Research using taste solutions have identified other
differences in the three taster groups (see Prescott et al. 2004). Different PROP taster groups reported different taste
intensities and liking of other bitter, salty, sweet, and fat-containing substances. The three different PROP taster groups
are known to possess corresponding genetic differences. In particular, studies of taste genetics have revealed the exis­
tence of multiple bitterness receptor genes (Kim et al. 2004; see also Bufe et al. 2005; Duffy 2007).
744 The Neural Bases of Multisensory Processes

that they experienced on a Labelled Magnitude Scale (e.g., Green et al. 1993). The participants were
then classified into one of three taster groups: nontasters, medium tasters, and supertasters based
on the cutoff values (non-tasters <10.90; 10.91 < medium tasters < 61.48; supertasters > 61.49; see
also Essick et al. 2003, for a similar criterion). Zampini et al.’s findings revealed that the modulatory
cross-modal effect of visual cues on people’s flavor identification responses were significantly more
pronounced in the nontasters than in the medium tasters, who, in turn, were influenced to a greater
extent by visual cues on their flavor identification responses than were the supertasters (see Figure
37.1). In particular, the nontasters (and, to a lesser extent, medium tasters) identified the flavors of
the solutions significantly more accurately when they were colored appropriately than when they
were colored inappropriately (or else were presented as colorless solutions). By contrast, the super-
tasters identified the flavors of the solutions more accurately overall, and their performance was not
affected by the colors of the solutions.
Zampini et al.’s (2008) results are consistent with recent accounts of sensory dominance derived
from studies of cross-modal interactions between tactile, visual, and auditory stimuli (see, e.g.,
Alais and Burr 2004; Ernst and Banks 2002). Ernst and Banks used the maximum likelihood

Black currant Orange Flavorless


100 100 100
Correct responses (%)

Correct responses (%)

Correct responses (%)


75 75 75

50 50 50
Nontasters

25 25 25

0 0 0
Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless
Color of the solutions Color of the solutions Color of the solutions

100 100 100


Correct responses (%)

Correct responses (%)

Correct responses (%)

75 75 75

Medium 50 50 50
tasters
25 25 25

0 0 0
Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless
Color of the solutions Color of the solutions Color of the solutions

100 100 100


Correct responses (%)

Correct responses (%)

Correct responses (%)

75 75 75

Supertasters 50 50 50

25 25 25

0 0 0
Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless Yellow Gray Orange Red Colorless
Color of the solutions Color of the solutions Color of the solutions

FIGURE 37.1  Mean flavor intensity ratings for three groups of participants (nontasters, medium tasters, and
supertasters) for blackcurrant, orange, and flavorless solutions presented in Zampini et al.’s (2008) study of
effects of color cues on multisensory flavor perception in humans. Black columns represent solutions where
fruit acids had been added and white columns represent solutions without fruit acids. Error bars represent
between-participants standard errors of the means. (Reprinted from Zampini, M. et al., Food Qual. Prefer.,
18, 975–984, 2007. With permission.)
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 745

estimation (MLE) approach to argue that the contribution of a given sensory input to multisensory
perception is determined by weighting the sensor estimates in each sensory modality by the noise
(or variance) present in that modality. It could be argued that in Zampini et al.’s study, the estimates
of the flavors of the fruit-flavored solutions by the nontasters were simply more variable (i.e., their
judgments were less sensitive) than those of either the medium tasters or the supertasters. As a con-
sequence, given the presumably uniform levels of visual discriminability across these three groups
of participants, the MLE account would predict that nontasters should weigh the visual cues more
highly when making their responses than the medium tasters, who in turn should weigh the gusta-
tory cues less highly than the supertasters, just as we observed. It will be an interesting question
for future research to determine whether flavor discrimination responses can be modeled using the
MLE approach. It is important to note here that such an analysis may also be able to reveal whether
there are any underlying attentional biases (to weight information from one sensory modality more
highly than information from another modality) that may be present in the different taster groups
(cf. Battaglia et al. 2003). Moreover, it is interesting to consider at this point that although more than
100 studies examining visual contributions to flavor perception have been published over the past 80
years, Zampini et al.’s study represents the first attempt to take the taster status of participants into
consideration when analyzing their results. The results of Zampini et al.’s study clearly demonstrate
that taster status plays an important role in modulating the cross-modal contribution of visual cues
to flavor perception in fruit-flavored beverages.*

37.2.3  Color–Flavor Interactions: Possible Role of Learned


Associations between Colors and Flavors
The influence of color on flavor perception may (and, some might say, must) be due to learned
associations between specific colors and particular flavors. Some of these associations are fairly
universal. For example, the flavor and color association of ripe fruits (Maga 1974; see also Morrot
et al. 2001). By contrast, other color–flavor association might be context dependent and so might
be different in different parts of the world (see Duncker 1939; Lucchelli et al. 1978; Shankar et al.
2010; Spence 2002; Wheatley 1973). For instance, lemons are typically yellow in Europe, whereas
in Colombia they are mostly dark green. Therefore, a particular color–flavor pairing that seems
congruent to people in a certain part of the world may seem incongruent to those who live elsewhere
(cf. Demattè et al. 2006, 2009).
Seventy years ago, Duncker (1939) considered the role of individual differences in learning such
associations. The participants in his early study were presented with milk chocolate that had been
colored brown or white (a new color for chocolate at the time the study was conducted) both with
the same flavor. Participants who had never seen white chocolate before reported that the white
chocolate had a different flavor to the brown-colored chocolate. The only participant who had come
across white chocolate before taking part in the study reported that the different colored chocolates
all tasted the same. Although it should be noted that this early study had a number of methodologi-
cal limitations (i.e., only a small number of participants were tested, not to mention the fact that no
statistical analysis of the data was reported), the results nevertheless highlight the possible impor-
tance of prior experience and knowledge in modulation color–flavor interactions.
A follow-up of Duncker’s (1939) seminal study has been conducted recently by Levitan et al.
(2008; see also Shankar et al. 2009). The researchers in this study investigated whether people’s
prior beliefs concerning specific color–flavor associations might not affect their ability to discrimi-
nate the flavor of colored sugar-coated chocolate sweets, Smarties (Nestlé). Smarties are readily
available in eight different colors but only two different flavors, as test stimuli: the orange Smarties

* However, it should also be noted that the relatively small number of participants were tested in each category (i.e., four
non-tasters, five medium tasters, and five supertasters), thus placing a caveat in terms of generalizing from Zampini et
al.’s (2008) findings. In future studies, taster status should therefore be assessed with much larger sample sizes.
746 The Neural Bases of Multisensory Processes

that are produced for the UK market contain orange-flavored chocolate, whereas all of the other
colors contain unadulterated milk chocolate. By contrast, Smarties that have been produced for
other markets all contain unadulterated milk chocolate, regardless of their color. Crucially, the par-
ticipants were sometimes presented with pairs of stimuli that differed in their color but not in their
flavor, or with pairs of Smarties that differed in both their color and flavor, or else with Smarties
pairs that differed in their flavor but not their color.
In a preliminary questionnaire, a number of the participants in Levitan et al.’s (2008) study
stated their belief that a certain non-orange (i.e., red and green) Smartie had a distinctive flavor
(which is incorrect), whereas other participants believed (correctly) that all the non-orange Smarties
tasted the same. In the first experiment, the participants were presented with all possible pairings of
orange, red, and green Smarties and were asked to judge whether a given pair of Smarties differed
in flavor by tasting them while either sighted or blindfolded. The results showed that people’s beliefs
concerning specific color–flavor associations for Smarties exerted a significant modulatory effect
on their flavor responses. In the sighted condition, those participants who believed that non-orange
Smarties all taste the same were more likely to judge correctly that a red–green pairing of Smarties
tasted identical in comparison to the first group, who performed at a level that was significantly
below chance (i.e., they reported that the red and green Smarties tasted different on the majority
of trials). In other words, those participants who thought that there was a difference between the
flavors of the red and green Smarties did in fact judge the two Smarties as tasting different far more
frequently when compared with participants who did not hold such a belief in the sighted condition.
The results of Levitan et al.’s study are consistent with the results of the other studies presented in
this section in showing that food color can have a powerful cross-modal influence on people’s per-
ception of the flavor of food. However, Levitan et al.’s findings show that people’s beliefs about the
cross-modal color–flavor associations of specific foods can modulate this influence, and that such
cognitive influences can be robust and long-lasting despite extensive experience with the particular
food item concerned.*
In another recent study, Shankar et al. (2009) found that another variety of sugar-coated choco-
late candies (multicolored M&Ms, which are all physically identical in taste) were rated as having a
stronger chocolate flavor when they were labeled as “dark chocolate” than when they were labeled
as “milk chocolate.” Many other studies have found a similar effect of expectations produced by
labeling a stimulus before sampling on flavor perception (see Cardello 1994; Deliza and MacFie
1996; Lee et al. 2006; Yeomans et al. 2008; Zellner et al. 2004, for reviews). Shankar et al. have
also investigated whether the influence of expectations on flavor perception might be driven by color
information (see Levitan et al. 2008). In their study, participants were asked to evaluate how “choco-
latey” they found green- or brown-colored M&Ms. Participants rated the brown M&Ms as being
more “chocolatey” than the green ones. This result suggests that the color brown generates stronger
expectations of “chocolate” than the green color (cf. Duncker 1939). Finally, Shankar et al. studied
whether there was an interaction between the expectation generated by either color or label on mul-
tisensory flavor perception. The participants were again presented with brown- or green-colored
M&Ms and informed about the “chocolate category” (i.e., either “milk chocolate” or “dark choco-
late”) with each color–label combination (green–milk, brown–milk, green–dark, brown–dark)
presented in a randomized order. Brown-colored M&Ms were given a higher chocolatey rating
than green-colored M&Ms. Similarly, those labeled as “dark chocolate” were given higher ratings
than those labeled “milk chocolate.” However, no interaction between these colors and labels was
found, thus suggesting that these two factors exerted independent effects, implying that two distinct
associations were being retrieved from memory and then utilized (e.g., the color–flavor association

* It is interesting to note that the participants in Levitan et al.’s (2008) study were able to maintain such inappropriate
beliefs about differently colored Smarties tasting different, despite the objective evidence that people perceive no differ-
ence in their flavor, and the fact that they have presumably had extensive previous exposure to the fact that these colors
provide no useful information in this foodstuff.
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 747

and the label–flavor association). Shankar et al.’s findings therefore provide the first evidence that
color can influence the flavor of a product whose flavor identity cannot be predicted by its color. In
other words, the colors of the coatings of the M&Ms are independent of their taste (which is always
chocolate).
One final issue that remains unresolved here concerns the extent to which the influence of color
on flavor discrimination reflects a perceptual versus a more decisional effect, or whether instead
both perceptual and decisional factors may contribute to participants’ performance (see Spence et
al., submitted; and Zampini et al. 2007, on this point). If it is a purely perceptual effect, the par-
ticipant’s gustatory experience should be changed by viewing the color, that is, knowledge of the
color might improve the sensitivity of participants’ flavor discrimination responses by reducing the
variability of the multisensory flavor signal. Alternatively, however, according to the decisional
account, people should always have given the same gustatory response for a given color–flavor pair-
ing regardless of whether sighted or blindfolded. In fact, what may have changed is their decisional
criteria. In Levitan et al.’s (2008) study, the participants who were uncertain of their responses for
a given pair of Smarties might have biased their choice toward making different responses because
they could see that they had a different color. By contrast, those participants who already knew that
red and green Smarties were normally identical in taste might have been biased toward making a
same response. In the case of olfaction, Engen (1972) has already shown results consistent with the
claim that color can influence odor perception as a result of its effect on decisional mechanisms, but
this does not, of course, necessarily rule out a role for perceptual interactions as well, at least when
tested under the appropriate experimental conditions (see Zellner and Kautz 1990).
However, it is possible to hypothesize that a person’s beliefs about particular foods tasting dif-
ferent if they have a different color may paradoxically result in them actually tasting different.
Analogously, de Craen et al. (1996) discussed a number of findings showing that color cues modu-
late the effectiveness of medicines as well as placebo pills. Although the mechanism behind placebo
effects such as these is not as yet well understood, the effects themselves are nevertheless robust
(e.g., for a recent review, see Koshi and Short 2007). What is more, just as in Levitan et al.’s (2008)
Smarties experiment, there is at least some evidence that different people may hold different beliefs
about differently colored pills, and that these beliefs can carry over into the actual effects that the
differently colored placebo pills are shown to have (Lucchelli et al. 1978). Therefore, if people’s
beliefs about color and medication can affect their physical state (e.g., resulting in a genuine change
in their tolerance for pain, say, or in their ability to sleep), it would seem conceivable that a person’s
belief that a certain colored Smartie tasted distinctive (from a Smartie of a different color) might
give rise to the effect of it, paradoxically, actually tasting different to that person, despite there being
no physical difference in flavor.

37.2.4  Color–Flavor Interactions: Neural Correlates


The results discussed so far on the potential influences of visual cues on flavor perception are con-
sistent with the growing body of neurophysiological and electrophysiological data demonstrating
the intimate link between visual, olfactory, and gustatory flavor information at a neuronal level
(Osterbauer et al. 2005; Rolls 2004; Rolls and Baylis 1994; Small 2004; Small and Prescott 2005;
Verhagen and Engelen 2006). For instance, Osterbauer and his colleagues have used functional
neuroimaging to investigate how activity in the human orbitofrontal cortex (OFC) can be modu-
lated by the presentation of particular combinations of odors and colors. The participants in this
study had to smell different odors including lemon, strawberry, spearmint, and caramel that were
presented by means of a computer-controlled olfactometer. The odors were presented in isolation or
else together with a color. The participants wore prism glasses to see full-screen colors presented
onscreen outside the magnet bore. On some occasions the odor matched the color, such as when
the smell of lemon was presented with the color yellow, whereas at other times the odor and color
did not match, such as when spearmint odor was presented with the color brown. Osterbauer et al.’s
748 The Neural Bases of Multisensory Processes

findings revealed that the presentation of appropriate odor–color combinations (e.g., odor of straw-
berry matched with red color) increased the brain activity seen in the OFC when compared with the
brain activation seen in the odor-alone conditions. By contrast, there was a suppression of neural
activity in the same area when inappropriate color–odor combinations were presented (e.g., when
the odor of strawberry was presented with a turquoise patch of color on the monitor; see also De
Araujo et al. 2003). Taken together, these results would appear to suggest that presenting an appro-
priate color–odor association may actually lead to increased neural activity in brain areas respon-
sible for processing olfactory stimuli, whereas presenting inappropriate color–odor associations can
suppress brain activity below that observed to the odors alone. The positive correlation between the
perceived congruency of color–odor pairs and the changes in the pattern of brain activation found
in Osterbauer et al.’s study (see also Skrandies and Reuther 2008), therefore, provides a neurophysi-
ological basis for the perceptual changes elicited by changing the color of food.

37.2.5  Interim Summary


Taken together, the results reviewed thus far demonstrate that visual information can have a dra-
matic impact on flavor perception and evaluation in humans. In particular, most of the studies have
shown that it is possible to impair flavor discrimination performance by coloring fruit-flavored
solutions inappropriately. The effect of color cues on human flavor perception can be explained by
the fact that visual information sets up an expectation regarding the flavor that is about to be experi-
enced. This expectation may originate from any previous experience with similar food stimuli that
have contributed to build up such associations between the visual aspect and the experienced flavor
(see Shankar et al. 2010; Yeomans et al. 2008). Stevenson and his colleagues (e.g., Stevenson and
Boakes 2004; Stevenson et al. 1998) have suggested than any interaction taking place between gus-
tation and olfaction might be explained in terms of associative learning processes. Their findings
show that we are able to create strong links between odors and tastes that are repeatedly presented
together. It is possible to hypothesize that the strong correspondences between colors and flavors
may rely on a similar mechanism. The same foodstuffs are usually experienced first through their
visual appearance and then through their flavor. It is possible that in our life we learn to build up
a strong association between visual and flavor food properties that are systematically combined.
Therefore, people who are presented first with the visual aspect of food and drinks generated a
series of expectations about the flavor that those food and drinks should have. White and Prescott
(2007) have put forward a similar explanation for their findings regarding the influence of odors on
tastes identification, when odors were presented in advance of taste.
In the previous section, a study was discussed in which participants’ beliefs on the color–flavor
association based on their previous experiences significantly modulated their responses (see Levitan
et al. 2008). In particular, participants who expected a difference between food products that were
colored differently were more likely to report a difference than those without any such prior belief.
Therefore, flavor perception might be considered as constituting a multisensory experience with
somewhat different rules that those regulating other multisensory interactions. Research suggests
that spatial coincidence and temporal synchrony are two of the key factors determining whether
multisensory integration will take place (at the single cell level) to give rise to the rich multisensory
perceptual objects that fill our everyday lives (for reviews, see Calvert et al. 2004). Given that the
cross-modal influence of visual cues on flavor perception occur long before we taste foods and occur
in different regions of space (i.e., food is only ever seen outside the oral cavity but tasted within it;
see Hutchings 1977), it would seem reasonable to suggest that expectancy plays a greater role than
the spatial and temporal rules (see Shankar et al. 2010). It might, for example, be less likely that
visual–flavor interactions would be influenced by the spatial and temporal rules of multisensory
integration (which might better help to explain the integration of auditory, visual, and tactile, that is,
the spatial senses; it might also explain the integration of olfactory/gustatory and oral–somatosen-
sory cues in the basic flavor percept). Therefore, we believe that the multisensory study of flavor
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 749

perception is particularly interesting for multisensory researchers precisely because the rules of
integration, and cross-modal influence, are likely to be somewhat different.
In the previous sections, we also discussed how individual differences can affect the nature
of the cross-modal visual–flavor interactions that are observed. In particular, visual influences on
multisensory flavor perception can be significantly modulated as a function of the taster status of
the participant. Visual dominance effects in multisensory flavor perception are more pronounced
in those participants who are less sensitive to gustatory cues (i.e., nontasters) than in supertasters,
who appear to enjoy the benefit of enhanced gustatory resolution. Therefore, taster status, although
often neglected in studies investigating color–flavor interactions, should certainly be considered
more carefully in any future research in this area. Finally, we have reviewed the role of expectancy
resulting from visual information on the overall food perception.

37.3 ROLE OF AUDITORY CUES IN THE MULTISENSORY


EXPERIENCE OF FOODSTUFFS
Most of the visual cues typically occur before our consumption of food and drink, whereas auditory
cues are typically only available at the moment of consumption (or mastication). Therefore, one
might expect the role of expectancy to be reduced when looking at the effect of sounds on the per-
ception of food. Certainly, visual and auditory cues provide information at distinct stages of eating.
In the second part of this chapter, we therefore briefly discuss the possible role that auditory cues
may play in the multisensory perception of foodstuffs.
Several studies have demonstrated the influential role that auditory information plays in our
perception of food (for a review, see Spence and Zampini 2006). For example, it has been shown
that people’s ratings of the pleasantness of many foods can be strongly influenced by the sounds
produced when people bite into them (e.g., Drake 1970; Vickers 1981, 1983; Vickers and Bourne
1976). Food sounds have a particularly noticeable influence on people’s perception of crispness that
is closely associated with pleasantness, especially in crunchy foods (i.e., crisps; e.g., Vickers 1983).
Taken together, these results therefore suggest that the perception of the crispness of (especially)
crunchy foods (e.g., crisps, biscuits, cereals, vegetables) is largely characterized by tactile, mechani-
cal, kinesthetic, and auditory properties (e.g., Vickers 1987).
Many foodstuffs produce particular sounds when we eat them. For instance, Drake (1963)
reported that the sounds produced by chewing or crushing a variety of different foodstuffs varied
in their amplitude, frequency, and temporal characteristics. Analysis of the auditory characteristics
of different foods has shown that crispy foods are typically higher in pitch than crunchy foods
(Vickers 1979). However, the role of auditory cues in the evaluation of food qualities (e.g., crisp-
ness) have been investigated by using different kinds of foodstuffs, that might have different levels
of freshness (e.g., Christensen and Vickers 1981; Drake 1963; Seymour and Hamann 1988; Vickers
1984; Vickers and Bourne 1976; Vickers and Wasserman 1979). Those studies also clearly show
that despite the informational richness contained in the auditory feedback provided by biting into
and/or chewing food, people are typically unaware of the effect that such sounds have on their over-
all multi­sensory perception or evaluation of particular stimuli. In particular, Zampini and Spence
(2004, 2005) have shown that people’s perception and evaluation of different foodstuffs (e.g., potato
chips and sparkling water) can be modulated by changing the overall sound level or just the high-
frequency components (see also Chen et al. 2005; Masuda et al. 2008; Varela et al. 2006).

37.3.1  Effect of Sound Manipulation on the Perception of Crisps


Zampini and Spence (2004) studied the multisensory interactions between auditory, oral, tactile,
mechanical, kinesthetic, and visual information in the rating of the perception of the “crispness”
and “freshness” of potato chips (or crisps), to investigate whether the evaluation of the crispness and
750 The Neural Bases of Multisensory Processes

freshness of potato chips would be affected by only modifying the sounds produced during the bit-
ing action. In fact, the Pringles potato chips used in their experiment have all the same visual (i.e.,
shape) and oral–tactile (i.e., texture) aspects. The participants in this study had to make a single
bite with their front teeth into a large number (180) of potato chips (otherwise known as crisps in
the United Kingdom) with their mouth placed directly above the microphone and then to spit the
crisp out (without swallowing) into a bowl placed on their lap. They then rated the crispness and
freshness of each potato chip using a computer-based visual analog scale. The participants might
hear the veridical sounds they made when biting into a crisp without any auditory frequency adjust-
ment or with frequencies in the range 2–20 of the biting sounds amplified or attenuated by 12 dB.
Furthermore, for each frequency manipulation, there was an attenuation of the overall volume of
0 (i.e., no attenuation), 20, or 40 dB. The results showed that the perception of both crispness and
freshness were affected by the modulation of the auditory cues produced during the biting action. In
particular, the potato chips were perceived as being both crisper and fresher when either the overall
sound level was increased, or when just the high frequency sounds (in the range of 2–20 kHz) were
selectively amplified (see Figure 37.2).

(b)
Crisper 100

0 dB
80
(magnitude estimation)

(a) Response scale –20 dB


Perceived crispness

60
–40 dB

40
Headphones
20

Softer 0
Attenuate Normal Amplify
Microphone
Frequency manipulation
(c)
Fresher 100

80 0 dB
(magnitude estimation)
Perceived freshness

–20 dB
60
–40 dB

40

20
Response pedals
Staler 0
Attenuate Normal Amplify
Frequency manipulation

FIGURE 37.2  (a) Schematic view of apparatus and participant in Zampini et al.’s (2004) study. Door of experi-
mental booth was closed during the experiment and response scale was viewed through the window in left-hand
side wall of booth. Mean responses for soft–crisp (b) and fresh–stale (c) response scales for three overall attenuation
levels (0, −20, or −40 dB) against three frequency manipulations (high frequencies attenuated, veridical auditory
feedback, or high frequencies amplified) are reported. Error bars represent between-participants standard errors of
means. (Reprinted from Zampini, M., and Spence, C., J. of Sens. Stud., 19, 347–363, 2004. With permission.)
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 751

Given that the crisps in Zampini and Spence’s (2004) study were very similar to each other in
terms of their visual, tactile, and flavor attributes, the only perceptual aspect that varied during the
task was the sound (which, of course, also contributes to flavor). Therefore, participants may have
“felt” that the crisps had a different texture only guided by the sound since the other senses always
received the same information. Additional evidence highlighting the powerful effect of auditory
cues on the overall perception of the crisps was that the majority of the participants (15 out of 20)
stated anecdotally on debriefing after the experiment that they believed the crisps to have been
selected from different packages. Additionally, the majority of the participants also reported that
the auditory information had been more salient than the oral tactile information, and this may also
help to account for the effects reported by Zampini and Spence. In fact, one of the fundamental laws
of multisensory integration that has emerged over the past few decades states that the sense that
provides the more reliable (or salient) information is the one that dominates, or modulates, percep-
tion in another sensory modality (e.g., Ernst and Banks 2002; Shimojo and Shams 2001; Welch and
Warren 1980). However, the sensory dominance effect can be explained by the fact that the human
brain might rely on the most attended senses (Spence and Shankar 2010). The role of attention in the
multisensory influence of auditory information on food perception is consistent with the results of a
study in which the participants had to try and detect weak solutions of sucrose or citric acid in a mix-
ture (Marks and Wheeler 1998). Participants were more accurate at detecting the tastant they were
attending to than for the tastant they were not attending to (see also Ashkenazi and Marks 2004).
Marks and Wheeler suggested that our ability to detect a particular sensory quality (e.g., tastant or
flavor) may be modulated by selective attention toward (or away from) that quality. Therefore, in a
similar vein, one might suggest that the effect found in crispness perception by increasing the over-
all loudness of the sounds produced when biting into crisps can change a participant’s perception
by making the sounds more pronounced than would have been the case if this information had been
derived solely from the texture in the mouth or from normal-level auditory cues. That is, partici-
pants’ attention would be directed toward this feature of the food by externally changing the relative
weighting of the sensory cues that signify this. Louder sounds are also presumably more likely to
capture a person’s attention than quieter sounds. However, at present, it is unclear how many of the
findings taken to support an attentional account of any sensory dominance effect can, in fact, be
better accounted for in terms of sensory estimates of stimulus attributes simply being more accu-
rate (i.e., less variable) in the dominant modality than those in the other modalities (e.g., Alais and
Burr 2004; Battaglia et al. 2003; Ernst and Banks 2002). Finally, it is important to note that these
explanations are not mutually exclusive. For example, Zampini and Spence’s (2004) results can be
accounted for either in terms of attentional capture or in terms of multisensory integration.

37.3.2  Effect of Auditory Cues on the Perception of Sparkling Water


In a follow-up study, Zampini and Spence (2005) studied the possible influence of auditory cues
in the perception and evaluation of carbonation of water. Our perception of the carbonation of
a beverage often relies on the integration of a variety of multisensory cues from visual, oral–­
somatosensory, nociceptive, auditory, and even tactile cues that are provided by the bubbles (e.g.,
Chandrashekar et al. 2009; Vickers 1991; Yau and McDaniel 1992). Zampini and Spence (2005)
examined the relationship between the auditory cues produced by sparkling water and its perceived
level of carbonation both when carbonated water samples were assessed in a cup and when they
were assessed in the mouth. The carbonation sounds were modified adopting the same experimen-
tal paradigm developed in their previous research on the perception of potato chips (Zampini and
Spence 2004). The sparkling water samples held in participants’ hands were judged to be more
carbonated when the overall sound level was increased and/or when the high-frequency components
(2–20 kHz) of the water sound were amplified. Interestingly, however, a subsequent experiment
failed to demonstrate any effect of these auditory manipulations on the perception of carbonation
and oral irritation from water samples that were held in the mouth. Taken together, these results
752 The Neural Bases of Multisensory Processes

therefore show that auditory cues can modulate the perception of the carbonation of a water sample
held in the hand, but cannot modulate people’s perception of a water sample held in the mouth. This
might be because the perception of carbonation in the mouth is more dependent on oral–somatosen-
sory and/or nociceptive inputs than on auditory cues, or alternatively, that it is more important that
we correctly perceive stimuli once they have entered the oral cavity (see Koza et al. 2005). Once
again, these findings are consistent with the hypothesis that the modality dominating multisensory
perception (when the senses are put into conflict) is the most accurate and/or informative sense (e.g.,
see Ernst and Banks 2002).

37.4  CONCLUSIONS
The past few years have seen a rapid growth of interest in the multisensory aspects of food perception
(see Auvray and Spence 2008; Delwiche 2004; Prescott 1999, 2004; Stevenson 2009; Stevenson and
Tomiczek 2007; Stillman 2002; Verhagen and Engelen 2006, for reviews). The research reviewed here
highlights the profound effect that visual (i.e., color of food) and auditory cues (i.e., variations in the
overall sound level and variations in the spectral distribution of energy) can have on people’s percep-
tion foodstuffs (such as potato chips and beverages). When people are asked to identify the flavors of
food and beverages, their responses can be influenced by the colors of those food and beverages. In
particular, the identification of specific flavors has often been shown to be less accurate when they are
paired with an inappropriate color (e.g., DuBose et al. 1980; Zampini et al. 2007, 2008). Our percep-
tion of the flavor and physical characteristics of food and beverages can also be modulated by auditory
cues. For instance, it is possible to change the perceived crispness of crisps or the perceived fizziness
of a carbonated beverage (such as sparkling water) simply by modifying the sounds produced when
eating the crisps or by the bubbles of the sparkling water (Zampini et al. 2004, 2005).
It is important to note that visual and auditory information are available at different stages of eat-
ing. Typically, visual (not to mention orthonasal olfactory and, on occasion, auditory) cues are avail-
able long before our ingestion of food (and before any other sensory cues associated with the food
are available). Therefore, visual cues (e.g., food colors) might be expected to create an expectancy
concerning the possible flavor of the food to be eaten (Hutchings 1977; Shankar et al. 2010). By
contrast, any role of expectancy might be reduced when thinking at the potential influence of audi-
tory cues on the perception of food. In fact, the sounds produced when biting into or chewing food
are available at the moment of consumption. Therefore, it is possible to hypothesize that the role of
multisensory integration is somewhat different when looking at the role of visual and auditory cues
on the overall food perception. Given that visual cues are typically available long before a food is
consumed and outside the mouth, it is quite unlikely that visual–flavor interactions are modulated
by the spatial and temporal rules (i.e., greater multisensory interaction with spatial and temporal
coincidence between the stimuli; see Calvert et al. 2004, for a review). Therefore, visual influences
on multisensory flavor perception are better explained by looking at the role of expectancy than at
the role of the spatial and temporal rules, which might help us to understand the role of auditory
cues on food perception instead. However, some sounds might produce an expectancy effect as well.
For example, sound of the food package being opened will normally precede the consumption of a
particular packaged food item (think only of the rattling of the crisps packet). Several researchers
have demonstrated that people’s expectations regarding what they are about to consume can also
have a significant effect on their perception of pleasantness of the food or drink itself (see Spence
et al., in press, for a recent review). It is also important to note that the visual and auditory con-
tribution to multisensory flavor perception typically takes place without people necessarily being
consciously aware that what they are seeing or hearing is influencing their overall flavor experience
(e.g., Zampini et al. 2004, 2005). In Zampini et al.’s more recent research (e.g., Zampini et al. 2007,
2008), the participants were influenced by the inappropriate colors of the beverages that they were
evaluating even though they had been informed beforehand that there might be a lack of congruency
between the colors that they saw and the flavors that they were tasting. This shows, therefore, that
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 753

the effect was powerful enough to override participants’ awareness that color information might
mislead their identification of the flavors. The potential role of the sounds made when eating food
on food perception is often ignored by people. For example, most of the participants in Zampini et
al.’s (2004) study thought that the crisps were actually different (i.e., sorted from different pack-
ages or with different level of freshness and, therefore, of crispness). They seem to ignore the fact
that the experimenters changed only the sounds produced when biting into the crisps and the crisps
were not different. Nevertheless, the study reported here are consistent with a growing number of
neurophysiological and electrophysiological studies demonstrating close visual–flavor (Osterbauer
et al. 2005; Small 2004; Small and Prescott 2006; Verhagen and Engelen 2006) and audiotacile
(Gobbelé et al. 2003; Kitagawa and Spence 2006; Levänen et al. 1998; Schroeder et al. 2001; von
Békésy 1957)* interactions at the neuronal level. Results such as these therefore help to emphasize
the limitations that may be associated with relying solely on introspection and verbal report (as is
often the case in commercial consumer testing settings) when trying to measure people’s perception
and evaluation of foodstuffs.

REFERENCES
Alais, D., and D. Burr. 2004. The ventriloquist effect results from near-optimal bimodal integration. Current
Biology 14: 257–262.
Alley, R. L., and T. R. Alley. 1998. The influence of physical state and color on perceived sweetness. Journal
of Psychology: Interdisciplinary and Applied 132: 561–568.
Ashkenazi, A., and L. E. Marks. 2004. Effect of endogenous attention on detection of weak gustatory and olfac-
tory flavors. Perception & Psychophysics 66: 596–608.
Auvray, M., and C. Spence. 2008. The multisensory perception of flavor. Consciousness & Cognition 17:
1016–1031.
Bartoshuk, L. M., V. B. Duffy, and I. J. Miller. 1994. PTC/PROP tasting: Anatomy, psychophysics, and sex
effects. Physiology & Behavior 56: 1165–1171.
Bartoshuk, L. M., K. Fast, T. A. Karrer, S. Marino, R. A. Price, and D. A. Reed. 1992. PROP supertasters and
the perception of sweetness and bitterness. Chemical Senses 17: 594.
Battaglia, P. W., R. A. Jacobs, and R. N. Aslin. 2003. Bayesian integration of visual and auditory signals for
spatial localization. Journal of the Optical Society of America A 20: 1391–1397.
Bertelson, P., and G. Aschersleben. 1998. Automatic visual bias of perceived auditory location. Psychonomic
Bulletin & Review 5: 482–489.
Blackwell, L. 1995. Visual clues and their effects on odour assessment. Nutrition and Food Science 5: 24–28.
Bourne, M. C. 1982. Food texture and viscosity. New York: Academic Press.
Bufe, B., P. A. Breslin, C. Kuhn et al. 2005. The molecular basis of individual differences in phenylthiocarbam-
ide and propylthiouracil bitterness perception. Current Biology 15: 322–327.
Bult, J. H. F., R. A. de Wijk, and T. Hummel. 2007. Investigations on multimodal sensory integration: Texture,
taste, and ortho- and retronasal olfactory stimuli in concert. Neuroscience Letters 411: 6–10.
Cain, W. S. 1977. History of research on smell. In Handbook of perception: Vol. 6a: Tasting and smelling, ed.
E. C. Carterette and M. P. Friedman, 197–229. New York: Academic Press.
Calvert, G., C. Spence, and B. E. Stein. 2004. The handbook of multisensory processing. Cambridge, MA: MIT
Press.
Cardello, A. V. 1994. Consumer expectations and their role in food acceptance. In Measurement of food
preferences, ed. H. J. H. MacFie, and D. M. H. Thomson, 253–297. London: Blackie Academic &
Professional.
Chan, M. M., and C. Kane-Martinelli. 1997. The effect of color on perceived flavour intensity and acceptance
of foods by young adults and elderly adults. Journal of the American Dietetic Association 97: 657–659.
Chandrashekar, J., D. Yarmolinsky, L. von Buchholtz et al. 2009. The taste of carbonation. Science 326:
443–445.

* However, it is important to note that, to the best of our knowledge, no neuroimaging studies have as yet been conducted
to investigate the role of auditory cues on multisensory food perception (cf. Spence and Zampini 2006; Verhagen and
Engelen 2006).
754 The Neural Bases of Multisensory Processes

Chen, H., C. Karlsson, and M. Povey. 2005. Acoustic envelope detector for crispness assessment of biscuits.
Journal of Texture Studies 36: 139–156.
Christensen, C. M. 1980a. Effects of taste quality and intensity on oral perception of viscosity. Perception &
Psychophysics 28: 315–320.
Christensen, C. M. 1980b. Effects of solution viscosity on perceived saltiness and sweetness. Perception &
Psychophysics 28: 347–353.
Christensen, C. M., and Z. M. Vickers. 1981. Relationship of chewing sounds to judgments of food crispness.
Journal of Food Science 46: 574–578.
Clydesdale, F. M. 1993. Color as a factor in food choice. Critical Reviews in Food Science and Nutrition 33:
83–101.
Dalton, P., N. Doolittle, H. Nagata, and P. A. S. Breslin. 2000. The merging of the senses: Integration of sub-
threshold taste and smell. Nature Neuroscience 3: 431–432.
Davis, R. 1981. The role of nonolfactory context cues in odor identification. Perception & Psychophysics 30:
83–89.
De Araujo���������������������������������������������������������������������������������������������������
���������������������������������������������������������������������������������������������������������
, I. E. T., E. T.����������������������������������������������������������������������������������
���������������������������������������������������������������������������������
Rolls, M. L. Kringelbach, F. McGlone, and N. Phillips. 2003. ��������������������
Taste–olfactory con-
vergence, and the representation of the pleasantness of flavour, in the human brain. European Journal of
Neuroscience 18: 2059–2068.
de Craen, A. J. M., P. J. Roos, A. L. de Vries, and J. Kleijnen. 1996. Effect of colour of drugs: Systematic review
of perceived effect of drugs and their effectiveness. British Medical Journal 313: 1624–1626.
Deliza, R., and H. MacFie. 1996. The generation of sensory expectation by external cues and its effect on sen-
sory perception and hedonic ratings: A review. Journal of Sensory Studies 11: 103–128.
Delwiche, J. 2004. The impact of perceptual interactions on perceived flavour. Food Quality and Preference
15: 137–146.
Demattè, M. L., D. Sanabria, and C. Spence. 2006. Crossmodal associations and interactions between olfaction
and vision. Chemical Senses 31: E50–E51.
Demattè, M. L., D. Sanabria, and C. Spence. 2009. Olfactory identification: When vision matters? Chemical
Senses 34: 103–109.
Drake, B. K. 1963. Food crunching sounds. An introductory study. Journal of Food Science 28: 233–241.
Drake, B. K. 1970. Relationships of sounds and other vibrations to food acceptability. Proceedings of the 3rd
International Congress of Food Science and Technology, pp. 437–445. August 9–14, Washington, DC.
Drewnowski, A. 2003. Genetics of human taste perception. In Human olfaction and gustation, 2nd ed., ed.
R. L. Doty, 847–860. New York: Marcel Dekker, Inc.
DuBose, C. N., A. V. Cardello, and O. Maller. 1980. Effects of colourants and flavourants on identification,
perceived flavour intensity, and hedonic quality of fruit-flavoured beverages and cake. Journal of Food
Science 45: 1393–1399, 1415.
Duffy, V. B. 2007. Variation in oral sensation: Implications for diet and health. Current Opinion in
Gastroenterology 23: 171–177.
Duncker, K. 1939. The influence of past experience upon perceptual properties. American Journal of Psychology
52: 255–265.
Engen, T. 1972. The effect of expectation on judgments of odour. Acta Psychologica 36: 450–458.
Ernst, M. O., and M. S. Banks. 2002. Humans integrate visual and haptic information in a statistically optimal
fashion. Nature 415: 429–433.
Essick, G. K., A. Chopra, S. Guest, and F. McGlone. 2003. Lingual tactile acuity, taste perception, and the den-
sity and diameter of fungiform papillae in female subjects. Physiology & Behavior 80: 289–302.
Frank, R. A., K. Ducheny, and S. J. S. Mize. 1989. Strawberry odor, but not red color, enhances the sweetness
of sucrose solutions. Chemical Senses 14: 371–377.
Garber Jr., L. L., E. M. Hyatt, and Ü. Ö. Boya. 2008. �������������������������������������������������
The mediating effects of the appearance of nondu-
rable consumer goods and their packaging on consumer behavior. In Product experience, ed. H. N. J.
Schifferstein and P. Hekkert, 581–602. London: Elsevier.
Garber Jr., L. L., E. M. Hyatt, and R. G. Starr Jr. 2000. The effects of food colour on perceived flavour. Journal
of Marketing Theory and Practice 8: 59–72.
Garber Jr., L. L., E. M. Hyatt, and R. G. Starr Jr. ����������������������������������������������������������
2001. Placing food color experimentation into a valid con-
sumer context. Journal of Food Products Marketing 7: 3–24.
Gifford, S. R., and F. M. Clydesdale. 1986. The psychophysical relationship between colour and sodium chlo-
ride concentrations in model systems. Journal of Food Protection 49: 977–982.
Gifford, S. R., F. M. Clydesdale, and R. A. Damon Jr. 1987. The psychophysical relationship between colour
and salt concentration in chicken flavoured broths. Journal of Sensory Studies 2: 137–147.
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 755

Gobbelé, R., M. Schürrmann, N. Forss, K. Juottonen, H. Buchner, and R. Hari. 2003. Activation of the human
posterior and tempoparietal cortices during audiotactile interaction. Neuroimage 20: 503–511.
Green, B. G., G. S. Shaffer, and M. M. Gilmore. 1993. Derivation and evaluation of a semantic scale of oral
sensation with apparent ratio properties. Chemical Senses 18: 683–702.
Hollowood, T. A., R. S. T. Linforth, and A. J. Taylor. 2002. The effect of viscosity on the perception of flavour.
Chemical Senses 27: 583–591.
Hutchings, J. B. 1977. The importance of visual appearance of foods to the food processor and the consumer.
In Sensory properties of foods, ed. G. G. Birch, J. G. Brennan, and K. J. Parker, 45–57. London: Applied
Science Publishers.
Johnson, J. L., and F. M. Clydesdale. 1982. Perceived sweetness and redness in coloured sucrose solutions.
Journal of Food Science 47: 747–752.
Johnson, J. L., E. Dzendolet, and F. M. Clydesdale. 1983. Psychophysical relationships between sweetness and
redness in strawberry-drinks. Journal of Food Protection 46: 21–25.
Kim, U. K., P. A. Breslin, D. Reed, and D. Drayna. 2004. Genetics of human taste perception. Journal of Dental
Research 83: 448–453.
Kitagawa, N., and C. Spence. 2006. Audiotactile multisensory interactions in information processing. Japanese
Psychological Research 48: 158–173.
Koshi, E. B., and C. A. Short. 2007. Placebo theory and its implications for research and clinical practice: A
review of the recent literature. Pain Practice 7: 4–20.
Koza, B., A. Cilmi, M. Dolese, and D. Zellner. 2005. Color enhances orthonasal olfactory intensity and reduces
retronasal olfactory intensity. Chemical Senses 30: 643–649.
Lavin, J., and H. T. Lawless. 1998. Effects of colour and odor on judgments of sweetness among children and
adults. Food Quality and Preference 9: 283–289.
Lawless, H., P. Rozin, and J. Shenker. 1985. Effects of oral capsaicin on gustatory, olfactory and irritant sensa-
tions and flavor identification in humans who regularly or rarely consume chili pepper. Chemical Senses
10: 579–89.
Lee, L., S. Frederick, and D. Ariely. 2006. Try it, you’ll like it. Psychological Science 17: 1054–1058.
Levänen, S., V. Jousmäki, and R. Hari. 1998. Vibration-induced auditory-cortex activation in a congenitally
deaf adult. Current Biology 8: 869–872.
Levitan, C. A., M. Zampini, R. Li, and C. Spence. 2008. Assessing the role of colour cues and people’s beliefs
about colour–flavour associations on the discrimination of the flavour of sugar-coated chocolates.
Chemical Senses 33: 415–423.
Lucchelli, P. E., A. D. Cattaneo, and J. Zattoni. 1978. Effect of capsule colour and order of administration of
hypnotic treatments. European Journal of Clinical Pharmacology 13: 153–155.
Maga, J. A. 1974. Influence of colour on taste thresholds. Chemical Senses and Flavour 1: 115–119.
Marks, L. E., and M. E. Wheeler. 1998. Attention and the detectability of weak-taste stimuli. Chemical Senses
23: 19–29.
Masuda, M., Y. Yamaguchi, K. Arai, and K. Okajima. 2008. Effect of auditory information on food recognition.
IEICE Technical Report 108(356): 123–126.
Moir, H. C. 1936. Some observations on the appreciation of flavour in food stuffs. Chemistry and Industry 55:
145–148.
Morrot, G., F. Brochet, and D. Dubourdieu. 2001. The colour of odors. Brain and Language 79: 309–320.
Murphy, C., and W. S. Cain. 1980. Taste and olfaction: Independence vs. interaction. Physiology and Behavior
24: 601–605.
Oram, N., D. G. Laing, I. Hutchinson et al. 1995. The influence of flavour and colour on drink identification by
children and adults. Developmental Psychobiology 28: 239–246.
Osterbauer, R. A., P. M. Matthews, M. Jenkinson, C. F. Beckmann, P. C. Hansen, and G. A. Calvert. 2005. Color
of scents: Chromatic stimuli modulate odor responses in the human brain. Journal of Neurophysiology
93: 3434–3441.
Pangborn, R. M. 1960. Influence of colour on the discrimination of sweetness. American Journal of Psychology
73: 229–238.
Pangborn, R. M., and B. Hansen. 1963. The influence of colour on discrimination of sweetness and sourness in
pear-nectar. American Journal of Psychology 76: 315–317.
Philipsen, D. H., F. M. Clydesdale, R. W. Griffin, and P. Stern. 1995. Consumer age affects response sensory
characteristics of a cherry flavoured beverage. Journal of Food Science 60: 364–368.
Prescott, J. 1999. Flavour as a psychological construct: Implications for perceiving and measuring the sensory
qualities of foods. Food Quality and Preference 10: 349–356.
756 The Neural Bases of Multisensory Processes

Prescott, J. 2004. Psychological processes in flavour perception. In Flavour perception, ed. A. J. Taylor and D.
Roberts, 256–278. London: Blackwell Publishing.
Prescott, J., N. Ripandelli, and I. Wakeling. 2001. Binary taste mixture interactions in PROP non-tasters,
medium-­tasters and super-tasters. Chemical Senses 26: 993–1003.
Prescott, J., J. Soo, H. Campbell, and C. Roberts. 2004. Response of PROP taster groups to variations in sen-
sory qualities within foods and beverages. Chemical Senses 26: 993–1003.
Reed, D. R. 2008. Birth of a new breed of supertaster. Chemical Senses 33: 489–491.
Rolls, E. T. 2004. Smell, taste, texture, and temperature multimodal representations in the brain, and their rel-
evance to the control of appetite. Nutrition Reiews 62: S193–S204.
Rolls, E. T., and L. L. Baylis. 1994. Gustatory, olfactory, and visual convergence within the primate orbitofron-
tal cortex. Journal of Neuroscience 14: 5437–5452.
Roth, H. A., L. J. Radle, S. R. Gifford, and F. M. Clydesdale. 1988. Psychophysical relationships between perceived
sweetness and colour in lemon- and lime-flavoured drinks. Journal of Food Science 53: 1116–1119, 1162.
Rozin, P. 1982. “Taste–smell confusions” and the duality of the olfactory sense. Perception & Psychophysics
31: 397–401.
Schroeder, C. E., R. W. Lindsley, C. Specht, A. Marcovici, J. F. Smiley, and D. C. Javitt. 2001. Somatosensory in­­
put to auditory association cortex in the macaque monkey. Journal of Neurophysiology 85: 1322–1327.
Seymour, S. K., and D. D. Hamann. 1988. Crispness and crunchiness of selected low moisture foods. Journal
of Texture Studies 19: 79–95.
Shankar, M. U., C. A. Levitan, J. Prescott, and C. Spence. 2009. The influence of color and label information
on flavor perception. Chemosensory Perception 2: 53–58.
Shankar, M. U., C. Levitan, and C. Spence. 2010. “Grape expectations”: Does higher level knowledge mediates
the interpretation of multisensory cues? Consciousness & Cognition 19: 380–390.
Shimojo, S., and L. Shams. 2001. Sensory modalities are not separate modalities: Plasticity and interactions.
Current Opinion in Neurobiology 11: 505–509.
Skrandies, W., and N. Reuther. 2008. Match and mismatch of taste, odor, and color is reflected by electrical
activity in the human brain. Journal of Psychophysiology 22: 175–184.
Small, D. M. 2004. Crossmodal integration—insights from the chemical senses. Trends in Neurosciences 27:
120–123.
Small, D. M., J. C. Gerber, Y. E. Mak, and T. Hummel. 2005. Differential neural responses evoked by orthona-
sal versus retronasal odorant perception in humans. Neuron 47: 593–605.
Small, D. M., and J. Prescott 2005. Odor/taste integration and the perception of flavour. Experimental Brain
Research 166: 345–357.
Small, D. M., M. G. Veldhuizen, J. Felsted, Y. E. Mak, and F. McGlone. 2008. Separable substrates for anticipa-
tory and consummatory food chemosensation. Neuron 57: 786–797.
Spence, C. 2002. The ICI report on the secret of the senses. London: The Communication Group.
Spence, C., C. Levitan, M. U. Shankar, and M. Zampini. 2010. Does food colour influence flavour identification
in humans? Chemosensory Perception 3: 68–84.
Spence, C., and M. U. Shankar. 2010. The influence of auditory cues on the perception of, and responses to,
food and drink. Journal of Sensory Studies 25: 406–430.
Spence, C., M. U. Shankar, and H. Blumenthal. In press. ‘Sound bites’: Auditory contributions to the percep-
tion and consumption of food and drink. To appear in Art and the senses, ed. F. Bacci and D. Melcher.
Oxford: Oxford Univ. Press.
Spence, C., and M. Zampini. 2006. Auditory contributions to multisensory product perception. Acta Acustica
united with Acustica 92: 1009–1025.
Stevenson, R. J., and R. A. Boakes. 2004. Sweet and sour smells: Learned synesthesia between the senses of
taste and smell. In The handbook of multisensory processes, ed. G.A. Calvert, C. Spence, and B. E. Stein,
69–83. Cambridge, MA: MIT Press.
Stevenson, R. J., R. A. Boakes, and J. Prescott. 1998. Changes in odor sweetness resulting from implicit learn-
ing of a simultaneous odor–sweetness association: An example of learned synesthesia. Learning and
Motivation 29: 113–132.
Stevenson, R. J. 2009. The psychology of flavour. Oxford: Oxford Univ. Press.
Stevenson, R. J., and M. Oaten. 2008. The effect of appropriate and inappropriate stimulus color on odor dis-
crimination. Perception & Psychophysics 70: 640–646.
Stevenson, R. J., and C. Tomiczek. 2007. Olfactory-induced synesthesias: A review and model. Psychological
Bulletin 133: 294–309.
Stillman, J. 1993. Colour influences flavour identification in fruit-flavoured beverages. Journal of Food Science
58: 810–812.
Assessing the Role of Visual and Auditory Cues in Multisensory Perception of Flavor 757

Stillman, J. A. 2002. Gustation: Intersensory experience par excellence. Perception 31: 1491–1500.
Strugnell, C. 1997. Colour and its role in sweetness perception. Appetite 28: 85.
Tyle, P. 1993. Effect of size, shape and hardness of particles in suspension on oral texture and palatability. Acta
Psychologica 84: 111–118.
Varela, P., J. Chen, C. Karlsson, and M. Povey. 2006. Crispness assessment of roasted almonds by an integrated
approach to texture description: Texture, acoustics, sensory and structure. Journal of Chemometrics 20:
311–320.
Verhagen, J. V., and L. Engelen. 2006. The neurocognitive bases of human multimodal food perception: Sensory
integration. Neuroscience and Biobehavioral Reviews 30: 613–650.
Vickers, Z. M. 1979. Crispness and crunchiness in foods. In Food texture and rheology, ed. P. Sherman, 145–
166. London: Academic Press.
Vickers, Z. M. 1981. Relationships of chewing sounds to judgments of crispness, crunchiness and hardness.
Journal of Food Science 47: 121–124.
Vickers, Z. M. 1983. Pleasantness of food sounds. Journal of Food Science 48: 783–786.
Vickers, Z. M. 1984. Crispness and crunchiness—A difference in pitch? Journal of Texture Studies 15:
157–163.
Vickers, Z. M. 1987. Crispness and crunchiness—Textural attributes with auditory components. In Food tex-
ture: Instrumental and sensory measurement, ed. H. R. Moskowitz, 45–66. New York: Marcel Dekker.
Vickers, Z. M. 1991. Sound perception and food quality. Journal of Food Quality 14: 87–96.
Vickers, Z. M., and M. C. Bourne. 1976. A psychoacoustical theory of crispness. Journal of Food Science 41:
1158–1164.
Vickers, Z. M., and S. S. Wasserman. 1979. Sensory qualities of food sounds based on individual perceptions.
Journal of Texture Studies 10: 319–332.
von Békésy, G. 1957. Neural volleys and the similarity between some sensations produced by tones and by skin
vibrations. Journal of the Acoustical Society of America 29: 1059–1069.
Welch, R. B., and D. H. Warren. 1980. Immediate perceptual response to intersensory discrepancy. Psychological
Bulletin 3: 638–667.
Wheatley, J. 1973. Putting colour into marketing. Marketing 67: 24–29.
White, T. L., and J. Prescott. 2007. Chemosensory cross-modal Stroop effects: Congruent odors facilitate taste
identification. Chemical Senses 32: 337–341
Yau, N. J. N., and M. R. McDaniel. 1992. The effect of temperature on carbonation perception. Chemical
Senses 14: 337–348.
Yeomans, M., L. Chambers, H. Blumenthal, and A. Blake. 2008. The role of expectancy in sensory and hedonic
evaluation: The case of smoked salmon ice-cream. Food Quality and Preference 19: 565–573.
Zampini, M., D. Sanabria, N. Phillips, and C. Spence. 2007. The multisensory perception of flavour: Assessing
the influence of colour cues on flavour discrimination responses. Food Quality & Preference 18:
975–984.
Zampini, M., and C. Spence. 2004. The role of auditory cues in modulating the perceived crispness and stale-
ness of potato chips. Journal of Sensory Studies 19: 347–363.
Zampini, M., and C. Spence. 2005. Modifying the multisensory perception of a carbonated beverage using
auditory cues. Food Quality and Preference 16: 632–641.
Zampini, M., E. Wantling, N. Phillips, and C. Spence. 2008. Multisensory flavour perception: Assessing the
influence of fruit acids and colour cues on the perception of fruit-flavoured beverages. Food Quality &
Preference 18: 335–343.
Zellner, D. A., A. M. Bartoli, and R. Eckard. 1991. Influence of colour on odor identification and liking ratings.
American Journal of Psychology 104: 547–561.
Zellner, D. A., and P. Durlach. 2003. Effect of colour on expected and experienced refreshment, intensity, and
liking of beverages. American Journal of Psychology 116: 633–647.
Zellner, D. A., and M. A. Kautz. 1990. Colour affects perceived odor intensity. Journal of Experimental
Psychology: Human Perception and Performance 16: 391–397.
Zellner, D. A., and L. A. Whitten. 1999. The effect of colour intensity and appropriateness on color-induced
odor enhancement. American Journal of Psychology 112: 585–604.
MEDICINE

It has become accepted in the neuroscience community that perception and performance are
quintessentially multisensory by nature. Using the full palette of modern brain imaging and
neuroscience methods, The Neural Bases of Multisensory Processes details current understanding
in the neural bases for these phenomena as studied across species, stages of development, and
clinical statuses.

Organized thematically into nine subsections, the book is a collection of contributions by leading
scientists in the field. Chapters build generally from basic to applied, allowing readers to ascertain
how fundamental science informs the clinical and applied sciences.

Topics discussed include

• Anatomy, essential for understanding the neural substrates of multisensory processing


• Neurophysiological bases and how multisensory stimuli can dramatically change the
encoding processes for sensory information
• Combinatorial principles and modeling, focusing on efforts to gain a better mechanistic
handle on multisensory operations and their network dynamics
• Development and plasticity
• Clinical manifestations and how perception and action are affected by altered
sensory experience
• Attention and spatial representations

The last sections of the book focus on naturalistic multisensory processes in three separate contexts:
motion signals, multisensory contributions to the perception and generation of communication
signals, and how the perception of flavor is generated. The text provides a solid introduction for
newcomers and a strong overview of the current state of the field for experts.

K10614

6000 Broken Sound Parkway, NW


Suite 300, Boca Raton, FL 33487
711 Third Avenue
an informa business New York, NY 10017
2 Park Square, Milton Park
w w w. c r c p r e s s . c o m
Abingdon, Oxon OX14 4RN, UK www.crcpress.com

You might also like