Professional Documents
Culture Documents
Multilevel Rhythms in Multimodal
Multilevel Rhythms in Multimodal
communication
royalsocietypublishing.org/journal/rstb
Wim Pouw1,2, Shannon Proksch3, Linda Drijvers1,2,†, Marco Gamba4,†,
Judith Holler1,2,†, Christopher Kello3,†, Rebecca S. Schaefer5,6,† and
Geraint A. Wiggins7,8,†
Review 1
Donders Institute for Brain, Cognition and Behaviour, Radboud University, Nijmegen, The Netherlands
2
Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
Cite this article: Pouw W, Proksch S, Drijvers 3
Cognitive and Information Sciences, University of California, Merced, CA, USA
4
L, Gamba M, Holler J, Kello C, Schaefer RS, Department of Life Sciences and Systems Biology, University of Turin, Turin, Italy
5
Wiggins GA. 2021 Multilevel rhythms in Health, Medical and Neuropsychology unit, Institute for Psychology, and 6Academy for Creative and Performing
Arts, Leiden University, Leiden, The Netherlands
multimodal communication. Phil. Trans. R. Soc. 7
Vrije Universiteit Brussel, Brussels, Belgium and Queen Mary University of London, UK
B 376: 20200334. 8
Queen Mary University, London, UK
Downloaded from https://royalsocietypublishing.org/ on 07 June 2023
© 2021 The Author(s) Published by the Royal Society. All rights reserved.
2
social interactions
communication
multilevel rhythms of multimodal communication
royalsocietypublishing.org/journal/rstb
multimodal signals
biomechanics
(sensory) features
neural processes
quasi-rhythmic structure of
vocal and gestural signals can:
• be intentionally mediated
• facilitate interaction and
information processing
• aid in turn-taking and avoid
overlapping vocal calls
Figure 1. Multilevel rhythm in multimodal communication. Graphical overview of how each level contributes uniquely to the rhythms sustained in multimodal
communication. Figures are adapted from [6,7] and inspired by Gilbert Gottlieb’s (1929–2006) view on epigenesis.
lost in translation. In box 1, we have marked the terms and address all the rich and varied (temporal) functions of com-
their senses that occur throughout our overview. This glos- plex multimodal signalling [5,8,10]. But in our review, the
sary also aims to capture a very general meaning of common thread resonates with a recent overview by Half-
specialist terms offered to address a particular process in per- werk et al. [10] who suggest that multimodal signalling
ception or production, or at a neural, structural or functions are not exhausted by simply (i) providing redun-
behavioural level. The definitions are as general as possible, dant backup information or (ii) combining multiple
for instance, so as to underline a continuity of the perception independent messages. Instead, what is central to the tem-
and production of multimodal signals or so as to include poral functioning of multimodal systems is that the
phenomena not traditionally treated as multimodal in resulting perception or production of a signal ‘is qualitatively
nature. For example, in sign languages, both the hands as different from the sum of the properties of its components’
well as facial and labial expressions are combined in complex [10, p. 2], i.e. has emergent properties [8].
utterances [9]. Such complex signs are designed to be
received through one sensory channel and are thus unimodal
by common definitions (but see [10]). In our view, signed 3. Neural level: multimodal neural-cognitive
languages are an example of a multimodal production in
virtue of combining several otherwise independent signalling processes
modes/signal features. Similarly, neural processes can be mul- Here, we present an overview of how temporal coupling
timodal in our view, in virtue of coupling neural ensembles in the production and perception of multimodal signals
that independently would be tuned to differently structured can be constrained by neural ensembles that are independently
information in the environment. Note, that we cannot tuned towards specifically structured information in the
3
Box 1. Definitions.
royalsocietypublishing.org/journal/rstb
general definition of
phenomenon term context of term example
a distinct measurable aspect component ethology frequency or duration of a signal; an intellectual instance
of a system, which can be determining behaviour
measured independently of component mathematics; electronic partial at frequency x; regions of energy concentration
other aspects engineering (EE)
feature EE; computer science (CS) spectral centroid; signal onset/offset; duration of a signal
feature music; current paper pitch; fundamental frequency; rhythm; harmony
environment. In their multimodal arrangement, they yield optogenetics, EEG, MEG, fMRI, combined with psycho-
unique stabilities for tuning to the rhythms of multimodal physical experiments [11–13]). Although the potential
communication. Furthermore, some neural ensembles are multisensory integration mechanisms are debated, the inte-
uniquely specialized to attune to multisensory information. gration likelihood of two signals seems highly dependent
When integrating a cascade of sensory signals to form a on the degree of spatio-temporal coherence between those
unified, structured percept of the environment, the brain signals: unisensory signals that are closer in time and space
faces two challenges. First, integrating different sensory signals have a greater likelihood of being integrated (cf. [3] and §5).
into a unified percept relies on solving the ‘binding problem’: Both human and non-human animal research demonstrates
whether signals need to be integrated or segregated. Second, that multisensory neurons in the superior colliculus respond
these sensory signals require integration with prior and more robustly to spatio-temporally congruent audio-visual
contextual knowledge to weigh their uncertainty. cues than to individual sensory cues [14–16]. For example,
The neural integration of multiple sensory signals in macaques (Macaca mulatta), single-unit activity measure-
is describable at several neural levels and measurable ments in one specific area in the superior temporal sulcus
using wide-ranging methods (e.g. single-unit recordings, (anterior fundus) show unique sensitivity to facial displays
when temporally aligned with vocal information, while other system in terms of timing mechanisms [30]. While the audi- 4
areas (anterior medial) are sensitive to facial displays alone tory versus visual modality seems better suited to guide
royalsocietypublishing.org/journal/rstb
[17]. Behavioural evidence of multisensory integration is movement [31], it appears that within different sensory mod-
shown in the territorial behaviour of dart-poison frogs alities, different features may be better suited to cue
(Epipedobates femoralis), who aggress conspecifics more when movement [32]. For example, movement is most easily cued
auditory and visual cues are sufficiently spatio-temporally by discrete events in the auditory domain (e.g. beeps), fol-
aligned [18]. Note though, multimodal temporal alignment lowed by continuously moving objects in the visual domain
need not entail synchronization but can specifically involve (e.g. moving bars) [32]. For discrete visual stimuli (e.g.
structured sequencing (i.e. alignment at a lag). This is evi- flashes), or continuous auditory stimuli, (e.g. a siren), sensor-
denced by research on a taxa of flycatcher bird species imotor synchronization is less stable (see for similar results in
(Monarcha castaneiventris) who are uniquely responsive to audio-visual speech: [33]). By contrast with humans, rhesus
long-range-emitted song followed by seeing plumage colour macaques (Macaca mulatta) more easily synchronize to dis-
of potential territorial rivals as opposed to their, reversely crete visual cues [34] perhaps due to weaker audiomotor
ordered, synchronized or unimodal presentation [19]. connections in the macaque brain [35]. These findings indi-
coherent visual and auditory stimuli early in life [12]. pelling arguments based on multimodal mating signals in a
Although the lower level and higher level multimodal inte- range of species, as reviewed by Halfwerk et al. [10], suggests
gration mechanisms are not well understood, both feedback that exactly this integration of signals, leading to a multimo-
and feed-forward interactions between early and higher dal percept rather than a main and a secondary modality, is
level cortices might be relevant for integration. Specifically, it what makes them informative.
has been hypothesized that synchronized neural oscillations Behavioural and neural studies show that temporal struc-
provide a mechanism for multisensory binding and selecting tures in one sensory domain can affect processing in another.
information that matches across sensory signals [20]. Here, Examples are auditory [38] or even multisensory rhythmic
coherent oscillatory signals are thought to allow for functional cues such as music or a metronome [28] not only regularizing
connectivity between spatially distributed neuronal popu- movement (i.e. changing motion trajectories as compared to
lations, where low-frequency neural oscillations provide uncued movements), but also entraining visual attention
temporal windows for cross-modal influences [21]. This syn- [37,39], by increasing visual sensitivity at time points pre-
chronization can occur through neural entrainment and/or dicted to be salient by an auditory stimulus. The neural
phase resetting, which might be relevant for phase reorganiz- underpinnings of such interactions are largely unclear.
ation of ongoing oscillatory activity, so that high-excitability Music-cued versus non-cued movement leads to additional
phases align to the timing of relevant events [21]. New neural activation in motor areas, specifically cerebellum
methods, such as rapid invisible frequency tagging [22–24], [40,41], suggesting that the neural activations related to mul-
might clarify how multisensory signals are neurally inte- timodal processing are synergetic. This may explain findings
grated, and what the role of low-frequency oscillations is in of enhanced learning with multimodal cues, for instance
this process over time. Moreover, novel approaches focusing when auditory feedback of movement (or sonification) is pro-
on moment-to-moment fluctuations in oscillatory activity vided [42,43]. Even when multimodal embedding of motor
combined with methods with an increased spatial resolution learning does not show clear behavioural increases, differ-
(e.g. ECoG/depth-electrode recordings) could significantly ences in learning-related neural plasticity were reported for
advance our knowledge of the role of oscillatory activity in novices learning a new motor sequence to music as compared
routing and integrating multisensory information across to without [44], suggesting that the learning process is
different neural networks [21]. This will be especially relevant implemented qualitatively differently [45].
in more complex, higher level multimodal binding scenarios, Taken together, different sensory modalities, and the fea-
such as (human) communication. tures embedded in these signals, have different sensitivities
Communicative signals in naturalistic settings arguably for specific timescales, making some features especially suit-
include multiple features that work together to maximize able for cross-modal combinations. When investigating
their effectiveness. Different sensory modalities may operate features that naturally combine, behavioural and neural
at different timescales, with specific well-matched combi- responses emerge which amount to more than a simple
nations of features across modalities, leading to common addition of multiple processes.
cross-modal mappings that are intuitively associated (e.g.
visual size and auditory loudness, cf. [25–27]). Prominent
well-matched cross-modal mappings (see §§4 and 5) are sen- 4. Body level: multimodal signalling and
sorimotor mappings: signals transmitted to, from, or within
visuomotor and auditory-motor systems. Given the high sen- peripheral bodily constraints
sitivity of the auditory system for periodic signals aligned Understanding rhythmic multimodal communication also
with motor periodicities [28,29], auditory signals often requires a still underdeveloped understanding of peripheral
entrain movement, with examples seen in various kinds of bodily constraints (henceforth biomechanics) in the pro-
joint action (e.g. marching or other timed actions). Less com- duction of multimodal signals. Here, we overview findings
monly, visual signals serve this purpose, as seen in musical which show how multimodal signalling sometimes exploits
conductors. Moreover, perception of both auditory and physical properties of the body in the construction of
visual rhythms shares neural substrates with the motor temporally complex signals.
Speech is putatively a superordinate mode of coordi- gestures possible in this classic McGurk effect [61]. Recently, a 5
nation between what were originally stable independent ‘manual gesture McGurk-effect’ has been discovered [62].
royalsocietypublishing.org/journal/rstb
vocal and mandibular action routines [46]. In chimpanzees When asked to detect a particular lexical stress in a uniformly
(Pan troglodytes), non-vocal lip smacking occurs in the theta stressed speech sequence, participants who see a hand gesture’s
range (approx. 3–8 Hz) typical of the speech envelope and beat timed with a particular speech segment tend to hear a lex-
labial kinematics of human speech [47]. Marmosets (Callithrix ical stress for that segment [62]. We think it is possible that
jacchus) occupy bistable modes of vocal-articulatory coordi- the gesture–speech–respiratory link as reviewed previously is
nation, where mandibular oscillation is only synchronized actually important for understanding the manual McGurk
at the characteristic theta range with vocal modulations at effect as listeners attune to features of the visual–acoustic
the final but not starting segments of the call [48]. Similarly, signal that are informative about such coordinated modes of
in the zebra finch (Taeniopygia guttata), respiratory pulses production [63]. Similarly, communicative gestures can also
are timed with syrinx activity and rapid beak movements, influence the heard duration of musical notes. For example, the
the coordination of which is held to sustain the highly absolute duration of a percussive tone sounds longer to an
varied vocalization repertoire of this bird species [49]. audience member when seeing a long- versus short-percussion
peaks co-occurrent with emphatic stress in speech, which they might be identified when properly related to biomecha-
are tightly and dynamically coupled under adverse con- nics. For instance, hierarchical bodily representations of metre
ditions, though with more temporal variability for more can be elicited in spontaneous music-induced movement,
complex symbolizing gestures [51]. This coupling of gestures’ with different aspects of the metre embodied in hand, torso
acceleration-induced forces and speech can arise biomechani- or full-arm movements [66]. Additionally, specific coordi-
cally from upper limb–respiratory coupling, e.g. by soliciting nation patterns emerge between different body parts of
anticipatory muscle adjustments to stabilize posture during interacting musicians during musical improvisation [67].
gesture [52], which also include respiratory-controlling Thus, what one hears in music might be constrained to
muscles supporting speech-vocalization [53]. Comparable what body part can be optimally temporally aligned with a
biomechanical interactions and synergies have been found feature in the music.
in other animals long before such associations were raised To detect multimodal cues in this way may be very
to explain aspects of human multimodal prosody. In closely related to indexical signals, such as hearing the poten-
brown-headed cowbirds (Molothrus ater), vocalizations are tial strength of a conspecific from vocal qualities [60,68].
produced with specific respiratory-related abdominal Indexical signals are often the result of perceptual and/or
muscle activity. Such modulations are reduced during voca- morphological specialization to detect/convey features
lizing while moving the wings for visual displaying, even from multimodal couplings. For example, frogs (Physalaemus
though air sac pressure is maintained. This suggests that pustulosus) and frog-eating bats (Trachops cirrhosus) have
visual displays in cowbirds biomechanically interact with res- learned to attune to frog calls in relation to the water ripples
piratory dynamics supporting vocalization [54]. During their produced by the calling frog’s vocal sac deformations [69].
more vigorous wing displays, these birds are vocally silent, Similarly, crested pigeons (Ochyphaps lophotes) are alarmed
likely so as to avoid biomechanical instability of singing by the sounds of high-velocity wing beats of conspecifics,
and moving vigorously at the same time. Such biomechanical where the feathers turn out to have morphologically evolved
interactions are consistent with findings of the wing beats of to produce the aeroelastic flutter needed to sustain these
flying bats (e.g. Pteronotus parnellii), which are synchronized unique alarm calls during fleeing locomotion [70]. In broad-
with echo vocalizations due to locomotion–respiratory bio- tailed hummingbirds (Selasphorus platycercus), the character-
mechanical synergies [55]. The echo vocalizations during a istic high-speed courtship dives seem to be driven to attain
flight are often isochronously structured (at 6–12 Hz), and exactly the right speeds to elicit sonification from aeroelastic
this rhythmic ability is attributed to locomotion–respiratory flutter, which is synchronized with attaining the correct
couplings as they share a temporal structure. However, iso- angle transition relative to the to-be-impressed perceiver so
chrony (at 12–24 Hz) has also been observed in stationary that the gorget dramatically changes colour during sound
bats when producing social vocalizations [56]. In this way, production [71]. In sum, multimodal communication some-
biomechanical stabilities from one domain may have scaf- times involves a specialized exploitation or attunement of
folded the rhythmic vocal capabilities that are sustained in physics that constrains particular (combined) modes of
social vocal domains [57]. acting (with the environment).
Rhesus macaques assume different facial postures with Note that the multimodal information embedded in com-
particular vocalizations. Lips usually protrude when emitting municative acoustic signals can have impacts on complex
coos or grunts (e.g. during mother–infant contact or group communication in humans too. Speakers who cannot see
progression). During the emission of screams (e.g. copulation but only hear each other tend to align patterns of postural
or threats), lips retract [58]. In macaques, facial gestures are sway suggesting that vocal features are used to coordinate
associated with peculiar vocal tract shapes, which influence a wider embodied context [72]. These emergent coordinations
acoustic signals during phonation [59] and can be discriminated are found to increase social affiliation and can align bodily
by conspecific listeners [60]. Relatedly, in humans, perceiving lip processes [73]. For example, synchronized drumming in
postures allows the perceiver to derive a /ba/ or /pa/ from an groups synchronizes physiology, aligning participants’ heart-
auditory signal. It is the auditory–visual–motor co-regularity beats [74]. Further, visual observation of interpersonal
that makes the visual or haptic perception of articulatory synchronous movement between others may lead observers
to rate higher levels of rapport (liking) between the interact- head, face, torso, etc. [3]. These different signals are typically 6
ing individuals [75] and increase an audience’s affective not aligned in time but distributed over the entire length of
royalsocietypublishing.org/journal/rstb
and aesthetic enjoyment of a group dance performance [76]. utterances and beyond, with varying onsets and offsets.
To conclude, we have overviewed examples of peripheral Typically, multimodal utterances in human social inter-
bodily constraints which influence the perception and produc- action are produced within a scaffold of speaking turns.
tion of multimodal signals across species. Specifically, these Sacks et al. [82] propose that interlocutors abide by a clear
biomechanical processes mediate the temporal structuring of set of rules which, combined with linguistic information
multimodal communicative signals. (semantics, pragmatics, prosody and syntax), afford precise
timing of turns, yielding minimal gaps and overlaps. Indeed,
quantitative cross-language analyses support this tight tem-
5. Social level: complex rhythms in interactive poral coupling [83], in line with a putative ‘interaction
engine’ providing cognitive-interactional predispositions for
multimodal communication this human ability [84], though gestural turn exchanges in
In this section, we overview how social interaction complexi- bonobos point towards an evolutionary precursor [85].
also found quasi-rhythmic structure in sounds of speech, while decreasing the chance of overlap. This may be enhanced
music and animal communication [77], and likewise for through temporal projections derived from linguistic infor-
movements produced while talking, singing or performing mation [87], but the rhythmical abilities grounding this
music [78]. The multiscale character of these sounds and mechanism are evolutionarily basic [88]. Wild non-human
movements is readily illustrated in speech—phonemes of primates, like indris, gibbons and chimpanzees, show coordi-
varying durations combine to create longer syllabic units nation during joint vocal output, suggesting the ability to
with more variability in duration, which combine to form coordinate to auditory rhythms [89,90]. The captive chimpan-
phrases with even more variability in length, and so on, zee Ai was able to synchronize her keyboard tapping with an
thus creating quasi-rhythmicity at each timescale. acoustic stimulus [91], and captive macaques can flexibly
The durations of linguistic units like phonemes and sylla- adjust their tapping in anticipation of the beat of a visual metro-
bles are difficult to measure in the acoustic speech signal, but nome [34]. Moreover, cotton-top tamarins and marmosets have
they generally correspond to modulations in a specific feature been observed to avoid initiating and adjust the duration and
of the acoustic signal, called the amplitude envelope. Within onset of their calls such that they avoid interfering noise [92].
the amplitude envelope, units are expressed in terms of However, conversational turn-taking is also characterized
bursts and lulls of energy, and their temporal patterning by temporal variation, including periods of overlap and gaps
can be distilled in the timing of bursts via peak amplitudes. ranging up to hundreds of milliseconds [83,93]. The full breadth
Speech analysis [79] shows that smaller bursts cluster to of factors influencing turn transition times remains opaque, but
form larger bursts, where larger bursts cluster to form even turn duration, syntactic complexity, word frequency and social
larger bursts across timescales that roughly correspond with action are some of them [94]. A coupled-oscillator turn-taking
( phonemic, syllabic, phrasal) units of language. Musical mechanism can accommodate this large variation in turn
recordings also exhibit degrees of multiscale structure timing, since entrained interlocutors could begin speaking at
whose specifics depend on the genre of music or type of any new anti-phased periodic cycle [86,88]. A recent study
speech performance [77]. Even recordings of animal vocaliza- based on telephone interactions shows a quasi-rhythmic struc-
tions have been found to exhibit multiscale structure using ture regulated by turn-by-turn negative autocorrelations [95].
those same analysis methods. While we do not have access The coupled-oscillator mechanism that may form the basis for
to the underlying units, recordings of communicative vocali- dealing with quasi-rhythmicity at the interactional level may
zations produced by killer whales were found to have a also govern communication in non-human species, such as
quasi-rhythmic structure across timescales surprisingly the interactional synchronization of non-isochronous call
similar to human speech interactions [77]. patterns in the katydid species Mecopoda [96].
Multiscale structure in speech and music is also multimo- To conclude, the temporal organization of intentional
dal. Analyses of sounds and movements in video recordings communication is an intricate matter, characterized, on one
have found coordinated multiscale structures in the ampli- hand, by synchrony serving the amplification of signals or
tudes of co-speech face, head and body movements [78], specific features/components thereof, as well as semantic
and the degree of coordination in speech sounds and move- enhancement and smooth coordination between interlocu-
ments depends on the communicative context. Studies of tors. On the other hand, the temporal organization is
the rhythmic structure have also found that visual communi- characterized by quasi-rhythmic, multiscale structure within
cative signals are tightly coordinated with the acoustic signals and across modalities, serving complex communication and
of speech [3]. However, while gestures with a beating quality coordination patterns that are widespread in communicative
coincide closely with pitch peaks, on the semantic level animal vocalizations, human speech and even music.
object- or action-depicting gestures frequently precede corre-
sponding lexical items by several hundred milliseconds [80].
Facial signals, too, can precede the speech they relate to [81].
Variable timing is most obvious if we consider multimodal 6. Conclusion
utterances in their entirety, where speech is embedded in a We have argued that to understand communicative rhythms
rich infrastructure of visual signals coming from the hands, which characterize animal communication, a multimodal
perspective is necessary and multiple levels need to be exam- place to exploit certain bodily capacities [101]. A recent inte- 7
ined. The current overview takes the first step towards a grative approach has been proposed in the understanding of
royalsocietypublishing.org/journal/rstb
multilevel multimodal approach, showing how each level beat perception and motor synchronization, where it is
(neural, bodily and interactive) uniquely contributes to the suggested that a network of biological oscillators are at play
communicative rhythms of animals. We think that when pro- when moving to a rhythm, which involves more neurally
cesses on these levels are understood we can productively passive dynamic biomechanics and neural processes [28].
investigate why the rhythms of, for example, human conver- Finally, social interactions allow for new rhythmic stabilities
sation are so complexly varied. Though we have addressed that are simply absent or qualitatively different in nature to
the unique contributions at each level independently, the non-interactive set-ups [102]. Indeed, there are increasingly
biggest challenge is understanding how levels intersect. louder calls for action for understanding neural processes
A historic lesson in this regard comes from early theories as sometimes softly assembling into a wider distributed
about human vocalization. Early theories held that phonation multi-person system in social interactions [103–106]. The cur-
was actively neurally driven, such that active muscle contrac- rent contribution further underlines a call for such a
tions would be needed to complete each vocal fold cycle [97]. multiscale investigation of temporal rhythms of multimodal
References
1. Partan SR, Marler P. 2005 Issues in the classification 9. Sandler W. 2018 The body as evidence for the stimuli. J. Cogn. Neurosci. 1, 12–24. (doi:10.1162/
of multimodal communication signals. Am. Nat. nature of language. Front. Psychol. 9, 1782. (doi:10. jocn.1989.1.1.12)
166, 231–245. (doi:10.1086/431246) 3389/fpsyg.2018.01782) 16. Burnett LR, Stein BE, Perrault TJ, Wallace MT. 2007
2. Fröhlich M, Sievers C, Townsend SW, Gruber T, van 10. Halfwerk W, Varkevisser J, Simon R, Mendoza E, Excitotoxic lesions of the superior colliculus
Schaik CP. 2019 Multimodal communication and Scharff C, Riebel K. 2019 Toward testing for preferentially impact multisensory neurons and
language origins: integrating gestures and vocalizations. multimodal perception of mating signals. multisensory integration. Exp. Brain Res. 179,
Biol. Rev. 94, 1809–1829. (doi:10.1111/brv.12535) Front. Ecol. Evol. 7, 124. (doi:10.3389/fevo. 325–338. (doi:10.1007/s00221-006-0789-8)
3. Holler J, Levinson SC. 2019 Multimodal language 2019.00124) 17. Khandhadia AP, Murphy AP, Romanski LM, Bizley
processing in human communication. Trends 11. Stein BE, Meredith MA. 1993 The merging of the JK, Leopold DA. 2021 Audiovisual integration
Cogn. Sci. 23, 639–652. (doi:10.1016/j.tics.2019. senses. Cambridge, MA: MIT Press. in macaque face patch neurons. Curr. Biol.
05.006) 12. Xu J, Yu L, Stanford TR, Rowland BA, Stein BE. 2015 31, 1826–1835. (doi:10.1016/j.cub.2021.
4. Stein BE, Stanford TR. 2008 Multisensory What does a neuron learn from multisensory 01.102)
integration: current issues from the perspective of experience? J. Neurophysiol. 113, 883–889. (doi:10. 18. Narins PM, Grabul DS, Soma KK, Gaucher P, Hödl W.
the single neuron. Nat. Rev. Neurosci. 9, 255–266. 1152/jn.00284.2014) 2005 Cross-modal integration in a dart-poison frog.
(doi:10.1038/nrn2331) 13. Yu L, Cuppini C, Xu J, Rowland BA, Stein BE. 2019 Proc. Natl Acad. Sci. USA 102, 2425–2429. (doi:10.
5. Hebets EA, Papaj DR. 2005 Complex signal function: Cross-modal competition: the default computation 1073/pnas.0406407102)
developing a framework of testable hypotheses. for multisensory processing. J. Neurosci. 19. Uy JAC, Safran RJ. 2013 Variation in the temporal
Behav. Ecol. Sociobiol. 57, 197–214. (doi:10.1007/ 39, 1374–1385. (doi:10.1523/JNEUROSCI.1806- and spatial use of signals and its implications for
s00265-004-0865-7) 18.2018) multimodal communication. Behav. Ecol. Sociobiol.
6. Dimensions.Guide | Database of Dimensioned 14. Meredith MA, Stein BE. 1986 Visual, auditory, and 67, 1499–1511. (doi:10.1007/s00265-013-1492-y)
Drawings. See https://www.dimensions.guide somatosensory convergence on cells in superior 20. Senkowski D, Schneider TR, Foxe JJ, Engel AK.
(accessed on 1 May 2019). colliculus results in multisensory integration. 2008 Crossmodal binding through neural
7. Chilton J. 2020 Brain outline. Zenodo. (doi:10.5281/ J. Neurophysiol. 56, 640–662. (doi:10.1152/jn.1986. coherence: implications for multisensory processing.
zenodo.3925989) 56.3.640) Trends Neurosci. 31, 401–409. (doi:10.1016/j.tins.
8. Partan SR, Marler P. 1999 Communication goes 15. Stein BE, Meredith MA, Huneycutt WS, McDade L. 2008.05.002)
multimodal. Science 283, 1272–1273. (doi:10.1126/ 1989 Behavioral indices of multisensory integration: 21. Bauer A-KR, Debener S, Nobre AC. 2020
science.283.5406.1272) orientation to visual cues is affected by auditory Synchronisation of neural oscillations and cross-
modal influences. Trends Cogn. Sci. 24, 481–495. between speech and gestures in clear and degraded 51. Pouw W, Dixon JA. 2019 Entrainment and 8
(doi:10.1016/j.tics.2020.03.003) speech. J. Cogn. Neurosci. 30, 1086–1097. (doi:10. modulation of gesture–speech synchrony under
royalsocietypublishing.org/journal/rstb
22. Drijvers L, Spaak E, Jensen O. 2020 Rapid invisible 1162/jocn_a_01301) delayed auditory feedback. Cogn. Sci. 43, e12721.
frequency tagging reveals nonlinear integration of 37. Pearce MT, Ruiz MH, Kapasi S, Wiggins GA, (doi:10.1111/cogs.12721)
auditory and visual semantic information. bioRxiv Bhattacharya J. 2010 Unsupervised statistical learning 52. Hodges PW, Richardson CA. 1997 Feedforward
067454. (doi:10.1101/2020.04.29.067454) underpins computational, behavioural, and neural contraction of transversus abdominis is not
23. Zhigalov A, Herring JD, Herpers J, Bergmann TO, manifestations of musical expectation. Neuroimage 50, influenced by the direction of arm movement. Exp.
Jensen O. 2019 Probing cortical excitability using 302–313. (doi:10.1016/j.neuroimage.2009.12.019) Brain Res. 114, 362–370. (doi:10.1007/pl00005644)
rapid frequency tagging. Neuroimage 195, 59–66. 38. Repp BH. 2005 Sensorimotor synchronization: a 53. Pouw W, Harrison SJ, Esteve-Gibert N, Dixon JA.
(doi:10.1016/j.neuroimage.2019.03.056) review of the tapping literature. Psychon. Bull. Rev. 2020 Energy flows in gesture-speech physics: the
24. Duecker K, Gutteling TP, Herrmann CS, Jensen O. 12, 969–992. (doi:10.3758/BF03206433) respiratory-vocal system and its coupling with hand
2020 No evidence for entrainment: endogenous 39. Escoffier N, Herrmann CS, Schirmer A. 2015 Auditory gestures. J. Acoust. Soc. Am. 148, 1231–1247.
gamma oscillations and rhythmic flicker responses rhythms entrain visual processes in the human (doi:10.1121/10.0001730)
coexist in visual cortex. bioRxiv 279497. (doi:10. brain: evidence from evoked oscillations and event- 54. Cooper BG, Goller F. 2004 Multimodal signals:
royalsocietypublishing.org/journal/rstb
67. Walton AE, Washburn A, Langland-Hassan P, Chemero 80. Ter Bekke M, Drijvers L, Holler J. 2020 The predictive 93. Holler J, Kendrick KH, Levinson SC. 2018 Processing
A, Kloos H, Richardson MJ. 2018 Creating time: social potential of hand gestures during conversation: an language in face-to-face conversation: questions with
collaboration in music improvisation. Top. Cogn. Sci. investigation of the timing of gestures in relation to gestures get faster responses. Psychon. Bull. Rev. 25,
10, 95–119. (doi:10.1111/tops.12306) speech. PsyArXiv. (doi:10.31234/osf.io/b5zq7) 1900–1908. (doi:10.3758/s13423-017-1363-z)
68. Pisanski K, Cartei V, McGettigan C, Raine J, Reby D. 81. Kaukomaa T, Peräkylä A, Ruusuvuori J. 2013 94. Roberts SG, Torreira F, Levinson SC. 2015 The effects
2016 Voice modulation: a window into the origins Turn-opening smiles: facial expression constructing of processing and sequence organization on the
of human vocal control? Trends Cogn. Sci. 20, emotional transition in conversation. J. Pragmat. 55, timing of turn taking: a corpus study. Front. Psychol.
304–318. (doi:10.1016/j.tics.2016.01.002) 21–42. (doi:10.1016/j.pragma.2013.05.006) 6, 3389. (doi:10.3389/fpsyg.2015.00509)
69. Halfwerk W, Jones PL, Taylor RC, Ryan MJ, Page RA. 82. Sacks H, Schegloff EA, Jefferson G. 1974 A simplest 95. Pouw W, Holler J. 2020 Timing in conversation is
2014 Risky ripples allow bats and frogs to eavesdrop systematics for the organization of turn-taking for dynamically adjusted turn by turn: evidence for lag-1
on a multisensory sexual display. Science 343, conversation. Language 50, 696–735. (doi:10.1353/ negatively autocorrelated turn taking times in telephone
413–416. (doi:10.1126/science.1244812) lan.1974.0010) conversation. PsyArXiv. (doi:10.31234/osf.io/b98da)
71. Hogan BG, Stoddard MC. 2018 Synchronization of language: the interaction engine hypothesis. In 97. Husson R, Garde ÉJ, Richard A. 1952 L’acoustique
speed, sound and iridescent color in a hummingbird Human language: from genes and brain to behavior des salles du point de vue du chanteur et de
aerial courtship dive. Nat. Commun. 9, 5260. (ed. P Hagoort), pp. 189–200. Cambridge, MA: l’orateur. Ann. Télécommunications 7, 58–74.
(doi:10.1038/s41467-018-07562-7) MIT Press. (doi:10.1007/BF03017103)
72. Shockley K, Baker AA, Richardson MJ, Fowler CA. 85. Fröhlich M, Kuchenbuch P, Müller G, Fruth B, 98. Titze IR. 1973 The human vocal cords: a
2007 Articulatory constraints on interpersonal Furuichi T, Wittig RM, Pika S. 2016 Unpeeling the mathematical model. Phonetica 28, 129–170.
postural coordination. J. Exp. Psychol. Hum. layers of language: bonobos and chimpanzees (doi:10.1159/000259453)
Percept. Perform. 33, 201–208. (doi:10.1037/0096- engage in cooperative turn-taking sequences. Sci. 99. Herbst CT, Stoeger AS, Frey R, Lohscheller J, Titze IR,
1523.33.1.201) Rep. 6, 25887. (doi:10.1038/srep25887) Gumpenberger M, Fitch WT. 2012 How low can you
73. Wiltermuth SS, Heath C. 2009 Synchrony and 86. Wilson M, Wilson TP. 2005 An oscillator model of go? Physical production mechanism of elephant
cooperation. Psychol. Sci. 20, 1–5. (doi:10.1111/j. the timing of turn-taking. Psychon. Bull. Rev. 12, infrasonic vocalizations. Science 337, 595–599.
1467-9280.2008.02253.x) 957–968. (doi:10.3758/BF03206432) (doi:10.1126/science.1219712)
74. Gordon I, Gilboa A, Cohen S, Milstein N, Haimovich 87. Levinson SC. 2016 Turn-taking in human 100. Peters G. 2002 Purring and similar vocalizations in
N, Pinhasi S, Siegman S. 2020 Physiological and communication—origins and implications for mammals. Mamm. Rev. 32, 245–271. (doi:10.1046/
behavioral synchrony predict group cohesion and language processing. Trends Cogn. Sci. 20, 6–14. j.1365-2907.2002.00113.x)
performance. Sci. Rep. 10, 8484. (doi:10.1038/ (doi:10.1016/j.tics.2015.10.010) 101. Fitch WT. 2018 The biology and evolution of speech: a
s41598-020-65670-1) 88. Takahashi DY, Narayanan DZ, Ghazanfar AA. 2013 comparative analysis. Annu. Rev. Linguist. 4, 255–279.
75. McEllin L, Knoblich G, Sebanz N. 2020 Coupled oscillator dynamics of vocal turn-taking in (doi:10.1146/annurev-linguistics-011817-045748)
Synchronicities that shape the perception of joint monkeys. Curr. Biol. 23, 2162–2168. (doi:10.1016/j. 102. Tognoli E, Zhang M, Fuchs A, Beetle C, Kelso JAS.
action. Sci. Rep. 10, 15554. (doi:10.1038/s41598- cub.2013.09.005) 2020 Coordination dynamics: a foundation for
020-72729-6) 89. Geissmann T, Orgeldinger M. 2000 The relationship understanding social behavior. Front. Hum. Neurosci.
76. Vicary S, Sperling M, Zimmermann Jv, Richardson between duet songs and pair bonds in siamangs, 14, 317. (doi:10.3389/fnhum.2020.00317)
DC, Orgs G. 2017 Joint action aesthetics. PLoS Hylobates syndactylus. Anim. Behav. 60, 805–809. 103. De Jaegher H, Di Paolo E, Gallagher S. 2010 Can social
ONE 12, e0180101. (doi:10.1371/journal.pone. (doi:10.1006/anbe.2000.1540) interaction constitute social cognition? Trends Cogn.
0180101) 90. Gamba M, Torti V, Estienne V, Randrianarison RM, Sci. 14, 441–447. (doi:10.1016/j.tics.2010.06.009)
77. Kello CT, Bella SD, Médé B, Balasubramaniam R. Valente D, Rovara P, Bonadonna G, Friard O, 104. Dumas G, Kelso JAS, Nadel J. 2014 Tackling the
2017 Hierarchical temporal structure in music, Giacoma C. 2016 The Indris have got rhythm! social cognition paradox through multi-scale
speech and animal vocalizations: jazz is like a Timing and pitch variation of a primate song approaches. Front. Psychol. 5, 882. (doi:10.3389/
conversation, humpbacks sing like hermit thrushes. examined between sexes and age classes. fpsyg.2014.00882)
J. R. Soc. Interface 14, 20170231. (doi:10.1098/rsif. Front. Neurosci. 10, 249. (doi:10.3389/fnins. 105. Redcay E, Schilbach L. 2019 Using second-person
2017.0231) 2016.00249) neuroscience to elucidate the mechanisms of social
78. Alviar C, Dale R, Dewitt A, Kello C. 2020 Multimodal 91. Hattori Y, Tomonaga M, Matsuzawa T. 2013 interaction. Nat. Rev. Neurosci. 20, 495–505.
coordination of sound and movement in music and Spontaneous synchronized tapping to an auditory (doi:10.1038/s41583-019-0179-4)
speech. Discourse Process. 57, 682–702. (doi:10. rhythm in a chimpanzee. Sci. Rep. 3, 1566. (doi:10. 106. Bolis D, Schilbach L. 2020 ‘I interact therefore I am’:
1080/0163853X.2020.1768500) 1038/srep01566) the self as a historical product of dialectical
79. Falk S, Kello CT. 2017 Hierarchical organization in 92. Egnor SER, Wickelgren JG, Hauser MD. 2007 attunement. Topoi 39, 521–534. (doi:10.1007/
the temporal structure of infant-direct speech and Tracking silence: adjusting vocal production to avoid s11245-018-9574-0)