You are on page 1of 11

Brain & Language 122 (2012) 151–161

Contents lists available at SciVerse ScienceDirect

Brain & Language


journal homepage: www.elsevier.com/locate/b&l

Temporal context in speech processing and attentional stream selection:


A behavioral and neural perspective
Elana M. Zion Golumbic a,b,⇑, David Poeppel c, Charles E. Schroeder a,b
a
Departments of Psychiatry and Neurology, Columbia University Medical Center, 710 W 168th St., New York, NY 10032, USA
b
The Nathan Kline Institute for Psychiatric Research, 140 Old Orangeburg Road, Orangeburg, NY 10962, USA
c
Department of Psychology, New York University, 6 Washington Place, New York, NY 10003, USA

a r t i c l e i n f o a b s t r a c t

Article history: The human capacity for processing speech is remarkable, especially given that information in speech
Available online 29 January 2012 unfolds over multiple time scales concurrently. Similarly notable is our ability to filter out of extraneous
sounds and focus our attention on one conversation, epitomized by the ‘Cocktail Party’ effect. Yet, the
Keywords: neural mechanisms underlying on-line speech decoding and attentional stream selection are not well
Oscillations understood. We review findings from behavioral and neurophysiological investigations that underscore
Entrainment the importance of the temporal structure of speech for achieving these perceptual feats. We discuss
Attention
the hypothesis that entrainment of ambient neuronal oscillations to speech’s temporal structure, across
Speech
Auditory
multiple time-scales, serves to facilitate its decoding and underlies the selection of an attended speech
Time stream over other competing input. In this regard, speech decoding and attentional stream selection
Rhythm are examples of ‘Active Sensing’, emphasizing an interaction between proactive and predictive top-down
modulation of neuronal dynamics and bottom-up sensory input.
Ó 2012 Elsevier Inc. All rights reserved.

1. Introduction cognitive task we face daily, has proven extremely difficult to mod-
el and is not well understood.
In speech communication, a message is formed over time as This is a truly interdisciplinary question, of fundamental
small packets of information (sensory and linguistic) are transmit- importance in numerous fields, including psycholinguistics,
ted and received, accumulating and interacting to form a larger cognitive neuroscience, computer science and engineering. Con-
semantic structure. We typically perform this task effortlessly, verging insights from research across these different disciplines
oblivious to the inherent complexity that on-line speech process- have lead to growing recognition of the centrality of the temporal
ing entails. It has proven extremely difficult to fully model this structure of speech for adequate speech decoding, stream segre-
basic human capacity, and despite decades of research, the ques- gation and attentional stream selection (Giraud & Poeppel,
tion of how a stream of continuous acoustic input is decoded in 2011; Shamma, Elhilali, & Micheyl, 2011). According to this
real-time to interpret the intended message remains a founda- perspective, since speech is first and foremost a temporal stimu-
tional problem. lus, with acoustic content unfolding and evolving over time,
The challenge of understanding mechanisms of on-line speech decoding the momentary spectral content (‘fine structure’) of
processing is aggravated by the fact that speech is rarely heard un- speech benefits greatly from considering its position within the
der ideal acoustic conditions. Typically, we must filter out extrane- longer temporal context of an utterance. In other words, the tim-
ous sounds in the environment and focus only on a relevant ing of events within a speech stream is critical for its decoding,
portion of the auditory scene, as exemplified in the ‘Cocktail Party attribution to the correct source, and ultimately to whether it will
Problem’, famously coined by Cherry (1953). While many models be included in the attentional ‘spotlight’ or forced to the back-
have dealt with the issue of segregating multiple streams of ground of perception.
concurrent input (Bregman, 1990), the specific attentional process Here we review current literature on the role of the temporal
of selecting one particular stream over all others, which is the structure of speech in achieving these cognitive tasks. We discuss
its importance for speech intelligibility from a behavioral perspec-
⇑ Corresponding author. Address: Department of Neurology, Columbia University tive as well as possible neuronal mechanisms that represent and
Medical Center, 710 W 168th St., RM 721, New York, NY 10032, USA. Fax: +1 212 utilize this temporal information to guide on-line speech decoding
305 5445.
and selective attention at a ‘Cocktail Party’.
E-mail address: ezg2101@columbia.edu (E.M. Zion Golumbic).

0093-934X/$ - see front matter Ó 2012 Elsevier Inc. All rights reserved.
doi:10.1016/j.bandl.2011.12.010
152 E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161

2. The temporal structure of speech 1994b; Greenberg & Ainsworth, 2006; Rosen, 1992). It represents
the timing of events in the speech stream, and provides the tempo-
2.1. The centrality of temporal structure to speech processing/decoding ral framework in which the fine structure of linguistic content
is delivered. The temporal envelope itself contains variations at
At the core of most models of speech processing is the notion multiple distinguishable rates which probably convey different
that speech tokens are distinguishable based on their spectral speech features. These include phonetic segments, which typically
makeup (Fant, 1960; Stevens, 1998). Inspired by the tonotopic orga- occur at a rate of 20–50 Hz; syllables which occur at a rate of
nization in the auditory system (Eggermont, 2001; Pickles, 2008), 3–10 Hz; and phrases whose rate is usually between 0.5 and
theories for speech decoding, be it by brain or by machine, have 3 Hz (Drullman, 2006; Greenberg, Carvey, Hitchcock, & Chang,
emphasized on-line spectral analysis of the input for the identifica- 2003; Poeppel, 2003; see Fig. 1).
tion of basic linguistic units (primarily phonemes and syllables; Although the various speech features evolve over differing time
(Allen, 1994; French & Steinberg, 1947; Pavlovic, 1987). Similarly, scales, they are not processed in isolation; rather speech compre-
many models of auditory stream segregation rely on frequency hension involves integrating information across different time
separation as the primary cue for stream segregation, attributing scales in a hierarchical and interdependent manner (Poeppel,
consecutive inputs with overlapping spectral structure to the same 2003; Poeppel, Idsardi, & van Wassenhove, 2008). Specifically, long-
stream (Beauvois & Meddis, 1996; Hartmann & Johnson, 1991; er-scale units are made up of several units of a shorter time-scale:
McCabe & Denham, 1997). However, other models have recognized words and phrases contain multiple syllables, and each syllable is
that relying purely on spectral cues is insufficient for robust speech comprised of multiple phonetic segments. Not only is this hierarchy
decoding and stream segregation, especially under noisy conditions a description of the basic structure of speech, but behavioral studies
(Elhilali, Ma, Micheyl, Oxenham, & Shamma, 2009; McDermott, demonstrate that this structure alters perception and assists intel-
2009; Moore & Gockel, 2002). These models emphasize the central ligibility. For example, Warren and colleagues (1990) found that
role of temporal cues in providing a context within which the spec- presenting sequences of vowels at different time scales creates
tral content is processed. markedly different percepts. Specifically, when vowel duration
Psychophysical studies clearly demonstrate the critical impor- was >100 ms (i.e. within the syllabic rate of natural speech) they
tance of speech’s temporal structure for its intelligibility. For maintained their individual identify and could be named. However,
example, speech becomes unintelligible if it is presented at an when vowels were 30–100 ms long they lost their perceptual iden-
unnaturally fast or slow rate, even if the fine structure is preserved tities and instead the entire sequence was perceived as a word, with
(Ahissar et al., 2001; Ghitza & Greenberg, 2009) or if its temporal its emergent percept bearing little resemblance to the actual pho-
envelope is smeared or otherwise disrupted (Arai & Greenberg, nemes it was comprised of. There are many other psychophysical
1998; Drullman, 2006; Drullman, Festen, & Plomp, 1994a, 1994b; examples that the interpretation of individual phonemes and sylla-
Greenberg, Arai, & Silipo, 1998; Stone, Fullgrabe, & Moore, 2010). bles is affected by prior input (Mann, 1980; Norris, McQueen, Cut-
In particular, intelligibility of speech crucially depends on the ler, & Butterfield, 1997; Repp, 1982), supporting the view that these
integrity of the modulation spectrum’s amplitude in the region be- smaller units are decoded as part of a longer temporal unit rather than
tween 3 and 16 Hz (Kingsbury, Morgan, & Greenberg, 1998). In the in isolation. Similar effects have been reported for longer time
extreme case of complete elimination of temporal modulations scales. Pickett and Pollack (1963) show that single words become
(flat envelope) speech intelligibility is reduced to 5%, demonstrat- more intelligible when they are presented within a longer phrase
ing that without temporal context, the fine structure alone carries versus in isolation. Several linguistic factors also exert contextual
very little information (Drullman et al., 1994a, 1994b). In contrast, influences over speech decoding and intelligibility, including
high degrees of intelligibility can be maintained even with signifi-
cant distortion to the spectral structure, as long as the temporal
envelope is preserved. In a seminal study, Shannon, Zeng, Kamath,
Wygonski, and Ekelid (1995) amplitude modulated white noise
using temporal envelopes of speech which preserved temporal
cues and amplitude fluctuations but without any fine structure
information. Listeners were able to correctly report the phonetic
content from this amplitude-modulated noise, even when the
envelope was extracted from as few as four frequency bands of
speech. Speech intelligibility was also robust to other manipula-
tions of fine structure, such as spectral rotation (Blesser, 1972)
and time-reversal of short segments (Saberi & Perrott, 1999), fur-
ther emphasizing the importance of temporal contextual cues as
opposed to momentary spectral content for speech processing.
Moreover, hearing-impaired patients using cochlear implant rely
heavily on the temporal envelope for speech recognition, since
the spectral content is significantly degraded by such devices
(Chen, 2011; Friesen, Shannon, Baskent, & Wang, 2001).
These psychophysical findings raise an important question:
What is the functional role of the temporal envelope of speech,
and what contextual cues does it provide that renders it crucial
for intelligibility? In the next sections we discuss the temporal
structure of speech and how it may relate to speech processing.
Fig. 1. Top: A 9-s long excerpt of natural, conversational speech, with its broad-
2.2. A hierarchy of time scales in speech band temporal envelope overlaid in red. The temporal envelope was extracted by
calculating the RMS power of the broad-band sound wave, following the procedure
described by (Lalor & Foxe, 2010). Bottom: A typical spectrum of the temporal
The temporal envelope of speech reflects fluctuations in speech envelope of speech, averaged over 8 token of natural conversational speech, each
amplitude at rates between 2 and 50 Hz (Drullman et al., 1994a, one 10 s long.
E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161 153

semantic (Allen, 2005; Borsky, Tuller, & Shapiro, 1998; Connine, and more universal prosodic cues (Endress & Hauser, 2010) as well
1987; Miller, Heise, & Lichten, 1951), syntactic (Cooper, Paccia, & as cues such as stress, intonation and speeding up/slowing down of
Lapointe, 1978; Cooper & Paccia-Cooper, 1980) and lexical (Connine speech and the insertion of pauses (Arciuli & Slowiaczek, 2007;
& Clifton, 1987; Ganong, 1980) influences. Discussing the nature of Lehiste, 1970; Wightman, Shattuck-Hufnagel, Ostendorf, & Price,
these influences is beyond the scope of this paper; however they 1992). Many of these prosodic cues are represented in the temporal
emphasize the underlying hierarchical principle in speech process- envelope, mostly in subtle variation from the mean amplitude or
ing. The momentary acoustic input is insufficient for decoding; rate of syllables (Rosen, 1992). For example, stress is reflected in
rather it is greatly affected by the temporal context they are embed- higher amplitude and (sometimes) longer durations, and changes
ded in, often relying on integrating information over several hun- in tempo are reflected in shorter or longer pauses. Thus, the tempo-
dred milliseconds (Holt & Lotto, 2002; Mattys, 1997). ral envelope contains cues for speech parsing on both the syllabic
The robustness of speech processing to co-articulation, reverber- and phrasal levels. It is suggested that the degraded intelligibility
ation, and phonemic restorations provides additional evidence that observed when the speech envelope is distorted is due to lack of
a global percept of the speech structure does not simply rely on lin- cues by which to parse the stream and place the fine structure con-
ear combination of all the smaller building blocks. Some have tent into its appropriate context within the hierarchy of speech
explicitly argued for a ‘reversed hierarchy’ with the auditory system (Greenberg & Ainsworth, 2004, 2006).
(Harding, Cooke, & König, 2007; Nahum, Nelken, & Ahissar, 2008;
Nelken & Ahissar, 2006), and specifically of time scales for speech 2.4. Making predictions
perception (Cooper et al., 1978; Warren, 2004). This approach advo-
cates for global pattern recognition akin to that proposed in the One factor that might assist greatly in both phrasal and syllabic
visual system (Bar, 2004; Hochstein & Ahissar, 2002). Along with parsing – as well as hierarchical integration – is the ability to antic-
a bottom-up process of combining low level features over succes- ipate the onset and offsets of these units before they occur. It is
sive stages, there is a dynamic and critical top-down influence with widely demonstrated that temporal predictions assist perception
higher-order units constraining and assisting in the identification of in more simple rhythmic contexts. Perceptual judgments are
lower-order units. Recent models for speech decoding have enhanced for stimuli occurring at an anticipated moment, com-
proposed this can be achieved by a series of hierarchical processors pared to stimuli occurring randomly or at unanticipated times
which sample the input at different time scales in parallel, with the (Barnes & Jones, 2000). These studies have led to the ‘Attention
processors sampling longer timescales imposing constraints on in Time’ hypothesis (Jones, Johnston, & Puente, 2006; Large & Jones,
processors of a finer time-scale (Ghitza, 2011; Ghitza & Greenberg, 1999; Nobre, Correa, & Coull, 2007; Nobre & Coull, 2010) which
2009; Kiebel, von Kriegstein, Daunizeau, & Friston, 2009; Tank & posits that attention can be directed to particular points in time
Hopfield, 1987). when relevant stimuli are expected, similar to the allocation of
spatial attention. A more recent construction of this idea, the
2.3. Parsing and integrating – utilizing temporal envelope information ‘‘Entrainment Hypothesis’’ (Schroeder & Lakatos, 2009b), specifies
temporal attention in rhythmic terms directly relevant to speech
A core issue in the hierarchical perspective for speech process- processing.
ing is the need to parse the sequence correctly into smaller units, It follows that speech decoding would similarly benefit from
organize and integrate them in a hierarchical fashion in order to temporal predictions regarding the precise timing of events in
perform higher-order pattern recognition (Poeppel, 2003). This the stream as well as syllabic and phrasal boundaries. The concept
poses an immense challenge to the brain (as well as to computers), of prediction in sentence processing, i.e. beyond the bottom-up
to convert a stream of continuously fluctuating input into a multi- analysis of the speech stream, has been developed in detail by Phil-
scale hierarchical structure, and to do so on-line as information lips and Wagers (2007) and others. In particular, the ‘Entrainment
accumulates. It is suggested that the temporal envelope of the Hypothesis’ emphasizes utilizing rhythmic regularities to form
speech contains cues that assist in parsing the speech stream into predictions about upcoming events, which could be potentially
smaller units. In particular, two basic units of speech which are beneficial for speech processing. Temporal prediction may also
important for speech parsing, the phrasal and syllabic levels, can play a key role in the perception of speech in noise and in stream
be discerned and segmented from the temporal envelope. selection, discussed below (Elhilali & Shamma, 2008; Elhilali
The syllable has long been recognized as a fundamental unit in et al., 2009; Schroeder & Lakatos, 2009b).
speech which is vital for parsing continuous speech (Eimas, 1999; Another cue that likely contributes to forming temporal predic-
Ferrand, Segui, & Grainger, 1996; Mehler, Dommergues, Frauenfel- tions about upcoming speech events is congruent visual input. In
der, & Segui, 1981). The boundaries between syllables are evident in many real-life situations auditory speech is accompanied by visual
the fluctuations of envelope energy over time, with different input of the talking face (an exception being a telephone conversa-
‘‘bursts’’ of energy indicating distinct syllables (Greenberg & Ains- tion). It is known that congruent visual input substantially
worth, 2004; Greenberg et al., 2003; Shastri, Chang, & Greenberg, improves speech comprehension, particularly under difficult lis-
1999). Thus, parsing into syllable-size packages can be achieved rel- tening conditions (Campanella & Belin, 2007; Campbell, 2008;
atively easily based on the temporal envelope. Although syllabic Sumby & Pollack, 1954; Summerfield, 1992). Importantly, these
parsing has been emphasized in the literature, some advocate that cues are predictive (Schroeder, Lakatos, Kajikawa, Partan, & Puce,
at the core of speech processing is an even longer time scale – that 2008; van Wassenhove, Grant, & Poeppel, 2005). Visual cues of
of phrases – which constitutes the backbone for integrating the articulation typically precede auditory input by 150–100 ms
smaller units of language – phonetic segments, syllables and words (Chandrasekaran, Trubanova, Stillittano, Caplier, & Ghazanfar,
– into comprehensible linguistic streams (Nygaard & Pisoni, 1995; 2009), though they remain perceptually linked to congruent audi-
Shukla, Nespor, & Mehler, 2007; Todd, Lee, & O’Boyle, 2006). The tory input at offsets of up to 200 ms (van Wassenhove, Grant, &
acoustic properties for identifying phrasal boundaries are much less Poeppel, 2007). These time frames allow ample opportunity for vi-
well defined then those of syllabic boundaries, however it is gener- sual cues to influence the processing of upcoming auditory input.
ally agreed that prosodic cues are key for phrasal parsing (Cooper In particular, congruent visual input is critical for making temporal
et al., 1978; Fodor & Ferreira, 1998; Friederici, 2002; Hickok, predictions, as it contains information about the temporal enve-
1993). These may include metric characteristics of the language lope of the acoustics (Chandrasekaran et al., 2009; Grant & Green-
(Cutler & Norris, 1988; Otake, Hatano, Cutler, & Mehler, 1993) berg, 2001) as well as prosodic cues (Bernstein, Eberhardt, &
154 E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161

Demorest, 1989; Campbell, 2008; Munhall, Jones, Callan, Kuratate, studied for modulation rates higher than those prevalent in speech
& Vatikiotis-Bateson, 2004; Scarborough, Keating, Mattys, Cho, & (40 Hz, Galambos, Makeig, & Talmachoff, 1981; Picton, Skinner,
Alwan, 2009). A recent study by Arnal, Wyart, and Giraud (2011) Champagne, Kellett, & Maiste, 1987), it has also been observed
emphasizes the contribution of visual input to making predictions for low-frequency modulations, in ranges prevalent in the speech
about upcoming speech input on multiple time-scales. Supporting envelope and crucial for intelligibility (1–16 Hz, Liégeois-Chauvel,
this premise, from a neural perspective there is evidence auditory Lorenzi, Trébuchon, Régis, & Chauvel, 2004; Picton et al., 1987;
cortex is activated by lip reading prior to arrival of auditory speech Rodenburg, Verweij, & van der Brink, 1972; Will & Berg, 2007). This
input (Besle et al., 2008), and visual input brings about a phase- auditory steady state response (ASSR) indicates that a large ensem-
reset in auditory cortex (Kayser, Petkov, & Logothetis, 2008; ble of neurons synchronize their excitability fluctuations to the
Thorne, De Vos, Viola, & Debener, 2011). amplitude modulation rate of the stimulus. A recent model which
To summarize, the temporal structure of speech is critical for emphasizes tuning properties to temporal modulations along the
speech processing. It provides contextual cues for transforming entire auditory pathway, successfully reproduced this steady state
continuous speech into units on which subsequent linguistic pro- response as recorded directly from human auditory cortex (Dugué,
cesses can be applied for construction and retrieval of the message. Le Bouquin-Jeannès, Edeline, & Faucon, 2010).
It further contains predictive cues – from both the auditory and The auditory system’s inclination to synchronize neuronal
visual components of speech – which can facilitate rapid on-line activity according to the temporal structure of the input has raised
speech processing. Given the behavioral importance of the tempo- the hypothesis that this response is related to entrainment of cor-
ral envelope, we now turn to discuss neural mechanisms that could tical neuronal oscillations to the external driving rhythms. More-
represent and utilize this temporal information for speech process- over, it is suggested that such neuronal entrainment is idea for
ing and for the selection of a behaviorally relevant speech stream temporal coding of more complex auditory stimuli, such as speech
among concurrent competing input. (Giraud & Poeppel, 2011; Schroeder & Lakatos, 2009b; Schroeder
et al., 2008). This hypothesis is motivated, in part, by the strong
3. Neuronal tracking of the temporal envelope resemblance between the characteristic time scales in speech
and other communication sounds and those of the dominant
3.1. Representation of temporal modulations in the auditory pathway frequency bands of neuronal oscillations – namely the delta
(1–3 Hz), theta (3–7 Hz) and beta/gamma (20–80 Hz) bands. Given
Given the importance of parsing and hierarchical integration for these time-scale similarities, neuronal oscillations are well posi-
speech processing, how are these processes implemented from a tioned for parsing continuous input in behaviorally relevant time
neurobiologically mechanistic perspective? Some evidence sup- scale and hierarchical integration. This hypothesis is further devel-
porting a link between the temporal windows in speech and inher- oped in the next sections, based on evidence for LFP and spike
ent preferences of the auditory system can be found in studies of ‘tracking’ of natural communication calls and speech.
responses to amplitude modulated (AM) tones or noise and other
complex natural sounds such as animal vocalizations. Physiological 3.2. Neuronal encoding of the temporal envelope of complex auditory
studies in monkeys demonstrate that envelope encoding occurs stimuli
throughout the auditory system, from the cochlea to the auditory
cortex, with slower modulation rates represented at higher sta- Several physiological studies in primary auditory cortex of cat
tions along the auditory pathway (reviewed by Bendor & Wang, and monkey have found that the temporal envelope of complex
2007; Frisina, 2001; Joris, Schreiner, & Rees, 2004). Importantly, vocalizations is significantly correlated with the temporal patterns
neurons in primary auditory cortex demonstrate a preference to of spike activity in spatially distributed populations (Gehr, Komiya,
slow modulation rates, in ranges prevalent in natural communica- & Eggermont, 2000; Nagarajan et al., 2002; Rotman, Bar-Yosef, &
tion calls (Creutzfeldt, Hellweg, & Schreiner, 1980; de Ribaupierre, Nelken, 2001; Wang et al., 1995). These studies show bursts of
Goldstein, & Yeni-Komshian, 1972; Funkenstein & Winter, 1973; spiking activity time-locked to major peaks in the vocalization
Gaese & Ostwald, 1995; Schreiner & Urbas, 1988; Sovijärvi, 1975; envelope. Recently, it has been shown that the temporal pattern
Wang, Merzenich, Beitel, & Schreiner, 1995; Whitfield & Evans, of LFPs is also phase-locked to the temporal structure of complex
1965). In fact, these ‘naturalistic’ modulation rates are obvious in sounds, particularly in frequency ranges of 2–9 Hz and 16–30 Hz
spontaneous as well as stimulus-evoked activity (Lakatos et al., (Chandrasekaran, Turesson, Brown, & Ghazanfar, 2010; Kayser,
2005). Similar patterns of hierarchical sensitivity to AM rate along Montemurro, Logothetis, & Panzeri, 2009). Moreover, in those
the human auditory system were reported by Giraud et al. (2000) studies the timing of spiking activity was also locked to the phase
using fMRI, showing that the highest-order auditory areas are most of the LFP, indicating that the two neuronal responses are inher-
sensitive to AM modulations between 4 and 8 Hz, corresponding to ently linked (see also (Eggermont & Smith, 1995; Montemurro,
the syllabic rate in speech. Behavioral studies also show that Rasch, Murayama, Logothetis, & Panzeri, 2008). Critically, Kayser
human listeners are most sensitive to AM rates comparable to et al. (2009) show that spikes and LFP phase (<30 Hz) carry comple-
those observed in speech (Chi, Gao, Guyton, Ru, & Shamma, 1999; mentary information, and their combination proved to be the most
Viemeister, 1979). These converging findings demonstrate that the informative about complex auditory stimuli. It has been suggested
auditory system contains inherent tuning to the natural temporal that the phase of the LFP, and in particular that of underlying neu-
statistical structure of communication calls, and the apparent tem- ronal oscillations which generate the LFP, control the timing of
poral hierarchy along the auditory pathway could be potentially neuronal excitability and gate spiking activity so that they occur
useful for speech recognition, in which information from longer only at relevant times (Elhilali, Fritz, Klein, Simon, & Shamma,
time-scales inform and provides predictions for processing at 2004; Fries, 2009, 2007, 2008; Lakatos, Chen, O’Connell, Mills, &
shorter time scales. Schroeder, 2007; Lakatos et al., 2005; Mazzoni, Whittingstall,
It is also well established that presentation of AM tones or other Brunel, Logothetis, & Panzeri, 2010; Schroeder & Lakatos, 2009a;
periodic stimuli, evokes a periodic steady-state response at the Womelsdorf et al., 2007). According to this view, and as explicitly
modulating frequency that is evident in local field potentials suggested by Buzsáki and Chrobak (1995), neuronal oscillations
(LFP) in nonhuman subjects, as well as intracranial EEG recordings supply the temporal context for processing complex stimuli and
(ECoG) and scalp EEG in humans, as an increase in power at that serve to direct more intricate processing of the acoustic content,
frequency. While the steady state response has most often been represented by spiking activity, to particular points in time, i.e.
E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161 155

peaks in the temporal envelope. In the next section we discuss how tracking breaks down when speech is no longer intelligible (due
this framework may relate to the representation of natural speech to compression or distortion), coupled with known top-down influ-
in human auditory cortex. ences of attention on the N1 (Hillyard, Hink, Schwent, & Picton,
1973; particularly relevant in this context is the N1 sensitivity to
3.3. Low-frequency tracking of speech envelope temporal attention: Lange, 2009; Lange & Röder, 2006; Sanders &
Astheimer, 2008). Moreover, an entirely sensory-driven approach
Few studies have attempted to directly quantify neuronal would greatly hinder speech processing particularly in noisy set-
dynamics to speech in humans, perhaps since traditional EEG/ tings (Atal & Schroeder, 1978), whereas an entrainment-based
MEG methods often are geared toward strictly time-locked re- mechanism utilizing temporal predictions is more robust to filter-
sponses to brief stimuli or well defined events within a stream, ing out irrelevant background noise (see Section 4).
rather than long continuous and fluctuating stimuli. Nonetheless, Distinguishing between sensory-driven evoked responses and
the advancement of single-trial analysis tools has allowed for some modulatory influences of intrinsic entrainment is not simple. How-
examination of the relationship between the temporal structure of ever, one way to determine the relative contribution of entrain-
speech and the temporal dynamics of neuronal activity while lis- ment is by comparing the tracking responses to rhythmic and
tening to it. Ahissar and Ahissar (2005), Ahissar et al. (2001) com- non rhythmic stimuli, or comparing speech-tracking responses at
pared the spectrum of the MEG response when subjects listened to the beginning versus the end of an utterance, i.e. before and after
sentences, to the spectrum of the sentences’ temporal envelope temporal regularities have supposedly been established. It should
and found that they were highly correlated, with a peak in the be noted that the neuronal patterns hypothesized to underlie such
theta band (3–7 Hz). Critically, when speech was artificially com- entrainment, (e.g. theta and gamma band oscillations), exist in
pressed yielding a faster syllabic rate, the theta-band peak in the absence of any stimulation and therefore intrinsic (e.g. Giraud
neural spectrum shifted accordingly, as long as the speech re- et al., 2007; Lakatos et al., 2005).
mained intelligible. The match between the neural and acoustic
signals broke down as intelligibility decreased with additional 3.4. Hierarchical representation of the temporal structure of speech
compression (see also Ghitza & Greenberg, 2009). This finding is
important for two reasons: First, it shows that not only does the A point that has not yet been explored with regard to neuronal
auditory system respond at similar frequencies to those prevalent tracking of the speech envelope, but is of pressing interest, is how
in speech, but there is dynamic control and adjustment of the the- it relates to the hierarchical temporal structure of speech. As dis-
ta-band response to accommodate different speech rates (within a cussed above, in processing speech the brain is challenged to con-
circumscribed range). Thus it is well positioned for encoding the vert a single stream of complex fluctuating input into a hierarchical
temporal envelope of speech, given the great variability across structure across multiple time scales. As mentioned above, the
speakers and tokens. Second, and critically, this dynamic is limited dominant rhythms of cortical neuronal oscillations – delta, theta
to within the frequency bands where speech intelligibility is and beta/gamma bands – strongly resemble the main time scales
preserved, demonstrating its specificity for speech processing. In in speech. Moreover, ambient neuronal activity exhibits hierarchi-
contrast, using ECoG recordings, Nourski et al. (2009) found that cal cross-frequency coupling relationships, such that higher fre-
gamma-band responses in human auditory cortex continue to quency activity tends to be nested within the envelope of a
track an auditory stimulus even past the point of non-intelligibil- lower-frequency oscillation (Canolty & Knight, 2010; Canolty
ity, suggesting that the higher frequency response is more stimu- et al., 2006; Lakatos et al., 2005; Monto, Palva, Voipio, & Palva,
lus-driven and is not critical for intelligibility. 2008), just as spiking activity is nested within the phase of neuro-
More direct support for the notion that the phase of low-fre- nal oscillations (Eggermont & Smith, 1995; Mazzoni et al., 2010;
quency neuronal activity tracks the temporal envelope speech is Montemurro et al., 2008) mirroring the hierarchical temporal
found in recent work from several groups using scalp recorded structure of speech. Some have suggested that the similarity
EEG and MEG (Abrams, Nicol, Zecker, & Kraus, 2008; Aiken & between the temporal properties of speech and neural activity
Picton, 2008; Bonte, Valente, & Formisano, 2009; Lalor & Foxe, are not coincidental but rather, that spoken communication
2010; Luo, Liu, & Poeppel, 2010; Luo & Poeppel, 2007). Luo and col- evolved to fit constraints posed by the neuronal mechanisms of
leagues (2007), 2010) show that low-frequency phase pattern the auditory system and the motor articulation system (Gentilucci
(1–7 Hz; recoded by MEG) is highly replicable across repetitions & Corballis, 2006; Liberman & Whalen, 2000). From an ‘Active
of the same speech token, and reliably distinguishes between dif- Sensing’ perspective (Schroeder, Wilson, Radman, Scharfman, &
ferent tokens (illustrated in Fig. 2 top). Importantly, as in the stud- Lakatos, 2010), it is likely that the dynamics of motor system, artic-
ies by Ahissar et al. (2001), this effect was related to speech ulatory apparatus, and related cognitive operations constrain the
intelligibility and was not found for distorted speech. Similarly, evolution and development of both spoken language and auditory
distinct alpha phase patterns (8–12 Hz) were found to characterize processing dynamics. These observations have lead to suggestions
different vowels (Bonte et al., 2009). Further studies have directly that neuronal oscillations are utilized on-line as a means for sam-
shown that the temporal pattern of low-frequency auditory activ- pling the unfolding speech content at multiple temporal scales
ity tracks the temporal envelope of speech (Abrams et al., 2008; simultaneously and form the basis for adequately parsing and
Aiken & Picton, 2008; Lalor & Foxe, 2010). integrating the speech signal in the appropriate time windows
These scalp-recorded speech-tracking responses are likely dom- (Poeppel, 2003; Poeppel et al., 2008; Schroeder et al., 2008). There
inated by sequential evoked responses to large fluctuations in the is some physiological evidence for parallel processing in multiple
temporal envelope (e.g. word beginnings, stressed syllables), akin time scales in audition, emphasizing the theta and gamma bands
to the auditory N1 response which is primarily a theta-band phe- in particular (Ding & Simon, 2009; Theunissen & Miller, 1995), with
nomenon (see Aiken & Picton 2008). Thus, a major question with a noted asymmetry of the sensitivity to these two time scales in
regard to the above cited findings is whether speech-tracking is the right and left auditory cortices (Boemio, Fromm, Braun, & Poep-
entirely sensory-driven, reflecting continuous responses to the pel, 2005; Giraud et al., 2007; Poeppel, 2003). Further, the perspec-
auditory input, or whether it is influenced by neuronal entrain- tive of an array of oscillators parsing and processing information at
ment facilitating accurate speech tracking in a top-down fashion. multiple time scales has been mirrored in computational models
While this question has not been fully resolved empirically, the attempting to describe speech processing (Ahissar & Ahissar,
latter view is strongly supported by the findings that speech 2005; Ghitza & Greenberg, 2009) as well as beat perception in
156 E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161

Fig. 2. A schematic illustration of selective neuronal ‘entrainment’ to the temporal envelope of speech. On the left are the temporal envelopes derived from two different
speech tokens, as well as their combination in a ‘Cocktail Party’ simulation. Although the envelopes share similar spectral makeup, with dominant delta and theta band
modulations, they differ in their particular temporal/phase pattern (Pearson correlation r = 0.06). On the right is a schematic of the anticipated neural dynamics under the
Selective Entrainment Hypothesis, filtered between 1 and 20 Hz (zero-phase FIR filter). When different speech tokens are presented in isolation (top), the neuronal response
tracks the temporal envelope of the stimulus, as suggested by several studies. When the two speakers are presented simultaneously, (bottom) it is suggested that attention
enforces selective entrainment to the attended token. As a result, the neuronal response should be different when attending to different speakers in the same environment,
despite the identical acoustic input. Moreover, the dynamics of the response when attending to each speaker should be similar to those observed when listening to the same
speaker alone (with reduced/suppressed contribution of the sensory response to the unattended speaker).

music (Large, 2000). The findings that low-frequency neuronal Rupert, & Galambos, 1959; Ray, Niebur, Hsiao, Sinai, & Crone,
activity tracks the temporal envelope of complex sounds and 2008; Woldorff et al., 1993). Moreover, attending to different attri-
speech provide an initial proof-of-concept that this occurs. How- butes of a sound can modulate the pattern of auditory neural re-
ever, a systematic investigation of the relationship between differ- sponse, demonstrating feature-specific top-down influence
ent rhythms in the brain and different linguistic features is (Brechmann & Scheich, 2005). Thus, with regard to the ‘Cocktail
required, as well as investigation of the hierarchical relationship Party’ problem, directing attention to the features of an attended
across frequencies. speaker may serve to enhance their cortical representation and
thus facilitate processing at the expense of other competing input.
4. Selective speech tracking – implications for the ‘Cocktail What are those features and how does attention operate, from a
Party’ problem neural perspective, to enhance their representation? In a ‘Cocktail
Party’ environment, speakers can be differentiated based on acous-
4.1. The challenge of solving the ‘Cocktail Party’ problem tic features (such fundamental frequency (pitch), timbre, harmo-
nicity, intensity), spatial features (location, trajectory) and
Studying the neural correlates of speech perception is inher- temporal features (mean rhythm, temporal envelope pattern). In
ently tied to the problem of selective attention since, more often order to focus auditory attention on a specific speaker, we most
than not, we listen to speech amid additional background sounds likely make use of a combination of these features (Fritz, Elhilali,
which we must filter out. Though auditory selective attention, as David, & Shamma, 2007; Hafter, Sarampalis, & Loui, 2007; McDer-
exemplified by the ‘Cocktail Party’ problem (Cherry, 1953), has mott, 2009; Moore & Gockel, 2002). One way in which attention
been recognized for over 50 years as an essential cognitive capac- influences sensory processing is by sharpening neuronal receptive
ity, its precise neuronal mechanisms are unclear. fields to enhance the selectivity for attended features, as found in
Tuning in to a particular speaker in a noisy environment in- the visual system (Reynolds & Heeger, 2009). For example, utilizing
volves at least two processes: identifying and separating different the tonotopic organization of the auditory system, attention could
sources in the environment (stream segregation) and directing selectively enhance responses to an attended frequency while
attention to one task relevant one (stream selection). It is widely responsiveness at adjacent spectral frequencies is suppressed
accepted that attention is necessary for stream selection, although (Fritz, Elhilali, & Shamma, 2005; Fritz, Shamma, Elhilali, & Klein,
there is an ongoing debate whether attention is required for stream 2003). Similarly, in the spatial domain, attention to a particular spa-
segregation as well (Alain & Arnott, 2000; Cusack, Decks, Aikman, tial location enhances responses in auditory neurons spatially-
& Carlyon, 2004; Sussman & Orr, 2007). Besides activating dedi- tuned to the attended location, as well as cortical responses
cated fronto-parietal ‘attentional networks’ (Corbetta & Shulman, (Winkowski & Knudsen, 2006; Winkowski & Knudsen, 2008; Wu,
2002; Posner & Rothbart, 2007), attention influences neural activ- Weissman, Roberts, & Woldorff, 2007). Given these tuning proper-
ity in auditory cortex (Bidet-Caulet et al., 2007; Hubel, Henson, ties, if two speakers differ in their fundamental frequency and
E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161 157

spatial location, biasing the auditory system towards these features showed that attending to different speakers in a scene manifests
could assist in its selective processing. Indeed, many current in unique temporal pattern of delta and theta activity (combina-
models emphasize frequency separation and spatial location as tion of phase and power). We have recently expanded on these
the main features underlying stream segregation (Beauvois & findings (Zion Golumbic et al., 2010, in preparation) and demon-
Meddis, 1996; Darwin & Hukin, 1999; Hartmann & Johnson, strated using ECoG recordings in humans that, indeed, both delta
1991; McCabe & Denham, 1997). and theta activity selectively track the temporal envelope of an at-
However, recent studies challenge the prevalent view that tono- tended, but not an unattended speaker nor the global acoustics in a
topic and spatial separation are sufficient for perceptual stream simulated Cocktail Party.
segregation, since they fail to account for observed influence of
the relative timing of sounds on streaming percepts and neural
responses (Elhilali & Shamma, 2008; Elhilali et al., 2009; Shamma 5. Conclusion: Active Sensing
et al., 2011). Since different acoustic streams evolve differently
over time (i.e. have unique temporal envelopes), neuronal popula- The hypothesis put forth in this paper is that dedicated neuro-
tions encoding the spectral and spatial features of a particular nal mechanisms encode the temporal structure of naturalistic
stream share common temporal patterns. In their model, Elhilali stimuli, and speech in particular. The data reviewed here suggest
and colleagues (2008), 2009) propose that the temporal coherence that this temporal representation is critical for speech parsing,
among these neural populations serves to group them together and decoding, and intelligibility, and may also serve as a basis for atten-
form a unified, and distinct percept of an acoustic source. Thus, tional selection of a relevant speech stream in a noisy environment.
temporal structure has a key role in stream segregation. Further- It is proposed that neuronal oscillations, across multiple frequency
more, Elhilali and colleagues suggest that an effective system bands, provide the mechanistic infrastructure for creating an
would utilize the temporal regularities within the sound pattern on-line representation of the multiple time scales in speech, and
of a particular source to form predictions regarding upcoming facilitate the parsing and hierarchical integration of smaller speech
input from that source, which are then compared to the actual in- units for speech comprehension. Since speech unfolds over time,
put. This predictive capacity is particularly valuable for stream this representation needs to be created and constantly updated
selection, as it allows directing attention not only to particular ‘on the fly’, as information accumulates. To achieve this task opti-
spectral and spatial feature of an attended speaker, but also to mally, the system most likely does not rely solely on bottom-up
points in time when attended events are projected to occur, in line input to shape the momentary representation, but rather it also
with the ‘Attention in Time’ hypothesis discussed above (Barnes & creates internal predictions regarding forthcoming stimuli to facil-
Jones, 2000; Jones et al., 2006; Large & Jones, 1999; Nobre & Coull, itate its processing. Computational models specifically emphasize
2010; Nobre et al., 2007). Given the fundamental role of oscilla- the centrality of predictions for successful speech decoding and
tions in controlling the timing of neuronal excitability, it is partic- stream segregation (Denham & Winkler, 2006; Elhilali & Shamma,
ularly suited for implementing such temporal selective attention. 2008). As noted above, multiple cues carry predictive information,
including the intrinsic rhythmicity of speech, prosodic information
4.2. The Selective Entrainment Hypothesis and visual input, all of which contribute to forming expectations
regarding upcoming events in the speech stream. In that regard,
In attempt to link cognitive behavioral to its underlying neuro- speech processing can be viewed as an example for ‘Active Sensing’,
physiology, Schroeder and Lakatos (2009b) recently proposed a in which, rather than being passive and reactive the brain is proac-
Selective Entrainment Hypothesis suggesting that the auditory sys- tive and predictive in that it shapes it own internal activity in
tem selectively tracks the temporal structure of only one behavior- anticipation of probable upcoming input (Jones et al., 2006; Nobre
ally-relevant stream (Fig. 2 bottom). According to this view, the & Coull, 2010; Schroeder et al., 2010).
temporal regularities within a speech stream enable the system Using prediction to facilitate neural coding of current and future
to synchronize the timing of high neuronal excitability with pre- events has often been suggested as a key strategy in multiple
dicted timing of events in the attended stream, thus selectively cognitive and sensory functions (Eskandar & Assad, 1999; Hosoya,
enhancing its processing. This hypothesis also implies that events Baccus, & Meister, 2005; Kilner, Friston, & Frith, 2007). ‘Active
from the unattended stream will often occur at low-excitability Sensing’ promotes a mechanistic approach to understanding how
phase, producing reduced neuronal responses, and rendering them prediction influences speech processing in that it connects the allo-
easier to ignore. The principle of Selective Entrainment has thus far cation of attention to a particular speech stream, with predictive
been demonstrated using brief rhythmic stimuli in LFP recordings ‘‘tuning’’ of auditory neuron ensemble dynamics to the anticipated
in monkeys as well as ECoG and EEG in humans (Besle et al., 2011; temporal qualities of that stream. In speech as in other rhythmic
Lakatos, Karmos, Mehta, Ulbert, & Schroeder, 2008; Lakatos et al., stimuli, entrainment of neuronal oscillations is well suited to serve
2005; Stefanics et al., 2010). In these studies, the phase of Active Sensing, given their rhythmic nature and their role in con-
entrained oscillations at the driving frequency was set either in- trolling the timing of excitability fluctuations in the neuronal
phase or out-of-phase with the stimulus, depending on whether ensembles comprising the auditory system. The findings reviewed
it was to be attended or ignored. These findings demonstrate the here support the notion that attention controls and directs entrain-
capacity of the brain to control the timing of neuronal excitability ment to a behaviorally relevant stream, thus facilitating its pro-
(i.e. the phase of the oscillation) based on task demands. cessing over all other input in the scene. The rhythmic nature of
Extending the Selective Entrainment Hypothesis to naturalistic oscillatory activity generates predictions as to when future events
and continuous stimuli is not straightforward, due to their tempo- might occur and thus allows for ‘sampling’ the acoustic environ-
ral complexity. Yet, although speech tokens share common modu- ment with an appropriate rate and pattern. Of course, sensory pro-
lation rates, they differ substantially in their precise time course cessing cannot rely only on predictions, and when the real input
and thus can be differentiated in their relative phases (see arrives it is used to update the predictive model and create new
Fig. 2A). Selective Entrainment to speech would, thus, manifest in predictions. One such mechanism used often by engineers to mod-
precise phase locking (tracking) to the temporal envelope of an el entrainment to a quasi-rhythmic source and proposed as a use-
attended speaker, with minimal phase locking to the unattended ful mechanism for speech decoding is a phase-lock loop (PLL,
speaker. Initial evidence for selective entrainment to continuous Ahissar & Ahissar, 2005; Gardner, 2005; Ghitza, 2011; Ghitza &
speech was provided by Kerlin, Shahin, and Miller (2010) who Greenberg, 2009). A PLL consists of a phase-detector which
158 E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161

compares the phase of an external input with that of an ongoing Besle, J., Fischer, C., Bidet-Caulet, A. l., Lecaignard, F., Bertrand, O., & Giard, M.-H. l. n.
(2008). Visual activation and audiovisual interactions in the auditory cortex
oscillator, and adjusts the phase of the oscillator using feedback
during speech perception: Intracranial recordings in humans. Journal of
to optimize its synchronization with the external signal. Neuroscience, 28(52), 14301–14310.
As discussed above, another source of predictive information for Besle, J., Schevon, C. A., Mehta, A. D., Lakatos, P., Goodman, R. R., McKhann, G. M.,
speech processing is visual input. It has been demonstrated that et al. (2011). Tuning of the human neocortex to the temporal dynamics of
attended events. Journal of Neuroscience, 31(9), 3176–3185.
visual input brings about phase-resetting in auditory cortex (Kay- Bidet-Caulet, A., Fischer, C., Besle, J., Aguera, P., Giard, M., & Bertrand, O. (2007).
ser et al., 2008; Lakatos et al., 2009; Thorne et al., 2011), suggesting Effects of selective attention on the electrophysiological representation of
that viewing the facial movements of speech provides additional concurrent sounds in the human auditory cortex. Journal of Neuroscience,
27(35), 9252–9261.
potent cues for adjusting the phase of entrained oscillations and Blesser, B. (1972). Speech perception under conditions of spectral transformation: I.
optimizing temporal predictions (Schroeder et al., 2008). Luo Phonetic characteristics. Journal of Speech and Hearing Research, 15(1), 5–41.
et al. (2010) show that both auditory and visual cortices align Boemio, A., Fromm, S., Braun, A., & Poeppel, D. (2005). Hierarchical and asymmetric
temporal sensitivity in human auditory cortices. Nature Neuroscience, 8(3),
phases with great precision in naturalistic audiovisual scenes. 389–395.
However, determining the degree of influence that visual input Bonte, M., Valente, G., & Formisano, E. (2009). Dynamic and task-dependent
has on entrainment to speech and its role in Active Sensing invites encoding of speech and voice by phase reorganization of cortical oscillations.
Journal of Neuroscience, 29(6), 1699–1706.
additional investigation. Borsky, S., Tuller, B., & Shapiro, L. P. (1998). How to milk a coat: The effects of
To summarize, the findings and concepts reviewed here support semantic and acoustic information on phoneme categorization. Journal of the
the idea of Active Sensing as a key principle governing perception in Acoustical Society of America, 103(5), 2670–2676.
Brechmann, A., & Scheich, H. (2005). Hemispheric shifts of sound representation
a complex natural environment. They support a view that the
in auditory cortex with conceptual listening. Cerebral Cortex, 15(5), 578–
brain’s representation of stimuli, and particularly those of natural 587.
and continuous stimuli, emerge from an interaction between pro- Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound.
active, top-down modulation of neuronal activity and bottom-up MIT Press.
Buzsáki, G., & Chrobak, J. J. (1995). Temporal structure in spatially organized
sensory dynamics. In particular, they stress an important func- neuronal ensembles: A role for interneuronal networks. Current Opinion in
tional role for ambient neuronal oscillations, directed by attention, Neurobiology, 5(4), 504–510.
in facilitating selective speech processing and encoding its hierar- Campanella, S., & Belin, P. (2007). Integrating face and voice in person perception.
Trends in Cognitive Sciences, 11(12), 535–543.
chical temporal context, within which the content of the semantic Campbell, R. (2008). The processing of audio–visual speech: Empirical and neural
message can be adequately appreciated. bases. Philosophical Transactions of the Royal Society of London. Series B: Biological
Sciences, 363(1493), 1001–1010.
Canolty, R., Edwards, E., Dalal, S., Soltani, M., Nagarajan, S., Kirsch, H., et al. (2006).
Acknowledgments High gamma power is phase-locked to theta oscillations in human neocortex.
Science, 313(5793), 1626–1628.
The compilation and writing of this review paper was sup- Canolty, R. T., & Knight, R. T. (2010). The functional role of cross-frequency coupling.
Trends in Cognitive Sciences, 14(11), 506–515.
ported by NIH Grant 1 F32 MH093061-01 to EZG and MH060358. Chandrasekaran, C., Trubanova, A., Stillittano, S. b., Caplier, A., & Ghazanfar, A. A.
(2009). The natural statistics of audiovisual speech. PLoS Computational Biology,
References 5(7), e1000436.
Chandrasekaran, C., Turesson, H. K., Brown, C. H., & Ghazanfar, A. A. (2010). The
influence of natural scene dynamics on auditory cortical activity. Journal of
Abrams, D. A., Nicol, T., Zecker, S., & Kraus, N. (2008). Right-hemisphere auditory
Neuroscience, 30(42), 13919–13931.
cortex is dominant for coding syllable patterns in speech. Journal of
Chen, F. (2011). The relative importance of temporal envelope information for
Neuroscience, 28(15), 3958–3965.
intelligibility prediction: A study on cochlear-implant vocoded speech. Medical
Ahissar, E., & Ahissar, M. (2005). Processing of the temporal envelope of speech. In R.
Engineering & Physics, 33(8), 1033–1038.
Konig, P. Heil, E. Budinger, & H. Scheich (Eds.), The auditory cortex: A synthesis of
Cherry, E. C. (1953). Some experiments on the recognition of speech, with one and
human and animal research (pp. 295–313). London: Lawrence Erlbaum
two ears. Journal of the Acoustical Society of America, 25, 975–979.
Associates, Inc..
Chi, T., Gao, Y., Guyton, M. C., Ru, P., & Shamma, S. (1999). Spectro-temporal
Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., & Merzenich, M.
modulation transfer functions and speech intelligibility. Journal of the Acoustical
M. (2001). Speech comprehension is correlated with temporal response
Society of America, 106(5), 2719–2732.
patterns recorded from auditory cortex. Proceedings of the National Academy of
Connine, C. M. (1987). Constraints on interactive processes in auditory word
Sciences of the United States of America, 98(23), 13367–13372.
recognition: The role of sentence context. Journal of Memory and Language,
Aiken, S. J., & Picton, T. W. (2008). Human cortical responses to the speech envelope.
26(5), 527–538.
Ear and Hearing, 29(2), 139–157.
Connine, C. M., & Clifton, C. J. (1987). Interactive use of lexical information in speech
Alain, C., & Arnott, S. R. (2000). Selectively attending to auditory objects. Frontiers in
perception. Journal of Experimental Psychology: Human Perception and
Bioscience, 5, d202–212.
Performance, 13(2), 291–299.
Allen, J. B. (1994). How do humans process and recognize speech? Speech and Audio
Cooper, W. E., & Paccia-Cooper, J. (1980). Syntax and speech Cambridge. Mass:
Processing, IEEE Transactions on, 2(4), 567–577.
Harvard University Press.
Allen, J. B. (2005). Articulation and intelligibility. San Rafael: Morgan and Claypool
Cooper, W. E., Paccia, J. M., & Lapointe, S. G. (1978). Hierarchical coding in speech
Publishers.
timing. Cognitive Psychology, 10(2), 154–177.
Arai, T., & Greenberg, S. (1998, 12–15 May 1998). Speech intelligibility in the
Corbetta, M., & Shulman, G. L. (2002). Control of goal-directed and stimulus-driven
presence of cross-channel spectral asynchrony. In Paper presented at the IEEE
attention in the brain. Nature Reviews Neuroscience, 3(3), 201–215.
international conference on acoustics, speech and signal processing.
Creutzfeldt, O., Hellweg, F., & Schreiner, C. (1980). Thalamocortical transformation
Arciuli, J., & Slowiaczek, L. M. (2007). The where and when of linguistic word-level
of responses to complex auditory stimuli. Experimental Brain Research, 39(1),
prosody. Neuropsychologia, 45(11), 2638–2642.
87–104.
Arnal, L. H., Wyart, V., & Giraud, A.-L. (2011). Transitions in neural oscillations
Cusack, R., Decks, J., Aikman, G., & Carlyon, R. P. (2004). Effects of location, frequency
reflect prediction errors generated in audiovisual speech. Nature Neuroscience,
region, and time course of selective attention on auditory scene analysis. Journal
14(6), 797–801.
of Experimental Psychology: Human Perception and Performance, 30(4), 643–
Atal, B., & Schroeder, M. (1978, April 1978). Predictive coding of speech signals and
656.
subjective error criteria. In Paper presented at the acoustics, speech, and signal
Cutler, A., & Norris, D. (1988). The role of strong syllables in segmentation for lexical
processing, IEEE international conference on ICASSP ‘78.
access. Journal of Experimental Psychology: Human Perception and Performance,
Bar, M. (2004). Visual objects in context. Nature Reviews Neuroscience, 5(8),
14(1), 113–121.
617–629.
Darwin, C. J., & Hukin, R. W. (1999). Auditory objects of attention: The role of
Barnes, R., & Jones, M. R. (2000). Expectancy, attention, and time. Cognitive
interaural time differences. Journal of Experimental Psychology: Human
Psychology, 41(3), 254–311.
Perception and Performance, 25(3), 617–629.
Beauvois, M. W., & Meddis, R. (1996). Computer simulation of auditory stream
de Ribaupierre, F., Goldstein, M. H., & Yeni-Komshian, G. (1972). Cortical coding of
segregation in alternating-tone sequences. Journal of the Acoustical Society of
repetitive acoustic pulses. Brain Research, 48, 205–225.
America, 99(4), 2270–2280.
Denham, S. L., & Winkler, I. (2006). The role of predictive models in the formation of
Bendor, D., & Wang, X. (2007). Differential neural coding of acoustic flutter within
auditory streams. Journal of Physiology – Paris, 100(1–3), 154–170.
primate auditory cortex. Nature Neuroscience, 10(6), 763–771.
Ding, N., & Simon, J. Z. (2009). Neural representations of complex temporal
Bernstein, L. E., Eberhardt, S. P., & Demorest, M. E. (1989). Single-channel
modulations in the human auditory cortex. Journal of Neurophysiology, 102(5),
vibrotactile supplements to visual perception of intonation and stress. Journal
2731–2743.
of the Acoustical Society of America, 85(1), 397–405.
E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161 159

Drullman, R. (2006). The significance of temporal modulation frequencies for Ghitza, O., & Greenberg, S. (2009). On the possible role of brain rhythms in speech
speech intelligibility. In S. Greenberg & W. Ainsworth (Eds.), Listening to speech: perception: Intelligibility of time-compressed speech with periodic and
An auditory perspective (pp. 39–48). Mahwah, NJ: Erlbaum. aperiodic insertions of silence. Phonetica, 66(1–2), 113–126.
Drullman, R., Festen, J. M., & Plomp, R. (1994a). Effect of reducing slow temporal Giraud, A.-L., Kleinschmidt, A., Poeppel, D., Lund, T. E., Frackowiak, R. S. J., & Laufs, H.
modulations on speech reception. Journal of the Acoustical Society of America, (2007). Endogenous cortical rhythms determine cerebral specialization for
95(5), 2670–2680. speech perception and production. Neuron, 56(6), 1127–1134.
Drullman, R., Festen, J. M., & Plomp, R. (1994b). Effect of temporal envelope Giraud, A.-L., Lorenzi, C., Ashburner, J., Wable, J., Johnsrude, I., Frackowiak, R., et al.
smearing on speech reception. Journal of the Acoustical Society of America, 95(2), (2000). Representation of the temporal envelope of sounds in the human brain.
1053–1064. Journal of Neurophysiology, 84(3), 1588–1598.
Dugué, P., Le Bouquin-Jeannès, R., Edeline, J.-M., & Faucon, G. (2010). A Giraud, A.-L., & Poeppel, D. (2011). Speech perception from a neurophysiological
physiologically based model for temporal envelope encoding in human perspective. In O. T. PoeppelD, A. N. Popper, & R. R. Fay (Eds.), The human
primary auditory cortex. Hearing Research, 268(1–2), 133–144. auditory cortex. Berlin: Springer.
Eggermont, J. J. (2001). Between sound and perception: Reviewing the search for a Grant, K. W., & Greenberg, S. (2001). Speech intelligibility derived from
neural code. Hearing Research, 157(1–2), 1–42. asynchronous processing of auditory–visual information. In Paper presented at
Eggermont, J. J., & Smith, G. M. (1995). Synchrony between single-unit activity and the proceedings of the workshop on audio–visual speech processing, Scheelsminde,
local field potentials in relation to periodicity coding in primary auditory cortex. Denmark.
Journal of Neurophysiology, 73(1), 227–245. Greenberg, S., & Ainsworth, W. (2006). Listening to speech: An auditory perspective.
Eimas, P. D. (1999). Segmental and syllabic representations in the perception of Mahwah, NJ: Erlbaum.
speech by young infants. Journal of the Acoustical Society of America, 105(3), Greenberg, S., & Ainsworth, W. A. (2004). Speech processing in the auditory system.
1901–1911. New York: Springer-Verlag.
Elhilali, M., Fritz, J. B., Klein, D. J., Simon, J. Z., & Shamma, S. A. (2004). Dynamics of Greenberg, S., Arai, T., & Silipo, R. (1998). Speech intelligibility derived from
precise spike timing in primary auditory cortex. Journal of Neuroscience, 24(5), exceedingly sparse spectral information. In Int. conf. spoken lang. proc., pp.
1159–1172. 2803–2806.
Elhilali, M., Ma, L., Micheyl, C., Oxenham, A. J., & Shamma, S. A. (2009). Temporal Greenberg, S., Carvey, H., Hitchcock, L., & Chang, S. (2003). Temporal properties of
coherence in the perceptual organization and cortical representation of spontaneous speech: A syllable-centric perspective. Journal of Phonetics, 31,
auditory scenes. Neuron, 61(2), 317–329. 465–485.
Elhilali, M., & Shamma, S. A. (2008). A cocktail party with a cortical twist: How Hafter, E., Sarampalis, A., & Loui, P. (2007). Auditory attention and filters. In W. Yost
cortical mechanisms contribute to sound segregation. Journal of the Acoustical (Ed.), Auditory perception of sound sources. Springer-Verlag.
Society of America, 124(6), 3751–3771. Harding, S., Cooke, M., & König, P. (2007). Auditory gist perception: an alternative to
Endress, A. D., & Hauser, M. D. (2010). Word segmentation with universal prosodic attentional selection of auditory streams? In Paper presented at the 4th
cues. Cognitive Psychology, 61(2), 177–199. international workshop on attention in cognitive systems, WAPCV, Hyderabad,
Eskandar, E. N., & Assad, J. A. (1999). Dissociation of visual, motor and predictive India.
signals in parietal cortex during visual guidance. Nature Neuroscience, 2(1), Hartmann, W. M., & Johnson, D. (1991). Stream segregation and peripheral
88–93. channeling. Music perception: An interdisciplinary journal, 9(2), 155–183.
Fant, G. (1960). In acoustic theory of speech production. The Hague: Mouton. Hickok, G. (1993). Parallel parsing: Evidence from reactivation in garden-path
Ferrand, L., Segui, J., & Grainger, J. (1996). Masked priming of word and picture sentences. Journal of Psycholinguistic Research, 22(2), 239–250.
naming: The role of syllabic units. Journal of Memory and Language, 35, 708– Hillyard, S., Hink, R., Schwent, V., & Picton, T. (1973). Electrical signs of selective
723. attention in the human brain. Science, 182, 177–180.
Fodor, J. D., & Ferreira, F. (1998). Reanalysis in sentence processing. Dordrecht Kluwer. Hochstein, S., & Ahissar, M. (2002). View from the top: Hierarchies and reverse
French, N. R., & Steinberg, J. C. (1947). Factors governing the intelligibility of speech hierarchies in the visual system. Neuron, 36(5), 791–804.
sounds. Journal of the Acoustical Society of America, 19, 90–119. Holt, L. L., & Lotto, A. J. (2002). Behavioral examinations of the level of auditory
Friederici, A. D. (2002). Towards a neural basis of auditory sentence processing. processing of speech context effects. Hearing Research, 167(1–2), 156–169.
Trends in Cognitive Sciences, 6(2), 78–84. Hosoya, T., Baccus, S. A., & Meister, M. (2005). Dynamic predictive coding by the
Fries, P. (2009). Neuronal gamma-band synchronization as a fundamental process retina. Nature, 436(7047), 71–77.
in cortical computation. Annual Review of Neuroscience, 32(1), 209–224. Hubel, D. H., Henson, C. O., Rupert, A., & Galambos, R. (1959). Attention units in the
Fries, P., Nikolic, D., & Singer, W. (2007). The gamma cycle. Trends in Neurosciences, auditory cortex. Science, 129(3358), 1279–1280.
30(7), 309–316. Jones, M. R., Johnston, H. M., & Puente, J. (2006). Effects of auditory pattern structure
Fries, P., Womelsdorf, T., Oostenveld, R., & Desimone, R. (2008). The effects of visual on anticipatory and reactive attending. Cognitive Psychology, 53(1), 59–96.
stimulation and selective visual attention on rhythmic neuronal Joris, P. X., Schreiner, C. E., & Rees, A. (2004). Neural processing of amplitude-
synchronization in macaque area V4. Journal of Neuroscience, 28, 4823–4835. modulated sounds. Physiological Reviews, 84(2), 541–577.
Friesen, L. M., Shannon, R. V., Baskent, D., & Wang, X. (2001). Speech recognition in Kayser, C., Montemurro, M. A., Logothetis, N. K., & Panzeri, S. (2009). Spike-phase
noise as a function of the number of spectral channels: Comparison of acoustic coding boosts and stabilizes information carried by spatial and temporal spike
hearing and cochlear implants. Journal of the Acoustical Society of America, patterns. Neuron, 61(4), 597–608.
110(2), 1150–1163. Kayser, C., Petkov, C. I., & Logothetis, N. K. (2008). Visual modulation of neurons in
Frisina, R. D. (2001). Subcortical neural coding mechanisms for auditory temporal auditory cortex. Cerebral Cortex, 18(7), 1560–1574.
processing. Hearing Reserach, 158(1–2), 1–27. Kerlin, J. R., Shahin, A. J., & Miller, L. M. (2010). Attentional gain control of ongoing
Fritz, J., Shamma, S., Elhilali, M., & Klein, D. J. (2003). Rapid task-related plasticity of cortical speech representations in a ‘‘cocktail party’’. Journal of Neuroscience,
spectrotemporal receptive fields in primary auditory cortex. Nature 30(2), 620–628.
Neuroscience, 6, 1216–1223. Kiebel, S. J., von Kriegstein, K., Daunizeau, J., & Friston, K. J. (2009). Recognizing
Fritz, J. B., Elhilali, M., David, S. V., & Shamma, S. A. (2007). Auditory attention – sequences of sequences. PLoS Computational Biology, 5(8), e1000464.
Focusing the searchlight on sound. Current Opinion in Neurobiology, 17(4), Kilner, J., Friston, K., & Frith, C. (2007). Predictive coding: An account of the mirror
437–455. neuron system. Cognitive Processing, 8(3), 159–166.
Fritz, J. B., Elhilali, M., & Shamma, S. A. (2005). Differential dynamic plasticity of A1 Kingsbury, B. E. D., Morgan, N., & Greenberg, S. (1998). Robust speech recognition
receptive fields during multiple spectral tasks. Journal of Neuroscience, 25(33), using the modulation spectrogram. Speech Communication, 25(1–3), 117–132.
7623–7635. Lakatos, P., Chen, C.-M., O’Connell, M. N., Mills, A., & Schroeder, C. E. (2007).
Funkenstein, H. H., & Winter, P. (1973). Responses to acoustic stimuli of units in the Neuronal oscillations and multisensory interaction in primary auditory cortex.
auditory cortex of awake squirrel monkeys. Experimental Brain Research, 18(5), Neuron, 53(2), 279–292.
464–488. Lakatos, P., Karmos, G., Mehta, A. D., Ulbert, I., & Schroeder, C. E. (2008). Entrainment
Gaese, B. H., & Ostwald, J. (1995). Temporal coding of amplitude and frequency of neuronal oscillations as a mechanism of attentional selection. Science,
modulation in the rat auditory cortex. European Journal of Neuroscience, 7(3), 320(5872), 110–113.
438–450. Lakatos, P., O’Connell, M. N., Barczak, A., Mills, A., Javitt, D. C., & Schroeder, C. E.
Galambos, R., Makeig, S., & Talmachoff, P. J. (1981). A 40-Hz auditory potential (2009). The leading sense: Supramodal control of neurophysiological context by
recorded from the human scalp. Proceedings of the National Academy of Sciences attention. Neuron, 64(3), 419–430.
of the United States of America, 78(4), 2643–2647. Lakatos, P., Shah, A. S., Knuth, K. H., Ulbert, I., Karmos, G., & Schroeder, C. E. (2005).
Ganong, W. r. (1980). Phonetic categorization in auditory word perception. Journal An oscillatory hierarchy controlling neuronal excitability and stimulus
of Experimental Psychology: Human Perception and Performance, 6(1), 110–125. processing in the auditory cortex. Journal of Neurophysiology, 94(3), 1904–1911.
Gardner, F. M. (2005). Phaselock techniques (3rd ed.). New York: Wiley. Lalor, E. C., & Foxe, J. J. (2010). Neural responses to uninterrupted natural speech can
Gehr, D. D., Komiya, H., & Eggermont, J. J. (2000). Neuronal responses in cat primary be extracted with precise temporal resolution. European Journal of Neuroscience,
auditory cortex to natural and altered species-specific calls. Hearing Research, 31(1), 189–193.
150(1–2), 27–42. Lange, K. (2009). Brain correlates of early auditory processing are attenuated by
Gentilucci, M., & Corballis, M. C. (2006). From manual gesture to speech: A gradual expectations for time and pitch. Brain and Cognition, 69(1), 127–137.
transition. Neuroscience and Biobehavioral Reviews, 30(7), 949–960. Lange, K., & Röder, B. (2006). Orienting attention to points in time improves
Ghitza, O. (2011). Linking speech perception and neurophysiology: Speech decoding stimulus processing both within and across modalities. Journal of Cognitive
guided by cascaded oscillators locked to the input rhythm. Frontiers in Neuroscience, 18(5), 715–729.
Psychology: Auditory Cognitive Neuroscience, 2, 130. doi:10.3389/ Large, E. W. (2000). On synchronizing movements to music. Human Movement
fpsyg.2011.00130. Science, 19(4), 527–566.
160 E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161

Large, E. W., & Jones, M. R. (1999). The dynamics of attending: How people track Posner, M. I., & Rothbart, M. K. (2007). Research on attention networks as a model
time-varying events. Psychological Review, 106(1), 119–159. for the integration of psychological science. Annual Review of Psychology, 58(1),
Lehiste, I. (1970). Suprasegmentals. Cambridge, Mass: M.I.T. Press. 1–23.
Liberman, A. M., & Whalen, D. H. (2000). On the relation of speech to language. Ray, S., Niebur, E., Hsiao, S. S., Sinai, A., & Crone, N. E. (2008). High-frequency gamma
Trends in Cognitive Sciences, 4(5), 187–196. activity (80‑‘150Hz) is increased in human cortex during selective attention.
Liégeois-Chauvel, C., Lorenzi, C., Trébuchon, A., Régis, Jean., & Chauvel, P. (2004). Clinical Neurophysiology: Official Journal of the International Federation of Clinical
Temporal envelope processing in the human left and right auditory cortices. Neurophysiology, 119(1), 116–133.
Cerebral Cortex, 14(7), 731–740. Repp, B. H. (1982). Phonetic trading relations and context effects: New
Luo, H., Liu, Z., & Poeppel, D. (2010). Auditory cortex tracks both auditory and visual experimental evidence for a speech mode of perception. Psychological Bulletin,
stimulus dynamics using low-frequency neuronal phase modulation. PLoS 92(1), 81–110.
Biology, 8(8), e1000445. Reynolds, J. H., & Heeger, D. J. (2009). The normalization model of attention. Neuron,
Luo, H., & Poeppel, D. (2007). Phase patterns of neuronal responses reliably 61(2), 168–185.
discriminate speech in human auditory cortex. Neuron, 54(6), 1001–1010. Rodenburg, M., Verweij, C., & van der Brink, G. (1972). Analysis of evoked responses
Mann, V. A. (1980). Influence of preceding liquid on stop-consonant perception. in man elicited by sinusoidally modulated noise. International Journal of
Perception & Psychophysics, 28(5), 407–412. Audiology, 11(5–6), 283–293.
Mattys, S. (1997). The use of time during lexical processing and segmentation: A Rosen, S. (1992). Temporal information in speech: Acoustic, auditory and linguistic
review. Psychonomic Bulletin & Review, 4(3), 310–329. aspects. Philosophical Transactions of the Biological Sciences, 336(1278), 367–373.
Mazzoni, A., Whittingstall, K., Brunel, N., Logothetis, N. K., & Panzeri, S. (2010). Rotman, Y., Bar-Yosef, O., & Nelken, I. (2001). Relating cluster and population
Understanding the relationships between spike rate and delta/gamma responses to natural sounds and tonal stimuli in cat primary auditory cortex.
frequency bands of LFPs and EEGs using a local cortical network model. Hearing Research, 152(1–2), 110–127.
NeuroImage, 52(3), 956–972. Saberi, K., & Perrott, D. R. (1999). Cognitive restoration of reversed speech. Nature,
McCabe, S. L., & Denham, M. J. (1997). A model of auditory streaming. Journal of the 398(6730), 760. 760.
Acoustical Society of America, 101(3), 1611–1621. Sanders, L. D., & Astheimer, L. B. (2008). Temporally selective attention modulates
McDermott, J. H. (2009). The cocktail party problem. Current Biology, 19(22), early perceptual processing: Event-related potential evidence. Perceptual
R1024–R1027. Psychophysiology, 70(4), 732–742.
Mehler, J., Dommergues, J. Y., Frauenfelder, U., & Segui, J. (1981). The syllable’s role Scarborough, R., Keating, P., Mattys, S. L., Cho, T., & Alwan, A. (2009). Optical
in speech segmentation. Journal of Verbal Learning and Verbal Behavior, 20(3), phonetics and visual perception of lexical and phrasal stress in English.
298–305. Language and Speech, 52, 135–175.
Miller, G. A., Heise, G. A., & Lichten, W. (1951). The intelligibility of speech as a Schreiner, C. E., & Urbas, J. V. (1988). Representation of amplitude modulation in the
function of the context of the test materials. Journal of Experimental Psychology, auditory cortex of the cat. II. Comparison between cortical fields. Hearing
41(5), 329–335. Research, 32(1), 49–63.
Montemurro, M. A., Rasch, M. J., Murayama, Y., Logothetis, N. K., & Panzeri, S. (2008). Schroeder, C. E., & Lakatos, P. (2009a). The gamma oscillation: Master or slave? Brain
Phase-of-firing coding of natural visual stimuli in primary visual cortex. Current Topography, 22(1), 24–26.
Biology, 18(5), 375–380. Schroeder, C. E., & Lakatos, P. (2009b). Low-frequency neuronal oscillations as
Monto, S., Palva, S., Voipio, J., & Palva, J. M. (2008). Very slow EEG fluctuations instruments of sensory selection. Trends in Neurosciences, 32(1), 9–18.
predict the dynamics of stimulus detection and oscillation amplitudes in Schroeder, C. E., Lakatos, P., Kajikawa, Y., Partan, S., & Puce, A. (2008). Neuronal
humans. Journal of Neuroscience, 28(33), 8268–8272. oscillations and visual amplification of speech. Trends in Cognitive Science, 12(3),
Moore, B. C. J., & Gockel, H. (2002). Factors influencing sequential stream 106–113.
segregation. Acta Acustica United with Acustica, 88, 320–332. Schroeder, C. E., Wilson, D. A., Radman, T., Scharfman, H., & Lakatos, P. (2010).
Munhall, K. G., Jones, J. A., Callan, D. E., Kuratate, T., & Vatikiotis-Bateson, E. (2004). Dynamics of active sensing and perceptual selection. Current Opinion in
Visual prosody and speech intelligibility. Psychological Science, 15(2), 133–137. Neurobiology, 20(2), 172–176.
Nagarajan, S. S., Cheung, S. W., Bedenbaugh, P., Beitel, R. E., Schreiner, C. E., & Shamma, S. A., Elhilali, M., & Micheyl, C. (2011). Temporal coherence and attention
Merzenich, M. M. (2002). Representation of spectral and temporal envelope of in auditory scene analysis. Trends in Neurosciences, 34(3), 114–123.
twitter vocalizations in common marmoset primary auditory cortex. Journal of Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M. (1995). Speech
Neurophysiology, 87(4), 1723–1737. recognition with primarily temporal cues. Science, 270(5234), 303–304.
Nahum, M., Nelken, I., & Ahissar, M. (2008). Low-level information and high-level Shastri, L., Chang, S., & Greenberg, S. (1999). Syllable detection and segmentation
perception: The case of speech in noise. PLoS Biology, 6(5), e126. using temporal flow neural networks. In Paper presented at the 14th international
Nelken, I., & Ahissar, M. (2006). High-level and low-level processing in the auditory congress of phonetic sciences, San Francisco, CA.
system: The role of primary auditory cortex. In P. L. Divenyi, S. Greenberg, & G. Shukla, M., Nespor, M., & Mehler, J. (2007). An interaction between prosody
Meyer (Eds.), Dynamics of speech production and perception (pp. 343–354). and statistics in the segmentation of fluent speech. Cognitive Psychology, 54(1),
Amsterdam, NLD: IOS Press. 1–32.
Nobre, A. C., Correa, A., & Coull, J. T. (2007). The hazards of time. Current Opinion in Sovijärvi, A. (1975). Detection of natural complex sounds by cells in the primary
Neurobiology, 17(4), 465–470. auditory cortex of the cat. Acta Physiologica Scandinavia, 93(3), 318–335.
Nobre, A. C., & Coull, J. T. (2010). Attention and time. Oxford, UK: Oxford University Stefanics, G., Hangya, B., Hernádi, I., Winkler, I., Lakatos, P., & Ulbert, I. (2010).
Press. Phase entrainment of human delta oscillations can mediate the effects of
Norris, D., McQueen, J. M., Cutler, A., & Butterfield, S. (1997). The possible-word expectation on reaction speed. Journal of Neuroscience, 30(41), 13578–13585.
constraint in the segmentation of continuous speech. Cognitive Psychology, Stevens, K. N. (1998). Acoustic phonetics. Cambridge, MA: MIT press.
34(3), 191–243. Stone, M. A., Fullgrabe, C., & Moore, B. C. J. (2010). Relative contribution to speech
Nourski, K. V., Reale, R. A., Oya, H., Kawasaki, H., Kovach, C. K., Chen, H., et al. (2009). intelligibility of different envelope modulation rates within the speech dynamic
Temporal envelope of time-compressed speech represented in the human range. Journal of the Acoustical Society of America, 128(4), 2127–2137.
auditory cortex. Journal of Neuroscience, 29(49), 15564–15574. Sumby, W. H., & Pollack, I. (1954). Visual contribution to speech intelligibility in
Nygaard, L., & Pisoni, D. (1995). Speech perception: New directions in research and noise. Journal of the Acoustical Society of America, 26, 212–215.
theory. In J. Miller & P. Eimas (Eds.), Handbook of perception and cognition: Summerfield, Q. (1992). Lipreading and audio–visual speech perception.
Speech, language and communication (pp. 63–96). San Diego, CA: Academic. Philosophical Transactions of the Royal Society of London. Series B: Biological
Otake, T., Hatano, G., Cutler, A., & Mehler, J. (1993). Mora or syllable? Speech Sciences, 335(1273), 71–78.
segmentation in Japanese. Journal of Memory and Language, 32(2), 258–278. Sussman, E. S., & Orr, M. (2007). The role of attention in the formation of auditory
Pavlovic, C. V. (1987). Derivation of primary parameters and procedures for use in streams. Attention, Perception and Psychophysics, 69(1), 136–152.
speech intelligibility predictions. Journal of the Acoustical Society of America, Tank, D. W., & Hopfield, J. J. (1987). Neural computation by concentrating
82(2), 413–422. information in time. Proceedings of the National Academy of Sciences of the
Phillips, C., & Wagers, M. (2007). Relating structure and time in linguistics and United States of America, 84(7), 1896–1900.
psycholinguistics. In G. Gaskell (Ed.), Oxford handbook of psycholinguistics Theunissen, F., & Miller, J. P. (1995). Temporal encoding in nervous systems: A
(pp. 739–756). Oxford, UK: Oxford University Press. rigorous definition. Journal of Computational Neuroscience, 2(2), 149–162.
Pickett, J. M., & Pollack, I. (1963). Intelligibility of excerpts from fluent speech: Thorne, J. D., De Vos, M., Viola, F. C., & Debener, S. (2011). Cross-modal phase reset
Effects of rate of utterance and duration of excerpt. Language and Speech, 6(3), predicts auditory task performance in humans. Journal of Neuroscience, 31(10),
151–164. 3853–3861.
Pickles, J. O. (2008). An introduction to the physiology of hearing (3rd ed.). London: Todd, N. P. M., Lee, C. S., & O’Boyle, D. J. (2006). A sensorimotor theory of speech
Emerald. perception: Implications for learning, organization and recognition. In S.
Picton, T. W., Skinner, C. R., Champagne, S. C., Kellett, A. J. C., & Maiste, A. C. (1987). Greenberg & W. A. Ainsworth (Eds.), Listening to speech: An auditory
Potentials evoked by the sinusoidal modulation of the amplitude or frequency perspective (pp. 351–373). Mahwah, New Jersey: Lawrence Erlbaum
of a tone. Journal of the Acoustical Society of America, 82(1), 165–178. Associates, Inc..
Poeppel, D. (2003). The analysis of speech in different temporal integration van Wassenhove, V., Grant, K. W., & Poeppel, D. (2005). Visual speech speeds up the
windows: Cerebral lateralization as ‘asymmetric sampling in time’. Speech neural processing of auditory speech. Proceedings of the National Academy of
Communica, 41, 245–255. Sciences of the United States of America, 102(4), 1181–1186.
Poeppel, D., Idsardi, W. J., & van Wassenhove, V. (2008). Speech perception at the van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). Temporal window of
interface of neurobiology and linguistics. Philosophical Transactions of the Royal integration in auditory–visual speech perception. Neuropsychologia, 45(3),
Society of London. Series B: Biological Sciences, 363(1493), 1071–1086. 598–607.
E.M. Zion Golumbic et al. / Brain & Language 122 (2012) 151–161 161

Viemeister, N. F. (1979). Temporal modulation transfer functions based upon Winkowski, D. E., & Knudsen, E. I. (2006). Top-down gain control of the auditory
modulation thresholds. Journal of the Acoustical Society of America, 66(5), space map by gaze control circuitry in the barn owl. Nature, 439(7074),
1364–1380. 336–339.
Wang, X., Merzenich, M. M., Beitel, R., & Schreiner, C. E. (1995). Representation of a Winkowski, D. E., & Knudsen, E. I. (2008). Distinct mechanisms for top-down control
species-specific vocalization in the primary auditory cortex of the common of neural gain and sensitivity in the owl optic tectum. Neuron, 60(4), 698–
marmoset: Temporal and spectral characteristics. Journal of Neurophysiology, 708.
74(6), 2685–2706. Woldorff, M. G., Gallen, C. C., Hampson, S. A., Hillyard, S. A., Pantev, C., Sobel, D., et al.
Warren, R. (2004). The relation of speech perception to the perception of nonverbal (1993). Modulation of early sensory processing in human auditory cortex
auditory patterns. In S. Greenberg & W. A. Ainsworth (Eds.), Listening to speech: during auditory selective attention. Proceedings of the National Academy of
An auditory perspective (pp. 333–350). Mahwah, NJ: Laurence Erlbaum Sciences, 90(18), 8722–8726.
Associates, Inc.. Womelsdorf, T., Schoffelen, J.-M., Oostenveld, R., Singer, W., Desimone, R., Engel, A.
Warren, R., Bashford, J., & Gardner, D. (1990). Tweaking the lexicon: Organization of K., et al. (2007). Modulation of neuronal interactions through neuronal
vowel sequences into words. Attention, Perception, & Psychophysics, 47(5), synchronization. Science, 316(5831), 1609–1612.
423–432. Wu, C. T., Weissman, D. H., Roberts, K. C., & Woldorff, M. G. (2007). The neural
Whitfield, I. C., & Evans, E. F. (1965). Responses of auditory cortical neurons to circuitry underlying the executive control of auditory spatial attention. Brain
stimuli of changing frequency. Journal of Neurophysiology, 28(4), 655–672. Research, 1134, 187–198.
Wightman, C. W., Shattuck-Hufnagel, S., Ostendorf, M., & Price, P. J. (1992). Zion Golumbic, E. M., Bickel, S., Cogan, G. B., Mehta, A. D., Schevon, C. A., Emerson, R.
Segmental durations in the vicinity of prosodic phrase boundaries. Journal of the G., et al. (2010). Low-frequency phase modulations as a tool for attentional
Acoustical Society of America, 91(3), 1707–1717. selection at a cocktail party: Insights from intracranial EEG and MEG in humans.
Will, U., & Berg, E. (2007). Brain wave synchronization and entrainment to periodic In Paper presented at the Neurobiology of Language Conference, San Diego, CA.
acoustic stimuli. Neuroscience Letters, 424(1), 55–60.

You might also like