You are on page 1of 8

Hearing Research 348 (2017) 70e77

Contents lists available at ScienceDirect

Hearing Research
journal homepage: www.elsevier.com/locate/heares

Research Paper

Indexing cortical entrainment to natural speech at the phonemic


level: Methodological considerations for applied research
Giovanni M. Di Liberto a, *, Edmund C. Lalor a, b, **
a
Trinity College Institute of Neuroscience, School of Engineering, and Trinity Centre for Bioengineering, Trinity College Dublin, Dublin 2, Ireland
b
Department of Biomedical Engineering and Department of Neuroscience, University of Rochester, Rochester, NY, USA

a r t i c l e i n f o a b s t r a c t

Article history: Speech is central to human life. As such, any delay or impairment in receptive speech processing can
Received 10 October 2016 have a profoundly negative impact on the social and professional life of a person. Thus, being able to
Received in revised form assess the integrity of speech processing in different populations is an important goal. Current stan-
15 February 2017
dardized assessment is mostly based on psychometric measures that do not capture the full extent of a
Accepted 17 February 2017
Available online 27 February 2017
person's speech processing abilities and that are difficult to administer in some subjects groups. A po-
tential alternative to these tests would be to derive “direct”, objective measures of speech processing
from cortical activity. One such approach was recently introduced and showed that it is possible to use
Keywords:
Language impairment
electroencephalography (EEG) to index cortical processing at the level of phonemes from responses to
Continuous speech continuous natural speech. However, a large amount of data was required for such analyses. This limits
EEG the usefulness of this approach for assessing speech processing in particular cohorts for whom data
Categorical perception collection is difficult. Here, we used EEG data from 10 subjects to assess whether measures reflecting
Clinical research phoneme-level processing could be reliably obtained using only 10 min of recording time from each
Neuromarker subject. This was done successfully using a generic modeling approach wherein the data from a training
group composed of 9 subjects were combined to derive robust predictions of the EEG signal for new
subjects. This allowed the derivation of indices of cortical activity at the level of phonemes and the
disambiguation of responses to specific phonetic features (e.g., stop, plosive, and nasal consonants) with
limited data. This objective approach has the potential to complement psychometric measures of speech
processing in a wide variety of subjects.
© 2017 Elsevier B.V. All rights reserved.

1. Introduction some form of speech and language impairments. These can arise as
a consequence of developmental disorders (Leonard, 2014) or from
Speech is central to human life. Over the past 30 years, neuro- a decline in related cortical functions (e.g., through ageing, psy-
science has provided tremendous insights into the neurobiology of chosis, injury; (Kemper and Anagnopoulos, 2008; Mesulam et al.,
language using brain imaging. As a result, it is now generally un- 2014; Ross et al., 2007)). A better understanding of the underly-
derstood that speech is processed in a hierarchically organized ing speech processing network and an ability to identify specific
system of functionally distinct cortical areas (Hickok and Poeppel, impairment within that network are crucial to developing clinically
2007; Okada et al., 2010; Peelle et al., 2010; Poeppel, 2014). How- useful assessments of speech and language in these populations.
ever, much work remains to be done to elucidate the details of this Speech and language impairment can disrupt the ability to
system, particularly in the context of natural speech. This is an understand auditory speech and efficiently communicate in a
important issue in and of itself. However, it is also important in that number of ways, which correspond to different symptoms. In this
there are significant numbers of people worldwide who suffer from context, standardized assessment of such impairments is usually
pursued using a number of behavioral tests (e.g., non-verbal
hearing, speech, and language tests; standardized test of intelli-
* Corresponding author. Trinity Centre for Bioengineering, Trinity College Dublin, gence) (Ford and Dahinten, 2005; Gardner et al., 2006; Tomblin
Dublin 2, Ireland.
et al., 1996). However, these measures are inadequate at
** Corresponding author. Department of Biomedical Engineering and Department
of Neuroscience, University of Rochester, Rochester, NY, USA.
capturing the full extent of a person's impairment and should be
E-mail addresses: diliberg@tcd.ie (G.M. Di Liberto), edlalor@tcd.ie (E.C. Lalor). considered only as one aspect of a comprehensive assessment

http://dx.doi.org/10.1016/j.heares.2017.02.015
0378-5955/© 2017 Elsevier B.V. All rights reserved.
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 71

process (Flanagan et al., 1997; Mody and Belliveau, 2013). which may constitute an obstacle when studying particular cohorts
Furthermore, some of these measures cannot be derived for some (e.g., young children).
groups such as infants or participants with reading impairment or Here we introduce a modification to our previously introduced
no reading skills. framework that allows for a significant reduction of the experi-
A complementary approach is to “directly” investigate the cau- mental time needed to derive such indices of phoneme-level
ses that underpin such conditions, rather than evaluate “indirect” cortical entrainment. Our previous framework involved relating
effects on specific behavioral markers. In this sense, neuroimaging different representations of a speech signal to ongoing EEG. In
provides an opportunity to derive measures directly related to the particular, it involved building a model for each subject that would
cortical processing of speech in the human brain. In particular, map a specific speech representation to that subject's own EEG
noninvasive, safe, functional brain measurements (EEG, MEG, fMRI, signal. This type of approach has previously been used to “decode”
NIRS) have now been proven feasible for use with both children how attention is being deployed in so-called cocktail party envi-
(starting at birth) and adults (Aslin and Mehler, 2005; Kuhl, 2010; ronments (O'Sullivan et al., 2015; Mirkovic et al., 2015). The
Kuhl et al., 2005; McNealy et al., 2006). Neuroimaging research in modification we make here follows innovation introduced in these
speech perception has traditionally focused on neural activation attention decoding studies (O'Sullivan et al., 2015; Mirkovic et al.,
patterns corresponding to the perception of minimal linguistic 2015). Specifically, these authors showed that it was possible to
contrasts (e.g., how we distinguish “cat” from “mat”) (Obleser et al., decode attention for an individual subject using a generic model
2007; Peter et al., 2016; Salmelin, 2007), cortical responses cases of that was built from the data from other subjects. This led to a large
syntactic or semantic violation (Kutas and Hillyard, 1980; Lau et al., reduction in how much data was needed from each subject to
2008), and the processing of the low-level acoustics of an incoming perform decoding (Mirkovic et al., 2015). Here, we seek to do
sound stimulus (Lakatos et al., 2005; Overath et al., 2008). However, something similar in the context of our approach for assessing
the study of speech comprehension needs to account for how phoneme-level speech processing. While it is known that not many
humans process continuous natural speech, which is a task per- electrodes are needed for this approach to be effective (by con-
formed efficiently by healthy people in their everyday life and is struction; see forward modeling approach in Crosse et al., 2016),
profoundly different from, for example, the perception of isolated the ability to use the framework with small amounts of data from
syllables (Bonte et al., 2006). individual subjects is uncertain. To clarify this issue, an extensive
Recent studies showed an innovative way to investigate analysis was conducted to assess the minimum experimental time
continuous speech perception in humans, by indexing how cortical needed to detect meaningful cortical responses. The goal of the
activity tracks the dynamics of that speech. This phenomenon of analysis was to show that it is possible to utilize short data sets
cortical entrainment has been demonstrated in humans for the across multiple subjects to make inferences about speech pro-
amplitude envelope of speech using magnetoencephalography cessing in individual subjects. Specifically, we aimed to show that
(MEG; Ahissar et al., 2001; Luo and Poeppel, 2007), electroen- we can robustly index phoneme-level processing in the context of
cephalography (EEG; Aiken and Picton, 2008; Lalor and Foxe, 2010), natural speech in cases of limited amounts of experimental data.
and electrocorticography (ECoG; Nourski et al., 2009). And the ef-
fect has been quantified using a cross-correlation analysis between 2. Material and methods
the speech envelope and the recorded neural data (Ahissar et al.,
2001; Abrams et al., 2008; Nourski et al., 2009; Millman et al., Ten healthy subjects (7 male) aged between 23 and 38 years old
2015). However, this approach is ill-suited to the study of natural- participated in the experiment. The study was undertaken in
istic stimuli (Crosse et al., 2016). The reason for this is that natu- accordance with the Declaration of Helsinki and was approved by
ralistic stimuli vary in a non-random way and so such stimuli are the Ethics Committee of the School of Psychology at Trinity College
correlated with time-shifted versions of themselves. This leads to Dublin. Each subject provided written informed consent. Subjects
temporal smearing when cross-correlating the stimulus with shif- reported no history of hearing impairment or neurological disorder.
ted versions of the response. For this reason, system identification
methods based on ridge regression have been recently applied in 2.1. Stimuli and experimental procedure
this context, and were shown to be effective for investigating the
cortical processing of natural speech (Machens et al., 2004; Crosse Subjects undertook 28 trials, each of ~155 s in length, where
et al., 2016). And, in turn, the ability to use more natural stimuli they were presented with an audiobook version of a classic work of
facilitates the design of more engaging paradigms. These issues, fiction read by a male American English speaker. The trials pre-
and others, are discussed in several recent reviews on the ap- served the storyline, with neither repetitions nor discontinuities.
proaches for and applications of the speech-entrainment phe- All stimuli were presented monophonically at a sampling rate of
nomenon (Ding and Simon, 2014; Wo €stmann et al., 2016; Crosse 44,100 Hz using Sennheiser HD650 headphones and Presentation
et al., 2016). software from Neurobehavioral Systems (http://www.neurobs.
In this context, a recent study (Di Liberto et al., 2015) introduced com). Testing was carried out in a dark room and subjects were
a framework for disentangling phoneme-level cortical responses instructed to maintain visual fixation for the duration of each trial
from cortical activity elicited by low-level acoustics. Results from on a crosshair centered on the screen, and to minimize eye blinking
this study indicated that low-frequency cortical entrainment to and all other motor activities.
speech features reflects more than a simple acoustic analysis of the
stimulus, and that it also reflects phoneme-level processing. 2.2. Data acquisition and preprocessing
Therefore, this framework provides a potential methodology for
investigating speech encoding under a variety of conditions and in Electroencephalographic (EEG) data were recorded from 128
a variety of cohorts. This could include research on the causes of scalp electrodes (plus 2 mastoid channels). Data were filtered over
speech impairments in particular cohorts by deriving direct indices the range 0e134 Hz, and digitized with a sampling frequency of
of cortical activity at specific levels of the speech processing hier- 512 Hz using a BioSemi Active Two system. Data were analyzed
archy using non-invasive EEG. However, short experimental times offline using MATLAB software (The Mathworks Inc.). EEG data
are preferable in applied research (Mirkovic et al., 2015), whereas were digitally filtered between 1 and 8 Hz using a Chebyshev Type
Di Liberto et al., 2015 used a recording time of 72 min per subject, 2 zero-phase filter. In order to reduce the processing time, all EEG
72 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77

data were then down-sampled to 64 Hz. EEG channel with a vari- Each phoneme consists of a combination of distinct features;
ance that exceeded three times that of the surrounding channels therefore this is a set of non-mutually exclusive descriptors.
were labelled as bad channels, and replaced by an estimate calcu- 4. Finally, we propose a model that combines F and S (FS): This was
lated using spherical spline interpolation (EEGLAB; Delorme and obtained by concatenating F and S into a single data matrix. This
Makeig, 2004). All channels were then re-referenced to the representation consists of 19 phonetic features and 16 frequency
average of the two mastoid channels with the goal of maximizing bands, therefore FS has 35 dimensions.
the EEG responses to the auditory stimuli (Luck, 2005).
2.5. Model evaluation
2.3. TRF computation
In order to quantify how well the EEG reflects the encoding of
the various speech representations we used a model-based anal-
The method used here aims to derive a quantitative mapping
ysis. The idea is to fit a model (i.e., an mTRF) that describes the
between particular representations of a speech signal and the
forward mapping from a speech representation to the EEG and then
recorded EEG. This mapping is commonly known as a temporal
to test that model by seeing how accurately it can predict EEG from
response function (TRF). A TRF can be interpreted as a filter that
a new trial. Specifically, we used a leave-one-out cross-validation
describes how the brain transforms a stimulus feature to the
approach, whereby an mTRF was trained on 27 trials, and used to
continuous neural response. Because the mapping described is
predict the EEG data from the remaining trial. This process was
from stimulus to EEG signal, the resulting models are referred to as
repeated until the data from all trials were predicted. EEG predic-
forward TRFs. Furthermore, as will become clear in the following
tion accuracies were evaluated by determining a correlation coef-
section, the stimulus here is often represented as a multivariate
ficient (Pearson's r) between the actual and predicted EEG data on
feature vector. As such, we refer to our TRFs as multivariate TRFs
each electrode channel. A single prediction correlation value was
(Crosse et al., 2016). mTRFs were calculated using custom written,
then derived by averaging these correlations over our chosen set of
publicly available software (http://www.mee.tcd.ie/lalorlab/
electrodes of interest (this procedure is described further in the
resources.html).
following section). Note that silent time intervals were removed
from the correlation evaluation (the same intervals were removed
2.4. Speech representations from all speech representations).
For each participant, predictions of its EEG signals were derived
Following Di Liberto et al., 2015, we estimated mTRFs based on using mTRFs that were fit on data of that specific subject (subject-
four distinct representations of the speech stimulus: specific models). This approach was compared to a subject-
independent method, which consisted of using models obtained
1. Broadband amplitude envelope (E): This was calculated as: by averaging the subject-specific mTRFs obtained from all other
Env ¼ jxa ðtÞj; xa ðtÞ ¼ xðtÞ þ jbx ðtÞ; where xa ðtÞ is the complex subjects (generic models).
analytic signal obtained by the sum of the original speech xðtÞ
and its Hilbert transform b x ðtÞ. The envelope of speech was then 2.6. Model parameter selection
downsampled to the same sampling frequency as the EEG data,
after applying a zero-phase shift anti-aliasing filter. A significant Three key considerations when carrying out model-based pre-
number of papers have been published in recent years based on dictions of EEG data are 1) the channels to be predicted; 2) the
relating the envelope of a continuous speech signal to neural choice of time-lags between stimulus and data to optimize pre-
data (Aiken and Picton, 2008; Ding and Simon, 2014; Millman diction; and 3) the choice of the regularization parameter l. In
et al., 2015; Nourski et al., 2009; Zion Golumbic et al., 2013). terms of channels, we focused on predicting channels that strongly
2. Spectrogram (S): This was obtained by first filtering the speech reflect auditory cortical activity. In particular, a set of 12 electrodes
stimulus into 16 frequency bands between 250 Hz and 8 kHz from 2 bilateral areas of the fronto-central scalp with the highest
according to Greenwood's equation (equal distance on the prediction correlations were selected (6 on the left side of the scalp,
basilar membrane; Greenwood, 1961) using Chebyshev type 2 and their symmetrical counterparts on the right; Di Liberto et al.,
filters (order 100), and then computing the amplitude envelope 2015)). The EEG prediction correlations were then averaged
(as above) for each frequency band. across these channels. While we have chosen to average across 12
3. Phonetic features (F): This representation was computed using channels to ensure some robustness in terms of our results, there is
the Prosodylab-Aligner (Gorman et al., 2011) which, given a no requirement to predict data from so many channels. Indeed, in
speech file and the corresponding textual orthographical tran- our previous study, we found no qualitative difference across these
scription, partitions each word into phonemes from the Amer- channels. And, indeed, it may be possible to derive effectively the
ican English International Phonetic Alphabet (IPA). It then same information from a single channel (properly referenced). In
performs forced-alignment (Gorman et al., 2011) and returns terms of time-lags, we first computed mTRFs using a broad time-
the starting and ending time-points for each phoneme. This window from 150 to 450 ms. Based on visual inspection of the
information was then converted into a multivariate time-series average mTRFs across all subjects, this time interval was then
composed of indicator variables, which are binary arrays (one restricted to lags from 0 to 250 ms as no visible response was
for each phoneme). These are active for the time-points in which present outside this range. Finally, an important consideration
phonemes occurred. Each phoneme was then converted into a when calculating the mTRFs is that of regularization. As extensively
space of 19 phonetic features (Mesgarani et al., 2014), which are described by Crosse et al., 2016, the mTRF procedure is based on
a distinctive subset of those defined by Chomsky and Halle ridge (or Tikhonov) regression, which uses regularization to reduce
(1968) that describe the articulatory and acoustic properties of overfitting by smoothing, in this case, across the time dimension.
the phonetic content of speech. In particular, the chosen features This parameter l was optimized for each model, while the same
are related to the manner of articulation, the voicing of a con- overall optimal value was used across subjects and electrodes. The
sonant, the backness of a vowel, and the place of articulation. optimal l values were 1, 10, and 10 for, S, F, and FS respectively. Note
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 73

that these values depend on the specific EEG recordings and


stimuli. Hence, each specific dataset should be tuned following the
procedure described by Crosse et al., 2016.

2.7. Multi-dimensional scaling analysis

The mTRF mapping functions produced by the above analysis


can be informative about how particular speech features are rep-
resented in the EEG. One useful technique for analyzing these
multivariate mapping functions is known as multidimensional
scaling (MDS). This is an analytic vehicle which projects each data
point into a location in a multi-dimensional space, such that the
distance between points represents a measure of similarity. We
used MDS analysis on our mTRFs and assessed how EEG responses
to different speech features clustered in this MDS space using a k-
means clustering algorithm. In our case, the data points fed to the
MDS analysis were linear regression model weights, which are the
output of the mTRF multivariate fit and correspond to EEG-
phoneme mappings at all of the 12 EEG channels of interest for
each subject. How well different clusters of phonetic-feature re-
sponses clustered in the MDS space were quantified using an F-
Score measure (Rijsbergen, 1979). Because of the randomized na-
ture of the k-means algorithm, the results reported below were
averaged over 100 repetitions of this procedure.

2.8. Statistical analyses

Unless otherwise stated, all statistical analyses were performed Fig. 1. EEG data were recorded while subjects listened to natural speech from an
audiobook. Speech was represented using E and S (speech acoustics), F (phonetics),
using a repeated measure, one way ANOVA to compare distributions
and FS (which combines acoustics and phonetics). Multivariate temporal response
of Pearson correlation values across models and to compare F-Score functions (mTRF) were built to describe the mapping from each representation of
classifications across response intervals. The values reported use the speech to the EEG recording and used to predict the EEG signal with cross-validation
convention F(df, dferror). Greenhouse-Geisser corrections were made (: greater than all others, p < 0.01; ; smaller than all others, p < 0.01; *p < 0.05). (A)
if Mauchly's test of sphericity was not met. All post-hoc model com- Correlations between EEG and its predictions are shown for each subject and each of
the 4 speech representations. The predictions were obtained using speech-specific
parisons were performed using Bonferroni corrected paired t-tests. models, i.e., trained and tested within each subject using cross-validation (subjects
were re-arranged according the performance of the FS model for visualization pur-
3. Results pose). This figure corresponds to Di Liberto et al., Fig. 2B; it is not identical because of
minor changes in the data preprocessing (e.g., down-sampling rate). (B) The same data
is here shown grouped by the 4 speech representations. Each data point refers to a
3.1. Neural evidence for phonetic processing in generic models
specific subject (a specific color saturation was assigned to each subject). (C) Corre-
lations between EEG and its predictions using a generic model, i.e., trained on all
To investigate whether phoneme level cortical activity can be subjects with the exception of the test subject. The subject arrangement is consistent
indexed using a generic modeling approach, multivariate temporal with (A). (D) The same values obtained for generic models are here grouped by speech
response functions (mTRFs) (Crosse et al., 2016) were built to representation. Each data point refers to a specific subject and their colors match the
ones shown in (B). Because prediction correlations were calculated using 72 min of
describe the linear mapping from speech to the EEG scalp-recorded data, chance level here is effectively zero.
signal. In particular, speech was represented using its acoustic en-
velope (E) and spectrogram (S), phonetic features (F), and a com- were performed with one important difference: when predicting
bination of acoustic spectrogram and phonetics (FS). mTRF models the EEG for a given subject, we used models that were fit to data
were built for each speech representation and used to build pre- from all the other subjects. Because EEG responses vary across
dictions of the EEG signal using cross-validation to avoid over- subjects as a result of cortical folding, EEG prediction correlations
fitting. The quality of these predictions, measured with Pearson's for generic models were expected to be lower than in the subject-
correlation, indexed how well the EEG reflects the processing of specific approach. This was confirmed by the results in Fig. 1C,D
low- and high-level speech features. Fig. 1A,B show this result (Two-way ANOVA; effect of modeling approach: F(1,72) ¼ 6.1,
when the EEG predictions were derived using a subject-specific p ¼ 0.016). However, crucially, the combined model FS still pro-
model, i.e., given a subject, the predictions of their EEG signal duced the best EEG predictions in this case (ANOVA:
were obtained using a model fit on that same subject using cross- F(3.0,7.0) ¼ 21.9, p ¼ 0.001; post-hoc paired t-test comparisons of
validation across trials. While no significant difference emerged FS with all other models: p < 0.001, p < 0.001, p ¼ 0.024 for E, S, F
between the S and F models, the model fit on the combination of respectively). Again, this suggests that this modeling approach is
the two (FS) produced the highest EEG prediction correlations sensitive to the effects of categorical phoneme-level processing,
(ANOVA: F(3.0,7.0) ¼ 12.1, p ¼ 0.004; post-hoc paired t-test com- even when using generic models.
parisons of FS with all other models: p ¼ 0.001, p ¼ 0.005, p ¼ 0.023
for E, S, F respectively). This result indicates that low-frequency EEG
indexes the cortical entrainment to categorical phoneme-level 3.2. Generic models index phonetic processing for limited
features of speech (Di Liberto et al., 2015). experimental time
A similar analysis was conducted to assess whether the same
effect emerged when using the generic modeling approach. In In our previous paper (Di Liberto et al., 2015), we suggested that
particular, the same processing steps as in the subject-specific case one could potentially derive an isolated measure of phoneme-level
74 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77

processing from the EEG prediction framework. In particular, sub-


tracting the Pearson's correlations for the FS and S models (FS-S)
should represent such an index of phoneme-level processing. Such
a measure would be unsuited for clinical application if it required
long experimental times. For this reason, we wished to investigate
its robustness as a function of EEG recording time in both subject-
specific and generic modeling approaches.
The subject-specific FS-S measure showed a logarithmically
increasing relationship as a function of testing time. Crucially, with
less than 30 min of EEG data, the FSeS measure was not statistically
significantly different from zero across subjects (Wilcoxon signed
rank test, p > 0.05; Fig. 2A). A significant FS-S measure emerged
only when the amount of recording data was greater than or equal
to 30 min (Wilcoxon signed rank test, p < 0.05). Crucially, this issue,
which hampers the clinical applicability of such an approach, did
not apply to the generic models. The overarching rationale is that
the use of data from many subjects produces a model that con-
verges to an effective fit even for short recording times. For
instance, since each model was averaged across 9 subjects, an
experimental time of 10 min corresponded to 90 min of data in
total, which is more data than was available for any subject-specific
model. These considerations are confirmed by the results in Fig. 2A,
which shows the significant advantage of the generic modeling
approach over the subject-specific approach when up to 40 min of
data were available from each subject (paired Wilcoxon signed rank
test, p ¼ 0.001, p ¼ 0.020, p ¼ 0.014, p ¼ 0.037, p ¼ 0.064, p ¼ 0.084,
p ¼ 0.064, respectively from 10 to 70 min of recording data).
Fig. 2B provides further support to these considerations by
showing how recording time impacts on the mTRF model fit for a
particular subject (S4). Qualitatively, the regression weights for the
generic models required only 20e30 min of data to converge to a
stable fit for both S and F. In contrast, the corresponding subject-
specific models showed a more gradual and prolonged conver-
gence. Importantly, the subject-specific and generic mTRFs
converged to qualitatively similar results (please note that the x-
axis was up-sampled and smoothed/interpolated for visualization
purpose). This observation is confirmed by the numerical result in
Fig. 1A,C; in fact the EEG prediction accuracies obtained for subject-
specific and generic models for subject 4 (S4) were comparable. Fig. 2. (A) The speech-specific neural index FS-S for increasing experimental time
(minutes) is reported for the subject-specific models and compared with the result
Intuitively, a generic model that does not resemble the corre-
obtained using a generic model, i.e., trained on all subjects but the one used to evaluate
sponding subject-specific model may lead to poor EEG prediction the EEG prediction correlation. The shaded areas represent the standard error of the
accuracies for the generic model (for example, see S6). mean. The subject-specific modeling approach needs at least 30 min of data to be
These results highlight the applicability of the generic modeling sensitive to the FS-S effect, while the generic model produces significant results also
for short experimental times. Importantly, a significant difference emerges between
approach in the study of single subjects with limited amounts of
the two approaches for recording times under or equal to 40 min (**p < 0.01,
data. *p < 0.05), which means that it's advantageous to use a generic model when little
training data is available. Also, the use of speech-specific models does not improve the
3.3. Sensitivity of EEG to phonetic features for limited experimental performance of a generic model, even when the whole 72 min dataset is used
time (p > 0.05). (B) The mTRF regression weights are shown for the S (Frequency vs time-
lags) and F (Phonetic-features vs time-lags) models for increasing experimental time
(minutes). Given a selected subject (S4, see Fig. 1), this panel compares the mTRFs for
In our previous study we showed that EEG-speech mTRF models its subject-specific model (fit on S4) and its generic model (fit on all the others).
could be examined in terms of how they reflected the processing of Smoothing/interpolation of the x-axis was performed for visual purpose.
different phonetic features (Di Liberto et al., 2015). And we sug-
gested that another possibly useful way of assessing speech
encoding was to quantify how well the mTRFs to different phonetic interest. This approach allowed us to build a geometric space in
features could be discriminated. But how much experimental data which the Euclidean distance between phonetic features corre-
is needed in order to carry out these types of discriminative anal- sponds to the similarity of their neural responses. In this space, k-
ysis? This is important because it determines which phonetic fea- means clustering (k ¼ 2) was performed to study the pairwise
tures have an impact on the model performances for different discriminability between feature groups (vowels, semi-vowels,
amounts of experimental data. Also, it determines which phonetic fricatives, plosives), which was quantified by calculating the F-
features can be further studied, for example to infer possible dif- scores (the harmonic mean of precision and recall) between the
ferences between subject groups. Here, we addressed this question actual grouping and the result of the clustering.
by quantifying the discriminability between phonetic feature Fig. 3 shows the evolution of the discriminability between
groups in the mTRF models. Specifically, unsupervised multi- phonetic features categories with the amount of experimental data.
dimensional scaling (MDS) was applied to the subject-specific Chance level for the F-score measure (which changes for each pair
phonetic-feature mTRF models using the 12 bilateral electrodes of of feature groups) was calculated by repeating this same procedure
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 75

(MDS and F-Score) after randomly relabelling each phoneme academic problems in case of early diagnosis (Catts et al., 2002;
occurrence and converting that into phonetic features. Each dis- Clark, 2010). In this context, the ability to derive noninvasively
criminability reported in Fig. 3 was obtained by subtracting the F- robust markers of natural speech processing at specific levels of the
score derived when using the correct stimulus with its chance level cortical hierarchy could be of great benefit for research in certain
(shuffled over 50 randomly relabelled versions of the stimulus). cohorts. Here we have investigated a number of practical consid-
Individual subject values and their mean are reported in the figure. erations surrounding a recently introduced framework for indexing
In line with (Di Liberto et al., 2015), EEG activity in response to the encoding of natural speech at the level of phonemes (Di Liberto
vowels (vow) could be significantly discriminated from that to et al., 2015).
fricative (fri) and plosive (plo) consonants (paired Wilcoxon signed Firstly, it was shown that a generic model is capable of indexing
rank test, p < 0.05), while no significant difference between vowels the cortical entrainment to several speech representations of in-
and semi-vowels (semi) emerged. Importantly, these consider- terest. However, it was found that overall the EEG prediction cor-
ations were true even when only small amounts of experiment data relations were lower than in the subject-specific approach (Fig. 1).
(10 min) were available. The individual subject data clarifies that, This is likely to be an effect of anatomical differences among in-
when enough data is available, the significant effects for the com- dividuals, causing differences in the EEG signals between subjects.
parisons vow-fri (from 30 min of data) and vow-plo (from 50 min of This subject-specific information would be lost when averaging
data) correspond to an above-chance discriminability for every between subjects, hence producing lower EEG prediction correla-
single subject. Also, EEG activity to vowels was more discriminable tions. Even though the prediction values were smaller overall than
from plosive than it was from fricative consonants (this difference in the subject-specific case, the generic modeling approach pro-
was significant for all experiment duration with the exception of duces a similar pattern of prediction accuracies. Moreover, the
20 min; paired Wilcoxon signed rank test, p < 0.05). Interestingly, generic model of a larger and reasonably homogeneous group
the discriminability of classes within consonants reveals a different would still encode cortical responses that are consistent across
pattern. In particular, semi-vowels required at least 20 min of subjects. Potentially, such a framework could benefit from such a
experimental time to emerge as different from plosive consonants. larger dataset, and may require even shorter recording times to
Also, weak significant discriminability emerged between plosive produce meaningful results. Here, out of the four speech repre-
and fricative (p ( 0.05; with the exception of one data-point). In sentations used, the combination of acoustic and phonetic features
this case, a recording time of at least 60 (plo-semi) or 70 min (plo-fri (FS) is the best at predicting the EEG signal, while the envelope of
and fri-semi) was required to achieve an above-chance result for speech is the worst. This result is relevant as the broadband en-
every single subject. velope of speech has been used in several recent studies on audi-
tory perception (Aiken and Picton, 2008; Ding and Simon, 2014;
Millman et al., 2015; Nourski et al., 2009; Zion Golumbic et al.,
4. Discussion
2013). And the generic modeling approach has been shown to be
able to produce a significant neural index FS-S, which has been
Language impairments are disorders that affect the under-
suggested to reflect speech processing at a level of phonemes (Di
standing and/or use of spoken or written language, which carry the
Liberto et al., 2015).
risk of poor social functioning, reduced independence and
The results discussed so far suggest that a generic modeling
restricted employment opportunities (Clegg and Henderson, 1999;
approach can be used to index cortical entrainment to phonetic
Paul, 2007; Reed, 2012). The disorder may involve the form of
features of speech. In order for this approach to be feasible for
language (phonology, syntax, and morphology), its meaning (se-
applied research in particular cohorts, this study aimed at assessing
mantics), or its use (pragmatics), and includes deficits such as
how much experimental time it requires. Fig. 2 showed that
specific-language impairment (SLI), aphasia, and dyslexia, among
subject-specific models are sensitive to recording duration and
others. Early identification is crucial for improving long-term out-
need at least 30 min of recording data to provide a significant index
comes in many of these conditions, especially for early school age
of phoneme-level activity (although, the more data the better),
children who are less likely to have subsequent reading and

Fig. 3. A measure of discriminability between phonetic features was derived from a multidimensional scaling analysis (MDS) on the phonetic features mTRF model. In both figures,
the x-axis indicates the amount of recording data and the y-axis reports a discriminability score (F-Score). The comparison of each pair of feature-sets produced a distinct chance
level, which was subtracted from the corresponding discriminability scores for visualization clarity. Empty gray circles indicate non-significant discrimination values (p > 0.05). The
small dots indicate the result on an individual subject level. (A) Vowels resulted discriminable from fricative and plosive consonants, and this difference emerged with 10 min of
data. Vowels and semi-vowels were not significantly discriminable at any training time. (B) Plosive consonants and semi-vowels were significantly discriminated when at least
20 min of data was used. Similarly, plosive and fricative consonants are significantly discriminable for all recording durations, with the exception of 30 min.
76 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77

which limits the applicability of this framework. In this context, a the homogeneity within a subject group. In particular, such mea-
solution is met by using a generic modeling approach, which was sures were effective for every single subject when enough
effective even with only 10 min of recording data. Furthermore, recording time was available and, for selected feature groups, 20 or
Fig. 3 demonstrates that phonetic-feature groups such as vowels, 30 min of data were sufficient. However, it remains unclear how the
fricative consonants, and plosive consonants are discriminable particularities of the subjects used to fit a model will affect the
already after 10 min of recording time (e.g., vowels VS fricative predictions it produces for the test subject. Future work incorpo-
consonants, vowels VS plosive consonants). Unsurprisingly, the rating neuropsychological metrics and behavioral assays of speech
ability to separate phonetic-features increases with recording time, perception will aim to clarify how this factor impacts our proposed
which highlights the importance of collecting as much experi- methodology and to investigate the effectiveness of this framework
mental data as possible. at detecting specific processing problems at an individual subject
With the goal of minimizing the experimental duration and level.
facilitating clinical application, there are other considerations that In summary, we have defined a framework to investigate speech
are important to clarify. Firstly, the mapping procedure at the core processing using “direct” measures of cortical activity recorded
of this framework (mTRF) is performed independently for each with EEG. Importantly, the feasibility of applying this framework
single electrode. In this context, Di Liberto et al., 2015 observed a using shorter testing times was demonstrated. The approach pro-
lack of topographical differences and that the strongest EEG pre- vides a number of novel dependent measures of speech processing
dictability measures emerged from fronto-central scalp sites. The which can be used to assess speech processing on individual sub-
choice of focusing on a set of electrodes of interest in those sites jects in certain cohorts. In addition to an overall index of phonetic
also allowed us to investigate weaker effects that emerge at a group level processing (FS-S), we introduced a methodology to assess
level, such as the sensitivity of EEG to specific phonetic features. speech processing at the level of specific phonetic features, which
However, no qualitative differences emerged between electrodes of may be important to investigate the causes and effects of specific
interest and, in this sense, the effectiveness of this approach would speech and language disorders. For instance, dyslexia, which has
not suffer from the reduction of the electrode-set, as long as the been linked to phonological deficits (Goswami, 2015), may be
scalp areas of interest are used. This confirms that it is absolutely related to altered/impaired processing of specific phonetic features.
possible to obtain similar results by using only a few bilateral Another example is the study of language development. The pro-
electrodes. However, the use of 16 or 32 scalp electrodes over the cessing of speech into phonetic categories is known to gradually
whole scalp surface may be important at the preprocessing stage, develop through infancy and childhood (Kuhl, 2004); however, this
as it would facilitate artifact detection and channel interpolation to had previously only been investigated in the context of simple,
deal with noise and motor artifacts, which may be more prob- discrete stimuli. The framework introduced here provides a new
lematic in specific cohorts (e.g., infants, older persons). Addition- way to investigate such developmental processes in more natu-
ally, the use of a larger number of participants in each subject group ralistic conditions.
may result in a further reduction of the amount of recording data
needed to produce significant objective measures of speech Author contributions
perception.
One possible shortcoming of the above generic modeling The study was conceived by E.C.L. and G.D.L. G.D.L. programmed
approach is that it relies on testing an individual subject using a the tasks, collected and analyzed the data. G.D.L., E.C.L. wrote the
model fit to other subjects. For studies comparing groups (e.g., manuscript.
typically developing children vs children with dyslexia), this means
combining data within each group to form separate generic models. Conflicts of interest
This implicitly assumes a certain amount of homogeneity within
each group, an assumption that is certainly problematic (Happe None declared.
et al., 2006; Willems et al., 2016). In fact, Fig. 1 demonstrated that
such variabilities affect the results even within the subject-group of
Funding sources
this study. Intuitively, subjects with dynamics more similar to the
group (subject-specific model similar to the correspondent generic
This study was supported by an Irish Research Council Gov-
model) will be characterized by higher EEG prediction correlations,
ernment of Ireland Postgraduate Scholarship.
while the opposite will happen for subjects with peculiar mTRFs. In
this sense, the generic modeling approach could be used as a tool to
Acknowledgements
investigate the homogeneity within a subject-group. This method
is suitable to study within-subject effects, and indices of such ef-
This study was supported by an Irish Research Council Gov-
fects (i.e. FS-S), properly normalized, could be used to compare
ernment (GOIPG, 2013-2017) of Ireland Postgraduate Scholarship.
different subject groups. In this context, interpretation of the
The authors thank Denis Drennan, Emily Teoh, and Adam Bednar
analysis outputs needs to take into account the choice of subject-
for useful discussions and comments on the manuscript.
groups, as excessive within-group variability may hamper the fit
of a generic model.
An alternative solution could be to define a unique generic References
model using a training control group, and to use such model to Abrams, D.A., Nicol, T., Zecker, S., Kraus, N., 2008. Right-hemisphere auditory cortex
assess whether a new subject (e.g. a patient) belongs or not to the is dominant for coding syllable patterns in speech. J. Neurosci. 28, 3958e3965.
group. This approach has the advantage of having no limitations on Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., Merzenich, M.M.,
2001. Speech comprehension is correlated with temporal response patterns
the recording times for the training control group. However, the
recorded from auditory cortex. Proc. Natl. Acad. Sci. U. S. A. 98, 13367e13372.
effectiveness of such an approach, which assumes some degree of Aiken, S.J., Picton, T.W., 2008. Human cortical responses to the speech envelope. Ear
homogeneity in each of the subject groupings, could not be verified Hear 29, 139e157.
here as it requires a dataset that includes at least two distinct Aslin, R.N., Mehler, J., 2005. Near-infrared spectroscopy for functional studies of
brain activity in human infants: promise, prospects, and challenges. J. Biomed.
subject groups. To this end, the measures of phonetic features Opt. 10, 011009e0110093.
discriminability (Fig. 3) may provide a quantitative way to assess Bonte, M., Parviainen, T., Hyto €nen, K., Salmelin, R., 2006. Time course of top-down
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 77

and bottom-up influences on syllable processing in the auditory cortex. Cereb. McNealy, K., Mazziotta, J.C., Dapretto, M., 2006. Cracking the language code: neural
Cortex 16, 115e123. mechanisms underlying speech parsing. J. Neurosci. Off. J. Soc. Neurosci. 26,
Catts, H.W., Fey, M.E., Tomblin, J.B., Zhang, X., 2002. A longitudinal investigation of 7629e7639.
reading outcomes in children with language impairments. J. speech Lang. Hear. Mesgarani, N., Cheung, C., Johnson, K., Chang, E.F., 2014. Phonetic feature encoding
Res. JSLHR 45, 1142e1157. in human superior temporal gyrus. Science 343, 1006e1010.
Chomsky, N., Halle, M., 1968. The Sound Pattern of English. Mesulam, M.M., Rogalski, E.J., Wieneke, C., Hurley, R.S., Geula, C., Bigio, E.H.,
Clark, M.K.K.,A.G., 2010. Language disorders (child language disorders). In: Thompson, C.K., Weintraub, S., 2014. Primary progressive aphasia and the
Stone, J.H., Blouin, M. (Eds.), International Encyclopedia of Rehabilitation. evolving neurology of the language network. Nat. Rev. Neurol. 10, 554e569.
Clegg, J., Henderson, J., 1999. Developmental language disorders: changing eco- Millman, R.E., Johnson, S.R., Prendergast, G., 2015. The role of phase-locking to the
nomic costs from childhood into adult life. Ment. Health Res. Rev. 6, 27e30. temporal envelope of speech in auditory perception and speech intelligibility.
Crosse, M.J., Di Liberto, G.M., Bednar, A., Lalor, E.C., 2016. The multivatiate temporal J. Cognitive Neurosci. 27, 533e545.
response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals Mirkovic, B., Debener, S., Jaeger, M., De Vos, M., 2015. Decoding the attended speech
to continuous stimuli. Front. Hum. Neurosci. 10, 604. stream with multi-channel EEG: implications for online daily-life applications.
Delorme, A., Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of J. neural Eng. 12, 046007.
single-trial EEG dynamics including independent component analysis. Mody, M., Belliveau, J.W., 2013. Speech and language impairments in autism: in-
J. Neurosci. methods 134, 9e21. sights from behavior and neuroimaging. North Am. J. Med. Sci. 5, 157.
Di Liberto, G.M., O'Sullivan, J.A., Lalor, E.C., 2015. Low frequency cortical entrain- Nourski, K.V., Reale, R.A., Oya, H., Kawasaki, H., Kovach, C.K., Chen, H.,
ment to speech reflects phoneme level processing. Curr. Biol. 25 (19), Howard 3rd, M.A., Brugge, J.F., 2009. Temporal envelope of time-compressed
2457e2465. speech represented in the human auditory cortex. J. Neurosci. 29, 15564e15574.
Ding, N., Simon, J.Z., 2014. Cortical entrainment to continuous speech: functional Obleser, J., Zimmermann, J., Van Meter, J., Rauschecker, J.P., 2007. Multiple stages of
roles and interpretations. Front. Hum. Neurosci. 8. auditory speech perception reflected in event-related FMRI. Cereb. Cortex 17,
Flanagan, D.P., Genshaft, J., Harrison, P.L., 1997. Contemporary Intellectual Assess- 2251e2257.
ment. Theories, Tests, and Issues Guilford Press. Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.H., Saberi, K., Serences, J.T.,
Ford, L., Dahinten, V.S., 2005. Use of intelligence tests in the assessment of pre- Hickok, G., 2010. Hierarchical Organization of Human Auditory Cortex: Evi-
schoolers. Contemp. Intellect. Assess. 487e503. dence from Acoustic Invariance in the Response to Intelligible Speech. Cerebral
Gardner, H., Froud, K., McClelland, A., van der Lely, H.K., 2006. Development of the cortex, New York, N.Y, pp. 2486e2495, 1991) 20.
Grammar and Phonology Screening (GAPS) test to assess key markers of specific Overath, T., Kumar, S., von Kriegstein, K., Griffiths, T.D., 2008. Encoding of spectral
language and literacy difficulties in young children. Int. J. Lang. Commun. Dis- correlation over time in auditory cortex. J. Neurosci. 28, 13268e13273.
ord. 41, 513e540. O'Sullivan, J.A., Power, A.J., Mesgarani, N., Rajaram, S., Foxe, J.J., Shinn-
Gorman, K., Howell, J., Wagner, M., 2011. Prosodylab-aligner: a Tool for Forced Cunningham, B.G., Slaney, M., Shamma, S.A., Lalor, E.C., 2015. Attentional se-
Alignment of Laboratory Speech, vol. 39, p. 2, 2011. lection in a cocktail party environment can be decoded from single-trial EEG.
Goswami, U., 2015. Sensory theories of developmental dyslexia: three challenges Cereb. Cortex 25, 1697e1706.
for research. Nat. Rev. Neurosci. 16, 43e54. Paul, R., 2007. Language Disorders from Infancy through Adolescence. Assessment
Greenwood, D.D., 1961. Auditory masking and the critical band. J. Acoust. Soc. Am. & Intervention Mosby Elsevier.
33, 484e502. Peelle, J.E., Johnsrude, I.S., Davis, M.H., 2010. Hierarchical processing for speech in
Happe, F., Ronald, A., Plomin, R., 2006. Time to give up on a single explanation for human auditory cortex and beyond. Front. Hum. Neurosci. 4, 51.
autism. Nat. Neurosci. 9, 1218e1220. Peter, V., Kalashnikova, M., Santos, A., Burnham, D., 2016. Mature neural responses
Hickok, G., Poeppel, D., 2007. The cortical organization of speech processing. Nature to infant-directed speech but not adult-directed speech in pre-verbal infants.
reviews. Neuroscience 8, 393e402. Sci. Rep. 6, 34273.
Kemper, S., Anagnopoulos, C., 2008. Language and aging. Annu. Rev. Appl. Lin- Poeppel, D., 2014. The neuroanatomic and neurophysiological infrastructure for
guistics 10, 37e50. speech and language. Curr. Opin. Neurobiol. 28, 142e149.
Kuhl, P.K., 2004. Early language acquisition: cracking the speech code. Nature re- Reed, V., 2012. An Introduction to Children with Language Disorders Pearson.
views. Neuroscience 5, 831e843. Rijsbergen, C.J.V., 1979. Information Retrieval. Butterworth-Heinemann.
Kuhl, P.K., 2010. Brain mechanisms in early language acquisition. Neuron 67, Ross, L.A., Saint-Amour, D., Leavitt, V.M., Molholm, S., Javitt, D.C., Foxe, J.J., 2007.
713e727. Impaired multisensory processing in schizophrenia: deficits in the visual
Kuhl, P.K., Coffey-Corina, S., Padden, D., Dawson, G., 2005. Links between social and enhancement of speech comprehension under noisy environmental conditions.
linguistic processing of speech in preschool children with autism: behavioral Schizophr. Res. 97, 173e183.
and electrophysiological measures. Dev. Sci. 8, F1eF12. Salmelin, R., 2007. Clinical neurophysiology of language: the MEG approach. Clin.
Kutas, M., Hillyard, S.A., 1980. Reading senseless sentences: brain potentials reflect Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 118, 237e254.
semantic incongruity. Science 207, 203e205. Tomblin, J.B., Records, N.L., Zhang, X., 1996. A system for the diagnosis of specific
Lakatos, P., Shah, A.S., Knuth, K.H., Ulbert, I., Karmos, G., Schroeder, C.E., 2005. An language impairment in kindergarten children. J. Speech Lang. Hear. Res. 39,
oscillatory hierarchy controlling neuronal excitability and stimulus processing 1284e1294.
in the auditory cortex. J. Neurophysiol. 94, 1904e1911. Willems, G., Jansma, B., Blomert, L., Vaessen, A., 2016. Cognitive and familial risk
Lalor, E.C., Foxe, J.J., 2010. Neural responses to uninterrupted natural speech can be evidence converged: a data-driven identification of distinct and homogeneous
extracted with precise temporal resolution. Eur. J. Neurosci. 31, 189e193. subtypes within the heterogeneous sample of reading disabled children. Res.
Lau, E.F., Phillips, C., Poeppel, D., 2008. A cortical network for semantics:(de) con- Dev. Disabil. 53e54, 213e231.
structing the N400. Nat. Rev. Neurosci. 9, 920e933. Wo €stmann, M., Fiedler, L., Obleser, J., 2016. Tracking the signal, cracking the code:
Leonard, L.B., 2014. Children with Specific Language Impairment. MIT Press. speech and speech comprehension in non-invasive human electrophysiology.
Luck, S.J., 2005. An Introduction to the Event-related Potential Technique. MIT Press. Lang. Cognition Neurosci. 1e15.
Luo, H., Poeppel, D., 2007. Phase patterns of neuronal responses reliably discrimi- Zion Golumbic, E.M., Cogan, G.B., Schroeder, C.E., Poeppel, D., 2013. Visual input
nate speech in human auditory cortex. Neuron 54, 1001e1010. enhances selective speech envelope tracking in auditory cortex at a “cocktail
Machens, C.K., Wehr, M.S., Zador, A.M., 2004. Linearity of cortical receptive fields party”. J. Neurosci. 33, 1417e1426.
measured with natural sounds. J. Neurosci. 24, 1089e1100.

You might also like