Professional Documents
Culture Documents
Hearing Research
journal homepage: www.elsevier.com/locate/heares
Research Paper
a r t i c l e i n f o a b s t r a c t
Article history: Speech is central to human life. As such, any delay or impairment in receptive speech processing can
Received 10 October 2016 have a profoundly negative impact on the social and professional life of a person. Thus, being able to
Received in revised form assess the integrity of speech processing in different populations is an important goal. Current stan-
15 February 2017
dardized assessment is mostly based on psychometric measures that do not capture the full extent of a
Accepted 17 February 2017
Available online 27 February 2017
person's speech processing abilities and that are difficult to administer in some subjects groups. A po-
tential alternative to these tests would be to derive “direct”, objective measures of speech processing
from cortical activity. One such approach was recently introduced and showed that it is possible to use
Keywords:
Language impairment
electroencephalography (EEG) to index cortical processing at the level of phonemes from responses to
Continuous speech continuous natural speech. However, a large amount of data was required for such analyses. This limits
EEG the usefulness of this approach for assessing speech processing in particular cohorts for whom data
Categorical perception collection is difficult. Here, we used EEG data from 10 subjects to assess whether measures reflecting
Clinical research phoneme-level processing could be reliably obtained using only 10 min of recording time from each
Neuromarker subject. This was done successfully using a generic modeling approach wherein the data from a training
group composed of 9 subjects were combined to derive robust predictions of the EEG signal for new
subjects. This allowed the derivation of indices of cortical activity at the level of phonemes and the
disambiguation of responses to specific phonetic features (e.g., stop, plosive, and nasal consonants) with
limited data. This objective approach has the potential to complement psychometric measures of speech
processing in a wide variety of subjects.
© 2017 Elsevier B.V. All rights reserved.
1. Introduction some form of speech and language impairments. These can arise as
a consequence of developmental disorders (Leonard, 2014) or from
Speech is central to human life. Over the past 30 years, neuro- a decline in related cortical functions (e.g., through ageing, psy-
science has provided tremendous insights into the neurobiology of chosis, injury; (Kemper and Anagnopoulos, 2008; Mesulam et al.,
language using brain imaging. As a result, it is now generally un- 2014; Ross et al., 2007)). A better understanding of the underly-
derstood that speech is processed in a hierarchically organized ing speech processing network and an ability to identify specific
system of functionally distinct cortical areas (Hickok and Poeppel, impairment within that network are crucial to developing clinically
2007; Okada et al., 2010; Peelle et al., 2010; Poeppel, 2014). How- useful assessments of speech and language in these populations.
ever, much work remains to be done to elucidate the details of this Speech and language impairment can disrupt the ability to
system, particularly in the context of natural speech. This is an understand auditory speech and efficiently communicate in a
important issue in and of itself. However, it is also important in that number of ways, which correspond to different symptoms. In this
there are significant numbers of people worldwide who suffer from context, standardized assessment of such impairments is usually
pursued using a number of behavioral tests (e.g., non-verbal
hearing, speech, and language tests; standardized test of intelli-
* Corresponding author. Trinity Centre for Bioengineering, Trinity College Dublin, gence) (Ford and Dahinten, 2005; Gardner et al., 2006; Tomblin
Dublin 2, Ireland.
et al., 1996). However, these measures are inadequate at
** Corresponding author. Department of Biomedical Engineering and Department
of Neuroscience, University of Rochester, Rochester, NY, USA.
capturing the full extent of a person's impairment and should be
E-mail addresses: diliberg@tcd.ie (G.M. Di Liberto), edlalor@tcd.ie (E.C. Lalor). considered only as one aspect of a comprehensive assessment
http://dx.doi.org/10.1016/j.heares.2017.02.015
0378-5955/© 2017 Elsevier B.V. All rights reserved.
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 71
process (Flanagan et al., 1997; Mody and Belliveau, 2013). which may constitute an obstacle when studying particular cohorts
Furthermore, some of these measures cannot be derived for some (e.g., young children).
groups such as infants or participants with reading impairment or Here we introduce a modification to our previously introduced
no reading skills. framework that allows for a significant reduction of the experi-
A complementary approach is to “directly” investigate the cau- mental time needed to derive such indices of phoneme-level
ses that underpin such conditions, rather than evaluate “indirect” cortical entrainment. Our previous framework involved relating
effects on specific behavioral markers. In this sense, neuroimaging different representations of a speech signal to ongoing EEG. In
provides an opportunity to derive measures directly related to the particular, it involved building a model for each subject that would
cortical processing of speech in the human brain. In particular, map a specific speech representation to that subject's own EEG
noninvasive, safe, functional brain measurements (EEG, MEG, fMRI, signal. This type of approach has previously been used to “decode”
NIRS) have now been proven feasible for use with both children how attention is being deployed in so-called cocktail party envi-
(starting at birth) and adults (Aslin and Mehler, 2005; Kuhl, 2010; ronments (O'Sullivan et al., 2015; Mirkovic et al., 2015). The
Kuhl et al., 2005; McNealy et al., 2006). Neuroimaging research in modification we make here follows innovation introduced in these
speech perception has traditionally focused on neural activation attention decoding studies (O'Sullivan et al., 2015; Mirkovic et al.,
patterns corresponding to the perception of minimal linguistic 2015). Specifically, these authors showed that it was possible to
contrasts (e.g., how we distinguish “cat” from “mat”) (Obleser et al., decode attention for an individual subject using a generic model
2007; Peter et al., 2016; Salmelin, 2007), cortical responses cases of that was built from the data from other subjects. This led to a large
syntactic or semantic violation (Kutas and Hillyard, 1980; Lau et al., reduction in how much data was needed from each subject to
2008), and the processing of the low-level acoustics of an incoming perform decoding (Mirkovic et al., 2015). Here, we seek to do
sound stimulus (Lakatos et al., 2005; Overath et al., 2008). However, something similar in the context of our approach for assessing
the study of speech comprehension needs to account for how phoneme-level speech processing. While it is known that not many
humans process continuous natural speech, which is a task per- electrodes are needed for this approach to be effective (by con-
formed efficiently by healthy people in their everyday life and is struction; see forward modeling approach in Crosse et al., 2016),
profoundly different from, for example, the perception of isolated the ability to use the framework with small amounts of data from
syllables (Bonte et al., 2006). individual subjects is uncertain. To clarify this issue, an extensive
Recent studies showed an innovative way to investigate analysis was conducted to assess the minimum experimental time
continuous speech perception in humans, by indexing how cortical needed to detect meaningful cortical responses. The goal of the
activity tracks the dynamics of that speech. This phenomenon of analysis was to show that it is possible to utilize short data sets
cortical entrainment has been demonstrated in humans for the across multiple subjects to make inferences about speech pro-
amplitude envelope of speech using magnetoencephalography cessing in individual subjects. Specifically, we aimed to show that
(MEG; Ahissar et al., 2001; Luo and Poeppel, 2007), electroen- we can robustly index phoneme-level processing in the context of
cephalography (EEG; Aiken and Picton, 2008; Lalor and Foxe, 2010), natural speech in cases of limited amounts of experimental data.
and electrocorticography (ECoG; Nourski et al., 2009). And the ef-
fect has been quantified using a cross-correlation analysis between 2. Material and methods
the speech envelope and the recorded neural data (Ahissar et al.,
2001; Abrams et al., 2008; Nourski et al., 2009; Millman et al., Ten healthy subjects (7 male) aged between 23 and 38 years old
2015). However, this approach is ill-suited to the study of natural- participated in the experiment. The study was undertaken in
istic stimuli (Crosse et al., 2016). The reason for this is that natu- accordance with the Declaration of Helsinki and was approved by
ralistic stimuli vary in a non-random way and so such stimuli are the Ethics Committee of the School of Psychology at Trinity College
correlated with time-shifted versions of themselves. This leads to Dublin. Each subject provided written informed consent. Subjects
temporal smearing when cross-correlating the stimulus with shif- reported no history of hearing impairment or neurological disorder.
ted versions of the response. For this reason, system identification
methods based on ridge regression have been recently applied in 2.1. Stimuli and experimental procedure
this context, and were shown to be effective for investigating the
cortical processing of natural speech (Machens et al., 2004; Crosse Subjects undertook 28 trials, each of ~155 s in length, where
et al., 2016). And, in turn, the ability to use more natural stimuli they were presented with an audiobook version of a classic work of
facilitates the design of more engaging paradigms. These issues, fiction read by a male American English speaker. The trials pre-
and others, are discussed in several recent reviews on the ap- served the storyline, with neither repetitions nor discontinuities.
proaches for and applications of the speech-entrainment phe- All stimuli were presented monophonically at a sampling rate of
nomenon (Ding and Simon, 2014; Wo €stmann et al., 2016; Crosse 44,100 Hz using Sennheiser HD650 headphones and Presentation
et al., 2016). software from Neurobehavioral Systems (http://www.neurobs.
In this context, a recent study (Di Liberto et al., 2015) introduced com). Testing was carried out in a dark room and subjects were
a framework for disentangling phoneme-level cortical responses instructed to maintain visual fixation for the duration of each trial
from cortical activity elicited by low-level acoustics. Results from on a crosshair centered on the screen, and to minimize eye blinking
this study indicated that low-frequency cortical entrainment to and all other motor activities.
speech features reflects more than a simple acoustic analysis of the
stimulus, and that it also reflects phoneme-level processing. 2.2. Data acquisition and preprocessing
Therefore, this framework provides a potential methodology for
investigating speech encoding under a variety of conditions and in Electroencephalographic (EEG) data were recorded from 128
a variety of cohorts. This could include research on the causes of scalp electrodes (plus 2 mastoid channels). Data were filtered over
speech impairments in particular cohorts by deriving direct indices the range 0e134 Hz, and digitized with a sampling frequency of
of cortical activity at specific levels of the speech processing hier- 512 Hz using a BioSemi Active Two system. Data were analyzed
archy using non-invasive EEG. However, short experimental times offline using MATLAB software (The Mathworks Inc.). EEG data
are preferable in applied research (Mirkovic et al., 2015), whereas were digitally filtered between 1 and 8 Hz using a Chebyshev Type
Di Liberto et al., 2015 used a recording time of 72 min per subject, 2 zero-phase filter. In order to reduce the processing time, all EEG
72 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77
data were then down-sampled to 64 Hz. EEG channel with a vari- Each phoneme consists of a combination of distinct features;
ance that exceeded three times that of the surrounding channels therefore this is a set of non-mutually exclusive descriptors.
were labelled as bad channels, and replaced by an estimate calcu- 4. Finally, we propose a model that combines F and S (FS): This was
lated using spherical spline interpolation (EEGLAB; Delorme and obtained by concatenating F and S into a single data matrix. This
Makeig, 2004). All channels were then re-referenced to the representation consists of 19 phonetic features and 16 frequency
average of the two mastoid channels with the goal of maximizing bands, therefore FS has 35 dimensions.
the EEG responses to the auditory stimuli (Luck, 2005).
2.5. Model evaluation
2.3. TRF computation
In order to quantify how well the EEG reflects the encoding of
the various speech representations we used a model-based anal-
The method used here aims to derive a quantitative mapping
ysis. The idea is to fit a model (i.e., an mTRF) that describes the
between particular representations of a speech signal and the
forward mapping from a speech representation to the EEG and then
recorded EEG. This mapping is commonly known as a temporal
to test that model by seeing how accurately it can predict EEG from
response function (TRF). A TRF can be interpreted as a filter that
a new trial. Specifically, we used a leave-one-out cross-validation
describes how the brain transforms a stimulus feature to the
approach, whereby an mTRF was trained on 27 trials, and used to
continuous neural response. Because the mapping described is
predict the EEG data from the remaining trial. This process was
from stimulus to EEG signal, the resulting models are referred to as
repeated until the data from all trials were predicted. EEG predic-
forward TRFs. Furthermore, as will become clear in the following
tion accuracies were evaluated by determining a correlation coef-
section, the stimulus here is often represented as a multivariate
ficient (Pearson's r) between the actual and predicted EEG data on
feature vector. As such, we refer to our TRFs as multivariate TRFs
each electrode channel. A single prediction correlation value was
(Crosse et al., 2016). mTRFs were calculated using custom written,
then derived by averaging these correlations over our chosen set of
publicly available software (http://www.mee.tcd.ie/lalorlab/
electrodes of interest (this procedure is described further in the
resources.html).
following section). Note that silent time intervals were removed
from the correlation evaluation (the same intervals were removed
2.4. Speech representations from all speech representations).
For each participant, predictions of its EEG signals were derived
Following Di Liberto et al., 2015, we estimated mTRFs based on using mTRFs that were fit on data of that specific subject (subject-
four distinct representations of the speech stimulus: specific models). This approach was compared to a subject-
independent method, which consisted of using models obtained
1. Broadband amplitude envelope (E): This was calculated as: by averaging the subject-specific mTRFs obtained from all other
Env ¼ jxa ðtÞj; xa ðtÞ ¼ xðtÞ þ jbx ðtÞ; where xa ðtÞ is the complex subjects (generic models).
analytic signal obtained by the sum of the original speech xðtÞ
and its Hilbert transform b x ðtÞ. The envelope of speech was then 2.6. Model parameter selection
downsampled to the same sampling frequency as the EEG data,
after applying a zero-phase shift anti-aliasing filter. A significant Three key considerations when carrying out model-based pre-
number of papers have been published in recent years based on dictions of EEG data are 1) the channels to be predicted; 2) the
relating the envelope of a continuous speech signal to neural choice of time-lags between stimulus and data to optimize pre-
data (Aiken and Picton, 2008; Ding and Simon, 2014; Millman diction; and 3) the choice of the regularization parameter l. In
et al., 2015; Nourski et al., 2009; Zion Golumbic et al., 2013). terms of channels, we focused on predicting channels that strongly
2. Spectrogram (S): This was obtained by first filtering the speech reflect auditory cortical activity. In particular, a set of 12 electrodes
stimulus into 16 frequency bands between 250 Hz and 8 kHz from 2 bilateral areas of the fronto-central scalp with the highest
according to Greenwood's equation (equal distance on the prediction correlations were selected (6 on the left side of the scalp,
basilar membrane; Greenwood, 1961) using Chebyshev type 2 and their symmetrical counterparts on the right; Di Liberto et al.,
filters (order 100), and then computing the amplitude envelope 2015)). The EEG prediction correlations were then averaged
(as above) for each frequency band. across these channels. While we have chosen to average across 12
3. Phonetic features (F): This representation was computed using channels to ensure some robustness in terms of our results, there is
the Prosodylab-Aligner (Gorman et al., 2011) which, given a no requirement to predict data from so many channels. Indeed, in
speech file and the corresponding textual orthographical tran- our previous study, we found no qualitative difference across these
scription, partitions each word into phonemes from the Amer- channels. And, indeed, it may be possible to derive effectively the
ican English International Phonetic Alphabet (IPA). It then same information from a single channel (properly referenced). In
performs forced-alignment (Gorman et al., 2011) and returns terms of time-lags, we first computed mTRFs using a broad time-
the starting and ending time-points for each phoneme. This window from 150 to 450 ms. Based on visual inspection of the
information was then converted into a multivariate time-series average mTRFs across all subjects, this time interval was then
composed of indicator variables, which are binary arrays (one restricted to lags from 0 to 250 ms as no visible response was
for each phoneme). These are active for the time-points in which present outside this range. Finally, an important consideration
phonemes occurred. Each phoneme was then converted into a when calculating the mTRFs is that of regularization. As extensively
space of 19 phonetic features (Mesgarani et al., 2014), which are described by Crosse et al., 2016, the mTRF procedure is based on
a distinctive subset of those defined by Chomsky and Halle ridge (or Tikhonov) regression, which uses regularization to reduce
(1968) that describe the articulatory and acoustic properties of overfitting by smoothing, in this case, across the time dimension.
the phonetic content of speech. In particular, the chosen features This parameter l was optimized for each model, while the same
are related to the manner of articulation, the voicing of a con- overall optimal value was used across subjects and electrodes. The
sonant, the backness of a vowel, and the place of articulation. optimal l values were 1, 10, and 10 for, S, F, and FS respectively. Note
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 73
Unless otherwise stated, all statistical analyses were performed Fig. 1. EEG data were recorded while subjects listened to natural speech from an
audiobook. Speech was represented using E and S (speech acoustics), F (phonetics),
using a repeated measure, one way ANOVA to compare distributions
and FS (which combines acoustics and phonetics). Multivariate temporal response
of Pearson correlation values across models and to compare F-Score functions (mTRF) were built to describe the mapping from each representation of
classifications across response intervals. The values reported use the speech to the EEG recording and used to predict the EEG signal with cross-validation
convention F(df, dferror). Greenhouse-Geisser corrections were made (: greater than all others, p < 0.01; ; smaller than all others, p < 0.01; *p < 0.05). (A)
if Mauchly's test of sphericity was not met. All post-hoc model com- Correlations between EEG and its predictions are shown for each subject and each of
the 4 speech representations. The predictions were obtained using speech-specific
parisons were performed using Bonferroni corrected paired t-tests. models, i.e., trained and tested within each subject using cross-validation (subjects
were re-arranged according the performance of the FS model for visualization pur-
3. Results pose). This figure corresponds to Di Liberto et al., Fig. 2B; it is not identical because of
minor changes in the data preprocessing (e.g., down-sampling rate). (B) The same data
is here shown grouped by the 4 speech representations. Each data point refers to a
3.1. Neural evidence for phonetic processing in generic models
specific subject (a specific color saturation was assigned to each subject). (C) Corre-
lations between EEG and its predictions using a generic model, i.e., trained on all
To investigate whether phoneme level cortical activity can be subjects with the exception of the test subject. The subject arrangement is consistent
indexed using a generic modeling approach, multivariate temporal with (A). (D) The same values obtained for generic models are here grouped by speech
response functions (mTRFs) (Crosse et al., 2016) were built to representation. Each data point refers to a specific subject and their colors match the
ones shown in (B). Because prediction correlations were calculated using 72 min of
describe the linear mapping from speech to the EEG scalp-recorded data, chance level here is effectively zero.
signal. In particular, speech was represented using its acoustic en-
velope (E) and spectrogram (S), phonetic features (F), and a com- were performed with one important difference: when predicting
bination of acoustic spectrogram and phonetics (FS). mTRF models the EEG for a given subject, we used models that were fit to data
were built for each speech representation and used to build pre- from all the other subjects. Because EEG responses vary across
dictions of the EEG signal using cross-validation to avoid over- subjects as a result of cortical folding, EEG prediction correlations
fitting. The quality of these predictions, measured with Pearson's for generic models were expected to be lower than in the subject-
correlation, indexed how well the EEG reflects the processing of specific approach. This was confirmed by the results in Fig. 1C,D
low- and high-level speech features. Fig. 1A,B show this result (Two-way ANOVA; effect of modeling approach: F(1,72) ¼ 6.1,
when the EEG predictions were derived using a subject-specific p ¼ 0.016). However, crucially, the combined model FS still pro-
model, i.e., given a subject, the predictions of their EEG signal duced the best EEG predictions in this case (ANOVA:
were obtained using a model fit on that same subject using cross- F(3.0,7.0) ¼ 21.9, p ¼ 0.001; post-hoc paired t-test comparisons of
validation across trials. While no significant difference emerged FS with all other models: p < 0.001, p < 0.001, p ¼ 0.024 for E, S, F
between the S and F models, the model fit on the combination of respectively). Again, this suggests that this modeling approach is
the two (FS) produced the highest EEG prediction correlations sensitive to the effects of categorical phoneme-level processing,
(ANOVA: F(3.0,7.0) ¼ 12.1, p ¼ 0.004; post-hoc paired t-test com- even when using generic models.
parisons of FS with all other models: p ¼ 0.001, p ¼ 0.005, p ¼ 0.023
for E, S, F respectively). This result indicates that low-frequency EEG
indexes the cortical entrainment to categorical phoneme-level 3.2. Generic models index phonetic processing for limited
features of speech (Di Liberto et al., 2015). experimental time
A similar analysis was conducted to assess whether the same
effect emerged when using the generic modeling approach. In In our previous paper (Di Liberto et al., 2015), we suggested that
particular, the same processing steps as in the subject-specific case one could potentially derive an isolated measure of phoneme-level
74 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77
(MDS and F-Score) after randomly relabelling each phoneme academic problems in case of early diagnosis (Catts et al., 2002;
occurrence and converting that into phonetic features. Each dis- Clark, 2010). In this context, the ability to derive noninvasively
criminability reported in Fig. 3 was obtained by subtracting the F- robust markers of natural speech processing at specific levels of the
score derived when using the correct stimulus with its chance level cortical hierarchy could be of great benefit for research in certain
(shuffled over 50 randomly relabelled versions of the stimulus). cohorts. Here we have investigated a number of practical consid-
Individual subject values and their mean are reported in the figure. erations surrounding a recently introduced framework for indexing
In line with (Di Liberto et al., 2015), EEG activity in response to the encoding of natural speech at the level of phonemes (Di Liberto
vowels (vow) could be significantly discriminated from that to et al., 2015).
fricative (fri) and plosive (plo) consonants (paired Wilcoxon signed Firstly, it was shown that a generic model is capable of indexing
rank test, p < 0.05), while no significant difference between vowels the cortical entrainment to several speech representations of in-
and semi-vowels (semi) emerged. Importantly, these consider- terest. However, it was found that overall the EEG prediction cor-
ations were true even when only small amounts of experiment data relations were lower than in the subject-specific approach (Fig. 1).
(10 min) were available. The individual subject data clarifies that, This is likely to be an effect of anatomical differences among in-
when enough data is available, the significant effects for the com- dividuals, causing differences in the EEG signals between subjects.
parisons vow-fri (from 30 min of data) and vow-plo (from 50 min of This subject-specific information would be lost when averaging
data) correspond to an above-chance discriminability for every between subjects, hence producing lower EEG prediction correla-
single subject. Also, EEG activity to vowels was more discriminable tions. Even though the prediction values were smaller overall than
from plosive than it was from fricative consonants (this difference in the subject-specific case, the generic modeling approach pro-
was significant for all experiment duration with the exception of duces a similar pattern of prediction accuracies. Moreover, the
20 min; paired Wilcoxon signed rank test, p < 0.05). Interestingly, generic model of a larger and reasonably homogeneous group
the discriminability of classes within consonants reveals a different would still encode cortical responses that are consistent across
pattern. In particular, semi-vowels required at least 20 min of subjects. Potentially, such a framework could benefit from such a
experimental time to emerge as different from plosive consonants. larger dataset, and may require even shorter recording times to
Also, weak significant discriminability emerged between plosive produce meaningful results. Here, out of the four speech repre-
and fricative (p ( 0.05; with the exception of one data-point). In sentations used, the combination of acoustic and phonetic features
this case, a recording time of at least 60 (plo-semi) or 70 min (plo-fri (FS) is the best at predicting the EEG signal, while the envelope of
and fri-semi) was required to achieve an above-chance result for speech is the worst. This result is relevant as the broadband en-
every single subject. velope of speech has been used in several recent studies on audi-
tory perception (Aiken and Picton, 2008; Ding and Simon, 2014;
Millman et al., 2015; Nourski et al., 2009; Zion Golumbic et al.,
4. Discussion
2013). And the generic modeling approach has been shown to be
able to produce a significant neural index FS-S, which has been
Language impairments are disorders that affect the under-
suggested to reflect speech processing at a level of phonemes (Di
standing and/or use of spoken or written language, which carry the
Liberto et al., 2015).
risk of poor social functioning, reduced independence and
The results discussed so far suggest that a generic modeling
restricted employment opportunities (Clegg and Henderson, 1999;
approach can be used to index cortical entrainment to phonetic
Paul, 2007; Reed, 2012). The disorder may involve the form of
features of speech. In order for this approach to be feasible for
language (phonology, syntax, and morphology), its meaning (se-
applied research in particular cohorts, this study aimed at assessing
mantics), or its use (pragmatics), and includes deficits such as
how much experimental time it requires. Fig. 2 showed that
specific-language impairment (SLI), aphasia, and dyslexia, among
subject-specific models are sensitive to recording duration and
others. Early identification is crucial for improving long-term out-
need at least 30 min of recording data to provide a significant index
comes in many of these conditions, especially for early school age
of phoneme-level activity (although, the more data the better),
children who are less likely to have subsequent reading and
Fig. 3. A measure of discriminability between phonetic features was derived from a multidimensional scaling analysis (MDS) on the phonetic features mTRF model. In both figures,
the x-axis indicates the amount of recording data and the y-axis reports a discriminability score (F-Score). The comparison of each pair of feature-sets produced a distinct chance
level, which was subtracted from the corresponding discriminability scores for visualization clarity. Empty gray circles indicate non-significant discrimination values (p > 0.05). The
small dots indicate the result on an individual subject level. (A) Vowels resulted discriminable from fricative and plosive consonants, and this difference emerged with 10 min of
data. Vowels and semi-vowels were not significantly discriminable at any training time. (B) Plosive consonants and semi-vowels were significantly discriminated when at least
20 min of data was used. Similarly, plosive and fricative consonants are significantly discriminable for all recording durations, with the exception of 30 min.
76 G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77
which limits the applicability of this framework. In this context, a the homogeneity within a subject group. In particular, such mea-
solution is met by using a generic modeling approach, which was sures were effective for every single subject when enough
effective even with only 10 min of recording data. Furthermore, recording time was available and, for selected feature groups, 20 or
Fig. 3 demonstrates that phonetic-feature groups such as vowels, 30 min of data were sufficient. However, it remains unclear how the
fricative consonants, and plosive consonants are discriminable particularities of the subjects used to fit a model will affect the
already after 10 min of recording time (e.g., vowels VS fricative predictions it produces for the test subject. Future work incorpo-
consonants, vowels VS plosive consonants). Unsurprisingly, the rating neuropsychological metrics and behavioral assays of speech
ability to separate phonetic-features increases with recording time, perception will aim to clarify how this factor impacts our proposed
which highlights the importance of collecting as much experi- methodology and to investigate the effectiveness of this framework
mental data as possible. at detecting specific processing problems at an individual subject
With the goal of minimizing the experimental duration and level.
facilitating clinical application, there are other considerations that In summary, we have defined a framework to investigate speech
are important to clarify. Firstly, the mapping procedure at the core processing using “direct” measures of cortical activity recorded
of this framework (mTRF) is performed independently for each with EEG. Importantly, the feasibility of applying this framework
single electrode. In this context, Di Liberto et al., 2015 observed a using shorter testing times was demonstrated. The approach pro-
lack of topographical differences and that the strongest EEG pre- vides a number of novel dependent measures of speech processing
dictability measures emerged from fronto-central scalp sites. The which can be used to assess speech processing on individual sub-
choice of focusing on a set of electrodes of interest in those sites jects in certain cohorts. In addition to an overall index of phonetic
also allowed us to investigate weaker effects that emerge at a group level processing (FS-S), we introduced a methodology to assess
level, such as the sensitivity of EEG to specific phonetic features. speech processing at the level of specific phonetic features, which
However, no qualitative differences emerged between electrodes of may be important to investigate the causes and effects of specific
interest and, in this sense, the effectiveness of this approach would speech and language disorders. For instance, dyslexia, which has
not suffer from the reduction of the electrode-set, as long as the been linked to phonological deficits (Goswami, 2015), may be
scalp areas of interest are used. This confirms that it is absolutely related to altered/impaired processing of specific phonetic features.
possible to obtain similar results by using only a few bilateral Another example is the study of language development. The pro-
electrodes. However, the use of 16 or 32 scalp electrodes over the cessing of speech into phonetic categories is known to gradually
whole scalp surface may be important at the preprocessing stage, develop through infancy and childhood (Kuhl, 2004); however, this
as it would facilitate artifact detection and channel interpolation to had previously only been investigated in the context of simple,
deal with noise and motor artifacts, which may be more prob- discrete stimuli. The framework introduced here provides a new
lematic in specific cohorts (e.g., infants, older persons). Addition- way to investigate such developmental processes in more natu-
ally, the use of a larger number of participants in each subject group ralistic conditions.
may result in a further reduction of the amount of recording data
needed to produce significant objective measures of speech Author contributions
perception.
One possible shortcoming of the above generic modeling The study was conceived by E.C.L. and G.D.L. G.D.L. programmed
approach is that it relies on testing an individual subject using a the tasks, collected and analyzed the data. G.D.L., E.C.L. wrote the
model fit to other subjects. For studies comparing groups (e.g., manuscript.
typically developing children vs children with dyslexia), this means
combining data within each group to form separate generic models. Conflicts of interest
This implicitly assumes a certain amount of homogeneity within
each group, an assumption that is certainly problematic (Happe None declared.
et al., 2006; Willems et al., 2016). In fact, Fig. 1 demonstrated that
such variabilities affect the results even within the subject-group of
Funding sources
this study. Intuitively, subjects with dynamics more similar to the
group (subject-specific model similar to the correspondent generic
This study was supported by an Irish Research Council Gov-
model) will be characterized by higher EEG prediction correlations,
ernment of Ireland Postgraduate Scholarship.
while the opposite will happen for subjects with peculiar mTRFs. In
this sense, the generic modeling approach could be used as a tool to
Acknowledgements
investigate the homogeneity within a subject-group. This method
is suitable to study within-subject effects, and indices of such ef-
This study was supported by an Irish Research Council Gov-
fects (i.e. FS-S), properly normalized, could be used to compare
ernment (GOIPG, 2013-2017) of Ireland Postgraduate Scholarship.
different subject groups. In this context, interpretation of the
The authors thank Denis Drennan, Emily Teoh, and Adam Bednar
analysis outputs needs to take into account the choice of subject-
for useful discussions and comments on the manuscript.
groups, as excessive within-group variability may hamper the fit
of a generic model.
An alternative solution could be to define a unique generic References
model using a training control group, and to use such model to Abrams, D.A., Nicol, T., Zecker, S., Kraus, N., 2008. Right-hemisphere auditory cortex
assess whether a new subject (e.g. a patient) belongs or not to the is dominant for coding syllable patterns in speech. J. Neurosci. 28, 3958e3965.
group. This approach has the advantage of having no limitations on Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke, H., Merzenich, M.M.,
2001. Speech comprehension is correlated with temporal response patterns
the recording times for the training control group. However, the
recorded from auditory cortex. Proc. Natl. Acad. Sci. U. S. A. 98, 13367e13372.
effectiveness of such an approach, which assumes some degree of Aiken, S.J., Picton, T.W., 2008. Human cortical responses to the speech envelope. Ear
homogeneity in each of the subject groupings, could not be verified Hear 29, 139e157.
here as it requires a dataset that includes at least two distinct Aslin, R.N., Mehler, J., 2005. Near-infrared spectroscopy for functional studies of
brain activity in human infants: promise, prospects, and challenges. J. Biomed.
subject groups. To this end, the measures of phonetic features Opt. 10, 011009e0110093.
discriminability (Fig. 3) may provide a quantitative way to assess Bonte, M., Parviainen, T., Hyto €nen, K., Salmelin, R., 2006. Time course of top-down
G.M. Di Liberto, E.C. Lalor / Hearing Research 348 (2017) 70e77 77
and bottom-up influences on syllable processing in the auditory cortex. Cereb. McNealy, K., Mazziotta, J.C., Dapretto, M., 2006. Cracking the language code: neural
Cortex 16, 115e123. mechanisms underlying speech parsing. J. Neurosci. Off. J. Soc. Neurosci. 26,
Catts, H.W., Fey, M.E., Tomblin, J.B., Zhang, X., 2002. A longitudinal investigation of 7629e7639.
reading outcomes in children with language impairments. J. speech Lang. Hear. Mesgarani, N., Cheung, C., Johnson, K., Chang, E.F., 2014. Phonetic feature encoding
Res. JSLHR 45, 1142e1157. in human superior temporal gyrus. Science 343, 1006e1010.
Chomsky, N., Halle, M., 1968. The Sound Pattern of English. Mesulam, M.M., Rogalski, E.J., Wieneke, C., Hurley, R.S., Geula, C., Bigio, E.H.,
Clark, M.K.K.,A.G., 2010. Language disorders (child language disorders). In: Thompson, C.K., Weintraub, S., 2014. Primary progressive aphasia and the
Stone, J.H., Blouin, M. (Eds.), International Encyclopedia of Rehabilitation. evolving neurology of the language network. Nat. Rev. Neurol. 10, 554e569.
Clegg, J., Henderson, J., 1999. Developmental language disorders: changing eco- Millman, R.E., Johnson, S.R., Prendergast, G., 2015. The role of phase-locking to the
nomic costs from childhood into adult life. Ment. Health Res. Rev. 6, 27e30. temporal envelope of speech in auditory perception and speech intelligibility.
Crosse, M.J., Di Liberto, G.M., Bednar, A., Lalor, E.C., 2016. The multivatiate temporal J. Cognitive Neurosci. 27, 533e545.
response function (mTRF) toolbox: a MATLAB toolbox for relating neural signals Mirkovic, B., Debener, S., Jaeger, M., De Vos, M., 2015. Decoding the attended speech
to continuous stimuli. Front. Hum. Neurosci. 10, 604. stream with multi-channel EEG: implications for online daily-life applications.
Delorme, A., Makeig, S., 2004. EEGLAB: an open source toolbox for analysis of J. neural Eng. 12, 046007.
single-trial EEG dynamics including independent component analysis. Mody, M., Belliveau, J.W., 2013. Speech and language impairments in autism: in-
J. Neurosci. methods 134, 9e21. sights from behavior and neuroimaging. North Am. J. Med. Sci. 5, 157.
Di Liberto, G.M., O'Sullivan, J.A., Lalor, E.C., 2015. Low frequency cortical entrain- Nourski, K.V., Reale, R.A., Oya, H., Kawasaki, H., Kovach, C.K., Chen, H.,
ment to speech reflects phoneme level processing. Curr. Biol. 25 (19), Howard 3rd, M.A., Brugge, J.F., 2009. Temporal envelope of time-compressed
2457e2465. speech represented in the human auditory cortex. J. Neurosci. 29, 15564e15574.
Ding, N., Simon, J.Z., 2014. Cortical entrainment to continuous speech: functional Obleser, J., Zimmermann, J., Van Meter, J., Rauschecker, J.P., 2007. Multiple stages of
roles and interpretations. Front. Hum. Neurosci. 8. auditory speech perception reflected in event-related FMRI. Cereb. Cortex 17,
Flanagan, D.P., Genshaft, J., Harrison, P.L., 1997. Contemporary Intellectual Assess- 2251e2257.
ment. Theories, Tests, and Issues Guilford Press. Okada, K., Rong, F., Venezia, J., Matchin, W., Hsieh, I.H., Saberi, K., Serences, J.T.,
Ford, L., Dahinten, V.S., 2005. Use of intelligence tests in the assessment of pre- Hickok, G., 2010. Hierarchical Organization of Human Auditory Cortex: Evi-
schoolers. Contemp. Intellect. Assess. 487e503. dence from Acoustic Invariance in the Response to Intelligible Speech. Cerebral
Gardner, H., Froud, K., McClelland, A., van der Lely, H.K., 2006. Development of the cortex, New York, N.Y, pp. 2486e2495, 1991) 20.
Grammar and Phonology Screening (GAPS) test to assess key markers of specific Overath, T., Kumar, S., von Kriegstein, K., Griffiths, T.D., 2008. Encoding of spectral
language and literacy difficulties in young children. Int. J. Lang. Commun. Dis- correlation over time in auditory cortex. J. Neurosci. 28, 13268e13273.
ord. 41, 513e540. O'Sullivan, J.A., Power, A.J., Mesgarani, N., Rajaram, S., Foxe, J.J., Shinn-
Gorman, K., Howell, J., Wagner, M., 2011. Prosodylab-aligner: a Tool for Forced Cunningham, B.G., Slaney, M., Shamma, S.A., Lalor, E.C., 2015. Attentional se-
Alignment of Laboratory Speech, vol. 39, p. 2, 2011. lection in a cocktail party environment can be decoded from single-trial EEG.
Goswami, U., 2015. Sensory theories of developmental dyslexia: three challenges Cereb. Cortex 25, 1697e1706.
for research. Nat. Rev. Neurosci. 16, 43e54. Paul, R., 2007. Language Disorders from Infancy through Adolescence. Assessment
Greenwood, D.D., 1961. Auditory masking and the critical band. J. Acoust. Soc. Am. & Intervention Mosby Elsevier.
33, 484e502. Peelle, J.E., Johnsrude, I.S., Davis, M.H., 2010. Hierarchical processing for speech in
Happe, F., Ronald, A., Plomin, R., 2006. Time to give up on a single explanation for human auditory cortex and beyond. Front. Hum. Neurosci. 4, 51.
autism. Nat. Neurosci. 9, 1218e1220. Peter, V., Kalashnikova, M., Santos, A., Burnham, D., 2016. Mature neural responses
Hickok, G., Poeppel, D., 2007. The cortical organization of speech processing. Nature to infant-directed speech but not adult-directed speech in pre-verbal infants.
reviews. Neuroscience 8, 393e402. Sci. Rep. 6, 34273.
Kemper, S., Anagnopoulos, C., 2008. Language and aging. Annu. Rev. Appl. Lin- Poeppel, D., 2014. The neuroanatomic and neurophysiological infrastructure for
guistics 10, 37e50. speech and language. Curr. Opin. Neurobiol. 28, 142e149.
Kuhl, P.K., 2004. Early language acquisition: cracking the speech code. Nature re- Reed, V., 2012. An Introduction to Children with Language Disorders Pearson.
views. Neuroscience 5, 831e843. Rijsbergen, C.J.V., 1979. Information Retrieval. Butterworth-Heinemann.
Kuhl, P.K., 2010. Brain mechanisms in early language acquisition. Neuron 67, Ross, L.A., Saint-Amour, D., Leavitt, V.M., Molholm, S., Javitt, D.C., Foxe, J.J., 2007.
713e727. Impaired multisensory processing in schizophrenia: deficits in the visual
Kuhl, P.K., Coffey-Corina, S., Padden, D., Dawson, G., 2005. Links between social and enhancement of speech comprehension under noisy environmental conditions.
linguistic processing of speech in preschool children with autism: behavioral Schizophr. Res. 97, 173e183.
and electrophysiological measures. Dev. Sci. 8, F1eF12. Salmelin, R., 2007. Clinical neurophysiology of language: the MEG approach. Clin.
Kutas, M., Hillyard, S.A., 1980. Reading senseless sentences: brain potentials reflect Neurophysiol. Off. J. Int. Fed. Clin. Neurophysiol. 118, 237e254.
semantic incongruity. Science 207, 203e205. Tomblin, J.B., Records, N.L., Zhang, X., 1996. A system for the diagnosis of specific
Lakatos, P., Shah, A.S., Knuth, K.H., Ulbert, I., Karmos, G., Schroeder, C.E., 2005. An language impairment in kindergarten children. J. Speech Lang. Hear. Res. 39,
oscillatory hierarchy controlling neuronal excitability and stimulus processing 1284e1294.
in the auditory cortex. J. Neurophysiol. 94, 1904e1911. Willems, G., Jansma, B., Blomert, L., Vaessen, A., 2016. Cognitive and familial risk
Lalor, E.C., Foxe, J.J., 2010. Neural responses to uninterrupted natural speech can be evidence converged: a data-driven identification of distinct and homogeneous
extracted with precise temporal resolution. Eur. J. Neurosci. 31, 189e193. subtypes within the heterogeneous sample of reading disabled children. Res.
Lau, E.F., Phillips, C., Poeppel, D., 2008. A cortical network for semantics:(de) con- Dev. Disabil. 53e54, 213e231.
structing the N400. Nat. Rev. Neurosci. 9, 920e933. Wo €stmann, M., Fiedler, L., Obleser, J., 2016. Tracking the signal, cracking the code:
Leonard, L.B., 2014. Children with Specific Language Impairment. MIT Press. speech and speech comprehension in non-invasive human electrophysiology.
Luck, S.J., 2005. An Introduction to the Event-related Potential Technique. MIT Press. Lang. Cognition Neurosci. 1e15.
Luo, H., Poeppel, D., 2007. Phase patterns of neuronal responses reliably discrimi- Zion Golumbic, E.M., Cogan, G.B., Schroeder, C.E., Poeppel, D., 2013. Visual input
nate speech in human auditory cortex. Neuron 54, 1001e1010. enhances selective speech envelope tracking in auditory cortex at a “cocktail
Machens, C.K., Wehr, M.S., Zador, A.M., 2004. Linearity of cortical receptive fields party”. J. Neurosci. 33, 1417e1426.
measured with natural sounds. J. Neurosci. 24, 1089e1100.