Professional Documents
Culture Documents
net/publication/329355334
CITATIONS READS
0 24
7 authors, including:
Junhai Xu Xianglin Li
Tianjin University 24 PUBLICATIONS 71 CITATIONS
29 PUBLICATIONS 176 CITATIONS
SEE PROFILE
SEE PROFILE
Baolin Liu
Tianjin University
46 PUBLICATIONS 316 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Junhai Xu on 15 April 2019.
Neuroscience Research
journal homepage: www.elsevier.com/locate/neures
a r t i c l e i n f o a b s t r a c t
Article history: Scene recognition plays an important role in spatial navigation and scene classification. It remains
Received 27 March 2018 unknown whether the occipitotemporal cortex could represent the semantic association between the
Received in revised form scenes and sounds of objects within the scenes. In this study, we used the functional magnetic resonance
21 November 2018
imaging (fMRI) technique and multivariate pattern analysis to assess whether different scenes could be
Accepted 30 November 2018
Available online xxx
discriminated based on the patterns evoked by sounds of objects within the scenes. We found that pat-
terns evoked by scenes could be predicted with patterns evoked by sounds of objects within the scenes in
the posterior fusiform area (pF), lateral occipital area (LO) and superior temporal sulcus (STS). The further
Keywords:
Cross modality functional connectivity analysis suggested significant correlations between pF, LO and parahippocampal
Scene decoding place area (PPA) except that between STS and other three regions under the scene and sound conditions.
Functional connectivity A distinct network in processing scenes and sounds was discovered using a seed-to-voxel analysis with
Multivariate pattern analysis STS as the seed. This study may provide a cross-modal channel of scene decoding through the sounds
fMRI of objects within the scenes in the occipitotemporal cortex, which could complement the single-modal
channel of scene decoding based on the global scene properties or objects within the scenes.
© 2018 Elsevier B.V. and Japan Neuroscience Society. All rights reserved.
https://doi.org/10.1016/j.neures.2018.11.009
0168-0102/© 2018 Elsevier B.V. and Japan Neuroscience Society. All rights reserved.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
2 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx
Previous studies found that superior temporal sulcus (STS) could 2. Material and methods
not only process the visual and auditory information from the ani-
mals and man-made manipulable objects (tools), but also integrate 2.1. Participants
the audio-visual information (Beauchamp et al., 2004; Tyll et al.,
2013; Venezia et al., 2017), which provided support to the view Twenty-three healthy subjects (all right-handed, 12 females,
that it was a nature of our human brain to integrate the relevant average: 21.91 ± 2.81 years old, ranged from 18 to 26 years old)
information from multiple modalities (Mesulam, 1998; Liang et al., participated in the study, in which 2 subjects were excluded from
2013). Recently, the STS has been proposed to be quite active and further analyses due to the excessive head movement during scan-
responsible for different processing mechanisms in distinct tasks ning, thus, a total of 21 effective subjects participated in the present
(Hein and Knight, 2008). Accordingly, we speculate that STS may study. All participants had no history of neurological, psychiatric
also be involved in processing the semantic relationship between diseases, or auditory impairments, and had a self-reported normal
scenes and object sounds. or correct-to-normal hearing and vision. Written informed con-
Information integration across different sensory modalities con- sents were obtained from all participants before the experiment,
tributes to object recognition (Beauchamp, 2005; Doehrmann et al., and the study was approved by the Institutional Review Board (IRB)
2010). Visual and auditory information in objects could activate of Tianjin Key Laboratory of Cognitive Computing and Application,
the modality-specific brain regions, which implied that multisen- Tianjin University.
sory convergence zones were not fixed but rather depended on
object contents and modalities (Amedi et al., 2005). One recent
study (Vetter et al., 2014) found that the early visual cortex could
distinguish the perception from the imagery contents, probably 2.2. Experiment stimuli
because the actual sound stimuli induced people to imagine the
corresponding category information, and this finding provided a All stimuli consisted of 32 color images of scenes and 64 sound
support to the speculation that a perceptual integration mecha- clips of animals or man-made manipulable objects (tools) selected
nism appeared in the human primary cortex besides the advanced from the internet There were four categories of scenes (indoor:
cortex (Werner and Noppeney, 2010; Klemen and Chambers, 2012; kitchen and office; outdoor: street and grass) and the sample
Rohe and Noppeney, 2016). images were shown in Fig. 1A. The scene images were edited to
Previous studies have suggested that there are two main chan- 400 * 400 pixels with Adobe Photoshop with the same parame-
nels in scene recognition: the spatial property-based channel ters, trying to avoid irrelevant factors. We chose eight categories
(Renninger and Malik, 2004; Greene and Oliva, 2009) and object- of sound clips strongly associated with the objects in the scenes
based channel (MacEvoy and Epstein, 2011; Stansbury et al., 2013), (“vroom” for the engine and “hoot” for the horn in the street, “siz-
and these two channels were complementary to each other. The zle” for the hot oil and “rat-tat” for the kitchen knife in the kitchen,
finding that PPA and LOC could represent the scene in the dis- “moo” for the cattle and “baa” for the sheep in the grass, “click” for
tributed and complemented way supported the evidence that the the keyboard and “ringing” for the telephone in the office), and each
spatial layout and scene content representation were processed in category included eight sounds clips. To reduce the possible con-
different channels (Park et al., 2011). In addition, LO was proposed founds evoked by vocalizations (Belin et al., 2000; Norman et al.,
to provide a new object-based channel to scene decoding, which 2006), all the animal and tool sounds did not contain any vocaliza-
complemented the processing of the spatial attributes in the PPA tions or vocal-related content at all. All sound stimuli were edited to
(MacEvoy and Epstein, 2011). The natural scene recognition was 2.5-s duration and converted to one channel (mono, 44.1KHZ, 16-
not only tuned by the objects within the scene but also influenced bit), 80–83 dB C-weighted in both ears (Cool Edit Pro, Syntrillium
by the global properties (Greene and Oliva, 2009). However, these Software Co., owned by Adobe). Sounds were presented to subjects
channels were confined in the visual modality. It remains unclear binaurally. All stimuli were assessed by another 10 volunteers to
whether the other modalities such as the scene-relevant typical make sure their easy recognition (average accuracy = 98.78%, stan-
sounds are beneficial to scene recognition. dard deviation = 0.014).
The multivariate pattern analysis (MVPA) is capable of quanti-
fying the activity patterns of each item within the category and is
very sensitive to different categories (Haxby et al., 2001; Harrison
and Tong, 2009). To our knowledge, no study to date has inves- 2.3. Experimental design
tigated to what extent the areas that are related to the visual
scenes, objects and audio-visual integration are involved in rep- A block design was adopted in the experiment (Fig. 1B). Scan
resenting the semantic relationship between the visual scenes and sessions consisted of 4 experimental runs, and each run lasted for
their sounds using MVPA. In the present study, we hypothesized 7 min 46 s which comprised 12 blocks (4 scene blocks, 8 sound
that the sounds most closely linked to the objects within the scene blocks) that were presented in a random order. Each block con-
could decode this scene. To test the hypothesis, we used the func- sisted of 8 different stimuli, in which one 2.5-s trial followed by
tional magnetic resonance imaging (fMRI) technique to acquire 0.5-s inter-stimulus interval (ISI) was repeated 8 times with differ-
blood oxygenation level dependent (BOLD) data while participants ent stimuli. Following each block, one 4-s “select one from four”
viewed the four categories of scenes (indoor scenes vs outdoor button was pressed by each subject to record which category of the
scenes) and listened to eight categories of sounds (two sounds per scene the stimuli they had seen or heard belonged to. A 10-s white
scene). Four regions of interest (ROIs) were defined and then MVPA fixation with the gray background was followed. Before the fMRI
was performed using the voxel-wise activity patterns of each stim- scan, subjects received a train to learn about stimuli categories by
ulus to examine whether the patterns for the sounds could decode some exemplar stimuli. In addition, this training can help assure
the scene patterns in all four ROIs. To explore the influence of the subjects could identify scene pictures and object sounds properly.
scene openness, we divided the scenes into the indoor and outdoor Subjects were asked to silently name the item when each scene
scenes and made the same MVPA analysis. Finally, the functional picture or sound clip appeared (e.g. When heard the “vroom”, one
connectivity (FC) analyses between all four ROIs were made to should name it “sound of the engine” silently) and all the blocks as
examine the functional integrations in the scene and sound tasks. well as trials in each block were presented randomly across runs
with the same stimuli as in the first run.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 3
Fig. 1. Experimental materials and paradigm. (A) Experimental materials. The sample pictures correspond to 4 kinds of scenes (office, grass, street and kitchen) and there are
32 scene images in total. The speaker icons represents the objects sounds from 4 categories scene, and there are 8 kinds sounds in total (“vroom” for the engine and “hoot”
for the horn in the street, “sizzle” for the hot oil and “rat-tat” for the kitchen knife in the kitchen, “moo” for the cattle and “baa” for the sheep in the grass, “click” for the
keyboard and “ringing” for the telephone in the office). (B) Block-design paradigm. The experiment was composed of 4 runs and lasted about 29 min. Two task blocks were
separated by 10-s blocks of rest, and each task block contained 8 stimuli (shown for 2.5 s with 0.5-s inter-stimulus interval) presented centrally, with a following 4-s button
press.
2.4. Data acquisition smoothed functional volumes in all runs to obtain the voxel-wise
responses ( values) corresponding to each condition.
The experiment scanning was conducted on a 3.0 T Siemens
skyra scanner with a 20-channel head coil at Yantai Affili- 2.6. ROI definition
ated Hospital of Binzhou Medical University. Foam pads and
earplugs were used to reduce the head motion and scanner First of all, we obtained the activated brain regions by all
noise. A high-resolution structural MR image set was collected experimental conditions versus rest at the group level, and
using a T1 weighted 3D MPRAGE sequence (repetition time then picked the anatomical masks in the AAL atlas (Lalli et al.,
(TR) = 1900 ms, echo time (TE) = 2.52 ms, voxel size = 1 × 1 × 1 2012) (Fusiform L, Fusiform R, Occipital Mid L, Occipital Mid R,
mm3 , matrix size = 256 × 256, flip angle (FA) = 9◦ ). A gradient-echo ParaHippocampal L, ParaHippocampal R, Temporal Sup L, Tem-
planar imaging (EPI) sequence (TR = 2000 ms, TE = 30 ms, voxel poral Sup R) corresponding to pF, LO, PPA and STS with the
size = 3.1 × 3.1 × 4.0 mm3 , matrix size = 64 × 64, slices = 33, slices WFU Pickatlas toolbox in SPM8. Finally, four ROIs (Fig. 2) were
thickness = 4 mm, slices gap = 0.6 mm, FA = 90◦ ) was used for func- defined based on the common parts between the activated regions
tional data collection. The stimuli presentation and behavioral (p < 0.05, FDR corrected) and the anatomical masks.
response collection were performed by E-Prime 2.0 Profes-
sional (Psychology Software Tools, Pittsburgh, PA, USA) through 2.7. Data analysis
the audio-visual somatosensory device equipment with high-
resolution glasses and headphones. 2.7.1. Univariate analysis
A univariate analysis was performed to quantify the percent sig-
nal change of the sound and scene conditions for each ROI. At first,
2.5. Data preprocessing we used the MarsBar toolbox (http://marsbar.sourceforge.net) to
extract the time courses for four ROIs in each condition, and then
Data preprocessing was conducted using the SPM8 package calculated the average of the signal change for each region in the
(http://www.fil.ion.ucl.ac.uk/spm). For reaching steady-state equi- sound and scene conditions separately. Finally, the paired t-tests
librium, we discarded the first five functional images from each were conducted between the signal changes of the sound and the
run, and the remaining images were slice time-corrected to the first scene in each region to investigate whether the activation are dif-
image of the first run and motion-corrected by a realign analysis. ferent between sound and scene conditions.
For each participant, the individual’s own structural image was first
coregistered to the mean functional image after motion correction, 2.7.2. Classification analysis using MVPA
and the transformed structural image was then segmented into MVPA was conducted to explore the relationship between the
gray matter, white matter and cerebral spinal fluid (CSF) by using scene and the associated sounds in all ROIs. We added labels to
a unified segmentation algorithm. And the corrected images were the response patterns of sound according to the scene categories
spatially normalized to 3 × 3 × 3 mm3 in the Montreal Neurological (kitchen, grass, office, street), for example, the patterns of the
Institute (MNI) space using the generated spatial parameters from engine sound labeled as “street”, and the response patterns of scene
the segment analysis. For each individual subject, the functional were labeled same as the sound. And then a linear SVM classifier
data were spatially smoothed with a full-width at half maximum was chosen to classify patterns evoked by scene images based on
(FWHM) Gaussian 6 × 6 × 6 mm kernel. It is noted that data for the the patterns evoked by sounds. The LibSVM toolkit was used to
classification analysis were not smoothed. implement the four-way classification in the ROIs (http://www.
The fMRI data for each subject were preprocessed to remove the csie.ntu.edu.tw/∼cjlin/libsvm/). Afterward, the one-sample t-test
low-frequency signal changes and minimize the head move arti- was conducted on the classification performances to test whether
facts. Then a general linear model (GLM) was constructed for the the statistical value was statistically significant (p < 0.05). In order
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
4 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx
Fig. 2. Regions of interest. pF: posterior fusiform area; LO: lateral occipital area; PPA: parahippocampal place area; STS: superior temporal sulcus.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 5
Fig. 4. Multivariate classification of scenes using sound-based decoders and the classification performance as a function of the number of voxels resampled. (A) We trained a
pattern classifier to predict the pattern of the scene based on the pattern of sounds. (B) Classification accuracies for different ROI sizes and the optimal result were observed
when the ROI contained all voxels. And one-sample t-test was utilized to test the statistical significance of classification accuracies. The error bars denoted the standard error
of the mean. *p < 0.05, **p < 0.01, ***p < 0.001.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
6 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx
4. Discussion
The principal finding of this study is that the scene patterns can
be decoded based on the relevant sound-evoked information in pF,
LO and STS. We used the activity patterns of the two object sounds
strongly associated with the scene to predict the scene patterns
successfully. However, the similar relationship between the scene
and sound was not observed in PPA. And a further FC analysis using
the four ROIs as seeds indicated that there was no significant FC
between STS and the other three regions in the sound and scene
tasks. Through exploring the construction of the scene patterns
from the response patterns of the sound in pF, LO, and STS, our
findings could provide evidence for the neural mechanism in scene
recognition based on the sound-evoked information.
Fig. 6. ROI-to-ROI FC analyses among 4 ROIs (pF, LO, PPA and STS) in the sound
and scene tasks. In order to show the connectivity more clearly, the ROIs are shown 4.1. Reconstructing scenes from object sounds using MVPA
as points in different locations, even though both hemispheres areas were used in
the FC analysis. The line between two regions stands for a significant FC between In this study, we found that the patterns evoked by object sounds
regions. “Blue” represented the negative correlation; “Red” expressed the positive could discriminate the corresponding patterns of the scene signif-
correlation. L, left; R, right (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article).
icantly in pF, LO and STS, but not in the PPA.
The ventral occipital-temporal cortex (VOTC) has been explored
both in sighted and congenitally blind individuals, and agreed with
significant FC were found between pF and LO, PPA, STS (pF vs LO: the notion that this region could represent objects in a multi-modal
t(20) = 20.62, p < 0.001; pF vs PPA: t(20) = 10.00, p < 0.001; pF vs STS: way (Bi et al., 2016). By investigating the relationship between
t(20) = -5.35, p < 0.001). And there were significant FC between LO the actual visual stimuli and the fMRI activity in early visual areas
and PPA, STS (LO vs PPA: t(20) = 7.87, p < 0.001; LO vs STS: t(20) = based on quantitative receptive-field models, one previous study
-3.92, p < 0.001). suggested that it was possible to construct the visual stimuli from
In the sound task, we did not find significant FC between STS and the fMRI activity patterns (Kay et al., 2008). Another study recon-
pF, LO, PPA (STS vs pF: t(20) = 0.99, p = 0.33; STS vs LO: t(20) = -0.49, structed the complex natural scenes successfully by developing a
p = 0.63; STS vs PPA: t(20) = -1.66, p = 0.11). In contrast, significant new Bayesian decoder on the basis of the fMRI signals in the early
FC was observed between pF and LO, PPA (pF vs LO: t(20) = 14.06, and anterior visual areas (Naselaris et al., 2009). And the existing
p < 0.001; pF vs PPA: t(20) = 7.74, p < 0.001). And the FC was also literature found that the patterns of the scene could be predicted
significant between LO and PPA (LO vs PPA: t(20) = 5.75, p < 0.001). successfully from the patterns of the within-scene objects in LO
The above ROI-to-ROI FC analyses found that STS showed no (MacEvoy and Epstein, 2011). However, our study showed that the
significant positive correlation with the other three regions both patterns evoked by the scene could be decoded by the patterns of
in the sound and scene tasks. To explore the functional role of STS the scene-relevant sounds using MVPA.
in the scene and sound task, we made a further seed-to-voxel FC One recent study (Klemen and Chambers, 2012) suggested that
analysis using STS as the seed. More specific regions were found LOC could decode the object sounds because the sounds evoked
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 7
Fig. 7. Surface displays for seed-to-voxel FC analyses using STS as a seed in the scene and sound tasks separately. Threshold: voxel-level p = 0.001, FWE correction.
Table 3
Regions that showed significant functional connectivity with STS in the scene and sound tasks.
The MNI coordinates and selected the clusters (≥ 30 voxels) that included t-values of peak voxel representing the significant FC with the STS. The clusters homologous to the
peak voxel are written in bold and the other areas contained in this cluster are written in regular. PreCG: precentral gyrus, PostCG: postcentral gyrus, ACC: anterior cingulate
cortex, IC: insular Cortex, CO: central opercular cortex, M/STG: middle/superior Temporal gyrus, PCu: precuneus, PO: parietal operculum cortex, PT: planum temporale, SMA:
supplementary motor cortex, SMG: supramarginal gyrus, PL: parietal lobule, HG: heschl’s gyrus, LOC: lateral occipital cortex, Amy: amygdala, M/SFG: middle/superior frontal
gyrus, I/SPL: inferior/superior parietal lobule, MTG: middle temporal gyrus, SMG: supramarginal Gyrus, pSTG: posterior superior temporal gyrus, HG: heschl’s gyrus, Tha:
thalamus, S/M/IFG: superior/middle/inferior frontal gyrus, PaCiG: paracingulate gyrus, FOC: orbital frontal cortex, Cereb: cerebellum. a/p/i: anterior/posterior/inferior; L/R:
left/right.
subjects’ imaging of the corresponding objects, and they used the As opposed to LOC, we did not find the semantic relationship
fMRI data of imaginary objects to determine the identity of the between the sound and the scene in PPA. That is to say, even if many
objects successfully, which provided consistent evidence that the studies have attested the important role of PPA in scene recognition,
imagery and actual perception may share the same neural mecha- reactions to the scene in this area seemed to have nothing to do with
nism (Kosslyn et al., 2001; Hubbard, 2010), which even was verified the response to the corresponding sound. Other studies demon-
by the successful scene decoding during perception and imagery strated that PPA could also represent the scene-relevant object
(Johnson and Johnson, 2014). One previous study found that two information (Macevoy and Epstein, 2009; Linsley and MacEvoy,
subregions within the LOC, pF and LO were involved in processing 2014) and take part in tuning the object information (Macevoy and
different aspects of visual recognition (Nordhjem et al., 2015). In Epstein, 2009; Harel et al., 2013). Although subjects might imagine
our study, we speculated that one explanation for the successful the objects and animals when hearing sounds, activity patterns in
scene decoding based on the sounds may be that the sound stim- PPA only included the information of the standalone objects and
uli could induce our imaging the corresponding objects (Klemen this information disappeared when the objects arose in the scene
and Chambers, 2012; Vetter et al., 2014), and there was a strong (MacEvoy and Epstein, 2011).
association between the objects and sounds within the scenes. A growing number of studies hold the notion that integrating the
stimuli from the multisensory modalities was the inherent nature
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
8 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx
Fig. 8. Searchlight-based classification analysis maps across subjects. Surface displays the decoding accuracy of voxels with significant classification accuracies higher than
chance-level (25%, P < 0.01, FDR correction). (A) Outlined regions are STS (dark blue), pF (light blue), PPA (purple) and the LO (green). (B) Besides of the four ROIs, some main
regions are labeled, including precentral gyrus, MFG, and IFG (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of
this article).
of the human brain (Ghazanfar and Schroeder, 2006; van Atteveldt 4.2. Influence of scene openness
et al., 2014), and multisensory processing may happen from the
initial narrow mechanism to highly flexible various levels (Muckli Previous studies suggested that the scene openness might influ-
et al., 2015). STS played a central role in multisensory interaction ence the scene recognition by comparing the fMRI activities of the
and responded to the visual and auditory objects strongly (Klemen indoor and outdoor scenes, and the neural signal induced by indoor
and Chambers, 2012). Our study showed that STS could also repre- scenes was found to be stronger than that by outdoor scenes in the
sent the semantic relationship between the scene and sound, and PPA (Henderson et al., 2007; Kravitz et al., 2011). In this study, we
we speculated the possibility that one could imagine the animals further analyzed the indoor and outdoor scenes using MVPA and
and man-made manipulable objects (tools) contained in the scenes found that the openness of the scenes could affect the sound-based
when hearing the sound stimuli, for example, “moo” and Cattle in scene prediction in pF, PPA and STS, which was consistent with the
the grass, “vroom” and automobile engine in the street. Further- previous studies. However, the LO showed a great tolerance of the
more, the STS has been believed to be capable of integrating the scene openness in the present study. In addition, the classification
meaningful information of audio-visual objects into a coherent per- analysis of indoor and outdoor scenes suggested that pF, PPA and
cept (Beauchamp et al., 2004; Venezia et al., 2017), and take part in STS might be able to represent the semantic relationship between
processing the many diverse aspects, for instance, theory of mind, the outdoor scenes and associated sounds, but failed to process the
audiovisual integration, motion processing, speech processing and association between indoor scenes and the sounds.
face processing (Hein and Knight, 2008; Blank and von Kriegstein,
2013). And the study considered that the same brain region might 4.3. FC in the scene and sound tasks
be responsible for different recognition functions under the distinct
task network (Hein and Knight, 2008; Watson et al., 2014). There- Using the ROI-to-ROI FC analyses, we found that there were
fore, STS may also be involved in tuning the sound information in significant functional correlations between these four ROIs, which
the scene. formed two distinct sub-networks (negative correlation network
Moreover, our study found that there were significant differ- consisted of pF, LO and STS; positive correlation network com-
ences in the signal change between the scene and scene-related posed of pF, LO and PPA) in the scene task. PPA was showed to
sounds in all four regions. Specifically, the signal change of the scene be more sensitive to the scenes than other visual stimuli (Epstein
images was significantly greater than that of the sounds in pF, LO et al., 1999, 2003; Epstein et al., 2006), and involved in decoding the
and PPA, while the signal change of the sounds was significantly appearance and layout of the scenes (Epstein et al., 1999). Previous
higher than that of the scene signal in STS. In spite of the significant studies (Macevoy and Epstein, 2009; MacEvoy and Epstein, 2011)
differences in the signal change, diff ;erent scenes could be suc- found that the brain regions could classify the four categories of the
cessfully discriminated based on the patterns of sounds of objects scenes successfully containing the pF and LO, which indicated that
within the scenes in ROIs, which proved that these brain regions these two regions were also engaged in scene classification. The
could represent the sound-scene semantic relationship. One expla- aforementioned evidence was consistent with our FC analyses dur-
nation for the results may be that the univariate analysis of fMRI ing the scene processing. But an opposite situation was observed in
data were not as sensitive as MVPA (Norman et al., 2006; Stelzer the sound task, in which pF, LO and PPA formed one sub-network,
et al., 2013) to the small number of neurons that shared a little while STS showed no functional associations with the other three
classification information. regions. The STS was suggested to represent stimuli from visual,
auditory, and audio-visual modality (Beauchamp, 2005; Venezia
et al., 2017). There existed a question, namely why STS was not
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 9
functionally connected with the other three regions in the sound Acknowledgments
task. We speculated the reason was that only STS participated in
representing the sound stimuli, therefore, it was independent of the This work was supported by the National Natural Science
network formed by the other three ROIs which mainly represented Foundation of China (No. U1736219, No. 61860206010 and
the visual scenes. No.61571327), Shandong Provincial Natural Science Foundation
As STS showed an independent role in the ROI-to-ROI analy- of China (No. ZR2015HM081) and Project of Shandong Province
ses, we further explored brain regions that were related to STS in Higher Educational Science and Technology Program (J15LL01).
the sound and the scene tasks. Some common brain areas that were
functionally cooperated with STS in both tasks suggested that these
References
regions might be involved in different functions in different tasks
(Beauchamp et al., 2004; Hein and Knight, 2008; Venezia et al., Amedi, A., von Kriegstein, K., van Atteveldt, N.M., Beauchamp, M., Naumer, M.J., 2005.
2017). In this study, the insular cortex showed a significant FC with Functional imaging of human crossmodal identification and object recognition.
Exp. Brain Res. 166, 559–571.
STS. The insular cortex has been demonstrated to play a role in the
Beauchamp, M.S., 2005. See me, hear me, touch me: multisensory integration in
cross-modal coincidence and matching (Calvert, 2001; Senkowski lateral occipital-temporal cortex. Curr. Opin. Neurobiol. 15, 145–153.
et al., 2007), and was involved in a number of different functions Beauchamp, M.S., Lee, K.E., Argall, B.D., Martin, A., 2004. Integration of auditory
such as pain perception, speech production and social emotion pro- and visual information about objects in superior temporal sulcus. Neuron 41,
809–823.
cessing (Nieuwenhuys, 2011; Gu et al., 2013). The part of the middle Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B., 2000. Voice-selective areas in
frontal gyrus was proved to store and process the working mem- human auditory cortex. Nature 403, 309–312.
ory (Leung et al., 2002; Senkowski et al., 2007). One recent study Bi, Y., Wang, X., Caramazza, A., 2016. Object domain and modality in the ventral
visual pathway. Trends Cogn. Sci. (Regul. Ed.) 20, 282–290.
(Whitney et al., 2010) indicated that posterior middle temporal Blank, H., von Kriegstein, K., 2013. Mechanisms of enhancing visual-speech recog-
gyrus (pMTG) and inferior frontal gyrus (IFG) played an impor- nition by prior auditory information. NeuroImage 65, 109–118.
tant role in the semantic control by using repetitive transcranial Calvert, G.A., 2001. Crossmodal processing in the human brain: insights from func-
tional neuroimaging studies. Cereb. Cortex 11, 1110–1123.
magnetic stimulation (rTMS) to obstruct the processing of IFG and Chen, Y.W., Lin, C.J., 2006. Combining SVMs with various feature selection strategies.
pMTG. Additional evidence (MacEvoy and Epstein, 2011) pointed Studies in Fuzziness & Soft Computing 207, 315–324.
that LOC could represent the semantic relationship between the Doehrmann, O., Weigelt, S., Altmann, C.F., Kaiser, J., Naumer, M.J., 2010. Audiovi-
sual functional magnetic resonance imaging adaptation reveals multisensory
scenes and scene-relevant objects and provided an object-based
integration effects in object-related sensory cortices. J. Neurosci. 30, 3370–3379.
channel to decode the scene. Previous studies showed that the pos- Drucker, D.M., Aguirre, G.K., 2009. Different spatial scales of shape similarity repre-
terior superior temporal gyrus could integrate the different types sentation in lateral and ventral LOC. Cereb. Cortex 19, 2269–2280.
Epstein, R., Kanwisher, N., 1998. A cortical representation of the local visual envi-
of within-modality and cross-modality information (Beauchamp
ronment. Nature 392, 598–601.
et al., 2004; Venezia et al., 2017). It is possible that the key assign- Epstein, R., Harris, A., Stanley, D., Kanwisher, N., 1999. The parahippocampal place
ment in the experiment offered some category information to these area: Recognition, navigation, or encoding? Neuron 23, 115–125.
brain areas, even though the data during the motor response were Epstein, R., Graham, K.S., Downing, P.E., 2003. Viewpoint-specific scene representa-
tions in human parahippocampal cortex. Neuron 37, 865–876.
not used for decoding. Considering that part of motor responses Epstein, R.A., Higgins, J.S., Parker, W., Aguirre, G.K., Cooperman, S., 2006. Cortical
can overlap in time with the responses induced by sound, the key correlates of face and scene inversion: a comparison. Neuropsychologia 44,
assignment can be improved by setting different motor responses 1145–1158.
Epstein, R.A., Parker, W.E., Feiler, A.M., 2007. Where am I now? Distinct roles for
to a same scene in future study, which may reduce the influ- parahippocampal and retrosplenial cortices in place recognition. J. Neurosci. 27,
ence of fixed motor responses on stimuli classification. In addition, 6141–6149.
the existence of the common brain regions may imply that these Ghazanfar, A.A., Schroeder, C.E., 2006. Is neocortex essentially multisensory? Trends
Cogn. Sci. (Regul. Ed.) 10, 278–285.
regions cooperated with STS during processing the sound and the Greene, M.R., Oliva, A., 2009. Recognition of natural scenes from global properties:
scene to represent the semantic relationship between both. And the seeing the forest without representing the trees. Cogn. Psychol. 58, 137–176.
connectivity between these areas can help us understand the infor- Gu, X., Hof, P.R., Friston, K.J., Fan, J., 2013. Anterior insular cortex and emotional
awareness. J. Comp. Neurol. 521, 3371–3388.
mation flowing during the sound-to-scene representation, which
Harel, A., Kravitz, D.J., Baker, C.I., 2013. Deconstructing visual scenes in cortex: gra-
will be our next work. dients of object and spatial layout information. Cereb. Cortex 23, 947–957.
Harrison, S.A., Tong, F., 2009. Decoding reveals the contents of visual working mem-
ory in early visual areas. Nature 458, 632–635.
5. Conclusions Haushofer, J., Livingstone, M.S., Kanwisher, N., 2008. Multivariate patterns in object-
selective cortex dissociate perceptual and physical shape similarity. PLoS Biol.
6, e187.
In this study, we explored the scene decoding based on the pat- Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P., 2001.
terns of sounds of objects within the scenes and found that pF, Distributed and overlapping representations of faces and objects in ventral tem-
LO and STS could represent the semantic relationship between the poral cortex. Science 293, 2425–2430.
Hein, G., Knight, R.T., 2008. Superior temporal sulcus—it’s my area: or is it? Cognitive
scenes and the sounds of the associated objects. However, this Neuroscience. Journal of 20, 2125–2136.
semantic association was not observed in PPA. Furthermore, by Henderson, J.M., Larson, C.L., Zhu, D.C., 2007. Cortical activation to indoor versus
dividing the scenes into indoor and outdoor parts, we found that outdoor scenes: an fMRI study. Exp. Brain Res. 179, 75–84.
Hubbard, T.L., 2010. Auditory imagery: empirical findings. Psychol. Bull. 136,
LO was not sensitive to the openness of the scenes. The ROI-to-ROI 302–329.
FC analyses among four ROIs in the scene and sound tasks indi- Johnson, M.R., Johnson, M.K., 2014. Decoding individual natural scene representa-
cated that STS did not coordinate with the other three ROIs in both tions during perception and imagery. Front. Hum. Neurosci. 8, 59.
Kay, K.N., Thomas, N., Prenger, R.J., Gallant, J.L., 2008. Identifying natural images
tasks, and a further seed-to-voxel analysis using STS as the seed sug- from human brain activity. Nature 452, 352–355.
gested a distinct network in processing the scenes and sounds. In Klemen, J., Chambers, C.D., 2012. Current perspectives and methods in studying
summary, our study showed the existence of a cross-modal sound- neural mechanisms of multisensory interactions. Neurosci. Biobehav. Rev. 36,
111–133.
based channel for scene decoding in the occipitotemporal cortex,
Kosslyn, S.M., Ganis, G., Thompson, W.L., 2001. Neural foundations of imagery. Nat.
which could give a further access to understand the neural mech- Rev. Neurosci. 2, 635–642.
anism of scene recognition. Kravitz, D.J., Peng, C.S., Baker, C.I., 2011. Real-world scene representations in
high-level visual cortex: it’s the spaces more than the places. J. Neurosci. 31,
7322–7333.
Conflict of interest Lalli, S., Piacentini, S., Franzini, A., Panzacchi, A., Cerami, C., Messina, G., Ferré, F.,
Perani, D., Albanese, A., 2012. Epidural premotor cortical stimulation in pri-
mary focal dystonia: clinical and 18F-fluoro deoxyglucose positron emission
No conflict of interest. tomography open study. Mov. Disord. 27, 533–538.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
10 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx
Leung, H., Gore, J.C., Goldman-Rakic, P.S., 2002. Sustained mnemonic response in Said, C.P., Moore, C.D., Engell, A.D., Todorov, A., Haxby, J.V., 2010. Distributed repre-
the human middle frontal gyrus during on-line storage of spatial memoranda. sentations of dynamic facial expressions in the superior temporal sulcus. J. Vis.
Cognitive Neuroscience, Journal of 14, 659–671. 10, 11-11.
Liang, M., Mouraux, A., Hu, L., Iannetti, G.D., 2013. Primary sensory cortices contain Senkowski, D., Saint-Amour, D., Kelly, S.P., Foxe, J.J., 2007. Multisensory processing
distinguishable spatial patterns of activity for each sense. Nat. Commun. 4, 1979. of naturalistic objects in motion: a high-density electrical mapping and source
Linsley, D., MacEvoy, S.P., 2014. Evidence for participation by object-selective visual estimation study. NeuroImage 36, 877–888.
cortex in scene category judgments. J. Vis. 14, 19-19. Stansbury, D.E., Naselaris, T., Gallant, J.L., 2013. Natural scene statistics account
Macevoy, S.P., Epstein, R.A., 2009. Decoding the representation of multiple simul- for the representation of scene categories in human visual cortex. Neuron 79,
taneous objects in human occipitotemporal cortex. Current Biology Cb 19, 1025–1034.
943–947. Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing cor-
MacEvoy, S.P., Epstein, R.A., 2011. Constructing scenes from objects in human occip- rection in classification-based multi-voxel pattern analysis (MVPA): random
itotemporal cortex. Nat. Neurosci. 14, 1323–1329. permutations and cluster size control. NeuroImage 65, 69–82.
Malach, R., Reppas, J.B., Benson, R.R., Kwong, K.K., Jiang, H., Kennedy, W.A., Ledden, Tyll, S., Bonath, B., Schoenfeld, M.A., Heinze, H.J., Ohl, F.W., Noesselt, T., 2013. Neural
P.J., Brady, T.J., Rosen, B.R., Tootell, R.B., 1995. Object-related activity revealed by basis of multisensory looming signals. NeuroImage 65, 13–22.
functional magnetic resonance imaging in human occipital cortex. Proceedings van Atteveldt, N., Murray, M.M., Thut, G., Schroeder, C.E., 2014. Multisensory inte-
of the National Academy of Sciences 92, 8135–8139. gration: flexible use of general operations. Neuron 81, 1240–1253.
Mesulam, M.M., 1998. From sensation to cognition. Brain 121 (Pt 6), 1013–1052. Venezia, J.H., Vaden Jr., K.I., Rong, F., Maddox, D., Saberi, K., Hickok, G., 2017. Auditory,
Muckli, L., Vizioli, L., Petro, L., De Martino, F., Vetter, P., 2015. Predictive coding of visual and audiovisual speech processing streams in superior temporal sulcus.
auditory and contextual information in early visual cortex-evidence from layer Front. Hum. Neurosci. 11, 174.
specific fMRI brain reading. J. Vis. 15, 720-720. Vetter, P., Smith, F.W., Muckli, L., 2014. Decoding sound and imagery content in early
Naselaris, T., Prenger, R.J., Kay, K.N., Oliver, M., Gallant, J.L., 2009. Bayesian recon- visual cortex. Curr. Biol. 24, 1256–1262.
struction of natural images from human brain activity. Neuron. Neuron 63, Walther, D.B., Caddigan, E., Fei-Fei, L., Beck, D.M., 2009. Natural scene categories
902–915. revealed in distributed patterns of activity in the human brain. J. Neurosci. 29,
Nieuwenhuys, R., 2011. The insular cortex: a review. Prog. Brain Res. 195, 123–163. 10573–10581.
Nordhjem, B., Curcic-Blake, B., Meppelink, A.M., Renken, R.J., de Jong, B.M., Leenders, Walther, D.B., Chai, B., Caddigan, E., Beck, D.M., Fei-Fei, L., 2011. Simple line drawings
K.L., van Laar, T., Cornelissen, F.W., 2015. Lateral and medial ventral occipitotem- suffice for functional MRI decoding of natural scene categories. Proc Natl Acad
poral regions interact during the recognition of images revealed from noise. Sci U S A 108, 9661–9666.
Front. Hum. Neurosci. 9, 678. Watson, R., Latinus, M., Charest, I., Crabbe, F., Belin, P., 2014. People-selectivity,
Norman, K.A., Polyn, S.M., Detre, G.J., Haxby, J.V., 2006. Beyond mind-reading: multi- audiovisual integration and heteromodality in the superior temporal sulcus.
voxel pattern analysis of fMRI data. Trends Cogn. Sci. (Regul. Ed.) 10, 424–430. Cortex 50, 125–136.
Park, S., Brady, T.F., Greene, M.R., Oliva, A., 2011. Disentangling scene content from Werner, S., Noppeney, U., 2010. Distinct functional contributions of primary sen-
spatial boundary: complementary roles for the parahippocampal place area sory and association areas to audiovisual integration in object categorization. J.
and lateral occipital complex in representing real-world scenes. J. Neurosci. 31, Neurosci. 30, 2662–2675.
1333–1340. Whitfield-Gabrieli, S., Nieto-Castanon, A., 2012. Conn: a functional connectivity tool-
Park, S., Konkle, T., Oliva, A., 2015. Parametric coding of the size and clutter of natural box for correlated and anticorrelated brain networks. Brain connect 2: 125-141.
scenes in the human brain. Cereb. Cortex 25, 1792–1805. Brain Connect. 2, 125–141.
Renninger, L.W., Malik, J., 2004. When is scene identification just texture recogni- Whitney, C., Kirk, M., O’Sullivan, J., Ralph, M.A.L., Jefferies, E., 2010. The neural orga-
tion? Vision Res. 44, 2301–2311. nization of semantic control: TMS evidence for a distributed network in left
Rohe, T., Noppeney, U., 2016. Distinct computational principles govern multisensory inferior frontal and posterior middle temporal gyrus. Cereb. Cortex, bhq180.
integration in primary sensory and association cortices. Curr. Biol. 26, 509–514.
Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009