You are on page 1of 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/329355334

Decoding natural scenes based on sounds of objects within scenes using


multivariate pattern analysis

Article  in  Neuroscience Research · December 2018


DOI: 10.1016/j.neures.2018.11.009

CITATIONS READS

0 24

7 authors, including:

Junhai Xu Xianglin Li
Tianjin University 24 PUBLICATIONS   71 CITATIONS   
29 PUBLICATIONS   176 CITATIONS   
SEE PROFILE
SEE PROFILE

Baolin Liu
Tianjin University
46 PUBLICATIONS   316 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Medical image analysis View project

Neural mechanism of affective processing View project

All content following this page was uploaded by Junhai Xu on 15 April 2019.

The user has requested enhancement of the downloaded file.


G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
Neuroscience Research xxx (2018) xxx–xxx

Contents lists available at ScienceDirect

Neuroscience Research
journal homepage: www.elsevier.com/locate/neures

Decoding natural scenes based on sounds of objects within scenes


using multivariate pattern analysis
Xiaojing Wang a,1 , Jin Gu a,1 , Junhai Xu a , Xianglin Li c , Junzu Geng d , Bin Wang c ,
Baolin Liu a,b,∗
a
College of Intelligence and Computing, Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin, 300350, PR China
b
State Key Laboratory of Intelligent Technology and Systems, National Laboratory for Information Science and Technology, Tsinghua University, Beijing,
100084, PR China
c
Medical Imaging Research Institute, Binzhou Medical University, Yantai, Shandong 264003, PR China
d
Department of Radiology, Yantai Affiliated Hospital of Binzhou Medical University, Yantai, Shandong, 264003, PR China

a r t i c l e i n f o a b s t r a c t

Article history: Scene recognition plays an important role in spatial navigation and scene classification. It remains
Received 27 March 2018 unknown whether the occipitotemporal cortex could represent the semantic association between the
Received in revised form scenes and sounds of objects within the scenes. In this study, we used the functional magnetic resonance
21 November 2018
imaging (fMRI) technique and multivariate pattern analysis to assess whether different scenes could be
Accepted 30 November 2018
Available online xxx
discriminated based on the patterns evoked by sounds of objects within the scenes. We found that pat-
terns evoked by scenes could be predicted with patterns evoked by sounds of objects within the scenes in
the posterior fusiform area (pF), lateral occipital area (LO) and superior temporal sulcus (STS). The further
Keywords:
Cross modality functional connectivity analysis suggested significant correlations between pF, LO and parahippocampal
Scene decoding place area (PPA) except that between STS and other three regions under the scene and sound conditions.
Functional connectivity A distinct network in processing scenes and sounds was discovered using a seed-to-voxel analysis with
Multivariate pattern analysis STS as the seed. This study may provide a cross-modal channel of scene decoding through the sounds
fMRI of objects within the scenes in the occipitotemporal cortex, which could complement the single-modal
channel of scene decoding based on the global scene properties or objects within the scenes.
© 2018 Elsevier B.V. and Japan Neuroscience Society. All rights reserved.

1. Introduction PPA has been shown to be selectively activated by the scene


than other visual stimuli, and especially to the spatial layout of the
Scene recognition is one of the main aspects in perceiving scene, suggesting that PPA could represent the scene by encoding
the world, and people could recognize the environment around the spatial geometry of local environments (Epstein and Kanwisher,
us with a high speed and accuracy through scene recognition, 1998). Further evidence suggested that PPA made a more sensi-
which plays a central role in navigating to a destination and scene tive response to the indoor than outdoor scenes (Henderson et al.,
discrimination. Multiple areas have been proved to be engaged 2007). Another study (Walther et al., 2009) found that activities in
in the understanding of scene recognition. Neuroimaging studies PPA and LOC included the classification information of the natural
have demonstrated that the brain regions were recruited in scene scenes. As previous work suggests that the two subregions of LOC,
perception mainly including parahippocampal place area (PPA) the pF and the LO, may support different functions during visual
(Epstein and Kanwisher, 1998; Epstein et al., 2007; Walther et al., recognition (Haushofer et al., 2008; Drucker and Aguirre, 2009). By
2011) and lateral occipital complex (LOC) (Malach et al., 1995; Park exploring the two subregions of the LOC: lateral occipital area (LO)
et al., 2011, 2015). and posterior fusiform area (pF), MacEvoy and his colleagues sug-
gested the existence of the neural mechanism that patterns evoked
by scenes could be predicted by the patterns evoked by within-
scene objects in LO (MacEvoy and Epstein, 2011). Furthermore, by
∗ Corresponding author at: College of Intelligence and Computing, Tianjin Key Lab- classifying the patterns of the scene based on the object patterns,
oratory of Cognitive Computing and Application, Tianjin University, Tianjin, 300350, some recent studies indicated that the regions which mediated the
PR China. within-scene object information were not limited to LOC, maybe
E-mail address: liubaolin@tsinghua.edu.cn (B. Liu).
1
including PPA (Harel et al., 2013; Linsley and MacEvoy, 2014).
Xiaojing Wang and Jin Gu contribute equally to this work.

https://doi.org/10.1016/j.neures.2018.11.009
0168-0102/© 2018 Elsevier B.V. and Japan Neuroscience Society. All rights reserved.

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
2 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx

Previous studies found that superior temporal sulcus (STS) could 2. Material and methods
not only process the visual and auditory information from the ani-
mals and man-made manipulable objects (tools), but also integrate 2.1. Participants
the audio-visual information (Beauchamp et al., 2004; Tyll et al.,
2013; Venezia et al., 2017), which provided support to the view Twenty-three healthy subjects (all right-handed, 12 females,
that it was a nature of our human brain to integrate the relevant average: 21.91 ± 2.81 years old, ranged from 18 to 26 years old)
information from multiple modalities (Mesulam, 1998; Liang et al., participated in the study, in which 2 subjects were excluded from
2013). Recently, the STS has been proposed to be quite active and further analyses due to the excessive head movement during scan-
responsible for different processing mechanisms in distinct tasks ning, thus, a total of 21 effective subjects participated in the present
(Hein and Knight, 2008). Accordingly, we speculate that STS may study. All participants had no history of neurological, psychiatric
also be involved in processing the semantic relationship between diseases, or auditory impairments, and had a self-reported normal
scenes and object sounds. or correct-to-normal hearing and vision. Written informed con-
Information integration across different sensory modalities con- sents were obtained from all participants before the experiment,
tributes to object recognition (Beauchamp, 2005; Doehrmann et al., and the study was approved by the Institutional Review Board (IRB)
2010). Visual and auditory information in objects could activate of Tianjin Key Laboratory of Cognitive Computing and Application,
the modality-specific brain regions, which implied that multisen- Tianjin University.
sory convergence zones were not fixed but rather depended on
object contents and modalities (Amedi et al., 2005). One recent
study (Vetter et al., 2014) found that the early visual cortex could
distinguish the perception from the imagery contents, probably 2.2. Experiment stimuli
because the actual sound stimuli induced people to imagine the
corresponding category information, and this finding provided a All stimuli consisted of 32 color images of scenes and 64 sound
support to the speculation that a perceptual integration mecha- clips of animals or man-made manipulable objects (tools) selected
nism appeared in the human primary cortex besides the advanced from the internet There were four categories of scenes (indoor:
cortex (Werner and Noppeney, 2010; Klemen and Chambers, 2012; kitchen and office; outdoor: street and grass) and the sample
Rohe and Noppeney, 2016). images were shown in Fig. 1A. The scene images were edited to
Previous studies have suggested that there are two main chan- 400 * 400 pixels with Adobe Photoshop with the same parame-
nels in scene recognition: the spatial property-based channel ters, trying to avoid irrelevant factors. We chose eight categories
(Renninger and Malik, 2004; Greene and Oliva, 2009) and object- of sound clips strongly associated with the objects in the scenes
based channel (MacEvoy and Epstein, 2011; Stansbury et al., 2013), (“vroom” for the engine and “hoot” for the horn in the street, “siz-
and these two channels were complementary to each other. The zle” for the hot oil and “rat-tat” for the kitchen knife in the kitchen,
finding that PPA and LOC could represent the scene in the dis- “moo” for the cattle and “baa” for the sheep in the grass, “click” for
tributed and complemented way supported the evidence that the the keyboard and “ringing” for the telephone in the office), and each
spatial layout and scene content representation were processed in category included eight sounds clips. To reduce the possible con-
different channels (Park et al., 2011). In addition, LO was proposed founds evoked by vocalizations (Belin et al., 2000; Norman et al.,
to provide a new object-based channel to scene decoding, which 2006), all the animal and tool sounds did not contain any vocaliza-
complemented the processing of the spatial attributes in the PPA tions or vocal-related content at all. All sound stimuli were edited to
(MacEvoy and Epstein, 2011). The natural scene recognition was 2.5-s duration and converted to one channel (mono, 44.1KHZ, 16-
not only tuned by the objects within the scene but also influenced bit), 80–83 dB C-weighted in both ears (Cool Edit Pro, Syntrillium
by the global properties (Greene and Oliva, 2009). However, these Software Co., owned by Adobe). Sounds were presented to subjects
channels were confined in the visual modality. It remains unclear binaurally. All stimuli were assessed by another 10 volunteers to
whether the other modalities such as the scene-relevant typical make sure their easy recognition (average accuracy = 98.78%, stan-
sounds are beneficial to scene recognition. dard deviation = 0.014).
The multivariate pattern analysis (MVPA) is capable of quanti-
fying the activity patterns of each item within the category and is
very sensitive to different categories (Haxby et al., 2001; Harrison
and Tong, 2009). To our knowledge, no study to date has inves- 2.3. Experimental design
tigated to what extent the areas that are related to the visual
scenes, objects and audio-visual integration are involved in rep- A block design was adopted in the experiment (Fig. 1B). Scan
resenting the semantic relationship between the visual scenes and sessions consisted of 4 experimental runs, and each run lasted for
their sounds using MVPA. In the present study, we hypothesized 7 min 46 s which comprised 12 blocks (4 scene blocks, 8 sound
that the sounds most closely linked to the objects within the scene blocks) that were presented in a random order. Each block con-
could decode this scene. To test the hypothesis, we used the func- sisted of 8 different stimuli, in which one 2.5-s trial followed by
tional magnetic resonance imaging (fMRI) technique to acquire 0.5-s inter-stimulus interval (ISI) was repeated 8 times with differ-
blood oxygenation level dependent (BOLD) data while participants ent stimuli. Following each block, one 4-s “select one from four”
viewed the four categories of scenes (indoor scenes vs outdoor button was pressed by each subject to record which category of the
scenes) and listened to eight categories of sounds (two sounds per scene the stimuli they had seen or heard belonged to. A 10-s white
scene). Four regions of interest (ROIs) were defined and then MVPA fixation with the gray background was followed. Before the fMRI
was performed using the voxel-wise activity patterns of each stim- scan, subjects received a train to learn about stimuli categories by
ulus to examine whether the patterns for the sounds could decode some exemplar stimuli. In addition, this training can help assure
the scene patterns in all four ROIs. To explore the influence of the subjects could identify scene pictures and object sounds properly.
scene openness, we divided the scenes into the indoor and outdoor Subjects were asked to silently name the item when each scene
scenes and made the same MVPA analysis. Finally, the functional picture or sound clip appeared (e.g. When heard the “vroom”, one
connectivity (FC) analyses between all four ROIs were made to should name it “sound of the engine” silently) and all the blocks as
examine the functional integrations in the scene and sound tasks. well as trials in each block were presented randomly across runs
with the same stimuli as in the first run.

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 3

Fig. 1. Experimental materials and paradigm. (A) Experimental materials. The sample pictures correspond to 4 kinds of scenes (office, grass, street and kitchen) and there are
32 scene images in total. The speaker icons represents the objects sounds from 4 categories scene, and there are 8 kinds sounds in total (“vroom” for the engine and “hoot”
for the horn in the street, “sizzle” for the hot oil and “rat-tat” for the kitchen knife in the kitchen, “moo” for the cattle and “baa” for the sheep in the grass, “click” for the
keyboard and “ringing” for the telephone in the office). (B) Block-design paradigm. The experiment was composed of 4 runs and lasted about 29 min. Two task blocks were
separated by 10-s blocks of rest, and each task block contained 8 stimuli (shown for 2.5 s with 0.5-s inter-stimulus interval) presented centrally, with a following 4-s button
press.

2.4. Data acquisition smoothed functional volumes in all runs to obtain the voxel-wise
responses (␤ values) corresponding to each condition.
The experiment scanning was conducted on a 3.0 T Siemens
skyra scanner with a 20-channel head coil at Yantai Affili- 2.6. ROI definition
ated Hospital of Binzhou Medical University. Foam pads and
earplugs were used to reduce the head motion and scanner First of all, we obtained the activated brain regions by all
noise. A high-resolution structural MR image set was collected experimental conditions versus rest at the group level, and
using a T1 weighted 3D MPRAGE sequence (repetition time then picked the anatomical masks in the AAL atlas (Lalli et al.,
(TR) = 1900 ms, echo time (TE) = 2.52 ms, voxel size = 1 × 1 × 1 2012) (Fusiform L, Fusiform R, Occipital Mid L, Occipital Mid R,
mm3 , matrix size = 256 × 256, flip angle (FA) = 9◦ ). A gradient-echo ParaHippocampal L, ParaHippocampal R, Temporal Sup L, Tem-
planar imaging (EPI) sequence (TR = 2000 ms, TE = 30 ms, voxel poral Sup R) corresponding to pF, LO, PPA and STS with the
size = 3.1 × 3.1 × 4.0 mm3 , matrix size = 64 × 64, slices = 33, slices WFU Pickatlas toolbox in SPM8. Finally, four ROIs (Fig. 2) were
thickness = 4 mm, slices gap = 0.6 mm, FA = 90◦ ) was used for func- defined based on the common parts between the activated regions
tional data collection. The stimuli presentation and behavioral (p < 0.05, FDR corrected) and the anatomical masks.
response collection were performed by E-Prime 2.0 Profes-
sional (Psychology Software Tools, Pittsburgh, PA, USA) through 2.7. Data analysis
the audio-visual somatosensory device equipment with high-
resolution glasses and headphones. 2.7.1. Univariate analysis
A univariate analysis was performed to quantify the percent sig-
nal change of the sound and scene conditions for each ROI. At first,
2.5. Data preprocessing we used the MarsBar toolbox (http://marsbar.sourceforge.net) to
extract the time courses for four ROIs in each condition, and then
Data preprocessing was conducted using the SPM8 package calculated the average of the signal change for each region in the
(http://www.fil.ion.ucl.ac.uk/spm). For reaching steady-state equi- sound and scene conditions separately. Finally, the paired t-tests
librium, we discarded the first five functional images from each were conducted between the signal changes of the sound and the
run, and the remaining images were slice time-corrected to the first scene in each region to investigate whether the activation are dif-
image of the first run and motion-corrected by a realign analysis. ferent between sound and scene conditions.
For each participant, the individual’s own structural image was first
coregistered to the mean functional image after motion correction, 2.7.2. Classification analysis using MVPA
and the transformed structural image was then segmented into MVPA was conducted to explore the relationship between the
gray matter, white matter and cerebral spinal fluid (CSF) by using scene and the associated sounds in all ROIs. We added labels to
a unified segmentation algorithm. And the corrected images were the response patterns of sound according to the scene categories
spatially normalized to 3 × 3 × 3 mm3 in the Montreal Neurological (kitchen, grass, office, street), for example, the patterns of the
Institute (MNI) space using the generated spatial parameters from engine sound labeled as “street”, and the response patterns of scene
the segment analysis. For each individual subject, the functional were labeled same as the sound. And then a linear SVM classifier
data were spatially smoothed with a full-width at half maximum was chosen to classify patterns evoked by scene images based on
(FWHM) Gaussian 6 × 6 × 6 mm kernel. It is noted that data for the the patterns evoked by sounds. The LibSVM toolkit was used to
classification analysis were not smoothed. implement the four-way classification in the ROIs (http://www.
The fMRI data for each subject were preprocessed to remove the csie.ntu.edu.tw/∼cjlin/libsvm/). Afterward, the one-sample t-test
low-frequency signal changes and minimize the head move arti- was conducted on the classification performances to test whether
facts. Then a general linear model (GLM) was constructed for the the statistical value was statistically significant (p < 0.05). In order

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
4 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx

Fig. 2. Regions of interest. pF: posterior fusiform area; LO: lateral occipital area; PPA: parahippocampal place area; STS: superior temporal sulcus.

to assure the classification performances were reliable, we shuf- Table 1


Reaction time and accuracies of all conditions.
fled the labels and randomly assigned to the training samples and
performed the 4 scene classification analysis with the same proce- Conditions Response time (ms) Response accuracy (%)
dure based on the “shuffled” data (Stelzer et al., 2013). Furthermore, Sounds Cattle 1472.81 ± 115.12 98.62 ± 0.12
we divided the scene categories into indoor and outdoor scenes Sheep 1303.93 ± 100.52 100.00 ± 0.00
and performed a paired classification analysis, and the paired two- Wok 1239.91 ± 134.22 95.82 ± 0.12
tailed t-test was used to examine the statistical distinction of the Kitchen knife 1226.52 ± 116.92 93.12 ± 0.21
Horn 1374.02 ± 121.812 97.22 ± 0.11
classification results of the scene sub-categories.
Engine 1540.22 ± 116.52 90.32 ± 0.21
In addition, we performed the within-cue classifications with Telephone 1298.42 ± 84.52 100.00 ± 0.00
the same method. The discrimination analysis was conducted on Keyboard 1262.72 ± 87.73 84.73 ± 0.32
the four kinds of scene based on the patterns evoked by scene Pictures Grass 1269.52 ± 122.82 100.00 ± 0.00
Kitchen 1120.43 ± 102.31 95.83 ± 0.21
images, and eight object sounds using the patterns evoked by
Street 1532.93 ± 136.11 97.22 ± 0.14
sounds. Office 1205.71 ± 97.72 98.62 ± 0.11
To eliminate the interference of the voxel numbers on the results
(Walther et al., 2009; Said et al., 2010), we repeated the proce-
dure of the classification analysis by controlling the voxel numbers
using the F score method (Chen and Lin, 2006). F-score is a simple
technique which can measure the discrimination of a feature. We
calculated F-score of each voxel and then selected specified num-
ber of voxels (from 50 to 950) according to F-scores in a descending
order, and then applied the SVM classifier to conduct the classi-
fication procedure. ANOVA was conducted to test whether there
were significant differences in the classification accuracies when
the voxel numbers increased in each ROI.

2.7.3. Functional connectivity analysis


Finally, we performed a ROI-to-ROI FC analysis using pF, LO, PPA
and STS as seeds to investigate the functional coordination between
four ROIs in the sound and scene conditions. The preprocessed data
in SPM8 were imported into FC Toolbox conn (Whitfield-Gabrieli
et al., 2012) (http://www.nitrc.org/projects/conn). Individual nor-
Fig. 3. The average of signal change of sound and scene in all ROIs. The signal change
malized anatomical images were segmented into white matter, was calculated using Marsbar toolbox and paired t-test was performed between the
gray matter and CSF. The confounding factors including possi- sound and corresponding scene in each ROI. The error bars denoted the standard
ble motion, physiological and other human factors were defined error of the mean. *p < 0.05, **p < 0.01, ***p < 0.001.
as the principle components of the BOLD signals from the white
matter and CSF, without regressing out the mean global brain sig-
tively. Table 1 showed the response time and accuracies in all
nal. Pearson’s correlation coefficients were calculated between the
conditions.
time-courses of the seed region and all other voxels.

3.2. Univariate analysis


3. Results
The average of signal changes of all the sounds and scenes was
3.1. Behavioral performance calculated separately using the Marsbar package, as shown in Fig. 3.
A one-sample t-test was performed for each condition in the four
Subjects were asked to press the button accordingly after a block regions and all were significantly above the chance level both in
of stimuli and silently name every item during scanner to control the sound condition (pF: 0.09%, t(20) = 8.58, p < 0.001; LO: 0.11%,
the subjects’ attention, and their behavioral performances were t(20) = 9.61, p < 0.001; PPA; 0.04%, t(20) = 7.76, p < 0.001; STS: 0.61%,
evaluated by calculating the reaction time of button pressing. The t(20) = 59.45, p < 0.001; FDR correction) and scene condition (pF:
mean accuracy was 95.90% ± 0.046 (t(11) = 72.18, p < 0.001) and 0.62%, t(20) = 23.06, p < 0.001; LO: 0.79%, t(20) = 29.20, p < 0.001;
the mean response time was 1325.26 ± 138.51 ms (t(11) = 31.73, PPA: 0.42%, t(20) = 12.78, p < 0.001; STS: 0.09%, t(20) = 5.70, p <
p < 0.001), indicating that subjects could identify the stimuli effec- 0.001; FDR correction).

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 5

Fig. 4. Multivariate classification of scenes using sound-based decoders and the classification performance as a function of the number of voxels resampled. (A) We trained a
pattern classifier to predict the pattern of the scene based on the pattern of sounds. (B) Classification accuracies for different ROI sizes and the optimal result were observed
when the ROI contained all voxels. And one-sample t-test was utilized to test the statistical significance of classification accuracies. The error bars denoted the standard error
of the mean. *p < 0.05, **p < 0.01, ***p < 0.001.

Moreover, we performed the paired t-test statistical analysis on Table 2


The results of within-cue classification in each ROI.
the signal changes in the sound and scene tasks for all four ROIs.
It was found that there were significant differences between the within-cue classification
signal changes of both for each ROI. The signals in the scene con-
object-object scene-scene
dition were observed to be significantly greater than that in the
sound condition in the visual cortex (pF: t(20) = 20.09, p < 0.001; Accuracy SEM t(20) P Accuracy SEM t(20) P
(%) (%)
PPA: t(20) = 13.05, p < 0.001; LO: t(20) = 25.82, p < 0.001), and STS
showed a larger change in the sound task (t(20) = 23.73, p < 0.001) pF 13.082 1.211 0.481 0.636 32.242 1.054 6.871 <0.001
LO 12.823 1.186 0.272 0.788 33.383 1.111 7.544 <0.001
than in the scene task.
PPA 12.910 0.803 0.510 0.616 31.448 1.046 6.163 <0.001
STS 19.320 1.172 5.819 <0.001 28.373 1.030 3.274 0.004
3.3. Multivariate pattern classification analysis One-sample t-tests of classification accuracies for within-cue stimuli in each ROI
(two-tailed). In the 8-way classification of object sounds, the decoding failed in the
The specific activity patterns for the object sounds in each ROI pF, LO and PPA, which distribute in visual cortex. The scenes can be successfully
were calculated to decode the patterns of the scene. Fig. 4A depicted decoded by scenes in all ROIs.
the average of classification accuracies in four ROIs. A one-tailed
t-test analysis for the results suggested that the classification per-
gradually as the size of the voxels increased (except for STS, in
formances were significantly above the chance level (25%) in three
which there was a peak in the number of 500, and the later clas-
regions (pF: 29.68%, t(20) = 6.01, p < 0.001; LO: 29.91%, t(20) = 4.72,
sification results showed a slight decrease with the increase of the
p < 0.001; STS: 27.70%, t(20) = 2.23, p < 0.05; FDR correction). We
voxel numbers, and then it did not increase until voxels in the size
can see that scene-evoked activity patterns could be built based
of 700). However, one-way analysis of variance (ANOVA) showed
on the combination of sound-evoked patterns even if we did not
that there were no significant differences in the classification accu-
tell subjects to pay attention to the objects in the scene before the
racy when the number of the voxels increased in all of the ROIs. The
scan. However, the sound-evoked patterns in PPA cannot construct
classification accuracies in these groups were significantly smaller
the scene-evoked patterns (t(20) = 0.66, p = 0.52). To exclude the
than the group that all voxels were included (p < 0.05).
performance of the uncorrelated brain area on the scene decoding
based on the object sound within the scene, we examined the ret-
rosplenial cortex (RSC) as a control area which has been proved to 3.4. Differences between indoor and outdoor scenes
be involved in scene decoding, however, the literatures have not
suggested that this area may represent the semantic relationship To explore the influence of the scene sub-categories on the scene
between the scene and the sound. The classification accuracy in the recognition, we further divided the scene into indoor and outdoor
control area was 25.92% (t(20) = 1.12, p = 0.27). In addition, the clas- scenes and then re-analyzed the data using the MVPA as described
sifications based on the “shuffled” data failed in all of ROIs (pF: t(20) above. A paired t-test was performed to compare the decoding per-
= -0.64, p = 0.527; PPA: t(20) = -1.19, p = 0.249; LO: t(20) = 0.97, formances of indoor scenes with outdoor scenes in each ROI. The
p = 0.334; STS: t(20) = -0.44, p = 0.665). Furthermore, we per- classification accuracies for the indoor scenes were found to be sig-
formed paired t-test between classification accuracies of “true” data nificantly lower than that in the outdoor scenes in pF (t(20) = 2.52,
and “shuffle” data: pF: t(20) = -5.58, p < 0.01; PPA: t(20) = -5.98, p < 0.05), PPA (t(20) = 4.04, p < 0.01) and STS (t(20) = 3.93, p < 0.01).
p < 0.001; LO: t(20) = 0.92, p = 0.632; STS: t(20) = -3.12, p = 0.04. However, there was no significant difference in LO (t(20) = 1.03, p =
Therefore, statistical tests verified that the successful decoding was 0.313). Fig. 5 showed the details of the statistical analysis.
reliable in pF, LO and STS.
The results of the within-cue classification analysis were shown 3.5. FC analyses in the sound and scene tasks
in Table 2. The 8-way classification of sounds only succeeded in
STS (t(20) = 5.819, p < 0.001), while the classification accuracies of To investigate the integration differences among four ROIs in
scene-to-scene were significantly higher than the chance-level the sound and scene tasks, we performed a ROI-to-ROI FC analysis
(25%). Furthermore, to diminish the influence of the different ROI (Fig. 6).
sizes, we used F score to select different numbers of voxels in 4 ROIs In the scene task, no significant FC was observed between STS
(from 50 to 950). From Fig. 4B, the classification accuracy increased and PPA (STS vs PPA: t(20) = -1.76, p = 0.093). On the contrary,

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
6 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx

to showed a significant FC with STS, including speech aspects areas


(insular cortex), visuospatial imagery and memory relevant regions
(middle frontal gyrus), semantic aspects areas (posterior middle
temporal gyrus, inferior frontal gyrus), objects sensitive areas (LOC)
and audio-visual areas (posterior superior temporal gyrus). Fig. 7
illustrated the brain regions that were significantly correlated with
STS in both tasks (p < 0.001, FDR correction), and details of the brain
regions were shown in Table 3.

3.6. Searchlight analysis

To examine the areas that could show above-chance decoding


accuracy outside our pre - defined ROIs, we performed a search-
light analysis to identify regions containing significant associations
between scene patterns and patterns evoked by object sounds. We
defined a 5-mm - radius spherical mask centered on each voxel
in the brain and conducted the scene - from - sounds classification
Fig. 5. Classification results for indoor and outdoor scenes. The scenes were divided procedures as described in MVPA classification. Fig. 8 showed the
into indoor and outdoor scenes (indoor: kitchen and office; outdoor: grass and results of the pattern classification analysis. Compared with the
street), and the classification process was repeated. The paired t-test was performed
between indoor and outdoor scenes. The error bars denoted the standard error of
seed-to-voxel FC analysis, more voxels were found in the frontal
the mean. *p < 0.05, **p < 0.01, ***p < 0.001. cortex, even though they were scattered. Other than these areas,
the voxels with significant classification accuracies for scenes were
mainly distributed in precentral gyrus, as found in the FC analysis.
Considering that the significant classification accuracies for scenes
in these areas, it is worth to explore the connection between them
during the sound-to-scene presentation, which will be our future
work.

4. Discussion

The principal finding of this study is that the scene patterns can
be decoded based on the relevant sound-evoked information in pF,
LO and STS. We used the activity patterns of the two object sounds
strongly associated with the scene to predict the scene patterns
successfully. However, the similar relationship between the scene
and sound was not observed in PPA. And a further FC analysis using
the four ROIs as seeds indicated that there was no significant FC
between STS and the other three regions in the sound and scene
tasks. Through exploring the construction of the scene patterns
from the response patterns of the sound in pF, LO, and STS, our
findings could provide evidence for the neural mechanism in scene
recognition based on the sound-evoked information.
Fig. 6. ROI-to-ROI FC analyses among 4 ROIs (pF, LO, PPA and STS) in the sound
and scene tasks. In order to show the connectivity more clearly, the ROIs are shown 4.1. Reconstructing scenes from object sounds using MVPA
as points in different locations, even though both hemispheres areas were used in
the FC analysis. The line between two regions stands for a significant FC between In this study, we found that the patterns evoked by object sounds
regions. “Blue” represented the negative correlation; “Red” expressed the positive could discriminate the corresponding patterns of the scene signif-
correlation. L, left; R, right (For interpretation of the references to colour in this
figure legend, the reader is referred to the web version of this article).
icantly in pF, LO and STS, but not in the PPA.
The ventral occipital-temporal cortex (VOTC) has been explored
both in sighted and congenitally blind individuals, and agreed with
significant FC were found between pF and LO, PPA, STS (pF vs LO: the notion that this region could represent objects in a multi-modal
t(20) = 20.62, p < 0.001; pF vs PPA: t(20) = 10.00, p < 0.001; pF vs STS: way (Bi et al., 2016). By investigating the relationship between
t(20) = -5.35, p < 0.001). And there were significant FC between LO the actual visual stimuli and the fMRI activity in early visual areas
and PPA, STS (LO vs PPA: t(20) = 7.87, p < 0.001; LO vs STS: t(20) = based on quantitative receptive-field models, one previous study
-3.92, p < 0.001). suggested that it was possible to construct the visual stimuli from
In the sound task, we did not find significant FC between STS and the fMRI activity patterns (Kay et al., 2008). Another study recon-
pF, LO, PPA (STS vs pF: t(20) = 0.99, p = 0.33; STS vs LO: t(20) = -0.49, structed the complex natural scenes successfully by developing a
p = 0.63; STS vs PPA: t(20) = -1.66, p = 0.11). In contrast, significant new Bayesian decoder on the basis of the fMRI signals in the early
FC was observed between pF and LO, PPA (pF vs LO: t(20) = 14.06, and anterior visual areas (Naselaris et al., 2009). And the existing
p < 0.001; pF vs PPA: t(20) = 7.74, p < 0.001). And the FC was also literature found that the patterns of the scene could be predicted
significant between LO and PPA (LO vs PPA: t(20) = 5.75, p < 0.001). successfully from the patterns of the within-scene objects in LO
The above ROI-to-ROI FC analyses found that STS showed no (MacEvoy and Epstein, 2011). However, our study showed that the
significant positive correlation with the other three regions both patterns evoked by the scene could be decoded by the patterns of
in the sound and scene tasks. To explore the functional role of STS the scene-relevant sounds using MVPA.
in the scene and sound task, we made a further seed-to-voxel FC One recent study (Klemen and Chambers, 2012) suggested that
analysis using STS as the seed. More specific regions were found LOC could decode the object sounds because the sounds evoked

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 7

Fig. 7. Surface displays for seed-to-voxel FC analyses using STS as a seed in the scene and sound tasks separately. Threshold: voxel-level p = 0.001, FWE correction.

Table 3
Regions that showed significant functional connectivity with STS in the scene and sound tasks.

Regions of Numbers of MNI coordinates peak


Seed intensity
FC voxels
x y z t-value
Scene conditions: Positive correlations
correlations
STS R STG: PreCG, PostCG, ACC, IC R, CO R, MTG R, 6089 63 −9 3 16.76
PCu, PO R, PT R, SMA, pSTG R, pSMG R, aSMG R,
SPL, HG R, iLOC R, Amy, aSTG R, SFG, pMTG
L STG: IC L, PreCG L, PostCG L, MTG L, SMG L, 3899 −48 −12 0 26.40
IPL, pSTG L, iLOC L, HG L, Tha, Amy L
L Cereb8 137 −15 −60 −51 7.67
L MFG 106 −36 42 30 6.48
R IFG 35 42 39 6 4.45
R Cereb8 34 15 −72 −54 8.03
Sound Conditions: Positive correlations
STS R STG: PreCG, PostCG, IC, CO, MFG, MTG, PO, 11884 60 −24 9 25.06
PT, IFG, SMA, ACC, pSTG, aSMG, pSMG, HG,
aSMG, SFG, IFG, iLOC L, PaCiG, aSTG, Amy L,
pMTG R, FOC R
L Cereb8: Cereb6 L, Cereb2 L, Cereb1 L, Cereb7 L 648 −24 −63 −51 9.28
R Cereb8: Cereb6 R 336 15 −75 −51 7.71
L SPL 82 −30 −57 66 5.47

The MNI coordinates and selected the clusters (≥ 30 voxels) that included t-values of peak voxel representing the significant FC with the STS. The clusters homologous to the
peak voxel are written in bold and the other areas contained in this cluster are written in regular. PreCG: precentral gyrus, PostCG: postcentral gyrus, ACC: anterior cingulate
cortex, IC: insular Cortex, CO: central opercular cortex, M/STG: middle/superior Temporal gyrus, PCu: precuneus, PO: parietal operculum cortex, PT: planum temporale, SMA:
supplementary motor cortex, SMG: supramarginal gyrus, PL: parietal lobule, HG: heschl’s gyrus, LOC: lateral occipital cortex, Amy: amygdala, M/SFG: middle/superior frontal
gyrus, I/SPL: inferior/superior parietal lobule, MTG: middle temporal gyrus, SMG: supramarginal Gyrus, pSTG: posterior superior temporal gyrus, HG: heschl’s gyrus, Tha:
thalamus, S/M/IFG: superior/middle/inferior frontal gyrus, PaCiG: paracingulate gyrus, FOC: orbital frontal cortex, Cereb: cerebellum. a/p/i: anterior/posterior/inferior; L/R:
left/right.

subjects’ imaging of the corresponding objects, and they used the As opposed to LOC, we did not find the semantic relationship
fMRI data of imaginary objects to determine the identity of the between the sound and the scene in PPA. That is to say, even if many
objects successfully, which provided consistent evidence that the studies have attested the important role of PPA in scene recognition,
imagery and actual perception may share the same neural mecha- reactions to the scene in this area seemed to have nothing to do with
nism (Kosslyn et al., 2001; Hubbard, 2010), which even was verified the response to the corresponding sound. Other studies demon-
by the successful scene decoding during perception and imagery strated that PPA could also represent the scene-relevant object
(Johnson and Johnson, 2014). One previous study found that two information (Macevoy and Epstein, 2009; Linsley and MacEvoy,
subregions within the LOC, pF and LO were involved in processing 2014) and take part in tuning the object information (Macevoy and
different aspects of visual recognition (Nordhjem et al., 2015). In Epstein, 2009; Harel et al., 2013). Although subjects might imagine
our study, we speculated that one explanation for the successful the objects and animals when hearing sounds, activity patterns in
scene decoding based on the sounds may be that the sound stim- PPA only included the information of the standalone objects and
uli could induce our imaging the corresponding objects (Klemen this information disappeared when the objects arose in the scene
and Chambers, 2012; Vetter et al., 2014), and there was a strong (MacEvoy and Epstein, 2011).
association between the objects and sounds within the scenes. A growing number of studies hold the notion that integrating the
stimuli from the multisensory modalities was the inherent nature

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
8 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx

Fig. 8. Searchlight-based classification analysis maps across subjects. Surface displays the decoding accuracy of voxels with significant classification accuracies higher than
chance-level (25%, P < 0.01, FDR correction). (A) Outlined regions are STS (dark blue), pF (light blue), PPA (purple) and the LO (green). (B) Besides of the four ROIs, some main
regions are labeled, including precentral gyrus, MFG, and IFG (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of
this article).

of the human brain (Ghazanfar and Schroeder, 2006; van Atteveldt 4.2. Influence of scene openness
et al., 2014), and multisensory processing may happen from the
initial narrow mechanism to highly flexible various levels (Muckli Previous studies suggested that the scene openness might influ-
et al., 2015). STS played a central role in multisensory interaction ence the scene recognition by comparing the fMRI activities of the
and responded to the visual and auditory objects strongly (Klemen indoor and outdoor scenes, and the neural signal induced by indoor
and Chambers, 2012). Our study showed that STS could also repre- scenes was found to be stronger than that by outdoor scenes in the
sent the semantic relationship between the scene and sound, and PPA (Henderson et al., 2007; Kravitz et al., 2011). In this study, we
we speculated the possibility that one could imagine the animals further analyzed the indoor and outdoor scenes using MVPA and
and man-made manipulable objects (tools) contained in the scenes found that the openness of the scenes could affect the sound-based
when hearing the sound stimuli, for example, “moo” and Cattle in scene prediction in pF, PPA and STS, which was consistent with the
the grass, “vroom” and automobile engine in the street. Further- previous studies. However, the LO showed a great tolerance of the
more, the STS has been believed to be capable of integrating the scene openness in the present study. In addition, the classification
meaningful information of audio-visual objects into a coherent per- analysis of indoor and outdoor scenes suggested that pF, PPA and
cept (Beauchamp et al., 2004; Venezia et al., 2017), and take part in STS might be able to represent the semantic relationship between
processing the many diverse aspects, for instance, theory of mind, the outdoor scenes and associated sounds, but failed to process the
audiovisual integration, motion processing, speech processing and association between indoor scenes and the sounds.
face processing (Hein and Knight, 2008; Blank and von Kriegstein,
2013). And the study considered that the same brain region might 4.3. FC in the scene and sound tasks
be responsible for different recognition functions under the distinct
task network (Hein and Knight, 2008; Watson et al., 2014). There- Using the ROI-to-ROI FC analyses, we found that there were
fore, STS may also be involved in tuning the sound information in significant functional correlations between these four ROIs, which
the scene. formed two distinct sub-networks (negative correlation network
Moreover, our study found that there were significant differ- consisted of pF, LO and STS; positive correlation network com-
ences in the signal change between the scene and scene-related posed of pF, LO and PPA) in the scene task. PPA was showed to
sounds in all four regions. Specifically, the signal change of the scene be more sensitive to the scenes than other visual stimuli (Epstein
images was significantly greater than that of the sounds in pF, LO et al., 1999, 2003; Epstein et al., 2006), and involved in decoding the
and PPA, while the signal change of the sounds was significantly appearance and layout of the scenes (Epstein et al., 1999). Previous
higher than that of the scene signal in STS. In spite of the significant studies (Macevoy and Epstein, 2009; MacEvoy and Epstein, 2011)
differences in the signal change, diff ;erent scenes could be suc- found that the brain regions could classify the four categories of the
cessfully discriminated based on the patterns of sounds of objects scenes successfully containing the pF and LO, which indicated that
within the scenes in ROIs, which proved that these brain regions these two regions were also engaged in scene classification. The
could represent the sound-scene semantic relationship. One expla- aforementioned evidence was consistent with our FC analyses dur-
nation for the results may be that the univariate analysis of fMRI ing the scene processing. But an opposite situation was observed in
data were not as sensitive as MVPA (Norman et al., 2006; Stelzer the sound task, in which pF, LO and PPA formed one sub-network,
et al., 2013) to the small number of neurons that shared a little while STS showed no functional associations with the other three
classification information. regions. The STS was suggested to represent stimuli from visual,
auditory, and audio-visual modality (Beauchamp, 2005; Venezia
et al., 2017). There existed a question, namely why STS was not

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx 9

functionally connected with the other three regions in the sound Acknowledgments
task. We speculated the reason was that only STS participated in
representing the sound stimuli, therefore, it was independent of the This work was supported by the National Natural Science
network formed by the other three ROIs which mainly represented Foundation of China (No. U1736219, No. 61860206010 and
the visual scenes. No.61571327), Shandong Provincial Natural Science Foundation
As STS showed an independent role in the ROI-to-ROI analy- of China (No. ZR2015HM081) and Project of Shandong Province
ses, we further explored brain regions that were related to STS in Higher Educational Science and Technology Program (J15LL01).
the sound and the scene tasks. Some common brain areas that were
functionally cooperated with STS in both tasks suggested that these
References
regions might be involved in different functions in different tasks
(Beauchamp et al., 2004; Hein and Knight, 2008; Venezia et al., Amedi, A., von Kriegstein, K., van Atteveldt, N.M., Beauchamp, M., Naumer, M.J., 2005.
2017). In this study, the insular cortex showed a significant FC with Functional imaging of human crossmodal identification and object recognition.
Exp. Brain Res. 166, 559–571.
STS. The insular cortex has been demonstrated to play a role in the
Beauchamp, M.S., 2005. See me, hear me, touch me: multisensory integration in
cross-modal coincidence and matching (Calvert, 2001; Senkowski lateral occipital-temporal cortex. Curr. Opin. Neurobiol. 15, 145–153.
et al., 2007), and was involved in a number of different functions Beauchamp, M.S., Lee, K.E., Argall, B.D., Martin, A., 2004. Integration of auditory
such as pain perception, speech production and social emotion pro- and visual information about objects in superior temporal sulcus. Neuron 41,
809–823.
cessing (Nieuwenhuys, 2011; Gu et al., 2013). The part of the middle Belin, P., Zatorre, R.J., Lafaille, P., Ahad, P., Pike, B., 2000. Voice-selective areas in
frontal gyrus was proved to store and process the working mem- human auditory cortex. Nature 403, 309–312.
ory (Leung et al., 2002; Senkowski et al., 2007). One recent study Bi, Y., Wang, X., Caramazza, A., 2016. Object domain and modality in the ventral
visual pathway. Trends Cogn. Sci. (Regul. Ed.) 20, 282–290.
(Whitney et al., 2010) indicated that posterior middle temporal Blank, H., von Kriegstein, K., 2013. Mechanisms of enhancing visual-speech recog-
gyrus (pMTG) and inferior frontal gyrus (IFG) played an impor- nition by prior auditory information. NeuroImage 65, 109–118.
tant role in the semantic control by using repetitive transcranial Calvert, G.A., 2001. Crossmodal processing in the human brain: insights from func-
tional neuroimaging studies. Cereb. Cortex 11, 1110–1123.
magnetic stimulation (rTMS) to obstruct the processing of IFG and Chen, Y.W., Lin, C.J., 2006. Combining SVMs with various feature selection strategies.
pMTG. Additional evidence (MacEvoy and Epstein, 2011) pointed Studies in Fuzziness & Soft Computing 207, 315–324.
that LOC could represent the semantic relationship between the Doehrmann, O., Weigelt, S., Altmann, C.F., Kaiser, J., Naumer, M.J., 2010. Audiovi-
sual functional magnetic resonance imaging adaptation reveals multisensory
scenes and scene-relevant objects and provided an object-based
integration effects in object-related sensory cortices. J. Neurosci. 30, 3370–3379.
channel to decode the scene. Previous studies showed that the pos- Drucker, D.M., Aguirre, G.K., 2009. Different spatial scales of shape similarity repre-
terior superior temporal gyrus could integrate the different types sentation in lateral and ventral LOC. Cereb. Cortex 19, 2269–2280.
Epstein, R., Kanwisher, N., 1998. A cortical representation of the local visual envi-
of within-modality and cross-modality information (Beauchamp
ronment. Nature 392, 598–601.
et al., 2004; Venezia et al., 2017). It is possible that the key assign- Epstein, R., Harris, A., Stanley, D., Kanwisher, N., 1999. The parahippocampal place
ment in the experiment offered some category information to these area: Recognition, navigation, or encoding? Neuron 23, 115–125.
brain areas, even though the data during the motor response were Epstein, R., Graham, K.S., Downing, P.E., 2003. Viewpoint-specific scene representa-
tions in human parahippocampal cortex. Neuron 37, 865–876.
not used for decoding. Considering that part of motor responses Epstein, R.A., Higgins, J.S., Parker, W., Aguirre, G.K., Cooperman, S., 2006. Cortical
can overlap in time with the responses induced by sound, the key correlates of face and scene inversion: a comparison. Neuropsychologia 44,
assignment can be improved by setting different motor responses 1145–1158.
Epstein, R.A., Parker, W.E., Feiler, A.M., 2007. Where am I now? Distinct roles for
to a same scene in future study, which may reduce the influ- parahippocampal and retrosplenial cortices in place recognition. J. Neurosci. 27,
ence of fixed motor responses on stimuli classification. In addition, 6141–6149.
the existence of the common brain regions may imply that these Ghazanfar, A.A., Schroeder, C.E., 2006. Is neocortex essentially multisensory? Trends
Cogn. Sci. (Regul. Ed.) 10, 278–285.
regions cooperated with STS during processing the sound and the Greene, M.R., Oliva, A., 2009. Recognition of natural scenes from global properties:
scene to represent the semantic relationship between both. And the seeing the forest without representing the trees. Cogn. Psychol. 58, 137–176.
connectivity between these areas can help us understand the infor- Gu, X., Hof, P.R., Friston, K.J., Fan, J., 2013. Anterior insular cortex and emotional
awareness. J. Comp. Neurol. 521, 3371–3388.
mation flowing during the sound-to-scene representation, which
Harel, A., Kravitz, D.J., Baker, C.I., 2013. Deconstructing visual scenes in cortex: gra-
will be our next work. dients of object and spatial layout information. Cereb. Cortex 23, 947–957.
Harrison, S.A., Tong, F., 2009. Decoding reveals the contents of visual working mem-
ory in early visual areas. Nature 458, 632–635.
5. Conclusions Haushofer, J., Livingstone, M.S., Kanwisher, N., 2008. Multivariate patterns in object-
selective cortex dissociate perceptual and physical shape similarity. PLoS Biol.
6, e187.
In this study, we explored the scene decoding based on the pat- Haxby, J.V., Gobbini, M.I., Furey, M.L., Ishai, A., Schouten, J.L., Pietrini, P., 2001.
terns of sounds of objects within the scenes and found that pF, Distributed and overlapping representations of faces and objects in ventral tem-
LO and STS could represent the semantic relationship between the poral cortex. Science 293, 2425–2430.
Hein, G., Knight, R.T., 2008. Superior temporal sulcus—it’s my area: or is it? Cognitive
scenes and the sounds of the associated objects. However, this Neuroscience. Journal of 20, 2125–2136.
semantic association was not observed in PPA. Furthermore, by Henderson, J.M., Larson, C.L., Zhu, D.C., 2007. Cortical activation to indoor versus
dividing the scenes into indoor and outdoor parts, we found that outdoor scenes: an fMRI study. Exp. Brain Res. 179, 75–84.
Hubbard, T.L., 2010. Auditory imagery: empirical findings. Psychol. Bull. 136,
LO was not sensitive to the openness of the scenes. The ROI-to-ROI 302–329.
FC analyses among four ROIs in the scene and sound tasks indi- Johnson, M.R., Johnson, M.K., 2014. Decoding individual natural scene representa-
cated that STS did not coordinate with the other three ROIs in both tions during perception and imagery. Front. Hum. Neurosci. 8, 59.
Kay, K.N., Thomas, N., Prenger, R.J., Gallant, J.L., 2008. Identifying natural images
tasks, and a further seed-to-voxel analysis using STS as the seed sug- from human brain activity. Nature 452, 352–355.
gested a distinct network in processing the scenes and sounds. In Klemen, J., Chambers, C.D., 2012. Current perspectives and methods in studying
summary, our study showed the existence of a cross-modal sound- neural mechanisms of multisensory interactions. Neurosci. Biobehav. Rev. 36,
111–133.
based channel for scene decoding in the occipitotemporal cortex,
Kosslyn, S.M., Ganis, G., Thompson, W.L., 2001. Neural foundations of imagery. Nat.
which could give a further access to understand the neural mech- Rev. Neurosci. 2, 635–642.
anism of scene recognition. Kravitz, D.J., Peng, C.S., Baker, C.I., 2011. Real-world scene representations in
high-level visual cortex: it’s the spaces more than the places. J. Neurosci. 31,
7322–7333.
Conflict of interest Lalli, S., Piacentini, S., Franzini, A., Panzacchi, A., Cerami, C., Messina, G., Ferré, F.,
Perani, D., Albanese, A., 2012. Epidural premotor cortical stimulation in pri-
mary focal dystonia: clinical and 18F-fluoro deoxyglucose positron emission
No conflict of interest. tomography open study. Mov. Disord. 27, 533–538.

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009
G Model
NSR-4243; No. of Pages 10 ARTICLE IN PRESS
10 X. Wang et al. / Neuroscience Research xxx (2018) xxx–xxx

Leung, H., Gore, J.C., Goldman-Rakic, P.S., 2002. Sustained mnemonic response in Said, C.P., Moore, C.D., Engell, A.D., Todorov, A., Haxby, J.V., 2010. Distributed repre-
the human middle frontal gyrus during on-line storage of spatial memoranda. sentations of dynamic facial expressions in the superior temporal sulcus. J. Vis.
Cognitive Neuroscience, Journal of 14, 659–671. 10, 11-11.
Liang, M., Mouraux, A., Hu, L., Iannetti, G.D., 2013. Primary sensory cortices contain Senkowski, D., Saint-Amour, D., Kelly, S.P., Foxe, J.J., 2007. Multisensory processing
distinguishable spatial patterns of activity for each sense. Nat. Commun. 4, 1979. of naturalistic objects in motion: a high-density electrical mapping and source
Linsley, D., MacEvoy, S.P., 2014. Evidence for participation by object-selective visual estimation study. NeuroImage 36, 877–888.
cortex in scene category judgments. J. Vis. 14, 19-19. Stansbury, D.E., Naselaris, T., Gallant, J.L., 2013. Natural scene statistics account
Macevoy, S.P., Epstein, R.A., 2009. Decoding the representation of multiple simul- for the representation of scene categories in human visual cortex. Neuron 79,
taneous objects in human occipitotemporal cortex. Current Biology Cb 19, 1025–1034.
943–947. Stelzer, J., Chen, Y., Turner, R., 2013. Statistical inference and multiple testing cor-
MacEvoy, S.P., Epstein, R.A., 2011. Constructing scenes from objects in human occip- rection in classification-based multi-voxel pattern analysis (MVPA): random
itotemporal cortex. Nat. Neurosci. 14, 1323–1329. permutations and cluster size control. NeuroImage 65, 69–82.
Malach, R., Reppas, J.B., Benson, R.R., Kwong, K.K., Jiang, H., Kennedy, W.A., Ledden, Tyll, S., Bonath, B., Schoenfeld, M.A., Heinze, H.J., Ohl, F.W., Noesselt, T., 2013. Neural
P.J., Brady, T.J., Rosen, B.R., Tootell, R.B., 1995. Object-related activity revealed by basis of multisensory looming signals. NeuroImage 65, 13–22.
functional magnetic resonance imaging in human occipital cortex. Proceedings van Atteveldt, N., Murray, M.M., Thut, G., Schroeder, C.E., 2014. Multisensory inte-
of the National Academy of Sciences 92, 8135–8139. gration: flexible use of general operations. Neuron 81, 1240–1253.
Mesulam, M.M., 1998. From sensation to cognition. Brain 121 (Pt 6), 1013–1052. Venezia, J.H., Vaden Jr., K.I., Rong, F., Maddox, D., Saberi, K., Hickok, G., 2017. Auditory,
Muckli, L., Vizioli, L., Petro, L., De Martino, F., Vetter, P., 2015. Predictive coding of visual and audiovisual speech processing streams in superior temporal sulcus.
auditory and contextual information in early visual cortex-evidence from layer Front. Hum. Neurosci. 11, 174.
specific fMRI brain reading. J. Vis. 15, 720-720. Vetter, P., Smith, F.W., Muckli, L., 2014. Decoding sound and imagery content in early
Naselaris, T., Prenger, R.J., Kay, K.N., Oliver, M., Gallant, J.L., 2009. Bayesian recon- visual cortex. Curr. Biol. 24, 1256–1262.
struction of natural images from human brain activity. Neuron. Neuron 63, Walther, D.B., Caddigan, E., Fei-Fei, L., Beck, D.M., 2009. Natural scene categories
902–915. revealed in distributed patterns of activity in the human brain. J. Neurosci. 29,
Nieuwenhuys, R., 2011. The insular cortex: a review. Prog. Brain Res. 195, 123–163. 10573–10581.
Nordhjem, B., Curcic-Blake, B., Meppelink, A.M., Renken, R.J., de Jong, B.M., Leenders, Walther, D.B., Chai, B., Caddigan, E., Beck, D.M., Fei-Fei, L., 2011. Simple line drawings
K.L., van Laar, T., Cornelissen, F.W., 2015. Lateral and medial ventral occipitotem- suffice for functional MRI decoding of natural scene categories. Proc Natl Acad
poral regions interact during the recognition of images revealed from noise. Sci U S A 108, 9661–9666.
Front. Hum. Neurosci. 9, 678. Watson, R., Latinus, M., Charest, I., Crabbe, F., Belin, P., 2014. People-selectivity,
Norman, K.A., Polyn, S.M., Detre, G.J., Haxby, J.V., 2006. Beyond mind-reading: multi- audiovisual integration and heteromodality in the superior temporal sulcus.
voxel pattern analysis of fMRI data. Trends Cogn. Sci. (Regul. Ed.) 10, 424–430. Cortex 50, 125–136.
Park, S., Brady, T.F., Greene, M.R., Oliva, A., 2011. Disentangling scene content from Werner, S., Noppeney, U., 2010. Distinct functional contributions of primary sen-
spatial boundary: complementary roles for the parahippocampal place area sory and association areas to audiovisual integration in object categorization. J.
and lateral occipital complex in representing real-world scenes. J. Neurosci. 31, Neurosci. 30, 2662–2675.
1333–1340. Whitfield-Gabrieli, S., Nieto-Castanon, A., 2012. Conn: a functional connectivity tool-
Park, S., Konkle, T., Oliva, A., 2015. Parametric coding of the size and clutter of natural box for correlated and anticorrelated brain networks. Brain connect 2: 125-141.
scenes in the human brain. Cereb. Cortex 25, 1792–1805. Brain Connect. 2, 125–141.
Renninger, L.W., Malik, J., 2004. When is scene identification just texture recogni- Whitney, C., Kirk, M., O’Sullivan, J., Ralph, M.A.L., Jefferies, E., 2010. The neural orga-
tion? Vision Res. 44, 2301–2311. nization of semantic control: TMS evidence for a distributed network in left
Rohe, T., Noppeney, U., 2016. Distinct computational principles govern multisensory inferior frontal and posterior middle temporal gyrus. Cereb. Cortex, bhq180.
integration in primary sensory and association cortices. Curr. Biol. 26, 509–514.

Please cite this article in press as: Wang, X., et al., Decoding natural scenes based on sounds of objects within scenes using multivariate
pattern analysis. Neurosci. Res. (2018), https://doi.org/10.1016/j.neures.2018.11.009

View publication stats

You might also like