1 s2.0 S0957417423024041 Main

Expert Systems With Applications 238 (2024) 121902
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Acoustic scene classification: A comprehensive survey

Biyun Ding a, 1, Tao Zhang a, *, 2, Chao Wang a, 3, Ganjun Liu a, 4, Jinhua Liang b, 5, Ruimin Hu c, 6,
Yulin Wu c, 7, Difei Guo d, 8
a
School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China
b
Centre for Digital Music, Queen Mary University of London, London, UK
c
National Engineering Research Center for Multimedia Software, School of Computer Science, Wuhan University, Wuhan, China
d
Tianjin Communication and Broadcasting Group Company Limited, Tianjin, China
A R T I C L E I N F O A B S T R A C T
Keywords: Acoustic scene classification (ASC) has gained significant interest recently due to its diverse applications. Various
Acoustic scene classification audio signal processing and machine learning methods have been proposed for ASC. The volume and scope of
Data augmentation ASC publications covering theories, algorithms, and applications have also been expanded. However, no recent
Datasets
comprehensive surveys exist to collect and organize the knowledge, impeding the ability of researchers and its
DCASE
Deep learning
applications. To fill this gap, we present an up-to-date overview of ASC methods, covering earlier works and
Environmental sound recent advances. In this work, we first define a general framework for ASC, starting with a historical review of
Feature extraction previous research in the ASC field. Then, we review core techniques for ASC that have achieved good perfor
Modeling mance. Focus on machine learning based ASC systems, this work summarizes and groups the existing techniques
in terms of data processing, feature acquisition, and modeling. Furthermore, we provide a summary of the
available resources for ASC research and analyze ASC tasks in Detection and Classification of Acoustic Scenes and
Events (DCASE) challenges. Finally, we discuss limitations of the current ASC algorithms and open challenges to
possible future developments toward practical applications of ASC systems.
1. Introduction acoustic scene refers to the entirety of sounds from various sources,
typically merging real-world scenarios into a composite (Virtanen et al.,
Acoustic scene classification (ASC) is to assign a predefined semantic 2018). The different terms for acoustic scenes include environments of
label to an audio stream recorded in a certain environment by analyzing sounds and acoustic/sonic/sound/audio environments.
the audio signal (Waldekar & Saha, 2020b). For instance, an ASC system Various potential applications promoted the ASC study, such as
may categorize an audio clip under predefined labels such as bus, office, multimedia information retrieval (Virtanen et al., 2018), context
or home (Virtanen, Plumbley, & Ellis, 2018). Generally, semantic labels awareness in smart devices (Vivek, Vidhya, & Madhanmohan, 2020;
stem from environmental sound categorizations that describe the Chu et al., 2006; Hüwel, Adiloğlu, & Bach, 2020), smart homes/cities,
ambiance of audio streams. An overview of an ASC system is shown in bioacoustic scene analysis, and acoustic abnormal monitoring (Dong
Fig. 1. ASC is usually treated as a single-label classification task, and its et al., 2018). ASC could enable several prospective technologies, such as
corresponding multi-label classification task is audio tagging. An automated sound archives assigning metadata to audio; smart devices
* Corresponding author.
E-mail addresses: beatrice@tju.edu.cn (B. Ding), zhangtao@tju.edu.cn (T. Zhang), wwchao@tju.edu.cn (C. Wang), ganjun_liu@tju.edu.cn (G. Liu), jinhua.liang@
qmul.ac.uk (J. Liang), hrm@whu.edu.cn (R. Hu), wuyulin@whu.edu.cn (Y. Wu).
1
https://orcid.org/0000-0001-6387-5857
2
https://orcid.org/0000-0003-2317-644X
3
https://orcid.org/0000-0002-1144-8041
4
https://orcid.org/0000-0003-3919-5370
5
https://orcid.org/0000-0002-4570-0735
6
https://orcid.org/0000-0002-0290-5757
7
https://orcid.org/0000-0003-1940-2141
8
No orcid
https://doi.org/10.1016/j.eswa.2023.121902
Received 6 December 2022; Received in revised form 20 August 2023; Accepted 27 September 2023
Available online 2 October 2023
0957-4174/© 2023 Published by Elsevier Ltd.
B. Ding et al. Expert Systems With Applications 238 (2024) 121902
Lim, & Kwak, 2021), and post-processing methods. The general frame
work of the ASC system is shown in Fig. 2, including data processing,
feature acquisition, and modeling steps. These three stages have been
the focal points for investigating various ASC techniques.
In the training stage, the input features are obtained from the
training set by data processing and feature acquisition, then along with
labels fed to a model that is initialized randomly or initialized by transfer
learning for model training. In the testing stage, the same input features
extracted from the testing set are fed into the well-trained ASC model to
get outputs.
Data processing aims to enhance data quality or increase quantity for
boosting ASC performance, including channel selection and data
augmentation. Channel selection refers to selecting appropriate signal
channels beneficial to performance improvement for ASC. Commonly-
used signal channels in ASC include mono, binary (left and right),
Fig. 1. Overview of an ASC system.
mid/side channels (Lee et al., 2021), and signal variants generated by
Harmonic Percussive Source Separation (HPSS) or Nearest Neighbor
leveraging ASC to adapt functions to their surroundings (e.g., intelligent
Filtering (NNF). Data augmentation aims to increase training data
wearable devices, hear aids or robotic wheelchairs); information capture
through techniques such as mixing, pitch shifting, time stretching,
for intelligent decision-making (e.g., noise monitoring, classification of
adding random noise, mixup, channel confusion, dynamic range
domestic activities (Ebbers, Keyser, & Haeb-Umbach, 2021), autono
compression, spectrum augment, spectral rolling, random erasing,
mous surveillance (Chandrakala & Jayalakshmi, 2019), and workplace
Between-Class learning, spectrum correction, Generative Adversarial
scene analysis (Jati et al., 2021)); or acoustic abnormal monitoring for
Networks (GAN), etc. Sometimes the combination of several data
early warning (e.g., building health monitoring (Kawamoto & Hama
augmentation methods is also employed (Hu et al., 2020; Hu et al.,
moto, 2020) and fiber optic security system (He & Zhu, 2021)).
2021).
Similar to ASC, Sound Event Detection (SED) is also used to capture
Feature acquisition is used to obtain distinguished features for effi
environment information. It recognizes what is happening in an audio
cient classification, including feature learning, handcraft feature
signal and when it is happening (2017; Xia et al., 2019; Mesaros et al.,
extraction, and the techniques of feature selection and dimensionality
2021). Its annotation is event classes with temporal information (i.e.,
reduction. Common handcraft features for ASC include Log Mel Band
onset and offset times of a sound event). In practice, SED aims to
Energies (LMBE), their delta and delta-delta values, Mel Frequency
recognize in what temporal instances different sounds are active within
Cepstral Coefficients (MFCC), Constant-Q-Transform (CQT) features,
an audio signal. The difference between ASC and SED lies in the anno
etc. Feature learning is usually implemented by deep learning methods,
tation, where ASC is a single-label without temporal information task
such as Deep Neural Networks (DNN), CNN, Visual Geometry Group
and SED is multi-label with temporal information task. In some contexts,
(VGG), Nonnegative Matrix Factorization (NMF), Restricted Boltzmann
like multimedia retrieval systems (Vivek, Vidhya, & Madhanmohan,
Machine (RBM), etc. In addition, some networks are designed for feature
2020), the lines between SED and ASC might blur. Also, ASC can be used
learning in audio classification, such as L3-Net and SoundNet. Feature
to boost the SED performance by providing prior information about the
selection and dimensionality reduction aim to obtain a significant and
probability of certain events (Barchiesi et al., 2015). Therefore, there
compact feature set.
may be overlapping applications for both.
Modeling is generally implemented using machine learning methods,
ASC has been an active research field for decades. The perception of
such as supervised, unsupervised, semi-supervised, and self-supervised
environmental sounds has garnered significant attention. Several
methods. Supervised methods are the most popular methods for ASC,
important challenges have made significant progress for ASC, such as
including conventional machine learning and deep learning methods.
the Detection and Classification of Acoustic Scenes and Events (DCASE)
Most of the earlier ASC works employed conventional machine learning
(Bai et al., 2020).
methods for modeling, such as K-Nearest Neighborhood (KNN),
In comparison to speech and music signals, audio signals often
Gaussian Mixture Modelling (GMM), Support Vector Machines (SVM),
exhibit greater diversity and complexity. Therefore, ASC lags behind
Hidden Markov Model (HMM), random forest, and decision tree.
other audio domains like speaker identification and music classification
Recently, deep learning methods have been successfully applied in ASC
in development. Various signal processing and machine learning
and achieved excellent results, such as Time Delay Neural Network
methods have been proposed to address the low-performance problem,
(TDNN), Bidirectional Long Short Term Memory (Bi-LSTM), Feedfor
such as signal representation, data augmentation, feature learning,
ward Neural Networks (FNN), CNN (e.g., traditional CNN, VGG, Re
efficient convolutional neural networks (CNN) based modeling (Lee,
sidual Network (ResNet), Inception, Xception, and MobileNetV2, etc),
Fully Convolutional Neural Network (FCNN), Convolutional Recurrent
Neural Network (CRNN), etc. Furthermore, attention mechanisms,
transfer learning, and multitask learning methods are also introduced
into ASC. Typical examples include SoundNet, SubSpectralNet, DCA
SENet, CAA-Net, and PANNs. Unsupervised, semi-supervised, and self-
supervised methods are studied as new directions for classification
tasks to overcome the shortcoming of requiring a large amount of
labeled data in deep learning. They have been successfully applied in
image classification and natural language processing, then are gradually
introduced into ASC. Also, late fusion methods are used in ASC to
improve performance (Ding et al., 2022; Paseddula & Gangashetty,
2021).
Beyond performance concerns, real-world implementations often
run on devices with a limited computational capacity (Martín-Morató
Fig. 2. General framework of the ASC system. et al., 2021). Most systems with high model complexity pose a big
2
challenge in wide use. To address this problem, various model 2023). In this section, we detail the systematic literature review meth
compression and simplification methods are applied in ASC, such as odology used in this work to illustrate the survey scope, retrieval data
parameter pruning and sharing, low-rank factorization, transferred/ bases, and criteria for inclusion and exclusion.
compact convolutional filters, knowledge distillation (Gou et al., 2021; To intuitively show the scope of the survey, we provide the Preferred
Kim, Yang, Kim, & Chang, 2021; Kim, Hyun, Chung, & Kwak, 2021), and Reporting Items for Systematic Reviews and Mete-Analysis (PRISMA)
model simplification (Li et al., 2020). flow diagram in Fig. 3. This diagram shows how many papers were
Recently, ASC trends from the classification using multiple devices retrieved from the given query, how each screening phase removed
(which entails domain adaptation), the use of external data (which in undesired ones, and eventually how many papers were included in this
volves transfer learning), and open set to more recently low-complexity survey. In Fig. 3, there are 683 ASC-related studies identified from the
ASC. Numerous methods aim to enhance performance while reducing retrieval databases, 165 documents screened from the DCASE commu
model complexity. The challenge lies in optimizing performance within nity, and 161 included studies obtained by other methods. A total of 330
the bounds of limited computational complexity. At present, there are no papers are cited in this work.
clear instructions on optimal ASC algorithms and methods to reduce From Fig. 3, we firstly identify scientific publications from
complexity. Although there are some literature reviews for ASC, most of commonly-used digital database, such as Scopus, Web of Science
them summarize ASC from one or two perspectives, such as review of (abbreviated as Wos), IEEE Xplore (abbreviated as IEEE), and Springer
deep learning for ASC (Abeßer, 2020), survey on ASC and SED for Link (abbreviated as Springer). Due to several different terms for the
autonomous surveillance (Chandrakala & Jayalakshmi, 2019), and acoustic scene, we set the query strings during searching according to
recent survey on preprocessing and classification techniques for acoustic the common terms. The query strings include acoustic scene classifica
scene (Singh, Sharma, & Sur, 2023). Besides, some of them were pub tion, environmental sound classification, sound scene classification, and
lished in early period (Barchiesi et al., 2015; Dandashi & AlJaam, 2017), audio scene classification. Also, we adopt slightly different search
which means they cannot provide a summary and analysis of the latest methods according to different database search settings. For example,
progress. In this work, we present a comprehensive survey of the state- we search papers based on multiple query strings simultaneously
of-the-art methods for ASC from multiple perspectives. through OR (satisfying one of them) in the Scopus and Wos databases.
The main contributions of this paper are: i) A systematic review of We search papers based on a single query string in the IEEE and springer
ASC, covering the unified description of ASC, environmental sound databases. In addition, we also use the automatic filtering tools provided
categorization for defining acoustic scenes, the development of ASC, and by the databases to conduct preliminary screening of search results.
the current research, ii) A categorization grouping ASC methods ac Then, we integrate the search results of Scopus & Wos, IEEE, and
cording to their processing, allowing us to link ASC independently of Springer through removing duplicates. Next, we exclude records that are
their particular application, and iii) Suggestions for future research. irrelevant to ASC based on human evaluation. Finally, 683 papers are
The following section introduces the systematic literature review obtained after integrate all above search results and removing dupli
methodology used in this work. Section 3 introduces the ASC back cates. After meta-analysis, 128 of them are included in this work.
ground, including environmental sound categorization, the history of Furthermore, we also collected 602 documents (i.e., papers in DCASE
ASC, and problems in ASC. The core methods of ASC include data pro workshops and technical reports of ASC tasks in DCASE challenges) from
cessing, feature acquisition, and modeling methods, which are discussed DCASE community. Then, 165 documents can be screened by excluding
in Section 4, Section 5, and Section 6, respectively. Then, we review papers irrelevance to ASC and low-ranking reports (i.e., low-
open resources for ASC and ASC tasks of DCASE challenges in Section 7 performance ASC methods). Next, 41 of them are selected by meta-
and Section 8. Finally, Section 9 discusses open challenges and future analysis as cited references. In addition, we identify 161 studies via
works before Section 10 concludes this article. All related acronyms are other methods, such as reference tracking, citation searching, and
listed in Table 1. searching with extra keywords (e.g., acoustic features, channel, data
augmentation, deep learning, model compression, multitask, transfer
2. Methodology learning, attention mechanism, and multiple devices). Among these
studies, 104 are related literature on ASC, and 57 are related to specific
This paper provide a comprehensive survey on ASC research from the techniques.
perspectives of background, core technologies, open source resources, In summary, this work cites about 18.7 % (128/683) of the searching
challenges, and ultimately future development trends. This content studies from databases, which account for about 38.8 % (128/330) of
borrows extensively from many ASC-related publications (before August the cited references. Moreover, cited papers come from DCASE
Table 1
List of important abbreviations in alphabetical order.
Abbr. Definition Abbr. Definition Abbr. Definition
AMFB Amplitude modulation filterbank GAN Generative Adversarial Networks NMF Nonnegative Matrix Factorization
ASA Auditory Scene Analysis GFB Gabor Filterbank NNF Nearest Neighbor Filtering
ASC Acoustic Scene Classification GFCC Gammatone Feature Cepstral Coefficients PCA Principal Component Analysis
ASM Acoustic Segment Model GMM Gaussian Mixture Modelling PCEN Per-Channel Energy Normalization
Bi-LSTM Bidirectional Long Short Term Memory HMM Hidden Markov Model RBM Restricted Boltzmann Machine
BOF Bag-Of-Frames HOG Histogram Of Gradient ResNet Residual Network
CASA Computational Auditory Scene Analysis HPSS Harmonic-Percussive Source Separation RNN Recurrent Neural Network
CASR Computational Auditory Scene Recognition ICA Independent Component Analysis RQA Recurrence Quantification Analysis
CELP Code Excited Linear Prediction KD Knowledge Distillation S2SAE Sequence To Sequence Auto-Encoder
CNN Convolutional Neural Network KNN K-Nearest Neighborhood SED Sound Event Detection
CQCC Constant-q Cepstral Coefficients LBP Local Binary Pattern SPD Subband Power Distribution
CQT Constant Q Transform LDA Latent Dirichlet Allocation STFT Short-Time Fourier Transform
CRNN Convolutional Recurrent Neural Network LMBE Log Mel Band Energies SVM Support Vector Machines
DNN Deep Neural Networks LPCC Linear Prediction Cepstral Coefficients TDNN Time Delay Neural Network
DRC Dynamic Range Compression MFCC Mel Frequency Cepstral Coefficients VGG Visual Geometry Group
FCNN Fully Convolutional Neural Network MLP Multilayer Perceptron
FNN Feedforward Neural Networks MP Matching Pursuit
3
Fig. 3. The PRISMA flow diagram of this work. Due to the limited image space, we abbreviate classification as C in the figure. For example, acoustic scene clas
sification is abbreviated as acoustic scene C. Studies identified via other methods: reference tracking, citation searching, and keywords searching.
community account for about 12.4 % (41/330). While cited references sound categorization for defining acoustic scenes. Then, we summarize
obtained via other methods account for 48.8 % (161/330), where cited the development history of ASC in terms of three phases. Finally, we
references related to ASC account for 31.5 % (104/330), and cited ref summarize and analyze the problems faced in the ASC field.
erences related to specific techniques account for 17.3 % (57/330).
To provide a global view of the ASC-related literature, we also
display the distribution of ASC-related literature over different years in 3.1. Environmental sound categorization
Fig. 4. From this figure, it can been seen that the number of ASC-related
publications has increased with the years, especially since 2016. In In the 1990s and 2000s, there has been a growing body of literature
particular, the number of publications from 1997 to 2005 was relatively about the categorization of everyday sounds. For instance, Gaver
small. Although there was a slight increase in the number of publications (Gaver, 1993) grouped the sound event into three categories: solids,
from 2006 to 2015, it is still small. The number of ASC-related publi gasses, and liquids based on sound-producing materials. Moreover, two
cations from 2016 to 2023 has rapidly increased. Therefore, the number categorization principles (i.e., sound source similarity and the similarity
of ASC publications in this period is much higher than in early periods. It between events, or actions, causing the sound) are proposed (Van
means that the ASC study has received widespread attention. Derveer, 1979). Situational factors based principles are also highlighted
(Marcell et al., 2000; Gygi, Kidd, & Watson, 2007), such as the sound’s
3. Background location or context and hedonic judgments associated with emotional
responses. Houix et al. (Houix et al., 2012) found a close interaction
Although ASC has attracted extensive attention, its research is still in between action and sound source, which is opposite to the independent
its developing stages. In this section, we first introduce environmental relationship suggested by Gaver.
Guastavino identified two urban soundscapes in a free sorting task
Fig. 4. Distribution of ASC related literature with years. n_searching presents the papers identified by searching key words from databases (including Scopus, Wos,
IEEE, and springer). n_included points the references cited in this work. n_included_searching means the references cited in this work, which identified from da
tabases. It is the intersection of n_searching and n_included. n_included_others presents the references cited in this work, which identified via other methods.
Therefore, n_included is the total of n_included_searching and n_included_others.
4
according to the absence or presence of human activity in the sound (Salamon, Jacoby, & Bello, 2014). We only list some typical classes at
recording related to hedonic judgments (Guastavino, 2007; Virtanen the low level in Fig. 5 because the acoustic environment taxonomy relies
et al., 2018). Their findings provided support for another form of on application scenarios. There are overlaps among urban, rural, wil
interaction between categorization principles based on sound sources, derness, and underwater environments. Hence, the acoustic environ
activities, and hedonic judgments. Guastavino also indicated that ment divide will not always be clear-cut. Despite all this, the proposed
sounds can be cross-classified according to different categorization taxonomy of acoustic environments is applicable and encourages the
principles depending on the sound’s context and the participants’ goals definition of acoustic scenes using coincident nomenclatures.
and theories. The extent to which participants rely on the acoustic
properties of the sound vs. the semantic properties of the sound source 3.2. The history of ASC
also varies for different types of sounds (Giordano, McDonnell, &
McAdams, 2010). We introduce key developments of ASC with aspects of three phases:
Cross-classification into different categorization schemes allows lis problem raise, primary exploration, and early in-depth research phases.
teners to get a full understanding of their environments from a goal- The first phase begins with the soundscape (Schafer, 1977), pioneering
oriented perspective (Virtanen et al., 2018). For instance, Brown et al. work in which researchers pay attention to acoustic environments.
(Brown, Kang, & Gjestland, 2011) proposed a taxonomy of acoustic Later, multiple ASC related concepts were proposed, and the automatic
environments based on principles of environment types, the presence or classification of environmental sounds received some attention. In the
absence of human activity, and sound sources. Salamon et al. (Salamon, second phase, the publication of “computational auditory scene anal
Jacoby, & Bello, 2014) proposed a taxonomy of urban sounds using ysis” (Wang & Brown, 2006) and the study of comprehensively evalu
principles of sound sources, actions, and their combinations. A rich ating computers’ and humans’ perception of audio context (Eronen
description of sound events across different descriptor layers (i.e., sound et al., 2006) started the early works of ASC. In this phase, ASC algo
sources, actions, and contexts) was proposed (Virtanen et al., 2018). In rithms applied features from speech recognition areas and simple ma
addition, influence factors of everyday sound categorization are also chine learning methods, as well as several ASC datasets were released.
studied, consisting of similarity (Lemaitre et al., 2010), person-related The third phase started from the DCASE2016 challenge (Eghbal-Zadeh
factors (expertise of the listeners, age (Berland et al., 2015), and pref et al., 2016), where deep learning methods were introduced into ASC
erences), and situational factors (mood, attention, activity) (Steffens, and achieved high performance. In this phase, ASC received widespread
Steele, & Guastavino, 2017). attention, and the ASC publications rapidly increased. For a more
Environmental sound categorization is critical for designing and concise description, we abbreviate the DCASE task1 (ASC task) to Dxx.
evaluating sound classification systems. It defines acoustic scenes and For example, DCASE 2013 task1 is abbreviated to D13. Similarly, DCASE
thus provides predefined labels for ASC tasks. However, there is a dif 2019 task 1A (resp. 1B) is abbreviated to D19A (resp. D19B).
ference between theoretical sound categorization and the definition of
sound classes for practical classification tasks. In specific sound classi 3.2.1. Problem raise phase (1977–2005)
fication tasks, real data do not always cover all sound categories. For In this phase, many ASC-related concepts were proposed and then
example, ASC tasks in the urban environment focus more on traffic and ASC studies were started. We review the development of this phase
human activities but less on natural sounds, which is very different from (shown in Fig. 6). Table 2 provides early ASC approaches evaluated on
a rural or wilderness environment. self-create databases in this phase. Here, ASC relates to perceptual
To overcome the problem above, we derive a taxonomy of acoustic studies and computational research (Barchiesi et al., 2015). The
environments shown in Fig. 5. This taxonomy is constructed in terms of perceptual studies refer to soundscape cognition (Dubois, Guastavino, &
types of environments—urban, rural, wilderness, and underwater Raimbault, 2006) by defining soundscapes as the auditory equivalent of
(Brown, Kang, & Gjestland, 2011). Acoustic environments generally landscapes (Schafer, 1977), aiming to understand the human cognitive
present mixtures of specific sound sources. Therefore, we further cate processes that enable our understanding of acoustic scenes. The
gorize them in terms of different principles. We first divide urban en computational research is called Computational Auditory Scene
vironments into three abstract levels: indoor, outdoor, and Recognition (CASR) (Peltonen et al., 2002), attempting to automatically
transportation, and then further group them in terms of types of places, perform ASC (Barchiesi et al., 2015). CASR relates to the area of
e.g., home, office, and public spaces (Virtanen et al., 2018; Chandrakala Computational Auditory Scene Analysis (CASA) (Brown & Cooke, 1994;
& Jayalakshmi, 2019; Bai et al., 2020). We divided rural, wilderness, Wang & Brown, 2006) and is particularly applied to the study of envi
and underwater environments based on sound sources and actions ronmental sounds (Ellis, 1996).
Fig. 5. Taxonomy of acoustic environments inspired by previous studies and ASC tasks of DCASE challenges.
5
Fig. 6. Summary of key developments in problem raise phase of ASC (Schafer,1977; Bregman, 1990; Gaver, 1993; Brown & Cooke, 1994; Ellis, 1996; Sawhney &
Maes, 1997; Clarkson, Sawhney, & Pentland, 1998; Peltonen et al., 2001; Peltonen et al., 2002; Eronen et al., 2003). We marked important concepts and data with red
color. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
different listening levels. It is practical for designing functional and

Table 2 aesthetically stimulating acoustic environments (Droumeva, 2005).
ASC approaches evaluated on self-create databases in problem raise phase.
After that, Peltonen et al. (Peltonen et al., 2001) conducted a
Ref. Acc. Approaches Classes Samples Dataset listening test to study the human ability to recognize everyday sounds,
Duration
achieving an accuracy of 70.0 % for 25 scenes. It also mentioned the
(audio
duration) importance of prominently identified sound events for ASC. One year
later, they used KNN and GMM for ASC (17 scenes), achieving 68.4 %
(Sawhney 68.0 % Filter-Banks 5 100 3 h (15 s)
& Maes, based features +
and 63.4 %, respectively (Peltonen et al., 2002). They proposed the
1997) nearest-neighbor concept of CASR, focusing on recognizing contexts, or environments,
(Peltonen 70.0 % Human 25 34 34 min (1 instead of discrete sound events. For example, Couvreur et al. (Couvreur
et al., evaluation (19 min) et al., 1998) used Linear Prediction Cepstral Coefficients (LPCC) features
2001) subjects)
and discrete HMMs to recognize five types of environmental noise
(Peltonen 68.4 % Band-energy, 17 209 104.5 min
et al., flux, roll-off, (30 s) events. Clarkson classified seven contexts using spectral energies and
2002) centroid, and HMM (Clarkson & Pentland, 1999). El-Maleh et al. (El-Maleh,
ZCR + KNN Samouelian, & Kabal, 1999) classified five environmental noise classes
63.4 % MFCC + GMM using line spectral features and a Gaussian classifier. Acoustic features
used in different audio information retrieval tasks (Foote, 1999) were
In the field of perceptual studies, Schafer (Schafer, 1977) first pro also introduced into ASC. Casey used a front-end system to transform
posed soundscape in 1977 in the context of acoustic ecology and has log-spectral energies into a low-dimensional representation with
grown to connect with other related fields, such as community noise, singular-value decomposition and independent component analysis
acoustics, and psychoacoustics. Then he published the famous book on (ICA) (Casey, 2001). There are also many CASR-related studies (Pelto
soundscape (Schafer, 1993) in 1993, where he defined the soundscape as nen et al., 2002), such as noise classification (Couvreur et al., 1998),
the totality of all sounds in the listener’s dynamic environment. These speech/music discrimination (Carey, Parris, & Lloyd-Thomas, 1999),
studies have made a significant contribution to the study of environ sound source classification (Martin, 1999), content-based audio classi
mental sound in the framework of acoustic ecology. In the same year, fication (Zhang & Kuo, 2001b), and classification of general audio (Li
Gaver (Gaver, 1993) attempted to construct a new framework for et al., 2001).
describing sound using audible sound source properties according to the Subsequently, Eronen et al. (Eronen et al., 2003) proposed an ASC
content of everyday sounds. algorithm using MFCC and GMM/HMM. Meanwhile, Gerhard (Gerhard,
On the other hand, Bergman (Bregman, 1990) proposed Auditory 2003) reviewed the history and current techniques of audio signal
Scene Analysis (ASA) in 1990 in his academic classic monograph classification in 2003. The author presented the background for under
“auditory scene analysis”. It is a milestone piece of research on an standing the general research domain of audio signal classification (i.e.,
auditory cognitive mechanism and plays a critical guiding role in sound signal processing, spectral analysis, psychoacoustics, and auditory scene
processing. Based on ASA, CASA was first proposed by Brown in 1994 analysis), the basic elements of classification systems, and the state of
(Brown & Cooke, 1994), which referred to the computational analysis of the ASC research literature at that time.
an acoustic environment and the recognition of distinct sound events in
it. Then, Ellis (Ellis, 1996) proposed a prediction-driven CASA to alter 3.2.2. Primary exploration phase (2006–2015)
nate the data-driven approach (i.e., original CASA). The data-driven In this phase, many major ASC-related research topics and compe
CASA uniquely converts the concrete input data to the more abstract titions emerged one after another following the publication of
output data using a set of operations. It has the shortcomings of “computational auditory scene analysis”. This phase is summarized in
neglecting sound context and cannot detect the presence of a sound Fig. 7. This phase started with the first study of comprehensively eval
hidden by other components. In contrast, the prediction-driven CASA uating computers’ (accuracy for contexts/high-level classes: 58 % vs. 82
can analyze a complex sound mixture into a set of discrete components %) and humans’ (69 % vs. 88 %) perception of audio context (Eronen
(i.e., individual sound-producing events). et al., 2006). This study provides a baseline for evaluating an ASC sys
Then, Sawhney et al. (Sawhney & Maes, 1997) first proposed ASC in tem. In the same year, the book “Computational Auditory Scene Anal
1997, achieving an accuracy of 68.0 %. In the second year, they ysis: Principles, Algorithms, and Applications” (Wang & Brown, 2006)
(Clarkson, Sawhney, & Pentland, 1998) reported an ASA system that was published. Meanwhile, Dubois et al. (Dubois, Guastavino, &
classified sound objects with HMMs and detected scene changes by Raimbault, 2006) studied how individuals define their semantic cate
clustering the space formed by the likelihoods of these sound object gory classification when they do not have prior knowledge.
HMMs. Next, Truax (Truax, 2001) treated listening as an active process In 2008, Tardieu et al. (Tardieu et al., 2008) demonstrated that
of interacting with the environment and distinguished it among several sound sources, human activities, and room effects (e.g., reverberation)
6
Fig. 7. Summary of key developments in the primary exploration phase of ASC (Wang & Brown, 2006; Eronen et al., 2006; Stiefelhagen, Bowers, & Fiscus, 2007;
Temko et al., 2007; Stiefelhagen et al., 2007; Stiefelhagen et al., 2008; Tardieu et al., 2008; Waibel & Stiefelhagen, 2009; Giannoulis et al., 2013a; Salamon, Jacoby,
& Bello, 2014; Piczak, 2015b; Rakotomamonjy & Gasso, 2015; Barchiesi et al., 2015). We marked important concepts and data with red color. (For interpretation of
the references to color in this figure legend, the reader is referred to the web version of this article.)
are not only the factors driving the formation of acoustic scene cate study.
gories but also the clues to scene classification. Next, Chu et al. (Chu,
Narayanan, & Kuo, 2009) mentioned that the classification accuracy fell 3.2.3. Early in-depth research phase (2016 to present)
rapidly with the increasing number of scene classes because of In this phase, ASC research has made revolutionary progress, and its
randomness, high variance, and other difficulties in processing envi accuracy has been significantly improved with the development of
ronmental sounds. It is supported by examples of accuracy limited to signal processing and deep learning methods. The development of this
around 92.0 % for 5 classes (Chu et al., 2006), 77.0 % for 11 classes phase is shown in Fig. 8.
(Malkin & Waibel, 2005), and 60.0 % for 13 or more classes (Peltonen Since 2016, the DCASE challenge and workshop have been held
et al., 2002; Eronen et al., 2006). annually, and the results and publications have provided explicit com
Various countries and regions have released many major research parisons among ASC methods. Systematical analyses for the results of
plans about the auditory cognition abilities of machines related to the the D13 (Barchiesi et al., 2015; Stowell et al., 2015), D16 (Mesaros et al.,
ASC field. For example, Video Analysis and Content Extraction (VACE) 2018), D17 (Mesaros, Heittola, & Virtanen, 2018a), and D21A (Martín-
and Cognitive Assistant that Learns and Organizes (CALO) (Stiefelhagen, Morató et al., 2021) competitions are also provided. They demonstrated
Bowers, & Fiscus, 2007) were released in America. Computers in the that the emergence of deep learning brought a breakthrough for ASC,
Human Interaction Loop (CHIL) (Waibel & Stiefelhagen, 2009) and replacing traditional approaches (e.g., SVM and GMM) and achieving
Augmented Multi-party Interaction (AMI) were proposed in Europe. The higher classification accuracy.
major research plan on Cognitive Computing of Visual and Auditory Also, some overviews of ASC were proposed. For instance, Dandashi
Information was released by the National Science Foundation of China and AlJaam (Dandashi & AlJaam, 2017) presented a survey of the
(NSFC, 2019). advanced methods and datasets for audio-based classification. Virtanen
Furthermore, some related competitions have also been held, such as et al. (Virtanen et al., 2018) published the book “Computational analysis
CLEAR (Temko et al., 2007; Stiefelhagen et al., 2007; Stiefelhagen et al., of sound scenes and events” in 2018. This book is often considered a
2008), DCASE (Giannoulis et al., 2013a), and TRECVID (audio-visual beginner-friendly textbook of ASC and SED, which provides founda
multimodal event detection) (Lee et al., 2021) challenges. They have tions, history, methods, applications, and further perspectives on ASC
attracted contributions to the audio domain. CLEAR evaluation is an and SED. Then, Abeßer systematically summarized ASC algorithms
international effort in CHIL to evaluate systems for the multimodal based on deep learning (Abeßer, 2020). Lee et al. (Lee et al., 2021)
perception of people, their activities, and interactions (Stiefelhagen summarized the commonly-used CNN for ASC in literature. They also
et al., 2007; Stiefelhagen et al., 2008). The DCASE challenge (Giannoulis used deep separable convolution to reduce model complexity. Martín-
et al., 2013a) was released by IEEE AASP in 2013 for the first time to Morató et al. (Martín-Morató et al., 2021) concluded that advanced
evaluate the existing environmental sound detection methods, as methods combine the best-performing methods with model compression
comprehensively summarized in (Barchiesi et al., 2015). methods for low-complexity ASC tasks. Recently, Vij et al. (Vij et al.
Additionally, various methods and datasets have been proposed to 2023) provided a short overview of ASC and SED studies. Singh, Sharma,
promote the development and application of ASC. For example, a noise- & Sur (Singh, Sharma, & Sur, 2023) provided a systematic review of pre-
robust classification method combined with ICA and Matching Pursuit processing and classification techniques for acoustic scenes.
(MP) was proposed, achieving an accuracy of 76.0 % (Mogi & Kasai, Various ASC algorithms have been proposed, achieving higher ac
2013). Meanwhile, the EAR-IT project was proposed to realize street curacy. For instance, the accuracy of ASC methods on the LITIS dataset is
traffic flow monitoring and indoor energy use (i.e., smart city and smart improved from 97.1 % (Bisot et al., 2017a), 98.1 % (Zhang, Zhang, &
home) based on household monitoring through smart audio technology Wu, 2018), and 98.7 % (Phan et al., 2019), and finally to 98.9 % (Pham
(European Commision, 2014). Also, open ASC datasets, such as Urban et al., 2021) by observing Fig. 8. To further identify the state of pub
Sound8k (baseline accuracy: 68.0 %) (Piczak, 2015a; Salamon, Jacoby, lished algorithms, we list the state-of-the-art approaches on common
& Bello, 2014), ESC-50 (baseline/human accuracy: 44.3 % vs. 81.3 %), ASC datasets in Table 3, where the trend of accuracy changes over time
ESC-10 (baseline/human accuracy: 72.7 % vs. 95.7 %) (Piczak, 2015b), is clarified. By observing Table 3, the accuracy of an ASC approach is
and LITIS (Rakotomamonjy & Gasso, 2015) datasets were released to affected by its techniques, such as data augmentation, features, classi
fairly evaluate ASC algorithms. fiers, and late fusion. Early methods were mainly based on MFCC fea
Following the previous studies, ASC has been developed to a certain tures and simple machine learning methods, such as random forest,
extent. ASC datasets and ASC-related theories have been gradually MLP, and SVM. Recently, features like LMBE, spectrogram, CQT, and log
presented. Nevertheless, audio scene analysis is still a new and under Mel spectrogram (or their variants) were adopted with CNN-based
explored research topic, and there are still many problems to further classifiers, such as ResNet, FCNN, and Xception. The ASC algorithms
7
Fig. 8. Summary of key developments in the early in-depth research phase of ASC (Aytar, Vondrick, & Torralba, 2016; Mesaros, Heittola, & Virtanen, 2016b;
Dandashi & AlJaam, 2017; Mesaros et al., 2018; Mesaros, Heittola, & Virtanen, 2018b; Virtanen et al., 2018; Kroos et al., 2019; Chen et al., 2019; Suh et al., 2020;
Abeßer, 2020; Lee et al., 2021; Kim et al., 2021; Schmid et al., 2022; Schmid et al., 2023). We marked important concepts and data with red color. (For interpretation
of the references to color in this figure legend, the reader is referred to the web version of this article.)
combining feature learning with a simple classifier (e.g., MLP) also driven methods in terms of implementation strategy (Ellis, 1996), cor
perform well in some datasets (Arniriparian et al., 2018). Furthermore, responding to relationship 7. Naturally, Relationship 5 appears.
most high-performance approaches employ data augmentation and late Different from the data-driven CASA, the prediction-driven CASA is
fusion methods, especially for processing complex datasets (e.g., D17 to based on a set of discrete sound events in audio data. Therefore, the
D23). data-driven CASA for acoustic environment analysis corresponds to
In addition, we also list reported performance on common ASC BOF-based, raw waveform-based, and spectrogram-based ASC methods
datasets to intuitively show accuracy changes (see Fig. 9). Notably, we (relationships 8). The prediction-driven CASA for acoustic environment
do not report data on the D18C and D19C datasets (Wilkinghoff, 2021) analysis corresponds to the audio event-based ASC method (relation
because of the limited space and their few published papers. From Fig. 9, ships 9).
the ASC accuracy is significantly improved. Although the ASC accuracy
on early datasets has been significantly improved to over 95 %, the 3.3. Problems
accuracy on recent datasets is still low (less than 90 %), such as D20A,
D21A, D22, and D23 datasets recorded by multiple devices. This is also Although the ASC field has made great progress, there are also some
due to the increase of task complexity (e.g., larger-scale data, multiple problems that need to be addressed. We analyze them from the perspectives of
devices, limited model complexity, and shorter segments). Compared acoustic scenes, datasets, classification performance, and deployment in the
with the D20A task, D21A only added a low-complexity constraint, real world.
slightly decreasing accuracy.
In summary, ASC research made great progress with significantly 1) Unclear definition of acoustic scenes. ASC requires a clear definition
improved accuracy. However, performance is still low for the increasing of acoustic scenes, as it needs predefined labels. However, it is hard
task complexity of ASC, as shown in Fig. 9. It means that ASC technol to define a generalized boundary for acoustic scenes. There are two
ogies are still not ideal, which leaves a large margin to develop. main reasons: (i) The acoustic signal is complex, multi-source,
overlapping, and unstructured, (ii) The acoustic scene definition
3.2.4. Summary and its category granularity depend on application scenarios.
To intuitively show the conceptual relationships of ASC-related Because the distribution of sound source components varies from
fields, we have summarized nine relationships in Fig. 10. Relationship application-to-application scenarios. For example, human activities
1 has been mentioned in (Barchiesi et al., 2015). ASC includes percep and traffic are very intensive in urban environments, distinct from
tual studies (i.e., soundscape cognition) and computational research (i. rural environments.
e., CASR). The authors further explained that CASR is a task associated 2) Lack of large-scale comprehensive datasets. As a complex multi-
with the area of CASA and is particularly applied to the study of envi classification task, a high-performance ASC system requires a
ronmental sounds (Barchiesi et al., 2015), which can infer relationship large-scale dataset for model training. However, the size and di
2. Recently, ASC is considered CASR by default, which can be divided versity of the available data are currently limited. Large-scale
into four groups: audio event-based, BOF-based, raw waveform-based, comprehensive datasets are lacking for developing a specific ASC
and spectrogram-based methods (Ding et al., 2022), driving relation system, despite many existing ASC datasets.
ship 3. 3) Low classification performance. Compared with other audio fields,
CASA is the computational study of ASA, which was motivated partly the ASC’s performance is still deficient. There are mainly three rea
by the demand for practical sound separation systems (Wang & Brown, sons: (i) The high diversity among acoustic features of the same
2006). It is obvious to obtain relationship 4. Moreover, Wang defined scenes and the similarity among acoustic features of different scenes
CASA as the procedure of a machine turning the ambient sound signal make it difficult to extract efficient features for ASC, (ii) Audio sig
into a meaningful representation and listed its related areas as ASC, SED, nals lack structural information, making modeling difficult
and source separation (Wang, 2018). Therefore, we concluded rela compared with speech signals, and (iii) Mismatched data distribution
tionship 6. In addition, CASA is grouped into prediction-driven and data-
8
Table 3
The state-of-the-art approaches on common ASC datasets. ‘spec.’ presents spectrogram. ‘top system’ is the top ranking system in the top system submitted to the DCASE
challenge. ‘DA’ (resp. ‘LF’) point whether data augmentation (resp. late fusion) is used for the ASC algorithm.
Datasets Acc. Years DA Algorithm LF Ref.
D13 71.0 % 2013 MFCCs, classified with a bag-of-frames approach (top system) (Stowell et al., 2015)
88.0 % 2016 SoundNet embedding + SVM (Aytar, Vondrick, & Torralba, 2016)
97.0 % 2019 L3-Net M256 + MLP (Cramer et al., 2019)
D16 89.7 % 2016 (spec. + CNN) ⊕ (MFCC + Binaural I-vectors) (top system) (Eghbal-Zadeh et al., 2016)
93.0 % 2018 layer-wise SoundNet embedding √ (Singh et al., 2018)
97.4 % 2019 1D-LTP, MFCC + SVM (Aziz et al.,2019)
D17 87.1 % 2017 √ LMBE, spec. + MLP,RNN,CNN,SVM (top system) √ (Mun et al., 2017a)
91.7 % 2017 √ Log Mel spectrogram + CNN √ (Han et al., 2017)
91.1 % 2018 Feature learning by DCGAN ⊕ S2SAE + MLP √ (Arniriparian et al., 2018)
95.0 % 2022 Wavelet Scattering Transform (WST) features + RSC √ (Hajihashemi et al., 2022)
D18A 76.9 % 2018 √ LMBE + CNN (top system) √ (Sakashita & Aono, 2018)
79.8 % 2018 LMBE + Xception √ (Yanget al., 2018)
90.0 % 2020 Raw waveform + multi-temporal CNN (e2e) (Kumar et al., 2020)
93.0 % 2023 Mel spec. and TCCs + Fisher layers + SVM (Fisher network) (Venkatesh, Mulimani, & Koolagudi, 2023)
D18B 63.6 % 2018 (LMBE, LMBE + NNF) + Multi-input CNN (top system) √ (Nguyen & Pernkopf, (2018)
77.3 % 2020 √ multi-scale semantic features + Xception (Yang et al., 2020)
D19A 85.1 % 2019 √ LMBE, CQT + CNN √ (Chen et al., 2019).
91.0 % 2021 Log Mel spec. + CRNN-ELU (Mulimani & Koolagudi, 2021)
D19B 70.0 % 2019 √ LMBE + CNN (top system) √ (Kosmider, 2019)
75.0 % 2020 √ Log Mel spec. + CNN (Kosmider, 2020)
D20A 74.2 % 2020 √ LMBE, Δ, ΔΔ + ResNet, Snapshot (top system) (Suh et al., 2020)
81.9 % 2020 √ LMBE, Δ, ΔΔ + ResNet; FCNN; fsFCNN (2-stage Ensemble) √ (Hu et al., 2021).
88.0 % 2023 √ multi-scale Mel spec. (ms2) + RQNet (Madhu & Suresh, 2023)
D20B 97.3 % 2020 √ Perceptually-weighted LMBE + RF-regularized CNNs (top system) (Koutini et al., 2020)
D21A 75.9 % 2021 √ LMBE + CNNBC-ResNet (top system) √ (Kim et al., 2021)
79.4 % 2021 √ LMFB + TSL using a large two-stage ASC system pruned LTH √ (Yang et al., 2021a)
88.0 % 2023 √ multi-scale Mel spec. (ms2) + RQNet (Madhu & Suresh, 2023)
D21B 95.1 % 2021 √ LMBE, CQTbark + CNN, EfficientNet, Swin Transformer (top system) √ (M. Wang, Chen, Xie, Chen, Liu, & Zhang, 2021)
D22 58.0 % 2022 √ LMBE + RF-regularized CNNs, PaSST transformer (top system) √ (Schmid et al., 2022)
D23 58.4 % 2023 √ LMBE + RF-regularized CNNs, PaSST transformer (top system) √ (Schmid et al., 2023)
ESC-10 95.7 % 2015 Human baseline (Piczak, 2015b)
72.7 % 2015 MFCC & ZCR + random forest (Computer baseline) (Piczak, 2015b)
95.8 % 2020 √ Log Mel spec. + TS-CNN10 (Wang et al., 2020)
97.0 % 2021 Log spec. + ESResNet (Guzhov et al., 2021)
ESC-50 81.3 % 2015 Human baseline (Piczak, 2015b)
44.3 % 2015 MFCC & ZCR + random forest (Computer baseline) (Piczak, 2015b)
94.0 % 2017 spectrum + mixture model (Baelde, Biernacki, & Greff R, 2017)
94.7 % 2020 √ Log Mel spec. + fine-tune PANN (Kong et al., 2020)
95.7 % 2021 √ Log Mel spec. + AST-P √ (Gong, Chung, & Glass, 2021)
LITIS 97.1 % 2017 DNN embedding + LR (Bisot et al., 2017a)
98.1 % 2018 CNN with temporal transformer module (TT-CNN) (Zhang et al., 2018)
98.7 % 2019 √ Log Mel spec. + CRNN with spatio-temporal attention pooling √ (Phan et al., 2019)
98.9 % 2021 √ Multi-spec. + encoder-decoder (Pham et al., 2021)
UrbanSound8k 68.0 % 2014 MFCC + random forest (baseline) (Salamon, Jacoby, & Bello, 2014)
73.6 % 2015 Spherical k-means embedding from log Mel spec. + random forest (Salamon & Bello, 2015)
88.0 % 2017 TEO-GTCC ⊕ GTCC √ (Agrawal et al., 2017)
88.5 % 2020 √ Log Mel spec. + TS-CNN10 (Wang et al., 2020)
96.7 % 2021 pre-trained VGGlish (Tsalera, Papadakis, & Samarakou, 2021)
99.0 % 2022 √ Log Mel spec. + deep CNN (Madhu, 2022)
MSOS 93.2 % 2019 Multi-view SoundNet embedding + MLP (Singh et al., 2019)
96.0 % 2020 √ Log Mel spec. + fine-tune PANN (Kong et al., 2020)
between source and target data, such as multiple devices, might performance of basic perceptual algorithms, for instance, by providing
cause low performance. appropriate preprocessing or context-dependent grammar for speech
4) Computational efficiency and practicability. At present, ASC recognition modules (Rakotomamonjy, 2017). Therefore, ASC can be
models with high performance are relatively complex. There are used as a first step for a device to automatically understand its envi
problems with computational efficiency optimization and algorithm ronment, followed by a downstream task (e.g., speech enhancement),
adaptability when deploying ASC algorithms in the real world. For instead of the end goal of a system.
example, in the reality of embedded industrial applications, an ASC
system might be deployed in conditions of limited computational 4. Data processing
power and memory capacity (e.g., IoT embedded devices), where the
high performance of the ASC system is also required. Therefore, the The goal of data processing is to process the input data to enhance its
computational efficiency of the ASC system needs to be considered to quality or increasing its quantity, ultimately improving the performance
judge whether it is practical for this application. of the back-end classification. Frequently-used data processing methods
for ASC include channel selection and data augmentation. It is worth
In addition, knowledge of scene type can be helpful to boost the noting that the data augmentation method can be combined with
9
Fig. 9. Accuracies on commonly-used ASC datasets in state-of-the-art papers. The blue data dots present the accuracy of the baseline system. The green data dots
indicate the accuracy of the top system submitted to the corresponding DCASE challenge or the highest accuracies on other datasets. The highest accuracies on
DCACE ASC datasets are marked by the red font. Here, DCASE ASC datasets are the development datasets for DCASE ASC tasks. The oval with the same color presents
that the datasets contain the same audio data but for different DCASE challenge ASC task (e.g., D20A and D21A). (For interpretation of the references to color in this
figure legend, the reader is referred to the web version of this article.)
2017; Abeßer, 2020; Kawamura et al., 2023).

Undeniably, multichannel fusion is effective in boosting ASC per
formance in some cases (Ding et al., 2022; Eghbal-zadeh et al., 2017;
Zhang, Han, & Shi, 2020; Imoto, 2021). However, Waldekar et al.
(Waldekar & Saha, 2020a) mentioned that it might be better to average
binaural audio into mono. Abeßer mentioned that current state-of-the-
art ASC algorithms process either mono or stereo signals without a
clear trend (Abeßer, 2020). We account that the effectiveness of multi
channel information fusion is affected by task characteristics, audio
features, classification, and fusion strategy. Therefore, more research on
channel selection and fusion in ASC is necessary.
4.2. Data augmentation

Fig. 10. Conceptual relationships of ASC-related fields.
Although deep learning has demonstrated remarkable performance
channel selection. There exist many works combining data augmenta on many complex tasks, it suffers from the requirement for large
tion and channel selection according to the technical reports of DCASE amounts of data (Summers & Dinneen, 2019). Considering ASC, the
challenges ASC tasks. For instance, Chen et al. (Chen et al., 2019) uti relative scarcity of labeled data and the uncertainty of the acoustic scene
lized average-difference channels (average and difference of binaural signal result in different statistical distributions of the training and
channels) along with a GAN-based data augmentation method for ASC, testing data (Salamon & Bello, 2017). Therefore, deep learning models
winning the DCASE2019 ASC competition. We will detail these two trained directly on training data may perform poorly on generalization
techniques and discuss their application in ASC. (Yang et al., 2020). This problem can be overcome by data augmentation
(Abeßer, 2020). Data augmentation usually generates new training data
through deformations and transformations on existing training re
4.1. Channel selection
cordings (Cui, Goel, & Kingsbury, 2015; Cao et al., 2019). It does not
always directly process the original data (e.g., audio signals) but also
Generally, the audio input of ASC is monaural (Abeßer, 2020).
processes features extracted from the original data, such as the
However, multichannel and multimodal data settings represent oppor
spectrum.
tunities to address complex real-world scene and event classification
Generally, data augmentation can be presented by (1), which con
problems more effectively. Potentially complementary data streams can
siders arbitrary functions f mapping one or two basic samples into a
lead to more robust analysis by effectively combining them using
single new training data. If f is a linear function, the data augmentation
appropriate techniques at the input representation, feature, or decision
method is considered linearity-based, otherwise, it is non-linearity-
level (Virtanen et al., 2018). Such techniques have been successfully
based (Summers & Dinneen, 2019). Label-preserving data augmenta
realized in various multichannel audio/audio-visual scene analysis
tion generates new training data based on basic samples while preser
tasks.
ving their original label.
The fusion of multichannel information focuses on merging the in
formation from different channels to maximize the amount of informa (̃x, ̃y) = f ({(xi , yi )}Ni=1 ) 1⩽N < 3 (1)
tion and better model acoustic scenes. For example, Zieliński (Zieliński,
2018) proposed three information fusions of five-channel surround where (xi , yi ) and (̃
x, ̃
y) are the sample-label vector of the ith basic
sound (i.e., early fusion, late fusion variant 1/2) for ASC. In addition, sample and new sample, respectively. We did not find improvement
separation algorithms are employed to generate additional signal beyond N = 2 from (Summers & Dinneen, 2019), so N is less than 3 here.
channels/variants, such as HPSS (Ono et al., 2008), NMF-based HPSS We summarize commonly-used data augmentation in ASC.
(Park, Shin, & Lee, 2017), and NNF (Nguyen & Pernkopf, 2018) algo
rithms. Furthermore, multichannel audio was applied to extract spatial
information to supplement spectral features for ASC (Imoto & Ono,
10
1) Mixing audio (mixing) randomly mixes two audios in training 14) GAN, a special reinforcement learning, augments training data
data from the same scene class (Hu et al., 2020). using GAN (Goodfellow et al., 2014). Generally, GAN is per
2) Pitch shifting (pitch) randomly shifts the pitch of training data in formed in the audio signal domain (Chen et al., 2019) or feature
the uniform distribution (Hu et al., 2020; Salamon & Bello, domain (Mun et al., 2017a). Moreover, some GAN variants have
2017). been applied for data augmentation, such as Conditional-GAN
3) Time stretching (time) randomly changes the speed or the dura (Yang et al., 2019), EnvGAN (Madhu, 2022), ACGAN (Chen
tion of an audio signal in uniform distribution without affecting et al., 2021), CycleGAN (Kacprzak & Kowalczyk, 2021), and
the audio pitch (Salamon & Bello, 2017). It is also called speed DCGAN (Bahmei, Birmingham, & Arzanpour, 2022). However,
change. GAN has the drawback of being hard to train and tune.
4) Adding random noise (noise) adds Gaussian noise for each 15) Label smoothing spatial-mixup (ls-spatial-mixup) randomly se
training data randomly (Hu et al., 2020). lects and concatenates parts of Log-Mel spectrograms of two
5) Mixup mixes two samples in the training data from different samples with mixup (Yang et al., 2020). Then, the label of the new
scene classes with a random weight (Hu et al., 2020). It is data is smoothed by (4) and (5). The smoothed label aims to
implemented by ̃ x = λx1 +(1 − λ)x2 and ̃ y = λy1 + (1 − λ)y2 . overcome the over-confident problem of network training and
Here, (x1 , y2 ) and (x2 , y2 ) are two sample-label vectors drawn at generate meaningful virtual data using the spatial-mixup method.
random from the training data. (̃ x, ̃
y) is the new sample-label
β
vectors after mixup, and λ ∈ [0, 1]. λ is obtained by the sam y = (1 − β)(λy1 + (1 − λ)y2 ) + 1N−
̃ 1 (4)
N
pling of the beta distribution with parameter α, and α ∈ [0, ∞]
(Zhang et al., 2018). Hu et al. (Hu et al., 2020) use mixup with α { βmin y1 = y2
equal to 0.4. β= (5)
− 4(βmax − βmin )(λ − 0.5)2 + βmax y1 ∕
= y2
6) Dynamic Range Compression (DRC) compresses the dynamic
range of the sample (Salamon & Bello, 2017). It maps the natural Where β ∈ [0, 1] is a smoothing parameter, 1N− 1 in (6) is an N
dynamic range of a signal to a smaller range. For instance, Hu dimensional all-one vector. Here, λ is generated from Beta distribution
et al. (Hu et al., 2020) generated data of simulated devices by (same as mixup). ‘+’ donates add operation.
adding reverberation followed by DRC to audio collected with a
real device. 16) LDA-based data augmentation (LDA-augment) generates new
7) Spectrum augment (SpecAugment) applies time and frequency data according to the probability distributions related to the key
masking on spectrogram of the signal (Hu et al., 2020). It is audio events of each acoustic scene detected by a topic model
proposed for the automatic speech recognition (Park et al., 2019). LDA (Latent Dirichlet Allocation). It could simulate acoustic
8) Spectral rolling (SpecRoll) randomly shifts spectrogram excerpts scenes in real environments more effectively (Leng et al., 2020),
of training data over time (Koutini, Eghbal-zadeh & Widmer, which is more stable than pitch, time, and noise.
2019).
9) Random erasing (RandErasing) replaces random boxes in feature In addition, other methods that cannot generate new data are also
representations of training data with random numbers (Zhong classified as data augmentation. In this paper, we only consider these
et al., 2020). It has been successfully applied in image classifi methods as data processing methods to enhance data quality and thus
cation, object detection, and person re-identification. improve performance. For instance, random cropping randomly crops
10) Between-Class (BC) mixes two samples in the training data from input data (i. e., feature map) into a fixed length along the time axis
different scene classes with a mixing ratio obtained by training a during the training stage (Hu et al., 2020). It was combined with a mixup
model using KL loss (Tokozume, Ushiku & Harada, 2018). It is in the temporal axis in (McDonnell & Gao, 2020).
implemented by (2) and (3). The experiment on ESC-50 verified To intuitively compare data augmentation methods in ASC, we list
its effectiveness. their characteristics in Table 4 in terms of data type, basic sample,
linearity-based, label-preserving, learning-based, and complexity.
px1 + (1 − p)x2 1
x = √̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
̃ where p =
1− λ (2) Apparently, the characteristics of the combination method are not
certain. GAN and LDA-augment methods augment data without
G1− G2
p2 + (1 − p)2 1 + 10 20 ⋅
λ
requiring basic samples. Only mixing, mixup, BC, and ls-spatial-mixup
y = λy1 + (1 − λ)y2
̃ (3) methods are based on two basic samples. Others are based on only
one sample.
where (x1 , y2 ) and (x2 , y2 ) are two sample-label vectors drawn at In conclusion, data augmentation methods provide an effective so
random from the training data. (̃ x, ̃
y) is the new sample-label vectors lution to meet the requirements of deep learning methods for a large
after using BC method. G1 and G2 is the sound pressure level of x1 and x2 amount of labeled data. Various data augmentation methods have been
[dB], and λ ∼ U[0, 1]. verified in specific datasets. However, the practical implementation of
these methods to achieve good performance for various ASC tasks is still
11) Channel confusion (ChanConfusion) randomly swaps two an area of active research. For example, Hu et al. (Hu et al., 2021)
channels in input data (Hu et al., 2020). compared popular data augmentation methods to validate the effec
12) Spectrum correction (SpecCorrect) computes the correction co tiveness of their combination. Han et al. (Han et al., 2023) also
efficients from the spectrum of n aligned pairs of recordings and compared a few data augmentation methods to validate their proposed
then transforms all the recordings using these coefficients. The randmasking augment method on the D18A and D19A datasets. How
corrected spectra should look alike for all the devices (Kosmider, ever, their effect might be uncertain when using them in other ASC al
2019). Here, n can be quite small (e.g., 30) (Nguyen & Pernkopf, gorithms or datasets.
2019a; Kosmider, 2019). Furthermore, overfitting happens frequently when a large amount of
13) Combination generates new training data using several data augmented data is used (Xie et al., 2021). Therefore, not all augmen
augmentation techniques. For example, Hu et al. (Hu et al., 2020) tation methods are effective. Those samples augmented far from their
combined more than five data augmentation methods for ASC. original ones will jeopardize the classification performance (Zheng, Mo,
Lasseck et al. applied the combined local time stretching and & Zhao, 2022). Therefore, appropriate data augmentation methods or
frequency stretching for acoustic bird detection (Lasseck, 2018). sample filters (Zhang et al., 2018) are required for ASC to ensure
11
Table 4
The characteristics of data enhancement methods for ASC. The data type of audio/spectrum corresposnds augmentation occurring before/ after feature extraction.
Data Data type Basic Linearity- Label- Learning- Complexity Ref.
augmentation sample based preserving based
mixing Audio/ 2 √ √ × simple (Hu et al., 2020)

spectrum
pitch Audio 1 × √ × simple (Hu et al., 2020; Salamon & Bello, 2017)
time Audio 1 × √ × simple (Hu et al., 2020; Salamon & Bello, 2017)
noise Audio 1 √ √ × simple (Hu et al., 2020)
mixup Audio/ 2 √ × × simple (Hu et al., 2020; Salamon & Bello, 2017; Zhang et al.,
spectrum 2018)
DRC Audio 1 √ √ × simple (Hu et al., 2020; Salamon & Bello, 2017)
SpecAugment Spectrum 1 × √ × simple (Hu et al., 2020; Park et al., 2019)
SpecRoll Spectrum 1 × √ × simple (Koutini et al., 2019)
RandErasing Spectrum 1 × √ × simple (Zhong et al., 2020)
BC Audio 2 √ × √ complex (Tokozume et al., 2018)
ChanConfusion Audio 1 × √ × simple (Hu et al., 2020)
SpecCorrect Spectrum 1 × √ × complex (Nguyen, Pernkopf, & Kosmider, 2020; Kosmider, 2019;
Kosmider, 2020)
Combination Audio/ – – – – – (Hu et al., 2020; Lasseck, 2018; Wang et al., 2023)
spectrum
GAN Audio/ 0 × – √ complex (Goodfellow et al., 2014; Mun et al., 2017a)
spectrum
Ls-spatial-mixup Spectrum 2 × × × complex (Yang et al., 2020)
LDA-augment Audio 0 – – √ complex (Leng et al., 2020)
mixing Audio/ 2 √ √ × simple (Hu et al., 2020; McDonnell & Gao, 2020)
spectrum
augmentation quality. Also, data augmentation methods have the background subtraction and median filtering over time to eliminate
drawback of consuming large storage spaces and long training times. irrelevant noise from the environment and the recording device.
Moreover, their parameters require careful tuning to avoid excessive 2) Framing and windowing. Audio signals are generally non-stationary
deformations that take away features critical for acquiring since the signal statistics change rapidly over time. Therefore, the
generalization. short-time processing approach is utilized to capture the signal in a
Most of the augmentation methods mentioned above are offline quasistationary state. It analyzes audio signals periodically in short-
because their learning policies are isolated from their usage. Recently, time segments referred to as analysis frames, which are transformed
online data augmentation (OnlineAugment) was proposed, which into a spectrum for further feature extraction (Virtanen et al., 2018).
jointly optimizes data augmentation and target network training in an Analysis frames are obtained by framing, which slices the audio
online manner (Tang et al., 2020). Contrasted with traditional offline signal into fixed-length frames by sliding a window, and windowing,
methods, it is more efficient and adaptive as it requires less domain which smooths these frames by multiplying them with a window
knowledge and is easily applicable to new tasks. Also, it can be applied function.
together with offline methods. 3) Time-frequency representation. The audio signal is generally first
transformed to a time–frequency representation to capture frequency
5. Feature acquisition information, such as Short-Time Fourier Transform (STFT) spectro
gram, Mel spectrogram, log Mel spectrogram, CQT spectrogram, and
Feature acquisition refers to the process of extracting features from wavelet spectrogram. The Mel spectrogram is obtained by processing
signals. It is crucial for the entire classification process since a set of the STFT spectrogram with a Mel-scaled filter bank, which provides a
significant and compact features will make correct classification much more compact spectral representation of sounds than the STFT
easier. spectrogram (Abeßer, 2020). The log Mel spectrogram is obtained by
both Mel filtering and logarithmic scaling. The wavelet spectrogram
is the time–frequency representation of a signal by wavelet trans
5.1. Pre-processing formation (Chen et al., 2021). Generally, STFT is mainly used to
analyze linear and stationary signals. Mel spectrograms and log Mel
Pre-processing is applied before extracting features to enhance spectrograms are commonly used for ASC because both use a non-
certain characteristics of incoming signals and maximize analysis per linear frequency scale motivated by human auditory perception.
formance in the later phases of the analysis system. It is achieved by CQT provides a higher frequency resolution for lower frequencies
reducing the effects of noise or emphasizing the target sounds in the and a higher temporal resolution for higher frequencies. Wavelet
signal (Virtanen et al., 2018), mainly implemented through noise analysis is suitable for analyzing non-stationary signals (Xie et al.,
reduction, framing, and windowing, and time–frequency representation. 2021), which can assist other time–frequency representations when
analyzing complex audio signals.
1) Noise reduction. Both ASC and SED face the challenge that fore
ground sound events in acoustic scenes are often overshadowed by 5.2. Hand-crafted feature extraction
background noises (Abeßer, 2020). Noise reduction can be used to
reduce environmental noise interference in audio analysis (Schröder Hand-crafted features are obtained by processing audio signals using
et al., 2013). Several noise reduction methods for ASC have been fixed pre-defined signal transformations. They incorporate a priori
proposed to remove irrelevant noise and enhance target-related in knowledge of acoustics, sound perception, or specific properties into an
formation. For instance, Lostanlen et al. (Lostanlen et al., 2018) used audio scene. Therefore, hand-crafted features are usually interpretable
per-channel energy normalization (PCEN) (Wang et al., 2017) to and often bring an interesting insight into the content and behavior of
reduce stationary noise and enhance transient sound events in sound scenes (Virtanen et al., 2018). Moreover, it generally obtains a
environmental audio. Han et al. (Han, Park, & Lee, 2017) used compact data representation for efficient sound scene analysis
12
approaches. combined with other techniques. For example, time–frequency fea

At present, the most frequently-used hand-crafted features for ASC tures obtained by the MP algorithm (Chu, Narayanan, & Kuo, 2009)
are mainly inspired by speech, music, or image processing, including were used to supplement MFCC features. MiMFCC features (Eghbal-
time domain, frequency domain, cepstral, and spectrogram image-based zadeh et al., 2017), RQA + MFCC (Giannoulis et al., 2013a), CELP +
features listed in Table 5. MFCC (Tsau, Kim, & Kuo, 2011), and MFCC + GFCC (Huang et al.,
2020) are also applied.
1) Time domain features include envelope, Zero-Crossing Rate (ZCR), 4) Spectrogram image-based features contain HOG (Rakotomamonjy &
short-time average energy, temporal waveform moments, and Gasso, 2015), Subband power distribution (SPD) (Dennis, Tran, &
autocorrelation coefficients (Virtanen et al., 2018; Peltonen et al., Chng, 2013), Local binary pattern (LBP) (Virtanen et al., 2018;
2002; Chu, Narayanan, & Kuo, 2009; Zhang & Kuo, 2001a). They are Battaglino et al., 2015; Abidin, Togneri, & Sohel, 2017; Xie & Zhu,
easily computed and usually used with other refined features (Cha 2019), Amplitude modulation filterbank (AMFB) (Schröder et al.,
chada & Kuo, 2014). 2017), and Gabor filterbank (GFB) (Schröder, Goetze, & Anemüller,
2) Frequency domain features mainly include LMBE (Virtanen et al., 2015). They use computer vision technology to characterize the
2018), band energy, band energy ratio, bandwidth, spectral shape, texture, and evolution of the time–frequency content of the
centroid, spectral flatness, spectral roll-off point, spectral flux (Pel sound field. The spectrogram common image-based features in ASC
tonen et al., 2002; Eronen et al., 2006; Eronen et al., 2006), and include HOG, SPD, and LBP. HOG aims to capture relevant time
spectral kurtosis (Sigtia et al., 2016; Dwyer, 1983). LMBE is the most –frequency structures for characterizing sound scenes. The SPD
popular feature and has obtained good results in DCASE ASC tasks. It image can be used either as a feature (Dennis, Tran, & Chng, 2013) or
is powerful enough to be used on its own as input for classification or as an intermediate representation for extracting other image-based
feature learning (Virtanen et al., 2018). LMBE provided a compact features (Bisot, Essid, & Richard, 2015). LBP features can capture
and smooth representation of the local spectrum but neglected the texture and geometrical properties of a scene’s spectrogram.
temporal changes in the spectrum over time. Therefore, the approach Usually, multiple features are used together in ASC, for instance,
of delta features and stacking feature vectors of neighboring frames MFCCs + LBP (Battaglino et al., 2015), LBP + HOG (Schröder,
is employed to capture the temporal aspect of the features (Virtanen Goetze, & Anemüller, 2015), and the combination (Xie & Zhu, 2019)
et al., 2018). For example, their first and second derivatives (Δ and Δ of MFCCs, LBP, HOG, moments, etc.
Δ) were used in ASC (Ding et al., 2022).
3) Cepstral features consist of MFCC (Chu, Narayanan, & Kuo, 2009; Several publications reviewed acoustic features for audio analysis.
Zhao et al., 2011; Tsau, Kim, & Kuo, 2011; Agrawal et al., 2017), For instance, Peltonen et al. (Peltonen et al., 2002) reviewed acoustic
Code Excited Linear Prediction (CELP) (Tsau, Kim, & Kuo, 2011), features applied in ASC. Tsau et al. (Tsau, Kim, & Kuo, 2011) detailed a
Constant-Q Cepstral Coefficients (CQCC) (Virtanen et al., 2018), summary of features for content-based audio retrieval. Geiger et al.
Gammatone Feature Cepstral Coefficients (GFCC) (Agrawal et al., (Geiger, Schuller, & Rigoll, 2013) discussed the common features in the
2017; Huang et al., 2020), and Recurrence Quantification Analysis early works of ASC. Chachada and Kuo (Chachada & Kuo, 2014) offered
(RQA) features (Roma, Nogueira, & Herrera, 2013; Battaglino et al., a qualitative and elucidatory survey on early ASC development and
2015). They are used to estimate the rough shape of the spectrum of a divided ASC methods into stationary and non-stationary. They consid
signal. MFCC features are the most popular features, and their first ered features in (Mitrović, Zeppelzauer, & Breiteneder, 2010) as sta
and second derivatives are also applied. However, the performance tionary features, including rough features, cepstral features, MPEG-7
of MFCC for unstructured audio (e.g., environmental sounds (Tsau, features, and auto-regression features.
Kim, & Kuo, 2011)) is limited. Therefore, MFCC features are often In addition, the MPEG-7 descriptor (Wang et al., 2006; Muhammad
et al., 2010) and perceptual features (Richard, Sundaram, & Narayanan,
2013; Virtanen et al., 2018) are also applied to ASC. Occasionally, novel
Table 5 features are designed for ASC. For example, Song et al. (Song, Han, &
Hand-crafted features applied for ASC. Deng, 2018) proposed auditory summary statistics as features for ASC
Features Type Features inspired by the neuroscience study about sound texture perception. Bai
Time domain Envelope (Virtanen et al., 2018)
et al. (Bai et al., 2019) used the Acoustic Segment Model (ASM) to give
ZCR (Zhang & Kuo, 2001a; Peltonen et al., 2002; Chu, finer segmentation and cover all acoustic scenes by excavating the
Narayanan, & Kuo, 2009) potentially similar units hidden in the auditory sense. Recently, Götz
Short-time average energy (Zhang et al., 2023; Peltonen et al. (Götz et al., 2023) presented an approach to extract low-
et al., 2002; Chu, Narayanan, & Kuo, 2009)
dimensional representations from reverberant speech signals in a
Autocorrelation coefficients (Virtanen et al., 2018)
Frequency domain LMBE, ΔLMBE, and ΔΔLMBE (Virtanen et al., 2018; Ding contrastive learning framework, and its effectiveness across the down
et al., 2022) stream tasks of acoustic parameter estimation and environment classi
Band energy, band energy ratio, bandwidth, spectral fication was demonstrated.
centroid, spectral flatness, spectral roll-off point, spectral Also, combining a large variety of features is often used to improve
flux, spectral kurtosis (Peltonen et al., 2002; Eronen et al.,
2006; Chu, Narayanan, & Kuo, 2009)
performance (Virtanen et al., 2018). However, hand-crafted feature
Spectrogram image- HOG (Rakotomamonjy & Gasso, 2015; Virtanen et al., 2018) extraction requires expert knowledge. Moreover, it relies on predefined
based SPD / SPD + HOG (Dennis, Tran, & Chng, 2013; Bisot, Essid, transformations and neglects some particularities of the signals, such as
& Richard, 2015; Virtanen et al., 2018) recording conditions and devices.
LBP (Battaglino et al., 2015; Abidin, Togneri, & Sohel, 2017;
Virtanen et al., 2018; Xie & Zhu, 2019)
AMFB (Schröder et al., 2017) 5.3. Feature learning
GFB (Schröder, Goetze, & Anemüller, 2015; Schröder et al.,
2017) Feature learning obtains features by processing audio signals using
Cepstral features CELP (Tsau, Kim, & Kuo, 2011)
learnable signal transformations. It analyzes datasets to automatically
MFCC, ΔMFCC, and ΔΔMFCC (Chu, Narayanan, & Kuo,
2009; Zhao et al., 2011; Tsau, Kim, & Kuo, 2011; Agrawal determine what transformations should be applied to convert input into
et al., 2017) features (Virtanen et al., 2018). Different from hand-crafted feature
GFCC (Agrawal et al., 2017; Huang et al., 2020) extraction, it requires a large amount of data from which to generalize
CQCC (Virtanen et al., 2018) instead of domain knowledge. It is generally applied in the end-to-end
RQA (Roma, Nogueira, & Herrera, 2013)
(e2e) system (Abrol & Sharma, 2020), which builds the scene model
13
directly from the raw signal or time–frequency representation by different methods (Arniriparian et al., 2018; Zhang, Liang, & Ding,
learning both features and the classifier. Typical examples include Acl 2020) are also proposed for ASC. Recently, Jiang et al. (Jiang et al.,
Net, AclSincNet (Huang et al., 2019), SoundNet (Aytar, Vondrick, & 2023) constructed an embedding space containing multi-level distance
Torralba, 2016), very deep CNN-based e2e system (Dai et al., 2017), based on the hierarchical relationship of similarity between acoustic
multi-temporal CNN-based e2e system (Kumar et al., 2020), as well as a scene classes, aiming to learn advanced features that are less susceptible
general-purpose e2e audio embedding generator (Lopez-Meyer et al., to differences in device information from manual features that contain
2021). device information.
Moreover, feature learning is applied to interpret the signal trans In addition, unsupervised feature learning is applied to derive
formation as a learnable function, commonly considered the front-end semantically meaningful signal representations. Unsupervised feature
used for joint training with the classification back-end (Chen, Zhang, learning techniques based on K-means (Salamon & Bello, 2015) or
& Yan, 2019). For instance, Kuncheva (Kuncheva, 2004) proposed su nonnegative matrix factorization (NMF) have been shown to be
pervised extensions of NMF to adapt the decomposition to the task at competitive with the best hand-crafted features (Bisot et al., 2016; Bisot
hand and learn better features. Li et al. (Y. Li, Liu, Wang, Zhang, & He, et al., 2017a; Guo et al., 2017). Similarly, Lee et al. (Lee, Hyung, & Nam,
2019) used deep embedding to capture the characteristics of various 2013) also used a sparse feature learning algorithm to learn features for
acoustic scenes for acoustic scene clustering. Similarly, L3-Net (Cramer ASC.
et al., 2019; Varma, Padmanabhan, & Dileep, 2021), SoundNet (Aytar, There are many successful cases of feature learning in ASC. We list
Vondrick, & Torralba, 2016; Singh et al., 2018; Singh, Rajan, & Bhavsar, most of them in Table 6. From this table, feature learning methods work
2019; Singh, Rajan, & Bhavsar, 2020; Singh, 2022), VGGish (Jansen well in ASC. However, the performance of the ASC system using feature
et al., 2017), CliqueNets (Nguyen & Pernkopf, 2019a), CNN (Heo, Jung, learning is still not at par with well-established ASC systems built using
Shim, & Yu, 2019; Olvera, Vincent, & Gasso, 2022; Zhang, Liang, & hand-crafted features such as spectral features (Ding et al., 2022). This
Feng, 2022), VGG (Ren et al., 2017; Yao et al., 2019; Bahmei, Bir conclusion can be drawn by comparing the accuracy of ASC systems
mingham, & Arzanpour, 2022), and other pre-trained embedding using feature learning in Table 6 with their corresponding highest
models have also been used to learn features for ASC. The combination accuracy.
of acoustic and visual embeddings (Chen, Wang, & Zhang, 2022), mul
tiview embeddings (Devalraju & Rajan, 2022), and features learned by
Table 6
Feature learning methods applied in ASC. Notably, the underlined data is better than the baseline accuracy and less (within 5%) than the highest accuracy.
Methods Acc. Dataset (Highest Methods Acc. Dataset
Acc.) (Highest Acc.)
(mixup) AemNet-DW (e2e) (Lopez-Meyer et al., 2021) 92.3 % ESC-50 Layer-wise SoundNet embedding + SVM (maximum 93.0 % D16 (97.4 %)
(95.7 %) likelihood) (Singh et al., 2018)
Pruned SoundNet embedding + 1D CNN (SVM ensemble) 92.1 % Pruned SoundNet embeddings + 1D CNN (SVM 89.9 %
(Singh, Rajan, & Bhavsar, 2020) ensemble) (Singh, Rajan, & Bhavsar, 2020)
(DCGAN, pitch, noise) VGG-19 embedding + CNN-RNN ( 91.5 % feature learning from CQT + Nonnegative TDL (Bisot 83.3 %
Bahmei, Birmingham, & Arzanpour, 2022) et al., 2017a)
VGG embedding (LS, AM-softmax) + VGG (Yao et al., 81.9 % (ConvNet + SNMF + HOG) + SVM (Rakotomamonjy, 80.9 %
2019) 2017)
L3-Net M256 + MLP (Cramer et al., 2019) 79.8 % Feature learning by DCGAN ⊕ S2SAE + MLP (optimal 91.1 % D17 (95.0 %)
weighted sum) (Arniriparian et al., 2018)
SoundNet embedding + SVM (Aytar, Vondrick, & 74.2 % (Feature learning; NMF + CQT) + (TNMF-AS + DNN-M) 90.1 %
Torralba, 2016) (average) (Bisot et al., 2017b)
SoundNet embedding + SVM (Aytar, Vondrick, & 92.2 % ESC-10 (97.0 %) Scalogram-DCNN with DCT-based temporal module 87.4 %
Torralba, 2016) (e2e) (Chen, Zhang, & Yan, 2019)
(DCGAN, pitch, noise) VGG-19 embedding + CNN-RNN ( 98.0 % UrbanSound8k L3-Net embeddings + SVM (Varma, Padmanabhan, & 84.7 %
Bahmei, Birmingham, & Arzanpour, 2022) (99.0 %) Dileep, 2021)
AemNet-DW (e2e) (Lopez-Meyer et al., 2021) 83.6 % VGG16 embedding (STFT + bump + morse) + GRNN 80.9 %
(Margin Sampling Value, MSV) (Ren et al., 2017)
L3-Net M256 + MLP (Cramer et al., 2019) 79.3 % Multi-view embeddings ((MvLDAN)) + SVM (Devalraju 77.1 %
& Rajan, 2022)
Spherical k-means embedding from logMel spec. + 73.6 % (SoundNet embeddings + PCA + ACDL) + non-linear 76.1 %
random forest (Salamon & Bello, 2015) SVM (late fusion) (Singh, 2022)
Raw waveform + CNN-18 (Dai et al., 2017) 71.8 % Raw waveform + multi-temporal CNN (e2e) (Kumar 90.0 % D18A(93.0)
et al., 2020)
Multi-view embeddings ((MvLDAN)) + SVM (Devalraju & 97.9 % LITIS (98.9 %) Deep CNN embedding + DNN (Pham et al., 2020) 78.0 %
Rajan, 2022)
DNN embedding + logistic regression (Bisot et al., 2017a) 97.1 % (CNN embedding ⊕ CNN-GRU embedding) + SVM 77.4 %
(average) (Heo et al., 2019)
NMF embedding from CQT spec. (Kernel PCA) + logistic 95.6 % 3 channels LMBE + end-to-end 3D CNN (SeNoT-Net-L2) 77.2 %
regression (Bisot et al., 2016) (Zhang, Han, & Shi, 2020)
Multi-view SoundNet embedding + MLP (Singh et al., 93.2 % MSOS (96.0 %) Multi-view embeddings (MvLDAN) + KNN (Devalraju & 75.6 %
2019) Rajan, 2022)
(mixup) AemNet-DW (e2e) (Lopez-Meyer et al., 2021) 91.6 % D13 (97.0 %) Deep CNN Embedding + DNN (Pham et al., 2020) 66.9 % D18B (91.0
SoundNet embedding + SVM (Aytar, Vondrick, & 88.0 % (mixup) CliqueNets embedding + CNN-MoE (Nguyen & 64.7 % %)
Torralba, 2016) Pernkopf, 2019a)
RBM embedding from Mel spec. + L2-SVM (Lee, Hyung, 72.0 % (domain adaptation) Feature learning by CNN-5 + 60.1 %
& Nam, 2013) softmax (Olvera, Vincent, & Gasso, 2022)
LMBE + end-to-end 3D CNN (SeNoT-Net-L2) (Zhang, 80.3 % D19A (92.0 %) (random crop, mixup) multi-level distance embedding 72.6 % D20A (88.0
Han, & Shi, 2020) learning (MEDL) + Trident ResNet (Jiang et al., 2023) %)
(mixup, SpecCorrect) Feature learning by CNN + CNN 70.1 % D19B (91.0 %) (image data augmentation) Acoustic embedding, visual 94.6 % D21B (95.1
(calibration by heated-up softmax) (Nguyen, Pernkopf, ImageNet embeddings + SoftMax (Chen, Wang, & %)
& Kosmider, 2020) Zhang, 2022)
14
5.4. Features for recent deep learning models complex ASC tasks with a large-scale dataset, although they are gener
ally low-complexity and train models quickly.
According to recent publications, LMBE (Yang et al., 2020; Pased Recently, conventional machine learning methods have also been
dula & Gangashetty, 2021; Hu et al., 2021), MFCC (Bai et al., 2020; combined with feature embedding and work well in ASC. For example,
Huang et al., 2020), as well as their deltas and delta-deltas, are the most SVM, Multilayer Perceptron (MLP), KNN, random forest, and logistic
powerful hand-crafted features for deep learning-based ASC systems. In regression are performed on high-level embedding extracted from audio
addition, STFT spectrograms, Mel spectrograms (Lee, Hyung, & Nam, signals, as shown in Table 6.
2013; Alamir, 2021; Shim et al., 2022), log Mel spectrograms (Kong
et al., 2020; Ren et al., 2019; Mulimani & Koolagudi, 2021; Mulimani & 6.2. Deep learning
Koolagudi, 2021), CQT spectrograms (Rakotomamonjy, 2017; Arnir
iparian et al., 2018), and their combination (Ren et al., 2017) are often In recent years, deep learning methods have been adopted to learn
employed directly for deep learning. Also, the wavelet spectrogram is high-level representations, achieving better performance than conven
applied with other spectrograms together in some cases (Yang et al., tional machine learning methods (Ren et al., 2020). However, CNN
2019; Chen, Zhang, & Yan, 2019; Wu & Lee, 2020). In addition, the raw consumes much longer training times and requires a large amount of
waveform is also used as the input for deep learning for building end-to- data. It can be solved by current excellent computing power, increasing
end ASC systems (Dai et al., 2017; Kumar et al., 2020; Lopez-Meyer available datasets, and data processing techniques.
et al., 2021). Occasionally, spectrogram image-based features like
SPD, LBP, and HOG are used as the input of deep learning models for 6.2.1. Deep learning methods
feature learning (Dennis, Tran, & Chng, 2013; Rakotomamonjy & Gasso, Deep learning methods, especially CNNs, have been continuously
2015; Bisot, Essid, & Richard, 2015). introduced into ASC and have achieved excellent results. ASC systems
mostly use CNN-based network architectures since CNN usually pro
5.5. Feature selection and dimensionality reduction vides a summary classification of longer acoustic scene excerpts
(Abeßer, 2020). Most state-of-the-art ASC algorithms employ CNNs,
With the creation of large-scale databases and the consequent re according to the technical reports of DCASE.
quirements for good machine learning techniques, a large number of Currently, deep learning methods applied in ASC include Recur
features are extracted or learned to better represent the target concept. rent Neural Network (RNN) (Li et al., 2017), TDNN (Schröder et al.,
However, not all features are significant since many could be either 2017), BLSTM (Y. Li, Li, Zhang, Wang, Liu, & Feng, 2018), FNN (Bisot
redundant or irrelevant to the target concept. Thus, feature selection and et al., 2017b; Takahashi et al., 2017), CNN (e.g., traditional CNN, VGG,
dimensionality reduction become inevitable to reduce feature dimen ResNet, Inception (Hershey et al., 2017), Xception, and MobileNetV2),
sionality and retain only the information that is relevant to discrimi FCNN (Hu et al., 2020), CRNN (Mulimani & Koolagudi, 2021). Also,
nating the target classes (Virtanen et al., 2018). some deep learning networks are designed for ASC-related tasks, such as
Feature selection is the process of seeking for an appropriate feature SoundNet (Aytar, Vondrick, & Torralba, 2016), SubSpectralNet (Phaye,
subset from a larger set of initial features that can describe an applica Benetos, & Wang, 2019), DCASENet (Jung et al., 2021), CAA-Net (Ren
tion domain, and even improve classification performance. Therefore, it et al., 2020), PANNs (Kong et al., 2020), Fisher network (Venkatesh,
can directly affect the complexity and accuracy of classifiers. Generally, Mulimani, & Koolagudi, 2023), and RQNet (Madhu & Suresh, 2023).
feature selection can be divided into filter and embedded approaches Recently, Hasan et al. (Hasan et al. 2023) proposed an automated ar
according to whether the process takes place before or is integrated into chitecture design using a genetic algorithm for ASC, which can find
the classification process (Virtanen et al., 2018; Özseven and Arpacio optimized CNN architectures for the ASC task. It employs frequency-
glu, 2023). dimension splitting of the input spectrograms to explore the architec
In addition, dimensionality reduction techniques are applied to cope ture search space in sub-CNN models in addition to classical single-path
with the potentially large dimensionality of the feature space. They are CNNs. The performance and specifications of these deep learning
generally implemented through transformation, such as Principal methods for ASC are reported in Table 7. From this table, traditional
Component Analysis (PCA), linear discriminant analysis (LDA), or the CNN and ResNet are the most popular architectures and work well in
recent bottleneck DNN (Virtanen et al., 2018). Moreover, dimension ASC.
ality reduction by mapping features to a randomized low-dimensional
feature space was also employed. 6.2.2. Attention mechanism
The attention mechanism highlights important information and en
6. Modeling riches the representation. It can be used to estimate the contribution of
the feature vectors at each time step (Chorowski et al., 2015; Ren et al.,
Modeling is performed on the features extracted from audio signals, 2020), aiming to overcome the shortcoming that conventional global
including supervised (i.e., conventional machine learning and deep pooling methods cannot estimate the contribution of each time
learning), unsupervised, semi-supervised, and self-supervised methods –frequency bin in feature maps (Ren et al., 2018). The existing attention
(S. Li, Gu, Luo, Chambers, & Wang, 2019). In addition, result post- methods can be divided into six categories: channel attention, spatial
processing is also applied to boost ASC performance. Therefore, we attention, temporal attention, branch channel, channel & spatial atten
summarize modeling methods with aspects of these six categories. tion, and spatial & temporal attention (Guo et al., 2022).
The recent literature about ASC has shown its effectiveness. For
6.1. Conventional machine learning example, Ren et al. (Ren et al., 2020; Zhang, Liang, & Feng, 2022) used a
time–frequency attention mechanism to analyze the contribution of
Earlier ASC works relied on conventional machine learning methods, each time–frequency bin of the feature maps in the CNNs. Li et al. (Z. Li
such as KNN (Peltonen et al., 2002), GMM (Aucouturier, Defreville, & et al., 2019) incorporated Gated Linear Units (GLU) in several steps of
Pachet, 2007; Chu, Narayanan, & Kuo, 2009), SVM (Geiger, Schuller, & the feature learning part of the network (multi-level attention). Wang
Rigoll, 2013), HMM (Eronen et al., 2006), random forest (Piczak, et al. (Wang, Santoso, & Wang, 2017) used self-determination CNNs
2015a), and decision tree (Piczak, 2015a). They are usually combined (SD-CNNs) to identify frames with higher uncertainty due to over
with hand-crafted features such as MFCCs to build the ASC model lapping sound events. Wang et al. (Wang, Zou, Chong, & Wang, 2020)
(Geiger, Schuller, & Rigoll, 2013; Chachada & Kuo, 2014)]. The accu proposed a novel parallel temporal-spectral attention mechanism for
racy of these methods is vastly limited when they are used to process CNN to enhance the temporal and spectral features by capturing the
15
Table 7
Comparison of different deep learning methods applied in ASC. Notably, Δ and ΔΔ represent deltas and delta-deltas features of a specific feature. For example, MFCC +
Δ + ΔΔ are the concatenated features of MFCC and their deltas and delta-deltas values. TL is transfer learning. Spc. is spectrogram. Acc. represents accuracy. AM-
softmax is the additive margin softmax loss. The bold data indicates the method achieves the highest accuracy, and the underlined data means that the difference
with the highest accuracy is no more than 5 %.
Classifier Specifications Acc. Datasets Classifier Specifications Acc. Datasets
(Highest Acc.) (Highest
Acc.)
DNN Smile6k features (Alamir, 2021; Eyben et al., 84.2 D16 (97.4 %) VGG Spec. features + VGG (Eghbal-Zadeh 89.9 % D16 (97.4
2010) % et al., 2016; Simonyan & Zisserman, %)
2015)
MFCC + Δ + ΔΔ (TL) (Mun et al., 2017b) 86.3 D16 (97.4 %) Feature learning using VGG, label 81.9 % ESC-50
% smoothing(LS), and AM-softmax (Yao (95.7 %)
et al., 2019)
Frame- and channel-concatenated MFCCs 88.2 D16 (97.4 %) ResNet (Mixup) transfer learning by pre- 74.7 % D18A
(DNN + GMM) (Takahashi et al., 2017) % trained ResNet mode (Ye et al., 2019; (93.0 %)
He et al., 2016)
Concatenate features from NMF features and 89.2 D17 (95.0 %) (Mixup) LMBE (late fusion) (Ding 87.6 % D19A
CQT representations (Bisot et al., 2017b) % et al., 2022) (92.0 %)
LPCC, SCMC, and LMBE features (Kong et al., 82.1 D17 (95.0 %) (Data augmentation combination) log 81.9 % D20A
2020) % Mel + Δ + ΔΔ(two stage fusion) (Hu (88.0 %)
et al., 2021)
72.3 D18A (93.0 %) Log Mel spec. (average) (Liu et al., 69.0 % D20A
% 2019) (88.0 %)
58.6 D18B (91.0 %) (mixup, device translator, etc.) log 75.3 % D21A
% Mel spec., BC-ResNet (Kim et al., (88.0 %)
2021)
RNN Smile6k features (Li et al., 2017) 80.2 D16 (97.4 %) Perceptually-weighted LMBE ( 97.3 % D20B
% Koutini et al., 2020) (97.3 %)
TDNN Log Mel spec., AMFB and GFB (Schröder 76.5 D16 (97.4 %) Inception LMBE + Δ + ΔΔ (Lee et al., 2021; 63.9 % D20A
et al., 2017) % Szegedy et al., 2016) (88.0 %)
BLSTM Deep Audio Feature (DAF) (Y. Li et al., 2018) 82.1 D17 (95.0 %) Xception RGB image features (TL) (Lu et al., 75.3 % ESC-50
% 2021; Chollet, 2017) (95.7 %)
Traditional Log Mel spec. (Li et al., 2017) 82.2 D16 (97.4 %) MobileNetV2 LMBE, Δ, ΔΔ (Hu et al., 2020; 96.7 % D20B
CNN % Sandler et al., 2018) (97.3 %)
1D, 2D, and 3D CNN fusion (Yin, Shah, & 91 % D16 (97.4 %) SoundNet Feature learning (Aytar, Vondrick, & 88.0 % D13 (97.0
Zimmermann, 2018) Torralba, 2016) %)
(Conditional-GAN) KL, CQT and Mel spec. ( 94.3 D16 (97.4 %) 74.2 % ESC-50
Yang et al., 2019) % (95.7 %)
89.8 D17 (95.0 %) 92.2 % ESC-10
% (97.0 %)
Log Mel spec. (average) (Schröder et al., 91.7 D17 (95.0 %) DCASENet Mel spec. (Jung et al., 2021) 69.5 % D20A
2013) % (88.0 %)
Deep scattering spectrum (DSS) features 78.3 D18A (93.0 %) CAA-Net Feature learning (Ren et al., 2020) 68.0 % D18B
(multi-level attention) (Z. Li et al., 2019) % (91.0 %)
(Mixup) log Mel spec.(average) (Seo, Park, & 80.4 D19A (92.0 %) 56.1 % D19B
Park, 2019) % (91.0 %)
(Mixup, SpecCorrect) feature learning 70.1 D19B (91.0 %) PANNs (Mixup and SpecAugment) log Mel 96.0 % MSOS
(calibration by heated-up softmax) (Nguyen, % spec. (TL) (Kong et al., 2020) (96.0 %)
Pernkopf, & Kosmider, 2020)
Transfer and adapt for feature learning from 83.5 ESC-50 (95.7 %) 94.7 % ESC-50
log Mel spec. (Kumar, Khadkevich, & Fügen, % (95.7 %)
2018)
TEO-GTCC, GTCC (weighted sum) (Agrawal 88.0 UrbanSond8K 76.4 % D19A
et al., 2017) % (99 %) (92.0 %)
Features extracted by temporal transformer 98.1 LITIS (98.9 %) SubSpectralNet Incorporating sub-spec. + CNN ( 74.1 % D18A
module (Zhang et al., 2018) % Phaye, Benetos, & Wang, 2019) (93.0 %)
FCNN LMBE, Δ, ΔΔ (Hu et al., 2020) 76.9 D20A (88.0 %) Fisher network Mel spec. and TCCs + Fisher layers + 93.0 % D18A
% SVM (Venkatesh, Mulimani, & (93.0 %)
Koolagudi, 2023) 91.0 % D18B
(91.0 %)
92.0 % D19A
(92.0 %)
91.0 % D19B
(91.0 %)
85.0 % D20A
(88.0 %)
CRNN Log Mel spec. (Mulimani & Koolagudi, 2021) 91.0 D19A (92.0 %) RQNet Multi-scale Mel spec. (ms2) (Madhu 88.0 % D20A
% & Suresh, 2023) (88.0 %)
importance of different time frames and frequency bands. Recently, 6.2.3. Transfer learning
Shim et al. (Shim et al., 2022) proposed an Attentive Max Feature Map Transfer learning is used to improve a learner in one domain (the
(AMFM) to overcome the problem of excessively discarding information target domain) by transferring information from a related domain (the
of the attention mechanism. source domain) (Weiss, Khoshgoftaar, & Wang, 2016). It allows
retraining well-trained networks on different datasets (Tsalera, Papa
dakis, & Samarakou, 2021). Therefore, it can not only accelerate the
16
training process and thus reduce model development costs but also Although several studies have demonstrated the advantages and
overcome the shortcoming of the domain with limited labeled datasets potential value of unsupervised learning in ASC, it is still inferior to
that cannot use some more complex knowledge discovery approaches. supervised learning in terms of classification performance.
Many ASC algorithms rely on well-proven neural network architec
tures from the computer vision domain or related sound domains. They 6.4. Semi-supervised learning
use transfer learning to fine-tune pre-trained models developed for
related tasks. For instance, Huang et al. (Huang et al., 2019) fine-tuned Semi-supervised learning can reduce the amount of labeled data
models pre-trained on AudioSet. Similarly, Kong et al. (Kong et al., required during the development of deep learning models by using a
2020) proposed PANNs pre-trained on AudioSet, which can be trans small amount of labeled data as well as a large amount of unlabeled data
ferred to six audio classification tasks (e.g., ASC, SED, audio tagging, (Van Engelen & Hoos, 2020). It is conceptually situated between su
etc.). Moreover, Kumar et al. (Kumar, Khadkevich, & Fügen, 2018) pervised and unsupervised learning. It can be categorized into trans
compared three transfer learning strategies to adapt the pre-trained ductive and inductive learning according to the primary objective, or
deep CNN model to AED and ASC target tasks. There are many suc into wrapper methods, unsupervised preprocessing, and intrinsically
cessful application cases, such as pre-trained SoundNet (Singh et al., semi-supervised methods according to the way unlabeled data is used
2018; Singh et al., 2019), ResNet (Ye et al., 2019), Xception (Lu et al., (Huang et al., 2006). The semi-supervised learning method has been
2021), VGG16 (Ren et al., 2017), GoogLeNet, SqueezeNet, ShuffleNet, introduced into audio classification recently. For instance, Sascha and
VGGish, YAMNet (Tsalera, Papadakis, & Samarakou, 2021), etc. Estefanía proposed a FixMatch-based semi-supervised approach, which
In the ASC tasks of the DCASE challenge, the use of publicly-available was verified in the music, industrial sounds, and ASC tasks. Their
external ASC datasets is explicitly allowed. However, most recently, experimental results demonstrated the potential of semi-supervised
winning algorithms did not use transfer learning from these datasets but methods for audio data (Grollmisch & Cano, 2021).
instead focused on the provided training datasets combined with data
augmentation techniques (Abeßer, 2020). 6.5. Self-supervised learning
6.2.4. Multitask learning Self-supervised learning has gained popularity because it can avoid
Multitask Learning involves learning to solve multiple related clas the cost of annotating large-scale datasets (Jaiswal et al., 2021). Self-
sification tasks jointly with one network (Ruder, 2017). It improves supervised learning models are employed to learn unsupervised repre
generalization by leveraging the domain-specific information in the sentations by solving pretext tasks and using those for performing
training signals of related tasks (Caruana, 1998). Its terms also include downstream tasks such as classification or regression (Tripathi & Mis
joint learning, learning to learn, and learning with auxiliary tasks. hra, 2021). Pretext tasks are pre-designed tasks for networks to solve,
Typical examples of multitask learning include SoundNet, DCASENet, and unsupervised representations are learned by learning the objective
PANNs, and CAA-Net. They are proposed to address multiple audio functions of pretext tasks. The pretext tasks can be predictive, genera
classification tasks (e.g., SED and ASC (Bear, Nolasco, & Benetos, 2019; tive, contrasting, or a combination of them. The supervision signal for
Komatsu, Imoto, & Togami, 2020)). pretext tasks is generated from the input data itself based on its structure
In particular, Imoto et al. (Imoto et al., 2020) jointed ASC and SED (Jing & Tian, 2021). Self-supervised learning is an under-explored
tasks by using soft labels of acoustic scenes to relate sound events to method in the field of ASC, although it performs remarkably well in
scenes. Heo et al. (Heo et al., 2019) used teacher-student learning to natural language processing and computer vision (Tripathi & Mishra,
extract soft labels to better represent similarities across different scenes. 2021). For instance, Tripathi and Mishra (Tripathi & Mishra, 2021)
In addition, a two-stage multitask model is also applied in ASC (Xu et al., presented a self-supervised learning-based deep classifier for ASC, where
2016; Abrol & Sharma, 2020; Shim et al., 2022), which develops a deep they defined identification of the type of data augmentation as the
learning model that predicts fine classes as the main task and coarse pretext task, and a well-trained model learned by solving the pretext task
classes (high-level classes) as the auxiliary task for boosting the per is further fine-tuned for developing the deep model for ASC.
formance of the main task. Besides, Nwe et al. (Nwe, Dat, & Ma, 2017)
grouped acoustic scenes based on types of environments to formulate 6.6. Result post-processing
ASC into a multitask learning framework for predicting the most likely
acoustic scene within each scene group. Summarily, multitask learning The purpose of result post-processing is to process the prediction
has been successfully applied in ASC. However, it is not common in results of ASC for improving performance. It can be considered a kind of
state-of-the-art ASC algorithms (Abeßer, 2020). auxiliary task. We categorize it into three groups as follows:
6.3. Unsupervised learning 1) Late fusion refers to the process of obtaining the final output by
combing the outputs of a set of individual classifiers, organized in a
Unsupervised ASC algorithms (called acoustic scene clustering) parallel way, using a combination method (Santana et al., 2010).
merge the audio recordings of the same scene class into a single cluster Late fusion is also called model ensemble. The combination methods
without using prior information and training classifiers (Y. Li et al., have been summarized in (Kuncheva, 2004), detailing how they
2019). It won’t be limited by the unlabeled, weakly labeled, or incor combine several classifiers for performance improvement.
rectly labeled problems in the ASC dataset because it does not require
reference labels. For example, Li et al. (Li & Wang, 2018) proposed an Late fusion makes the final decision for a test instance to improve
improved spectral clustering algorithm using a randomly sketched performance by fusing the predictions outputted by multiple well-
sparse subspace to classify unlabeled acoustic data or correct wrongly trained ASC models (Dong et al., 2018; Ding et al., 2022). The fusion
labeled acoustic data. Moreover, a new-streaming based subspace clus strategies include voting, averaging, weighted averaging, and stacking
tering algorithm was also proposed for the acoustic scene clustering of (Jiang, Shi, & Li, 2019). The voting method gets the class with the most
overwhelmingly high-volume data (Li et al., 2017). It allows the data to votes in each sample as the final prediction. The averaging method
be clustered on the fly. Unlike these methods, Li et al. (Y. Li et al., 2019) obtains the final prediction by averaging the prediction probabilities of
jointly optimized feature learning (CNN fed by a log Mel spectrum) and different systems. The weighted averaging method, a variant of aver
clustering iteration (Agglomerative Hierarchical Clustering) by inte aging, averages prediction probabilities with unequal weights for each
grating these two procedures into a single model with a unified loss class. The stacking method combines prediction probabilities via a meta-
function. classifier, such as random forest (Liu et al., 2019), logistic regression,
17
and softmax (Byttebier et al., 2021). frequencies, formant frequencies, and zero-crossings.
Generally, averaging and voting methods are comparatively simple,
and the stacking method is often more effective. In addition, other 7.2. Available datasets for ASC
strategies are also employed for late fusion, such as weighted sum
(Paseddula & Gangashetty, 2021), late fusion using swarm intelligence An audio classification system must be tested on standardized audio
(Ding et al., 2022), and late fusion based on CNNs and ensemble clas datasets to evaluate its performance in a fair quantitative comparison
sifiers (Alamir, 2021). Moreover, Li et al. (Li et al., 2017) employed with similar studies. Therefore, we overview several available datasets
random forest, extremely randomized trees, Adaboost, gradient tree for ASC baseline testing in Table 8. The AudioSet dataset is generally
boosting, and weighted average probabilities to ensemble multiple ASC used to pre-train networks for transferring knowledge from SED to ASC
models. tasks. Other datasets, such as LITIS Rouen, ESC-50, ESC-10, MSOS,
UrbanSound8K, and the datasets for DCASE ASC tasks, have been
2) Calibration is another special result post-processing method that commonly applied in ASC.
processes the prediction result of a single ASC system for perfor All datasets for DCASE ASC tasks represent the development datasets
mance improvement (Kong et al., 2020). For example, Platt directly by default because reference labels are provided only for the develop
optimized the logarithmic loss standard through calibration based on ment datasets. In addition, there may be intersecting sets between these
logistic regression (Guo et al., 2017; Alamir, 2021). Nguyen and ASC datasets. For example, TAU Urban Acoustic Scenes 2019 is a subset
Pernkopf (Nguyen, Pernkopf, & Kosmider, 2020) used a heated-up of TAU Urban Acoustic Scenes 2020 Mobile. Audio Files for MSOS come
softmax to calibrate the predictions of the ASC model. The calibra from the Freesound database,13 the ESC-50 dataset, and the Cambridge-
tion method focuses on the prediction of a single system. The late MT Multitrack Download Library.14
fusion method can be considered a special calibration process based
on multiple classification systems. 8. DCASE Challenges
3) Two-stage fusion fuses the prediction results of 3-class (coarse
classes) and 10-class (fine classes) classifiers using an ad-hoc score Competitions and community activities play a significant role in
combination method (Hu et al., 2020; Hu et al., 2021). These two attracting excellent researchers and further promoting advanced
classifiers are generally trained separately. It can also be integrated methods in the research field (Gharib et al., 2018a) because of their
into the model learning process, which is the joint training of two benefits in terms of reproducibility of results, dissemination of open-
classification tasks. For instance, Shim et al. (Shim et al., 2022) source code libraries, and documentation of results. CLEAR evaluation
jointly learned the 10-class and 3-class ASC in a model to implement and the DCASE challenge are ASC-related challenges.
two-stage fusion for ASC. This method is relatively low-complexity The CLEAR evaluation (i.e., CLEAR 2006 (Stiefelhagen et al., 2007)
and competitive compared with the two-stage fusion in separate and CLEAR 2007 (Stiefelhagen et al., 2008)) involved acoustic, visual,
training. and audio-visual analysis. They mainly include tracking tasks (faces,
people, and vehicles), person identification tasks, head pose estimation,
7. Open resources for ASC and acoustic scene analysis (events and environment, i.e., SED, ASC
tasks). Although the ASC task is contained, it only accounts for a small
7.1. Open-source code libraries for ASC proportion. In particular, only one site participated in the ASC task of
CLEAR 2006. CLEAR 2007 even deleted the ASC task and only kept the
There are several commonly-used open-source projects and toolkits SED task.
dedicated to the ASC research community. The DCASE challenge is one of the competitions active in fields
The Environmental Audio Recognition System (EARS9) is an open- related to auditory computational analysis. It provides researchers with
source project. It implements a CNN for live environmental audio pro a fair comparison of available datasets and makes their research results
cessing and recognition on low-power SoC devices (Raspberry Pi 3 more specific for others to compare. It has been launching ASC tasks
Model B). EARS features a background thread for audio capture and annually since 2016. Therefore, we discuss ASC development and trends
classification and a Bokeh server-based dashboard providing live visu by analyzing ASC task setup, results, and promising methods in DCASE.
alization and audio streaming from the device to the browser.
dcase_util10 is a collection of utilities for DCASE challenges. It was 8.1. Task setup of ASC tasks
bundled into a standalone library to allow its reuse in other research
projects. Most utilities are related to audio datasets for handling meta The DCASE challenge started in 2013 and has been held annually
data and various forms of other structured data and providing a stan since 2016. We report all DCASE ASC tasks in Fig. 11 in terms of tasks,
dardized usage API to audio datasets from different sources. number of scenes, dataset duration, segment duration, sampling rate,
sed_eval11 is an open-source Python toolbox that provides a stan baseline accuracy, and highest accuracy on the evaluation dataset. From
dardized and transparent way to evaluate SED and ASC systems this figure, the dataset duration for DCASE ASC tasks increases from 50
(Mesaros, Heittola, & Virtanen, 2016a). min to 64 h. The segment duration ranges from 30 s for D13 and D16 to
openSMILE12 is an open-source feature extraction toolkit imple 10 s for D17-D21, and then to 1 s for D22 and D23. Summarily, the
mented in C++ (Eyben, Wöllmer, & Schuller, 2010). It uses feature development trends of ASC tasks mainly include increasing dataset
extraction algorithms from the speech processing and music information scale, external data, multiple devices, the open set, low complexity,
retrieval communities. It can be used to extract Low-Level Descriptors audio-visual fusion, and shorter segments.
(LLD) and apply various filters, functions, and transformations to them. Notably, all ASC tasks since the D18C task allow using external data.
Various hand-crafted features are supported, such as CHROMA and Moreover, there exist ASC tasks associated with multiple devices each
CENS features, loudness, MFCC, LPC, Perceptual Linear Predictive year since 2018, such as D18B, D19B, D20A, D21A, D22, and D23. Their
Cepstral Coefficients (PLPCC), line spectral frequencies, fundamental development datasets have inclusion relationships shown in Fig. 12. It
shows the continuous expansion of development datasets in DCASE ASC
tasks, especially in the later stage, which is directly expanded on the
9
https://github.com/karolpiczak/EARS.
10
https://github.com/DCASE-REPO/dcase_util.
11 13
https://github.com/TUT-ARG/sed_eval. https://freesound.org/.
12 14
https://opensmile.sourceforge.net/. https://www.cambridge-mt.com/ms-mtk.htm.
18
Table 8 Table 8 (continued )

Available datasets for ASC. Notably, the data in brackets in the dataset duration Database name Classes Samples Dataset Usage, publications
column is the duration of an audio segment. Duration
Database name Classes Samples Dataset Usage, publications Scenes 2019
Duration Mobile
Dares G1 28 123 123 min Stress testing and TUT Urban 10 15,850 44 h (10 DCASE2019 Task1C (
developing recognition Acoustic s) Mesaros, Heittola, &
systems, (Computational) Scenes 2019 Virtanen, 2018b)
auditory scene analysis, Openset
Auditory cognition MSOS 5 2000 2.8 h (5 Making Sense of Sounds
research, and Hearing aid s) (MSOS) challenge (Kroos
validation (Van Grootel, et al., 2019)
Andringa, & Krijnders, HEAR-DS 14 10,226 28.4 h Hearing aid research data
2009) (10 s) for ASC (Martín-Morató
LITIS Rouen 19 3026 1513 min ASC based on image et al., 2021)
(30 s) features ( TAU Urban 10 23,040 64 h (10 DCASE2020&2021
Rakotomamonjy & Gasso, Acoustic s) Task1A (Heittola,
2015) Scenes 2020 Mesaros, & Virtanen,
YouTube-8 M 4716 >7M >27 M A large-scale video Mobile 2020)
classification benchmark TAU Urban 3 14,400 40 h (10 DCASE2020 Task1B (S.
(Abu-El-Haija et al., Acoustic s) Wang, Mesaros, Heittola,
2016) Scenes 2020 & Virtanen, 2021)
AudioSet 527 2.1 M 5800 h Audio event recognition ( 3Class
Gemmeke et al., 2017) TAU Urban 10 12,240 34 h (10 DCASE2021 Task1B (
ESC-50 50 200 166 min Environmental sound Audio Visual s) Gharib et al., 2018a)
(5 s) classification (Piczak, Scenes 2021
2015b) TAU Urban 10 230,350 64 h (1 s) DCASE2022&2023 Task1
ESC-10 10 400 33 min Environmental sound Acoustic (Mesaros, Heittola, &
(5 s) classification (the subset Scenes 2022 Virtanen, 2018b)
of ESC-50) Mobile
NYU Urban 10 8732 525 min Urban sound research
Sound8K (<=4s) (analysis, classification,
etc.) (Salamon, Jacoby, & original basis. The low-complexity ASC tasks are constrained by a
Bello, 2014) limited model complexity (e.g., 500 KB or 128 KB). The calculation rule
CHIME-Home 7 6137 409 min Sound source recognition of model complexity varies from ASC tasks. More details refer to DCASE
(4 s) in a domestic
environment (Foster
website.15
et al., 2015) We also report acoustic scenes for DACSE ASC tasks in Table 9. From
Freefield1010 7 7690 1282 min Audio recording archives this table, acoustic scenes are various before 2018, but scene classes
(Stowell & Plumbley, from 2018 are limited to urban acoustic scenes, mainly including three
2014)
major classes.
CICESE Sound 20 1367 92 min Scalable identification of
Events mixed environmental In general, the task complexity of ASC tasks increased over the years.
sounds, recorded from It is not only affected by dataset duration, segment duration, and scene
heterogeneous sources ( classes but also by the focus points of each ASC task (e.g., multiple de
Beltrán, Chávez, & vices, use of external data, open set, low complexity, and audio-visual
Favela, 2015)
DCASE 2013 10 100 50 min DCASE2013 Task1 (
fusion). To quantify the task complexity, we attempt to calculate a
Scenes (30 s) Giannoulis et al., 2013b) task complexity factor based on these influence factors as follows.
DEMAND 18 18 (16 1.5 h (5 Multichannel
channels) min) environmental noise for (1) If CF , CC , CD , and CS present the task complexity factors of task
audio processing, such as
focus, the number of scene classes, datasets duration, and
source separation, signal
enhancement, and segment duration, The task complexity C is calculated by C =
acoustic noise CF ⋅CC ⋅CD ⋅CS .
suppression/removal ( (2) If CdF , CeF , CoF , ClF , and Cav
F correspond to task complexity factors of
Thiemann, Ito, & Vincent,
multiple devices, use of external data, open set, low complexity,
2013)
TUT Sound 15 1170 585 min DCASE2016 Task1 ( and audio-visual fusion, CF is their sum, but if it is calculated to 0,
Scenes 2016 (30 s) Mesaros, Heittola, & it will be assigned a value of 1; CdF is calculated by CdF =
Virtanen, 2016b) 5⋅(Ndevices − 1), where N devices is the number of recording de
TUT Acoustic 15 4680 780 min DCASE2017 Task1 (
scenes 2017 (10 s) Mesaros et al., 2017) vices. Others is calculated by CeF , CoF , ClF , Cav F =
{
TUT Urban 10 86,400 24 h (10 DCASE2018 10, if the focus point exists
.
Acoustic s) Task1A&Task1C ( 0, others
Scenes 2018 Mesaros, Heittola, &
(3) CC and CD are equal to the number of scene classes and datasets
Virtanen, 2018b)
TUT Urban 10 10,080 28 h (10 DCASE2018 Task1B (
duration, respectively (e.g., CC = 10 for 10-class ASC or CD = 40
Acoustic s) Mesaros, Heittola, & when duration dataset is 40 h).
Scenes 2018 Virtanen, 2018b) (4) CS = 1 if segment duration is 30 s; CS = 3 if segment duration is
Mobile 10 s; CS = 30 if segment duration is 1. Because long-duration
TAU Urban 10 14,400 40 h (10 DCASE2019 Task1A (
sounds are generally most informative for ASC (Wu & Lee,
Acoustic s) Mesaros, Heittola, &
Scenes 2019 Virtanen, 2018b) 2020; Ebbers, Keyser, & Haeb-Umbach, 2021). Therefore, we
TAU Urban 10 16,560 46 h (10 DCASE2019 Task1B (
Acoustic s) Mesaros, Heittola, &
Virtanen, 2018b)
15
https://dcase.community/.
19
Fig. 11. The task setup of DCASE ASC tasks. The duration represents the duration of the development dataset and audio segment. The baseline and highest accuracies
are evaluated on the corresponding evaluation.
Fig. 12. The relationship between different development datasets of DCASE ASC tasks. The dataset for D22 and D23 has the same data with different segment
duration as D20A and D21A.
D21B.
Table 9
The acoustic scenes for DCASE ASC tasks.
9. Results of ASC tasks
DCASE Scene Acoustic Scenes
editions number
We also report the accuracies of the top-5 ranked systems of DCASE
D13 10 Bus, busystreet, office, open air market, park, quiet ASC tasks in Fig. 14. By observing this figure, we can draw the following
street, restaurant, supermarket, tube, tubestation conclusions:
D16-D17 15 Bus, cafe/restaurant, car, city center, forest path,
grocery store, home, lakeside beach, library, metro
station, office, residential area, train, tram, urban park (1) The accuracies of the top-5 systems for the D20B (3-class) and
D18-D23 10 Airport, shopping_mall, metro_station, D21B tasks (audio-visual) are more than 90 % due to their lower
street_pedestrian, public_square, street_traffic, tram, task complexity and advanced deep learning methods.
bus, metro, park
(2) The accuracy of the D13 task is very low because of the immature
D20B 3 Indoor (airport, metro_station, and shopping_mall),
outdoor (park, public_square, street_pedestrian, and ASC technologies. And the accuracy of D16 is higher because of
street_traffic), transportation (bus, metro, and tram) the introduction of deep learning methods into ASC.
(3) From Fig. 14 (a), the best accuracy of D16, D17, D18A, and D19A
is more than 80 %, while the highest accuracy of D18B, D19B,
consider that the shorter the segment duration, the higher the D20A, D21A, D22, and D23 is less than 80 % (even less than 70
task complexity. %). It might be caused by constraints of dataset size, multiple
devices, and low model complexity.
This calculation rule allows a qualitative comparison of task (4) The constraint of low complexity slightly affects ASC perfor
complexity for DCASE ASC tasks shown in Fig. 13. This figure shows an mance by observing the results of D20A and D21A, where D21A
increasing trend of the overall task complexity over time. It is affected by only added a low-complexity constraint.
different influence factors, which is especially obvious in D20B and
20
Fig. 13. The task complexity of DCASE ASC tasks. Only the complexity of the entire task (C_tasks) uses the scale of the right Y-axis. It is noted that the D23 task is a
continuation of the D22 task, with modified memory limit and added measurement of energy consumption. Both D22 and D23 tasks focus on low-complexity
problems although there are subtle changes. With the calculation rule of task complexity, their task complexity factor are equal.
Fig. 14. The accuracies of the top-ranked systems submitted for ASC tasks in previous DCASE challenges. Notably, the vacancies in (b) mean no data.
21
(5) The 10 s segment classification outperforms the 1 s by comparing Also, this team was ranked 2nd in D18A and D19A. The team created by
D21A (10 s) with D22/D23 (1 s), where they contain the same Park and Mun et al. was ranked 3rd and 1st in D16 and D17 (Park et al.,
audio data with different segment durations. 2016; Mun et al., 2017a). The team formed by Yang et al. (Yang, Chen, &
(6) By observing the results of D21A and D22/D23 in Fig. 14 (a), the Tao, 2018) got 5th and 2nd place in D18A and D18B. The team of
accuracy of the ranked low system may be higher than that of the McDonnell and Gao was ranked 2nd, 3rd, and 3rd in D19B, D20A, and
ranked high system because the ranking criteria changed from D20B (Koutini et al., 2020; McDonnell & Gao, 2020; McDonnell &
macro-average accuracy to multiclass cross-entropy (e.g., D21A, UniSA, 2020). The team of Hu et al. was ranked 2nd in D20A and D20B
D21B, and D22) or weighted average rank of multiple criteria (e. (Hu et al., 2020; Yang et al., 2021a). Then, they and Wang et al. got 2nd
g., D23) since 2021. place in D21A and D21B together as a team (Q. Wang et al., 2021).
(7) DCASE challenges play a critical role in stimulating the devel
opment of ASC, as reflected in the accuracy improvement of top- 9.1. The most promising methods in ASC competitions
ranked systems compared with baseline systems, ranging from 7
% to 30 % with an average of about 20 %. We further analyze the most promising methods in DCASE ASC
(8) The slight accuracy difference among the top 5 systems for the competitions. Particularly, we report specifications of top systems in
same task (especially obvious since 2019) demonstrates that DCASE ASC tasks and their accuracies on the development and evalu
high-performance ASC methods are diverse. The method di ation datasets (Eval. vs. Dev.) in Table 10. From this table, the following
versity benefits ASC algorithm optimization under the constraints points can be observed: (i) CNN was gradually introduced into ASC, and
of task complexity and limited model complexity. the complexity of CNN increased, (ii) Data augmentation methods such
(9) From Fig. 14 (c), some ASC systems are overconfident and per as mixup and GAN have been applied to boost performance for the
formed much worse on the evaluation dataset than on the problem of requiring larger labeled data for model training in deep
development dataset, corresponding to a maximum drop of 17.5 learning (e.g., CNN), (iii) The top systems are stable and have good
%. Some systems performed much better on the evaluation generalization ability according to their slight accuracy difference be
dataset than the development dataset, corresponding to a tween the development and evaluation datasets, (iv) The diversity of
maximum increment of 10.9 %. The top system of each task features for ASC was increased, from the initial RQA, MFCC features,
almost does not suffer from these two extreme problems; the and spectrograms to recent LMBE, as well as CQT features for supple
accuracy gap between the development and evaluation datasets is menting LMBE features, (v) The effect of resampling and multichannel
less than 5.4 %. in the input stage on performance improvement cannot be observed
from this table, and (vi) Late fusion methods are effective for perfor
Finally, we find that not a few teams have participated in the DCASE mance improvement because most top systems have used late fusion
ASC competition many times and ranked high by tracking the partici except for D13 task.
pation of top-ranked teams. For instance, the CP-JKU team participated Also, we summarize the promising ASC methods in terms of input,
in almost all ASC competitions (Eghbal-Zadeh et al., 2016; Koutini et al., sampling rate, date augmentation, features, classifier, and output by
2020; Lehner et al., 2017; Dorfer et al., 2018; Koutini, Eghbal-zadeh, analyzing DCASE technical reports. ASC systems commonly use
Widmer, & Kepler, 2019, Koutini, Jan, & Widmer, 2021; Schmid monaural, multichannel signals, or signal components obtained by
et al., 2022; Schmid et al., 2023), and won D16, D20B, D22, and D23. HPSS. The sampling rate is generally reset by resampling to adapt to the
Table 10
Top systems submitted for ASC tasks in previous DCASE challenges. CNN1 represents the use of traditional CNN architectures for classification. More details please
refer to (Roma, Nogueira, & Herrera, 2013; Eghbal-Zadeh et al., 2016; Mun et al., 2017a; Mesaros et al., 2018; Sakashita & Aono, 2018; Nguyen & Pernkopf, 2018;
Kosmider, 2019; Chen et al., 2019; Suh et al., 2020; Koutini et al., 2020; Kim et al., 2021; Schmid et al., 2022; Schmid et al., 2023).
DCASE Input Sampling Data Augmentation Features Classifier Output Eval. Dev.
ASC tasks Rate
D13 44.1 kHz RQA, MFCC SVM 76.0 % 71.0 %

D16 mono + 44.1 kHz MFCC, spectrogram i-vector, DCNN Averaging, 89.7 % 89.9 %
binaural calibration
D17 left, right, 22.05 kHz GAN LMBE, spectrogram MLP, RNN, CNN1, SVM Majority vote 83.3 % 87.1 %
mixed
D18A mono, 44.1 kHz mixup LMBE CNN1 Random forest 81.0 % 76.9 %
binaural
1
D18B 44.1 kHz LMBE and their nearest CNN Averaging vote 69.0 % 63.6 %
neighbor filtered version
1
D19A binaural 48 kHz GAN LMBE, CQT CNN Average vote 85.2 % 85.1 %
D19B 44.1 kHz SpecCorrect, LMBE CNN1 Soft-voting 75.3 % 70.0 %
specAugment, mixup
D20A 44.1 kHz temporal cropping, LMBE, deltas, delta-deltas ResNet, Snapshot Averaging, 76.5 % 74.2 %
mixup weighted
averaging
D20B stereo 22.05 kHz mixup Perceptually-weighted RF-regularized CNNs Averaging 96.5 % 97.3 %
LMBE
D21A 16 kHz mixup, time rolling, LMBE CNN1, BC-ResNet Maximum 76.1 % 75.9 %
specAugment likelihood
D21B 48 kHz mixup LMBE, CQTbark CNN1, EfficientNet, Swin Weighted vote 93.8 % 95.1 %
Transformer, ensemble
D22 32 kHz mixstyle, pitch shifting LMBE RF-regularized CNNs, Averaged logits 59.6 % 58.0 %
PaSST transformer
D23 32 kHz device impulse response LMBE RF-regularized CNNs, Averaged logits 58.7 % 58.4 %
augmentation, PaSST transformer
mixup,
freq-mixstyle,
pitch shifting
22
ASC system. Next, data augmentation methods are often employed, such reweighting (Huang et al., 2006), weighted local domain adaptive
as mixing, time shifting, pitch shifting, mixup, SpecAugment, frequency method (He & Zhu, 2021), and DIDA (Deng et al., 2022)), feature
shifting, random noise, and GAN. The popular features include LMBE, adaptation (feature alignment) (Wang et al., 2019), and model adapta
MFCCs, CQT, raw waveform, deltas and delta-deltas of LMBE, HOG, etc. tion (Primus et al., 2019; Zhao et al., 2022). Also, it can be categorized
The classifier is generally based on CNNs (e.g., ResNet), SVM, GMM, or into supervised (Abeßer, 2020; Ren et al., 2020) and unsupervised (e.g.,
ensemble models. Finally, the output is usually processed by result post- MMD (Long et al., 2016), UADA (Gharib et al., 2018b), W-UADA
processing methods for higher performance. (Drossos, Magron, & Virtanen, 2019), DANN (Wang et al., 2019), MCD-
KD (Takeyama et al., 2021), and MTDA (Yang et al., 2021)) methods
9.2. Open challenges and future directions according to whether target domain data are labeled.
Abeßer (Abeßer, 2020) argued that combining approaches for
This section discusses open challenges and future directions. Chal domain adaptation and data augmentation jointly improved the
lenges are arose from developing and applying ASC algorithms in the robustness of ASC algorithms against changes in acoustic conditions. In
real world. Focusing on these challenges, we suggest several future di addition, the problem of multiple devices can also be solved from the
rections to address ASC problems. perspective of feature extraction and classification. For instance, Jiang
et al. (Jiang et al., 2023) employed the class hierarchy relationship of
9.3. Open challenges acoustic scenes to learn advanced features that are less susceptible to
differences in device information. Aryal and Lee (Aryal & Lee, 2023)
The large progress in the ASC field is reflected in the increasing proposed a frequency-aware convolutional neural network (FACNN) to
datasets and significantly improved accuracy. However, the four ASC solve the device mismatch problem by focusing on the frequency in
problems mentioned in section 2.3 are still pending. Among these formation of the audio samples. However, the performance of multiple-
problems, the unclear definition of acoustic scene and the lack of large- device ASC tasks is still low, such as the D18B, D19B, D20A, D21A, D22,
scale comprehensive datasets are generalized problems that are hard to and D23 tasks. Therefore, the multiple-device problem is still an open
address by a specific method. Because it is hard to provide a clear and challenge to be addressed for ASC development.
uniform acoustic scene definition for ASC. ASC datasets are closely
related to acoustic scene definition. Hence, the lack of large-scale 2) Deep learning methods: Although various deep learning methods
comprehensive datasets is hard to address as well, although there are have been applied in ASC, there is still much space for performance
many ASC datasets. improvement. ASC algorithms can be optimized by novel deep
We thus mainly discuss the challenges raised by the rest of the learning methods, such as attention mechanisms, transfer learning,
problems (i.e., performance optimization and computational and multitask learning. The attention mechanism determines which
complexity) in this section. Once they are overcome, the development of part needs to be focused on by estimating the contribution of the
ASC will make a great leap. Subsequently, various excellent ASC tech input and thus boosts classification performance. Transfer learning
nologies would be applied to actual life and production, which can allows an ASC algorithm to retrain a pre-trained model by trans
significantly enhance production efficiency and improve people’s ferring existing knowledge from other tasks for performance
quality of life. improvement. Multitask learning has achieved great success in task
assistance for ASC, such as the two-stage fusion by joint training and
9.3.1. Performance optimization multitask learning embeded SED.
According to the influence factors of performance, we discuss per
formance optimization challenges from the following four perspectives: A quantitative analysis of performance improvement and theoretical
guidance is still underexplored, although some papers have validated
1) Multiple devices: Audio data is often recorded with distinct devices, the effectiveness of these deep learning methods for ASC. Therefore,
such as professional sound recording devices (Brezina & Jeseniˇcov́ a, more research is needed.
2018) and mobile devices (Goyal, Shukla, & Sarin, 2019). Varying
characteristics of recording devices lead to various qualities of audio 3) Model Interpretability: Occasionally, evaluation metrics calculated
data, such as sampling rate, amplitude, frequency response, and data on only predictions and ground-truth labels do not suffice to char
distributions (Primus et al., 2019). However, it is challenging to train acterize the model. The demand for the reliability and interpret
models on an audio dataset with different data distributions caused ability of a model arises when a mismatch occurs between the formal
by mismatched devices (Nguyen & Pernkopf, 2019b). Various objectives of supervised learning (testing performance) and the real-
methods, such as data augmentation (Nguyen & Pernkopf, 2019a; Hu world costs in a deployment setting. The desiderata of interpret
et al., 2021; Morocutti et al., 2023), spectrum correction (Nguyen, ability research include trust, causality, transferability, informa
Pernkopf, & Kosmider, 2020; Kosmider, 2020), and domain adap tiveness, and fair and ethical decision-making (Lipton, 2018).
tation (Primus et al., 2019; Yang, Wang, & Zou, 2021; Gharib et al.,
2018b), have been used to address this problem. Model interpretability helps to explain the methods of model pre
diction and internal feature representation. It will be useful to improve
Data augmentation works by increasing extra training data con model performance, confidence, and robustness when deploying the
taining device-related information. For example, mixup is used to model in the real world. For example, Wu and Lee (Wu & Lee, 2019)
enlarge the training set recorded by devices with fewer samples to solve proposed a CNN visualization method called Grad-CAM (Zhang et al.,
the data imbalance problem between recording devices (Nguyen & 2021) to help people understand the relationship between the input and
Pernkopf, 2019a). Spectrum correction reduces the difference among output. However, it is still a challenge to develop methods that allow for
mismatched devices by transforming a given input spectrum to that of a a better interpretation of the model predictions and internal feature
reference (Nguyen, Pernkopf, & Kosmider, 2020). For instance, Kos representations (Abeßer, 2020). Also, how to use model interpretability
mider (Kosmider, 2020) used it to transform the distortion of all known to boost model performance and guide an efficient deployment of ASC
devices into the distortion of a chosen reference device by matching algorithms in the real world is another challenge that needs to be
their spectra. Domain adaptation adjusts samples, features, or models overcome.
from the source data to the target data to minimize their discrepancy and
thus improve performance. It can be divided into three groups: instance 4) Result post-processing: result post-processing has become an essen
adaptation (e.g., instance weighting (Jiang & Zhai, 2007), sample tial part of state-of-the-art ASC algorithms recently (Abeßer, 2020).
23
However, there are still two primary challenges that need to be 2) Deployment based on collaborative end-edge-cloud computing: It
addressed. The first is how to improve performance steadily and might still be a big challenge to deploy a compressed deep model on
significantly with a general fusion method. Simple post-processing the end device with limited resources, especially in complex appli
methods (e.g., averaging) can only slightly improve performance. cation scenarios. Although cloud computing can process
While complex methods are designed for special tasks, which are computation-intensive tasks (e.g., deep learning), it cannot guar
difficult to widely use in ASC tasks. For example, the optimization- antee low latency throughout the whole process of data generation,
based late fusion method (Ding et al., 2022) requires various initial transmission, and execution (Wang et al., 2020). Focusing on the
ASC systems for optimization fusion. Two-stage fusion (Ren et al., issue, a novel computing paradigm, collaborative end-edge-cloud
2018; Hu et al., 2020; Hu et al., 2021) requires extra coarse class computing (Huang et al., 2018; Wang et al., 2020) was proposed,
labels and an auxiliary system. Second, result post-processing which has potential for the deployment of complex ASC systems. By
requiring multiple models is often not feasible in the real world, using this computing paradigm, the high-complexity computation
limited by available computational resources and real-time will be reasonably decomposed and dispatched separately to the end,
requirements. edge, and cloud for execution, reducing the execution delay of the
task while ensuring the classification performance (G. Li, Liu, Wang,
9.3.2. Computational complexity Dong, Zhao, & Feng, 2018; Zhang & Yan, 2023).
The real-world solutions need to operate on devices with limited
computational capacity (Martín-Morató et al., 2021). Moreover, real- However, how to reasonably decompose the ASC task and allocate
time processing requirements often demand fast model prediction the end, edge, and cloud resources is still a challenge. It involves the
with low latency. Therefore, we analyze computational complexity granularity determination of dividing tasks, computing resource evalu
challenges from the following two perspectives. ation, and task scheduling optimization.
1) Model compression: Model compression refers to using a fast and 9.4. Future directions
compact model to approximate the function learned by a slower,
larger, but better performing model (Buciluǎ, Caruana, & Niculescu- As mentioned above, ASC faces four main problems. We present our
Mizil, 2006). It is also widely concerned in ASC (Mohaimenuzzaman thoughts on research directions with aspects of these problems and their
et al., 2023), especially as DACSE challenges have released low- challenges, as shown in Fig. 15. For the first problem, acoustic scenes
complexity tasks for four consecutive years. Cheng et al. (Cheng can be defined according to specific ASC applications combined with the
et al., 2018) summarized four types of model compression methods existing categorization of environmental sounds. Then, we summarize
as follows: (i) Parameter pruning and sharing remove inessential pa three directions for the ASC dataset problem: (i) Building/collecting
rameters in DNN without (little) performance loss (Gou et al., 2021). specific ASC datasets directly, which is intuitive and effective but
It can be subdivided into model quantization (e.g., 8/16-bit quanti complex and costly, (ii) Automatic audio tagging can be applied as an
zation (Vanhoucke, Senior, & Mao, 2011; Gupta et al., 2015; Kim, alternative solution, which efficiently reduces the cost of manual
Yoo, & Kwak, 2020), ternary quantization (2-bit) (Zhu et al., 2017; tagging for big data but relies on high-performance annotation algo
Chang et al., 2018), and binarization (1 bit) (Courbariaux, Bengio, & rithms, (iii) Transfer learning (it requires less labeled data since it can
David, 2015)), pruning and sharing, and designing the structural transfer knowledge from other related tasks) or few-shot learning (it
matrix (Cheng et al., 2018). (ii) Low-rank factorization estimates the requires a few samples) can also be a feasible solution (Xie et al., 2023).
informative parameters of the deep models by using matrix/tensor We provide several suggestions for the low-performance problem in
decomposition (Cheng et al., 2018). (iii) Transferred/compact con terms of corresponding open challenges. First, we suggest using data
volutional filters remove inessential parameters by transferring or augmentation, spectrum correction, or domain adaptation for the chal
compressing the convolutional filters (Gou et al., 2021). And (iv) lenge of multiple devices. Second, we advise extracting multi-view
Knowledge distillation (KD) (Gou et al., 2021; Kang et al., 2023; Tri features and using efficient classification methods (such as DNN and
pathi & Pandey, 2023) learns a smaller student model from a larger CNN) to optimize ASC performance from the perspective of deep
network with good performance, which is implemented by learning learning. Next, we recommend designing characterizable and inter
the output class distributions of the teacher model via softened pretable model components/structures, e.g., the attention model. Also,
softmax. appropriate result post-processing methods are suggested to boost ASC
performance, for example, a novel optimization algorithm can be
Additionally, model simplification is also employed to compress a introduced into the late fusion for ASC (Zhao et al., 2023a; Zhao et al.,
model, implemented by reducing the model parameters (Pham et al., 2023b). Finally, the suggestion of designing a low-complexity ASC
2023; Liang, Zhang, & Feng, 2020), scaling the network dimension (Li model and using model compression is provided from the perspective of
et al., 2020), or using slim models, e.g., depth-wise separable CNNs model compression. Additionally, we suggest deploying ASC systems
(Drossos et al., 2020; Lee et al., 2021). MobileNetV2 and EfficientNet based on collaborative end-edge-cloud in the real world.
(Tan & Le, 2019) are successful examples. Also, some optimization Inspired by the application of reinforcement learning in speech
works to empower a small-size low-complexity CNN model are explored. recognition to enhance speech (Shen et al., 2019) or optimize supervised
For example. Seresht and Mohammadi (Seresht & Mohammadi, 2023) model training or adaptation methods (Kala & Shinozaki, 2018), rein
proposed a novel global pooling method called Sparse Salient Region forcement learning is also suggested to be used in ASC. In addition,
Pooling (SSRP), which computes the channel descriptors using a sparse applications of ASC need to be explored, such as smart homes, smart
subset of features, and guides the model to effectively learn from the cities, driverless cars, virtual reality, etc. It is the key to improving the
more salient time–frequency regions. application value of ASC and stimulating the development of ASC more
Various model compression methods or their combinations (Yang efficiently.
et al., 2021a; Kim et al., 2021; Han, Mao, & Dally, 2016) have efficiently
reduced the model complexity of ASC systems. However, the model 10. Conclusion
compression method for ASC is selected based on the researchers’
experience. There is still a lack of guidance for researchers to choose a In recent years, a rapid increase of research papers have been pub
fast, simple, and efficient model compression scheme for ASC. lished in the ASC field. It means ASC has raised widespread attention.
There are several reasons for achieving this progress. Firstly, various
available dataset, code repositories, and tutorial materials make it easier
24
Fig. 15. Research directions for the ASC field.
to start the research on ASC. Secondly, advanced audio signal processing directions. We hope this survey will give researchers a deeper under
and deep learning techniques are proposed or introduced into ASC, such standing of various ASC technologies and the relationships among them,
as channel selection, data augmentation, feature acquisition, CNN as a springboard for future research.
network designing, transfer learning, multitask learning, and attention
mechanism, etc. Finally, the annual DCASE challenge stimulates and CRediT authorship contribution statement
guides researchers to address complex problems in the ASC study (e.g.,
large-scale datasets, use of external data, open set, low-complexity, Biyun Ding: Conceptualization, Methodology, Software, Investiga
multiple devices, audio-visual fusion, and shorter segments). tion, Validation, Writing – original draft, Writing – review & editing.
The recent state-of-the-art algorithms have achieved relatively high Tao Zhang: Conceptualization, Resources, Supervision, Project admin
performance on most available ASC datasets (refer to Table 2). For istration, Funding acquisition. Chao Wang: Methodology, Writing –
example, the highest accuracy of ASC reach 99.0 % on UrbanSound8K, review & editing. Ganjun Liu: Methodology, Writing – review & edit
97.0 % on ESC-10, 95.7 % on ESC-50, 98.9 % on LITIS, 96.0 % on MSOS, ing. Jinhua Liang: Writing – review & editing. Ruimin Hu: Writing –
and over 90 % on most ASC datasets from the DCASE challenge (e.g., review & editing. Yulin Wu: Writing – review & editing. Difei Guo:
D13, D16, D17, D18A, D18B, D19A, D19B). Recently, the ASC method Writing – review & editing.
using multi-scale Mel spectrogram and designed RQNet network ach
ieved the best accuracy of 88.0 % on the D21A datasets (focusing on
Declaration of Competing Interest
multiple devices and low complexity issues). However, the accuracy on
the newest D22 and D23 datasets (focusing multiple devices, low-
The authors declare that they have no known competing financial
complexity issues, and shorter segments issues) is still low (under
interests or personal relationships that could have appeared to influence
75.0 %). It indicates that complex ASC tasks still face low- performance
the work reported in this paper.
issues.
We have summarized four problems in ASC, including unclear defi
nition of acoustic scenes, lack of large-scale comprehensive datasets, low Acknowledgment
classification performance, as well as computational efficiency and
practicability. Generally, performance optimization and computational This work was supported by the National Natural Science Foundation
complexity are commonly focused by researchers. In terms of perfor of China (grant number 62271344).
mance optimization issue, various core technologies including data
processing, feature acquisition, and modeling methods have been pro References
posed or introduced into ASC. And the computational complexity issue
Abeßer, J. (2020). A review of deep learning based methods for acoustic scene
can addressed from aspects of model complexity controlling (e.g., model classification. Applied Sciences, 10(6). https://doi.org/10.3390/app10062020
compression, model optimization) and the optimization of deployment Abidin, S., Togneri, R., & Sohel, F. (2017). In Enhanced LBP texture features from time
strategy. frequency representations for acoustic scene classification (pp. 626–630). IEEE.
Abrol, V., & Sharma, P. (2020). Learning hierarchy aware embedding from raw audio for
Summarily, this article provides a comprehensive survey for ASC, acoustic scene classification. IEEE/ACM Transactions on Audio, Speech, and Language
covering the fundamentals and practice till August 2023. We summarize Processing, 28, 1964–1973.
the development history of ASC. We also overview core technologies for Abu-El-Haija, S., Kothari, N., Lee J., Natsev, P., Toderici, G., Varadarajan, B., &
Vijayanarasimhan, S (2016). Youtube-8m: A large-scale video classification
ASC and discuss their promise and limitations. Then, we study the series benchmark. arXiv preprint arXiv: 1609.08675. http://research.google.com/
of ASC resources and challenges that are useful information for re youtube8m/.
searchers and engineers to start the ASC research. Finally, we analyze Agrawal, D. M., Sailor, H. B., Soni, M. H., & Patil, H. A. (2017). Novel TEO-based
Gammatone features for environmental sound classification. In In 2017 25th
open challenges for ASC and conclude with several potential future
European Signal Processing Conference (EUSIPCO) (pp. 1809–1813).
25
Alamir, M. A. (2021). A novel acoustic scene classification model using the late fusion of Casey, M. (2001). General sound classification and similarity in MPEG-7. Organised
convolutional neural networks and different ensemble classifiers. Applied Acoustics, Sound, 6(2), 153–164.
175, Article 107829. https://doi.org/10.1016/j.apacoust.2020.107829 Chachada, S., & Kuo, C. C. J. (2014). Environmental sound recognition: A survey.
Arniriparian, S., Freitag, M., Cummins, N., Gerczuk, M., Pugachevskiy, S., & Schuller, B. APSIPA Transactions on Signal and Information Processing, 3(1), Article e14.
(2018). A fusion of deep convolutional generative adversarial networks and https://doi.org/10.1017/ATSIP.2014.12.
sequence to sequence autoencoders for acoustic scene classification. In In 2018 26th Chandrakala, S., & Jayalakshmi, S. L. (2019). Environmental audio scene and sound
European Signal Processing Conference (EUSIPCO) (pp. 977–981). event recognition for autonomous surveillance: A survey and comparative studies.
Aryal, N., & Lee, S. W. (2023). Frequency-based CNN and attention module for acoustic ACM Computing Surveys, 52(3), 1–34. https://doi.org/10.1145/3322240
scene classification. Applied Acoustics, 210. https://doi.org/10.1016/j. Chang, Y., Wu, X., Zhang, S., & Yan, J. (2018). Ternary weighted networks with equal
apacoust.2023.109411 quantization levels. In 25th Asia-Pacific Conference on Communications (APCC) (pp.
Aucouturier, J. J., Defreville, B., & Pachet, F. (2007). The bag-of-frames approach to 126–130). https://doi.org/10.1109/APCC47188.2019.9026483
audio pattern recognition: A sufficient model for urban soundscapes but not for Chen, C., Wang, M., & Zhang, P. (2022). Audio-visual scene classification using a transfer
polyphonic music. The Journal of the Acoustical Society of America, 122(2), 881–891. learning based joint optimization strategy. arXiv preprint arXiv: 2204.11420.
Aytar, Y., Vondrick, C., & Torralba, A. (2016). SoundNet: Learning sound representations Chen, H., Liu, Z., Liu, Z., Zhang, P., & Yan, Y. (2019). Integrating the data augmentation
from unlabeled video. In 30th Annual Conference on Neural Information Processing scheme with various classifiers for acoustic scene modeling. Tech. Rep. DCASE2019
Systems (NIPS) (pp. 892–900). Challenge.
Aziz, S., Awais, M., Akram, T., Khan, U., Alhussein, M., & Aurangzeb, K. (2019). Chen, H., Zhang, P., & Yan, Y. (2019). An audio scene classification framework with
Automatic Scene Recognition through Acoustic Classification for Behavioral embedded filters and a DCT-Based temporal module. In 2019 IEEE International
Robotics. Electronics, 2019, 8(5), Article 483. https://doi.org/10.3390/ Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 835–839).
electronics8050483. https://doi.org/10.1109/ICASSP.2019.8683636
Baelde, M., Biernacki, C., & Greff, R. (2017). A mixture model-based real-time audio Chen, H., Liu, Z., Liu, Z., & Zhang, P. (2021). Long-term scalogram integrated with an
sources classification method. In In 2017 IEEE International Conference on Acoustics, iterative data augmentation scheme for acoustic scene classification. The Journal of
Speech and Signal Processing (ICASSP) (pp. 2427–2431). https://doi.org/10.1109/ the Acoustical Society of America, 149(6), 4198–4213.
ICASSP.2017.7952592 Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2018). Model compression and acceleration
Bahmei, B., Birmingham, E., & Arzanpour, S. (2022). CNN-RNN and Data Augmentation for deep neural networks: The principles, progress, and challenges. IEEE Signal
Using Deep Convolutional Generative Adversarial Network for Environmental Sound Processing Magazine, 35(1), 126–136.
Classification. IEEE Signal Processing Letters, 29, 682–686. Chollet, F. (2017). In Xception: Deep learning with depthwise separable convolutions (pp.
Bai, X., Du, J., Pan, J., Zhou, H. S., Tu, Y. H., & Lee, C. H. (2020). High-resolution 1251–1258). IEEE Computer Society.
attention network with acoustic segment model for acoustic scene classification. In Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., & Bengio, Y. (2015). Attention-
In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing based models for speech recognition. Advances in Neural Information Processing
(ICASSP) (pp. 656–660). https://doi.org/10.1109/ICASSP40776.2020.9053519 Systems, 28, 577–585.
Bai, X., Du, J., Wang, Z. R., & Lee, C. H. (2019). A Hybrid Approach to Acoustic Scene Chu, S., Narayanan, S., & Kuo, C. C. J. (2009). Environmental sound recognition with
Classification Based on Universal Acoustic Models. In Proc. Interspeech, 3619–3623. time-frequency audio features. IEEE Transactions on Audio, Speech, and Language
https://doi.org/10.21437/Interspeech.2019-2171. Processing, 17(6), 1142–1158.
Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene Chu, S., Narayanan, S., Kuo, C. C. J., & Mataric, M. J. (2006). Where am I? Scene
classification: Classifying environments from the sounds they produce. IEEE Signal recognition for mobile robots using audio features. In 2006 IEEE International
Processing Magazine, 32(3), 16–34. https://doi.org/10.1109/MSP.2014.2326181 conference on multimedia and expo (pp. 885–888). https://doi.org/10.1109/
Battaglino, D., Lepauloux, L., Pilati, L., & Evans, N. (2015). Acoustic context recognition ICME.2006.262661
using local binary pattern codebooks. In In 2015 IEEE Workshop on Applications of Clarkson, B., & Pentland, A. (1999). Unsupervised clustering of ambulatory audio and
Signal Processing to Audio and Acoustics (WASPAA) (pp. 1–5). https://doi.org/ video. In 1999 IEEE International Conference on Acoustics, Speech, and Signal
10.1109/WASPAA.2015.7336886 Processing. Proceedings (ICASSP) (Vol. 6, pp. 3037-3040). IEEE, https://doi.org/
Bear, H. L., Nolasco, I., & Benetos, E. (2019). Towards joint sound scene and polyphonic 10.1109/ICASSP.1999.757481.
sound event recognition. In Proc. Interspeech (pp. 4594-4598). https://doi.org/ Clarkson, B., Sawhney, N., & Pentland, A. (1998). Auditory context awareness via
10.21437/Interspeech.2019-2169. wearable computing. In Proceedings of The 1998 Workshop On Perceptual User
Beltrán, J., Chávez, E., & Favela, J. (2015). Scalable identification of mixed Interfaces (pp. 1–6).
environmental sounds, recorded from heterogeneous sources. Pattern Recognition Courbariaux, M., Bengio, Y., & David, J. P. (2015). Binaryconnect: Training deep neural
Letters, 68, 153–160. networks with binary weights during propagations. In Proceedings of the 28th
Berland, A., Gaillard, P., Guidetti, M., & Barone, P. (2015). Perception of everyday International Conference on Neural Information Processing Systems, 2, 3123-3131.
sounds: a developmental study of a free sorting task. PLoS One, 10(2), Article Couvreur, C., Fontaine, V., Gaunard, P., & Mubikangiey, C. G. (1998). Automatic
e0115557. https://doi.org/10.1371/journal.pone.0115557. classification of environmental noise events by hidden Markov models. Applied
Bisot, V., Essid, S., & Richard, G. (2015). HOG and subband power distribution image Acoustics, 54(3), 187–206.
features for acoustic scene classification. In In 2015 23rd European Signal Processing Cramer, J., Wu, H. H., Salamon, J., & Bello, J. P. (2019). Look, listen, and learn more:
Conference (EUSIPCO) (pp. 719–723). https://doi.org/10.1109/ Design choices for deep audio embeddings. In 2019 IEEE International Conference on
EUSIPCO.2015.7362477 Acoustics, Speech and Signal Processing (ICASSP) (pp. 3852–3856). IEEE. https://doi.
Bisot, V., Serizel, R., Essid, S., & Richard, G. (2016). In Acoustic scene classification with org/10.1109/ICASSP.2019.8682475.
matrix factorization for unsupervised feature learning (pp. 6445–6449). IEEE. Cui, X., Goel, V., & Kingsbury, B. (2015). Data augmentation for deep neural network
Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017a). Feature learning with matrix acoustic modeling. IEEE/ACM Transactions on Audio, Speech, and Language Processing,
factorization applied to acoustic scene classification. IEEE/ACM Transactions on 23(9), 1469–1477.
Audio, Speech, and Language Processing, 25(6), 1216–1229. Dai, W., Dai, C., Qu, S., Li, J., & Das, S. (2017). Very deep convolutional neural networks
Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017b). Nonnegative feature learning for raw waveforms. In 2017 IEEE international conference on acoustics, speech and
methods for acoustic scene classification. In In Proc. DCASE 2017 Workshop (pp. signal processing (ICASSP) (pp. 421–425). IEEE. https://doi.org/10.1109/
1142–1158). ICASSP.2017.7952190.
Bregman, A. S. (1990). Auditory scene analysis: The perceptual organization of sound ((1st Dandashi, A., & AlJaam, J. (2017). A survey on audio content-based classification. In
ed.),, p. 1).). The MIT Press (Chapter. 2017 International conference on computational science and computational intelligence
Brezina, P., & Jeseniˇcov́ a, S. (2018). Sound recording technologies and music education. (CSCI) (pp. 408–413). IEEE. https://doi.org/10.1109/CSCI.2017.69.
Ad Alta: Journal of Interdisciplinary Research, 8(2), 13–18. Deng, Z., Zhou, K., Li, D., He, J., Song, Y. Z., & Xiang, T. (2022). Dynamic instance
Brown, A. L., Kang, J., & Gjestland, T. (2011). Towards standardization in soundscape domain adaptation. IEEE Transactions on Image Processing, 31, 4585–4597. https://
preference assessment. Applied acoustics, 72(6), 387–392. https://doi.org/10.1016/j. doi.org/10.1109/TIP.2022.3186531
apacoust.2011.01.001 Dennis, J., Tran, H. D., & Chng, E. S. (2013). Image feature representation of the subband
Brown, G. J., & Cooke, M. (1994). Computational auditory scene analysis. Computer power distribution for robust sound event classification. IEEE Transactions on Audio,
Speech & Language, 8(4), 297–336. Speech, and Language Processing, 21(2), 367–377.
Buciluǎ, C., Caruana, R., & Niculescu-Mizil, A. (2006). Model Compression. In In Devalraju, D. V., & Rajan, P. (2022). Multiview Embeddings for Soundscape
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30,
and data mining (pp. 535–541). https://doi.org/10.1145/1150402.1150464 1197–1206.
Byttebier, L., Desplanques, B., Thienpondt, J., Song, S., Demuynck, K., & Madhu, N. Ding, B., Zhang, T., Liu, G., Kong, L., & Geng, Y. (2022). Late fusion for acoustic scene
(2021). Small-Footprint acoustic scene classification through 8-Bit Quantization- classification using swarm intelligence. Applied Acoustics, 192, Article 108698.
Aware training and pruning of ResNet models. Tech. Rep. DCASE2021 Challenge. https://doi.org/10.1016/j.apacoust.2022.108698
Cao, B., Wang, N., Li, J., & Gao, X. (2019). Data Augmentation-Based Joint Learning for Dong, X., Yan, Y., Tan, M., Yang, Y., & Tsang, I. W. (2018). Late fusion via subspace
Heterogeneous Face Recognition. IEEE Transactions on Neural Networks and Learning search with consistency preservation. IEEE Transactions on Image Processing, 28(1),
Systems, 30(6), 1731–1743. 518–528. https://doi.org/10.1109/TIP.2018.2867747
Carey, M. J., Parris, E. S., & Lloyd-Thomas, H. (1999). A comparison of features for Dorfer, M., Lehner, B., Eghbal-zadeh, H., Christop, H., Fabian, P., & Gerhard, W. (2018).
speech music discrimination. In 1999 IEEE International Conference on Acoustics, Acoustic scene classification with fully convolutional neural networks and i-vectors.
Speech, and Signal Processing (ICASSP) (pp. 149–152). https://doi.org/10.1109/ Tech. Rep. DCASE2018 Challenge.
ICASSP.1999.758084 Drossos, K., Magron, P., & Virtanen, T. (2019). Unsupervised adversarial domain
Caruana, R. (1998). Multitask Learning. Autonomous Agents and Multi-Agent Systems, 27 adaptation based on the wasserstein distance for acoustic scene classification. In
(1), 95–133.
26
2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S.,
(WASPAA) (pp. 259–263). IEEE. https://doi.org/10.1109/WASPAA.2019.8937231. Courville, A. & Bengio, Y. (2014). Generative Adversarial Nets. In Proceedings of the
Drossos, K., Mimilakis, S. I., Gharib, S., Li, Y., & Virtanen, T. (2020). Sound event 27th International Conference on Neural Information Processing Systems, 2, 2672-2680.
detection with depthwise separable and dilated convolutions. In 2020 International Götz, P., Tuna, C., Walther, A., & Habets, E. A. (2023). Contrastive Representation
Joint Conference on Neural Networks (IJCNN) (pp. 1–7). IEEE. https://doi.org/ Learning for Acoustic Parameter Estimation. In 2023 IEEE International Conference on
10.1109/IJCNN48605.2020.9207532. Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE. https://doi.org/
Droumeva, M. (2005). Understanding immersive audio: a historical and socio-cultural 10.1109/ICASSP49357.2023.10095279.
exploration of auditory displays. In International Conference on Auditory Display Gou, J., Yu, B., Maybank, S. J., & Tao, D. (2021). Knowledge Distillation: A Survey.
(ICAD) (pp. 162-168). International Journal of Computer Vision, 129(6), 1789–1819. https://doi.org/
Dubois, D., Guastavino, C., & Raimbault, M. (2006). A cognitive approach to urban 10.1007/s11263-021-01453-z
soundscapes: Using verbal data to access everyday life auditory categories. Acta Goyal, A., Shukla, S. K., & Sarin, R. K. (2019). Identification of source mobile hand sets
Acustica United with Acustica, 92(6), 865–874. using audio latency feature. Forensic Science International, 298, 332–335.
Dwyer, R. (1983). Detection of non-Gaussian signals by frequency domain Kurtosis Grollmisch, S., & Cano, E. (2021). Improving semi-supervised learning for audio
estimation. In IEEE International Conference on Acoustics, Speech, and Signal Processing classification with FixMatch. Electronics, 10(15), 1807.
(ICASSP) (pp. 607–610). IEEE. https://doi.org/10.1109/ICASSP.1983.1172264. Guastavino, C. (2007). Categorization of environmental sounds. Canadian Journal of
Ebbers, J., Keyser, M. C., & Haeb-Umbach, R. (2021). Adapting sound recognition to a Experimental Psychology/Revue canadienne de psychologie expérimentale, 61(1), 54–63.
new environment via self-training. In 2021 29th European Signal Processing https://psycnet.apa.org/doi/10.1037/cjep2007006.
Conference (EUSIPCO) (pp. 1135–1139). IEEE. Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural
Eghbal-Zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2016). CP-JKU submissions for networks. In International conference on machine learning (ICML) (pp. 1321–1330).
DCASE-2016: a hybrid approach using binaural i-vectors and deep convolutional PMLR.
neural networks. Tech. Rep. DCASE2016 Challenge. Guo, M. H., Xu, T. X., Liu, J. J., Liu, Z. N., Jiang, P. T., Mu, T. J., Zhang, S. H.,
Eghbal-zadeh, H., Lehner, B., Dorfer, M., & Widmer, G. (2017). A hybrid approach with Martin, R. R., Cheng, M. M., & Hu, S. M. (2022). Attention mechanisms in computer
multi-channel i-vectors and convolutional neural networks for acoustic scene vision: A survey. Computational Visual Media, 8, 331–368.
classification. In 2017 25th European Signal Processing Conference (EUSIPCO) (pp. Gupta, S., Agrawal, A., Gopalakrishnan, K., & Narayanan, P. (2015). Deep learning with
2749–2753). IEEE. limited numerical precision. In International conference on machine learning (pp.
Ellis, D. P. W. (1996). Pediction-driven computational auditory scene analysis [Doctoral 1737–1746). PMLR.
dissertation]. Doctoral dissertation, Columbia University. https://doi.org/10.7916/ Guzhov, A., Raue, F., Hees, J., & Dengel, A. (2021). ESResNet: Environmental Sound
D84J0N13 Classification Based on Visual Domain Models. In 2020 25th International Conference
El-Maleh, K., Samouelian, A., & Kabal, P. (1999). Frame level noise classification in on Pattern Recognition (ICPR) (pp. 4933–4940). IEEE. https://doi.org/10.1109/
mobile environments. In 1999 IEEE International Conference on Acoustics, Speech, ICPR48806.2021.9413035.
and Signal Processing (ICASSP), 1, 237-240. IEEE, https://doi.org/10.1109/ Gygi, B., Kidd, G. R., & Watson, C. S. (2007). Similarity and categorization of
ICASSP.1999.758106. environmental sounds. Perception & Psychophysics, 69(6), 839–855. https://doi.org/
Eronen, A. J., Peltonen, V. T., Tuomi, J. T., Klapuri, A. P., Fagerlund, S., Sorsa, T., et al. 10.3758/BF03193921
(2006). Audio-based context recognition. IEEE Transactions on Audio, Speech, and Hajihashemi, V., Gharahbagh, A. A., Cruz, P. M., Ferreira, M. C., Machado, J. J., &
Language Processing, 14(1), 321–329. Tavares, J. M. R. (2022). Binaural Acoustic Scene Classification Using Wavelet
Eronen, A., Tuomi, J., Klapuri, A., Fagerlund, S., Sorsa, T., Lorho, G., et al. (2003). Audio- Scattering, Parallel Ensemble Classifiers and Nonlinear Fusion. Sensors, 22(4), Article
based context awareness—Acoustic modeling and perceptual evaluation. In 2003 1535.
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (pp. Han, J., Matuszewski, M., Sikorski, O., Sung, H., & Cho, H. (2023). Randmasking
529–532). IEEE. https://doi.org/10.1109/ICASSP.2003.1200023. Augment: A Simple and Randomized Data Augmentation For Acoustic Scene
Commision, E. (2014). EAR-IT: Using sound to picture the world in a new way. Retrieved Classification. In 2023 IEEE International Conference on Acoustics, Speech and Signal
from https://digital-strategy.ec.europa.eu/en/news/ear-it-using-sound-picture- Processing (ICASSP) (pp. 1–5). IEEE. https://doi.org/10.1109/
world-new-way. Accessed December 1, 2022. ICASSP49357.2023.10095001.
Eyben, F., Wöllmer, M., & Schuller, B. (2010). OpenSMILE: The munich versatile and fast Han, S., Mao, H., & Dally, W. J. (2016). Deep compression: Compressing deep neural
open-source audio feature extractor. In Proceedings of the 18th ACM international networks with pruning, trained quantization and Huffman coding. In 4th
conference on Multimedia (pp. 1459-1462). International Conference on Learning Representations (ICLR).
Foote, J. (1999). An overview of audio information retrieval. Multimedia Systems, 7(1), Han, Y., Park, J., & Lee, K. (2017). Convolutional neural networks with binaural
2–10. representations and background subtraction for acoustic scene classification. In Proc.
Foster, P., Sigtia, S., Krstulovic, S., Barker, J., & Plumbley, M. D. (2015). Chime-home: A DCASE2017 Workshop (pp. 46-50).
dataset for sound source recognition in a domestic environment. In 2015 IEEE Hasan, N. W., Saudi, A. S., Khalil, M. I., & Abbas, H. M. (2023). A Genetic Algorithm
Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA) (pp. Approach to Automate Architecture Design for Acoustic Scene Classification. IEEE
1–5). IEEE. https://doi.org/10.1109/WASPAA.2015.7336899. Transactions on Evolutionary Computation, 27(2), 222–236. https://doi.org/10.1109/
Gaver, W. W. (1993). What in the world do we hear? An ecological approach to auditory TEVC.2022.3185543
event perception. Ecological Psychology, 5(1), 1–29. https://doi.org/10.1207/ He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition.
s15326969eco0501_1 In Proceedings of the IEEE conference on computer vision and pattern recognition (pp.
Geiger, J. T., Schuller, B., & Rigoll, G. (2013). Large-scale audio feature extraction and 770-778).
SVM for acoustic scene classification. In 2013 IEEE Workshop on Applications of Signal He, N., & Zhu, J. (2021). A weighted partial domain adaptation for acoustic scene
Processing to Audio and Acoustics (pp. 1–4). IEEE. https://doi.org/10.1109/ classification and its application in fiber optic security system. IEEE Access, 9,
WASPAA.2013.6701857. 2244–2250. https://doi.org/10.1109/ACCESS.2020.3044153
Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, R. C., et al. Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE
(2017). Audio set: An ontology and human-labeled dataset for audio events. In 2017 2020 challenge: generalization across devices and low complexity solutions. In Proc.
IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. DCASE2020 Workshop (pp. 56-60).
776–780). IEEE. https://doi.org/10.1109/ICASSP.2017.7952261. Heo, H. S., Jung, J. W., Shim, H. J., & Yu, H. J. (2019). Acoustic scene classification using
Gerhard, D. (2003). Audio signal classification: History and current techniques. Regina, SK, teacher-student learning with soft-labels. In Proc. Interspeech (pp. 614-618). https://
Canada: Department of Computer Science, University of Regina. doi.org/10.21437/Interspeech.2019-1989.
Gharib, S., Derrar, H., Niizumi, D., Senttula, T., Tommola, J., Heittola, T., et al. (2018). Hershey, S., Chaudhuri, S., Ellis, D. P., Gemmeke, J. F., Jansen, A., Moore, R. C., et al.
Acoustic scene classification: A competition review. In 28th International Workshop (2017). CNN architectures for large-scale audio classification. In 2017 IEEE
on Machine Learning for Signal Processing (MLSP) (pp. 1–6). IEEE. https://doi.org/ international conference on acoustics, speech and signal processing (ICASSP) (pp.
10.1109/MLSP.2018.8517000. 131–135). IEEE. https://doi.org/10.1109/ICASSP.2017.7952132.
Gharib, S., Drossos, K., Cakir, E., Serdyuk, D., & Virtanen, T. (2018b). Unsupervised Houix, O., Lemaitre, G., Misdariis, N., Susini, P., & Urdapilleta, I. (2012). A lexical
Adversarial Domain Adaptation for Acoustic Scene Classification. In Proc. analysis of environmental sound categories. Journal of Experimental Psychology:
DCASE2018 Workshop (pp. 1-5). Applied, 18(1), 52. https://psycnet.apa.org/doi/10.1037/a0026240.
Giannoulis, D., Benetos, E., Stowell, D., Rossignol, M., Lagrange, M., & Plumbley, M. D. Hu, H., Yang, C. H. H., Xia, X., Bai, X., Tang, X., Wang, Y., Niu, S., Chai, L., Li, J., Zhu, H.,
(2013). Detection and classification of acoustic scenes and events: An IEEE AASP Bao, F., Zhao, Y., Siniscalchi, S. M., Wang, Y., Du, J. & Lee, C. H. (2020). Device-
challenge. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and robust acoustic scene classification based on two-stage categorization and data
Acoustics (pp. 1–4). IEEE. https://doi.org/10.1109/WASPAA.2013.6701819. augmentation. Tech. Rep. DCASE2020 Challenge.
Giannoulis, D., Stowell, D., Benetos, E., Rossignol, M., Lagrange, M., & Plumbley, M. D. Hu, H., Yang, C. H. H., Xia, X., Bai, X., Tang, X., Wang, Y., et al. (2021). A two-stage
(2013). A database and challenge for acoustic scene classification and event approach to device-robust acoustic scene classification. In 2021 IEEE International
detection. In 21st European Signal Processing Conference (EUSIPCO) (pp. 1–5). IEEE. Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 845–849). IEEE.
Giordano, B. L., McDonnell, J., & McAdams, S. (2010). Hearing living symbols and https://doi.org/10.1109/ICASSP39728.2021.9414835.
nonliving icons: Category specificities in the cognitive processing of environmental Huang, J., Gretton, A., Borgwardt, K., Schölkopf, B., & Smola, A. (2006). Correcting
sounds. Brain and Cognition, 73(1), 7–19. https://doi.org/10.1016/j. sample selection bias by unlabeled data. Advances in Neural Information Processing
bandc.2010.01.005 Systems, 19, 601–608.
Gong, Y., Chung, Y. A., & Glass, J. (2021). AST: Audio spectrogram transformer. In Proc. Huang, J., Lu, H., Meyer, P., Cordourier, H., & Ontiveros, J. (2019). Acoustic scene
Interspeech (pp. 571–575). https://doi.org/10.21437/Interspeech.2021-698. classification using deep learning-based ensemble averaging. In Proc. DCASE2019
Workshop (pp. 94-98).
27
Huang, Y., Zhu, Y., Fan, X., Ma, X., Wang, F., Liu, J., et al. (2018). Task scheduling with Kosmider, M. (2020). Spectrum correction: Acoustic scene classification with
optimized transmission time in collaborative cloud-edge learning. In 27th mismatched recording devices. In Proc. Interspeech 2020 (pp. 4641-4645). https://
International Conference on Computer Communication and Networks (ICCCN) (pp. 1–9). doi.org/10.21437/Interspeech.2020-3088.
IEEE. https://doi.org/10.1109/ICCCN.2018.8487352. Kosmider, M. (2019). Calibrating neural networks for secondary recording devices. Tech.
Huang, Z., Liu, C., Fei, H., Li, W., Yu, J., & Cao, Y. (2020). Urban sound classification Rep. DCASE2019 Challenge.
based on 2-order dense convolutional network using dual features. Applied Acoustics, Koutini, K., Eghbal-zadeh, H., & Widmer, G. (2019). Receptive-field-regularized CNN
164, Article 107243. variants for acoustic scene classification. In Proc. DCASE2019 Workshop (pp. 124-
Hüwel, A., Adiloğlu, K., & Bach, J. H. (2020). Hearing aid research data set for acoustic 128). https://doi.org/10.33682/cjd9-kc43.
environment recognition. In 2020 IEEE International Conference on Acoustics, Speech Koutini, K., Eghbal-zadeh, H., Widmer, G., & Kepler, J. (2019). CP-JKU submissions to
and Signal Processing (ICASSP) (pp. 706–710). IEEE. https://doi.org/10.1109/ DCASE’19: acoustic scene classification and audio tagging with receptive-field-
ICASSP40776.2020.9053611. regularized CNNs. Tech. Rep. DCASE2019 Challenge.
Imoto, K. (2021). Acoustic Scene Classification Using Multichannel Observation with Koutini, K., Henkel, F., Eghbal-zadeh, H., & Widmer, G. (2020). CP-JKU Submissions to
Partially Missing Channels. In 2021 29th European Signal Processing Conference DCASE’20: Low-Complexity Cross-Device Acoustic Scene Classification with RF-
(EUSIPCO) (pp. 875–879). IEEE. Regularized CNNs. Tech. Rep. DCASE2020 Challenge.
Imoto, K., & Ono, N. (2017). Spatial Cepstrum as a Spatial Feature Using a Distributed Koutini, K., Jan, S., & Widmer, G. (2021). CPJKU submission to DCASE21: cross-device
Microphone Array for Acoustic Scene Analysis. IEEE/ACM Transactions on Audio, audio scene classification with wide sparse frequency-damped CNNs. Tech. Rep.
Speech, and Language Processing, 25(6), 1335–1343. DCASE2021 Challenge.
Imoto, K., Tonami, N., Koizumi, Y., Yasuda, M., Yamanishi, R., & Yamashita, Y. (2020). Kroos, C., Bones, O., Cao, Y., Harris, L., Jackson, P. J., Davies, W. J., et al. (2019).
Sound event detection by multitask learning of sound events and scenes with soft Generalisation in environmental sound classification: The making sense of sounds
scene labels. In 2020 IEEE International Conference on Acoustics, Speech and Signal data set and challenge. In 2019 IEEE International Conference on Acoustics, Speech and
Processing (ICASSP) (pp. 621–625). IEEE. https://doi.org/10.1109/ Signal Processing (ICASSP) (pp. 8082–8086). IEEE. https://doi.org/10.1109/
ICASSP40776.2020.9053912. ICASSP.2019.8683292.
Jaiswal, A., Babu, A. R., Zadeh, M. Z., Banerjee, D., & Makedon, F. (2021). A survey on Kumar, A., Khadkevich, M., & Fügen, C. (2018). Knowledge transfer from weakly labeled
contrastive self-supervised learning. Technologies, 2020, 9(1), Article 2. audio using convolutional neural network for sound events and scenes. In 2018 IEEE
Jansen, A., Gemmeke, J. F., Ellis, D. P., Liu, X., Lawrence, W., & Freedman, D. (2017). International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.
Large-scale audio event discovery in one million youtube videos. In 2017 IEEE 326–330). IEEE. https://doi.org/10.1109/ICASSP.2018.8462200.
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. Kumar, T. V., Sundar, R. S., Purohit, T., & Ramasubramanian, V. (2020). End-to-end
786–790). IEEE. https://doi.org/10.1109/ICASSP.2017.7952263. audio-scene classification from raw audio: Multi time-frequency resolution CNN
Jati, A., Nadarajan, A., Peri, R., Mundnich, K., Feng, T., Girault, B., et al. (2021). architecture for efficient representation learning. In International Conference on Signal
Temporal dynamics of workplace acoustic scenes: Egocentric analysis and Processing and Communications (SPCOM) (pp. 1–5). IEEE. https://doi.org/10.1109/
prediction. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29, SPCOM50965.2020.9179600.
756–769. https://doi.org/10.1109/TASLP.2021.3050265 Kuncheva, L. I. (2004). Combining Pattern Classifiers: Methods and Algorithms (1st ed.).
Jiang, G., Ma, Z., Mao, Q., & Zhang, J. (2023). Multi-level distance embedding learning Wiley-Interscience (Chapter 4).
for robust acoustic scene classification with unseen devices. Pattern Analysis and Lasseck, M. (2018). Acoustic bird detection with deep convolutional neural networks. In
Applications, 26, 1089–1099. Proc. DCASE2018 Workshop (pp. 143-147).
Jiang, J., & Zhai, C. X. (2007). Instance weighting for domain adaptation in NLP. In 45th Lee, K., Hyung, Z., & Nam, J. (2013). Acoustic scene classification using sparse feature
Annual Meeting of the Association for Computational Linguistics, ACL 2007 (pp. 264- learning and event-based pooling. In 2013 IEEE Workshop on Applications of Signal
271). Processing to Audio and Acoustics (pp. 1–4). IEEE. https://doi.org/10.1109/
Jiang, S., Shi, C., & Li, H. (2019). Acoustic scene classification using ensembles of WASPAA.2013.6701893.
convolutional neural networks and spectrogram decompositions. In Tech. Rep. Lee, Y., Lim, S., & Kwak, I. Y. (2021). CNN-based acoustic scene classification system.
DCASE2019 Challenge Task1. Electronics, 10(4), Article 371. https://doi.org/10.3390/electronics10040371.
Jing, L., & Tian, Y. (2021). Self-supervised visual feature learning with deep neural Lehner, B., Eghbal-Zadeh, H., Dorfer, M., Korzeniowski, F., Koutini, K., & Widmer, G.
networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43 (2017). Classifying short acoustic scenes with i-vectors and CNNs: Challenges and
(11), 4037–4058. optimisations for the 2017 DCASE ASC task. Tech. Rep. DCASE2017 Challenge.
Jung, J. W., Shim, H. J., Kim, J. H., & Yu, H. J. (2021). DCASENet: An integrated Lemaitre, G., Houix, O., Misdariis, N., & Susini, P. (2010). Listener expertise and sound
pretrained deep neural network for detecting and classifying acoustic scenes and identification influence the categorization of environmental sounds. Journal of
events. In 2021 IEEE International Conference on Acoustics, Speech and Signal Experimental Psychology: Applied, 16(1), 16–32. https://psycnet.apa.org/doi/10.1037
Processing (ICASSP) (pp. 621–625). IEEE. https://doi.org/10.1109/ /a0018762.
ICASSP39728.2021.9414406. Leng, Y., Zhao, W., Lin, C., Sun, C., Wang, R., Yuan, Q., et al. (2020). LDA-based data
Kacprzak, S., & Kowalczyk, K. (2021). Adversarial Domain Adaptation with Paired augmentation algorithm for acoustic scene classification. Knowledge-Based Systems,
Examples for Acoustic Scene Classification on Different Recording Devices. In 2021 195, Article 105600.
29th European Signal Processing Conference (EUSIPCO) (pp. 1030–1034). IEEE. Li, D., Sethi, I. K., Dimitrova, N., & McGee, T. (2001). Classification of general audio data
Kala, T., & Shinozaki, T. (2018). Reinforcement learning of speech recognition system for content-based retrieval. Pattern Recognition Letters, 22(5), 533–544.
based on policy gradient and hypothesis selection. In 2018 IEEE international Li, G., Liu, L., Wang, X., Dong, X., Zhao, P., & Feng, X. (2018). In Auto-tuning neural
conference on acoustics, speech and signal processing (ICASSP) (pp. 5759–5763). IEEE. network quantization framework for collaborative inference between the cloud and edge
https://doi.org/10.1109/ICASSP.2018.8462656. (pp. 402–411). Cham: Springer.
Kang, Z., He, Y., Wang, J., Peng, J., Qu, X., & Xiao, J. (2023). Feature-Rich Audio Model Li, J., Dai, W., Metze, F., Qu, S., & Das, S. (2017). A comparison of deep learning methods
Inversion for Data-Free Knowledge Distillation Towards General Sound for environmental sound detection. In 2017 IEEE International conference on acoustics,
Classification. In 2023 IEEE International Conference on Acoustics, Speech and Signal speech and signal processing (ICASSP) (pp. 126–130). https://doi.org/10.1109/
Processing (ICASSP) (pp. 1–5). IEEE. https://doi.org/10.1109/ ICASSP.2017.7952131
ICASSP49357.2023.10096079. Li, S., Gu, Y., Luo, Y., Chambers, J., & Wang, W. (2019). Enhanced streaming based
Kawamoto, M., & Hamamoto, T. (2020). Building health monitoring using computational subspace clustering applied to acoustic scene data clustering. In 2019 IEEE
auditory scene analysis. In 2020 16th International Conference on Distributed International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.
Computing in Sensor Systems (DCOSS) (pp. 144–146). IEEE. https://doi.org/10.1109/ 11–15). https://doi.org/10.1109/ICASSP.2019.8682593
DCOSS49796.2020.00033. Li, S., & Wang, W. (2018). In Randomly sketched sparse subspace clustering for acoustic
Kawamura, T., Kinoshita, Y., Ono, N., & Scheibler, R. (2023). Effectiveness of Inter- and scene clustering (pp. 2489–2493). IEEE.
Intra-Subarray Spatial Features for Acoustic Scene Classification. In 2023 IEEE Li, Y., Li, X., Zhang, Y., Wang, W., Liu, M., & Feng, X. (2018). Acoustic scene
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). classification using deep audio feature and BLSTM network. In 2018 International
IEEE. https://doi.org/10.1109/ICASSP49357.2023.10096935. Conference on Audio, Language and Image Processing (ICALIP) (pp. 371–374). https://
Kim, B., Yang, S., Kim, J., & Chang, S. (2021). QTI submission to DCASE 2021: Residual doi.org/10.1109/ICALIP.2018.8455765
normalization for device-imbalanced acoustic scene classification with efficient Li, Y., Liu, M., Drossos, K., & Virtanen, T. (2020). Sound event detection via dilated
design. Tech. Rep. DCASE2021 Challenge. convolutional recurrent neural networks. In 2020 IEEE International Conference on
Kim, J., Hyun, M., Chung, I., & Kwak, N. (2021). Feature fusion for online mutual Acoustics, Speech and Signal Processing (ICASSP) (pp. 286–290). https://doi.org/
knowledge distillation. In 2020 25th International Conference on Pattern Recognition 10.1109/ICASSP40776.2020.9054433
(ICPR) (pp. 4619–4625). IEEE. https://doi.org/10.1109/ICPR48806.2021.9412615. Li, Y., Liu, M., Wang, W., Zhang, Y., & He, Q. (2019). Acoustic scene clustering using
Kim, J., Yoo, K. Y., & Kwak, N. (2020). Position-based scaled gradient for model joint optimization of deep embedding learning and clustering iteration. IEEE
quantization and pruning. Advances in Neural Information Processing Systems, 33, Transactions on Multimedia, 22(6), 1385–1394.
20415–20426. Li, Z., Hou, Y., Xie, X., Li, S., Zhang, L., Du, S., et al. (2019). Multi-level attention model
Komatsu, T., Imoto, K., & Togami, M. (2020). Scene-dependent acoustic event detection with deep scattering spectrum for acoustic scene classification. In 2019 IEEE
with scene conditioning and fake-scene-conditioned loss. In 2020 IEEE International International Conference on Multimedia and Expo Workshops (ICMEW) (pp. 396–401).
Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 646–650). IEEE. https://doi.org/10.1109/ICMEW.2019.00074
https://doi.org/10.1109/ICASSP40776.2020.9053702. Liang, J., Zhang, T., & Feng, G. (2020). Channel compression: Rethinking information
Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., & Plumbley, M. D. (2020). PANNs: redundancy among channels in CNN architecture. IEEE Access, 8, 147265–147274.
Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ Lipton, Z. C. (2018). The mythos of model interpretability. Communications of the ACM,
ACM Transactions on Audio, Speech, and Language Processing, 28, 2880–2894. 61(10), 35–43.
28
Liu, Y., Jiang, S., Shi, C., & Li, H. (2019). Acoustic Scene Classification Using Ensembles 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP)
of Deep Residual Networks and Spectrogram Decompositions. In Proc. DCASE2019 (pp. 796–800). https://doi.org/10.1109/ICASSP.2017.7952265
Workshop (pp. 25–26). National Science Foundation of China (NSFC). (2019). Major research of Cognitive
Long, M., Wang, J., Cao, Y., Sun, J., & Philip, S. Y. (2016). Deep learning of transferable computing of visual and auditory information. Retrieved from https://www.nsfc.
representation for scalable domain adaptation. IEEE Transactions on Knowledge and gov.cn/publish/portal0/tab448/info75524.htm, Accessed December 1, 2022.
Data Engineering, 28(8), 2027–2040. Nguyen, T., & Pernkopf, F. (2018). Acoustic scene classification using a convolutional
Lopez-Meyer, P., del Hoyo Ontiveros, J. A., Lu, H., & Stemmer, G. (2021). Efficient end- neural network ensemble and nearest neighbor filters. In Proc. DCASE2018 Workshop
to-end audio embeddings generation for audio classification on target applications. (pp. 34–38).
In 2021 IEEE International Conference on Acoustics, Speech and Signal Processing Nguyen, T., & Pernkopf, F. (2019a). Acoustic Scene Classification with Mismatched
(ICASSP) (pp. 601–605). https://doi.org/10.1109/ICASSP39728.2021.9414229 Devices Using CliqueNets and Mixup Data Augmentation. Proc. Interspeech,
Lostanlen, V., Salamon, J., Cartwright, M., McFee, B., Farnsworth, A., Kelling, S., et al. 2330–2334. https://doi.org/10.21437/Interspeech.2019-3002.
(2018). Per-channel energy normalization: Why and how. IEEE Signal Processing Nguyen, T., & Pernkopf, F. (2019b). Acoustic scene classification with mismatched
Letters, 26(1), 39–43. recording devices using mixture of experts layer. In 2019 IEEE International
Lu, J., Ma, R., Liu, G., & Qin, Z. (2021). Deep convolutional neural network with transfer Conference on Multimedia and Expo (ICME) (pp. 1666–1671). https://doi.org/
learning for environmental sound classification. In 2021 International Conference on 10.1109/ICME.2019.00287
Computer, Control and Robotics (ICCCR) (pp. 242–245). https://doi.org/10.1109/ Nguyen, T., Pernkopf, F., & Kosmider, M. (2020). Acoustic scene classification for
ICCCR49711.2021.9349393 mismatched recording devices using heated-up softmax and spectrum correction. In
Madhu, A., & Suresh, K. (2023). RQNet: Residual Quaternion CNN for Performance 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Enhancement in Low Complexity and Device Robust Acoustic Scene Classification. (pp. 126–130). https://doi.org/10.1109/ICASSP40776.2020.9053582
IEEE Transactions on Multimedia. https://doi.org/10.1109/TMM.2023.3241553 Nwe, T. L., Dat, T. H., & Ma, B. (2017). Convolutional neural network with multi-task
Madhu, A. (2022). EnvGAN: A GAN-based augmentation to improve environmental learning scheme for acoustic scene classification. In 2017 Asia-Pacific Signal and
sound classification. Artificial Intelligence Review, 55, 6301–6320. Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp.
Malkin, R. G., & Waibel, A. (2005). Classifying user environment for mobile applications 1347–1350).
using linear autoencoding of ambient audio. In IEEE International Conference on Olvera, M., Vincent, E., & Gasso, G. (2022). On the impact of normalization strategies in
Acoustics, Speech, and Signal Processing (ICASSP) (pp. 509–512). https://doi.org/ unsupervised adversarial domain adaptation for acoustic scene classification. In
10.1109/ICASSP.2005.1416352 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Marcell, M. M., Borella, D., Greene, M., Kerr, E., & Rogers, S. (2000). Confrontation (pp. 631–635). https://doi.org/10.1109/ICASSP43922.2022.9747540
naming of environmental sounds. Journal of Clinical and Experimental Ono, N., Miyamoto, K., Le Roux, J., Kameoka, H., & Sagayama, S. (2008). In Separation of
Neuropsychology, 22(6), 830–864. https://doi.org/10.1076/jcen.22.6.830.949 a monaural audio signal into harmonic/percussive components by complementary
Martin, K. D. (1999). Sound Source Recognition: A theory and computational model. diffusion on spectrogram (pp. 1–4). IEEE.
Massachusetts Institute of Technology [Doctoral dissertation]. Doctoral dissertation. Özseven, T., & Arpacioglu, M. (2023). Classification of Urban Sounds with PSO and WO
Martín-Morató, I., Heittola, T., Mesaros, A., & Virtanen, T. (2021). Low-complexity Based Feature Selection Methods. In 2023 5th International Congress on Human-
acoustic scene classification for multi-device audio: Analysis of DCASE 2021 Computer Interaction, Optimization and Robotic Applications (HORA) (pp. 1–3).
challenge systems. In Proc. DCASE2021 Workshop (pp. 85–89). https://doi.org/10.1109/HORA58378.2023.10156803
McDonnell, M. D., & Gao, W. (2020). Acoustic scene classification using deep residual Park, D. S., Chan, W., Zhang, Y., Chiu, C. C., Zoph, B., Cubuk, E. D., et al. (2019).
networks with late fusion of separated high and low frequency paths. In 2020 IEEE SpecAugment: A Simple Data Augmentation Method for Automatic Speech
International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. Recognition. Proc. Interspeech, 2613–2617. https://doi.org/10.21437/Interspeech.
141–145). https://doi.org/10.1109/ICASSP40776.2020.9053274 2019-2680.
McDonnell, M., & UniSA, S. (2020). Low-complexity acoustic scene classification using Park, J., Shin, J., & Lee, K. (2017). Exploiting continuity/discontinuity of basis vectors in
one-bit-per-weight deep convolutional neural networks. Tech. Rep. DCASE2020 spectrogram decomposition for harmonic-percussive sound separation. IEEE/ACM
Challenge. Transactions on Audio, Speech, and Language Processing, 25(5), 1061–1074.
Mesaros, A., Heittola, T., & Virtanen, T. (2016a). Metrics for polyphonic sound event Park, S., Mun, S., Lee, Y., & Ko, H. (2016). Score Fusion of Classification Systems for
detection. Applied Sciences, 6(6), Article 162. Acoustic Scene Classification. Tech. Rep. DCASE2016 Challenge.
Mesaros, A., Heittola, T., & Virtanen, T. (2016b). TUT database for acoustic scene Paseddula, C., & Gangashetty, S. V. (2021). Late fusion framework for Acoustic Scene
classification and sound event detection. In 2016 24th European Signal Processing Classification using LPCC, SCMC, and LogMel band energies with Deep Neural
Conference (EUSIPCO) (pp. 1128–1132). https://doi.org/10.1109/ Networks. Applied Acoustics, 172, Article 107568. https://doi.org/10.1016/j.
EUSIPCO.2016.7760424 apacoust.2020.107568
Mesaros, A., Heittola, T., & Virtanen, T. (2018a). Acoustic scene classification: An Peltonen, V. T., Eronen, A. J., Parviainen, M. P., & Klapuri, A. P. (2001). Recognition of
overview of DCASE 2017 challenge entries. In 2018 16th International Workshop on everyday auditory scenes: potentials, latencies and cues. In Proceedings of the 110th
Acoustic Signal Enhancement (IWAENC) (pp. 411–415). https://doi.org/10.1109/ audio engineering society convention. Hall, Amsterdam.
IWAENC.2018.8521242 Peltonen, V., Tuomi, J., Klapuri, A., Huopaniemi, J., & Sorsa, T. (2002). Computational
Mesaros, A., Heittola, T., & Virtanen, T. (2018b). A multi-device dataset for urban auditory scene recognition. In 2002 IEEE International Conference on Acoustics,
acoustic scene classification. In Proc., DCASE2018, 9–13. Speech, and Signal Processing (Vol. 2, pp. II-1941-II-1944). IEEE, https://doi.org/
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., et al. (2018). 10.1109/ICASSP.2002.5745009.
Detection and Classification of Acoustic Scenes and Events: Outcome of the DCASE Pham, L., McLoughlin, I., Phan, H., Palaniappan, R., & Mertins, A. (2020). Deep feature
2016 Challenge. IEEE/ACM Transactions on Audio, Speech, and Language Processing, embedding and hierarchical classification for audio scene classification. In 2020
26(2), 379–393. International Joint Conference on Neural Networks (IJCNN) (pp. 1–7). https://doi.org/
Mesaros, A., Heittola, T., Diment, A., et al. (2017). DCASE 2017 challenge setup: Tasks, 10.1109/IJCNN48605.2020.9206866
datasets and baseline system. In Proc. DCASE2017 Workshop (pp. 85–92). Pham, L., Phan, H., Nguyen, T., Palaniappan, R., Mertins, A., & McLoughlin, I. (2021).
Mesaros, A., Heittola, T., Virtanen, T., & Plumbley, M. D. (2021). Sound event detection: Robust acoustic scene classification using a multi-spectrogram encoder-decoder
A tutorial. IEEE Signal Processing Magazine, 38(5), 67–83. https://doi.org/10.1109/ framework. Digital Signal Processing, 110, Article 102943.
MSP.2021.3090678 Pham, L., Ngo, D., Salovic, D., Jalali, A., Schindler, A., Nguyen, P. X., et al. (2023).
Mitrović, D., Zeppelzauer, M., & Breiteneder, C. (2010). Features for content-based audio Lightweight deep neural networks for acoustic scene classification and an effective
retrieval. Advances in Computers, 78, 71–150. https://doi.org/10.1016/S0065-2458 visualization for presenting sound scene contexts. Applied Acoustics, 211. https://doi.
(10)78003-7 org/10.1016/j.apacoust.2023.109489
Mogi, R., & Kasai, H. (2013). Noise-Robust environmental sound classification method Phan, H., Chén, O. Y., Pham, L., Koch, P., De Vos, M., McLoughlin, I., & Mertins, A.
based on combination of ICA and MP features. Artificial Intelligence Research, 2(1), (2019). Spatio-temporal attention pooling for audio scene classification. In Proc.
107–121. Interspeech 2019 (pp. 3845-3849). https://doi.org/10.21437/Interspeech.2019-
Mohaimenuzzaman, M., Bergmeir, C., West, I., & Meyer, B. (2023). Environmental Sound 3040.
Classification on the Edge: A Pipeline for Deep Acoustic Networks on Extremely Phaye, S. S. R., Benetos, E., & Wang, Y. (2019). Subspectralnet–using sub-spectrogram
Resource-Constrained Devices. Pattern Recognition, 133, Article 109025. based convolutional neural networks for acoustic scene classification. In 2019 IEEE
Morocutti, T., Schmid, F., Koutini, K., & Widmer, G. (2023). Device-Robust Acoustic International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp.
Scene Classification via Impulse Response Augmentation. arXiv preprint arXiv: 825–829). https://doi.org/10.1109/ICASSP.2019.8683288
2305.07499. Piczak, K. J. (2015a). Environmental sound classification with convolutional neural
Muhammad, G., Alotaibi, Y. A., Alsulaiman, M., & Huda, M. N. (2010). Environment networks. In 2015 IEEE 25th international workshop on machine learning for signal
Recognition Using Selected MPEG-7 Audio Features and Mel-Frequency Cepstral processing (MLSP) (pp. 1–6). https://doi.org/10.1109/MLSP.2015.7324337
Coefficients. In 2010 Fifth international conference on digital telecommunications (pp. Piczak, K. J. (2015b). ESC: Dataset for environmental sound classification. In Proceedings
11–16). https://doi.org/10.1109/ICDT.2010.10 of the 23rd ACM international conference on Multimedia (pp. 1015–1018).
Mulimani, M., & Koolagudi, S. G. (2021). Acoustic scene classification using deep Primus, P., Eghbal-zadeh, H., Eitelsebner, D., Koutini, K., Arzt, A., & Widmer, G. (2019).
learning architectures. In 2021 6th International Conference for Convergence in Exploiting parallel audio recordings to enforce device invariance in CNN-based
Technology (I2CT) (pp. 1–6). https://doi.org/10.1109/I2CT51068.2021.9418177 acoustic scene classification. In Proc. DCASE2019 Workshop (pp. 204–208).
Mun, S., Park, S., Han, D. K., & Ko, H. (2017a). Generative Adversarial Network Based Rakotomamonjy, A. (2017). Supervised Representation Learning for Audio Scene
Acoustic Scene Training Set Augmentation and Selection Using SVM Hyper-Plane. In Classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25
Proc. DCASE2017 Workshop (pp. 93-102). (6), 1253–1265.
Mun, S., Shon, S., Kim, W., Han, D. K., & Ko, H. (2017). Deep neural network based
learning and transferring mid-level audio features for acoustic scene classification. In
29
Rakotomamonjy, A., & Gasso, G. (2015). Histogram of Gradients of Time-Frequency Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale
Representations for Audio Scene Classification. IEEE/ACM Transactions on Audio, image recognition. Computational and Biological Learning Society.
Speech, and Language Processing, 23(1), 142–153. Singh, A. (2022). 1-D CNN based Acoustic Scene Classification via Reducing Layer-wise
Ren, Z., Kong, Q., Han, J., Plumbley, M. D., & Schuller, B. W. (2019). Attention-based Dimensionality. arXiv preprint arXiv: 2204.00555.
atrous convolutional neural networks: Visualisation and understanding perspectives Singh, A., Rajan, P., & Bhavsar, A. (2019) Deep Multi-View Features from Raw Audio for
of acoustic scenes. In 44th IEEE International Conference on Acoustics, Speech, and Acoustic Scene Classification. In Proc. DCASE2019 Workshop (pp. 229-233).
Signal Processing (ICASSP) (pp. 56–60). https://doi.org/10.1109/ Singh, A., Rajan, P., & Bhavsar, A. (2020). SVD-based redundancy removal in 1-D CNNs
ICASSP.2019.8683434 for acoustic scene classification. Pattern Recognition Letters, 131, 383–389.
Ren, Z., Kong, Q., Han, J., Plumbley, M. D., & Schuller, B. W. (2020). CAA-Net: Singh, A., Thakur, A., Rajan, P., & Bhavsar, A. (2018). A layer-wise score level ensemble
Conditional atrous CNNs with attention for explainable device-robust acoustic scene framework for acoustic scene classification. In 2018 26th European Signal Processing
classification. IEEE Transactions on Multimedia, 23, 4131–4142. Conference (EUSIPCO) (pp. 837–841).
Ren, Z., Kong, Q., Qian, K., Plumbley, M. D., & Schuller, B. W. (2018). Attention-based Singh, V. K., Sharma, K., & Sur, S. N. (2023). A survey on preprocessing and classification
convolutional neural networks for acoustic scene classification. In Proc. DCASE2018 techniques for acoustic scene. Expert Systems with Applications, 229. https://doi.org/
Workshop (pp. 39–43). 10.1016/j.eswa.2023.120520
Ren, Z., Pandit, V., Qian, K., Yang, Z., Zhang, Z., & Schuller, B. (2017). Deep sequential Song, H., Han, J., & Deng, S. (2018). A compact and discriminative feature based on
image features for acoustic scene classification. In Proc. DCASE2017 Workshop (pp. auditory summary statistics for acoustic scene classification. Proc. Interspeech,
113–117). 3294–3298. https://doi.org/10.21437/Interspeech.2018-1299.
Richard, G., Sundaram, S., & Narayanan, S. (2013). An Overview on Perceptually Steffens, J., Steele, D., & Guastavino, C. (2017). Situational and person-related factors
Motivated Audio Indexing and Classification. Proceedings of the IEEE, 101(9), influencing momentary and retrospective soundscape evaluations in day-to-day life.
1939–1954. The Journal of the Acoustical Society of America, 141(3), 1414–1425.
Roma, G., Nogueira, W., & Herrera, P. (2013). Recurrence quantification analysis Stiefelhagen, R., Bernardin, K., Bowers, R., Garofolo, J., Mostefa, D., & Soundararajan, P.
features for environmental sound recognition. In 2013 IEEE Workshop on Applications (2007). The CLEAR 2006 evaluation. In International evaluation workshop on
of Signal Processing to Audio and Acoustics (pp. 1–4). https://doi.org/10.1109/ classification of events, activities and relationships (pp. 1–44). Berlin Heidelberg:
WASPAA.2013.6701890 Springer. https://doi.org/10.1007/978-3-540-69568-4_1.
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv Stiefelhagen, R., Bernardin, K., Bowers, R., Rose, R. T., Michel, M., & Garofolo, J. (2008).
preprint arXiv:1706.05098. The CLEAR 2007 evaluation. In International Evaluation Workshop on Classification of
Sakashita, Y., & Aono, M. (2018). Acoustic scene classification by ensemble of Events, Activities and Relationships (pp. 3-34). Berlin, Heidelberg: Springer Berlin
spectrograms based on adaptive temporal divisions. Tech. Rep. DCASE2018 Heidelberg, https://doi.org/10.1007/978-3-540-68585-2_1.
Challenge. Stiefelhagen, R., Bowers, R., & Fiscus, J. (2007). Multimodal Technologies for Perception of
Salamon, J., & Bello, J. P. (2015). Unsupervised feature learning for urban sound Humans (1st ed.), Berlin, Heidelberg, Germany: Springer. https://doi.org/10.1007/
classification. In 2015 IEEE International Conference on Acoustics, Speech and Signal 978-3-540-68585-2.
Processing (ICASSP) (pp. 171–175). https://doi.org/10.1109/ICASSP.2015.7177954 Stowell, D., & Plumbley, M. (2014). An Open Dataset for Research on Audio Field
Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data Recording Archives: freefield1010. In 53rd AES International Conference 2014:
augmentation for environmental sound classification. IEEE Signal processing letters, 24 Semantic Audio (pp. 80-86).
(3), 279–283. Stowell, D., Giannoulis, D., Benetos, E., Lagrange, M., & Plumbley, M. D. (2015).
Salamon, J., Jacoby, C., & Bello, J. P. (2014). A dataset and taxonomy for urban sound Detection and classification of acoustic scenes and events. IEEE Transactions on
research. In Proceedings of the 22nd ACM international conference on Multimedia (pp. Multimedia, 17(10), 1733–1746.
1041–1044). https://doi.org/10.1145/2647868.2655045 Suh, S., Park, S., Jeong, Y., & Lee, T. (2020). Designing acoustic scene classification
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., & Chen, L. (2018). MobileNetV2: models with CNN variants. In Tech. Rep. DCASE2020 Challenge Task1.
Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on Summers, C., & Dinneen, M. J. (2019). Improved mixed-example data augmentation. In
computer vision and pattern recognition (pp. 4510–4520). 2019 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp.
Santana, L. E., Silva, L., Canuto, A. M., Pintro, F., & Vale, K. O. (2010). A comparative 1262–1270). IEEE. https://doi.org/10.1109/WACV.2019.00139.
analysis of genetic algorithm and ant colony optimization to select attributes for an Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the
heterogeneous ensemble of classifiers. In IEEE congress on evolutionary computation Inception Architecture for Computer Vision. In Proceedings of the IEEE conference on
(pp. 1–8). https://doi.org/10.1109/CEC.2010.5586080 computer vision and pattern recognition (pp. 2818-2826).
Sawhney, N., & Maes, P. (1997). Situational Awareness from Environmental Sounds. Takahashi, G., Yamada, T., Ono, N., & Makino, S. (2017). Performance evaluation of
Project Rep. for Pattie Maes, 1–7. acoustic scene classification using DNN-GMM and frame-concatenated acoustic
Schafer, R. M. (1977). The tuning of the world. New York, NY: Knopf. features. In 2017 Asia-Pacific Signal and Information Processing Association Annual
Schafer, R. M. (1993). The soundscape: Our sonic environment and the tuning of the world Summit and Conference (APSIPA ASC) (pp. 1739–1743). IEEE. https://doi.org/
(1st ed., pp. 1–11). Simon and Schuster (Chapter). 10.1109/APSIPA.2017.8282314.
Schmid, F., Masoudian, S., Koutini, K., & Widmer, G. (2022). CPJKU submission to Takeyama, S., Komatsu, T., Miyazaki, K., Togami, M., & Ono, S. (2021). Robust acoustic
dcase22: Distilling knowledge for lowcomplexity convolutional neural networks scene classification to multiple devices using maximum classifier discrepancy and
from a patchout audio transformer. Tech. Rep, DCASE2022 Challenge. knowledge distillation. In 2020 28th European Signal Processing Conference
Schmid, F., Morocutti, T., Masoudian, S., Koutini, K., & Widmer, G. (2023). CP-JKU (EUSIPCO) (pp. 36–40). IEEE.
Submission to Dcase23: Efficient Acoustic Scene Classification with Cp-Mobile, Tan, M., & Le, Q. (2019). EfficientNet: Rethinking Model Scaling for Convolutional
DCASE2023 Challenge. Tech. Rep, DCASE2023 Challenge. Neural Networks. In International conference on machine learning (ICML) (pp.
Schröder, J., Goetze, S., & Anemüller, J. (2015). Spectro-Temporal Gabor Filterbank 6105–6114). PMLR.
Features for Acoustic Event Detection. IEEE/ACM Transactions on Audio, Speech, and Tang, Z., Gao, Y., Karlinsky, L., Sattigeri, P., Feris, R., & Metaxas, D. (2020).
Language Processing, 23(12), 2198–2208. OnlineAugment: Online Data Augmentation with Less Domain Knowledge. In
Schröder, J., Moritz, N., Anemüller, J., Goetze, S., & Kollmeier, B. (2017). Classifier Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28,
architectures for acoustic scenes and events: Implications for DNNs, TDNNs, and 2020, Proceedings, Part VII 16 (pp. 313-329). Springer International Publishing.
perceptual features from DCASE 2016. IEEE/ACM Transactions on Audio, Speech, and Tardieu, J., Susini, P., Poisson, F., Lazareff, P., & McAdams, S. (2008). Perceptual study
Language Processing, 25(6), 1304–1314. of soundscapes in train stations. Applied Acoustics, 69(12), 1224–1239.
Schröder, J., Moritz, N., Schädler, M. R., Cauchi, B., Adiloglu, K., Anemüller, J., et al. Temko, A., Malkin, R., Zieger, C., Macho, D., Nadeu, C., & Omologo, M. (2007). CLEAR
(2013). On the use of spectro-temporal features for the IEEE AASP challenge evaluation of acoustic event detection and classification systems. In International
‘detection and classification of acoustic scenes and events’. In 2013 IEEE Workshop Evaluation Workshop on Classification of Events, Activities and Relationships (pp.
on Applications of Signal Processing to Audio and Acoustics (pp. 1–4). https://doi.org/ 311–322). Berlin, Heidelberg: Springer.
10.1109/WASPAA.2013.6701868 Thiemann, J., Ito, N., & Vincent, E. (2013). The Diverse Environments Multi-channel
Seo, H., Park, J., & Park, Y. (2019). Acoustic scene classification using various pre- Acoustic Noise Database (DEMAND): A database of multichannel environmental
processed features and convolutional neural networks. In Proc. DCASE2019 noise recordings. In Proceedings of Meetings on Acoustics ICA2013. Acoustical Society of
Workshop (pp. 3–6). America, 19(1), Article 035081.
Seresht, H. R., & Mohammadi, K. (2023). Environmental Sound Classification With Low- Tokozume, Y., Ushiku, Y., & Harada, T. (2018). Learning from between-class examples
Complexity Convolutional Neural Network Empowered by Sparse Salient Region for deep sound recognition. In International Conference on Learning Representations
Pooling. IEEE Access, 11, 849–862. https://doi.org/10.1109/ACCESS.2022.3232807 (ICLR) (pp. 1-13).
Shen, Y. L., Huang, C. Y., Wang, S. S., Tsao, Y., Wang, H. M., & Chi, T. S. (2019). Tripathi, A. M., & Pandey, O. J. (2023). Divide and Distill: New Outlooks on Knowledge
Reinforcement Learning Based Speech Enhancement for Robust Speech Recognition. Distillation for Environmental Sound Classification. IEEE/ACM Transactions on
In 2019 IEEE International Conference on Acoustics, Speech and Signal Processing Audio, Speech, and Language Processing, 31, 1100–1113. https://doi.org/10.1109/
(ICASSP) (pp. 6750–6754). https://doi.org/10.1109/ICASSP.2019.8683648 TASLP.2023.3244507
Shim, H. J., Jung, J. W., Kim, J. H., & Yu, H. J. (2022). Attentive max feature map and Tripathi, A. M., & Mishra, A. (2021). Self-supervised learning for Environmental Sound
joint training for acoustic scene classification. In 2022 IEEE International Conference Classification. Applied Acoustics, 2021(182), Article 108183. https://doi.org/
on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1036–1040). https://doi.org/ 10.1016/j.apacoust.2021.108183
10.1109/ICASSP43922.2022.9746091 Truax, B. (2001). Acoustic Communication (1st ed.). Greenwood Publishing Group
Sigtia, S., Stark, A. M., Krstulović, S., & Plumbley, M. D. (2016). Automatic (Chapter 1).
environmental sound recognition: Performance versus computational cost. IEEE/ Tsalera, E., Papadakis, A., & Samarakou, M. (2021). Comparison of pre-trained CNNs for
ACM Transactions on Audio, Speech, and Language Processing, 24(11), 2096–2107. audio classification using transfer learning. Journal of Sensor and Actuator.
Networks, 10(4), Article 72.
30
Tsau, E. S., Kim, S. H., & Kuo, C. C. J. (2011). Environmental sound recognition with Wilkinghoff, K. (2021). On Open-Set Classification with L3-Net Embeddings for Machine
CELP-based features. In International Symposium on Signals, Circuits and Systems (pp. Listening Applications. In 2020 28th European Signal Processing Conference
1–4). IEEE. https://doi.org/10.1109/ISSCS.2011.5978729. (EUSIPCO) (pp. 800–804). IEEE.
Van Engelen, J. E., & Hoos, H. H. (2020). A survey on semi-supervised learning. Machine Wu, Y., & Lee, T. (2019). Enhancing sound texture in CNN-based acoustic scene
Learning, 109(2), 373–440. https://doi.org/10.1007/s10994-019-05855-6 classification. In 2019 IEEE International Conference on Acoustics, Speech and Signal
Van Grootel, M. W. W., Andringa, T. C., & Krijnders, J. D. (2009). DARES-G1: Database Processing (ICASSP) (pp. 815–819). IEEE. https://doi.org/10.1109/
of annotated real-world everyday sounds. In Proceedings of the NAG/DAGA Meeting ICASSP.2019.8683490.
(pp. 996-999). Wu, Y., & Lee, T. (2020). Time-Frequency Feature Decomposition based on sound
VanDerveer, N. J. (1979). Ecological acoustics: Human perception of environmental duration for acoustic scene classification. In 2020 IEEE International Conference on
sounds [Doctoral dissertation]. UMI No. 804002), ProQuest Dissertations & Theses Acoustics, Speech and Signal Processing (ICASSP) (pp. 716–720). IEEE. https://doi.
Global (302934163 Dissertation Abstracts International, 40/(9-B), 4543B htt org/10.1109/ICASSP40776.2020.9053194.
ps://www.proquest.com/dissertations-theses/ecological-acoustics-human-perceptio Xia, X., Togneri, R., Sohel, F., Zhao, Y., & Huang, D. (2019). A survey: Neural network-
n/docview/302934163/se-2?accountid=10424. based deep learning for acoustic event detection. Circuits, Systems, and Signal
Vanhoucke, V., Senior, A., & Mao, M. Z. (2011). Improving the speed of neural networks Processing, 38(8), 3433–3453.
on CPUs. In Proc. Conf. Neural Information Processing Systems Deep Learning and Xie, J., & Zhu, M. (2019). Investigation of acoustic and visual features for acoustic scene
Unsupervised Feature Learning Workshop (pp. 1-8). classification. Expert Systems with Applications, 126, 20–29.
Varma, D. D., Padmanabhan, R., & Dileep, A. D. (2021). Learning to separate: Xie, W., He, Q., Yan, H., & Li, Y. (2021). Acoustic Scene Classification Using Deep CNNs
Soundscape classification using foreground and background. In 2020 28th European with Time-Frequency Representations. In 21st International Conference on
Signal Processing Conference (EUSIPCO) (pp. 21–25). IEEE. Communication Technology (ICCT) (pp. 1325–1329). IEEE. https://doi.org/10.1109/
Venkatesh, S., Mulimani, M., & Koolagudi, S. G. (2023). Acoustic Scene Classification ICCT52962.2021.9658004.
using Deep Fisher network. Digital Signal Processing, 139. https://doi.org/10.1016/j. Xie, W., Li, Y., He, Q., & Cao, W. (2023). Few-shot class-incremental audio classification
dsp.2023.104062 via discriminative prototype learning. Expert Systems with Applications, 255, Article
Vij, D., Yogesh, Y., Srivastava, D., et al. (2023). Detection of Acoustic Scenes and Events 120044. https://doi.org/10.1016/j.eswa.2023.120044
using Audio Analysis – A Survey. In 2023 3rd International Conference on Advance Xu, Y., Huang, Q., Wang, W., & Plumbley, M. D. (2016). Hierarchical learning for DNN-
Computing and Innovative Technologies in Engineering (ICACITE) (pp. 316–320). IEEE. Based acoustic scene classification. In Proc. DCASE2016 Workshop (pp. 105-109).
https://doi.org/10.1109/ICACITE57410.2023.10183195. Yang, C. H. H., Hu, H., Siniscalchi, S. M., Wang, Q., Wang, Y., Xia, X., Zhao, Y., Wu, Y.,
Virtanen, T., Plumbley, M. D., & Ellis, D. (2018). Computational analysis of sound scenes Wang, Y., Du, J. & Lee, C. H. (2021). A lottery ticket hypothesis framework for low-
and events (1st ed.). Springer (Chapter 1-9). https://doi.org/10.1007/978-3-319- complexity device-robust neural acoustic scene classification. Tech. Rep.
63450-0. DCASE2021 Challenge.
Vivek, V. S., Vidhya, S., & Madhanmohan, P. (2020). Acoustic scene classification in Yang, D., Wang, H., & Zou, Y. (2021). Unsupervised multi-target domain adaptation for
hearing aid using deep learning. In 2020 International Conference on Communication acoustic scene classification. In Proc. Interspeech 2021 (pp. 1159-1163). https://doi.
and Signal Processing (ICCSP) (pp. 0695–0699). IEEE. https://doi.org/10.1109/ org/10.21437/Interspeech.2021-300.
ICCSP48568.2020.9182160. Yang, L., Chen, X., & Tao, L. (2018). Acoustic scene classification using multi-scale
Waibe,l A. & Stiefelhagen, R. (2009). Computers in the human interaction loop (1st ed.). features. Tech. Rep. DCASE2018 Challenge.
Springer (Chapter 7). Yang, L., Tao, L., Chen, X., & Gu, X. (2020). Multi-scale semantic feature fusion and data
Waldekar, S., & Saha, G. (2020a). Analysis and classification of acoustic scenes with augmentation for acoustic scene classification. Applied Acoustics, 163, Article
wavelet transform-based mel-scaled features. Multimedia Tools and Applications, 79 107238.
(11), 7911–7926. Yang, Y., Zhang, H., Tu, W., Ai, H., Cai, L., Hu, R., et al. (2019). Kullback-Leibler
Waldekar, S., & Saha, G. (2020b). Two-level fusion-based acoustic scene classification. divergence frequency warping scale for acoustic scene classification using
Applied Acoustics, 170, Article 107502. https://doi.org/10.1016/j. convolutional neural network. In 2019 IEEE International Conference on Acoustics,
apacoust.2020.107502 Speech and Signal Processing (ICASSP) (pp. 840–844). IEEE. https://doi.org/10.1109/
Wang, C. Y., Santoso, A., & Wang, J. C. (2017). Acoustic scene classification using self- ICASSP.2019.8683000.
determination convolutional neural network. In 2017 Asia-Pacific Signal and Yao, K., Yang, J., Zhang, X., Zheng, C., & Zeng, X. (2019). Robust deep feature extraction
Information Processing Association Annual Summit and Conference (APSIPA ASC) (pp. method for acoustic scene classification. In 2019 IEEE 19th International Conference
19–22). IEEE. https://doi.org/10.1109/APSIPA.2017.8281995. on Communication Technology (ICCT) (pp. 198–202). IEEE. https://doi.org/10.1109/
Wang, D. L., & Brown, G. J. (2006). Computational auditory scene analysis: Principles, ICCT46805.2019.8947252.
algorithms, and applications (1st ed.). Wiley-IEEE press (Chapter 1). Ye, M., Zhong, H., Song, X., Huang, S., & Cheng, G. (2019). Acoustic scene classification
Wang, H., Zou, Y., Chong, D., & Wang, W. (2020). Environmental sound classification using deep convolutional neural network via transfer learning. In 2019 International
with parallel temporal-spectral attention. In Proc. Interspeech (pp. 821-825). https:// Conference on Asian Language Processing (IALP) (pp. 19–22). IEEE. https://doi.org/
doi.org/10.21437/Interspeech.2020-1219. 10.1109/IALP48816.2019.9037692.
Wang, J. C., Wang, J. F., He, K. W., & Hsu, C. S. (2006). Environmental Sound Yin, Y., Shah, R. R., & Zimmermann, R. (2018). Learning and fusing multimodal deep
Classification using Hybrid SVM/KNN Classifier and MPEG-7 Audio Low-Level features for acoustic scene categorization. In Proceedings of the 26th ACM
Descriptor. In The 2006 IEEE international joint conference on neural network international conference on Multimedia (pp. 1892–1900).
proceedings (pp. 1731–1735). IEEE. https://doi.org/10.1109/IJCNN.2006.246644. Zhang, C., Zhan, H., Hao, Z., & Gao, X. (2023). Classification of Complicated Urban
Wang, M., Chen, C., Xie, Y., Chen, H., Liu, Y., & Zhang, P. (2021). Audio-visual scene Forest Acoustic Scenes with Deep Learning Models. Forests, 14(2), 206.
classification using transfer learning and hybrid fusion strategy. Tech. Rep. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2018). Mixup: Beyond empirical
DCASE2021 Challenge. risk minimization. In International Conference on Learning Representations (ICLR)
Wang, Q., Zheng, S., Li, Y., Wang, Y., Wu, Y., Hu, H., Yang, C. H. H., Siniscalchi, S. M., (pp.1-13).
Wang, Y., Du, J. & Lee, C. H. (2021). A model ensemble approach for audio-visual Zhang, J., & Yan, H. (2023). Application and Implementation of Convolutional Neural
scene classification. Tech. Rep. DCASE2021 Challenge. Network Accelerator Based on FPGA in Environmental Sound Classification. In 2023
Wang, Q., Du, J., Wu, H. X., Pan, J., Ma, F., & Lee, C. H. (2023). A Four-Stage Data 8th International Conference on Computer and Communication Systems (ICCCS) (pp.
Augmentation Approach to ResNet-Conformer Based Acoustic Modeling for Sound 22–27). IEEE. https://doi.org/10.1109/ICCCS57501.2023.10151442.
Event Localization and Detection. IEEE/ACM Transactions on Audio, Speech, and Zhang, L., Han, J., & Shi, Z. (2020). Learning Temporal Relations from Semantic
Language Processing, 31, 1251–1264. https://doi.org/10.1109/TASLP.2023.3256088 Neighbors for Acoustic Scene Classification. IEEE Signal Processing Letters, 27,
Wang, R., Wang, M., Zhang, X. L., & Rahardja, S. (2019). Domain adaptation neural 950–954.
network for acoustic scene classification in mismatched conditions. In 2019 Asia- Zhang, T., & Kuo, C. C. J. (2001). Audio content analysis for online audiovisual data
Pacific Signal and Information Processing Association Annual Summit and Conference segmentation and classification. IEEE Transactions on speech and audio processing, 9
(APSIPA ASC) (pp. 1501–1505). IEEE. https://doi.org/10.1109/ (4), 441–457.
APSIPAASC47483.2019.9023057. Zhang, T., & Kuo, C. C. J. (2001b). Generic audio data segmentation and indexing (1st ed.).
Wang, S., Mesaros, A., Heittola, T., & Virtanen, T. (2021c). A curated dataset of urban Springer (Chapter 4).
scenes for audio-visual scene analysis. In 2021 IEEE International Conference on Zhang, T., Feng, G., Liang, J., & An, T. (2021). Acoustic scene classification based on Mel
Acoustics, Speech and Signal Processing (ICASSP) (pp. 626–630). IEEE. https://doi. spectrogram decomposition and model merging. Applied Acoustics, 182, Article
org/10.1109/ICASSP39728.2021.9415085. 108258.
Wang, X., Han, Y., Leung, V. C., Niyato, D., Yan, X., & Chen, X. (2020). Convergence of Zhang, T., Liang, J., & Ding, B. (2020). Acoustic scene classification using deep CNN with
edge computing and deep learning: A comprehensive survey. IEEE Communications fine-resolution feature. Expert Systems with Applications, 143, Article 113067. https://
Surveys & Tutorials, 22(2), 869–904. doi.org/10.1016/j.eswa.2019.113067
Wang, Y. (2018). Polyphonic sound event detection with weak labeling. Carnegie Mellon Zhang, T., Liang, J., & Feng, G. (2022). Adaptive time-frequency feature resolution
University [Doctoral dissertation]. Doctoral dissertation,. network for acoustic scene classification. Applied Acoustics, 195, Article 108819.
Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017). Trainable https://doi.org/10.1016/j.apacoust.2022.108819
frontend for robust and far-field keyword spotting. In 2017 IEEE International Zhang, T., Zhang, K., & Wu, J. (2018). Temporal Transformer Networks for Acoustic
Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5670–5674). Scene Classification. In Proc. Interspeech (pp. 1349-1353). https://doi.org/
IEEE. https://doi.org/10.1109/ICASSP.2017.7953242. 10.21437/Interspeech.2018-1152.
Weiss, K., Khoshgoftaar, T. M., & Wang, D. D. (2016). A survey of transfer learning. Zhao, H., Hu, L., Peng, X., Wang, G., Yu, F., & Xu, C. (2011). An Improving MFCC
Journal of Big data, 3(1), 1–40. Features Extraction Based on FastICA Algorithm plus RASTA Filtering. Journal of
Computers, 6(7), 1477–1484.
31
Zhao, J., Kong, Q., Song, X., Feng, Z., & Wu, X. (2022). Feature alignment for robust Zheng, W., Mo, Z., & Zhao, G. (2022). Clustering by Errors: A Self-Organized Multitask
acoustic scene classification across devices. IEEE signal processing letters, 29, Learning Method for Acoustic Scene Classification. Sensors, 22(1), Article 36.
578–582. Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data
Zhao, X., Jia, X., Zhang, T., Cao, Y., & Liu, T. (2023). Evolutionary Algorithms with Blind augmentation. In 34th AAAI Conference on Artificial Intelligence (Vol. 34, No. 07,
Fitness Evaluation for Solving Optimization Problems with Only Fuzzy Fitness pp. 13001-13008).
Information. IEEE Transactions on Fuzzy Systems. https://doi.org/10.1109/ Zhu, C., Han, S., Mao, H., & Dally, W. J. (2017). Trained ternary quantization. In
TFUZZ.2023.3273308 International Conference on Learning Representations (ICLR). (pp. 1-10).
Zhao, X., Jia, X., Zhang, T., Liu, T., & Cao, Y. (2023). A Supervised Surrogate-Assisted Zieliński, S. K. (2018). Feature extraction of surround sound recordings for acoustic scene
Evolutionary Algorithm for Complex Optimization Problems. IEEE Transactions on classification. In International conference on artificial intelligence and soft
Instrumentation and Measurement, 72, 1–14. computing (pp.475-486).
32

1 s2.0 S0957417423024041 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0957417423024041 Main

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 238 (2024) 121902

Contents lists available at ScienceDirect

Expert Systems With Applications

Acoustic scene classification: A comprehensive survey

different listening levels. It is practical for designing functional and

2017; Abeßer, 2020; Kawamura et al., 2023).

4.2. Data augmentation

mixing Audio/ 2 √ √ × simple (Hu et al., 2020)

approaches. combined with other techniques. For example, time–frequency fea

Table 8 Table 8 (continued )

D13 44.1 kHz RQA, MFCC SVM 76.0 % 71.0 %

Fig. 15. Research directions for the ASC field.

You might also like

1 s2.0 S0957417423024041 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s2.0 S0957417423024041 Main

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 238 (2024) 121902

Contents lists available at ScienceDirect

Expert Systems With Applications

Acoustic scene classification: A comprehensive survey

different listening levels. It is practical for designing functional and

2017; Abeßer, 2020; Kawamura et al., 2023).

4.2. Data augmentation

mixing Audio/ 2 √ √ × simple (Hu et al., 2020)

approaches. combined with other techniques. For example, time–frequency fea­

Table 8 Table 8 (continued )

D13 44.1 kHz RQA, MFCC SVM 76.0 % 71.0 %

Fig. 15. Research directions for the ASC field.

You might also like

approaches. combined with other techniques. For example, time–frequency fea