2309 13504

ATTENTION IS ALL YOU NEED FOR BLIND ROOM VOLUME ESTIMATION
Chunxi Wang1 , Maoshen Jia1 , Meiran Li1 , Changchun Bao1 , Wenyu Jin2 .
1
Speech and Audio Signal Processing Laboratory, Faculty of Information Technology,
Beijing University of Technology, Beijing, China
2
AcousticDSP Consulting LLC, St Paul, MN, United States
arXiv:2309.13504v3 [eess.AS] 27 Dec 2023
ABSTRACT practice, in-situ measurements of RIRs and volumes of users’ local

In recent years, dynamic parameterization of acoustic environments acoustic spaces are typically difficult to be carried out [7]. Alterna-
has raised increasing attention in the field of audio processing. One tively, an attractive option is to blindly estimate room acoustic pa-
of the key parameters that characterize the local room acoustics in rameters from audio recordings using microphones. The 2015 ACE
isolation from orientation and directivity of sources and receivers challenge [8] sets the bar for the blind estimation of RT60 and DRR
is the geometric room volume. Convolutional neural networks from noisy speech sequences. Meanwhile, room volume estimation
(CNNs) have been widely selected as the main models for con- has long been formulated as a classification problem [9, 10]. With
ducting blind room acoustic parameter estimation, which aims to recent advancements in DNNs, formulating the blind room volume
learn a direct mapping from audio spectrograms to corresponding estimation as a regression problem by taking advantage of convo-
labels. With the recent trend of self-attention mechanisms, this pa- lutional neural network (CNN) models in conjunction with time-
per introduces a purely attention-based model to blindly estimate frequency representations has become increasingly relevant. Gen-
room volumes based on single-channel noisy speech signals. We ovese et al. [11] deployed a CNN trained using both simulated as
demonstrate the feasibility of eliminating the reliance on CNNs well as real RIRs, and results show that it can estimate a broad range
for this task and the proposed Transformer architecture takes Gam- of volumes within a factor of 2 on real-measured data from the ACE
matone magnitude spectral coefficients and phase spectrograms as challenge [8]. Similar CNN-based systems were proposed to blindly
inputs. To enhance the model performance given the task-specific estimate room acoustic parameters from single-channel [12, 13, 14]
dataset, cross-modality transfer learning is also applied. Experi- or multi-channel [15] speech signals and demonstrated promising re-
mental results demonstrate that the proposed model outperforms sults in terms of both estimation accuracy and robustness to temporal
traditional CNN models across a wide range of real-world acoustics variations in dynamic acoustic environments. In addition to the log-
spaces, especially with the help of a dedicated pretraining and data energy calculation of spectro-temporal features that prior works gen-
augmentation schemes. erally relied on, Ick et al. [16] introduced a series of phase-related
features and demonstrated clear improvements in the context of re-
verberation fingerprint estimation on unseen real-world rooms.
1. INTRODUCTION
CNNs are widely considered in the aforementioned approaches
Dynamic parameterization of acoustic environments that users due to their suitability for learning two-dimensional time-frequency
evolve has become an emerging topic in recent years. Parame- signal patterns for end-to-end modelling. In order to better cap-
ters that characterize local rooms or other acoustic spaces can be ture long-range global context, CNN-attention hybrid models that
used to model or design audio filters for various applications, e.g. concatenate CNN with a self-attention mechanism have achieved
speech dereverberation for automatic speech recognition (ASR) and cutting-edge results for numerous tasks such as acoustic event classi-
voice communication [1, 2], spatial sound systems with room equal- fication [17, 18] and other audio pattern recognition topics [19, 20].
ization [3, 4] etc. In particular, for the proper realization of audio Gong et al. [21] took one step further and devised purely attention-
augmented reality (AAR), virtual acoustic objects are required to based models for audio classification. The devised Audio Spectro-
be seamlessly integrated into the real environment, which makes a gram Transformer (AST) was assessed on various audio classifica-
good match between acoustical properties of virtual elements and tion benchmarks with new state-of-the-art results, which shows that
the local space a necessity [5]. CNNs are not indispensable in this context.
Conventionally, measured room impulse responses (RIRs) can
Inspired by the work in [21], in this work we propose a
be used to directly derive room parameters such as reverberation
convolution-free, purely attention-based model to estimate geomet-
time (RT60 ) and direct-to-reverberant ratio (DRR). Another position-
ric room volume blindly from single-channel noisy speech signals.
independent parameter, which has been proposed as a key part of
To the authors’ best knowledge, this is the first attention-based
the so-called “reverberation fingerprint” of a room, is the geomet-
system in the area of blind acoustic room parameter estimation.
ric room volume V . Under ideal diffuse sound field assumptions,
The proposed system takes Gammatone magnitude spectral coef-
the relation between these parameters is given by the widely known
ficients as well as the low-frequency phase spectrogram as inputs
Sabine’s equation [6]:
and captures long-range global context, even in the lowest layers.
V In addition, the system performance is further boosted by applying
RT60 (b) ≈ 0.16 , (1)
α(b) · S transfer learning of knowledge from an ImageNet-pretrained trans-
former model. A corpus of RIRs that consists of publicly available
where S denotes the total area of the room’s surfaces and α(b) is RIRs, synthesized RIRs and in-house measurements of real-world
the area-weighted mean absorption coefficient in octave band b. In rooms is formulated with the aim of training and testing the proposed
This work was supported by the National Natural Science Foundation of method. Experimental results show that the proposed model signifi-
China under Grant No. 61971015 and Beijing Natural Science Foundation cantly outperforms CNN-based blind volume estimation systems on
(No.L233032, L223033). unseen real-world rooms using single-channel recordings.
2. SYSTEM METHODOLOGY
In this section, we provide a detailed description of the proposed

attention-based system for blind room volume estimation. We start
with an explanation of how we formulate input features that lever-
age both magnitude spectral and phase-related information. We then
outline the design of the convolution-free model architecture. Fi-
nally, we highlight the use of transfer learning to enhance model
performance with limited training datasets. Note that the geometric
room volume is the main focus of this study. However, the proposed
system can be readily extended for blindly estimating other room
acoustic parameters (e.g. RT60 , and total surface area).
2.1. Featurization
Before being fed into the neural network, noisy speech signals go
through a featurization process to obtain a two-dimensional time-
frequency representation. Various extracted features are integrated
into the feature block for model training, aiming to effectively cap-
ture information about the acoustic space.
Similar to prior literature [11, 12], the Gammatone ERB filter- Fig. 1: Proposed system architecture with the featurization process.
bank is selected for generating time-frequency representation as it
leads to low model complexity while preserving signal information Due to the fact that traditional transformer architectures do not
that is relevant to this problem. Specifically, the Gammatone filter- directly process the sequential order of input sequences and these
bank consists of 20 bands covering the range from 50Hz to 2000Hz. patches are not arranged in chronological order, trainable positional
We compute the audio by continuing with a Hann window of 64 sam- embeddings (which also have a dimension of 768) are incorporated
ples and a hop size of 32 samples, followed by convolution with the into each patch embedding. This incorporation allows the model to
filterbank. This convolution generates a spectral feature (20×1997). grasp the spatial structure of the audio spectrogram and understand
Additionally, the phase information of the audio is retained. The the positional relationships among different patches.
phase angles computed for each time-frequency bin can be used to Similar to [21], this paper also leverages a [CLS] token at the
generate a phase feature, and it is then truncated to only include the beginning of the sequence and feeds the feature sequence into the
bands corresponding to frequencies below 500 Hz (i.e. 5 × 1997) as Transformer. In the proposed system, encoding and the feature ex-
lower frequency behavior generally carries more information cor- traction of the input sequence are achieved by utilizing only the en-
responding to the room volume [6]. Furthermore, the first-order coder part of the original Transformer architecture [22]. We adjust
derivative of phase coefficients along the frequency axis is also con- the input and output dimensions of the encoder. To be more precise,
catenated (i.e. 5 × 1997). The configuration of this feature set aligns the input is a sequence formed by a feature block of size 30×1997
with the “+Phase” model outlined in [16], which is shown to out- while the output is a single label used for volume prediction. The
perform other methods that rely solely on magnitude-based spectral output of the whole Transformer is then used as the feature repre-
features. sentation of the two-dimensional audio feature block, which is sub-
Overall, The proposed feature block has dimensions of 30 × sequently mapped to labels used for volume estimation through a
1997, where 30 represents the feature dimension F , and 1997 repre- linear layer with sigmoid activation.
sents the time dimension T .
2.2.2. ImageNet Pretraining
2.2. Model Architecture Compared to methods based on CNN architecture, one disadvantage
While the attention-based model demonstrates impressive perfor- of Transformer methods lies in their increased demand for training
mance in audio classification tasks, its application in other domains data [23]. One of the main challenges in blind room volume esti-
especially regression-related problems remains unexplored. In this mation problems is due to insufficient data as publicly available RIR
section, we propose a purely attention-based model following the datasets with properly labelled room volume groudtruth are highly
Audio Spectrogram Transformer work in [21] for conducting blind limited. To alleviate this issue, we took the following two measures:
room volume estimation tasks. 1) a synthetic RIR dataset based on the image-source model (which
will be covered in Sec. 3.1), 2) transfer learning.
More specifically, cross-modality transfer learning was applied
2.2.1. Audio Spectrogram Transformer to the proposed Transformer-based model. In this context, we lever-
In the proposed system, as shown in Fig. 1, in order to better lever- aged a pretrained off-the-shelf Vision Transformer (ViT) model from
age the local information of the audio, feature blocks are divided ImageNet [24] for application within the proposed method. Prior
into P patches, each patch having a size of 16 × 16. During this to the transfer, necessary adjustments were made to ensure ViT’s
division process, to maintain continuity in both the feature dimen- compatibility with the proposed architecture. Firstly, the ViT model
sion and the time dimension, each patch overlaps with its surround- was pretrained on three-channel images, which was distinct from the
ing patches by 6 feature dimensions and 6 time dimensions. Con- single-channel feature blocks used in the proposed model. There-
sequently, the number of patches P is determined as 398, where fore, we calculated the average of parameters across the three chan-
F − 16 T − 16 nels of the ViT model and then applied this averaged information.
P = . For further processing of these patches, In addition, the so-called ’Cut and bi-linear interpolate’ method [21]
10 10
a linear projection layer is introduced. This layer flattens each 16 × was used to adjust the input size and manage positional encoding.
16 patch into a one-dimensional patch embedding of size 768, re- Lastly, to adapt ViT for the task of blind room volume estimation,
ferred to as the patch embedding layer. the final classification layer of ViT was reinitialized.
Fig. 2: The histogram illustrates the distribution of RIRs across var-
ious datasets based on their labelled volumes Fig. 3: Augmentation schemes applied to reverberant speech signals.
3. DATA GENERATION AND AUGMENTATION Data # of # of Real Simulated

Spilt Dataset I Dataset II Rooms Rooms
To address the challenging task of room volume estimation, neural Train 19200 24000 34 18
networks require extensive data to train and validate. In this section, Validation 6400 6400 21 12
we devise a multi-stage audio generation process, utilizing six pub- Test 6400 6400 21 0
licly available real-world RIR datasets and a synthetic dataset based
on room simulation. In addition, a series of data augmentation tech- Table 1: Summary of Data Splits for Datasets I & II
niques are introduced to enhance the generalizability of the model. widely-used SpecAugment [31] augmentation scheme and added it
to our data generation pipeline. Specifically, reverberant speech sig-
3.1. RIR Dataset nals without noise addition were considered in the training set. Sub-
sequently, these audio were transformed into log mel spectrograms,
As shown in Fig. 2, six publicly available real-world RIR datasets upon which time masking, frequency masking and time warping
recorded in 55 real rooms are considered to cover a wide range of were applied, as illustrated in Fig. 3. Then, masked mel spectro-
realistic acoustic room parameters. Data collection predominantly grams were converted back into time-domain signals. Finally, the
took place in geometrically regular rooms, encompassing spaces resulting 4800 speech sequences with masking effects were added to
such as elevator shafts, classrooms, auditoriums, seminar rooms, the original training set for neural network training, which is denoted
and more. These datasets include the ACE Challenge dataset [8], as Dataset II.
the Aachen Impulse Response (AIR) dataset [25], the Brno Uni-
versity of Technology Reverb Database (BUT ReverbDB) [26], the
OpenAIR dataset [27], the C4DM Dataset (C4DM) [28] and the de- 4. EXPERIMENTAL RESULTS
chorate Dataset (dechorate) [29]. To account for the natural gap of
In this section, we evaluate the effectiveness of the proposed
real-world acoustic spaces between the 12 m3 to 7000 m3 range of
attention-based method and compare it to the state-of-the-art ap-
volumes, the in-house BJUT Reverb Dataset was collected. Specif-
proach in the realm of single-channel blind room volume estimation.
ically, we took RIR measurements at 11 distinct rooms within the
First, the experimental design and setup of training sessions are in-
campus of Beijing University of Technology. For each room, 5 RIRs
troduced. Second, we present results that demonstrate the estimation
corresponding to different source-receiver positions were recorded.
results of the considered systems in two different tracks.
All RIRs are resampled to the sampling rate of 16kHz.
Moreover, we supplemented the real-world data with an addi-
tional 30 simulated RIRs based on virtual rooms of various geome- 4.1. Experimental Design
tries. The purpose was to augment the dataset within less com- To assess the performance of our proposed approach, we compared
mon room volume ranges, thus bringing the total volume distribution it with the +Phase model in [16] that leverages a CNN-based archi-
closer to a normal distribution. To generate the synthetic dataset, we tecture, as well as phase-related feature sets. This CNN model con-
utilized the pyroomacoustics [30] package that employs the image- sists of six convolutional layers, each followed by an average pooling
source model to simulate RIRs for rooms with specific volumes. layer. Following the convolutional blocks are a dropout layer and a
fully connected layer mapping to the output dimension, forming a
3.2. Audio preprocessing complete feedforward convolutional neural network. This method
is considered as the state-to-the-art in terms of single-channel blind
By utilizing the RIR dataset with labelled volumes, convolution was room volume estimation.
applied to 4-second clean speech signals recorded in anechoic cham- We evaluated our data on the base-10 logarithm of the volume,
bers from the ACE dataset [8], resulting in reverberant speech se- which implies that the estimation error would be related to its order
quences that were characterized by corresponding RIRs. To enhance of magnitude. A log-10 estimate is more appropriate than a linear
the model’s adaptability to noise across various types and signal-to- one due to the significantly large range of room volumes in the test
noise ratios (SNR), white noise and babble noise were added. Each set as shown in Fig. 2. The following four metrics were consid-
type of noise was applied at the following SNR levels: [+30, +20, ered in this evaluation: mean squared error (MSE), mean absolute
+10, +0] dB. This formulated Dataset I, which was then divided into error (MAE), the Pearson correlation coefficient (ρ) between pre-
train, test, and validation sets in a 6-2-2 ratio (as shown in Table 1). dicted and target values, and MeanMult (MM). MM is the mean ab-
Note that for the test set, only RIRs recorded in real-world environ- solute logarithm of the ratio between the estimated room volume V̂n
ments were selected to assess the model’s estimation performance and the ground truth Vn :
on unseen non-simulated rooms.
Moreover, for the sake of enhancing the networks’ generaliz-

1 PN V̂n
N n=1 | ln Vn
|
ability in unknown rooms and noisy environments, we adapted the MM = e (2)
Dataset I Dataset II
Method
MSE MAE ρ MM MSE MAE ρ MM
CNN [16] 0.3863 0.4837 0.6984 3.0532 0.3154 0.4136 0.7678 2.5921
Proposed method 0.2650 0.3432 0.8077 2.2039 0.1981 0.2884 0.8580 1.9427
Proposed method
0.2157 0.3111 0.8529 2.0470 0.1541 0.2423 0.8929 1.7470
w/pretrain
Table 3: Performance comparison of different models with and without the application of SpecAugment.
Method MSE MAE ρ MM

CNN [16] 0.3863 0.4837 0.6984 3.0532
Proposed method 0.2650 0.3432 0.8077 2.2039
Table 2: Comparison between the CNN-based system [16] and the

base version of the proposed method.
where N is the number of samples. This metric provides an overview
of the average error in ratios between estimated and target parame-
ters.
During the model training phase, MSE was used as the loss func-
tion and the Adam optimizer from PyTorch was deployed for opti-
mization. The CNN model and the proposed method were trained for
1000 epochs and 150 epochs, respectively. This decision was made Fig. 4: Confusion matrices for the CNN model and the “Proposed
as good convergence behaviors were already observed for both train- method w/Pretrain” model trained on Dataset II. The dashed red line
ing and validation at these epoch numbers. L2 regularization was indicates a perfect prediction.
applied to mitigate potential over-fitting. Additionally, an adaptive
learning rate strategy was adopted to ensure efficient convergence
With Dataset I, the deployment of the ImageNet pretraining el-
during training. If the model failed to demonstrate improvement on
evated the proposed method’s performance to a new level, yielding
the validation set for ten consecutive epochs, early stopping crite-
a significantly improved room volume estimation accuracy. With
ria were applied to halt the training process. For comparison test
the application of the SpecAugment method, Dataset II facilitated
purposes, we switched between Dataset I and Dataset II, as well as
to further enhance system performance for all three models. Partic-
whether or not to use the pretrained model from ImageNet, while
ularly, the augmentation effect was more prominent in the proposed
maintaining same hyperparameters between the attention-based and
method w/ Pretrain, confirming the effectiveness of SpecAugment in
CNN-based model to ensure uniformity in model configurations.
terms of alleviating overfitting and enhancing models’ generalizabil-
ity. As a more illustrative example, the best-performing system, i.e.
4.2. Results
“Proposed method w/ Pretrain” model with Dataset II, resulted in a
4.2.1. Base Systems median and mean absolute error of only 155 m3 and 1219 m3 in lin-
ear scale respectively, given that the range of test set room volumes
We started with the experiment of comparing the base version of was [12, 21000] m3 . In contrast, the median and mean absolute er-
our proposed method with the CNN-based baseline system [16]. ror of the CNN-based system with Dataset II was 353 m3 and 1919
The goal of this experiment is to observe if we can achieve simi- m3 , respectively.
lar estimation performance by simply replacing CNN with a purely
Fig. 4 demonstrates the confusion matrices for these two sys-
attention-based model. We trained both the CNN model and the pro-
tems, with the x-axis and y-axis representing log-10 indices for
posed method (without ImageNet pretraining) on Dataset I sepa-
volume sizes. It can be clearly seen that the “Proposed method
rately, feeding them with the same feature set as outlined in Section
w/Pretrain” model is consistently well-distributed around the ground
2.1. Results of these two models are presented in Table 2.
truth across the tested range while the CNN-based method diverges.
It can be seen that the proposed model significantly outperforms
This indicates that our proposed attention-based model captures
the CNN model in terms of prediction accuracy, relationship with
the representation of the room volume regression problem through
ground truth values and predictive capability. This indicates that
the effective training process and more importantly generalizes the
neural networks purely based on attention are sufficient (or even
learned patterns to unseen real-world rooms.
more superior than CNN models) to accurately learn the relation-
ship between indoor acoustic characteristics and room volumes, even
with the low-layer network configuration and a relatively small num-
ber of training epochs. 5. CONCLUSION AND FUTURE WORK
4.2.2. Enhanced Systems With Pretraining
In this study, we aim to explore the feasibility of applying a
To further investigate the impact of ImageNet pretraining on the Transformer-based model in the blind room volume estimation task
performance of the proposed method, we introduced the “Proposed and to benchmark its performance using different training strate-
method w/ Pretrain” model. We conducted separate training ses- gies. Experimental results based on unseen real-world rooms with
sions for the CNN model, the Proposed method, and the “Proposed realistic noise settings confirm that the proposed method exhibits
method w/ Pretrain” model on Dataset I. Additionally, we also incor- more superior performance compared to traditional CNN-based
porated the SpecAugment data augmentation method into the three methods, indicating that a neural network purely based on attention
different models. Specifically, all three models were retrained on is sufficient to obtain high performance in audio-related regression
Dataset II to investigate its impact on model performance. The re- problems. Future work will investigate the flexibility and robustness
sults of the above experiments are listed in Table 3. of the proposed system in terms of variable-length audio inputs.
6. REFERENCES [16] Ick Christopher, Adib Mehrabi, and Wenyu Jin, “Blind acous-
tic room parameter estimation using phase features,” in Proc.
[1] Zixing Zhang, Jürgen Geiger, Jouni Pohjalainen, Amr El- ICASSP, Rhodes Island, Greece, 2023, IEEE, pp. 1–5.
Desoky Mousa, Wenyu Jin, and Björn Schuller, “Deep learning [17] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu
for environmentally robust speech recognition: An overview of Wang, and Mark D. Plumbley, “Panns: Large-scale pre-
recent developments,” ACM Trans. Intell. Syst. Technol., vol. trained audio neural networks for audio pattern recognition,”
9, no. 5, apr 2018. IEEE/ACM Transactions on Audio, Speech, and Language
[2] Bo Wu, Kehuang Li, Fengpei Ge, Zhen Huang, Minglei Yang, Processing, vol. 28, pp. 2880–2894, 2020.
Sabato Marco Siniscalchi, and Chin-Hui Lee, “An end-to- [18] Yuan Gong, Yu-An Chung, and James Glass, “Psla: Improv-
end deep learning approach to simultaneous speech derever- ing audio tagging with pretraining, sampling, labeling, and ag-
beration and acoustic modeling for robust speech recognition,” gregation,” IEEE/ACM Transactions on Audio, Speech, and
IEEE Journal of Selected Topics in Signal Processing, vol. 11, Language Processing, vol. 29, pp. 3292–3306, 2021.
no. 8, pp. 1289–1300, 2017. [19] Pengcheng Li, Yan Song, Ian Vince McLoughlin, Wu Guo, and
[3] Stefania Cecchi, Alberto Carini, and Sascha Spors, “Room Li-Rong Dai, “An attention pooling based representation learn-
response equalization – a review,” Applied Sciences, vol. 8, no. ing method for speech emotion recognition,” 2018.
1, 2018. [20] Oleg Rybakov, Natasha Kononenko, Niranjan Subrahmanya,
[4] Wenyu Jin and W. Bastiaan Kleijn, “Theory and design Mirkó Visontai, and Stella Laurenzo, “Streaming keyword
of multizone soundfield reproduction using sparse methods,” spotting on mobile devices,” arXiv preprint arXiv:2005.06720,
IEEE/ACM Transactions on Audio, Speech, and Language 2020.
Processing, vol. 23, no. 12, pp. 2343–2355, 2015. [21] Yuan Gong, Yu-An Chung, and James Glass, “AST: Audio
[5] A. Neidhardt, C. Schneiderwind, and F. Klein, “Perceptual Spectrogram Transformer,” in Proc. Interspeech 2021, 2021,
matching of room acoustics for auditory augmented reality in pp. 571–575.
small rooms - literature review and theoretical framework,” [22] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,
Trends in Hearing, vol. 26, pp. 23312165221092919, 2022. A. N.Gomez, L. Kaiser, and I. Polosukhin, “Attention is all
[6] Heinric Kuttruff, “Room acoustics,” in CRC Press, 2016. you need,” in NIPS, 2017.
[7] Wenyu Jin, “Adaptive reverberation cancelation for multizone [23] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn,
soundfield reproduction using sparse methods,” in 2016 IEEE X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer,
International Conference on Acoustics, Speech and Signal Pro- G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image
cessing (ICASSP), 2016, pp. 509–513. is worth 16 × 16 words: Transformers for image recognition at
[8] James Eaton, Nikolay D. Gaubitch, Alastair H. Moore, and scale,” in ICLR, 2021.
Patrick A. Naylor, “Estimation of room acoustic parame- [24] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles,
ters: The ACE challenge,” IEEE/ACM Transactions on Audio, and H. Jégou, “Training data-efficient image transformers &
Speech, and Language Processing, vol. 24, no. 10, pp. 1681– distillation through attention,” International conference on ma-
1693, 2016. chine learning, pp. 10347–10357, 2021.
[9] A. Moore, M. Brookes, and P. Naylor, “Room identification [25] Marco Jeub, Magnus Schafer, and Peter Vary, “A binauralroom
using roomprints,” Journal of the Audio Engineering Society, impulse response database for the evaluation of dereverbera-
june 2014. tion algorithms,” Digital Signal Processing, pp. 1–5, 2009.
[10] Nils Peters, Howard Lei, and Gerald Friedland, “Name that [26] Igor Szöke, Miroslav Skácel, Ladislav Mošner, Jakub Paliesek,
room: Room identification using acoustic features in a record- and Jan Černocký, “A binauralroom impulse response database
ing,” in Proceedings of the 20th ACM International Conference for the evaluation of dereverberation algorithms,” in IEEE
on Multimedia. 2012, p. 841–844, Association for Computing Journal of Selected Topics in Signal Processing, 2019, vol. 13,
Machinery. pp. 864–876.
[11] Andrea F. Genovese, Hannes Gamper, Ville Pulkki, Nikunj [27] D. Murphy and S. Shelley, “Openair: an interactive auraliza-
Raghuvanshi, and Ivan J. Tashev, “Blind room volume esti- tion web resource and database,” Journal of the Audio Engi-
mation from single-channel noisy speech,” in Proc. ICASSP. neering Society, November 2010.
IEEE, 2019, pp. 231–235. [28] R. Stewart and M. Sandler, “Database of omnidirectional and
[12] Hannes Gamper and Ivan J. Tashev, “Blind reverberation time b-format room impulse responses,” IEEE International Con-
estimation using a convolutional neural network,” in 2018 ference on Acoustics, Speech and Signal Processing (ICASSP),
16th International Workshop on Acoustic Signal Enhancement vol. 13, no. 4, pp. 165–168, 2010.
(IWAENC), 2018, pp. 136–140. [29] D. Di Carlo, P. Tandeitnik, C. Foy, N. Bertin, A. Deleforge, and
[13] Nicholas J. Bryan, “Impulse response data augmentation and S. Gannot, “dechorate: a calibrated room impulse response
deep neural networks for blind room acoustic parameter esti- dataset for echo-aware signal processing,” EURASIP Journal
mation,” in Proc. ICASSP. IEEE, 2020, pp. 1–5. on Audio, Speech, and Music Processing, Springer, vol. 2021,
no. 1, pp. 1–15, 2021.
[14] Philipp Götz, Cagdas Tuna, Andreas Walther, and Emanuël
A. P. Habets, “Blind reverberation time estimation in dynamic [30] Robin Scheibler, Eric Bezzam, and Ivan Dokmanić, “Py-
acoustic conditions,” in Proc. ICASSP. IEEE, 2022, pp. 581– roomacoustics: A Python package for audio room simula-
585. tions and array processing algorithms,” arXiv e-prints, p.
arXiv:1710.04196, Oct. 2017.
[15] Prerak Srivastava, Antoine Deleforge, and Emmanuel Vincent,
“Blind room parameter estimation using multiple multichannel [31] D. S. Park, Y. Zhang W. Chan, C.-C. Chiu, B. Zoph, E. D.
speech recordings,” in 2021 IEEE Workshop on Applications Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-
of Signal Processing to Audio and Acoustics (WASPAA), 2021, tation method for automatic speech recognition,” Interspeech,
pp. 226–230. pp. 2613–2617, 2019.

2309 13504

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2309 13504

Uploaded by

Copyright:

Available Formats

ATTENTION IS ALL YOU NEED FOR BLIND ROOM VOLUME ESTIMATION

ABSTRACT practice, in-situ measurements of RIRs and volumes of users’ local

In this section, we provide a detailed description of the proposed

3. DATA GENERATION AND AUGMENTATION Data # of # of Real Simulated

Method MSE MAE ρ MM

Table 2: Comparison between the CNN-based system [16] and the

You might also like