Professional Documents
Culture Documents
Chunxi Wang1 , Maoshen Jia1 , Meiran Li1 , Changchun Bao1 , Wenyu Jin2 .
1
Speech and Audio Signal Processing Laboratory, Faculty of Information Technology,
Beijing University of Technology, Beijing, China
2
AcousticDSP Consulting LLC, St Paul, MN, United States
arXiv:2309.13504v3 [eess.AS] 27 Dec 2023
2.1. Featurization
Before being fed into the neural network, noisy speech signals go
through a featurization process to obtain a two-dimensional time-
frequency representation. Various extracted features are integrated
into the feature block for model training, aiming to effectively cap-
ture information about the acoustic space.
Similar to prior literature [11, 12], the Gammatone ERB filter- Fig. 1: Proposed system architecture with the featurization process.
bank is selected for generating time-frequency representation as it
leads to low model complexity while preserving signal information Due to the fact that traditional transformer architectures do not
that is relevant to this problem. Specifically, the Gammatone filter- directly process the sequential order of input sequences and these
bank consists of 20 bands covering the range from 50Hz to 2000Hz. patches are not arranged in chronological order, trainable positional
We compute the audio by continuing with a Hann window of 64 sam- embeddings (which also have a dimension of 768) are incorporated
ples and a hop size of 32 samples, followed by convolution with the into each patch embedding. This incorporation allows the model to
filterbank. This convolution generates a spectral feature (20×1997). grasp the spatial structure of the audio spectrogram and understand
Additionally, the phase information of the audio is retained. The the positional relationships among different patches.
phase angles computed for each time-frequency bin can be used to Similar to [21], this paper also leverages a [CLS] token at the
generate a phase feature, and it is then truncated to only include the beginning of the sequence and feeds the feature sequence into the
bands corresponding to frequencies below 500 Hz (i.e. 5 × 1997) as Transformer. In the proposed system, encoding and the feature ex-
lower frequency behavior generally carries more information cor- traction of the input sequence are achieved by utilizing only the en-
responding to the room volume [6]. Furthermore, the first-order coder part of the original Transformer architecture [22]. We adjust
derivative of phase coefficients along the frequency axis is also con- the input and output dimensions of the encoder. To be more precise,
catenated (i.e. 5 × 1997). The configuration of this feature set aligns the input is a sequence formed by a feature block of size 30×1997
with the “+Phase” model outlined in [16], which is shown to out- while the output is a single label used for volume prediction. The
perform other methods that rely solely on magnitude-based spectral output of the whole Transformer is then used as the feature repre-
features. sentation of the two-dimensional audio feature block, which is sub-
Overall, The proposed feature block has dimensions of 30 × sequently mapped to labels used for volume estimation through a
1997, where 30 represents the feature dimension F , and 1997 repre- linear layer with sigmoid activation.
sents the time dimension T .
2.2.2. ImageNet Pretraining
2.2. Model Architecture Compared to methods based on CNN architecture, one disadvantage
While the attention-based model demonstrates impressive perfor- of Transformer methods lies in their increased demand for training
mance in audio classification tasks, its application in other domains data [23]. One of the main challenges in blind room volume esti-
especially regression-related problems remains unexplored. In this mation problems is due to insufficient data as publicly available RIR
section, we propose a purely attention-based model following the datasets with properly labelled room volume groudtruth are highly
Audio Spectrogram Transformer work in [21] for conducting blind limited. To alleviate this issue, we took the following two measures:
room volume estimation tasks. 1) a synthetic RIR dataset based on the image-source model (which
will be covered in Sec. 3.1), 2) transfer learning.
More specifically, cross-modality transfer learning was applied
2.2.1. Audio Spectrogram Transformer to the proposed Transformer-based model. In this context, we lever-
In the proposed system, as shown in Fig. 1, in order to better lever- aged a pretrained off-the-shelf Vision Transformer (ViT) model from
age the local information of the audio, feature blocks are divided ImageNet [24] for application within the proposed method. Prior
into P patches, each patch having a size of 16 × 16. During this to the transfer, necessary adjustments were made to ensure ViT’s
division process, to maintain continuity in both the feature dimen- compatibility with the proposed architecture. Firstly, the ViT model
sion and the time dimension, each patch overlaps with its surround- was pretrained on three-channel images, which was distinct from the
ing patches by 6 feature dimensions and 6 time dimensions. Con- single-channel feature blocks used in the proposed model. There-
sequently, the number of patches P is determined as 398, where fore, we calculated the average of parameters across the three chan-
F − 16 T − 16 nels of the ViT model and then applied this averaged information.
P = . For further processing of these patches, In addition, the so-called ’Cut and bi-linear interpolate’ method [21]
10 10
a linear projection layer is introduced. This layer flattens each 16 × was used to adjust the input size and manage positional encoding.
16 patch into a one-dimensional patch embedding of size 768, re- Lastly, to adapt ViT for the task of blind room volume estimation,
ferred to as the patch embedding layer. the final classification layer of ViT was reinitialized.
Fig. 2: The histogram illustrates the distribution of RIRs across var-
ious datasets based on their labelled volumes Fig. 3: Augmentation schemes applied to reverberant speech signals.