You are on page 1of 8

MS-SincResNet: Joint learning of 1D and 2D kernels using

multi-scale SincNet and ResNet for music genre classification


Pei-Chun Chang Yong-Sheng Chen Chang-Hsing Lee
Department of Computer Science, Department of Computer Science, Department of Computer Science
National Yang Ming Chiao Tung National Yang Ming Chiao Tung Information Engineering, Chung Hua
University University University
Hsinchu, Taiwan Hsinchu, Taiwan Hsinchu, Taiwan
pcchang.cs05@nycu.edu.tw yschen@nycu.edu.tw chlee@chu.edu.tw
arXiv:2109.08910v1 [cs.SD] 18 Sep 2021

ABSTRACT ACM Reference Format:


In this study, we proposed a new end-to-end convolutional neu- Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee. 2021. MS-SincResNet:
Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet
ral network, called MS-SincResNet, for music genre classification.
for music genre classification. In Proceedings of the 2021 International Confer-
MS-SincResNet appends 1D multi-scale SincNet (MS-SincNet) to ence on Multimedia Retrieval (ICMR ’21), August 21–24, 2021, Taipei, Taiwan.
2D ResNet as the first convolutional layer in an attempt to jointly ACM, New York, NY, USA, 8 pages. https://doi.org/10.1145/3460426.3463619
learn 1D kernels and 2D kernels during the training stage. First,
an input music signal is divided into a number of fixed-duration
(3 seconds in this study) music clips, and the raw waveform of 1 INTRODUCTION
each music clip is fed into 1D MS-SincNet filter learning module to Automatic music genre classification (MGC) is an important task
obtain three-channel 2D representations. The learned representa- for multimedia retrieval systems. The task of MGC is to assign a
tions carry rich timbral, harmonic, and percussive characteristics proper music genre/type to a music signal. Traditional MGC sys-
comparing with spectrograms, harmonic spectrograms, percussive tems typically consist of two main stages: feature extraction and
spectrograms and Mel-spectrograms. ResNet is then used to ex- classification. First, some discriminative features are extracted from
tract discriminative embeddings from these 2D representations. the input music signal, and then a classifier is used to get the music
The spatial pyramid pooling (SPP) module is further used to en- genre label according to the extracted features. In general, short-
hance the feature discriminability, in terms of both time and fre- term representation describing the timbral characteristics of music
quency aspects, to obtain the classification label of each music clip. signals is first extracted from every short time window (or frame).
Finally, the voting strategy is applied to summarize the classifi- The most well-known timbral features include zero crossing rate
cation results from all 3-second music clips. In our experimental (ZCR) [14], short time energy [32], spectral centroid/rolloff/flux
results, we demonstrate that the proposed MS-SincResNet outper- [41], Mel-frequency cepstral coefficients (MFCC) [50], linear pre-
forms the baseline SincNet and many well-known hand-crafted diction coefficients (LPC) [48], discrete wavelet transform (DWT)
features. Considering individual 2D representation, MS-SincResNet coefficients [29], octave-based spectral contrast (OSC) coefficients
also yields competitive results with the state-of-the-art methods [19], MPEG-7 normalized audio spectrum envelope (NASE) [22],
on the GTZAN dataset and the ISMIR2004 dataset. The code is etc. Then, several short-term features extracted from consecutive
available at https://github.com/PeiChunChang/MS-SincResNet. frames are aggregated to form the long-term features representing
the whole music signal. The most widely used approaches to aggre-
CCS CONCEPTS gating short-term features include statistical moments [28, 35, 50],
entropy or correlation [35], nonlinear time series analysis [35],
• Computing methodologies → Artificial intelligence; Neural
autoregressive (AR) models or multivariate autoregressive (MAR)
networks; Learning latent representations.
models [33], modulation spectral analysis [25, 26, 28, 35], bag of
words (BoW) model [44, 52], vector of locally aggregated descrip-
KEYWORDS tors (VLAD) [31, 34], etc. Given the extracted features representing
music genre classification, convolutional neural networks, SincNet, the music signal, a number of supervised or unsupervised classifica-
ResNet tion approaches have been developed for music genre classification
[4, 14, 41, 50], including support vector machines (SVM) [18, 24],
Gaussian mixture model (GMM) [21], principal component analysis
(PCA) [20], linear discriminant analysis (LDA) [27], etc.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed In recent years, with the remarkable success of deep learning
for profit or commercial advantage and that copies bear this notice and the full citation techniques in computer vision applications, deep neural networks
on the first page. Copyrights for components of this work owned by others than ACM (DNNs) have also shown great success in speech/music classifi-
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specific permission and/or a cation or recognition tasks, such as speaker recognition [36, 43],
fee. Request permissions from permissions@acm.org. music genre classification [6, 39], speech emotion recognition [49],
ICMR ’21, August 21–24, 2021, Taipei, Taiwan etc. In these tasks, deep learning provides a new way to extract
© 2021 Association for Computing Machinery.
ACM ISBN 978-1-4503-8463-6/21/08. . . $15.00 discriminative embeddings from those famous hand-crafted acous-
https://doi.org/10.1145/3460426.3463619 tic features, called i-vector content, for classification/recognition
ICMR ’21, August 21–24, 2021, Taipei, Taiwan Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee

1D kernel learning 2D kernel learning


Multi-scale Spatial pyramid Estimated

FC
FC
ResNet
SincNet pooling results
3-second
raw waveform

Figure 1: The proposed MS-SincResNet architecture for music genre classification. The audio waveform is first resampled to
16 kHz and divided into several overlapped 3-second music clips with each clip being fed into the proposed network to get the
classification label. In the 1D kernel learning stage, the multi-scale SincNet is designed to learn the 2D representations. Next,
the learned 2D representations are fed into 2D kernel learning module (the ResNet-18 is used in this study) for extracting em-
beddings across time and frequency aspect, simultaneously. Then, spatial pyramid pooling module is used to get the compact
features from the output of the last convolutional layer (i.e., conv5_2) of the ResNet. Finally, two fully-connected layers are
applied to obtain the music genre label. In this study, the classification results of all 3-second music clips segmented from the
same song will be summarized by the voting strategy to get the final classification label.

purposes. To this end, deep learning methods based on convolu- Fourier transform of some rectangular (ideal) band-pass filters pa-
tional neural networks (CNNs) are the most widely used approach rameterized with the cut-off frequencies of Mel-scale band-pass
to obtain embeddings from those i-vector content, such as MFCC filters. That is, these SincNet filters, can be viewed as 1D kernels
[46, 47, 53], OSC coefficients [54], 2D representations like audio used for performing 1D convolutions on the raw waveform. Their
spectrogram or chromagram [6, 39], etc. experiments have shown that the learned SincNet filters can extract
Bisharad et al. proposed a music genre classification system features like customized band-pass filters with faster convergence,
using residual neural network (ResNet) based model [8]. Specifically, fewer parameters, and interpretable kernels.
ResNet-18 is used to extract time-frequency features from the Mel- As stated above, the main streams of MGC approaches typically
spectrogram of each 3-second music clip. By taking the advantage used 2D CNN with hand-crafted 2D represnetations (for example,
of recurrent neural network (RNN) on sequential data analysis, spectrogram, Mel-spectrogram, harmonic spectrogram, percussive
they also proposed a CNN with gated recurrent unit (GRU) for spectrogram, etc.), as input. We conjecture that if the input 2D
music genre recognition [7]. First, they apply CNN on the Mel- representations can be learned using the training data, a better
spectrogram to get the embedding of each 3-second music clip, classification accuracy can be obtained. Based on this idea, we
and then apply RNN on the successive time-aligned embeddings propose a new end-to-end CNN architecture, called MS-SincResNet,
for music genre classification. Their experiments have shown that which can jointly learn both 1D kernels and 2D kernels for MGC
Mel-spectrograms are capable of providing consistent performance task. In the proposed network architecture, 1D multi-scale SincNet
on the GTZAN and the MagnaTagATune datasets. (MS-SincNet) filters are appended to 2D ResNet structure as the
Ng et al. [39] proposed the FusionNet to combine the classifica- first convolutional layer. Given 1D raw waveform as input, MS-
tion results obtained from a set of hand-crafted features, including SincNet tries to learn variant 2D representations having different
timbre, rhythm, Mel-spectrogram, constant-Q spectrogram [17], frequency resolutions. These learned 2D representations are stacked
harmonic spectrogram [12], percussive spectrogram[12], scatter and then fed into 2D ResNet, followed by a spatial pyramid pooling
transform spectrogram [1], and transfer feature[10]. They fed each (SPP) module to extract discriminative features for music genre
feature into the individual feature coding network with NetVLAD classification. Our main contribution can be summarized as follows:
[2] and self-attention to obtain the classification results. Finally, (1) a multi-scale SincNet (MS-SincNet) is designed to learn 2D
they try all possible combinations among these 8 features using representations from 1D raw waveform signal; (2) a new network
sum rule to report the highest testing accuracy. architecture, called MS-SincResNet, which can jointly learn 1D
Rather than applying 2D CNN on variant hand-crafted spec- kernels and 2D kernels, is proposed for MGC purpose.
trograms, some researchers try to directly apply 1D CNN on the
waveforms of speech/music signal to learn acoustic features [23, 40]. 2 THE PROPOSED METHOD
In these standard 1D CNN architectures, the learnable parameters
As shown in Fig. 1, an end-to-end CNN architecture is designed
are the kernel/filter coefficients. Typically, kernels with a large num-
to jointly learn 1D kernels and 2D kernels for music genre classifi-
ber of coefficients are needed in order to effectively characterize
cation. First, each input music signal is resampled to 16 kHz and
the timbral/rhythmic properties of the music signal, which often
divided into fixed-duration (3 seconds) music clips with hop size 0.5
takes a large amount of computational cost during the training
seconds. Each music clip will then be fed into the proposed CNN
process. To reduce the number of learnable parameters, Ravanelli
model for MGC purpose. The classification results of all music clips
et al. proposed a new architecture, called SincNet, in which a set
will be summarized by using voting strategy to get the classification
of SincNet filters is appended to the CNN structure as the first
label for the input music signal. For each music clip, we first exploit
convolutional layer [43]. In fact, the SincNet filters are the inverse
multi-scale SincNet (MS-SincNet) with different kernel lengths to
MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification ICMR ’21, August 21–24, 2021, Taipei, Taiwan

learn variant 2D representations. Then, 2D ResNet and SPP mod-


SincNet filter Adaptive
ule are used to extract discriminative features, followed by two (L=251) avg. pooling
fully-connected (FC) layers to obtain the music genre label from the
SincNet filter Adaptive 3@1601024
learned 2D representations. Before describing the proposed method, (L=501) avg. pooling C
feature
we first give a short review of the original SincNet architecture 3-second SincNet filter Adaptive
proposed by Ravanelli et al. [43]. raw waveform (L=1001) avg. pooling

2D learned representations
2.1 SincNet
SincNet tries to discover interpretable and meaningful filters by Figure 2: Illustration of 1D kernel learning using multi-scale
introducing an additional 1D convolutional layer realized by sinc- SincNet filter. Three sets of SincNet filters with differnet fil-
functions, followed by standard CNN layers. In general, it is straight- ter lengths (𝐿=251, 501, and 1001) are first used to learn three
forward to use rectangular (ideal) band-pass filters to decompose a 2D representations having different frequency resolutions.
signal into a number of frequency bands in the frequency domain. The adaptive average pooling is then used to get compact
In fact, the frequency response of a band-pass filter can be written 2D representations. Finally, we stack these compact 2D rep-
as the difference of two rectangular low-pass filters: resentations to obtain a three-channel image (3@160×1024)
    which will be fed into ResNet for extracting embeddings.
𝑓 𝑓
𝐺 𝑓1 ,𝑓2 (𝑓 ) = 𝑟𝑒𝑐𝑡 − 𝑟𝑒𝑐𝑡 (1)
2𝑓2 2𝑓1
where 𝑓1 and 𝑓2 (𝑓2 > 𝑓1 ) are the low and high cut-off frequencies,
and 𝑟𝑒𝑐𝑡 (·) is the frequency response of the rectangular low-pass Local reshape
avg. pooling
filter defined as follows: Conv5_2 2560-d
512@22 C
if |𝑥 | > 0.5, feature vector
 0, Global



𝑟𝑒𝑐𝑡 (𝑥) = 0.5, if |𝑥 | = 0.5, (2) avg. pooling
 1, if |𝑥 | 0.5,

< 512@11
 512@532
By performing inverse Fourier transform on the filter function 𝐺,
we can get the impulse response of the filter, represented by the Figure 3: Illustration of spatial pyramid pooling module.
sinc function: The input is 512@5×32 feature volume from the output of
𝑔 𝑓1 ,𝑓2 [𝑛] = 2𝑓2𝑠𝑖𝑛𝑐 (2𝜋 𝑓2𝑛) − 2𝑓1𝑠𝑖𝑛𝑐 (2𝜋 𝑓1𝑛), 𝑛 = 1, 2, ..., 𝐿. (3) the last convolutional layer of ResNet-18. There are two
branch to aggregate both local and global features using av-
where 𝐿 is the filter length, the sinc function is defined as 𝑠𝑖𝑛𝑐 (𝑥) =
erage pooling on blocks of different sizes. After that, we flat-
sin(𝑥)/𝑥. In general, the sinc function is multiplied by a window
tened and concatenated these pooled values as a 2560-d com-
function to smooth out the abrupt discontinuities of the sinc func-
pact feature vector.
tion:
𝑔𝑤𝑓 ,𝑓 [𝑛] = 𝑔 𝑓1 ,𝑓2 [𝑛] · 𝑤 [𝑛] (4)
1 2
In their study, the Hamming window is used, defined as follows:
  2.2 Proposed Multi-scale SincResNet
2𝜋𝑛
𝑤 [𝑛] = 0.54 − 0.46 · cos (5) (MS-SincResNet)
𝐿
Given the raw waveform of a music clip, multi-scale SincNet (MS-
In the original SincNet architecture, the first convolution layer SincNet), which consists of three sets of SincNet filters of different
consists of 80 SincNet filters of lenght 𝐿=251 (i.e., each filter consists lengths, is first designed to learn 2D representations. The outputs of
of 251 coefficients), followed by two standard convolutional layers each set of SincNet filters are concatenated to form a 2D represen-
with 60 filters of length 5. Layer normalization [5] was applied tation, and then all 2D representations are stacked and fed into the
to the input raw waveform and the output of each convolution ResNet for extracting embeddings. Finally, a spatial pyramid pool-
layer. Finally, the classifier consists of three fully-connected layers ing module is applied to summarize the information across time
having 2048 neurons, followed by batch normalization. To increase and frequency aspects to obtain compact features for MGC task. In
non-linearity, all the hidden layers are followed by a Leaky-ReLU this study, the end-to-end training strategy is used to fine-tune the
activation function. The parameters of the SincNet filter were ini- parameters in the proposed MS-SincNet and ResNet. The parame-
tialized with the cut-off frequencies derived from the Mel-scale ters of MS-SincNet are initialized by Mel-scale cut-off frequencies,
decomposition. and the parameters of ResNet are initialized by pretrained model
In standard 1D CNN [23, 40], the number of learnable parame- on ImageNet dataset.
ters for each filter is 𝐿, the filter length. However, for each SincNet
filter, there are only two parameters (𝑓1 and 𝑓2 ), which represent 2.2.1 Data preprocessing. In this study, the input music signal is
respectively the low and high cut-off frequencies of the band-pass first resampled to 16 kHz and divided into 3-second music clips
filters, have to be learned during the training process. Therefore, with hop size being 0.5 seconds. Hence, each clip consists of 48,000
comparing with standard 1D CNN, SincNet can obtain faster con- samples. During the training process, two data augmentation meth-
vergence using fewer parameters, and provide interpretability of ods, which will be described later, are applied to these music clips.
the neural networks [43]. Before being fed into the proposed MS-SincResNet for embeddings
ICMR ’21, August 21–24, 2021, Taipei, Taiwan Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee

extraction, each input audio waveform is normalized using layer 2.2.4 Spatial pyramid pooling. The spatial pyramid pooling (SPP)
normalization operation [5]. module is used to enhance the feature discriminability in terms
of both time and frequency aspects [15], as shown in Fig. 3. By
2.2.2 1D kernel learning stage. As shown in Fig.2, the convolu- performing global average pooling on each channel or each block
tion operation in the 1D MS-SincNet filter learning stage can be obtained by dividing the channel into 2×2 blocks, we can get the
represented as: SPP features consisting of 512@1×1 (global feature) and 512@2×2
(local feature). Then, we flattened and concatenated all the feature
𝑥𝑘𝑠 [𝑛] = 𝑥 [𝑛] ∗ 𝑔𝑘𝑠 [𝑛], 𝑘 = 1, 2, ..., 𝐾, 𝑠 = 1, 2, 3 (6)
values, and fed them into two fully-connected layers to obtain the
where 𝐾 is the number of kernels, 𝑥 [𝑛] is an input music clip classification genre label.
having 𝑁 samples(𝑁 =48,000), and 𝑔𝑘𝑠 [𝑛] is the 𝑘-th convolutional
kernel for scale 𝑠, called SincNet filter in this study. To get a com- 2.3 Training strategy and data augmentation
pact representation, we apply adaptive average pooling to each In this study, the proposed MS-SincResNet architecture, including
filter output (𝑥𝑘𝑠 [𝑛] ∈ 𝑹 1×𝑁 ) of (6) to obtain 1024-d representation 1D kernel and 2D kernel learning, is implemented on the Pytorch
𝑚𝑠𝑘 [𝑛] ∈ 𝑹 1×1024 : framework. The classification result is obtained for each music clip
in the training stage, whereas in the testing stage the voting strategy
𝑚𝑠𝑘 [𝑛] = 𝑎𝑑𝑎𝑝𝑡𝑖𝑣𝑒𝐴𝑣𝑔𝑃𝑜𝑜𝑙 (𝑥𝑘𝑠 [𝑛]) (7) is used to get the final classification label of the input music signal
consisting of several music clips.
Then, for each scale, we concatenate the compact outputs of all The SGD optimizer is used to tune both 1D MS-SincNet parame-
kernels to get the corresponding 2D representation of the music ters and 2D ResNet parameters in the whole network model. In the
clip, 𝑀 𝑠 ∈ 𝑹 𝐾×1024 : training stage, the warm-up strategy with learning rate 𝑙𝑟 = 10−5
is used for the first five epochs. After that, we set 𝑙𝑟 =0.005 from the
𝑚𝑠1 [𝑛]]   𝑚𝑠1 [1], 𝑚𝑠1 [2], . . . , 𝑚𝑠1 [1024] 
 𝑠    6-th epoch, and decay half for every 30 epochs.
 𝑚 2 [𝑛]   𝑚𝑠2 [1], 𝑚𝑠2 [2], . . . , 𝑚𝑠2 [1024] 
𝑠 To avoid overfitting, for each epoch we randomly select 4 music
𝑀 = . = (8)
   
..
clips from each input music data to train the network. In addition,

 ..   . 
   
𝑚𝑠 [𝑛]  𝑚𝑠 [1], 𝑚𝑠 [2], . . . , 𝑚𝑠 [1024]  two data augmentation methods are used to enhance the variation
 𝐾   𝐾 𝐾 𝐾 
of the training data. First, we multiply the amplitude of the music
In this study, the set of parameters (𝑓1 and 𝑓2 ) of the SincNet filters signal by a ratio randomly chosen within the interval [0.9, 1.1].
were initialized using Mel-scale cut-off frequencies between [30, Second, we add zero-mean Gaussian noise (𝜎=0.02) to the signal.
𝑓𝑠 /2] Hz, where 𝑓𝑠 is the sampling frequency. Specifically, the values
of the lower cut-off frequency 𝑓1 and higher cut-off frequency 𝑓2 3 EXPERIMENTS
were initialized according to the Mel-scale cut-off frequencies. As a In this study, the GTZAN dataset [51] and the ISMIR2004 Audio
result, all the derived 2D representations 𝑀 𝑠 ∈ 𝑹 𝐾×1024 (𝑠=1, 2, 3) Description Contest dataset [9] were used for performance evalua-
can be considered as the learned multi-scale Mel-spectrograms of tion. In this section, we will give brief descriptions for these two
the input music clip. datasets, and then give the experimental results.
For each set of SincNet filters, 160 convolutional kernels (i.e.,
𝐾=160) followed by a 1D batch normalization and a ReLU non- 3.1 The GTZAN dataset
linear activation function is applied to get the filter output. For
The GTZAN dataset consists of 1000 audio tracks involved ten
MS-SincNet, three different kernel lengths (𝐿 = 251, 501, and 1001)
music categories: Blues, Classical, Country, Disco, Hip Hop, Jazz,
corresponding to three scales (𝑠=1, 2, 3) are applied to the input
Metal, Popular, Reggae, and Rock. For each music genre, there are
waveform to get three 2D representations. By stacking these 2D
exactly 100 tracks, and all tracks were recorded in 22,050 Hz with
representations, we can obtain a 3 × 160 × 1024 2D representation
mono 16-bit wav format. Similar to prior works, we used 10-fold
for each music clip.
cross-validation on the GTZAN dataset to evaluate the classification
2.2.3 2D kernel learning stage. ResNet is one of the most well- performance. For each fold, 900 tracks were randomly selected as
known backbone network in deep neural networks [16]. Comparing training set, and the remaining 100 tracks were used for testing.
to prior network architectures, ResNet introduces a shortcut con- The performance will be computed by averaging the classification
nection to address the problem of vanishing gradient, and further results of these 10 folds.
extracts abundant sementics from the input data to build a robust
classifier. In this paper, ResNet-18 pretrained using the ImageNet 3.2 The ISMIR2004 dataset
dataset is selected as our 2D kernel learning backbone network. We The ISMIR2004 dataset consists of 1458 music tracks, in which
performed transfer learning on ResNet-18 to fine-tune the kernel 729 music tracks are used for training and the others for testing.
parameters using music clips derived from the training set. The These music tracks are classified into six classes, including Classical,
input to ResNet-18 is the three-channel 2D representations 𝑀 𝑠 (𝑠= Electronic, Jazz/Blue, Metal/Punk, Rock/Pop, and World. The audio
1, 2, 3) obtained from the 1D MS-SincNet learning stage, and the file format is 44.1 kHz, 16 bits per sample, and stereo MP3 format.
output of the last convolution layer (i.e., conv5_2) is a 512-channel Among the 729 music tracks used for training, we randomly selected
5 × 32 feature volume from which discriminative features will be 1/10 as validation set in an attempt to choose the best parameter
extracted. set for evaluating the performance of the testing data set.
MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification ICMR ’21, August 21–24, 2021, Taipei, Taiwan

L=251
L=501
(a) Spectrogram (b) Harmonic spectrogram

L=1001
(c) Percussive spectrogram (d) Mel-spectrogram (e) Learned 2D representation of
the proposed MS-SincResNet

Figure 4: Visualization of different 2D representations (a) spectrogram, (b) harmonic spectrogram (c) percussive spectrogram,
(d) Mel-spectrogram, and (e) 2D representations learned by the proposed MS-SincResNet.

Table 1: The ablation study of proposed method for SincNet filter initialization, filter number, filter length, spatial pyramid
pooling (SPP) on the GTZAN and the ISMIR2004 datasets. The best classification accuracy on GTZAN and ISMIR2004 datasets
are respectively 91.49% and 91.91% by using MS-SincResNet with 160 learnable filters initialized with Mel-scale decomposition
and SPP module.

SincNet filter SincNet filter Multi-scale SincNet filter


SPP GTZAN(%) ISMIR2004(%)
initialization number (𝐾) SincNet filters length (𝐿)
Mel-scale 80 - 251 - 73.98 81.34
Mel-scale 80 - 501 - 76.37 79.70
Mel-scale 80 - 1001 - 74.97 79.01
SincNet
Mel-scale 160 - 251 - 75.78 80.12
Mel-scale 160 - 501 - 76.08 71.83
Mel-scale 160 - 1001 - 76.38 78.33
ResNet (Mel-spetrogram) - - - - - 85.49 88.34
Mel-scale 80 - (251, 251, 251) - 89.79 88.75
Mel-scale 80 - (501, 501, 501) - 89.78 87.93
Mel-scale 80 - (1001, 1001, 1001) - 89.69 86.56
SincResNet
Mel-scale 160 - (251, 251, 251) - 90.89 89.99
Mel-scale 160 - (501, 501, 501) - 90.79 90.12
Mel-scale 160 - (1001, 1001, 1001) - 91.08 90.26

Mel-scale 80 (251, 501, 1001) - 90.18 87.24
√ √
Mel-scale 80 (251, 501, 1001) 90.38 87.52
MS-SincResNet √
Mel-scale 160 (251, 501, 1001) - 91.29 89.71
√ √
Mel-scale 160 (251, 501, 1001) 91.49 91.91
ICMR ’21, August 21–24, 2021, Taipei, Taiwan Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee

2.5
3.3 Baseline setups
MS-SincResNet
In this study, the original SincNet architecture with filter lengths SincNet
(𝐿=251/501/1001), initialized using Mel-scale cut-off frequencies, 2

is used as the baseline network. For the proposed SincResNet and


MS-SincResNet, the setting of the SincNet filters also follows the

Training Loss
1.5
original design [43]. In addition, we compared MS-SincNet having
different filter lengths (MS-SincResNet) with the original single
scale SincNet filters followed by the ResNet architecture (SincRes- 1
Net). To show the learning capability of the proposed MS-SincNet,
2D representations learned using MS-SincNet and hand-crafted
0.5
Mel-spectrogram are individually fed into ResNet to compare their
classification accuracy.
0
0 20 40 60 80 100 120 140 160 180 200
4 EXPERIMENTAL RESULTS Epoch
First, we compare the visualization of the 2D representations learned
using MS-SincResNet with other 2D representations. Then, we show Figure 5: The training loss curve of the proposed MS-
the ablation study to investigate the performance of the proposed SincResNet method (𝐾=160 and 𝐿=(251, 501, 1001)) and the
MS-SincResNet approach. Finally, we compare the proposed MS- original SincNet (𝐾=160 and 𝐿=251) over various epochs.
SincResNet approach with other competitive approaches on the
GTZAN and the ISMIR2004 datasets.
Table 2: Comparison the proposed method with the state-of-
4.1 The learned 2D representation the-art methods on the GTZAN dataset.
In this section, we compare the 2D representations learned using
MS-SincResNet with other 2D representations, such as spectro- The SOTA methods GTZAN dataset (%)
gram, harmonic spectrogram, percussive spectrogram, and Mel- Bisharad et al. [7] 85.36
spectrogram. The spectrogram and Mel-spectrogram are obtained Bisharad et al. [8] 82.00
by short-term Fourier transform with window size 512 samples and Raissi et al. [42] 91.00
hop size 128 samples. The harmonic spectrogram and percussive Sugianto et al. [45] 71.87
spectrogram are obtained based on the harmonic-percussive source Ashraf et al. [3] 87.79
separation algorithm [13]. Each 2D representation is obtained by Ng et al. [39] (FusionNet) 96.50
feeding a music clip to MS-SincNet using variant filter lengths. Liu et al. [30] 93.90
From Fig.4, we can see that 2D representations learned using the Nanni et al. [37] 90.60
proposed MS-SincNet filters exhibit noticeable harmonic-related Ours (MS-SincResNet) 91.49
and percussive-related features, particularly for high frequency
components. Thus, it is expected that using the learned 2D repre-
sentations as input to ResNet will yield better classification accuracy Table 3: Comparison the proposed method with the state-of-
than Mel-spectrogram input. the-art methods on the ISMIR2004 dataset.

4.2 Ablation study The SOTA methods ISMIR2004 dataset (%)


As shown in Table 1, for baseline SincNet, the best classification ac- Ng et al. [39] (FusionNet) 92.46
curacy on the GTZAN and the ISMIR2004 datasets are obtained by Nanni et al. [37] 90.90
setting 𝐾=160, 𝐿=1001 (76.38%) and 𝐾=80, 𝐿=251 (81.34%), respec- Nanni et al. [38] 90.20
tively. That is, it is hard to select a filter length 𝐿 that can achieve Costa et al. [11] 87.10
the best classification accuracy for all datasets. Also, we evaluated Ours (MS-SincResNet) 91.91
the classification results by using ResNet with Mel-Spectrogram
as input. The classification accuracy is 85.49% and 88.34% on the
GTZAN and the ISMIR2004 datasets, respectively. This shows that
2D ResNet with Mel-spectrogram input outperforms 1D SincNet best classification accuracy is 91.49% and 91.91% when SPP is in-
with raw waveform input. By replacing Mel-spectrogram with 2D corporated in the network architecture. Comparing with the best
representation learned using single scale SincNet (notated by Sin- results obtained by the baseline SincNet, an improvement of 15.11%
cResNet), the classification accuracy can be improved to be 91.08% and 10.57% on the GTZAN and the ISMIR2004 dataset, respectively.
and 90.26% when 𝐾=160 and 𝐿=1001. This comparison shows that Fig.5 compares the training loss curves of the proposed MS-
using SincNet to learn 2D representation can extract more dis- SincResNet with baseline SincNet on the GTZAN dataset. It demon-
criminative features and obtain better classification accuracy than strates that the proposed MS-SincResNet can converge faster, and
hand-crafted Mel-spectrogram features. For MS-SincResNet, the obtain better classification accuracy than baseline SincNet.
MS-SincResNet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification ICMR ’21, August 21–24, 2021, Taipei, Taiwan

4.3 Comparison with the state-of-the-art [4] Jean-Julien Aucouturier and Francois Pachet. 2003. Representing musical genre:
A state of the art. Journal of New Music Research 32, 1 (2003), 83–93.
methods [5] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normaliza-
Tables 2 and 3 compare the proposed MS-SincResNet with the state- tion. arXiv preprint arXiv:1607.06450 (2016).
[6] Wenhao Bian, Jie Wang, Bojin Zhuang, Jiankui Yang, Shaojun Wang, and Jing Xiao.
of-the-art methods on the GTZAN and the ISMIR2004 datasets, re- 2019. Audio-Based Music Classification with DenseNet and Data Augmentation.
spectively. The classification accuracy of the proposed MS-SincResNet In Pacific Rim International Conference on Artificial Intelligence. Springer, 56–65.
[7] Dipjyoti Bisharad and Rabul Hussain Laskar. 2019. Music genre recognition using
method is 91.49% and 91.91%. From these two tables, we can see convolutional recurrent neural network architecture. Expert Systems 36, 4 (2019),
that the FusionNet proposed by Ng et al. [39] achieves the best e12429.
classification accuracy. However, as stated in Sec. 1, FusionNet [8] Dipjyoti Bisharad and Rabul Hussain Laskar. 2019. Music Genre Recognition
Using Residual Neural Networks. In TENCON 2019-2019 IEEE Region 10 Conference
tries all possible combinations among 8 different features (timbre, (TENCON). IEEE, 2063–2068.
rhythm, Mel-spectrogram, constant-Q spectrogram [17], harmonic [9] Pedro Cano, Emilia Gómez, Fabien Gouyon, Perfecto Herrera, Markus Koppen-
spectrogram [12], percussive spectrogram[12], scatter transform berger, Beesuan Ong, Xavier Serra, Sebastian Streich, and Nicolas Wack. 2006.
ISMIR 2004 audio description contest. Music Technology Group of the Universitat
spectrogram [1], and transfer feature[10]) using sum rule to get the Pompeu Fabra, Tech. Rep (2006).
highest testing accuracy for each dataset. In fact, when consider- [10] Keunwoo Choi, György Fazekas, Mark Sandler, and Kyunghyun Cho. 2017.
Transfer learning for music classification and regression tasks. arXiv preprint
ing one individual network, the best performance on the GTZAN arXiv:1703.09179 (2017).
dataset is 89.10% using Mel-spectrogram, on the ISMIR2004 dataset [11] Yandre MG Costa, Luiz S Oliveira, and Carlos N Silla Jr. 2017. An evaluation
is 87.38% using transfer feature. That is, without fusion of different of convolutional neural networks for music classification using spectrograms.
Applied soft computing 52 (2017), 28–38.
networks, our learned 2D representations always achieves better [12] Jonathan Driedger, Meinard Müller, and Sascha Disch. 2014. Extending Harmonic-
performance than the other hand-crafted 2D representations. Percussive Separation of Audio Signals.. In ISMIR. 611–616.
[13] Derry Fitzgerald. 2010. Harmonic/percussive separation using median filtering.
In Proceedings of the International Conference on Digital Audio Effects (DAFx),
5 CONCLUSIONS Vol. 13.
[14] Fabien Gouyon, François Pachet, Olivier Delerue, et al. 2000. On the use of
In this study, we proposed an end-to-end CNN architecture, called zero-crossing rate for an application of classification of percussive sounds. In
MS-SincResNet, which can jointly learn 1D kernels and 2D kernels, Proceedings of the COST G-6 conference on Digital Audio Effects (DAFX-00), Verona,
for music genre classification. For 1D kernel learning, we use MS- Italy, Vol. 5.
[15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Spatial pyramid
SincNet filters to obtain variant 2D representations from raw audio pooling in deep convolutional networks for visual recognition. IEEE transactions
waveform rather than pre-computed hand-crafted features such as on pattern analysis and machine intelligence 37, 9 (2015), 1904–1916.
[16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
Mel-spectrogram. Then, 2D kernel learning using ResNet-18 is used learning for image recognition. In Proceedings of the IEEE conference on computer
to extract embeddings from these learned 2D representations. The vision and pattern recognition. 770–778.
spatial pyramid pooling module is used to get the compact features [17] Nicki Holighaus, Monika Dörfler, Gino Angelo Velasco, and Thomas Grill. 2012.
A framework for invertible, real-time constant-Q transforms. IEEE Transactions
from the output of the last convolutional layer of ResNet-18. In on Audio, Speech, and Language Processing 21, 4 (2012), 775–785.
the experiments, the proposed MS-SincResNet approach achieves [18] Yin-Fu Huang and Shih-Hao Wang. 2012. Movie genre classification using
classification accuracy of 91.49% and 91.91% on the GTZAN and svm with audio and video features. In International Conference on Active Media
Technology. Springer, 1–10.
ISMIR2004 datasets, which outperforms every hand-crafted 2D [19] Dan-Ning Jiang, Lie Lu, Hong-Jiang Zhang, Jian-Hua Tao, and Lian-Hong Cai.
representation. 2002. Music type classification by spectral contrast feature. In Proceedings. IEEE
International Conference on Multimedia and Expo, Vol. 1. IEEE, 113–116.
Inspired by the FusionNet [39], we can see that the combination [20] Xin Jin and Rongfang Bie. 2006. Random Forest and PCA for Self-Organizing
of the classification results of different features often yields better Maps based Automatic Music Genre Discrimination.. In DMIN. 414–417.
performance than each individual feature. In the future, we will try [21] Chandanpreet Kaur and Ravi Kumar. 2017. Study and analysis of feature based
automatic music genre classification using Gaussian mixture model. In 2017
to combine the classification results of variant networks in which International Conference on Inventive Computing and Informatics (ICICI). IEEE,
2D representations can be learned using SincNet with different set 465–468.
of cut-off frequencies as the initialization of the band-pass filters. [22] Hyoung-Gook Kim, Nicolas Moreau, and Thomas Sikora. 2004. Audio classifi-
cation based on MPEG-7 spectral basis representations. IEEE Transactions on
That is, in addition to Mel-scale decomposition, linear-scale decom- Circuits and Systems for Video Technology 14, 5 (2004), 716–725.
position, or other frequency decomposition approaches such as [23] Taejun Kim, Jongpil Lee, and Juhan Nam. 2018. Sample-level cnn architectures for
music auto-tagging using raw waveforms. In 2018 IEEE international conference
OSC and NASE can also be considered. on acoustics, speech and signal processing (ICASSP). IEEE, 366–370.
[24] Gursimran Kour and Neha Mehan. 2015. Music genre classification using MFCC,
ACKNOWLEDGMENTS SVM and BPNN. International Journal of Computer Applications 112, 6 (2015).
[25] Chang-Hsing Lee, Jau-Ling Shih, Kun-Ming Yu, and Hwai-San Lin. 2009. Auto-
This work was supported in part by Ministry of Science and Tech- matic music genre classification based on modulation spectral analysis of spectral
nology, Taiwan (MOST-108-2221-E-216-005 and MOST-108-2221-E- and cepstral features. IEEE Transactions on Multimedia 11, 4 (2009), 670–682.
[26] Chang-Hsing Lee, Jau-Ling Shih, Kun-Ming Yu, and Jung-Mau Su. 2007. Auto-
009-066-MY3). matic music genre classification using modulation spectral contrast feature. In
2007 IEEE International Conference on Multimedia and Expo. IEEE, 204–207.
[27] Tao Li, Mitsunori Ogihara, and Qi Li. 2003. A comparative study on content-based
REFERENCES music genre classification. In Proceedings of the 26th annual international ACM
[1] Joakim Andén and Stéphane Mallat. 2014. Deep scattering spectrum. IEEE SIGIR conference on Research and development in informaion retrieval. 282–289.
Transactions on Signal Processing 62, 16 (2014), 4114–4128. [28] Thomas Lidy and Andreas Rauber. 2005. Evaluation of feature extractors and
[2] Relja Arandjelovic, Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. psycho-acoustic transformations for music genre classification. In ISMIR. 34–41.
2016. NetVLAD: CNN architecture for weakly supervised place recognition. In [29] Chien-Chang Lin, Shi-Huang Chen, Trieu-Kien Truong, and Yukon Chang. 2005.
Proceedings of the IEEE conference on computer vision and pattern recognition. Audio classification and categorization based on wavelets and support vector
5297–5307. machine. IEEE Transactions on Speech and Audio Processing 13, 5 (2005), 644–651.
[3] Mohsin Ashraf, Guohua Geng, Xiaofeng Wang, Farooq Ahmad, and Fazeel Abid. [30] Caifeng Liu, Lin Feng, Guochao Liu, Huibing Wang, and Shenglan Liu. 2020.
2020. A Globally Regularized Joint Neural Architecture for Music Classification. Bottom-up broadcast neural network for music genre classification. Multimedia
IEEE Access 8 (2020), 220980–220989. Tools and Applications (2020), 1–19.
ICMR ’21, August 21–24, 2021, Taipei, Taiwan Pei-Chun Chang, Yong-Sheng Chen, and Chang-Hsing Lee

[31] Zhen Liu, Houqiang Li, Wengang Zhou, Ting Rui, and Qi Tian. 2015. Making [43] Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw wave-
residual vector distribution uniform for distinctive image representation. IEEE form with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT).
Transactions on Circuits and Systems for Video Technology 26, 2 (2015), 375–384. IEEE, 1021–1028.
[32] Lie Lu, Hong-Jiang Zhang, and Stan Z Li. 2003. Content-based audio classification [44] Li Su, Chin-Chia Michael Yeh, Jen-Yu Liu, Ju-Chiang Wang, and Yi-Hsuan Yang.
and segmentation by using support vector machines. Multimedia systems 8, 6 2014. A systematic evaluation of the bag-of-frames representation for music
(2003), 482–492. information retrieval. IEEE Transactions on Multimedia 16, 5 (2014), 1188–1200.
[33] Anders Meng, Peter Ahrendt, Jan Larsen, and Lars Kai Hansen. 2007. Temporal [45] Sugianto Sugianto and Suyanto Suyanto. 2019. Voting-based music genre classi-
feature integration for music genre classification. IEEE Transactions on Audio, fication using melspectogram and convolutional neural network. In 2019 Inter-
Speech, and Language Processing 15, 5 (2007), 1654–1664. national Seminar on Research of Information Technology and Intelligent Systems
[34] Ionuţ Mironică, Ionuţ Cosmin Duţă, Bogdan Ionescu, and Nicu Sebe. 2016. A mod- (ISRITI). IEEE, 330–333.
ified vector of locally aggregated descriptors approach for fast video classification. [46] Chun Pui Tang, Ka Long Chui, Ying Kin Yu, Zhiliang Zeng, and Kin Hong Wong.
Multimedia Tools and Applications 75, 15 (2016), 9045–9072. 2018. Music genre classification using a hierarchical long short term memory
[35] F Morchen, Alfred Ultsch, Michael Thies, and Ingo Lohken. 2005. Modeling timbre (LSTM) model. In Third International Workshop on Pattern Recognition, Vol. 10828.
distance with temporal statistics from polyphonic music. IEEE Transactions on International Society for Optics and Photonics, 108281B.
Audio, Speech, and Language Processing 14, 1 (2005), 81–90. [47] R Thiruvengatanadhan. 2018. Music Genre Classification using MFCC and AANN.
[36] Hannah Muckenhirn, Mathew Magimai Doss, and Sébastien Marcell. 2018. To- International Research Journal of Engineering and Technology (IRJET) (2018).
wards directly modeling raw speech signal for speaker verification using CNNs. [48] Adam R Tindale, Ajay Kapur, George Tzanetakis, and Ichiro Fujinaga. 2004.
In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing Retrieval of percussion gestures using timbre classification techniques.. In ISMIR.
(ICASSP). IEEE, 4884–4888. [49] George Trigeorgis, Fabien Ringeval, Raymond Brueckner, Erik Marchi, Mihalis A
[37] Loris Nanni, Yandre MG Costa, Diego Rafael Lucio, Carlos Nascimento Silla Jr, Nicolaou, Björn Schuller, and Stefanos Zafeiriou. 2016. Adieu features? end-to-
and Sheryl Brahnam. 2017. Combining visual and acoustic features for audio end speech emotion recognition using a deep convolutional recurrent network.
classification tasks. Pattern Recognition Letters 88 (2017), 49–56. In 2016 IEEE international conference on acoustics, speech and signal processing
[38] Loris Nanni, Yandre MG Costa, Alessandra Lumini, Moo Young Kim, and Se- (ICASSP). IEEE, 5200–5204.
ung Ryul Baek. 2016. Combining visual and acoustic features for music genre [50] George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio
classification. Expert Systems with Applications 45 (2016), 108–117. signals. IEEE Transactions on speech and audio processing 10, 5 (2002), 293–302.
[39] Wing WY Ng, Weijie Zeng, and Ting Wang. 2020. Multi-Level Local Feature [51] George Tzanetakis and Perry Cook. 2002. Musical genre classification of audio
Coding Fusion for Music Genre Recognition. IEEE Access 8 (2020), 152713–152727. signals. IEEE Transactions on speech and audio processing 10, 5 (2002), 293–302.
[40] Hyunsin Park and Chang D Yoo. 2020. CNN-based learnable gammatone filter- [52] Yonatan Vaizman, Brian McFee, and Gert Lanckriet. 2014. Codebook-based audio
bank and equal-loudness normalization for environmental sound classification. feature representation for music information retrieval. IEEE/ACM Transactions
IEEE Signal Processing Letters 27 (2020), 411–415. on Audio, Speech, and Language Processing 22, 10 (2014), 1483–1493.
[41] Lawrence Rabiner. 1993. Fundamentals of speech recognition. Fundamentals of [53] S Vishnupriya and K Meenakshi. 2018. Automatic Music Genre Classification
speech recognition (1993). using Convolution Neural Network. In 2018 International Conference on Computer
[42] Tina Raissi, Alessandro Tibo, and Paolo Bientinesi. 2018. Extended pipeline Communication and Informatics (ICCCI). IEEE, 1–4.
for content-based feature engineering in music genre recognition. In 2018 IEEE [54] Benedikt S Vogler and Amir Othman. 2016. Music genre recognition. Benedik-
International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, tsvogler. com (2016).
2661–2665.

You might also like