FMA V1 October 3 1

MUSIC GENRE CLASSIFICATION USING DENSENET AND DATA
AUGMENTATION
Dao Thi Le Thuy, Trinh Van Loan, Chu Ba Thanh, Nguyen Hieu Cuong
Abstract
It can be said that the automatic classification of musical genres plays a very important role in the
current digital technology world in which the creation, distribution, and enjoyment of musical works
have undergone huge changes. With the number of music products increasing every day along with
the extremely rich music genres, storing, classifying, and searching these works manually is
extremely difficult if not impossible. Automatic classification of musical genres will contribute to
making the impossible possible. The research presented in this paper includes proposing an
appropriate deep learning model along with an effective data augmentation method to achieve high
classification accuracy for music genre classification. Our research results show that the
DenseNet121 model and data augmentation methods such as noise addition, and echo generation
have allowed achieving a classification accuracy of 98.97% for the Small FMA dataset while this
dataset has lowered the sampling frequency to 16000 Hz. The classification accuracy achieved in our
research outperforms most of the existing results on the same Small FMA dataset.
Keywords: Music Genre Classification, Small FMA, DenseNet, CNN, GRU, Data Augmentation
I. Introduction
Today, advanced digital technology has dramatically changed the way people create, distribute,
enjoy and consume music products. On the one hand, the number of musical works of mankind is
extremely large and constantly increasing. On the other hand, the musical genres are also very rich.
Therefore, manually sorting, classifying, and searching for such musical works is an extremely
difficult if not impossible task. Computers along with machine learning and deep learning tools have
made it possible to do such jobs automatically. In addition, various music data have been developed
and have been very helpful for research in this area.
In this paper, we will present the results of our research using deep learning tools and effective data
augmentation methods for music genre classification with Small FMA data. The rest of the paper is
organized as follows: Section II is for related work. Section III describes data augmentation methods
for Small FMA. Section IV presents the models used in our experiments. Section V is the results and
discussion. Finally, section VI is our conclusion.
II. Related work
1
FMA (Free Music Archive) introduced by Defferrard et al. in [1] (Defferrard, M., Benzi, K.,
Vandergheynst, P., & Bresson, X. (2016). FMA: A dataset for music analysis. arXiv preprint
arXiv:1612.01840.). The Free Music Archive (FMA) is a large-scale dataset for evaluating several
tasks in Music Information Retrieval. It consists of 343 days of audio from 106,574 tracks from
16,341 artists and 14,854 albums, arranged in a hierarchical taxonomy of 161 genres. It provides full-
length, high-quality audio, pre-computed features, and track- and user-level metadata, tags, and free-
form text such as biographies. There are four subsets defined by the authors: Full: the complete
dataset, Large: the full dataset with audio limited to 30 seconds clips extracted from the middle of the
tracks (or entire track if shorter than 30 seconds), Medium: a selection of 25,000 30s clips having a
single root genre, and Small: a balanced subset containing 8,000 30s clips with 1,000 clips per one of
8 root genres. Eight root genres include: Electronic, Experimental, Folk, Hip Hop, Instrumental,
International, Pop, and Rock.
Due to hardware constraints, we used Small FMA in our research. For comparison with other
research, this section also presents research that has used the same Small FMA in recent years. More
details on related research are presented in Table XXXXX. Regarding the models used for MGC
(Music Genre Classification) with Small FMA, the following can be seen. The authors in (Pimenta-
Zanon, M.H., Bressan, G.M. and Lopes, F.M., 2021. Complex Network-Based Approach for Feature
Extraction and Classification of Musical Genres. arXiv preprint arXiv:2110.04654.) used Bayesian
Network and Ke Chen, Beici Liang in [] used SVM (Ke Chen, Beici Liang, “Do User Preference
Data Benefit Music Genre Classification Tasks?”, in Proc. of the 21st Int. Society for Music
Information Retrieval Conf., Montréal, Canada, 2020) for MGC. Most other research used deep
neural networks for MGC. CNN (Convolutional Neural Networks) were used in (Kostrzewa, D.,
Kaminski, P. and Brzeski, R., 2021, June. Music Genre Classification: Looking for the Perfect Network.
In International Conference on Computational Science (pp. 55-67). Springer, Cham.), (Matocha, M. and
Zieliński, S.K., 2018. Music Genre Recognition Using Convolutional Neural Networks. Advances in Computer
Science Research.Advances in Computer Science Research, vol. 14, pp. 125-142, 2018. DOI: 10.24427/acsr-2018-vol14-0008), and in
(Chillara, S., Kavitha, A.S., Neginhal, S.A., Haldia, S. and Vidyullatha, K.S., 2019. Music genre
classification using machine learning algorithms: a comparison. Int. Res. J. Eng. Technol.(IRJET),
6(05), pp.851-858.). Research in (Gunawan, A. A., & Suhartono, D. (2019). Music recommender
system based on genre using convolutional recurrent neural networks. Procedia Computer Science,
157, 99-109.), (Qin, Y., & Lerch, A. (2019, May). Tuning Frequency Dependency in Music
Classification. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and
Signal Processing (ICASSP) (pp. 401-405). IEEE.), and (Kostrzewa, D., Ciszynski, M., & Brzeski,
R. (2022, July). Evolvable hybrid ensembles for musical genre classification. In Proceedings of the
Genetic and Evolutionary Computation Conference Companion (pp. 252-255).) used CRNN
(Convolutional Recurrent Neural Networks). BBNN (Bottom-up Broadcast Neural Networks) were
used in (Heakl, Ahmed, Abdelrahman Abdelgawad, and Victor Parque. "A Study on Broadcast Networks for
Music Genre Classification." arXiv preprint arXiv:2208.12086 (2022). ). The authors in (Choudhary, V. and
Vyas, A., CS543: Music Genre Recognition through Audio Samples. small, 8(8), p.30.) have exploited
LSTM and DenseNet was used in (Bian, W., Wang, J., Zhuang, B., Yang, J., Wang, S. and Xiao, J., 2019,
August. Audio-based music classification with DenseNet and data augmentation. In Pacific Rim International
Conference on Artificial Intelligence (pp. 56-65). Springer, Cham ). A Siamese neural network consists of
twin networks that share weights and configuration then provides unique inputs to the network and
optimizes similarity scores was used in (Park, J., Lee, J., Park, J., Ha, J. W., & Nam, J. (2017).
2
Representation learning of music using artist labels. arXiv preprint arXiv:1710.06648. Park, Jiyoung, Jongpil
Lee, Jangyeon Park, Jung-Woo Ha and Juhan Nam. “Representation Learning of Music Using Artist
Labels.” ISMIR (2018).). For the feature parameters, most of the authors used Mel-Spectrogram as in
the (5,6,8,10, 11, 16, 17). Others used spectrograms (1, 7, 14). (Kostrzewa, D., Mazur, W., & Brzeski,
R. (2022). Wide Ensembles of Neural Networks in Music Genre Classification. In International Conference on
Computational Science (pp. 64-71). Springer, Cham.) made use of 500 features in the FMA dataset.
(Pimenta-Zanon, M.H., Bressan, G.M. and Lopes, F.M., 2021. Complex Network-Based Approach
for Feature Extraction and Classification of Musical Genres. arXiv preprint arXiv:2110.04654.) used
EXAMINNER for feature extraction.
III. Proposed data augmentation methods

Among the studies listed in Table xx, only [8,12] used data augmentation in their study. The data
augmentation in [8,12] was done by changing the pitch. The data augmentation methods that were performed
in our study include: adding noise, creating echo, and changing the pitch. Among these methods, changing the
pitch proved to be less effective than the other two methods as it will be seen in our experimental section. The
following is a description of the data augmentation methods we have implemented.
 For noise addition, Librosa was used [22]. The amplitude of the white noise is taken as 0.03 of the
peak amplitude of the signal and then this noise is added to the signal. The SNR (Signal/Noise ratio) is
calculated by using the following formula: where is the signal
power, and is the added noise power with the assumption that the background noise that exists
in the original sound files can be ignored. From that, the average SNR of 8000 sound files is about
19.38 dB. After ICSI (International Computer Science Institute),
(https://www1.icsi.berkeley.edu/Speech/faq/speechSNR.html) a SNR of 30 dB is considered a clean
signal. Thus, adding noise has degraded the quality of the signal but not much in this case. Figure xx
is an example of calculating the SNR and its average for a file.
3
Figure XX. Example of calculating the SNR and its average for a file.
 Creating Echo
An echo effect causes a sound to repeat on a delay with diminishing volume, simulating the real
effect of an echo. In our echo experiment, the sound signal is delayed by 250 ms and repeated 3
times, for each iteration, the amplitude of the delay is multiplied by a factor of 0.25. To see the
echo clearly, Figure XX illustrates the echo generated at the end of the sound file.
Figure XX. Echo generated at the end of the sound file.
 Changing pitch
Changing pitch is done by shifting to a semitone and to a tone. To do this, we used the function
librosa.effects.pitch_shift from Librosa.To illustrate pitch change, Figure XX shows shifting up for the A5
4
note. A5 note has a frequency of 880 Hz. After shifting to a semitone, there will be A5# with a frequency of
933.88 Hz. A5 note becomes a B5 note after shifting up one note with a frequency of B5 note of 987.77 Hz.
Figure XX. Illustration of shifting up a semitone and a tone
IV. Proposed models for our experiments
The models used to perform our experiments include DenseNet, CNN and GRU. There are 3 variants of
DenseNet that have been used are DenseNet121, DenseNet169, and DenseNet201 [16]. The DenseNet models
can be briefly described as follows.
DenseNet is considered one of the 7 best models for image classification using Keras [15]. In a traditional
CNN, the input image is passed through the network to extract feature mappings in turn and finally, we
have predicted labels on the output in a way where the forward pass is pretty straightforward. Except for the
first convolutional layer whose input is the image to be recognized, each layer takes the previous layer's output
to create a feature map in the output. This feature map is then passed to the next convolutional layer. If the
CNN network has 𝐿 layers, we will have 𝐿 Connections which are connections between one layer and its next
layer. Fig. XXX is an illustration of DenseNet architecture.
5
Prediction Output
Convolution
Convolution
Convolution
Linear
Pooling
Pooling
Pooling
Transition Transition
Input Dense Block 1 Dense Block 2 Dense Block 3
Layer Layer
Figure XXX. Illustration of DenseNet architecture
DenseNet [17] introduces the Densely Connected Neural Networks to get deeper insights, efficient and
accurate training, and outputs. For DenseNet, in addition to the connection between layers like the
connection in a CNN network, there is another special type of connection. In the DenseNet architecture,
each layer is connected to every other layer. If DenseNet has 𝐿 layers, there will be 𝐿 𝐿 + /2 direct
connections. The input of a layer inside DenseNet is the concatenation of feature maps from previous
layers.
The architecture of DenseNet contains dense blocks, where the dimensions of the feature maps remain
constant within a block, but the number of filters changes between them. Transition Layers are used to
connect dense blocks together.
As shown in Table 1 from [1], the DenseNet121 has [6,12,24,16] layers in the four dense blocks. The
number 121 is deduced as follows: 5+(6+12+24+16) ×2=121, where 5 is (convolution, pooling) + 3
transition layers + classification layer. Multiplying by 2 is because each dense block has 2 layers (1×1
convolution and 3×3 convolution). The same can be inferred for models DenseNet169 and DenseNet201
The CNN model in this study inherits the CNN model used and presented in [18]. The same goes for the
GRU model that was used and presented in [18].
V. Results and Discussion
The sampling frequency of the original files is 44100 Hz and the number of bits per sample is 16. First, the
DenseNet169 model will be used for Small FMA with 2 sampling frequencies 44100 Hz and 16000 Hz. At the
same time, data is enhanced by 4 times. 4 × DATA includes: original data (8000 files), noise-added original
data (8000 files), echoed original data (8000 files) and echoed and noise-added original data (8000 files).
Thus, 4 × DATA = 4 × 8000 = 32000 files. 32000 files divided into 10 parts for cross-validation. One out of
10 parts will be segregated and used for testing. The remaining 9 are used for training and validation in the
ratio training: validation = 8:1. Thus, the number of folds for training and validation will be 9..
The first is the experimental results with the DenseNet169 model with two sampling frequencies 44100 Hz
and 16000 Hz and 4 × DATA.
If the sampling frequency is 44100 Hz and the duration of each file is 30 seconds, the number of samples
of each file will be 2 samples. If the frame width used to calculate the FFT for
the mel spectrum is 4028 samples and the frameshift is 2014 samples, the corresponding number of
frames for a sound file will be 646. Thus, each audio file will be converted to an image file of size 646
(number of Mel coefficients) × 646 (number of frames). In our study, the input shape for use with
DenseNet networks will have a target of (224,224,3). (224, 224): image size, (3): 3 images R, G, and B.
The quantities like Accuracy, Precision, Recall and f1-score in this paper are calculated according to [21].
6
DenseNet169, fs = 44100Hz, Precision, DenseNet169, 44100 kHz, 4 × DATA
4 x DATA Folds
Elec. Exp. Folk HH Instr. Intl. Pop Rock
Folds Acc. AUC
1 88.1 83.21 91.24 92.29 87.98 92.66 79.35 88.22
1 87.88 0.931
2 87.32 82.97 88.33 96.25 89.09 92.71 84.24 91.14
2 88.94 0.937
3 84 76.04 90.31 89.35 82.99 92.58 75.54 89.75
3 84.91 0.914
4 83.65 77.94 89.6 92.41 85.91 89.8 79.34 88.13
4 85.81 0.919
5 85.5 83.66 89.1 92.11 89.57 92.22 80.05 88.32
5 87.53 0.929
6 91.17 80.97 91.38 90.4 90.55 93.06 82.66 89.49
6 88.66 0.935
7 91.82 81.36 91.39 93.21 88.8 92.45 75.77 85.94
7 87.34 0.928
8 85.03 75.66 88.29 88.86 83.46 87.95 75.32 86.65
8 83.91 0.908
9 86.76 76.92 90.43 93.1 86.6 94.69 80.05 87.21
9 86.84 0.925
Aver. % 87.04 79.86 90.01 92 87.22 92.01 79.15 88.32
Aver. % 86.87 0.925111
Table x1. Accuracy and AUC for Table x2. Precision for Densenet169, fs = 44100 Hz, and 4 × DATA
Densenet169, fs = 44100 Hz, and 4 × (Elec. : Electronic, Exp.: Experimental, HH: Hip Hop, Instr.: Instrumental, Intl.:
DATA (Acc.: Accuracy, AUC: Area International)
Under the Receiver Operating
Characteristics [18,19])
Tables x2, x3, and x4 show that Precision achieved the highest value (in Bold) of 92.01% for the International
genre but Recall and f1-score got the highest value for the same genre of Hip Hop. Precision, Recall, and f1-
score have the same lowest value (in Red) for the Pop genre.
Folds Recall, DenseNet169, 44100 kHz, 4 × DATA

Elec Exp Folk HH Instr. Intl. Pop Rock
1 88.1 88.07 85.3 93.98 86.87 93.52 78.19 89.15
2 92.41 87.82 89.4 93.98 86.62 94.21 79.9 87.04
3 85.06 83.76 83.13 90.05 82.58 92.36 76.47 85.71
4 88.1 82.49 87.23 92.41 80.05 91.67 76.23 88.36
5 88.1 87.06 88.67 91.62 84.6 93.29 80.64 85.98
6 88.86 88.58 89.4 93.72 87.12 93.06 80.64 87.83
7 88.1 85.28 86.99 93.46 84.09 90.74 84.31 85.71
8 84.81 79.7 87.23 89.79 81.57 91.2 72.55 84.13
9 89.62 86.29 86.51 91.88 84.85 90.74 76.72 88.36
Aver. % 88.13 85.45 87.1 92.32 84.26 92.31 78.41 86.92
Table x3. Recall for Densenet169, fs = 44100 Hz, and 4 × DATA
f1-score, DenseNet169, 44100 kHz, 4 × DATA

Folds
1 88.1 85.57 88.17 93.13 87.42 93.09 78.77 88.68
2 89.79 85.33 88.86 95.1 87.84 93.46 82.01 89.04
3 84.53 79.71 86.57 89.7 82.78 92.47 76 87.69
7
4 85.82 80.15 88.4 92.41 82.88 90.72 77.75 88.24
5 86.78 85.32 88.89 91.86 87.01 92.75 80.34 87.13
6 90 84.61 90.38 92.03 88.8 93.06 81.64 88.65
7 89.92 83.27 89.14 93.33 86.38 91.59 79.81 85.83
8 84.92 77.63 87.76 89.32 82.5 89.55 73.91 85.37
9 88.17 81.34 88.42 92.49 85.71 92.67 78.35 87.78
Aver. % 87.56 82.55 88.51 92.15 85.7 92.15 78.73 87.6
Table x4. f1-score for Densenet169, fs = 44100 Hz, and 4 × DATA
If the sampling frequency is reduced to 16000 Hz, the number of samples of each file will be 30 ×
16000 = 480,000 samples, the frame width and the frameshift remain as above then the
corresponding number of frames for a sound file will be approximately 234. To prevent the number
of samples from possibly exceeding the size of a sound file, the number of frames has been taken to
be 230. Thus, each sound file will be converted to an image file of size 230 (number of Mel
coefficients) × 230 (number of frames). The input shape for use with DenseNet networks remains the
same target as above (224,224,3). It can be seen that reducing the sampling frequency to 16000 Hz
leads to a reduction in image size and a significant increase in recognition accuracy. So, our
experiments were next performed with a sampling frequency of 16000 Hz.
DenseNet169, Fs = 16kHz Precision, DenseNet169, 4 × DATA, 16kHz
4 x DATA Folds
Folds Acc AUC
1 99.24 99.74 99.04 99.48 99.24 99.31 99.02 100
1 99.38 0.996
2 98.75 97.98 99.52 99.22 99.24 100 99.5 99.21
2 99.19 0.995
3 97.98 97.99 99.76 99.21 98.72 98.85 98.79 98.94
3 98.78 0.993
4 98.97 97.72 96.92 99.74 97.97 99.07 99.02 97.88
4 98.41 0.991
5 99.23 99.24 99.76 98.96 99.23 98.18 99.75 99.47
5 99.22 0.996
6 100 99.49 99.27 100 99.25 99.31 99.27 99.47
6 99.5 0.997
7 98.74 98 98.54 98.96 99.22 99.07 98.06 98.42
7 98.62 0.992
8 99.22 97.45 98.33 97.7 98.22 98.38 97.8 98.94
8 98.25 0.99
9 98.99 98.49 100 99.21 98.74 99.31 99.02 100
9 99.22 0.996 Aver.
Aver. % 98.95 0.994 % 99.01 98.46 99.02 99.16 98.87 99.05 98.91 99.15
Table xx1. Accuracy and Table xx2 Precision for Densenet169, fs = 16000 Hz, and 4 × DATA
AUC for Densenet169, fs =
16000 Hz, and 4 × DATA
8
Recall, DenseNet169, 4 × DATA, 16kHz
Folds
1 99.24 98.98 99.52 99.21 99.24 100 99.26 99.47
2 99.75 98.73 99.28 99.74 99.24 99.54 98.04 99.21
3 98.48 98.98 98.55 98.95 97.22 99.54 99.75 98.68
4 97.72 97.97 98.55 99.48 97.73 98.84 99.02 97.88
5 98.23 99.75 99.04 99.74 98.23 99.77 99.26 99.74
6 98.99 99.75 98.8 99.74 99.75 99.54 99.51 100
7 99.24 99.49 97.83 99.21 96.46 99.07 99.02 98.68
8 96.71 96.95 99.28 100 97.73 98.38 98.04 98.94
9 98.99 99.49 99.28 99.21 98.74 99.77 98.77 99.47
Aver. % 98.59 98.9 98.9 99.48 98.26 99.38 98.96 99.12
Table xx3. Recall for DenseNet169, fs = 16000 Hz, and 4 × DATA
f1-score, DenseNet169, 4 × DATA, 16kHz

Folds
1 99.24 99.36 99.28 99.34 99.24 99.65 99.14 99.73
2 99.24 98.36 99.4 99.48 99.24 99.77 98.77 99.21
3 98.23 98.48 99.15 99.08 97.96 99.19 99.27 98.81
4 98.34 97.85 97.73 99.61 97.85 98.96 99.02 97.88
5 98.73 99.49 99.4 99.35 98.73 98.97 99.51 99.6
6 99.49 99.62 99.03 99.87 99.5 99.42 99.39 99.74
7 98.99 98.74 98.19 99.08 97.82 99.07 98.54 98.55
8 97.95 97.2 98.8 98.84 97.97 98.38 97.92 98.94
9 98.99 98.99 99.64 99.21 98.74 99.54 98.9 99.73
Aver. % 98.8 98.68 98.96 99.32 98.56 99.22 98.94 99.13
Table xx4. f1-score for DenseNet169, fs = 16000 Hz, and 4 × DATA
From Tables xx2, xx3, xx4, Precision, Recall, and f1-score have the same highest value (in Bold) for
the Hip Hop genre. The lowest values (in Red) are for Precision and f1-score of the Experimental
genre, and for the Recall of the Instrumental genre.
The following are the results for DenseNet121 with the sampling frequency of 16000 Hz,
image size (230×230), and 4 × DATA.
DenseNet121, 16kHz, Precision, DenseNet121, fs = 16kHz, 4 × DATA

4 x DATA Folds
Folds Acc AUC
1 99.24 97.01 99.04 99.22 98.47 99.3 99.75 98.94
9
1 98.88 0.994 2 99.25 97.75 99.76 100 98.74 98.62 98.75 100
2 99.09 0.995 3 98.48 97.02 98.55 99.74 98.21 99.31 98.77 98.67
3 98.59 0.992 4 98.74 98.99 98.08 99.74 99.23 99.31 98.28 98.68
4 98.88 0.994 5 99.23 98.25 100 99.48 99.75 100 100 100
5 99.59 0.998 6 100 99.24 99.76 100 98.75 99.31 99.76 100
6 99.59 0.998 7 98.98 98.73 96.93 98.2 99.23 99.31 99.01 98.14
7 98.56 0.992 8 99.74 97.01 98.57 99.21 98.73 98.38 99.5 98.94
8 98.75 0.993 9 98.49 97.74 99.03 99.21 98.48 99.54 99 98.95
9 98.81 0.993 Aver.
99.13 97.97 98.86 99.42 98.84 99.23 99.2 99.15
Aver. % 98.97 0.994333 %
Table xxx1. Accuracy and Table xxx2. Precision for DenseNet121, fs = 16000 Hz, and 4 × DATA
AUC for DenseNet121, fs =
16000 Hz, and 4 × DATA
Recall, DenseNet121, fs = 16kHz, 4 × DATA

Folds
1 98.99 98.98 99.76 100 97.73 98.61 98.28 98.68
2 100 99.24 99.28 99.74 98.99 99.31 97.06 99.21
3 98.73 99.24 98.55 98.95 96.97 99.54 98.28 98.41
4 99.24 99.24 98.55 99.74 97.98 99.31 98.04 98.94
5 98.48 99.75 100 100 99.24 100 99.51 99.74
6 99.24 99.75 99.04 99.74 99.49 99.77 100 99.74
7 97.97 98.73 99.04 99.74 98.23 99.31 97.79 97.62
8 98.99 98.73 100 99.21 98.48 98.61 97.06 98.94
9 99.24 98.98 98.8 98.69 97.98 99.54 97.55 99.74
Aver. % 98.99 99.18 99.22 99.53 98.34 99.33 98.17 99
Table xxx3. Recall for DenseNet121, fs = 16000 Hz, and 4 × DATA
f1-score, DenseNet121, fs = 16kHz, 4 × DATA

Folds
1 99.11 97.99 99.4 99.61 98.1 98.95 99.01 98.81
2 99.62 98.49 99.52 99.87 98.87 98.96 97.9 99.6
3 98.61 98.12 98.55 99.34 97.59 99.42 98.53 98.54
4 98.99 99.11 98.32 99.74 98.6 99.31 98.16 98.81
5 98.86 98.99 100 99.74 99.49 100 99.75 99.87
6 99.62 99.49 99.4 99.87 99.12 99.54 99.88 99.87
7 98.47 98.73 97.97 98.96 98.73 99.31 98.4 97.88
8 99.36 97.86 99.28 99.21 98.61 98.5 98.26 98.94
9 98.87 98.36 98.91 98.95 98.23 99.54 98.27 99.34
Aver. % 99.06 98.57 99.04 99.48 98.59 99.28 98.68 99.07
Table xxx4. f1-score for DenseNet121, fs = 16000 Hz, and 4 × DATA
10
For DenseNet121 with fs =16000 Hz and 4 × DATA, Precision, Recall, and f1-score get the highest
value for the same genre of Hip Hop. The lowest values are for Precision and f1-score with the Experimental
genre, and for Recall with the Pop genre.
For DenseNet201 with the sampling frequency of 16000 Hz, image size (230×230), and 4 ×
DATA, the results are given in Tables XXXX
DenseNet201, Fs = 16kHz Precision, DenseNet201, Fs = 16kHz, 4 × DATA
4 × DATA Folds
Folds Acc AUC
1 98.25 98.48 99.76 98.96 99.24 99.54 100 98.94
1 99.16 0.995
2 98 96.99 98.32 99.48 99.22 99.77 98.02 98.68
2 98.56 0.992
3 97.46 95.31 96.9 99.74 97.69 98.38 98.73 96.11
3 97.53 0.986
4 98.24 95.54 99.04 98.7 98.73 99.77 99.24 98.68
4 98.5 0.991
5 99.75 99.24 99.27 100 98.99 99.31 99.51 99.21
5 99.41 0.997
6 98.73 98.99 99.52 99.22 99.24 99.77 99.02 99.21
6 99.22 0.996
7 98.49 98.73 99.76 99.48 98.98 99.3 96.16 98.93
7 98.72 0.993
8 99.75 98.5 99.76 99.48 100 99.31 98.3 100
8 99.38 0.996
9 99.49 97.76 98.57 99.48 99.49 99.77 99.5 99.2
9 99.16 0.995
Aver.
Aver. % 98.85 0.993444 98.68 97.73 98.99 99.39 99.06 99.44 98.72 98.77
%
Table xxxx1. Accuracy and AUC Table xxxx2. Precision for DenseNet201, fs = 16000 Hz, and 4 × DATA
for DenseNet201, fs = 16000 Hz,
and 4 × DATA
Recall, DenseNet201, Fs = 16kHz, 4 × DATA

Folds
1 99.24 98.98 99.28 99.21 99.24 99.31 98.77 99.21
2 99.24 98.22 98.55 100 96.46 99.54 97.3 99.21
3 97.22 97.97 97.83 99.21 95.96 98.61 95.34 98.15
4 98.99 97.97 99.52 99.74 98.23 98.84 95.59 99.21
5 100 98.98 98.8 99.48 99.24 99.77 99.26 99.74
6 98.73 99.24 99.28 100 98.48 99.54 99.02 99.47
7 99.24 98.98 98.55 99.48 98.23 98.84 98.28 98.15
8 100 100 99.04 99.48 98.23 99.77 99.26 99.21
9 98.73 99.49 100 100 98.23 100 97.79 98.94
Aver. % 99.04 98.87 98.98 99.62 98.03 99.36 97.85 99.03
Table xxxx3. Recall for DenseNet201, fs = 16000 Hz, and 4 × DATA
f1-score, DenseNet201, Fs = 16kHz, 4 × DATA

Folds
11
1 98.74 98.73 99.52 99.08 99.24 99.42 99.38 99.08
2 98.62 97.6 98.44 99.74 97.82 99.65 97.66 98.94
3 97.34 96.62 97.36 99.48 96.82 98.5 97.01 97.12
4 98.61 96.74 99.28 99.22 98.48 99.3 97.38 98.94
5 99.87 99.11 99.03 99.74 99.12 99.54 99.39 99.47
6 98.73 99.11 99.4 99.61 98.86 99.65 99.02 99.34
7 98.87 98.86 99.15 99.48 98.61 99.07 97.21 98.54
8 99.87 99.24 99.4 99.48 99.11 99.54 98.78 99.6
9 99.11 98.62 99.28 99.74 98.86 99.88 98.64 99.07
Aver. % 98.86 98.29 98.98 99.51 98.55 99.39 98.27 98.9
Table xxxx4. f1-score for DenseNet201, fs = 16000 Hz, and 4 × DATA
For DenseNet201 with fs =16000 Hz and 4 × DATA, Recall and f1-score get the highest value for the
Hip Hop genre while Precision get the highest value for the International genre. The lowest values are
for Precision with the Experimaelta
The results for the CNN model with image size (230 × 230), fs = 16000 Hz, 4 × DATA
CNN 230 × 230, 16 kHz. Precision, CNN 230 × 230, 16 kHz. 4 × DATA, Full File
4 × DATA Folds
Folds Acc AUC 1 98.63 96.13 98.09 36.11 98.64 98.68 96.75 96.35
1 76.38 0.964 2 35.55 97.54 98.11 97.99 97.67 99 98.34 97.31
2 76.53 0.964 3 97.97 96.82 98.41 98.67 98.3 97.71 35.24 98.62
3 76.12 0.964 4 97.95 97.2 35.04 98.01 97.36 98.67 98.35 97.64
4 75.97 0.964 5 98.95 97.89 35.04 97.66 93.33 97.02 98.04 97.62
5 75.56 0.963 6 98.62 36.14 100 98.01 96.1 97.4 99 96.91
6 76.88 0.964 7 97.97 96.52 99.03 97.69 97.64 97.7 97.71 36.18
7 76.59 0.963 8 99.65 96.53 98.08 98.67 35.78 97.7 97.05 95.64
8 76.28 0.962 9 99.31 36.06 100 98.32 97.67 96.73 95.63 96.96
9 76.81 0.963 Aver. % 91.62 83.43 84.64 91.24 90.28 97.85 90.68 90.36
Aver.
76.35 0.963444
%
Table xxxx1. Accuracy and AUC for Table xxxx2. Precision for CNN, fs = 16000 Hz, and 4 × DATA
CNN, fs = 16000 Hz, and 4 × DATA
Recall, CNN 230 × 230, 16 kHz. 4 × DATA

Folds
12
1 71.75 68.25 77 99.75 72.5 74.75 74.5 72.5
2 97.75 69.5 78 73 73.5 74.25 74 72.25
3 72.5 68.5 77.25 74 72.25 74.75 98.5 71.25
4 71.75 69.5 97.75 73.75 73.75 74.25 74.5 72.5
5 70.75 69.75 97.5 73 73.5 73.25 75 71.75
6 71.25 98.75 77 74 74 75 74.5 70.5
7 72.25 69.25 76.25 74 72.25 74.5 74.75 99.5
8 71.5 69.5 76.5 74.25 98.75 74.5 74 71.25
9 72 98 75.75 73.25 73.25 74 76.5 71.75
Aver. % 74.61 75.67 81.44 76.56 75.97 74.36 77.36 74.81
Table xxxx3. Recall for CNN, fs = 16000 Hz, and 4 × DATA
f1-score, CNN 230 × 230, 16 kHz. 4 × DATA

Folds
1 83.07 79.82 86.27 53.02 83.57 85.06 84.18 82.74
2 52.13 81.17 86.91 83.67 83.88 84.86 84.45 82.93
3 83.33 80.23 86.55 84.57 83.29 84.7 51.91 82.73
4 82.83 81.05 51.58 84.17 83.93 84.74 84.78 83.21
5 82.51 81.46 51.55 83.55 82.24 83.48 84.99 82.71
6 82.73 52.91 87.01 84.33 83.62 84.75 85.02 81.62
7 83.17 80.64 86.16 84.21 83.05 84.54 84.7 53.07
8 83.26 80.81 85.96 84.74 52.53 84.54 83.97 81.66
9 83.48 52.72 86.2 83.95 83.71 83.85 85 82.47
Aver. % 79.61 74.53 78.69 80.69 79.98 84.5 81 79.24
Table xxxx4. f1-score for CNN, fs = 16000 Hz, and 4 × DATA
For the CNN model, Precision and f1-score get the highest values for the International genre
but Recall gets the lowest value for this genre and the highest values for the Folk genre.
Precision and f1-score get the lowest value for the same Experimental genre.
The results for the GRU model with image size (230 × 230), fs = 16000 Hz, 4 × DATA
GRU, 16kHz, 4 × DATA Precision, GRU, Fs = 16kHz, 4 × DATA
Folds
Folds Acc AUC Elec. Exp. Folk HH Instr. Intl. Pop Rock
1 76.69 0.964 1 99.33 95.85 97.81 36.15 97.6 99.33 97.73 97.97
2 75.19 0.961 2 97.93 96.45 35 96.97 95.39 97.69 96.43 97.61
3 75.34 0.962 3 98.26 93.81 34.98 98.33 95.33 98.03 95.92 98.26
4 75.62 0.962 4 97.3 98.18 35.08 97.04 95.75 98.99 97.08 98.63
5 75.41 0.963 5 96.98 95.16 34.79 96.72 97.25 98.65 96.82 98.96
6 76.03 0.96 6 99.3 96.17 96.58 97.03 96.66 35.75 97.08 97.6
7 75.31 0.961 7 98.62 96.11 35.06 97.98 95.07 97.39 96.42 97.6
13
8 75.75 0.962 8 99.32 93.6 99.36 98.33 96.97 98.68 34.59 97.57
9 75.84 0.962 9 98.97 96.15 99.04 98.32 96.69 98.32 34.94 97.95
Aver. % 75.69 0.961889 Aver.
98.45 95.72 63.08 90.76 96.3 91.43 83 98.02
%
Table xxxx1. Accuracy and AUC for GRU, Table xxxx2. Precision for GRU, fs = 16000 Hz, and 4 × DATA
fs = 16000 Hz, and 4 × DATA
Recall, GRU, Fs = 16kHz, 4 × DATA

Folds
1 73.75 69.25 78 99.5 71.25 74 75.5 72.25
2 71 68 98.25 72 72.5 74 74.25 71.5
3 70.5 68.25 97.25 73.5 71.5 74.75 76.5 70.5
4 72 67.25 98.5 73.75 73.25 73.75 74.75 71.75
5 72.25 68.75 97.25 73.75 70.75 73.25 76 71.25
6 71.25 69 77.75 73.5 72.25 98.5 74.75 71.25
7 71.5 68 98.25 72.75 72.25 74.5 74 71.25
8 72.5 69.5 77.5 73.5 72 74.5 96.25 70.25
9 71.75 68.75 77.25 73 73 73.25 98 71.75
Aver. % 71.83 68.53 88.89 76.14 72.08 76.72 80 71.31
Table xxxx3. Recall for GRU, fs = 16000 Hz, and 4 × DATA
f1-score, GRU, Fs = 16kHz, 4 × DATA

Folds
1 84.65 80.41 86.79 53.03 82.37 84.81 85.19 83.17
2 82.32 79.77 51.61 82.64 82.39 84.21 83.9 82.54
3 82.1 79.02 51.46 84.12 81.71 84.82 85.12 82.1
4 82.76 79.82 51.74 83.81 83 84.53 84.46 83.07
5 82.81 79.83 51.25 83.69 81.91 84.07 85.15 82.85
6 82.97 80.35 86.15 83.64 82.69 52.46 84.46 82.37
7 82.9 79.65 51.68 83.5 82.1 84.42 83.73 82.37
8 83.82 79.77 87.08 84.12 82.64 84.9 50.89 81.69
9 83.19 80.17 86.8 83.79 83.19 83.95 51.51 82.83
Aver. % 83.06 79.87 67.17 80.26 82.44 80.91 77.16 82.55
Table xxxx4. f1-score for GRU, fs = 16000 Hz, and 4 × DATA
For the GRU model, Precision and f1-score get the highest value for the Electronic genre while
Recall gets the highest value for the Fold genre. Precision and f1-score get the lowest value for the
same Folk genre while Recall gets the lowest value for the Experimental genre.
14
Accuracy (%)
100
95
90
85 98.97 98.95 98.85
80
75 76.35 75.69
70
DenseNet121 DenseNet169 DenseNet201 CNN GRU
Figure xxx. Summary of the accuracy of models with 16000 Hz sampling rate and 4 × DATA
Thus, DenseNet121 gives the highest accuracy with 4 × DATA and this accuracy is superior to most of the
studies given in Table XX except [].
The following Figure xxx is an illustrative quote for the classification results of DenseNet121 with 4 × DATA
including the variation of Training Loss and Validation Loss according to epoch, Confusion Matrix, and ROC
Curves for Fold 6.
Figure xx a) Variation of Training Loss Figure xx b) Confusion Matrix Figure xx c) ROC Curves
and Validation Loss according to epoch
Experiments using DenseNet169 with 3× DATA, 2 × DATA, and no DA (Data Augmentation)

DenseNet169, Fs = 16kHz, image mel spectrum 230 × 230, 3 × DATA (original data (8000 files), noise-added
original data (8000 files), echoed original data (8000 files))
15
DenseNet169, 16kHz, 3 × Precision, DenseNet169, 3 × DATA, 16kHz
DATA Folds
Folds Acc AUC 1 95.07 94.96 97.72 98.68 98.4 96.93 96.52 98.24
1 97.08 0.983 2 97.31 97.03 97.36 99.02 95.22 99.08 96.26 95.9
2 97.17 0.984 3 95.59 90.39 94.86 96.12 95.38 96.63 94.81 96.85
3 95.13 0.972 4 97.62 93.24 97.67 96.7 94.32 97.55 94.86 96.52
4 96.08 0.978 5 98.6 92.47 94.65 96.14 94.9 97.23 97.15 96.17
5 95.92 0.976 6 95.36 91.93 97.02 99.67 95.19 97.22 95.8 96.9
6 96.17 0.978 7 96.96 89.86 97.37 97.09 95.85 97.45 91.35 94.81
7 95.17 0.972 8 96.94 93.04 94.06 96.42 94.29 95.36 96.96 92.46
8 94.92 0.971 9 95.65 88.51 95.39 97.71 95.13 97.51 94.64 95.45
9 95.04 0.972 Aver.
Aver. 0.97622 % 96.57 92.38 96.23 97.51 95.41 97.22 95.37 95.92
95.85
% 2
and 3 × DATA

Folds
1 96.01 97.06 98.04 97.7 98.71 96.93 94.86 97.21
2 96.01 95.96 96.41 99.02 96.14 98.77 96.92 97.91
3 93.69 93.38 96.41 97.38 92.93 96.63 93.84 96.52
4 95.35 96.32 95.75 96.07 96.14 97.55 94.86 96.52
5 93.36 94.85 98.37 98.03 95.82 96.93 93.49 96.17
6 95.68 96.32 95.75 97.7 95.5 96.63 93.84 97.91
7 95.35 94.49 96.73 98.36 96.46 93.87 90.41 95.47
8 94.68 93.38 98.37 97.05 95.5 94.48 87.33 98.26
9 95.02 96.32 94.77 98.03 94.21 96.01 90.75 95.12
Aver. % 95.02 95.34 96.73 97.7 95.71 96.42 92.92 96.79

Folds
1 95.54 96 97.88 98.19 98.56 96.93 95.68 97.72
2 96.66 96.49 96.88 99.02 95.68 98.92 96.59 96.9
3 94.63 91.86 95.62 96.74 94.14 96.63 94.32 96.68
4 96.47 94.76 96.7 96.38 95.22 97.55 94.86 96.52
5 95.9 93.65 96.47 97.08 95.36 97.08 95.29 96.17
16
6 95.52 94.08 96.38 98.68 95.35 96.92 94.81 97.4
7 96.15 92.11 97.05 97.72 96.15 95.63 90.88 95.14
8 95.8 93.21 96.17 96.73 94.89 94.92 91.89 95.27
9 95.33 92.25 95.08 97.87 94.67 96.75 92.66 95.29
Aver.
% 95.78 93.82 96.47 97.6 95.56 96.81 94.11 96.34
The results for DenseNet169, Fs = 16kHz, image mel spectrum 230 × 230, 2 × DATA (original data (8000
files), noise-added original data (8000 files))
DenseNet169, 16kHz Prescision, DenseNet169, 2 × DATA, 16kHz

2 × DATA Folds
Folds Acc AUC 1 84.31 73.74 83.76 94.91 81.25 85.92 70.65 91.81
1 83.37 0.904 2 79.17 69.9 90.45 92.59 80.69 87.3 74.3 83.06
2 82.13 0.897 3 84.4 82.26 89.42 95.89 84.52 88 84.57 89.84
3 87.44 0.927 4 89.86 77.51 88.71 94.64 87.17 91.1 85.54 87.96
4 87.88 0.93 5 82.21 75.9 87.37 93.95 83.4 86.47 81.76 91.28
5 85.31 0.915 6 77.31 67.74 80.95 92.45 69.67 82.2 60.8 77.96
6 76.31 0.862 7 86.12 75 90.61 93.21 85.46 91.1 75.14 88.71
7 85.75 0.917 8 81.02 67.96 88.77 94.86 82.73 88.21 73.86 87.1
8 83.13 0.902 9 84.04 80.2 89.06 97.12 85.39 85.29 73.54 88.2
9 85.44 0.916 Aver. % 83.16 74.47 87.68 94.4 82.25 87.29 75.57 87.32
Aver. % 84.08 0.907778
and 2 × DATA

Folds
1 81.52 79.35 82.91 93.18 81.25 89.85 72.63 84.41
2 81.04 78.26 80.9 90.91 83.93 83.76 74.3 81.72
3 87.2 83.15 84.92 95.45 90.18 89.34 76.54 90.32
4 88.15 88.04 82.91 96.36 87.95 88.32 79.33 90.32
5 81.04 80.43 86.93 91.82 87.5 90.86 77.65 84.41
6 79.15 68.48 76.88 89.09 75.89 79.7 59.78 77.96
7 85.31 83.15 82.41 93.64 86.61 88.32 75.98 88.71
8 82.94 76.09 83.42 92.27 81.25 87.31 72.63 87.1
9 84.83 85.87 85.93 91.82 83.48 88.32 77.65 84.41
Aver.
% 83.46 80.31 83.02 92.73 84.23 87.31 74.05 85.48
17
Folds
1 82.89 76.44 83.33 94.04 81.25 87.84 71.63 87.96
2 80.09 73.85 85.41 91.74 82.28 85.49 74.3 82.38
3 85.78 82.7 87.11 95.67 87.26 88.66 80.35 90.08
4 89 82.44 85.71 95.5 87.56 89.69 82.32 89.12
5 81.62 78.1 87.15 92.87 85.4 88.61 79.66 87.71
6 78.22 68.11 78.87 90.74 72.65 80.93 60.28 77.96
7 85.71 78.87 86.32 93.42 86.03 89.69 75.56 88.71
8 81.97 71.79 86.01 93.55 81.98 87.76 73.24 87.1
9 84.43 82.94 87.47 94.39 84.42 86.78 75.54 86.26
Aver.
% 83.3 77.25 85.26 93.55 83.2 87.27 74.76 86.36
DenseNet169, Fs = 16kHz, image mel spectrum 230 × 230, No DA (Data Augmenatation) (8000 files for 8
genres)
DenseNet169, No DA,
16kHz Prescision, DenseNet169, No DA, 16kHz
Folds
1 61 0.78 1 55.32 61.54 65.69 75.86 55.17 65.48 39.08 73.91
2 62.88 0.789 2 56.91 59.43 72.22 69.15 58.18 72.73 47.67 69.3
3 59.88 0.773 3 57.72 56.07 78.67 75.29 50 61.05 41.79 63.39
4 62.75 0.789
4 59.69 62.75 75.29 69.89 53.1 68.29 47.62 67.86
5 60.62 0.776
5 56.1 63 70.65 75 60.44 62.64 38.04 61.42
6 60.62 0.777
6 57.6 63.04 71.59 72.83 50.79 63.75 40.74 66.38
7 60 0.774
8 63.25 0.791 7 52.03 55.96 76.25 70.1 55.17 67.5 44.3 62.93
9 63 0.79 8 64.95 59.5 64.71 72.83 52.46 76.62 50.67 67.54
Aver. 9 55.88 60 79.27 74.71 62.11 73.75 38.3 67.57
61.56 0.782
%
Aver. % 57.36 60.14 72.7 72.85 55.27 67.98 43.13 66.7
Table xxxx1. Accuracy and Table xxxx2. Precision for DenseNet169, fs = 16000 Hz, and No DA
16000 Hz, and No DA
Recall, DenseNet169, No DA,16kHz

Folds
1 77.23 48.28 63.21 74.16 67.37 64.71 34 62.96
2 69.31 54.31 61.32 73.03 67.37 65.88 41 73.15
18
3 70.3 51.72 55.66 71.91 71.58 68.24 28 65.74
4 76.24 55.17 60.38 73.03 63.16 65.88 40 70.37
5 68.32 54.31 61.32 70.79 57.89 67.06 35 72.22
6 71.29 50 59.43 75.28 67.37 60 33 71.3
7 63.37 52.59 57.55 76.4 67.37 63.53 35 67.59
8 62.38 62.07 62.26 75.28 67.37 69.41 38 71.3
9 75.25 59.48 61.32 73.03 62.11 69.41 36 69.44
Aver. % 70.41 54.21 60.27 73.66 65.73 66.01 35.56 69.34
Table xxxx3. Recall for DenseNet169, fs = 16000 Hz, and No DA
f1-score, DenseNet169, No DA, 16kHz

Folds
1 64.46 54.11 64.42 75 60.66 65.09 36.36 68
2 62.5 56.76 66.33 71.04 62.44 69.14 44.09 71.17
3 63.39 53.81 65.19 73.56 58.87 64.44 33.53 64.55
4 66.96 58.72 67.02 71.43 57.69 67.07 43.48 69.09
5 61.61 58.33 65.66 72.83 59.14 64.77 36.46 66.38
6 63.72 55.77 64.95 74.03 57.92 61.82 36.46 68.75
7 57.14 54.22 65.59 73.12 60.66 65.45 39.11 65.18
8 63.64 60.76 63.46 74.03 58.99 72.84 43.43 69.37
9 64.14 59.74 69.15 73.86 62.11 71.52 37.11 68.49
Aver.
% 63.06 56.91 65.75 73.21 59.83 66.9 38.89 67.89
Table xxxx4. f1-score for DenseNet169, fs = 16000 Hz, and No DA
In summary, the MGC accuracy of the DenseNet169 model according to the augmented data size is presented
in Fig.xxx
100
90
80
70
60
95.85 98.95
50
84.08
40
61.56
30
20
10
0
No DA 2 X DATA 3 X DATA 4 X DATA
Figure xxx. MGC accuracy of the DenseNet169 model depends on data.
19
Experiments with data augmentation using pitch shiftting
3 × DATA2 = original data (8000 files) + noise-added original data (8000 files) + original data shifted up by
semitone (8000 files)
The results show that acc is decreased in comparison with ORG + ORGNoise + ECHO.
Acc is increased in comparison with ORG + ORGNoise but not so much.
DenseNet169, 3 × Precision, DenseNet169, 3 × DA2, Semitone, fs =16kHz

DATA2, Semitone, Folds
fs=16kHz
1 84.49 76.87 87.54 94.02 81.39 91.64 75.26 86.93
Folds Acc AUC
2 89.97 77.42 89.84 94.02 83.86 94.12 84.76 88.15
1 84.92 0.913
3 86.29 70.86 84.36 93.96 80.77 84.71 73.98 85.35
2 87.79 0.93
4 86.52 79.11 88.01 95.59 90.34 92.77 82.64 88.97
3 82.63 0.9
5 89.04 81.88 88.44 94.1 82.41 92.26 80.63 84.64
4 88.04 0.931
6 85.71 74.73 85.95 93.2 80.37 90.71 78.14 85.09
5 86.79 0.924
7 86.51 68.58 84.94 92.05 82.47 89.06 66.21 85.36
6 84.42 0.91
8 83.01 67.92 86.81 91.7 76.54 86.79 64.89 80.35
7 82 0.896
9 82.5 71.53 86.32 91.21 81.63 91.61 73.33 83.39
8 79.87 0.884
Aver. % 86 74.32 86.91 93.32 82.2 90.41 75.54 85.36
9 82.87 0.901
Aver. 0.9098
% 84.37 89
Table xxxx1. Accuracy and Table xxxx2. Precision for DenseNet169, fs = 16000 Hz, and 3 × DATA2
16000 Hz, and 3 × DATA2
Recall, DenseNet169, 3 × DATA2, Semitone, fs=16kHz

Folds
1 85.05 79.41 87.25 92.79 82.96 90.8 73.97 85.71
2 86.38 88.24 89.54 92.79 85.21 93.25 78.08 88.15
3 85.71 78.68 84.64 91.8 81.03 88.34 68.15 81.18
4 91.69 84.93 91.18 92.46 84.24 90.49 81.51 87.11
5 89.04 83.09 84.97 94.1 85.85 91.41 78.42 86.41
6 87.71 77.21 83.99 94.43 84.24 89.88 74.66 81.53
7 83.06 74.63 86.6 91.15 81.67 87.42 66.44 83.28
8 84.39 73.16 81.7 86.89 79.74 88.65 62.67 79.79
9 87.71 75.74 86.6 91.8 77.17 87.12 71.58 83.97
Aver. % 86.75 79.45 86.27 92.02 82.46 89.71 72.83 84.13
Table xxxx3. Recall for DenseNet169, fs = 16000 Hz, and 3 × DATA2
20
Folds f1-score, DenseNet169, 3 × DATA2, Semitone, fs=16kHz
1 84.77 78.12 87.4 93.4 82.17 91.22 74.61 86.32
2 88.14 82.47 89.69 93.4 84.53 93.68 81.28 88.15
3 86 74.56 84.5 92.87 80.9 86.49 70.94 83.21
4 89.03 81.91 89.57 94 87.19 91.61 82.07 88.03
5 89.04 82.48 86.67 94.1 84.09 91.83 79.51 85.52
6 86.7 75.95 84.96 93.81 82.26 90.29 76.36 83.27
7 84.75 71.48 85.76 91.6 82.07 88.24 66.32 84.3
8 83.69 70.44 84.18 89.23 78.11 87.71 63.76 80.07
9 85.02 73.57 86.46 91.5 79.34 89.31 72.44 83.68
Aver.
86.35 76.78 86.58 92.66 82.3 90.04 74.14 84.73
%
Table xxxx4. f1-score for DenseNet169, fs = 16000 Hz, and 3 × DATA2
DenseNet169, Fs = 16kHz, image mel spectrum 230 × 230, 5 × DATA3 (40000 files for 8 genres), ORG +
ORGNoise + ECHO + ECHONoise + Shift Up Semitone (n_step = 1)
DenseNet169, 5
× DATA3 Precision, DenseNet169, 5 DATA, Full, ORG + ORGNoise + ECHO + ECHONoise +
Shift Up Semitone,
Folds
1 95.8 0.976 1 96.71 92.23 94.82 99.15 95.09 98.22 94.99 95.42
2 95.95 0.977 2 97.47 87.82 96.46 99.36 94.55 98.61 97.28 97.07
3 94.23 0.967 3 94.85 90.71 92.65 96.44 94.29 97.8 94.07 93.08
4 96.37 0.979 4 96.54 95.09 95.66 99.14 96.51 98.21 96.71 93.45
5 95.77 0.976 5 95.59 94.07 96.03 99.79 94.8 98.02 95.76 92.5
6 95.27 0.973 6 96.32 92.64 94.82 97.7 93.69 98.03 94.28 94.81
7 94.2 0.967 7 95.09 89.45 96.46 97.62 93.16 98.17 90.8 93.66
8 95.83 0.976 8 97.05 93.47 95.1 98.3 95.33 97.08 95.16 95.24
9 93.97 0.966 9 95.83 88.58 94.47 95.78 93.83 98.79 94.02 91.12
Aver. Aver.
% 95.27 0.973 % 96.16 91.56 95.16 98.14 94.58 98.1
94.79 94.04
Table xxxx1. Accuracy and AUC Table xxxx2. Precision for DenseNet169, fs = 16000 Hz, and 5 × DATA3
and 5 × DATA3
Recall, DenseNet169, 5× DATA3

Folds
1 96.71 94.81 94.63 98.1 95.09 97.83 93.37 95.99
2 96.71 96.41 95.66 97.68 95.71 97.64 94.89 92.99
3 96.13 91.62 93.8 97.05 94.48 96.46 90.15 94.39
4 97.1 96.61 95.66 96.84 96.11 97.05 94.51 97.19
5 96.32 95.01 95.04 98.95 93.25 97.24 94.13 96.39
6 96.13 95.41 94.63 98.73 94.07 97.83 90.53 95.19
21
7 93.62 94.81 95.66 95.36 94.68 95.28
89.77 94.79
8 95.55 94.21 96.28 97.47 96.11 98.03
92.99 96.19
9 93.42 94.41 95.25 95.78 93.25 96.06
89.39 94.59
Aver. % 95.74 94.81 95.18 97.33 94.75 97.05
92.19 95.3
f1-score, DenseNet169, 5 × DATA3

Folds
1 96.71 93.5 94.73 98.62 95.09 98.0394.17 95.7
2 97.09 91.91 96.06 98.51 95.12 98.1296.07 94.98
3 95.49 91.16 93.22 96.74 94.38 97.1392.07 93.73
4 96.82 95.84 95.66 97.97 96.31 97.6295.59 95.28
5 95.95 94.54 95.53 99.36 94.02 97.6394.94 94.41
6 96.22 94 94.73 98.22 93.88 97.9392.37 95
7 94.35 92.05 96.06 96.48 93.91 96.790.29 94.22
8 96.3 93.84 95.69 97.88 95.72 97.5594.06 95.71
9 94.61 91.4 94.86 95.78 93.54 97.4191.65 92.82
Aver. % 95.95 93.14 95.17 97.73 94.66 97.5793.47 94.65
DenseNet169, 5xDATA,
Acc DenseNet169 (5x Data) 95.27% > Acc DenseNet121 (5x Data) 94.67%
Acc DenseNet169 (5x Data) 95.27% < Acc DenseNet169 (4x Data) 98.95%
DenseNet121, Fs = 16kHz, image mel spectrum 230 × 230, 5 × DATA3 (40000 files for 8 genres),
ORG + ORGNoise + ECHO + ECHONoise + ShiftUp Pitch by a Semitone,
DenseNet121, Fs = Precision, DenseNet121, Fs = 16kHz, 5 x

16kHz, 5 × Folds DATA3,n_steps=1
DATA3,n_steps=1 Elec. Exp. Folk HH Instr. Intl. Pop Rock
Folds Acc AUC 1 95.67 88.44 93.96 96.04 92.95 95.28 90.82 92.79
1 93.2 0.961 2 96.5 92.05 93.91 99.35 92.6 98.41 95.69 95.59
2 95.47 0.974 3 95.5 91.57 93.56 95.21 91.87 96.97 91.67 94.83
3 93.87 0.965 4 96.15 92.16 94.06 97.85 94.67 98.17 91.89 92.49
4 94.63 0.969 5 95.78 89.04 96.65 97.65 95.62 97.82 94.04 95.2
5 95.15 0.972 6 95.89 89.35 93.78 96.83 94 98.18 92.35 93.98
6 94.23 0.967 7 96.48 91.73 95.39 98.51 92.86 98.61 94.29 94.23
7 95.23 0.973 8 95.22 91.11 95.79 98.31 93.71 98.41 91.15 95.56
8 94.85 0.971 9 95.26 92.04 95.28 97.85 96.81 97.43 94.48 94.65
9 95.43 0.974 Aver. % 95.83 90.83 94.71 97.51 93.9 97.7 92.93 94.37
Aver.
% 94.67 0.969556
22
Table xxxx1. Accuracy and Table xxxx2. Precision for DenseNet121, fs = 16000 Hz, and 5 × DATA3
16000 Hz, and 5 × DATA3
Recall, DenseNet121, Fs = 16kHz, 5 x DATA3,n_steps=1

Folds
1 94 91.62 93.18 97.26 91.62 95.47 89.96 92.79
2 95.94 94.81 95.66 97.26 94.68 97.64 92.42 95.59
3 94.39 93.21 92.98 96.41 92.43 94.49 91.67 95.59
4 96.71 93.81 94.83 96.2 94.48 95.08 92.23 93.79
5 96.52 94.01 95.45 96.41 93.66 97.24 92.61 95.39
6 94.78 95.41 93.39 96.62 92.84 95.67 91.48 93.79
7 95.55 93.01 94.01 97.47 95.71 97.44 93.75 94.99
8 96.32 92.02 94.01 98.1 94.48 97.44 91.67 94.99
9 97.1 94.61 95.87 96.2 93.05 96.85 93.94 95.79
Aver. % 95.7 93.61 94.38 96.88 93.66 96.37 92.19 94.75

Folds
1 94.83 90 93.57 96.65 92.28 95.38 90.39 92.79
2 96.22 93.41 94.78 98.29 93.63 98.02 94.03 95.59
3 94.94 92.38 93.26 95.81 92.15 95.71 91.67 95.21
4 96.43 92.98 94.44 97.02 94.58 96.6 92.06 93.13
5 96.15 91.46 96.05 97.03 94.63 97.53 93.32 95.3
6 95.33 92.28 93.58 96.73 93.42 96.91 91.91 93.88
7 96.02 92.37 94.69 97.99 94.26 98.02 94.02 94.61
8 95.77 91.56 94.89 98.2 94.09 97.92 91.41 95.28
9 96.17 93.31 95.57 97.02 94.89 97.14 94.21 95.22
Aver. % 95.76 92.19 94.54 97.19 93.77 97.03 92.56 94.56
DenseNet121, Fs = 16kHz, image mel spectrum 230 × 230, 5 × DATA4 (40000 files for 8 genres),
ORG + ORGNoise + ECHO + ECHONoise + ShiftUp Pitch by a Tone,
DenseNet121, Fs = 16kHz, 5 x Folds Precision, DenseNet121, Fs = 16kHz, 5 x DATA4,n_steps=2
DATA4,n_steps=2
Folds Acc AUC
1 95.76 90.82 94.74 97.9 93.33 98.4 92.41 93.99
1 94.63 0.969
2 94.19 89.75 94.74 97.02 93.28 97.21 91.67 92.4
2 93.73 0.964
3 95.64 93.74 94.96 98.3 97.04 98.17 94.91 96.01
3 96.05 0.977
4 95.51 88.89 95.15 97.05 91.02 96.06 92.72 92.93
4 93.63 0.964
5 95.96 91.6 92.77 98.06 94.68 98.6 93.49 94.38
5 94.9 0.971
23
6 93.8 0.965 6 95.69 88.24 95.32 97.45 92.4 96.83 93.55 91.49
7 93.07 0.961 7 94.52 83.91 96.07 96.23 93.54 97.02 91.71 92.84
8 93.87 0.965 8 95.36 86.78 95.76 96.41 94.02 95.9 93.79 93.65
9 95.43 0.974 9 95.61 90.02 94.99 98.93 95.41 98.6 96.32 94.07
Aver. % 94.35 0.967778 Aver. % 95.36 89.31 94.94 97.48 93.86 97.42 93.4 93.53
Table xxxx1. Accuracy and AUC Table xxxx2. Precision for DenseNet121, fs = 16000 Hz, and 5 × DATA4
and 5 × DATA4

Folds
1 96.13 94.81 92.98 98.31 94.48 96.65 89.96 93.99
2 94 92.61 92.98 96.2 93.66 96.06 89.58 94.99
3 97.49 95.61 97.31 97.68 93.87 94.88 95.27 96.39
4 94.58 91.02 93.18 97.26 93.25 96.06 91.67 92.18
5 96.52 95.81 95.45 95.78 94.68 97.24 89.77 94.19
6 94.58 92.81 92.56 96.84 92.02 96.26 90.72 94.79
7 93.42 91.62 90.91 97.05 91.82 96.26 90.15 93.59
8 95.36 93.01 93.39 96.41 93.25 96.65 88.64 94.59
9 96.91 95.41 94.01 97.26 93.46 96.85 94.13 95.39
Aver. % 95.44 93.63 93.64 96.98 93.39 96.32 91.1 94.46
f1-score, DenseNet121, Fs = 16kHz, 5 x DATA4,n_steps=2

Folds
1 95.95 92.77 93.85 98.11 93.9 97.52 91.17 93.99
2 94.09 91.16 93.85 96.61 93.47 96.63 90.61 93.68
3 96.55 94.66 96.12 97.99 95.43 96.5 95.09 96.2
4 95.04 89.94 94.15 97.15 92.12 96.06 92.19 92.56
5 96.24 93.66 94.09 96.91 94.68 97.92 91.59 94.28
6 95.14 90.47 93.92 97.14 92.21 96.54 92.12 93.11
7 93.97 87.6 93.42 96.64 92.67 96.64 90.93 93.21
8 95.36 89.79 94.56 96.41 93.63 96.27 91.14 94.12
9 96.25 92.64 94.5 98.09 94.42 97.72 95.21 94.73
Aver. % 95.4 91.41 94.27 97.23 93.61 96.87 92.23 93.99
24
Thus, the data augmentation by pitch shifting by one tone does not increase the recognition
accuracy. This is also consistent with the comments in [12]: “With a small change of pitch of
a song, its classification still works” and [12] used shifting the pitch of songs by a half of the
tone only.
So, DenseNet121, IF n_steps = 2, Acc =94.35 , IF n_steps=1 Acc = 94.67
VI. Conclusions
Our research results presented in this paper show that appropriately defining the deep learning model
together with an efficient data augmentation method has allowed achieving the MGC accuracy to
outperform most of the available studies on the same Small FMA dataset. The music genre
classification problem in our case becomes an image recognition problem in which each musical
work has been characterized by the corresponding mel spectrogram. Reducing the sampling
frequency of the original audio files from 44100 Hz to 16000 Hz allows for a reduction in the size of
the image to be recognized and increases the recognition accuracy. Among the performed deep
learning models, DenseNet121 gives the highest recognition accuracy with 4 × enhanced data. Our
experiments also show that data augmentation by echoing is more effective than pitch shifting in
increasing recognition accuracy.
References
1. Defferrard, M., Benzi, K., Vandergheynst, P., & Bresson, X. (2016). FMA: A dataset for
music analysis. arXiv preprint arXiv:1612.01840
2. Pimenta-Zanon, M.H., Bressan, G.M. and Lopes, F.M., 2021. Complex Network-Based
Approach for Feature Extraction and Classification of Musical Genres. arXiv preprint
arXiv:2110.04654.
3. Ke Chen, Beici Liang, “Do User Preference Data Benefit Music Genre Classification
Tasks?”, in Proc. of the 21st Int. Society for Music Information Retrieval Conf., Montréal,
Canada, 2020
4. Kostrzewa, D., Kaminski, P. and Brzeski, R., 2021, June. Music Genre Classification:
Looking for the Perfect Network. In International Conference on Computational Science (pp.
55-67). Springer, Cham.
5. Matocha, M. and Zieliński, S.K., 2018. Music Genre Recognition Using Convolutional
Neural Networks. Advances in Computer Science Research.Advances in Computer Science
Research, vol. 14, pp. 125-142, 2018. DOI: 10.24427/acsr-2018-vol14-0008
6. Chillara, S., Kavitha, A.S., Neginhal, S.A., Haldia, S. and Vidyullatha, K.S., 2019. Music
genre classification using machine learning algorithms: a comparison. Int. Res. J. Eng.
Technol.(IRJET), 6(05), pp.851-858.
7. Gunawan, A. A., & Suhartono, D. (2019). Music recommender system based on genre using
convolutional recurrent neural networks. Procedia Computer Science, 157, 99-109.
8. Qin, Y., & Lerch, A. (2019, May). Tuning Frequency Dependency in Music Classification. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal
25
Processing (ICASSP) (pp. 401-405). IEEE.), and (Kostrzewa, D., Ciszynski, M., & Brzeski,
R. (2022, July.
9. Kostrzewa, D., Ciszynski, M., & Brzeski, R. (2022, July). Evolvable hybrid ensembles for
musical genre classification. In Proceedings of the Genetic and Evolutionary Computation
Conference Companion (pp. 252-255).
10.Heakl, Ahmed, Abdelrahman Abdelgawad, and Victor Parque. "A Study on Broadcast
Networks for Music Genre Classification." arXiv preprint arXiv:2208.12086 (2022).
11.Choudhary, V. and Vyas, A., CS543: Music Genre Recognition through Audio
Samples. small, 8(8), p.30.
12.Bian, W., Wang, J., Zhuang, B., Yang, J., Wang, S. and Xiao, J., 2019, August. Audio-based
music classification with DenseNet and data augmentation. In Pacific Rim International
Conference on Artificial Intelligence (pp. 56-65). Springer, Cham.
13.Park, Jiyoung, Jongpil Lee, Jangyeon Park, Jung-Woo Ha and Juhan Nam. “Representation
Learning of Music Using Artist Labels.” ISMIR (2018)
14.McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg,
and Oriol Nieto. “librosa: Audio and music signal analysis in python.” In Proceedings of the
14th python in science conference, pp. 18-25. 2015
15.https://www.it4nextgen.com/keras-image-classification-models, last accessed October
1, 2022
16.https://keras.io/api/applications/densenet/, last accessed October 1, 2022
17.Huang, Gao, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q. Weinberger. "Densely
connected
convolutional networks." In Proceedings of the IEEE conference on computer vision and
pattern recognition, pp. 4700-4708. 2017.
18.Trinh Van, Loan, Thuy Dao Thi Le, Thanh Le Xuan, and Eric Castelli. "Emotional Speech
Recognition Using Deep Neural Networks." Sensors 22, no. 4 (2022): 1414.
19.James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning:
With Applications in R, 7th ed.; Springer: Berlin/Heidelberg, Germany, 2017.
20.Bhandari, A. AUC-ROC Curve in Machine Learning Clearly Explained, 16 June 2020.
Available online: https://www.
analyticsvidhya.com/blog/2020/06/auc-roc-curve-machine-learning/ (accessed on October , 1,
2021).
21.Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.;
Prettenhofer, P.; Weiss, R.; Dubourg, V.;et al. Scikit-learn: Machine learning in Python. J.
Mach. Learn. Res. 2011, 12, 2825–2830.
22.McFee, B.; Raffel, C.; Liang, D.; Ellis, D.P.; McVicar, M.; Battenberg, E.; Nieto, O. Librosa:
Audio and music signal analysis in python. In Proceedings of the 14th Python in science
conference, Austin, TX, USA, 6–12 July 2015; pp. 18–25.
Table XXXXX
26
No Year Reference Corpus Model Parameters Accuracy
1 2019 Bian, W., Wang, J., Zhuang, B., GTZAN, DenseNet Spectrogram (128 x 128), 68.9
Yang, J., Wang, S. and Xiao, J., FMA 10 segments, each (FMA)
2019, August. Audio-based music small segment 2.56s 90.2
classification with DenseNet and
data augmentation. In Pacific Rim (GTZAN)
International Conference on
Artificial Intelligence (pp. 56-65).
Springer, Cham.
2 2018 Hassen, A.K., Janßen, H., GTZAN ResNet18 Spectrograms, Every 80.4
Assenmacher, D., Preuss, M. and , NNet2 track is split into 3- (Nnet2)
Vatolkin, I., 2018. Classifying second windows, a FTT 84.7
Music Genres Using Image
window size of 1024 (ResNet1
Classification Neural Networks.
In Archives of Data Science, with Hann window 8)
Series A (Online First) (Vol. 5, No. function and hop size Avg. 10
1, p. 20). KIT Scientific Publishing. 512 is used, The image folds
size for every 3-second
spectrogram is set to
128x513 pixels, Fs =
22050Hz
3 2021 Rafi, Q.G., Noman, M., Prodhan, GTZAN CRNN-TF Down Sampling Rate 12 64.7
S.Z., Alam, S. and Nandi, D., KHz, Hop Size 256, FFT
2021. Comparative Analysis of 512, Number of Mel-
Three Improved Deep Learning
bins 96.
Architectures for Music Genre
Classification. International Mel-spectrogram
Journal of Information Technology
and Computer Science, 13(2),
pp.1-14.
4 2021 Pimenta-Zanon, M.H., Bressan, GTZAN, Bayes EXAMINNER feature 99.8
G.M. and Lopes, F.M., 2021. FMA Network extraction (GTZAN)
Complex Network-Based 99.4
small (FMA)
Approach for Feature Extraction
and Classification of Musical
Genres. arXiv preprint
arXiv:2110.04654.
5 The data set used for our FMA TD (LSTM) Mel Spectrogram 60
problem is FMA dataset. small
It is a huge dataset of audio
samples and metadata,
consisting of 917 GB of licensed
audio from
106,574 tracks and 161 genres
(Defferrard et al.,
2016). However, due to
processing power limitations,
we will analyze and solve the
problem on small dataset,
which is like GTZAN data,
containing 8,000 tracks of 30s,
and 8 balanced genres -
27
electronic, experimental, folk,
hiphop, instrumental,
international, pop, rock
Choudhary, V. and Vyas, A.,
CS543: Music Genre Recognition
through Audio
Samples. small, 8(8), p.30.
6 2020 Ke Chen, Beici Liang, “Do User Preference GTZAN, SVM melspectrogram 0.8013(GT
Data Benefit Music Genre
FMA ZAN)
Classification Tasks?”, in Proc. of the 21st Int.
Society for Music Information Retrieval Conf., small 0.6148
Montréal, Canada, 2020 (FMA)
7 2019 Yi, Y., Chen, K.Y. and Gu, H.Y., FMA Mixture of spectrogram 55.88%
2019, November. Mixture of CNN experts (FMA)
small 86.40%
experts from multiple acoustic (MoE)
feature domain for music genre MoE with (GTZAN)
classification. In 2019 Asia-Pacific RNN as a
Signal and Information Processing mixer
Association Annual Summit and model
Conference (APSIPA ASC) (pp. (MoER):
1250-1255). IEEE. Recurrent
neural
network
(RNN)
with gated
recurrent
unit (GRU)
8 2021 Kostrzewa, D., Kaminski, P. and FMA CNN Melspectrogram 56.39
Brzeski, R., 2021, June. Music small
Genre Classification: Looking for
the Perfect Network.
In International Conference on
Computational Science (pp. 55-
67). Springer, Cham.
9 2018 Matocha, M. and Zieliński, S.K., FMA CNN the standard 0.605
2018. Music Genre Recognition small spectrograms (images)
Using Convolutional Neural using a bank of 128 Mel-
Networks. Advances in Computer
frequency filters
Science Research.
Advances in Computer Science Research,
vol. 14, pp. 125-142, 2018. DOI:
10.24427/acsr-2018-vol14-0008
10 2019 Gunawan, A. A., & Suhartono, D. FMA CRNN Mel-Spectrogram Precision
(2019). Music recommender sampling rate is 22050, the 0.712
small
system based on genre using window length is 2048 with Recall
(only 7 hop length is 512 and the
convolutional recurrent neural 0.804
networks. Procedia Computer genres, chosen window filter is
Science, 157, 99-109. no Hanning
Internat
ional)
11 2019 Qin, Y., & Lerch, A. (2019, May). GTZAN, CRNN The input is the log- GTZAN
Tuning Frequency Dependency in FMA model amplitude Mel 88:0% and
Music Classification. In ICASSP
small presented spectrogram of the down- FMA
2019-2019 IEEE International by Choi sampled (12 kHz) audio as 61:9%
Conference on Acoustics, Speech (K Choi, G generated
and Signal Processing Fazekas, by the library librosa 1 (96
28
(ICASSP) (pp. 401-405). IEEE. M Sandler, mel-bins, hop-size 256
and K Cho, samples).
“Convoluti The resulting input
onal dimensions are 96 × 1366
recurrent (mel-frequency
neural bands × time frames).
networks data is augmented by
for music changing the pitch
classificati
on,”
in ICASSP,
2017.)
12 2018 Park, J., Lee, J., Park, J., Ha, J. GTZAN, Basic we divided the 30-second Artist-label
W., & Nam, J. (2017). Model audio clip into 10 Basic model
FMA
Representation learning of music This is a segments to match up GTZAN
small 0.7076
using artist labels. arXiv preprint widely with the model input
FMA 0.5687
arXiv:1710.06648. used 1D- size and the 256- Artist-label
Park, Jiyoung, Jongpil Lee, CNN dimension features Siamese
Jangyeon Park, Jung-Woo Ha model for we present a supervised model
and Juhan Nam. music feature learning approach GTZAN
“Representation Learning of classificati using artist labels 0.7203 FMA
Music Using Artist on [5, 9, annotated in every single 0.5673
Labels.” ISMIR (2018). 20, 25]. track as objective
, 19th International Society for The model metadata
Music Information Retrieval uses mel-
spectrogra
Conference, Paris, France, 2018.
m with
128 bins
in the
input
layer
A Siamese
neural
network
consists of
twin
networks
that share
weights
and
configurat
ion. It then
provides
unique
inputs to
the
network
and
optimizes
similarity
scores [3,
18, 22].
13 2020 Wang, Kun-Ching. "Robust audio GTZAN, hierarchic using hybrid feature- GTZAN 93.6
content classification using hybrid- al ACC based SMD and entropy- FMA 94.5
FMA
based smd and entropy-based method based VAD
29
vad." Entropy 22, no. 2 (2020): small… (algorithm
183. of audio
content
classificati
on)
14 2019 Chillara, S., Kavitha, A.S., FMA CNN Spectrogram 88.54%
Neginhal, S.A., Haldia, S. and small Generation
Vidyullatha, K.S., 2019. Music A spectrogram is a 2D
genre classification using machine representation of a signal,
learning algorithms: a having time
comparison. Int. Res. J. Eng. on the x-axis and
Technol.(IRJET), 6(05), pp.851- frequency on the y-axis. In
858. this study, each
audio signal was
converted into a MEL
spectrogram (having
MEL frequency bins on the
y-axis). The parameters
used to
generate the power
spectrogram using STFT
are listed
Sampling rate (sr) =
22050
Window size (n_fft) =
2048
Hop length = 512
X_axis: time
Y_axis: MEL
Highest Frequency
(f_max) = 8000
15 2022 Li, Xinyu. "HouseX: A Fine- GTZAN,
grained House Music Dataset and FMA
its Potential in the Music Small
Industry." arXiv preprint
arXiv:2207.11690 (2022).
16 2022 Heakl, Ahmed, Abdelrahman GTZAN, Bottom-up The Mel-Spectrogram of 58.3
Abdelgawad, and Victor Parque. FMA Broadcast the audio file was
"A Study on Broadcast Networks Neural fed to the model as input
for Music Genre Classification." Networks
arXiv preprint arXiv:2208.12086 (BBNN)
(2022). Inspired
by BBNN
[22] and
the way in
which
the ear
uses
distinct
filters to
analyze
sound
characteri
stics
30
[9], [23],
we
propose
broadcast
networks
comprisin
g multiple
feature
extractors
that
render
different
scales of
feature
maps
17 2022 Kostrzewa, D., Ciszynski, M., & FMA hybrid Mel-spectrogram 50.00%
Brzeski, R. (2022, July). Evolvable Small ensembles
hybrid ensembles for musical formed
genre classification. In from deep
Proceedings of the Genetic and neural
Evolutionary Computation networks
Conference Companion (pp. 252- and
255). classical
classifiers
CRNN05
18 2022 Kostrzewa, D., Mazur, W., & FMA Wide 500 features in the 65.8
Brzeski, R. (2022). Wide Small ensemble FMA dataset
Ensembles of Neural Networks in En10(The This dataset was split into
Music Genre Classification. best three sets: training,
In International Conference on model) validation, and
Computational Science (pp. 64- testing in ratio 80:10:10
71). Springer, Cham.
31

FMA V1 October 3 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

FMA V1 October 3 1

Uploaded by

Copyright:

Available Formats

MUSIC GENRE CLASSIFICATION USING DENSENET AND DATA

II. Related work

III. Proposed data augmentation methods

Figure XX. Echo generated at the end of the sound file.

Figure XX. Illustration of shifting up a semitone and a tone

IV. Proposed models for our experiments

Figure XXX. Illustration of DenseNet architecture

V. Results and Discussion

Folds Recall, DenseNet169, 44100 kHz, 4 × DATA

f1-score, DenseNet169, 44100 kHz, 4 × DATA

Table xx3. Recall for DenseNet169, fs = 16000 Hz, and 4 × DATA

f1-score, DenseNet169, 4 × DATA, 16kHz

DenseNet121, 16kHz, Precision, DenseNet121, fs = 16kHz, 4 × DATA

Recall, DenseNet121, fs = 16kHz, 4 × DATA

f1-score, DenseNet121, fs = 16kHz, 4 × DATA

Recall, DenseNet201, Fs = 16kHz, 4 × DATA

Table xxxx3. Recall for DenseNet201, fs = 16000 Hz, and 4 × DATA

f1-score, DenseNet201, Fs = 16kHz, 4 × DATA

Recall, CNN 230 × 230, 16 kHz. 4 × DATA

f1-score, CNN 230 × 230, 16 kHz. 4 × DATA

Recall, GRU, Fs = 16kHz, 4 × DATA

f1-score, GRU, Fs = 16kHz, 4 × DATA

85 98.97 98.95 98.85

Experiments using DenseNet169 with 3× DATA, 2 × DATA, and no DA (Data Augmentation)

Recall, DenseNet169, 3 × DATA, 16kHz

f1-score, DenseNet169, 3 × DATA, 16kHz

DenseNet169, 16kHz Prescision, DenseNet169, 2 × DATA, 16kHz

Recall, DenseNet169, 2 × DATA, 16kHz

Recall, DenseNet169, No DA,16kHz

f1-score, DenseNet169, No DA, 16kHz

Figure xxx. MGC accuracy of the DenseNet169 model depends on data.

DenseNet169, 3 × Precision, DenseNet169, 3 × DA2, Semitone, fs =16kHz

Recall, DenseNet169, 3 × DATA2, Semitone, fs=16kHz

Recall, DenseNet169, 5× DATA3

f1-score, DenseNet169, 5 × DATA3

DenseNet121, Fs = Precision, DenseNet121, Fs = 16kHz, 5 x

Recall, DenseNet121, Fs = 16kHz, 5 x DATA3,n_steps=1

Recall, DenseNet121, Fs = 16kHz, 5 x DATA3,n_steps=1

Recall, DenseNet121, Fs = 16kHz, 5 x DATA4,n_steps=2

f1-score, DenseNet121, Fs = 16kHz, 5 x DATA4,n_steps=2

You might also like