Professional Documents
Culture Documents
net/publication/356647020
CITATIONS READS
0 432
2 authors:
All content following this page was uploaded by Rafael Martínez García Peña on 30 November 2021.
Abstract—Music Emotion Recognition (MER) is an open emotional expression in music is the rhythm within which
problem in computer science. Currently, approaches differ in is the tempo element, some studies have related fast tempo
the way emotions are classified, and the way the emotions with high arousal emotions and conversely a slow tempo
are predicted. Machine Learning is an interesting possibility to
solve this problem, since it does not require that we necessarily with lower arousal emotions. In addition, the numerical value
understand the specific relations between the input parameters of tempo, that is to say, beats per minute (or BPM) also
and the predicted emotion to achieve good results, but there are influences emotional response, where a high tempo (150 bpm)
currently no models that achieve good performance in this field. is associated with emotions such as happiness and stress,
To further explore this problem, a Deep Neural Network designed whereas a slow tempo elicits emotions such as sadness and
with Keras is trained with parameters extracted from Librosa
and asked to predict emotions in a small database to observe boredom [11]. The following figure shows some emotions
what the limitations and advantages are for Machine Learning classified according to the circumplex model developed by
methods applied to MER problems. James Russell.
Index Terms—music, python, sentiment, background
I. I NTRODUCTION
According to the literature, music is capable of inducing
emotions, that is why a large number of studies have been
conducted to study the physiological responses and the effects
that music causes in a person, for example, the selection of a
suitable piece of music in music therapy to improve mental
health [3].
A. Feature extraction
To begin tackling MER using a Deep Neural we must first
extract the input parameters from a song. Based on the
reported qualities of music that affect emotional impact [11],
the following parameters were chosen:
• An estimation of fundamental frequency (F0) using a
modification (pYIN) [9] of the YIN algorithm. [4].
Where, according to [9] the probability that a period τ is
the fundamental period τ0 is:
N
X
P (τ = τ0 |S, xt ) = a(si , τ )P (si )[Y (xt , si ) = τ ] Fig. 3. Model with the classification of selected emotions according to their
i=1 arousal and valence.
Between these two layers, a hidden layer with 8 neurons and This function could be more desirable than a sigmoid output
a ReLu activation function was added to map the inputs to because it will normalize the classes’ output to a probability
the outputs. distribution.
To train this model, parameters were extracted from a The increase in neurons also required an increase in epochs to
database of 48 songs, 12 of each label, to perform both the achieve a good value of loss function. The final training epochs
training, with a different set hand-chosen for validation. The were chosen to be 3000. The batch size remained fixed at 40.
songs were manually labelled in accordance to the identified
F. Recognition
emotion from each. This is a very small database, which is
one of the main limitations of the methodology proposed Two generated models, the best from the first training
herein. scheme (musicModel626.h5) and the best from the second
batch (musicModel895.h5), were chosen to undergo a
For training, the loss function chosen is Multi-Class Cross series of validation tests with songs from outside the
Entropy, a standard loss function for classification problems, database, to see if they could successfully identify
defined as the songs in accordance to their generated training
X accuracies (62.6% and 89.5%, respectively). Both of
L=− p(x) · log(q(x))
these models can be downloaded from the github page
x
(https://github.com/tuptup9/MusicEmotionRecognition).
Where p(x) is the real probability of an item being in class
’x’, and q(x) is the predicted probability of an item being These songs were chosen from other possible candidates that
in class ’x’. This loss function was chosen because we are were considered from the database, containing a mixture of
seeking to maximize performance on all classes, and Cross emotions, genres, and styles. 11 tests were carried out for both
Entropy punishes large mistakes across any class more than models, with the test songs available for review in the code
small mistakes in any one class. repository.
Additionally, training is done with the Adam stochastic gra- III. R ESULTS
dient descent optimizer, which presents good results for a During the generation of the database, the tempo average,
variety of problems. The choice of optimizer is not particularly tempo deviation, fundamental frequency average, and
important for our implementation, as the database is fairly fundamental frequency deviation were calculated using
limited. The model was trained with a batch size of 40 and Librosa and saved as a .csv file for training the neural
for 200 epochs, after initial testing to observe the training rate. network. The average of these values for each category can
be consulted in table II.
D. Tests
The models generated were evaluated according to accuracy,
and adjustments were made for further iterations. It was A. Architecture 1
noticed that smaller batch sizes had a negative effect on Multiple models using the first iteration of our architecture,
accuracy (just slightly above choosing random, about 30%), with accuracies ranging from 29.1% to 62.6%. During
so it was increased to encompass a significant portion of the training, it was noticed that the loss function achieved most
database. of its reduction within the first 100 epochs, and as such,
200 epochs was set to be the final training value for this
More tests were carried out with multiple number of neurons architecture. The models with this architecture achieved
and hidden layers. It was found that the model was highly relatively low-to-medium accuracies, with the expected
sensitive to the number of neurons, with accuracies growing performance for a completely random decision to be 25%.
as more were used, but was not particularly receptive to a The best performing of these models achieved nearly twice
change in hidden layers. this rate of success for database entries.
Fig. 5. An example of training using the second learning scheme. The top
panel shows the loss function, the bottom panel shows the accuracy.
Fig. 4. An example of training using the first learning scheme. The top panel
shows the loss function, the bottom panel shows the accuracy. and the arousal of that emotion when compared to the true
values. Highlighted in green are the cases where a model
During validation, musicModel626.h5 managed to recognize was succesful in identifying emotion, and in yellow where
5 out of the 11 test cases, achieving a success rate of 45.5%, the model identified arousal correctly but not the emotion.
well above random chance, but not particularly accurate
TABLE I
compared to a human guess. It was observed that the model VALIDATION TESTS USING MODELS 626 AND 895.
managed to identify every occurrence of the ’angry’ label,
Predicted emotion by model Predicted arousal by model Real labels
succeeded at some of the ’calm’ and ’sad’ labels, but failed Name
626 895 626 895 Emotion Arousal
Down with the Sickness Anger Happiness High High Anger High
to identify ’happy’ in every trial. Traitor Sad Happiness Low High Sad Low
Lay all your love on me Anger Calm High Low Anger High
Super Trouper Calm Calm Low Low Calm Low
Nuestra Cancion Anger Anger High High Sad Low
Now we are free Anger Happy High High Calm Low
B. Architecture 2 Don’t go breaking my heart
Sarek
Anger
Happy
Anger
Calm
High High
High Low
Happy
Calm
High
Low
Mi primer millon Sad Anger Low High Happy High
More iterations were made using the proposed second Peace Sad Sad Low Low Calm Low
Beast in Black Anger Happy High High Anger High
architecture, now with much improved accuracies ranging Hey, soul sister Anger Anger High High Happy High