You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/356647020

Music Emotion Recognition using Python

Preprint · November 2021

CITATIONS READS
0 432

2 authors:

Rafael Martínez García Peña Sharon Ramírez


Tecnológico de Monterrey Tecnológico de Monterrey
3 PUBLICATIONS   0 CITATIONS    3 PUBLICATIONS   0 CITATIONS   

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Rafael Martínez García Peña on 30 November 2021.

The user has requested enhancement of the downloaded file.


Music Emotion Recognition using Python
1st Rafael Martı́nez Garcı́a Peña 2nd Sharon Elizabeth Esther Ramı́rez Lechuga
School of Engineering and Sciences School of Engineering and Sciences
ITESM ITESM
Monterrey, Mexico Mexico City, Mexico
A01274853@itesm.mx A01379035@itesm.mx

Abstract—Music Emotion Recognition (MER) is an open emotional expression in music is the rhythm within which
problem in computer science. Currently, approaches differ in is the tempo element, some studies have related fast tempo
the way emotions are classified, and the way the emotions with high arousal emotions and conversely a slow tempo
are predicted. Machine Learning is an interesting possibility to
solve this problem, since it does not require that we necessarily with lower arousal emotions. In addition, the numerical value
understand the specific relations between the input parameters of tempo, that is to say, beats per minute (or BPM) also
and the predicted emotion to achieve good results, but there are influences emotional response, where a high tempo (150 bpm)
currently no models that achieve good performance in this field. is associated with emotions such as happiness and stress,
To further explore this problem, a Deep Neural Network designed whereas a slow tempo elicits emotions such as sadness and
with Keras is trained with parameters extracted from Librosa
and asked to predict emotions in a small database to observe boredom [11]. The following figure shows some emotions
what the limitations and advantages are for Machine Learning classified according to the circumplex model developed by
methods applied to MER problems. James Russell.
Index Terms—music, python, sentiment, background

I. I NTRODUCTION
According to the literature, music is capable of inducing
emotions, that is why a large number of studies have been
conducted to study the physiological responses and the effects
that music causes in a person, for example, the selection of a
suitable piece of music in music therapy to improve mental
health [3].

Music emotion recognition (MER) is a discipline concerned


with the ability to determine which features affect the
emotional reception of a particular musical piece, such as
if a person would feel happiness, or horror, after their first
contact with it [11]. Being able to properly identify emotions
that an average listener would feel when listening to a song
is of interest to psychology, where open questions such as
why songs seem to evoke feelings on listeners are yet to
be answered, and if it has possible uses in the treatment of
psychological disorders [7]. Furthermore, it is of interest to Fig. 1. circumplex model for Basic Emotions. Adapted from [1]
the field of computer science since it is a particularly hard
problem to solve: There are few good results, and not a lot Classifying a song into any one specific emotion exhibits
of agreements on what the best method to tackle the problem significant issues that other fields do not have. For one,
is [8]. emotions can be very subjective – classifying something as
’happy’ or as ’exciting’ can have a significant amount of
The melody is perhaps the most crucial part of any song. nuance to it, and using other methods of classification such
Melody is known as the horizontal succession of the per- as dimensional models, which attempt to split an emotional
ceptual quality of frequency, that is a succession of pitches. impact into different axes, have similarly ambiguous and
Relationships between emotions and elements of a song, such subjective decisions (What is the dimensionality of human
as pitch range, have been found in the literature, where a emotion?) [8]. There have been many efforts to apply machine
wider range relates to high arousal emotions, such as joy or learning methodologies to MER problems, but they have yet
fear, while narrow ranges relate to low arousal emotions, such to achieve good results, and how to improve them is currently
as sadness and calmness. Another important point related to an open problem [11]. In order to explore the problem more
deeply, this paper discusses the implementation of a simple tempo values, in order to achieve the categorization of the
Neural Network designed in Keras [2] that attempts to make songs. Obtaining these parameters will be done utilizing the
simple, label-based predictions on a small database. python library Librosa [10], which has multiple utilities to
process music. In this way, a simplified model of the song
Machine learning, and neural networks in particular, are of can be given according to the acoustic parameters chosen to
interest to us as a way to tackle this problem because of classify the songs taking into account the emotions they may
their ability to map inputs that correlate to outputs without a elicit.
known relationship. Since this problem is particularly difficult
to model objectively with observed relations, a self-learning
B. Database
model with supervised training allows for the possibility of
mapping variables in a way that was previously unobserved. A database of 48 songs was generated using 4 different
emotions: Anger, Calm, Happy, Sad. These emotions were
II. M ETHOD AND DATA chosen because each one belongs to a different quadrant
The following diagram shows the steps taken to carry out the of the Circumplex model, that is to say, the four selected
project. emotions have different arousal and valence, as can be seen
in Fig. 3. This was done with the purpose of facilitating the
recognition process of the program, since choosing emotions
with the same arousal and valence could be a factor that
could influence the increase of errors during recognition,
however, it would be interesting to develop in the future a
program capable of recognizing emotions that are in the same
quadrant of the Circumplex model.

Fig. 2. Steps carried out in the methodology.

A. Feature extraction
To begin tackling MER using a Deep Neural we must first
extract the input parameters from a song. Based on the
reported qualities of music that affect emotional impact [11],
the following parameters were chosen:
• An estimation of fundamental frequency (F0) using a
modification (pYIN) [9] of the YIN algorithm. [4].
Where, according to [9] the probability that a period τ is
the fundamental period τ0 is:

N
X
P (τ = τ0 |S, xt ) = a(si , τ )P (si )[Y (xt , si ) = τ ] Fig. 3. Model with the classification of selected emotions according to their
i=1 arousal and valence.

• An estimation of tempo (BPM) following the method


The songs were chosen to be from different genres, with and
described in [5]. There are three stages to detecting beats
without vocals, and an effort was made to include potentially
per minute, the first is to measure the onset strength, then
difficult songs as to properly evaluate if the model would
estimate the tempo from the onset correlation, and finally
perform well in real-life situations. Each label contains three
choose the onset strength peaks that roughly match the
songs selected from [6] where a list of validated music pieces
estimated tempo. In [5] is described a single function
used in previous studies can be found.
that combines the goals of a beat tracker, this function is
defined as follows:
For a full list of songs used, the database can be consulted at
N N
X X https://github.com/tuptup9/MusicEmotionRecognition.
C(ti ) = 0(ti ) + α F (ti − ti−1 , τp )
i=1 i=2 C. Training, first iteration
From these key parameters are derived the functions to be A neural network model with one hidden layer was created
implemented in the program, such as the different pitch and using Keras [2]. The input layer contains 8 neurons densely
connected to the next layer, using the ReLU activation func- E. Training, second iteration
tion: Another batch of models was trained using a new architecture,
f (x) = max(0, x) this time using 15 input neurons with the ReLu activation
function, densely connected to a hidden layer containing 30
With 4 output neurons which classify the result into one of neurons with a sigmoid activation function. Finally, the output
the following labels: anger, calm, happy, sad. The activation layer’s activation function was changed to a softmax function:
function for the neuron is the sigmoid function:
ez i
1 σ(zi ) = PK
f (x) = ezj
1 + e−x j=1

Between these two layers, a hidden layer with 8 neurons and This function could be more desirable than a sigmoid output
a ReLu activation function was added to map the inputs to because it will normalize the classes’ output to a probability
the outputs. distribution.

To train this model, parameters were extracted from a The increase in neurons also required an increase in epochs to
database of 48 songs, 12 of each label, to perform both the achieve a good value of loss function. The final training epochs
training, with a different set hand-chosen for validation. The were chosen to be 3000. The batch size remained fixed at 40.
songs were manually labelled in accordance to the identified
F. Recognition
emotion from each. This is a very small database, which is
one of the main limitations of the methodology proposed Two generated models, the best from the first training
herein. scheme (musicModel626.h5) and the best from the second
batch (musicModel895.h5), were chosen to undergo a
For training, the loss function chosen is Multi-Class Cross series of validation tests with songs from outside the
Entropy, a standard loss function for classification problems, database, to see if they could successfully identify
defined as the songs in accordance to their generated training
X accuracies (62.6% and 89.5%, respectively). Both of
L=− p(x) · log(q(x))
these models can be downloaded from the github page
x
(https://github.com/tuptup9/MusicEmotionRecognition).
Where p(x) is the real probability of an item being in class
’x’, and q(x) is the predicted probability of an item being These songs were chosen from other possible candidates that
in class ’x’. This loss function was chosen because we are were considered from the database, containing a mixture of
seeking to maximize performance on all classes, and Cross emotions, genres, and styles. 11 tests were carried out for both
Entropy punishes large mistakes across any class more than models, with the test songs available for review in the code
small mistakes in any one class. repository.

Additionally, training is done with the Adam stochastic gra- III. R ESULTS
dient descent optimizer, which presents good results for a During the generation of the database, the tempo average,
variety of problems. The choice of optimizer is not particularly tempo deviation, fundamental frequency average, and
important for our implementation, as the database is fairly fundamental frequency deviation were calculated using
limited. The model was trained with a batch size of 40 and Librosa and saved as a .csv file for training the neural
for 200 epochs, after initial testing to observe the training rate. network. The average of these values for each category can
be consulted in table II.
D. Tests
The models generated were evaluated according to accuracy,
and adjustments were made for further iterations. It was A. Architecture 1
noticed that smaller batch sizes had a negative effect on Multiple models using the first iteration of our architecture,
accuracy (just slightly above choosing random, about 30%), with accuracies ranging from 29.1% to 62.6%. During
so it was increased to encompass a significant portion of the training, it was noticed that the loss function achieved most
database. of its reduction within the first 100 epochs, and as such,
200 epochs was set to be the final training value for this
More tests were carried out with multiple number of neurons architecture. The models with this architecture achieved
and hidden layers. It was found that the model was highly relatively low-to-medium accuracies, with the expected
sensitive to the number of neurons, with accuracies growing performance for a completely random decision to be 25%.
as more were used, but was not particularly receptive to a The best performing of these models achieved nearly twice
change in hidden layers. this rate of success for database entries.
Fig. 5. An example of training using the second learning scheme. The top
panel shows the loss function, the bottom panel shows the accuracy.

Fig. 4. An example of training using the first learning scheme. The top panel
shows the loss function, the bottom panel shows the accuracy. and the arousal of that emotion when compared to the true
values. Highlighted in green are the cases where a model
During validation, musicModel626.h5 managed to recognize was succesful in identifying emotion, and in yellow where
5 out of the 11 test cases, achieving a success rate of 45.5%, the model identified arousal correctly but not the emotion.
well above random chance, but not particularly accurate
TABLE I
compared to a human guess. It was observed that the model VALIDATION TESTS USING MODELS 626 AND 895.
managed to identify every occurrence of the ’angry’ label,
Predicted emotion by model Predicted arousal by model Real labels
succeeded at some of the ’calm’ and ’sad’ labels, but failed Name
626 895 626 895 Emotion Arousal
Down with the Sickness Anger Happiness High High Anger High
to identify ’happy’ in every trial. Traitor Sad Happiness Low High Sad Low
Lay all your love on me Anger Calm High Low Anger High
Super Trouper Calm Calm Low Low Calm Low
Nuestra Cancion Anger Anger High High Sad Low
Now we are free Anger Happy High High Calm Low
B. Architecture 2 Don’t go breaking my heart
Sarek
Anger
Happy
Anger
Calm
High High
High Low
Happy
Calm
High
Low
Mi primer millon Sad Anger Low High Happy High
More iterations were made using the proposed second Peace Sad Sad Low Low Calm Low
Beast in Black Anger Happy High High Anger High
architecture, now with much improved accuracies ranging Hey, soul sister Anger Anger High High Happy High

from 60.4% to 89.5%. During training, the loss function was


found to decrease at a much slower rate, and as such epochs
were increased until a more linear behaviour was seen at TABLE II
around 3000 epochs. While the accuracies were much higher, AVERAGE OF THE FEATURES EXTRACTED FROM EACH EMOTION .
there was significant concern that the network was overfitting Tempo Dynamic Mean Fundamental Standard deviation
(beat tracker) tempo frequency (F0) Fundamental frequency (F0)
the model and it would perform poorly during validation. Anger 1.1541E+02 1.5442E+01 1.6069E+02 9.2783E+01
Calm 1.3075E+02 1.4413E+01 1.1130E+02 6.4265E+01
Happiness 1.3874E+02 2.0457E+01 1.2487E+02 7.2102E+01
This worry proved to be correct, with the validation of music- Sadness 1.1712E+02 1.6851E+01 1.2677E+02 7.3195E+01

Model895.h5 recognizing 3 out of 11 test cases, with a success


rate of 27.3% it is difficult to conclude that it performed any
IV. D ISCUSSION
better than random chance. This model managed to identify
calmness the most reliably, but with such low success rates There were several noticed trends during testing of different
it is impossible to conclude that it was anything more than architectures, and during validation. The first insight is that
chance. the neural network has trouble finding good relations between
the input features and the outputs with few neurons (see fig. 4,
C. Validation the bottom graph is the accuracy, and it varies wildly between
Both models were validated with 11 songs, the full results epochs even with the loss function being very low), while
can be seen in table I. This table shows the predicted emotion increasing the neurons heavily improves this relationship
(see fig. 5, the accuracy increases much more reliably as with the responses people have to each song, thus validating
the loss function decreases), but introduces severe overfitting the effectiveness of the program in the selected region.
problems as observed in our validation tests (see tab. I).
R EFERENCES
When testing with our program we noticed that the program
sometimes identified a happy song as if it were an angry [1] Saikat Basu, Nabakumar Jana, Arnab Bag, M Mahadevappa, Jayanta
Mukherjee, Somesh Kumar, and Rajlakshmi Guha. Emotion recogni-
song, this same behavior could also be observed with the tion based on physiological signals using valence-arousal model. In
emotions of sadness and calm. One of our hypotheses is that 2015 Third International Conference on Image Information Processing
the confusion could be due to the arousal of the emotions (ICIIP), pages 50–55. IEEE, 2015.
[2] François Chollet et al. Keras. https://keras.io, 2015.
since the confusion was between high arousal emotions (anger [3] Ian Daly, Duncan Williams, James Hallowell, Faustina Hwang, Alexis
and happiness) and between low arousal emotions (sadness Kirke, Asad Malik, James Weaver, Eduardo Miranda, and Slawomir J.
and calm). Fig. 3 shows the classification of the selected Nasuto. Music-induced emotions can be predicted from a combination
of brain activity and acoustic features. Brain and Cognition, 101:1–11,
emotions, highlighting in green the high arousal emotions 2015.
and in red the low arousal emotions. [4] Alain De Cheveigné and Hideki Kawahara. Yin, a fundamental fre-
quency estimator for speech and music. The Journal of the Acoustical
Society of America, 111(4):1917–1930, 2002.
This observation of confusion between high- and low- arousal [5] DanielP. W. Ellis. Beat tracking by dynamic programming. Journal of
emotions leads us to believe that the current scheme for New Music Research, 36(1):51 – 60, 2007.
training requires two changes to achieve heightened accuracies [6] Seyedeh Maryam Fakhrhosseini and Myounghoon Jeon. Affect/emotion
induction methods. In Emotions and affect in human factors and human-
across a more a general selection of songs: computer interaction, pages 235–253. Elsevier, 2017.
• The database must be increased to include more examples [7] Patrik N. Juslin and Petri Laukka. Expression, perception, and induction
to better distinguish between edge cases. of musical emotions: A review and a questionnaire study of everyday
listening. Journal of New Music Research, 33:217–238, 2004.
• The features used must be expanded to include more [8] Miroslav Malik, Sharath Adavanne, Konstantinos Drossos, Tuomas
diverse information about a particular song. Virtanen, Dasa Ticha, and Roman Jarina. Stacked convolutional and
recurrent neural networks for music emotion recognition. 6 2017.
[9] Matthias Mauch and Simon Dixon. Pyin: A fundamental frequency
Even with these observations, however, and with the significant estimator using probabilistic threshold distributions. In 2014 IEEE
limitations the neural network encountered, it is of note that International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 659–663, 2014.
musicModel626.h5 performs significantly better than chance [10] Brian McFee, Alexandros Metsai, Matt McVicar, Stefan Balke, Carl
at identifying music: It can successfully determine the emotion Thomé, Colin Raffel, Frank Zalkow, Ayoub Malek, Dana, Kyungyun
elicited by a song half the time, and this success is increased at Lee, Oriol Nieto, Dan Ellis, Jack Mason, Eric Battenberg, Scott Seyfarth,
Ryuichi Yamamoto, viktorandreevichmorozov, Keunwoo Choi, Josh
arousal determination (it successfully determined when a song Moore, Rachel Bittner, Shunsuke Hidaka, Ziyao Wei, nullmightybofo,
was high- or low- arousal 8 out of 11 times – a 72.7% success Darı́o Hereñú, Fabian-Robert Stöter, Pius Friesch, Adam Weiss, Matt
rate observable in table I). This indicates that tempo and Vollrath, Taewoon Kim, and Thassilo. librosa/librosa: 0.8.1rc2, May
2021.
fundamental frequency are necessary for emotion recognition, [11] Renato Panda, Ricardo Manuel Malheiro, and Rui Pedro Paiva. Audio
but they are not sufficient. features for music emotion recognition: a survey. IEEE Transactions on
Affective Computing, pages 1–1, 10 2020.
V. C ONCLUSIONS
For access to the code, database, and validation tests, the
In conclusion, with the results obtained, it was observed repository for the project can be found in https:// github.com/
that the selection of musical features is determinant for the tuptup9/ MusicEmotionRecognition
classification of songs, the more information the program has
about the features, the greater the probability of having an
accurate recognition. On the other hand, it is crucial to mention
that the geographical location and the culture that exists in
the place is a key factor that influences the classification
and recognition of emotions, since the musical genres, as
well as the emotions that the songs can induce, will change
according to the culture of each region, so it is important to
emphasize that the database should adapt according to the
region in which the program will be used, and with a more
expansive list of songs for training plus an enhanced feature
extraction for them it is likely the proposed model will easily
improve its ability to discriminate between different emotions.

In the future, it could be interesting to make the pertinent


modifications to the program, as well as to increase the
database in terms of the number of songs included, in order
to carry out a study comparing the results of the program

View publication stats

You might also like