You are on page 1of 10

Available online at www.sciencedirect.

com

ScienceDirect
Available online at www.sciencedirect.com

ScienceDirect
Available online at www.sciencedirect.com
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
ScienceDirect
Procedia Computer Science 00 (2019) 000–000
www.elsevier.com/locate/procedia
Procedia Computer Science 167 (2020) 16–25

International Conference on Computational Intelligence and Data Science (ICCIDS 2019)


International Conference on Computational Intelligence and Data Science (ICCIDS 2019)
Musical instrument emotion recognition using deep recurrent neural
Musical instrument emotion recognition
network using deep recurrent neural
network
Sangeetha Rajesh *, N J Nalini a b

a
a
K.J.Somaiya Institute Sangeetha
a
Rajesh
of Manageement *, and
Studies Nalinib Mumbai 400077,India
N Jresearch,
Annamalai Uniersity, Chidambaram 608002,India
a
K.J.Somaiya Institute of Manageement Studies and research, Mumbai 400077,India
a
Annamalai Uniersity, Chidambaram 608002,India

Abstract

Abstract
Music is one of the effective media to convey emotion. Emotion recognition in music is the process of identifying the emotion
from the music clips. In this paper, a novel approach is proposed to recognize the emotion by classes of musical instruments
Musicdeep
using is onelearning
of the effective
techniques.media
Musicto convey
dataset emotion. Emotion
is collected recognition
for four instrument in types,
music string,
is the process of identifying
percussion, woodwind,the andemotion
brass.
from the
These music clips.
instruments typesIn are
thisgrouped
paper, ainto
novel approach
four emotions,is proposed
happy, sad, to neutral
recognize andthe emotion
fear. The Melby classes
frequency of musical instruments
cepstral coefficients
using deepChroma
(MFCC), learning techniques.
energy Musicstatistics
normalized dataset (CENS),
is collected for four
chroma shortinstrument
time Fouriertypes, string, percussion,
transform (STFT), spectralwoodwind,
featuresandspectral
brass.
These instruments
centroid, bandwidth, types are and
rolloff, grouped into feature
temporal four emotions,
ZCR arehappy,
extractedsad,from
neutral and fear. Themusic
the instrumental Mel data
frequency cepstral
set. Based coefficients
on the extracted
(MFCC),recurrent
features Chroma neural
energynetworks
normalized statistics
(RNN) (CENS),
are trained chroma short
to recognize the time Fourier
emotion. Thetransform
performance(STFT),
of RNN spectral features with
is compared spectral
the
centroid, bandwidth,
baseline rolloff, classification
machine learning and temporal feature ZCR The
algorithms. are extracted from thethat
results indicate instrumental music data
MFCC features withset.deep
BasedRNN on the extracted
give better
features
performancerecurrent neural networks
for instrument emotion (RNN) are trained
recognition. to recognize
It also shows that the instrument
emotion. The performance
class of RNN isrole
plays an important compared
in the with the
emotion
baseline
induced by machine learning
the music. Musicclassification
is one of thealgorithms. The results
effective media to convey indicate thatEmotion
emotion. MFCC recognition
features with in deep
musicRNNis the give better
process of
performancetheforemotion
identifying instrument
from emotion
the musicrecognition.
clips. In thisIt paper,
also shows
a novel that instrument
approach class plays
is proposed an important
to recognize role in by
the emotion theclasses
emotionof
induced
musical by the music.using
instruments Musicdeepis one of thetechniques.
learning effective media
Musictodataset
conveyisemotion.
collectedEmotion
for fourrecognition
instrument intypes,
musicstring,
is thepercussion,
process of
identifying the
woodwind, andemotion from instruments
brass. These the music clips.
typesInare
thisgrouped
paper, ainto
novel
fourapproach
emotions, is happy,
proposed to neutral
sad, recognizeandthe emotion
fear. The Mel by frequency
classes of
musical instruments
cepstral coefficients using
(MFCC),deepChroma
learningenergy
techniques. Music statistics
normalized dataset is(CENS),
collectedChroma
for fourshort
instrument types, transform
time Fourier string, percussion,
(STFT),
woodwind,
spectral and brass.
features spectralThese instruments
centroid, types rolloff,
bandwidth, are grouped into fourfeature
and temporal emotions,ZCRhappy, sad, neutral
are extracted fromandthe fear. The Melmusic
instrumental frequency
data
cepstral
set. Based coefficients (MFCC),
on the extracted Chroma
features energy
recurrent normalized
neural networksstatistics
(RNN) are (CENS),
trainedChroma short the
to recognize timeemotion.
FourierThe transform (STFT),
performance of
spectral
RNN is features
compared spectral
with thecentroid,
baseline bandwidth,
machine rolloff,
learningand temporal feature
classification ZCR are
algorithms. Theextracted from thethat
results indicate instrumental
MFCC featuresmusic with
data
set. Based
deep RNN on thebetter
give extracted features recurrent
performance neuralemotion
for instrument networks (RNN) areIttrained
recognition. to recognize
also shows the emotion.
that instrument classTheplaysperformance
an important of
RNN
role inisthe
compared with the by
emotion induced baseline machine learning classification algorithms. The results indicate that MFCC features with
the music.
deep RNN give better performance for instrument emotion recognition. It also shows that instrument class plays an important
© 2020
role Theemotion
in the Authors. Published
induced by theby music.
Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Data
Science (ICCIDS 2019).

* Corresponding author. Tel.: +919821860564.


E-mail address: rajesh.sangeetha@gmail.com
* Corresponding author. Tel.: +919821860564.
E-mail address:
1877-0509 rajesh.sangeetha@gmail.com
© 2019 The Authors. Published by Elsevier B.V.. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
1877-0509 ©under
Peer-review 2019responsibility
The Authors. of
Published by Elsevier
the scientific B.V..ofThis
committee the is an open access
International article under
Conference the CC BY-NC-ND
on Computational license
Intelligence and Data Science
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
(ICCIDS 2019)
Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Data Science
(ICCIDS
1877-05092019)
© 2020 The Authors. Published by Elsevier B.V.
This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of the scientific committee of the International Conference on Computational Intelligence and Data
Science (ICCIDS 2019).
10.1016/j.procs.2020.03.178
2 Sangeetha Rajesh et al. / Procedia Computer Science 00 (2019) 000–000

© 2019 The Authors. Published by Elsevier B.V.. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/)
Peer-review under responsibility of theSangeetha
scientificRajesh
committee
et al. /of the International
Procedia Conference
Computer Science on Computational
167 (2020) 16–25 Intelligence and
17
Data Science (ICCIDS 2019)

Keywords: Instrument emotion recognition; Support vector machines; Deep learning, Recurrent neural networks; MFCC;CENS.

1. Introduction

“Music is the shorthand of emotion”, quoted by Leo Tolstoy. Music helps people feel relaxed and is one of the
most effective media to convey emotion rapidly. It also helps in experiencing many emotions such as happiness,
excited, sad to name a few. Music information retrieval (MIR) is attaining more focus in recent years by the
research community and the music industry. The major reason is the tremendous growth in digital music data and its
ubiquitous nature. Digital music can be organized by various means such as artist, genre, instrument, mood or
emotion etc. However, organizing online music repositories based on emotion is a meaningful way. Recognizing
the perceived emotion from music is an interesting and challenging area in MIR. The issues related to subjectivity
associated with emotions are addressed by various disciplines [1] or [2] or [3]. Another challenge in music emotion
recognition (MER) is, how music represents emotion is still enigmatic. Which elements in music induce a specific
emotion in the listener is not absolutely known to the research community.
One of the primary motives in listening to music is to feel emotions. Our emotional states affect our performance
in every task. Music becomes a cue to evoke specific kinds of emotion. The ubiquitous nature of music is increased
through mobile devices, radio stations, and recommender systems. To recommend songs to listeners based on
emotion is one of the main requirements of the music recommender systems [4]. With the increase in digital music
data, the identification and annotation of the emotion became a tedious task. This results in the need to automate the
recognition of emotion content from the song. Another application of music emotion recognition is music therapy, a
nourishing field in healthcare. Music therapy is used to decrease the tiredness, anxiety, and breathlessness in cancer
patients [5]. Music signal comprises of the singing voice and the instrumental music. Numerous research works have
been conducted in the area of music emotion recognition (MER) where the recognition was based on both singing
and instrumental music clips. However, emotion recognition from instrumental music is an unexplored area in MIR.
Based on the literature review conducted, the following research questions have been formulated.

 Does the monophonic instrumental music clips provide any cues to recognize emotion?
 Is the recognition of emotion is centered on the type of instrument class?
 Does the advanced machine learning technique improve the performance of IER when compared to the baseline
algorithms?
 Which acoustic features contain the necessary information in recognition of emotion from instrumental music
clips?

In this paper, a deep learning system using recurrent neural networks has been proposed to recognize the emotion
from the music clips of four instrument classes string, woodwind, percussion, and brass. Four emotion categories,
happy, sad, neutral and fear are selected from the different quadrants of the most referred valence-arousal plane of
emotions as shown in Figure 1.
18 Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25
Sangeetha Rajesh et al./ Procedia Computer Science 00 (2019) 000–000 3

Fig. 1. Arousal- valence emotion model

This paper has been organized as follows: section 2 discussing the related works on instrument emotion
recognition followed by description of the dataset in section 3. Section 4 details various acoustic features employed
in this work and section 5 explains the proposed IER system structure. Results have been discussed in section 6 and
section 7 concludes the paper by mentioning the future scope.

2. Previous works

Music emotion recognition is a field of pattern recognition and artificial intelligence. Since music has the ability
to evoke emotion, many technical and psychological studies have been conducted in the area of MER. Two main
processes in pattern recognition techniques are feature extraction and recognition. Feature selection and extraction
play a vital role in pattern recognition. The learning of the model is based on the features extracted from the music
clips. In feature selection, the important features need to be identified which contributes to the success of the
application. In this scenario, the features that accurately recognize the emotion from the music clip need to be
selected. The research in MER is fundamentally about the feature selection and pattern recognition algorithms.
In [6], authors have proposed a model for emotion recognition from vocal features, instrumental features and a
combination of both vocal and instrumental. A feedback artificial neural network has been trained with timbre
features, which changes the state until it reaches equilibrium. The authors have demonstrated that timbre features
contain the emotion-related information in music signals and the combined vocal and instrumental signals had
shown more accuracy when compared to the individual signals.
Jianyu et al. [7] had proposed a smoothed rank SVM to predict the arousal valence ranking of music clips. The
authors have annotated the dataset for experimental music emotion recognition. 56-dimensional feature vectors have
been extracted including various timbre and cepstral features. The computational efficiency of the system depends
on the feature vectors and the time complexity is also directionally proportional to the feature vector size. In [8], a
genetic algorithmic approach is used for feature selection. This method reduced the computational time, which is a
major issue in many pattern recognition tasks. Music clips are classified into various eight categories using the SVM
algorithm. With 80% crossover probability and 20 generations and achieved the highest accuracy. Many researchers
employed SVM for music emotion recognition and are one of the preeminent classification algorithms. However,
artificial neural networks brought a dramatic change in pattern recognition applications. Tong L et al. [9] have
proposed a convolution neural network (CNN) for MER. The audio signal is represented as spectrogram and is used
as the input to the CNN, which resulted in a decent accurate result. In [10], authors have proposed an MER system
using SVM, auto-associative neural networks (AANN) and radial basis function neural networks (RBFNN).
Residual phase features have been utilized with the popular MFCC features for recognition of emotion. Score level
fusion has been performed on the results due to the complementary nature of these features. SVM outperformed the
other two techniques for emotion recognition.
From the literature review, it is observed that the research on music emotion recognition have been performed
based on two approaches, dimensional and categorical. Bai et al. [11] have proposed a dimensional music emotion
recognition model to predict the V-A values using SVR, RFR and RNN, among which, SVR had shown the
optimum performance. SVM had been utilized by Lin et al. [12] for a two level music emotion recognition. In the
Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25 19
4 Sangeetha Rajesh et al. / Procedia Computer Science 00 (2019) 000–000

first level the music clips are classified into various genre and in the second phase the emotion has been recognized.
Subjectivity of the MER has been a topic of discussion among the research in MIR. This has been taken care by
Deng et al. [13] by proposing a Support Vector Regression algorithm for categorical MER which determines top two
dominant emotions in the music clip. Chen et al.[14] have proposed a Deep Gaussian Process(GP) for MER. They
have concluded that calm and anger emotions are more difficult to be modelled by deep GP. Hierarchical SVM
based classifiers have been employed to recognize happy, tensional, sad and peaceful [15]. Revised Gene
Expression Programming (RGEP) algorithm had been utilized by Zhang et al. [16] for web music emotion
recognition. RGEP had shown better performance than SVM and GEP. In [17], different acoustic features are
evaluated for music emotion recognition task using SVM with polynomial and radial basis function kernel. They
had demonstrated that spectral features are suited for MER when compared to rhythm, dynamics and harmony.
Also, the polynomial kernel worked better for MER than rbf kernel. Few challenges and research gaps identified
from the literature review are given below.

 The previous studies on music emotion recognition have been accomplished to identify the main emotion
categories happy, sad and angry. In this work, the categories have been extended by including neutral and fear
emotions in the music clips.
 In addition, the previous works are envisioned in recognizing the emotion from the music clip containing both
vocal and instrumental. The novelty of this work is in recognizing the emotion from the monophonic instrumental
music clip and analysing the type of instrument which better invoke a particular emotion using machine learning
techniques
 Deep learning algorithms are seldom used in instrument emotion identification.
 Many psychological studies have been done on the perceived emotion from different instrumental music clips
[11]. The intended emotion of the instrumental music clips is an area which needs more focus.

3. Dataset

The musical instrument dataset includes 800 musical instrumental clips of four emotions (happy, sad, neutral and
fear) played with instruments of various categories (percussion, string, woodwind, and brass). For each emotion, 200
monophonic instrument music clips of four instruments mentioned in Table 1 are collected. For the purpose of this
work, the music clips are gathered of 15 seconds duration. Moreover, each music clip collected is with 44.1 kHz
sampling frequency and 16 bps mono wav format. Based on [18], musical instruments are categorized into particular
emotions. Table 1 shows the instrument types, instruments used in this work and the related emotions.

Table 1. Instruments classes and respective emotion classes.


Instrument Class Instruments Emotions
String instrument Violin Sad
Percussion instrument Piano Happy
Woodwind instrument Flute Neutral
Brass Instrument Trumpet Fear

4. Acoustic features

Features provide an efficient computational way to represent the audio data for the application it is designed for.
Table 2 lists the features extracted from the instrumental music clips in this work.

Table 2. Features extracted from instrumental music clips.


Feature #coefficients
MFCC 39
CENS 12
Chroma stft 12
Spectral centroid 1
Spectral bandwidth 1
Spectral flux 1
ZCR 1
20 Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25
Sangeetha Rajesh et al./ Procedia Computer Science 00 (2019) 000–000 5

4.1. MFCC

It is the most prevalent feature vector used in speech and music related pattern recognition applications. The
extraction process of MFCC is detailed in [19]. MFCCs have revealed its prominent performance in speech and
music emotion recognition [6]. However, it is hardly used in instrument emotion recognition. These features give a
nearby accuracy with human perception because it takes human perception sensitivity with respect to frequencies
into consideration. To compute MFCC, the music clips are split into frames of 20ms with a shift of 10ms. 39 MFCC
features are extracted from each frame, which comprises the static, derivative and acceleration coefficients.
Python_speech_features package is used to extract the features. The MFCC feature extracted from four different
instrument clips that shows the related emotion is shown in Figure 2.

Fig. 2. MFCC features from samples of emotion- instrument combination (a) happy-piano (b) sad-violin (c) neutral-flute (d) fear-trumpet

4.2. Chroma Energy Normalized Statistics(CENS)

CENS features are used in music retrieval applications because of the robustness to timbre and dynamics. These
features are highly related to harmonic content [20]. Chroma features compute the short time energy spread signal
across 12 Chroma bins. It identifies the spectral components that differ from the musical octave. CENS represents
the tempo and timbre variations in an efficient way [8]. These features are strongly correlated to the harmonic
progression in the music signal which makes it suitable for MIR tasks. 12 CENS features are extracted for each
frame using librosa package with python. The extracted features for different five emotions are shown in Figure 3.

Fig. 3. CENS features from samples of emotion-instrument combination (a) happy-piano (b) sad-violin (c) neutral-flute (d) fear-trumpet

4.3. Chroma STFT

It is obtained by performing FFT on the music samples and displaying the resulting spectrum in the chromagram
in a vertical axis. In chromagram, the vertical axis represents the musical notes.
Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25 21
6 Sangeetha Rajesh et al. / Procedia Computer Science 00 (2019) 000–000

4.4. Spectral Centroid

It is allied with the extent of the brightness of an audio [21] or [22] or [23] or [24]. It represents a measure of
timbre and spectral components of the music signal. It is computed using,
∑N
k=1 kF[k]
Spectral centroid= ∑N
(1)
k=1 F[k]

where F[k] is the amplitude.

4.5. Spectral bandwidth

Order-p spectral bandwidth is computed:


1
p p
Spectral Bandwidth= (∑k S(k)(f(k)-fc ) ) (2)
where S(k): spectral magnitude.

4.6. Spectral Rolloff

It is the frequency below which a specified percentage, 85% of the total spectral energy lies. It is computed as
given below.
1
Zn= ∑m sgn[x(m)]-sgn[x(m-1)]w(n-m) (3)
2
1, x(n)≥0
sgn[x(n)]= { (4)
-1, x(n)<0
and w(n) is a rectangular window of length n.
Spectral features, spectral centroid, bandwidth, rolloff, and temporal feature-ZCR are combined with Chroma
STFT at score level. Spectral and temporal parameters play a key role in determining the affective dimensions
(arousal and valence) of the instrumental sounds [18]. Chroma features proved its performance in many music
analysis tasks [20]. The extracted spectral centroid features from various four emotion music samples are shown in
Figure 4

Fig. 4. Spectral centroid features (a) happy-piano (b) sad-violin (c) neutral-flute (d) fear-trumpet

5. System structure

Recurrent neural networks are introduced to handle sequence and time series data and are well suited for various
speech and music-related applications. It comprises the internal memory to cope with the sequence-related
information. In RNN, each node takes the current input and the learning from the previous node. . In this work, an
RNN with bidirectional LSTM (Long short term memory) and LSTM layers has been proposed. LSTM layer
increases the recall capability of the RNN nodes. The bidirectional LSTM efficiently handles the forward and
22 Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25
Sangeetha Rajesh et al./ Procedia Computer Science 00 (2019) 000–000 7

backward flow of information in the network. Fig. 5 depicts the architecture designed of deep RNN for instrument
emotion detection.

Fig. 5. Recurrent Neural Network Architecture designed for IER

The designed RNN architecture consists of one input layer, four hidden layers which include two bidirectional
LSTM layers each with 64 nodes and two LSTM layers with 32 and 16 nodes and an output dense layer with 5
nodes for 5 emotion classes. ReLU activation is used for hidden layers and softmax is used for the output layer that
will result in the probability for each class. Dropout layers are used in between the hidden layers to reduce the
chance of overfitting.

5.1. Instrument emotion recognition with RNN

The deep recurrent neural network is trained with three kinds of features, MFCC, CENS, and the combined
Chroma and spectral features including spectral centroid, bandwidth, rolloff, and ZCR. For each instrument four
emotional models are generated. The trained instrument emotion models are tested with the features extracted from
test music clips and evaluated in terms of recognition rate. The confusion matrices for four emotion-instrument
combination are shown in Table 3-6.
Table 3. IER performance of piano.
Happy- Sad- Neutral- Fear
Piano Piano Piano -Piano
Happy- 75.6 8.5 7.3 8.6
Piano
Sad- 12.6 62.3 13.3 11.8
Piano
Neutral- 13.2 12.9 64.7 9.2
Piano
Fear- 14.3 7.6 6.9 71.2
Piano

Table 4. IER performance of violin.


Happy- Sad- Neutral- Fear
Violin Violin Violin - Violin
Happy- 61.4 7.6 14.8 16.
Violin
Sad- 5.6 81.2 6.8 6.4
Violin
Neutral- 11.5 5.9 75.3 7.3
Violin
Fear- 13.4 12.1 6.1 68.4
Violin
Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25 23
8 Sangeetha Rajesh et al. / Procedia Computer Science 00 (2019) 000–000

Table 5. IER performance of flute.


Happy- Sad- Neutral- Fear
Flute Flute Flute - Flute
Happy- 68.9 9.2 10.2 11.7
Flute
Sad- 11.8 74.3 9.7 4.2
Flute
Neutral- 6.3 9.7 78.7 5.3
Flute
Fear- 19.3 13.5 10. 56.4
Flute

Table 6. IER performance of trumpet.


Happy- Sad- Neutral- Fear
Trumpet Trumpet Trumpet - Trumpet
Happy- 75.6 8.1 6.5 9.8
Trumpet
Sad- 9.1 69.0 13.2 8.7
Trumpet
Neutral- 8.3 14.1 67.2 10.4
Trumpet
Fear- 8.4 3.2 4.1 84.3
Trumpet

From the result, it is observed that percussion instrument piano shows better recognition rate for happy than sad,
neutral and fear emotions. Similarly, violin works well for sad emotion than happy, neutral and sad emotions. Based
on the results obtained, emotion and instrument are grouped into four classes, happy-piano, sad-violin, neutral-flute
and fear-trumpet. Deep recurrent neural networks are trained with the extracted features from the training dataset
music clips. The generated emotion-instrument models are evaluated with the extracted features from the test music
clips. The recognition performance of instrument emotion recognition using MFCC with RNN is shown in Table 7.

Table 7. IER performance using MFCC with RNN (in %).


Happy- Piano Sad-Violin Neutral-Flute Fear-Trumpet
Happy-
85.6 4.5 3.6 6.3
Piano
Sad-
3.2 91.0 3.6 2.2
Violin
Neutral-
4.3 6.5 87.3 1.9
Flute
Fear-
3.5 1.6 1.5 93.4
Trumpet
Overall Performance : 89.3 %

5.2. Instrument emotion recognition using SVM

SVM is a traditional machine learning technique, which has shown its superior performance in many
pattern recognition tasks. It estimates the hyperplane, which divides the data into two classes. In this work, emotion
models are created using a SVM classifier algorithm with ‘rbf’ kernel. The features are trained with various kernel
functions-linear, rbf, and polynomial, from which ‘rbf’ has shown the better performance. The extracted features
from the testing music clips are tested against each of the trained instrument emotion recognition model using SVM
classifier. The recognition accuracy achieved is shown in Table 4.

6. Results and Discussion

This study explores the deep learning techniques and acoustic features adequate for the task of recognizing
emotion from monophonic instrumental music clips. The instruments are grouped into different instrument classes
based on emotion that it conveys as mentioned in [11] and the results obtained from the individual instrument and
24 Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25
Sangeetha Rajesh et al./ Procedia Computer Science 00 (2019) 000–000 9

emotion models. Table 8 summarizes the performance of the musical instrument emotion recognition using deep
learning technique recurrent neural networks and traditional machine learning algorithm support vector machines for
four emotions happy, sad, neutral and fear using music clips of four instrument classes string, woodwind,
percussion, and brass. The deep recurrent neural networks using MFCC features gives higher performance when
compared with RNN using CENS and combined Chroma, spectral and temporal features. The results of RNN are
also compared with the baseline machine learning technique. The experimental results show that deep recurrent
neural networks outperform support vector machines for instrument emotion recognition. More specifically, it is
observed from the results that, percussion instrument, piano has worked well with happy than sad or fear emotions.
Whereas, string instrument violin is showing high accuracy with sad emotion than happy, neutral and fear emotions.
Neutral emotion model achieved a high accuracy when tested with flute music clips. Finally, the brass instrument
trumpet worked well with fear emotion than sad or neutral emotions.

Table 8. Recognition results of IER using RNN and SVM (in %).
Emotion-Instrument MFCC CENS Chroma+Spectral
features+ZCR
RNN SVM RNN SVM RNN SVM
Happy–piano 85.6 82.3 59.2 55.7 71.2 69.3
Sad-violin 91.0 88.2 75.1 73.1 85.2 82.6
Neutral-flute 87.3 83.4 65.3 59.2 78.2 72.1
Fear-trumpet 93.4 89.0 54.5 46.6 76.3 74.9
The comparison of the performance of the instrument recognition system using RNN with features MFCC,
CENS and combined Chroma, spectral and temporal features for four emotion-instrument combinations is shown in
Figure 6.

Fig.6. Recognition rate of IER using RNN with four emotion-instrument groups

7. Conclusion

A novel approach for instrument emotion recognition from monophonic music clips is proposed in this paper.
The dataset contains music clips of four instrument types, string, percussion, woodwind, and brass. The collected
instrument music clips are grouped into four emotions based on the emotion it generally conveys. MFCC, CENS
and Chroma, spectral and temporal features are extracted from music clips. Four models for the emotion-instrument
types, happy-piano, sad-violin, neutral-flute, and fear-trumpet are built with the extracted features. Traditional
machine learning technique SVM and deep learning technique RNN are employed to train the extracted features.
The results indicate that MFCC feature with RNN gives better performance for instrument emotion recognition with
a recognition rate of 89.3%. Whereas, MFCC with SVM resulted in a recognition rate of 85.7%. Furthermore, the
results also show that the combined features contain the emotion-related aspect in the instrumental music clips. An
accuracy of 77.7% and 74.7% is achieved when the emotion models are tested using combined features with RNN
and SVM respectively. Using CENS features 63.5% and 58.7% recognition accuracy is achieved with RNN and
Sangeetha Rajesh et al. / Procedia Computer Science 167 (2020) 16–25 25
10 Sangeetha Rajesh et al. / Procedia Computer Science 00 (2019) 000–000

SVM. The results of this work state that the use of deep learning recurrent neural networks improves the
performance in instrument emotion recognition from instrumental music clips. The results also evidenced that, the
musical instrument plays a role in determining the emotion from monophonic instrumental music clips.

In the future work, the emotion recognition from polyphonic instruments needs to be explored to get a generous
depiction. It is a challenging task because of the interferences of multiple instruments. The scope of research in this
arena is based on various acoustic features and the deep learning algorithms appropriate for instrument emotion
recognition from polyphonic music signals.

References

[1] Thayer RE. ( 1989) “The biopsychology of mood and arousal.” Oxford University Press, New York.
[2] Yang, Yi-Hsuan, and Chen, Homer H. (2011) “Music emotion recognition.” CRC Press.
[3] Zhu Bin, and Zhang Kejun. (2010) “Music emotion recognition system based on improved GA-BP.” In International Conference On
Computer Design and Applications.
[4] R Cai, C. Zhang, C. Wang, L. Zhang, and W-Y. Ma. (2007) “Musicsense: contextual music recommendation using emotional allocation
modeling.” In International Conference on Multimedia 553–556.
[5] Ramirez, R., Planas J., Escud, N., Mercade J., and Farriols C. (2018) “EEG-Based Analysis of the Emotional Effect of Music Therapy on
Palliative Care Cancer Patients.” Frontiers in Psychology 9:1-7.
[6] Mokhsin, M. B., Rosli N. B., Zambri S., Ahmad N. D., and Rahah S. (2014) “Automatic music emotion classification using artificial neural
network based on vocal and instrumental sound timbres.” Journal of Computer Science 10 (12): 2584–2592.
[7] Jianyu, F., Kivac T., Miles T., and Philippe P. (2017) “Ranking based emotion recognition for experimental music.” in International society
for Music Information Retrieval Conference.
[8] Mahesh, B. (2015) “Emotion recognition and emotion based classification of Audio using Genetic algorithm-an optimized approach.” In
International Conference on Industrial Instrumentation and control.
[9] Tong L., Li H., Liangkai M., and Dongwei G. (2018) “Audio-based deep music emotion recognition.” In International Conference on
Computer Aided Design, Manufacturing, Modelling and Simulation.
[10]N. J. Nalini, and S. Palanivel. (2016) “Music emotion recognition: The combined evidence of MFCC and residual phase.” Egyptian
Informatics Journal 17:1–10.
[11]Junjie, Bai, Jun Peng, Jinliang Shi, Dedong Tang, Ying Wu, Jianqing Li and Kan Luo (2016) “Dimensional music emotion recognition by
valence-arousal regression.” in International conference on cognitive informatics & cognitive computing.
[12]Chingshun, Lin, Mingyu Liu, Weiwei Hsiung, and Jhihsiang Jhang (2016) “Music emotion recognition based on two level support vector
classification.”, in International conference on machine learning and cybernetics.
[13]Yongli, Deng, Yuanyuan Lv, Mingliang Liu, and Qiyong Lu (2015) “Regression approach to categorical music emotion recognition.” In
IEEE International Conference on Progress in Informatics and Computing (PIC) .
[14]Sih-Huei, Chen, Yuan-Shan Lee, Wen-Chi Hsieh, and Jia-Ching Wang (2015) “Music Emotion recognition using Deep Gaussian Process.” in
APSIPA Annual Summit and Conference.
[15]Wei-Chun, Chiang, Jeen-Shing Wang, and Yu-Liang Hsu (2014) “ A music emotion recognition algorithm with hierarchical SVM based
classifiers.” In IEEE International Symposium on Computer, Consumer and Control.
[16]Zhang K., Sun S. (2013) “Web music emotion recognition based on higher effective gene expression programming.” Neurocomputing, 105,
100-106.
[17]Song, Y., Dixon S., and Pearce M.T. (2012). Evaluation of Musical Features for Emotion Classification.” In International society for Music
Information Retrieval Conference.
[18]Liu, X., Xu Y., Alter K., and Tuomainen J. (2018) “Emotional Connotations of Musical Instrument Timbre in Comparison With Emotional
Speech Prosody: Evidence From Acoustics and Event-Related Potentials.” Frontiers in Psychology, 9.
[19]Sangeetha, R, and Nalini N J, (2019) “Singer identification using MFCC and CRP features with support vector machines.” in International
Conference on Computational Intelligence and Pattern Recognition.
[20]Meinard, Müller, Frank Kurth, and Michael Clausen.(2005) “Audio Matching via Chroma-Based Statistical Features.” in International
Conference on Music Information Retrieval 288–295.
[21]Peeters, G., Giordano B. L., Susini P., Misdariis N., and McAdams S. (2011) “The timbre toolbox: Extracting audio descriptors from musical
signals.” Journal of the Acoustical Society of America 130 (5):9-16.
[22]Lerch, A. (2012) “An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics.” John Wiley &
Sons, Hoboken, NJ, USA.
[23]Shaila, D A. (2012) “Speech and Audio processing”. Wiley India Publication.
[24]Osmalsky, J., M. D. Van, and J.J Embrechts. (2014) “Performances of low level audio classifiers for large scale music similarity.” In
International conference on systems, signals and Image proceedings, IEEE Xplore press 91-94.

You might also like