You are on page 1of 16

1

Predicting emotion from music videos: exploring


the relative contribution of visual and auditory
information to affective responses
Phoebe Chua, Dimos Makris, Dorien Herremans, Senior Member, IEEE, Gemma Roig, Member, IEEE
and Kat Agres, Member, IEEE

Abstract—Although media content is increasingly produced, distributed, and consumed in multiple combinations of modalities, how
individual modalities contribute to the perceived emotion of a media item remains poorly understood. In this paper we present
arXiv:2202.10453v1 [cs.CV] 19 Feb 2022

MusicVideos (MuVi), a novel dataset for affective multimedia content analysis to study how the auditory and visual modalities contribute
to the perceived emotion of media. The data were collected by presenting music videos to participants in three conditions: music, visual,
and audiovisual. Participants annotated the music videos for valence and arousal over time, as well as the overall emotion conveyed. We
present detailed descriptive statistics for key measures in the dataset and the results of feature importance analyses for each condition.
Finally, we propose a novel transfer learning architecture to train Predictive models Augmented with Isolated modality Ratings (PAIR) and
demonstrate the potential of isolated modality ratings for enhancing multimodal emotion recognition. Our results suggest that perceptions
of arousal are influenced primarily by auditory information, while perceptions of valence are more subjective and can be influenced by
both visual and auditory information. The dataset is made publicly available.

Index Terms—Multimodal Modelling, Emotion Prediction, Multimodal Emotion Prediction, Dataset, Long Short-Term Memory, Affective
Computing.

1 I NTRODUCTION auditory (voices and music) and visual information simul-


taneously. Publishers have begun to exploit the flexibility
H OW do we perceive, integrate, and interpret affective
information that is conveyed through multiple sensory of multimedia content by releasing their work on multiple
modalities simultaneously, and how can we model and platforms, each tailored to a different modality. For example,
predict the effects of visual and auditory information on contemporary podcasts are frequently recorded with video
perceived emotion states? The strong connection between then released on both audio and video streaming platforms.
music and emotions [1] as well as video and emotions [2] In turn, users can choose to consume multimedia content
is well-studied and intuitive – indeed, research has found in various modalities or combinations of modalities based
that many people cite emotion regulation as the primary on their needs and preferences [7]. Of the various forms
reason they listen to music [3]. As multimedia content of multimedia content available, music-related content is
becomes easier to create and access, music can increasingly consumed in an especially wide range of modalities. It can
be used as a tool to shape the emotional narratives in be accessed through audio-only platforms or in audio-visual
video [4]; likewise, visual displays can shape our perception form (music videos), and is frequently used as background
of a melody’s affective content [5]. The development of music for videos watched on mute, just to name a few
affective multimedia content analysis methods to predict scenarios. Although music is typically understood as an
and understand the influence of different media modalities “art of sound” [8], video and audio music streaming cur-
on emotion is hence a key step in the development of rently account for similar proportions of the total music
affect-aware technologies and has utility in a wide range streaming volume, with the International Federation of the
of applications including content delivery, indexing and Phonographic Industry (IFPI) reporting in their latest Global
search [6]. Music Report [9] that video streaming made up 47% of music
The defining feature of multimedia content is its use streaming consumption globally in 2019.
of multiple modalities to convey information: for example, While a great deal of work has been dedicated to music
we might easily imagine a video that plays text (captions), and video emotion recognition, to our knowledge, few
studies have acknowledged that multimedia content can
• P. Chua is with the National University of Singapore, Singapore.
be flexibly consumed in different combinations of modalities,
Email: pchua@nus.edu.sg which may have differing effects on the emotions of the user.
• K. Agres is with the Yong Siew Toh Conservatory of Music, National In this paper we present MuVi, a novel dataset for affective
University of Singapore, Singapore, 117376. multimedia content analysis which contains music videos
E-mail: katagres@nus.edu.sg
• D. Herremans and D. Makris are with Singapore University of Technology annotated with both dimensional and discrete measures of
and Design, Singapore. emotion in three conditions: audiovisual (original music
• G. Roig is with the Goethe University Frankfurt, Germany. videos), music (music-only) and visual (muted music videos).
Manuscript received 000; revised 000. Each annotation is accompanied by annotator metadata
2

relevant to the perception of emotion in music including also contains categories that are not always included in
gender, song familiarity and musical training. We provide traditional approaches to the classification of emotions, such
a comprehensive descriptive analysis of the dataset’s main as “transcendence” and “nostalgia,” which help to better
features and examine whether modality and demographic capture the range of emotion states induced by music. The
factors influence the perceived emotion of stimuli. Then, full GEMS-45 scale contains 45 terms, but shorter variants of
we investigate the relative contribution of audio and visual the scale (GEMS-25 and GEMS-9) were later defined through
modalities to prediction of arousal and valence in audiovisual factor analysis. Categorical approaches may be fairly rich,
stimuli, and discuss the most important auditory and visual due to the abstract and complex concepts expressed by
features. Finally, we propose a novel model architecture that natural language; they may also be more intuitive, especially
leverages knowledge transfer from isolated modality ratings when we want to describe the overall emotions conveyed by
for multimodal emotion recognition - a Predictive model a song or video. However, they often do not lend themselves
Augmented with Isolated modality Ratings (PAIR). readily to dynamic emotion annotations.
The paper is organized as follows: Motivation for this To integrate dimensional and categorical approaches to
study and related work on emotion recognition systems for understanding and measuring emotion, some have suggested
media content are covered in Section 2. Data collection proce- that discrete emotions represent positions in an affective
dures and key features of the MuVi dataset, including dataset space defined by dimensions such as arousal and valence [14],
analysis and visualization, are covered in Sections 3 and 4. [15]. Empirical tests of this relationship with emotionally
In Section 5, we present PAIR, our novel architecture for evocative videos [13] have found that affective dimension
multimodal emotion recognition along with implementation judgements are able to explain the variance in categorical
details. Sections 6 and 7 detail the experimental setup and judgements and vice versa, and that categories seem to have
evaluation of our approach. Finally, in Section 8, we conclude more semantic value than affective dimensions. However, to
and discuss some avenues for future work. our knowledge, similar studies to investigate the relative con-
tributions of music and video content on emotion induction
have not been conducted.
2 BACKGROUND
2.1 Models of emotion
Emotion is generally defined as a collection of psychological 2.2 Perception of multimedia content
states that include subjective experience, expressive behavior An understanding of how we weigh and integrate affective
and peripheral physiological responses [10]. While there are information from different sensory modalities can inform
multiple ways to understand and measure emotions, the the design of computational models for multimedia content
two most frequently used in affective computing research analysis. At the same time, computational models can pro-
are the dimensional [11] and categorical [12] approaches. vide hints as to how humans perceive affective multimedia
The dimensional approach represents emotions as points in content [16]. With regards to music multimedia content in
a continuous space defined by a set of affective dimensions. particular, we are interested in the relative influence of audio
Intuitively, it implies that emotion can be understood as a lin- and visual input on the perceived emotion of the content.
ear combination of the dimensions. A commonly employed Empirical studies on the influence of visual information
dimensional model for Music Emotion Recognition (MER) on the evaluation music performance have consistently
is the circumplex model proposed by Russell [11], which replicated significant effects [17], [18], suggesting that visual
consists of two dimensions: arousal (indicating emotional information influences the way we perceive sound and
intensity) and valence (indicating pleasant versus unpleasant music.
emotions). This valence-arousal (VA) model is easily applied More broadly, studies have examined the ways in which
across different domains and can be used to collect both static music influences the perception of visual scenes, as well as
and dynamic emotion annotations - the difference being that the reverse relationship – the ways in which visual informa-
static annotations generally reflect the overall emotion or tion influences the cognitive processing of music. Boltz [19]
mood conveyed by a media item, while dynamic annotations proposed that music influences visual perception by provid-
reflect the emotions conveyed by a media item as it unfolds ing an interpretative framework for story comprehension.
over time. However, the VA model can lack the coherence of For example, mood-congruent music could be utilized by
categorical approaches that define emotions with semantic artistic directors to heighten emotional impact or clarify
terms [13]. the meaning of ambiguous scenes in a visual story, while
Categorical approaches represent emotions with a set mood-incongruent music could be used to convey subtler
of discrete labels such as “joy” or “fear”, based on the meanings such as irony [5]. Although comparatively little
underlying hypothesis that there are a finite number of work has been done to investigate the way visual information
distinct basic emotions [12]. Basic emotion theories typically influences the interpretation of music content, empirical
hypothesize that there are between 6 to 15 emotion categories, findings suggest that music videos help maintain listener’s
while recent work suggests that there are up to 27 distinct interest when songs are relatively ambiguous [20]. Both
varieties of general emotional experiences [13]. Zentner perspectives hint at how the construction of a meaningful
et al. [1] proposed the Geneva Emotional Music Scale narrative is an important component of the way we consume
(GEMS) as a domain-specific model of emotion to more media. Hence, one way to think about multimedia might be
accurately capture the range of aesthetic emotions that can be as a form of expression that provides multiple information
induced by music. Compared to mainstream emotion models, channels through which artists can convey meaning to their
most of the emotions in GEMS are positive. The model audience, and conversely by which audiences can make
3

inferences about the artists’ intent. Indeed, Sun and Lull [21] Table 1 provide demographic and contextual information
argue that one of the main appeals of music videos is that such as the annotator’s age, gender, familiarity with the
the visual scenes enrich a song’s meaning and underlying media and current mood.
message. A comprehensive review of databases that utilize only
one modality or other combinations of modalities is out of the
scope of this paper, but we refer interested readers to recent
2.3 Datasets for affective multimedia content analysis
survey studies (e.g., [35] for multimodal emotion recognition,
The development of computational models for emotion [30] for music and [42] for audiovisual). Table 1 instead
recognition and understanding has motivated the collec- highlights several of the prominent datasets relevant to music
tion of novel datasets, usually based on audio, visual, or and audiovisual emotion recognition in order to show that
audiovisual stimuli that are accompanied by (dynamic) time- affective computing datasets are typically focused on a single
series annotations of emotion or (static) emotion labels. domain, restricting a researchers’ ability to study the relative
While there are many dimensions along which to label contribution of individual modalities to multimodal emotion
affective computing datasets, here we focus on four key recognition.
factors: emotion measurement, temporal nature, domain and The similarities in the literature on music and video
features. emotion recognition present a clear opportunity to bring to-
Emotion measurement refers to the categorical and dimen- gether these two lines of research and better understand how
sional approaches described earlier, while temporal nature individuals respond emotionally to media, and specifically
refers to whether the modeling target is a single, static label music, presented in multimodal formats. To our knowledge,
capturing the overall emotion of the content, or a vector that the dataset collected in this research, the MuVi dataset, is
captures the dynamically changing emotion as it unfolds over the first published dataset that contains multimodal stimuli
time. Across datasets, emotion is most frequently measured annotated with both static and time-series measures of
using continuous valence and arousal scales, though some emotion in individual, isolated modalities (music, visual)
datasets include additional dimensions such as dominance as well as the original multimodal format (audiovisual).
[28] or use categorical measures [24]. In other words, we have three sets of stimuli: the original
Domain refers to the modality in which stimuli are anno- music videos (audiovisual modality), the music clips (music
tated; here, we focus on music, visual and audiovisual stimuli. modality) and muted video clips (visual modality). The
Datasets using music-based stimuli have largely focused on presence of both static and dynamic annotations for each
music consumed in an audio-only format [23], [30], [31]. One stimuli provides more multifaceted information about its
notable exception is the Database for Emotion Analysis using affective content and enables us to study how discrete
Physiological Signals (DEAP) which used music videos as emotion labels might map onto continuous measures of
the underlying stimuli [28], although the focus of the dataset valence and arousal [15]. Finally, we provide the anonymized
is predicting emotion labels from a combination of data- profile and demographic information of annotators.
based features extracted from the underlying stimuli and MuVi has several advantages over RAVDESS, the only
physiological data. Datasets based on multimodal stimuli other dataset we are aware of that contains isolated modality
include those based on film repositories [32] [27], videos ratings. Firstly, while RAVDESS uses short utterances lasting
collected “in the wild” [33], [34] and those provided by the several seconds as stimuli, MuVi uses longer 60s excerpts
Audiovisual Emotion Challenge (AVEC). We draw particular from music videos. As such, while the stimuli in RAVDESS
attention to the Ryerson Audio-Visual Database of Speech are only rated with static emotion labels, the stimuli in
and Song (RAVDESS) [29], which consists of spoken and MuVi are accompanied by both static overall emotion labels
song utterances vocalized by professional actors to express and dynamic time-series annotations. Additionally, while
emotions based on Ekman’s theory of basic emotions [12]. a media item is only shown once per participant in MuVi
In contrast to the other datasets, the stimuli in RAVDESS (ignoring modality), it is unclear whether a participant would
were presented and annotated in three conditions: audio-only, be exposed to the same utterance in different modalities in
video-only and audiovisual. RAVDESS - a concern since repeated presentation might
Features that can be used for emotion recognition are also influence the perceived emotion of an utterance [29].
a key component of each dataset. Reviews have summarized
extracted features relevant to affect detection in the audio
modality such as intensity (loudness, energy), timbre (MFCC) 3 DATASET COLLECTION
and rhythm (tempo, regularity) features [31] and video In this section, we describe the experimental protocol for the
modality such as colour, lighting key, motion intensity and collection of MuVi dataset.
shot length [35], [36]. Features that can capture complex
latent dimensions in the data, such as the audio embeddings
generated by the VGGish model [37] [38] [39], are also 3.1 Stimuli selection
becoming increasingly popular. Features may be provided in The MuVi dataset consists of music videos (MVs) annotated
the dataset or extracted from the source data if it is available. in three modalities: music-only, visual (i.e. muted video only),
Apart from data-based features, however, the subjective and audiovisual. We selected a corpus of 81 music videos
nature of emotion perception suggests that it may be useful from the LAKH MIDI Dataset [43], a collection of MIDI
to build personalized models that account for characteristics files matched to entries in the Million Song Dataset [44].
of the user in addition to only the characteristics of the media Entries in the Million Song Dataset contain additional
content being consumed [40] [41]. Several of the datasets in information about each music video including audio features
4

Domain Emotion measurement # annotations


Name # items Features
per item
Music Visual AV Static Dynamic
Valence, Data-based (MFCC, tonal,
AMG1608 [22] X 1,608 15-32
Arousal spectral, temporal)
Annotator metadata (mood,
Valence,
DEAM [23] X 1,802 5-10 familiarity, liking, personality
Arousal
etc.), source music data
Annotator mood, metadata
Emotify [24] X GEMS 400 16-48 (age, gender, mother
tongue), liking
Valence, Data-based (MFCC, spectral),
Moodswings [25] X 240 7-23
Arousal Echo Nest Timbre features
Arousal,
FilmStim [26] X 70 44-56 Source video data
etc.
Valence, Valence, NA, (rankings
LIRIS-ACCEDE [27] X 9,800 Source video data
Arousal Arousal provided)
Valence,
EEG data, frontal face video
DEAP [28] X Arousal, 120 14-48
of annotators, links to MVs
Dominance
Basic emotions,
RAVDESS [29] X X X 2,452 10 Source recordings
intensity
Data-based (audio, visual),
Valence,
MuVi (the present dataset) X X X GEMS 81 5-9 annotator metadata (gender,
Arousal
familiarity, musical training etc.)

TABLE 1
Comparison of existing datasets for emotion recognition based on music, visual and audiovisual stimuli.

and metadata, making it a valuable resource for users of contains a total of 1,494 annotations (music = 350, visual =
MuVi who may be interested in obtaining additional features. 349, audiovisual = 342). Each one-minute excerpt received
As there were some MVs in which the song was preceded between five to nine annotations in each modality, with the
by a silent video intro, we manually scanned each MV to median number of annotations for each media item being
determine the timestamp at which music began playing, six.
then obtained excerpts by taking the first minute of each MV
beginning from that timestamp. 3.2.1 Annotation process
For each media item, participants first completed the dy-
3.2 Procedure namic (continuous) annotation task. To date, continuous
We recruited 48 participants (31 males) aged between 19 and annotations of arousal and valence for video stimuli have
35 (mean = 23.35y, std = 3.28y). All participants provided typically been collected using a slider placed outside the
informed consent according to the procedures of the SUTD video frame [33], [34]. However, this design might result
Institutional Review Board. The study was approved by in participants having to repeatedly shift their attention
the Institutional Review Board under SUTD-IRB 20-315. between the stimuli and their current position on the slider.
Prior to the experiment, participants filled out a survey Superimposing the arousal-valence scale over the video
that included demographic questions and music listening stimuli might mitigate this issue. To investigate which design
preferences. Once in the lab, participants were informed participants would prefer, we tested two versions of the
of the listening experimental protocol and the meaning of dynamic annotation interface in the video-only and music
the arousal-valence scale used for annotations. For each video conditions: with the 2D valence-arousal axes overlaid
quadrant of the arousal-valence space, participants could over the video (Figure 1) or located beside the video. Each
listen to a short (14s-17s) music clip that was typically rated version was randomly selected with 50% probability. Partici-
as belonging to that quadrant. An experimenter was present pants were asked to indicate which annotation interface they
to answer any questions. preferred at the end of the experiment.
Each participant was then given two practice trials to For the dynamic annotation task, participants were asked
familiarize themselves with the annotation interface before to rate the emotion they thought the media item was trying
beginning the actual task. During the session, each partici- to convey (rather than how they felt as they watched or
pant annotated between 30 to 36 media items (not inclusive listened to the media item) by moving their mouse over a
of the practice trials) in a randomly chosen media modality, 2D graph of the arousal-valence space as the item played.
and to ensure the reliability of ratings, the same media Figure 1 shows an example of our annotation interface, which
item would not be shown to a participant twice even if displays the axes labels and an emoticon for each quadrant
in a different modality. We compensated participants with a of the arousal-valence space to help participants remember
$10sgd voucher at the end of the session. The final dataset where different perceived emotions lie on the axes.
5

3.3 Feature extraction


For each media item in our dataset, we extracted a set of
audio and video features. To match the rate at which dynamic
arousal-valence annotations were collected, time-varying
acoustic features were extracted from the underlying music
videos in non-overlapping 500ms windows.
Audio feature extraction was performed with openS-
MILE [46], a popular open-source library for audio feature
extraction. Specifically, we used the “emobase” configuration
file to extract a set of 988 low-level descriptors (LLDs)
including MFCC, pitch, spectral, zero-crossing rate, loudness
and intensity statistics, many of which have been shown to be
effective for identifying emotion in music [38], [39], [47], [48].
Fig. 1. Dynamic annotation interface (overlayed mode). Participants Many other configurations are available in openSMILE but
moved their cursor, indicated by the black square, to match the arousal we provide the “emobase” set of acoustic features since it is
and valence of media items during playback. The interface also provided well-documented and was designed for emotion recognition
information about their current coordinates in the 2D space.
applications [49].
Six types of visual features are provided: color, lighting
To collect the dynamic annotations, we used a script that key, facial expressions, scenes, objects, and actions. With
sampled a participant’s cursor location every 0.5 seconds. the exception of action features, visual features were ex-
However, the actual sampling frequency slightly varied de- tracted from still frames sampled from the underlying music
pending on the lab computer’s browser, Internet connection videos at 500ms intervals. Features related to color [50] and
speed and CPU load. For consistency, we resampled the lighting [51] have been used in a number of video emotion
annotations to maintain a sampling interval of as close to recognition studies, based on the insight that these visual
2Hz as possible; the published annotations have a mean elements are often manipulated by filmmakers to convey a
sampling interval of 2Hz and a standard deviation of 0.0177s. chosen mood or induce emotional reactions. These insights
are supported by experiments showing that the hue and
Superfactor Category Emotion labels saturation of a color has significant influence on induced
Moved, Allured, arousal and valence [52]. Similarly, there is evidence that
Sublimity Wonder the facial expressions of individuals in a scene influence the
Filled with wonder
Fascinated, Overwhelmed, emotion perceived by a viewer [53]. We use OpenCV 1 to
Transcendence
Feeling of transcendence extract hue and saturation histograms as well as “lighting
Peacefulness Serene, Calm, Soothed key” information. Facial expressions were extracted using a
Tender, Affectionate,
Tenderness pre-trained VGG19 [54] model.
Mellow
Nostalgic, Sentimental, Recent techniques have begun to recognize the impor-
Nostalgia tance of contextual information such as scenes, objects and ac-
Dreamy
tions for image and video emotion recognition [55] [56]. Intu-
Strong, Energetic,
Vitality Power itively, certain contexts may tend to be associated with certain
Triumphant
Joyful Activation Animated, Bouncy, Joyful affective states: for example, a beach scene or images of pets
might be associated with happy emotions/positive affect,
Unease Sadness Sad, Tearful, Blue
Tension Tense, Agitated, Nervous while arguments might be associated with anger/negative
affect. Scene classifications were extracted using a ResNet50
TABLE 2 model [57] trained on the Places 365 dataset [58] while Yolov5
GEMS-28 emotion terms, together with their categories and 2
superfactors, as per [1]
was used for object detection. To obtain the action features,
we first employed the popular open source shot detection
method SceneDetect 3 to decompose each music video into
Once the media item finished playing, participants in- a sequence of scenes. Then, the I3D model described in [59]
dicated whether or not they had watched or listened to was used to compute the probability of each action class in
the media item before and completed the static (discrete) each scene.
annotation task. The static annotation task consisted of
selecting the terms that they felt described the media item’s
overall emotion. The terms were taken from GEMS-28 [45], 4 DATASET ANALYSIS
an extended version of the GEMS-25 scale, resulting in a
total of 27 possible emotion labels. As shown in Table 2, the 4.1 Distribution of emotion annotations
labels can be grouped into nine different categories which In this section, we present high-level descriptive statistics for
in turn can be condensed into three “superfactors”. Both both the AV annotations and GEMS emotion labels.
these categories and superfactors were also displayed to
participants. Participants had to select at least one label per 1. https://opencv.org
media item with no upper limit on the number of labels they 2. https://github.com/ultralytics/yolov5
could choose; the median number of labels chosen was four. 3. https://github.com/Breakthrough/PySceneDetect
6

4.1.1 Arousal-valence annotations GEMS labels Music Audiovisual Visual Total % Songs %
Wonder
The overall distribution of arousal-valence values across Moved 54 65 63 3.03% 12.33%
participants, sampled at 2Hz intervals, is displayed in Filled with
49 65 68 3.03% 12.33%
wonder
Figure 2. Across the entire dataset, the mean arousal value Allured 50 64 55 2.82% 11.45%
is 0.163 (music = 0.250, audiovisual = 0.192, visual= 0.044) Transcendence
while the mean valence is 0.073 (music = 0.177, audiovisual Fascinated 79 106 71 4.27% 17.34%
= 0.115, visual = -0.075). Overwhelmed 59 64 52 2.92% 11.86%
Feeling of
36 39 33 1.80% 7.32%
transcendence
Peacefulness
Serene 36 35 33 1.73% 7.05%
Calm 54 51 53 2.63% 10.70%
Soothed 30 33 25 1.47% 5.96%
Tenderness
Tender 37 60 54 2.52% 10.23%
Affectionate 73 104 90 4.45% 18.09%
Mellow 41 30 44 1.92% 7.79%
Nostalgia
Nostalgic 105 112 101 5.30% 21.54%
Sentimental 79 99 87 4.42% 17.95%
Dreamy 85 87 74 4.10% 16.67%
Power
Strong 147 164 144 7.58% 30.83%
Fig. 2. Distribution of arousal-valence annotations for each modality, Energetic 193 202 178 9.55% 38.82%
Triumphant 48 51 52 2.52% 10.23%
sampled at 2Hz.
Joyful Activation
Animated 43 77 57 2.95% 11.99%
Bouncy 113 141 105 5.98% 24.32%
Joyful 103 106 79 4.80% 19.51%
4.1.2 GEMS emotion labels
Sadness
Across the entire dataset, participants selected an average of Sad 96 102 98 4.93% 20.05%
Tearful 33 35 41 1.82% 7.38%
4.06 labels per media stimulus (music = 3.92, audiovisual = Blue 56 46 54 2.60% 10.57%
4.31, visual = 3.96). Each media stimulus was labeled with Tension
an average of 15.83 unique labels in total across participants Tense 113 110 107 5.50% 22.36%
Agitated 59 70 57 3.10% 12.60%
(music = 15.36, audiovisual = 16.56, visual = 15.57). Nervous 41 49 45 2.25% 9.15%
Across the dataset, the most frequently selected emotion Total 1,912 2,167 1,920 5,999
labels were “energetic”, which was selected for 38.82% Average labels
3.92 4.31 3.96 4.06
of the media stimuli and makes up 9.55% of the labels, selected

and “strong”, which was selected for 30.83% of the media TABLE 3
stimuli and makes up 7.58% of the labels. Both “strong” and Distribution of GEMS emotion labels for each modality.
“energetic” belong to the Power category, which was also
the most frequently selected category with 55.3% of discrete
annotations containing at least one Power label. The least
frequently selected labels were “soothed” and “feeling of
transcendence”. The distribution of GEMS emotion labels
selected might be a reflection of the media stimuli used,
which all belonged to the pop music genre and are hence
generally more upbeat.
We also computed the pairwise Pearson’s correlation
of the emotion labels selected by each participant for each
stimulus. A heatmap of the results is displayed in Figure 3.
The strongest positive correlations were between sad-tearful
(r = 0.419, p <0.001) and bouncy-energetic (r = 0.378, p
<0.001). One possible interpretation of the results is that
these labels were frequently selected together. A second
possibility is that the labels were selected by different
people to indicate the same emotion for a particular media
file, which might suggest partially redundant labels [24].
Indeed, the pair sad-tearful belongs to the same GEMS
category “Sadness”, while bouncy-energetic belongs to the
same superfactor “Vitality” (encompassing Power and Joyful
Activation), as shown in Table 2. The strongest negative
correlations were observed between energetic-sad (r = -0.255, Fig. 3. Pearson’s correlations between categorical GEMS emotion labels.
p <0.001) and joyful-sad (r = -0.206, p <0.001), which is
logical as the terms are antonyms semantically.
7

4.2 Effect of annotation interface, media modality and 4.2.2 GEMS emotion labels
annotator profile on perceived emotion A contingency Chi-square test was used to assess whether
We performed a series of analyses to examine whether the frequency distribution of emotion labels is significantly
the annotation interface, modality, and demographic and different for media stimuli presented in different modalities.
profile characteristics of participants influence the perceived Additionally, we investigated whether extra musical factors
emotion of media stimuli as measured by dynamic arousal- (such as gender, musical training, and familiarity with the
valence ratings as well as overall discrete emotion labels. media item) and characteristics of the annotation interface
(whether it was overlaid on the media stimulus or placed
4.2.1 Arousal-valence annotations on the side) would influence the frequency of each emotion
A visual inspection of the full time-series arousal-valence label’s selection. We found that neither modality nor overlay
annotations reveals interesting differences between the type significantly influenced the frequency with which
average emotion trajectories in modalities. Although the emotion labels were selected. However, the test statistic
annotations appear to be similar across all modalities for was significant for gender (χ2 = 46.36, df = 26, p = 0.008),
some stimuli, on the whole, annotations for the audiovisual formal musical training (χ2 = 93.43, df = 26, p < 0.001), and
modality tend to resemble those for the music modality. familiarity (χ2 = 263.26, df = 26, p < 0.001).
First, for each stimulus, we average the arousal annotations We manually inspected the difference between expected
to obtain mean arousal ratings for the music, visual and and observed frequencies of emotion labels for each of these
audiovisual modalities. Then, we compute the Pearson’s three groups and found that for gender (Figure 4), labels
correlation between each pair of modalities. For arousal, the under the GEMS category Vitality such as “energetic”, “ani-
highest correlations were obtained for the music-audiovisual mated” and “bouncy” tended to show the greatest deviation
modality pair (mean = 0.731, sd = 0.309), followed by visual- from the expected frequencies according to the Chi-square
audiovisual (mean = 0.479, sd = 0.453) then music-visual test. Interestingly, this difference was not consistent across
(mean = 0.411, sd = 0.436). We observe the same pattern for gender: female participants selected the label “energetic” less
valence, where the highest correlations were again for the frequently than male participants, while selecting “animated”
music-audiovisual modality pair (mean = 0.433, sd = 0.423), and “bouncy” more frequently.
followed by visual-audiovisual (mean = 0.372, sd = 0.495) For ‘years of musical training’ (Figure 5), the label
then music-visual (mean = 0.084, sd = 0.577). “nostalgic” showed the greatest deviation from the expected
Given that the arousal and valence annotations are time- frequencies according to the Chi-square test. Participants
series data, individual data points are not independent. As with three years or less of musical training selected the label
such, in order to perform the following set of statistical “nostalgic” less frequently than expected, even though on
analyses, we pre-process all annotations by taking the average they were more likely to have seen or listened to the
median arousal and valence of each annotation sequence. media stimuli previously.
A Kolmogorov-Smirnov test of goodness-of-fit indicates For familiarity (Figure 6), the label “tense” shows the
that median arousal (D = 0.273, p < 0.001) and valence greatest deviation from the expected frequencies according
(D = 0.229, p < 0.001) are not normally distributed. Hence, to the Chi-square test. Participants who were unfamiliar
non-parametric statistical tests were used for the following with the media item selected labels from the GEMS category
analyses. Tension such as “tense” and “nervous” much more frequently
Overlay type: Mann-Whitney U tests indicate that there than expected, perhaps because they were less sure what to
is no significant difference between median arousal (U = expect. They also selected the label “nostalgic” less frequently
1.15×105 , p = 0.2798) and valence (U = 1.14×105 , p = 0.2456) than expected, a result that is fairly intuitive since nostalgia
for side-by-side versus overlaid annotation interfaces, which is generally induced by familiarity. Finally, participants who
are relevant to the visual and audiovisual modalities. Hence, were familiar with the media items selected labels from the
we combine the overlay types in our remaining analyses. The GEMS category Sadness such as “sad” and “blue” more
majority of participants (31 out of 49, 63.27%) indicated a frequently than expected.
preference for the annotation interface with the 2D valence-
arousal axes located beside the video.
Media modality: The Kruskal-Wallis one-way ANOVA
5 A PREDICTIVE MODEL AUGMENTED WITH ISO -
was run to compare arousal-valence annotations for stimuli LATED MODALITY RATINGS (PAIR)
presented in the music, visual, and audiovisual modalities. In this section, we present a novel approach for multimodal
The results were significant for both arousal (H = 79.45, p < time-series emotion recognition, termed a Predictive model
0.001) and valence (H = 100.38, p < 0.001), indicating that Augmented with Isolated modality Ratings (PAIR). We first
there is a significant difference between median emotion discuss models for unimodal emotion recognition relevant
annotations for at least two of the modalities. Arousal is to the music and visual modalities, and culminate in a
highest in the music modality (median = 0.2957), followed discussion of models for multimodal emotion recognition
by the audiovisual modality (median = 0.2087), and lowest relevant to the audiovisual modality.
in the visual modality (median = 0.0). Similarly, valence is While sophisticated modelling techniques are undoubt-
highest in the music modality (mean = 0.2087), followed by edly required to improve state of the art performance on
the audiovisual modality (mean = 0.1130), and lowest in the emotion recognition tasks, our focus here is to investigate
visual modality (mean = -0.0087). This suggests that modality whether an approach that utilizes parameters pre-trained
in which a media item is consumed exerts an effect on its on isolated modality ratings can lead to better performance
perceived arousal and valence. on a multimodal (audiovisual) emotion recognition task. As
8

Fig. 4. Distribution of GEMS emotion labels selected by female (left) and male (right) participants.

Fig. 5. Left: Distribution of GEMS emotion labels selected by those with little (0 - 3 years) musical training; Right: more than 3 years of musical
training.

Fig. 6. Left: Distribution of GEMS emotion labels selected by participant who were exposed to the media stimuli for the first time; Right: Participant
has viewed or listened to the media item previously.

such, all our models are based on long short-term memory we examine model performance when using only audio
(LSTM) networks [60]. LSTMs are a popular variant of the features, and when using only visual features. Next, we
recurrent neural network (RNN) [61] that can retain ’memory’ compare the perfomance of three different model archi-
of information seen previously over arbitrarily long intervals tectures for audio and visual feature fusion (denoted as
[62], and have been widely used to model time-series data A1, A2 and PAIR). A1 and A2 utilize early (feature-level)
for mood prediction and classification tasks (e.g. [63], [64], fusion and late (decision-level) fusion respectively, two of
[65]). the most commonly-employed feature fusion methods [35].
The proposed PAIR architecture builds on the unimodal
5.1 Unimodal architectures models described above. Figure 7 presents the following
Unimodal systems can be viewed as building blocks for architectures which are described in detail below:
a multimodal emotion recognition framework [35]. In this • Early fusion (A1): In this architecture, we apply an
case, we model arousal-valence annotations from the two early fusion method by concatenating the audio and
unimodal conditions (music, visual) in MuVi. We term these visual features to form a single input feature vector.
annotations isolated modality ratings, as they capture the The feature vector is then fed to the LSTM blocks.
emotion conveyed by the individual modalities that make up • Late fusion (A2): In this architecture, we apply a late
multimedia content in isolation. We trained separate models fusion method by feeding the two input modalities
for arousal and valence in both isolated modalities. Isolated (i.e., audio and visual features) to separate LSTM
music modality ratings were predicted using audio features blocks. Then, the outputs of both LSTMs are concate-
only while isolated visual modality ratings were predicted nated and fed to a final, fully-connected layer for
using visual features only, matching the information that prediction.
human participants would have had access to during the • PAIR: This architecture is identical to A2, except
experiment. that the LSTM blocks are initialized with pre-trained
weights from the corresponding unimodal models
5.2 Multimodal architectures (trained on ratings from the isolated music and visual
In this section, we model arousal-valence annotations from modalities). Then, the weights are fine-tuned on
the multimodal (audiovisual) condition in MuVi, exhaus- ratings from the multimodal audiovisual modality.
tively considering all possible combinations of features. First, The motivation for this architecture is twofold: firstly,
9

Fig. 7. Illustrated workflow for the proposed Multimodal architectures for the audiovisual modality featuring: a) Early Fusion (A1), b) Late Fusion (A2),
and c) the proposed PAIR. PAIR’s LSTM blocks initializes the pre-trained weights (w) from the corresponding Music and Visual Unimodal networks.

from a computational perspective, the amount of data 6 E XPERIMENTAL S ETUP


in MuVi is relatively small for training deep learning In this section, we introduce our experimental design to
models. Additionally, there is evidence from neu- evaluate our predictive models and examine the effectiveness
roimaging studies that the integration of emotional of the novel transfer-learning that PAIR architecture utilizes.
information from auditory and visual channels can be Moreover, we introduce the equivalent linear models as
attributed to the interaction between unimodal audi- baselines, not only for comparison with the proposed PAIR
tory and visual cortices plus higher order supramodal architecture, but also for finding the most influential features.
cortices, rather than from direct crosstalk between
the auditory and visual cortices [66]. As such, we
hypothesized that we may obtain better performance 6.1 Baseline: Linear Models
on a multimodal affect recognition task by mimicking We implemented linear LASSO regression models as a simple
this neural architecture and initializing a multimodal baseline predictive model for valence and arousal. In addition
model using weights pre-trained on relatively ’pure’ to investigating the relative importance of audio and visual
isolated modality ratings. features for predicting emotion in audiovisual media, we are
also interested in examining the individual features that are
5.3 Implementation details most predictive of arousal and valence across modalities.
In all architectures (i.e., unimodal and multimodal) an LSTM In a review of music emotion recognition methods, [31]
block consists of two stacked LSTM layers each of size 256, found that the use of too many features generally results in
with dropout of 0.2 to prevent overfitting. The output of the performance degradation. Feature selection methods such as
LSTM layers is fed to a fully connected layer with a “tanh” manual selection or removal of highly correlated features can
activation that outputs the final prediction. This value lies help reduce computational complexity and improve model
in the range of [−1, 1], matching that of the arousal and performance by mitigating overfitting issues. Additionally,
valence annotations. In the Late Fusion (A2) and PAIR (A3) all possible combinations of modalities and corresponding
architectures, we have an additional fully connected layer feature sets are examined. We use the least absolute shrinkage
before the final predictive layer with 256 hidden units and and selection operator (LASSO) as our baseline linear predic-
“relu” activation. tive model, as it can also be used for feature selection [69].
During the optimisation phase, the Adam [67] optimizer The LASSO minimizes the absolute sum of all coefficients
was used with a learning rate of 0.0001, while “Mean (L1 regularization), and if subsets of features are highly
Squared Error” (MSE) was used as the loss function. Since correlated, the model tends to ‘select’ one feature from the
the annotation and the feature extraction process are both pool while shrinking the remaining coefficients to zero. Cross-
dynamic, based on timesteps of 0.5s, our benchmark models validation was performed to determine optimal values for
predict arousal and valence at the next time step by taking the complexity parameter alpha, which controls the strength
the information of previous time-steps as input (i.e., sequence of the regularization.
length of the LSTM). After experimenting, we found that the
optimal sequence length for both arousal and valence across 6.2 Predictive Model Evaluation Metrics
all modalities is 4 timesteps (i.e., 2 seconds). The model was
implemented using the Tensorflow 2.x [68] deep learning To account for variance in the media items and fully utilize
framework, and is available online 4 . the information provided by the dataset, the PAIR and
linear models are evaluated using k-fold cross-validation. We
4. https://github.com/AMAAI-Lab/MuVi partitioned the media items into five folds using the KFold
10

Arousal Valence
Modality Features
RMSE CCC RMSE CCC
Music Audio 0.2675 ± 0.1075 0.3417 ± 0.2909 0.1573 ± 0.1029 0.1196 ± 0.1950
Visual Visual 0.4110 ± 0.1571 0.0785 ± 0.1579 0.3506 ± 0.1283 0.0415 ± 0.1332
Audiovisual Audio 0.2721 ± 0.1145 0.3095 ± 0.3218 0.4106 ± 0.1330 0.0260 ± 0.1419
Audiovisual Visual 0.3649 ± 0.1744 0.0772 ± 0.1475 0.3834 ± 0.1427 0.0438 ± 0.1210
Audiovisual Audio and Visual 0.2783 ± 0.1231 0.3070 ± 0.3127 0.3831 ± 0.1417 0.0503 ± 0.1680
TABLE 4
Evaluation of the LASSO linear models in terms of RMSE and CCC (mean ± standard deviation). Note that lower values of RMSE, and higher values
of CCC, reflect higher predictive accuracy.

Fig. 8. LASSO model predictions of arousal and valence ratings in the visual (left), audiovisual (center) and music (right) modalities for the media item
’Living On A Prayer’ by Bon Jovi. Each colored line in the background represents an individual participants’ rating, while the bold red (arousal) and
teal (valence) lines represent the Evaluator Weighted Estimator rating along with its standard deviation. The dashed black lines represent the model’s
predicted rating at that timestep.

iterator in sklearn 5 with random state set to 42, iteratively tions of each evaluation metric are then computed across all
holding out each fold as a validation set for the model trained cross-validation folds.
on the remaining four fold. A gold-standard annotation sequence to be used as the
Regarding metrics, we modified the approach taken modelling target was computed for each media item in each
by [23] to design benchmarks for the MediaEval Emotion modality using the Evaluator Weighted Estimator (EWE)
in Music task, and we use two evaluation metrics to approach [70]. As its name suggests, the EWE is a weighted
benchmark the performance of baseline methods: root mean average which gives more reliable annotations a higher
square error (RMSE) and Lin’s Concordance Correlation weight. The weight of an annotation rj is calculated as
Coefficient (CCC). Intuitively, the RMSE is an indicator of the correlation between rj and the unweighted average of
how well the predicted emotion matches the gold-standard all annotations r̄. It is plausible that the dataset contains
or “true” emotion of a media item, while the CCC (inter- unreliable annotations with wj < 0; in the following analyses
rater agreement, or the extent to which the human raters we deal with unreliable annotations by setting their weight
provide similar ratings across trials) was initially designed as to zero such that they are effectively not taken into account
a reproducibility index and quantifies the agreement between when calculating the gold-standard annotation sequence.
two vectors by measuring their difference from the 45 degree
line through the origin. As such, it captures both the pattern wj = Corr(rj , r̄) (2)
of association and systematic differences between two sets
of points. The CCC for two vectors (r1 , r2 ) is calculated as: 1 X
rEW E = P wj rj (3)
2Corr(r1 , r2 )σr1 σr2 j wj j
CCC(r1 , r2 ) = 2 (1)
σr1 + σr22 + (µr1 − µr2 )2
6.3 Human inter-rater agreement
RMSE and predicted CCC are computed for each media We also calculate inter-rater agreement, which provides a
item in each validation set. The mean and standard devia- useful auxiliary metric for comparison with the predictive
5. https://scikit-learn.org/stable/modules/cross validation.html# model results. Inter-rater agreement indicates the extent to
k-fold which the perception of emotional content of a media item is
11

Modality Arousal Valence


Music 0.4062 ±0.32 0.2385 ±0.31
Visual 0.2839 ±0.32 0.3115 ±0.32
Audiovisual 0.3369 ±0.32 0.2384 ±0.32
TABLE 5
Evaluation of inter-rater agreement by calculating the CCC of each
annotation for all modalities.

similar between human raters, and may serve as a measure


of the quality of the human annotations of the dataset. We
calculate the inter-rater agreement as CCC(rj , rEW E ). The
CCC of each annotation rj with the corresponding EW E
for the media stimuli whereby the EWE was calculated
without rj so as to avoid overstating inter-rater agreement.
Agreement was low to moderate. Across the entire dataset,
the mean CCC for arousal was 0.34 ± 0.32 and the mean
CCC for valence was 0.26 ± 0.32.
The results in Table 5 are in line with previous empirical
studies [40] which show that raters tend to demonstrate
greater agreement in their perceptions of arousal in music as
compared to valence, possibly because auditory cues relevant
to arousal such as tempo and pitch are more salient and
the perception of arousal is hence less subjective than that
of valence. There was also greater agreement for arousal
compared to valence in the audiovisual condition. Fig. 9. PAIR model predictions of arousal and valence ratings in the
audiovisual modality for Bon Jovi’s Living On A Prayer. Similarly to 8,
each colored line in the background represents an individual participants’
rating, while the bold red (arousal) and teal (valence) lines represent the
7 R ESULTS AND D ISCUSSION Evaluator Weighted Estimator rating along with its standard deviation.
7.1 Predicting valence and arousal The dashed black lines represent the model’s predicted rating at that
timestep.
7.1.1 Baseline: LASSO models
Starting with the baselines, Table 4 summarises the results
of the LASSO models in different modality configurations. all modality-feature combinations. As with the linear models,
As expected, the best model performance was observed for the best performance was observed when predicting arousal
arousal in the music modality. A t-test indicates that the sim- in the music modality, and predictions of arousal were more
ple linear model produced results that were not significantly accurate than predictions of valence. The worst performance
different from human-level performance as determined by was observed in the visual modality in which only visual
annotator’s inter-rater agreement (t = 1.82, p > 0.05). features were available (i.e., a unimodal architecture).
Similarly, linear model performance was not significantly Next, we compare the A1 (early fusion), A2 (late fusion)
different from human-level performance when predicting and PAIR architectures for multimodal emotion recognition.
arousal in the video modality with audio features only, and For both arousal and valence, the early and late fusion
with both audio and visual features. However, none of the architectures improve over the linear model’s results in terms
linear models were able to predict valence accurately enough of RMSE, but not for CCC. However, the PAIR architecture
to match human-level performance. These results agree initialized with weights pre-trained on the unimodal LSTM
with existing empirical work on music emotion recognition networks produces a noticeable improvement in arousal
that typically achieves much higher accuracy for arousal prediction in terms of CCC. Paired t-tests show that the PAIR
compared to valence [71]. is significantly different (better) than the linear models for
A visual inspection of the linear model predictions in predicting arousal in terms of CCC (t(40) = 2.81, p < 0.01)
Figure 8 reveals that even though the predicted annotations and RMSE (t(40) = −13.0, p < .001), and for predicting
are not significantly different from those of human raters, valence in terms of CCC (t(40) = −11.7, p < 0.001), high-
they are unrealistic looking and much less smooth, possibly lighting the effectiveness of our proposed transfer learning
because the model does not account for features at the approach leveraging isolated modality ratings. Similarly, the
preceding timesteps. PAIR model predictions in Figure 9 appear to be less erratic
than the linear model predictions, and more closely follows
the contours of the human participants’ ratings.
7.1.2 PAIR (LSTM) models
Table 6 shows the results of the proposed LSTM models
for all modality-feature combinations examined using the 7.2 Most influential features
linear models, and for the three additional architectures LASSO performs feature selection by shrinking the coeffi-
(A1, A2, PAIR) described in Section 5.2. Generally, the cients of variables that are less significant to zero [69]. We
LSTM models improve over the linear model’s results in report the features with the largest absolute coefficient values
12

Arousal Valence
Modality Features
RMSE CCC RMSE CCC
Music Audio 0.0973 ± 0.0814 0.4080 ± 0.2631 0.1331 ± 0.1225 0.1741 ± 0.2142
Visual Visual 0.1625 ± 0.0853 0.1417 ± 0.2075 0.2340 ± 0.1341 0.0675 ± 0.2294
Audiovisual Audio 0.1109 ± 0.0853 0.3715 ± 0.2455 0.1970 ± 0.1264 0.0497 ± 0.1523
Audiovisual Visual 0.2434 ± 0.1513 0.0827 ± 0.2231 0.2767 ± 0.1493 0.0449 ± 0.1859
Audiovisual Audio and Visual (A1) 0.1285 ± 0.0969 0.3177 ± 0.2824 0.2264 ± 0.1347 0.0734 ± 0.2136
Audiovisual Audio and Visual (A2) 0.1372 ± 0.1041 0.3230 ± 0.2882 0.2392 ± 0.1390 0.0708 ± 0.2191
Audiovisual Audio and Visual (PAIR) 0.1075 ± 0.0847 0.4102 ± 0.2488 0.2126 ± 0.1268 0.0886 ± 0.2198
TABLE 6
Evaluation of the LSTM models (unimodal and mulitimodal) in terms of RMSE and CCC (mean ± standard deviation) featuring A1, A2 and the novel
PAIR architectures in the audiovisual modality.

Modality Features Arousal Valence information by indicating attributes of the communicator


MFCC, loudness, Line spectral such as accessibility. The tempo conveyed through cinematic
Music Audio line spectral frequencies, techniques such as camera and subject movement has also
frequencies loudness, MFCC
been found to be a highly expressive aspect of videos [36].
Lighting, scene, It might be possible that it is not the action of stretching
Actions, lighting,
Visual Visual hue and saturation, one’s arm that conveys affective information, but the action
scene
objects, actions
classification model is recognizing latent dimensions such as
Intensity, MFCC, open gestures and slowed motion that convey anticipation
MFCC, loudness,
line spectral
Audiovisual Audio F0, line spectral or affective states such as relaxedness that correspond to
frequencies,
frequencies
zero crossing rate lowered arousal and valence.
Actions, lighting, The sign of the coefficient for “sneezing” is also negative.
Actions, hue,
Audiovisual Visual
lighting, scene
hue and saturation, In the video scenes, “sneezing” seemed to correspond
scene generally to shots centering on a single actor in which facial
Audio &
MFCC, loudness, Lighting, actions, expressions were clearly visible. One possible reason why the
Audiovisual actions, line spectral hue and saturation, feature could be associated with lower arousal and valence
Visual
frequencies, F0 intensity
is because the actors depicted were usually trying to express
TABLE 7 seriousness or intensity. The reader may ask: if this were the
Most predictive features for Valence and Arousal according to linear case, why was the model assigning much higher weights to
models.
facial actions than facial expression features, which were also
available? Action features were extracted from scenes that
could stretch over longer durations, while facial expressions
were extracted from still frames sampled from the underlying
for each model in Table 7. Coefficient values were calculated video at regular intervals. Consequently, predicted facial
by averaging across all cross-validation folds. expressions could change significantly between consecutive
When annotators were presented with audiovisual stim- sampled frames while predicted action features tended to
uli, the variance in arousal values seems to be explained be more stable. The stability of the action features might
largely by audio features while the variance in valence can hence be a better match to the human annotations of arousal
be largely expained by visual features. This is supported by and valence which tend to change smoothly over time. This
the observation that models trained on target labels from the finding further justifies the use of time-series models that
audiovisual modality with audio features only outperformed can incorporate information provided in previous time-steps
those trained with visual features only when predicting for emotion recognition.
arousal, but not when predicting valence.
Across the media modalities that included visual features,
the LASSO regression models consistently estimated coeffi- 8 C ONCLUSION
cients of relatively large magnitude for several action classes In this paper, we explore the relative contribution of audio
extracted using the I3D model [59] including “stretching arm” and visual information on perceived emotion in media items.
and “sneezing”. We manually inspected the video scenes To do so, we provide a new dataset of music videos labeled
where these actions were classified as highly likely. The sign with both continuous as well as discrete measures of emotion
of the coefficient for “stretching arm” is negative, indicating in both isolated and combined modalities. We then used the
that the action is associated with decreased arousal and dataset to train two types of emotion recognition models:
valence. In the video scenes, “stretching arm” seemed to interpretable linear predictive models for feature importance,
correspond generally to moving of the arms in the context and more sophisticated LSTM models. We trained three
of actions such as embraces or dancing. Additionally, most LSTM models for audiovisual emotion recognition: early
of the scenes were shot at least partially in slow motion. fusion (A1), late fusion (A2), and a novel transfer learning
Previous work on affective gestures [72] has demonstrated model pre-trained on isolated modality ratings which we
that the spatial extension of movements [73] and openness named PAIR. Notably, we find that transfer effects from
of arm arrangements [74] express emotionally relevant isolated modalities can enhance performance of a multimodal
13

model through transfer learning, a result that could not be that in music videos, our perception of arousal is largely
shown from previous datasets due to the lack of isolated determined by the music while both the music and visual
modality ratings. narrative contribute to our perception of valence. Addition-
Our analyses show that affective multimedia content ally, an individual’s impression of the emotion conveyed by
analysis is a complex problem for several reasons, including a song could be very different depending on the modality
the subjectivity of perceived emotion, as evidenced by low to in which it is consumed. Content platforms may find it
moderate agreement between human annotators using both useful to incorporate modality-related information into their
continuous and discrete measures of emotion [34] [23], and search and recommendation algorithms. Similarly, music
the association between factors such as gender and familiarity video producers might benefit from a greater awareness of
with perceived emotion [75]. Specifically, we find that gender, each modality’s contribution to the emotion conveyed by
familiarity, and musical training have a significant influence their content, and to the different modalities in which their
on the emotion labels selected by participants. In the current content can be consumed. Of course, music videos are a
study, we randomly assigned stimuli to participants and subset of audiovisual content and, more broadly, multimodal
focused on modeling the average emotion ratings reported. content [35]. While music videos usually involve the music
However, future work could examine methods to integrate being produced first with the visual narrative subsequently
profile and demographic information in addition to data- designed around the song, other scenarios are also possible
based features to build personalized models, through model during audiovisual content production: the video may be
adaptation [41] or other techniques. shot first, with music selection occurring subsequently, or
Our work investigated several visual context features the video and music production may proceed in parallel and
rarely used in affect detection including scene, object and inform one another. Investigating isolated modality ratings
action features. We find evidence that action features are an in other types of audiovisual and multimodal content is an
important source of affective information, and that scene- important avenue for future work. We hope that our research
based features might be more appropriate for time-series and dataset will provide a useful steppdfing-stone towards
emotion recognition than features extracted from windows of advancing the field of affective computing through the study
fixed duration. These results add to a growing body of work of isolated modalities.
showing that people make use of contextual information
when making affective judgements [76].
Further, although the visual modality is generally consid- ACKNOWLEDGMENTS
ered to be dominant [77] [18], we find that the auditory This work was supported by the RIE2020 Advanced Manu-
modality explains most of the variance in arousal for facturing and Engineering (AME) Programmatic Fund (No.
multimedia stimuli, while both the auditory and visual A20G8b0102), Singapore, as well as the Singapore Ministry
modalities contribute to explaining the variance in valence. of Education, Grant no. MOE2018-T2-2-161.
Future work should investigate the impact of incongruent
information across modalities, and explore which features
cause the user’s responses to be more oriented towards the R EFERENCES
visual or auditory modality when multimedia stimuli are [1] M. Zentner, D. Grandjean, and K. R. Scherer, “Emotions evoked by
presented, as this may may clarify when visual or auditory the sound of music: characterization, classification, and measure-
ment.” Emotion, vol. 8, no. 4, p. 494, 2008.
dominance occurs. A model that is able to fully capture the [2] Y. Baveye, J.-N. Bettinelli, E. Dellandréa, L. Chen, and C. Chamaret,
influence of emotion of each modality may even be used “A large video database for computational models of induced
as a conditional generative model for music [78], [79], [80] emotion,” in 2013 Humaine Association Conference on Affective
or video. Additionally, although we have begun to examine Computing and Intelligent Interaction. IEEE, 2013, pp. 13–18.
[3] A. J. Lonsdale and A. C. North, “Why do we listen to music? a uses
feature importance, there is still room for future work on and gratifications analysis,” British journal of psychology, vol. 102,
feature engineering. For example, extracting more extensive no. 1, pp. 108–134, 2011.
deep pretrained repesentations of latent audio and video [4] A. J. Cohen, “Music as a source of emotion in film.” in Handbook
of music and emotion: Theory, research, applications, P. N. Juslin and
dimensions might enable models to take into account more J. Sloboda, Eds. Oxford University Press, 2010.
complex information. [5] M. G. Boltz, B. Ebendorf, and B. Field, “Audiovisual interactions:
The isolated modality ratings revealed several interesting The impact of visual information on music perception and memory,”
patterns. Compared to listening to music alone, we find that Music Perception, vol. 27, no. 1, pp. 43–59, 2009.
[6] P. Lamere, “Social tagging and music information retrieval,” Journal
the addition of a visual modality in music videos generally of new music research, vol. 37, no. 2, pp. 101–114, 2008.
produces slightly lower arousal and valence and lower [7] M.-Y. Chung and H.-S. Kim, “College students’ motivations for
interrater agreement for arousal. In line with previous work using podcasts,” Journal of Media Literacy Education, vol. 7, no. 3, pp.
13–28, 2015.
[30], agreement for valence was lower than agreement for [8] S. Davies, Musical meaning and expression. Cornell University Press,
arousal in the music and audiovisual modalities. In the visual 1994.
modality, however, agreement for valence was higher than [9] “Ifpi issues annual global music report,”
agreement for arousal. In fact, the visual modality produced Aug 2020. [Online]. Available: https://www.ifpi.org/
ifpi-issues-annual-global-music-report/
the lowest agreement in arousal but the highest agreement in [10] J. J. Gross and L. Feldman Barrett, “Emotion generation and
valence. Taken together with the feature importance results emotion regulation: One or two depends on your point of view,”
above, the data suggests that visual information contributes Emotion review, vol. 3, no. 1, pp. 8–16, 2011.
to people’s perceptions of valence more than arousal. [11] J. A. Russell, “A circumplex model of affect.” Journal of personality
and social psychology, vol. 39, no. 6, p. 1161, 1980.
Our results also have several practical implications. Con- [12] P. Ekman, “An argument for basic emotions,” Cognition & emotion,
vergent evidence across multiple analyses strongly suggests vol. 6, no. 3-4, pp. 169–200, 1992.
14

[13] A. S. Cowen and D. Keltner, “Self-report captures 27 distinct [36] S. Wang and Q. Ji, “Video affective content analysis: a survey of
categories of emotion bridged by continuous gradients,” Proceedings state-of-the-art methods,” IEEE Transactions on Affective Computing,
of the National Academy of Sciences, vol. 114, no. 38, pp. E7900–E7909, vol. 6, no. 4, pp. 410–430, 2015.
2017. [37] J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence,
[14] J. A. Russell, “Core affect and the psychological construction of R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and
emotion.” Psychological review, vol. 110, no. 1, p. 145, 2003. human-labeled dataset for audio events,” in 2017 IEEE International
[15] G. Paltoglou and M. Thelwall, “Seeing stars of valence and arousal Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
in blog posts,” IEEE Transactions on Affective Computing, vol. 4, no. 1, 2017, pp. 776–780.
pp. 116–123, 2012. [38] H. T. P. Thao, B. Balamurali, D. Herremans, and G. Roig, “At-
[16] Y. E. Kim, E. M. Schmidt, R. Migneco, B. G. Morton, P. Richardson, tendaffectnet: Self-attention based networks for predicting affective
J. Scott, J. A. Speck, and D. Turnbull, “Music emotion recognition: responses from movies,” in 2020 25th International Conference on
A state of the art review,” in Proc. ismir, vol. 86, 2010, pp. 937–952. Pattern Recognition (ICPR). IEEE, 2021, pp. 8719–8726.
[17] F. Platz and R. Kopiez, “When the eye listens: A meta-analysis of [39] H. T. P. Thao, B. T. Balamurali, G. Roig, and D. Herremans,
how audio-visual presentation enhances the appreciation of music “Attendaffectnet–emotion prediction of movie viewers using
performance,” Music Perception: An Interdisciplinary Journal, vol. 30, multimodal fusion with self-attention,” Sensors, vol. 21, no. 24, p.
no. 1, pp. 71–83, 2012. 8356, Dec 2021. [Online]. Available: http://dx.doi.org/10.3390/
[18] C.-J. Tsay, “Sight over sound in the judgment of music performance,” s21248356
Proceedings of the National Academy of Sciences, vol. 110, no. 36, pp. [40] Y.-H. Yang, Y.-C. Lin, Y.-F. Su, and H. H. Chen, “A regression
14 580–14 585, 2013. approach to music emotion recognition,” IEEE Transactions on audio,
[19] M. G. Boltz, “Musical soundtracks as a schematic influence on the speech, and language processing, vol. 16, no. 2, pp. 448–457, 2008.
cognitive processing of filmed events,” Music Perception, vol. 18, [41] J.-C. Wang, Y.-H. Yang, H.-M. Wang, and S.-K. Jeng, “Personalized
no. 4, pp. 427–454, 2001. music emotion recognition via model adaptation,” in Proceedings
[20] M. E. Goldberg, A. Chattopadhyay, G. J. Gorn, and J. Rosenblatt, of The 2012 Asia Pacific Signal and Information Processing Association
“Music, music videos, and wear out,” Psychology & Marketing, Annual Summit and Conference. IEEE, 2012, pp. 1–7.
vol. 10, no. 1, pp. 1–13, 1993. [42] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede:
[21] S.-W. Sun and J. Lull, “The adolescent audience for music videos A video database for affective content analysis,” IEEE Transactions
and why they watch.” Journal of Communication, 1986. on Affective Computing, vol. 6, no. 1, pp. 43–55, 2015.
[43] C. Raffel, “Learning-based methods for comparing sequences, with
[22] Y.-A. Chen, Y.-H. Yang, J.-C. Wang, and H. Chen, “The amg1608
applications to audio-to-midi alignment and matching,” Ph.D.
dataset for music emotion recognition,” in 2015 IEEE International
dissertation, Columbia University, 2016.
Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE,
2015, pp. 693–697. [44] T. Bertin-Mahieux, D. P. Ellis, B. Whitman, and P. Lamere, “The
million song dataset,” in Proceedings of the 12th International Society
[23] A. Aljanaki, Y.-H. Yang, and M. Soleymani, “Developing a bench-
for Music Information Retrieval, 2011.
mark for emotional analysis of music,” PloS one, vol. 12, no. 3, p.
[45] A. Lykartsis, A. Pysiewicz, H. von Coler, and S. Lepa, “The
e0173392, 2017.
emotionality of sonic events: testing the geneva emotional music
[24] A. Aljanaki, F. Wiering, and R. C. Veltkamp, “Studying emotion
scale (gems) for popular and electroacoustic music,” in The 3rd
induced by music through a crowdsourcing game,” Information
International Conference on Music & Emotion, Jyväskylä, Finland, June
Processing & Management, vol. 52, no. 1, pp. 115–128, 2016.
11-15, 2013. University of Jyväskylä, Department of Music, 2013.
[25] E. M. Schmidt and Y. E. Kim, “Modeling musical emotion dynamics [46] F. Eyben, M. Wöllmer, and B. Schuller, “Opensmile: the munich
with conditional random fields.” in ISMIR. Miami (Florida), USA, versatile and fast open-source audio feature extractor,” in Proceed-
2011, pp. 777–782. ings of the 18th ACM international conference on Multimedia, 2010, pp.
[26] A. Schaefer, F. Nils, X. Sanchez, and P. Philippot, “Assessing the 1459–1462.
effectiveness of a large database of emotion-eliciting films: A new [47] J. L. Zhang, X. L. Huang, L. F. Yang, Y. Xu, and S. T. Sun, “Feature
tool for emotion researchers,” Cognition and emotion, vol. 24, no. 7, selection and feature learning in arousal dimension of music
pp. 1153–1172, 2010. emotion by using shrinkage methods,” Multimedia systems, vol. 23,
[27] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen, “Deep learning no. 2, pp. 251–264, 2017.
vs. kernel methods: Performance for emotion prediction in videos,” [48] K. W. Cheuk, Y.-J. Luo, B. Balamurali, G. Roig, and D. Herremans,
in 2015 international conference on affective computing and intelligent “Regression-based music emotion prediction using triplet neural
interaction (acii). IEEE, 2015, pp. 77–83. networks,” in 2020 International Joint Conference on Neural Networks
[28] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, (IJCNN). IEEE, 2020, pp. 1–7.
T. Ebrahimi, T. Pun, A. Nijholt, and I. Patras, “Deap: A database for [49] F. Eyben, F. Weninger, F. Gross, and B. Schuller, “Recent develop-
emotion analysis; using physiological signals,” IEEE transactions on ments in opensmile, the munich open-source multimedia feature
affective computing, vol. 3, no. 1, pp. 18–31, 2011. extractor,” in Proceedings of the 21st ACM international conference on
[29] S. R. Livingstone and F. A. Russo, “The ryerson audio-visual Multimedia, 2013, pp. 835–838.
database of emotional speech and song (ravdess): A dynamic, [50] Z. Rasheed and M. Shah, “Movie genre classification by exploiting
multimodal set of facial and vocal expressions in north american audio-visual features of previews,” in Object recognition supported by
english,” PloS one, vol. 13, no. 5, p. e0196391, 2018. user interaction for service robots, vol. 2. IEEE, 2002, pp. 1086–1089.
[30] Y.-H. Yang and H. H. Chen, “Machine recognition of music emotion: [51] Z. Rasheed, Y. Sheikh, and M. Shah, “On the use of computable
A review,” ACM Transactions on Intelligent Systems and Technology features for film classification,” IEEE Transactions on Circuits and
(TIST), vol. 3, no. 3, pp. 1–30, 2012. Systems for Video Technology, vol. 15, no. 1, pp. 52–64, 2005.
[31] X. Yang, Y. Dong, and J. Li, “Review of data features-based music [52] L. Wilms and D. Oberfeld, “Color and emotion: effects of hue,
emotion recognition methods,” Multimedia systems, vol. 24, no. 4, saturation, and brightness,” Psychological research, vol. 82, no. 5, pp.
pp. 365–389, 2018. 896–914, 2018.
[32] Y. Baveye, C. Chamaret, E. Dellandréa, and L. Chen, “Affective [53] J. Li, S. Roy, J. Feng, and T. Sim, “Happiness level prediction with
video content analysis: A multidisciplinary insight,” IEEE Transac- sequential inputs via multiple regressions,” in Proceedings of the
tions on Affective Computing, vol. 9, no. 4, pp. 396–409, 2017. 18th ACM International Conference on Multimodal Interaction, 2016,
[33] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing pp. 487–493.
the recola multimodal corpus of remote collaborative and affective [54] K. Simonyan and A. Zisserman, “Very deep convolutional networks
interactions,” in 2013 10th IEEE international conference and workshops for large-scale image recognition,” arXiv preprint arXiv:1409.1556,
on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1–8. 2014.
[34] D. Ong, Z. Wu, Z.-X. Tan, M. Reddan, I. Kahhale, A. Mattek, and [55] C. Chen, Z. Wu, and Y.-G. Jiang, “Emotion in context: Deep seman-
J. Zaki, “Modeling emotion in complex stories: the stanford emo- tic feature fusion for video emotion recognition,” in Proceedings
tional narratives dataset,” IEEE Transactions on Affective Computing, of the 24th ACM international conference on Multimedia, 2016, pp.
2019. 127–131.
[35] S. Poria, E. Cambria, R. Bajpai, and A. Hussain, “A review of [56] J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion
affective computing: From unimodal analysis to multimodal fusion,” recognition networks,” in Proceedings of the IEEE/CVF International
Information Fusion, vol. 37, pp. 98–125, 2017. Conference on Computer Vision, 2019, pp. 10 143–10 152.
15

[57] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Phoebe Chua received her M.Sc. in Information
image recognition,” in Proceedings of the IEEE conference on computer Systems and Analytics at the National University
vision and pattern recognition, 2016, pp. 770–778. of Singapore, and is currently pursuing a PhD in
[58] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, Information Systems and Analytics with a focus
“Places: A 10 million image database for scene recognition,” IEEE on computational social science at the National
transactions on pattern analysis and machine intelligence, vol. 40, no. 6, University of Singapore under the mentorship
pp. 1452–1464, 2017. of Professor Desmond C. Ong. Her research
[59] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new interests include understanding emotion in the
model and the kinetics dataset,” in proceedings of the IEEE Conference contexts of interpersonal relationships and aes-
on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308. thetic experiences.
[60] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[61] R. J. Williams and D. Zipser, “A learning algorithm for continually
running fully recurrent neural networks,” Neural computation, vol. 1,
no. 2, pp. 270–280, 1989.
[62] C. Olah, “Understanding lstm networks,” 2015. [Online]. Available: Dr. Dimos (Dimosthenis) Makris is an active
https://colah.github.io/posts/2015-08-Understanding-LSTMs researcher in the field of Music Information Re-
[63] F. Weninger, F. Eyben, and B. Schuller, “On-line continuous-time trieval. His PhD research includes A.I. applica-
music mood regression with deep recurrent neural networks,” in tions for music generation using symbolic data,
2014 IEEE International Conference on Acoustics, Speech and Signal dataset creation, and track separation/instrument
Processing (ICASSP). IEEE, 2014, pp. 5412–5416. recognition tasks. After his PhD, he worked as a
[64] E. Coutinho, G. Trigeorgis, S. Zafeiriou, and B. Schuller, “Auto- postdoctoral researcher at the Singapore Univer-
matically estimating emotion in music with deep long-short term sity of Technology and Design under the super-
memory recurrent neural networks,” in CEUR Workshop Proceedings, vision of Assistant Professor Dorien Herremans.
vol. 1436, 2015. He also has experience in the music industry
[65] H. T. P. Thao, D. Herremans, and G. Roig, “Multimodal deep as a Technical Director of Mercury Orbit Music,
models for predicting affective responses evoked by movies.” in an A.I. music generation start-up company (2017-19, 2021-now), and
ICCV Workshops, 2019, pp. 1618–1627. as a Recording Engineer/Producer for over seven years. His current
[66] B. Kreifelts, T. Ethofer, W. Grodd, M. Erb, and D. Wildgruber, research interests include Deep Learning architectures for conditional
“Audiovisual integration of emotional signals in voice and face: an music generation and exploring efficient encoding representations. Finally,
event-related fmri study,” Neuroimage, vol. 37, no. 4, pp. 1445–1456, he holds a diploma in Music Theory and is an active piano player.
2007.
[67] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
tion,” arXiv preprint arXiv:1412.6980, 2014.
[68] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for Dorien Herremans (M’12, SM’17) is an Assis-
large-scale machine learning,” in 12th {USENIX} symposium on tant Professor at Singapore University of Technol-
operating systems design and implementation ({OSDI} 16), 2016, pp. ogy and Design. In 2015, she was awarded the
265–283. individual Marie-Curie Fellowship for Experienced
[69] R. Muthukrishnan and R. Rohini, “Lasso: a feature selection Researchers, and worked at the Centre for Digital
technique in predictive modeling for machine learning,” in 2016 Music, Queen Mary University of London. Prof.
IEEE international conference on advances in computer applications Herremans received her PhD in Applied Eco-
(ICACA). IEEE, 2016, pp. 18–20. nomics from the University of Antwerp. After grad-
[70] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives- uating as a business engineer in management
based evaluation and estimation of emotions in speech,” Speech information systems at the University of Antwerp
communication, vol. 49, no. 10-11, pp. 787–800, 2007. in 2005, she worked as a Drupal consultant and
[71] S. Ehrlich, K. R. Agres, C. Guan, and G. Cheng, “A closed-loop, was an IT lecturer at Les Roches University in Bluche, Switzerland.
music-based brain-computer interface for emotion mediation,” PloS Prof. Herremans’ research focuses on the intersection of machine
one, vol. 14, no. 3, p. doi.org/10.1371/journal.pone.0213516, 2019. learning/optimization and digital music/audio. She is a Senior Member
[72] D. Glowinski, N. Dael, A. Camurri, G. Volpe, M. Mortillaro, and of the IEEE and co-organizer of the First International Workshop on
K. Scherer, “Toward a minimal representation of affective gestures,” Deep Learning and Music as part of IJCNN, as well as guest editor for
IEEE Transactions on Affective Computing, vol. 2, no. 2, pp. 106–118, Springer’s Neural Computing and Applications and the Proceedings of
2011. Machine Learning Research. She was featured on the Singapore 100
[73] H. G. Wallbott, “Bodily expression of emotion,” European journal of Women in Technology list in 2021.
social psychology, vol. 28, no. 6, pp. 879–896, 1998.
[74] A. Mehrabian, Nonverbal communication. Routledge, 2017.
[75] Y.-H. Yang, Y.-F. Su, Y.-C. Lin, and H. H. Chen, “Music emotion
recognition: The role of individuality,” in Proceedings of the interna-
tional workshop on Human-centered multimedia, 2007, pp. 13–22. Gemma Roig is a professor at the computer
[76] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Context science department in the Goethe University
based emotion recognition using emotic dataset,” IEEE transactions Frankfurt since January 2020. Before, she was an
on pattern analysis and machine intelligence, vol. 42, no. 11, pp. 2755– assistant professor at the Singapore University
2766, 2019. of Technology and Design. She conducted her
[77] M. I. Posner, M. J. Nissen, and R. M. Klein, “Visual dominance: postdoctoral work at MIT in the Center for Brains
an information-processing account of its origins and significance.” Minds and Machines. She pursued her doctoral
Psychological review, vol. 83, no. 2, p. 157, 1976. degree in Computer Vision at ETH Zurich. Her
[78] D. Herremans and E. Chew, “Morpheus: generating structured research aim is to build computational models
music with constrained patterns and tension,” IEEE Transactions on of human vision and cognition to understand its
Affective Computing, vol. 10, no. 4, pp. 510–523, 2017. underlying principles, and to use those models to
[79] H. H. Tan and D. Herremans, “Music fadernets: Controllable build applications of artificial intelligence.
music generation based on high-level features via low-level feature
modelling,” in Proceedings of The International Society for Music
Information Retrieval (ISMIR), 2020.
[80] D. Makris, K. R. Agres, and D. Herremans, “Generating lead
sheets with affect: A novel conditional seq2seq framework,” in
2021 International Joint Conference on Neural Networks (IJCNN), 2021,
pp. 1–8.
16

Kat R. Agres (M’17) is an Assistant Professor


at the Yong Siew Toh Conservatory of Music
(YSTCM) at the National University of Singapore
(NUS), with a joint appointment at Yale-NUS Col-
lege. Previously, she founded the Music Cognition
group at the Institute of High Performance Com-
puting, AQ∗ STAR, in Singapore. Kat received
her PhD in Psychology (with a graduate minor
in Cognitive Science) from Cornell University in
2013, and holds a bachelor’s degree in Cog-
nitive Psychology and Cello Performance from
Carnegie Mellon University. Her postdoctoral research was conducted at
Queen Mary University of London, in the areas of Music Cognition and
Computational Creativity. She has received funding from the National In-
stitute of Health (NIH), the European Commission’s Future and Emerging
Technologies (FET) program, and the Agency for Science, Technology
and Research (Singapore), amongst others, to support her research.
Currently, Kat’s research explores a range of topics including music
technology for healthcare and well-being, music perception and cognition,
computational modeling of learning and memory, and automatic music
generation. She has presented her work in over fifteen countries across
four continents, and remains an active cellist in Singapore.

You might also like