You are on page 1of 9

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO.

2, FEBRUARY 2020 515

Exploring Discriminative Representations for Image


Emotion Recognition With CNNs
Wei Zhang , Xuanyu He , and Weizhi Lu

Abstract—Image emotion recognition aims to automatically


categorize the emotion conveyed by an image. The potential of deep
representation has been demonstrated in recent research on image
emotion recognition. To better understand how CNNs work in
emotion recognition, we investigate the deep features by visualizing
them in this work. This study shows that the deep models mainly
rely on the image content but miss the image style information
such as color, texture, and shapes that are low-level visual features
but are vital for evoking emotions. To form a more discriminative
representation for emotion recognition, we propose a novel CNN
model that learns and integrates the content information from the
high layers of the deep network with the style information from
the lower layers. The uncertainty of image emotion labels is also
investigated in this paper. Rather than using the emotion labels for
training directly, as in previous work, a new loss function is designed
by including the emotion labeling quality to optimize the proposed
inference model. Extensive experiments on benchmark datasets
are conducted to demonstrate the superiority of the proposed
representation.
Index Terms—Image emotion classification, discriminative
representation, emotional inference, deep learning, convolutional
neural networks.

I. INTRODUCTION
MOTIONAL image analysis has attracted increasing atten-
E tion in the computer vision research community [1]–[6].
Compared to the recognition of objective properties of im- Fig. 1. Emotional image samples in [1].
ages [7]–[13], analyzing images at the affect level is more diffi-
cult because of the subjectivity and complexity of emotions [4]. subjected to precise rules. It is thus unsurprising that the ex-
Image emotion refers to different emotional reactions that peo- traction of discriminative emotional features has become a key
ple have to different images. Fig. 1 shows examples from [1] problem.
that elicit or express different emotions. Developing an algo- Many approaches have been proposed to address this chal-
rithm to recognize the emotion of an image is challenging. First, lenging issue. For example, Borth et al. [14] built a SentiBank
this task involves the affect-level inference of images, and emo- system consisting of 1,200 concepts and associated classifiers.
tion is strongly related to many visual features [4]. It is difficult Each concept is made of an adjective strongly indicating an
to design a comprehensive representation to cover all factors. emotion and a noun corresponding to an object or scene. In the
Moreover, some images like abstract paintings cannot easily be field of image emotion recognition, hand-crafted features were
earlier attempts that were based on the intuition and observa-
Manuscript received January 28, 2018; revised December 12, 2018 and May tion of how people perceive the emotion of an image or art
22, 2019; accepted June 28, 2019. Date of publication July 16, 2019; date of cur- feature. These typical features include color, texture, composi-
rent version January 24, 2020. This work was supported in part by the National tion, balance, harmony, variety and movement [2], [15], [16].
Key Research and Development Plan of China under Grant 2017YFB1300205
and in part by the Major Research Program of Shandong Province under Grant Nevertheless, hand-crafted features are unable to fully exploit
2018CXGC1503. The associate editor coordinating the review of this manuscript the connection between visual cues and evoked emotions, as the
and approving it for publication was Prof. Wenwu Zhu. (Corresponding author: limited types of features can hardly cover all important factors
Xuanyu He.)
The authors are with the School of Control Science and Engineering, Shan- related to image emotion [17].
dong University, Jinan 250000, China (e-mail: david.zhang@sdu.edu.cn; hexif- Recently, deep convolutional neural networks (CNNs) have
fer@outlook.com; wzlu@sdu.edu.cn). been applied to the image emotion recognition task [1], [17]. In-
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. stead of designing features manually, CNNs are capable of learn-
Digital Object Identifier 10.1109/TMM.2019.2928998 ing the representations of an image in an end-to-end manner. The
1520-9210 © 2019 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
516 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 2, FEBRUARY 2020

Fig. 2. Emotional image from [1] (left) and its style representation (right) generated following [18].

learned representations of an image make the target-object infor- the proposed representation consists of features of the image
mation increasingly explicit along the processing hierarchy [19]. content learned from the high layers of the deep network and
Experimental results suggest that the deep CNN features the features of the image style learned from the lower layers. In
outperform the hand-crafted features for image emotion recog- addition, as stated in Section III-C, image emotion labeling is
nition [1], [17], showing the outstanding potential of deep rep- a highly subjective task, and it is difficult to reach a consensus
resentation in this challenging task. However, Alameda-Pineda among people. It is observed that such uncertainty over emotion
et al. [20] note that the deep features are not good enough for labels will affect the training of the inference model and degrade
emotion recognition, particularly on abstract paintings. That is the recognition accuracy, as illustrated in Table III. To solve
because, in addition to the image semantics, emotions may also this problem, the emotion labeling quality should be considered
be conveyed by mid- and low-level visual features such as tex- during training. In this paper, a new loss function is designed
ture, color and shapes [2], [17], [20]. by including the label uncertainty to optimize the deep model.
To exploit how typical CNNs that are designed for object To validate the proposed approach, we conduct emotion recog-
recognition work in emotion recognition, we investigate the deep nition experiments on various benchmark datasets. Extensive
features along the processing hierarchy of emotion recognition experimental results show that the proposed method produces
and visualize the representations that are encoded at different significant gains in emotion recognition accuracy.
layers in the network. Our study shows that the deep models
recognize emotion mainly by inferring the semantic-level rep-
resentation of an image. This may explain the success of deep II. RELATED WORK
features in emotion recognition, as semantics is one of the most Currently, there is an increasing research interest in devel-
important cues for image emotion [2]. On the other hand, it is oping computational models for image emotion recognition. In
also found that lower-level visual features are missed along the this section, we review the existing work on hand-crafted fea-
processing hierarchy. In some cases, people pay less attention tures and deep models designed for emotional analysis and the
to the content of images, while non-figurative components like recognition of images.
color and texture may elicit stronger emotional responses than
the content [20]. Inspired by work in image style transfer [18],
[21], [22], it is known that color, texture, and shapes can be used A. Hand-Crafted Features for Emotion Recognition
to describe the style of an image. As shown in Fig. 2, the image Previous work mainly focused on investigating the role of vi-
style can be represented by pixel-level visual features, which sual description in predicting the emotion conveyed by an image
may play an important role in evoking emotion. Therefore, it is to observers. For example, Yanulevskaya et al. [15] proposed
also crucial to study the relationship between the image style an emotion categorization approach using Gabor and Wiccest
and the evoked emotions. features that aims to model a perceptual surface texture in im-
In this paper, we propose a novel CNN model to learn discrim- ages. Machajdik et al. [2] introduced a unified emotion recog-
inative representations about an image for emotion recognition. nition framework by combining low-level visual features and
Fig. 3 gives an overview of the proposed network. Specifically, high-level concepts from psychology and art theory, including

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: EXPLORING DISCRIMINATIVE REPRESENTATIONS FOR IMAGE EMOTION RECOGNITION WITH CNNs 517

Fig. 3. Network architecture. The proposed model consists of two parts: the base CNN for content representation and the Gram matrix for style representations.
We adopt ResNet with 152 layers as the base CNN architecture. The detailed convolutional layers are omitted and only the changes of feature maps are shown
in the architecture. Gram matrix computation is performed on the feature maps of the lower layers to generate the style representation of the input image. The
detailed procedure for style representations is illustrated in Fig. 5. At the end, content representation and style representation are combined to produce a hybrid
representation, followed by the classification layer for emotion inference.

composition, color variance, and content. Zhao et al. [16] at- Yang et al. [26] proposed a framework to discover and lever-
tempted to capture the emotional information of balance, har- age the affective regions of an image, producing an emotional
mony, variety and movement and classify images by extracting prediction of the image. Similar to [26], Yang et al. [27] used
visual features according to these art principles. Considering the the localized information of the image and proposed a weakly
relevance of colors and their combinations, Sartori et al. [23] de- supervised coupled CNN with two branches for utilizing both
signed visual features, named Group Lasso, to analyze emotions the holistic and localized information. Zhu et al. [28] proposed
in abstract paintings. Alameda-Pineda et al. [20] introduced a bidirectional recurrent neural network (RNN) to integrate the
non-linear matrix completion (NLMC) to recognize emotions, learned features from different layers in the CNN model for vi-
which is an advanced multi-label learning framework for use sual emotion recognition. By utilizing the texture representation
on emotional abstract paintings. Compared to those features ex- and deep metric learning, Yang et al. [29] proposed a framework
tracted from CNN models, these hand-crafted features mainly for both retrieval and classification goals. Considering that one
focus on low-level visual cues, while high-level semantics have image usually evokes multiple emotions, Yang et al. [30] de-
not been exploited sufficiently. veloped a multi-task framework by learning label distribution.
However, the relationship between the deep representations and
the image emotions and the reason why deep CNNs work so
remarkably on this task have not been explored well. This paper
B. Deep Features for Emotion Recognition intends to illustrate the mechanism by visualizing representa-
CNNs have been applied to image emotion recognition and tions and provide an inference model to extract discriminative
have yielded state-of-the-art performance because of their out- representations about both image content and style for emotional
standing capability of extracting high-level representations [1], analysis. In addition, we discuss the issue of emotion labeling
[17]. Zhao et al. [24] comprehensively summarized represen- uncertainty and introduce a novel loss function for optimization,
tative approaches to emotion feature extraction and discussed which is different from [30].
some future research directions. You et al. [1] established a
benchmark deep model based on AlexNet [8]. To help in training,
a large-scale dataset for image emotion recognition is collected. III. EMOTIONAL INFERENCE MODEL
Eight emotion categories are defined in this dataset to label the As introduced in Section I, hybrid visual features should be
images. For more emotional image samples, Zhao et al. [25] considered and combined in the field of image emotion recogni-
proposed a method to adapt image emotion with generative ad- tion. In this work, we adopt the ResNet [10] with 152 layers as the
versarial networks. Rao et al. [17] proposed a multiple-instance base CNN, as shown in Fig. 3. The convolutional layers mostly
learning (MIL) framework, named MldrNet, to unify represen- have a 3 × 3 filter, which is the smallest size that can capture the
tations of multi-scale patches of an image. Three different CNN notions of left/right, up/down and center. Downsampling is per-
models are applied to extract representations at different lev- formed directly by convolutional layers that have a stride of 2.
els, and impressive emotion recognition results are reported. The base network ends with a global average pooling layer and

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
518 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 2, FEBRUARY 2020

these different level style representations are concatenated to


generate a final style representation. At the end of the proposed
model, the content and style representations are concatenated
to generate a hybrid representation of the input image. In the
following, Section III-A explains the mechanism ow how deep
CNNs that are designed for object classification recognize image
emotion and provides the way to extract content representation
for emotion recognition. In Section III-B, we introduce style rep-
resentations including the missing low-level and mid-level fea-
tures. Finally, considering the emotion-labeling quality, a novel
loss function is designed for optimization in Section III-C.

A. Content Representation
We extract the content representation from the deepest layer
of the base CNN, since the features in this layer include seman-
tic information. In this section, we discuss the content repre-
sentation with visualization. It is found that visual details that
Fig. 4. Visualizing the features in the convolutional layers. For each row, the are important for emotion recognition are missing in the con-
images from left to right are the input image, the representations in the 3th, 6th, tent representation, since the base CNN is designed for object
9th and 12th convolutional layers, respectively. classification.
Given one sample (x, y), where x is the image and y is the
associated label, x is encoded with a representation in each layer
of the CNNs by the filter responses to that image. To understand
how the typical CNNs recognize the emotions, we intend to
visualize the representations that are encoded at different layers
in the network. Gradient descent is performed on a randomly
initialized image to find another image that matches the feature
maps of the original one, following [18], [19]. Let x and c be the
original image and the image that is generated for visualizing
the representation. X ∈ RH×W ×N and C ∈ RH×W ×N are the
respective feature maps of x and c, where H, W and N are the
height, width and number of the feature maps. We could define
the squared-error loss between x and c:
N
1
L(x, c) = (Xi − Ci )2 . (1)
2 i
The derivative of the loss with respect to the feature maps X
equals
∂L
= Xi − C i . (2)
∂Xi
Thus, we can change the initially random image c using standard
error back-propagation until it generates the same response in a
certain layer of the CNN as the original image x.
As shown in Fig. 4, shallow layers maintain photographically
Fig. 5. Illustration of extracting style representation by Gram matrix. faithful representations of the image, though with increasing
fuzziness. Along the processing hierarchy of the network, it is
outputs a vector with a size of 2048. We extract the content rep- observed that the visual details are ignored in the high layers.
resentation from the last fully connected layer. Style represen- Therefore, another representation that could include the missing
tations are extracted from the shallow layers. The outputs of the visual details is needed for image emotion recognition.
downsampling operations are chosen for computing the Gram
B. Style Representation
matrices, which are able to capture the visual details of an image,
including color, texture and shapes [18], [21], [22]. We empiri- We extract style representations from the shallow layers, since
cally use the first three outputs of the downsampling operations. visual details are still available there. However, using the fea-
Style representations are produced from the fully connected lay- ture maps from the shallow layers to describe the image style
ers following the Gram matrix computation. As shown in Fig. 5, directly leads to redundant information. That is because these

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: EXPLORING DISCRIMINATIVE REPRESENTATIONS FOR IMAGE EMOTION RECOGNITION WITH CNNs 519

feature maps include the low-level visual features as well as the CNNs can be optimized based on the cross-entropy loss:
image content [18], [21], [22], [31]. Therefore, we employ the n

correlation between the feature maps, named the Gram matrix, L1 = − yi log(pi ), (6)
as the style representation, which preserves the style of the image i
only by capturing the low-level visual features [18]. 
Each feature map Xi with a size m × m convolutional layer where y = {yi |yi ∈ {0, 1}, i = 1, . . . , n, ni=1 yi = 1} indi-
is first transformed to a vector with a size M = m × m, and cates the ground truth label of the image.
then the vectorized feature maps can be stored in a matrix Considering the different labeling qualities of the images, we
X ∈ RN ×M , where N indicates the number of feature maps. describe the quality of each image as v and use the number of
Therefore, the Gram matrix G ∈ RN ×N of the convolutional votes of each emotion that it received in the process of data
layer can be written as: collection [1] as the definition:

G = XX T . (3) v = {vi |vi = votes received, i = 1, . . . , n}. (7)

Every element vi in v means that the ith emotion receives vi votes


To be specific, every element Gij in the Gram matrix is the inner
when collecting samples. The emotions are arranged in the order
product between the vectorized feature maps Xi and Xj in the
of Amusement, Anger, Awe, Contentment, Disgust, Excitement,
layer:
Fear and Sadness. For example, an image with three votes for
 Anger, one vote for Excitement and one vote for Sadness has
Gij = Xik Xjk . (4) v = [0, 3, 0, 0, 0, 1, 0, 1]T . Based on this definition, another loss
k
can be introduced as:
After normalization, the Gram matrix is transformed to a style n  2
1
representation S by a fully connected layer that has N neurons, L2 = v i pi − vi , (8)
i
nv
as shown in Fig. 5. We also visualized the style representation
S from the shallow layers in Fig. 2. Different from the content
where nv is the total number of votes and n1v is used for nor-
representation, the style representation provides a description of
malization. This loss function is proposed to make the probabil-
the low-level visual features, which is not related to the content
ity distribution p approach v numerically as a method of label
of the image. To include more information of the style, a set
smoothing, since the manual annotations of the image emotions
of style representations {S1 , S2 , . . .} from different layers are
are the results of multiple participants. Moreover, vi also serves
concatenated and aggregated by a fully connected layer.
as a modulating factor such that the loss function can focus more
on the samples of high quality. Finally, we define the total loss
C. Loss Function function as the weighted sum of L1 and L2 :
The performance of the trained models relies on the qual-  n n  2
1
ity and number of the labeled examples. In current practice, Loss = − yi log(pi ) + α v i pi − vi , (9)
nv
the labels are mostly assumed to be unambiguous and accu- i i
rate. However, this assumption does not hold in image emotion where α is an empirical constant.
recognition because labeling an image with a specific category
is highly subjective. Different people may have different feel-
ings about the same image, so it is difficult to reach a consensus IV. EXPERIMENTAL RESULTS
for most images. As stated in Section IV-A, when labeling an In this section, we conduct experiments on the benchmark
image, an emotion that receives more than three votes out of datasets for evaluation.
five will be regarded as the ground truth label for that image [1],
[17]. However, such uncertainty on emotion labels will affect A. Datasets
the training quality of the network and degrade the recognition
accuracy, as illustrated in Table III. To address this problem, we 1) Image Emotion Dataset: The Image Emotion Dataset con-
introduce a novel loss function for image emotion recognition. sists of approximately 23,000 emotional images, which were
As introduced above, CNNs finally extract a representation collected from Flickr and Instagram using eight emotions as the
of the image x using convolutional layers and fully connected searching keywords [1], i.e., Amusement, Anger, Awe, Content-
layers. The output of the last fully connected layer is transformed ment, Disgust, Excitement, Fear and Sadness. Then, these im-
into a probability distribution with a softmax function: ages were further labeled by Amazon Mechanical Turk (AMT)
workers, and five AMT workers were assigned to verify the emo-
ea i tion of each image. They were asked to answer a question like
pi =  a , i = 1, . . . , n, (5) “Do you feel anger seeing this image?” and choose YES or NO
ie
i

for each image. Those images that received at least three YESes
where ai is the output from the last fully connected layer, e is were kept as well-labeled samples.
the mathematical constant whose natural logarithm is equal to 2) IAPS-Subset: It is a subset of the International Affective
one and n indicates the number of emotion categories. Then, the Picture System (IAPS) [32], which has been widely used in

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
520 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 2, FEBRUARY 2020

affective image classification. Among all IAPS images, 395 im- TABLE I
EVALUATION OF NETWORK ARCHITECTURES ON IMAGE
ages were labeled with the above mentioned eight emotion cat- EMOTION DATASET (%)
egories by computing the arousal and valence values of these
images [33].
3) ArtPhoto: This dataset was obtained by using the emo-
tion categories as the searching terms in the art sharing site [2].
These photos were taken by professional artists who attempted
to evoke a certain emotion in the viewer of the photograph. In
total, there are 806 images, and each is assigned to one of the
eight aforementioned emotion categories.
4) Abstract Paintings: There are 228 abstract paintings in
this dataset. To obtain ground truth emotion labels for the sam-
ples, the images were peer-rated in a web survey in which the
participants could select the best fitting emotional category from
the eight specified above [2]. For each image, the category with
the most votes was selected as the ground truth emotion.
5) MART: This dataset contains only abstract paintings pro-
duced by 80 different professional artists. In total, 500 abstract
paintings were collected from the electronic archive of the Mu-
seum of Modern and Contemporary Art of Trento and Rovereto
(MART) [15]. The labeling of the emotional perception was chosen because of their good performance on visual classifica-
done using the relative score method in [23]. Differing from the tion tasks and relatively consistent structures. We train the deep
above four datasets, only two emotion labels are used to classify models starting with the pre-trained weights on the ImageNet
the images, positive or negative. dataset [34]. As shown in Table I, the performance improves as
The above five datasets can be categorized as large-scale the depth increases. The deepest ResNet-152 yields the best re-
ones (i.e., Image Emotion Dataset) vs. small-scale ones (i.e., sult. In addition, it is observed that the VGG nets (i.e., VGG-16
IAPS-Subset, ArtPhoto, Abstract Paintings and MART), and in- and VGG-19) achieved better performances than the ResNet-18,
ternet images (i.e., Image Emotion Dataset, IAPS-Subset, and which has more convolutional layers. This is because the VGG
ArtPhoto) vs. abstract paintings (i.e., Abstract Paintings and nets have two more fully connected layers, which provide re-
MART). dundant parameters to build up high-level concepts for emotion
reasoning. Hence, we tried to add more fully connected layers to
B. Implementation Details various ResNets. From Table I, experimental results show that
additional fully connected layers are helpful. However, as the
The implementation for the classification follows the prac-
network goes deeper, the improvement from the additional fully
tice in [1], [10]. The images are randomly split into the train-
connected layers becomes limited. Therefore, we empirically
ing set (80%, 18,074 images) and the testing set (20%, 4,520
adopt the ResNet-152 as the base network.
images) [1]. The image is resized with its shorter side ran-
2) Style Representation: The style representations extracted
domly sampled in [256, 480] for scale augmentation, and then
from the ResNet-152 are investigated. In CNNs, there are often
a 224 × 224 crop is randomly sampled from the image or its
many layers producing output maps of the same size and we say
horizontal flip [10]. We initialize the weights using models
these layers are in the same network stage (see Fi in Fig. 3).
pre-trained on ImageNet [34]. We use stochastic gradient de-
In our work, we compute the Gram Matrix Gi using the output
scent with a mini-batch size of 64. The learning rate starts from
Fi of the last layer in each stage for the style representation Si .
0.001 and is divided by 10 after every 10 epochs, and the mod-
As shown in Table II, the performance degrades obviously when
els are trained for up to 50 epochs. We use a weight decay of
the style representations are used without content representation.
1e-4 and a momentum of 0.9. In testing, we adopt the standard
This proves that the content of the image is important for CNNs
10-crop testing [10]. The images are resized such that the shorter
to recognize the emotion. Moreover, it is also observed that the
side is 256, and 224 × 224 crops are randomly sampled.
combination of the content representation C and style represen-
tations {S1 , S2 , S3 } yields the best emotion recognition result.
C. Ablation Study On the other hand, style representations from high layers (i.e.,
1) Network Architecture: Deep CNNs have led to a series of S4 and S5 ) degrade the performance. Since the Gram Matrix of
breakthroughs for image classification [8]–[10], [35], [36]. Re- the high-level features has a larger size (e.g., 2048 × 2048) than
cent evidence [9], [10], [35] has revealed that the network depth that from a low layer (e.g., 256 × 256), the number of neurons in
is of crucial importance, and the leading results [9], [10], [35] on the fully-connected layers of S4 or S5 is much larger than those
the challenging ImageNet dataset [34] all suggest deep models. of S1, S2 and S3, which may lead to overfitting considering the
Inspired by the significance of the depth, our first attempt is to limited training samples. As a result, the combination of the style
study the influence of the network depth on the emotion analy- representations from high layers and the content representation
sis. In the experiments, the VGG net [9] and the ResNet [10] are (e.g., C+S4) degrades the performance.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: EXPLORING DISCRIMINATIVE REPRESENTATIONS FOR IMAGE EMOTION RECOGNITION WITH CNNs 521

TABLE II TABLE V
EVALUATION OF STYLE REPRESENTATIONS ON IMAGE EMOTION DATASET (%) COMPARISON WITH THE STATE-OF-THE-ART METHODS ON IMAGE
EMOTION DATASET (%)

the proposed loss (Eq. (9)) improves the performance. More-


over, when the weight α is equal to 2, the proposed model yields
the best result.

D. Comparison to State of the Arts


TABLE III 1) Comparison on Image Emotion Dataset: To verify the ef-
EMOTION CLASSIFICATION ACCURACY WITH DIFFERENT LABELING QUALITY
ON IMAGE EMOTION DATASET (%)
fectiveness of our model, we compare it with the state-of-the-art
methods in Table V. Compared with Zhao et al. [16], which
is a method based on hand-crafted features, our representation
obviously yield better results. As shown in Table V, the pro-
posed method outperforms deep CNNs that are designed for the
object classification task, such as AlexNet [8], VGG-19 net [9]
and ResNet-152 [10]. This method benefits from the proposed
hybrid content and style representation of images, as well as
TABLE IV the more appropriate loss for model optimization. Compared
EVALUATION OF LOSS FUNCTION ON IMAGE EMOTION DATASET (%)
with the deep models designed for image emotion recognition,
including MldrNet [17], Yang et al. [30], Yang et al. [29], WSC-
Net [27], Bi-GRN [28] and AR [26], our method leads to a com-
parable performance based on style representations. A number
of emotion recognition examples are visualized in Fig. 6. It is
observed that the proposed method could correctly recognize the
images whose emotions are falsely classified by the base CNN.
2) Comparison on Small-Scale Datasets: To further evalu-
ate the effectiveness of our model, we test the proposed method
on small datasets including IAPS-Subset, ArtPhoto, Abstract
3) Labeling Quality and Loss: As stated in Section III-C, the Paintings and MART. The image samples from each category
subjectivity of emotion is one of the main challenges in image were randomly split into five batches, and 5-fold cross valida-
emotion recognition. Experiments are conducted to analyze the tion was used to produce the result, except that 10-fold vali-
influence of the emotion-labeling quality on the recognition ac- dation was used for MART. As the emotion category of anger
curacy, as shown in Table III. The ResNet-152 is trained on the contains only 8 and 3 samples in the IAPS-Subset and Abstract
Image Emotion Dataset. Apparently, the recognition accuracy Paintings, respectively, it is insufficient to perform the 5-fold
tends to be better as the consensus over the image emotion la- cross validation. Thus, the true positive rates for emotion anger
beling quality increases. Therefore, the labeling quality of image on these two datasets are not reported following [1], [2], [17].
emotion is considered, and a new loss function is introduced, as On IAPS-Subset, Abstract Paintings and ArtPhoto, our model
shown in Eq. (9). Both the ResNet-152 and the proposed model outperforms the state-of-the-art methods including Machajdik
are evaluated. When the weight α is equal to 0, the networks and Hanbury [2], Zhao et al. [16] and MldrNet [17], as shown
are only trained with the cross-entropy loss L1 . As shown in Ta- in Fig. 7. On the other hand, the proposed method outperforms
ble IV, training the ResNet-152 and the proposed model with those methods that focus on the emotional analysis of abstract

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
522 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 22, NO. 2, FEBRUARY 2020

Fig. 6. Emotional image samples classified by the proposed method and the base network. The ground truth emotion is shown below each image.

Fig. 7. Comparison with the state-of-the-art methods on three small datasets, including Machajdik and Hanbury [2], Zhao et al. [16], and MldrNet [17].

TABLE VI paintings, as shown in Table VI, including Group Lasso [23],


COMPARISON WITH THE STATE-OF-THE-ART METHODS ON
MART DATASET (%)
NLMC [20], AlexNet [8], and MldrNet [17]. From Table VI
and Fig. 7, it is also found that the combination of the content
representation and the style representation from deep CNNs is
helpful for emotion recognition in abstract paintings.

V. CONCLUSIONS
In this paper, we proposed a novel CNN model to learn and
integrate the content representation and the style representation

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: EXPLORING DISCRIMINATIVE REPRESENTATIONS FOR IMAGE EMOTION RECOGNITION WITH CNNs 523

of an image for emotion analysis. The visualization of the rep- [15] V. Yanulevskaya et al., “Emotional valence categorization using holis-
resentations and experimental results showed that combining tic image features,” in Proc. 15th IEEE Int. Conf. Image Process., 2008,
pp. 101–104.
the style representation would benefit the recognition of image [16] S. Zhao et al., “Exploring principles-of-art features for image emotion
emotion. Moreover, this study showed that the quality of the recognition,” in Proc. 22nd ACM Int. Conf. Multimedia, 2014, pp. 47–56.
label would influence the performance, since the annotation of [17] T. Rao, M. Xu, and D. Xu, “Learning multi-level deep representations for
image emotion classification,” 2016, arXiv:1611.07145.
the image emotion was subject. The authors selected an emotion [18] L. A. Gatys, A. S. Ecker, and M. Bethge, “Image style transfer using
that has a majority as the ground truth when they were building convolutional neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern
a dataset, leading to a difference in the labeling quality. There- Recognit., 2016, pp. 2414–2423.
[19] A. Mahendran and A. Vedaldi, “Understanding deep image representations
fore, a novel loss function was introduced for optimizing deep by inverting them,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
models. The experimental results proved that the proposed loss 2015, pp. 5188–5196.
could improve the performance of deep networks in the field of [20] X. Alameda-Pineda, E. Ricci, Y. Yan, and N. Sebe, “Recognizing emotions
from abstract paintings using non-linear matrix completion,” in Proc. IEEE
image emotion recognition. Conf. Comput. Vis. Pattern Recognit., 2016, pp. 5240–5248.
[21] M. Elad and P. Milanfar, “Style transfer via texture synthesis,” IEEE Trans.
REFERENCES Image Process., vol. 26, no. 5, pp. 2338–2351, May 2017.
[22] Y. Li, N. Wang, J. Liu, and X. Hou, “Demystifying neural style transfer,”
[1] Q. You, J. Luo, H. Jin, and J. Yang, “Building a large scale dataset for in Proc. Int. Joint Conf. Artif. Intell., 2017, pp. 2230–2236.
image emotion recognition: The fine print and the benchmark,” in Proc. [23] A. Sartori, D. Culibrk, Y. Yan, and N. Sebe, “Who’s afraid of itten: Us-
13th AAAI Conf. Artif. Intell., 2016, pp. 308–314. ing the art theory of color combination to analyze emotions in abstract
[2] J. Machajdik and A. Hanbury, “Affective image classification using fea- paintings,” in Proc. 23rd ACM Int. Conf. Multimedia, 2015, pp. 311–320.
tures inspired by psychology and art theory,” in Proc. 18th ACM Int. Conf. [24] S. Zhao et al., “Affective image content analysis: A comprehensive sur-
Multimedia, 2010, pp. 83–92. vey,” in Proc. Int. Joint Conf. Artif. Intell., 2018, pp. 5534–5541.
[3] D. Joshi et al., “Aesthetics and emotions in images,” IEEE Signal Process. [25] S. Zhao, C. Lin, P. Xu, S. Zhao, and K. Guo, “Cycleemotiongan: Emotional
Mag., vol. 28, no. 5, pp. 94–115, Sep. 2011. semantic consistency preserved cyclegan for adapting image emotions,”
[4] W. Wang, Y. Yu, and J. Zhang, “Image emotional classification: Static in Proc. AAAI Conf. Artif. Intell, 2019, pp. 491–498.
vs. dynamic,” in Proc. IEEE Int. Conf. Syst., Man Cybern., 2004, vol. 7, [26] J. Yang et al., “Visual sentiment prediction based on automatic discovery
pp. 6407–6411. of affective regions,” IEEE Trans. Multimedia, vol. 20, no. 9, pp. 2513–
[5] W. Wang, Y. Yu, and S. Jiang, “Image retrieval by emotional semantics: A 2525, Sep. 2018.
study of emotional space and feature extraction,” in Proc. IEEE Int. Conf. [27] J. Yang, D. She, Y.-K. Lai, P. L. Rosin, and M.-H. Yang, “Weakly super-
Syst., Man Cybern., 2006, vol. 4, pp. 3534–3539. vised coupled networks for visual sentiment analysis,” in Proc. IEEE Conf.
[6] X. He and W. Zhang, “Emotion recognition by assisted learning with Comput. Vis. Pattern Recognit., 2018, pp. 7584–7592.
convolutional neural networks,” Neurocomputing, vol. 291, pp. 187–194, [28] X. Zhu et al., “Dependency exploitation: A unified CNN-RNN approach
2018. for visual emotion recognition,” in Proc. Int. Joint Conf. Artif. Intell., 2017,
[7] W. Zhang, X. Yu, and X. He, “Learning bidirectional temporal cues for pp. 3595–3601.
video-based person re-identification,” IEEE Trans. Circuits Syst. Video [29] J. Yang, D. She, Y.-K. Lai, and M.-H. Yang, “Retrieving and classifying
Technol., vol. 28, no. 10, pp. 2768–2776, Oct. 2018. affective images via deep metric learning,” in Proc. 32nd AAAI Conf. Artif.
[8] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification Intell., 2018, pp. 491–498.
with deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro- [30] J. Yang, D. She, and M. Sun, “Joint image emotion classification and
cess. Syst., 2012, pp. 1097–1105. distribution learning via deep convolutional neural network,” in Proc. 26th
[9] K. Simonyan and A. Zisserman, “Very deep convolutional networks for Int. Joint Conf. Artif. Intell., 2017, pp. 3266–3272.
large-scale image recognition,” in Proc. Int. Conf. Learn. Represent. 2015. [31] T. Li et al., “Beautygan: Instance-level facial makeup transfer with deep
[10] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image generative adversarial network,” in Proc. ACM Multimedia Conf. Multi-
recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, media Conf., 2018, pp. 645–653.
pp. 770–778. [32] P. J. Lang et al., “International affective picture system (IAPs): Instruction
[11] W. Zhang, B. Ma, K. Liu, and R. Huang, “Video-based pedestrian re- manual and affective ratings,” The center for research in psychophysiology,
identification by adaptive spatio-temporal appearance model,” IEEE Trans. University of Florida, Gainesville, FL, USA, 1999.
Image Process., vol. 26, no. 4, pp. 2042–2054, Apr. 2017. [33] J. A. Mikels et al., “Emotional category data on images from the inter-
[12] Y. Lu, C. Yuan, W. Zhu, and X. Li, “Structurally incoherent low-rank national affective picture system,” Behav. Res. Methods, vol. 37, no. 4,
nonnegative matrix factorization for image classification,” IEEE Trans. pp. 626–630, 2005.
Image Process., vol. 27, no. 11, pp. 5248–5260, Nov. 2018. [34] O. Russakovsky et al., “Imagenet large scale visual recognition challenge,”
[13] W. Zhang, Q. Chen, W. Zhang, and X. He, “Long-range terrain perception Int. J. Comput. Vis. 115.3, Dec. 2015, pp. 211–252.
using convolutional neural networks,” Neurocomputing, vol. 275, pp. 781– [35] C. Szegedy et al., “Going deeper with convolutions,” in Proc. IEEE Conf.
787, 2018. Comput. Vis. Pattern Recognit., 2015, pp. 1–9.
[14] D. Borth, T. Chen, R. Ji, and S.-F. Chang, “Sentibank: Large-scale ontology [36] Z. Zhang, P. Cui, H. Li, X. Wang, and W. Zhu, “Billion-scale network
and classifiers for detecting sentiment and emotions in visual content,” in embedding with iterative random projection,” in Proc. IEEE Int. Conf.
Proc. 21st ACM Int. Conf. Multimedia, 2013, pp. 459–460. Data Mining, 2018, pp. 787–796.

Authorized licensed use limited to: Amrita Vishwa Vidyapeetham Chennai Campus. Downloaded on January 20,2023 at 16:16:06 UTC from IEEE Xplore. Restrictions apply.

You might also like