Kumar 2018

Image Sentiment Analysis Using
Convolutional Neural Network
and Arunima Jaiswal ✉

( )
Akshi Kumar
Delhi Technological University, Delhi, India
akshikumar@dce.ac.in, arunimajaiswal@gmail.com
Abstract. Visual media is one of the most powerful channel for expressing
emotions and sentiments. Social media users are gradually using multimedia like
images, videos etc. for expressing their opinions, views and experiences. Senti‐
ment analysis of this vast user generated visual content can aid in better and
improved extraction of user sentiments. This motivated us to focus on determining
‘image sentiment analyses’. Significant advancement has been made in this area,
however, there is lot more to focus on visual sentiment analysis using deep
learning techniques. In our study, we aim to design a visual sentiment framework
using a convolutional neural network. For experimentation, we employ the use
of Flickr images for training purposes and Twitter images for testing purposes.
The results depict that the proposed ‘visual sentiment framework using convo‐
lutional neural network’ shows improved performance for analyzing the senti‐
ments associated with the images.
Keywords: Sentiment analysis · Deep learning · Convolutional neural network
1 Introduction
These days due to abundant volume of opinion rich web data accessible via Internet, a
large portion of recent research is going on in the area of web mining called as Sentiment
Analysis [1]. It is defined as the process of computationally classifying and categorizing
sentiments expressed in a piece of ‘multimedia web data’ (both textual or non-textual),
especially in order to determine the polarity of the writer’s attitude towards a particular
topic, product, etc. [2] It is a way to evaluate the written or spoken language in order to
determine the degree of the expression, whether it is favorable, unfavorable, or neutral
[3]. Also, referred to as Opinion Mining, the ideology of sentiment analysis is to search
for opinions, identify the sentiments involved in it and classify it based on the polarity.
Huge measure of heterogeneous information is produced by the users of the social media
via Internet which is analyzed for efficient decision making. Era of Internet has drasti‐
cally modified the way of expressing of views, opinions, sentiments etc. of an individual
[3]. It is essentially done via blogs, online reviews, forums, social media, feedbacks or
surveys etc. These days, people are more dependent on the usage of social networking
sites like Twitter, Facebook, Flickr, Instagram etc. for appropriate decision making, to
share their opinions and views which in turn is generating enormous volume of ‘senti‐
ment rich data’ that is often expressed in the form of texts, images, audios, videos,
© Springer International Publishing AG, part of Springer Nature 2018

A. Abraham et al. (Eds.): ISDA 2017, AISC 736, pp. 464–473, 2018.
https://doi.org/10.1007/978-3-319-76348-4_45
Image Sentiment Analysis Using Convolutional Neural Network 465
mixture of images and texts etc. So we can say that the masses is relying on such (online)
user generated multimedia web content for the opinions etc. Images being a part of
‘multimedia web data’ helps to convey, express, communicate, comprehend, illustrate
and carry’s different level of people’s opinions or sentiments to their viewers that exten‐
sively marks the increasing significance of image sentiment analysis or image sentiment
prediction.
Scrutinization of this ‘multimedia web data’ especially sentiment analysis within
textual and visual data promised to a better understanding of the human behavior as they
convey the emotions and the opinions more clearly. Till date, most of the ‘sentiment
analysis computation’ has covered the textual content only. Although, understanding
the emotions and sentiments of visual media contents has attracted increasing attention
in research and practical applications. However, little progress has been made in deter‐
mining and estimating emotions & sentiments of ‘visual user generated online content’.
It is thus emerging as a recent area of research and there is a huge scope of exploring
sentiments of visual multimedia content. The main motivation of this work is the explo‐
sive growth of social media and online visual content that has encouraged the research
on ‘large-scale social multimedia analyses’. Therefore, we aim to choose it as our
problem statement for analyzing sentiments associated with the images using deep
learning algorithm like convolutional neural network. Multimedia messages including
videos and images encapsulating strong sentiments can strengthen the sentiment or the
opinion conveyed in the content and thus influencing the audience more effectively [12].
Understanding the opinions or the sentiments expressed in visual content will signifi‐
cantly profit social media communication and facilitate broad applications in education,
finance, advertisement, entertainment, health etc. [13] We can say that the sentiments
(both textual and non-textual) of the social media is broadly influencing the thoughts
and views of the public, like U.S. economy and stock market situations gets influenced
by the changing sentiments of the Twitter users.
We aim to explore the application of deep learning algorithm like convolutional
neural network to visual media for determining its sentiments accurately. The major
problem arises in situations where we have conflicting emotions being shown by the
image and the text, and thus it necessitates the requirement of a proper framework for
determining the sentiments of any visual media like images etc.
Rest of the paper is organized as follows. Section 2 discusses the related work in the
area of image sentiment analysis. Section 3 explains the fundamental concept of deep
learning algorithm like Convolutional Neural Networks. Section 4 describe the data set
collection, system architecture and the experimentation. Section 5 briefs about the
experimental results. Section 6 finally concludes the paper.
2 Related Work
Most of the work in the past literature on sentiment analysis has majorly focused on text
analysis [4–7]. Sentiment based models have been demonstrated to be beneficial in
various analytical uses like human behavior prediction, business, and political science
466 A. Kumar and A. Jaiswal
[8, 22–24]. In comparison to “text-based” opinion mining or sentiment analysis,

modeling of the sentiments based on the images has been much less studied.
One of the effort given by [9] suggested to design a “large scale visual sentiment
ontology” based on ‘Adjective Noun Pairs (ANP)’. Borth et al. [10] proposed a more
tractable approach that models sentiment related visual concepts as a ‘mid-level repre‐
sentation’ to fill the gaps. These concepts include ‘ANPs’, such as “happy dog” &
“beautiful sky”, that merge the sentimental strength of “adjectives” and “detectability”
of the nouns. Although such ANP concepts do not directly express emotions or senti‐
ments, they were learned based on the strong’ co-occurrence relationships with emotion
tags’ of web photos, and thus are valuable as effective statistical cues for detecting
emotions depicted in the images.
Study by Krizhevsky et al. [11] focused on the training of a ‘deep convolutional
neural network’ for classifying 1.2 million ‘high resolution images’ in the ImageNet
‘LSVRC-2010 contest’ and had obtained improved results. Author [12] had proposed a
novel framework for visual sentiment prediction of images using Deep Convolutional
Neural Networks and does the experimentation on the data obtained from two famous
microblogs, Twitter and Tumblr. Jindal et al. [13] discusses about the applicability of
an image sentiment prediction framework using Convolutional Neural Networks for
Flickr image dataset. Author [14] had designed a framework for image sentiment anal‐
ysis of Flickr images using CNN. They had also developed new strategies to tackle the
noisy nature of large scale image samples taken. Their results show the improved
performance of CNN. Cai et al. [15] briefs about the applicability of CNN for learning
both the textual as well as visual features for determining sentiment analysis using
Twitter dataset. Their study depicts that the combination of both the text and the images
showed improved results. The work done by [16] shows the exploration and utilization
of hyper parameters from a very deep CNN network for analyzing image sentiments
drawn on Twitter dataset. The results claim that their model exhibits improved results.
3 Deep Learning Using Convolutional Neural Network
Deep learning (DL) algorithm was earlier proposed by G.E. Hinton in 2006 and is the
part of machine learning process which refers to Deep Neural Network [17]. Neural
network (NN) works just like our human brains, comprising of numerous neurons that
make an impressive network. DL is a group of networks containing lots of other algo‐
rithms like Convolutional Neural Networks (CNN), Recurrent Neural Networks, Recur‐
sive Neural Networks, Deep Belief Networks etc. NN are very advantageous in text
generation, word representation estimation, vector representation, sentence modeling,
sentence classification, and feature presentation [18]. The application of DL algorithms
is increasing enormously because of three prime reasons, i.e., enhanced abilities of chip
processing, comprehensively lesser expenditure of hardware & noteworthy improve‐
ments in machine learning algorithms [19].
In our study, we aim to focus on the use of CNNs. A “CNN” is a type of “feed-
forward” artificial NN where the individual neurons are lined in such a way that they
respond to ‘overlapping regions’ in the ‘visual field’. CNNs were inspired by ‘biological
processes’ (the connectivity pattern between its neurons is inspired by the organization
of the animal visual cortex) and are variations of multilayer perceptron (MLP), that are
designed to use slight amounts of pre-processing. They comprise of several layers of
receptive fields [20]. These are small neuron collections which process portions of the
input image. The outputs of these collections are then lined or tiled so that their input
regions overlap, in order to attain a higher-resolution representation of the original
image; & this process is then repeated for every such layer [13]. Tiling permits CNNs
to endure translation of the input image [12, 13].
There are 4 key operations in the ConvNet:
3.1 Convolution
ConvNets derive their name from the operator “convolution”. The major motive of
Convolution (of a ConvNet) is to extract features from the image taken as input.
3.2 Non Linearity (ReLU)
ReLU stands for Rectified Linear Unit and is a non-linear operation. ReLU is an element
wise operation (applied per pixel) and swaps all negative pixel values in the feature map
by 0. The prime intend of ReLU is to announce non-linearity in our ConvNet.
3.3 Pooling or Sub Sampling
Spatial Pooling reduces the dimensionality of each feature map but retains the most
important information. Spatial Pooling can be of different types: Max, Average, Sum
etc.
3.4 Classification (Fully Connected Layer)

The “Fully Connected” layer is a traditional MLP that uses a ‘softmax’ activation func‐
tion in the output layer. The term “Fully Connected” denotes that each neuron in the
previous layer is connected to every single neuron in the next layer. The output deriving
from the ‘convolutional & pooling layers’ represents the high-level features of the input
image. The prime aspect of the Fully Connected layer is to use these features in order
to classify the input image into several classes based on the training dataset.
In our study, we had employed the use of CNN, where training is done by the back-
propagation method. The Convolution & Pooling layers serve as Feature Extractors from
the input image whereas the Fully Connected layer behaves as a classifier. Training the
CNN signifies the optimization of all the weights & parameters for correctly classifying
images from the training set. Whenever a new input image arrives into the CNN, it
undergoes the propagation step and a probability for each of the class is obtained as
output. If the training set is huge enough in size, then the network will generalize well
to the new images and will eventually classify them into correct categories.
4 System Architecture and Experimentation
We had implemented the CNN using a deep learning framework called as “Caffe” [21].
It is developed by ‘Berkeley AI Research (BAIR) and by community contributors’. It
is under active development by the Berkeley Vision and Learning Center (BVLC).
Our proposed system intends to classify sentiment of an image based on generation
of Adjective Noun Pairs (ANP’s) [9, 10]. The reason why ANP’s are generated for this
purpose is that ANP’s can easily be correlated with the sentiments. The ANP’s are
generated with the help of a Convolutional Neural Net that has been trained rigorously
for this particular labelling task. Image recognition and labelling through CNN are topics
which have been extensively researched upon. Image labelling for sentiment analysis is
an approach which is in its nascent state. We aim to develop this approach into a novel
method of image sentiment classification. The system can broadly be categorized into
two phases (as shown in Fig. 1), the ANP generation phase and the sentiment predictor
phase. The ANP generation phase is mainly comprised of the neural net while the senti‐
ment predictor phase is comprised of a Support Vector Machine. Additional work has
also been done to analyze the ANP’s generated through textual sentiment analysis libra‐
ries and the results were compared with the sentiment predicted through SVM.
Phase II: Calculat-

Image PhaseI: Sentiment ing Sen-
taken ANP Predictor timent
as Genera- Probabili-
Input tion ties
Fig. 1. Generating sentiment probabilities
Phase I takes an image as an input and outputs the probability distribution regarding
the sentiment for that image (positive, negative or neutral). The image is pre-processed
to a specific resolution and is then passed through multiple layers of the neural net (ReLu,
Pooling, Fully Connected and Convolutional). ANP’s generated in the first phase serves
as an input to the Phase II which thus outputs the sentiment probabilities. This step can
be performed by two methods. The first is sentiment prediction through the use of
Support Vector Machine (SVM) and the second is sentiment prediction through textual
analysis of ANP’s. In our study, we had implemented both the methods in order to
compare them so as to know which one is best suited for the purpose of sentiment
classification. Prime focus was on the generation of ANP’s that closely represent the
sentiment of the image. Support Vector Machine was selected for the classification task
as it works well in scenarios where the number of input features are high. In our work,
the features are the ANP’s generated in the first phase (number of input features are
2089). The output has the sentiment probabilities. SVM performs a quite well in
predicting sentiments on the basis of the ANP’s. Our main aim was to generate labels
or ANP’s that are closely related to the sentiments. Here, a string of the top ANP’s was
passed through a textual sentiment analyzer and the corresponding sentiment probabil‐
ities were obtained.
Our proposed framework is divided into five modules. The first module takes input
as an image from the user. The output of each of the module serves as an input for the
next module. The last module outputs the desired results. Each module performs a
specific task ranging from preprocessing of the dataset to the sentiment prediction. This
modular approach helps in building a better system that even simplifies the system
designing processing. Our proposed framework has been depicted in Fig. 2.
Pre-processing of Text Sentiment

the data Predictor
Image Sen-
timent Classi-
fier
Designing of SVM Sentiment

CNN Predictor
Fig. 2. Proposed visual sentiment framework
Data preprocessing involves processing of the dataset. Our system makes use of set
of two databases. The Flickr database is used for training and testing the convolutional
neural network or the ANP classifier. The twitter database is used for training the support
vector machine and also for testing both the SVM and text sentiment predictors. It
consists of 800 images and their corresponding tweets. The images have been labelled
with their corresponding sentiment. For each ANP, at most 1,000 images tagged with it
were downloaded that resulted into approximately one million images for 3,316 ANPs.
To train the visual sentiment concept or ANP classifiers, we first filter out the ANPs
associated with less than 120 images. Consequently, 2,089 ANPs with 867,919 images
were left after filtration process. For each ANP, 20 images were randomly selected for
testing purposes, while others were used for training purposes, ensuring at least 100
training images per ANP. The ANP tags from Flickr users were used as labels for each
image. All the images, whether used for training or testing were normalized to 256 × 256
without keeping the aspect ratio. The ‘python pandas’ library was used for shaping,
merging, re-shaping and slicing the datasets. The final dataset consists of approximately
800,000 images for the proposed 2089 ANP’s. The CNN for ANP detection has been
built through the ‘Caffe’ deep learning framework. We had used the python bindings of
Caffe to build this network that comprises of eight main layers. Five of them were
convolutional and the other three were fully connected. The output of the CNN is thus
used as input for the SVM. The SVM is implemented using the ‘scikit-learn python
library’. The text sentiment analyzer had been implemented in python using the natural
language toolkit NLTK. This toolkit is used to extract features from the text. The twitter
dataset has been used for testing and training the text sentiment predictor. In the system
implementation, the top ANP’s generated were formed into a string. The string is
constructed as follows. For all the top ANP’s, the rank/probability of the ANP is normal‐
ized to an integer greater than 1. The adjective of the ANP is added to the string ‘x’
number of times, where ‘x’ is the weight of the ANP. The noun is added as it is for each
ANP. The final string is passed through the text sentiment predictor and the sentiment
probabilities were obtained.
All the training, testing, and experimentation was done on a Macbook (Processor –
2.7 GHz dual-core Intel Core i5, Turbo Boost up to 3.1 GHz, 3 MB shared L3 cache,
256 GB PCIe-based onboard SSD, 8 GB of 1867 MHz LPDDR3 onboard memory, Intel
Iris Graphics 6100 1536 MB, Operating system – macOS Sierra v 10.12.2). The time
taken to train the model was approximately 13 days.
5 Result Analysis and Conclusion
This section shows the results obtained after passing images through our system. The
images have been selected to cover a diverse range of sentiments in order to push the
limits of our system. It shows the input image, the mid-level representation of the ANP’s
and the final graph obtained on executing the system, comprising of a comparison
between the two techniques used for sentiment prediction and the final label produced
for the particular image. We test our system on three types of data. The first comprised
only the text i.e. the tweets, the second contains only the images associated with those
tweets and the third one consists of the combination of the tweets and their corresponding
images. We observed that the third method gives us the most desirable results as shown
in Fig. 3. We have used accuracy as a parameter for classifying images/texts. Accuracy
Fig. 3. Comparison between text, image, and text + image

is defined as the percentage of images/text that have been correctly labelled by the
system. In our context, it is the number of images out of the 800 twitter images/text that
have been correctly labelled.
In this study, we intend to propose a visual sentiment concept classification model
based on deep convolutional neural networks. The deep CNNs model is trained using
‘Caffe’. We did the comparative performance analysis of applying basic text analysis to
that of SVM on the generated Adjective Noun Pairs (ANPs) to find out that the ANPs
when passed through SVM, yields better results for visual sentiment classification.
Another comparison was made pertaining to the results of textual sentiment analysis,
visual sentiment analysis, and textual + visual sentiment analysis to explore the method
that generates the most desirable results for correctly classifying sentiments. Perform‐
ance evaluation shows that the newly trained deep CNNs model works significantly
better for both annotation and retrieval in comparison to other shallow models that
employed the use of an independent binary SVM classification models etc.
6 Conclusion
Visual sentiment analysis is thus a very challenging task although it is still in its infancy.
Earlier works majorly focused on the textual sentiment analysis. In our study, we aim
to focus on the picture sentiment analysis using convolutional neural networks. We had
also used the concept of averaging for generating sentiment probabilities and it could
possibly serve as a trivial approach and might not ensure most optimized and accurate
results. Thus other approaches could also be used that would yield more optimized
solutions for producing more accurate results. The experimental results depicts that the
deep learning CNN model when properly trained can give better results and thus could
be used as a variant for outperforming the challenging problem of visual multimedia
mining. We can also incorporate the concept localization into the deep CNNs model,
and improve network structure by leveraging concept relations. Furthermore, we can
also apply other deep learning models to many other application domains using more
data sources for detecting emotion analysis and predicting visual sentiment analysis.
References
1. Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Inf. Retr. 2(1–2),
1–135 (2008)
2. Liu, B.: Sentiment Analysis: Mining Opinions, Sentiments, and Emotions. Cambridge
University Press, Chicago (2015)
3. Kumar, A., Teeja, M.S.: Sentiment analysis: A perspective on its past, present and future. Int.
J. Intell. Syst. Appl. 4(10), 1–14 (2012)
4. Esuli, A., Sebastiani, F.: SentiWordnet: a publicly available lexical resource for opinion
mining. In: Proceedings of LREC (2006)
5. Thelwall, M., Buckley, K., Paltoglou, G., Cai, D., Kappas, A.: Sentiment strength detection
in short informal text. J. Am. Soc. Inf. Sci. Technol. 62(2), 419–442 (2011)
6. Vonikakis, V., Winkler, S.: Emotion-based sequence of family photos. In: Proceedings of the
20th ACM International Conference on Multimedia, pp. 1371–1372. ACM, Japan (2012)
7. Yanulevskaya, V., Uijlings, J., Bruni, E., Sartori, A., Zamboni, E., Bacci, F., Sebe, N.: In the
eye of the beholder: employing statistical analysis and eye tracking for analyzing abstract
paintings. In: Proceedings of the 20th ACM International Conference on Multimedia, pp.
349–358. ACM, Japan (2012)
8. Wang, X., Jia, J., Hu, P., Wu, S., Tang, J., Cai, L.: Understanding the emotional impact of
images. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1369–
1370. ACM, Japan (2012)
9. Aradhye, H., Toderici, G., Yagnik, J.: Video2text: Learning to annotate video content. In:
IEEE International Conference on Data Mining Workshops, ICDMW 2009, pp. 144–151.
IEEE, USA (2009)
10. Borth, D., Ji, R., Chen, T., Breuel, T., Chang, S. F.: Large-scale visual sentiment ontology
and detectors using adjective noun pairs. In: Proceedings of the 21st ACM International
Conference on Multimedia, pp. 223–232. ACM, Spain (2013)
11. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional
neural networks. In: Advances in neural information processing systems, pp. 1097–1105.
ACM, USA (2012)
12. Xu, C., Cetintas, S., Lee, K. C., Li, L. J.: Visual sentiment prediction with deep convolutional
neural networks. arXiv preprint arXiv:1411.5731 (2014)
13. Jindal, S., Singh, S.: Image sentiment analysis using deep convolutional neural networks with
domain specific fine tuning. In: 2015 International Conference on Information Processing
(ICIP), pp. 447–451. IEEE, India (2015)
14. You, Q., Luo, J., Jin, H., Yang, J.: Robust image sentiment analysis using progressively
trained and domain transferred deep networks. In: AAAI, pp. 381–388. ACM, USA (2015)
15. Cai, G., Xia, B.: Convolutional neural networks for multimedia sentiment analysis. In: Natural
Language Processing and Chinese Computing, pp. 159–167. Springer, Cham (2015)
16. Islam, J., Zhang, Y.: Visual Sentiment Analysis for Social Images Using Transfer Learning
Approach. In: IEEE International Conferences on Big Data and Cloud Computing
(BDCloud), Social Computing and Networking (SocialCom), Sustainable Computing and
Communications (SustainCom) (BDCloud-SocialCom-SustainCom), pp. 124–130. IEEE,
USA (2016)
17. Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., Fei-Fei, L. Imagenet: A large-scale
hierarchical image database. In: Computer Vision and Pattern Recognition, CVPR 2009, pp.
248–255. IEEE, USA (2009)
18. Zhang, Y., Er, M. J., Venkatesan, R., Wang, N., Pratama, M.: Sentiment classification using
comprehensive attention recurrent models. In: Neural Networks (IJCNN), pp. 1562–1569.
IEEE, Canada (2016)
19. Jiang, Y. G., Ye, G., Chang, S. F., Ellis, D., Loui, A. C.: Consumer video understanding: A
benchmark database and an evaluation of human and machine performance. In: Proceedings
of the 1st ACM International Conference on Multimedia Retrieval, p. 29. ACM, Italy (2011)
20. Ain, Q.T., Ali, M., Riaz, A., Noureen, A., Kamran, M., Hayat, B., Rehman, A.: Sentiment
analysis using deep learning techniques: a review. Int. J. Adv. Comput. Sci. Appl. 8(6), 424–
433 (2017)
21. Jindal, S., Singh, S.: Image sentiment analysis using deep convolutional neural networks with
domain specific fine tuning. In: Information Processing (ICIP), pp. 447–451. IEEE, India
(2015)
22. Kumar, A., Khorwal, R., Chaudhary, S.: A survey on sentiment analysis using swarm
intelligence. Indian J. Sci. Technol. 9(39), 1–7 (2016)
23. Kumar, A., Sebastian, T.M.: Sentiment analysis on twitter. Int. J. Comput. Sci. Issues 9(4),
372–378 (2012)
24. Kumar, A., Sebastian, T. M.: Machine learning assisted sentiment analysis. In: Proceedings
of International Conference on Computer Science & Engineering (ICCSE 2012), pp. 123–
130. IAENG, UAE (2012)

Kumar 2018

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Kumar 2018

Uploaded by

Copyright:

Available Formats

Image Sentiment Analysis Using

Convolutional Neural Network

and Arunima Jaiswal ✉

Keywords: Sentiment analysis · Deep learning · Convolutional neural network

© Springer International Publishing AG, part of Springer Nature 2018

[8, 22–24]. In comparison to “text-based” opinion mining or sentiment analysis,

3 Deep Learning Using Convolutional Neural Network

3.2 Non Linearity (ReLU)

3.3 Pooling or Sub Sampling

3.4 Classiﬁcation (Fully Connected Layer)

4 System Architecture and Experimentation

Phase II: Calculat-

Fig. 1. Generating sentiment probabilities

Pre-processing of Text Sentiment

Designing of SVM Sentiment

Fig. 2. Proposed visual sentiment framework

5 Result Analysis and Conclusion

Fig. 3. Comparison between text, image, and text + image

You might also like