Professional Documents
Culture Documents
Uj Gepi Tanulasi Modszerek Alkalmazasa Dolgozat 3
Uj Gepi Tanulasi Modszerek Alkalmazasa Dolgozat 3
……………………
Dr. Dabóczi Tamás
tanszékvezető
Application of new machine learning
algorithms in text processing
MSc Thesis
Máté Fejes
December 2017
Abstract
ii
Kivonat
iii
Hallgatói nyilatkozat
Fejes Máté
iv
Acknowledgements
v
Contents
Abstract ii
Abstract II iii
Declaration iv
Acknowledgements v
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Film Industry Statistics . . . . . . . . . . . . . . . . . . . . . 1
1.3 Twitter as Social Media . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Overview 5
2.1 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . 5
2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . 8
2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 11
2.3 Language Modelling & Embedding . . . . . . . . . . . . . . . 15
2.3.1 N-gram Model . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Continuous Space Language Models & Embedding . . 15
3 Related Work 20
4 Datasets 23
4.1 Movie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Filtering for Relevant Search Results . . . . . . . . . . 23
4.2 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Cleaning the Text Data . . . . . . . . . . . . . . . . . 27
vi
5 Creating Emotion Labeling 32
5.1 Selecting the Set of Emoticons . . . . . . . . . . . . . . . . . . 33
5.2 Emoticon Embedding . . . . . . . . . . . . . . . . . . . . . . . 35
6 Emotion Analysis 40
6.1 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Embedding Layer Settings . . . . . . . . . . . . . . . . 42
6.2.2 Neural Network Architectures . . . . . . . . . . . . . . 43
7 Income Prediction 49
7.1 Creating Features Using Film Features . . . . . . . . . . . . . 49
7.2 Feature Generation Based on Twitter Data . . . . . . . . . . . 52
7.3 Income Prediction with Different Feature Sets . . . . . . . . . 56
Bibliography 64
vii
1 Introduction
In this section we define our goals, and describe the financial and social envi-
ronment we use as the source of our data.
1.1 Goals
For this thesis we attempt to predict box office earnings of movies, based on
information collected on twitter, dated before the premiere, focusing on the
sentimental aspects of the textual data. We do so by labelling tweets based on
emotions, a task which includes natural language processing (NLP) and using
classification methods. After aggregating emotions for each movie, regression
models are built from the resulting information and other features.
We try to outperform the models published in the related literature by
using recent advances in NLP for sentiment analysis, such as neural word
embeddings.
1
Figure 1: Global box office income.
The Hollywood film industry is turning to Chinese and other oriental au-
diences, because they don’t require as much exposure to marketing to watch
a film as the European and American moviegoers [1].
In conclusion it is more important than ever to use efficient and cost-
effective marketing techniques in the film industry. One way to measure the
effect of advertisements pre-release is by monitoring social media, such as
Twitter or Facebook.
2
1.3 Twitter as Social Media
Twitter is one of the four biggest social platforms, based on the number of
monthly active users. It is the fourth most popular after Facebok, Tumblr and
Instagram, with 310 million users.
It can be defined as a microblog website with a social network framing.
Users can write short, at most 140 character long status updates, tweets, and
subscribe to the posts of other users. A person’s Twitter feed consist of the
tweets of the people they follow (are subscribed to).
It is possible to mention other users in your tweets using their handles by
writing @username in the text. One can re-post (retweet) tweets, optionally
adding their own remark to the original text. This appears on followers’ feed,
also containing the handle of the original poster. So called hashtags are a
way to link your tweet to certain topics, such as current events, products
or ideas. This can be done by adding the right hashtag to your tweet, e.g.
#SomethingCurrent. Apart from plain text, emoticons, pictures and links
may also be added to a tweet.
3
Figure 3: Devices owned by moviegoers [1].
4
2 Theoretical Overview
Machine learning methods are built on the assumption that given sufficient
amount of data, we can build models that generalise well.
Machine learning methods try to explore systematic relationships in data,
and find functions approximating reality as best as possible. Data in this
context means a collection of data points, where each point is described by
the same type of categorical or numerical features (x). Training supervised
classification and regression models requires data, where each datapoint is
labeled either with a class or a numeric value. This is also called target. The
goal of these models is to correctly determine the label based on the features.
This is done by tuning the model’s parameters (h), on which the model depends
to predict the target features by calculating ŷ = ŷ(x, h). We find the optimal
value of h by training on part of the data, minimizing some kind of error
function. This error function should reflect how far the model’s prediction (ŷ)
is from the correct label (y). The error is then E = ferror (y, ŷ), where ferror is
the error function.
If both the model and the error function is differentiable, then one way
to train the model is by using gradient descent. This iterative optimization
method calculates the gradient of the error function, and changes the model’s
parameters accordingly, with λ stepsize in each iteration (1).
6
rons working parallel, each working with the same inputs (the outputs of the
previous layer).
(2) (1)
0 (2)
wij ϕ0 (sj )
(yj ) ϕ (s1 )
0 (2)
ferror x
z }| { z }| { z }| { z }|i {
(2) (1)
z }| {
(2) (1)
∂E1 dE1 dyj ds1 dyj ds1
= (2)
· (2)
· (1)
· (1)
· (1)
(2)
∂wij dyj ds1 dyj ds1 dwij
| {z }
repeats
Using this derivative, we can update wij so the error becomes smaller.
Up to this point we summarized the core idea behind neural networks,
which is quite old. Next we will elaborate on some more recent developments
in this field.
The idea of convolutional neural networks was proposed in 1998 in [20]. But
it only really gaining popularity in 2012, after Alex Krizhevsky, Ilya Sutskever
and Geoffrey Hinton won ILSVRC (ImageNet Large-Scale Visual Recognition
Challenge) by using a deep convolutional network [19]. From this point on,
8
9
Figure 7: Error backpropagation.
architectures using convolution became increasingly popular in image process-
ing, and later in sound processing.
A convolutional filter is a sliding window function applied to a matrix (or
vector, or tensor). As seen on Figure 8, the filter (also known as the kernel)
slides over the matrix, and at each step computes the sum of the element-
wise product of the overlapping kernel and image pixels. In image processing,
different 2D filters are widely used, such as gaussian blur, edge enhancement
and so on. For these applications the values of the kernel are well defined.
The idea behind convolutional neural networks is to let the neural network
learn the kernel values instead of predefining them.
By doing this, we significantly decrease the number of trainable parameters
compared to a fully connected neural layer. Instead of training as many weights
as pixels, we only need to train a number of weights equal to the number of
fields in the kernel. At the same time we get a few additional hyper-parameters,
as the size and strive (step size) of the filters.
10
Figure 9: Convolutional neural network.
Recurrent neurons take advantage of the fact that in certain datasets the
consecutive instances may have some kind of correlation, for example time
series.
11
To utilize this additional information, recurrent neurons use their output
from a previous step as an additional input in the current step (Figure 11).
This way they have implicit memory of all the previous inputs (similar to an
IIR1 filter).
Deep neural networks have a problem with the gradient (calculated via
backpropagation and used to update weights), decreasing with distance from
the output layer [15]. Recurrent layers can be thought of as having as many
layers as the length of the input sequence. Because of the vanishing gradient,
the starting elements of the sequence (the beginning of a sentence) will not
have as much effect on the outcome as later elements. So to say a simple
recurrent neuron does not really have longer term memory.
This was solved by the introduction of the Long Short Term Memory unit,
LSTM in short (Figure 12)[16, 30]. In LSTM units the cell’s inner state and
previous output is handled separately. The update of the inner state is done
through numerous gates, preserving long term memories. In LSTM units, the
update of the cell state (C t ) can only be done through gates. We use an
example to illustrate the importance of these gates. Let’s take the sentence
”He likes his ice-cream, but she likes hers better”. The LSTM unit receives
the words of this sentence one at a time. After the word ”He”, the cell state is
supposed to represent the subject’s gender somehow, thus being able to predict
1
Infinite Impulse Response
12
the correct pronouns. But once we get to the second part of the sentence, a
new subject appears with a different gender.
To use the correct pronouns, the cell should forget the ”saved” gender first.
This is done through the forget gate, ft . Since
ft is a vector with values between 0 and 1, with σ being the sigmoid function
(Figure 5), Wf the weight-matrix of the forget gate and h the output vector.
By taking the dot product of ft and Ct−1 , the previous cell-state, we either keep
or forget (to some extent) certain elements of the previous state, the gender of
the subject in this case. To update the cellstate, we compute Ct0 (Equation 5),
and add parts of it to the cellstate (in our case the female gender) depending
on the value of it (Equation 4), creating Ct (Equation 6).
13
Finally, we create an output based on the cellstate Ct ( Equation 8), the
previous output ht−1 , and the current input xt (Equation 7).
ht = ot ∗ tanh(Ct ) (8)
Some of the best results that came from using recurrent neural networks,
were achieved by utilizing LSTM units. The previously described LSTM unit
is one of the simplest LSTM units. There are many other variants, e.g. using
“peephole connections”, or coupled forget and input gates.
14
2.3 Language Modelling & Embedding
m
Y m
Y
P (w1 , ...wm ) = P (wi |w1 , ...wi−1 ) = P (wi |wi−n , ...wi−1 ) (9)
i=1 i=1
count(wi−n , ...wi−1 , wi )
P (wi |wi−n , ...wi−1 ) = (10)
count(wi−n , ...wi−1 )
Continuous language models are a big step from N-grams, because they repre-
sent words differently. Most pre-neural network natural language models treat
words as atomic units: there is no notion of similarity between words, as these
are represented as indices in a vocabulary.
15
Continuous models use word embeddings, a technique to transform the
word indices to dense vector representations by using semantic information
implicitly. Words are first represented as vectors with as many dimensions as
the size of the vocabulary (the number of unique words in the corpus). These
vectors have zero values in every dimension except the one corresponding to
the given word. This is called one-hot representation.
The different algorithms convert these one-hot vectors to vectors with fewer
dimensions and continuous values. This may be done by using word-to-word
or word-to-document relations. One of the oldest methods, Latent Semantic
Analysis (LSA) uses the latter. By starting with a dictionary of documents,
it builds a matrix with columns representing documents, and the rows being
different words. For each element of the matrix it counts the number of oc-
currences of a word in a document, or computes the TF-IDF2 score. One can
define similarities between row-vectors (words) and use column vectors to rep-
resent documents. These can be used in solving information retrieval problems
[28]. By using singular value decomposition the vectors may be condensed
while retaining most information. This method (similar other methods us-
ing documents as contextual information) captures semantic relatedness (e.g.
“boat” – “water”), while we would often rather capture semantic similarity
(e.g. “boat” – “ship”).
Capturing semantic similarity is done by using the neighbours of a word as
context. Similar words probably occur in the same environment. So we take
every word in the corpus, and log the n words before it and after it, creating
our teaching dataset, where the label is the given word (Figure 13).
2
Term frequency–inverse document frequency is a numerical statistic that is in-
tended to reflect how important a word is to a document in a collection or corpus.
16
Figure 13: Creating training instances for Neural Language Model [23].
Skip-gram
Using this dataset we can train a neural network with an input layer the size of
the vocabulary, a hidden layer with a neuroncount of the preferred embedded
vector length, and an output layer, again as big as the number of unique words
in the corpus. Using a softmax activation we create a classifier, which learns to
return a word if its neighbour is the input feature (Figure 14). We maximize
the following average log probability:
T
1X X
Lskip−gram = log p(wt+j |wt ) (11)
T t=1 −c<=j<=c,j =0
Notice, that we have created a bottleneck with the hidden layer, thus com-
pressing the contextual information into a relatively small vector. This vector,
the output values of the hidden layer (with no activation function), is the
representation of the current label word [4].
On Figure 14 we can see the neuron counts used when creating word2vec
[23, 25], an embedding tool by Google. With an architecture this size, training
the network means fitting 300×10000×2 = 6 million weights, which is compu-
17
Figure 14: Embedding with Neural Language Model.
The CBOW algorithm [25] is similar to skip-gram, but uses the sum of the
neighbouring words’ one-hot vectors as input of the neural model. It maximizes
the function in Equation 12.
T
1X
L= logp(wt+j |wt∗ ) (12)
T t=1
X
wt∗ +j (13)
−c<=j<=c,j =0
Word2vec uses both models. The CBOW architecture predicts the current
word based on the context, and skip-gram predicts surrounding words given
the current word (Figure 15).
18
Figure 15: CBOW and skip-gram [25].
19
3 Related Work
20
income. Their predictions were correct for 36,9% of the movies in the test
dataset, 75.2% of the films were classified less than two categories away from
the correct category [35]. Joshi et al. used a linear regression model with
movie meta-data and sentiment features, where the latter were extracted from
pre-release critiques using n-gram models. Their best combination of features
achieved an r2 score4 of 0.671 [18]. Predictions based on classic quality fac-
tors are not reliable enough to use in practical applications, but with the use
of user generated data this threshold might be crossed. With the birth of
micro-blogging came an increased number of electronically documented hu-
man interactions, and a way to get direct insight into the thoughts of many
people. Ishii et al. model human interactions within society with a stochas-
tic process [17]. By using only the marketing budget in time as input, their
model generates a dynamic popularity variable, which they validated against
the number of blog posts about the particular movies in the Japanese Blogo-
sphere [27]. Box-office predictions have also been done using user activities on
the Wikipedia pages of films [24]. Mestyán et al. used measurements of the
number of views, users, edits and collaborative rigor on 312 movies’ Wikipedia
pages. Using a simple linear regression model, they were able to make predic-
tions with coefficient of determination of 0.925 one month before the premiere.
In a novel approach Asur and Huberman predict movie revenue based on the
number of Twitter mentions regarding 24 movies [3]. They anticipated the
income for the opening weekends of movies with tweets from the night before,
achieving an r2 score of 0.97. In other work, Wong et al. advise us to be
sceptical of Twitter’s financial predictive ability [41]. By using a sample of 34
movies, they compare ratings from IMDb and Rotten Tomatoes to the senti-
ment of the tweets mentioning those movies, and arrive at the conclusion that
4
Coefficient of determination: the proportion of the variance in the dependent vari-
able that is predictable from the independent variable(s).
21
there is a noteworthy bias towards positivity in the emotions twitter users dis-
play. In a similar approach, Oghina et al. use Twitter and YouTube activity
to predict the ratings on IMDb[29].
22
4 Datasets
In this section we elaborate on the collection of data, and review some char-
acteristics of the used datasets.
To collect relevant tweets for each film and be able to use them for income pre-
diction, we had to ensure that these movies met certain criteria. We dropped
all films with missing title, income, or premiere date. The merged movie
5
The IMDb5000 was replaced by the TMDb5000 dataset on Kaggles website due
to legal reasons, and is no longer available online.
23
Figure 16: Distribution of film releases in time.
dataset contains films from as long ago as 1922 (Figure 16), but Twitter has
only been released on 15th of July 2006, and the number of users was relatively
low in the first few years. For this reason we used movies released after 2008,
when Twitter hit 6 million monthly active users.
Movie titles are often expressions used in everyday language, so by simply
searching for tweets containing the title, we would obtain a high number of
irrelevant posts. One option would have been to search for tweets containing
the official hashtags of the films, but to our best knowledge there is no available
data listing these, so this option was not viable. Our solution was to collect
tweets that contained the word ”movie” besides the movie title, and only used
films with at least 2 word titles.
However, even with these constraints, a significant portion of the tweets
collected for less known movies were unrelated to said movies. On Figure 17
we see the distribution of income for movies, and also the distibution of log
of income. Based on these histograms we decided to only use films with at
24
least $100,000 gross income. It is safe to presume that films generating less
revenue have a smaller footprint on social media. After collecting the tweets,
we retroactively removed films with less than 50 tweets. Finally, we were left
with 988 movies.
25
Figure 18: Distribution of number of tweets in films.
In order to get a first look at how important tweets are with regards to
the success of the film, we calculated the correlation between the number of
tweets regarding a movie and its income (seen on Figure 19). For compari-
son, we did the same with the production budget and the income. The low
correlation between the income and the tweet count is partly due to Twitter
having been less commonly used when the earlier movies premiered. For 2009
the correlation was 0.1, but in time it increases, and reaches 0.7 by 2016.
26
Figure 19: Correlation between income of movies and budget of movies, corre-
lation between income of movies and number of tweets mentioning the films.
To see how the tweet count for a single movie develops in time, I collected
tweets about the movie Rogue One: A Star Wars Story for a longer time
period. The effect of new information regarding the movie can be seen clearly
in Figure 20. New announcements create a spike in tweets, with an exponential
decay after. Nearing the premiere there is a slower, but steady exponential
rise in the frequency of tweets.
To see any global characteristics of tweeting behavior in time, we aligned
the tweets of different films by the time until the premiere, as seen in Figure 21.
An exponential rise in interest can be observed approaching the opening date,
and a few blurred local maxima can also be sound. By looking closer at the
distribution we can observe the daily periodicity of the number of tweets.
The collected tweet texts have a lot of elements besides words, such as handles,
hashtags, or URLs. These can help us predict the income of movies (Figure 35),
27
28
Figure 20: Number of tweets in time about Rogue One before the premiere.
Figure 21: Number of tweets for all films (aligned to premiere date).
29
on whether the tweet contains an image link, YouTube link, or any link at all.
We registered the number of handles and hashtags the tweet contained, then
removed those as well.
Numbers occurring in tweets might have valuable information, as they
could be a numerical rating of the movie. However it is hard to distinguish
between a score or other number occurrences, therefore we did not use this
information, and removed all numbers. We also got rid of all non-alphanumeric
characters, apart from a handful used for punctuation: ,.!?’". These left were
padded with whitespace to make them form separate words. These two steps
were performed while keeping a list of ASCII emoticons (specified in Section 5)
intact and in place.
The models we use for emotion classification need an input format with
fixed number of words. This number was determined based on the distribution
of the word counts of tweets. Tweets longer than this were truncated, shorter
ones padded. By making each punctuation character a separate word, a tweet
with a sequence of exclamation marks can have a very high number of words,
and the necessary trimming (done in order to achieve the specified number of
words) would possibly result in loosing important words. We wanted to avoid
this, thus replaced all repeating non-alphanumericals with only one instance
of the character. Using excessively repeating characters is also a frequent
phenomenon among Twitter users (e.g. ”haappyyyyy”, ”yayyy”). By trans-
forming these to an almost correct form, we reduced the number of distinct
words. This was done by replacing the characters repeated more than twice
with only one copy (e.g. ”haapppyyyy” → ”haapy”).
If a certain emotion is relatively often associated with one or a few movies,
our emotion classification model may overfit by learning that the presence of
these titles infer that emotion. By removing the titles altogether, we would
30
lose the relative position information between words or expressions and the
title. Our solution was to replace all occurrences of the titles in the text with
the word ”film”.
At this point in the text processing we extracted two additional features
from the text: the number of words and the ratio of capital letters. We
then replaced all capital letters with their lowercase counterparts. Twitter is
notorious for people with bad spelling, but hopefully thanks to the continuous
space word representations, the misspelled versions of words will be represented
by similar vectors, if those certain misspellings are common enough. To filter
the ones that are not, we removed all words with less than 40 occurrences in
the 10 million tweets.
The vocabulary assembled during cleaning the textual data contains 62,000
different words.
31
5 Creating Emotion Labeling
tweets regarding each film, and the gross domestic income of those films. The
values can be seen in Table 1 and in Figure 22.
Figure 22: Occurrence ratio of emoticons with positive and negative correla-
tions with gross domestic income of films. Each point represents one film.
The correlations are small, but the fact that smiling emoticons have positive
correlation, while surprised and sad emoticons have negative correlation, is
promising.
The next question is what emoticons to use, and how to group them. As there
is no strict, closed set of ASCII emoticons, we searched for a wide variety
33
Part Possible characters
Eyebrow >,), ,(,<,},’
Eye :,;,X,x,B,8,=
Nose -,,,’-,
Mouth ),)),D,P,b,S,(,},{,],[,@,o,O,0, /,|,L,X,#,&
Table 2: Building block of left to right horizontal emoticons.
of emoticons, most of which are probably never used, and filtered them by
a minimum necessary number of occurrences. We created our preliminary
set of emoticons by combining the possible building blocks in every possible
way. For horizontal emoticons that are rotated +90 degrees, such as ”:)”,
we defined the building blocks as seen in Table 2. We did the same with
horizontal emoticons rotated −90 degrees (e.g. ”(:”), and vertical emoticons
(e.g. ”O.O”). This set contained 5,260 possible emoticons.
Determining if an emoticon is present in a tweet can be difficult, because
often they aren’t separated from words by spaces. This is particularly prob-
lematic if the word is attached to a part of an emoticon that is a character,
e.g. ”:Dgood times”. We did not want to lose these ”conjoint” emoticons, but
also did not want to identify part of the text as an emoticon by accident. So
we searched for the emoticons with and without whitespace padding on each
side. If the number of occurrences of the non-padded version was much higher
than that of the padded version, we assumed that this character sequence was
used as a part of the text, rather than as an emoticon. By removing emoticons
with a lower ”padded count” than 30, and higher non − padded/padded ratio
than 5, we were left with 50 emoticons. An additional 9 more were removed by
hand, because they were not emoticons. The remaining set contained multiple
instances of the same emoticons, with different letters capitalized. These were
merged, finally leaving 31 emoticons.
34
5.2 Emoticon Embedding
Our original goal was to somehow extract the 6 basic emotions used for emotion
analysis. The more frequent emoticons we found may not cover all 6 emotions,
and we were also not certain how to arrange them into these groups. The 31
emoticons available at this point were way too many to use as distinct emo-
tions, and their number of occurrences are very unevenly distributed. Some
are very similar, and probably should not be distinguished from each other.
Grouping the emoticons was inevitable, but we wanted to avoid doing this in
manually. However, to be able to apply any kind of clustering algorithm, we
need a measure of similarity between the emoticons.
Our solution to the problem was to treat the emoticons as if they were
words, and use continuous space vectors to represent them. We assume that
emoticons are similar to words in the sense that similar ones occur in simi-
lar contexts. So by using Gensim’s [33] word2vec embedding module on the
subset of our tweet text data that contains emoticons, we created embedded
vector representations of our emoticons. A similar method has been used in
[8] creating embeddings for unicode emoticons based on their description. The
word2vec model’s parameters were chosen by selecting the combination that
assigned the most similar embedded vectors to emoticons we considered to
have similar meaning. The similarity of the vectors were monitored by reduc-
ing the number of dimensions to two with t-SNE 6 , and scatter plotting the two
dimensional projections of the embedded vectors. The context of each word
was the set of the closest 7 words, and the number of embedding dimensions
was chosen to be 5.
6
t-SNE projects high-dimensional vectors into a lower dimensional space while attempt-
ing to preserve relative distances [22].
35
We used K-means clustering to create 9 clusters based on the 5 dimensional
embedded vectors of the emoticons. We ceated 9 clusters instead of 6, so
that the few outlier emoticons without similar elements would not hinder the
clustering process. As can be seen on Figure 23, some clusters are easier to
interpret than others. But we must keep in mind that the similarities observed
by the word2vec model can also be influenced by the style of a person’s writing,
which may even be of help when trying to predict the income of movies. The
results of the embedding may be somewhat compromised if contrary to our
thoughts emoticons written in the tweet do not summarize the feeling also
conveyed by the text, but rather differ from it. In the case of sarcasm for
example, the emoticon may be the only indicator of the real emotions: ”the
film was sooo good :/”. The size of clusters seen in Figure 23 does not contain
occurrences of the emoticons which are stuck to words, only the well separated
ones. Based on the whole count we selected the 5 biggest clusters: 1,3,5,7,8
(from here on refered to as emotion class 1,2,3,4,5), and created labels for the
Twitter data based on these (Table 3).
To see if using multiple emotion classes helps more in predicting income
than using 2 classes, we had to create a grouping of emoticons with two classes
by hand. In Figure 24 these two artificially chosen classes can be seen, along
with the emoticons we couldn’t fit into either class.
The two classes are visibly well separated, even though we created these
classes independently of the two dimensional representation of the embedded
vectors. To see if these two classes are easily separable by some subset of the
five embedding dimensions, or if we only see this nice separation thanks to
t-SNE, we plotted the same figure but using the 5 dimensional embeddings
directly. We examined all dimensions pairwise, and noticed that the classes
were somewhat separated along the (0,2) and (3,4) dimension pairs (Figure 25),
36
Figure 23: The 5 dimensional continuous space vector representations of emoti-
cons embedded with t-SNE into 2 dimensions, colored by the containing clus-
ters created with K-means clustering. The size of the dots in the plot cor-
responds to the count of unambiguous occurrences of emoticons, and in the
legend to the size of the clusters.
but not as distinctly as in the figure we made with t-SNE. From this we
concluded that the current word2vec embedding’s (0,2,3,4) dimensions may
have something to do with the positivity of words/emoticons, and that t-SNE
seems to be a good method to reduce this multidimensional information to 2
dimensions. We added this 2 class labeling to our Twitter dataset to use for
sentiment analysis. The number of labeled tweets can be seen in Table 3.
37
Figure 24: The 5 dimensional continuous space vector representations of emoti-
cons embedded with t-SNE into 2 dimensions, colored by the containing man-
ually selected 2 clusters. The size of the dots corresponds to the count of
unambiguous occurrences of emoticons, and in the legend to the size of the
clusters.
39
6 Emotion Analysis
We trained a few simpler models on the 5 class labels for later comparison
against the neural network based models, using two different text representa-
tions. The one-hot encoding format represents each tweet with a vector the
length of the vocabulary, with ones at the indexes of the contained words. We
also tried using the pretrained word embeddings by concatenating the embed-
ded vectors of the contained word in sequence for each tweet. The results are
reported in Section 4. The K-neighbours and the Support Vector Classifier
model did not finish training within reasonable time while using the sparser,
one-hot encoded input format.
Multiclass accuracy
Classification models
one-hot encoded text word2vec embedded text
Logistic Regression 0.221 0.224
K-Neighbors 0.325
Support Vector Machine 0.307
Decision Tree 0.356 0.309
Random Forest 0.431 0.402
Gaussian Naive Bayes 0.296 0.292
Table 4: Multiclass accuracy for simpler models.
41
6.2 Neural Networks
Based on recent trends, neural networks (NN-s) are the most popular models
when performing emotion analysis. One of the problems with NN-s is the very
high number of hyper-parameters to tune. One must choose, for instance,
the number and types of layers constructing the network, and the number of
neurons in each of the layers. Using embedding layers, one must also face
another set of questions, detailed in the following section.
42
Figure 27: Performance of different word embedding methods with 1 dimen-
sional convolution model.
We started our search for usable neural models by experimenting with models
that were found to be the best in different tutorials and blogs, modifying them
to fit our current input format. We tried many different architectures, mostly
using 4 different types of layers: dense, recurrent, 2 dimensional convolution
and 1 dimensional convolution. We tried different combinations of these types,
it seemed that less complex networks worked better with our data and labels.
All neural networks were implemented using Keras [7] with Tensorflow [2]
backend.
43
The model referred to as ”logreg” is a one layer dense network with sigmoid
activation. The input of this network is the same as the input of the logistic
regression in Section 6.1, the embedded vectors concatenated to form a vector.
This model differs from the baseline model in the fact the weights in the
embedding matrix may change while training. In our experiments this resulted
in much better performance.
The other model’s architectures can be seen in tables: Table 5, Table 6,
Table 7, Table 8. When training on the 2 class labels, the dense output layers
had 2 neurons.
44
Layer type Output shape Number of parameters
Embedding (None, 20, 5) 310030
Reshape (None, 20, 5, 1)
Conv2D (10 (2,5) kernels) (None, 19, 1, 10) 110
Conv2D (10 (5,1) kernels) (None, 15, 1, 10) 510
Conv2D (10 (5,1) kernels) (None, 11, 1, 10) 510
Flatten (None, 110)
Dense (None, 128) 14208
Dense (softmax activation) (None, 5) 645
Table 7: The layout of the ”2dim conv3” network.
On Figure 28 and Figure 29 we can see the accuracy measured on the test
dataset during training. All lines start from the accuracy of a totally random
classifier: 0.2 for 5 classes and 0.5 for 2 classes. Interestingly the models
trained on 2 class labels overfit much faster, while in the case of 5 classes it
seems that the 1 dimensional convolution still might be getting better on the
test set even after 300 iterations on the whole data.
In the case of 5 classes, more complex models perform significantly better
than the ”dense1” model, which could mean that these models were able to
utilize the information of how words follow each other. The top 3 models:
”LSTM1”, ”1dim conv” and ”2dim conv3” do not differ too much in perfor-
mance after 100 epochs. The ”1dim conv” model learns slower, but this was
the first model we succeeded to tweak to a performance this high, so we used
this for labeling our tweets.
When using the 2 class labels, the difference between the performance
of complex and simpler neural models was not as big, but still visible. The
”1dim conv” and ”LSTM1” models compete for the first place, while the mod-
45
els using 2 dimensional convolution overfit quite quickly on the training data.
We also used the ”1dim conv” for creating the binary labeling for tweets.
46
47
Figure 28: Performance of neural models on the validation dataset with 5 class labels. The marked datapoints mean 10 iterations
on the whole dataset. The whole figure contains 300 iterations.
48
Figure 29: Performance of neural models on the validation dataset with 2 class labels. The marked datapoints mean 1 iterations
on the whole dataset after the first 14 points. The whole figure contains 20 iterations.
7 Income Prediction
In this section we aggregate the tweet features for films, and compare them
by their ability to improve a prediction based on movie meta-data.
The problem of regression can be simplified to classification by defining
income categories. However, we wanted to avoid the problem of finding the
best way to divide the interval of incomes, so we chose regression instead.
The merged IMDb5000 & TMDb5000 dataset has 30 different features not
counting the title and the gross domestic income. Most of these features are
user created, such as IMDBb score, number of votes, or number of likes on the
director’s Facebook page, which are very useful when predicting income. Our
task, however, was to predict the income based on pre-release data. Unfortu-
nately, these features were not recorded before the premieres, so we pretend to
not know these values. Even the films’ budget is problematic, since we saw in
Section 1.2 that marketing costs account for a large portion of the expenses,
and marketing spending does not stop at the premiere date.
49
The features from the dataset we can safely presume we know before the
premier are the following:
• Duration (min)
• Production companies
• Genres
• Content rating
The number of categories is too high for most features to merely one-hot
encode. Luckily, the distribution of number of occurrences seem exponential
for every feature, meaning that a lot of categories are very rare, and most
films are in at least one of rather small subset of categories. For production
companies we used the 150 most frequent ones. Certain production companies
often collaborate, so we were able to further reduce the number of features by
50
capitalizing on these correlations. We did this using PCA7 , taking the first 5
columns of the transformed data matrix.
We used similar methods for actors and genres. We used PCA only on
actors who acted in at least 6 movies found in our datasets, but for genres
we ran Principal Component Analysis using all categories. Content ratings
have a rather small number of categories, so we simply one-hot encoded the
feature. We did not transform the duration and premiere time values at all.
The created feature set will be tested against the Twitter derived features in
Section 7.3.
7
Principal Component Analysis
51
7.2 Feature Generation Based on Twitter Data
After preprocessing the tweets and calculating emotion scores for each, we
have a dataset with features describing each tweet. To acquire a dataset
that characterizes movies, we must summarize the tweet features for each
film. The easiest way to do this is by summing or averaging all features of
tweets regarding a certain movie. But by doing this, we loose all time based
information, and also some knowledge about the distribution of features.
On the other hand, by not aggregating the features into a dense enough
format, e.g. summing the number of tweets for every day separately, we would
create a huge number of features, which would not help any regression model.
We wanted a balanced solution. Our best idea was to create a rather
small number of time-bins, within which we would average features. Fixed
width bins are not suitable for this purpose, because the frequency of tweets
change radically in time, as an exponential rise can be observed nearing the
premiere. As can bee seen in Figure 30, we created time-bins that adapt to
this distribution by exponentially decreasing their width nearing the release
date. The number of tweets falling into these bins are visible on the lower part
of Figure 30, where it can be observed that the uneven bins divide them quite
evenly.
Figure 30: The definition of time-bins with uneven width and the number of
tweets in them.
We have two basic feature sets regarding tweets: the emotion scores and
the features extracted from the text during preprocessing. We will refer to
the latter as statistical Twitter features. Within these statistical features we
defined a few subgroups, and used different techniques to aggregate them.
For the features regarding the presence of links in the tweet (video, image,
52
all URL), we calculated their ratio among tweets altogether and in time-bins.
The text lengths and capitalized letters ratio were averaged for each film. For
the number of hashtags and handles we created categories to separate very
different types of tweets. The distribution of the two are quite similar, hence
we used the same boundaries to sort them into classes (Table 10).
count number of
category occurrences in tweet
0 0
1 1-3
2 4-
Table 10: Thresholds for hashtag and handle categories.
Aggregation method
Tweet features
without bins with bins
Statistical features
#hashtag count - average - average in bins
@handle count - ratios of count categories - ratios of count categories
among tweets among tweets in bins
has URL
- ratio among tweets
has image link - ratio among tweets
in bins
has Youtube link
text length
- average - average in bins
CAPS ratio
existing number of tweets number of tweets in bins
Emotion features
emotion class 1 score
emotion class 2 score
- average
emotion class 3 score
- average - median
emotion class 4 score
- median - average of thresholded
emotion class 5 score
scores
positivity score
negativity score
Table 11: Summarizing the aggregation of features.
53
bins as well. These features created for each film are, as far as we can tell,
independent from the number of tweets belonging to those films. To visualize
any significant characteristics of these created emotion features in different
income ranges, we averaged the average time-binned scores for lower and higher
income movies. The scores are shown for the 6 class emotions in Figure 31,
and for the positive label in Figure 32.
Figure 31: The 5 class time-binned emotion scores averaged for all films with
lower income than $5 million and higher income than $80 million.
54
Figure 32: The positive time-binned emotion scores averaged for all films with
lower income than $5 million and higher income than $80 million.
55
7.3 Income Prediction with Different Feature Sets
The evaluation of features derived from tweets was done by examining the
performance of a regression model, using feature sets containing only movie
meta-data, and other regression models that incorporate twitter features too.
We used scikit-learns Gradient Boosted Tree as our regression model, so we
could monitor the trained model’s feature importance property. This gives us
insight into the usefulness of each feature regarding the correct prediction of
income. We measured the performance of the models by using mean squared
error as our metric.
Figure 33: Predictions of models trained on the normal income values, and on
log of income values using all movie meta-data features. On the X axis we see
the test film dataset sorted by income.
Our regression model penalizes the same relative error more in higher label
values. So the model will try to fit more on high income films, and somewhat
56
neglect lower income movies. Because of the shape of the income distribution
(relatively few, very high grossing movies, see Figure 17), this is problematic.
We compared using the logarithm of label values with the original labels for
training. The results, seen in Figure 33, show that with the log labels we fit
the lower grossing movies much better, while prediction quality on high income
movies remains similar. Hence we will use the log of income values to train
our model.
The parameters of the model were tuned while training on solely movie
meta-data features. After experimenting with training fewer trees with bigger
depth and a larger number of shallower trees, we found the best settings to be
50 estimators with the maximum depth of 3. We always used 20% of the films
as a test dataset.
Measuring the performance for different feature set combinations was done
using 500-fold cross validation. We used this many folds because the MSE8
value fluctuated immensely when using different train-test splits. To visualize
this, we plotted the distributions of the test errors for a few feature sets in
Figure 34.
On Figure 37 we can see that we succeeded in enhancing our baseline model
by adding emotion based features. The 5 emotion classes helped more than
the binary emotions, but the difference is not as great as we have hoped for. It
is also somewhat disappointing that the statistical Twitter features boost the
performance way more than any emotion feature. We could not strengthen the
performance by combining the two. One may argue though that some of these
features also carry emotional charge, e.g. the capital letter ratio. We trimmed
the number of features in the best emotion based feature sets (for both types
of classes) and also for the feature set involving binned statistics. By keeping
8
Mean square error
57
Figure 34: Distributions of Mean Squared Error values for log of income from
500 fold cross validations on different feature sets.
only the 30 features deemed most important (25 in case of the 2 class emotion
features), we achieved considerable improvement in all cases. The best feature
set we managed to assemble was created by selecting the most important 30
features of the set containing the pre-release movie features and the binned
statistical features. The feature importances can be seen in Figure 35.
Predictions were made using the movie feature set, the best emotional
feature set and the altogether best feature set. These can be seen on Figure 36.
58
Figure 35: The feature importance values of the best performing feature set.
In order to gain clear insight into the relative importance of different emo-
tions, we trained the same gradient boosted tree model we used before on
solely the binned and thresholded 5 class emotion features. The feature im-
portance values were recorded and averaged for 500 different training sessions.
The results can be seen on Figure 38. To assure ourselves that the importance
of binned emotion features is not only an effect of the number of tweets in
those bins, we did the same with only using tweet bin count features. These
feature importances can also be seen on Figure 38, as the last row of the ma-
trix. The visualized values seem to imply that the number of tweets in bins
and the ratio of emotions in bins carry independent information. However, the
earlier examined predictive power of different feature set combinations shows
us that our model could not utilize this.
59
Figure 36: Predictions made using 3 different feature sets and the real income
figures.
60
61
Figure 37: Performance of regression model trained on different feature sets. MSE scores calculated for log of incomes.
Figure 38: The feature importances of the 5 class emotions and number of
tweets in different time-bins.
62
8 Conclusions & Future Work
63
References
[1] MPAA Theatrical Market-Statistics 201. April 2016.
[2] Martı́n Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-
scale machine learning on heterogeneous systems, 2015. Software available
from tensorflow.org.
[3] Sitaram Asur and Bernardo A. Huberman. Predicting the future with
social media. CoRR, abs/1003.5699, 2010.
[4] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jau-
vin. A neural probabilistic language model. JOURNAL OF MACHINE
LEARNING RESEARCH, 3:1137–1155, 2003.
[5] Tim Brody, Stevan Harnad, and Leslie Carr. Earlier web usage statistics
as predictors of later citation impact. Journal of the American Society
for Information Science and Technology, 57(8):1060–1072, 2006.
[6] Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer, and Matt Stem-
peck. Characterizing the life cycle of online news stories using social
media reactions. In Proceedings of the 17th ACM Conference on Com-
puter Supported Cooperative Work & Social Computing, CSCW ’14,
pages 211–223, New York, NY, USA, 2014. ACM.
[7] François Chollet et al. Keras. https://github.com/fchollet/keras,
2015.
[8] Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bosnjak, and
Sebastian Riedel. emoji2vec: Learning emoji representations from their
description. CoRR, abs/1609.08359, 2016.
[9] Paul Ekman. Basic emotions in dalgleish t. e power t.(eds.), the handbook
of cognition and emotion, 1999.
[10] Gunther Eysenbach. Can tweets predict citations? metrics of social im-
pact based on twitter and correlation with traditional metrics of scientific
impact. J Med Internet Res, 13(4):e123, Dec 2011.
[11] Stephen Follows. How films make money, 2016.
[12] Daniel Gayo-Avello. ”i wanted to predict elections with twitter and all
I got was this lousy paper” - A balanced survey on election prediction
using twitter data. CoRR, abs/1204.6441, 2012.
[13] Daniel Gayo-Avello, Panagiotis Metaxas, and Eni Mustafaraj. Limits of
electoral predictions using twitter, 2011.
[14] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks.
A closer look at skip-gram modelling, 2006.
64
[15] Sepp Hochreiter. The vanishing gradient problem during learning recur-
rent neural nets and problem solutions. International Journal of Uncer-
tainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
[16] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-
ral Comput., 9(8):1735–1780, November 1997.
[18] Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and Noah A Smith. Movie
reviews and revenues: An experiment in text regression. In Human Lan-
guage Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 293–296.
Association for Computational Linguistics, 2010.
[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
fication with deep convolutional neural networks. In F. Pereira, C. J. C.
Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural
Information Processing Systems 25, pages 1097–1105. Curran Associates,
Inc., 2012.
[20] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. In Proceedings of the
IEEE, pages 2278–2324, 1998.
[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[24] Márton Mestyán, Taha Yasseri, and János Kertész. Early prediction of
movie box office success based on wikipedia activity big data. PLOS ONE,
8(8):1–8, 08 2013.
[25] Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. Efficient
Estimation of Word Representations in Vector Space. September 2013.
[26] Tomas Mikolov, Greg Corrado, Kai Chen, Jeffrey Dean, and Ilya
Sutskever. Distributed Representations of Words and Phrases and their
Compositionality. October 2013.
65
[27] Gilad Mishne and Natalie Glance. Predicting movie sales from blogger
sentiment. In Proceedings ofAAAI-CAAW-06, the Spring Symposia on
Computational Approaches to Analyzing Weblogs, Stanford, US, January
2006.
[28] Bhaskar Mitra and Nick Craswell. Neural text embeddings for information
retrieval. In Proceedings of the Tenth ACM International Conference on
Web Search and Data Mining, WSDM ’17, pages 813–814, New York,
NY, USA, 2017. ACM.
[29] Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Ri-
jke. Predicting imdb movie ratings using social media. In Proceedings
of the 34th European Conference on Advances in Information Retrieval,
ECIR’12, pages 503–507, Berlin, Heidelberg, 2012. Springer-Verlag.
[31] Raj Kumar Pan and Sitabhra Sinha. The statistical laws of popularity:
universal properties of the box-office dynamics of motion pictures. New
Journal of Physics, 12(11):115004, 2010.
[32] Tobias Preis, Daniel Reith, and H. Eugene Stanley. Complex dynam-
ics of our economic life on different scales: insights from search engine
query data. Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences, 368(1933):5707–5719,
2010.
[33] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling
with Large Corpora. In Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010.
ELRA. http://is.muni.cz/publication/884893/en.
[34] Irene Roozen. The impact of emotional appeal and the media context on
the effectiveness of commercials for not-for-profit and for-profit brands.
19:198–214, 07 2013.
[35] Ramesh Sharda and Dursun Delen. Predicting box-office success of mo-
tion pictures with neural networks. Expert Systems with Applications,
30(2):243 – 254, 2006.
[36] Xin Shuai, Alberto Pepe, and Johan Bollen. How the scientific community
reacts to newly submitted preprints: Article downloads, twitter mentions,
and citations. PLOS ONE, 7(11):1–8, 11 2012.
66
[38] Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. Predicting
the volume of comments on online news stories. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management, CIKM
’09, pages 1765–1768, New York, NY, USA, 2009. ACM.
[40] Andranik Tumasjan, Timm Sprenger, Philipp Sandner, and Isabell Welpe.
Predicting elections with twitter: What 140 characters reveal about po-
litical sentiment, 2010.
[41] Felix Ming Fai Wong, Soumya Sen, and Mung Chiang. Why watching
movie tweets won’t tell the whole story? In Proceedings of the 2012 ACM
Workshop on Workshop on Online Social Networks, WOSN ’12, pages
61–66, New York, NY, USA, 2012. ACM.
67