Uj Gepi Tanulasi Modszerek Alkalmazasa Dolgozat 3

DIPLOMATERV-FELADAT
Fejes Máté (FW4ITK)

szigorló villamosmérnök-hallgató részére
Új gépi tanulási módszerek alkalmazása szövegelemzésben

A bevételek előrejelzése rendkívül fontos gazdasági feladat. A termékek és szolgáltatások egy
részében a bevétel nagymértékben függ a közösségben terjedő valós, de akár hamis
információktól is. Tipikus ilyen termékek például a könyvek, filmek stb. Napjainkban a
társadalomban terjedő információ tekintetében egyre nagyobb súlyt nyer a közösségi média.
Az információ jelentős része a közzétett szövegek elemzésével nyerhető ki.
A diplomaterv célja a várható bevételek előrejelzése közösségi médiában (twitter) megjelenő
információk alapján. A feladat természetesen nem könnyű, és kérdéses, hogy milyen
pontosságú előrejelzést lehet adni. Egyes esetekben nem triviális megtalálni pl., hogy egy-egy
hozzászólás pontosan mire vonatkozik, nem egyszerű eldönteni, hogy milyen érzelmi
hozzáállást mutat, nem triviális, hogy a negatív töltetű hozzászólások milyen hatást
gyakorolnak a bevételre stb. (Egyesek szerint a negatív reklám is reklám, „mindegy, csak
beszéljenek róla”.)
A diplomaterv kidolgozása során a hallgató feladatai:
 Tekintse át a közösségi médiában megjelenő információk feldolgozásának
módszereire vonatkozó szakirodalmat!
 Alakítson ki módszert, amely alkalmas közösségi média (twitter) üzenetek adott
termékre vonatkozó relevanciájának jellemzésére!
 Vizsgálja meg, hogyan nyerhető ki az üzenetekből a termékhez kapcsolódó érzelmi
tartalom!
 Vizsgálja meg, hogy milyen fontosabb jellemzők hatnak elsősorban a bevételekre, pl.
az üzenetek gyakorisága, érzelmi töltete, a folyamatok időbeli lefutása stb.!
 Értékelje a kialakított módszereket és a kapott eredményeket!
Tanszéki konzulens: Dr. Pataki Béla, docens

Külső konzulens: Benczúr András, laborvezető, MTA SZTAKI Informatikai
Kutatólaboratórium
Budapest, 2017. március 2.
……………………
Dr. Dabóczi Tamás
tanszékvezető
Application of new machine learning
algorithms in text processing
MSc Thesis
Máté Fejes
Budapest University of Technology and Economics
Faculty of Electrical Engineering and Informatics
Department of Measurement and Information Systems
Supervisor: András Benczúr

University consultant: Dr. Béla Pataki
December 2017
Abstract
The prediction of success is an extremely important task from an economic

viewpoint. The income of some products or services are heavily affected by
rumors spreading in the community, regardless of them being true or false.
Recently social media is gaining more and more importance concerning the
information flow in our society. A significant portion of this information may
be harvested by analyzing the text originating in these media.
The goal of this thesis is to create an income prediction model using in-
formation from social media, such as Twitter. We generate features for films,
based on characteristics of tweets written about these movies. To extract the
emotional aspects of tweets, we use emoticons found in the texts to define
emotion classes, and classify tweets into these classes. Finally, an income pre-
diction model is trained using conventional features and features derived from
social media.
ii
Kivonat
A bevételek megjóslása gazdasági szempontból rendkı́vül fontos feladat. Egyes

termékek és szolgáltatások bevétele pedig erősen függ a róluk keringő hiresz-
telésektől, függetlenül azok igazságtartalmától.
A társadalmunkban történő információcsere jelentős része a közösségi
médián keresztül történik, ı́gy az ott megjelenő szövegek elemzésével jelentős
tudásra tehetünk szert. A dolgozat célja egy filmek bevételeinek jóslására
alkalmas modell felépı́tése Twitter bejegyzések alapján. A tweet-ek érzelmi
töltetének kinyerése érdekében a szövegben megtalálható emotikonok alapján
érzelmi osztályokat definiálunk, majd ezek szempontjából értékeljük az
összes tweet-et. Mutatókat készı́tünk a tweet-ek jellemzői alapján, és ezek
segı́tségével felállı́tjuk a modelt a filmek bevételeink megjóslására.
iii
Hallgatói nyilatkozat
Alulı́rott, Fejes Máté, szigorló hallgató kijelentem, hogy ezt a diplomatervet

meg nem engedett segı́tség nélkül, saját magam készı́tettem, csak a megadott
forrásokat (szakirodalom, eszközök, stb.) használtam fel. Minden olyan részt,
melyet szó szerint, vagy azonos értelemben, de átfogalmazva más forrásból
átvettem, egyértelműen, a forrás megadásával megjelöltem. Hozzájárulok,
hogy a jelen munkám alapadatait (szerző(k), cı́m, angol és magyar nyelvű
tartalmi kivonat, készı́tés éve, konzulens(ek) neve) a BME VIK nyilvánosan
hozzáférhető elektronikus formában, a munka teljes szövegét pedig az egyetem
belső hálózatán keresztül (vagy hitelesı́tett felhasználók számára) közzétegye.
Kijelentem, hogy a benyújtott munka és annak elektronikus verziója mege-
gyezik. Dékáni engedéllyel titkosı́tott diplomatervek esetén a dolgozat szövege
csak 3 év eltelte után válik hozzáférhetővé.
Budapest, 2017. 12. 17.
Fejes Máté
iv
Acknowledgements
I would like to thank my advisors: Róbert Pálovics for guiding me throughout

this work and Domokos Kelen for giving me much needed advice and vast
amounts of corrections regarding my English. I thank the whole Informatics
Laboratory at SZTAKI for letting me use their hardware for my research. I
would also like to thank my university advisor Béla Pataki for providing me
with sound advice time to time. I am grateful for the help and support Zsanett
Szuda offered me during my work. I would like to thank my parents and my
grandparents, who supported me throughout this work and the previous years
of studying.
v
Contents
Abstract ii
Abstract II iii
Declaration iv
Acknowledgements v
1 Introduction 1
1.1 Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Film Industry Statistics . . . . . . . . . . . . . . . . . . . . . 1
1.3 Twitter as Social Media . . . . . . . . . . . . . . . . . . . . . 3
2 Theoretical Overview 5
2.1 Machine Learning Basics . . . . . . . . . . . . . . . . . . . . . 5
2.2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Convolutional Layers . . . . . . . . . . . . . . . . . . . 8
2.2.2 Recurrent Neural Networks . . . . . . . . . . . . . . . 11
2.3 Language Modelling & Embedding . . . . . . . . . . . . . . . 15
2.3.1 N-gram Model . . . . . . . . . . . . . . . . . . . . . . . 15
2.3.2 Continuous Space Language Models & Embedding . . 15
3 Related Work 20
4 Datasets 23
4.1 Movie Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Filtering for Relevant Search Results . . . . . . . . . . 23
4.2 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Cleaning the Text Data . . . . . . . . . . . . . . . . . 27
vi
5 Creating Emotion Labeling 32
5.1 Selecting the Set of Emoticons . . . . . . . . . . . . . . . . . . 33
5.2 Emoticon Embedding . . . . . . . . . . . . . . . . . . . . . . . 35
6 Emotion Analysis 40
6.1 Baseline Models . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.2.1 Embedding Layer Settings . . . . . . . . . . . . . . . . 42
6.2.2 Neural Network Architectures . . . . . . . . . . . . . . 43
7 Income Prediction 49
7.1 Creating Features Using Film Features . . . . . . . . . . . . . 49
7.2 Feature Generation Based on Twitter Data . . . . . . . . . . . 52
7.3 Income Prediction with Different Feature Sets . . . . . . . . . 56
8 Conclusions & Future Work 63
Bibliography 64
vii
1 Introduction
In this section we define our goals, and describe the financial and social envi-
ronment we use as the source of our data.
1.1 Goals
For this thesis we attempt to predict box office earnings of movies, based on
information collected on twitter, dated before the premiere, focusing on the
sentimental aspects of the textual data. We do so by labelling tweets based on
emotions, a task which includes natural language processing (NLP) and using
classification methods. After aggregating emotions for each movie, regression
models are built from the resulting information and other features.
We try to outperform the models published in the related literature by
using recent advances in NLP for sentiment analysis, such as neural word
embeddings.
1.2 Film Industry Statistics
The film industry is a multi-billion dollar industry that continues growing to

this day. Movies have made an astounding 38.3 billion dollars in movie theaters
in 2015 worldwide, 29% of which came from the US and Canada (Figure 1)
[1]. Currently the motion picture industry contributes roughly 3% of GDP to
the economy of the US.
A lot of movies can’t break even, but actually lose money. It is especially
hard to make big blockbuster movies profitable, which is mostly due to the
high marketing costs necessary for the film to succeed in the US and western
Europe (Figure 2) [11].
1
Figure 1: Global box office income.
The Hollywood film industry is turning to Chinese and other oriental au-
diences, because they don’t require as much exposure to marketing to watch
a film as the European and American moviegoers [1].
In conclusion it is more important than ever to use efficient and cost-
effective marketing techniques in the film industry. One way to measure the
effect of advertisements pre-release is by monitoring social media, such as
Twitter or Facebook.
Figure 2: The distribution of an average movie’s cost among different types of

expenses [11].
2
1.3 Twitter as Social Media
Twitter is one of the four biggest social platforms, based on the number of
monthly active users. It is the fourth most popular after Facebok, Tumblr and
Instagram, with 310 million users.
It can be defined as a microblog website with a social network framing.
Users can write short, at most 140 character long status updates, tweets, and
subscribe to the posts of other users. A person’s Twitter feed consist of the
tweets of the people they follow (are subscribed to).
It is possible to mention other users in your tweets using their handles by
writing @username in the text. One can re-post (retweet) tweets, optionally
adding their own remark to the original text. This appears on followers’ feed,
also containing the handle of the original poster. So called hashtags are a
way to link your tweet to certain topics, such as current events, products
or ideas. This can be done by adding the right hashtag to your tweet, e.g.
#SomethingCurrent. Apart from plain text, emoticons, pictures and links
may also be added to a tweet.
3
Figure 3: Devices owned by moviegoers [1].
Compared to Facebook, the format of user generated content is more

strictly defined, which has favorable aspects regarding natural language pro-
cessing. Twitter’s main profile is text entries, making it a medium where
people and organizations voice their opinions or thoughts.
In this environment a number of interesting social phenomena could be
observed, such as propagation of a topic through the network, conflicts due to
difference of opinions and their resolutions, or the clustering of society.
A promising fact in regard to movie income prediction is that a high per-
centage of moviegoers have mobile electronic devices, compared to the average
population (Figure 3). By owning such a device, it is much simpler to post
thoughts about a certain movie.
4
2 Theoretical Overview
In this section we address the theoretical background of methods used in this

thesis. We summarize the theory of artificial neural networks and language
modelling.
2.1 Machine Learning Basics
Machine learning methods are built on the assumption that given sufficient
amount of data, we can build models that generalise well.
Machine learning methods try to explore systematic relationships in data,
and find functions approximating reality as best as possible. Data in this
context means a collection of data points, where each point is described by
the same type of categorical or numerical features (x). Training supervised
classification and regression models requires data, where each datapoint is
labeled either with a class or a numeric value. This is also called target. The
goal of these models is to correctly determine the label based on the features.
This is done by tuning the model’s parameters (h), on which the model depends
to predict the target features by calculating ŷ = ŷ(x, h). We find the optimal
value of h by training on part of the data, minimizing some kind of error
function. This error function should reflect how far the model’s prediction (ŷ)
is from the correct label (y). The error is then E = ferror (y, ŷ), where ferror is
the error function.
If both the model and the error function is differentiable, then one way
to train the model is by using gradient descent. This iterative optimization
method calculates the gradient of the error function, and changes the model’s
parameters accordingly, with λ stepsize in each iteration (1).
hn+1 = hn + λ · ∇f (hn ) (1)

5
2.2 Artificial Neural Networks
Artificial neural networks mimic the human brain’s information processing

method. Our brain has without a doubt an advantage over traditional com-
puters in some aspects. It is more power efficient and is better at certain
tasks that require higher levels of abstraction. The building block of this bio-
computer is the neuron. It is made up of the cell body (soma) and many
extensions which connect them to one-another. The extensions handling the
incoming signals are the dendrites, and the one that emits the output is the
axon. The incoming signals are weighted by the strength of the given connec-
tion, and summed on the soma. If a certain threshold is reached, the neuron
will fire, sending an output signal through the axon [21].
Figure 4: Artificial neuron.
An artificial neuron (Figure 4) does something similar. The incoming sig-

nals, xi -s are multiplied by wi weights, and summed. The activation function
(Figure 5), which must be a differentiable function (to be able to compute
Equation 2), transforms this sum in some way, e.g scales the output to the
[0,1] interval. The computed value, y will be the input for other neurons. We
can compose ”layered” networks (Figure 6), where a layer is composed of neu-
6
rons working parallel, each working with the same inputs (the outputs of the
previous layer).
Figure 5: Activation functions.
The architecture of a neural net is defined, among other things, by the

number of layers (also known as the depth), and the number of neurons in each
layer.
The first layer is often called the in-
put layer, the last is known as the
output layer. The additional layers
residing between the two are called
hidden layers. A one-layer network
is called a single layer perceptron, a
networkd with a number of layers be-
tween 1 and 3 is called a multilayer
perceptron, and a network with more
hidden layers is called a deep learning
network.
Figure 6: Fully connected dense net-
The knowledge of an artificial
work.
neural netword is defined by the
weights assigned to the edges of the network. These must be adjusted during
7
training (Section 2.1). The choice of the error function depends on whether
we’re trying to teach the network to do binary or multi-class classification,
or regression. A simple choice for regression is ferror = (y − ŷ)2 , the squared
error.
The updating of weights is usually done by some kind of modified gradient
descent method. To change weights that aren’t in the last layer, we must use
what is called error-backpropagation. We take advantage of the fact that we can
calculate the dependency between the error and any given weight by applying
the chain rule (Equation 2). To illustrate this, lets imagine an architecture
with two layers (Figure 7). The red route is the way wij influences the first
(2)
output (y1 ). The derivate is expanded in Equation 2, where the indices have
(layernumber)
the following meaning: wf rom,to .
(2) (1)
0 (2)
wij ϕ0 (sj )
(yj ) ϕ (s1 )
0 (2)
ferror x
z }| { z }| { z }| { z }|i {
(2) (1)
z }| {
(2) (1)
∂E1 dE1 dyj ds1 dyj ds1
= (2)
· (2)
· (1)
· (1)
· (1)
(2)
∂wij dyj ds1 dyj ds1 dwij
| {z }
repeats
Using this derivative, we can update wij so the error becomes smaller.
Up to this point we summarized the core idea behind neural networks,
which is quite old. Next we will elaborate on some more recent developments
in this field.
2.2.1 Convolutional Layers
The idea of convolutional neural networks was proposed in 1998 in [20]. But
it only really gaining popularity in 2012, after Alex Krizhevsky, Ilya Sutskever
and Geoffrey Hinton won ILSVRC (ImageNet Large-Scale Visual Recognition
Challenge) by using a deep convolutional network [19]. From this point on,
8
9
Figure 7: Error backpropagation.
architectures using convolution became increasingly popular in image process-
ing, and later in sound processing.
A convolutional filter is a sliding window function applied to a matrix (or
vector, or tensor). As seen on Figure 8, the filter (also known as the kernel)
slides over the matrix, and at each step computes the sum of the element-
wise product of the overlapping kernel and image pixels. In image processing,
different 2D filters are widely used, such as gaussian blur, edge enhancement
and so on. For these applications the values of the kernel are well defined.
Figure 8: Convolutional filter.
The idea behind convolutional neural networks is to let the neural network
learn the kernel values instead of predefining them.
By doing this, we significantly decrease the number of trainable parameters
compared to a fully connected neural layer. Instead of training as many weights
as pixels, we only need to train a number of weights equal to the number of
fields in the kernel. At the same time we get a few additional hyper-parameters,
as the size and strive (step size) of the filters.
10
Figure 9: Convolutional neural network.
An often used technique is to repeatedly use pooling layers, e.g. max

pooling (Figure 10) after the convolution (Figure 9), thus compressing the ex-
tracted features (features being the output of the filtering procedure). This
can reduce the dimensions of the features considerably. This part of the net-
work may be viewed as feature extraction, after which a densely connected
layer can do the classification.
Figure 10: Max pooling.
Convolution layers don’t exclusively work on two dimensional data, filters

can have as many dimensions as we want.
2.2.2 Recurrent Neural Networks
Recurrent neurons take advantage of the fact that in certain datasets the
consecutive instances may have some kind of correlation, for example time
series.
11
To utilize this additional information, recurrent neurons use their output
from a previous step as an additional input in the current step (Figure 11).
This way they have implicit memory of all the previous inputs (similar to an
IIR1 filter).
Figure 11: Recurrent neuron unrolled.
Deep neural networks have a problem with the gradient (calculated via
backpropagation and used to update weights), decreasing with distance from
the output layer [15]. Recurrent layers can be thought of as having as many
layers as the length of the input sequence. Because of the vanishing gradient,
the starting elements of the sequence (the beginning of a sentence) will not
have as much effect on the outcome as later elements. So to say a simple
recurrent neuron does not really have longer term memory.
This was solved by the introduction of the Long Short Term Memory unit,
LSTM in short (Figure 12)[16, 30]. In LSTM units the cell’s inner state and
previous output is handled separately. The update of the inner state is done
through numerous gates, preserving long term memories. In LSTM units, the
update of the cell state (C t ) can only be done through gates. We use an
example to illustrate the importance of these gates. Let’s take the sentence
”He likes his ice-cream, but she likes hers better”. The LSTM unit receives
the words of this sentence one at a time. After the word ”He”, the cell state is
supposed to represent the subject’s gender somehow, thus being able to predict
1
Infinite Impulse Response
12
the correct pronouns. But once we get to the second part of the sentence, a
new subject appears with a different gender.
Figure 12: LSTM unit.
To use the correct pronouns, the cell should forget the ”saved” gender first.
This is done through the forget gate, ft . Since
ft = σ(Wf [ht−1 , xt ] + bf ), (3)
ft is a vector with values between 0 and 1, with σ being the sigmoid function
(Figure 5), Wf the weight-matrix of the forget gate and h the output vector.
By taking the dot product of ft and Ct−1 , the previous cell-state, we either keep
or forget (to some extent) certain elements of the previous state, the gender of
the subject in this case. To update the cellstate, we compute Ct0 (Equation 5),
and add parts of it to the cellstate (in our case the female gender) depending
on the value of it (Equation 4), creating Ct (Equation 6).
it = σ(Wi [ht−1 , xt ] + bi ) (4)
Ct0 = tanh(WC [ht−1 , xt ] + bC ) (5)
Ct = ft ∗ Ct−1 + it ∗ Ct0 (6)
13
Finally, we create an output based on the cellstate Ct ( Equation 8), the
previous output ht−1 , and the current input xt (Equation 7).
ot = σ(Wo i[ht−1 , xt ] + bo ) (7)
ht = ot ∗ tanh(Ct ) (8)
Some of the best results that came from using recurrent neural networks,
were achieved by utilizing LSTM units. The previously described LSTM unit
is one of the simplest LSTM units. There are many other variants, e.g. using
“peephole connections”, or coupled forget and input gates.
14
2.3 Language Modelling & Embedding
Trying to use mathematical tools to represent human language has interested

many researchers for a long time. A high number of statistical models were
developed before neural network language models, which are a subject of active
ongoing research.
2.3.1 N-gram Model
A statistical language model is a probability distribution over a sequence of

words. To an m word long sequence a model assigns P (wi , ...wi+m ) probability.
These models have many applications, such as speech recognition, handwriting
recognition, machine translation and information retrieval.
The simplest are the n-gram models [14]. Here we observe a text through an
n word/token long moving window. We assume, that each word’s probability
only depends on the preceding n − 1 words.
m
Y m
Y
P (w1 , ...wm ) = P (wi |w1 , ...wi−1 ) = P (wi |wi−n , ...wi−1 ) (9)
i=1 i=1
The conditional probability can be calculated by the number of co-

occurrences of the nth and n − 1 words:
count(wi−n , ...wi−1 , wi )
P (wi |wi−n , ...wi−1 ) = (10)
count(wi−n , ...wi−1 )
2.3.2 Continuous Space Language Models & Embedding
Continuous language models are a big step from N-grams, because they repre-
sent words differently. Most pre-neural network natural language models treat
words as atomic units: there is no notion of similarity between words, as these
are represented as indices in a vocabulary.
15
Continuous models use word embeddings, a technique to transform the
word indices to dense vector representations by using semantic information
implicitly. Words are first represented as vectors with as many dimensions as
the size of the vocabulary (the number of unique words in the corpus). These
vectors have zero values in every dimension except the one corresponding to
the given word. This is called one-hot representation.
The different algorithms convert these one-hot vectors to vectors with fewer
dimensions and continuous values. This may be done by using word-to-word
or word-to-document relations. One of the oldest methods, Latent Semantic
Analysis (LSA) uses the latter. By starting with a dictionary of documents,
it builds a matrix with columns representing documents, and the rows being
different words. For each element of the matrix it counts the number of oc-
currences of a word in a document, or computes the TF-IDF2 score. One can
define similarities between row-vectors (words) and use column vectors to rep-
resent documents. These can be used in solving information retrieval problems
[28]. By using singular value decomposition the vectors may be condensed
while retaining most information. This method (similar other methods us-
ing documents as contextual information) captures semantic relatedness (e.g.
“boat” – “water”), while we would often rather capture semantic similarity
(e.g. “boat” – “ship”).
Capturing semantic similarity is done by using the neighbours of a word as
context. Similar words probably occur in the same environment. So we take
every word in the corpus, and log the n words before it and after it, creating
our teaching dataset, where the label is the given word (Figure 13).
2
Term frequency–inverse document frequency is a numerical statistic that is in-
tended to reflect how important a word is to a document in a collection or corpus.
16
Figure 13: Creating training instances for Neural Language Model [23].
Skip-gram
Using this dataset we can train a neural network with an input layer the size of
the vocabulary, a hidden layer with a neuroncount of the preferred embedded
vector length, and an output layer, again as big as the number of unique words
in the corpus. Using a softmax activation we create a classifier, which learns to
return a word if its neighbour is the input feature (Figure 14). We maximize
the following average log probability:
T
1X X
Lskip−gram = log p(wt+j |wt ) (11)
T t=1 −c<=j<=c,j =0
Notice, that we have created a bottleneck with the hidden layer, thus com-
pressing the contextual information into a relatively small vector. This vector,
the output values of the hidden layer (with no activation function), is the
representation of the current label word [4].
On Figure 14 we can see the neuron counts used when creating word2vec
[23, 25], an embedding tool by Google. With an architecture this size, training
the network means fitting 300×10000×2 = 6 million weights, which is compu-
17
Figure 14: Embedding with Neural Language Model.
tationally expensive. To circumvent this problem, skip-gram uses hierarchical

softmax or so-called negative sampling instead of the full softmax [26].
Continuous Bag of Words
The CBOW algorithm [25] is similar to skip-gram, but uses the sum of the
neighbouring words’ one-hot vectors as input of the neural model. It maximizes
the function in Equation 12.
T
1X
L= logp(wt+j |wt∗ ) (12)
T t=1
X
wt∗ +j (13)
−c<=j<=c,j =0
Word2vec uses both models. The CBOW architecture predicts the current
word based on the context, and skip-gram predicts surrounding words given
the current word (Figure 15).
18
Figure 15: CBOW and skip-gram [25].
By using embeddings based on word-to-word relations, the vectors corre-

sponding to similar words point in similar directions. E.g. the difference vector
of the plural and of the singular form of a noun is quite similar for a lot of
words (dogs−dog ≈ cats−cat). The other often used example is the following:
queen ≈ king − man + woman (where each word is the vector for that word).
Based on this, the meaning of a phrase or sentence might be represented by
the sum of vectors of the composing words.
19
3 Related Work
Making predictions based on user generated content on social media has a

tremendous amount of literature. A very exciting and timely example is us-
ing Twitter to predict electoral outcomes [40], however it has its biases and
limitations [13, 12]. Interesting studies have appeared regarding the use of
social media indicators to predict the scientific impact of research articles, e.g.
short-term web usage (number of downloads from the pre-print sharing web
site arXiv ) [5] and Twitter mentions [10]. In a recent work, it is shown that
Twitter mentions and arXiv downloads follow two distinct temporal patterns
of activity, however the volume of Twitter mentions is statistically correlated
with arXiv downloads and early citations [36]. Preis et al. found a connec-
tion between weekly transaction volumes of “S&P 500 companies” and weekly
Google search volumes of corresponding company names [32]. There are other
examples of using social media streams to make predictions on news popular-
ity in terms of the number of user-generated comments [38, 39] or the number
of news visitors [6].
The motion picture industry is a subject of many studies, which may partly
be duo to the interesting statistical behaviors it shows. One can observe a log-
normal distribution of the gross income of theaters, and a bimodal distribution
of the number of theaters screening a movie [31]. Based on 70 years of data
regarding the American movie market, Sreenivasan states that more original
movies tend to earn more (where the originality is based on keywords from
IMDb3 ) [37]. Predicting the financial success of films is a challenging problem.
Sharda and Delen attempted to do so by training a neural network on data
using features regarding quality and popularity, collected from before the pre-
mieres. They classified movies into nine categories based on their predicted
3
Internet Movie Database
20
income. Their predictions were correct for 36,9% of the movies in the test
dataset, 75.2% of the films were classified less than two categories away from
the correct category [35]. Joshi et al. used a linear regression model with
movie meta-data and sentiment features, where the latter were extracted from
pre-release critiques using n-gram models. Their best combination of features
achieved an r2 score4 of 0.671 [18]. Predictions based on classic quality fac-
tors are not reliable enough to use in practical applications, but with the use
of user generated data this threshold might be crossed. With the birth of
micro-blogging came an increased number of electronically documented hu-
man interactions, and a way to get direct insight into the thoughts of many
people. Ishii et al. model human interactions within society with a stochas-
tic process [17]. By using only the marketing budget in time as input, their
model generates a dynamic popularity variable, which they validated against
the number of blog posts about the particular movies in the Japanese Blogo-
sphere [27]. Box-office predictions have also been done using user activities on
the Wikipedia pages of films [24]. Mestyán et al. used measurements of the
number of views, users, edits and collaborative rigor on 312 movies’ Wikipedia
pages. Using a simple linear regression model, they were able to make predic-
tions with coefficient of determination of 0.925 one month before the premiere.
In a novel approach Asur and Huberman predict movie revenue based on the
number of Twitter mentions regarding 24 movies [3]. They anticipated the
income for the opening weekends of movies with tweets from the night before,
achieving an r2 score of 0.97. In other work, Wong et al. advise us to be
sceptical of Twitter’s financial predictive ability [41]. By using a sample of 34
movies, they compare ratings from IMDb and Rotten Tomatoes to the senti-
ment of the tweets mentioning those movies, and arrive at the conclusion that
4
Coefficient of determination: the proportion of the variance in the dependent vari-
able that is predictable from the independent variable(s).
21
there is a noteworthy bias towards positivity in the emotions twitter users dis-
play. In a similar approach, Oghina et al. use Twitter and YouTube activity
to predict the ratings on IMDb[29].
22
4 Datasets
In this section we elaborate on the collection of data, and review some char-
acteristics of the used datasets.
4.1 Movie Dataset
We created a movie dataset by merging the IMDb5000 and the TMDb5000

datasets5 from Kaggle, which both contain titles, release dates, domestic gross
income figures from the USA, and further information regarding roughly 5000
movies.
With the use of this data we chose which films to collect tweets about,
and as described in Section 6.1, we used the meta-data features to create a
baseline model for income prediction. Considering we use NLP techniques
later on, we only collected tweets labeled as English. Most English tweets are
written in the USA, so we used income figures from this country too, rather
then worldwide earnings, arguing that this way, there should be a stronger
connection between the two. The two datasets contain almost the same set of
movies, 4700 of them are found in both. The film properties in the two sets
are somewhat orthogonal, so taking the union of the two is beneficial to our
cause. Also, by merging the two datasets, we were able to fill missing values
for features contained by both.
4.1.1 Filtering for Relevant Search Results
To collect relevant tweets for each film and be able to use them for income pre-
diction, we had to ensure that these movies met certain criteria. We dropped
all films with missing title, income, or premiere date. The merged movie
5
The IMDb5000 was replaced by the TMDb5000 dataset on Kaggles website due
to legal reasons, and is no longer available online.
23
Figure 16: Distribution of film releases in time.
dataset contains films from as long ago as 1922 (Figure 16), but Twitter has
only been released on 15th of July 2006, and the number of users was relatively
low in the first few years. For this reason we used movies released after 2008,
when Twitter hit 6 million monthly active users.
Movie titles are often expressions used in everyday language, so by simply
searching for tweets containing the title, we would obtain a high number of
irrelevant posts. One option would have been to search for tweets containing
the official hashtags of the films, but to our best knowledge there is no available
data listing these, so this option was not viable. Our solution was to collect
tweets that contained the word ”movie” besides the movie title, and only used
films with at least 2 word titles.
However, even with these constraints, a significant portion of the tweets
collected for less known movies were unrelated to said movies. On Figure 17
we see the distribution of income for movies, and also the distibution of log
of income. Based on these histograms we decided to only use films with at
24
least $100,000 gross income. It is safe to presume that films generating less
revenue have a smaller footprint on social media. After collecting the tweets,
Figure 17: Distribution of films’ gross domestic income.
we retroactively removed films with less than 50 tweets. Finally, we were left
with 988 movies.
4.2 Twitter Data
Our dataset of tweets contains 10 million tweets in English (according to

Twitter’s language labeling), regarding the 988 movies mentioned before (Sec-
tion 4.1), written in the 128 days preceding the premieres of each movie. The
distribution of number of tweets can be seen on Figure 18.
25
Figure 18: Distribution of number of tweets in films.
In order to get a first look at how important tweets are with regards to
the success of the film, we calculated the correlation between the number of
tweets regarding a movie and its income (seen on Figure 19). For compari-
son, we did the same with the production budget and the income. The low
correlation between the income and the tweet count is partly due to Twitter
having been less commonly used when the earlier movies premiered. For 2009
the correlation was 0.1, but in time it increases, and reaches 0.7 by 2016.
26
Figure 19: Correlation between income of movies and budget of movies, corre-
lation between income of movies and number of tweets mentioning the films.
To see how the tweet count for a single movie develops in time, I collected
tweets about the movie Rogue One: A Star Wars Story for a longer time
period. The effect of new information regarding the movie can be seen clearly
in Figure 20. New announcements create a spike in tweets, with an exponential
decay after. Nearing the premiere there is a slower, but steady exponential
rise in the frequency of tweets.
To see any global characteristics of tweeting behavior in time, we aligned
the tweets of different films by the time until the premiere, as seen in Figure 21.
An exponential rise in interest can be observed approaching the opening date,
and a few blurred local maxima can also be sound. By looking closer at the
distribution we can observe the daily periodicity of the number of tweets.
4.2.1 Cleaning the Text Data
The collected tweet texts have a lot of elements besides words, such as handles,
hashtags, or URLs. These can help us predict the income of movies (Figure 35),
27
28
Figure 20: Number of tweets in time about Rogue One before the premiere.
Figure 21: Number of tweets for all films (aligned to premiere date).
but hinder natural language processing models. We generated features based

on the presence of these elements, then removed them from the text.
Duplicates can be found in the dataset, which is partly due to tweets
mentioning more than one film. In these cases separate instances of the tweets
were assigned to each film. To keep these while removing real duplicates, we
only removed tweets that were written at the same time, have the exact same
text, and were collected for the same film.
To prevent any encoding errors we removed all non-ASCII characters first.
Next we examined the URLs present in the tweets. Most often they were
links to either images, or YouTube videos. We created three features based
29
on whether the tweet contains an image link, YouTube link, or any link at all.
We registered the number of handles and hashtags the tweet contained, then
removed those as well.
Numbers occurring in tweets might have valuable information, as they
could be a numerical rating of the movie. However it is hard to distinguish
between a score or other number occurrences, therefore we did not use this
information, and removed all numbers. We also got rid of all non-alphanumeric
characters, apart from a handful used for punctuation: ,.!?’". These left were
padded with whitespace to make them form separate words. These two steps
were performed while keeping a list of ASCII emoticons (specified in Section 5)
intact and in place.
The models we use for emotion classification need an input format with
fixed number of words. This number was determined based on the distribution
of the word counts of tweets. Tweets longer than this were truncated, shorter
ones padded. By making each punctuation character a separate word, a tweet
with a sequence of exclamation marks can have a very high number of words,
and the necessary trimming (done in order to achieve the specified number of
words) would possibly result in loosing important words. We wanted to avoid
this, thus replaced all repeating non-alphanumericals with only one instance
of the character. Using excessively repeating characters is also a frequent
phenomenon among Twitter users (e.g. ”haappyyyyy”, ”yayyy”). By trans-
forming these to an almost correct form, we reduced the number of distinct
words. This was done by replacing the characters repeated more than twice
with only one copy (e.g. ”haapppyyyy” → ”haapy”).
If a certain emotion is relatively often associated with one or a few movies,
our emotion classification model may overfit by learning that the presence of
these titles infer that emotion. By removing the titles altogether, we would
30
lose the relative position information between words or expressions and the
title. Our solution was to replace all occurrences of the titles in the text with
the word ”film”.
At this point in the text processing we extracted two additional features
from the text: the number of words and the ratio of capital letters. We
then replaced all capital letters with their lowercase counterparts. Twitter is
notorious for people with bad spelling, but hopefully thanks to the continuous
space word representations, the misspelled versions of words will be represented
by similar vectors, if those certain misspellings are common enough. To filter
the ones that are not, we removed all words with less than 40 occurrences in
the 10 million tweets.
The vocabulary assembled during cleaning the textual data contains 62,000
different words.
31
5 Creating Emotion Labeling
In this section we assign emotion labels to a subset of our tweets, based on

emoticons found in the texts.
Methods analyzing emotions can be divided into two main groups based
on how many different emotions we consider. Sentiment analysis assigns a one
dimensional score, determining the positivity/neutrality/negativity of an item.
Emotion analysis on the other hand tries to analyze data in a more nuanced
way, using more dimensions. These commonly are: anger, disgust, sadness,
happiness, surprise and fear [9], but sometimes different groups are used.
Marketing studies have shown, that these emotions influence brand recogni-
tion and income in a non-trivial way [34]. A funny advertisement, for instance,
may be remembered years after it aired, but people have a harder time recall-
ing the brand or product it was promoting. On the other hand, viewers shown
an ad with a greater sadness component will have a better recollection of the
brand/product.
Movies are products, franchises might even be interpreted as brands.
Tweets written pre-release are most likely direct or indirect reactions to
marketing activities such as trailers, pictures, or news about a new film
(Figure 20). For these reasons we decided to emphasize emotion analysis in
this work.
Labeled datasets with 6 emotion classes are hard to access, so we created
our own labeling using the ASCII emoticons found in tweets. People often use
emoticons to emphasize their feelings on Twitter. We presume that the feeling
represented by an emoticon is somewhat consistent with the emotion conveyed
by the text. So to speak, the emoticon summarizes the emotional aspect of
the tweet. To see whether our presumptions have basis, we handpicked a few
emoticons and calculated the correlation between their occurrence ratios in
32
Emoticon Correlation
:D 0.141
** 0.083
:-D 0.071
=)) 0.054
:O -0.027
O.O -0.040
:/ -0.041
Table 1: Correlation between emoticons’ occurrence ratios in tweets related to
films, and the films’ income.
tweets regarding each film, and the gross domestic income of those films. The
values can be seen in Table 1 and in Figure 22.
Figure 22: Occurrence ratio of emoticons with positive and negative correla-
tions with gross domestic income of films. Each point represents one film.
The correlations are small, but the fact that smiling emoticons have positive
correlation, while surprised and sad emoticons have negative correlation, is
promising.
5.1 Selecting the Set of Emoticons
The next question is what emoticons to use, and how to group them. As there
is no strict, closed set of ASCII emoticons, we searched for a wide variety
33
Part Possible characters
Eyebrow >,), ,(,<,},’
Eye :,;,X,x,B,8,=
Nose -,,,’-,
Mouth ),)),D,P,b,S,(,},{,],[,@,o,O,0, /,|,L,X,#,&
Table 2: Building block of left to right horizontal emoticons.
of emoticons, most of which are probably never used, and filtered them by
a minimum necessary number of occurrences. We created our preliminary
set of emoticons by combining the possible building blocks in every possible
way. For horizontal emoticons that are rotated +90 degrees, such as ”:)”,
we defined the building blocks as seen in Table 2. We did the same with
horizontal emoticons rotated −90 degrees (e.g. ”(:”), and vertical emoticons
(e.g. ”O.O”). This set contained 5,260 possible emoticons.
Determining if an emoticon is present in a tweet can be difficult, because
often they aren’t separated from words by spaces. This is particularly prob-
lematic if the word is attached to a part of an emoticon that is a character,
e.g. ”:Dgood times”. We did not want to lose these ”conjoint” emoticons, but
also did not want to identify part of the text as an emoticon by accident. So
we searched for the emoticons with and without whitespace padding on each
side. If the number of occurrences of the non-padded version was much higher
than that of the padded version, we assumed that this character sequence was
used as a part of the text, rather than as an emoticon. By removing emoticons
with a lower ”padded count” than 30, and higher non − padded/padded ratio
than 5, we were left with 50 emoticons. An additional 9 more were removed by
hand, because they were not emoticons. The remaining set contained multiple
instances of the same emoticons, with different letters capitalized. These were
merged, finally leaving 31 emoticons.
34
5.2 Emoticon Embedding
Our original goal was to somehow extract the 6 basic emotions used for emotion
analysis. The more frequent emoticons we found may not cover all 6 emotions,
and we were also not certain how to arrange them into these groups. The 31
emoticons available at this point were way too many to use as distinct emo-
tions, and their number of occurrences are very unevenly distributed. Some
are very similar, and probably should not be distinguished from each other.
Grouping the emoticons was inevitable, but we wanted to avoid doing this in
manually. However, to be able to apply any kind of clustering algorithm, we
need a measure of similarity between the emoticons.
Our solution to the problem was to treat the emoticons as if they were
words, and use continuous space vectors to represent them. We assume that
emoticons are similar to words in the sense that similar ones occur in simi-
lar contexts. So by using Gensim’s [33] word2vec embedding module on the
subset of our tweet text data that contains emoticons, we created embedded
vector representations of our emoticons. A similar method has been used in
[8] creating embeddings for unicode emoticons based on their description. The
word2vec model’s parameters were chosen by selecting the combination that
assigned the most similar embedded vectors to emoticons we considered to
have similar meaning. The similarity of the vectors were monitored by reduc-
ing the number of dimensions to two with t-SNE 6 , and scatter plotting the two
dimensional projections of the embedded vectors. The context of each word
was the set of the closest 7 words, and the number of embedding dimensions
was chosen to be 5.
6
t-SNE projects high-dimensional vectors into a lower dimensional space while attempt-
ing to preserve relative distances [22].
35
We used K-means clustering to create 9 clusters based on the 5 dimensional
embedded vectors of the emoticons. We ceated 9 clusters instead of 6, so
that the few outlier emoticons without similar elements would not hinder the
clustering process. As can be seen on Figure 23, some clusters are easier to
interpret than others. But we must keep in mind that the similarities observed
by the word2vec model can also be influenced by the style of a person’s writing,
which may even be of help when trying to predict the income of movies. The
results of the embedding may be somewhat compromised if contrary to our
thoughts emoticons written in the tweet do not summarize the feeling also
conveyed by the text, but rather differ from it. In the case of sarcasm for
example, the emoticon may be the only indicator of the real emotions: ”the
film was sooo good :/”. The size of clusters seen in Figure 23 does not contain
occurrences of the emoticons which are stuck to words, only the well separated
ones. Based on the whole count we selected the 5 biggest clusters: 1,3,5,7,8
(from here on refered to as emotion class 1,2,3,4,5), and created labels for the
Twitter data based on these (Table 3).
To see if using multiple emotion classes helps more in predicting income
than using 2 classes, we had to create a grouping of emoticons with two classes
by hand. In Figure 24 these two artificially chosen classes can be seen, along
with the emoticons we couldn’t fit into either class.
The two classes are visibly well separated, even though we created these
classes independently of the two dimensional representation of the embedded
vectors. To see if these two classes are easily separable by some subset of the
five embedding dimensions, or if we only see this nice separation thanks to
t-SNE, we plotted the same figure but using the 5 dimensional embeddings
directly. We examined all dimensions pairwise, and noticed that the classes
were somewhat separated along the (0,2) and (3,4) dimension pairs (Figure 25),
36
Figure 23: The 5 dimensional continuous space vector representations of emoti-
cons embedded with t-SNE into 2 dimensions, colored by the containing clus-
ters created with K-means clustering. The size of the dots in the plot cor-
responds to the count of unambiguous occurrences of emoticons, and in the
legend to the size of the clusters.
but not as distinctly as in the figure we made with t-SNE. From this we
concluded that the current word2vec embedding’s (0,2,3,4) dimensions may
have something to do with the positivity of words/emoticons, and that t-SNE
seems to be a good method to reduce this multidimensional information to 2
dimensions. We added this 2 class labeling to our Twitter dataset to use for
sentiment analysis. The number of labeled tweets can be seen in Table 3.
37
Figure 24: The 5 dimensional continuous space vector representations of emoti-
cons embedded with t-SNE into 2 dimensions, colored by the containing man-
ually selected 2 clusters. The size of the dots corresponds to the count of
unambiguous occurrences of emoticons, and in the legend to the size of the
clusters.
Class Number of labeled tweets

5 emotion classes
Class 1 128,553
Class 2 5,619
Class 3 5,576
Class 4 16,000
Class 5 16,786
2 emotion classes
Positive 142,012
Negative 23,062
Table 3: Number of labeled tweets for each emotion class.
38
Figure 25: The emoticon embeddings plotted using only two 2 dimensional
subsets of their 5 dimensions, colored by the containing manually selected
2 clusters. The size of the dots corresponds to the count of unambiguous
occurrences of emoticons, and in the legend to the size of the clusters.
39
6 Emotion Analysis
In this section we compare the ability of different classification models to

predict the emoticon based labels defined in Section 5.2 from texts represented
as an array of embedded or one-hot encoded words. The best performing model
is used to score all unlabeled tweets.
We removed the emoticons left in the twitter texts and created a new em-
bedding matrix with Gensim’s word2vec module, using the same parameters
we found best for embedding emoticons (window size=7, embedding dimen-
sions=5). These embeddings are used for both baseline models and neural
networks.
Machine learning models need a fixed number of input features. In our case
the training instances were tweets, where every word was assigned a unique
index. The number of words in a tweet is the width of that data point. To have
a unified input format, we chose a number of words, over which we truncate
tweets, and under which we would pad them with a chosen padding word.
By choosing a very big number, we would not lose any words from longer
tweets, but the ratio of actual information in the input would be quite low,
negatively affecting the training of our models. By cutting tweets too short,
this information ratio would be higher, but we would also lose many possibly
important words. Based on the distribution of the number of words in tweets
seen in Figure 26, we set the number of input words to 20. The padding
character’s embedded vector was defined as all zeros.
The labels we created have a rather uneven distribution in number of oc-
currences, so we sampled them to have the same number of training instances.
For the 5 class labels this meant 5,500 labeled tweets for each class, and 23,000
for the 2 class labels. A random classifier would achieve respectively 0.2 and
0.5 multiclass accuracy. The train-test cut was done using a 0.8/0.2 ratio.
40
Figure 26: Distribution of number of words in tweets.
6.1 Baseline Models
We trained a few simpler models on the 5 class labels for later comparison
against the neural network based models, using two different text representa-
tions. The one-hot encoding format represents each tweet with a vector the
length of the vocabulary, with ones at the indexes of the contained words. We
also tried using the pretrained word embeddings by concatenating the embed-
ded vectors of the contained word in sequence for each tweet. The results are
reported in Section 4. The K-neighbours and the Support Vector Classifier
model did not finish training within reasonable time while using the sparser,
one-hot encoded input format.
Multiclass accuracy
Classification models
one-hot encoded text word2vec embedded text
Logistic Regression 0.221 0.224
K-Neighbors 0.325
Support Vector Machine 0.307
Decision Tree 0.356 0.309
Random Forest 0.431 0.402
Gaussian Naive Bayes 0.296 0.292
Table 4: Multiclass accuracy for simpler models.
41
6.2 Neural Networks
Based on recent trends, neural networks (NN-s) are the most popular models
when performing emotion analysis. One of the problems with NN-s is the very
high number of hyper-parameters to tune. One must choose, for instance,
the number and types of layers constructing the network, and the number of
neurons in each of the layers. Using embedding layers, one must also face
another set of questions, detailed in the following section.
6.2.1 Embedding Layer Settings
Embedding layers function as a lookup table, where each word is assigned a

vector with the length of the number of embedding dimensions. This table is
the embedding matrix, with the size of vocabularySize × embeddedVectorSize.
Its cells are most often initialized using random values. These values get fitted
through backpropagation when training the whole network. The embedding
matrix can also be initialized with predefined weights, possibly originating
from some other model (for example word2vec). These weights can be chosen
to be fixed during training, or further trained through backpropagation.
We compared the performance of fixed and trainable predefined weights,
as well as randomly initialized weights, using the network referred to as
”1dim conv” in Section 6.2.2. The results are shown in Figure 27. The
pretrained fixed embeddings only perform better during the very start of
the training, the embedding matrices with trainable weights work much
better long term, almost independently of initial weights. Based on these
experiments, we decided to use trainable embedding weights initialized with
pre-trained embeddings.
42
Figure 27: Performance of different word embedding methods with 1 dimen-
sional convolution model.
6.2.2 Neural Network Architectures
We started our search for usable neural models by experimenting with models
that were found to be the best in different tutorials and blogs, modifying them
to fit our current input format. We tried many different architectures, mostly
using 4 different types of layers: dense, recurrent, 2 dimensional convolution
and 1 dimensional convolution. We tried different combinations of these types,
it seemed that less complex networks worked better with our data and labels.
All neural networks were implemented using Keras [7] with Tensorflow [2]
backend.
43
The model referred to as ”logreg” is a one layer dense network with sigmoid
activation. The input of this network is the same as the input of the logistic
regression in Section 6.1, the embedded vectors concatenated to form a vector.
This model differs from the baseline model in the fact the weights in the
embedding matrix may change while training. In our experiments this resulted
in much better performance.
The other model’s architectures can be seen in tables: Table 5, Table 6,
Table 7, Table 8. When training on the 2 class labels, the dense output layers
had 2 neurons.
Layer type Output shape Number of parameters

Embedding (20, 5) 310030
Conv1D (10 (5) kernels) (16, 10) 260
Conv1D (10 (5) kernels) (12, 10) 510
Flatten (120)
Dense (128) 15488
Dense (softmax activation) (5) 645
Table 5: The layout of the ”1dim conv” network.

Reshape (20, 5, 1)
Conv2D (100 (3,5) kernels) (18, 1, 100) 1600
Reshape (18, 100)
MaxPooling1 (1, 100)
Flatten (100)
Table 6: The layout of the ”2dim conv” network.
The network that is referred to as ”dense1” is a 1 layer dense network,

where the input format is the one-hot encoded format of tweets, also used for
the baseline models. While the other models may use information about the
sequence of words, this model can only make predictions based on the words’
presence or absence.
44
Embedding (None, 20, 5) 310030
Reshape (None, 20, 5, 1)
Conv2D (10 (2,5) kernels) (None, 19, 1, 10) 110
Flatten (None, 110)
Dense (None, 128) 14208
Dense (softmax activation) (None, 5) 645
Table 7: The layout of the ”2dim conv3” network.

Embedding (20, 5) 310030
LSTM (100) 42400
Table 8: The layout of the ”LSTM1” network.
On Figure 28 and Figure 29 we can see the accuracy measured on the test
dataset during training. All lines start from the accuracy of a totally random
classifier: 0.2 for 5 classes and 0.5 for 2 classes. Interestingly the models
trained on 2 class labels overfit much faster, while in the case of 5 classes it
seems that the 1 dimensional convolution still might be getting better on the
test set even after 300 iterations on the whole data.
In the case of 5 classes, more complex models perform significantly better
than the ”dense1” model, which could mean that these models were able to
utilize the information of how words follow each other. The top 3 models:
”LSTM1”, ”1dim conv” and ”2dim conv3” do not differ too much in perfor-
mance after 100 epochs. The ”1dim conv” model learns slower, but this was
the first model we succeeded to tweak to a performance this high, so we used
this for labeling our tweets.
When using the 2 class labels, the difference between the performance
of complex and simpler neural models was not as big, but still visible. The
”1dim conv” and ”LSTM1” models compete for the first place, while the mod-
45
els using 2 dimensional convolution overfit quite quickly on the training data.
We also used the ”1dim conv” for creating the binary labeling for tweets.
46
47
Figure 28: Performance of neural models on the validation dataset with 5 class labels. The marked datapoints mean 10 iterations
on the whole dataset. The whole figure contains 300 iterations.
48
Figure 29: Performance of neural models on the validation dataset with 2 class labels. The marked datapoints mean 1 iterations
on the whole dataset after the first 14 points. The whole figure contains 20 iterations.
7 Income Prediction
In this section we aggregate the tweet features for films, and compare them
by their ability to improve a prediction based on movie meta-data.
The problem of regression can be simplified to classification by defining
income categories. However, we wanted to avoid the problem of finding the
best way to divide the interval of incomes, so we chose regression instead.
7.1 Creating Features Using Film Features
The merged IMDb5000 & TMDb5000 dataset has 30 different features not
counting the title and the gross domestic income. Most of these features are
user created, such as IMDBb score, number of votes, or number of likes on the
director’s Facebook page, which are very useful when predicting income. Our
task, however, was to predict the income based on pre-release data. Unfortu-
nately, these features were not recorded before the premieres, so we pretend to
not know these values. Even the films’ budget is problematic, since we saw in
Section 1.2 that marketing costs account for a large portion of the expenses,
and marketing spending does not stop at the premiere date.
49
The features from the dataset we can safely presume we know before the
premier are the following:
• Premiere date (unix timestamp)
• Duration (min)
• Production companies
• First 3 actors on the cast list
• Genres
• Content rating
The latter 4 are clearly categorical features, documented in quite different

formats. This is summed up in Table 9. With the exception of content rating,
the categories are not mutually exclusive, e.g. a single film can fall into 26
genre categories at most.
Maximum Used number

Number of
Name number Use top of PCA
categories
per film columns
Production company 4,051 26 150 5
Actor 4,776 3 359 5
Genre 23 8 23 5
Content rating 13 1 13
Table 9: A few properties of the categorical features and parameters of the
features reduction method.
The number of categories is too high for most features to merely one-hot
encode. Luckily, the distribution of number of occurrences seem exponential
for every feature, meaning that a lot of categories are very rare, and most
films are in at least one of rather small subset of categories. For production
companies we used the 150 most frequent ones. Certain production companies
often collaborate, so we were able to further reduce the number of features by
50
capitalizing on these correlations. We did this using PCA7 , taking the first 5
columns of the transformed data matrix.
We used similar methods for actors and genres. We used PCA only on
actors who acted in at least 6 movies found in our datasets, but for genres
we ran Principal Component Analysis using all categories. Content ratings
have a rather small number of categories, so we simply one-hot encoded the
feature. We did not transform the duration and premiere time values at all.
The created feature set will be tested against the Twitter derived features in
Section 7.3.
7
Principal Component Analysis
51
7.2 Feature Generation Based on Twitter Data
After preprocessing the tweets and calculating emotion scores for each, we
have a dataset with features describing each tweet. To acquire a dataset
that characterizes movies, we must summarize the tweet features for each
film. The easiest way to do this is by summing or averaging all features of
tweets regarding a certain movie. But by doing this, we loose all time based
information, and also some knowledge about the distribution of features.
On the other hand, by not aggregating the features into a dense enough
format, e.g. summing the number of tweets for every day separately, we would
create a huge number of features, which would not help any regression model.
We wanted a balanced solution. Our best idea was to create a rather
small number of time-bins, within which we would average features. Fixed
width bins are not suitable for this purpose, because the frequency of tweets
change radically in time, as an exponential rise can be observed nearing the
premiere. As can bee seen in Figure 30, we created time-bins that adapt to
this distribution by exponentially decreasing their width nearing the release
date. The number of tweets falling into these bins are visible on the lower part
of Figure 30, where it can be observed that the uneven bins divide them quite
evenly.
Figure 30: The definition of time-bins with uneven width and the number of
tweets in them.
We have two basic feature sets regarding tweets: the emotion scores and
the features extracted from the text during preprocessing. We will refer to
the latter as statistical Twitter features. Within these statistical features we
defined a few subgroups, and used different techniques to aggregate them.
For the features regarding the presence of links in the tweet (video, image,
52
all URL), we calculated their ratio among tweets altogether and in time-bins.
The text lengths and capitalized letters ratio were averaged for each film. For
the number of hashtags and handles we created categories to separate very
different types of tweets. The distribution of the two are quite similar, hence
we used the same boundaries to sort them into classes (Table 10).
count number of
category occurrences in tweet
0 0
1 1-3
2 4-
Table 10: Thresholds for hashtag and handle categories.
The emotion scores were treated similarly to the statistical features. We

calculated the mean and the median for the whole time period and the separate
Aggregation method
Tweet features
without bins with bins
Statistical features
#hashtag count - average - average in bins
@handle count - ratios of count categories - ratios of count categories
among tweets among tweets in bins
has URL
- ratio among tweets
has image link - ratio among tweets
in bins
has Youtube link
text length
- average - average in bins
CAPS ratio
existing number of tweets number of tweets in bins
Emotion features
emotion class 1 score
- average
- average - median
- median - average of thresholded
scores
positivity score
negativity score
Table 11: Summarizing the aggregation of features.
53
bins as well. These features created for each film are, as far as we can tell,
independent from the number of tweets belonging to those films. To visualize
any significant characteristics of these created emotion features in different
income ranges, we averaged the average time-binned scores for lower and higher
income movies. The scores are shown for the 6 class emotions in Figure 31,
and for the positive label in Figure 32.
Figure 31: The 5 class time-binned emotion scores averaged for all films with
lower income than $5 million and higher income than $80 million.
54
Figure 32: The positive time-binned emotion scores averaged for all films with
lower income than $5 million and higher income than $80 million.
We also tried aggregating in a different way by first thresholding the emo-

tion scores, then calculating the occurrence ratio of these binary values. Our
reasoning was that this could capture different, more useful aspects of the
score distributions than a simple average or median. The summary of differ-
ent aggregating methods used on features can be seen in Table 7.2.
55
7.3 Income Prediction with Different Feature Sets
The evaluation of features derived from tweets was done by examining the
performance of a regression model, using feature sets containing only movie
meta-data, and other regression models that incorporate twitter features too.
We used scikit-learns Gradient Boosted Tree as our regression model, so we
could monitor the trained model’s feature importance property. This gives us
insight into the usefulness of each feature regarding the correct prediction of
income. We measured the performance of the models by using mean squared
error as our metric.
Figure 33: Predictions of models trained on the normal income values, and on
log of income values using all movie meta-data features. On the X axis we see
the test film dataset sorted by income.
Our regression model penalizes the same relative error more in higher label
values. So the model will try to fit more on high income films, and somewhat
56
neglect lower income movies. Because of the shape of the income distribution
(relatively few, very high grossing movies, see Figure 17), this is problematic.
We compared using the logarithm of label values with the original labels for
training. The results, seen in Figure 33, show that with the log labels we fit
the lower grossing movies much better, while prediction quality on high income
movies remains similar. Hence we will use the log of income values to train
our model.
The parameters of the model were tuned while training on solely movie
meta-data features. After experimenting with training fewer trees with bigger
depth and a larger number of shallower trees, we found the best settings to be
50 estimators with the maximum depth of 3. We always used 20% of the films
as a test dataset.
Measuring the performance for different feature set combinations was done
using 500-fold cross validation. We used this many folds because the MSE8
value fluctuated immensely when using different train-test splits. To visualize
this, we plotted the distributions of the test errors for a few feature sets in
Figure 34.
On Figure 37 we can see that we succeeded in enhancing our baseline model
by adding emotion based features. The 5 emotion classes helped more than
the binary emotions, but the difference is not as great as we have hoped for. It
is also somewhat disappointing that the statistical Twitter features boost the
performance way more than any emotion feature. We could not strengthen the
performance by combining the two. One may argue though that some of these
features also carry emotional charge, e.g. the capital letter ratio. We trimmed
the number of features in the best emotion based feature sets (for both types
of classes) and also for the feature set involving binned statistics. By keeping
8
Mean square error
57
Figure 34: Distributions of Mean Squared Error values for log of income from
500 fold cross validations on different feature sets.
only the 30 features deemed most important (25 in case of the 2 class emotion
features), we achieved considerable improvement in all cases. The best feature
set we managed to assemble was created by selecting the most important 30
features of the set containing the pre-release movie features and the binned
statistical features. The feature importances can be seen in Figure 35.
Predictions were made using the movie feature set, the best emotional
feature set and the altogether best feature set. These can be seen on Figure 36.
58
Figure 35: The feature importance values of the best performing feature set.
In order to gain clear insight into the relative importance of different emo-
tions, we trained the same gradient boosted tree model we used before on
solely the binned and thresholded 5 class emotion features. The feature im-
portance values were recorded and averaged for 500 different training sessions.
The results can be seen on Figure 38. To assure ourselves that the importance
of binned emotion features is not only an effect of the number of tweets in
those bins, we did the same with only using tweet bin count features. These
feature importances can also be seen on Figure 38, as the last row of the ma-
trix. The visualized values seem to imply that the number of tweets in bins
and the ratio of emotions in bins carry independent information. However, the
earlier examined predictive power of different feature set combinations shows
us that our model could not utilize this.
59
Figure 36: Predictions made using 3 different feature sets and the real income
figures.
60
61
Figure 37: Performance of regression model trained on different feature sets. MSE scores calculated for log of incomes.
Figure 38: The feature importances of the 5 class emotions and number of
tweets in different time-bins.
62
8 Conclusions & Future Work
We successfully gathered a large amount of tweets written about films with

known gross domestic income in the US. We cleaned these, creating a large
dataset for natural language processing, extracting text format features si-
multaneously. Using contextual information, we created embeddings of ASCII
emoticons and used these to create emotion classes. We extracted emotional
information from tweets in the form of scores for these classes, and summa-
rized it alongside the text format features for each movie. Based on these and
movie meta-data features, we trained regression models to predict the income
of movies.
While using two different kinds of emotion classification (5 classes and 2
classes), we observed the difference of their ability to predict income figures.
In our experiments we found that it helps to use more complicated emotion
classes, but the difference was not significant enough to make strong state-
ments.
The accuracy of our predictions could be further improved by gathering
more training data(for example about newly released movies) or combining our
current model with a model trained for outlier detection. The enhancement of
our sentiment analysis model could also help. This may be done for example
by capitalizing on the emotional aspects of the text format features, using
better subset and grouping of emoticons for labeling, or training on pre-labeled
texts. Due to the high number of misspelled words on Twitter, we did not
try widely used pre-trained embeddings that were trained on longer texts with
better spelling (e.g. books). But if we manage to replace commonly misspelled
words with their correct form, this becomes an option to be considered.
63
References
[1] MPAA Theatrical Market-Statistics 201. April 2016.
[2] Martı́n Abadi, Ashish Agarwal, Paul Barham, et al. TensorFlow: Large-
scale machine learning on heterogeneous systems, 2015. Software available
from tensorflow.org.
[3] Sitaram Asur and Bernardo A. Huberman. Predicting the future with
social media. CoRR, abs/1003.5699, 2010.
[4] Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jau-
vin. A neural probabilistic language model. JOURNAL OF MACHINE
LEARNING RESEARCH, 3:1137–1155, 2003.
[5] Tim Brody, Stevan Harnad, and Leslie Carr. Earlier web usage statistics
as predictors of later citation impact. Journal of the American Society
for Information Science and Technology, 57(8):1060–1072, 2006.
[6] Carlos Castillo, Mohammed El-Haddad, Jürgen Pfeffer, and Matt Stem-
peck. Characterizing the life cycle of online news stories using social
media reactions. In Proceedings of the 17th ACM Conference on Com-
puter Supported Cooperative Work & Social Computing, CSCW ’14,
pages 211–223, New York, NY, USA, 2014. ACM.
[7] François Chollet et al. Keras. https://github.com/fchollet/keras,
2015.
[8] Ben Eisner, Tim Rocktäschel, Isabelle Augenstein, Matko Bosnjak, and
Sebastian Riedel. emoji2vec: Learning emoji representations from their
description. CoRR, abs/1609.08359, 2016.
[9] Paul Ekman. Basic emotions in dalgleish t. e power t.(eds.), the handbook
of cognition and emotion, 1999.
[10] Gunther Eysenbach. Can tweets predict citations? metrics of social im-
pact based on twitter and correlation with traditional metrics of scientific
impact. J Med Internet Res, 13(4):e123, Dec 2011.
[11] Stephen Follows. How films make money, 2016.
[12] Daniel Gayo-Avello. ”i wanted to predict elections with twitter and all
I got was this lousy paper” - A balanced survey on election prediction
using twitter data. CoRR, abs/1204.6441, 2012.
[13] Daniel Gayo-Avello, Panagiotis Metaxas, and Eni Mustafaraj. Limits of
electoral predictions using twitter, 2011.
[14] David Guthrie, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks.
A closer look at skip-gram modelling, 2006.
64
[15] Sepp Hochreiter. The vanishing gradient problem during learning recur-
rent neural nets and problem solutions. International Journal of Uncer-
tainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116, 1998.
[16] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neu-
ral Comput., 9(8):1735–1780, November 1997.
[17] Akira Ishii, Hisashi Arakaki, Naoya Matsuda, Sanae Umemura,

Tamiko Urushidani, Naoya Yamagata, and Narihiko Yoshida. The
‘hit’phenomenon: a mathematical model of human dynamics interactions
as a stochastic process. New journal of physics, 14(6):063018, 2012.
[18] Mahesh Joshi, Dipanjan Das, Kevin Gimpel, and Noah A Smith. Movie
reviews and revenues: An experiment in text regression. In Human Lan-
guage Technologies: The 2010 Annual Conference of the North American
Chapter of the Association for Computational Linguistics, pages 293–296.
Association for Computational Linguistics, 2010.
[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi-
fication with deep convolutional neural networks. In F. Pereira, C. J. C.
Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural
Information Processing Systems 25, pages 1097–1105. Curran Associates,
Inc., 2012.
[20] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-
based learning applied to document recognition. In Proceedings of the
IEEE, pages 2278–2324, 1998.
[21] H. Lodish, A. Berk, S. L. Zipursky, et al. Molecular Cell Biology. 4th

edition. W. H. Freeman; Section 21.1, Overview of Neuron Structure and
Function. Available from, New York, 2000.
[22] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using
t-sne. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
[23] Chris McCormic. Word2vec tutorial: the skip-gram model, 2016.
[24] Márton Mestyán, Taha Yasseri, and János Kertész. Early prediction of
movie box office success based on wikipedia activity big data. PLOS ONE,
8(8):1–8, 08 2013.
[25] Tomas Mikolov, Greg Corrado, Kai Chen, and Jeffrey Dean. Efficient
Estimation of Word Representations in Vector Space. September 2013.
[26] Tomas Mikolov, Greg Corrado, Kai Chen, Jeffrey Dean, and Ilya
Sutskever. Distributed Representations of Words and Phrases and their
Compositionality. October 2013.
65
[27] Gilad Mishne and Natalie Glance. Predicting movie sales from blogger
sentiment. In Proceedings ofAAAI-CAAW-06, the Spring Symposia on
Computational Approaches to Analyzing Weblogs, Stanford, US, January
2006.
[28] Bhaskar Mitra and Nick Craswell. Neural text embeddings for information
retrieval. In Proceedings of the Tenth ACM International Conference on
Web Search and Data Mining, WSDM ’17, pages 813–814, New York,
NY, USA, 2017. ACM.
[29] Andrei Oghina, Mathias Breuss, Manos Tsagkias, and Maarten de Ri-
jke. Predicting imdb movie ratings using social media. In Proceedings
of the 34th European Conference on Advances in Information Retrieval,
ECIR’12, pages 503–507, Berlin, Heidelberg, 2012. Springer-Verlag.
[30] Christopher Olah. Understanding lstm networks, 2015.
[31] Raj Kumar Pan and Sitabhra Sinha. The statistical laws of popularity:
universal properties of the box-office dynamics of motion pictures. New
Journal of Physics, 12(11):115004, 2010.
[32] Tobias Preis, Daniel Reith, and H. Eugene Stanley. Complex dynam-
ics of our economic life on different scales: insights from search engine
query data. Philosophical Transactions of the Royal Society of London A:
Mathematical, Physical and Engineering Sciences, 368(1933):5707–5719,
2010.
[33] Radim Řehůřek and Petr Sojka. Software Framework for Topic Modelling
with Large Corpora. In Proceedings of the LREC 2010 Workshop on New
Challenges for NLP Frameworks, pages 45–50, Valletta, Malta, May 2010.
ELRA. http://is.muni.cz/publication/884893/en.
[34] Irene Roozen. The impact of emotional appeal and the media context on
the effectiveness of commercials for not-for-profit and for-profit brands.
19:198–214, 07 2013.
[35] Ramesh Sharda and Dursun Delen. Predicting box-office success of mo-
tion pictures with neural networks. Expert Systems with Applications,
30(2):243 – 254, 2006.
[36] Xin Shuai, Alberto Pepe, and Johan Bollen. How the scientific community
reacts to newly submitted preprints: Article downloads, twitter mentions,
and citations. PLOS ONE, 7(11):1–8, 11 2012.
[37] S. Sreenivasan. Quantitative analysis of the evolution of novelty in cinema

through crowdsourced keywords. Scientific Reports, 3:2758, September
2013.
66
[38] Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. Predicting
the volume of comments on online news stories. In Proceedings of the 18th
ACM Conference on Information and Knowledge Management, CIKM
’09, pages 1765–1768, New York, NY, USA, 2009. ACM.
[39] Manos Tsagkias, Wouter Weerkamp, and Maarten de Rijke. News

Comments:Exploring, Modeling, and Online Prediction, pages 191–203.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2010.
[40] Andranik Tumasjan, Timm Sprenger, Philipp Sandner, and Isabell Welpe.
Predicting elections with twitter: What 140 characters reveal about po-
litical sentiment, 2010.
[41] Felix Ming Fai Wong, Soumya Sen, and Mung Chiang. Why watching
movie tweets won’t tell the whole story? In Proceedings of the 2012 ACM
Workshop on Workshop on Online Social Networks, WOSN ’12, pages
61–66, New York, NY, USA, 2012. ACM.
67

Uj Gepi Tanulasi Modszerek Alkalmazasa Dolgozat 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Uj Gepi Tanulasi Modszerek Alkalmazasa Dolgozat 3

Uploaded by

Copyright:

Available Formats

DIPLOMATERV-FELADAT

Fejes Máté (FW4ITK)

Új gépi tanulási módszerek alkalmazása szövegelemzésben

Tanszéki konzulens: Dr. Pataki Béla, docens

Budapest, 2017. március 2.

Budapest University of Technology and Economics

Faculty of Electrical Engineering and Informatics

Department of Measurement and Information Systems

Supervisor: András Benczúr

The prediction of success is an extremely important task from an economic

A bevételek megjóslása gazdasági szempontból rendkı́vül fontos feladat. Egyes

Alulı́rott, Fejes Máté, szigorló hallgató kijelentem, hogy ezt a diplomatervet

Budapest, 2017. 12. 17.

I would like to thank my advisors: Róbert Pálovics for guiding me throughout

8 Conclusions & Future Work 63

1.2 Film Industry Statistics

The film industry is a multi-billion dollar industry that continues growing to

Figure 2: The distribution of an average movie’s cost among different types of

Compared to Facebook, the format of user generated content is more

In this section we address the theoretical background of methods used in this

2.1 Machine Learning Basics

hn+1 = hn + λ · ∇f (hn ) (1)

Artificial neural networks mimic the human brain’s information processing

Figure 4: Artificial neuron.

An artificial neuron (Figure 4) does something similar. The incoming sig-

Figure 5: Activation functions.

The architecture of a neural net is defined, among other things, by the

2.2.1 Convolutional Layers

Figure 8: Convolutional filter.

An often used technique is to repeatedly use pooling layers, e.g. max

Figure 10: Max pooling.

Convolution layers don’t exclusively work on two dimensional data, filters

2.2.2 Recurrent Neural Networks

Figure 11: Recurrent neuron unrolled.

Figure 12: LSTM unit.

ft = σ(Wf [ht−1 , xt ] + bf ), (3)

it = σ(Wi [ht−1 , xt ] + bi ) (4)

Ct0 = tanh(WC [ht−1 , xt ] + bC ) (5)

Ct = ft ∗ Ct−1 + it ∗ Ct0 (6)

ot = σ(Wo i[ht−1 , xt ] + bo ) (7)

Trying to use mathematical tools to represent human language has interested

2.3.1 N-gram Model

A statistical language model is a probability distribution over a sequence of

The conditional probability can be calculated by the number of co-

2.3.2 Continuous Space Language Models & Embedding

tationally expensive. To circumvent this problem, skip-gram uses hierarchical

Continuous Bag of Words

By using embeddings based on word-to-word relations, the vectors corre-

Making predictions based on user generated content on social media has a

4.1 Movie Dataset

We created a movie dataset by merging the IMDb5000 and the TMDb5000

4.1.1 Filtering for Relevant Search Results

Figure 17: Distribution of films’ gross domestic income.

4.2 Twitter Data

Our dataset of tweets contains 10 million tweets in English (according to

4.2.1 Cleaning the Text Data

but hinder natural language processing models. We generated features based

In this section we assign emotion labels to a subset of our tweets, based on

5.1 Selecting the Set of Emoticons

Class Number of labeled tweets

In this section we compare the ability of different classification models to

6.1 Baseline Models

6.2.1 Embedding Layer Settings

Embedding layers function as a lookup table, where each word is assigned a