Ahmad Et Al. - 2020 - Movie Revenue Prediction Based On Purchase Intenti

Information Processing and Management 57 (2020) 102278
Contents lists available at ScienceDirect
Information Processing and Management

journal homepage: www.elsevier.com/locate/infoproman
Movie Revenue Prediction Based on Purchase Intention Mining

T
Using YouTube Trailer Reviews
Ibrahim Said Ahmada, Azuraliza Abu Bakarb,⁎, Mohd Ridzwan Yaakubb
a
Department of Information Technology, Faculty of Computer Science and Information Technology, Bayero University Kano, 700241 Kano, Nigeria
b
Center for Artificial Intelligence Technology, Faculty of Information Science and Technology, Universiti Kebangsaan Malyasia, 43600 UKM Bangi,
Selangor, Malaysia
ARTICLE INFO ABSTRACT
Keywords: The increase in acceptability and popularity of social media has made extracting information
Box-office from the data generated on social media an emerging field of research. An important branch of
Movie revenue this field is predicting future events using social media data. This paper is focused on predicting
Sentiment analysis box-office revenue of a movie by mining people's intention to purchase a movie ticket, termed
Data mining
purchase intention, from trailer reviews. Movie revenue prediction is important due to risks
Machine learning
Trailer reviews
involved in movie production despite the high cost involved in the production. Previous studies
in this domain focus on the use of twitter data and IMDB reviews for the prediction of movies that
have already been released. In this paper, we build a model for movie revenue prediction prior to
the movie's release using YouTube trailer reviews. Our model consists of novel methods of cal-
culating purchase intention, positive-to-negative sentiment ratio, and like-to-dislike ratio for
movie revenue prediction. Our experimental results prove the superiority of our approach
compared to three baseline approaches and achieved a relative absolute error of 29.65%.
1. Introduction
In the U.S., the motion picture industry produces approximately 500 movies in a year and garnering, on average, $60 million of
investment capital per film. Despite the large capital investment needed for movie production, the success, or profitability of a movie
is largely uncertain (Lash, Fu, Wang, & Zhao, 2015). Additionally, statistical analysis of historical Chinese movie market data re-
vealed that of all the movies released in the first half of 2013, only a fraction made profits (Duan, Ding, & Liu, 2017). Thus, designing
effective models to predict the future market performance of upcoming movies will benefit the producers, sponsors, and theatres.
Previous studies have investigated numerous predictive indicators for movie revenue prediction, with unsatisfying results
(Chen, Li, Yao, & Zhou, 2019; Court, Gillen, McKenzie, & Plott, 2018; Zhou, Zhang, & Yi, 2017). The uncertainties of movie box-office
revenues can be linked to the production and distribution of movies, ranging from the actors, directors, and budgets before the release
and Word-of-Mouth (WoM) marketing, and screen arrangement after the release (Ghiassi, Lio, & Moon, 2015). The majority of the
studies make predictions based on easily accessible movie metadata, such as genres, budgets, or by referring to market performances
of similar movies in history. Although critic reviews and blog contents are abundantly available, they have not been exploited until
recently with the advances in natural language processing. Therefore, this paper is modelled to answer the following research
questions:
⁎
Corresponding author.
E-mail addresses: isahmad.it@buk.edu.ng (I.S. Ahmad), azuraliza@ukm.edu.my (A.A. Bakar), ridzwanyaakub@ukm.edu.my (M.R. Yaakub).
https://doi.org/10.1016/j.ipm.2020.102278
Received 2 September 2019; Received in revised form 20 February 2020; Accepted 21 April 2020
0306-4573/ © 2020 Elsevier Ltd. All rights reserved.
I.S. Ahmad, et al. Information Processing and Management 57 (2020) 102278
1 How can we accurately extract movie ticket purchase intention from movie reviews?
2 Can sentiment analysis of YouTube trailer reviews and movie metadata be used for early prediction of movie revenue?
To address these questions, we crawled movie trailer reviews of twenty-nine movies released in 2016 and 2017. We also extracted
metadata and financial information of the movies from Box Office Mojo website. We used the data extracted to prepare a movie
revenue prediction dataset. The dataset consists of six independent variables and one dependent variable. Three of the independent
variables namely purchase intention, weighted positive-to-negative sentiment ratio, and weighted like-to-dislike ratio that are cal-
culated using novel methods.
Our experimental results show that the proposed movie reviews purchase intention mining approach is effective, achieving an
accuracy of 0.90. The results also show that purchase intention, weighted positive-to-negative sentiment ratio, and weighted like-to-
dislike ratio features are highly correlated with movie revenue. Consequently, our model achieved better performance than three
baseline models with a relative absolute error of 29.65%. In summary, the main contributions of this article are:
• This article is among the first in the previous studies to use purchase intention mining from reviews for the prediction of business
performance, in our case, movie revenue.
• We generate a purchase intention lexicon for purchase intention mining from movie reviews.
• We extract, pre-process, and analyse a dataset from YouTube trailer reviews of twenty-nine movies released in 2016 and 2017.
This dataset will be made available to the public for further research on movie revenue prediction using social media data.
• We compare the effectiveness of four different machine learning algorithms, namely, Linear Regression, Support Vector Machines
(SVM), Multilayer Perceptron Neural Network (MLP-NN), and Random Forest (RF), for movie revenue prediction using movie
reviews.
The remainder of this paper is organised as follows: Section 2 presents a review of previous studies. Section 3 presents the
proposed method, while in Section 4, the experimental setup is provided. Then, in Section 5, the experimental results are presented.
Section 6 presents the conclusion and discussion. Finally, the implications of the study and future works are highlighted in Section 7.
2. Related Work
Since the development of web 2.0, which is focused on user-generated content, websites are being developed to be as interactive
as possible. It has also led to the development of social media platforms. As such, the contemporary web is a massive source of data
through user-generated content shared on the internet (Ahmad, Bakar, & Yaakub, 2015; Al-Moslmi, Albared, Al-Shabi, Omar, &
Abdullah, 2018; Al-Moslmi, Omar, Abdullah, & Albared, 2017). These data include reviews from web pages, tweets, and Facebook
posts, which are collectively referred to as big data. Analysing this data is a well-researched area (Awwalu, Bakar, & Yaakub, 2019;
Yaakub, Li, & Zhang, 2014). One of the most prominent research focus in big data research is in business intelligence.
Previous studies in this field include stock market prediction, product reputation prediction model, and sales prediction model.
Jin et al. (2016) employed the Kalman filter method to forecast the trends of consumer requirements and a Bayesian method to
compare products. The objective was to help designers understand changes in customer requirements and their competitive ad-
vantages. Farhadloo et al. (2016) proposed a Bayesian method for modelling customer satisfaction from online reviews. Their ap-
proach applied cluster analysis to transform unstructured data into a semi-structured form, without the need to determine aspects a
priori. The Bayesian model was able to predict individual aspect ratings, and then, considered the overall rating of each review as the
weighted sum of the individual ratings of the aspects. This value was used to predict the total customer satisfaction. The accuracy of
this approach in finding the significant aspects was 88.3%. The average R2 values for the predicted total customer satisfaction rating
using this model ranged between 0.892 and 0.999.
In movie revenue prediction, one of the most common ways of identifying the success of a movie is by calculating its performance
in the box-office. Thus, predicting the box-office success of a movie is one of the most popular methods of predicting how successful a
movie will be. The related literature is discussed under two sections: movie success prediction using metadata, and movie success
prediction using social media. Figure 1 shows the budget and box-office revenue of 10 movies released in 2016 and 2017. It can be
observed that some movies like King Arthur and Monster Trucks were box-office flops despite a budget of over 100 million dollars.
2.1. Movie Revenue Prediction using Metadata
Predicting movie success has generally been viewed as a challenging task by the stakeholders. It has been termed as a “wild guess”
(Litman, 1983). Notwithstanding, due to the huge investment required in movie making, predicting the success of a movie is a well-
researched area. More commonly, WoM, critics review, casting, time of release, Motion Picture Association of America (MPAA)
rating, genre, and other pre-released data were investigated to determine how they affect the success of a movie (Ghiassi et al., 2015;
Lee & Chang, 2009; Ru, Li, Liu, & Chai, 2018; Sharda & Delen, 2006; Zhang, Luo, & Yang, 2009).
Later studies used the vast movie data available online, more commonly on IMDB, to forecast the success of a movie. (Sharda &
Delen, 2006) used a multi-layer perceptron neural network to forecast the box-office success of movies. They considered the pre-
diction problems as a classification problem by classifying the success of box-office movies into nine categories, from blockbuster to
flop. (Zhang et al., 2009) made an improvement to the previous study by using a multi-layer BP neural network to make predictions.
Their classification was divided into six categories. (Ghiassi et al., 2015) also used a variant of Neural Network (NN), called the
2
Fig. 1. The budget and box-office returns of 10 movies released in 2016 and 2017
dynamic neural network to predict the box-office success of a movie. Their approach was able to improve the prediction accuracy by
32.8% compared to existing models. They showed that by adding prediction budget, advertisement, and seasonality variables to their
model, it was able to achieve an accuracy of 94.1%.
2.2. Movie Revenue Prediction using Social Media
One of the earliest studies related to predicting movie success using social media data was by (Asur & Huberman, 2010). They
used linear regression (LR) to predict the success of the opening week of a movie using related tweets. They found a strong correlation
between the number of tweets, sentiment polarity of the tweets, and box-office success. This study has opened more rooms for other
researchers to improve on. Lu, Wang, & Maciejewski, (2014) explored the influence of movie sentiment score from tweets in con-
junction with other pre-released data, such as the genre, budget, and rating to the opening weekend gross of a movie. The movie
sentiment score was calculated as positive tweets divided by the sum of positive and negative tweets. Their dataset was composed of
tweets based on 24 movies extracted from twitter over a period of three months. Their prediction was based on multiple linear
regression, with the conclusion that social media visual analytics is a relatively effective method for predicting business performance.
However, there are numerous challenges in applying it to all domains of business intelligence. T. Liu, Ding, Chen, Chen, &
Guo, (2016) adopted a similar approach but introduced a novel variable, known as the purchase intention. The purchase intention is
defined as the intention of an individual to buy a product or service. They modelled the purchase intention from tweets using a
support vector machine. Their dataset consisted of five million tweets regarding 57 movies. They also used LR and SVM for their
predictions. They found that social media data can be correlated to box-office revenues and that mining purchase intention from users
can lead to a more accurate prediction.
Gaikar, Marakarkandy, & Dasgupta, 2015 experimented with a Fuzzy Inference System (FIS) on tweets to predict the performance
of Bollywood movies. The FIS used sentiment score, actor/actress rating, and hype factor for its prediction. Their dataset consisted of
10,269 tweets regarding 14 movies that were due to be released within a week. Their findings indicated that pre-release sentiments
and hype factors are of vital importance to box-office success.
Lipizzi, Iandoli, & Marquez, 2016 proposed an in-depth exploration method for social media data to make movie success pre-
diction. This method consisted of sentiment, traffic, social, and conversational variables extracted from the dataset. The dataset
contained 2 million tweets about 22 movies within the 72-hour period during the weekend of their release. They showed that
combining traffic metrics with a social network, or conversational indicators yielded better accuracy than combining the traffic
metrics with sentiment analysis. They stated that the importance of sentiment analysis in box-office prediction is overstated.
However, this observation could have been affected by the duration and period of their extracted data.
Rajput, Sapkal, & Sinha, 2017 proposed a method for tackling polarity shift in sentiment analysis polarity calculation. They did so
by computing a final sentiment analysis value by aggregating it and the value of the reverse sentiment analysis of tweets related to
specified movies. The prediction was based on multivariate LR consisting of six independent variables, namely, sentiment polarity,
hype, actor, holiday effect, genre, and sequel. Their approach was able to improve prediction accuracy.
Bhattacharjee, Sridhar, & Dutta, (2017) investigated whether the polarity of social media contents of Bollywood movies can be
used to understand the box-office performance of movies. The independent variables used in their study were a sentimental score, a
cumulative sentiment score, and a cumulative negative score. The sentimental score was calculated as the total number of a sentiment
word divided by the total number of all the words in the document. Their data consisted of social media contents from seven
Bollywood movies released between 2013 and 2014. They showed that there was a positive association between sentiment polarity of
3
Table 1
Summary of previous works
Reference Dataset Period Algorithm Accuracy
(Ru et al., 2018) Genre, WoM, Distributor, Country Daily Deep learning MAPE: 30.1%
(Choudhery & Leung, 2017) Tweets:6 movies (SA) Weekend Polynomial regression
(Zhou et al., 2017) Genre, Budget, WoM, Rating, Participants, Gross NN 88.60%
Duration
(Xiao, Li, Chen, Zhao, & Xu, 2017) Weekend LR MAPE: 1.2042
(Duan et al., 2017) Budget, WoM, Awards, Screens, Duration Weekend/ Gross Gaussian Copula r2: 0.824
Regression MAE: 54.7
r2: 0.907
MAE: 5.26
(Bhattacharjee et al., 2017) Tweets:7 movies (SA) Discrete values LR R2: 0.97
(Shim & Pourhomayoun, 2017) Tweets(SA), Budget, Screens, Weather Daily LR
(T. Liu et al., 2016) Tweets (Post-rate, PI, SA), Screens, Star Weekend/ Gross LR & SVM r2adj: 0.95 RAE: 0.28 r2adj:
power, Director Power 0.70 RAE: 0.71
(Ghiassi et al., 2015) Genre, Rating, Screens, Star power, Gross NN 94.1%
Competition, Sequel, Special effects
(Lash et al., 2015) Genre, star power, season, team cohesion Gross Logistic Regression 0.771
(Du, Xu, & Huang, 2014) Tweets(SA, Tweet-rate, Comment rate, LR, SVM, & NN MAPE: 0.1837
Formula)
(Asur & Huberman, 2010) Tweets(SA, Tweet-rate, PNratio) LR R2adj: 0.8
(Lu et al., 2014) Tweets(SA, Star power, Tweet-rate, Tweet Weekend LR MRAE: 0.285
sentiment), Genre, Budget
(Zhang et al., 2009) Genre, Season, Screens, Competition, Discrete NN 82.9%
Advertisement
(Sharda & Delen, 2006) Genre, Rating, Screens, Star power, Discrete NN 0.752
Competition, Sequel, Special effects
social media and the box-office success of Bollywood movies. The summary of related works is presented in table 1.
Table 1 shows that these studies have used movie specific metadata for making movie revenue predictions. The most commonly
used prediction algorithm is the variants of neural networks. The metadata used the most are movie genre, movie budget, the
popularity of the leading star, and MPAA rating of the movie. However, starting in 2010, previous studies have begun to use
sentiment analysis for movie revenue prediction. Movie tweets are usually used for these predictions, and features, such as the total
number of tweets and sentiments contained in the tweets are used as the independent variables. Only the study by (T. Liu et al., 2016)
was found to have extracted people's intention to purchase a movie ticket (PI) from tweets for their prediction, which showed
promising results. As such, this current study aims to improve the purchase intention mining approach for predicting movie revenue
using YouTube trailer reviews.
3. Proposed Method
We tackle the problem of movie revenue prediction by focusing on sentiment analysis for movie revenue prediction. Specifically,
by the extraction of more features from YouTube trailer reviews for the prediction. Our proposed method has the following novel
contributions:
1 To be able to predict movie revenue before its theatrical release, we propose the use of YouTube trailer reviews for the prediction.
2 We propose a new feature called weighted like-to-dislike ratio (WLDratio) to represent the like function on YouTube.
3 A new method of representing the sentiments feature called Weighted Positive-to-Negative sentiment ratio (WPNratio) was also
proposed.
4 We also propose a new approach for calculating the purchase intention (PI) feature, called Movie Review Purchase Intention
(MRPI).
Consequently, there are two broad tasks involved in our proposed model. First extracting relevant data from YouTube trailer
reviews, and secondly, using that data for movie revenue prediction. Thus we propose a novel two-stage model for predicting box-
office revenue from YouTube trailer reviews before the theatrical release of a movie. Stage 1 is the MRPI approach and it involves the
extraction of purchase intention while stage 2 is the Box-office prediction stage and it involves the use of the purchase intention
variable together with other novel variables for box-office revenue prediction. The model is shown in Figure 2 and further discussed
in the Experiment Setup section.
4
Fig. 2. Model of Box-office prediction using trailer reviews
4. Experimental Setup
4.2. Stage 1: Movie Reviews Purchase Intention Mining Approach
In this stage, we extract purchase intention from movie reviews using the proposed MRPI approach. The approach involves two
tasks. First to develop a purchase intention lexicon, and second, to develop an algorithm that uses the purchase intention lexicon to
classify movie reviews as to whether they signify purchase intention or not. The details of each task are further explained in the
following subsections.
4.2.1. Movie Reviews Purchase Intention Lexicon

The identification of purchase intention can be modeled by first identifying some set of words that show a strong likelihood of
purchase. Therefore a Movie Reviews Purchase Intention Lexicon (MRPIL) was developed from movie reviews. MRPIL was developed
in two steps as follows:
1 Generate seed of words from movie reviews: The method used to generate the seed words is illustrated in Figure 2. First, the data
is pre-processed by removing stop-words, lemmatization, and then POS tagging using the Textblob library in python program-
ming. TF-IDF was used to rank all the words. Top 200 bi-grams were selected from the data. This resulting data was then manually
annotated as to whether it signifies purchase intention or not. The bi-grams assigned under purchase intention are selected as the
seed lexicon in MRPIL.
2 Expanding the seed terms using synonyms: MRPIL was expanded by including synonyms with identical meaning with the initial
seed terms. This was done using synonyms in thesaurus.com. The synonym of each word is added, then synonyms of the synonyms
are also added continuously until no new synonyms are identified.
4.2.2. Purchase Intention Mining Algorithm

A Purchase Intention Mining (PIM) algorithm was proposed based on MRPIL. The algorithm work by comparing each review from
a dataset with MRPIL. If any term in the MRPIL is contained in the review, then the review is classified as 1, meaning it signifies
purchase intention, otherwise, 0 meaning it does not signify purchase intention. The total purchase intention coefficient of a
document is calculated as the total number of reviews that signify purchase intention. In Figure 3, examples of YouTube trailer
reviews with purchase intention terms is presented. The PIM algorithm is presented in Algorithm 1.
5
Fig. 3. Examples of reviews with purchase intention indicators
Algorithm 1
PIM algorithm.
Require:
• The set of Purchase Intention lexicon PIL
• The set of movie trailer reviews from YouTube
Steps:
1. i ← 1, PI ← 0
1. For each review k reviews ri in R do
2. PIi ← the purchase intention of ri
3. Compare ri with PIL
4. If ri ∩ PIL = True
5. PIi ← PIi + 1
6. end if
7. End For
4.2.3. Evaluation of MRPI

We evaluate the effectiveness of MRPI using the movie review dataset by (Ding, Cai, Liu, & Shi, 2018). The dataset consists of 5,
544 instances, divided as 4,432 for training, 554 for development and 558 for testing. The performance of the MRPI approach was
measured according to accuracy, precision, and recall performance metrics. These metrics are the most popular evaluation metrics
used in classification problems. The metrics are briefly explained as follows:
1 Accuracy is a simple evaluation measure calculated as the ratio correctly predicted values to the total values. The equation is given by:
TP + TN
Accuracy =
TP + TN + FP + FN (1)
2 Precision is calculated as the ratio of correctly predicted positive values to the total predicted positive values. Precision tells us
how much of the classified data is classified correctly. Precision is given by:
TP
Precision =
TP + FP (2)
3 Recall is the ratio of correctly predicted positive values to all values in the actual class. Recall tells us the amount of the correctly
classified data; it is given by:
TP
Recall =
TP + FN (3)
Where TP stands true-positive, TN stands for true-negative, FP stands for false-positive, and FN stands for false-negative.
4.3. Sentiment Analysis
We used sentiment analysis to extract sentiments from YouTube trailer reviews. The step consists of three tasks which are data
cleaning and pre-processing, feature selection, and sentiment classification. We carried out the sentiment analysis using the Textblob
library of Python programming.
6
4.2.1. Dataset
The dataset used in this research consists of sentiments extracted from movie trailer reviews. The movie trailer reviews were
crawled from YouTube (www.youtube.com). YouTube is the biggest video sharing website in the world (Burgess & Green, 2018). The
reviews crawled are of twenty-nine movies released in the years 2016 and 2017 posted on the official channel of the movie pro-
duction company. We used an online tool called YouTube Comment Scrapper (www.ytcomments.klostermann.ca) for extracting the
reviews.
The twenty-nine movies included in the dataset are selected at random but under three categories. The first category is movies
that made a profit of over $400, 000, 000 profit. The second category is movies that made a profit of less than $200, 000, 000 profit.
Lastly, the third category is made up of movies that were not able to make a return-on-investment (ROI). The dataset extracted from
twenty-nine movies is limited when compared to previous studies, specifically studies that used movie metadata for box-office
revenue prediction. Notwithstanding, the number of movies selected in this research is still sufficient to draw reasonable conclusions
and is similar to the number of movies used in other prominent studies like (Du et al., 2014; Lipizzi, Iandoli, & Marquez, 2016; T.
Liu et al., 2016) that has achieved good results. The returns of a movie was calculated using Equation 4:
returns = boxoffice revenue budget (4)
Where returns represents the coefficient of profit or loss, budget is the movie budget and boxoffice revenue is the total revenue
generated from the box-office.
Information about movie budget and box-office revenue is acquired from Box Office Mojo (https://www.boxofficemojo.com/).
Box Office Mojo is one of the biggest websites that store detailed information about movies.
4.2.2. Data Cleaning and Pre-processing

Reviews are usually noisy, consisting of misspellings, symbols, and other textual jargon. In order to efficiently perform sentiment
analysis with high accuracy, data cleaning and pre-processing are necessary. The difference between data clean and data pre-pro-
cessing is that whereas data cleaning is done to clean a noisy data correction, data pre-processing is not necessarily done to clean
noisy data but to transform the data into a more useful form for sentiment analysis. The following tasks were done in the data
cleaning step: correction of misspellings, removal of special characters, and replacing emoticon characters with the emotion words
they represent.
On the other hand, data pre-processing is the transformation of data into a more readable form for the data mining process. The
following popular data pre-processing tasks were done: removal of stop-words, lemmatization, and n-gram notation.
4.3.3. Feature Selection

Feature selection is a feature reduction technique in sentiment analysis used in selecting the most informative features from a
dataset. We used the information gain algorithm, which is a very popular feature ranking algorithm (Agarwal & Mittal, 2013;
Khoshgoftaar, Gao, Napolitano, & Wald, 2014; Schouten, Frasincar, & Dekker, 2016) to calculate the importance of each individual
features in our review. The value of information gain is within the range [0, 1] with 1 as most important and 0 least important.
Therefore, all features having an information gain value of greater than 0 were selected.
4.3.4. Sentiment Classification

We used the ‘sentiment’ function of the Textblob library to determine the polarity of each review. In sentiment analysis, the
polarity of a review is a number within the range [-1.0, 1.0] that signifies whether the review is positive or negative. -1.0 signifies
negativity while 1.0 signifies positivity. Reviews with a polarity of greater than 0 were classified as positive, while those with a
polarity of less than 0 were classified as negative. Finally, the percentages of positive and negative reviews for each movie were
determined. Table 2 shows the first ten movies in our dataset and their total number of reviews.
4.4. Stage 2: Box-office Revenue Prediction
In the box-office prediction stage, purchase intention, reviews' sentiments, and movie metadata were used to propose box-office
Table 2
Sentiment analysis result for first ten movies in the dataset.
SN Name Positive Reviews (%) Negative Reviews (%) Neutral Reviews (%) Number of reviews
1 Beauty and the Beast 40 12 48 19924

2 Despicable Me 3 27 11 62 9858
3 Guardians of the Galaxy Vol. 2 38 7 55 13938
4 It 25 20 55 46356
5 Justice League 35 17 48 51097
6 Spider-Man: Homecoming 35 12 53 8728
7 Star Wars: The Last Jedi 30 12 58 65057
8 Thor: Ragnarok 33 11 56 45106
9 Wonder Woman 41 14 45 16277
10 All Eyez on Me 33 14 53 2263
7
Table 3
Number of review classes and their weights
Class Range Weight
1 >10,000 4
2 >1,000 2
3 <1,000 1
revenue prediction method. The method has two steps, namely dataset preparation and prediction. The steps are fully explained in
the following sub-sections.
4.4.1. Box-office Dataset

We prepared a box-office prediction dataset that is made up of six independent variables and one dependent variable. The
dependent variable is the box-office revenue of the movies while the independent variables are budget, reviews_count, views_count,
WPNratio, WLDratio, and PI. A sample of the dataset is shown in Table 4, and the independent variables are explained in the
subsequent sub-sections.
1 Budget
The budget variable represents the production budget of a movie. Previous studies have shown that the movie budget is highly
correlated with box-office revenue. Movie budget was used because the official budget of a movie is usually released early, before a
movie's release. It was extracted from box office mojo.
1 Reviews_count
Reviews_count represents the total number of reviews generated by a movie trailer, uploaded on the official YouTube channel of
the production company of the movie.
1 Views_count
Views_count represents the total number of times a movie trailer uploaded on the official YouTube channel of the production
company of the movie is viewed.
1 WPNratio
WPNratio stands for weighted positive-negative sentiments ratio. It represents a weighted ratio of the percentage of positive
reviews and the percentage of negative reviews of a movie. This is a novel variable introduced because the existing approaches used a
positive-to-negative ratio that does not take into account the total number of reviews from which the ratio is derived. Therefore, a
weight is assigned to the ratio according to the total number of reviews. This is done by assigning classes to the movies based on the
number of reviews they generated. The classes are illustrated in Table 3.
The proposed WPNratio formula is shown in Equation 5:
positive sentiments
WPNratio = *c
negative sentiment (5)
Where c stands for the class of the movie according to its reviews.
Table 4
Sample of dataset.
SN Name Genre Rating Budget LDratio Views Duration Reviews Ratio PI Gross Profit
1 All Eyez on Me Drama R 4.00E+07 30.00 4,534,834 7 2263 2.36 61 4.E+07 5.E+06
2 Atomic Blonde Thriller R 3.00E+07 10.56 9,483,826 10 1946 1.79 37 1.E+08 7.E+07
3 Deepwater Horizon Action Drama PG-13 1.10E+08 14.21 13,561,983 11 1490 1.71 38 1.E+08 1.E+07
4 The BFG Adventure PG 1.40E+08 13.64 9,247,733 15 987 1.42 32 2.E+08 4.E+07
5 Ferdinand Animation PG 1.11E+08 10.00 6,642,780 24 1548 2.91 23 3.E+08 2.E+08
6 Jigsaw Horror R 1.00E+07 14.63 17,567,694 8 7404 1.47 188 1.E+08 9.E+07
7 Me Before You Romance PG-13 2.00E+07 15.33 8,084,422 11 1264 2.50 83 2.E+08 2.E+08
8 Power Rangers Action Adventure PG-13 1.00E+08 10.65 44,259,866 10 19087 1.67 412 1.E+08 4.E+07
9 The Shallows Horror PG-13 1.70E+07 9.00 21,859,793 14 5044 2.18 80 1.E+08 1.E+08
10 Why Him? Comedy R 3.80E+07 17.22 7,609,129 13 1409 2.69 17 1.E+08 8.E+07
8
1 WLDratio
WLDratio stands for weighted like-dislike ratio. It represents a weighted ratio of the total number of likes to the total number of
dislikes a movie trailer review generated on the official YouTube channel of the production company of the movie. This is also a novel
variable introduced that takes into account the total number of reviews from which the ratio is derived. This is done by assigning
classes to the movies based on the number of reviews they generated. The weight is assigned in a similar way to WPNratio using
Table 3. The proposed WLDratio formula is shown in Equation 6.
total likes
WLDratio = *c
total dislikes (6)
Where c stands for the class of the movie according to its reviews
1 PI
PI stands for purchase intention and it is a variable that represents people's intention to purchase a movie ticket. The PI variable is
computed using the proposed MRPI mining approach.
4.4.2. Prediction Algorithms

Regression analysis was done for the prediction of box-office revenue. Regression analysis was chosen instead of classification
because some information will be lost if the problem is treated as a classification problem. Four prediction algorithms, namely
multiple linear regression, support vector machine, polynomial regression, and random forest were employed to predict box-office
revenue using YouTube trailer reviews. These four algorithms are the most commonly found in movie revenue prediction as shown in
Table 1 and related work.
4.4.3. Evaluation
To evaluate the effectiveness of our approach, three baseline methods were used. The baseline methods are:
1 (Asur & Huberman, 2010) employed linear regression model to predict box-office performance from movie tweets. The features
used for the prediction are the number of tweets (tweet-rate) and the positive-to-negative sentiment ratio (PNratio).
2 (Bhattacharjee et al., 2017) employed linear regression model to predict box-office performance from movie tweets. The features
used for the prediction are cumulative positive sentimental score and cumulative negative sentimental score.
3 (Choudhery & Leung, 2017) employed a polynomial regression model to predict box-office performance from box-office from
movie reviews. The features used for the prediction are tweets count, percentage of positive tweets and percentage of negative
tweets.
Three error metrics namely Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and Relative Absolute Error (RAE) with
k-fold cross-validation were used to evaluate the accuracy of our prediction model. K-fold cross-validation is a validation technique in
which the data is divided into k number groups; each group called a fold. The data is divided in such a way that each fold is different
and approximately the same size as other folds. Then each of the folds is used to test the model. The error metrics used are briefly
explained below:
1 Mean Absolute Error (MAE) is a measure of the average magnitude of prediction errors ignoring their direction. It is given by
Equation 7:
n
1
MAE = yi yî
n i=1 (7)
Where yi represents the actual value, and yî represents the estimation value and ȳ predicted value.
1 Root Mean Square Error (RMSE) is a measure of how close a fitted line is to data points. It is the standard deviation of the
prediction errors, called residuals. RMSE is computed in the same unit as the dependent variable. The smaller the RMSE value, the
better the prediction accuracy. RMSE can be calculated using Equation 8:
RMSE = 1 r 2SDy (8)
2 Relative Absolute Error (RAE) is computed as the ratio of the mean errors (residuals) and errors produced by a trivial or naive
model. RAE can be calculated using Equation 9:
9
Table 5
MRPI evaluation results
Accuracy Precision Recall
Test 1 0.89 0.9 0.9

Test 2 0.89 0.80 1.00
Test 3 0.84 0.92 0.85
Test 4 0.95 1.00 0.90
Test 5 0.92 1 0.86
All 0.90 0.89 0.93
yî
n
i 1
yi
RAE = n
i 1
yi y¯ (9)
Where yi represents the actual value, and yî represent the estimation value and ȳ predicted value.
5. Results
5.1. Experimental results of MRPI
To test the effectiveness of MRPI, we performed 5 experiments using the movie review dataset by (Ding et al., 2018). The results
are presented in Table 5. We randomly selected 5 samples of 20 instances each from the dataset to generate 5 test set for the
experiment. The purpose of this is to determine the performance of our approach evenly among different portions of the dataset. The
experimental results of the 5 experiments are all good in terms of accuracy, precision and recall. Test case 4 achieved the best
accuracy and precision results with 0.95 and 1.00 respectively, while test case 2 has the best performance result for recall with 1.00.
Test case 3 has the least accuracy and recall with 0.84 and 0.85 respectively, while Test case 2 has the least value for precision with
0.80.
5.2. Correlation of Features used in Prediction with Box-office
The correlation between PNratio, Review count, Budget, Views, WLDratio, WPNratio, and PI was calculated to visualize the
strength of the relationship between these features and the box-office revenue. The result is presented in Table 6. The correlation
coefficient value ranges from -1 to 1, with -1 indicating perfect negative relation and 1 indicating a perfect positive relationship. A
value of 0 shows no relationship. Among the features, WPNratio and PI are the most correlated to the box-office revenue with 0.79
and 0.90 respectively.
5.3. Performance of Individual Features in Box-office Prediction
Six individual features, namely budget, review count, PNratio, WPNratio, WLDration and PI were experimented to determine how
well each can be used in box-office revenue prediction. The results of the experiment are presented in Table 7. PI has the best result
having achieved the lowest MAE, RMSE, and RAE values using three prediction algorithms. This confirms that people do indicate
their purchase intentions in YouTube trailer reviews. Review count also achieved a significantly low error result values compared to
the remaining features. This shows that movies that are well-talked about will most likely perform well in the box-office.
5.4. Performance of Proposed Box-office Prediction Model in Relation to Baseline Approaches
In this section, we analyze the performance of the proposed model in relation to the baseline approaches. Table 8 shows that the
Table 6
Correlation of individual features with the box-office revenue generated
Feature Correlation Coefficient (Box-office)
PNratio 0.48
Review count 0.73
Budget 0.71
Views 0.75
WLDratio 0.70
WPNratio 0.79
PI 0.90
10
Table 7
Result of Box-office prediction using individual features
Feature MAE (Billion $) RMSE (Billion $) RAE (%)
Budget 0.2176 0.2940 62.4

Review count 0.2058 0.2894 59.0
PNratio 0.3092 0.3822 88.7
WPNratio 0.1783 0.2687 51.1
WLDratio 0.2312 0.3670 66.3
PI 0.1100 0.1810 34.4
Table 8
Result of Box-office prediction in RMSE (in billion US dollar)
Correlation coefficient MAE (Billion $) RMSE (Billion $) RAE (%)
Baseline 1 0.7783 0.1748 0.2557 50.14

Baseline 2 0.4453 0.2915 0.3579 83.64
Baseline 3 0.6097 0.2375 0.3536 68.12
Proposed 0.9164 0.1034 0.1608 29.65
proposed model outperforms the baseline approaches in terms of r, MAE, RMSE, and RAE. The proposed approach performed better
because the added purchase intention, WPNratio, and WLDratio features that are highly correlated to box-office revenue. The result
also proves that the proposed WPNratio is superior to PNratio in modelling user sentiments for box-office prediction.
5.5. Performance of Prediction Algorithms in Box-office Prediction
The performance of four prediction algorithms, namely MLR, SVM, MLP-NN and RF in box-office prediction is presented in this
section. Four different experiments, using the proposed approach, baseline approach 1, baseline approach 2, and baseline approach 3
respectively were used for the evaluation. In all the experiments, MLR achieved the best performance having the lowest error using
MAE, RMSE, RAE and RAE metrics. The next algorithm in terms of low prediction errors is SVM and followed by RF. Multilayer
perceptron NN achieved the worst results in all the experiments, however, the difference between the results of RF and multilayer
perceptron NN is not very significant. The results are illustrated in Figure 4–Figure 6 and fully discussed in the following subsections.
5.5.1. Experiment 1: Performance of Prediction Algorithms in Box-office Prediction using Proposed Approach
The performance of MLR, SVM, multilayer perceptron NN and RF in box-office revenue prediction towards the proposed approach
is evaluated in this experiment. The prediction error was measured using the following error metrics MAE, RMSE, RAE, and RAE. The
results are shown in Table 9. First, it can be seen that MLR has the highest correlation based on the correlation coefficient with box-
office revenue, then followed by SVM with a linear kernel. Multilayer perceptron NN has the worst results with MAE, RMSE, and RAE.
100
90
80
70
Error (RAE)
60
50
40
30
20
10
0
SVM MLR MLP NN RF
Algorithms
Proposed Baseline 1 Baseline 2 Baseline 3
Fig. 4. Performance of prediction algorithms towards the proposed approach based on RAE
11
0.45
0.4
0.35
Error (RMSE)
0.3
0.25
0.2
0.15
0.1
0.05
0
SVM MLR MLP NN RF
Algorithms
Fig. 5. Performance of prediction algorithms towards the proposed approach based on RMSE
0.35
0.3
0.25
Error (MAE)
0.2
0.15
0.1
0.05
0
SVM MLR MLP NN RF
Algorithms
Fig. 6. Performance of prediction algorithms towards the proposed approach based on MAE
Table 9
Results of the proposed approach with baseline prediction algorithms
SVM 0.9005 0.1076 0.1735 30.88

MLR 0.9164 0.1034 0.1608 29.65
MLP-NN 0.8569 0.1320 0.2049 37.87
RF 0.9097 0.1134 0.1736 35.66
5.5.2. Experiment 2: Performance of Prediction Algorithms in Box-office Prediction using Baseline Approach 1
This experiment was done to evaluate the performance of MLR, SVM, multilayer perceptron NN and RF in box-office revenue
prediction in terms of MAE, RMSE, and RAE error metrics. The results are presented in Table 10. The results obtained show that MLR
performed the best having the lowest prediction errors with all the error metrics used. MLR is followed by SVM, and then RF.
Multilayer perceptron NN has the worst performance results with the highest values in all the evaluation metrics used.
In this experiment, MAE, RMSE, and RAE were used to evaluate the performance of four prediction algorithms namely MLR, SVM,
multilayer perceptron NN and RF. The results are presented in Table 11. The performance results show that MLR and SVM performed
the best, with MLR having slightly lower prediction errors. Multilayer perceptron NN and RF performed poorly with Multilayer
12
Table 10
Results of baseline approach 1 with baseline prediction algorithms
SVM 0.7121 0.1867 0.2850 53.56

MLR 0.7783 0.1748 0.2557 50.14
MLP-NN 0.7319 0.2106 0.2879 60.41
RF 0.7733 0.1633 0.2532 46.84
Table 11
SVM 0.324 0.2995 0.3778 85.93

MLR 0.4453 0.2915 0.3579 83.64
MLP-NN 0.1956 0.3267 0.3901 93.72
RF 0.2291 0.2900 0.4025 83.19
perceptron NN having the worst results. However, it is important to note that all the algorithms performed poorly in this experiment,
which implies that the approach is not suitable for box-office revenue prediction.
This is the last experiment in this section and it was done to evaluate the performance of MLR, SVM, multilayer perceptron NN
and RF in box-office revenue prediction using baseline approach 3. The results of the experiment are shown in Table 12. Experimental
results obtained show that RF and MLR performed fairly with RF having slightly lower prediction errors. SVM and Multilayer
perceptron NN performed poorly with the later having the worst results.
6. Conclusion and Discussion
The results of the experiments conducted in this paper can be discussed according to the two research questions. First, to answer
the question “How can we accurately extract movie ticket purchase intention from movie reviews?” The performance of the proposed MRPI
measured using accuracy, precision and recall have proven the feasibility and strength of the approach in identifying purchase
intention from movie reviews. The novelty of MRPI lies in it being first lexicon-based approach. It can also be deduced that the
developed lexicon, PIL sufficiently represents the terms people use to indicate their intention to purchase a movie ticket in online
movie reviews. These key findings have weighty applications in marketing and business intelligence. Previous studies have shown the
effectiveness of lexicon-based social media analysis in the medical domain (S. Liu & Lee, 2019), in cyber intelligence (Power, Keane,
Nolan, & O'Neill, 2017), and in stock-market domain (Li & Shah, 2017). Therefore the findings of the experiment further strengthen
the notion of using social media data for practical real-world solutions to problems.
Secondly, to answer the question “Can sentiment analysis of YouTube trailer reviews and movie metadata be used for early prediction of
movie revenue?” Previous studies have shown that the early prediction of movie success is possible by the use of pre-released movie
metadata (Lash et al., 2015), and also by the use of social media content like twitter (T. Liu et al., 2016). The results obtained in this
research have shown that YouTube Trailer reviews are an effective source of data for movie revenue prediction. YouTube trailer
reviews are normally released early, months before a movie's release, and therefore the revenue predication can be done early and
necessary decisions can be taken promptly. Interestingly, movies are not the only commercial entities marketed on YouTube. Several
manufacturers of products like mobile phones, laptops, and cars usually introduce their upcoming products by releasing a short video
of the product and its specifications on YouTube prior to the product's release. Therefore, the method of extracting and processing
YouTube reviews proposed has a wide range of applications.
Table 12
SVM 0.7583 0.1807 0.2658 51.85

MLR 0.7887 0.1687 0.2456 48.39
MLP-NN 0.7402 0.2116 0.2776 60.72
RF 0.803 0.1708 0.2580 49.00
13
Additionally, WPNratio proposed has a higher correlation coefficient PNratio (Asur & Huberman, 2010) and percentages of
positive and negative reviews (Choudhery & Leung, 2017) used in previous studies; thereby increasing the prediction accuracy. This
validates the researchers' claim that the number of reviews needs to be taken into consideration when calculating the impact of online
sentiments for movie revenue prediction. It also has practical applications in other domains, not necessarily sentiments analysis, but
wherever information needs to be extracted from a set of population with the sample size. Additionally, WLDratio proposed has
improved the prediction accuracy significantly. It can also be applicable in other domains just like the WPNratio. Finally, among the
four prediction algorithms used, the two that achieved the lowest prediction errors are linear models, that is, MLR and SVM (linear
kernel). This means that the relationship between movie reviews and movie revenue is linear and linear models should be used in
movie revenue prediction.
In conclusion, this paper has demonstrated novel techniques of movie revenue prediction using YouTube trailer reviews.
7. Implications of the study and Future work
The findings in this research have shown the effectiveness of the proposed approaches is box-office revenue prediction. This
research has practical implications as it can be used for business intelligence and decision making. Additionally, the method involves
the use of trailer reviews before a movie's release. Therefore, it gives room for movie producers to make changes to their movie or the
release plans based on people's sentiments of the trailer. For example, Sonic the Hedgehog movie has the main character redesigned
because of the negative reviews the original trailer received on the initial design of the Sonic character in the movie. The new design
has generated many positive reviews and people have shown keen interest in the movie because of the new design (Allen, 2019).
Future studies could include the use of deep learning techniques to improve the performance of MRPI approach because deep
learning is efficient in complex problems, like natural language processing. The reply feature in YouTube reviews can also be in-
vestigated to determine whether they can improve the prediction accuracy. Additionally, the difference (if any) between the pre-
diction of different movie industries could be investigated. The number of days after a movie's trailer release or numbers reviews
sufficient for accurate box-office prediction could also be investigated.
Author Statement
All persons who meet authorship criteria are listed as authors, and all authors certify that they have participated sufficiently in the
work to take public responsibility for the content, including participation in the concept, design, analysis, writing, or revision of the
manuscript. Furthermore, each author certifies that this material or similar material has not been and will not be submitted to or
published in any other publication before its appearance in Information Processing and Management.
Authors’ contributions
Azuraliza Abu Bakar: Conceptualization, Methodology, Supervision, Project administration, Writing – Reviewing and Editing.
Ibrahim Said Ahmad: Conceptualization, Methodology, Investigation, Data curation, Writing – Original Draft
Mohd Ridzwan Yaakub: Supervision, Funding acquisition, Resources, Validation, Writing – Reviewing and Editing.
Acknowledgment
This work is supported by the Fundamental Research Grant Scheme (FRGS/1/2017/ICT02/UKM/02/4) of the ‘Universiti
Kebangsaan Malaysia’ (UKM), and Regional Cluster for Research and Publication (RCRP-2016-002).
References
Agarwal, B., & Mittal, N. (2013). Optimal Feature Selection for Sentiment Analysis. International Conference on Intelligent Text Processing and Computational Linguistics
(pp. 13–24). Springer. https://doi.org/10.1007/978-3-642-37256-8_2.
Ahmad, S. R., Bakar, A. A., & Yaakub, M. R. (2015). Metaheuristic algorithms for feature selection in sentiment analysis. 2015 Science and Information Conference (SAI)
(pp. 222–226). IEEE. https://doi.org/10.1109/SAI.2015.7237148.
Al-Moslmi, T., Albared, M., Al-Shabi, A., Omar, N., & Abdullah, S. (2018). Arabic senti-lexicon: Constructing publicly available language resources for Arabic sentiment
analysis. Journal of Information Science, 44(3), 345–362. https://doi.org/10.1177/0165551516683908.
Al-Moslmi, T., Omar, N., Abdullah, S., & Albared, M. (2017). Approaches to Cross-Domain Sentiment Analysis: A Systematic Literature Review. IEEE Access, 5,
16173–16192. https://doi.org/10.1109/ACCESS.2017.2690342.
Allen, K. (2019, November 12). The “Sonic the Hedgehog” movie tries again with a new trailer, and people finally like it. CNN Entertainment. Retrieved from https://
edition.cnn.com/2019/11/12/entertainment/sonic-hedgehog-movie-redesign-trnd/index.html.
Asur, S., & Huberman, B. A. (2010). Predicting the Future with Social Media. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology. 1. 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (pp. 492–499). https://doiorg/10.1109/WI-IAT.
2010.63.
Awwalu, J., Bakar, A. A., & Yaakub, M. R. (2019). Hybrid N-gram model using Naïve Bayes for classification of political sentiments on Twitter. Neural Computing and
Applications, 1–14. https://doi.org/10.1007/s00521-019-04248-z.
Bhattacharjee, B., Sridhar, A., & Dutta, A. (2017). Identifying the causal relationship between social media content of a Bollywood movie and its box-office success-a
text mining approach. International Journal of Business Information Systems, 24(3), 344–368. https://doi.org/10.1504/IJBIS.2017.082039.
Burgess, J., & Green, J. (2018). YouTube : online video and participatory culture. Polity Press. Retrieved from https://books.google.com.my/books?hl=en&lr=&id=
mg1rDwAAQBAJ&oi=fnd&pg=PT5&dq=YouTube&ots=RBsMRDm1pM&sig=i7SiYAtiHplbWTul7NmyGaF6CXc&redir_esc=y#v=onepage&q=YouTube&f=
false.
14
Chen, X., Li, X., Yao, D., & Zhou, Z. (2019). Seeking the support of the silent majority: are lurking users valuable to UGC platforms. Journal of the Academy of Marketing
Science. https://doi.org/10.1007/s11747-018-00624-8.
Choudhery, D., & Leung, C. K. (2017). Social Media Mining. Proceedings of the 21st International Database Engineering & Applications Symposium on - IDEAS 2017 (pp. 20–
29). ACM Press. https://doi.org/10.1145/3105831.3105854.
Court, D., Gillen, B., McKenzie, J., & Plott, C. R. (2018). Two information aggregation mechanisms for predicting the opening weekend box office revenues of films:
Boxoffice Prophecy and Guess of Guesses. Economic Theory, 65(1), 25–54. https://doi.org/10.1007/s00199-017-1036-1.
Ding, X., Cai, B., Liu, T., & Shi, Q. (2018). Domain adaptation via tree kernel based maximum mean discrepancy for user consumption intention identification. IJCAI
International Joint Conference on Artificial Intelligence. 2018-July. IJCAI International Joint Conference on Artificial Intelligence (pp. 4026–4032). https://doi.org/10.
24963/ijcai.2018/560.
Du, J., Xu, H., & Huang, X. (2014). Box office prediction based on microblog. Expert Systems with Applications, 41(4), 1680–1689. https://doi.org/10.1016/J.ESWA.
2013.08.065.
Duan, J., Ding, X., & Liu, T. (2017). A Gaussian copula regression model for movie box-office revenues prediction. Science China Information Sciences, 60(9), 092103.
https://doi.org/10.1007/s11432-015-0905-6.
Gaikar, D. D., Marakarkandy, B., & Dasgupta, C. (2015). Using Twitter data to predict the performance of Bollywood movies. Industrial Management & Data Systems,
115(9), 1604–1621. https://doi.org/10.1108/IMDS-04-2015-0145.
Ghiassi, M., Lio, D., & Moon, B. (2015). Pre-production forecasting of movie revenues with a dynamic artificial neural network. Expert Systems with Applications, 42(6),
3176–3193. https://doi.org/10.1016/J.ESWA.2014.11.022.
Khoshgoftaar, T. M., Gao, K., Napolitano, A., & Wald, R. (2014). A comparative study of iterative and non-iterative feature selection techniques for software defect
prediction. Information Systems Frontiers, 16(5), 801–822. https://doi.org/10.1007/s10796-013-9430-0.
Lash, M. T., Fu, S., Wang, S., & Zhao, K. (2015). Early Prediction of Movie Success What, Who, and When. Social Computing, Behavioral-Cultural Modeling, and Prediction
(pp. 345–349). . https://doi.org/10.1007/978-3-319-16268-3_41.
Lee, K. J., & Chang, W. (2009). Bayesian belief network for box-office performance: A case study on Korean movies. Expert Systems with Applications, 36(1), 280–291.
https://doi.org/10.1016/j.eswa.2007.09.042.
Li, Q., & Shah, S. (2017). Learning Stock Market Sentiment Lexicon and Sentiment-Oriented Word Vector from StockTwits. Proceedings of the 21st Conference on
Computational Natural Language Learning (CoNLL 2017) (pp. 301–310). Vancouver: Association for Computational Linguistics. Retrieved from https://www.
aclweb.org/anthology/K17-1031.
Lipizzi, C., Iandoli, L., & Marquez, J. E. R. (2016). Combining structure, content and meaning in online social networks: The analysis of public’s early reaction in social
media to newly launched movies. Technological Forecasting and Social Change, 109, 35–49. https://doi.org/10.1016/j.techfore.2016.05.013.
Litman, B. R. (1983). Predicting Success of Theatrical Movies: An Empirical Study. The Journal of Popular Culture, 16(4), 159–175. https://doi.org/10.1111/j.0022-
3840.1983.1604_159.x.
Liu, S., & Lee, I. (2019). Extracting features with medical sentiment lexicon and position encoding for drug reviews. Health Information Science and Systems, 7(11)
https://doi.org/10.1007/s13755-019-0072-6.
Liu, T., Ding, X., Chen, Y., Chen, H., & Guo, M. (2016). Predicting movie Box-office revenues by exploiting large-scale social media content. Multimedia Tools and
Applications, 75(3), 1509–1528. https://doi.org/10.1007/s11042-014-2270-1.
Lu, Y., Wang, F., & Maciejewski, R. (2014). Business Intelligence from Social Media: A Study from the VAST Box Office Challenge. IEEE Computer Graphics and
Applications, 34(5), 58–69. https://doi.org/10.1109/MCG.2014.61.
Power, A., Keane, A., Nolan, B., & O'Neill, B. (2017). A lexical database for public textual cyberbullying detection. Revista de Lenguas Para Fines Específicos, 23(2),
157–186. https://doi.org/10.20420/rlfe.2017.177.
Rajput, P., Sapkal, P., & Sinha, S. (2017). Box Office Revenue Prediction Using Dual Sentiment Analysis. International Journal of Machine Learning and Computing, 7(5),
https://doi.org/10.18178/ijmlc.2017.7.4.623.
Ru, Y., Li, B., Liu, J., & Chai, J. (2018). An effective daily box office prediction model based on deep neural networks. Cognitive Systems Research, 52, 182–191. https://
doi.org/10.1016/J.COGSYS.2018.06.018.
Schouten, K., Frasincar, F., & Dekker, R. (2016). An information gain-driven feature study for aspect-based sentiment analysis. Lecture Notes in Computer Science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 9612. Lecture Notes in Computer Science (including subseries Lecture Notes
in Artificial Intelligence and Lecture Notes in Bioinformatics) (pp. 48–59). . https://doi.org/10.1007/978-3-319-41754-7_5.
Sharda, R., & Delen, D. (2006). Predicting box-office success of motion pictures with neural networks. Expert Systems with Applications, 30(2), 243–254. https://doi.
org/10.1016/J.ESWA.2005.07.018.
Shim, S., & Pourhomayoun, M. (2017). Predicting Movie Market Revenue Using Social Media Data. 2017 IEEE International Conference on Information Reuse and
Integration (IRI) (pp. 478–484). IEEE. https://doi.org/10.1109/IRI.2017.68.
Xiao, J., Li, X., Chen, S., Zhao, X., & Xu, M. (2017). An inside look into the complexity of box-office revenue prediction in China. International Journal of Distributed
Sensor Networks, 13(1), 155014771668484. https://doi.org/10.1177/1550147716684842.
Yaakub, M. R., Li, Y., & Zhang, J. (2014). Integration of Sentiment Analysis into Customer Relational Model: The Importance of Feature Ontology and Synonym.
Procedia Technology, 11(Iceei), 495–501. https://doi.org/10.1016/j.protcy.2013.12.220.
Zhang, L., Luo, J., & Yang, S. (2009). Forecasting box office revenue of movies with BP neural network. Expert Systems with Applications, 36(3), 6580–6587. https://doi.
org/10.1016/J.ESWA.2008.07.064.
Zhou, Y., Zhang, L., & Yi, Z. (2017). Predicting movie box-office revenues using deep neural networks. Neural Computing and Applications, 1–11. https://doi.org/10.
1007/s00521-017-3162-x.
15

Ahmad Et Al. - 2020 - Movie Revenue Prediction Based On Purchase Intenti

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ahmad Et Al. - 2020 - Movie Revenue Prediction Based On Purchase Intenti

Uploaded by

Copyright:

Available Formats

Information Processing and Management 57 (2020) 102278

Contents lists available at ScienceDirect

Information Processing and Management

Movie Revenue Prediction Based on Purchase Intention Mining

ARTICLE INFO ABSTRACT

2.1. Movie Revenue Prediction using Metadata

2.2. Movie Revenue Prediction using Social Media

Fig. 2. Model of Box-office prediction using trailer reviews

4.2. Stage 1: Movie Reviews Purchase Intention Mining Approach

4.2.1. Movie Reviews Purchase Intention Lexicon

4.2.2. Purchase Intention Mining Algorithm

Fig. 3. Examples of reviews with purchase intention indicators

4.2.3. Evaluation of MRPI

4.3. Sentiment Analysis

4.2.2. Data Cleaning and Pre-processing

4.3.3. Feature Selection

4.3.4. Sentiment Classification

4.4. Stage 2: Box-office Revenue Prediction

1 Beauty and the Beast 40 12 48 19924

4.4.1. Box-office Dataset

4.4.2. Prediction Algorithms

RMSE = 1 r 2SDy (8)

Test 1 0.89 0.9 0.9

5.1. Experimental results of MRPI

5.2. Correlation of Features used in Prediction with Box-office

5.3. Performance of Individual Features in Box-office Prediction

5.4. Performance of Proposed Box-office Prediction Model in Relation to Baseline Approaches

Budget 0.2176 0.2940 62.4

Baseline 1 0.7783 0.1748 0.2557 50.14

5.5. Performance of Prediction Algorithms in Box-office Prediction

Proposed Baseline 1 Baseline 2 Baseline 3

Proposed Baseline 1 Baseline 2 Baseline 3

Proposed Baseline 1 Baseline 2 Baseline 3

SVM 0.9005 0.1076 0.1735 30.88

SVM 0.7121 0.1867 0.2850 53.56

SVM 0.324 0.2995 0.3778 85.93

6. Conclusion and Discussion

SVM 0.7583 0.1807 0.2658 51.85

7. Implications of the study and Future work

You might also like