Professional Documents
Culture Documents
conducted by
University of Mumbai
through
Submitted by
MMS
I hereby declare that the project work entitled “Predictive Analysis to measure
Movie Acceptability” submitted to Rizvi Institute of Management Studies &
Research as a record of an original work done by me under guidance of Prof. Sanjay
Gupta of Rizvi Institute of Management Studies and Research and project work is
submitted in the partial fulfillment of the requirement for the award of the degree of
Masters in Management Studies. The results embodied in these have not been
submitted to any other University or Institute for award of any Degree or Diploma.
I further certify that I have no objection and grant the rights to:
Roll No: 14
Place: Mumbai
Certificate
This is to certify that Mr. Asfaq Nazir Kazani, a student of Rizvi Institute of
Management Studies and Research, of MMS IV bearing Roll No. 14 and specializing
in Systems has successfully completed the project titled.
Under the guidance of Prof. Sanjay Gupta in partial fulfillment of the requirement
of Masters of Management Studies by University of Mumbai for the academic year
2016 – 2018.
I wish to take this opportunity to express my sincere gratitude to all those who helped
me in some way or the other in the completion of my project.
I would like to thank my project guides Prof. Sanjay Gupta for his constant support,
encouragement and guidance without whom the successful completion of this project
would have been impossible.
I would like to specially thank Dr. Kalim Khan for always being supportive and
inspiring me throughout the completion of my project.
______________________
Roll No: M- 14
MMS - Systems
Batch (2016-2018)
Executive Summary
The film industry is one of the biggest contributors to the entertainment industry and
it is characterized with its unpredictability in success and failure. Film Industry has
always amused everyone with its unpredictable success and failure. ‘Hollywood is the
land of hunch and the wild guess’. Research hints the unpredictability of the product
demand and proves that 25% of total revenue of a movie is earned from the first 2
weeks of release. Litman was the first to develop a multiple regression model to
predict the commercial success of movies.
Till 2014, the U.S. film industry’s revenue reached 564 billion U.S. dollars. It is
expected that the entertainment industry will grow to more than US$669 billion in the
next four years. That is, it is predicted that 80% of the profits of the film industry in
the past 10 years were generated by 6% of the films released; 78% of movies lost
money at the same time. Therefore, a lot of research has been conducted to predict the
success of Hollywood movies.
Indian Movie Industry is different from its counterparts in many ways especially in
terms of different movies in different languages and low-ticket prices. In 2013, the
Indian film industry generated 1.68 billion U.S. dollars out of its total revenue of 2.07
billion from overseas and domestic box office collections.
According to the Deloitte and ASSOCHEM report, researchers predict that theatrical
revenue in India will be increased from $1.78 billion to $2.13 billion by FY2018. That
makes theatrical circuits account for some 74% of total income.
Bollywood Industry works a lot different from the Hollywood Industry. In Bollywood
Industry a lot of importance is normally given to different parameters such as
celebrity appeal, the movie album and others, which is an integral part of the movie
itself; unlike, the Hollywood Industry. This project looks into the inner details of
watching a movie by splitting the research into two main components. First is
exploring the variables that influences the frequency of watching the movies. Second,
developing a model to predict the success or failure of the movies, which is the
deciding factor for the producers and distributors whether to invest in the movie or
situation.
Objectives:
Conclusion …………………………………………………………………………57
Appendices …………………………………………………………………………59
7.3 Rshiny............................................................................................................ 67
Annexures …………………………………………………………………………70
Bibliography ............................................................................................................... 73
Predictive Analysis to Measure Movie Acceptability Asfaq Nazir Kazani
Chapter 1. Introduction
1.1 Introduction
With the emergence of big data, the world has changed in an unprecedented way. Due
to the many advantages of data analysis, many things that were previously impossible
to analyze and predict are now easier and more intuitive to predict. The entertainment
industry is no exception. Over the years, data analysis and its tremendous power have
been the root cause of the transformation of the entertainment industry's operating
model.
In a fast-growing and booming industry such as the movie industry, data analysis has
opened up many important new ways to analyze past data, make creative and
marketing decisions, and accurately predict the fate of upcoming movie releases.
One of the most important elements of predictive analytics is the collection and
storage of large amounts of data. The film industry has been showing rapid growth. In
the past few years, the growth rate has increased significantly due to the rise of
alternative distribution platforms such as online platforms and mobile platforms. The
rich data in the film industry makes it an exciting area for data analysts and
statisticians. Previously, the film industry had used the knowledge of certain industry
trends, basic rules of thumb and traditional wisdom and intuition to predict the
success or failure of particular movies. This method has never been very accurate or
reliable, and it has been found in many fields for many years.
At present, with the emergence of big data and exciting opportunities brought about
by data mining and analysis, the industry is actively developing a new, improved and
reliable method to accurately predict the success and failure of a particular movie.
Stakeholders in most of the world’s major industries are turning to data scientists and
analysts to help them increase their success rate through data analysis. At present,
many major movie studios in the world are actively implementing data analysis as the
main means to measure the success of many of their projects.
1.2.1 The Basics of Data Analytics for the Motion Picture Industry
The main factor that helps determine the likelihood of a particular movie's success is
knowledge about what makes people interested and raises curiosity. By analyzing a
variety of different online resources, including search engine results, video viewing
and commenting, expert website rating, and social media content, this knowledge can
be implemented to some extent. Analysis of past success records for other movies of
the same type or other films of the same subject can also be introduced into the
transaction to provide accurate results. The main goal is to be able to accurately
predict the box office receipts for a particular movie using relevant types of data. To
conduct this analysis, analysts are expected to have a large amount of important
information, including past records of the same director and production company's
movies, other movies of the same type, similar casts, story types, and different ways
of marketing and promotion. In addition to these important factors, there are other
influential factors such as welcome and trailer announcements, social media hotspots
and public forum reviews.
The past, present and future versions of the original classification and subdivision
were performed using cluster analysis techniques.
Similarity checks are used to determine the degree of similarity between sample plots
and other similarly perceived movies.
Use the model derived from the above steps and past data collection to arrive at an
approximate estimate of the net benefit of the particular movie in question.
Based on the above factors, build an accurate and reliable statistical model, as well as
other intrinsic factors such as consciousness, interest and curiosity.
Although the above process is only a rough overview, data scientists can obtain more
precise and finer results in a variety of ways. With the availability of large amounts of
data and the abundance of today's sophisticated tools, technology and data processing
platforms, predictors can achieve surprising levels of accuracy. The first step in
achieving this goal is to ensure that the right audience is targeted.
To achieve this goal, individual fans can be considered potential customers, and then
carefully study the data to determine which potential customers are most likely to
influence the opinions of others. In this regard, it is helpful to consider both potential
audiences and theater owners. They may have their own strategies for arranging
specific films to increase their occupancy and profits. Factors such as demographics
also play an important role in this regard.
In order to obtain more accurate results, another factor that should be included in the
analysis is the seasonal factor. The fate of the film released at a time consistent with a
specific event such as a major holiday, holiday or weekend needs to be studied in
detail. Usually, a certain week will see the release of several different movies, and
people are more likely to choose to watch only one movie in a given week. Therefore,
it is necessary to study the ratio rather than the absolute value in order to obtain
accurate results. This not only drives profits, but also plays an important role in the
marketing and promotion of movies.
1.2.3 Critical success drivers for leveraging the prowess of data analysis
With the astonishing growth of data on digital platforms and the exponential growth
of daily data, data analysis experts will continue to propose new and innovative
methods to use this information to achieve more and more accurate predictions of the
success or failure of a movie. The ability to predict the success of an upcoming
project with high accuracy has its advantages. It allows movie producers to promote
and fine-tune their decisions before the film is released to attract more income.
For producers, this precise forecasting system is useful for them to be able to
guarantee the right investment capital. For investors, this kind of analysis means deep
understanding of the break-even point and economic consequences of a particular
project and helps smooth decision-making. Through data analysis, each team that
plays a role in a particular project can be better equipped with important decisions
about knowledge and insight.
At the most basic level, the data analysis process involving the film industry must
deal with large amounts of past data processing, identify certain collection patterns,
juxtapose them with existing available data points, and use the acquired knowledge to
deal with improving performance. Make better decisions. Accurate revenue forecasts
can pave the way for positive financial and marketing plans for future projects.
There are many important nuances in this regard, all of which are related to a robust
and unbiased assessment of the qualitative and quantitative factors of time-tested
statistical significance that affect the fate of a particular movie project. These factors
are varied, and may include genre, release date, location, actors, budget, marketing
work, marketing budget, and other differences. All of these factors can also affect
each other and ultimately have a huge impact on the likelihood of success.
For example, a major movie studio might look at very specific data, such as past
records of all movies of a particular type released over the past few years, and have
one or two specific stars. The data can then be sorted chronologically and analyzed to
find any relevant trends. These trends can then be considered in conjunction with
other external factors that also have an impact on income, such as economic factors,
conflicting events during release, and weather conditions.
After this process, a generic model can be built to predict the likely box office wealth
of any type of movie project. An important addition to this data analysis approach will
be to obtain information from specific movie reviews, ratings, user reviews, and
social media content. This can help establish a mode of participation, degree of
participation, and the potential for a particular genre or particular member of the actor
in terms of generating interest and curiosity and even viewer fatigue.
Data mining and text analysis and statistics allow business users to create predictive
intelligence by discovering patterns and relationships in structured and unstructured
data. The data available for analysis are structured data such as age, gender, marital
status, income, sales, etc. Unstructured data is text data in call center notes, social
media content or other types of open text that needs to be extracted from the text, and
emotions, which are then used in the model building process.
Define the project outcomes, deliverables, scoping of the effort, business objectives,
identify the data sets which are going to be used.
Data Mining for predictive analytics prepares data from multiple sources for analysis.
This provides a complete view of the customer interactions.
Data Analysis is the process of inspecting, cleaning, transforming, and modeling data
with the objective of discovering useful information, arriving at conclusions.
1.4.4 Statistics:
Statistical Analysis enables to validate the assumptions, hypotheses and test them with
using standard statistical models.
1.4.5 Modeling:
1.4.6 Deployment:
Predictive Model Deployment provides the option to deploy the analytical results in to
the everyday decision making process to get results, reports and output by automating
the decisions based on the modeling.
Models are managed and monitored to review the model performance to ensure that it
is providing the results expected.
Predictive analysis applications in health care can determine the patients who are at
the risk of developing certain conditions such as diabetes, asthma and other lifetime
illnesses. The clinical decision support systems incorporate predictive analytics to
support medical decision making at the point of care.
Predictive analytics can also help to identify the most effective combination of
product versions, marketing material, communication channels and timing that should
be used to target a given consumer.
1.5.8 Underwriting
Predictive analytics can help underwrite the quantities by predicting the chances of
illness, default, bankruptcy. Predictive analytics can streamline the process of
customer acquisition by predicting the future risk behavior of a customer using
application level data.
Companies that want to use data-driven decision-making need to get a lot of relevant
data from a range of activities, sometimes large datasets are difficult to achieve. Even
if the company has enough data, critics believe that when it comes to predicting
human behavior, computers and algorithms cannot account for variables that may
affect customer buying patterns - from weather changes to emotions to relationships.
Time also plays an important role in how these technologies work. Although the
model may be successful at some point, customer behavior changes over time, so the
model must be updated. The 2008-2009 financial crisis illustrates the importance of
time considerations because the ineffective model predicts the likelihood of a
mortgage loan customer repaying a loan without considering the possibility that home
prices may fall.
2.1 What Is R?
R is the language and environment for statistical calculations and graphics. This is a
GNU project that is similar to the S language and environment developed by John
Chambers and colleagues at Bell Labs (formerly AT&T, now Lucent Technologies).
R can be thought of as a different implementation of S. Although there are some
important differences, many of the code written for S does not change under R.
One of the advantages of R is that it can well design a chart of publication quality,
including the required mathematical symbols and formulas. The default settings for
the minor design choices in the drawing are very careful, but the user still has full
control.
R is provided as free software as source code under the terms of the Free Software
Foundation's GNU General Public License. It is compiled and run on various UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
Website: https://www.r-project.org/
https://cran.r-project.org/bin/windows/base/
CRAN is a network of ftp and web servers around the world that store identical, up-
to-date, versions of code and documentation for R.
2.2.1 Advantages
2.2.2 Disadvantages
R has a steep learning curve. It takes a while to get used to the power of R, but
it won't be steeper than other statistical languages. R is not so easy to use for
beginners. R has several easy-to-use graphical user interfaces (GUIs) that
contain click and click interactions, but they usually have no commercial
product optimization.
Documents are sometimes incomplete and concise, and are insurmountable to
non-statisticians. However, some very high standards of books have
increasingly blocked document gaps.
The quality of some software packages is not satisfactory, although if a
software package is useful to many people, it will quickly develop into a very
powerful product through collaborative efforts.There is, in general, no one to
complain to if something doesn’t work. R is a software application that many
people freely devote their own time to developing. Problems are usually dealt
with quickly on the open mailing lists, and bugs disappear with lightning
speed. Users who do require it can purchase support from a number of vendors
internationally.
Many R commands rarely consider memory management, so R can quickly
consume all available memory. This may be a limitation when doing data
mining. There are various solutions, including the use of a 64-bit operating
system that can access more than 32 bits of memory.
R is a text-based program that enters commands at the prompt and executes them one
by one. It is evolving to create a more graphical interface where the code editor
interacts with installed packages and presents command images through the interface
(Valero-Mora & Ledesma, 2012). In addition, the development of R Studio, a code
editor that interfaces with R, Windows, MacOS, and Linux platforms, has become
popular. Kilburn (2015) mentioned that R studio is R-based commercial software and
provides other functions related to predictive modeling, data analysis, and other
functions.
The above picture represents the four parts of R Studio. First of all, the script part is
the data import part. Second, the next section, R Environment, shows the number of
variables that exist in a given data set. Next, on the R console where all the commands
are running, the graphical output display is finally run in the console following the
commands.
There are other user interfaces of R software such as Rattle, Red-R and Rkward which
makes it accessible for its users to enjoy free services.
The relevance of the forecast varies from software to software. R is mainly built for
running complex data science algorithms, but it provides a good package for
predictive analysis. It helps data visualization through graphical and graphical
representations. Usually there are 3 types of predictive modelling in R:
Propensity modeling,
Clustering modeling,
Collaborative filtering (Strickland, 2015).
First, the propensity model predicts the customer's future behavior and company.
Second, cluster modeling is used for customer classification and classification into
different groups. Finally, collaborative filtering is based on user feedback to
implement recommendations. It allows the development of User-User and Item-Item
collaborative filtering algorithms.
Since its introduction, R software has evolved and tried to make it easier for users to
predict their models. In order to see the response of the analytical model, it is best to
link them directly to the market execution system.
In short, companies that produce huge databases try to predict customer behavior
through statistical analysis and knowledge. Smith (2014) believes that it is becoming
more common to use R in marketing data analysis based on customer's habits and
background.
2.4.1 RStudio
https://www.rstudio.com/
2.5 RShiny
Shiny is an R package that makes it easy to build interactive Web applications directly
from R. This webinar will cover the basics of building and deploying Shiny
applications, including responsive programming points for building more efficient,
robust, and correct applications. We will also discuss building interactive drawings
and creating interactive documents that contain Shiny components.
Shiny makes it easy for R users to turn analytics into an interactive Web application
that anyone can use. These applications allow you to specify input parameters using
friendly controls such as sliders, pull-down menus, and text fields; and they can easily
combine any number of outputs, such as figures, tables, and summaries.
Predictive analysis is the process of using data analysis to predict based on data. The
process uses data as well as analysis, statistics, and machine learning techniques to
create predictive models that predict future events.
Predictive analytics starts with business goals: Use data to reduce waste, save time, or
reduce costs. This process uses heterogeneous, usually massive, data collections into
the model. These models can produce clear, actionable results to support this goal,
such as reducing material waste, inventory inventories, and manufacturing products
that meet specifications.
We are all familiar with the forecasting model of the weather forecast. An important
industry application of the predictive model involves energy load forecasting to
predict energy demand. In this situation, energy producers, grid operators, and traders
need to accurately predict energy loads and make decisions to manage grid loads. A
large amount of data is available, and using predictive analytics, grid operators can
translate this information into actionable insights.
Typically, the workflow for a predictive analytics application follows these basic
steps:
1. Import data from varied sources, such as web archives, databases, and
spreadsheets.
Data sources include energy load data in a CSV file and national weather data
showing temperature and dew point.
Linear and logistic regression are usually the first algorithms people learned in
predictive modeling. Because of their popularity, many analysts even consider them
to be the only form of return. Those who are slightly involved think they are the most
important of all forms of regression analysis.
The fact is that there are countless forms of regression that can be implemented. Each
form has its own importance and specific conditions that are best for the application.
Regression analysis estimates the relationship between two or more variables. Let's
use a simple example to understand this:
Suppose you want to estimate the company's sales growth rate based on current
economic conditions. Recent company data shows that sales growth is about two and
a half times that of economic growth. Using this insight, we can predict the company's
future sales based on current and past information.
There are various regression techniques available for prediction. These techniques are
mainly driven by three indicators (the number of independent variables, the type of
dependent variable, and the shape of the regression line). We will discuss them in
detail in the following sections.
For creative people, if you feel that you need to use a combination of the above
parameters, then you can even make new regressions that these people have not used
before. But before we get started, let us know the most common regressions:
It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as
regression line).
This form of regression is used when we deal with multiple independent variables. In
this technique, the selection of independent variables is done with the help of an
automatic process, which involves no human intervention.
Ridge Regression is a technique used when the data suffers from multicollinearity
(independent variables are highly correlated). In multicollinearity, even though the
least squares estimates, (OLS) are unbiased; their variances are large which deviates
the observed value far from the true value. By adding a degree of bias to the
regression estimates, ridge regression reduces the standard errors.
The global movie industry's health forecast for the next few years shows that global
box office revenue is expected to increase from approximately 38 billion U.S. dollars
in 2016 to nearly 50 billion U.S. dollars in 2020. The United States is the world’s
third-largest film market, counting on tickets sold every year, second only to China
and India. As of 2016, less than $1.2 billion in movie tickets were sold in the United
States. As of 2016, there are about 5,800 movie theaters in the United States.
According to a recent survey, 13% of Americans watch movies about once a month,
7% will go to cinemas several times a month, and 31% of movies are less than once a
year. Considering 52% of American adults who like to watch movies at home, this is
a considerable share.
In 2016, 733 movies were released in North America, and drama is the most common
type of film distribution in the region. So far, the region's most successful movie
franchise is the Marvel Cinematic Universe, which has generated more than $4 billion
in box office revenue in North America alone. "Iron Man," "The Miracle's Avenger,"
"Spider Man," and "The Incredible Hulk" are just a few examples of licensed movies.
"Star Wars: Force Awakening" is the highest-income 3D movie in North America
with a total lifetime value of approximately $937 million.
The Indian film industry contributed 43% of the income from the Hindi film industry.
Other interesting trends include international/foreign film gains in Indian industry,
access to international studios through acquisitions and cooperation, the rise of
regional films, digital adoption of the entire value chain, organic and inorganic growth
in diversification, and alternative income streams Its appearance is expected to shape
the contours of the Indian film industry. In addition, as per capita incomes continue to
increase, the middle class continues to increase, demand in second-tier and third-tier
cities increases, diversification of international markets, increased use of
supplementary income streams, and increased use of visual effects in films (VFX)
may drive the industry in the country. Market growth.
The success of low-budget Indian films such as "Mandy", "Nobody Kills Jessica",
"Peepli Live", "Kahaani" and so on have made large-scale budget movies gain huge
profits.
The total revenue is 138 billion Indian rupees. By 2018, the estimated income is 162
billion Indian rupees*, of which theater revenues account for a major portion of total
revenue. Bollywood can no longer be considered as a marginal industry.
Below are few trends that are disrupting the “business as usual” in Bollywood:
Box office sales are plateauing even though the average ticket prices have
gone up
Revenue from mobile & online / home video rights is increasing (Box
office sales still comprise 74% of the overall movie revenue)
The rise of multiplexes has led to increase in per ticket realization, cost-
effective screening by providing theatres with low seating capacities. Also,
these multiplexes have improved film-going experience for Indian audiences.
Digitization of films by studios such as Real Image and UFO films has
enabled faster distribution from 500 prints per day to 1500 prints per day. As a
result, revenue generated from the first week of release is almost 60-80% of
the total theatrical revenue.
It is now widely accepted that a film will not get a good opening weekend unless it is
heavily promoted. These days, a film’s success heavily depends on its intelligent
omni-channel marketing campaigns.
A typical marketing mix for most film promotions looks like this:
With the promotion of television, radio, and social media sites (such as Facebook,
Instagram, Twitter, and mobile games), marketing of most movies has become
increasingly digital.
Without proper marketing development, a movie with fantastic plots, actors, scenes
and special effects may not be able to attract the audience. A successful event requires
a thoroughly designed plan that not only focuses on the type of marketing channel but
also on how they connect. One surprising fact is that typical Indian movies require a
six-month plan and 18 months of execution, while a Hollywood movie requires 36
months and 12 months of execution.
All of these trends have given the industry sufficient potential to benefit from
analytical frameworks and applications. The analysis can take these trends in the form
of data points to provide the amount of spending for each marketing vehicle, as well
as the time difference between any two activities to illustrate the audience's memory
retention/decrease factor.
The advent of new technologies and the arrival of the “big data” era have enabled the
non-traditional industries such as the film industry to gain the benefits of analysis.
In addition, the latest trends seen in this industry have brought a large number of
platforms to connect with consumers. This situation requires a more systematic
approach to increase the efficiency and efficiency of promotion through each platform
in order to achieve the greatest possible return on investment. Rate from as little
investment as possible.
In addition, the past decade has not only increased the diversity of entertainment
consumption devices, but also increased the audience groups now accustomed to
customizing content. The analysis of customers for better positioning, personalized
content on specific media, etc., includes endless opportunities to benefit from the
analysis.
All in all, we think that such an industry is just as right as a gold mine waiting for the
new framework to get the most out of it. Hence, predictive analysis for film industry
is getting popular these days.
The case study is about visualizing and forecasting the main variables responsible for
a movies success rate and help in determining the revenue of upcoming movies with
the help of the model created from the historic data collected.
The data has been collected from data.world and kaggle.com. The data consists of 28
Variables for 5043 movies, Spanning across 100 years in 66 countries.
Source: Kaggle
Source: Kaggle
Some of the R packages used in mining or cleaning the data are are:
1. Glmnet
Extremely efficient procedures for fitting the entire lasso or elastic-net
regularization path for linear regression, logistic and multinomial regression
models, Poisson regression and the Cox model. Two recent additions are the
multiple-response Gaussian, and the grouped multinomial regression.
2. Tidyr
An evolution of 'reshape2'. It's designed specifically for data tidying (not
general reshaping or aggregating) and works well with 'dplyr' data pipelines.
3. Dplyr
A fast, consistent tool for working with data frame like objects, both in
memory and out of memory.
4. ggplot2
A system for 'declaratively' creating graphics, based on ``The Grammar of
Graphics''. You provide the data, tell 'ggplot2' how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the details.
5. Lubridate
Functions to work with date-times and time-spans: fast and user friendly
parsing of date-time data, extraction and updating of components of a date-
time (years, months, days, hours, minutes, and seconds), algebraic
manipulation on date-time and time-span objects. The 'lubridate' package has a
consistent and memorable syntax that makes working with dates easy and fun.
6. Pscl
Bayesian analysis of item-response theory (IRT) models, roll call analysis;
computing highest density regions; maximum likelihood estimation of zero-
inflated and hurdle models for count data; goodness-of-fit measures for
GLMs; data sets used in writing and teaching at the Political Science
Computational Laboratory; seats-votes curves.
7. ROCR
ROC graphs, sensitivity/specificity curves, lift charts, and precision/recall
plots are popular examples of trade-off visualizations for specific pairs of
performance measures. ROCR is a flexible tool for creating cutoff-
parameterized 2D performance curves by freely combining two from over 25
performance measures (new performance measures can be added using a
standard interface).
8. Coefplot
Plots the coefficients from model objects. This very quickly shows the user the
point estimates and confidence intervals for fitted models.
9. Shiny
Makes it incredibly easy to build interactive web applications with R.
Automatic ``reactive'' binding between inputs and outputs and extensive
prebuilt widgets make it possible to build beautiful, responsive, and powerful
applications with minimal effort.
10. DT
Data objects in R can be rendered as HTML tables using the JavaScript library
'DataTables' (typically via R Markdown or Shiny). The 'DataTables' library
has been included in this R package. The package name 'DT' is an
abbreviation of 'DataTables'.
11. Shinythemes
Themes for use with Shiny. Includes several Bootstrap themes from , which
are packaged for use with Shiny applications.
Steps
First observing the revenue that is given as “gross” & “budget” I found the
profitability ratio and converted the numbers into “1” as hit when the ratio is
above the quarter value and “0” as flop below the quarter value for binomial
and “1,2,3” for multinomial.
Observing the genre it can be classified into numbers so converted the genres
into categories of 14
Output
Binomial
Multinomial
Observations:
1. Accuracy of the model is 0.99 since its related to hits and flops which is
represented by “1” and “0”
2. Aspect ratio and profitability ratios plays an important role in deciding the
future of the movies and predicting the same.
3. The density shows (Figure 6-13: Graph based on hits & Flops (1)) shows the
line between hits and flops based on the revenue
4. (Figure 6-14:Coefficient Plot(1)) shows the variables result i.e. hit and flop
and the other variable profitability ratio related to the revenue of the film
6. (Figure 6-15: Invlogic Output) shows which variable plays and important role in
deciding the revenue of the film. It cleary shows that aspect ratios and
profitability ratios plays and important role, along with actor facebook likes
and number of users for review with imdb score.
7. (Figure 6-16: Coefficient Plot (2)) shows different genre relation with different
aspects of the data avilable
Suggestions
1. Since we have so many variables, we can also use lasso or rigid regression,
which chooses our best-fit variable.
3. Indication the more number of reviews the more likely the movie to be a
hit.
5. The ratings and the aspect ratios plays and important part in the revenue of
the given data.
6. We can also take the weighted average of the number of Facebook likes of
the actor of that particular film and can observe whether the actor or the
supporting actor plays important role in the revenue of the movie.
Conclusion
The film industry plays a very important role in the U.S. economy and Indian
Economy. Therefore, over the past three decades, research related to film has become
an important issue for many scientists. As one of the highest risk industries, Box
Office Revenue Forecast (BOR) can play an important role in the economy.
In the real world, movie box office can be determined by a lot of things that are
difficult to collect, such as the current economic situation, the latest news and the
popularity of actors, and the number of competing movies. The film market is very
competitive and predicting movie box office is a daunting challenge.
Future Work
Currently, the data only comes from IMDB. In the future work, making the data
available from other streaming platforms like Hotstar, Amazon Prime Video, Netflix,
BookmyShow etc. can be of great importance and value.
It is always hard to find the right balance between only analyzing the surface of a
problem and being stuck in details. The approach to viewer rating prediction is very
straightforward and can be easily applied by anyone. The box office prediction
requires more experience, because of its dependency on the user to be able to interpret
the different visualizations of the data in the right way and infer the right conclusions
from it. Visual Analytics is not widespread in the field of movie success prediction, so
the challenge was a good way to fuel interest in this research topic. I am very
confident Visual Analytics will become more and more integrated in our lives for the
great possibilities it enables. The human mind is very good at extracting information
from visualizations and thus can analyze data quicker and in a more complex way.
Particularly in our times where the amount of data we produce is skyrocketing.
Appendices
Code:
6.1 Binomial
library(glmnet)
library(tidyr)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(pscl)
library(useful)
library(nnet)
options(scipen = 999)
getwd()
setwd("C:/Users/kazani/Desktop/r/")
# Data Cleaning
View(h1)
str(h1.o)
head(h1)
sum(is.na(h1))
h1.sep1<- separate(h.1,genres,c("Gen1","Gen2","Gen3"))
h1.sep1$plot_keywords<-NULL
h1.sep1$movie_imdb_link<-NULL
h1.sep1$color<-NULL
h1.o1<-na.omit(h1.sep1)
h1.o1$Profitibility_Ratio<- (h1.o1$gross/h1.o1$budget)
summary(h1.o1$Profitibility_Ratio)
h1.o1$Result<- ifelse(h1.o1$Profitibility_Ratio
>1.9812,1,ifelse(h1.o1$Profitibility_Ratio>1.0207,2,3))
a<- as.data.frame(table(h1.o1$Gen1))
h1.o1$Genre<- NA
h1.o1$Genre<-
unlist(sapply(h1.o1$Gen1,switch,'Action'=1,'Adventure'=2,'Animation'=3,'Biography'
=4,'Comedy'=5,'Crime'=6,
'Documentary'=7,'Drama'=8,'Family'=9,'Fantasy'=10,'Horror'=11,'Mystery'=12,'Roma
nce'=13,'Sci'=14))
summary(h1.o$Profitibility_Ratio)
?unlist()
head(Hollywood1)
Hollywood1<-h1.o1[, c(2,3,7,8,14,18,22,25,26,27,28,29,30)]
#Logistic Regression
summary(model)
pR2(model)
colnames(Hollywood)
#Fitting Data
nnet:::predict.nnet
fitted.results <-
predict(model,newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12,13)),type='prob
s')
print(paste('Accuracy',1-misClasificError))
library(ROCR)
head(p)
plot(prf)
auc
p<-ggplot(Hollywood,aes(x=gross))+ geom_density(fill="grey",
color="grey")+geom_vline(xintercept = 100000000)
library(coefplot)
coefplot(model)
invlogit<-function(x){
1/(1+exp(-x))
invlogit(model$coefficients)
invlogit(model$coefficients)
6.2 Multinomial
library(glmnet)
library(tidyr)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(pscl)
library(useful)
options(scipen = 999)
getwd()
setwd("C:/Users/kazani/Desktop/r/")
# Data Cleaning
View(h2)
str(h2.o)
summary(h2.o$Profitibility_Ratio)
head(h2)
sum(is.na(h2))
h2.sep<- separate(h2,genres,c("Gen1","Gen2","Gen3"))
h2.sep$plot_keywords<-NULL
h2.sep$movie_imdb_link<-NULL
h2.sep$color<-NULL
h2.o<-na.omit(h2.sep)
h2.o$Profitibility_Ratio<- (h2.o$gross/h2.o$budget)
a<- as.data.frame(table(h2.o$Gen1))
h2.o$Genre<- NA
h2.o$Genre<-
unlist(sapply(h2.o$Gen1,switch,'Action'=1,'Adventure'=2,'Animation'=3,'Biography'=
4,'Comedy'=5,'Crime'=6,
'Documentary'=7,'Drama'=8,'Family'=9,'Fantasy'=10,'Horror'=11,'Mystery'=12,'Roma
nce'=13,'Sci'=14))
?unlist()
sum(is.na(h2.o$Gen1))
is.numeric(h2.o$Genre)
head(Hollywood)
Hollywood<-h2.o[, c(2,3,7,8,14,18,22,25,26,27,28,29)]
#Logistic Regression
summary(model)
pR2(model)
fitted.results <-
predict(model,newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12)),type='respons
e')
print(paste('Accuracy',1-misClasificError))
library(ROCR)
plot(prf)
auc
p<-ggplot(Hollywood,aes(x=gross))+ geom_density(fill="grey",
color="grey")+geom_vline(xintercept = 100000000)
library(coefplot)
coefplot(model)
invlogit(model$coefficients)
6.3 Rshiny
library(shiny)
library(DT)
library(ggplot2)
library(shinythemes)
options(scipen = 999)
getwd()
library(ROCR)
setwd("C:/Users/kazani/Desktop/r/")
H$X<- NULL
a<- as.data.frame(table(H$Gen1))
b<-as.data.frame(table(H$Result))
attach(H)
attach(N)
formatStyle(c("Result", "Genre","budget","gross","movie_title",
"title_year","Profitibility_Ratio"),
backgroundColor = "lightblue"))
output$mytable1 = DT::renderDataTable({N})
output$mytable2 = DT::renderDataTable({a})
output$mytable3 = DT::renderDataTable({b})}
tabPanel("Data Cleaned",
DT::dataTableOutput("mytable"),
tags$br(),tags$br(),
DT::dataTableOutput("mytable2"),tags$br(),tags$br(),
DT::dataTableOutput("mytable3")
),
tabPanel("Raw Data",
DT::dataTableOutput("mytable1")),
navlistPanel(
"Regression",
tags$br(),tags$br(),
),
tags$br(),tags$br(),
href="https://github.com/ashkazani/ranalytics/blob/master/binomial"),tags$br(),tags$b
r(), a("Multinomial Code ",
href="https://github.com/ashkazani/ranalytics/blob/master/multinomial"),tags$br(),tag
s$br()),
a("Link1", href="https://data.world/popculture/imdb-5000-movie-
dataset"),tags$br(),tags$br(),
a("Link2",href="https://www.kaggle.com/carolzhangdc/imdb-5000-movie-
dataset/data"),tags$br(),tags$br(),
a("Link3", href="https://www.kaggle.com/nazimamzz/imdb-dataset-of-5000-movie-
posters/data"),tags$br(),tags$br()
) )
Annexures
List of Figures
Figure No. Title Page
No.
List of Figures(Contd.)
Figure No. Title Page
No.
Bibliography
Bibliography
Websites
Blog, I. (2017). http://itelina.github.io/BoxOfficePrediction/. Retrieved March
09, 2018, from Github: http://itelina.github.io/BoxOfficePrediction/
Ding, T. L. (2014). Predicting movie Box-office revenues by exploiting large-
scale social media content. Retrieved March 113, 2018, from Multimed Tools
App:
http://ir.hit.edu.cn/~xding/docs/Predicting%20movie%20Box-
office%20revenues%20by%20exploiting%20large-
scale%20social%20media%20content.pdf
Duggal, S. (n.d.). Advantages of using R statistical software for predictive
modelling. (2017, Editor) Retrieved March 06, 2018, from Project Guru:
https://www.projectguru.in/publications/advantages-r-predictive-modelling/
El-Assady, M. (2016). Predictive Visual Analytics – Approaches for Movie
Ratings and Discussion of Open Research Challenges. Retrieved March 06,
2018, from https://bib.dbvis.de/uploadedFiles/predictivevisualanalytics.pdf
Georgetown University. (2016). Pros and Cons of Predictive Analysis.
Retrieved March 05, 20108, from Online Master’s in Technology
Management:
https://scsonline.georgetown.edu/programs/masters-technology-
management/resources/pros-and-cons-predictive-analysis
Guest. (2015). What is Predictive Analysis? Retrieved March 10, 2018, from
Pridictive Analytics:
https://www.predictiveanalyticstoday.com/what-is-predictive-analytics/
Guest. (2016). Bringing Analytics into Indian Film Industry with Back Tracing
Algorithm. Retrieved March 08, 2018, from Analytcs vidya:
https://www.analyticsvidhya.com/blog/2016/08/bringing-analytics-into-
indian-film-industry-with-back-tracing-algorithm/
India Adda – Perspectives On India. (2016). Retrieved March 22, 2018, from
IBEF:
https://www.ibef.org/blogs/indian-film-industry-to-reach-us-3-7-billion-by-
2020
JiayongLin. (2017). Movie Box Office Prediction System. Retrieved February
26, 2018, from Kaggle:
https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion/28576
Joshi, M. (2017). Movie Reviews and Revenues: An Experiment in Text
Regression. Retrieved March 12, 2018, from Academia:
https://aclanthology.info/pdf/N/N10/N10-1038.pdf
Kertész, J. (2013). Early Prediction of Movie Box Office Success Based on
Wikipedia Activity Big Data. Retrieved March 01, 2018, from Plos One:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071226
Math Works. (2016). Why Predictive Analytics Matters. Retrieved March 12,
2018, from Math Works: https://in.mathworks.com/discovery/predictive-
analytics.html
New Gen Apps. (2017). 6 Reasons: Why Choose R Programming for Data
Science Projects? Retrieved March 06, 2018, from New Gen Apps:
https://www.newgenapps.com/blog/6-reasons-why-choose-r-programming-
for-data-science-projects
O’Driscoll, S. (2016). Early prediction of a film’s box office success using
natural language processing techniques and machine learning. Retrieved
March 20, 2018, from National College of Ireland:
http://trap.ncirl.ie/2531/1/seanodriscoll.pdf
Petrosyan, V. (2014). Visualizing and Forecasting Box-Office Revenues: A
Case Study of the James Bond Movie Series. Retrieved March 20, 2018, from
DigitalCommons:
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1413&context=gr
adreports
Philipp. (2014). Predicting Movie Success with Machine Learning and Visual
Analytics. Retrieved March 17, 2018, from Technische Universität Wien:
http://www.cvast.tuwien.ac.at/sites/default/files/bakkarbeit/omenitsch.pdf
Predicting movie ratings with IMDb data and R. (2014). Retrieved March 10,
2018, from R-Blogger: https://www.r-bloggers.com/predicting-movie-ratings-
with-imdb-data-and-r/
Raj, M. P. (n.d.). Predictive model for movie’s success and sentiment analysis.
Retrieved March 12, 2018, from Research Journal of Management:
http://www.isca.in/IJMS/Archive/v6/i6/1.ISCA-RJMS-2017-012.pdf
RAY, S. (2015). 7 Types of Regression Techniques you should know!
Retrieved March 23, 2018, from Analytics Vidya:
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-
regression/
Rhee, T. G. (2016). Retrieved March 16, 2018, from Ieexplore:
http://ieeexplore.ieee.org/document/7838221/?anchor=authors
RStudio. (2008). Introduction to Shiny. Retrieved March 13, 2018, from
RStudio: https://www.rstudio.com/resources/webinars/introduction-to-shiny/
Sankaralingam, G. (n.d.). The bigger picture: Using analytics to predict movie
success. Retrieved February 26, 2018, from
https://www.latentview.com/blog/using-analytics-to-predict-movie-success/
Simonoff, J. S. (2000). Predicting movie grosses: Winners and losers,
blockbusters and sleepers. Retrieved March 20, 2018, from Chance:
https://archive.nyu.edu/jspui/bitstream/2451/14752/1/SOR-99-8.pdf
Singla, R. (2016). Predicting Blockbuster success of a Movie with Data
Analytics. Retrieved March 22, 2018, from Prompt Cloud:
https://www.promptcloud.com/blog/Predicting-Blockbuster-success-of-a-
Movie-with-Data-Analytics
Vr, N. (2014). Predicting Movie Success Based on IMDB Data. Retrieved
March 21, 2018, from Research Gate:
https://www.researchgate.net/publication/282133920_Predicting_Movie_Succ
ess_Based_on_IMDB_Data
Yoo, S. (2011). Predicting Movie Revenue from IMDb Data. Retrieved March
26, 2018, from Semantics Scholar:
https://www.semanticscholar.org/paper/Predicting-Movie-Revenue-from-
IMDb-Data-Yoo-Kanter/6e6cdf5b0282d89de45c407fc76a4c218696e3e3