Asfaq Kazani

A Project Report on
Predictive Analysis to Measure Movie Acceptability
in partial fulfillment of the requirements of
Master of Management Studies
conducted by
University of Mumbai
through
Rizvi Institute of Management Studies & Research
under the guidance of
Prof. Sanjay Gupta
Submitted by
Asfaq Nazir Kazani
MMS
Batch: 2016 – 2018

Undertaking
I hereby declare that the project work entitled “Predictive Analysis to measure
Movie Acceptability” submitted to Rizvi Institute of Management Studies &
Research as a record of an original work done by me under guidance of Prof. Sanjay
Gupta of Rizvi Institute of Management Studies and Research and project work is
submitted in the partial fulfillment of the requirement for the award of the degree of
Masters in Management Studies. The results embodied in these have not been
submitted to any other University or Institute for award of any Degree or Diploma.
I further certify that I have no objection and grant the rights to:
Rizvi Institute of Management Studies & Research or Mumbai University to publish

any chapter/projects if they deem fit in journals or magazines or newspapers without
any permission.
Name: Asfaq Nazir Kazani
Roll No: 14
Class: MMS (2016-2018)
Place: Mumbai
Certificate
This is to certify that Mr. Asfaq Nazir Kazani, a student of Rizvi Institute of
Management Studies and Research, of MMS IV bearing Roll No. 14 and specializing
in Systems has successfully completed the project titled.
“Predictive Analytics to measure Movie Acceptability”
Under the guidance of Prof. Sanjay Gupta in partial fulfillment of the requirement
of Masters of Management Studies by University of Mumbai for the academic year
2016 – 2018.
Prof. Sanjay Gupta Prof. Umar Farooq Dr. Kalim Khan
Project Guide Academic Coordinator Director

Acknowledgement
I wish to take this opportunity to express my sincere gratitude to all those who helped
me in some way or the other in the completion of my project.
I would like to thank my project guides Prof. Sanjay Gupta for his constant support,
encouragement and guidance without whom the successful completion of this project
would have been impossible.
I would like to specially thank Dr. Kalim Khan for always being supportive and
inspiring me throughout the completion of my project.
It gives me immense pleasure in expressing my sincere gratitude towards Rizvi

Institute of Management Studies and Research for providing me with the valuable and
necessary infrastructure to carry my project work.
I would like to thank “University of Mumbai” for giving me an opportunity to present

my skills in the form of this project which will not only prove to be useful for my
academic profile but will also prove to be fruitful for my future for attaining jobs and
also will help me to face the growing competition in the corporate level.
______________________
Asfaq Nazir Kazani
Roll No: M- 14
MMS - Systems
Batch (2016-2018)
Executive Summary
The film industry is one of the biggest contributors to the entertainment industry and
it is characterized with its unpredictability in success and failure. Film Industry has
always amused everyone with its unpredictable success and failure. ‘Hollywood is the
land of hunch and the wild guess’. Research hints the unpredictability of the product
demand and proves that 25% of total revenue of a movie is earned from the first 2
weeks of release. Litman was the first to develop a multiple regression model to
predict the commercial success of movies.
Till 2014, the U.S. film industry’s revenue reached 564 billion U.S. dollars. It is
expected that the entertainment industry will grow to more than US$669 billion in the
next four years. That is, it is predicted that 80% of the profits of the film industry in
the past 10 years were generated by 6% of the films released; 78% of movies lost
money at the same time. Therefore, a lot of research has been conducted to predict the
success of Hollywood movies.
Indian Movie Industry is different from its counterparts in many ways especially in
terms of different movies in different languages and low-ticket prices. In 2013, the
Indian film industry generated 1.68 billion U.S. dollars out of its total revenue of 2.07
billion from overseas and domestic box office collections.
According to the Deloitte and ASSOCHEM report, researchers predict that theatrical
revenue in India will be increased from $1.78 billion to $2.13 billion by FY2018. That
makes theatrical circuits account for some 74% of total income.
Bollywood Industry works a lot different from the Hollywood Industry. In Bollywood
Industry a lot of importance is normally given to different parameters such as
celebrity appeal, the movie album and others, which is an integral part of the movie
itself; unlike, the Hollywood Industry. This project looks into the inner details of
watching a movie by splitting the research into two main components. First is
exploring the variables that influences the frequency of watching the movies. Second,
developing a model to predict the success or failure of the movies, which is the
deciding factor for the producers and distributors whether to invest in the movie or
situation.
Objectives:
• To understand the concept of predictive analysis.

• To understand the best technological platforms used for the predictive analysis
i.e. R and RShiny.
• To understand different models for predictive analysis.
Table of Contents
Chapter 1. Introduction ............................................................................................ 1
1.1 Introduction ..................................................................................................... 2
1.2 What has changed in the entertainment industry? ........................................... 2
1.3 What is Predictive Analysis?........................................................................... 6
1.4 Predictive Analytics Process ........................................................................... 8
1.5 Applications of Predictive Analytics............................................................. 10
1.6 Benefits of Predictive Analytics .................................................................... 12
1.7 Drawbacks and Criticism .............................................................................. 12
Chapter 2. Predictive Analysis Implementation through “R” ............................ 13
2.1 What Is R? ..................................................................................................... 14
2.2 The R Environment ....................................................................................... 15
2.3 Advantages of using R statistical software for predictive modelling............ 18
2.4 Development Environment to Implement “R” .............................................. 22
2.5 RShiny ........................................................................................................... 23
Chapter 3. Predictive Analysis Process ................................................................. 25
3.1 Predictive Analytics Workflow ..................................................................... 27
Chapter 4. Predictive Analysis Models ................................................................. 30
4.1 Why do we use Regression Analysis?........................................................... 32
4.2 How many types of regression techniques do we have? ............................... 33
Chapter 5. Film Industry........................................................................................ 36
5.1 Hollywood Industry Facts ............................................................................. 37
5.2 Bollywood Industry Facts ............................................................................. 38
5.3 Need for Predictive Analysis......................................................................... 42

Table of Contents(Contd.)
Chapter 6. Case Study ............................................................................................ 43
Conclusion …………………………………………………………………………57
Appendices …………………………………………………………………………59
7.1 Binomial ........................................................................................................ 60
7.2 Multinomial ................................................................................................... 64
7.3 Rshiny............................................................................................................ 67
Annexures …………………………………………………………………………70
Bibliography ............................................................................................................... 73
Predictive Analysis to Measure Movie Acceptability Asfaq Nazir Kazani
Chapter 1. Introduction
Rizvi Institute of Management Studies and Research 1

1.1 Introduction
With the emergence of big data, the world has changed in an unprecedented way. Due
to the many advantages of data analysis, many things that were previously impossible
to analyze and predict are now easier and more intuitive to predict. The entertainment
industry is no exception. Over the years, data analysis and its tremendous power have
been the root cause of the transformation of the entertainment industry's operating
model.
In a fast-growing and booming industry such as the movie industry, data analysis has
opened up many important new ways to analyze past data, make creative and
marketing decisions, and accurately predict the fate of upcoming movie releases.
1.2 What has changed in the entertainment industry?
One of the most important elements of predictive analytics is the collection and
storage of large amounts of data. The film industry has been showing rapid growth. In
the past few years, the growth rate has increased significantly due to the rise of
alternative distribution platforms such as online platforms and mobile platforms. The
rich data in the film industry makes it an exciting area for data analysts and
statisticians. Previously, the film industry had used the knowledge of certain industry
trends, basic rules of thumb and traditional wisdom and intuition to predict the
success or failure of particular movies. This method has never been very accurate or
reliable, and it has been found in many fields for many years.
At present, with the emergence of big data and exciting opportunities brought about
by data mining and analysis, the industry is actively developing a new, improved and
reliable method to accurately predict the success and failure of a particular movie.
Stakeholders in most of the world’s major industries are turning to data scientists and
analysts to help them increase their success rate through data analysis. At present,
many major movie studios in the world are actively implementing data analysis as the
main means to measure the success of many of their projects.

1.2.1 The Basics of Data Analytics for the Motion Picture Industry
The main factor that helps determine the likelihood of a particular movie's success is
knowledge about what makes people interested and raises curiosity. By analyzing a
variety of different online resources, including search engine results, video viewing
and commenting, expert website rating, and social media content, this knowledge can
be implemented to some extent. Analysis of past success records for other movies of
the same type or other films of the same subject can also be introduced into the
transaction to provide accurate results. The main goal is to be able to accurately
predict the box office receipts for a particular movie using relevant types of data. To
conduct this analysis, analysts are expected to have a large amount of important
information, including past records of the same director and production company's
movies, other movies of the same type, similar casts, story types, and different ways
of marketing and promotion. In addition to these important factors, there are other
influential factors such as welcome and trailer announcements, social media hotspots
and public forum reviews.
A typical pathway of analyses of this kind can be the following –
The past, present and future versions of the original classification and subdivision
were performed using cluster analysis techniques.
Similarity checks are used to determine the degree of similarity between sample plots
and other similarly perceived movies.
Use the model derived from the above steps and past data collection to arrive at an
approximate estimate of the net benefit of the particular movie in question.
Based on the above factors, build an accurate and reliable statistical model, as well as
other intrinsic factors such as consciousness, interest and curiosity.

1.2.2 Getting Fine-tuned Results
Although the above process is only a rough overview, data scientists can obtain more
precise and finer results in a variety of ways. With the availability of large amounts of
data and the abundance of today's sophisticated tools, technology and data processing
platforms, predictors can achieve surprising levels of accuracy. The first step in
achieving this goal is to ensure that the right audience is targeted.
To achieve this goal, individual fans can be considered potential customers, and then
carefully study the data to determine which potential customers are most likely to
influence the opinions of others. In this regard, it is helpful to consider both potential
audiences and theater owners. They may have their own strategies for arranging
specific films to increase their occupancy and profits. Factors such as demographics
also play an important role in this regard.
In order to obtain more accurate results, another factor that should be included in the
analysis is the seasonal factor. The fate of the film released at a time consistent with a
specific event such as a major holiday, holiday or weekend needs to be studied in
detail. Usually, a certain week will see the release of several different movies, and
people are more likely to choose to watch only one movie in a given week. Therefore,
it is necessary to study the ratio rather than the absolute value in order to obtain
accurate results. This not only drives profits, but also plays an important role in the
marketing and promotion of movies.

1.2.3 Critical success drivers for leveraging the prowess of data analysis
With the astonishing growth of data on digital platforms and the exponential growth
of daily data, data analysis experts will continue to propose new and innovative
methods to use this information to achieve more and more accurate predictions of the
success or failure of a movie. The ability to predict the success of an upcoming
project with high accuracy has its advantages. It allows movie producers to promote
and fine-tune their decisions before the film is released to attract more income.
For producers, this precise forecasting system is useful for them to be able to
guarantee the right investment capital. For investors, this kind of analysis means deep
understanding of the break-even point and economic consequences of a particular
project and helps smooth decision-making. Through data analysis, each team that
plays a role in a particular project can be better equipped with important decisions
about knowledge and insight.
At the most basic level, the data analysis process involving the film industry must
deal with large amounts of past data processing, identify certain collection patterns,
juxtapose them with existing available data points, and use the acquired knowledge to
deal with improving performance. Make better decisions. Accurate revenue forecasts
can pave the way for positive financial and marketing plans for future projects.
There are many important nuances in this regard, all of which are related to a robust
and unbiased assessment of the qualitative and quantitative factors of time-tested
statistical significance that affect the fate of a particular movie project. These factors
are varied, and may include genre, release date, location, actors, budget, marketing
work, marketing budget, and other differences. All of these factors can also affect
each other and ultimately have a huge impact on the likelihood of success.

For example, a major movie studio might look at very specific data, such as past
records of all movies of a particular type released over the past few years, and have
one or two specific stars. The data can then be sorted chronologically and analyzed to
find any relevant trends. These trends can then be considered in conjunction with
other external factors that also have an impact on income, such as economic factors,
conflicting events during release, and weather conditions.
After this process, a generic model can be built to predict the likely box office wealth
of any type of movie project. An important addition to this data analysis approach will
be to obtain information from specific movie reviews, ratings, user reviews, and
social media content. This can help establish a mode of participation, degree of
participation, and the potential for a particular genre or particular member of the actor
in terms of generating interest and curiosity and even viewer fatigue.
1.3 What is Predictive Analysis?
Predictive analysis is a branch of advanced analysis that is used to predict unknown

future events. Predictive analysis uses data mining, statistics, modeling, machine
learning, and artificial intelligence to analyze current data to predict the future. It uses
a lot of data mining, predictive modeling and analysis techniques, brings together
management, information technology, and modeling business processes to predict the
future. The patterns found in historical and transactional data can be used to identify
future risks and opportunities. Predictive analytic models capture the relationship
between multiple factors to assess the risk of a particular set of conditions to specify a
score or weight. By successfully applying predictive analytics, companies can
effectively interpret big data and benefit.

Data mining and text analysis and statistics allow business users to create predictive
intelligence by discovering patterns and relationships in structured and unstructured
data. The data available for analysis are structured data such as age, gender, marital
status, income, sales, etc. Unstructured data is text data in call center notes, social
media content or other types of open text that needs to be extracted from the text, and
emotions, which are then used in the model building process.
Predictive analytics enables organizations to become proactive, forward-looking,

predicting outcomes and behavior based on data, rather than based on predictions or
assumptions. The prescriptive analysis goes further and suggests actions that benefit
from the forecast, and provide decision options to benefit from the forecast and its
impact.
Figure 1-1: Predictive Analytics Value Chain

Source: https://www.predictiveanalyticstoday.com/what -is-predictive-analytics/

1.4 Predictive Analytics Process
Figure 1-2: Predictive analysis Process
Source: https://www.predictiveanalyticstod ay.com/what-is-predictive-analytics/
1.4.1 Define Project:
Define the project outcomes, deliverables, scoping of the effort, business objectives,
identify the data sets which are going to be used.
1.4.2 Data Collection:
Data Mining for predictive analytics prepares data from multiple sources for analysis.
This provides a complete view of the customer interactions.
1.4.3 Data Analysis:
Data Analysis is the process of inspecting, cleaning, transforming, and modeling data
with the objective of discovering useful information, arriving at conclusions.
1.4.4 Statistics:
Statistical Analysis enables to validate the assumptions, hypotheses and test them with
using standard statistical models.

1.4.5 Modeling:
Predictive Modeling provides the ability to automatically create accurate predictive

models about future. There are also options to choose the best solution with multi
model evaluation.
1.4.6 Deployment:
Predictive Model Deployment provides the option to deploy the analytical results in to
the everyday decision making process to get results, reports and output by automating
the decisions based on the modeling.
1.4.7 Model Monitoring:
Models are managed and monitored to review the model performance to ensure that it
is providing the results expected.

1.5 Applications of Predictive Analytics
1.5.1 Customer relationship management (CRM)
Predictive analysis applications are used to achieve CRM objectives such as

marketing campaigns, sales, and customer services. Analytical customer relationship
management can be applied throughout the customers life cycle, right from
acquisition, relationship growth, retention, and win back.
1.5.2 Health Care
Predictive analysis applications in health care can determine the patients who are at
the risk of developing certain conditions such as diabetes, asthma and other lifetime
illnesses. The clinical decision support systems incorporate predictive analytics to
support medical decision making at the point of care.
1.5.3 Collection Analytics
Predictive analytics applications optimize the allocation of collection resources by

identifying the effective collection agencies, contact strategies, legal actions to
increase the recovery and also reducing the collection costs.
1.5.4 Cross Sell
Predictive analytics applications analyze customers spending, usage and other

behavior, leading to efficient cross sales, or selling additional products to current
customers for an organization that offers multiple products
1.5.5 Fraud detection
Predictive analytics applications can find inaccurate credit applications, fraudulent

transactions both done offline and online, identity thefts and false insurance claims.
1.5.6 Risk management
Predictive analytics applications predicts the best portfolio to maximize return in

capital asset pricing model and probabilistic risk assessment to yield accurate
forecasts.

1.5.7 Direct Marketing
Predictive analytics can also help to identify the most effective combination of
product versions, marketing material, communication channels and timing that should
be used to target a given consumer.
1.5.8 Underwriting
Predictive analytics can help underwrite the quantities by predicting the chances of
illness, default, bankruptcy. Predictive analytics can streamline the process of
customer acquisition by predicting the future risk behavior of a customer using
application level data.
Figure 1-3: Predictive Analysis
Source: https://www.predictiveanalyticstoday.com/what -is-predictive-analytics/

1.6 Benefits of Predictive Analytics
In its various forms of predictive modeling, decision analysis and optimization,

transaction analysis and predictive search predictive analysis can be applied to various
business strategies, and has always been a key player in search advertising and
recommendation engines.3 These technologies can be used by managers and
Executives with decision tools to influence high sales, sales and revenue forecasts,
manufacturing optimization and even new product development. Although useful and
helpful, predictive analysis is not for everyone.
1.7 Drawbacks and Criticism
Companies that want to use data-driven decision-making need to get a lot of relevant
data from a range of activities, sometimes large datasets are difficult to achieve. Even
if the company has enough data, critics believe that when it comes to predicting
human behavior, computers and algorithms cannot account for variables that may
affect customer buying patterns - from weather changes to emotions to relationships.
Time also plays an important role in how these technologies work. Although the
model may be successful at some point, customer behavior changes over time, so the
model must be updated. The 2008-2009 financial crisis illustrates the importance of
time considerations because the ineffective model predicts the likelihood of a
mortgage loan customer repaying a loan without considering the possibility that home
prices may fall.

Chapter 2. Predictive Analysis

Implementation through “R”

2.1 What Is R?
R is the language and environment for statistical calculations and graphics. This is a
GNU project that is similar to the S language and environment developed by John
Chambers and colleagues at Bell Labs (formerly AT&T, now Lucent Technologies).
R can be thought of as a different implementation of S. Although there are some
important differences, many of the code written for S does not change under R.
R provides a variety of statistics (linear and nonlinear modeling, classical statistical

testing, time series analysis, classification, clustering) and graphics technology and is
highly scalable. The S language is usually the tool of choice for researching statistical
methods, and R provides an open source way to participate in this activity.
One of the advantages of R is that it can well design a chart of publication quality,
including the required mathematical symbols and formulas. The default settings for
the minor design choices in the drawing are very careful, but the user still has full
control.
R is provided as free software as source code under the terms of the Free Software
Foundation's GNU General Public License. It is compiled and run on various UNIX
platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.
Website: https://www.r-project.org/
Website for Downloading R
https://cran.r-project.org/bin/windows/base/
CRAN is a network of ftp and web servers around the world that store identical, up-
to-date, versions of code and documentation for R.

2.2 The R Environment
R is an integrated suite of software facilities for data manipulation, calculation and

graphical display. It includes
 An effective data handling and storage facility,

 A suite of operators for calculations on arrays, in particular matrices,
 A large, coherent, integrated collection of intermediate tools for data analysis,
 Graphical facilities for data analysis and display either on-screen or on hardcopy,
 A well-developed, simple and effective programming language which includes
conditionals, loops, user-defined recursive functions and input and output
facilities.
The term “environment” is intended to characterize it as a fully planned and coherent

system, rather than an incremental accretion of very specific and inflexible tools, as is
frequently the case with other data analysis software.
Many users think of R as a statistics system. It is an environment within which

statistical techniques are implemented. R can be extended (easily) via packages. There
are about eight packages supplied with the R distribution and many more are available
through the CRAN family of Internet sites covering a very wide range of modern
statistics.

2.2.1 Advantages
 R is the most comprehensive statistical analysis package available. It contains

all standard statistical tests, models and analyses, and provides a
comprehensive language for managing and manipulating data. New
technologies and ideas often appear in R.
 R's graphics capabilities are excellent, providing a fully programmable
graphical language that surpasses most other statistical and graphical software
packages. The effectiveness of R software is ensured through public
verification and comprehensive governance documented by the R Foundation
for Statistical Computing (2008). Since R is open source, unlike closed-source
software, it has been reviewed by many internationally renowned statisticians
and computational scientists.
 R is free open source software that allows anyone to use it. It is important to
modify it. R is issued under the GNU General Public License and copyright is
held by The R Foundation for Statistical Computing.R has no license
restrictions (other than ensuring our freedom to use it at our own discretion),
and so we can run it anywhere and at any time, and even sell it under the
conditions of the license.
 R welcomes anyone who provides bug fixes, code enhancements, and new
software packages. R offers a wealth of premium software packages that
demonstrate this approach to software development and sharing.
 R has more than 4,800 software packages from multiple knowledge bases and
specializes in topics such as econometrics, data mining, spatial analysis and
bioinformatics.
 R is cross-platform. R can run on many operating systems and different
hardware. It is widely used on GNU/Linux, Macintosh and Microsoft
Windows and can run on 32-bit and 64-bit processors.
 R works with many other tools, such as importing data from CSV files, SAS
and SPSS, or importing data directly from Microsoft Excel, Microsoft Access,
Oracle, MySQL, and SQLite. It can also generate graphical output in PDF,
JPG, PNG and SVG formats, as well as tabular output of LATEX and HTML.

2.2.2 Disadvantages
 R has a steep learning curve. It takes a while to get used to the power of R, but
it won't be steeper than other statistical languages. R is not so easy to use for
beginners. R has several easy-to-use graphical user interfaces (GUIs) that
contain click and click interactions, but they usually have no commercial
product optimization.
 Documents are sometimes incomplete and concise, and are insurmountable to
non-statisticians. However, some very high standards of books have
increasingly blocked document gaps.
 The quality of some software packages is not satisfactory, although if a
software package is useful to many people, it will quickly develop into a very
powerful product through collaborative efforts.There is, in general, no one to
complain to if something doesn’t work. R is a software application that many
people freely devote their own time to developing. Problems are usually dealt
with quickly on the open mailing lists, and bugs disappear with lightning
speed. Users who do require it can purchase support from a number of vendors
internationally.
 Many R commands rarely consider memory management, so R can quickly
consume all available memory. This may be a limitation when doing data
mining. There are various solutions, including the use of a 64-bit operating
system that can access more than 32 bits of memory.

2.3 Advantages of using R statistical software for predictive

modelling
Predictive modeling is a data-driven, inductive-based modeling that large companies

continue to use to gain insight into future trends and risks. Modeling based on data
extraction, cleansing and analysis helps predict the value of the target variable
(Fortuny, Martens, & Provost, 2013). Most of the analysis software developed is used
to effectively understand the organization's situation based on trends indicated by
relevant factors. One of the software that helps predicting is R, which aggregates and
estimates the target variables based on different factors (Varian, 2014). The software
has a wide range of development forecast models.
2.3.1 Easy user interface
R is a text-based program that enters commands at the prompt and executes them one
by one. It is evolving to create a more graphical interface where the code editor
interacts with installed packages and presents command images through the interface
(Valero-Mora & Ledesma, 2012). In addition, the development of R Studio, a code
editor that interfaces with R, Windows, MacOS, and Linux platforms, has become
popular. Kilburn (2015) mentioned that R studio is R-based commercial software and
provides other functions related to predictive modeling, data analysis, and other
functions.

Figure 2-1: User interface of R Studio

Source: https://www.projectguru.in/publications/advantages -r-predictive-modelling/
The above picture represents the four parts of R Studio. First of all, the script part is
the data import part. Second, the next section, R Environment, shows the number of
variables that exist in a given data set. Next, on the R console where all the commands
are running, the graphical output display is finally run in the console following the
commands.
There are other user interfaces of R software such as Rattle, Red-R and Rkward which
makes it accessible for its users to enjoy free services.
2.3.2 Availability of different types of predictive modelling techniques
The relevance of the forecast varies from software to software. R is mainly built for
running complex data science algorithms, but it provides a good package for
predictive analysis. It helps data visualization through graphical and graphical
representations. Usually there are 3 types of predictive modelling in R:
 Propensity modeling,
 Clustering modeling,
 Collaborative filtering (Strickland, 2015).

First, the propensity model predicts the customer's future behavior and company.
Second, cluster modeling is used for customer classification and classification into
different groups. Finally, collaborative filtering is based on user feedback to
implement recommendations. It allows the development of User-User and Item-Item
collaborative filtering algorithms.
Figure 2-2: Weather forecasting using predictive modelling of R
Source: https://www.projectguru.in/publications/advantages -r-predictive-modelling/
Since its introduction, R software has evolved and tried to make it easier for users to
predict their models. In order to see the response of the analytical model, it is best to
link them directly to the market execution system.

2.3.3 Companies using R for predicting consumer behavior
In short, companies that produce huge databases try to predict customer behavior
through statistical analysis and knowledge. Smith (2014) believes that it is becoming
more common to use R in marketing data analysis based on customer's habits and
background.
In addition, the financial and insurance industry is a leading user of advanced

statistical analysis who develop new transactions, pricing and optimization strategies
(Mcneil, Martinez-miranda, Engelhardt, & Shanahan, 2013). In addition, R plays a
strategic role in weather forecasting, climate change detection, wartime casualty
estimation in unstable regions (Fraley, Raftery, & Gneiting, 2011).

2.4 Development Environment to Implement “R”
2.4.1 RStudio
The workspace tab

shows all the active
objects so far.
The files tab shows all the files and

folders in your default workspace as if
you were on a PC/Mac window. The
plots tab will show all your graphs. The
packages tab will list a series of
packages or add-ons needed to run
The console is where you can type Commands and certain processes. For additional info
see output see the help tab
Figure 2-3: Opening Screen of RStudio
RStudio is a free and open-source integrated development environment (IDE) for R, a

programming language for statistical computing and graphics. RStudio was founded
by JJ Allaire,creator of the programming language ColdFusion. Hadley Wickham is
the Chief Scientist at RStudio.RStudio is available in two editions: RStudio Desktop,
where the program is run locally as a regular desktop application; prepackaged
distributions of RStudio Desktop are available for Windows, macOS, and
Linux.RStudio is available in open source. RStudio is written in the C++
programming language and uses the Qt framework for its graphical user
interface.Work on RStudio started at around December 2010, and the first public beta
version (v0.92) was officially announced in February 2011. Version 1.0 was released
on 1 November 2016. Download RStudio:
https://www.rstudio.com/

2.4.2 Reasons to Use RStudio
 Usually, when you conduct exploratory data analysis in R, you generate a

bunch of charts, a few of which are of interest. Any drawing in the Plot pane
graphics device can be post exported as pdf / svg / postscript (as well as .png,
.jpeg, etc.) at any reasonable size you choose. Related to this is always the
difference between the drawing in the "Drawing" pane and the drawing after
export, but this point is constantly improved (v0.97 sets the default export size
to the default export size for the dimensions of the current pane) And it's fair
to say that when you transition from exploratory data analysis to high-quality
graphics, you should be able to explicitly set up your graphics device.
 It runs on your operating system and also has a server version.
 It also has a free open source editor version
 RStudio supports version control and code library organization in project
form.
 Operate the software package to dynamically change drawing parameters.
 Code completion - always convenient.
 It is actively developing but stable.
2.5 RShiny
Shiny is an R package that makes it easy to build interactive Web applications directly
from R. This webinar will cover the basics of building and deploying Shiny
applications, including responsive programming points for building more efficient,
robust, and correct applications. We will also discuss building interactive drawings
and creating interactive documents that contain Shiny components.
Shiny makes it easy for R users to turn analytics into an interactive Web application
that anyone can use. These applications allow you to specify input parameters using
friendly controls such as sliders, pull-down menus, and text fields; and they can easily
combine any number of outputs, such as figures, tables, and summaries.

No HTML or JavaScript knowledge is required. If you have some experience with R,

then you need to combine R's statistical functions with the simplicity of the web page
in minutes:
Figure 2-4: RShiny Example
Source: http://blog.rstudio.com/2012/11/08/introducing -shiny/

Chapter 3. Predictive Analysis Process

Predictive analysis is the process of using data analysis to predict based on data. The
process uses data as well as analysis, statistics, and machine learning techniques to
create predictive models that predict future events.
The term "predictive analysis" describes applying statistical or machine learning

techniques to create quantitative predictions about the future. Typically, supervised
machine learning techniques are used to predict the value of the future (how long can
the machine run before it needs to be maintained?) or the estimated probability (how
likely is the customer to default on the loan?).
Predictive analytics starts with business goals: Use data to reduce waste, save time, or
reduce costs. This process uses heterogeneous, usually massive, data collections into
the model. These models can produce clear, actionable results to support this goal,
such as reducing material waste, inventory inventories, and manufacturing products
that meet specifications.

3.1 Predictive Analytics Workflow
We are all familiar with the forecasting model of the weather forecast. An important
industry application of the predictive model involves energy load forecasting to
predict energy demand. In this situation, energy producers, grid operators, and traders
need to accurately predict energy loads and make decisions to manage grid loads. A
large amount of data is available, and using predictive analytics, grid operators can
translate this information into actionable insights.
Figure 3-1: Predictive analytics workflow.
Source: https://in.mathworks.com/discovery/predictive -analytics.html

3.1.1 Step-by-Step Workflow for Predicting Energy Loads
Typically, the workflow for a predictive analytics application follows these basic
steps:
1. Import data from varied sources, such as web archives, databases, and
spreadsheets.
Data sources include energy load data in a CSV file and national weather data
showing temperature and dew point.
2. Clean the data by removing outliers and combining data sources.

Identify data spikes, missing data, or anomalous points to remove from the
data. Then aggregate different data sources together – in this case, creating a
single table including energy load, temperature, and dew point.
3. Develop an accurate predictive model based on the aggregated data using

statistics, curve fitting tools, or machine learning.
Energy forecasting is a complex process with many variables, so you might
choose to use neural networks to build and train a predictive model. Iterate
through your training data set to try different approaches. When the training is
complete, you can try the model against new data to see how well it performs.

4. Integrate the model into a load forecasting system in a production

environment.
Once you find a model that accurately forecasts the load, you can move it into
your production system, making the analytics available to software programs
or devices, including web apps, servers, or mobile devices.
Figure 3-2: Predictive analytics application example
Source: https://in.mathworks.com/discovery/predictive -analytics.html

Chapter 4. Predictive Analysis Models

Predictive modeling uses mathematical and computational methods to predict events

or outcomes. These models predict the outcome of a future state or time based on
changes in the model input. Using an iterative process, you can use the training
dataset to develop a model and then test and validate it to determine the accuracy of
its predictions. You can try different machine learning methods to find the most
effective model.
Linear and logistic regression are usually the first algorithms people learned in
predictive modeling. Because of their popularity, many analysts even consider them
to be the only form of return. Those who are slightly involved think they are the most
important of all forms of regression analysis.
The fact is that there are countless forms of regression that can be implemented. Each
form has its own importance and specific conditions that are best for the application.

4.1 Why do we use Regression Analysis?
Regression analysis estimates the relationship between two or more variables. Let's
use a simple example to understand this:
Suppose you want to estimate the company's sales growth rate based on current
economic conditions. Recent company data shows that sales growth is about two and
a half times that of economic growth. Using this insight, we can predict the company's
future sales based on current and past information.
There are many benefits to using regression analysis.
They are as follows:
1. It indicates the significant relationships between dependent variable and

independent variable.
2. It indicates the strength of impact of multiple independent variables on a
dependent variable.
3.
Regression analysis also allowed us to compare the effects of variables of different

sizes, such as the impact of price changes and the number of promotional activities.
These benefits help market researchers/data analysts/data scientists eliminate and
evaluate the optimal set of variables used to build predictive models.

4.2 How many types of regression techniques do we have?
There are various regression techniques available for prediction. These techniques are
mainly driven by three indicators (the number of independent variables, the type of
dependent variable, and the shape of the regression line). We will discuss them in
detail in the following sections.
Figure 4-1: Types of Regression
Source: https://www.analyticsvidhya.com/blog/2015/08/comprehensive -guide-regression/
For creative people, if you feel that you need to use a combination of the above
parameters, then you can even make new regressions that these people have not used
before. But before we get started, let us know the most common regressions:

4.2.1 Linear Regression
It is one of the most widely known modeling technique. Linear regression is usually
among the first few topics which people pick while learning predictive modeling. In
this technique, the dependent variable is continuous, independent variable(s) can
be continuous or discrete, and nature of regression line is linear.
Linear Regression establishes a relationship between dependent variable (Y) and one
or more independent variables (X) using a best fit straight line (also known as
regression line).
4.2.2 Logistic Regression
Logistic regression is used to find the probability of event=Success and event=Failure.

We should use logistic regression when the dependent variable is binary (0/ 1, True/
False, Yes/ No) in nature. Here the value of Y ranges from 0 to 1 and it can
represented by following equation.
4.2.3 Polynomial Regression
A regression equation is a polynomial regression equation if the power of independent

variable is more than 1.
4.2.4 Stepwise Regression
This form of regression is used when we deal with multiple independent variables. In
this technique, the selection of independent variables is done with the help of an
automatic process, which involves no human intervention.

4.2.5 Ridge Regression
Ridge Regression is a technique used when the data suffers from multicollinearity
(independent variables are highly correlated). In multicollinearity, even though the
least squares estimates, (OLS) are unbiased; their variances are large which deviates
the observed value far from the true value. By adding a degree of bias to the
regression estimates, ridge regression reduces the standard errors.
4.2.6 Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and Selection

Operator) also penalizes the absolute size of the regression coefficients. In addition, it
is capable of reducing the variability and improving the accuracy of linear regression
models.
4.2.7 ElasticNet Regression
ElasticNet is hybrid of Lasso and Ridge Regression techniques. It is trained with L1

and L2 prior as regularizer. Elastic-net is useful when there are multiple features
which are correlated. Lasso is likely to pick one of these at random, while elastic-net
is likely to pick both.

Chapter 5. Film Industry

5.1 Hollywood Industry Facts
The global movie industry's health forecast for the next few years shows that global
box office revenue is expected to increase from approximately 38 billion U.S. dollars
in 2016 to nearly 50 billion U.S. dollars in 2020. The United States is the world’s
third-largest film market, counting on tickets sold every year, second only to China
and India. As of 2016, less than $1.2 billion in movie tickets were sold in the United
States. As of 2016, there are about 5,800 movie theaters in the United States.
According to a recent survey, 13% of Americans watch movies about once a month,
7% will go to cinemas several times a month, and 31% of movies are less than once a
year. Considering 52% of American adults who like to watch movies at home, this is
a considerable share.
Movie entertainment is a big business in the United States. It is estimated that by

2019, the revenue from the movie entertainment business will reach US$35.3 billion.
Among film studios, Buena Vista had the highest income in 2016, occupying the
largest market share, about 26%, and creating the highest box office revenue, more
than 3 billion US dollars. Warner Bros. accounted for nearly 17% of the total box
office in North America, while the 20th Century Fox accounted for about 13% of the
market. Warner Bros., Universal Pictures and Miramax received the Academy
Awards for Best Picture, four times each.
In 2016, 733 movies were released in North America, and drama is the most common
type of film distribution in the region. So far, the region's most successful movie
franchise is the Marvel Cinematic Universe, which has generated more than $4 billion
in box office revenue in North America alone. "Iron Man," "The Miracle's Avenger,"
"Spider Man," and "The Incredible Hulk" are just a few examples of licensed movies.
"Star Wars: Force Awakening" is the highest-income 3D movie in North America
with a total lifetime value of approximately $937 million.

5.2 Bollywood Industry Facts
According to a recent report released by Deloitte entitled “Indywood - Indian Film

Industry”, the Indian film industry is expected to reach US$3.7 billion by 2020 at a
compound annual growth rate (CAGR) of 11%. In terms of the number of film
productions, India is already the world's largest film industry, but it lags behind
market leaders in the United States and other countries in income. India produces an
average of nearly 1,500 to 2,000 films per year, but the per capita screen is 1 screen
per 96,300 residents, while the United States has 1 screen per 7,800 residents and
China has 1 screen per 45,000 residents. The industry’s current box office revenue
totals 2.1 billion U.S. dollars. Since the domestic box office contributed most of the
revenue (74%), other areas such as cable and satellite rights and online/digital
aggregation revenues are the fastest growing segments. In fact, cable and satellite
rights and online/digital aggregation revenues are expected to grow at a CAGR of
about 15% during FY15-20.
The Indian film industry contributed 43% of the income from the Hindi film industry.
Other interesting trends include international/foreign film gains in Indian industry,
access to international studios through acquisitions and cooperation, the rise of
regional films, digital adoption of the entire value chain, organic and inorganic growth
in diversification, and alternative income streams Its appearance is expected to shape
the contours of the Indian film industry. In addition, as per capita incomes continue to
increase, the middle class continues to increase, demand in second-tier and third-tier
cities increases, diversification of international markets, increased use of
supplementary income streams, and increased use of visual effects in films (VFX)
may drive the industry in the country. Market growth.
The success of low-budget Indian films such as "Mandy", "Nobody Kills Jessica",
"Peepli Live", "Kahaani" and so on have made large-scale budget movies gain huge
profits.

The total revenue is 138 billion Indian rupees. By 2018, the estimated income is 162
billion Indian rupees*, of which theater revenues account for a major portion of total
revenue. Bollywood can no longer be considered as a marginal industry.
Similar to banking or fast-moving consumer goods, the marketing of a movie is also a

major expense of the industry, and it is no different from the planned product
promotion activities. The industry must manage its spending and introduce data-
informed decisions to identify the channels that can lead to the highest ROMI
(Marketing Investment Return).
Below are few trends that are disrupting the “business as usual” in Bollywood:
 Box office sales are plateauing even though the average ticket prices have
gone up
 Revenue from mobile & online / home video rights is increasing (Box
office sales still comprise 74% of the overall movie revenue)
Figure 5-1: Bollywood Revenue
Source: KPMG India Analysis

 The rise of multiplexes has led to increase in per ticket realization, cost-
effective screening by providing theatres with low seating capacities. Also,
these multiplexes have improved film-going experience for Indian audiences.
 Digitization of films by studios such as Real Image and UFO films has
enabled faster distribution from 500 prints per day to 1500 prints per day. As a
result, revenue generated from the first week of release is almost 60-80% of
the total theatrical revenue.
 The filmmakers now have an elaborate market mix which includes

Television, radio, digital advertising, print, OOH, social media, mobile apps,
YouTube videos etc.
 Shift from film promotion by spreading awareness prior to release, to

audience engagement using creative campaigns across social media. For
example: Marketers of the film ‘PK’ posted behind-the-clips scenes on
Facebook and selfies of the promotion tours were put out on Twitter. Besides
this, a special app ‘MyPKPoster’ was also launched to let audience try out
creative versions of PK.
 Increasing competition from international films (Hollywood) as well

as regional films whose budgets and production finesse is improving
consistently.
 Rise of the “content driven” audience. The average filmgoer is increasingly

becoming more interested in stories that are relevant to the Zeitgeist.

Marketing Promotion in Films
It is now widely accepted that a film will not get a good opening weekend unless it is
heavily promoted. These days, a film’s success heavily depends on its intelligent
omni-channel marketing campaigns.
A typical marketing mix for most film promotions looks like this:
Figure 5-2: Marketing Mix
Source: Analytics Vidya
With the promotion of television, radio, and social media sites (such as Facebook,
Instagram, Twitter, and mobile games), marketing of most movies has become
increasingly digital.
Without proper marketing development, a movie with fantastic plots, actors, scenes
and special effects may not be able to attract the audience. A successful event requires
a thoroughly designed plan that not only focuses on the type of marketing channel but
also on how they connect. One surprising fact is that typical Indian movies require a
six-month plan and 18 months of execution, while a Hollywood movie requires 36
months and 12 months of execution.

All of these trends have given the industry sufficient potential to benefit from
analytical frameworks and applications. The analysis can take these trends in the form
of data points to provide the amount of spending for each marketing vehicle, as well
as the time difference between any two activities to illustrate the audience's memory
retention/decrease factor.
5.3 Need for Predictive Analysis
The advent of new technologies and the arrival of the “big data” era have enabled the
non-traditional industries such as the film industry to gain the benefits of analysis.
In addition, the latest trends seen in this industry have brought a large number of
platforms to connect with consumers. This situation requires a more systematic
approach to increase the efficiency and efficiency of promotion through each platform
in order to achieve the greatest possible return on investment. Rate from as little
investment as possible.
In addition, the past decade has not only increased the diversity of entertainment
consumption devices, but also increased the audience groups now accustomed to
customizing content. The analysis of customers for better positioning, personalized
content on specific media, etc., includes endless opportunities to benefit from the
analysis.
All in all, we think that such an industry is just as right as a gold mine waiting for the
new framework to get the most out of it. Hence, predictive analysis for film industry
is getting popular these days.

Chapter 6. Case Study

The case study is about visualizing and forecasting the main variables responsible for
a movies success rate and help in determining the revenue of upcoming movies with
the help of the model created from the historic data collected.
The data has been collected from data.world and kaggle.com. The data consists of 28
Variables for 5043 movies, Spanning across 100 years in 66 countries.
Figure 6-1: Raw Data (1)
Source: Kaggle

Figure 6-2: Raw Data (2)
Source: Kaggle
Things done in the new data file:

 NA or Null values have been removed
 Removed unwanted data
Data mining done with some of the packages like:

Some of the R packages used in mining or cleaning the data are are:
1. Glmnet
Extremely efficient procedures for fitting the entire lasso or elastic-net
regularization path for linear regression, logistic and multinomial regression
models, Poisson regression and the Cox model. Two recent additions are the
multiple-response Gaussian, and the grouped multinomial regression.
2. Tidyr
An evolution of 'reshape2'. It's designed specifically for data tidying (not
general reshaping or aggregating) and works well with 'dplyr' data pipelines.
3. Dplyr
A fast, consistent tool for working with data frame like objects, both in
memory and out of memory.
4. ggplot2
A system for 'declaratively' creating graphics, based on ``The Grammar of
Graphics''. You provide the data, tell 'ggplot2' how to map variables to
aesthetics, what graphical primitives to use, and it takes care of the details.
5. Lubridate
Functions to work with date-times and time-spans: fast and user friendly
parsing of date-time data, extraction and updating of components of a date-
time (years, months, days, hours, minutes, and seconds), algebraic
manipulation on date-time and time-span objects. The 'lubridate' package has a
consistent and memorable syntax that makes working with dates easy and fun.

6. Pscl
Bayesian analysis of item-response theory (IRT) models, roll call analysis;
computing highest density regions; maximum likelihood estimation of zero-
inflated and hurdle models for count data; goodness-of-fit measures for
GLMs; data sets used in writing and teaching at the Political Science
Computational Laboratory; seats-votes curves.
7. ROCR
ROC graphs, sensitivity/specificity curves, lift charts, and precision/recall
plots are popular examples of trade-off visualizations for specific pairs of
performance measures. ROCR is a flexible tool for creating cutoff-
parameterized 2D performance curves by freely combining two from over 25
performance measures (new performance measures can be added using a
standard interface).
8. Coefplot
Plots the coefficients from model objects. This very quickly shows the user the
point estimates and confidence intervals for fitted models.
9. Shiny
Makes it incredibly easy to build interactive web applications with R.
Automatic ``reactive'' binding between inputs and outputs and extensive
prebuilt widgets make it possible to build beautiful, responsive, and powerful
applications with minimal effort.
10. DT
Data objects in R can be rendered as HTML tables using the JavaScript library
'DataTables' (typically via R Markdown or Shiny). The 'DataTables' library
has been included in this R package. The package name 'DT' is an
abbreviation of 'DataTables'.

11. Shinythemes
Themes for use with Shiny. Includes several Bootstrap themes from , which
are packaged for use with Shiny applications.
Figure 6-3: Cleaned Data (1)

Figure 6-4: Cleaned Data (2)

Steps
 First observing the revenue that is given as “gross” & “budget” I found the
profitability ratio and converted the numbers into “1” as hit when the ratio is
above the quarter value and “0” as flop below the quarter value for binomial
and “1,2,3” for multinomial.
 Observing the genre it can be classified into numbers so converted the genres
into categories of 14
 Have also performed logistic regression - binomial and multinomial regression

and have shared the screen shot images of the console output. Along with that
have also shared the screenshot of the plots, which shows the coefficient
relation, and the graph, which divides the “hits” and “flops” based on the
revenue. Have also shared the invlogic screen shot, which tell us which
variable contributes more in the model. It also shows the accuracy of the
model.

Output
Figure 6-5: Accuracy output
Binomial
Figure 6-6: Binomial Output

Figure 6-7: Graph based on hits & Flops (1)
Figure 6-8: Coefficient Plot (1)

Multinomial
Figure 6-10: Multinomial Output
Figure 6-9: Invlogic Output
Figure 6-11: Coefficient Plot (2)

Figure 6-12: RShiny App
Observations:
1. Accuracy of the model is 0.99 since its related to hits and flops which is
represented by “1” and “0”
2. Aspect ratio and profitability ratios plays an important role in deciding the
future of the movies and predicting the same.
3. The density shows (Figure 6-13: Graph based on hits & Flops (1)) shows the
line between hits and flops based on the revenue
4. (Figure 6-14:Coefficient Plot(1)) shows the variables result i.e. hit and flop
and the other variable profitability ratio related to the revenue of the film
5. (Figure 6-11: Multinomial Output) shows different variable of genre i.e.

Action, Adventure, Animation, Biography, Comedy, Crime, Documentary,
Drama, Family, Fantasy, Horror, Mystery, Romance, Sci. plays any important
role in the revenue of the film.

6. (Figure 6-15: Invlogic Output) shows which variable plays and important role in
deciding the revenue of the film. It cleary shows that aspect ratios and
profitability ratios plays and important role, along with actor facebook likes
and number of users for review with imdb score.
7. (Figure 6-16: Coefficient Plot (2)) shows different genre relation with different
aspects of the data avilable

Suggestions
1. Since we have so many variables, we can also use lasso or rigid regression,
which chooses our best-fit variable.
2. The critics review can also play an important role.
3. Indication the more number of reviews the more likely the movie to be a
hit.
4. Since having so many genres we can also perform, K means Clustering to

segregate the revenue based on the genre. This can also be done for
directors or actors.
5. The ratings and the aspect ratios plays and important part in the revenue of
the given data.
6. We can also take the weighted average of the number of Facebook likes of
the actor of that particular film and can observe whether the actor or the
supporting actor plays important role in the revenue of the movie.
7. A concept of unsupervised learning can also be applied to the given data

since we don’t know exactly which variable is the most important and
every variable is different from another.

Conclusion

The film industry plays a very important role in the U.S. economy and Indian
Economy. Therefore, over the past three decades, research related to film has become
an important issue for many scientists. As one of the highest risk industries, Box
Office Revenue Forecast (BOR) can play an important role in the economy.
In the real world, movie box office can be determined by a lot of things that are
difficult to collect, such as the current economic situation, the latest news and the
popularity of actors, and the number of competing movies. The film market is very
competitive and predicting movie box office is a daunting challenge.
Future Work
Currently, the data only comes from IMDB. In the future work, making the data
available from other streaming platforms like Hotstar, Amazon Prime Video, Netflix,
BookmyShow etc. can be of great importance and value.
It is always hard to find the right balance between only analyzing the surface of a
problem and being stuck in details. The approach to viewer rating prediction is very
straightforward and can be easily applied by anyone. The box office prediction
requires more experience, because of its dependency on the user to be able to interpret
the different visualizations of the data in the right way and infer the right conclusions
from it. Visual Analytics is not widespread in the field of movie success prediction, so
the challenge was a good way to fuel interest in this research topic. I am very
confident Visual Analytics will become more and more integrated in our lives for the
great possibilities it enables. The human mind is very good at extracting information
from visualizations and thus can analyze data quicker and in a more complex way.
Particularly in our times where the amount of data we produce is skyrocketing.

Appendices

Code:
6.1 Binomial
library(glmnet)
library(tidyr)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(pscl)
library(useful)
library(nnet)
options(scipen = 999)
getwd()
setwd("C:/Users/kazani/Desktop/r/")
# Data Cleaning
h.1<- read.csv("movie_metadata1.csv", sep=",")
View(h1)
str(h1.o)
head(h1)
sum(is.na(h1))
h1.sep1<- separate(h.1,genres,c("Gen1","Gen2","Gen3"))
h1.sep1$plot_keywords<-NULL
h1.sep1$movie_imdb_link<-NULL
h1.sep1$color<-NULL
h1.o1<-na.omit(h1.sep1)

h1.date<-subset(h.1, title_year > "1990-01-01" & title_year <"2017-12-31")
h1.o1$Profitibility_Ratio<- (h1.o1$gross/h1.o1$budget)
summary(h1.o1$Profitibility_Ratio)
h1.o1$Result<- ifelse(h1.o1$Profitibility_Ratio
>1.9812,1,ifelse(h1.o1$Profitibility_Ratio>1.0207,2,3))
a<- as.data.frame(table(h1.o1$Gen1))
h1.o1$Genre<- NA
h1.o1$Genre<-
unlist(sapply(h1.o1$Gen1,switch,'Action'=1,'Adventure'=2,'Animation'=3,'Biography'
=4,'Comedy'=5,'Crime'=6,
'Documentary'=7,'Drama'=8,'Family'=9,'Fantasy'=10,'Horror'=11,'Mystery'=12,'Roma
nce'=13,'Sci'=14))
summary(h1.o$Profitibility_Ratio)
?unlist()
head(Hollywood1)
Hollywood1<-h1.o1[, c(2,3,7,8,14,18,22,25,26,27,28,29,30)]
#Logistic Regression
train <- Hollywood1[1:1500,]
test <- Hollywood1[1501:2495,]
model <- multinom(Result ~. ,data=train)
summary(model)
pR2(model)
colnames(Hollywood)
#Fitting Data
nnet:::predict.nnet

fitted.results <-
predict(model,newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12,13)),type='prob
s')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
library(ROCR)
p <- predict(model, newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12,13)),

type="probs")
head(p)
pr <- prediction(p, test$Result)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
p<-ggplot(Hollywood,aes(x=gross))+ geom_density(fill="grey",
color="grey")+geom_vline(xintercept = 100000000)
library(coefplot)
coefplot(model)
invlogit<-function(x){
1/(1+exp(-x))
invlogit(model$coefficients)
invlogit <- function(x){1/(1 + exp (-x))}


6.2 Multinomial
library(glmnet)
library(tidyr)
library(dplyr)
library(ggplot2)
library(dplyr)
library(lubridate)
library(pscl)
library(useful)
getwd()
# Data Cleaning
h2<- read.csv("movie_metadata1.csv", sep=",")
View(h2)
str(h2.o)
summary(h2.o$Profitibility_Ratio)
head(h2)
sum(is.na(h2))
h2.sep<- separate(h2,genres,c("Gen1","Gen2","Gen3"))
h2.sep$plot_keywords<-NULL
h2.sep$movie_imdb_link<-NULL
h2.sep$color<-NULL
h2.o<-na.omit(h2.sep)
h1.date<-subset(h1, release_date > "1990-01-01" & release_date <"2017-12-31")

h2.o$Profitibility_Ratio<- (h2.o$gross/h2.o$budget)
h2.o$Result<- ifelse(h2.o$Profitibility_Ratio >1.9812 ,1,0)
a<- as.data.frame(table(h2.o$Gen1))
h2.o$Genre<- NA
h2.o$Genre<-
unlist(sapply(h2.o$Gen1,switch,'Action'=1,'Adventure'=2,'Animation'=3,'Biography'=
4,'Comedy'=5,'Crime'=6,
'Documentary'=7,'Drama'=8,'Family'=9,'Fantasy'=10,'Horror'=11,'Mystery'=12,'Roma
nce'=13,'Sci'=14))
?unlist()
sum(is.na(h2.o$Gen1))
is.numeric(h2.o$Genre)
head(Hollywood)
Hollywood<-h2.o[, c(2,3,7,8,14,18,22,25,26,27,28,29)]
#Logistic Regression
train <- Hollywood[1:1500,]
test <- Hollywood[1501:2495,]
model <- glm(Result ~.,family=binomial(link='logit'),data=train)
summary(model)
pR2(model)
fitted.results <-
predict(model,newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12)),type='respons
e')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))

library(ROCR)
p <- predict(model, newdata=subset(test,select=c(1,2,3,4,5,6,7,8,9,10,11,12)),

type="response")
pr <- prediction(p, test$Result)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
p<-ggplot(Hollywood,aes(x=gross))+ geom_density(fill="grey",
color="grey")+geom_vline(xintercept = 100000000)
+scale_x_continuous(label="Revenue", limits = c(0,10000000))
library(coefplot)
coefplot(model)
invlogit <- function(x){1/(1 + exp (-x))}

6.3 Rshiny
library(shiny)
library(DT)
library(ggplot2)
library(shinythemes)
getwd()
library(ROCR)
H<- read.csv("hollywood1.csv", sep=",")
N<- read.csv("movie_metadata1.csv", sep=",")
H$X<- NULL
a<- as.data.frame(table(H$Gen1))
b<-as.data.frame(table(H$Result))
attach(H)
attach(N)
server = function(input, output, session){
output$mytable <- DT::renderDataTable(DT::datatable(H) %>%
formatCurrency("gross", "$", digits = 0) %>%
formatCurrency("budget", "$", digits = 0) %>%
formatStyle(c("Result", "Genre","budget","gross","movie_title",
"title_year","Profitibility_Ratio"),
backgroundColor = "lightblue"))
output$mytable1 = DT::renderDataTable({N})
output$mytable2 = DT::renderDataTable({a})

output$mytable3 = DT::renderDataTable({b})}
ui = navbarPage(theme = shinytheme("sandstone"), title = h4("Movie Revenue

Prediction ",span("By Asfaq Kazani")),
tabPanel("Data Cleaned",
DT::dataTableOutput("mytable"),
downloadButton(outputId = "mydownload", label = "Download Table"),
tags$br(),tags$br(),
DT::dataTableOutput("mytable2"),tags$br(),tags$br(),
DT::dataTableOutput("mytable3")
),
tabPanel("Raw Data",
DT::dataTableOutput("mytable1")),
tabPanel("Output & Plots",
navlistPanel(
"Regression",
tabPanel("Binomial Regression",h2(" Summary of the model"),
tags$img(src='movies1.png', height=500, width = 550),
tags$img(src='movies2.png', height=150, width = 550),tags$br(),tags$br(),
tags$img(src='movies3.png', height=150, width = 650),tags$br(),tags$br(),
tags$img(src='Rplot1.png', height=500, width = 550),tags$br(),tags$br(),
tags$img(src='Rplot2.png', height=500, width = 550),tags$br(),tags$br(),
tags$img(src='Rplot3.png', height=500, width = 550),tags$br(),tags$br()
),
tabPanel("Multinomial Regression",h2(" Summary of the model"),

tags$img(src='m1.png', height=450, width = 750),
tags$img(src='mm1.png', height=500, width = 550),tags$br(),tags$br(),
tags$img(src='mm2.png', height=500, width = 550),tags$br(),tags$br() ) ),
tabPanel("Code", a("Binomial Code ",
href="https://github.com/ashkazani/ranalytics/blob/master/binomial"),tags$br(),tags$b
r(), a("Multinomial Code ",
href="https://github.com/ashkazani/ranalytics/blob/master/multinomial"),tags$br(),tag
s$br()),
tabPanel("Raw Data Source",
h4("Raw Data Taken from Data World & Kaggle"),tags$br(),tags$br(),
a("Link1", href="https://data.world/popculture/imdb-5000-movie-
dataset"),tags$br(),tags$br(),
a("Link2",href="https://www.kaggle.com/carolzhangdc/imdb-5000-movie-
dataset/data"),tags$br(),tags$br(),
a("Link3", href="https://www.kaggle.com/nazimamzz/imdb-dataset-of-5000-movie-
posters/data"),tags$br(),tags$br()
) )
shinyApp(ui = ui, server = server)

Annexures

List of Figures
Figure No. Title Page
No.
Figure 1:1 Predictive Analytics Value Chain 07
Figure 1:2 Predictive Analysis Process 08
Figure 1:4 Predictive Analysis 11
Figure 2:1 User Interface of R Studio 19
Figure 2:2 Weather Forecasting Using Predictive Modelling of 20

R
Figure 2:3 Opening Screen Of Rstudio 22
Figure 2:4 Rshiny Example 24
Figure 3:1 Predictive Analytics Workflow 27
Figure 3:2 Predictive Analytics Application Example 28
Figure 4:1 Types of Regression 32
Figure 5:1 Bollywood Revenue 38
Figure 5:2 Marketing Mix 40
Figure 6:1 Raw Data (1) 43
Figure 6:2 Raw Data (2) 44
Figure 6:3 Cleaned Data (1) 47
Figure 6:4 Cleaned Data (2) 48
Figure 6:5 Accuracy Output 50

List of Figures(Contd.)
Figure No. Title Page
No.
Figure 6:6 Binomial Output 50
Figure 6:7 Graph Based On Hits & Flops (1) 51
Figure 6:8 Coefficient Plot (1) 51
Figure 6:9 Invlogic Output 52
Figure 6:10 Multinomial Output 52
Figure 6:11 Coefficient Plot (2) 52
Figure 6:12 Rshiny App 53

Bibliography

Bibliography
 Websites
 Blog, I. (2017). http://itelina.github.io/BoxOfficePrediction/. Retrieved March
09, 2018, from Github: http://itelina.github.io/BoxOfficePrediction/
 Ding, T. L. (2014). Predicting movie Box-office revenues by exploiting large-
scale social media content. Retrieved March 113, 2018, from Multimed Tools
App:
http://ir.hit.edu.cn/~xding/docs/Predicting%20movie%20Box-
office%20revenues%20by%20exploiting%20large-
scale%20social%20media%20content.pdf
 Duggal, S. (n.d.). Advantages of using R statistical software for predictive
modelling. (2017, Editor) Retrieved March 06, 2018, from Project Guru:
https://www.projectguru.in/publications/advantages-r-predictive-modelling/
 El-Assady, M. (2016). Predictive Visual Analytics – Approaches for Movie
Ratings and Discussion of Open Research Challenges. Retrieved March 06,
2018, from https://bib.dbvis.de/uploadedFiles/predictivevisualanalytics.pdf
 Georgetown University. (2016). Pros and Cons of Predictive Analysis.
Retrieved March 05, 20108, from Online Master’s in Technology
Management:
https://scsonline.georgetown.edu/programs/masters-technology-
management/resources/pros-and-cons-predictive-analysis
 Guest. (2015). What is Predictive Analysis? Retrieved March 10, 2018, from
Pridictive Analytics:
https://www.predictiveanalyticstoday.com/what-is-predictive-analytics/
 Guest. (2016). Bringing Analytics into Indian Film Industry with Back Tracing
Algorithm. Retrieved March 08, 2018, from Analytcs vidya:
https://www.analyticsvidhya.com/blog/2016/08/bringing-analytics-into-
indian-film-industry-with-back-tracing-algorithm/

 India Adda – Perspectives On India. (2016). Retrieved March 22, 2018, from
IBEF:
https://www.ibef.org/blogs/indian-film-industry-to-reach-us-3-7-billion-by-
2020
 JiayongLin. (2017). Movie Box Office Prediction System. Retrieved February
26, 2018, from Kaggle:
https://www.kaggle.com/tmdb/tmdb-movie-metadata/discussion/28576
 Joshi, M. (2017). Movie Reviews and Revenues: An Experiment in Text
Regression. Retrieved March 12, 2018, from Academia:
https://aclanthology.info/pdf/N/N10/N10-1038.pdf
 Kertész, J. (2013). Early Prediction of Movie Box Office Success Based on
Wikipedia Activity Big Data. Retrieved March 01, 2018, from Plos One:
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0071226
 Math Works. (2016). Why Predictive Analytics Matters. Retrieved March 12,
2018, from Math Works: https://in.mathworks.com/discovery/predictive-
analytics.html
 New Gen Apps. (2017). 6 Reasons: Why Choose R Programming for Data
Science Projects? Retrieved March 06, 2018, from New Gen Apps:
https://www.newgenapps.com/blog/6-reasons-why-choose-r-programming-
for-data-science-projects
 O’Driscoll, S. (2016). Early prediction of a film’s box office success using
natural language processing techniques and machine learning. Retrieved
March 20, 2018, from National College of Ireland:
http://trap.ncirl.ie/2531/1/seanodriscoll.pdf
 Petrosyan, V. (2014). Visualizing and Forecasting Box-Office Revenues: A
Case Study of the James Bond Movie Series. Retrieved March 20, 2018, from
DigitalCommons:
https://digitalcommons.usu.edu/cgi/viewcontent.cgi?article=1413&context=gr
adreports
 Philipp. (2014). Predicting Movie Success with Machine Learning and Visual
Analytics. Retrieved March 17, 2018, from Technische Universität Wien:
http://www.cvast.tuwien.ac.at/sites/default/files/bakkarbeit/omenitsch.pdf

 Predicting movie ratings with IMDb data and R. (2014). Retrieved March 10,
2018, from R-Blogger: https://www.r-bloggers.com/predicting-movie-ratings-
with-imdb-data-and-r/
 Raj, M. P. (n.d.). Predictive model for movie’s success and sentiment analysis.
Retrieved March 12, 2018, from Research Journal of Management:
http://www.isca.in/IJMS/Archive/v6/i6/1.ISCA-RJMS-2017-012.pdf
 RAY, S. (2015). 7 Types of Regression Techniques you should know!
Retrieved March 23, 2018, from Analytics Vidya:
https://www.analyticsvidhya.com/blog/2015/08/comprehensive-guide-
regression/
 Rhee, T. G. (2016). Retrieved March 16, 2018, from Ieexplore:
http://ieeexplore.ieee.org/document/7838221/?anchor=authors
 RStudio. (2008). Introduction to Shiny. Retrieved March 13, 2018, from
RStudio: https://www.rstudio.com/resources/webinars/introduction-to-shiny/
 Sankaralingam, G. (n.d.). The bigger picture: Using analytics to predict movie
success. Retrieved February 26, 2018, from
https://www.latentview.com/blog/using-analytics-to-predict-movie-success/
 Simonoff, J. S. (2000). Predicting movie grosses: Winners and losers,
blockbusters and sleepers. Retrieved March 20, 2018, from Chance:
https://archive.nyu.edu/jspui/bitstream/2451/14752/1/SOR-99-8.pdf
 Singla, R. (2016). Predicting Blockbuster success of a Movie with Data
Analytics. Retrieved March 22, 2018, from Prompt Cloud:
https://www.promptcloud.com/blog/Predicting-Blockbuster-success-of-a-
Movie-with-Data-Analytics
 Vr, N. (2014). Predicting Movie Success Based on IMDB Data. Retrieved
March 21, 2018, from Research Gate:
https://www.researchgate.net/publication/282133920_Predicting_Movie_Succ
ess_Based_on_IMDB_Data
 Yoo, S. (2011). Predicting Movie Revenue from IMDb Data. Retrieved March
26, 2018, from Semantics Scholar:
https://www.semanticscholar.org/paper/Predicting-Movie-Revenue-from-
IMDb-Data-Yoo-Kanter/6e6cdf5b0282d89de45c407fc76a4c218696e3e3

 Zhang, W. (2016). Improving Movie Gross Prediction Through News

Analysis. Retrieved March 16, 2018, from CS:
https://www.cs.cmu.edu/~nasmith/TDF/ZhangWenbinISF2009Paper.pdf

Asfaq Kazani

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Asfaq Kazani

Uploaded by

Copyright:

Available Formats

A Project Report on

Predictive Analysis to Measure Movie Acceptability

in partial fulfillment of the requirements of

Master of Management Studies

Rizvi Institute of Management Studies & Research

under the guidance of

Prof. Sanjay Gupta

Asfaq Nazir Kazani

Batch: 2016 – 2018

Rizvi Institute of Management Studies & Research or Mumbai University to publish

Name: Asfaq Nazir Kazani

Class: MMS (2016-2018)

“Predictive Analytics to measure Movie Acceptability”

Prof. Sanjay Gupta Prof. Umar Farooq Dr. Kalim Khan

Project Guide Academic Coordinator Director

It gives me immense pleasure in expressing my sincere gratitude towards Rizvi

I would like to thank “University of Mumbai” for giving me an opportunity to present

Asfaq Nazir Kazani

• To understand the concept of predictive analysis.

Chapter 1. Introduction ............................................................................................ 1

1.1 Introduction ..................................................................................................... 2

1.2 What has changed in the entertainment industry? ........................................... 2

1.3 What is Predictive Analysis?........................................................................... 6

1.4 Predictive Analytics Process ........................................................................... 8

1.5 Applications of Predictive Analytics............................................................. 10

1.6 Benefits of Predictive Analytics .................................................................... 12

1.7 Drawbacks and Criticism .............................................................................. 12

Chapter 2. Predictive Analysis Implementation through “R” ............................ 13

2.1 What Is R? ..................................................................................................... 14

2.2 The R Environment ....................................................................................... 15

2.3 Advantages of using R statistical software for predictive modelling............ 18

2.4 Development Environment to Implement “R” .............................................. 22

2.5 RShiny ........................................................................................................... 23

Chapter 3. Predictive Analysis Process ................................................................. 25

3.1 Predictive Analytics Workflow ..................................................................... 27

Chapter 4. Predictive Analysis Models ................................................................. 30

4.1 Why do we use Regression Analysis?........................................................... 32

4.2 How many types of regression techniques do we have? ............................... 33

Chapter 5. Film Industry........................................................................................ 36

5.1 Hollywood Industry Facts ............................................................................. 37

5.2 Bollywood Industry Facts ............................................................................. 38

5.3 Need for Predictive Analysis......................................................................... 42

Chapter 6. Case Study ............................................................................................ 43

7.1 Binomial ........................................................................................................ 60

7.2 Multinomial ................................................................................................... 64

Rizvi Institute of Management Studies and Research 1

1.2 What has changed in the entertainment industry?

Rizvi Institute of Management Studies and Research 2

A typical pathway of analyses of this kind can be the following –

Rizvi Institute of Management Studies and Research 3

1.2.2 Getting Fine-tuned Results

Rizvi Institute of Management Studies and Research 4

Rizvi Institute of Management Studies and Research 5

1.3 What is Predictive Analysis?

Predictive analysis is a branch of advanced analysis that is used to predict unknown

Rizvi Institute of Management Studies and Research 6

Predictive analytics enables organizations to become proactive, forward-looking,

Figure 1-1: Predictive Analytics Value Chain

Rizvi Institute of Management Studies and Research 7

1.4 Predictive Analytics Process

Figure 1-2: Predictive analysis Process

Source: https://www.predictiveanalyticstod ay.com/what-is-predictive-analytics/

1.4.1 Define Project: