Final Project

Introduction
0 Movie theater industry is facing severe competition from

internet and video streaming
0 Questions we want to answer through this project:
0 What factors affect the ratings of movies?
0 Is there any single factor like critic ratings that attract the
public to watch and rate a movie?
0 Does length of the movie influence a person’s movie rating?
0 Can revenue of a movie be determined by critic’s ratings?
0 Is there any agreement between movie watcher ratings and
critic ratings?
0 Is there any correlation between number of votes for a movie
in IMDB and followers for that movie on a social media
platform like Twitter?
Data Description
0 Data Source: IMDB, Twitter
0 From IMDB, we created a dataset with the following
variables: Title, Genre, Description, Director, Actors,
Year, Runtime, Rating, Votes, Revenue, Metascore.
0 We cleaned the dataset by removing unknown data
and then created categorical columns for ratings,
votes, critic ratings and genre.
0 From Twitter, we extracted the count of number of
followers for ten of the movies.
Exploratory Data Analysis and
Visualization
• 83% of movies are in Average category,

• The distribution of revenue movies with rating between 5 and 7.5.
generated is extremely left-skewed. • 13% of movies are in Good category, with
• Most of the movies generated revenue rating above 7.5.
below 100 million dollars in 2016 • Only 4% of movies are in Bad category, with
rating below 5.
Visualization (cont’d)
-Correlations
-Cohen’s Kappa
Technical Approach and
Evaluation
0 Models: GLM, LOGLM and proportional models.
0 Visualization: Mosaic plots and correspondence
analysis
0 Evaluation of Models: To get the best fitting model we
used anova function.
0 Twitter: Correlation of votes from IMDB dataset and
followers on Twitter
Association Between
Variables
-Runtime and Genre
• There is positive association

between Long Runtime and Action
and Adventure movies.
between Short Runtime and
Animation and Horror movies.
• Thus, action and adventure movies
are usually longer than the other
types of movies.
• Overall, there is no strong
association between genre of a
movie and its runtime.
Association Between
Variables
-Rating and Genre
• According to the mosaic plot
above, there is positive
association between Bad
Rating and Action and
Adventure movies.
between Good Rating and
Animation movies.
• Overall, genre of a movie has
no association with rating a
movie received.
Association Between
Variables
-Votes and Metascore
• There is strong association

between Average Metascore and
Very High Votes.
• Overall, there is no strong
association between Metascore
and Votes.
Association Between
Variables
-Rating and Metascore
• There is association between

rating and Metascore a movie
received.
Association Between
Variables
-Rating and Movie Length
Check for Residuals

Proportional Model
model1 <- polr(Rating_c~Votes_c+Metascore_c+Movie_length
, data=IMDB_Movie_Data)
Anova(model1)
## Analysis of Deviance Table (Type II tests)
##
## Response: Rating_c
## LR Chisq Df Pr(>Chisq)
## Votes_c 8.9549 3 0.029897 *
## Metascore_c 13.0058 2 0.001499 **
## Movie_length 2.7611 2 0.251446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Proportional Model (cont’d)
IMDB_vglm1 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data,family=cumulative(parallel
= TRUE))
IMDB_vglm2 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data, family=cumulative(parallel
=FALSE~Movie_length))
coef(IMDB_vglm1, matrix=TRUE)
## logit(P[Y<=1]) logit(P[Y<=2])
## (Intercept) 1.4511646 1.8263330
## Votes_c.L -1.0213357 -1.0213357
## Votes_c.Q -0.6081794 -0.6081794
## Votes_c.C 0.1124382 0.1124382
## Metascore_c.L -1.0348567 -1.0348567
## Metascore_c.Q -0.8508894 -0.8508894
## Movie_length.L -0.2946856 -0.2946856
## Movie_length.Q 0.5376338 0.5376338
coef(IMDB_vglm2, matrix=TRUE)
## logit(P[Y<=1]) logit(P[Y<=2])
## (Intercept) 1.4400052 1.92034416
## Votes_c.L -1.0316357 -1.03163568
## Votes_c.Q -0.5831384 -0.58313835
## Votes_c.C 0.1237770 0.12377695
## Metascore_c.L -1.0555081 -1.05550805
## Metascore_c.Q -0.8411716 -0.84117158
## Movie_length.L -0.3536775 0.01926485
## Movie_length.Q 0.4705363 0.80360852
model2 <- polr(Rating_c~Votes_c*Metascore_c, data=IMDB_M
ovie_Data)
Anova(model2)
## Analysis of Deviance Table (Type II tests)
##
## LR Chisq Df Pr(>Chisq)
## Votes_c 8.921 3 0.0303599 *
## Metascore_c 14.004 2 0.0009103 ***
## Votes_c:Metascore_c 16.141 6 0.0130191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(model1, model2)
## Likelihood ratio tests of ordinal regression models
##
## Model Resid. df Resid. Dev Test Df
## 1 Votes_c + Metascore_c + Movie_length 189 188.9996
## 2 Votes_c * Metascore_c 185 175.6201 1 vs 2 4
## LR stat. Pr(Chi)
## 1
## 2 13.3795 0.009562905
Generalized Linear Model
-Model 1
## Call:
## glm(formula = Rating_c ~ Votes + Metascore_c + Movie_length,
## family = binomial, data = IMDB_Movie_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3771 -0.5550 -0.4189 -0.3364 2.4515
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.883e+00 3.209e-01 -5.867 4.45e-09 ***
## Votes 6.052e-06 1.978e-06 3.060 0.00222 **
## Metascore_c.L 9.790e-01 3.318e-01 2.951 0.00317 **
## Metascore_c.Q 8.933e-01 4.438e-01 2.013 0.04415 *
## Movie_length.L 4.332e-01 4.487e-01 0.965 0.33431
## Movie_length.Q -2.592e-01 4.660e-01 -0.556 0.57805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.42 on 197 degrees of freedom
## Residual deviance: 154.00 on 192 degrees of freedom
## AIC: 166
##
## Number of Fisher Scoring iterations: 5
–Model2
## Call:
## glm(formula = Rating_c ~ Votes * Metascore_c + Movie_length,
##
## -1.4373 -0.6051 -0.4105 -0.1841 2.5897
##
## Coefficients:
## (Intercept) -1.483e+00 3.492e-01 -4.247 2.17e-05 ***
## Votes -8.469e-06 1.008e-05 -0.840 0.4007
## Metascore_c.L 9.029e-01 4.719e-01 1.913 0.0557 .
## Metascore_c.Q -8.432e-01 6.723e-01 -1.254 0.2098
## Movie_length.L 2.826e-01 4.600e-01 0.614 0.5390
## Movie_length.Q -4.262e-01 4.956e-01 -0.860 0.3898
## Votes:Metascore_c.L 1.871e-06 4.087e-06 0.458 0.6471
## Votes:Metascore_c.Q 4.414e-05 2.415e-05 1.828 0.0676 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 158.62
##
–Model 3
## Call:
## glm(formula = Rating_c ~ Votes * Movie_length + Metascore_c,
##
## -1.2385 -0.6107 -0.4031 -0.3075 2.5017
##
## Coefficients:
## (Intercept) -1.492e+00 3.737e-01 -3.993 6.52e-05 ***
## Votes -2.179e-05 2.640e-05 -0.826 0.40904
## Movie_length.L 1.355e+00 6.379e-01 2.124 0.03366 *
## Movie_length.Q 1.155e-01 6.363e-01 0.182 0.85592
## Metascore_c.L 9.130e-01 3.387e-01 2.695 0.00703 **
## Metascore_c.Q 9.366e-01 4.501e-01 2.081 0.03746 *
## Votes:Movie_length.L -6.016e-05 5.601e-05 -1.074 0.28277
## Votes:Movie_length.Q -3.316e-05 3.250e-05 -1.020 0.30764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
##
## AIC: 165.26
##
-Evaluation
## Analysis of Deviance Table
##
## Model 1: Rating_c ~ Votes + Metascore_c + Movie_length
## Model 2: Rating_c ~ Votes * Metascore_c + Movie_length
## Model 3: Rating_c ~ Votes * Movie_length + Metascore_c
## Resid. Df Resid. Dev Df Deviance
## 1 192 154.00
## 2 190 142.62 2 11.3779
## 3 190 149.26 0 -6.6375
Linear Regression
## Call:
## lm(formula = Revenue..Millions. ~ Runtime..Minutes. +
Rating +
## Votes + Metascore, data = imdb1)
##
## Residuals:
## -170.33 -22.93 -7.15 8.09 352.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.361e+01 3.825e+01 2.186 0.0300 *
## Runtime..Minutes. -4.767e-01 2.624e-01 -1.817
0.0708 .
## Rating -7.301e+00 7.051e+00 -1.036 0.3017
## Votes 8.125e-04 5.152e-05 15.772 <2e-16 ***
## Metascore 2.967e-01 3.192e-01 0.929 0.3538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.39 on 193 degrees of
freedom
## Multiple R-squared: 0.5873, Adjusted R-squared:
0.5787
## F-statistic: 68.65 on 4 and 193 DF, p-value: < 2.2e-16
Simple Linear Model
-Model 1
# Fit a simple linear model with interaction between Votes and Metascore
rating_lm1 <- lm(Rating ~ Votes*Metascore + Runtime, data = IMDB_Movie_Data)
summary(rating_lm1)
##
## Call:
## lm(formula = Rating ~ Votes * Metascore + Runtime, data = IMDB_Movie_Data)
##
## Residuals:
## -2.55822 -0.27029 0.02195 0.35984 1.89278
##
## Coefficients:
## (Intercept) 3.651e+00 3.018e-01 12.097 < 2e-16 ***
## Votes 1.026e-06 2.073e-06 0.495 0.621121
## Metascore 2.874e-02 3.179e-03 9.041 < 2e-16 ***
## Runtime 1.022e-02 2.619e-03 3.901 0.000132 ***
## Votes:Metascore 2.229e-08 3.185e-08 0.700 0.484874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5953 on 193 degrees of freedom
## Multiple R-squared: 0.5567, Adjusted R-squared: 0.5475
Simple Linear Model
-Model 2
# Fit a simple linear model using Metascore and Runtime as input
rating_lm2 <- lm(Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data)
summary(rating_lm2)
##
## Call:
## lm(formula = Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data)
##
## Residuals:
## -2.5595 -0.2735 0.0331 0.3563 1.9169
##
## Coefficients:
## (Intercept) 3.598e+00 2.915e-01 12.340 < 2e-16 ***
## Votes 2.435e-06 4.946e-07 4.923 1.82e-06 ***
## Metascore 3.018e-02 2.422e-03 12.460 < 2e-16 ***
## Runtime 9.899e-03 2.576e-03 3.843 0.000165 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5945 on 194 degrees of freedom
## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5487
Simple Linear Model
-Evaluation
#Conduct Anova test over chosen model
Anova(rating_lm2)
## Anova Table (Type II tests)
##
## Response: Rating
## Sum Sq Df F value Pr(>F)
## Votes 8.564 1 24.231 1.821e-06 ***
## Metascore 54.868 1 155.242 < 2.2e-16 ***
## Runtime 5.219 1 14.766 0.0001648 ***
## Residuals 68.567 194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparison with Twitter
cor(imdbsubset$Votes,imdb
subset$TwitterFollowers)
## [1] 0.9801678
The plot shows that Twitter

followers are mostly
consistent with Votes on
IMDB. A correlation of 0.98
also confirms this
conclusion.
Conclusion
0 The ratings of movie watchers is affected by Votes and
Critic ratings.
0 The length or genre of a movie does not have any
significant effect on movie watcher’s ratings. Critic ratings
are not a significant predictor for movie revenue. It is the
Votes that significantly affect a movie’s revenue.
0 There is not much agreement between movie
watcher/user ratings and critic ratings except in the case
of Bad Movies.
0 There is strong correlation between votes on IMDB and
followers on Twitter

Final Project

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Final Project

Uploaded by

Copyright:

Available Formats

Introduction

0 Movie theater industry is facing severe competition from

• 83% of movies are in Average category,

• There is positive association

• There is strong association

• There is association between

Check for Residuals

The plot shows that Twitter

You might also like