You are on page 1of 28

Introduction

0 Movie theater industry is facing severe competition from


internet and video streaming
0 Questions we want to answer through this project:
0 What factors affect the ratings of movies?
0 Is there any single factor like critic ratings that attract the
public to watch and rate a movie?
0 Does length of the movie influence a person’s movie rating?
0 Can revenue of a movie be determined by critic’s ratings?
0 Is there any agreement between movie watcher ratings and
critic ratings?
0 Is there any correlation between number of votes for a movie
in IMDB and followers for that movie on a social media
platform like Twitter?
Data Description
0 Data Source: IMDB, Twitter
0 From IMDB, we created a dataset with the following
variables: Title, Genre, Description, Director, Actors,
Year, Runtime, Rating, Votes, Revenue, Metascore.
0 We cleaned the dataset by removing unknown data
and then created categorical columns for ratings,
votes, critic ratings and genre.
0 From Twitter, we extracted the count of number of
followers for ten of the movies.
Exploratory Data Analysis and
Visualization

• 83% of movies are in Average category,


• The distribution of revenue movies with rating between 5 and 7.5.
generated is extremely left-skewed. • 13% of movies are in Good category, with
• Most of the movies generated revenue rating above 7.5.
below 100 million dollars in 2016 • Only 4% of movies are in Bad category, with
rating below 5.
Exploratory Data Analysis and
Visualization (cont’d)
Exploratory Data Analysis and
Visualization (cont’d)
-Correlations
Exploratory Data Analysis and
Visualization (cont’d)
-Cohen’s Kappa
Technical Approach and
Evaluation
0 Models: GLM, LOGLM and proportional models.
0 Visualization: Mosaic plots and correspondence
analysis
0 Evaluation of Models: To get the best fitting model we
used anova function.
0 Twitter: Correlation of votes from IMDB dataset and
followers on Twitter
Association Between
Variables
-Runtime and Genre

• There is positive association


between Long Runtime and Action
and Adventure movies.
• There is positive association
between Short Runtime and
Animation and Horror movies.
• Thus, action and adventure movies
are usually longer than the other
types of movies.
• Overall, there is no strong
association between genre of a
movie and its runtime.
Association Between
Variables
-Rating and Genre
• According to the mosaic plot
above, there is positive
association between Bad
Rating and Action and
Adventure movies.
• There is positive association
between Good Rating and
Animation movies.
• Overall, genre of a movie has
no association with rating a
movie received.
Association Between
Variables
-Votes and Metascore

• There is strong association


between Average Metascore and
Very High Votes.
• Overall, there is no strong
association between Metascore
and Votes.
Association Between
Variables
-Rating and Metascore

• There is association between


rating and Metascore a movie
received.
Association Between
Variables
-Rating and Movie Length

Check for Residuals


Proportional Model
model1 <- polr(Rating_c~Votes_c+Metascore_c+Movie_length
, data=IMDB_Movie_Data)
Anova(model1)
## Analysis of Deviance Table (Type II tests)
##
## Response: Rating_c
## LR Chisq Df Pr(>Chisq)
## Votes_c 8.9549 3 0.029897 *
## Metascore_c 13.0058 2 0.001499 **
## Movie_length 2.7611 2 0.251446
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Proportional Model (cont’d)
IMDB_vglm1 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data,family=cumulative(parallel
= TRUE))
IMDB_vglm2 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data, family=cumulative(parallel
=FALSE~Movie_length))
coef(IMDB_vglm1, matrix=TRUE)
## logit(P[Y<=1]) logit(P[Y<=2])
## (Intercept) 1.4511646 1.8263330
## Votes_c.L -1.0213357 -1.0213357
## Votes_c.Q -0.6081794 -0.6081794
## Votes_c.C 0.1124382 0.1124382
## Metascore_c.L -1.0348567 -1.0348567
## Metascore_c.Q -0.8508894 -0.8508894
## Movie_length.L -0.2946856 -0.2946856
## Movie_length.Q 0.5376338 0.5376338
coef(IMDB_vglm2, matrix=TRUE)
## logit(P[Y<=1]) logit(P[Y<=2])
## (Intercept) 1.4400052 1.92034416
## Votes_c.L -1.0316357 -1.03163568
## Votes_c.Q -0.5831384 -0.58313835
## Votes_c.C 0.1237770 0.12377695
## Metascore_c.L -1.0555081 -1.05550805
## Metascore_c.Q -0.8411716 -0.84117158
## Movie_length.L -0.3536775 0.01926485
## Movie_length.Q 0.4705363 0.80360852
Proportional Model (cont’d)
model2 <- polr(Rating_c~Votes_c*Metascore_c, data=IMDB_M
ovie_Data)
Anova(model2)
## Analysis of Deviance Table (Type II tests)
##
## Response: Rating_c
## LR Chisq Df Pr(>Chisq)
## Votes_c 8.921 3 0.0303599 *
## Metascore_c 14.004 2 0.0009103 ***
## Votes_c:Metascore_c 16.141 6 0.0130191 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Proportional Model (cont’d)
anova(model1, model2)
## Likelihood ratio tests of ordinal regression models
##
## Response: Rating_c
## Model Resid. df Resid. Dev Test Df
## 1 Votes_c + Metascore_c + Movie_length 189 188.9996
## 2 Votes_c * Metascore_c 185 175.6201 1 vs 2 4
## LR stat. Pr(Chi)
## 1
## 2 13.3795 0.009562905
Generalized Linear Model
-Model 1
## Call:
## glm(formula = Rating_c ~ Votes + Metascore_c + Movie_length,
## family = binomial, data = IMDB_Movie_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3771 -0.5550 -0.4189 -0.3364 2.4515
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.883e+00 3.209e-01 -5.867 4.45e-09 ***
## Votes 6.052e-06 1.978e-06 3.060 0.00222 **
## Metascore_c.L 9.790e-01 3.318e-01 2.951 0.00317 **
## Metascore_c.Q 8.933e-01 4.438e-01 2.013 0.04415 *
## Movie_length.L 4.332e-01 4.487e-01 0.965 0.33431
## Movie_length.Q -2.592e-01 4.660e-01 -0.556 0.57805
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.42 on 197 degrees of freedom
## Residual deviance: 154.00 on 192 degrees of freedom
## AIC: 166
##
## Number of Fisher Scoring iterations: 5
Generalized Linear Model
–Model2
## Call:
## glm(formula = Rating_c ~ Votes * Metascore_c + Movie_length,
## family = binomial, data = IMDB_Movie_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4373 -0.6051 -0.4105 -0.1841 2.5897
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.483e+00 3.492e-01 -4.247 2.17e-05 ***
## Votes -8.469e-06 1.008e-05 -0.840 0.4007
## Metascore_c.L 9.029e-01 4.719e-01 1.913 0.0557 .
## Metascore_c.Q -8.432e-01 6.723e-01 -1.254 0.2098
## Movie_length.L 2.826e-01 4.600e-01 0.614 0.5390
## Movie_length.Q -4.262e-01 4.956e-01 -0.860 0.3898
## Votes:Metascore_c.L 1.871e-06 4.087e-06 0.458 0.6471
## Votes:Metascore_c.Q 4.414e-05 2.415e-05 1.828 0.0676 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.42 on 197 degrees of freedom
## Residual deviance: 142.62 on 190 degrees of freedom
## AIC: 158.62
##
## Number of Fisher Scoring iterations: 8
Generalized Linear Model
–Model 3
## Call:
## glm(formula = Rating_c ~ Votes * Movie_length + Metascore_c,
## family = binomial, data = IMDB_Movie_Data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2385 -0.6107 -0.4031 -0.3075 2.5017
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.492e+00 3.737e-01 -3.993 6.52e-05 ***
## Votes -2.179e-05 2.640e-05 -0.826 0.40904
## Movie_length.L 1.355e+00 6.379e-01 2.124 0.03366 *
## Movie_length.Q 1.155e-01 6.363e-01 0.182 0.85592
## Metascore_c.L 9.130e-01 3.387e-01 2.695 0.00703 **
## Metascore_c.Q 9.366e-01 4.501e-01 2.081 0.03746 *
## Votes:Movie_length.L -6.016e-05 5.601e-05 -1.074 0.28277
## Votes:Movie_length.Q -3.316e-05 3.250e-05 -1.020 0.30764
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 178.42 on 197 degrees of freedom
## Residual deviance: 149.26 on 190 degrees of freedom
## AIC: 165.26
##
## Number of Fisher Scoring iterations: 7
Generalized Linear Model
-Evaluation
## Analysis of Deviance Table
##
## Model 1: Rating_c ~ Votes + Metascore_c + Movie_length
## Model 2: Rating_c ~ Votes * Metascore_c + Movie_length
## Model 3: Rating_c ~ Votes * Movie_length + Metascore_c
## Resid. Df Resid. Dev Df Deviance
## 1 192 154.00
## 2 190 142.62 2 11.3779
## 3 190 149.26 0 -6.6375
Linear Regression
## Call:
## lm(formula = Revenue..Millions. ~ Runtime..Minutes. +
Rating +
## Votes + Metascore, data = imdb1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -170.33 -22.93 -7.15 8.09 352.53
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.361e+01 3.825e+01 2.186 0.0300 *
## Runtime..Minutes. -4.767e-01 2.624e-01 -1.817
0.0708 .
## Rating -7.301e+00 7.051e+00 -1.036 0.3017
## Votes 8.125e-04 5.152e-05 15.772 <2e-16 ***
## Metascore 2.967e-01 3.192e-01 0.929 0.3538
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 58.39 on 193 degrees of
freedom
## Multiple R-squared: 0.5873, Adjusted R-squared:
0.5787
## F-statistic: 68.65 on 4 and 193 DF, p-value: < 2.2e-16
Simple Linear Model
-Model 1
# Fit a simple linear model with interaction between Votes and Metascore
rating_lm1 <- lm(Rating ~ Votes*Metascore + Runtime, data = IMDB_Movie_Data)
summary(rating_lm1)
##
## Call:
## lm(formula = Rating ~ Votes * Metascore + Runtime, data = IMDB_Movie_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.55822 -0.27029 0.02195 0.35984 1.89278
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.651e+00 3.018e-01 12.097 < 2e-16 ***
## Votes 1.026e-06 2.073e-06 0.495 0.621121
## Metascore 2.874e-02 3.179e-03 9.041 < 2e-16 ***
## Runtime 1.022e-02 2.619e-03 3.901 0.000132 ***
## Votes:Metascore 2.229e-08 3.185e-08 0.700 0.484874
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5953 on 193 degrees of freedom
## Multiple R-squared: 0.5567, Adjusted R-squared: 0.5475
## F-statistic: 60.6 on 4 and 193 DF, p-value: < 2.2e-16
Simple Linear Model
-Model 2
# Fit a simple linear model using Metascore and Runtime as input
rating_lm2 <- lm(Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data)
summary(rating_lm2)
##
## Call:
## lm(formula = Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5595 -0.2735 0.0331 0.3563 1.9169
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.598e+00 2.915e-01 12.340 < 2e-16 ***
## Votes 2.435e-06 4.946e-07 4.923 1.82e-06 ***
## Metascore 3.018e-02 2.422e-03 12.460 < 2e-16 ***
## Runtime 9.899e-03 2.576e-03 3.843 0.000165 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5945 on 194 degrees of freedom
## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5487
## F-statistic: 80.85 on 3 and 194 DF, p-value: < 2.2e-16
Simple Linear Model
-Evaluation
#Conduct Anova test over chosen model
Anova(rating_lm2)
## Anova Table (Type II tests)
##
## Response: Rating
## Sum Sq Df F value Pr(>F)
## Votes 8.564 1 24.231 1.821e-06 ***
## Metascore 54.868 1 155.242 < 2.2e-16 ***
## Runtime 5.219 1 14.766 0.0001648 ***
## Residuals 68.567 194
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Comparison with Twitter

cor(imdbsubset$Votes,imdb
subset$TwitterFollowers)
## [1] 0.9801678

The plot shows that Twitter


followers are mostly
consistent with Votes on
IMDB. A correlation of 0.98
also confirms this
conclusion.
Conclusion
0 The ratings of movie watchers is affected by Votes and
Critic ratings.
0 The length or genre of a movie does not have any
significant effect on movie watcher’s ratings. Critic ratings
are not a significant predictor for movie revenue. It is the
Votes that significantly affect a movie’s revenue.
0 There is not much agreement between movie
watcher/user ratings and critic ratings except in the case
of Bad Movies.
0 There is strong correlation between votes on IMDB and
followers on Twitter

You might also like