0 Movie theater industry is facing severe competition from
internet and video streaming 0 Questions we want to answer through this project: 0 What factors affect the ratings of movies? 0 Is there any single factor like critic ratings that attract the public to watch and rate a movie? 0 Does length of the movie influence a person’s movie rating? 0 Can revenue of a movie be determined by critic’s ratings? 0 Is there any agreement between movie watcher ratings and critic ratings? 0 Is there any correlation between number of votes for a movie in IMDB and followers for that movie on a social media platform like Twitter? Data Description 0 Data Source: IMDB, Twitter 0 From IMDB, we created a dataset with the following variables: Title, Genre, Description, Director, Actors, Year, Runtime, Rating, Votes, Revenue, Metascore. 0 We cleaned the dataset by removing unknown data and then created categorical columns for ratings, votes, critic ratings and genre. 0 From Twitter, we extracted the count of number of followers for ten of the movies. Exploratory Data Analysis and Visualization
• 83% of movies are in Average category,
• The distribution of revenue movies with rating between 5 and 7.5. generated is extremely left-skewed. • 13% of movies are in Good category, with • Most of the movies generated revenue rating above 7.5. below 100 million dollars in 2016 • Only 4% of movies are in Bad category, with rating below 5. Exploratory Data Analysis and Visualization (cont’d) Exploratory Data Analysis and Visualization (cont’d) -Correlations Exploratory Data Analysis and Visualization (cont’d) -Cohen’s Kappa Technical Approach and Evaluation 0 Models: GLM, LOGLM and proportional models. 0 Visualization: Mosaic plots and correspondence analysis 0 Evaluation of Models: To get the best fitting model we used anova function. 0 Twitter: Correlation of votes from IMDB dataset and followers on Twitter Association Between Variables -Runtime and Genre
• There is positive association
between Long Runtime and Action and Adventure movies. • There is positive association between Short Runtime and Animation and Horror movies. • Thus, action and adventure movies are usually longer than the other types of movies. • Overall, there is no strong association between genre of a movie and its runtime. Association Between Variables -Rating and Genre • According to the mosaic plot above, there is positive association between Bad Rating and Action and Adventure movies. • There is positive association between Good Rating and Animation movies. • Overall, genre of a movie has no association with rating a movie received. Association Between Variables -Votes and Metascore
• There is strong association
between Average Metascore and Very High Votes. • Overall, there is no strong association between Metascore and Votes. Association Between Variables -Rating and Metascore
• There is association between
rating and Metascore a movie received. Association Between Variables -Rating and Movie Length
Check for Residuals
Proportional Model model1 <- polr(Rating_c~Votes_c+Metascore_c+Movie_length , data=IMDB_Movie_Data) Anova(model1) ## Analysis of Deviance Table (Type II tests) ## ## Response: Rating_c ## LR Chisq Df Pr(>Chisq) ## Votes_c 8.9549 3 0.029897 * ## Metascore_c 13.0058 2 0.001499 ** ## Movie_length 2.7611 2 0.251446 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Proportional Model (cont’d) IMDB_vglm1 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data,family=cumulative(parallel = TRUE)) IMDB_vglm2 <- vglm(Rating_c~Votes_c+Metascore_c+Movie_length, data = IMDB_Movie_Data, family=cumulative(parallel =FALSE~Movie_length)) coef(IMDB_vglm1, matrix=TRUE) ## logit(P[Y<=1]) logit(P[Y<=2]) ## (Intercept) 1.4511646 1.8263330 ## Votes_c.L -1.0213357 -1.0213357 ## Votes_c.Q -0.6081794 -0.6081794 ## Votes_c.C 0.1124382 0.1124382 ## Metascore_c.L -1.0348567 -1.0348567 ## Metascore_c.Q -0.8508894 -0.8508894 ## Movie_length.L -0.2946856 -0.2946856 ## Movie_length.Q 0.5376338 0.5376338 coef(IMDB_vglm2, matrix=TRUE) ## logit(P[Y<=1]) logit(P[Y<=2]) ## (Intercept) 1.4400052 1.92034416 ## Votes_c.L -1.0316357 -1.03163568 ## Votes_c.Q -0.5831384 -0.58313835 ## Votes_c.C 0.1237770 0.12377695 ## Metascore_c.L -1.0555081 -1.05550805 ## Metascore_c.Q -0.8411716 -0.84117158 ## Movie_length.L -0.3536775 0.01926485 ## Movie_length.Q 0.4705363 0.80360852 Proportional Model (cont’d) model2 <- polr(Rating_c~Votes_c*Metascore_c, data=IMDB_M ovie_Data) Anova(model2) ## Analysis of Deviance Table (Type II tests) ## ## Response: Rating_c ## LR Chisq Df Pr(>Chisq) ## Votes_c 8.921 3 0.0303599 * ## Metascore_c 14.004 2 0.0009103 *** ## Votes_c:Metascore_c 16.141 6 0.0130191 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Proportional Model (cont’d) anova(model1, model2) ## Likelihood ratio tests of ordinal regression models ## ## Response: Rating_c ## Model Resid. df Resid. Dev Test Df ## 1 Votes_c + Metascore_c + Movie_length 189 188.9996 ## 2 Votes_c * Metascore_c 185 175.6201 1 vs 2 4 ## LR stat. Pr(Chi) ## 1 ## 2 13.3795 0.009562905 Generalized Linear Model -Model 1 ## Call: ## glm(formula = Rating_c ~ Votes + Metascore_c + Movie_length, ## family = binomial, data = IMDB_Movie_Data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.3771 -0.5550 -0.4189 -0.3364 2.4515 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.883e+00 3.209e-01 -5.867 4.45e-09 *** ## Votes 6.052e-06 1.978e-06 3.060 0.00222 ** ## Metascore_c.L 9.790e-01 3.318e-01 2.951 0.00317 ** ## Metascore_c.Q 8.933e-01 4.438e-01 2.013 0.04415 * ## Movie_length.L 4.332e-01 4.487e-01 0.965 0.33431 ## Movie_length.Q -2.592e-01 4.660e-01 -0.556 0.57805 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 178.42 on 197 degrees of freedom ## Residual deviance: 154.00 on 192 degrees of freedom ## AIC: 166 ## ## Number of Fisher Scoring iterations: 5 Generalized Linear Model –Model2 ## Call: ## glm(formula = Rating_c ~ Votes * Metascore_c + Movie_length, ## family = binomial, data = IMDB_Movie_Data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.4373 -0.6051 -0.4105 -0.1841 2.5897 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.483e+00 3.492e-01 -4.247 2.17e-05 *** ## Votes -8.469e-06 1.008e-05 -0.840 0.4007 ## Metascore_c.L 9.029e-01 4.719e-01 1.913 0.0557 . ## Metascore_c.Q -8.432e-01 6.723e-01 -1.254 0.2098 ## Movie_length.L 2.826e-01 4.600e-01 0.614 0.5390 ## Movie_length.Q -4.262e-01 4.956e-01 -0.860 0.3898 ## Votes:Metascore_c.L 1.871e-06 4.087e-06 0.458 0.6471 ## Votes:Metascore_c.Q 4.414e-05 2.415e-05 1.828 0.0676 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 178.42 on 197 degrees of freedom ## Residual deviance: 142.62 on 190 degrees of freedom ## AIC: 158.62 ## ## Number of Fisher Scoring iterations: 8 Generalized Linear Model –Model 3 ## Call: ## glm(formula = Rating_c ~ Votes * Movie_length + Metascore_c, ## family = binomial, data = IMDB_Movie_Data) ## ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -1.2385 -0.6107 -0.4031 -0.3075 2.5017 ## ## Coefficients: ## Estimate Std. Error z value Pr(>|z|) ## (Intercept) -1.492e+00 3.737e-01 -3.993 6.52e-05 *** ## Votes -2.179e-05 2.640e-05 -0.826 0.40904 ## Movie_length.L 1.355e+00 6.379e-01 2.124 0.03366 * ## Movie_length.Q 1.155e-01 6.363e-01 0.182 0.85592 ## Metascore_c.L 9.130e-01 3.387e-01 2.695 0.00703 ** ## Metascore_c.Q 9.366e-01 4.501e-01 2.081 0.03746 * ## Votes:Movie_length.L -6.016e-05 5.601e-05 -1.074 0.28277 ## Votes:Movie_length.Q -3.316e-05 3.250e-05 -1.020 0.30764 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## (Dispersion parameter for binomial family taken to be 1) ## ## Null deviance: 178.42 on 197 degrees of freedom ## Residual deviance: 149.26 on 190 degrees of freedom ## AIC: 165.26 ## ## Number of Fisher Scoring iterations: 7 Generalized Linear Model -Evaluation ## Analysis of Deviance Table ## ## Model 1: Rating_c ~ Votes + Metascore_c + Movie_length ## Model 2: Rating_c ~ Votes * Metascore_c + Movie_length ## Model 3: Rating_c ~ Votes * Movie_length + Metascore_c ## Resid. Df Resid. Dev Df Deviance ## 1 192 154.00 ## 2 190 142.62 2 11.3779 ## 3 190 149.26 0 -6.6375 Linear Regression ## Call: ## lm(formula = Revenue..Millions. ~ Runtime..Minutes. + Rating + ## Votes + Metascore, data = imdb1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -170.33 -22.93 -7.15 8.09 352.53 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 8.361e+01 3.825e+01 2.186 0.0300 * ## Runtime..Minutes. -4.767e-01 2.624e-01 -1.817 0.0708 . ## Rating -7.301e+00 7.051e+00 -1.036 0.3017 ## Votes 8.125e-04 5.152e-05 15.772 <2e-16 *** ## Metascore 2.967e-01 3.192e-01 0.929 0.3538 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 58.39 on 193 degrees of freedom ## Multiple R-squared: 0.5873, Adjusted R-squared: 0.5787 ## F-statistic: 68.65 on 4 and 193 DF, p-value: < 2.2e-16 Simple Linear Model -Model 1 # Fit a simple linear model with interaction between Votes and Metascore rating_lm1 <- lm(Rating ~ Votes*Metascore + Runtime, data = IMDB_Movie_Data) summary(rating_lm1) ## ## Call: ## lm(formula = Rating ~ Votes * Metascore + Runtime, data = IMDB_Movie_Data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.55822 -0.27029 0.02195 0.35984 1.89278 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.651e+00 3.018e-01 12.097 < 2e-16 *** ## Votes 1.026e-06 2.073e-06 0.495 0.621121 ## Metascore 2.874e-02 3.179e-03 9.041 < 2e-16 *** ## Runtime 1.022e-02 2.619e-03 3.901 0.000132 *** ## Votes:Metascore 2.229e-08 3.185e-08 0.700 0.484874 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5953 on 193 degrees of freedom ## Multiple R-squared: 0.5567, Adjusted R-squared: 0.5475 ## F-statistic: 60.6 on 4 and 193 DF, p-value: < 2.2e-16 Simple Linear Model -Model 2 # Fit a simple linear model using Metascore and Runtime as input rating_lm2 <- lm(Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data) summary(rating_lm2) ## ## Call: ## lm(formula = Rating ~ Votes + Metascore + Runtime, data = IMDB_Movie_Data) ## ## Residuals: ## Min 1Q Median 3Q Max ## -2.5595 -0.2735 0.0331 0.3563 1.9169 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.598e+00 2.915e-01 12.340 < 2e-16 *** ## Votes 2.435e-06 4.946e-07 4.923 1.82e-06 *** ## Metascore 3.018e-02 2.422e-03 12.460 < 2e-16 *** ## Runtime 9.899e-03 2.576e-03 3.843 0.000165 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.5945 on 194 degrees of freedom ## Multiple R-squared: 0.5556, Adjusted R-squared: 0.5487 ## F-statistic: 80.85 on 3 and 194 DF, p-value: < 2.2e-16 Simple Linear Model -Evaluation #Conduct Anova test over chosen model Anova(rating_lm2) ## Anova Table (Type II tests) ## ## Response: Rating ## Sum Sq Df F value Pr(>F) ## Votes 8.564 1 24.231 1.821e-06 *** ## Metascore 54.868 1 155.242 < 2.2e-16 *** ## Runtime 5.219 1 14.766 0.0001648 *** ## Residuals 68.567 194 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Comparison with Twitter
followers are mostly consistent with Votes on IMDB. A correlation of 0.98 also confirms this conclusion. Conclusion 0 The ratings of movie watchers is affected by Votes and Critic ratings. 0 The length or genre of a movie does not have any significant effect on movie watcher’s ratings. Critic ratings are not a significant predictor for movie revenue. It is the Votes that significantly affect a movie’s revenue. 0 There is not much agreement between movie watcher/user ratings and critic ratings except in the case of Bad Movies. 0 There is strong correlation between votes on IMDB and followers on Twitter