Big Data Analysis of IMDB Movies and Ratings

Big Data
(6CS030)
A1: Resit Coursework
Student Id : [WLV ID]

Student Name : Robin KC
Cohort/Batch :4
Submitted to : Jnaneshwar Bohara
Submitted on : <dd-mm-yy>
Table of Contents
1. Introduction to Big Data............................................................................................1
2. Information of Datasets............................................................................................1
2.1 CSV datasets......................................................................................................1
2.2 JSON dataset......................................................................................................1
3. Data Cleaning...........................................................................................................2
3.1.1 Cleaning the “IMDB_Details.csv” file...............................................................2
3.1.2 Data Visualization of ‘IMDB_Movies.csv’ file...................................................6
3.2.1 Cleaning the “IMDB_Rating.csv” file................................................................7
Now, it is ready for data analysis process using some queries...................................8
3.2.2 Data Visualization of ‘IMDB_Rating.csv’ file....................................................8
4. Data Analysis............................................................................................................9
4.1 Data Analysis Using Oracle................................................................................9
4.2 Data Analysis Using MongoDB.........................................................................16
4.2.1 Data Visualization of ‘movie_dataset.json’ file...........................................16
4.2.2 Data analysis of ‘movie_dataset.json’ file..................................................17
4.3 Analysis of Data using Hadoop.........................................................................24
4.4 Data Analysis Using Apache Spark..................................................................29
4.4.1 Data analysis using csv files in Apache Spark..........................................31
4.4.2 Data analysis using json file in Apache Spark...........................................34
5. Advantages of using Oracle, MongoDB and Hadoop for big data.........................35
6. Disadvantages of using Oracle, MongoDB and Hadoop for big data....................35
7. Conclusion and Recommendations........................................................................36
8. References.............................................................................................................36
9. Appendix.................................................................................................................37
1. Introduction to Big Data
2. Information of Datasets
I have taken two csv and one json datasets related to movie from Kaggle for this
coursework. Chosen datasets involve the details of movie and voting as well as
rating of respective movies.
2.1 CSV datasets
The following two csv datasets i.e., IMDB_details.csv and IMDB_Rating.csv have
been taken for analyzing the data using Oracle, Hadoop and Spark.
o IMDB_Details.csv:
This dataset contains the details of movie like movie title, genre, language,
actors, description etc. It contains 85,855 data of various movies.
o IMDB_Rating.csv:
This dataset contains the data of voting and rating of the movies by both user
and non-user voters. It also contains 85,855 data of various movie.
For this coursework, I have retrieved only 2000 rows from both csv files for analyzing
data. Both of the csv files were taken from below link.
https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
2.2 JSON dataset
I have taken movie_dataset.json as a semi structured dataset which is examined by
using some of the MongoDB queries.
Page | 1
3. Data Cleaning
Before applying some queries to examine the dataset, dataset must be cleaned to
obtain more precise result during data analysis. Therefore, datasets were cleaned
initially and removed unwanted data.
3.1.1 Cleaning the “IMDB_Details.csv” file
During data cleaning process, I have applied following different techniques.
 Removal of unnecessary columns
This dataset contains so many columns of data. Among them, I have chosen
certain columns by removing unnecessary columns.
Before removing unnecessary columns:
After removing unnecessary columns:
 Applying Filter in various columns:

Filter is applied to remove blank data from respective columns.
Applying filter in language column:
Page | 2
Applying filter in production_company column:
Applying filter in description column:
Applying filter in reviews_from_user column:
Page | 3
Applying filter in reviews_from_critics column:
Page | 4
 Removing hidden rows and columns
After cleaning dataset, it looks like below.
Finally, rename the columns for convenience.
Page | 5
Now, this dataset is ready for analyzing of data.
3.1.2 Data Visualization of ‘IMDB_Movies.csv’ file

Some of the data missing in csv file can be observed by following graph.
Here, “Almost Human” has no any data that indicates the missing of value.
After handling the missing data, data should be appeared as below.
Page | 6
3.2.1 Cleaning the “IMDB_Rating.csv” file
Originally, this dataset looks like as below.
First of all, unnecessary columns were removed.
Page | 7
In this dataset, there was no any problems. So, only renamed the name of columns.
Now, it is ready for data analysis process using some queries.
3.2.2 Data Visualization of ‘IMDB_Rating.csv’ file

Here, line graph is plotted between Title_ID and Top1000_Voters_Vote that indicate
the there is no any data missing.
Page | 8
4. Data Analysis
4.1 Data Analysis Using Oracle
IMDB_Movie_Details.csv and IMDB_Rating.csv files are used in Oracle for data
analysis.
4.1.1 Import Dataset
Step I: Import IMDB_Movie_Details.csv file in Oracle
Page | 9
Step II: Set table name as “MOVIE_DETAILS”
Step III: Choose all column
Page | 10
Step IV: Column Definition
Step V: Finish
These all five steps are repeated to import IMDB_Rating.csv file and create an
another table with name ‘MOVIE_RATING’ in Oracle.
Page | 11
4.1.2 Some SQL Queries
Some of SQL queries were applied to examine the both datasets.
Code: Count total movie
Output:
The total movies available in IMDB_Movie_Details.csv is 1716 which is shown by

COUNT function.
Code: Show total votes
Page | 12
Output:
Here, total votes is calculated by using SUM function in which User_Voters_Votes

and Non_User_Voters_Votes are added and displayed as Total_Votes.
Code: Apply ROLLUP
Page | 13
Output:
Code: Apply CUBE
Page | 14
Output:
Page | 15
4.2 Data Analysis Using MongoDB
4.2.1 Data Visualization of ‘movie_dataset.json’ file
The dataset movie_dataset.json is used for analyzing semi-structured data. Let’s see
some of the data using data visualization.
Here, we can see the IMDB Rating of different movies.
Here, we can see the number of votes obtained by different movies.
Page | 16
4.2.2 Data analysis of ‘movie_dataset.json’ file
Now, dataset is imported using mongoimport command in order to analyze the data.
After successfully imported the dataset, switch the ‘mongodb_coursework’ database

and check the collection ‘movie’.
It confirms the creation of ‘movie’ collection. Then some queries were applied to
examine the json dataset using mongo shell.
1. Count total movies
Here, 1000 movies are available in given json file.

2. Apply findOne() function
It shows record of one document.

3. Using distinct() function
Different available genres of movie are shown by using distinct function.
Page | 17
Page | 18
Page | 19
These are genres of movie.
Page | 20
4. Show the documents based on some criteria
4.1 Search the movie “God father”
4.2 Search the movie “The Martian”
4.3 Search the word “good” using regular expression
It shows that ‘good’ word is present six times in ‘Overview’ field of whole
document.
4.4 search the word “good” using regular expression as well as case
insensitive
It shows that ‘good’ word is present seven times in ‘Overview’ field of

whole document.
5. Apply update() function

5.1 Before update the movie “Green Book”
5.2 Update the title of movie as “The Green Book” instead of “Green Book”
Page | 21
5.3 After the update of title of movie as “The Green Book”
6. Using aggregate pipeline

Aggregate pipeline is generally useful for statistical data analysis and I have
used it to show the data that IMDB_Rating is greater than 8.
Output:
7. Create new collection using aggregate function

7.1 Create new collection ‘movieProject’
7.2 Show new collection
Here, we can see movieProject as a new collection.
Page | 22
7.3 Apply find() function in new collection
These are some queries for analyzing of semi-unstructured data.
Page | 23
4.3 Analysis of Data using Hadoop
IMDB_Movies.csv file is used to analysis the data using Hadoop. Before the Hadoop
process, I have removed some of the unnecessary columns from the csv file which
looks like as below.
Analysis of data using Hadoop is done by using University server via Putty app.
Various steps involve in Hadoop process are mentioned below.
1. Import the file
IMDB_Movies.csv file and CountReview.java files are transferred into

university desktop via WinSCP app. CountReview.java is used to display the
total reviews provided by the user on towards the particular movies which is
put in Appendix.
2. Compile the java file
3. Produce the jar file
Page | 24
4. Create the input directory on the hdfs
Here, input_movies_file is an input directory on hdfs which is used to store

input files.
5. Put the csv file in input directory
The dataset IMDB_Movies.csv file is put in the input_movies_file directory.

6. Run the map reduce program
7. It is for performing map and reduce program.
Page | 25
8. View the result
Page | 26
9. Retrieve the output files from hdfs
The output or result can be taken outside from the Hadoop or in local device
using below command.
10. View the retrieved output file
Page | 27
Page | 28
4.4 Data Analysis Using Apache Spark
For this part, initially, I have created new cluster ‘Spark’ in Databricks community
Edition.
Then, create a notebook ‘Spark_Coursework’ for writing some queries.
Page | 29
I have also uploaded two csv files and one json files as shown below.
Page | 30
4.4.1 Data analysis using csv files in Apache Spark
 Load the two csv files into separate data frames
Two csv files IMDB_Movies.csv and IMDB_Rating.csv files are loaded in
HDFS file system which can also read by Spark. The Data Frame objects of
IMDB_Movies.csv and IMDB_Rating.csv are df_Movies and df_Rating.
 Show data of IMDB_Movies.csv file
 Show data of IMDB_Rating.csv file
Page | 31
 Display the data of title “Bad Girl” from ‘IMDB_Movies.csv’ file
 Display the voting and rating data of title_id “tt0021635” from

‘IMdb_Rating.csv’ file
 Create a temporary view
 Join them as if they were two SQL tables
Page | 32
 Display some data of title “Bad Girl” from both SQL tables
Page | 33
 Count total votes by adding User_Voters_Votes and Non_User_Voters_Votes
from IMDB_Rating.csv file
4.4.2 Data analysis using json file in Apache Spark

Here, I have used movie_dataset.json file to test some queries.
 Create data frame object ‘df_movies_info’
 Show data
 Show schema
Page | 34
 Shows all movies having IMDB_Rating > 8
5. Advantages of using Oracle, MongoDB and Hadoop for big data

Databases Advantages
Oracle
MongoDB
Hadoop
6. Disadvantages of using Oracle, MongoDB and Hadoop for big data

Databases Disadvantages
Oracle
MongoDB
Hadoop
Page | 35
7. Conclusion and Recommendations
8. References
Page | 36
9. Appendix
CountReview.java
It contains CountReview class in which CountReviewMapper is created.
Page | 37
It is csvReducer class which is used to perform reduce program.
It is a main class which is used to run the program.
Page | 38

Big Data Analysis of IMDB Movies and Ratings

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Big Data Analysis of IMDB Movies and Ratings

Uploaded by

Copyright:

Available Formats

Big Data

A1: Resit Coursework

Student Id : [WLV ID]

After removing unnecessary columns:

 Applying Filter in various columns:

Applying filter in description column:

Applying filter in reviews_from_user column:

After cleaning dataset, it looks like below.

Finally, rename the columns for convenience.

3.1.2 Data Visualization of ‘IMDB_Movies.csv’ file

First of all, unnecessary columns were removed.

Now, it is ready for data analysis process using some queries.

3.2.2 Data Visualization of ‘IMDB_Rating.csv’ file

Step III: Choose all column

The total movies available in IMDB_Movie_Details.csv is 1716 which is shown by

Code: Show total votes

Here, total votes is calculated by using SUM function in which User_Voters_Votes

Code: Apply CUBE

Here, we can see the IMDB Rating of different movies.

Here, we can see the number of votes obtained by different movies.

After successfully imported the dataset, switch the ‘mongodb_coursework’ database

Here, 1000 movies are available in given json file.

It shows record of one document.

4.2 Search the movie “The Martian”

4.3 Search the word “good” using regular expression

It shows that ‘good’ word is present seven times in ‘Overview’ field of

5. Apply update() function

6. Using aggregate pipeline

7. Create new collection using aggregate function

7.2 Show new collection

Here, we can see movieProject as a new collection.

These are some queries for analyzing of semi-unstructured data.

IMDB_Movies.csv file and CountReview.java files are transferred into

3. Produce the jar file

Here, input_movies_file is an input directory on hdfs which is used to store

The dataset IMDB_Movies.csv file is put in the input_movies_file directory.

10. View the retrieved output file

Then, create a notebook ‘Spark_Coursework’ for writing some queries.

 Show data of IMDB_Movies.csv file

 Show data of IMDB_Rating.csv file

 Display the voting and rating data of title_id “tt0021635” from

 Create a temporary view

 Join them as if they were two SQL tables

4.4.2 Data analysis using json file in Apache Spark

5. Advantages of using Oracle, MongoDB and Hadoop for big data

6. Disadvantages of using Oracle, MongoDB and Hadoop for big data

It is a main class which is used to run the program.

You might also like