Professional Documents
Culture Documents
(6CS030)
2. Information of Datasets
I have taken two csv and one json datasets related to movie from Kaggle for this
coursework. Chosen datasets involve the details of movie and voting as well as
rating of respective movies.
2.1 CSV datasets
The following two csv datasets i.e., IMDB_details.csv and IMDB_Rating.csv have
been taken for analyzing the data using Oracle, Hadoop and Spark.
o IMDB_Details.csv:
This dataset contains the details of movie like movie title, genre, language,
actors, description etc. It contains 85,855 data of various movies.
o IMDB_Rating.csv:
This dataset contains the data of voting and rating of the movies by both user
and non-user voters. It also contains 85,855 data of various movie.
For this coursework, I have retrieved only 2000 rows from both csv files for analyzing
data. Both of the csv files were taken from below link.
https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
2.2 JSON dataset
I have taken movie_dataset.json as a semi structured dataset which is examined by
using some of the MongoDB queries.
Page | 1
3. Data Cleaning
Before applying some queries to examine the dataset, dataset must be cleaned to
obtain more precise result during data analysis. Therefore, datasets were cleaned
initially and removed unwanted data.
3.1.1 Cleaning the “IMDB_Details.csv” file
During data cleaning process, I have applied following different techniques.
Removal of unnecessary columns
This dataset contains so many columns of data. Among them, I have chosen
certain columns by removing unnecessary columns.
Before removing unnecessary columns:
Page | 2
Applying filter in production_company column:
Page | 3
Applying filter in reviews_from_critics column:
Page | 4
Removing hidden rows and columns
Page | 5
Now, this dataset is ready for analyzing of data.
Here, “Almost Human” has no any data that indicates the missing of value.
After handling the missing data, data should be appeared as below.
Page | 6
3.2.1 Cleaning the “IMDB_Rating.csv” file
Originally, this dataset looks like as below.
Page | 7
In this dataset, there was no any problems. So, only renamed the name of columns.
Page | 8
4. Data Analysis
4.1 Data Analysis Using Oracle
IMDB_Movie_Details.csv and IMDB_Rating.csv files are used in Oracle for data
analysis.
4.1.1 Import Dataset
Step I: Import IMDB_Movie_Details.csv file in Oracle
Page | 9
Step II: Set table name as “MOVIE_DETAILS”
Page | 10
Step IV: Column Definition
Step V: Finish
These all five steps are repeated to import IMDB_Rating.csv file and create an
another table with name ‘MOVIE_RATING’ in Oracle.
Page | 11
4.1.2 Some SQL Queries
Some of SQL queries were applied to examine the both datasets.
Code: Count total movie
Output:
Page | 12
Output:
Page | 13
Output:
Page | 14
Output:
Page | 15
4.2 Data Analysis Using MongoDB
4.2.1 Data Visualization of ‘movie_dataset.json’ file
The dataset movie_dataset.json is used for analyzing semi-structured data. Let’s see
some of the data using data visualization.
Page | 16
4.2.2 Data analysis of ‘movie_dataset.json’ file
Now, dataset is imported using mongoimport command in order to analyze the data.
It confirms the creation of ‘movie’ collection. Then some queries were applied to
examine the json dataset using mongo shell.
1. Count total movies
Page | 17
Page | 18
Page | 19
These are genres of movie.
Page | 20
4. Show the documents based on some criteria
4.1 Search the movie “God father”
It shows that ‘good’ word is present six times in ‘Overview’ field of whole
document.
4.4 search the word “good” using regular expression as well as case
insensitive
5.2 Update the title of movie as “The Green Book” instead of “Green Book”
Page | 21
5.3 After the update of title of movie as “The Green Book”
Output:
Page | 22
7.3 Apply find() function in new collection
Page | 23
4.3 Analysis of Data using Hadoop
IMDB_Movies.csv file is used to analysis the data using Hadoop. Before the Hadoop
process, I have removed some of the unnecessary columns from the csv file which
looks like as below.
Analysis of data using Hadoop is done by using University server via Putty app.
Various steps involve in Hadoop process are mentioned below.
1. Import the file
Page | 24
4. Create the input directory on the hdfs
Page | 25
8. View the result
Page | 26
9. Retrieve the output files from hdfs
The output or result can be taken outside from the Hadoop or in local device
using below command.
Page | 27
Page | 28
4.4 Data Analysis Using Apache Spark
For this part, initially, I have created new cluster ‘Spark’ in Databricks community
Edition.
Page | 29
I have also uploaded two csv files and one json files as shown below.
Page | 30
4.4.1 Data analysis using csv files in Apache Spark
Load the two csv files into separate data frames
Two csv files IMDB_Movies.csv and IMDB_Rating.csv files are loaded in
HDFS file system which can also read by Spark. The Data Frame objects of
IMDB_Movies.csv and IMDB_Rating.csv are df_Movies and df_Rating.
Page | 31
Display the data of title “Bad Girl” from ‘IMDB_Movies.csv’ file
Page | 32
Display some data of title “Bad Girl” from both SQL tables
Page | 33
Count total votes by adding User_Voters_Votes and Non_User_Voters_Votes
from IMDB_Rating.csv file
Show data
Show schema
Page | 34
Shows all movies having IMDB_Rating > 8
Page | 35
7. Conclusion and Recommendations
8. References
Page | 36
9. Appendix
CountReview.java
It contains CountReview class in which CountReviewMapper is created.
Page | 37
It is csvReducer class which is used to perform reduce program.
Page | 38