You are on page 1of 40

Big Data

(6CS030)

A1: Resit Coursework

Student Id : [WLV ID]


Student Name : Robin KC
Cohort/Batch :4
Submitted to : Jnaneshwar Bohara
Submitted on : <dd-mm-yy>
Table of Contents
1. Introduction to Big Data............................................................................................1
2. Information of Datasets............................................................................................1
2.1 CSV datasets......................................................................................................1
2.2 JSON dataset......................................................................................................1
3. Data Cleaning...........................................................................................................2
3.1.1 Cleaning the “IMDB_Details.csv” file...............................................................2
3.1.2 Data Visualization of ‘IMDB_Movies.csv’ file...................................................6
3.2.1 Cleaning the “IMDB_Rating.csv” file................................................................7
Now, it is ready for data analysis process using some queries...................................8
3.2.2 Data Visualization of ‘IMDB_Rating.csv’ file....................................................8
4. Data Analysis............................................................................................................9
4.1 Data Analysis Using Oracle................................................................................9
4.2 Data Analysis Using MongoDB.........................................................................16
4.2.1 Data Visualization of ‘movie_dataset.json’ file...........................................16
4.2.2 Data analysis of ‘movie_dataset.json’ file..................................................17
4.3 Analysis of Data using Hadoop.........................................................................24
4.4 Data Analysis Using Apache Spark..................................................................29
4.4.1 Data analysis using csv files in Apache Spark..........................................31
4.4.2 Data analysis using json file in Apache Spark...........................................34
5. Advantages of using Oracle, MongoDB and Hadoop for big data.........................35
6. Disadvantages of using Oracle, MongoDB and Hadoop for big data....................35
7. Conclusion and Recommendations........................................................................36
8. References.............................................................................................................36
9. Appendix.................................................................................................................37
1. Introduction to Big Data

2. Information of Datasets
I have taken two csv and one json datasets related to movie from Kaggle for this
coursework. Chosen datasets involve the details of movie and voting as well as
rating of respective movies.
2.1 CSV datasets
The following two csv datasets i.e., IMDB_details.csv and IMDB_Rating.csv have
been taken for analyzing the data using Oracle, Hadoop and Spark.
o IMDB_Details.csv:
This dataset contains the details of movie like movie title, genre, language,
actors, description etc. It contains 85,855 data of various movies.

o IMDB_Rating.csv:
This dataset contains the data of voting and rating of the movies by both user
and non-user voters. It also contains 85,855 data of various movie.
For this coursework, I have retrieved only 2000 rows from both csv files for analyzing
data. Both of the csv files were taken from below link.
https://www.kaggle.com/stefanoleone992/imdb-extensive-dataset
2.2 JSON dataset
I have taken movie_dataset.json as a semi structured dataset which is examined by
using some of the MongoDB queries.

Page | 1
3. Data Cleaning
Before applying some queries to examine the dataset, dataset must be cleaned to
obtain more precise result during data analysis. Therefore, datasets were cleaned
initially and removed unwanted data.
3.1.1 Cleaning the “IMDB_Details.csv” file
During data cleaning process, I have applied following different techniques.
 Removal of unnecessary columns
This dataset contains so many columns of data. Among them, I have chosen
certain columns by removing unnecessary columns.
Before removing unnecessary columns:

After removing unnecessary columns:

 Applying Filter in various columns:


Filter is applied to remove blank data from respective columns.
Applying filter in language column:

Page | 2
Applying filter in production_company column:

Applying filter in description column:

Applying filter in reviews_from_user column:

Page | 3
Applying filter in reviews_from_critics column:

Page | 4
 Removing hidden rows and columns

After cleaning dataset, it looks like below.

Finally, rename the columns for convenience.

Page | 5
Now, this dataset is ready for analyzing of data.

3.1.2 Data Visualization of ‘IMDB_Movies.csv’ file


Some of the data missing in csv file can be observed by following graph.

Here, “Almost Human” has no any data that indicates the missing of value.
After handling the missing data, data should be appeared as below.

Page | 6
3.2.1 Cleaning the “IMDB_Rating.csv” file
Originally, this dataset looks like as below.

First of all, unnecessary columns were removed.

Page | 7
In this dataset, there was no any problems. So, only renamed the name of columns.

Now, it is ready for data analysis process using some queries.

3.2.2 Data Visualization of ‘IMDB_Rating.csv’ file


Here, line graph is plotted between Title_ID and Top1000_Voters_Vote that indicate
the there is no any data missing.

Page | 8
4. Data Analysis
4.1 Data Analysis Using Oracle
IMDB_Movie_Details.csv and IMDB_Rating.csv files are used in Oracle for data
analysis.
4.1.1 Import Dataset
Step I: Import IMDB_Movie_Details.csv file in Oracle

Page | 9
Step II: Set table name as “MOVIE_DETAILS”

Step III: Choose all column

Page | 10
Step IV: Column Definition

Step V: Finish

These all five steps are repeated to import IMDB_Rating.csv file and create an
another table with name ‘MOVIE_RATING’ in Oracle.

Page | 11
4.1.2 Some SQL Queries
Some of SQL queries were applied to examine the both datasets.
Code: Count total movie

Output:

The total movies available in IMDB_Movie_Details.csv is 1716 which is shown by


COUNT function.

Code: Show total votes

Page | 12
Output:

Here, total votes is calculated by using SUM function in which User_Voters_Votes


and Non_User_Voters_Votes are added and displayed as Total_Votes.
Code: Apply ROLLUP

Page | 13
Output:

Code: Apply CUBE

Page | 14
Output:

Page | 15
4.2 Data Analysis Using MongoDB
4.2.1 Data Visualization of ‘movie_dataset.json’ file
The dataset movie_dataset.json is used for analyzing semi-structured data. Let’s see
some of the data using data visualization.

Here, we can see the IMDB Rating of different movies.

Here, we can see the number of votes obtained by different movies.

Page | 16
4.2.2 Data analysis of ‘movie_dataset.json’ file
Now, dataset is imported using mongoimport command in order to analyze the data.

After successfully imported the dataset, switch the ‘mongodb_coursework’ database


and check the collection ‘movie’.

It confirms the creation of ‘movie’ collection. Then some queries were applied to
examine the json dataset using mongo shell.
1. Count total movies

Here, 1000 movies are available in given json file.


2. Apply findOne() function

It shows record of one document.


3. Using distinct() function
Different available genres of movie are shown by using distinct function.

Page | 17
Page | 18
Page | 19
These are genres of movie.

Page | 20
4. Show the documents based on some criteria
4.1 Search the movie “God father”

4.2 Search the movie “The Martian”

4.3 Search the word “good” using regular expression

It shows that ‘good’ word is present six times in ‘Overview’ field of whole
document.
4.4 search the word “good” using regular expression as well as case
insensitive

It shows that ‘good’ word is present seven times in ‘Overview’ field of


whole document.

5. Apply update() function


5.1 Before update the movie “Green Book”

5.2 Update the title of movie as “The Green Book” instead of “Green Book”

Page | 21
5.3 After the update of title of movie as “The Green Book”

6. Using aggregate pipeline


Aggregate pipeline is generally useful for statistical data analysis and I have
used it to show the data that IMDB_Rating is greater than 8.

Output:

7. Create new collection using aggregate function


7.1 Create new collection ‘movieProject’

7.2 Show new collection

Here, we can see movieProject as a new collection.

Page | 22
7.3 Apply find() function in new collection

These are some queries for analyzing of semi-unstructured data.

Page | 23
4.3 Analysis of Data using Hadoop
IMDB_Movies.csv file is used to analysis the data using Hadoop. Before the Hadoop
process, I have removed some of the unnecessary columns from the csv file which
looks like as below.

Analysis of data using Hadoop is done by using University server via Putty app.
Various steps involve in Hadoop process are mentioned below.
1. Import the file

IMDB_Movies.csv file and CountReview.java files are transferred into


university desktop via WinSCP app. CountReview.java is used to display the
total reviews provided by the user on towards the particular movies which is
put in Appendix.
2. Compile the java file

3. Produce the jar file

Page | 24
4. Create the input directory on the hdfs

Here, input_movies_file is an input directory on hdfs which is used to store


input files.
5. Put the csv file in input directory

The dataset IMDB_Movies.csv file is put in the input_movies_file directory.


6. Run the map reduce program
7. It is for performing map and reduce program.

Page | 25
8. View the result

Page | 26
9. Retrieve the output files from hdfs
The output or result can be taken outside from the Hadoop or in local device
using below command.

10. View the retrieved output file

Page | 27
Page | 28
4.4 Data Analysis Using Apache Spark
For this part, initially, I have created new cluster ‘Spark’ in Databricks community
Edition.

Then, create a notebook ‘Spark_Coursework’ for writing some queries.

Page | 29
I have also uploaded two csv files and one json files as shown below.

Page | 30
4.4.1 Data analysis using csv files in Apache Spark
 Load the two csv files into separate data frames
Two csv files IMDB_Movies.csv and IMDB_Rating.csv files are loaded in
HDFS file system which can also read by Spark. The Data Frame objects of
IMDB_Movies.csv and IMDB_Rating.csv are df_Movies and df_Rating.

 Show data of IMDB_Movies.csv file

 Show data of IMDB_Rating.csv file

Page | 31
 Display the data of title “Bad Girl” from ‘IMDB_Movies.csv’ file

 Display the voting and rating data of title_id “tt0021635” from


‘IMdb_Rating.csv’ file

 Create a temporary view

 Join them as if they were two SQL tables

Page | 32
 Display some data of title “Bad Girl” from both SQL tables

Page | 33
 Count total votes by adding User_Voters_Votes and Non_User_Voters_Votes
from IMDB_Rating.csv file

4.4.2 Data analysis using json file in Apache Spark


Here, I have used movie_dataset.json file to test some queries.
 Create data frame object ‘df_movies_info’

 Show data

 Show schema

Page | 34
 Shows all movies having IMDB_Rating > 8

5. Advantages of using Oracle, MongoDB and Hadoop for big data


Databases Advantages
Oracle
MongoDB
Hadoop

6. Disadvantages of using Oracle, MongoDB and Hadoop for big data


Databases Disadvantages
Oracle
MongoDB
Hadoop

Page | 35
7. Conclusion and Recommendations

8. References

Page | 36
9. Appendix
CountReview.java
It contains CountReview class in which CountReviewMapper is created.

Page | 37
It is csvReducer class which is used to perform reduce program.

It is a main class which is used to run the program.

Page | 38

You might also like