You are on page 1of 22

Official (Closed)

Official - Non
(Closed) Sensitive
- Non Sensitive

Table of Contents
1. Project Objectives .......................................................................................................................2

2. Data Preparation .........................................................................................................................4

3. Exploratory Data Analysis ...........................................................................................................9

4. Visualisations ............................................................................................................................17

5. Dashboard ................................................................................................................................19
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

1. Project Objectives
• Provide a list of primary business/research questions you are trying to answer with your
visualisation(s).
• Background on the business research question as to why the target audience or others should
care.
• The questions may evolve over the course of the project. Update this part by adding or removing
new business/research questions you consider necessary in the course of your analysis.

Based on the dataset that was available for this assignment, a key business value would be to design
a dashboard(s) to facilitate a search for suitable content by users based on their stated preferences.
The preferences could be determined using a number of filters. In addition, a secondary objective would
be to do exploratory data analysis and via visualisations, discover any interesting insights that can be
gleamed from the dataset provided.

Table 1-1: Sample of test.csv dataset

Based on the assignment scope, meta-data and data provided, the potential business questions that
were identified included the following:

a. Pertaining to Movies/TV Shows


i. Proportion of Movies/TV Shows Available
ii. Movies/TV Show By Date That was Added to Netflix
iii. Movies/TV Shows by Year of Release

DVST April 2020 - Page 2 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

iv. Lead Time (In Years) between Year of Release and Year Added to Netflix
v. Movies by Duration (in Hours)
vi. TV Shows by Duration (in Seasons)
vii. Movies/TV Shows by Genre
viii. Movies/TV Shows by Content Rating
ix. Is Netflix has increasingly focusing on Movies/TV in recent years
b. Pertaining to Geography of Production
i. Movies/TV Shows produced by Country of Production,
ii. Movies/TV Shows produced by Region of Production
iii. Movies/TV Shows produced in ASEAN
iv. Top N Countries where Movies/TV Show was produced
c. Pertaining to Directors
i. Movies/TV Shows by Director
ii. Top N Directors of Movies/TV Shows
iii. Most Prolific Directors by Movie
iv. Most Prolific Directors by Genre
v. Which Movies/TV Shows has the Most Number of Directors involved in the Production
vi. Which Director has collaborated with the Most Actors
d. Pertaining to Actors
i. Movies/TV Shows by Actor
ii. Top N Actors of Movies/TV Shows
iii. Most Popular First Name for Actors
iv. Most Prolific Actors by Movie
v. Most Prolific Actors by Genre
vi. Which Movies/TV Shows has the Most Number of Actors in their Cast List
vii. Which Actors has collaborated with the Most Directors

DVST April 2020 - Page 3 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

2. Data Preparation
• Describe the state of the data (e.g. is it clean, is it complete). Do you expect to do substantial
data clean-up?
• How will you prepare your data so that it is suitable for exploration and analysis?
Describe the steps involved in detail.

a. The data source was made available to us as a csv format. Quite a bit of effort was expended
to transform the data and make it accessible for analysis. After conversion from csv to xls
format, there were the following initial observations made:

i. Dataset File Structure: The file contained 6,235 rows (observations) with 12 columns
(variables). According to the metadata provided with the assignment, the columns are
listed below. I also did an initial exploration on possible data issues and rectify them
where appropriate.

Column# Column Description Remarks & Data Transformation Done


Name
1 show_id Unique ID for every Checked and verified to make sure there are no
movie / TV show duplicate show_id. This field could be useful to link
content from different data sources.

2 type Identifier - A movie Either Movie or TV Show. Pie chart could be a


or TV show good candidate for visualisation in aggregate.
Alternatively if I were to look at the distribution over
the Years, then the bar/column charts would be
more appropriate.

3 title Title of the movie / Some titles had dates, parts of date or time or just
TV show numeric (7 records). Where possible, the data
were verified against the Netflix search engine or
Internet and amended accordingly

4 director Director of the Some blank fields were found (1,969 records). It
Movie may not be possible to update the fields manually
as the number of records is large.
The column was split into specific Directors and
the Director’s name was further split to their
respective First and Last Names. This is to enable
more data available for analysis. An example
would be which Director was most prolific by their
involvement in the TV/movie direction

DVST April 2020 - Page 4 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

5 cast Actors involved in Some blank fields were (1,969 records). It may not
the movie / show be possible to update the fields manually as the
number of records is large

6 country Country where the Some blank fields were found (476 records) and
movie / show was some fields have more than 1 country listed.
produced
The Country field was split so that we can analyse
the countries separately. Further classification
was also done to group the countries into Regions.
Both Soviet Union and West Germany were
updated with new Country names

7 date_added Date it was added Some blank fields were possible (11 records).
on Netflix Where possible, the data were verified against the
Netflix search engine or Internet and amended
accordingly manually

8 release_year Actual release year Converted this field from Numeric (Default) to Date
of the movie / TV
show
9 rating TV Rating of the Some blank fields were found (10 records). Where
movie / TV show possible, the data were verified against the Netflix
search engine or Internet and amended
accordingly

10 duration Total Duration - in Strings instead of numeric


minutes (movie) or
number of seasons
(TV show)

11 listed_in Genre of movie / TV Some fields have more than 1 genre listed. The
show field was split into specific genre to enable more
data be made available for analysis

12 description Summarized
description of movie
/ TV show

Table 2-1 – Metadata provided and observations/actions done

b. Column Names in the Dataset: Some of the column names were converted to initial caps for
visual aesthetics and readability. Some of the data were transformed as per the above table to
enable more data to be made available for analysis.

DVST April 2020 - Page 5 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

c. Data Type: A scan of all the columns in the Tableau Data Sources tab was also done to ensure
that columns are of the appropriate data types (e.g. Numeric Fields, Geographic Data, Date
fields converted from String to Date)

d. Custom Split-CrossTab-Pivoting: For those attributes (i.e. Genre, Country, Director, Cast) that
contains more than 1 categorical data in the same column, a process of split-crosstab-pivot
was done for each field respectively so as to enhance the pool of categorical data available for
analysis. The Tableau split function was only able to generate max 10 split columns but this is
acceptable as I have computed that I would be able to cover over 99% of the data available for
recovery from splits. This process also generate a lot of blank rows which may be a
performance drain on Tableau. These were removed by using Excel to filter out the blank lines
(e.g. Rows with Show_ID but not Country is blank). This process enrich the analysis in areas
such as Director, Actor, Country in which production was located and the type of Genre.

e. Grouping of Countries: For Countries, I also did a grouping of countries in broad regions (6 –
Africa, Americas, ASEAN, Asia, Europe and Oceania) to enrich the analysis.

f. Some of the names have special characters in them so I progressively verify the names against
Netflix search engine or the Internet and amend them manually. I may not be able to amend
all the names with special characters.

g. Using Calculated Fields: There were some interesting data on the duration of movies which
were then categorised using calculated fields to enhance the analysis for insights.

DVST April 2020 - Page 6 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

h. The data model looked like the following:

Table 2-1: Data Model for the Project

i. With the data model, we would be able to reference:

• A movie/TV show to many Actors in the Cast List


• A movie/TV to many Country in the Country List
• A movie/TV to many Director in the Director List
• A movie/TV to many Genres in the Genre List
• A movie/TV to Connected Graph Pair of Director & Actor

ii. Before the data transformation, there were the original data available for Cast, Country,
Director and Genre:

DVST April 2020 - Page 7 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 2-2: Example of Fields with multiple Directors (DirectorList), Actors (Castlist) and Countries
(CountryList)

Table 2-3 – Example of Genre (Listed_in)

DVST April 2020 - Page 8 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

3. Exploratory Data Analysis


• Describe the process to perform univariate and multivariate analysis on your dataset.
• Using descriptive analytics techniques (i.e. statistical analysis, correlation analysis and basic
visualisations etc), document your findings.

• What insights did you gain?

a. Based on a number of basic descriptive visualisations, quite a number of columns were suited
for categorical analysis and transformation. I found the following observations most
interesting in terms of insights for this project. Please check the Data Story entitled
“Exploratory Data Analysis - Insights I Learnt about the Netflix Dataset”

Table 3-1: Data Story Introduction Page

i. Pertaining to Movies/TV Shows


1) The content in the dataset was not homogenous. About two-thirds (68.4%, 4,265
titles) of the content are Movies.

DVST April 2020 - Page 9 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 3-2: Content Type

2) Most of the movies were of 1-1.5 hrs and 1.5-2 hrs duration (Side note: This was
enabled through the use of Calculated Fields to categorize the numerical data)

Table 3-3: Movie by Duration (hrs)

3) This is a boxplot of the distribution of Movies in the dataset:

DVST April 2020 - Page 10 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 3-4: Boxplot on Movies

4) Over the years, there has been a steady increase in the proportion of movies added
to Netflix

Table 3-5: Stacked Bar of Movie-TV Shows

5) In absolute numbers as well each year, we can also observe more Movies being
added to the offering in recent years. In addition, some months are “hotter” than
other months in their addition. You can use the filter to focus on Movies, TV Shows
or both.

DVST April 2020 - Page 11 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 3-6: Movie/TV Show by Year of Release

6) About one-third (1/3, 2174, 34.9%) of the titles were added to Netflix within a lead
time of between 1 to 3 years.

Table 3-7: Lead Time (Years) between Year of Release and the Year the content was
added to Netflix

ii. Pertaining to Geography of Production


1) The top 5 countries with most production located are the United States, India, United
Kingdom, Canada and France (Side note: This was enable through the use of
Groupings to group countries into Regions. The Americas can be further divided

DVST April 2020 - Page 12 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

into North, South and Central America but it is advisable to keep the color range
between 3 to 5 entities for visual impact)

Table 3-8: Regions with the Most Production Located In

2) Interestingly, ASEAN as a whole had 206 titles in the dataset. The major production
(in descending order) were done in Thailand, Indonesia, Philippines, Singapore and
Malaysia. The ranking of the countries are different if you toggle the (content) Type
filter.

Table 3-9: Production located in ASEAN Countries

DVST April 2020 - Page 13 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

3) Singapore had 30 productions located in the country between 2009 and 2018

Table 3-10: Details Table of Singapore-located Productions

iii. Pertaining to Directors


1) Steven Spielberg, Martin Scorsese and Quentin Tarantino were ranked among the
Top 5 Directors who had collaborated with the most number of Actors in their
productions.

Table 3-11: Directors who Collaborated with the Most Actors

iv. Pertaining to Actors

DVST April 2020 - Page 14 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

1) The 5 most popular Actor’s first names were Michael, David, John, Paul, James and
Paul. You can toggle the Type filter to see the most popular Actor’s first names for
Movies and TV Shows. You can also toggle to see the Top N (i.e. 5, 10, 15 etc, in
increments of 5) most popular Actor’s first names.

Table 3-12: Most Popular Actors’ First Name

2) James Franco was the top actor who collaborated with the most Directors.

Table 3-13: Actors who Collaborated with the Most Directors

DVST April 2020 - Page 15 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 3-14: Connected Graph between Directors and Actors


You can use the filter to visualise the collaboration between Directors and their Actors in a
production. Currently, this works for Directors based in the United States only.

The feature is inspired by this YouTube clip @ https://www.youtube.com/watch?v=lZS-


BhAYPT8&t=114s. After the ETL processing of Director and Actors, I was able to generate the
relationships i.e. Director => Actor(s). Besides being able to find out the Director who has worked
with which Actors or how many Actors, I used the Connected Graph to visualise this relationship.
One drawback is the number of connections to generate and the coordinates to compute to place
the nodes. For the moment, the focus is on Directors and Actors for Productions located in the
United States.

DVST April 2020 - Page 16 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

4. Visualisations
• Identify the core findings and insights that help to answer the business/research questions
identified earlier. Create the visualisations accordingly.
• What are the different visualisations you have considered? Justify the design decisions you
made using visualisation and design principles.
• Describe each visualisation by highlighting the business/research questions it answers.
• Explain how to interpret the visualisations in order to answer the business/research questions
(e.g. interactive elements).

a. When designing the visualisations, the 4 Ws were considered:

b. Who:

i. The visualisation is targeted at the end-user who might be interested to find out what
content is available for viewing based on a number of preferences (enabled by
Filters). The preferences could take many forms e.g. Title, Year of Release, Genre,
Director, Actor etc.
ii. The dashboard would likely take the shape of either an exploratory or information
dashboard

c. Where:

i. The interface can be in the form of desktop, tablet or mobile. However, based on the
questions derived, I decided to optimise the design and visual real estate targeted at
the desktop dimensions instead

d. Why

i. As the dataset has many dimensions for analysis, it could be quite challenging for an
interested user to look for interesting content to view. It is hoped that with a
visualisation dashboard, it would help facilitate this search for interesting content to
watch

e. What

DVST April 2020 - Page 17 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Business Questions

Content Geography Movie Direction Cast

Analysis on the Analysis on the Analysis on the


Analysis on the
Geography where the Directors who directed Actors in the
content production was located the produtions production

Table 4-1: Overall plan in grouping the business questions to shape the
analysis and design of the Dashboard(s)

i. Based on the business questions collected and generated, a grouping of 4 key categories
is possible to focus the questions in facilitating the analysis and design of the dashboards.

DVST April 2020 - Page 18 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 4-2: Mapping of Visualisation done to Business Questions listed

5. Dashboard
• From your library of visualisations created earlier, identify suitable ones to create a dashboard.
Recall that dashboards should incorporate only visualisations related to one main topic.
• Depending on your business/research questions, create multiple dashboards to aid you in
performing your video recorded data story telling presentation subsequently.

Using the Netflix portal as inspiration, the following dashboards were designed:
a) Search – allowing a user to find suitable content based on his/her preferences. Currently for
Netflix.com, they have a search field which allows you to input a Title, People (Actors or
Directors) and Genre you wish to search for
b) Movies – interesting visual filters related to Movies. Currently for Netflix, they have a Movies
page with a listing of recommendations.
c) TV Shows – interesting visual filters related to TV Shows. Currently for Netflix, they have a TV
Shows page with a listing of recommendations.

DVST April 2020 - Page 19 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 5-1: Netflix Search


d) An example of the use case is as follows:
1) Click on Genre – Sci-Fi & Fantasy
2) Click on Actor – Keanu Reeves
3) You can find a list of Sci-Fi & Fantasy Movies starring Keanu Reeves
4) You can click on any of the Movies to reveal URL links to IMDB for more details on the
Movie, Director or Actor
5) On IMDB search page, click on the movie title of interest

Table 5-2: Netflix Search for Keanu Reeves

DVST April 2020 - Page 20 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 5-3: IMDB Search Page

Table 5-2: Movies

DVST April 2020 - Page 21 - Last Update: 3 May 2020


SDDA
Official (Closed)
Official - Non
(Closed) Sensitive
- Non Sensitive

DVST2020AprAssignment

Table 5-3: TV Shows

Filters were added to see different categories, Top N number of items as well as links to Wikipedia
(Countries and Regions) and IMDB (Titles, Directors, Actors) for more information

DVST April 2020 - Page 22 - Last Update: 3 May 2020


SDDA

You might also like