You are on page 1of 12

Capstone Project-4 Submission

NETFLIX MOVIES & TV SHOWS


CLUSTERING

Suvendu Dey- suvendudey271@gmail.com


Abhishek Kumar- abhishekkumar@gmail.com
Ranjit Biswa- ranjit.python@gmail.com

GitHub Link: -

Suvendu Dey: - https://github.com/devsuvendu/Netflix-Movies-and-TV-Shows-Clustering.git

Abhishek Kumar: - https://github.com/Developer-AD/NETFLIX-MOVIES-AND-TV-SHOWS-CLUSTERING


Ranjit Biswal:-https://github.com/Ranjitcnb/NETFLIX-MOVIES-AND-TV-SHOWS-CLUSTERING

Abstract
anytime via online. This business is profitable
Netflix is a company that manages a large because users make a monthly payment to access
collection of TV shows and movies, streaming it the platform. However, customers can cancel
their subscriptions at any time. Therefore, the

1
company must keep the users hooked on the Problem Statement
platform and not lose their interest. This is where
recommendation systems start to play an This dataset consists of tv shows and movies
important role, providing valuable suggestions to available on Netflix as of 2019. The dataset is
users is essential. collected from Flixable which is a third-party
Netflix search engine.
Introduction
In 2018, they released an interesting report which
shows that the number of TV shows on Netflix
Netflix’s recommendation system helps them
has nearly tripled since 2010. The streaming
increase their popularity among service providers
service’s number of movies has decreased by
as they help increase the number of items sold,
more than 2,000 titles since 2010, while its
offer a diverse selection of items, increase user
number of TV shows has nearly tripled.
satisfaction, as well as user loyalty to the
company, and they are very helpful in getting a
better understanding of what the user wants. Then In this project, you are required to do
it’s easier to get the user to make better decisions
1. Exploratory Data Analysis
from a wide variety of movie products. With over
2. Understanding what type content is
139 million paid subscribers (total viewer pool -
available in different countries
300 million) across 190 countries, 15,400 titles
3. Is Netflix increasingly focused on TV rather
across its regional libraries and 112 Emmy Award
than movies in recent years?
Nominations in 2018 — Netflix is the world’s
4. Clustering similar content by matching text-
leading Internet television network and the most-
based features.
valued largest streaming service in the world. The
amazing digital success story of Netflix is
incomplete without the mention of its

recommender systems that focus on Objective


personalization. There are several methods to
create a list of recommendations according to Netflix Recommender recommends Netflix
your preferences. You can use (Collaborative- movies and TV shows based on a user's favorite
filtering) and movie or TV show. It uses a Natural Language
(Content-based Filtering) for recommendation. Processing (NLP) model and a K-Means
Clustering model to make these
recommendations. These models use information
2
about movies and TV shows such as their plot ● Duration: Duration is specified in terms of
descriptions and genres to make suggestions. The minutes for movies and in terms of the number of
motivation behind this project is to develop a seasons in the case of TV shows
deeper understanding of recommender systems ● Listed in: This columns species the category/
and create a model that can perform Clustering on genre of the content
comparable material by matching text-based ● Description: A short summary about the
attributes. Specifically, thinking about how storyline of the content
Netflix create algorithms to tailor content based
on user interests and behavior.
Approach

Data Description As the problem statement says, understanding


what type of content is available in different
Attribute Information:
countries and Is Netflix increasingly focused on
The dataset provided contains 7787 rows and 12
TV rather than movies in recent years we have to
columns.
do clustering on similar content by matching text-
The following are the columns in the dataset:
based features. For that we used Affinity
● Show id: Unique identifier of the record in the
Propagation, Agglomerative Clustering, and K-
dataset
means Clustering.
● Type: Whether it is a TV show or movie
● Title: Title of the show or movie
● Director: Director of the TV show or movie
● Cast: The cast of the movie or TV show
● Country: The list of the country in which a
show/ movie is released or watched

● Date added: The date on which the content was


onboarded on the Netflix platform
Tools Used
● Release year: Year of the release of the show/
movie
The whole project was done using python, in
● Rating: The rating informs about the suitability
google Collaboratory. Following libraries were
of the content for a specific age group
used for analyzing the data and visualizing it and

3
to build the model to predict the Netflix
clustering 1. Handling missing values:
 Pandas: Extensively used to load and wrangle We will need to replace blank countries with the
with the dataset. mode (most common) country. It would be better
 Matplotlib: Used for visualization. to keep director because it can be fascinating to
 Seaborn: Used for visualization. look at a specific filmmaker's movie. As a result,
we substitute the null values with the word
 Nltk: It is a toolkit build for working with NLP.
'unknown' for further analysis.
 Datetime: Used for analyzing the date variable.
There are very few null entries in the date_added
 Warnings: For filtering and ignoring the
fields thus we delete them.
warnings.
 NumPy: For some math operations in 2. Duplicate Values Treatment:
predictions. Duplicate values dose not contribute anything to
 Wordcloud: Visual representation of text data. accuracy of results.
 Sklearn: For the purpose of analysis and Our dataset dose not contains any duplicate
prediction. values.
Table1. The above table shows the dataset in
the form of Pandas Data Frame

Steps Involved
The following steps are involved in the project

4
3. Natural Language Processing
(NLP) Model: outliers, anomalous events, find interesting
relations among the variables.

For the NLP portion of this project, I will first After mounting our drive and fetching and

convert all plot descriptions to word vectors so reading the dataset given, we performed the

they can be processed by the NLP model. Then, Exploratory Data Analysis for it.

the similarity between all word vectors will be To get the understanding of the data and how the

calculated using cosine similarity (measures the content is distributed in the dataset, its type and

angle between two vectors, resulting in a score details such as which countries are watching more
and which type of content is in demand etc has
between -1 and 1, corresponding to complete
been analyzed in this step.
opposites or perfectly similar vectors). Finally, I
Explorations and visualizations are as follows:
will extract the 5 movies or TV shows with the
I. Proportion of type of content
most similar plot description to a given movie or
II. Country-wise count of content
TV show.
III. Total release for last 10 years.
IV. Type and Rating-wise content count
4. Exploratory Data Analysis: V. Top 10 genres in movie content
VI. Top 20 Actors on Netflix.
Exploratory Data Analysis (EDA) as the name
VII. Length distribution of movies.
suggests, is used to analyze and investigate
VIII. Season-wise distribution of TV shows.
datasets and summarize their main characteristics,
IX. Count of content appropriate for different
often employing data visualization methods. It
ages.
helps determine how best to manipulate data
X. Age-appropriate content count in top 10
sources to get the answers you need, making it
countries with maximum content.
easier for data scientists to discover patterns, spot
XI. Proportion of movies and TV shows content
anomalies, test a hypothesis, or check
appropriate for different ages.
assumptions. It also helps to understand the
XII. Season wise distribution of TV shows.
relationship between the variables (if any) and it
XIII. Longest TV shows.
will be useful for feature engineering. It helps to
XIV. Top 10 topics on Netflix.
understand data well before making any
XV. Extracting the features and creating the
assumptions, to identify obvious errors, as well as
document term metrix.
better understand patterns within data, detect
XVI. Topic modeling using LDA and LSA.
XVII. Most important features of topic.

5
reduction of noise in the data, feature selection (to
5. Missing or Null value treatment: a certain extent), and the ability to produce
independent, uncorrelated features of the data.
In datasets, missing values arise due to numerous
reasons such as errors, or handling errors in data. So, it's essential to transform our text into tf-idf
vectorizer, then convert it into an array so that we
We checked for null values present in our data
can fit into our model.
and the dataset contains a null value.
● Finding number of clusters
In order to handle the null values, some columns
and some of the null values are dropped. The goal is to separate groups with similar
characteristics and assign them to clusters.

We used the Elbow method and the Silhouette


6. Hypothesis from the data
score to do so, and we have determined that 28
visualized: clusters should be an optimal number of clusters.
Hypothesis testing is done to confirm our ● Fitting into model
observation about the population using sample
In this task, we have implemented a K means
data, within the desired error level. Through
clustering algorithm. K-means is a technique for
hypothesis testing, we can determine whether we
data clustering that may be used for unsupervised
have enough statistical evidence to conclude if
machine learning. It is capable of classifying
the hypothesis about the population is true or not.
unlabeled data into a predetermined number of
We have performed hypothesis testing to get the clusters based on similarities (k).
insights on duration of movies and content with
respect to different variables.
8. Data Preprocessing:

7. Tfidf vectorization: Removing Punctuation: Punctuations does


not carry any meaning in clustering, so removing
TF-IDF is an abbreviation for Term Frequency
punctuations helps to get rid of unhelpful parts of
Inverse Document Frequency. This is a very
the data, or noise.
common algorithm to transform text into a
Removing stop-words: Stop-words are
meaningful representation of numbers which is
basically a set of commonly used words in any
used to fit a machine learning algorithm for
language, not just in English. If we remove the
prediction.
words that are very commonly used in a given
We have also utilized the PCA because it can
language, we can focus on the important words
help us improve performance at a very low cost
instead.
of model accuracy. Other benefits of PCA include
6
Stemming: Stemming is the process of Building a clustering model
removing a part of a word, or reducing a word to Clustering models allow you to categorize
its stem or root. Applying stemming to reduce records into a certain number of clusters. This can
words to their basic form or stem, which may or help you identify natural groups in your data.
may not be a legitimate word in the language.
Clustering models focus on identifying groups of
similar records and labeling the records according
9. Clustering: to the group to which they belong. This is done
without the benefit of prior knowledge about the
Clustering (also called cluster analysis) is a task groups and their characteristics. In fact, you may
of grouping similar instances into clusters. More not even know exactly how many groups to look
formally, clustering is the task of grouping the for.
population of unlabeled data points into clusters
This is what distinguishes clustering models from
in a way that data points in the same cluster are
the other machine-learning techniques—there is
more similar to each other than to data points in
no predefined output or target field for the model
other clusters. The clustering task is probably the
to predict.
most important in unsupervised learning, since it
has many applications. These models are often referred to as
unsupervised learning models, since there is no
for example:
external standard by which to judge the model's
• Data analysis: often a huge dataset contains classification performance.
several large clusters, analyzing which separately,
you can come to interesting insights.
10. Topic Modeling:
• Anomaly detection: as we saw before, data
points located in the regions of low density can be • Latent Semantic Analysis (LSA)
considered as anomalies
LSA, which stands for Latent Semantic Analysis,
• Semi-supervised learning: clustering is one of the foundational techniques used in
approaches often helps you to automatically label topic modeling. The core idea is to take a matrix
partially labeled data for classification tasks. of documents and terms and try to decompose it
• Indirectly clustering tasks (tasks where into separate two matrices –
clustering helps to gain good results): • A document-topic matrix
recommender systems, search engines, etc. • A topic-term matrix.
• Directly clustering tasks: customer
segmentation, image segmentation, etc.

7
Therefore, the learning of LSA for latent topics
includes matrix decomposition on the document-
term matrix using Singular value decomposition.
It is typically used as a dimension reduction or
noise reducing technique.

• Latent Dirichlet Allocation (LDA)

LDA is a generative probabilistic model that


assumes each topic is a mixture over an
underlying set of words, and each document is a
mixture of over a set of topic probabilities. Figure1. Visualization of Custer and Silhouette
Coefficient.

11. Clusters Model Implementation


2. Agglomerative Clustering
1. Affinity Propagation
2. Agglomerative Clustering The agglomerative clustering is the most common
3. K-means Clustering type of hierarchical clustering used to group
objects in clusters based on their similarity. ...
1. Affinity Propagation
Next, pairs of clusters are successively merged
Affinity propagation (AP) is a graph-based
until all clusters have been merged into one big
clustering algorithm similar to k Means or K
cluster containing all objects.
medoids, which does not require the estimation of
the number of clusters before running the
algorithm. Affinity propagation finds
“exemplars” i.e. members of the input set that are
representative of clusters.

We used Euclidean distance as an affinity


estimator. After that, number of clusters we got
here:

Figure2. Silhouette score and visualization

8
12. K-means Clustering

K-means clustering is one of the simplest and


popular unsupervised machine learning
algorithms. Typically, unsupervised algorithms
make inferences from datasets using only input
vectors without referring to known, or labelled,
outcomes.
K-means algorithm works:
Ideal clustering
To process the learning data, the K-means
algorithm in data mining starts with a first group
k-means clustering is a method of vector
of randomly selected centroids, which are used as
quantization, originally from signal processing,
the beginning points for every cluster, and then
that aims to partition n observations into k
performs iterative (repetitive) calculations to
clusters in which each observation belongs to the
optimize the positions of the centroids. It halts
cluster with the nearest mean (cluster centers or
creating and optimizing clusters when either:
cluster centroid), serving as a prototype of the
• The centroids have stabilized — there is no
cluster.
change in their values because the clustering has
We created the sample data using build blobs and
been successful.
used range n_clusters to
• The defined number of iterations has been
specify the number of clusters we wanted to
achieved.
utilize in k means.
K-means algorithm is an iterative algorithm
Silhouette score and visualization
that tries to partition the dataset into K pre-
For clusters = 2 The average silhouette score is :
defined distinct non overlapping subgroups where
0.7049787496083262
each data point belongs to only one group.
For clusters = 3 The average silhouette score is :
Figure3.
0.5882004012129721
For clusters = 4 The average silhouette score is :
0.6505186632729437
For clusters = 5 The average silhouette score is :
0.56376469026194
For clusters = 6 The average silhouette score is :
0.4504666294372765

9
13. Silhouette Coefficient or
silhouette score(meaning)
Silhouette Coefficient or silhouette score is a
metric used to calculate the goodness of a

10
clustering technique. Its value ranges from -1 to Silhouette score is used to evaluate the quality of
1. 1: Means clusters are well apart from each clusters created using clustering algorithms such
other and clearly distinguished. ... a= average as K-Means in terms of how well samples are
intra-cluster distance i.e., the average distance clustered with other samples that are similar to
between each point within a cluster. each other. The Silhouette score is calculated for
each sample of different clusters. To calculate the
Silhouette score for each observation/data point,
the following distances need to be found out for
each observation belonging to all the clusters:
1. Silhouette’s Coefficient-
2. Elbow Curve:
If the ground truth labels are not known, the
evaluation must be performed utilizing the model The Elbow Curve is one of the most popular
itself. The Silhouette Coefficient is an example of methods to determine this optimal value of k.
such an evaluation, where a more increased
The elbow curve uses the sum of squared distance
Silhouette Coefficient score correlates to a model
(SSE) to choose an ideal value of k based on the
with better-defined clusters. The Silhouette
distance between the data points and their
Coefficient is determined for each sample and
assigned clusters.
comprised of two scores

● Mean distance between the observation and all

other data points in the same cluster. This


distance can also be called a mean intra-cluster
distance. The mean distance is denoted by a.
● Mean distance between the observation and all

other data points of the next nearest cluster. This


distance can also be called a mean nearest-cluster
distance. The mean distance is denoted by b.

The Silhouette Coefficient s for a single sample is


then given as:

𝑏−𝑎
𝑠=
𝑚𝑎𝑥(𝑎, 𝑏)

11
Conclusion: -

1.In this project, we worked on a text clustering problem wherein we had to classify/group the Netflix show
a cluster are similar to each other and the shows in different clusters are dissimilar to each other.
2.The dataset contained about 7787 records, and 12 attributes. We began by dealing with the dataset's mi
(EDA).
3.It was found that Netflix hosts more movies than TV shows on its platform, and the total number of show
majority of the shows were produced in the United States, and the majority of the shows on Netflix were cr
4.Once obtained the required insights from the EDA, we start with Pre-processing the text data by removin
data is passed through TF - IDF Vectorizer since we are conducting a text-based clustering and the model
predict the desired results.
5.It was decided to cluster the data based on the attributes: director, cast, country, genre, and description.
pre-processed, and then vectorized using TFIDF vectorizer.
6.Through TFIDF Vectorization, we created a total of 20000 attributes. We used Principal Component Ana
dimensionality. 4000 components were able to capture more than 80% of variance, and hence, the numbe
components were restricted to 4000. We first built clusters using the k-means clustering algorithm,
and the optimal number of clusters came out to be 6. This was obtained through the elbow method
and Silhouette score analysis.
7.Then clusters were built using the Agglomerative clustering algorithm, and the optimal number of
clusters came out to be 12. This was obtained after visualizing the dendrogram.
8.A content-based recommender system was built using the similarity matrix obtained
after using cosine similarity. This recommender system will make 10
recommendations to the user based on the type of show they watched.

12

You might also like