You are on page 1of 33

National Institute of Statistics and

Applied Economics
Master of Information Systems
and Intelligent Systems

Data Analysis Project

analyse data of the popular youtube channels


in Morocco using PCA methode

Realised By :

ABDELLATIF AHAMMAD

Supervised by :

Prof. wafaa EL HANNOUN

Academic Year : 2021/2022


Data Analysis Project Abdellatif Ahammad

Introduction
When studying functions in mathematics you will probably always ask yourself,
we can draw a function of two variables or even three variables, but how can I
draw a function with more than 3 variables ?

this question is one of the most important questions that any person have to
ask himself, it could be very hard to think of a good solution to represent those
functions in a way the human mind can understand, but thanks to the
Dimensions reduction techniques, such as PCA (Principal component analysis)
we can reduce it to a lower dimension and analyze it in the normal ways that we
already know, the statistically-Based techniques are very good when it comes to
this dimension reduction problems.

In this report, we will discover the magic behind the PCA method, in addition to
an application of it using the R language on a Youtube Data that contains some
of the channels that are used in Morocco, in order to see which is the most
popular channels
in Morocco.

1
Data Analysis Project Abdellatif Ahammad

1 - Theory behind PCA


1.1 - Introduction
the PCA method is considered as a linear methods to reduce the dimension
form high ones like (4,3...) to a simple ones that we are familiar with like
euclidean space (x,y) of 2 dimensions.

1.2 - How PCA works


In order to discover the magic behind the PCA method we will use this exemple
of data that contain 3 consecutive columns that is for Marks (from 0 to 100)

for the following subjects (Math,English,Art) .

Student Math English Art

1 90 60 90

2 90 90 30

3 60 60 90

4 30 30 30

so this data in fact can be presented as a matrix of 3 dimensions since we are


interested only on the quantitative columns

2
Data Analysis Project Abdellatif Ahammad

● Calculate the main of the matrix

The mean of matrix A would be :

To calculate the mean of each column it's simple we just calculate the sum of
the individuals divided by 5 which is the total of individuals .

● Compute the covariance matrix of the whole dataset

So, we can compute the covariance of two variables X and Y using the following
formula .

that would give us another 3x3 matrix cause the number of features that we
have is 3(math ,art,english).
Transform the raw scores from matrix X into deviation scores for matrix A.

3
Data Analysis Project Abdellatif Ahammad

● Compute Eigenvectors and corresponding Eigenvalues

Eigenvalues are associated with eigenvectors in Linear algebra. Both terms are
used in the analysis of linear transformations. Eigenvalues are the special set of
scalar values that are associated with the set of linear equations most probably

4
Data Analysis Project Abdellatif Ahammad

in the matrix equations. The eigenvectors are also termed characteristic roots.
It is a non-zero vector that can be changed at most by its scalar factor after the
application of linear transformations. And the corresponding factor which
scales the eigenvector is called an eigenvalue.
in other way :

Let A be a square matrix, ν a vector and λ a scalar that satisfies

Then λ is called the eigenvalue associated with the eigenvector ν of A.


The eigenvalues of A are roots of the characteristic equation

Calculating det(A-λI) first, I is an identity matrix

Calculating det(A-λI) first, I is an identity matrix Simplifying the matrix first, we


can calculate the determinant later .

Now we have our simplified matrix, we can find the determinant

5
Data Analysis Project Abdellatif Ahammad

As you see it's clear that we need to find lambda , and after solving the equation
we find :

Now, we can calculate the eigenvectors corresponding to the above eigenvalues.


to see how to calculate the eigenvectors you can check this website

https://study.com/academy/lesson/eigenvalues-eigenvectors-definition-
equation-examples.html

the result would be something like this :

● Sort the eigenvectors by decreasing eigenvalues and choose k


eigenvectors with the largest eigenvalues to form a n × k dimensional
matrix .

We started with the goal to reduce the dimensionality of our feature space, i.e.,
projecting the feature space via PCA onto a smaller subspace, where the
eigenvectors will form the axes of this new feature subspace. However, the
eigenvectors only define the directions of the new axis, since they have all the
same unit length 1.

6
Data Analysis Project Abdellatif Ahammad

So, in order to decide which eigenvector(s) we want to drop for our lower-
dimensional subspace, we have to take a look at the corresponding eigenvalues
of the eigenvectors. Roughly speaking, the eigenvectors with the lowest
eigenvalues bear the least information about the distribution of the data, and
Those are the ones we want to drop.

The common approach is to rank the eigenvectors from highest to lowest


corresponding eigenvalue and choose the top k eigenvectors. So, after sorting
the eigenvalues in decreasing order, we have :

For our simple example, where we are reducing a 3-dimensional feature space
to a 2-dimensional feature subspace, we are combining the two eigenvectors
with the highest eigenvalues to construct our n×k dimensional eigenvector
matrix B .

● Transform the samples onto the new subspace

In the last step, we use the 2×3 dimensional matrix B that we just computed to
transform our samples onto the new subspace via the equation .

where BT is the transpose of the matrix B .

7
Data Analysis Project Abdellatif Ahammad

1.3 - conclusion

So, as we see these are the mean steps of the PCA method, that can reduce the
dataset of n dimensions to small ones such as 2D, and this is so interesting
because it gives you an illustrated graph of your data then you can analyze it
and extract its features.

8
Data Analysis Project Abdellatif Ahammad

2 - Preparing and collecting Dataset


2.1-Introduction

I choose as a subject to apply PCA on it the data of some Moroccan Youtube


Channels to see which channels are the most popular (popular it's not
necessarily the one that has a lot of subscribers only, we have to add more
things to it such as likes, comments, ...) .

problem : There is no Dataset files ,

for this reason, I have to create a Dataset Myself , I found some websites that
contain some data but we need to web scrape it from HTML to an CSV file , to be
able to use it in R.

2.2-Data collection
this website contains 1000 Most watched Youtube Channels in Morocco :
https://hypeauditor.com/top-youtube-all-morocco/?p=1

the Data on this website is represented like that :

figure 1 : top 1000 youtube channel watched in Morocco

9
Data Analysis Project Abdellatif Ahammad

and we need to get it as csv file , for that reason I extract data from this website
using the following python script :

# Python program to scrape website


# and save channels from website
import requests
from bs4 import BeautifulSoup
import csv
chanels = [] # a list to store channels
for page in range(1,21):
URL =
"https://hypeauditor.com/top-youtube-all-morocco/?p={}".format(page)
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html5lib')
table = soup.find('tbody', attrs={'class': 'tbody'})
# print(table.findAll('tr',attrs={'class': 'tr'}))
#
for row in table.findAll('tr',attrs={'class': 'tr'}):
chanel = {}
i=0
for data in row.findAll('td',attrs={'class': 'td'}):
print(len(data.text))
if len(data.text)!=0 and i!=2:
if i ==0:
chanel['id'] = data.text
i+=1
elif i==1:
chanel['Chanel'] = data.text
i+=1
elif i==2:
chanel['Category'] = data.text
i += 1
elif i==3:
chanel['total_subscribers'] = data.text
i += 1
elif i==4:
chanel['subs_Morocco'] = data.text
i += 1
elif i==5:
chanel['views'] = data.text
i += 1
elif i==6:
chanel['likes'] = data.text

10
Data Analysis Project Abdellatif Ahammad

i += 1
elif i==7:
chanel['comments'] = data.text
i += 1
elif i==2:
chanel['Category'] = data.text
i+=1
chanels.append(chanel)
filename = 'youtube_maroc.csv'.format(page)
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f, ['id','Chanel' ,'Category',
'total_subscribers', 'subs_Morocco', 'views','likes','comments'])
w.writeheader()
for chanel in channels:
w.writerow(chanel)

the scraped Data looks like that :

figure 2: dataset file

11
Data Analysis Project Abdellatif Ahammad

2.3- Data cleaning

as you see there are some missing values in addition that the data is not
numerical cause it contains "k","M", so we have to clean it and create a new CSV
file for that reason I create this python script :

import pandas as pd
df = pd.read_csv("./youtube_maroc.csv")
print(df.columns)
likes = []
total_subscribers =[]
subs_Morocco =[]
views =[]
comments =[]
chanels = []
# I add this line to get only 10 cause when I use more than 10
# the graphique representation of individuals can't be clear
df = df[:10]
# remove individuals with null values
df = df[df.views!='0']
df = df[ df.views!=""]
df = df[df.likes!='N/A']
df = df[ df.likes!=""]
df = df[df.comments!='N/A']
df = df[ df.comments!=""]
# convert 'k' and 'm' to numeric values
for el in df.likes:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
likes.append(res)

# get a clean name for channels by taking the part before @


for el in df.Chanel:
res = str(el).lower().split('@')[0]

12
Data Analysis Project Abdellatif Ahammad

chanels.append(res)
for el in df.total_subscribers:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
total_subscribers.append(res)
# convert 'k' and 'm' to numeric values
for el in df.subs_Morocco:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
subs_Morocco.append(res)
# convert 'k' and 'm' to numeric values
for el in df.views:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
views.append(res)
# convert 'k' and 'm' to numeric values
for el in df.comments:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0]) * 1000
elif 'm' in res:
print(res.split('m')[0])

13
Data Analysis Project Abdellatif Ahammad

res = float(res.split('m')[0]) * 1000000


else:
res = float(res)
comments.append(res)
df.comments = comments
df.likes = likes
df.views = views
df.subs_Morocco = subs_Morocco
df.Chanel = channels
df.total_subscribers = total_subscribers
col
=['Chanel','Category','views','likes','comments','subs_Morocco','total_s
ubscribers']
df = df[col]
df = df.set_index('Chanel')
df.to_csv("cleaned_fixed_youtube_maroc_10.csv")

you can download the final data from here :

https://drive.google.com/drive/folders/1oCWL7zIyG3xmjs_uA-
drVyk3CxGFKBQY?usp=sharing

the output file : cleaned_fixed_youtube_maroc_10.csv .

figure 3: cleaned dataset file

Now we can start our work based on this sample of the dataset .

14
Data Analysis Project Abdellatif Ahammad

3 - Data Observation and basic analyses :


3.1- Introduction
as you see here the list already is ranked by the number of subscribers, I also
try to complete the category of each channel, the only missing thing here is the
number of dislikes, but is it that important?

● Does the number of dislikes influence the results ?

in the beginning, the algorithm of youtube was handling likes pretty similar
to dislikes, which is not logical, but in this recent period, they removed the
dislike button and use only the like button, so it could be clear that they find
that the dislike is not that significant in terms of data.

but to be sure I have done some analysis using the PCA method on trending
videos and the result was like the following.

figure 4: PCA graph of variables (like,dislike,comments,views)

15
Data Analysis Project Abdellatif Ahammad

the only thing that we can say from that is the dislikes do not participate that
much in creating both axes (dim1, dim2), so it's ok to not have it on this
dataset.

3.2- Basic statistics

# import libraries
library('FactoMineR' )
library("factoextra" )
# setting a working directory
setwd("/your_path/")
# get the data dataSet
youtube.data = read.table('data_yt.csv',sep = ',' ,check.names =
F,header = T,dec = '.')

# display columns names

names(youtube.data)

#create a subset by removing the Category column since it


contains only qualitative data
youtube.data = subset(youtube.data, select = -c(Category) )
#display few lines from our data
head(youtube.data)
#display basic statistiques on the dataset
summary(youtube.data)

the output :

figure 5: data summary output

16
Data Analysis Project Abdellatif Ahammad

4- Applying the ACP method on the


Dataset
4.1- ACP analyses

● Setup the ACP


# Normalization of data scale.unit = T
# quanti.sup -> supplementary quantitative columns
# quali.sup -> supplementary qualitative columns
youtube.acp = PCA(auto.data,scale.unit=T, graph =
F,quali.sup=c(1:2))

● Are 2 dimensions enough to represent this Dataset ?

The only way to come to this conclusion is by seeing the percentage of


explained variance of each dimension, and as you see in the scree plot the first
dimension only have a percentage of “83.4%” in addition to “14.3%” for the
second dimension, so 2 dimensions are it’s more than enough to represent
this dataset.

figure 8: Scree plot

17
Data Analysis Project Abdellatif Ahammad

● Analyze the contribution of variables

if we try to see the summary of the test result we find :

figure 6: summary of PCA data (eigenvalues,variance)

figure 7: summary of PCA data (variables contribution)

from this data, we can see how much each variable contribute to creating each
dimension, and since we are interested only in two dimensions we can see
that the views, comments, likes,total_subscribers are the best contributors in

18
Data Analysis Project Abdellatif Ahammad

creating the Dim1, and for the second dim we see that the number of
subscribers from morocco are the most contributors.
we can also get this information from these graphs :

figure 9: Contribution of variables to dim1

figure 10: Contribution of variables to dim2

19
Data Analysis Project Abdellatif Ahammad

● Analyze the contribution of individuals

figure 11: Contribution of individuals to dim1

figure 12: Contribution of individuals to dim2

20
Data Analysis Project Abdellatif Ahammad

In these graphs we can see which individuals contribute the most in each
dimension,

for the first dimension, we can see that 1 (“saad lamjarad”) is the top contributor
to create this dimension with more than “87%” when the others contribute just
with simple values.

but for the second dimension we can see that 1 (“saad lamjarad”) don’t
contribute at all when 2 (“choufTV”) is the top contributor with more than
“35%”, followed by 3 (“cuisine Halima filali”) with less than “20%”, 19 (“Syblus”)
and 9(‘Jamal alpha”) with less than “10%”.

● Analyze the Quality of representation of each variable

this graphs are presenting the correlation of each variable in our dataset

figure 13: correlations cercle

21
Data Analysis Project Abdellatif Ahammad

figure 14: cos2 of variables to dim1 and dim2

In those graphs, we can see clearly that the views, comments, likes, total
subscribers are the variables that are presented with good quality in the first
dimension when the number of Moroccan subscribers is presented in the
second one.

● Analyze the Quality of representation for each individuals

based on the first graph (see next pages) we could see channels that are
popular in Morocco , if we check that graph (see next page) we can find that
"saad lamjarred" is well presented in the first Dim1 the thing that can make us
say that he has a lot of

22
Data Analysis Project Abdellatif Ahammad

views, comments, likes,subscribers comparing it with the number of


subscribers, the thing that reflects why he is not presented in a good way in
the Dim2, in the other hand we have a channel that is named "ChoufTV" , that
have a good number of moroccan subscribers compared to the views,
comments, likes.

for the others we see that is represented in a fine way for both axes, like
"asmaa beauty ", "cuisine halima lfilali", "syblus"...
for the second and third graphs we can find the same results but this time we
will not target any specific channel we will focus on seeing the category that
those channels belong to, so as you can notice the number of channels that
belongs to “daily vlogs” is higher than others in addition to the fact that some
of them are not presented in a good way, for music & dance we notice a high
level of diversity between each channel for example “ saad lamjarad” is well
presented when “Hatim amour “ not presented in the same quality, and that’s
normal cause the music & dance field can attract only people who are
interested in a special artist and we call them fans, we can also conclude that
the “daily vlogs” category is more active and attractive for a lot of people
(especially women), and they kind of share the number of views, so a person
who are watching “Asmaa beauty” probably also watch “ cuisine Halima filali” .

for the “news & politics” channels we see the same thing as the music field
where “chouf Tv “ have represented in a good way, especially for the dim2,
where others make a small group that is not represented that much for both
dim1 and dim2 .

23
Data Analysis Project Abdellatif Ahammad

figure 15: individuals representation in scatter plot (channels)

figure 16: confidence ellipses around the categories of Category

24
Data Analysis Project Abdellatif Ahammad

figure 17: confidence ellipses around the categories of Category and Channels

25
Data Analysis Project Abdellatif Ahammad

figure 18: individuals representation in scatter plot (categories)

4.2-Conclusion
when we analyze this dataset we can get a small idea about what and who the
moroccan people are watching the most , and from this small analyze we
could say that the top 5 youtubers from this dataset that contain 25 youtuber
are :
● saad lamjarred
● Chouf TV
● cuisine Halima filali
● Jamal alpha
● baraka lbaraka

and for the top main categories that have more than one channel we have

● music & dance


● daily vlogs
● news & politics

26
Data Analysis Project Abdellatif Ahammad

5 - Simple Real-world application for the


PCA
5.1 - Introduction
In this chapter I just want to make things more clear and clost to everybody
even if they don’t know anything about PCA or programming , so the objective
is to create something that can give us rank of the top selected youtubers from a
dataset , and in order that we have to pick something else that can be deployed
rather than R , for that reason we will use Python .

5.2 - PCA Implementation in Python


df = pd.read_csv("./samples/yt_20.csv")
select = ['views', 'likes','comments', 'subs_Morocco',
'total_subscribers']
dfnorm = df[select]
dfnorm = StandardScaler().fit_transform(dfnorm)
acp = PCA(svd_solver='full')
coord = acp.fit_transform(dfnorm)

So those lines are for setting up the PCA method in python. There are more
lines of code but those are the base of what we need , and if we try to see the
scree plot and the other plots we will have the same thing as the one with R .

Examples :

27
Data Analysis Project Abdellatif Ahammad

28
Data Analysis Project Abdellatif Ahammad

5.3 - proposed Solution

So after writing some lines of code , I created this small platform that can rank
any selected Channels from the dataset , and presented it to you in a nice way .

Feel free to try it and figure out if it is that efficient or not .

https://acpyoutube.herokuapp.com/

29
Data Analysis Project Abdellatif Ahammad

30
Data Analysis Project Abdellatif Ahammad

31
Data Analysis Project Abdellatif Ahammad

General Conclusion
To conclude we can say that the PCA is one of the easiest and most efficacy
methods that make analyzing data easy for us, even with a lot of variables, of
sure there is also a lot of methods that give the same results or even better, the
only difficulty that left to us is to give these results a significant meaning, the
thing that is hard and has a lot of problems because of the human error
factor, or errors in the chosen data itself.

In this small project, we try to apply the PCA method on Moroccan youtube
data and we get some impressive results, which I found personally very
interesting cause it opened a lot of questions in my head like which gender is
watching youtube and engaging with the content on it more, man or women ?,
this question that needs to more data about users and that is kind of
impossible unless you get from the youtube company itself, but we can
predict certain results from what we have right now and I will claim that
hypothesis “ older women are watching and engage more with the daily vlogs
youtube content “, it could be true or false but it’s still a hypothesis to reject or
accept.

32

You might also like