Professional Documents
Culture Documents
Applied Economics
Master of Information Systems
and Intelligent Systems
Realised By :
ABDELLATIF AHAMMAD
Supervised by :
Introduction
When studying functions in mathematics you will probably always ask yourself,
we can draw a function of two variables or even three variables, but how can I
draw a function with more than 3 variables ?
this question is one of the most important questions that any person have to
ask himself, it could be very hard to think of a good solution to represent those
functions in a way the human mind can understand, but thanks to the
Dimensions reduction techniques, such as PCA (Principal component analysis)
we can reduce it to a lower dimension and analyze it in the normal ways that we
already know, the statistically-Based techniques are very good when it comes to
this dimension reduction problems.
In this report, we will discover the magic behind the PCA method, in addition to
an application of it using the R language on a Youtube Data that contains some
of the channels that are used in Morocco, in order to see which is the most
popular channels
in Morocco.
1
Data Analysis Project Abdellatif Ahammad
1 90 60 90
2 90 90 30
3 60 60 90
4 30 30 30
2
Data Analysis Project Abdellatif Ahammad
To calculate the mean of each column it's simple we just calculate the sum of
the individuals divided by 5 which is the total of individuals .
So, we can compute the covariance of two variables X and Y using the following
formula .
that would give us another 3x3 matrix cause the number of features that we
have is 3(math ,art,english).
Transform the raw scores from matrix X into deviation scores for matrix A.
3
Data Analysis Project Abdellatif Ahammad
Eigenvalues are associated with eigenvectors in Linear algebra. Both terms are
used in the analysis of linear transformations. Eigenvalues are the special set of
scalar values that are associated with the set of linear equations most probably
4
Data Analysis Project Abdellatif Ahammad
in the matrix equations. The eigenvectors are also termed characteristic roots.
It is a non-zero vector that can be changed at most by its scalar factor after the
application of linear transformations. And the corresponding factor which
scales the eigenvector is called an eigenvalue.
in other way :
5
Data Analysis Project Abdellatif Ahammad
As you see it's clear that we need to find lambda , and after solving the equation
we find :
https://study.com/academy/lesson/eigenvalues-eigenvectors-definition-
equation-examples.html
We started with the goal to reduce the dimensionality of our feature space, i.e.,
projecting the feature space via PCA onto a smaller subspace, where the
eigenvectors will form the axes of this new feature subspace. However, the
eigenvectors only define the directions of the new axis, since they have all the
same unit length 1.
6
Data Analysis Project Abdellatif Ahammad
So, in order to decide which eigenvector(s) we want to drop for our lower-
dimensional subspace, we have to take a look at the corresponding eigenvalues
of the eigenvectors. Roughly speaking, the eigenvectors with the lowest
eigenvalues bear the least information about the distribution of the data, and
Those are the ones we want to drop.
For our simple example, where we are reducing a 3-dimensional feature space
to a 2-dimensional feature subspace, we are combining the two eigenvectors
with the highest eigenvalues to construct our n×k dimensional eigenvector
matrix B .
In the last step, we use the 2×3 dimensional matrix B that we just computed to
transform our samples onto the new subspace via the equation .
7
Data Analysis Project Abdellatif Ahammad
1.3 - conclusion
So, as we see these are the mean steps of the PCA method, that can reduce the
dataset of n dimensions to small ones such as 2D, and this is so interesting
because it gives you an illustrated graph of your data then you can analyze it
and extract its features.
8
Data Analysis Project Abdellatif Ahammad
for this reason, I have to create a Dataset Myself , I found some websites that
contain some data but we need to web scrape it from HTML to an CSV file , to be
able to use it in R.
2.2-Data collection
this website contains 1000 Most watched Youtube Channels in Morocco :
https://hypeauditor.com/top-youtube-all-morocco/?p=1
9
Data Analysis Project Abdellatif Ahammad
and we need to get it as csv file , for that reason I extract data from this website
using the following python script :
10
Data Analysis Project Abdellatif Ahammad
i += 1
elif i==7:
chanel['comments'] = data.text
i += 1
elif i==2:
chanel['Category'] = data.text
i+=1
chanels.append(chanel)
filename = 'youtube_maroc.csv'.format(page)
with open(filename, 'w', newline='') as f:
w = csv.DictWriter(f, ['id','Chanel' ,'Category',
'total_subscribers', 'subs_Morocco', 'views','likes','comments'])
w.writeheader()
for chanel in channels:
w.writerow(chanel)
11
Data Analysis Project Abdellatif Ahammad
as you see there are some missing values in addition that the data is not
numerical cause it contains "k","M", so we have to clean it and create a new CSV
file for that reason I create this python script :
import pandas as pd
df = pd.read_csv("./youtube_maroc.csv")
print(df.columns)
likes = []
total_subscribers =[]
subs_Morocco =[]
views =[]
comments =[]
chanels = []
# I add this line to get only 10 cause when I use more than 10
# the graphique representation of individuals can't be clear
df = df[:10]
# remove individuals with null values
df = df[df.views!='0']
df = df[ df.views!=""]
df = df[df.likes!='N/A']
df = df[ df.likes!=""]
df = df[df.comments!='N/A']
df = df[ df.comments!=""]
# convert 'k' and 'm' to numeric values
for el in df.likes:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
likes.append(res)
12
Data Analysis Project Abdellatif Ahammad
chanels.append(res)
for el in df.total_subscribers:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
total_subscribers.append(res)
# convert 'k' and 'm' to numeric values
for el in df.subs_Morocco:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
subs_Morocco.append(res)
# convert 'k' and 'm' to numeric values
for el in df.views:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0])*1000
elif 'm' in res:
print(res.split('m')[0])
res = float(res.split('m')[0])*1000000
else:
res = float(res)
views.append(res)
# convert 'k' and 'm' to numeric values
for el in df.comments:
res = str(el).lower()
if 'k' in res:
print(res.split('k')[0])
res = float(res.split('k')[0]) * 1000
elif 'm' in res:
print(res.split('m')[0])
13
Data Analysis Project Abdellatif Ahammad
https://drive.google.com/drive/folders/1oCWL7zIyG3xmjs_uA-
drVyk3CxGFKBQY?usp=sharing
Now we can start our work based on this sample of the dataset .
14
Data Analysis Project Abdellatif Ahammad
in the beginning, the algorithm of youtube was handling likes pretty similar
to dislikes, which is not logical, but in this recent period, they removed the
dislike button and use only the like button, so it could be clear that they find
that the dislike is not that significant in terms of data.
but to be sure I have done some analysis using the PCA method on trending
videos and the result was like the following.
15
Data Analysis Project Abdellatif Ahammad
the only thing that we can say from that is the dislikes do not participate that
much in creating both axes (dim1, dim2), so it's ok to not have it on this
dataset.
# import libraries
library('FactoMineR' )
library("factoextra" )
# setting a working directory
setwd("/your_path/")
# get the data dataSet
youtube.data = read.table('data_yt.csv',sep = ',' ,check.names =
F,header = T,dec = '.')
names(youtube.data)
the output :
16
Data Analysis Project Abdellatif Ahammad
17
Data Analysis Project Abdellatif Ahammad
from this data, we can see how much each variable contribute to creating each
dimension, and since we are interested only in two dimensions we can see
that the views, comments, likes,total_subscribers are the best contributors in
18
Data Analysis Project Abdellatif Ahammad
creating the Dim1, and for the second dim we see that the number of
subscribers from morocco are the most contributors.
we can also get this information from these graphs :
19
Data Analysis Project Abdellatif Ahammad
20
Data Analysis Project Abdellatif Ahammad
In these graphs we can see which individuals contribute the most in each
dimension,
for the first dimension, we can see that 1 (“saad lamjarad”) is the top contributor
to create this dimension with more than “87%” when the others contribute just
with simple values.
but for the second dimension we can see that 1 (“saad lamjarad”) don’t
contribute at all when 2 (“choufTV”) is the top contributor with more than
“35%”, followed by 3 (“cuisine Halima filali”) with less than “20%”, 19 (“Syblus”)
and 9(‘Jamal alpha”) with less than “10%”.
this graphs are presenting the correlation of each variable in our dataset
21
Data Analysis Project Abdellatif Ahammad
In those graphs, we can see clearly that the views, comments, likes, total
subscribers are the variables that are presented with good quality in the first
dimension when the number of Moroccan subscribers is presented in the
second one.
based on the first graph (see next pages) we could see channels that are
popular in Morocco , if we check that graph (see next page) we can find that
"saad lamjarred" is well presented in the first Dim1 the thing that can make us
say that he has a lot of
22
Data Analysis Project Abdellatif Ahammad
for the others we see that is represented in a fine way for both axes, like
"asmaa beauty ", "cuisine halima lfilali", "syblus"...
for the second and third graphs we can find the same results but this time we
will not target any specific channel we will focus on seeing the category that
those channels belong to, so as you can notice the number of channels that
belongs to “daily vlogs” is higher than others in addition to the fact that some
of them are not presented in a good way, for music & dance we notice a high
level of diversity between each channel for example “ saad lamjarad” is well
presented when “Hatim amour “ not presented in the same quality, and that’s
normal cause the music & dance field can attract only people who are
interested in a special artist and we call them fans, we can also conclude that
the “daily vlogs” category is more active and attractive for a lot of people
(especially women), and they kind of share the number of views, so a person
who are watching “Asmaa beauty” probably also watch “ cuisine Halima filali” .
for the “news & politics” channels we see the same thing as the music field
where “chouf Tv “ have represented in a good way, especially for the dim2,
where others make a small group that is not represented that much for both
dim1 and dim2 .
23
Data Analysis Project Abdellatif Ahammad
24
Data Analysis Project Abdellatif Ahammad
figure 17: confidence ellipses around the categories of Category and Channels
25
Data Analysis Project Abdellatif Ahammad
4.2-Conclusion
when we analyze this dataset we can get a small idea about what and who the
moroccan people are watching the most , and from this small analyze we
could say that the top 5 youtubers from this dataset that contain 25 youtuber
are :
● saad lamjarred
● Chouf TV
● cuisine Halima filali
● Jamal alpha
● baraka lbaraka
and for the top main categories that have more than one channel we have
26
Data Analysis Project Abdellatif Ahammad
So those lines are for setting up the PCA method in python. There are more
lines of code but those are the base of what we need , and if we try to see the
scree plot and the other plots we will have the same thing as the one with R .
Examples :
27
Data Analysis Project Abdellatif Ahammad
28
Data Analysis Project Abdellatif Ahammad
So after writing some lines of code , I created this small platform that can rank
any selected Channels from the dataset , and presented it to you in a nice way .
https://acpyoutube.herokuapp.com/
29
Data Analysis Project Abdellatif Ahammad
30
Data Analysis Project Abdellatif Ahammad
31
Data Analysis Project Abdellatif Ahammad
General Conclusion
To conclude we can say that the PCA is one of the easiest and most efficacy
methods that make analyzing data easy for us, even with a lot of variables, of
sure there is also a lot of methods that give the same results or even better, the
only difficulty that left to us is to give these results a significant meaning, the
thing that is hard and has a lot of problems because of the human error
factor, or errors in the chosen data itself.
In this small project, we try to apply the PCA method on Moroccan youtube
data and we get some impressive results, which I found personally very
interesting cause it opened a lot of questions in my head like which gender is
watching youtube and engaging with the content on it more, man or women ?,
this question that needs to more data about users and that is kind of
impossible unless you get from the youtube company itself, but we can
predict certain results from what we have right now and I will claim that
hypothesis “ older women are watching and engage more with the daily vlogs
youtube content “, it could be true or false but it’s still a hypothesis to reject or
accept.
32