You are on page 1of 22

11/13/2022 Data Mining

Project
Advanced Statistics

Contents:
Clustering:
Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence team
and now wants you (their newly appointed data analyst) to segment type of ads based on the features
provided. Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

The Data Dictionary and the detailed description of the formulas for CPM, CPC and CTR are given in the
sheet 2 of the Clustering Clean Ads_Data1 Excel File.

Perform the following in given order:

• Read the data and perform basic analysis such as printing a few rows (head and tail), info, data
summary, null values duplicate values, etc.
• Treat missing values in CPC, CTR and CPM using the formula given. You may refer to the Bank_KMeans
Case Study to understand the coding behind treating the missing values using a specific formula. You have
to basically create an user defined function and then call the function for imputing.
• Check if there are any outliers.
• Do you think treating outliers is necessary for K-Means clustering? Based on your judgement decide
whether to treat outliers and if yes, which method to employ. (As an analyst your judgement may be
different from another analyst).
• Perform z-score scaling and discuss how it affects the speed of the algorithm.
• Perform clustering and do the following:
• Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance.
• Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm.
• Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
• Profile the ads based on optimum number of clusters using silhouette score and your domain
understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks, spend, revenue, CPM,
CTR, & CPC based on Device Type. Make bar plots.]
• Conclude the project by providing summary of your learnings.

PCA:
PCA FH (FT): Primary census abstract for female headed households excluding institutional households
(India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed Household Excluding
Institutional Household. The Indian Census has the reputation of being one of the best in the world. The
first Census in India was conducted in the year 1872. This was conducted at different points of time in
different parts of the country. In 1881 a Census was taken for the entire country simultaneously. Since
then, Census has been conducted every ten years, without a break. Thus, the Census of India 2011 was the
fifteenth in this unbroken series since 1872, the seventh after independence and the second census of the
third millennium and twenty first century. The census has been uninterruptedly continued despite of

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
2|Page
Advanced Statistics

several adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary Census
Abstract which is important publication of 2011 Census gives basic information on Area, Total Number of
Households, Total Population, Scheduled Castes, Scheduled Tribes Population, Population in the age group
0-6, Literates, Main Workers and Marginal Workers classified by the four broad industrial categories,
namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other Workers
and also non-Workers. The characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and rural-urban residence. Census
2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867
Villages.
The data collected has so many variables thus making it difficult to find useful details without using Data
Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal Components
that explains the most variance in data. Use Sklearn only.
Data file - PCA India Data Census.xlsx

• Read the data and perform basic checks like checking head, info, summary, nulls, and duplicates, etc.

• Perform detailed Exploratory analysis by creating certain questions like (i) Which state has highest
gender ratio and which has the lowest? (ii) Which district has the highest & lowest gender ratio?
(Example Questions). Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M,
TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F,
MAINWORK_M, MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M,
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F

• We choose not to treat outliers for this case. Do you think that treating outliers for this case is
necessary?

• Scale the Data using z-score method. Does scaling have any impact on outliers? Compare boxplots before
and after scaling and comment.

• Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get eigen values
and eigen vector.

• Identify the optimum number of PCs (for this project, take at least 90% explained variance). Show Scree
plot.

• Compare PCs with Actual Columns and identify which is explaining most variance. Write inferences about
all the principal components in terms of actual variables.

• Write linear equation for first PC.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
3|Page
Advanced Statistics

Solutions:
Clustering:
Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence team
and now wants you (their newly appointed data analyst) to segment type of ads based on the features
provided. Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000

CPC = Total Cost (spend) / Number of Clicks

CTR = Total Measured Clicks / Total Measured Ad Impressions x 100

The Data Dictionary and the detailed description of the formulas for CPM, CPC and CTR are given in the
sheet 2 of the Clustering Clean Ads_Data1 Excel File.

Perform the following in given order:

Part 1 - Clustering: Read the data and perform basic analysis such as printing a few rows (head and tail),
info, data summary, null values duplicate values, etc.
Ans.:

• Top 5 rows of the Data frame:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
4|Page
Advanced Statistics

• Bottom 5 rows of the Data frame:

• Basic info about the Data frame:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
5|Page
Advanced Statistics

• Data frame Summary:

• Data frame Null-value check:

• Data frame duplicate value check:

Total duplicate values: 0

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
6|Page
Advanced Statistics

Part 1 - Clustering: Treat missing values in CPC, CTR and CPM using the formula given.
Ans.:

Part 1 - Clustering: Check if there are any outliers. Do you think treating outliers is necessary for K-
Means clustering? Based on your judgement decide whether to treat outliers and if yes, which method
to employ. (As an analyst your judgement may be different from another analyst).
Ans.:

After treating Outliers :

• Yes, we need to treat Outliers as K-means Clustering is sensitive to outliers.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
7|Page
Advanced Statistics

Part 1 - Clustering: Perform z-score scaling and discuss how it affects the speed of the algorithm.
Ans.:

• Without Scaling data, the algorithm may be biased towards higher value.

Part 1 - Clustering: Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean
distance.
Ans.:

Part 1 - Clustering: Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means
algorithm.
Ans.:
The WSS value for 1 cluster is 232713.00000000006
The WSS value for 2 clusters is 135274.9268314021
The WSS value for 3 clusters is 100590.2395311129
The WSS value for 4 clusters is 71656.59481682391
The WSS value for 5 clusters is 45771.31324276951
The WSS value for 6 clusters is 37438.815811017026
The WSS value for 7 clusters is 30149.7112338386
The WSS value for 8 clusters is 23382.874391416677
The WSS value for 9 clusters is 18790.993332464503
The WSS value for 10 clusters is 16544.499210561502

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
8|Page
Advanced Statistics

#So, from the Dendrogram we can say optimum number of clusters: ' 5 '

Part 1 - Clustering: Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
Ans.:
The Silhouette scores for 2 clusters is 0.43093038125940913
The Silhouette scores for 3 clusters is 0.4169029019588384
The Silhouette scores for 4 clusters is 0.4859045662423113
The Silhouette scores for 5 clusters is 0.5484421685630947
The Silhouette scores for 6 clusters is 0.5554079926857388
The Silhouette scores for 7 clusters is 0.5882973964429631
The Silhouette scores for 8 clusters is 0.6005106775133303
The Silhouette scores for 9 clusters is 0.6298955511943023
The Silhouette scores for 10 clusters is 0.6296839903311501

#So, from the Silhouette scores we can say optimum number of clusters: ' 5 '

Part 1 - Clustering: Profile the ads based on optimum number of clusters using silhouette score and your
domain understanding [Hint: Group the data by clusters and take sum or mean to identify trends in
Clicks, spend, revenue, CPM, CTR, & CPC based on Device Type. Make bar plots].
Ans.:
array ([1, 1, 1, ..., 3, 1, 3])

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
9|Page
Advanced Statistics

• Total Count per cluster:

Cluster Profiling:

Part 1 - Clustering: Conclude the project by providing summary of your learnings.


Ans.:
• The dataset has 25857 rows and 19 columns.

•The missing values in CPC, CTR and CPM are treated by using the formulae given and writing a user-
defined function, and calling it.

•We check for outliers; we can see there are outliers in the variables.

•Dendrogram is the visualization and linkage are for computing the distances and merging the clusters
from n to 1.

•The output of Linkage is visualized by Dendrogram.

•We will create linkage using Ward’s method and run linkage function on the usable columns of the data.

•The linkage now stores the various distance at which the n clusters are sequentially merged into a single
cluster.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
10 | P a g e
Advanced Statistics

•using fit – transform function and viewing the output - The data frame is now stored in an array.

•Using this array, we can now perform k-means

•The one requirement before we run the k-means algorithm, is to know how many clusters we require as
output

•We map the elbow plot using wss values

•From the plot we have following observations:

•When we move from k=1 to k=2, we see that there is a significant drop in the value, also when we move
from k=2 to k=3, k=3 to k=4 there is a significant drop as well.

•But from k=4 to k=5, k=5 to k=6, the drop in values reduces significantly.

•In other words, the wss is not significantly dropping beyond 5,

•So, 5 is optimal number of clusters.

PCA:
PCA FH (FT): Primary census abstract for female headed households excluding institutional households
(India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed Household Excluding
Institutional Household. The Indian Census has the reputation of being one of the best in the world. The
first Census in India was conducted in the year 1872. This was conducted at different points of time in
different parts of the country. In 1881 a Census was taken for the entire country simultaneously. Since
then, Census has been conducted every ten years, without a break. Thus, the Census of India 2011 was the
fifteenth in this unbroken series since 1872, the seventh after independence and the second census of the
third millennium and twenty first century. The census has been uninterruptedly continued despite of
several adversities like wars, epidemics, natural calamities, political unrest, etc. The Census of India is
conducted under the provisions of the Census Act 1948 and the Census Rules, 1990. The Primary Census
Abstract which is important publication of 2011 Census gives basic information on Area, Total Number of
Households, Total Population, Scheduled Castes, Scheduled Tribes Population, Population in the age group
0-6, Literates, Main Workers and Marginal Workers classified by the four broad industrial categories,
namely, (i) Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other Workers
and also non-Workers. The characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and rural-urban residence. Census
2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and 6,40,867
Villages.
The data collected has so many variables thus making it difficult to find useful details without using Data
Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal Components
that explains the most variance in data. Use Sklearn only.
Data file - PCA India Data Census.xlsx

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
11 | P a g e
Advanced Statistics

Part 2 - PCA: Read the data and perform basic checks like checking head, info, summary, nulls, and
duplicates, etc.

Ans.:
• Top 5 rows of the Data frame:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
12 | P a g e
Advanced Statistics

Bottom 5 rows of the Data frame:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
13 | P a g e
Advanced Statistics

• Basic info about the Data frame:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
14 | P a g e
Advanced Statistics

• Data frame Summary:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
15 | P a g e
Advanced Statistics

• Data frame Null-value check:

• Data frame duplicate value check:

Total duplicate values: 0

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
16 | P a g e
Advanced Statistics

Part 2 - PCA: Perform detailed Exploratory analysis by creating certain questions like (i) Which state has
highest gender ratio and which has the lowest? (ii) Which district has the highest & lowest gender ratio?
(Example Questions). Pick 5 variables out of the given 24 variables below for EDA: No_HH, TOT_M,
TOT_F, M_06, F_06, M_SC, F_SC, M_ST, F_ST, M_LIT, F_LIT, M_ILL, F_ILL, TOT_WORK_M, TOT_WORK_F,
MAINWORK_M, MAINWORK_F, MAIN_CL_M, MAIN_CL_F, MAIN_AL_M, MAIN_AL_F, MAIN_HH_M,
MAIN_HH_F, MAIN_OT_M, MAIN_OT_F

Ans.: Choose these 5 variables for EDA: 'No_HH', 'TOT_M', 'TOT_F', 'TOT_WORK_M', 'TOT_WORK_F'.

• Univariate Analysis:

• From the Univariate Analysis we can say all variables are Left Skewed here and all are having
Outliers.

• Bi-variate Analysis:

• From the Bivariate Analysis we can say all variables are Positively Co-related to each other.

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
17 | P a g e
Advanced Statistics

Part 2 - PCA: We choose not to treat outliers for this case. Do you think that treating outliers for this
case is necessary?

Ans.: Outliers treatment is not necessary unless the they are the result from a processing mistake or
wrong measurement. True outliers must be kept in the data.

Part 2 - PCA: Scale the Data using z-score method. Does scaling have any impact on outliers? Compare
boxplots before and after scaling and comment.

Ans.: Presence of outliers before z-score method :

Scaled Data frame after applying Z-score :

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
18 | P a g e
Advanced Statistics

Presence of outliers after z-score method :

#So, we can clearly see that scaling have no impact on outliers.

Part 2 - PCA: Perform all the required steps for PCA (use sklearn only) Create the covariance Matrix Get
eigen values and eigen vector.

Ans.: covariance Matrix:

Eigen vectors:

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
19 | P a g e
Advanced Statistics

Eigen values:

Part 2 - PCA: Identify the optimum number of PCs (for this project, take at least 90% explained variance).
Show Scree plot.

Ans.:

Cumulative explained variance ratio to find a cut off for selecting the number of PCs:

• For this project, we need to consider at least 90% explained variance, so cut off for selecting the
number of PCs is: '6'

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
20 | P a g e
Advanced Statistics

Part 2 - PCA: Compare PCs with Actual Columns and identify which is explaining most variance. Write
inferences about all the principal components in terms of actual variables.

Ans.: Compare how the original features influence various PCs:

#Heatmap,Compare how the original features influence various PCs

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
21 | P a g e
Advanced Statistics

#Extract the required number of PCs (6 in our case):

#Check for presence of correlations among the PCs:

Part 2 - PCA: Write linear equation for first PC.

Ans.: PC1 = a1x1 + a2x2 + a3X3 +a4X4 + ……. + a57x57

Proprietary content. ©Great Learning. All Rights Reserved. Unauthorized use or distribution prohibited.
22 | P a g e

You might also like