You are on page 1of 47

Data Mining Business Report

P L Lohitha

07/11/22

Data mining

Business Report
CONTENTS

S.NO Problem Num Page Num

1 Problem 1 3
2 EDA 4
3 Hierarchical clustering 11
4 K-Means clustering 12
5 Summary 14
6 Problem 2 15
7 EDA 16
8 PCA 37

PAGE 2
Problem 1
Digital Ads Data:

The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence
team and now wants you (their newly appointed data analyst) to segment type of ads based on the
features provided. Use Clustering procedure to segment ads into homogeneous groups.

The following three features are commonly used in digital marketing:

CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign
Spend refers to the 'Spend' Column in the dataset and the Number of Impressions refers to the
'Impressions' Column in the dataset. 
CPC = Total Cost (spend) / Number of Clicks.  Note that the Total Cost (spend) refers to the
'Spend' Column in the dataset and the Number of Clicks refers to the 'Clicks' Column in the dataset. 
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total
Measured Clicks refers to the 'Clicks' Column in the dataset and the Total Measured Ad Impressions
refers to the 'Impressions' Column in the dataset. 

Perform the following in given order:

 Read the data and perform basic analysis such as printing a few rows (head and tail), info,
data summary, null values duplicate values, etc.
 Treat missing values in CPC, CTR and CPM using the formula given. You have to basically
create an user defined function and then call the function for imputing. 
  Check if there are any outliers.
  Do you think treating outliers is necessary for K-Means clustering? Based on your judgement
decide whether to treat outliers and if yes, which method to employ. (As an analyst your
judgement may be different from another analyst).
 Perform z-score scaling and discuss how it affects the speed of the algorithm.
 Perform clustering and do the following:
 Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance.
 Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm.
 Print silhouette scores for up to 10 clusters and identify optimum number of clusters.

PAGE 3
 Profile the ads based on optimum number of clusters using silhouette score and your domain
understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks, spend,
revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
 Conclude the project by providing summary of your learnings.

Observations:
 The required libraries are imported and the data is loaded into the code file.
 The first 5 columns of the data set are loaded.

 The last 5 columns of the data set are loaded.

 The dataset has 23066 rows and 19 columns.


(23066, 19)

 The dataset has 6 float objects, 7 integer objects and 6 object type variables.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23066 entries, 0 to 23065
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----

PAGE 4
0 Timestamp 23066 non-null object
1 InventoryType 23066 non-null object
2 Ad - Length 23066 non-null int64
3 Ad- Width 23066 non-null int64
4 Ad Size 23066 non-null int64
5 Ad Type 23066 non-null object
6 Platform 23066 non-null object
7 Device Type 23066 non-null object
8 Format 23066 non-null object
9 Available_Impressions 23066 non-null int64
10 Matched_Queries 23066 non-null int64
11 Impressions 23066 non-null int64
12 Clicks 23066 non-null int64
13 Spend 23066 non-null float64
14 Fee 23066 non-null float64
15 Revenue 23066 non-null float64
16 CTR 18330 non-null float64
17 CPM 18330 non-null float64
18 CPC 18330 non-null float64
dtypes: float64(6), int64(7), object(6)
memory usage: 3.3+ MB

 There are few missing values in the columns CTR, CPM, CPC of the data.
 The summary of the data is mentioned in the data below

 There are null values in the columns CTR, CPM, CPC of the data.
Timestamp 0
InventoryType 0
Ad - Length 0
Ad- Width 0
Ad Size 0

PAGE 5
Ad Type 0
Platform 0
Device Type 0
Format 0
Available_Impressions 0
Matched_Queries 0
Impressions 0
Clicks 0
Spend 0
Fee 0
Revenue 0
CTR 4736
CPM 4736
CPC 4736
dtype: int64

 There are no duplicate values in the data.

 To treat the missing values we have used the formulas that are given in the problem for the
columns CPM, CTR, CPC.
Timestamp 0
nventoryType 0
Ad - Length 0
Ad- Width 0
Ad Size 0
Ad Type 0
Platform 0
Device Type 0
Format 0
Available_Impressions 0
Matched_Queries 0
Impressions 0
Clicks 0
Spend 0
Fee 0
Revenue 0
CTR 0
CPM 0
CPC 0
dtype: int64

 There are outliers in the data for the columns 'Ad Size', 'Available_Impressions',
'Matched_Queries', 'Impressions', 'Clicks', 'Spend', 'Fee', 'Revenue', 'CTR', 'CPM', 'CPC' .

PAGE 6
PAGE 7
PAGE 8
 clustering is very sensitive to outliers so treating them is a very important step for accurate
results.
 Outliers are treated by replacing the outlier with either their upper limit or lower limit value.

PAGE 9
PAGE 10
 We can infer from the above graph that there are only a few variables that have a
correlation between them.

 There is a strong positive correlation between the variables ‘impressions’, ‘available


impressions’, and ‘matched queries’.
 We have to create a new data frame to further process the data with necessary fields.
 After creating a new data frame we have to scale the data using z-score technique.

PAGE 11
 we scale the data so that it controls the variability of the dataset, which will in turn help us
to generate good quality clusters and improve the accuracy of clustering algorithms.
 We have performed hierarchical clustering and plotted a dendrogram using WARD’s
linkage method and can conclude that there are 3 clusters.

 Since the graph is not clear we can drop it to first 8 clusters.

PAGE 12
 The ideal number of clusters are 3 according to hierarchical clustering.
 Next we will perform k-means clustering for the same data and identify the optimum
number of clusters.

 From the elbow curve we can clearly see that 5 clusters are ideal and we can verify that by
calculating the silhouette score.
 After calculating the silhouette score for 1 to 10 clusters we can conclude that 5 clusters
has the highest silhouette score of 0.49.

PAGE 13
PAGE 14
 The above visualization depicts the number of clusters which has been formed by the model
on each feature. We could also infer from the graph that the following features ‘available
impressions’, ’impressions’, and ‘matched queries’ have the major statistical difference
between the clusters, as a result of this we could use these cluster data points in order to
segregate ads based on ad attributes.

PAGE 15
Summary :
 The ads24x7 is a Digital Marketing company, they wanted us to segment the ads into homogeneous grops,
we can do that by clustering the data given to us and group them accordingly.

 Initially we loaded all the libraries, imported the data and started with EDA. The data frame has 23066 rows
and 19 columns,

 There are no null or duplicate values in the data but there are outliers and they are further treated to perform
clustering, as it is very sensitive to outliers.

 After checking the correlation of the variables we performed hierarchical clustering by constructing a
dendrogram using wards linkage method, which helped us to identify that 3 clusters would be optimum.

 We will also perform k-means clustering. we will fit and transform the model and then find the within sum of
squares value and plot it in a elbow curve from which we can see that 5 clusters would be optimum.

 We then calculated the silhouette score and observed that 5 clusters have the maximum silhouette score.

 Impressions, available impressions and matched queries are the major factors that are considered for
clustering.

 So we can conclude that the marketing campaigns should be designed by considering Impressions,
available impressions and matched queries by dividing them into 5 groups.

 The customers should be classified into 5 different segments and based on Impressions, available
impressions and matched queries.

PAGE 16
Problem 2
PCA FH (FT): Primary census abstract for female headed households excluding institutional
households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed
Household Excluding Institutional Household. The Indian Census has the reputation of being one of
the best in the world. The first Census in India was conducted in the year 1872. This was conducted
at different points of time in different parts of the country. In 1881 a Census was taken for the entire
country simultaneously. Since then, Census has been conducted every ten years, without a break.
Thus, the Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh after
independence and the second census of the third millennium and twenty first century. The census
has been uninterruptedly continued despite of several adversities like wars, epidemics, natural
calamities, political unrest, etc. The Census of India is conducted under the provisions of the Census
Act 1948 and the Census Rules, 1990. The Primary Census Abstract which is important publication of
2011 Census gives basic information on Area, Total Number of Households, Total Population,
Scheduled Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main
Workers and Marginal Workers classified by the four broad industrial categories, namely, (i)
Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other Workers and
also Non-Workers. The characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and rural-urban residence.
Census 2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and
6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details without using
Data Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal
Components that explains the most variance in data. Use Sklearn only.

Observations:
 The required libraries are imported and the data is loaded into the code file.
 The first 5 columns of the data set are loaded

 The first 5 columns of the data set are loaded

PAGE 17
 The dataset has 640 rows and 61 columns.
(640, 61)

 There are 59 integer variables and 2 object variables

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State Code 640 non-null int64
1 Dist.Code 640 non-null int64
2 State 640 non-null object
3 Area Name 640 non-null object
4 No_HH 640 non-null int64
5 TOT_M 640 non-null int64
6 TOT_F 640 non-null int64
7 M_06 640 non-null int64
8 F_06 640 non-null int64
9 M_SC 640 non-null int64
10 F_SC 640 non-null int64
11 M_ST 640 non-null int64
12 F_ST 640 non-null int64
13 M_LIT 640 non-null int64
14 F_LIT 640 non-null int64
15 M_ILL 640 non-null int64
16 F_ILL 640 non-null int64
17 TOT_WORK_M 640 non-null int64
18 TOT_WORK_F 640 non-null int64
19 MAINWORK_M 640 non-null int64
20 MAINWORK_F 640 non-null int64
21 MAIN_CL_M 640 non-null int64
22 MAIN_CL_F 640 non-null int64
23 MAIN_AL_M 640 non-null int64
24 MAIN_AL_F 640 non-null int64
25 MAIN_HH_M 640 non-null int64
26 MAIN_HH_F 640 non-null int64
27 MAIN_OT_M 640 non-null int64
28 MAIN_OT_F 640 non-null int64
29 MARGWORK_M 640 non-null int64
30 MARGWORK_F 640 non-null int64

PAGE 18
31 MARG_CL_M 640 non-null int64
32 MARG_CL_F 640 non-null int64
33 MARG_AL_M 640 non-null int64
34 MARG_AL_F 640 non-null int64
35 MARG_HH_M 640 non-null int64
36 MARG_HH_F 640 non-null int64
37 MARG_OT_M 640 non-null int64
38 MARG_OT_F 640 non-null int64
39 MARGWORK_3_6_M 640 non-null int64
40 MARGWORK_3_6_F 640 non-null int64
41 MARG_CL_3_6_M 640 non-null int64
42 MARG_CL_3_6_F 640 non-null int64
43 MARG_AL_3_6_M 640 non-null int64
44 MARG_AL_3_6_F 640 non-null int64
45 MARG_HH_3_6_M 640 non-null int64
46 MARG_HH_3_6_F 640 non-null int64
47 MARG_OT_3_6_M 640 non-null int64
48 MARG_OT_3_6_F 640 non-null int64
49 MARGWORK_0_3_M 640 non-null int64
50 MARGWORK_0_3_F 640 non-null int64
51 MARG_CL_0_3_M 640 non-null int64
52 MARG_CL_0_3_F 640 non-null int64
53 MARG_AL_0_3_M 640 non-null int64
54 MARG_AL_0_3_F 640 non-null int64
55 MARG_HH_0_3_M 640 non-null int64
56 MARG_HH_0_3_F 640 non-null int64
57 MARG_OT_0_3_M 640 non-null int64
58 MARG_OT_0_3_F 640 non-null int64
59 NON_WORK_M 640 non-null int64
60 NON_WORK_F 640 non-null int64
dtypes: int64(59), object(2)
memory usage: 305.1+ KB

 we need to convert state code and dist code columns to object type variable.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State Code 640 non-null object
1 Dist.Code 640 non-null object
2 State 640 non-null object
3 Area Name 640 non-null object
4 No_HH 640 non-null int64
5 TOT_M 640 non-null int64
6 TOT_F 640 non-null int64
7 M_06 640 non-null int64
8 F_06 640 non-null int64
9 M_SC 640 non-null int64
10 F_SC 640 non-null int64
11 M_ST 640 non-null int64
12 F_ST 640 non-null int64
13 M_LIT 640 non-null int64
14 F_LIT 640 non-null int64
15 M_ILL 640 non-null int64
16 F_ILL 640 non-null int64

PAGE 19
17 TOT_WORK_M 640 non-null int64
18 TOT_WORK_F 640 non-null int64
19 MAINWORK_M 640 non-null int64
20 MAINWORK_F 640 non-null int64
21 MAIN_CL_M 640 non-null int64
22 MAIN_CL_F 640 non-null int64
23 MAIN_AL_M 640 non-null int64
24 MAIN_AL_F 640 non-null int64
25 MAIN_HH_M 640 non-null int64
26 MAIN_HH_F 640 non-null int64
27 MAIN_OT_M 640 non-null int64
28 MAIN_OT_F 640 non-null int64
29 MARGWORK_M 640 non-null int64
30 MARGWORK_F 640 non-null int64
31 MARG_CL_M 640 non-null int64
32 MARG_CL_F 640 non-null int64
33 MARG_AL_M 640 non-null int64
34 MARG_AL_F 640 non-null int64
35 MARG_HH_M 640 non-null int64
36 MARG_HH_F 640 non-null int64
37 MARG_OT_M 640 non-null int64
38 MARG_OT_F 640 non-null int64
39 MARGWORK_3_6_M 640 non-null int64
40 MARGWORK_3_6_F 640 non-null int64
41 MARG_CL_3_6_M 640 non-null int64
42 MARG_CL_3_6_F 640 non-null int64
43 MARG_AL_3_6_M 640 non-null int64
44 MARG_AL_3_6_F 640 non-null int64
45 MARG_HH_3_6_M 640 non-null int64
46 MARG_HH_3_6_F 640 non-null int64
47 MARG_OT_3_6_M 640 non-null int64
48 MARG_OT_3_6_F 640 non-null int64
49 MARGWORK_0_3_M 640 non-null int64
50 MARGWORK_0_3_F 640 non-null int64
51 MARG_CL_0_3_M 640 non-null int64
52 MARG_CL_0_3_F 640 non-null int64
53 MARG_AL_0_3_M 640 non-null int64
54 MARG_AL_0_3_F 640 non-null int64
55 MARG_HH_0_3_M 640 non-null int64
56 MARG_HH_0_3_F 640 non-null int64
57 MARG_OT_0_3_M 640 non-null int64
58 MARG_OT_0_3_F 640 non-null int64
59 NON_WORK_M 640 non-null int64
60 NON_WORK_F 640 non-null int64
dtypes: int64(57), object(4)
memory usage: 305.1+ KB
 There are no null values in the data.
State Code 0
Dist.Code 0
State 0
Area Name 0
No_HH 0
TOT_M 0
TOT_F 0
M_06 0
F_06 0

PAGE 20
M_SC 0
F_SC 0
M_ST 0
F_ST 0
M_LIT 0
F_LIT 0
M_ILL 0
F_ILL 0
TOT_WORK_M 0
TOT_WORK_F 0
MAINWORK_M 0
MAINWORK_F 0
MAIN_CL_M 0
MAIN_CL_F 0
MAIN_AL_M 0
MAIN_AL_F 0
MAIN_HH_M 0
MAIN_HH_F 0
MAIN_OT_M 0
MAIN_OT_F 0
MARGWORK_M 0
MARGWORK_F 0
MARG_CL_M 0
MARG_CL_F 0
MARG_AL_M 0
MARG_AL_F 0
MARG_HH_M 0
MARG_HH_F 0
MARG_OT_M 0
MARG_OT_F 0
MARGWORK_3_6_M 0
MARGWORK_3_6_F 0
MARG_CL_3_6_M 0
MARG_CL_3_6_F 0
MARG_AL_3_6_M 0
MARG_AL_3_6_F 0
MARG_HH_3_6_M 0
MARG_HH_3_6_F 0
MARG_OT_3_6_M 0
MARG_OT_3_6_F 0
MARGWORK_0_3_M 0
MARGWORK_0_3_F 0
MARG_CL_0_3_M 0
MARG_CL_0_3_F 0
MARG_AL_0_3_M 0
MARG_AL_0_3_F 0
MARG_HH_0_3_M 0
MARG_HH_0_3_F 0
MARG_OT_0_3_M 0
MARG_OT_0_3_F 0
NON_WORK_M 0
NON_WORK_F 0
dtype: int64

 There are no duplicate values in the data.

PAGE 21
 The summary of data is clearly specified below.

count mean std min 25% 50% 75% max

35837.
No_HH 640.0 51222.871875 48135.405475 350.0 19484.00 68892.00 310450.0
0

58339. 107918.5
TOT_M 640.0 79940.576563 73384.511114 391.0 30228.00 485417.0
0 0

122372.08437 87724. 164251.7


TOT_F 640.0 113600.717282 698.0 46517.75 750392.0
5 5 5

M_06 640.0 12309.098438 11500.906881 56.0 4733.75 9159.0 16520.25 96223.0

F_06 640.0 11942.300000 11326.294567 56.0 4672.25 8663.0 15902.25 95129.0

M_SC 640.0 13820.946875 14426.373130 0.0 3466.25 9591.5 19429.75 103307.0

13709.
F_SC 640.0 20778.392188 21727.887713 0.0 5603.25 29180.00 156429.0
0

M_ST 640.0 6191.807813 9912.668948 0.0 293.75 2333.5 7658.00 96785.0

F_ST 640.0 10155.640625 15875.701488 0.0 429.50 3834.5 12480.25 130119.0

42693.
M_LIT 640.0 57967.979688 55910.282466 286.0 21298.00 77989.50 403261.0
5

43796.
F_LIT 640.0 66359.565625 75037.860207 371.0 20932.00 84799.75 571140.0
5

15767.
M_ILL 640.0 21972.596875 19825.605268 105.0 8590.00 29512.50 105961.0
5

42386.
F_ILL 640.0 56012.518750 47116.693769 327.0 22367.00 78471.00 254160.0
0

27936.
TOT_WORK_M 640.0 37992.407813 36419.537491 100.0 13753.50 50226.75 269422.0
5

30588.
TOT_WORK_F 640.0 41295.760938 37192.360943 357.0 16097.75 53234.25 257848.0
5

PAGE 22
count mean std min 25% 50% 75% max

21250.
MAINWORK_M 640.0 30204.446875 31480.915680 65.0 9787.00 40119.00 247911.0
5

18484.
MAINWORK_F 640.0 28198.846875 29998.262689 240.0 9502.25 35063.25 226166.0
0

MAIN_CL_M 640.0 5424.342188 4739.161969 0.0 2023.50 4160.5 7695.00 29113.0

MAIN_CL_F 640.0 5486.042188 5326.362728 0.0 1920.25 3908.5 7286.25 36193.0

MAIN_AL_M 640.0 5849.109375 6399.507966 0.0 1070.25 3936.5 8067.25 40843.0

MAIN_AL_F 640.0 8925.995312 12864.287584 0.0 1408.75 3933.5 10617.50 87945.0

MAIN_HH_M 640.0 883.893750 1278.642345 0.0 187.50 498.5 1099.25 16429.0

MAIN_HH_F 640.0 1380.773438 3179.414449 0.0 248.75 540.5 1435.75 45979.0

MAIN_OT_M 640.0 18047.101562 26068.480886 36.0 3997.50 9598.0 21249.50 240855.0

MAIN_OT_F 640.0 12406.035938 18972.202369 153.0 3142.50 6380.5 14368.25 209355.0

MARGWORK_M 640.0 7787.960938 7410.791691 35.0 2937.50 5627.0 9800.25 47553.0

10175.
MARGWORK_F 640.0 13096.914062 10996.474528 117.0 5424.50 18879.25 66915.0
0

MARG_CL_M 640.0 1040.737500 1311.546847 0.0 311.75 606.5 1281.00 13201.0

MARG_CL_F 640.0 2307.682813 3564.626095 0.0 630.25 1226.0 2659.25 44324.0

MARG_AL_M 640.0 3304.326562 3781.555707 0.0 873.50 2062.0 4300.75 23719.0

MARG_AL_F 640.0 6463.281250 6773.876298 0.0 1402.50 4020.5 9089.25 45301.0

MARG_HH_M 640.0 316.742188 462.661891 0.0 71.75 166.0 356.50 4298.0

PAGE 23
count mean std min 25% 50% 75% max

MARG_HH_F 640.0 786.626562 1198.718213 0.0 171.75 429.0 962.50 15448.0

MARG_OT_M 640.0 3126.154687 3609.391821 7.0 935.50 2036.0 3985.25 24728.0

MARG_OT_F 640.0 3539.323438 4115.191314 19.0 1071.75 2349.5 4400.50 36377.0

MARGWORK_3_6_ 30315.
640.0 41948.168750 39045.316918 291.0 16208.25 57218.75 300937.0
M 0

56793. 107924.0
MARGWORK_3_6_F 640.0 81076.323438 82970.406216 341.0 26619.50 676450.0
0 0

MARG_CL_3_6_M 640.0 6394.987500 6019.806644 27.0 2372.00 4630.0 8167.00 39106.0

MARG_CL_3_6_F 640.0 10339.864063 8467.473429 85.0 4351.50 8295.0 15102.00 50065.0

MARG_AL_3_6_M 640.0 789.848438 905.639279 0.0 235.50 480.5 986.00 7426.0

MARG_AL_3_6_F 640.0 1749.584375 2496.541514 0.0 497.25 985.5 2059.00 27171.0

MARG_HH_3_6_M 640.0 2743.635938 3059.586387 0.0 718.75 1714.5 3702.25 19343.0

MARG_HH_3_6_F 640.0 5169.850000 5335.640960 0.0 1113.75 3294.0 7502.25 36253.0

MARG_OT_3_6_M 640.0 245.362500 358.728567 0.0 58.00 129.5 276.00 3535.0

MARG_OT_3_6_F 640.0 585.884375 900.025817 0.0 127.75 320.5 719.25 12094.0

MARGWORK_0_3_
640.0 2616.140625 3036.964381 7.0 755.00 1681.5 3320.25 20648.0
M

MARGWORK_0_3_F 640.0 2834.545312 3327.836932 14.0 833.50 1834.5 3610.50 25844.0

MARG_CL_0_3_M 640.0 1392.973438 1489.707052 4.0 489.50 949.0 1714.00 9875.0

MARG_CL_0_3_F 640.0 2757.050000 2788.776676 30.0 957.25 1928.0 3599.75 21611.0

PAGE 24
count mean std min 25% 50% 75% max

MARG_AL_0_3_M 640.0 250.889062 453.336594 0.0 47.00 114.5 270.75 5775.0

MARG_AL_0_3_F 640.0 558.098438 1117.642748 0.0 109.00 247.5 568.75 17153.0

MARG_HH_0_3_M 640.0 560.690625 762.578991 0.0 136.50 308.0 642.00 6116.0

MARG_HH_0_3_F 640.0 1293.431250 1585.377936 0.0 298.00 717.0 1710.75 13714.0

MARG_OT_0_3_M 640.0 71.379688 107.897627 0.0 14.00 35.0 79.00 895.0

MARG_OT_0_3_F 640.0 200.742188 309.740854 0.0 43.00 113.0 240.00 3354.0

NON_WORK_M 640.0 510.014063 610.603187 0.0 161.00 326.0 604.50 6456.0

NON_WORK_F 640.0 704.778125 910.209225 5.0 220.50 464.5 853.50 10533.0

 We have to create a new data frame where we need to drop few columns that are not required
for PCA.

 Uttar Pradesh has the highest gender ratio and Dadara and Nagar Haveli has the lowest
gender ratio both in male and female.

PAGE 25
PAGE 26
PAGE 27
 Mumbai Suburban has the highest number of gender ratio and Dibang Valley has the lowest
gender ratio under districts.

 since the number of districts are very high we are not able to display visualisation clearly.

 Uttar Pradesh has the highest literacy ratio and Dadar and Nagar Haveli has the lowest literacy
ratio both in male and female.

PAGE 28
PAGE 29
 Uttar Pradesh has the highest Workers ratio and Dadar and Nagar Haveli has the lowest
Workers ratio both in male and female.

PAGE 30
 Uttar Pradesh has the highest agricultural labourers ratio and Dadar and Nagar Haveli has the
lowest agricultural labourers ratio both in male and female.

PAGE 31
 Now we have to check for outliers in the data and make sure there aren’t any because pca is
very sensitive to outliers.

PAGE 32
PAGE 33
 There are multiple outliers in the data so we need to treat them because PCA Is very
sensitive to outliers. A single outlier can cause changes in the principal components so I
am choosing to treat the outliers for this specific data . here I am using upper and lower
limit values to replace with the outlier.

PAGE 34
PAGE 35
 After treating the outliers we need to scale the data using z - score methond to
standardize the data.

 Scaling doesn’t have impact on outliers but we can detect outliers using scaling.

PAGE 36
 After scaling the data lets check the correlation between the variables of the data frame.
There is good correlation between few variables.

PAGE 37
 We need to confirm the statistical significance of correlations. there should be
significance to further proceed with PCA
H0: Correlations are not significant,
H1: There are significant correlations
Reject H0 if p-value < 0.05
 We calculated Barlett sphericity value and the p value is 0 since it is less tham 0.05 we
reject H0 and as it is evident that there is significant correlation we can further proceed
with PCA.
 To confirm the adequacy of sample size we will calculate KMO value and if the value is
above 0.7 it is acceptable and if the value is below 0.5 then it is not acceptable.
 The KMO value is 0.93 so it is acceptable and we can proceed for PCA.
 We will then fit and transform PCA model for all 57 components initially.
array([[ 0.14922158, 0.15916917, 0.15820921, ..., 0.14136961,
0.14762899, 0.14210263],
[-0.11548673, -0.08023879, -0.09371751, ..., 0.03510934,
-0.04912234, -0.03984815],
[ 0.1015276 , -0.03866173, 0.0289595 , ..., -0.10217491,
-0.12667281, -0.02854464],
...,
[ 0.00112879, -0.00673066, 0.02298648, ..., -0.01159627,
0.05608352, -0.00610478],
[ 0.00070908, 0.04637872, 0.00402434, ..., 0.01406358,
-0.07729171, -0.00056173],
[-0.00461221, -0.00370327, 0.00963954, ..., 0.00227908,
0.00539901, 0.00130606]])
 We should check for eigen values for all 57 components
array([3.56488638e+01, 7.64357559e+00, 3.76919551e+00, 2.77722349e+00,
1.90694892e+00, 1.15490310e+00, 9.87726707e-01, 4.64629906e-01,
3.96708513e-01, 3.22346888e-01, 2.73207369e-01, 2.35647574e-01,
1.81401107e-01, 1.69243770e-01, 1.38592325e-01, 1.31505852e-01,
1.03809666e-01, 9.55333831e-02, 8.58580407e-02, 8.09138742e-02,
6.60179067e-02, 6.30797999e-02, 4.82756124e-02, 4.59506197e-02,
4.37747566e-02, 3.19339710e-02, 2.86194563e-02, 2.75481445e-02,
2.34340044e-02, 2.20296816e-02, 1.87487040e-02, 1.59004895e-02,
1.39957919e-02, 1.18916465e-02, 1.11133495e-02, 9.07842645e-03,
7.25127869e-03, 6.27213692e-03, 4.95541908e-03, 4.60667097e-03,
3.45902033e-03, 2.18408510e-03, 2.13514664e-03, 1.92111328e-03,
1.43840980e-03, 1.09968912e-03, 9.65752052e-04, 8.62630267e-04,
6.51634478e-04, 5.76658846e-04, 4.35790607e-04, 3.70037468e-04,
3.06660171e-04, 2.07854170e-04, 1.38286484e-04, 8.97034441e-05,
4.61745385e-05])
 The covariance matrix is calculated below
array([[1. , 0.91127279, 0.97149267, ..., 0.65174157, 0.76720055,
0.7966374 ],
[0.91127279, 1. , 0.97859043, ..., 0.73168646, 0.86481243,
0.78948116],

PAGE 38
[0.97149267, 0.97859043, 1. , ..., 0.7107652 , 0.83833472,
0.81336875],
...,
[0.65174157, 0.73168646, 0.7107652 , ..., 1. , 0.76129967,
0.71962667],
[0.76720055, 0.86481243, 0.83833472, ..., 0.76129967, 1. ,
0.90083619],
[0.7966374 , 0.78948116, 0.81336875, ..., 0.71962667, 0.90083619,
1. ]])

 The transpose of PCA components will help us to get the eigen vectors.
array([[ 0.14922158, -0.11548673, 0.1015276 , 0.07681409, -0.01209003,
0.08255794, 0.10689589, -0.09951296, 0.02609778, 0.06812864,
-0.0586205 , -0.02177543],
[ 0.15916917, -0.08023879, -0.03866173, 0.05297633, -0.04234376,
0.07366681, -0.12408501, -0.10886983, 0.03285504, -0.04842824,
0.02949081, -0.04766829],
[ 0.15820921, -0.09371751, 0.0289595 , 0.07002217, -0.02292653,
0.08281204, -0.01029127, -0.11527589, 0.03640371, -0.02246575,
-0.02015258, -0.0428273 ],
[ 0.15634043, -0.02034061, -0.07441918, 0.02851986, -0.08033939,
0.09237947, -0.20080697, -0.13294526, 0.13840682, -0.15723774,
-0.00916557, -0.146674 ],
[ 0.1568144 , -0.01431023, -0.06822314, 0.01639807, -0.07832648,
0.08001002, -0.20341137, -0.139343 , 0.16571649, -0.14503123,
-0.02557186, -0.14463103],
[ 0.14335015, -0.07966701, -0.03761902, 0.01021041, -0.16789316,
0.05096945, -0.04039897, 0.18916926, -0.53174333, -0.09845631,
-0.19462968, -0.12262118],
[ 0.14353705, -0.08709832, 0.02134973, 0.01624416, -0.15809156,
0.05456754, 0.05398985, 0.17736326, -0.51506329, -0.06583908,
-0.25036558, -0.11452469],
[ 0.01884873, 0.06910144, 0.32382724, 0.09114279, 0.41841183,
-0.23180881, -0.35523838, -0.07163216, -0.11301919, -0.00838594,
-0.0824946 , -0.05551678],
[ 0.01787797, 0.06731586, 0.33870545, 0.07955449, 0.4159652 ,
-0.21454239, -0.32767705, -0.07839145, -0.13603111, -0.02861308,
-0.08142959, -0.05122301],
[ 0.15515239, -0.10598636, -0.03210704, 0.08918669, -0.01403251,
0.081378 , -0.06706185, -0.10288631, -0.0174454 , 0.00057329,
0.02382055, 0.03467198],
[ 0.14544984, -0.13323356, -0.00513336, 0.12541201, 0.02908422,
0.1022068 , 0.01349177, -0.12707401, 0.00097933, 0.12340681,
-0.01479861, 0.08753506],
[ 0.1545511 , -0.00945956, -0.04705352, -0.03466478, -0.10407302,
0.03795699, -0.24309747, -0.09103651, 0.12950534, -0.15515273,
0.04694997, -0.22031064],
[ 0.15828347, -0.02179345, 0.07934454, -0.01057813, -0.11033167,
0.01398577, -0.03698859, -0.05363198, 0.03034449, -0.14827181,
0.00754684, -0.16441159],
[ 0.15407627, -0.12091195, -0.0011159 , 0.06904579, -0.02310352,
0.0358025 , -0.08540362, -0.04578257, -0.02462895, 0.09006288,
0.11158062, 0.0362932 ],
[ 0.14252995, -0.07600253, 0.19412998, 0.11105656, -0.01893052,

PAGE 39
-0.01658672, 0.17425777, -0.06837732, 0.072908 , -0.01450426,
-0.13880219, 0.03461047],
[ 0.14193201, -0.16669997, 0.01982148, 0.10018791, -0.04322541,
0.01805394, -0.08732635, -0.05206017, -0.05125101, 0.12495614,
0.14241509, 0.05347519],
[ 0.12573163, -0.14224991, 0.20997642, 0.13301329, -0.054674 ,
-0.05195118, 0.14903607, -0.07720208, 0.09720121, 0.06229771,
-0.18270277, 0.11503845],
[ 0.11169244, 0.04255228, 0.03313125, 0.07885146, -0.30337639,
-0.2935043 , -0.28879016, 0.42581303, -0.0210675 , 0.21067673,
0.32765363, -0.10568573],
[ 0.08303496, 0.09589258, 0.1888222 , 0.2650219 , -0.25792534,
-0.26991402, 0.0262944 , 0.19774214, 0.20599541, -0.30804863,
-0.21591822, 0.09956266],
[ 0.11929067, -0.05334228, 0.22583087, -0.12137878, -0.25313081,
-0.0233356 , -0.11070105, 0.0363001 , 0.10411389, 0.38208405,
-0.07043905, -0.15175049],
[ 0.09008881, -0.07246688, 0.35656643, -0.02098921, -0.19921997,
-0.05655819, 0.12568925, 0.05016856, 0.16615054, 0.2039094 ,
-0.23071378, 0.14047405],
[ 0.14184969, -0.10183528, -0.10220234, -0.02196919, -0.06081182,
-0.14286889, -0.06468864, -0.11788673, -0.27774368, -0.21273456,
0.13161045, 0.26121991],
[ 0.13388011, -0.11325661, 0.02161302, -0.04543644, -0.0230627 ,
-0.31847365, 0.23118776, -0.24852423, -0.12512191, 0.00842445,
0.10649846, -0.07437718],
[ 0.1227618 , -0.2036023 , -0.02814398, 0.14702469, 0.06990677,
0.07121365, -0.00776822, -0.07730861, -0.11135006, 0.14358174,
0.21358002, 0.2380841 ],
[ 0.1168656 , -0.20589888, 0.06903375, 0.15591746, 0.10677437,
0.03388487, 0.09129161, -0.08264689, -0.04130694, 0.16077765,
0.08804174, 0.26487045],
[ 0.15665637, 0.07903864, -0.06868497, -0.07857186, 0.06581161,
0.07865492, -0.05722289, 0.03831204, 0.10143156, 0.03894095,
-0.10161767, 0.03946556],
[ 0.14869489, 0.10881279, 0.10495656, 0.01578813, 0.07762414,
0.09915551, 0.15271912, 0.05683757, 0.04488718, -0.14334494,
0.01602309, -0.0953364 ],
[ 0.08816344, 0.2715224 , -0.10474484, 0.15710396, -0.01800453,
-0.03273765, -0.00294181, -0.05997628, -0.00729524, 0.25389695,
-0.02680659, 0.00639141],
[ 0.06516026, 0.27539755, -0.03632536, 0.28502411, -0.05515214,
-0.03178707, 0.06348769, -0.03542371, -0.01298311, -0.09009164,
0.14122895, 0.08278884],
[ 0.1272781 , 0.15657864, 0.0704345 , -0.25059413, -0.04720013,
0.07974782, -0.09344179, 0.01684894, -0.01888328, 0.11682358,
0.07391181, 0.07830835],
[ 0.11588826, 0.13504767, 0.25998651, -0.15379789, -0.01264328,
0.11762488, 0.09222418, 0.03280765, -0.05168412, -0.148361 ,
0.23199585, 0.10020847],
[ 0.14536607, 0.04097368, -0.14434657, -0.16753968, 0.00557458,
-0.16997996, -0.05567003, 0.03363506, 0.04601489, -0.10345429,
-0.13520726, 0.27585568],
[ 0.14230182, 0.00668481, -0.09383805, -0.15146925, 0.04361632,
-0.31959562, 0.18400519, -0.13319507, -0.009807 , 0.02981446,
0.07433996, -0.19261057],
[ 0.15087675, -0.07344039, -0.13141498, 0.02119534, 0.1451087 ,

PAGE 40
0.01823245, -0.02139323, 0.17805166, 0.06086733, 0.0091236 ,
0.01991064, 0.10234219],
[ 0.14801846, -0.08836101, -0.05388345, 0.05996115, 0.19075649,
0.00240871, 0.09974423, 0.25190766, 0.08117148, 0.01431973,
0.12688446, -0.13244493],
[ 0.15790761, -0.04404402, -0.06687743, 0.03931895, -0.0598864 ,
0.10337666, -0.15317971, -0.14984177, 0.08935349, -0.15522534,
-0.04523389, -0.10574031],
[ 0.15583101, -0.09238317, -0.05871826, 0.04613025, -0.02247554,
0.11746706, -0.09836715, -0.12124342, 0.01321602, -0.00205965,
0.03746853, -0.07240161],
[ 0.15764021, 0.06620762, -0.06017243, -0.09131505, 0.05907845,
0.07238086, -0.06421911, 0.04254566, 0.11865173, 0.02711453,
-0.05877036, 0.0488893 ],
[ 0.1495015 , 0.08965133, 0.1257919 , 0.01886534, 0.06434924,
0.07089589, 0.14288847, 0.07205396, 0.06838015, -0.18267649,
0.07137054, -0.05881623],
[ 0.0947852 , 0.26126801, -0.09655088, 0.13159069, -0.01388688,
-0.04137688, -0.01126399, -0.0366386 , 0.037604 , 0.25511706,
0.00305617, 0.02623711],
[ 0.06715842, 0.26669101, -0.01825633, 0.29284517, -0.06101878,
-0.04936682, 0.05963742, -0.01288921, 0.02690756, -0.13595046,
0.17579363, 0.09843767],
[ 0.12818439, 0.14983097, 0.07819427, -0.2503371 , -0.05866475,
0.0731517 , -0.09594801, 0.02898646, -0.00080806, 0.10856889,
0.09174 , 0.06642587],
[ 0.11395923, 0.12064763, 0.28323496, -0.14304544, -0.02538622,
0.09486752, 0.08953895, 0.06295586, -0.02844888, -0.16406229,
0.25494868, 0.11124873],
[ 0.14510769, 0.03676265, -0.14251113, -0.16600189, 0.00331494,
-0.17463445, -0.05548298, 0.03264548, 0.03762013, -0.10739709,
-0.10073182, 0.27838852],
[ 0.14102942, -0.00368515, -0.08935617, -0.14259884, 0.04167758,
-0.34396998, 0.17735371, -0.12126702, -0.02248656, 0.01619689,
0.11692876, -0.20104553],
[ 0.15092232, -0.0777393 , -0.13068659, 0.01988712, 0.13279387,
0.01582574, -0.02259074, 0.16680123, 0.06982819, 0.00796007,
0.04505291, 0.10296503],
[ 0.14753416, -0.10114106, -0.05848926, 0.0600874 , 0.17059608,
-0.00485718, 0.07857288, 0.2224764 , 0.08868533, 0.00635225,
0.15277396, -0.10679669],
[ 0.14298675, 0.13683939, -0.10356452, -0.01822291, 0.0942929 ,
0.11104532, -0.02590217, 0.018268 , -0.00476824, 0.10656481,
-0.2841336 , -0.01112016],
[ 0.13378373, 0.16641612, 0.03342285, 0.0059541 , 0.11235112,
0.18588236, 0.17850035, -0.00407236, -0.0239809 , 0.00863649,
-0.15517828, -0.17161971],
[ 0.06296394, 0.28188148, -0.1202934 , 0.20894141, -0.01807012,
-0.00459955, 0.00947356, -0.11585956, -0.13405721, 0.18025429,
-0.11074858, -0.05974346],
[ 0.05674058, 0.28754091, -0.08809749, 0.2404994 , -0.03629271,
0.0220235 , 0.0664972 , -0.09544746, -0.13424113, 0.04261022,
0.01570896, 0.0321015 ],
[ 0.11910165, 0.18234077, 0.02617609, -0.24041564, 0.01698094,
0.10938653, -0.0828577 , -0.04866015, -0.09335608, 0.16042119,
-0.00215884, 0.07506802],
[ 0.11304417, 0.17711216, 0.16477413, -0.18940781, 0.04753801,

PAGE 41
0.18900563, 0.10968562, -0.07017643, -0.13770348, -0.04543535,
0.1269854 , 0.04833875],
[ 0.14213963, 0.05292484, -0.14441938, -0.16755357, 0.01418678,
-0.14968946, -0.05078585, 0.03888177, 0.07535478, -0.08411327,
-0.23982379, 0.2548551 ],
[ 0.14136961, 0.03510934, -0.10217491, -0.16901995, 0.04750424,
-0.23385789, 0.19468631, -0.15104151, 0.03806949, 0.08511215,
-0.04765465, -0.15385683],
[ 0.14762899, -0.04912234, -0.12667281, 0.02403566, 0.19178951,
0.02290434, -0.01633823, 0.23257906, 0.01366747, 0.03342147,
-0.088225 , 0.11227486],
[ 0.14210263, -0.03984815, -0.02854464, 0.05740164, 0.24976544,
0.04283359, 0.17525208, 0.32586916, 0.0509088 , 0.02365377,
-0.0234607 , -0.20968519]])

 Checking the explained variance for all the principal components.


Explained variance = (eigen value of each PC)/(sum of eigen values of all PCs)
array([6.24441446e-01, 1.33888289e-01, 6.60229147e-02, 4.86470891e-02,
3.34029704e-02, 2.02297994e-02, 1.73014629e-02, 8.13866529e-03,
6.94892379e-03, 5.64637229e-03, 4.78562250e-03, 4.12770833e-03,
3.17750294e-03, 2.96454958e-03, 2.42764517e-03, 2.30351534e-03,
1.81837655e-03, 1.67340548e-03, 1.50392785e-03, 1.41732362e-03,
1.15639919e-03, 1.10493400e-03, 8.45617224e-04, 8.04891611e-04,
7.66778221e-04, 5.59369722e-04, 5.01311201e-04, 4.82545623e-04,
4.10480504e-04, 3.85881758e-04, 3.28410688e-04, 2.78520087e-04,
2.45156553e-04, 2.08299401e-04, 1.94666401e-04, 1.59021779e-04,
1.27016642e-04, 1.09865556e-04, 8.68013375e-05, 8.06925096e-05,
6.05897475e-05, 3.82574118e-05, 3.74001838e-05, 3.36510796e-05,
2.51958296e-05, 1.92626466e-05, 1.69165450e-05, 1.51102177e-05,
1.14143210e-05, 1.01010143e-05, 7.63350323e-06, 6.48174183e-06,
5.37159674e-06, 3.64086663e-06, 2.42228792e-06, 1.57128566e-06,
8.08813873e-07])
 We can assume the ideal number of principal components from the scree plot and lets
consider 12 principal components from below graph as the variance ratio is almost
stagnant after 12.

PAGE 42
 Checking the cumulative explained variance ratio to find a cut off for selecting the
number of PCs.
array([0.62444145, 0.75832974, 0.82435265, 0.87299974, 0.90640271,
0.92663251, 0.94393397, 0.95207264, 0.95902156, 0.96466793,
0.96945356, 0.97358126, 0.97675877, 0.97972332, 0.98215096,
0.98445448, 0.98627285, 0.98794626, 0.98945019, 0.99086751,
0.99202391, 0.99312884, 0.99397446, 0.99477935, 0.99554613,
0.9961055 , 0.99660681, 0.99708936, 0.99749984, 0.99788572,
0.99821413, 0.99849265, 0.99873781, 0.99894611, 0.99914077,
0.99929979, 0.99942681, 0.99953668, 0.99962348, 0.99970417,
0.99976476, 0.99980302, 0.99984042, 0.99987407, 0.99989927,
0.99991853, 0.99993544, 0.99995055, 0.99996197, 0.99997207,
0.9999797 , 0.99998619, 0.99999156, 0.9999952 , 0.99999762,
0.99999919, 1. ])
 From cumulative explained variance we will take 12 principal components which will contribute
to almost 97% of the data.
 Checking as to how the original features matter to each PC by constructing a plot.

PAGE 43
PAGE 44
PAGE 45
 Comparing how the original features influence various PCs by constructing a heat map.
 Now we need to fit and transform the original loadings with PCA model by selecting 12 n
components.

 Next we need to check the correlation among the principal components by using heat map and
the components should have 0 correlation in order to verify PCA .

PAGE 46
 The linear equation for first pc is:

PC1 = No_HH*0.149222 + TOT_M*0.159169 + TOT_F*0.158209 + M_06*0.156340 +


F_06*0.156814 + M_SC*0.143350 + F_SC*0.143537 + M_ST*0.018849 + F_ST*0.017878 +
M_LIT*0.155152 + F_LIT*0.145450 + M_ILL*0.154551+ F_ILL*0.158283 +
TOT_WORK_M*0.154076 + TOT_WORK_F*0.142530 + MAINWORK_M*0.141932 +
MAINWORK_F*0.125732 + MAIN_CL_M*0.111692 + MAIN_CL_F*0.083035 + MAIN_AL_M*
0.119291 + MAIN_AL_F*0.090089 + MAIN_HH_M*0.141850 + MAIN_HH_F*0.133880 +
MAIN_OT_M*0.122762 + MAIN_OT_F*0.116866 + MARGWORK_M*0.156656 +
MARGWORK_F*0.148695 + MARG_CL_M*0.088163 + MARG_CL_F*0.065160 +
MARG_AL_M*0.127278 + MARG_AL_F* 0.115888 + MARG_HH_M*0.145366 +
MARG_HH_F*0.142302 + MARG_OT_M*0.150877 + MARG_OT_F*0.148018 +
MARGWORK_3_6_M*0.157908 + MARGWORK_3_6_F*0.155831 +
MARG_CL_3_6_M*0.157640 + MARG_CL_3_6_F*0.149501 + MARG_AL_3_6_M*0.094785 +
MARG_AL_3_6_F*0.067158 + MARG_HH_3_6_M*0.128184 + MARG_HH_3_6_F*0.113959 +
MARG_OT_3_6_M*0.145108 + MARG_OT_3_6_F*0.141029 + MARGWORK_0_3_M*0.150922 +
MARGWORK_0_3_F*0.147534 + MARG_CL_0_3_M*0.142987 + MARG_CL_0_3_F*0.133784 +
MARG_AL_0_3_M*0.062964 + MARG_AL_0_3_F*0.056741 + MARG_HH_0_3_M*0.119102 +
MARG_HH_0_3_F*0.113044 + MARG_OT_0_3_M*0.142140 + MARG_OT_0_3_F*0.141370 +
NON_WORK_M*0.147629 + NON_WORK_F*0.142103+bias

PAGE 47

You might also like