Professional Documents
Culture Documents
P L Lohitha
07/11/22
—
Data mining
—
Business Report
CONTENTS
1 Problem 1 3
2 EDA 4
3 Hierarchical clustering 11
4 K-Means clustering 12
5 Summary 14
6 Problem 2 15
7 EDA 16
8 PCA 37
PAGE 2
Problem 1
Digital Ads Data:
The ads24x7 is a Digital Marketing company which has now got seed funding of $10 Million. They are
expanding their wings in Marketing Analytics. They collected data from their Marketing Intelligence
team and now wants you (their newly appointed data analyst) to segment type of ads based on the
features provided. Use Clustering procedure to segment ads into homogeneous groups.
CPM = (Total Campaign Spend / Number of Impressions) * 1,000. Note that the Total Campaign
Spend refers to the 'Spend' Column in the dataset and the Number of Impressions refers to the
'Impressions' Column in the dataset.
CPC = Total Cost (spend) / Number of Clicks. Note that the Total Cost (spend) refers to the
'Spend' Column in the dataset and the Number of Clicks refers to the 'Clicks' Column in the dataset.
CTR = Total Measured Clicks / Total Measured Ad Impressions x 100. Note that the Total
Measured Clicks refers to the 'Clicks' Column in the dataset and the Total Measured Ad Impressions
refers to the 'Impressions' Column in the dataset.
Read the data and perform basic analysis such as printing a few rows (head and tail), info,
data summary, null values duplicate values, etc.
Treat missing values in CPC, CTR and CPM using the formula given. You have to basically
create an user defined function and then call the function for imputing.
Check if there are any outliers.
Do you think treating outliers is necessary for K-Means clustering? Based on your judgement
decide whether to treat outliers and if yes, which method to employ. (As an analyst your
judgement may be different from another analyst).
Perform z-score scaling and discuss how it affects the speed of the algorithm.
Perform clustering and do the following:
Perform Hierarchical by constructing a Dendrogram using WARD and Euclidean distance.
Make Elbow plot (up to n=10) and identify optimum number of clusters for k-means algorithm.
Print silhouette scores for up to 10 clusters and identify optimum number of clusters.
PAGE 3
Profile the ads based on optimum number of clusters using silhouette score and your domain
understanding
[Hint: Group the data by clusters and take sum or mean to identify trends in clicks, spend,
revenue, CPM, CTR, & CPC based on Device Type. Make bar plots.]
Conclude the project by providing summary of your learnings.
Observations:
The required libraries are imported and the data is loaded into the code file.
The first 5 columns of the data set are loaded.
The dataset has 6 float objects, 7 integer objects and 6 object type variables.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23066 entries, 0 to 23065
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
PAGE 4
0 Timestamp 23066 non-null object
1 InventoryType 23066 non-null object
2 Ad - Length 23066 non-null int64
3 Ad- Width 23066 non-null int64
4 Ad Size 23066 non-null int64
5 Ad Type 23066 non-null object
6 Platform 23066 non-null object
7 Device Type 23066 non-null object
8 Format 23066 non-null object
9 Available_Impressions 23066 non-null int64
10 Matched_Queries 23066 non-null int64
11 Impressions 23066 non-null int64
12 Clicks 23066 non-null int64
13 Spend 23066 non-null float64
14 Fee 23066 non-null float64
15 Revenue 23066 non-null float64
16 CTR 18330 non-null float64
17 CPM 18330 non-null float64
18 CPC 18330 non-null float64
dtypes: float64(6), int64(7), object(6)
memory usage: 3.3+ MB
There are few missing values in the columns CTR, CPM, CPC of the data.
The summary of the data is mentioned in the data below
There are null values in the columns CTR, CPM, CPC of the data.
Timestamp 0
InventoryType 0
Ad - Length 0
Ad- Width 0
Ad Size 0
PAGE 5
Ad Type 0
Platform 0
Device Type 0
Format 0
Available_Impressions 0
Matched_Queries 0
Impressions 0
Clicks 0
Spend 0
Fee 0
Revenue 0
CTR 4736
CPM 4736
CPC 4736
dtype: int64
To treat the missing values we have used the formulas that are given in the problem for the
columns CPM, CTR, CPC.
Timestamp 0
nventoryType 0
Ad - Length 0
Ad- Width 0
Ad Size 0
Ad Type 0
Platform 0
Device Type 0
Format 0
Available_Impressions 0
Matched_Queries 0
Impressions 0
Clicks 0
Spend 0
Fee 0
Revenue 0
CTR 0
CPM 0
CPC 0
dtype: int64
There are outliers in the data for the columns 'Ad Size', 'Available_Impressions',
'Matched_Queries', 'Impressions', 'Clicks', 'Spend', 'Fee', 'Revenue', 'CTR', 'CPM', 'CPC' .
PAGE 6
PAGE 7
PAGE 8
clustering is very sensitive to outliers so treating them is a very important step for accurate
results.
Outliers are treated by replacing the outlier with either their upper limit or lower limit value.
PAGE 9
PAGE 10
We can infer from the above graph that there are only a few variables that have a
correlation between them.
PAGE 11
we scale the data so that it controls the variability of the dataset, which will in turn help us
to generate good quality clusters and improve the accuracy of clustering algorithms.
We have performed hierarchical clustering and plotted a dendrogram using WARD’s
linkage method and can conclude that there are 3 clusters.
PAGE 12
The ideal number of clusters are 3 according to hierarchical clustering.
Next we will perform k-means clustering for the same data and identify the optimum
number of clusters.
From the elbow curve we can clearly see that 5 clusters are ideal and we can verify that by
calculating the silhouette score.
After calculating the silhouette score for 1 to 10 clusters we can conclude that 5 clusters
has the highest silhouette score of 0.49.
PAGE 13
PAGE 14
The above visualization depicts the number of clusters which has been formed by the model
on each feature. We could also infer from the graph that the following features ‘available
impressions’, ’impressions’, and ‘matched queries’ have the major statistical difference
between the clusters, as a result of this we could use these cluster data points in order to
segregate ads based on ad attributes.
PAGE 15
Summary :
The ads24x7 is a Digital Marketing company, they wanted us to segment the ads into homogeneous grops,
we can do that by clustering the data given to us and group them accordingly.
Initially we loaded all the libraries, imported the data and started with EDA. The data frame has 23066 rows
and 19 columns,
There are no null or duplicate values in the data but there are outliers and they are further treated to perform
clustering, as it is very sensitive to outliers.
After checking the correlation of the variables we performed hierarchical clustering by constructing a
dendrogram using wards linkage method, which helped us to identify that 3 clusters would be optimum.
We will also perform k-means clustering. we will fit and transform the model and then find the within sum of
squares value and plot it in a elbow curve from which we can see that 5 clusters would be optimum.
We then calculated the silhouette score and observed that 5 clusters have the maximum silhouette score.
Impressions, available impressions and matched queries are the major factors that are considered for
clustering.
So we can conclude that the marketing campaigns should be designed by considering Impressions,
available impressions and matched queries by dividing them into 5 groups.
The customers should be classified into 5 different segments and based on Impressions, available
impressions and matched queries.
PAGE 16
Problem 2
PCA FH (FT): Primary census abstract for female headed households excluding institutional
households (India & States/UTs - District Level), Scheduled tribes - 2011 PCA for Female Headed
Household Excluding Institutional Household. The Indian Census has the reputation of being one of
the best in the world. The first Census in India was conducted in the year 1872. This was conducted
at different points of time in different parts of the country. In 1881 a Census was taken for the entire
country simultaneously. Since then, Census has been conducted every ten years, without a break.
Thus, the Census of India 2011 was the fifteenth in this unbroken series since 1872, the seventh after
independence and the second census of the third millennium and twenty first century. The census
has been uninterruptedly continued despite of several adversities like wars, epidemics, natural
calamities, political unrest, etc. The Census of India is conducted under the provisions of the Census
Act 1948 and the Census Rules, 1990. The Primary Census Abstract which is important publication of
2011 Census gives basic information on Area, Total Number of Households, Total Population,
Scheduled Castes, Scheduled Tribes Population, Population in the age group 0-6, Literates, Main
Workers and Marginal Workers classified by the four broad industrial categories, namely, (i)
Cultivators, (ii) Agricultural Laborers, (iii) Household Industry Workers, and (iv) Other Workers and
also Non-Workers. The characteristics of the Total Population include Scheduled Castes, Scheduled
Tribes, Institutional and Houseless Population and are presented by sex and rural-urban residence.
Census 2011 covered 35 States/Union Territories, 640 districts, 5,924 sub-districts, 7,935 Towns and
6,40,867 Villages.
The data collected has so many variables thus making it difficult to find useful details without using
Data Science Techniques. You are tasked to perform detailed EDA and identify Optimum Principal
Components that explains the most variance in data. Use Sklearn only.
Observations:
The required libraries are imported and the data is loaded into the code file.
The first 5 columns of the data set are loaded
PAGE 17
The dataset has 640 rows and 61 columns.
(640, 61)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State Code 640 non-null int64
1 Dist.Code 640 non-null int64
2 State 640 non-null object
3 Area Name 640 non-null object
4 No_HH 640 non-null int64
5 TOT_M 640 non-null int64
6 TOT_F 640 non-null int64
7 M_06 640 non-null int64
8 F_06 640 non-null int64
9 M_SC 640 non-null int64
10 F_SC 640 non-null int64
11 M_ST 640 non-null int64
12 F_ST 640 non-null int64
13 M_LIT 640 non-null int64
14 F_LIT 640 non-null int64
15 M_ILL 640 non-null int64
16 F_ILL 640 non-null int64
17 TOT_WORK_M 640 non-null int64
18 TOT_WORK_F 640 non-null int64
19 MAINWORK_M 640 non-null int64
20 MAINWORK_F 640 non-null int64
21 MAIN_CL_M 640 non-null int64
22 MAIN_CL_F 640 non-null int64
23 MAIN_AL_M 640 non-null int64
24 MAIN_AL_F 640 non-null int64
25 MAIN_HH_M 640 non-null int64
26 MAIN_HH_F 640 non-null int64
27 MAIN_OT_M 640 non-null int64
28 MAIN_OT_F 640 non-null int64
29 MARGWORK_M 640 non-null int64
30 MARGWORK_F 640 non-null int64
PAGE 18
31 MARG_CL_M 640 non-null int64
32 MARG_CL_F 640 non-null int64
33 MARG_AL_M 640 non-null int64
34 MARG_AL_F 640 non-null int64
35 MARG_HH_M 640 non-null int64
36 MARG_HH_F 640 non-null int64
37 MARG_OT_M 640 non-null int64
38 MARG_OT_F 640 non-null int64
39 MARGWORK_3_6_M 640 non-null int64
40 MARGWORK_3_6_F 640 non-null int64
41 MARG_CL_3_6_M 640 non-null int64
42 MARG_CL_3_6_F 640 non-null int64
43 MARG_AL_3_6_M 640 non-null int64
44 MARG_AL_3_6_F 640 non-null int64
45 MARG_HH_3_6_M 640 non-null int64
46 MARG_HH_3_6_F 640 non-null int64
47 MARG_OT_3_6_M 640 non-null int64
48 MARG_OT_3_6_F 640 non-null int64
49 MARGWORK_0_3_M 640 non-null int64
50 MARGWORK_0_3_F 640 non-null int64
51 MARG_CL_0_3_M 640 non-null int64
52 MARG_CL_0_3_F 640 non-null int64
53 MARG_AL_0_3_M 640 non-null int64
54 MARG_AL_0_3_F 640 non-null int64
55 MARG_HH_0_3_M 640 non-null int64
56 MARG_HH_0_3_F 640 non-null int64
57 MARG_OT_0_3_M 640 non-null int64
58 MARG_OT_0_3_F 640 non-null int64
59 NON_WORK_M 640 non-null int64
60 NON_WORK_F 640 non-null int64
dtypes: int64(59), object(2)
memory usage: 305.1+ KB
we need to convert state code and dist code columns to object type variable.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 640 entries, 0 to 639
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 State Code 640 non-null object
1 Dist.Code 640 non-null object
2 State 640 non-null object
3 Area Name 640 non-null object
4 No_HH 640 non-null int64
5 TOT_M 640 non-null int64
6 TOT_F 640 non-null int64
7 M_06 640 non-null int64
8 F_06 640 non-null int64
9 M_SC 640 non-null int64
10 F_SC 640 non-null int64
11 M_ST 640 non-null int64
12 F_ST 640 non-null int64
13 M_LIT 640 non-null int64
14 F_LIT 640 non-null int64
15 M_ILL 640 non-null int64
16 F_ILL 640 non-null int64
PAGE 19
17 TOT_WORK_M 640 non-null int64
18 TOT_WORK_F 640 non-null int64
19 MAINWORK_M 640 non-null int64
20 MAINWORK_F 640 non-null int64
21 MAIN_CL_M 640 non-null int64
22 MAIN_CL_F 640 non-null int64
23 MAIN_AL_M 640 non-null int64
24 MAIN_AL_F 640 non-null int64
25 MAIN_HH_M 640 non-null int64
26 MAIN_HH_F 640 non-null int64
27 MAIN_OT_M 640 non-null int64
28 MAIN_OT_F 640 non-null int64
29 MARGWORK_M 640 non-null int64
30 MARGWORK_F 640 non-null int64
31 MARG_CL_M 640 non-null int64
32 MARG_CL_F 640 non-null int64
33 MARG_AL_M 640 non-null int64
34 MARG_AL_F 640 non-null int64
35 MARG_HH_M 640 non-null int64
36 MARG_HH_F 640 non-null int64
37 MARG_OT_M 640 non-null int64
38 MARG_OT_F 640 non-null int64
39 MARGWORK_3_6_M 640 non-null int64
40 MARGWORK_3_6_F 640 non-null int64
41 MARG_CL_3_6_M 640 non-null int64
42 MARG_CL_3_6_F 640 non-null int64
43 MARG_AL_3_6_M 640 non-null int64
44 MARG_AL_3_6_F 640 non-null int64
45 MARG_HH_3_6_M 640 non-null int64
46 MARG_HH_3_6_F 640 non-null int64
47 MARG_OT_3_6_M 640 non-null int64
48 MARG_OT_3_6_F 640 non-null int64
49 MARGWORK_0_3_M 640 non-null int64
50 MARGWORK_0_3_F 640 non-null int64
51 MARG_CL_0_3_M 640 non-null int64
52 MARG_CL_0_3_F 640 non-null int64
53 MARG_AL_0_3_M 640 non-null int64
54 MARG_AL_0_3_F 640 non-null int64
55 MARG_HH_0_3_M 640 non-null int64
56 MARG_HH_0_3_F 640 non-null int64
57 MARG_OT_0_3_M 640 non-null int64
58 MARG_OT_0_3_F 640 non-null int64
59 NON_WORK_M 640 non-null int64
60 NON_WORK_F 640 non-null int64
dtypes: int64(57), object(4)
memory usage: 305.1+ KB
There are no null values in the data.
State Code 0
Dist.Code 0
State 0
Area Name 0
No_HH 0
TOT_M 0
TOT_F 0
M_06 0
F_06 0
PAGE 20
M_SC 0
F_SC 0
M_ST 0
F_ST 0
M_LIT 0
F_LIT 0
M_ILL 0
F_ILL 0
TOT_WORK_M 0
TOT_WORK_F 0
MAINWORK_M 0
MAINWORK_F 0
MAIN_CL_M 0
MAIN_CL_F 0
MAIN_AL_M 0
MAIN_AL_F 0
MAIN_HH_M 0
MAIN_HH_F 0
MAIN_OT_M 0
MAIN_OT_F 0
MARGWORK_M 0
MARGWORK_F 0
MARG_CL_M 0
MARG_CL_F 0
MARG_AL_M 0
MARG_AL_F 0
MARG_HH_M 0
MARG_HH_F 0
MARG_OT_M 0
MARG_OT_F 0
MARGWORK_3_6_M 0
MARGWORK_3_6_F 0
MARG_CL_3_6_M 0
MARG_CL_3_6_F 0
MARG_AL_3_6_M 0
MARG_AL_3_6_F 0
MARG_HH_3_6_M 0
MARG_HH_3_6_F 0
MARG_OT_3_6_M 0
MARG_OT_3_6_F 0
MARGWORK_0_3_M 0
MARGWORK_0_3_F 0
MARG_CL_0_3_M 0
MARG_CL_0_3_F 0
MARG_AL_0_3_M 0
MARG_AL_0_3_F 0
MARG_HH_0_3_M 0
MARG_HH_0_3_F 0
MARG_OT_0_3_M 0
MARG_OT_0_3_F 0
NON_WORK_M 0
NON_WORK_F 0
dtype: int64
PAGE 21
The summary of data is clearly specified below.
35837.
No_HH 640.0 51222.871875 48135.405475 350.0 19484.00 68892.00 310450.0
0
58339. 107918.5
TOT_M 640.0 79940.576563 73384.511114 391.0 30228.00 485417.0
0 0
13709.
F_SC 640.0 20778.392188 21727.887713 0.0 5603.25 29180.00 156429.0
0
42693.
M_LIT 640.0 57967.979688 55910.282466 286.0 21298.00 77989.50 403261.0
5
43796.
F_LIT 640.0 66359.565625 75037.860207 371.0 20932.00 84799.75 571140.0
5
15767.
M_ILL 640.0 21972.596875 19825.605268 105.0 8590.00 29512.50 105961.0
5
42386.
F_ILL 640.0 56012.518750 47116.693769 327.0 22367.00 78471.00 254160.0
0
27936.
TOT_WORK_M 640.0 37992.407813 36419.537491 100.0 13753.50 50226.75 269422.0
5
30588.
TOT_WORK_F 640.0 41295.760938 37192.360943 357.0 16097.75 53234.25 257848.0
5
PAGE 22
count mean std min 25% 50% 75% max
21250.
MAINWORK_M 640.0 30204.446875 31480.915680 65.0 9787.00 40119.00 247911.0
5
18484.
MAINWORK_F 640.0 28198.846875 29998.262689 240.0 9502.25 35063.25 226166.0
0
10175.
MARGWORK_F 640.0 13096.914062 10996.474528 117.0 5424.50 18879.25 66915.0
0
PAGE 23
count mean std min 25% 50% 75% max
MARGWORK_3_6_ 30315.
640.0 41948.168750 39045.316918 291.0 16208.25 57218.75 300937.0
M 0
56793. 107924.0
MARGWORK_3_6_F 640.0 81076.323438 82970.406216 341.0 26619.50 676450.0
0 0
MARGWORK_0_3_
640.0 2616.140625 3036.964381 7.0 755.00 1681.5 3320.25 20648.0
M
PAGE 24
count mean std min 25% 50% 75% max
We have to create a new data frame where we need to drop few columns that are not required
for PCA.
Uttar Pradesh has the highest gender ratio and Dadara and Nagar Haveli has the lowest
gender ratio both in male and female.
PAGE 25
PAGE 26
PAGE 27
Mumbai Suburban has the highest number of gender ratio and Dibang Valley has the lowest
gender ratio under districts.
since the number of districts are very high we are not able to display visualisation clearly.
Uttar Pradesh has the highest literacy ratio and Dadar and Nagar Haveli has the lowest literacy
ratio both in male and female.
PAGE 28
PAGE 29
Uttar Pradesh has the highest Workers ratio and Dadar and Nagar Haveli has the lowest
Workers ratio both in male and female.
PAGE 30
Uttar Pradesh has the highest agricultural labourers ratio and Dadar and Nagar Haveli has the
lowest agricultural labourers ratio both in male and female.
PAGE 31
Now we have to check for outliers in the data and make sure there aren’t any because pca is
very sensitive to outliers.
PAGE 32
PAGE 33
There are multiple outliers in the data so we need to treat them because PCA Is very
sensitive to outliers. A single outlier can cause changes in the principal components so I
am choosing to treat the outliers for this specific data . here I am using upper and lower
limit values to replace with the outlier.
PAGE 34
PAGE 35
After treating the outliers we need to scale the data using z - score methond to
standardize the data.
Scaling doesn’t have impact on outliers but we can detect outliers using scaling.
PAGE 36
After scaling the data lets check the correlation between the variables of the data frame.
There is good correlation between few variables.
PAGE 37
We need to confirm the statistical significance of correlations. there should be
significance to further proceed with PCA
H0: Correlations are not significant,
H1: There are significant correlations
Reject H0 if p-value < 0.05
We calculated Barlett sphericity value and the p value is 0 since it is less tham 0.05 we
reject H0 and as it is evident that there is significant correlation we can further proceed
with PCA.
To confirm the adequacy of sample size we will calculate KMO value and if the value is
above 0.7 it is acceptable and if the value is below 0.5 then it is not acceptable.
The KMO value is 0.93 so it is acceptable and we can proceed for PCA.
We will then fit and transform PCA model for all 57 components initially.
array([[ 0.14922158, 0.15916917, 0.15820921, ..., 0.14136961,
0.14762899, 0.14210263],
[-0.11548673, -0.08023879, -0.09371751, ..., 0.03510934,
-0.04912234, -0.03984815],
[ 0.1015276 , -0.03866173, 0.0289595 , ..., -0.10217491,
-0.12667281, -0.02854464],
...,
[ 0.00112879, -0.00673066, 0.02298648, ..., -0.01159627,
0.05608352, -0.00610478],
[ 0.00070908, 0.04637872, 0.00402434, ..., 0.01406358,
-0.07729171, -0.00056173],
[-0.00461221, -0.00370327, 0.00963954, ..., 0.00227908,
0.00539901, 0.00130606]])
We should check for eigen values for all 57 components
array([3.56488638e+01, 7.64357559e+00, 3.76919551e+00, 2.77722349e+00,
1.90694892e+00, 1.15490310e+00, 9.87726707e-01, 4.64629906e-01,
3.96708513e-01, 3.22346888e-01, 2.73207369e-01, 2.35647574e-01,
1.81401107e-01, 1.69243770e-01, 1.38592325e-01, 1.31505852e-01,
1.03809666e-01, 9.55333831e-02, 8.58580407e-02, 8.09138742e-02,
6.60179067e-02, 6.30797999e-02, 4.82756124e-02, 4.59506197e-02,
4.37747566e-02, 3.19339710e-02, 2.86194563e-02, 2.75481445e-02,
2.34340044e-02, 2.20296816e-02, 1.87487040e-02, 1.59004895e-02,
1.39957919e-02, 1.18916465e-02, 1.11133495e-02, 9.07842645e-03,
7.25127869e-03, 6.27213692e-03, 4.95541908e-03, 4.60667097e-03,
3.45902033e-03, 2.18408510e-03, 2.13514664e-03, 1.92111328e-03,
1.43840980e-03, 1.09968912e-03, 9.65752052e-04, 8.62630267e-04,
6.51634478e-04, 5.76658846e-04, 4.35790607e-04, 3.70037468e-04,
3.06660171e-04, 2.07854170e-04, 1.38286484e-04, 8.97034441e-05,
4.61745385e-05])
The covariance matrix is calculated below
array([[1. , 0.91127279, 0.97149267, ..., 0.65174157, 0.76720055,
0.7966374 ],
[0.91127279, 1. , 0.97859043, ..., 0.73168646, 0.86481243,
0.78948116],
PAGE 38
[0.97149267, 0.97859043, 1. , ..., 0.7107652 , 0.83833472,
0.81336875],
...,
[0.65174157, 0.73168646, 0.7107652 , ..., 1. , 0.76129967,
0.71962667],
[0.76720055, 0.86481243, 0.83833472, ..., 0.76129967, 1. ,
0.90083619],
[0.7966374 , 0.78948116, 0.81336875, ..., 0.71962667, 0.90083619,
1. ]])
The transpose of PCA components will help us to get the eigen vectors.
array([[ 0.14922158, -0.11548673, 0.1015276 , 0.07681409, -0.01209003,
0.08255794, 0.10689589, -0.09951296, 0.02609778, 0.06812864,
-0.0586205 , -0.02177543],
[ 0.15916917, -0.08023879, -0.03866173, 0.05297633, -0.04234376,
0.07366681, -0.12408501, -0.10886983, 0.03285504, -0.04842824,
0.02949081, -0.04766829],
[ 0.15820921, -0.09371751, 0.0289595 , 0.07002217, -0.02292653,
0.08281204, -0.01029127, -0.11527589, 0.03640371, -0.02246575,
-0.02015258, -0.0428273 ],
[ 0.15634043, -0.02034061, -0.07441918, 0.02851986, -0.08033939,
0.09237947, -0.20080697, -0.13294526, 0.13840682, -0.15723774,
-0.00916557, -0.146674 ],
[ 0.1568144 , -0.01431023, -0.06822314, 0.01639807, -0.07832648,
0.08001002, -0.20341137, -0.139343 , 0.16571649, -0.14503123,
-0.02557186, -0.14463103],
[ 0.14335015, -0.07966701, -0.03761902, 0.01021041, -0.16789316,
0.05096945, -0.04039897, 0.18916926, -0.53174333, -0.09845631,
-0.19462968, -0.12262118],
[ 0.14353705, -0.08709832, 0.02134973, 0.01624416, -0.15809156,
0.05456754, 0.05398985, 0.17736326, -0.51506329, -0.06583908,
-0.25036558, -0.11452469],
[ 0.01884873, 0.06910144, 0.32382724, 0.09114279, 0.41841183,
-0.23180881, -0.35523838, -0.07163216, -0.11301919, -0.00838594,
-0.0824946 , -0.05551678],
[ 0.01787797, 0.06731586, 0.33870545, 0.07955449, 0.4159652 ,
-0.21454239, -0.32767705, -0.07839145, -0.13603111, -0.02861308,
-0.08142959, -0.05122301],
[ 0.15515239, -0.10598636, -0.03210704, 0.08918669, -0.01403251,
0.081378 , -0.06706185, -0.10288631, -0.0174454 , 0.00057329,
0.02382055, 0.03467198],
[ 0.14544984, -0.13323356, -0.00513336, 0.12541201, 0.02908422,
0.1022068 , 0.01349177, -0.12707401, 0.00097933, 0.12340681,
-0.01479861, 0.08753506],
[ 0.1545511 , -0.00945956, -0.04705352, -0.03466478, -0.10407302,
0.03795699, -0.24309747, -0.09103651, 0.12950534, -0.15515273,
0.04694997, -0.22031064],
[ 0.15828347, -0.02179345, 0.07934454, -0.01057813, -0.11033167,
0.01398577, -0.03698859, -0.05363198, 0.03034449, -0.14827181,
0.00754684, -0.16441159],
[ 0.15407627, -0.12091195, -0.0011159 , 0.06904579, -0.02310352,
0.0358025 , -0.08540362, -0.04578257, -0.02462895, 0.09006288,
0.11158062, 0.0362932 ],
[ 0.14252995, -0.07600253, 0.19412998, 0.11105656, -0.01893052,
PAGE 39
-0.01658672, 0.17425777, -0.06837732, 0.072908 , -0.01450426,
-0.13880219, 0.03461047],
[ 0.14193201, -0.16669997, 0.01982148, 0.10018791, -0.04322541,
0.01805394, -0.08732635, -0.05206017, -0.05125101, 0.12495614,
0.14241509, 0.05347519],
[ 0.12573163, -0.14224991, 0.20997642, 0.13301329, -0.054674 ,
-0.05195118, 0.14903607, -0.07720208, 0.09720121, 0.06229771,
-0.18270277, 0.11503845],
[ 0.11169244, 0.04255228, 0.03313125, 0.07885146, -0.30337639,
-0.2935043 , -0.28879016, 0.42581303, -0.0210675 , 0.21067673,
0.32765363, -0.10568573],
[ 0.08303496, 0.09589258, 0.1888222 , 0.2650219 , -0.25792534,
-0.26991402, 0.0262944 , 0.19774214, 0.20599541, -0.30804863,
-0.21591822, 0.09956266],
[ 0.11929067, -0.05334228, 0.22583087, -0.12137878, -0.25313081,
-0.0233356 , -0.11070105, 0.0363001 , 0.10411389, 0.38208405,
-0.07043905, -0.15175049],
[ 0.09008881, -0.07246688, 0.35656643, -0.02098921, -0.19921997,
-0.05655819, 0.12568925, 0.05016856, 0.16615054, 0.2039094 ,
-0.23071378, 0.14047405],
[ 0.14184969, -0.10183528, -0.10220234, -0.02196919, -0.06081182,
-0.14286889, -0.06468864, -0.11788673, -0.27774368, -0.21273456,
0.13161045, 0.26121991],
[ 0.13388011, -0.11325661, 0.02161302, -0.04543644, -0.0230627 ,
-0.31847365, 0.23118776, -0.24852423, -0.12512191, 0.00842445,
0.10649846, -0.07437718],
[ 0.1227618 , -0.2036023 , -0.02814398, 0.14702469, 0.06990677,
0.07121365, -0.00776822, -0.07730861, -0.11135006, 0.14358174,
0.21358002, 0.2380841 ],
[ 0.1168656 , -0.20589888, 0.06903375, 0.15591746, 0.10677437,
0.03388487, 0.09129161, -0.08264689, -0.04130694, 0.16077765,
0.08804174, 0.26487045],
[ 0.15665637, 0.07903864, -0.06868497, -0.07857186, 0.06581161,
0.07865492, -0.05722289, 0.03831204, 0.10143156, 0.03894095,
-0.10161767, 0.03946556],
[ 0.14869489, 0.10881279, 0.10495656, 0.01578813, 0.07762414,
0.09915551, 0.15271912, 0.05683757, 0.04488718, -0.14334494,
0.01602309, -0.0953364 ],
[ 0.08816344, 0.2715224 , -0.10474484, 0.15710396, -0.01800453,
-0.03273765, -0.00294181, -0.05997628, -0.00729524, 0.25389695,
-0.02680659, 0.00639141],
[ 0.06516026, 0.27539755, -0.03632536, 0.28502411, -0.05515214,
-0.03178707, 0.06348769, -0.03542371, -0.01298311, -0.09009164,
0.14122895, 0.08278884],
[ 0.1272781 , 0.15657864, 0.0704345 , -0.25059413, -0.04720013,
0.07974782, -0.09344179, 0.01684894, -0.01888328, 0.11682358,
0.07391181, 0.07830835],
[ 0.11588826, 0.13504767, 0.25998651, -0.15379789, -0.01264328,
0.11762488, 0.09222418, 0.03280765, -0.05168412, -0.148361 ,
0.23199585, 0.10020847],
[ 0.14536607, 0.04097368, -0.14434657, -0.16753968, 0.00557458,
-0.16997996, -0.05567003, 0.03363506, 0.04601489, -0.10345429,
-0.13520726, 0.27585568],
[ 0.14230182, 0.00668481, -0.09383805, -0.15146925, 0.04361632,
-0.31959562, 0.18400519, -0.13319507, -0.009807 , 0.02981446,
0.07433996, -0.19261057],
[ 0.15087675, -0.07344039, -0.13141498, 0.02119534, 0.1451087 ,
PAGE 40
0.01823245, -0.02139323, 0.17805166, 0.06086733, 0.0091236 ,
0.01991064, 0.10234219],
[ 0.14801846, -0.08836101, -0.05388345, 0.05996115, 0.19075649,
0.00240871, 0.09974423, 0.25190766, 0.08117148, 0.01431973,
0.12688446, -0.13244493],
[ 0.15790761, -0.04404402, -0.06687743, 0.03931895, -0.0598864 ,
0.10337666, -0.15317971, -0.14984177, 0.08935349, -0.15522534,
-0.04523389, -0.10574031],
[ 0.15583101, -0.09238317, -0.05871826, 0.04613025, -0.02247554,
0.11746706, -0.09836715, -0.12124342, 0.01321602, -0.00205965,
0.03746853, -0.07240161],
[ 0.15764021, 0.06620762, -0.06017243, -0.09131505, 0.05907845,
0.07238086, -0.06421911, 0.04254566, 0.11865173, 0.02711453,
-0.05877036, 0.0488893 ],
[ 0.1495015 , 0.08965133, 0.1257919 , 0.01886534, 0.06434924,
0.07089589, 0.14288847, 0.07205396, 0.06838015, -0.18267649,
0.07137054, -0.05881623],
[ 0.0947852 , 0.26126801, -0.09655088, 0.13159069, -0.01388688,
-0.04137688, -0.01126399, -0.0366386 , 0.037604 , 0.25511706,
0.00305617, 0.02623711],
[ 0.06715842, 0.26669101, -0.01825633, 0.29284517, -0.06101878,
-0.04936682, 0.05963742, -0.01288921, 0.02690756, -0.13595046,
0.17579363, 0.09843767],
[ 0.12818439, 0.14983097, 0.07819427, -0.2503371 , -0.05866475,
0.0731517 , -0.09594801, 0.02898646, -0.00080806, 0.10856889,
0.09174 , 0.06642587],
[ 0.11395923, 0.12064763, 0.28323496, -0.14304544, -0.02538622,
0.09486752, 0.08953895, 0.06295586, -0.02844888, -0.16406229,
0.25494868, 0.11124873],
[ 0.14510769, 0.03676265, -0.14251113, -0.16600189, 0.00331494,
-0.17463445, -0.05548298, 0.03264548, 0.03762013, -0.10739709,
-0.10073182, 0.27838852],
[ 0.14102942, -0.00368515, -0.08935617, -0.14259884, 0.04167758,
-0.34396998, 0.17735371, -0.12126702, -0.02248656, 0.01619689,
0.11692876, -0.20104553],
[ 0.15092232, -0.0777393 , -0.13068659, 0.01988712, 0.13279387,
0.01582574, -0.02259074, 0.16680123, 0.06982819, 0.00796007,
0.04505291, 0.10296503],
[ 0.14753416, -0.10114106, -0.05848926, 0.0600874 , 0.17059608,
-0.00485718, 0.07857288, 0.2224764 , 0.08868533, 0.00635225,
0.15277396, -0.10679669],
[ 0.14298675, 0.13683939, -0.10356452, -0.01822291, 0.0942929 ,
0.11104532, -0.02590217, 0.018268 , -0.00476824, 0.10656481,
-0.2841336 , -0.01112016],
[ 0.13378373, 0.16641612, 0.03342285, 0.0059541 , 0.11235112,
0.18588236, 0.17850035, -0.00407236, -0.0239809 , 0.00863649,
-0.15517828, -0.17161971],
[ 0.06296394, 0.28188148, -0.1202934 , 0.20894141, -0.01807012,
-0.00459955, 0.00947356, -0.11585956, -0.13405721, 0.18025429,
-0.11074858, -0.05974346],
[ 0.05674058, 0.28754091, -0.08809749, 0.2404994 , -0.03629271,
0.0220235 , 0.0664972 , -0.09544746, -0.13424113, 0.04261022,
0.01570896, 0.0321015 ],
[ 0.11910165, 0.18234077, 0.02617609, -0.24041564, 0.01698094,
0.10938653, -0.0828577 , -0.04866015, -0.09335608, 0.16042119,
-0.00215884, 0.07506802],
[ 0.11304417, 0.17711216, 0.16477413, -0.18940781, 0.04753801,
PAGE 41
0.18900563, 0.10968562, -0.07017643, -0.13770348, -0.04543535,
0.1269854 , 0.04833875],
[ 0.14213963, 0.05292484, -0.14441938, -0.16755357, 0.01418678,
-0.14968946, -0.05078585, 0.03888177, 0.07535478, -0.08411327,
-0.23982379, 0.2548551 ],
[ 0.14136961, 0.03510934, -0.10217491, -0.16901995, 0.04750424,
-0.23385789, 0.19468631, -0.15104151, 0.03806949, 0.08511215,
-0.04765465, -0.15385683],
[ 0.14762899, -0.04912234, -0.12667281, 0.02403566, 0.19178951,
0.02290434, -0.01633823, 0.23257906, 0.01366747, 0.03342147,
-0.088225 , 0.11227486],
[ 0.14210263, -0.03984815, -0.02854464, 0.05740164, 0.24976544,
0.04283359, 0.17525208, 0.32586916, 0.0509088 , 0.02365377,
-0.0234607 , -0.20968519]])
PAGE 42
Checking the cumulative explained variance ratio to find a cut off for selecting the
number of PCs.
array([0.62444145, 0.75832974, 0.82435265, 0.87299974, 0.90640271,
0.92663251, 0.94393397, 0.95207264, 0.95902156, 0.96466793,
0.96945356, 0.97358126, 0.97675877, 0.97972332, 0.98215096,
0.98445448, 0.98627285, 0.98794626, 0.98945019, 0.99086751,
0.99202391, 0.99312884, 0.99397446, 0.99477935, 0.99554613,
0.9961055 , 0.99660681, 0.99708936, 0.99749984, 0.99788572,
0.99821413, 0.99849265, 0.99873781, 0.99894611, 0.99914077,
0.99929979, 0.99942681, 0.99953668, 0.99962348, 0.99970417,
0.99976476, 0.99980302, 0.99984042, 0.99987407, 0.99989927,
0.99991853, 0.99993544, 0.99995055, 0.99996197, 0.99997207,
0.9999797 , 0.99998619, 0.99999156, 0.9999952 , 0.99999762,
0.99999919, 1. ])
From cumulative explained variance we will take 12 principal components which will contribute
to almost 97% of the data.
Checking as to how the original features matter to each PC by constructing a plot.
PAGE 43
PAGE 44
PAGE 45
Comparing how the original features influence various PCs by constructing a heat map.
Now we need to fit and transform the original loadings with PCA model by selecting 12 n
components.
Next we need to check the correlation among the principal components by using heat map and
the components should have 0 correlation in order to verify PCA .
PAGE 46
The linear equation for first pc is:
PAGE 47