You are on page 1of 21
Project on Statistical Methods for Decision Making by Rohit Yadav DECLARATION We certify that a, The work contained in this project has been done by me under the guidance of our supervision. . The work has not been submitted to any other Institute for any degree or diploma, c. We have followed the guidelines provided by the Institute in preparing the project report. d, We have confirmed to the norms and guidelines given in the Ethical Code of Conduct of the Institute. e. Whenever we have used materials (data, theoretical analysis, figures, and text) from other sources, we have given due credit to them by citing them in the text of the report and giving their details in the references. Further, I have taken permission from the copyright owners of the sources, whenever necessary. Name of Student Signature of Student Rohit Yadav Table of Contents Contents Executive Summary. Introduction. 8 Problem 1 1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and Occupation individually. 1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results. 1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results. 1.4 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are significantly different. Interpret the result 1.5 What is the interaction between the two treatments? Analyze the effects of one variable on the other (Education and Occupation) with the help of an interaction plot. 1.6 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state your results. How will you interpret this result? 17 Explain the business implications of performing ANOVA for this particular case study Problem 2 2.1 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. What insight do you draw from the EDA? 2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. 2.3 Comment on the comparison between the covariance and the correlation matrices from this data.[on scaled data] 2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? 2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] 2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with the original features 2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only). hint: write the linear equation of PC in terms of eigenvectors and corresponding features] 2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum number of principal components? What do the eigenvectors indicate? 2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained] 1. Importance of Variance in Analytics Advancement of technologies has made it easy for all organizations to collect and store a vast amount of data. Each data point is different from the others, and understanding, analyzing and generating actionable insight from the data has become essential for the organizations to remain ahead of their competition. Understanding data is to a large extent synonymous to understanding the various sources of variability in the data. Variability is an inherent property of an attribute (random variable). Take for example the “height” of a person. The property that not all adult human beings are of the same height, is due to variability in the height distribution. To predict the height of a randomly selected adult human being with any degree of accuracy, it is important to understand which factors, if any, are responsible for the difference in height. Source of variability may be of two primary types: Systematic and Error (random or chance).When one or more factors can be identified contributing to the variability, it is known as a Systematic Source of variation. One systematic source of variation of height is gender. Typically, (on the average) male adult human beings are taller than female adult human beings. When the variability cannot be attributed to any known source, it is known as error. If there is a height difference between two adult twin siblings of the same sex, the difference is purely due to chance or error. Any analytical or prediction problem tries to identify as many systematic sources of variation as possible 2. What is Analysis of Variance? The formal definition of Analysis of variance (ANOVA) ANOVA is a statistical technique that assumes that the observed response is coming from more than one population and tests the hypothesis that at least one population mean is different from the rest. The basic concept of ANOVA is to separate the total variability in a dataset into two types, the variability that can be attributed to specified causes and the variation that can be attributed to chance or error. The objective of the ANOVA Analysis of Variance (ANOVA) is a hypothesis testing technique that is used to determine whether the means of more than two populations are identical. ‘The underlying assumption is that the heterogeneity or variability in the data is due to the fact that the data is coming from more ‘than two different normal populations whose variance is the same This technique is used in various problems such as in comparing yields of the erop from seve varieties of seeds, the gasoline mileage of various types of automobiles, satisfaction score of customers with respect to mobile network services in different locations, etc. This technique has application in various fields such as sociology, economics, marketing, laboratory experiments, ete Experimental design is the plan used to collect the data, The basic purpose of setting an experiment is to observe the impact of one or more factors on the observed variable. The factor is an independent explanatory variable with several levels. Each level of the factor represents a different population. The response is the Dependent variable which is continuous and assumed to follow a normal distribution. ‘Types of ANOVA We have discussed two types of Analysis of Variance problems in detail: a) One-way ANOVA: When the response depends on a single factor b) Two-way ANOVA: When the response depends on two factors who may or may not interact between themselves Assumptions for ANOVA 1. All populations under consideration have normal distribution 2. All populations under consideration have equal variances. 3. The sample is a random sample, i.c. the observations are collected independently of cach other. Problem 1: Salary is hypothesized to depend on educational qualification and occupation, To understand the dependency, the salaries of 40 individuals [SalaryData.csv] are collected and each person’s educational qualification and occupation are noted. Educational qualification is at three levels, High school graduate, Bachelor, and Doctorate, Occupation is at four levels, Administrative and clerical, Sales, Professional or specialty, and Executive or managerial. A different number of observations are in each level of education — occupation combination. [Assume that the data follows a normal distribution, In reality, the normality assumption may not always hold if the sample size is small.] 1.1 State the null and the alternate hypothesis for conducting one-way ANOVA for both Education and Occupation individually. Solution: Hypothesis for Education: Null Hypothesis Ho: The mean salary is same across all the 3 categories of education. (Doctorate, Bachelors, HS-grad) Alternate Hypothesis Hi the mean salary is different in at least one category of education Hypothesis for Occupation: Null Hypothesis Ho: The mean salary of the all four categories of occupation are same (Adm- clerical, Sales, Prof-specialty, Exec-managerial) Alternate Hypothesis Ha: The mean salary is not same in at least in one category of occupation. 1.2 Perform one-way ANOVA for Education with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results. Solution: af sum_sq mean_sq F PROF) (Education) 2.8 1.026955e+11 5.134773e+1@ 3.95628 1.257709e-e8 Residual 37.8 6.137256e+18 1.658718e+89 Nan Nan Figure No. I Edueation We have performed the one way ANOVA in python based on which, we can reject Null hypothesis as p value = 1.257709e-08 is less than significance level alpha = 5%. Concluding that there is significant difference in the mean salaries for at least one category 1.3 Perform one-way ANOVA for variable Occupation with respect to the variable ‘Salary’. State whether the null hypothesis is accepted or rejected based on the ANOVA results. Solution: af sum_sq mean_sq Fo PROF) C(Occupation) 3.0 1.125878e+10 3.752928e#09 6.884144 ©. 458508 Residual 36.8 1,528092e#11 4. 244701e+09 Nal NaN Figure No. 2 Occupation We have performed the one way ANOVA in python based on which, we fail to reject Null hypothesis as p value = 0.458508 is greater than significance level alpha = 5%. Concluding that there is no significant difference in the mean salaries across all 4 categories of occupation, 14 If the null hypothesis is rejected in either (1.2) or in (1.3), find out which class means are significantly different. Interpret the result. Solutions 1.5 What is the interaction between the two treatments? Analyze the effects of one variable on the other (Education and Occupation) with the help of an interaction plot. Solution: 250009 225000 200009, 73009 150000 a 325000 300009 Education © Doctorate 608, @ Bachelors Se © Hgad ‘Aamlevical ——Sales_—_‘Protspecialty Exec-managerial ‘occupation Figure No.3 Line Plot I. People with HS-grad do not reach to position of exec-managerial position and they hold only Adm-clerieal, sales, prof-specialty jobs as well as the carn less compare to bachelors and Doctorate Il. People with education as bachelors and doctorate almost earn same salary ranging from 175000 to 250000 in occupation Adm-clerical and Sales. IML People with education as Bachelors and occupation as prof-specialty ears lesser than people with education as bachelors and occupation as Adm-clerical and Sales IV. People with education as bachelors and occupation as sales earns higher than people with education as bachelors and occupation as prof-specialty where as People with education as doctorate and occupation as sales eams lesser than People with education as doctorate and occupation as prof-specialty. V. People with education as bachelors and occupation as prof-specialty earns lesser as compare to the people education as bachelors and occupation as, VI. Sales people with bachelors or doctorate eams the same salary and eams higher than people with education as HS-grad. vil. Adm-clerical people with education as HS-grad earns the lowest salary when compared to people with education as bachelors and doctorate. VII, Prof-specialty people with education as doctorate eams maximum salaries and people with education as HS-grad ears minimum. IX. There are no people with education as HS-grad who held exec- managerial occupation. 1L6 Perform a two-way ANOVA based on the Education and Occupation (along with their interaction Education*Occupation) with the variable ‘Salary’. State the null and alternative hypotheses and state your results. How will you interpret this result? Solution: Null Hypothesis Ho: The effect of the independent variable Education on the mean salary does not depend on the independent variable occupation. (There is no interaction effect between 2 independent variable education and occupation. Alternate Hypothesis Ha: There is an interaction effect between the independent variable education and independent variable occupation on the mean salary oF sum_sq mean_sq FN (Education) 2.8 1.026955e+11 5.134773e+18 72.211958 c (Occupation) 3.0 5.519946e+89 1.839982e+09 2.587626 C(Education):C (Occupation) 6.0 3.6349@9e+10 6.058182e+09 8.519815 Residual 29.8 2.062102e+18 7.110697e+08 NaN PROOF) C (Education) 5.466264e-12 C (Occupation) 7.211586e-02 C(Education):C (Occupation) 2.2325eee-@5 Residual NaN Figure No. 4 Interaction From the above table we can clearly see that there is significant amount of interaction between the variables education and occupation. As p value = 2.232500e-05 is lesser than the significance level alpha ~ 5% we reject null hypothesis. Hence we can say that there is an interaction effect between education and Occupation on the mean salary. 1.7 Explain the business implications of perfor Solution: 1g ANOVA for this particular case study. From the Analysis of the variance method and the interaction plot, we can observe that the education combined with occupation results in higher and better salary among the people. Tt can clearly seen that people with education as doctorate draws maximum salary and people with education as HS-grad earns the least. From the above analysis of variance, we can say that salary is dependent on education qualification and occupation. 10 Problem 2 The dataset Education - Post 12th Standard.csv contains information on various colleges. You are ‘expected to do a Principal Component Analysis for this case study according to the instructions given. The data dictionary of the ‘Education - Post 12th Standard.csv' can be found in the following file: Data Dictionary .xisx. 2 Perform Exploratory Data Analysis [both univariate and multivariate analysis to be performed]. What insight do you draw from the EDA? Lk Univariate Analysis peees cvestees ges a Figure No.S Histogram Observation: ‘© There are 17 numeric fields in the database ‘+ Outslate, room board, grade rate and top 25 spec ate normally distributed ‘© Scaling needs to be done min and max distance is too vast © The mean grade rate is 65.4 © The mean Phd is72.66 ‘Multivariate Analysis Pair Plot Pair plot shows the relationship between the variables in the form of scatterplot and the distribution of the variable in the form of histogram. a Lb os wales i lk i Ma ed b Ade Bb r EE Bk hee WAP ERP ED AP rere See ee ae ‘SE Se eS ae AAP REET WEN S si ule PLR pes rR r Shr FSP Peer Poe APO eer ae + hee SE Me Ge ee w ¥ * € eee Be. ie - ie . ae oe © Fre rrr ses, Ler Sart eee PCE AW Cee Figure No.6 Pair Plot Correlation Plot I “ 2 ! ° i i : ae i PEERS 1 | i??epadddd os i $aq7 eee Aa i u mE _ Figure No, 7 correlation Plot From the correlation plot, we can see that there is very strong correlation between Apps, accept, enroll and F undergrad. Correlation values near to 1 or -1 are highly positively correlated and highly negatively correlated respectively. Correlation values near to 0 are not correlated to each other. 2B 2.2 Is scaling necessary for PCA in this case? Give justification and perform scaling. Solutions YES, in this case for PCA analysis scaling is necessary as the scale of the data is very high and to get it on the same base it is necessary. Before Scaling count mean stain 20% 80% 70% —_max ‘Apps 177.0 3001630353 se70 201464 810 7760 15680 36240 480040 Accept 7770 2019004375 2451.119071 720 G40 11100 24240 269300 Enroll 770 70972073 eaUATEIOO 360 2420 4940 9020 62920 ToptOpere 7770 27.950569 17.6t0304 «10150 Top26pere 7770 55708864 1.804778 04105890 «t0DD FUndergrad 770 2699 907995 4950470591 1390 9920 1707.0 40050 316430 PUndergead 770 855208584 1522431887 10 950 3580 9870 216980, Ourstawe 770 10440.609241 4029.016404 23400 73200 99900 126250 217000 Room Beard 7770 4367528304 toB8G0e41S 17800 95970 42000 50500 a1240 Books 7770 549.360952 165.105%0 980 4700 5000 6000 20400 PRO 7770 72600232 18.328165 «802075050030 SRRwo 7770 woso70 a058MD «25S GS OE Pereatumni 7770 22743087 © 12:391001 G0 130 28010 wa Expend 777.0 9660.171171 £221.7oa4s0 31860 67510 8377.0 108300 562530 Table No. 1 Describe After Scaling 14 ‘apps 7770 Ss6s7e7e-17 1000s -O7SsIo4 —osTeHT -OS7S254 O1eO0I2 11658671 Accept 7770 G77467G0-47 1000844 0.794764 -OG77EOI -oA71011 O.LeEs17 es24ne Enroll 770 -62402600-17 s0n0e44 0402273 067051 0.472584 o4s141 6043678 ToptOpere 7770 -27suz22e.17 1omee4s 1506826 0.712000 -o2sese2 o4z21t2 3aem10 Top26pere 777.0 -15467306-10 1000644 2.304419 0747007 -0090777 O6sTI04 2.23501 Undergrad 777 -1951405e-18 1000544 -O7S4617 -osseR4s 047113 OUEZOH1 5 764074 Pune ered 7770 -20201806-17 1000844 0.651502 0489719 -0290148 OOTIeI® 19769921 Outstate 77.0 @5155050-17 ‘on0e4 2014878 0776209 0112005 O617027 2¢0061 Books 777.0 -2.10268%0-18 1000644 2.747779 0481000 -0.209290 O.0G7e4 10852007 PRO 7770 8.964760e-17 1000644 2.952580 05295 0.140080 07650222 1.958020 erealumn| 777.0 -60226980-17 10D0e4 1.996590 -O7AD#24 -0.140800 OEHGES 9231452 Expend 7770 1219101e-16 tome44 1.240641 -oSsT483 0245893 O24I74 Be24T21 GradRate 777.0 2.8864960-16 1000084 2220876 -0.726019 -0026990 0720289 3060382 Table No.2 Describe scaled 2.3 Comment on the comparison between the covariance and the correlation matrices from this data. [on scaled data] Solutions 2.4 Check the dataset for outliers before and after scaling. What insight do you derive here? Solutions Before Scaling 15 -hLULr—t—sS I —h-h 20 wre emmemey fT} Ht ote-e commence TH After Scaling Scaled 2.5 Extract the eigenvalues and eigenvectors. [Using Sklearn PCA Print Both] Solutions igenvectors array([[ 2.48765602e-01, 2.07601502e-01, 1.763035020-01, 3.54273947e-01, 3.44001279e-@1, 1.54640962¢-01, 2.oaazsoase-2, 2.94736419e-01, 2.490304490-01, 6.A7575181e-02, ~4.25285386e-02, 3.18312875e-01, 3.17056016e-01, -1.75957895e-@1, 2.05082369¢-01, 3.180087500-01, 2.523156540-01], [ 3.31598227e-01, 3.72116750e-01, 4.03724252e-01, -B.2a1182i1e-02, -4.47786551¢-02, 4.17673774e-01, 3.150878200-01, -2.49643522e-01, -1.379088#20-01, 5.03418434e-02, 2,19929218e-01, | 5.831131746-02, 4,04294477e-02, 2.46665277e-01, -2.40595274e-01, -1,31689865e-01, -1.692405328-01), [-6.30921033e-02, -1.01249056¢-01, -8.298557096-02, 50555329e-02, -2.41479375e-02, -6.1392976de-02, 1[3e6gi7i6e-01, 4.65088731e-02, 1.49967380e-01, 6.77411649e-01, 4.99721120e-01, -1.27620371¢-01, ~6.60375¢54e-02, -2,8994B101e-01, -1.46989274e-01, 2.26783085e-01, -2-080646400-61], [ 2.81310530e-01, 2.67817346e-@1, 1.61826771¢-01, -51154725240-02, -1,007665410-01, 1.00417335e-01, -1,58558487e-01, 1.31201364e-01, 1.84995091e-01, 5. 1 8.70992205e-02, -2.30710568e-01, -5.34724832e-01, -5,194430192-01, -1,61189487e-01, 1.731422300-02, 7.92734946e-02, | 2.69128066e-01], [ 5.74a1ao964e-03, 5.578609200-62, -5.56936353¢-02, -2,954343450-01, -4,265325940-01, -4,245436500-02, 3,02385408e-01, 2.22532003e-01, 5.60919470e-01, -11272ge925e-01, -2.22311@21¢-01, 1.401663260-01, 2.047197200-01, -7.938824950-02, -2.16297411¢-01, 7,59581203e-02, ~1.09267913e-01], [-te23zatzee-02, 7.520680520-02, -2.255793020-02, ~5,268279802-02, 3.30915895e-02, ~4.345473496-02, =1,91198583e-01, -3,09003910e-02, 1.62755446e-01, 6.41054950e-01, -3.21399003e-01, 9.12555212e-02, 1154927646e-01, 4.87005875e-01, -4.734001440-02, -2,981186192-01, 2,1616913e-11 iguré No. 10 Eigen vectors 18 [-4.2as63a860-02, 6.10423460e-02, -11a9692034e-01) “218a7701050-02, [-11e3e00308e-01, -1:22678028e-01, 570783816001, -1121613297e-02) “5141593771e-02) [-91022708020-02; 5160672902201, -1133663353e-01, =2/sa03a1980-01, [ 5:25e08025e-02, 6.40257785¢-02) -2!231053080-01, “8185784627¢-025 1132286331¢-01, [ 41304620740-02, 100693324e-01, 31194003700-02, -5189734026e-02, [ 2:407090860-02; 3185543001¢-02, ~6135360730e-02, i leans04200-02; 3125982295¢-01, [ 5-95830975¢-01, 1.02303616e-03, 1.25997650e-01, -5183134662e-02, “9137464497e-02, [ 81063280300-02, -11078281890-01, [ 1:33405806e-o1, 6197722522e-01, 2109515982e-02, -9143887925¢-03, 1,5a909651e-01, -2127742017e-01, [ 4150130a080-01, -1149738723e-01, ~5127313042e-02, [ 3158970a000-01, -114a986329e-01, ‘91017880640-03, 7.726319636-04, -1.294971960-02, 110as289662-01, 613379008ae-01, 2119259358e-01, 51590439376-01] -51627096236-02, -1102491967e-01, 913a5997540-03, “@136048735e-02, “51335538016-03] “1 7786a8140-01, ~41573328800-03, “9 4a6@89000-02) 2.74a3aaas00-01, 21190a3e520-02] 4la1a00e4ae-02, 1145492289¢-02, 1366753630-01, 41720452492-01, ~51902710676-01} =51840558500-02, 3432206732-01, -1185784733e-02, 4129000727e-01, -21as102446e-01, ~8193515563e-02, -8123443779e-01, “1110262122e-02, 41.22106697e-01] 2.52642398e-01, 2.18838802¢-02, -1.aigse014e-o1, -11771527000-02, “61919697780-02} 3 3467a2810-02, 11517a21100-01, 3leaverasse-02], -1:45437511¢-01, ~6.17274818e-01, 3183544794e-02, -3109001353¢-03, ~2,080912842-02, ~31308336042-03] “5!185687890-01, S.1g6e3400e-02, 1.01594830e-01, “5 loasaavose-02], =5143a272500-01, #.034784a50-02, 5 e8995018e-02, -1111433396e-03, -2. 76928037e-02, 2109744235e-01, -1199641298e-03, 2143321156e-01, 5. 86623552e-02, 7.388964426-02, -2121453442e-01, 6.78523654e-01, -1.285607130-01, 21 75022548e-01, -1135181525e-01, -2155334907e-01, 3.44879247e-02, 21084718346-02, 2198324227e-01, 4,22999706e-01, <6. 929888310-02, =31593217346-01, 4103723253e-02, -1130727978e-01, i. 1.11431545e-02, 5.61767721e-02, 3.545597316-01, 1 e26606540-01, *4.44538207¢-01, -5123622267e-01, ~6197485854e-02, aleaoeeosee-01, -8.569671800-02, =5163728817¢-02, 2.95896052¢-02, 9.916409926-03, 3.40197083¢-03, -1112055599-01, ~8.417894102-03, -4,043184398-01, 5160363054e-01, -2.59293381¢-02, 6.096511100-01, -411a705279¢-01, 11140396200-03, 1138133366e-02, -3153098218e-02, -1.30710024e-02]]) Figure No.11 eigenvector 2 19 Eigen Values array([5.45@52162, 4.48360686, 1.17466761, 1.02820573, @.93423123, 0.84849117, 0.6057878 , @.58787222, @.53061262, 0.4043029 , @.31344588, @.22061096, @.16779415, @.1439785 , 0.08802464, @.03672545, @.@2302787]) Figure No, 12 Eigen values Explained Variance Ratio array([@.32020628, 0.26340214, 9.66900917, 0.05922989, 0.05488405, @.04984701, 0.03558871, 2.03453621, 0.03117234, 0.02375192, @.01841426, 0.01296041, 2.00985754, 0.00845842, 0.00517126, @.00215754, 0.00135284]) Figure No, 13 Explained Variance Ratio 2.6 Perform PCA and export the data of the Principal Component (eigenvectors) into a data frame with the original features Solutions pcr percocet PRPs ‘hope 026708 050159 -OORGOR O2STGTN OONGTAT -COTEZAT OAR TCR CHAT OORDHIO ODAINE OONOTI ORR Accept 0207602 037217 -010I249 O26TET? DOESTOS OOOTSSS -OONzsO -Oosnart 4.17 DOANAD -oOseADS -D1ssI02 o2seE2 nll 076204 a4724 DODO NOIR? -DOREEGK -ODKDESE -DOZTRH DOSE? ZAG OMNEKOS -aOHHIED DONS -24uEE ToptOpre 0254274 D812 OUNOEE OSIEAT ome ooKzeES OBIS -O-t26TE o.KTtOO nnenas ogoHtOS cozeEE4 comes Toptpere OMMIOY QOMTTE -oORAA® -SONTE? -OATESM OnKGLED aoKEE onc? OMMATI2 OOv4sKO A2TI2R -CowKEE? coRIH Undergrad 015441 OAI7674 -DOGIND OSOOHT2 -OOKEA ODIO -OORSITE ooTaMO osu? mDOEN7 -coRtisA oossIT? -osKz2 PUnderrad o006ut8 OIS0H8 01ND SAA ONES OIE OOsIOD OSTOTE OSES o2IOS OrONReS ooREE o-nseHE Cust 0204796 0249544 COSHH 921291 0222502 -OONEEED O10HREH cOHHEE -AcK6Ta ONES oastat 0.4088 RoomBosrd 021900 0 19TO09 ONNER6T O79KEES OSETETO O.mRTEE OzOETEs 0721653 275020 o2eaTeE Books 0054758 O0SHS42 OOTZIZ NETOES 0177209 OBHIIES -O MoRE2 O210%9 0 GER -OORINID ODStENO -OORRTES ONE Personal 001529 0210000 OalG721 -020OTH az22011 038188 NERETOD -02%0001 ooDuIED c42HIE coteTE naa OMENS PHO 02t#919 OuSERN 0127028 0504725 AOI68 C256 -OODIO -OOTION -oxeSIe2 012362 opmET2 oonD e-RTEHE Termine! 0317056 COMZE -nOseNe stou2 OATI0 OAstize -OORHKT? -ooI~sI 0 25K008 -OOHESTE -aQsHETa cOTEHES -coseata SFRaio 01705 024665 o2HoRde -OY6{HE0 OOTSEE OAT 0710099 OOKROS 2TESLA OATaNNS GaASIOY ora -0ONTTIS prune! 0205082 028505 0 14609 OOITSI4 0216007 oONTO ODES ORT 0255535 oansOD ossKT28 OeMeeY O-AOHE se60 -o0e9r48 Expand 0316009 0191680 O226TH4 Gra OUTER -O2HBtI® 0225556 -OOSKISE aOR 0122208 BEDI O325Ie2 -oORETIS Geo Ratn 025206 01082 -OzneTeS O2HeNzE -0 me OZ DEES -OOMEIIG OOKIENA -assnzT1 a2tERE nIz2I0T -omESTOT ‘Table No.3 Dataframe with pes 2.7 Write down the explicit form of the first PC (in terms of the eigenvectors. Use values with two places of decimals only). [hint: write the linear equation of PC in terms of eigenvectors and corresponding features] Solutions 20 2.8 Consider the cumulative values of the eigenvalues. How does it help you to decide on the optimum number of principal components? What do the eigenvectors indicate? Solutions array(|@.32020628, 0.58360843, @.65261759, 0.71184748, @.76673154, @.81657854, 0.85216726, @.88670347, 0.91787581, 0.94162773, @.96004199, 9.9738024 , @.98285994, @.99131837, @.99648962, @.99864716, 1. }) Figure No.14 Cumulative values Cumulative values help to analyses that in only 7 PCS we are able to capture 85% of the data which will save our time and will help us to focus on the required PCS for further analysis. Eigenvectors represent directions. Think of plotting your data on a multidimensional scatterplot. Then one can think of an individual Eigenvector as a particular “direction” in your scalterplot of data, Eigenvalues represent magnitude, or importance. 2.9 Explain the business implication of using the Principal Component Analysis for this case study. How may PCs help in the further analysis? [Hint: Write Interpretations of the Principal Components Obtained] Solutions 21

You might also like