You are on page 1of 21
Predictive xs ct -"Model Solution Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. Table of Content Problem - 1 The Problem in hand... eeneaaan anaes ee sa3 A. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis... B. Impute null values if present, also check for the values which are equal to zero. Do they have any meaning or do we need to change them or drop them? Do yolk think scaling is necessary in this case? 8 C. Encode the data (having string values) for Modelling. data into test and train (70:30). Apply Linear regression. Perfo ck the performance of Predictions on Train and Test sets usit ora 8) D. Inference: Basis on these predictions, recommendations. Problem 2: Logistic-LDA... The Problem in hand A. Data Ingestion: Read the data: condition check, write an infer exploratory data analysis insights and 1 12 12 le descriptive statistics and do null value riate and Bivariate Analysis. Do : = sen 8. iaving string values) for Modelling. Data 30). Apply Logistic Regression and LDA (linear 16 17 19 Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or dist ato Problem - 1 The Problem in hand You are hired by a company Gem Stones co lid, which is a cubic zirconia manufacturer. You are provided with the dataset containing the prices and other attributes of almost 27,000 cubic Zirconia (which is an inexpensive diamond alternative with many of the same qualities as a diamond). The company is earning different profits on different prize slots. You have to help the company in predicting the price for the stone on the bases of the details given in the dataset so it can distinguish between higher profitable stones and lower profitable stones so as to have better profit share. Also provide them with the best 5 attributes A. Read the data and do exploratory data anelysis. Describe t eck the null values, Data types, shape, EDA). Perform Univariate and Bi B. Impute null values if present, also check for the valu zero. Do they have any meaning or do we need to change them or necessary in this case? (8 marks) C. Encode the data (having string values) f and train (70:30). Apply Linear regressi of Predictions on Train and Test sets usi lit: Split the data into test ‘Check the performance (8 marks) D. Inference: Basis on these. recommendations, (6 marks) the business insights and Dataset for Problem 1: ci fof thlruibic zirconia the cubic zirconia ‘order Fair, Good, Very Good, Premium, Ideal. cubic zirconia. With D being the best and J the worst. rconia Clarity refers to the absence of the Inclusions and Blemishes. ler from Best to Worst, FL = flawless, I3= level 3 inclusions) FL, IF, VVS1, VS1, VS2, Sit, S12, 11, 12, 13 1e Height of a cubic zirconia, measured from the Culet to the table, divided by its average Girdle Diameter. Table: The Width of the cubic zirco Average Diameter. Price: the Price of the cubic zirconia X: Length of the cubic zirconia in mm. Y: Width of the cubic zirconiain mm. Z: Height of the cubie zirconia in mm. ° 's Table expressed as a Percentage of its oo00 Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distributio A. Read the data and do exploratory data analysis. Describe the data briefly. (Check the null values, Data types, shape, EDA). Perform Univariate and Bivariate Analysis. Let's get started. Load the required packages, set the working directory and load the data file. Dataset has 26967 rows and 10 features. Cut, Color and Clarity are object types, price (dependent variable) is integer type and all other are float64 type. Let's start the data exploration step will the fead function to look at fl J 1oWs. id carat cut color clarity depth table x yz price 0 1 030 ldeal E SI1 621 580 427 429 266 499 1 2 033 Premium G IF 608 580 442 446 270 984 2 3 0.90 VeryGood E WS2 622 600 604 612 378 6289 34 042 Ideal F VS1 616 560 482 480 296 1082 45 031 Ideal =F «WS1 604 59.0 435 443 265 779 Dataset does have null values in “defBth fe Vy cut color In sectiot wi 8k about the imputation of missing values. clarity depth table x v z price dtype: int6a g ecscc0e0foo000 Dataset has S&iduplicate records and all these records are deleted from the data set. ‘As per the dataset description of the independent features, carat, y and z are skewed and the same is verified based on the following distribution plot. Also, all continuous features have outliers and hence the same needs to be treated.\ Distribution and Box plot of all continuous features is provided below: Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. Carat Distribution Carat Boxplot ' 7 8 7 8 So : GRpth Didtributigh 3 4 3 Depth Boxpiot ; ; ee Ba s 2 We otmuh 2 fy 14 2 8 a 06 oa 3 02 = 0073? pie pifbubon 234 a rio a ee i os cat ca 0.0003 i 6000 a a a 7 .0008.550° 0 2000 4000 6000 s00ar900a1200014000 ° oa Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. X Distribution X Boxplot 07 os 04 03 02 on 00. oneoo dh 040 035 030 025 015 010 005 0.00, a on Z Depth Boxpiot 07 06 os, 4 ” 03 02 on 00. Cut variable has 5 unig values. eesug ou She Ree —S ~ , COlor haBNB unique values and clarity has 8 unique by 5| a 8 Bi 8| 8 Continuous varial plot shows a dist with object variables and as an example, following ith respect to carat and clarity Most of the possible outliers in carat are for clarity 11 Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. Pair plot among the following variables shows a strong relationship between carat, x, y and z. ‘iso, strong correlation is available between the dependent variable i.e. price and carat. 7 y, A Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. os aso 2s as 1 00 ae oan 2 -025 Rae B. Impute null val i check for the values which are equal to zero. y @ any meaning or do we need to change them, m? DO you think scaling is necessary in this case? as nul values. gbserved through the box plot and hence null values were imputed “y", “Z” has zero values, however these features are highly correlated with is highly correlated with “X" and “2” is highly correlated with “X" and “Y", hence are dropped from the dataset. Now among the remaining continuous independent variables i.e."carat”, “depth” and “table”. “carat” has values in decimals. So, features have different scale and hence scaling is required for this dataset. StandardScaler function is used to scale the continuous features. AS data set nas outliers, dataset is capped using the (21-1.5" IUK and Q3+1.5°IK logic. Following box plot confirms that outliers in the continuous features have been treated. Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. C. Encode the a for Modelling. Data Split: Split theda' teSiyand train (70:30). Apply Linear Metrics: Check the performance of st sets using Rsquare, RMSE. €. “cut”, “color” and “clarity” are encoded to create dummies. To irst=True option is used r scaling of continous independent variables, encoding of the object after treatement of outliers of all continuous variables, is split in to test and train Intercept coefficient is 704.2960048622108 Other beta coefficients are Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. The The The The The The The The The The The The The The The The The The The The coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient coefficient cnet Fi coefficient coefficient coefficient coefficient coefficient coefficient eat for for for for for for for for for for for for for for for for for for for for carat is 3715.38977663443 depth is -9.73886550852991 table is -40.905564176855094 cut_Good is 545.678362608045 cut_Ideal is 770.691745638859 cut_Premium is 701.3738339905831 cut_Very Good is 687.4907314316608 color_e color_F color_6 color_H color_I color) elanity clarity: clarity: clarity. clarity. clarity clarity. is -183.4175510906739 is -266.7480582118906 is -437.6749683805701 is -854.4690974983275 is -1311.60508015202 is -1912.1882580266902 TE i apOR GAA72II77RRO SI1 is 2547.7498704772 ‘S12 is 1709.1898658791172 si is 3388.518462041414 VS2 is 3084.287413105855 WS1 is 3861.3436717870954 , ‘Ws2 is 3827.1265582716005 Performance metrics comparison on Train ai = d Parameter Train Te R Square oO. ao jo. 9409610887221302 RMSE 853.6801097 Ww 844. 1071243110644 Final model output, \Y lows Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. (OLS Regression Results Model: OLS Adj. R-squared: ethod: Least Squares F-statistic bate: ‘Thu, 19 Mar 2828 Prob (Fostatistic): Tine: 16:32:22 _Log-Likelihood: No. Observations: 853 arc: DF Residuals: 1em32 BI DF Model: 20 Covariance Typé nonrobust coef std er tle Intercept 708.2960 65.727 0.000 575.456 833.126, carat ans, 7.374 e900 37011329 3729.451 depth : 7.281 ous -24.010 4.532 table 4919056 8.337 eee | -57.247 24.564 cut Good sas.e7ea 43.470 e.o00 —460.473 630.084 ‘cut_Very_Good 687.4967 41-aat 1008 606.262 768.719 cutideal 770.6017 42.748 e.e80 — 686.9e2 854.482 cut Premium 701.3738 41.520 e.000 619.991 782.756 colon) -101211883 33.272 1000 -19771406 -1846.973, color_e -183.4i76 22.914 eles 2281332 138.503, colon F "266.7881 23.379 9.00 3121573 220.923, colors 7437.6750 22.685 2.090 482.140 393.210 color Tasalacox 241277 2.008 fA colort -4341.0051 26.990 2,088 clarity 1F 4098.9807 66.389, 2,000 clarity sit 2547.7499 56.961, 2,000 Clarity st2 170911899 57.278 2,000 Clarityvsi 3388.5185 58.069 2,000 Clarity_vs2 3ee4.2874 57.267 2,000 Clarity_WS1 3861.3637 61.192 2.000 clarity Ws2 3827-1266 59.665, 2,000 oanibus: ‘4288.05 Durbin-Natson: Prob(Onnibus): 0.000 Jarque-Bera (38): Sk 11169. Prob(J8) 61367 Cond. No. Warnings: [1] Standard Errors assune that the covariance matrix of the errors is correctly specified. on Train and Test data set Train Test 939 (0.9409 853.68, 844.107 D. Inference: Basis on these predictions, what are the business insights and recommendations. As per the model output, following features are significant Continuous Features — Carat and Table Proprietary content. CGreat Learning. All Ri 1s Reserved. Unauthorized use or distribution prohibited. Categorical Features - Cut - Good, Very Good, Ideal and Premium, Clarity —IF, SI, SI2, VS1, VS2, WS1, WS2, Color -E, F,G, H, I, J Carat has a positive impact on the price, so higher the carat, higher the price. Table have a negative impact on the price Cut - have a positive impact on the price. Ideal has a highest beta coefficient and hence if the cut quality is Ideal, then price would be higher than all other categories All Color category have a negative impact on the price with "J" has the highest beta coefficient and hence higher negative impact. Alll levels under Clarity have a positive impact on the price with “IF” as tl coefficient. highest positive High beta coefficient for the intercept indicates that we need to ore features to improve the model performance. Problem 2: Logistic-LDA The Problem in hand You are hired by a tour and travel ageriy wi provided details of 872 employees of a compar the package and some didn't employee will opt for the packag} the'basis ‘Also, find out the important employees to sell their pa s in sellidg holiday Packages. You are jong these employees some opted for pany in predicting whether an information given in the data set. is of Which the company will focus on particular A. Data Ingestion: R check, write an data analysis. Do sscriptive statistics and do null value condition Perform Univariate and Bivariate Analysis. Do exploratory /, Confusion Matrix, Plot ROC curve and get ROC_AUC score for each model ‘Compare Both the models and write inference which model is best/optimized. (8 marks) D. Inference: Basis on these predictions, what are the insights and recommendations. (6 marks) Dataset for Problem 2:: Holiday Package.csv? + Data Dictionary for Holiday_Package: © Holiday_Package: Upted for Holiday Package yesino? © Salary: Employee salary © age: Age in years © edu: Years of formal education Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or dist ato © No_young_children: The number of young children (younger than 7 years) © No_older_children: Number of older children (older than 7 years) ©. foreign: foreigner Yes/No A. Data Ingestion: Read the dataset. Do the descriptive statistics and do null value condition check, write an inference on it? Perform Univariate and Bivariate Analysis. Do exploratory data analysis Loaded the required packages and read the data set Dataset has 872 rows and 7 features including the target featur Descriptive statistics of all the features is as folows ~ count unique top freq mean std_min 25% 50% 75% max Hoday Package 672 «2 no 471 —-NAN NAN NaN NON NaN NAN NaN Salary 872 NaN NaN NaN 477292 234187 1322 35324 419035 534605 236061 age 872 NaN NaN NaN 399553 105517 20 32 39482 edie 872 NaN NaN NeN 930734 3026026 1 8 9 12 a ‘no_young_chléren 672 NaN NaN NaN O3t1827 061287 0 0 0 0 3 ‘no_cler_chien 872 NaN NaN NaN 0982798 108679 0 0 1 2 6 foregn 8/2 2m 8 ONAN ONAN NAN NAN NGN NaN NaN Key takeaways from the dé I me ‘employées i.e. foreigner and not foreigner, salary of the @ to 236961, education varies from 1 to 21 years, age is older_children are appearing as an integer field. To get an ent variable for children in both the categories=0, these two ofverted into object type. ersion which is evident from the following histogram plot. Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. So, the final dataset which would be considered has joric@hvariables (including dependent variable) and 3 continuous features agBhow! RangeIndex: 872 entries, @ to 871 Data columns (total 7 columns): > Holliday Package 872 non-null object salary 872 non-null int6s age 872 non-null int6a educ 872 non-null intos no_young_children 872 non-null object no_older_children 872 non-null object foreign 872 non-null object dtypes: int6a(3), object (4) Data set does not Nave any valigs and no duplicate records are available in the given dataset. Salary has ran ers. R cap would be used to handle the outliers i.e. lower Rand upper range capped at Q3+1.5"IOR Edu lier however it's difficult to confirm that these values are really the outliers orthey are extreme values and hence these values are left as it is in the model. Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. ‘0000 00 | 8 ° coe | i ia000 + No correlation between Salary ai age and educ. Salary Salary we duc No pattern is visible when all the continuous variables are studied with the dependent variable Key observations from the given pair plot with hue as dependent variable are as follows: Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. Y Data set has more data point pertaining to no_young_children and no_older_children equal to 0 and 1 and all these records correspond to holiday package as yes which means individual with no_young_children and no_older_children have opted for the holiday_package. As the no. of young children increases to 2 or 3, total employees opting for the holiday packages are less ¥ However, in case of older children, pattern is different as employees with older children even equal to 3. Gee se Encode the data (having string values) elli Data Split: Split the data into train and test Apply Logistic Regression and LDA (linear discrimiMant analysis). Before doing the data split, variable types were checked. As there are many columns with the type as object, these variables were encoded Then target variable was captured in to separate vector for training and test data set Then the dataset was split into train and test in the ration of 70:30. Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. 1. Logistic Model with Performance Metrics Initial logistic model was built without any specific parameter setting LogisticRegression(C=1.0, class_weight=lione, dual-False, fit_intercept=True, intercept_scaling=1, 11_ratio-None, max_iter=100, multi_class="warn’. n_jobs=None, penalty="12", random_stateslone, solver="warn*, tol=@.0001, verbose=0, warm_start-False) npurlanil features were identified usit order is as follows RFE funtion. Ranking urdet of deuieasing © Foreign © No_young_children o Educ © No_older_children o Age o Salary < AUC on the train and test dataset is 601 ait 10 os as oa aa ao Proprietary content. CGreat Learning. All Ri is Reserved. Unauthorized use or distrbutio prohibited. Recall for class 1 on train dataset is .21 and for test itis for class 1 is considered as our interest is to predict holiday package Classification reportontraindstasst precision recall f1-score e 0.56 0.86 0.68 1 0.58 @.21 0.31 accuracy 0.56 macro avg 0.57 0.54 @.49 weighted avg 0.57 0.56 0.51 Classification rey precision recall f1-score e 0.59 0.87 0.78 1 0.60 0.25 0.35 accuracy 0.59 macro avg 0.68 0.56 0.53 weighted avg 0.60 0.59 0.55 Accuracy for train dataset is .56 and for test itis .59. Asal the metrics values for both train and test data set are within +/- 10% deviation and hence model is valid. Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. tes x support. 326 284 610 610 610 support 145 117 262 262 262 2. LDA Model with Performance Metrics Initial LDA model was created without any specific parameter LinearDiscriminantanalysis(n_conponentsslione, priors=None, shrinkagesNione, sOlver='svd", store_covariance-False, tol-@.0001) AUC for the train and test data set is .739 00 2 04 06 08 vo Recall for train and test data set is .56 Classification report on train data set Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. precision recall fi-score support e 0.67 0.78 @.72 326 1 0.69 0.56 0.61 284 accuracy 0.68 610 macro avg 0.68 0.67 0.67 610 weighted avg 0.68 0.68 0.67 610 Classification report on test data set precision recall f1-score support e 0.66 0.71 0.69 145 1 0.61 0.56 0.58 117 accuracy 0.64 262 macro avg 0.64 0.63 0.63 262 weighted avg 0.64 0.64 0.64 262 » Model score for train data set is .68 and for XV As all the metrics values for both train a1 t ‘set are within +/- 10% deviation and hence model is valid. C Final Model: QO and write an inference which model j ize As organi would be t Fry And on the test data sot Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited. So, from the following consolidated table of performance metricS has the best recall rate Model ‘AUC Recall Accuracy Train__| Test Train__| Tost rain i Test Logistic [601 | 601 21 25 6 59 toa [739 [739 56 |! 04 Based on this criterion LDA model i: As all the value of all performangél Mesh no! considering some more featur etoim provelthe mot D. Inference: insights and for the package. Next steps for business: ¥ So, the travel and tour company should design a package for foreigners as their reason for a travel is not just limited to vacations tour. Y Individuals with no young children have a high probability to accept the package and hence sumething unique Would be desigried for (hem as well Proprietary content. CGreat Learning. All Rights Reserved. Unauthorized use or distribution prohibited.

You might also like