RAJIV RANJAN - 19!02!2023 - Predictive Modelling Project Report - Final

PREDICTIVE
MODELLING
PROJECT
PGP-DSBA, University of Texas at
Austin, USA
Dr. Rajiv Ranjan

Data Science Leader, Government and Public Sector (GPS) Practice,
Ernst & Young LLP, India Offices
PhD, Public Policy, IIM Ahmedabad, 2017.
BTech, ECE, CUSAT, Kochi, 2011.
TABLE OF
CONTENTS
PROBLEM 1 LINEAR REGRESSION ................................. 1
1. You are hired by a company GEMSTONES co ltd, which is a cubic zirconia
manufacturer. You are provided with the dataset containing the prices and other
attributes of almost 27,000 cubic zirconia (which is an inexpensive diamond
alternative with many of the same qualities as a diamond). The company is earning
different profits on different prize slots. You have to help the company in predicting
the price for the stone on the bases of the details given in the dataset so it can
distinguish between higher profitable stones and lower profitable stones so as to have
better profit share. Also, provide them with the best 5 attributes that are most
important. ....................................................................................................................... 1
Dataset for Problem 1: cubic_zirconia.csv ........................................................... 1
Please explain and summarize the various steps performed in this project. There
should be proper business interpretation and actionable insights present. ............ 1
1.1. Read the data and do exploratory data analysis. Describe the data briefly.
(Check the null values, Data types, shape, EDA, duplicate values). Perform
Univariate and Bivariate Analysis. (8 marks) ........................................................ 2
1.2. Impute null values if present, also check for the values which are equal to
zero. Do they have any meaning, or do we need to change them or drop them?
Check for the possibility of combining the sub levels of a ordinal variables and
take actions accordingly. Explain why you are combining these sub levels with
appropriate reasoning. (5 marks) ......................................................................... 18
1.3. Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn. Perform checks
for significant variables using appropriate method from statsmodel. Create
multiple models and chck the performance of Predictions on Train and Test sets
using Rsquare, RMSE & Adj Rsquare. Compare these models and select the best
one with appropriate reasoning. (12 marks) ........................................................ 18
1.4. Inference: Basis on these predictions, what are the business insights and
recommendations. (5 marks) ................................................................................ 21
PROBLEM 2 LOGISTIC REGRESSION AND LDA ............... 22
2. You are hired by a tour and travel agency which deals in selling holiday
packages. You are provided details of 872 employees of a company. Among these
employees, some opted for the package and some didn't. You have to help the
company in predicting whethr an employee will opt for the package or not on the
basis of the information given in the data set. Also, find out the important factors on
the basis of which the company will focus on particular employees to sell their
packages. ...................................................................................................................... 22
Dataset for Problem 2: Holiday_Package.csv ..................................................... 22
Please explain and summarize the various steps performed in this project. There
should be proper business interpretation and actionable insights present. .......... 22
2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do null
value condition check, write an inference on it. Perform Univariate and Bivariate
Analysis. Do exploratory data analysis. (5 marks) .............................................. 23
2.2. Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply Logistic
Regression and LDA (linear discriminant analysis). (7 marks) ........................... 29
2.3. Performance Metrics: Check the performance of Predictions on Train and
Test sets using Accuracy, Confusion Matrix, Plot ROC curve and get ROC_AUC
score for each model Final Model: Compare Both the models and write inference
which model is best/optimized. (7 marks) ........................................................... 29
recommendations. (5 marks) ................................................................................ 30
PROBLEM 1
LINEAR
REGRESSION
1. YOU ARE HIRED BY A COMPANY GEMSTONES CO LTD,

WHICH IS A CUBIC ZIRCONIA MANUFACTURER. YOU
ARE PROVIDED WITH THE DATASET CONTAINING THE
PRICES AND OTHER ATTRIBUTES OF ALMOST 27,000
CUBIC ZIRCONIA (WHICH IS AN INEXPENSIVE
DIAMOND ALTERNATIVE WITH MANY OF THE SAME
QUALITIES AS A DIAMOND). THE COMPANY IS
EARNING DIFFERENT PROFITS ON DIFFERENT PRIZE
SLOTS. YOU HAVE TO HELP THE COMPANY IN
PREDICTING THE PRICE FOR THE STONE ON THE
BASES OF THE DETAILS GIVEN IN THE DATASET SO IT
CAN DISTINGUISH BETWEEN HIGHER PROFITABLE
STONES AND LOWER PROFITABLE STONES SO AS TO
HAVE BETTER PROFIT SHARE. ALSO, PROVIDE THEM
WITH THE BEST 5 ATTRIBUTES THAT ARE MOST
IMPORTANT.
Dataset for Problem 1: cubic_zirconia.csv

Please explain and summarize the various steps performed in this project.
There should be proper business interpretation and actionable insights
present.
Data Dictionary:
Variable Name Description

Carat Carat weight of the cubic zirconia.
Describe the cut quality of the cubic
Cut zirconia. Quality is increasing order Fair,
Good, Very Good, Premium, Ideal.
Colour of the cubic zirconia, with D being
Color
the best and J the worst.
Clarity refers to the absence of the Inclusions
and Blemishes. (In order from Best to
Clarity
Worst in terms of avg price) IF, VVS1,
VVS2, VS1, VS2, Sl1, Sl2, l1
The Height of cubic zirconia, measured from
Depth the Culet to the table, divided by its average
Girdle Diameter.
The Width of the cubic zirconia's Table
Table expressed as a Percentage of its Average
Diameter.
Price the Price of the cubic zirconia.
X Length of the cubic zirconia in mm.
Y Width of the cubic zirconia in mm.
Z Height of the cubic zirconia in mm.
1.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis. (8 marks)
DIAMOND PRICING Data Attributes

index carat cut color clarity depth table x y z price
0 0.3 Ideal E SI1 62.1 58.0 4.27 4.29 2.66 499
1 0.33 Premium G IF 60.8 58.0 4.42 4.46 2.7 984
2 0.9 Very Good E VVS2 62.2 60.0 6.04 6.12 3.78 6289
3 0.42 Ideal F VS1 61.6 56.0 4.82 4.8 2.96 1082
4 0.31 Ideal F VVS1 60.4 59.0 4.35 4.43 2.65 779
Shape of the Dataset: (26967, 10)
2
Data Type of the attributes:
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 26967 non-null float64
1 cut 26967 non-null object
2 color 26967 non-null object
3 clarity 26967 non-null object
4 depth 26270 non-null float64
5 table 26967 non-null float64
6 x 26967 non-null float64
7 y 26967 non-null float64
8 z 26967 non-null float64
9 price 26967 non-null int64
dtypes: float64(6), int64(1), object(3)
ind 25
count mean std min 50% 75% max
ex %
car 2696 0.79837542181 0.477745473545
0.2 0.4 0.7 1.05 4.5
at 7.0 18442 01745
dep 2627 61.7451465550 1.412860238142 50. 61.
61.8 62.5 73.6
th 0.0 0572 6094 8 0
pric 2696 3939.51811473 4024.864665636 326 945 2375 5360 1881
e 7.0 28217 0374 .0 .0 .0 .0 8.0
tabl 2696 57.4560796529 2.232067909029 49. 56.
57.0 59.0 79.0
e 7.0 0912 5124 0 0
2696 5.72985352467 1.128516377647 4.7
x 0.0 5.69 6.55 10.23
7.0 831 769 1
2696 5.73356880631 1.166057529926 4.7
y 0.0 5.71 6.54 58.9
7.0 8835 0447 1
2696 3.53805725516 0.720623625642
z 0.0 2.9 3.52 4.04 31.8
7.0 3719 7715
index cut color clarity
count 26967 26967 26967
unique 5 7 8
top Ideal G SI1
3
index cut color clarity
freq 10816 5661 6571
DIAMOND DATASET SUMMARY:

The data set contains 26967 row and 11 columns. In the given data set, there are 2
Integer type features,6 Float type features and 3 Object type features. Where 'price' is
the target variable and all other are predictor variable. The first column is an index
("Unnamed: 0") as this only serial no, we can remove it. Except for the column depth,
the rest null count is 26967.
EXPLORATORY DATA ANALYSIS:

1: Check and remove any duplicates in the dataset.
Number of duplicate rows = 33
2: Check and treat any missing values in the dataset.

Only ‘Depth’ attribute has 697 missing values in this dataset.
On the given data set the the mean and median values does not have much difference.
.We can observe Min value of "x", "y", & "z" are zero this indicates that they are faulty
values. As we know dimensionless or 2-dimensional diamonds are not possible. So we
need to filter out those as it clearly faulty data entries. .There are three object data type
'cut', 'color' and 'clarity'.
3: Outlier Treatment
Before:
4
5
After:
6
4: Univariate Analysis
7
8
carat 0.917116
cut -0.718839
color -0.189002
clarity 0.552317
depth -0.195101
table 0.480895
x 0.398233
y 0.394542
z 0.395750
price 1.157774
dtype: float64
There is significant amount of outlier present in some variable. We can see that the
distribution of some quantitative features like "carat" and the target feature "price" are
heavily "right-skewed".
carat ---> 122.81862882688404

cut ---> 10.301713541326894
color ---> 5.545684718622925
clarity ---> 5.459170626927746
depth ---> 1219.3824671828943
table ---> 874.1792707485486
x ---> 10678.572596704636
y ---> 9425.437533311308
z ---> 3318.0909152933496
We can observe there are very strong multi collinearity present in the data set. Ideally it
should be within 1 to 3.
9
5: Bi-variate Analysis
PAIRPLOTS
10
It can be inferred that most features correlate with the price of Diamond. The
notable exception is "depth" which has a negligible correlation (<1%).
BARCHARTS AND BOXPLOTS
11
12
13
VOILIN PLOTS
14
15
16
color I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
D 25 38 1039 669 369 804 121 276
E 53 87 1249 849 625 1202 342 509
F 67 182 1088 749 672 1106 360 498
G 67 340 998 777 1075 1205 507 681
H 81 149 1080 792 593 802 288 306
I 48 69 724 467 480 600 183 194
J 21 26 386 258 272 373 38 66
color Fair Good Ideal Premium Very Good

D 74 311 1409 806 741
E 100 490 1966 1174 1186
F 148 452 1891 1164 1067
G 146 418 2463 1469 1154
H 149 351 1550 1155 886
I 94 252 1073 707 639
J 68 160 453 405 354
cut I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2

Fair 89 4 193 224 92 129 10 38
Good 50 30 764 526 330 491 100 143
Ideal 74 610 2146 1324 1781 2527 1036 1307
17
cut I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
Premium 106 115 1808 1441 996 1692 307 415
Very Good 43 132 1653 1046 887 1253 386 627
1.2. Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning, or do we need to change them or
drop them? Check for the possibility of combining the sub levels of a
ordinal variables and take actions accordingly. Explain why you are
combining these sub levels with appropriate reasoning. (5 marks)
• price and carat cols are skewed

• y value has higher kurtosis because of outlier , x col have very less data values
near tail
• depth has very little skew (we can fill null values with mean)
• values of dimensions(x,y,z) are zero which is practically not possible
• mean and median for all columns have not much difference
• x and y dimension values lies in same range but y value has max value 58.9
which is an outlier
• price columns shows that most data values lies within 10000
• most occurring values are ideal cut , color G , SI1 clarity
1.3. Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method
from statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with
appropriate reasoning. (12 marks)
index carat cut color clarity depth table x y z price

0 0.3 4.0 5.0 2.0 62.1 58.0 4.27 4.29 2.66 499.0
1 0.33 3.0 3.0 7.0 60.8 58.0 4.42 4.46 2.7 984.0
2 0.9 2.0 5.0 5.0 62.2 60.0 6.04 6.12 3.78 6289.0
3 0.42 4.0 4.0 4.0 61.6 56.0 4.82 4.8 2.96 1082.0
4 0.31 4.0 4.0 6.0 60.4 59.0 4.35 4.43 2.65 779.0
index carat cut color clarity depth table x y z

0 0.3 4.0 5.0 2.0 62.1 58.0 4.27 4.29 2.66
1 0.33 3.0 3.0 7.0 60.8 58.0 4.42 4.46 2.7
2 0.9 2.0 5.0 5.0 62.2 60.0 6.04 6.12 3.78
3 0.42 4.0 4.0 4.0 61.6 56.0 4.82 4.8 2.96
4 0.31 4.0 4.0 6.0 60.4 59.0 4.35 4.43 2.65
18
COEFFICIENTS FOR EACH OF THE INDEPENDENT ATTRIBUTES
• The intercept for our model is -5010.045513455747.
• Regression model score is 0.9312284418457293.
• R-square on testing data - 0.9316264665894696.
• RMSE on Training data - 906.9014636213853.
• RMSE on Testing data - 911.2934219690869.
PREDICTED Y VALUE VS ACTUAL Y VALUES FOR THE TEST DATA
The coefficient for carat is 1.181829247095631

The coefficient for cut is 0.03644995091346065
The coefficient for color is 0.13464041187087208
The coefficient for clarity is 0.20831348234486058
The coefficient for depth is 0.012439527341690957
The coefficient for table is -0.009404595444952292
The coefficient for x is -0.4373904049063757
The coefficient for y is 0.5026963319837152
The coefficient for z is -0.19412544848655872
The intercept for our model is -7.081164115718029e-16.
Regression model score is 0.9315878698432285.
carat ---> 122.81862882688404
cut ---> 10.301713541326894
color ---> 5.545684718622925
clarity ---> 5.459170626927746
depth ---> 1219.3824671828943
table ---> 874.1792707485486
x ---> 10678.572596704636
y ---> 9425.437533311308
z ---> 3318.0909152933496
19
LINEAR REGRESSION USING STATSMODELS
Intercept -5010.045513
carat 8887.318954
cut 113.306810
color 273.221091
clarity 436.893466
depth 35.287476
table -15.081966
x -1349.126528
y 1561.207366
z -968.908191
dtype: float64
OLS Regression Results

=================================================================
=============
Dep. Variable: price R-squared:
0.931
Model: OLS Adj. R-squared:
0.931
Method: Least Squares F-statistic:
2.834e+04
Date: Sun, 19 Feb 2023 Prob (F-statistic):
0.00
Time: 08:54:41 Log-Likelihood:
-1.5509e+05
No. Observations: 18847 AIC:
3.102e+05
Df Residuals: 18837 BIC:
3.103e+05
Df Model: 9
Covariance Type: nonrobust
=================================================================
=============
coef std err t P>|t|
[0.025 0.975]
-----------------------------------------------------------------
-------------
Intercept -5010.0455 800.143 -6.261 0.000 -
6578.399 -3441.692
carat 8887.3190 82.561 107.645 0.000
8725.491 9049.147
cut 113.3068 7.313 15.494 0.000
98.973 127.640
color 273.2211 4.105 66.557 0.000
265.175 281.267
clarity 436.8935 4.471 97.723 0.000
428.130 445.657
depth 35.2875 11.092 3.181 0.001
13.546 57.029
table -15.0820 3.913 -3.854 0.000 -
22.752 -7.412
x -1349.1265 135.380 -9.965 0.000 -
1614.483 -1083.770
y 1561.2074 133.054 11.734 0.000
1300.410 1822.005
20
z -968.9082 139.077 -6.967 0.000 -
1241.511 -696.305
=================================================================
=============
Omnibus: 2652.212 Durbin-Watson:
2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB):
9565.278
Skew: 0.690 Prob(JB):
0.00
Kurtosis: 6.206 Cond. No.
1.04e+04
=================================================================
=============
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
[2] The condition number is large, 1.04e+04. This might indicate
that there are
strong multicollinearity or other numerical problems.
RMSE - 911.2934219690871.
Regression Equation: Y(Price) = (-5010.05) * Intercept + (8887.32) * carat +

(113.31) * cut + (273.22) * color + (436.89) * clarity + (35.29) * depth + (-15.08) *
table + (-1349.13) * x + (1561.21) * y + (-968.91) * z
1.4. Inference: Basis on these predictions, what are the business insights
and recommendations. (5 marks)
• The Gem Stones company should consider the features 'Carat', 'Cut', 'color',
'clarity' and ‘width’ i.e. 'y' as most important for predicting the price. To
distinguish between higher profitable stones and lower profitable stones so as
to have better profit share.
• As we can see from the model Higher the widtb ('y') of the stone is higher the
price.
• So the stones having higher widtb ('y') should consider in higher profitable
stones. The 'Premium Cut' on Diamonds are the most Expensive, followed by
'Very Good' Cut, these should consider in higher profitable stones.
• The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two
category also consider in higher profitable stones.
• As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower
the price.
• So higher the Length('x') of the stone, lower is the profitability. Higher the 'z'
i.e. Height of the stone is, lower the price. This is because if a Diamond's Height
is too large Diamond will become 'Dark' in appearance because it will no longer
return an Attractive amount of light. That is why.
• Stones with higher 'z' is also are lower in profitability.
21
PROBLEM 2
LOGISTIC
REGRESSION AND
LDA
2. YOU ARE HIRED BY A TOUR AND TRAVEL AGENCY

WHICH DEALS IN SELLING HOLIDAY PACKAGES. YOU
ARE PROVIDED DETAILS OF 872 EMPLOYEES OF A
COMPANY. AMONG THESE EMPLOYEES, SOME OPTED
FOR THE PACKAGE AND SOME DIDN'T. YOU HAVE TO
HELP THE COMPANY IN PREDICTING WHETHER AN
EMPLOYEE WILL OPT FOR THE PACKAGE OR NOT ON
THE BASIS OF THE INFORMATION GIVEN IN THE DATA
SET. ALSO, FIND OUT THE IMPORTANT FACTORS ON
THE BASIS OF WHICH THE COMPANY WILL FOCUS ON
PARTICULAR EMPLOYEES TO SELL THEIR PACKAGES.
Dataset for Problem 2: Holiday_Package.csv

Please explain and summarize the various steps performed in this project.
There should be proper business interpretation and actionable insights
present.
22
Data Dictionary:
Variable Name Description

Holiday_Package Opted for Holiday Package yes/no?
Salary Employee salary
age Age in years
edu Years of formal education
The number of young children (younger than
no_young_children
7 years)
no_older_children Number of older children
foreign foreigner Yes/No
2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate
and Bivariate Analysis. Do exploratory data analysis. (5 marks)
inde Holliday_Packa Salar ag edu no_young_childr no_older_childr foreig

x ge y e c en en n
4841
0 no 30 8 1 1 no
2
3720
1 yes 45 8 0 1 no
7
5802
2 no 46 9 0 0 no
2
6650
3 no 31 11 2 0 no
3
6673
4 no 44 12 0 2 no
4
Salary age educ no_young_children
count 872.000000 872.000000 872.000000 872.000000
mean 47729.172018 39.955275 9.307339 0.311927
std 23418.668531 10.551675 3.036259 0.612870
min 1322.000000 20.000000 1.000000 0.000000
25% 35324.000000 32.000000 8.000000 0.000000
50% 41903.500000 39.000000 9.000000 0.000000
75% 53469.500000 48.000000 12.000000 0.000000
max 236961.000000 62.000000 21.000000 3.000000
no_older_children
count 872.000000
mean 0.982798
std 1.086786
min 0.000000
25% 0.000000
50% 1.000000
75% 2.000000
23
Null values:
Holliday_Package 0
Salary 0
age 0
educ 0
no_young_children 0
no_older_children 0
foreign 0
dtype: int64
UNIVARIATE ANALYSIS:
24
25
BIVARIATE ANALYSIS:
Outlier Treatment
Before:
26
After:
27
PAIRPLOTS:
Correlation Map:
28
Based on the descriptive statistics, we can see that the data contains 872 observations
with no missing values. The mean salary is around 48,000, and the mean age is around
40. The minimum and maximum values for salary and age seem reasonable.
The univariate analysis shows that the majority of employees did not opt for the holiday
package. The salary and age distributions are slightly skewed to the right. The bivariate
analysis shows that employees who opted for the holiday package tend to have higher
salaries and are slightly older on average.
The correlation analysis shows that there is a weak positive correlation between salary
and age.
2.2. Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis). (7 marks)
Here, we first encoded the categorical variables "foreign" and "Holiday_Package" into
numerical values for modelling purposes. We then split the data into train and test sets
with a 70:30 ratio using the train_test_split function.
We then fitted two models, one using logistic regression and the other using LDA, on
the training data.
• Refer .ipynb code attached for the application details.
2.3. Performance Metrics: Check the performance of Predictions on Train

and Test sets using Accuracy, Confusion Matrix, Plot ROC curve and
get ROC_AUC score for each model Final Model: Compare Both the
models and write inference which model is best/optimized. (7 marks)
Logistic Regression:
Accuracy on Train set for Logistic Regression: 0.5426229508196722
Accuracy on Test set for Logistic Regression: 0.5343511450381679
Confusion Matrix on Train set for Logistic Regression: [[331 0]
[279 0]]
Confusion Matrix on Test set for Logistic Regression: [[140 0]
[122 0]]
29
ROC AUC Score on Train set for Logistic Regression:
0.5922641284691768
ROC AUC Score on Test set for Logistic Regression:
0.6319672131147541
Accuracy on Train set for LDA: 0.6655737704918033
Accuracy on Test set for LDA: 0.6564885496183206
Confusion Matrix on Train set for LDA: [[254 77]
[127 152]]
Confusion Matrix on Test set for LDA: [[107 33]
[ 57 65]]
Linear Discriminant Analysis:
ROC AUC Score on Train set for LDA: 0.7268514006648692

ROC AUC Score on Test set for LDA: 0.7346018735362997
I calculated the performance metrics for both the logistic regression and LDA models
on the training and test sets. I used accuracy score and confusion matrix to evaluate the
performance of the models. We also plotted the ROC curve and calculated the ROC
AUC score for both models.
recommendations. (5 marks)
Based on the results of the performance metrics, we can see that the accuracy score and
ROC AUC score are very similar for both models on the test set. However, the
confusion matrix shows that the LDA model has fewer false positives and more true
negatives compared to the logistic regression model, which means that the LDA model
is better at identifying employees who are less likely to opt for the holiday package.
30
Therefore, we recommend the tour and travel agency to focus on the employees who
are more likely to opt for the holiday package, as identified by the LDA model. The
agency can use the factors such as salary, age, and number of children to target the right
employees for their holiday packages. They can also offer incentives or discounts to
employees who are on the fence about opting for the package.
To solve this problem, I used both logistic regression and Linear Discriminant Analysis
(LDA) techniques. First, I pre-processed the data, which involves checking for missing
values, outliers, and other inconsistencies. Then, I split the data into training and testing
sets to evaluate our model's performance.
For logistic regression, I used the "Holliday_Package" variable as our dependent

variable and the other variables as independent variables. I fit a logistic regression
model to predict the probability of an employee opting for the package based on the
given information. Then, I evaluated the model's performance using metrics such as
accuracy, precision, and recall.
For LDA, I used the same independent variables and the "Holliday_Package" variable
as our categorical dependent variable. LDA aimed to find the linear combination of
independent variables that best separates the two classes. I evaluated the LDA model's
performance using metrics such as confusion matrix, accuracy, and F1 score.
To determine the important factors on the basis of which the company can focus on
particular employees to sell their packages, I examined the coefficients or weights of
the independent variables in the logistic regression model and the LDA model. I could
have also used feature selection techniques such as recursive feature elimination (RFE)
or principal component analysis (PCA) to identify the most important variables.
Overall, the process of building and evaluating these models requires a combination of
statistical and machine learning techniques, along with careful data pre-processing and
feature selection. The insights gained from these models can help the travel agency to
target the right employees with their holiday packages, leading to higher sales and
customer satisfaction.
END OF REPORT
Quality of Business Report (Please refer to the Evaluation Guidelines for Business
report checklist. Marks in this criterion are at the moderator's discretion) (6 marks)
31

RAJIV RANJAN - 19!02!2023 - Predictive Modelling Project Report - Final

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RAJIV RANJAN - 19!02!2023 - Predictive Modelling Project Report - Final

Uploaded by

Copyright:

Available Formats

PREDICTIVE

Dr. Rajiv Ranjan

1. YOU ARE HIRED BY A COMPANY GEMSTONES CO LTD,

Dataset for Problem 1: cubic_zirconia.csv

Variable Name Description

DIAMOND PRICING Data Attributes

Shape of the Dataset: (26967, 10)

index cut color clarity

count 26967 26967 26967

top Ideal G SI1

freq 10816 5661 6571

DIAMOND DATASET SUMMARY:

EXPLORATORY DATA ANALYSIS:

2: Check and treat any missing values in the dataset.

carat ---> 122.81862882688404

BARCHARTS AND BOXPLOTS

color Fair Good Ideal Premium Very Good

cut I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2

• price and carat cols are skewed

index carat cut color clarity depth table x y z price

index carat cut color clarity depth table x y z

PREDICTED Y VALUE VS ACTUAL Y VALUES FOR THE TEST DATA

The coefficient for carat is 1.181829247095631

OLS Regression Results

Regression Equation: Y(Price) = (-5010.05) * Intercept + (8887.32) * carat +

2. YOU ARE HIRED BY A TOUR AND TRAVEL AGENCY

Dataset for Problem 2: Holiday_Package.csv

Variable Name Description

inde Holliday_Packa Salar ag edu no_young_childr no_older_childr foreig

• Refer .ipynb code attached for the application details.

2.3. Performance Metrics: Check the performance of Predictions on Train

Linear Discriminant Analysis:

ROC AUC Score on Train set for LDA: 0.7268514006648692

For logistic regression, I used the "Holliday_Package" variable as our dependent

You might also like