Professional Documents
Culture Documents
MODELLING
PROJECT
PGP-DSBA, University of Texas at
Austin, USA
1.1. Read the data and do exploratory data analysis. Describe the data
briefly. (Check the null values, Data types, shape, EDA, duplicate
values). Perform Univariate and Bivariate Analysis. (8 marks)
2
Data Type of the attributes:
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 carat 26967 non-null float64
1 cut 26967 non-null object
2 color 26967 non-null object
3 clarity 26967 non-null object
4 depth 26270 non-null float64
5 table 26967 non-null float64
6 x 26967 non-null float64
7 y 26967 non-null float64
8 z 26967 non-null float64
9 price 26967 non-null int64
dtypes: float64(6), int64(1), object(3)
ind 25
count mean std min 50% 75% max
ex %
car 2696 0.79837542181 0.477745473545
0.2 0.4 0.7 1.05 4.5
at 7.0 18442 01745
dep 2627 61.7451465550 1.412860238142 50. 61.
61.8 62.5 73.6
th 0.0 0572 6094 8 0
pric 2696 3939.51811473 4024.864665636 326 945 2375 5360 1881
e 7.0 28217 0374 .0 .0 .0 .0 8.0
tabl 2696 57.4560796529 2.232067909029 49. 56.
57.0 59.0 79.0
e 7.0 0912 5124 0 0
2696 5.72985352467 1.128516377647 4.7
x 0.0 5.69 6.55 10.23
7.0 831 769 1
2696 5.73356880631 1.166057529926 4.7
y 0.0 5.71 6.54 58.9
7.0 8835 0447 1
2696 3.53805725516 0.720623625642
z 0.0 2.9 3.52 4.04 31.8
7.0 3719 7715
unique 5 7 8
3
index cut color clarity
On the given data set the the mean and median values does not have much difference.
.We can observe Min value of "x", "y", & "z" are zero this indicates that they are faulty
values. As we know dimensionless or 2-dimensional diamonds are not possible. So we
need to filter out those as it clearly faulty data entries. .There are three object data type
'cut', 'color' and 'clarity'.
3: Outlier Treatment
Before:
4
5
After:
6
4: Univariate Analysis
7
8
carat 0.917116
cut -0.718839
color -0.189002
clarity 0.552317
depth -0.195101
table 0.480895
x 0.398233
y 0.394542
z 0.395750
price 1.157774
dtype: float64
There is significant amount of outlier present in some variable. We can see that the
distribution of some quantitative features like "carat" and the target feature "price" are
heavily "right-skewed".
We can observe there are very strong multi collinearity present in the data set. Ideally it
should be within 1 to 3.
9
5: Bi-variate Analysis
PAIRPLOTS
10
It can be inferred that most features correlate with the price of Diamond. The
notable exception is "depth" which has a negligible correlation (<1%).
11
12
13
VOILIN PLOTS
14
15
16
color I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
D 25 38 1039 669 369 804 121 276
E 53 87 1249 849 625 1202 342 509
F 67 182 1088 749 672 1106 360 498
G 67 340 998 777 1075 1205 507 681
H 81 149 1080 792 593 802 288 306
I 48 69 724 467 480 600 183 194
J 21 26 386 258 272 373 38 66
17
cut I1 IF SI1 SI2 VS1 VS2 VVS1 VVS2
Premium 106 115 1808 1441 996 1692 307 415
Very Good 43 132 1653 1046 887 1253 386 627
1.2. Impute null values if present, also check for the values which are equal
to zero. Do they have any meaning, or do we need to change them or
drop them? Check for the possibility of combining the sub levels of a
ordinal variables and take actions accordingly. Explain why you are
combining these sub levels with appropriate reasoning. (5 marks)
1.3. Encode the data (having string values) for Modelling. Split the data into
train and test (70:30). Apply Linear regression using scikit learn.
Perform checks for significant variables using appropriate method
from statsmodel. Create multiple models and check the performance of
Predictions on Train and Test sets using Rsquare, RMSE & Adj
Rsquare. Compare these models and select the best one with
appropriate reasoning. (12 marks)
18
COEFFICIENTS FOR EACH OF THE INDEPENDENT ATTRIBUTES
• The intercept for our model is -5010.045513455747.
• Regression model score is 0.9312284418457293.
• R-square on testing data - 0.9316264665894696.
• RMSE on Training data - 906.9014636213853.
• RMSE on Testing data - 911.2934219690869.
19
LINEAR REGRESSION USING STATSMODELS
Intercept -5010.045513
carat 8887.318954
cut 113.306810
color 273.221091
clarity 436.893466
depth 35.287476
table -15.081966
x -1349.126528
y 1561.207366
z -968.908191
dtype: float64
20
z -968.9082 139.077 -6.967 0.000 -
1241.511 -696.305
=================================================================
=============
Omnibus: 2652.212 Durbin-Watson:
2.004
Prob(Omnibus): 0.000 Jarque-Bera (JB):
9565.278
Skew: 0.690 Prob(JB):
0.00
Kurtosis: 6.206 Cond. No.
1.04e+04
=================================================================
=============
Notes:
[1] Standard Errors assume that the covariance matrix of the
errors is correctly specified.
[2] The condition number is large, 1.04e+04. This might indicate
that there are
strong multicollinearity or other numerical problems.
RMSE - 911.2934219690871.
1.4. Inference: Basis on these predictions, what are the business insights
and recommendations. (5 marks)
• The Gem Stones company should consider the features 'Carat', 'Cut', 'color',
'clarity' and ‘width’ i.e. 'y' as most important for predicting the price. To
distinguish between higher profitable stones and lower profitable stones so as
to have better profit share.
• As we can see from the model Higher the widtb ('y') of the stone is higher the
price.
• So the stones having higher widtb ('y') should consider in higher profitable
stones. The 'Premium Cut' on Diamonds are the most Expensive, followed by
'Very Good' Cut, these should consider in higher profitable stones.
• The Diamonds clarity with 'VS1' &'VS2' are the most expensive. So these two
category also consider in higher profitable stones.
• As we see for 'X' i.e. Length. of the stone, higher the length of the stone is lower
the price.
• So higher the Length('x') of the stone, lower is the profitability. Higher the 'z'
i.e. Height of the stone is, lower the price. This is because if a Diamond's Height
is too large Diamond will become 'Dark' in appearance because it will no longer
return an Attractive amount of light. That is why.
• Stones with higher 'z' is also are lower in profitability.
21
PROBLEM 2
LOGISTIC
REGRESSION AND
LDA
22
Data Dictionary:
2.1. Data Ingestion: Read the dataset. Do the descriptive statistics and do
null value condition check, write an inference on it. Perform Univariate
and Bivariate Analysis. Do exploratory data analysis. (5 marks)
23
Null values:
Holliday_Package 0
Salary 0
age 0
educ 0
no_young_children 0
no_older_children 0
foreign 0
dtype: int64
UNIVARIATE ANALYSIS:
24
25
BIVARIATE ANALYSIS:
Outlier Treatment
Before:
26
After:
27
PAIRPLOTS:
Correlation Map:
28
Based on the descriptive statistics, we can see that the data contains 872 observations
with no missing values. The mean salary is around 48,000, and the mean age is around
40. The minimum and maximum values for salary and age seem reasonable.
The univariate analysis shows that the majority of employees did not opt for the holiday
package. The salary and age distributions are slightly skewed to the right. The bivariate
analysis shows that employees who opted for the holiday package tend to have higher
salaries and are slightly older on average.
The correlation analysis shows that there is a weak positive correlation between salary
and age.
2.2. Do not scale the data. Encode the data (having string values) for
Modelling. Data Split: Split the data into train and test (70:30). Apply
Logistic Regression and LDA (linear discriminant analysis). (7 marks)
Here, we first encoded the categorical variables "foreign" and "Holiday_Package" into
numerical values for modelling purposes. We then split the data into train and test sets
with a 70:30 ratio using the train_test_split function.
We then fitted two models, one using logistic regression and the other using LDA, on
the training data.
Logistic Regression:
Accuracy on Train set for Logistic Regression: 0.5426229508196722
Accuracy on Test set for Logistic Regression: 0.5343511450381679
Confusion Matrix on Train set for Logistic Regression: [[331 0]
[279 0]]
Confusion Matrix on Test set for Logistic Regression: [[140 0]
[122 0]]
29
ROC AUC Score on Train set for Logistic Regression:
0.5922641284691768
ROC AUC Score on Test set for Logistic Regression:
0.6319672131147541
Accuracy on Train set for LDA: 0.6655737704918033
Accuracy on Test set for LDA: 0.6564885496183206
Confusion Matrix on Train set for LDA: [[254 77]
[127 152]]
Confusion Matrix on Test set for LDA: [[107 33]
[ 57 65]]
I calculated the performance metrics for both the logistic regression and LDA models
on the training and test sets. I used accuracy score and confusion matrix to evaluate the
performance of the models. We also plotted the ROC curve and calculated the ROC
AUC score for both models.
2.4. Inference: Basis on these predictions, what are the business insights and
recommendations. (5 marks)
Based on the results of the performance metrics, we can see that the accuracy score and
ROC AUC score are very similar for both models on the test set. However, the
confusion matrix shows that the LDA model has fewer false positives and more true
negatives compared to the logistic regression model, which means that the LDA model
is better at identifying employees who are less likely to opt for the holiday package.
30
Therefore, we recommend the tour and travel agency to focus on the employees who
are more likely to opt for the holiday package, as identified by the LDA model. The
agency can use the factors such as salary, age, and number of children to target the right
employees for their holiday packages. They can also offer incentives or discounts to
employees who are on the fence about opting for the package.
To solve this problem, I used both logistic regression and Linear Discriminant Analysis
(LDA) techniques. First, I pre-processed the data, which involves checking for missing
values, outliers, and other inconsistencies. Then, I split the data into training and testing
sets to evaluate our model's performance.
For LDA, I used the same independent variables and the "Holliday_Package" variable
as our categorical dependent variable. LDA aimed to find the linear combination of
independent variables that best separates the two classes. I evaluated the LDA model's
performance using metrics such as confusion matrix, accuracy, and F1 score.
To determine the important factors on the basis of which the company can focus on
particular employees to sell their packages, I examined the coefficients or weights of
the independent variables in the logistic regression model and the LDA model. I could
have also used feature selection techniques such as recursive feature elimination (RFE)
or principal component analysis (PCA) to identify the most important variables.
Overall, the process of building and evaluating these models requires a combination of
statistical and machine learning techniques, along with careful data pre-processing and
feature selection. The insights gained from these models can help the travel agency to
target the right employees with their holiday packages, leading to higher sales and
customer satisfaction.
END OF REPORT
Quality of Business Report (Please refer to the Evaluation Guidelines for Business
report checklist. Marks in this criterion are at the moderator's discretion) (6 marks)
31