Group 2 Report On Regression Model - Group2

REPORT
ON
LINEAR REGRESSION
MODEL
1. Interpret your model. Discuss if it makes sense. Remember also that it is not a good idea to depend
solely on p-values.
2. Also perform a full residual analysis. Comment on the residual histogram plotted.
3. Report prediction error with prediction.
B+
Submitted By:
Group 2
Swati Jindal -P16023

Rishabh Jain -P16021
Vamsi Manyam- P16032
Vikram Khanna -P16027
Subhajit Roy -P16047
Report On Linear Regression Model 2
Report on Regression model
Final Regression Model (Model 7)
Gross Margin=25.1126 + 0.4838*Nfull + 1.68312*Npart + 0.13594*hourspw +0.18105*start
Interpretation of the Model

Dependent Variable
i) Gross Margin: It is the difference between net sales and cost of goods sold and is a measure
of profitability of a business. In the given model, it is dependent upon the following
independent variables.
Independent Variable
i) Nfull: Number of full time workers per day on average. Keeping all other variables
constant an increase in one full time worker increases the margin by 0.4838 units.
ii) Npart: Number of part time workers per day on average. Keeping all other variables
constant an increase in one part time worker increases the margin by 1.683 units.
iii) Hourspw: Total number of hours worked. Keeping all other variables constant, an
increase in total number of hours worked increases the margin by 0.1359 units.
iv) Start: Years since start of business. Keeping all other variables constant, an increase in the
number of years since the business started increases the margin by 0.181 units.
Model Development Process

1/ Identifying the Outliers
As a part of building the regression model, the first step is to check for outliers and deal with
them. For finding outliers, initially we started with box plots. As per the box plots, data
outside 3rd Quartile+ (1.5*Inter-quartile) and 1st Quartile-(1.5*Inter-quartile) are termed as
outliers. However, for some variables box plots were showing a lot of outliers, nearly 30
observations for one particular variable. Removing these many observations in a data of 400
observations didn’t seem to be a good idea. So we calculated outliers with 99 percentile value
as cut-off. Any value crossing 99 percentile was considered to be an outlier. Accordingly, we
removed a total of 24 observations that were outliers.
1st Model: All the variables
Margin ~ nown, nfull, npart, naux, hoursw, hourspw, inv1, inv2, ssize, start
Adj R square- 0.3259
Indian Institute of Management Nagpur

2/ Computation of Variance Inflation Factor

After removing the outliers, for building the regression model, the second step is to check the
multicollinearity between different variables. Correlation coefficient value between hoursw
and hourspw is 0.808 which is pretty high. After removing the outliers, if we calculate the
VIF, hoursw has a VIF value of nearly 50 which is way above threshold of 5. So we removed
the variable hoursw from the model. After removing this variable the VIF value of no other
variable is more than 5. So there is no multicollinearity issue now.
2nd model: hoursw Removed (High VIF)
Margin ~ nown, nfull, npart, naux, hourspw, inv1, inv2, ssize, start
Adj R SQAURE - 0.3148
3/ P-Value Computation
For all the individual variables null hypothesis is considered as that the coefficient of that
respective variable is zero.
H0:β=0
For all the individual variables alternative hypothesis is considered as that the coefficient of
that respective variable is not zero.
HA: β≠0
After removing the observations of hoursw variable using the VIF value and outliers, the next
step was to identify the significance of various variables. This could be done in a stepwise
procedure using the p-value of the respective variables. Having the significance level as 95%
if P-value is less than alpha (0.05), then we can reject the null hypothesis i.e. the variable is
significant, otherwise not. This means if any variable had a p-value greater than 0.05, we
dropped the same from the regression model.We have to conduct this test multiple times as
removing one variable can affect the p-value of other variables.
In this model inv2 variable has the highest p-value of 0.74. This value is higher than the
alpha value of 0.05. So going by the hypothesis testing, we removed this variable for 3 rd
model.
3rd model: Inv2 Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start
Adjusted R SQAURE=0.3164
In this model ssize has the highest p-value (.677) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 4th model.
4th model: Ssize Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+start

In this model nown has the highest p-value (0.238) which is more than alpha. So going by the
5th model: nown removed
Margin~nfull+npart+naux+hourspw+inv1+start
In this model inv1 has the highest p-value (0.234) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 6thmodel.
6th model: inv1 removed
Margin~nfull+npart+naux+hourspw+start
In this model naux has the highest p-value (0.096) which is more than alpha. So going by the
7th model: naux removed
margin~nfull+npart+hourspw+start
In this model nfull has the highest p-value (0.068) which is more than alpha. So going by the
8th model: nfull
margin~npart+hourspw+start
Comparison of models:
Various regression models can be compared using the adjusted R SQAURE value. High
adjusted R SQUARE means the spread of residuals is not wide and the error in prediction is
less. However we always cannot select the model solely based on this value. If there is only
minute difference between the values of two models but the amount of data required for one
of the models is very high, then it is better to go with the one which needs less data. This will
be a better fit compared to the other one due to the ease of calculation/prediction.
Conclusion:
After multiple iterations, the regression equation with nfull, npart, naux, hourspw, inv1 and
start as variables has adjusted R SQAURE with highest value of .3172(5th Model) Another
regression model with npart, naux, hourspw and start as variables has adjusted R SQAURE of
.3132(Model 7). Even if the value of adjusted R SQAURE for Model 7 is less, the reduction
in value is very low and comparative data requirement for this model is very less.

So depending on the Adjusted R SQUARE value and the amount of data required (4
independent variables) for our regression model, we selected Model-7 as the final regression
model. Adjusted R Square of 0.3132 explains the sample variation in Y using the estimated
model. Adjusted R Square is a better comparative as it overcomes some of the drawbacks
present in R Square.
Less value of R SQUARE means it is difficult to predict the outcome (that is gross margin),
from the available data. There might be some other variables which could explain the same.
F-test:
F-test compares our present regression model with the intercept only model, i.e. the model
without any variable. It gives the overall significance of the regression equation. If the p-
value is within the selected acceptable level (our own significance level) then the regression
model with our selected variables is significant. Otherwise we can reject it and the intercept
only model is the better fit. For F-test we take the null hypothesis as all the variables are
insignificant and alternate hypothesis as at-least one variable is not insignificant.
For the model-7 regression equation p-value is 2.2e-16 which is less than alpha value (0.05).
So null hypothesis is rejected and the regression equation is not insignificant
Predicting the gross of profit for the given data:

Prediction of the gross profit margin of a store with 2 owners, 5 full time workers, 2 part time
workers and 1 auxiliary worker, a total number of 300 work hours resulting in 300/
(2+5+2+1) = 30 work hours per worker, a $60000 investment in shop-premises and $10000
investment in automaton, with a shop size of 500 sq. m. and a 90 year old shop with
corresponding prediction interval.
Using the designed regression model:
Gross Margin= 25.1126+0.4838*Nfull +1.68312*Npart+0.13594*hourspw+0.18105*start
Gross Margin= 25.1126 + 0.4838*5+ 1.68312*2 + 0.13594*30 + 0.18105*90
= 51.27054
95% Prediction Interval = 51.27±1.96*SQRT (1.176^2+4.29^2)

=51.27±1.96*SQRT (1.382+18.4)
=51.27 ± 8.72
= (42.55, 59.98)

Appendix A
Finding Quantiles to remove outliers:
Quantile (cloths$nown,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.000000 1.295225 2.000000 3.000000 3.005000
quantile(cloths$nfull,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.9231 2.0664 4.3590 5.0000 6.0050
quantile(cloths$npart,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.2833 2.0000 3.0000 4.0000 5.0000
quantile(cloths$naux,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.3333 1.3673 2.0000 3.0000 4.0000
quantile(cloths$hoursw,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 145.25 250.00 313.05 359.13
quantile(cloths$inv1,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
20000.00 62269.23 292857.20 293178.63 350250.00
quantile(cloths$inv2,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
10000.00 22859.85 80000.00 200750.00 300500.00
quantile(cloths$ssize,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 190.00 360.50 454.46 600.50
quantile(cloths$start,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
37.00 42.00 80.00 88.01 90.00
quantile(cloths$hourspw,c(0.25,0.5,0.75,0.90,.99))
25% 50% 75% 90% 99%
13.54120 17.74459 24.30298 28.91297 37.75375

Appendix B
Modelling the regression equation:
Model 1:
cloth=read.csv("C:\\Users\\Vamsi Manyam\\Desktop\\clothing.csv")
model1=lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth)
summary(model1)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hoursw +
hourspw + inv1 + inv2 + ssize + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.7050 -2.3052 0.5483 2.6236 15.5357
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 3.611e+01 4.087e+00 8.837 < 2e-16 ***
nown -7.548e-01 7.155e-01 -1.055 0.29220
nfull -1.482e+00 7.728e-01 -1.918 0.05591 .
npart -1.911e-01 7.587e-01 -0.252 0.80131
naux -2.765e+00 9.490e-01 -2.913 0.00380 **
hoursw 7.715e-02 2.910e-02 2.652 0.00836 **
hourspw -3.379e-01 1.782e-01 -1.896 0.05873 .
inv1 6.966e-06 3.623e-06 1.923 0.05530 .
inv2 -4.232e-06 9.636e-06 -0.439 0.66082
ssize -2.913e-04 3.298e-03 -0.088 0.92966
start 1.861e-01 1.906e-02 9.764 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.252 on 364 degrees of freedom

Multiple R-squared: 0.3439, Adjusted R-squared: 0.3259
F-statistic: 19.08 on 10 and 364 DF, p-value: < 2.2e-16
library(car)
vif(lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth))
nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2.034203 8.822122 3.500458 1.844885 49.878834 32.115259 1.518805 1.183078 1.688677 1.127065
Model 2:
model2=lm(margin~nown+nfull+npart+naux+hourspw+inv1+inv2+ssize+start,data=cloth)
vif(model2)
nown nfull npart naux hourspw inv1 inv2 ssize start
1.031140 1.144159 1.166476 1.038341 1.520605 1.390207 1.180971 1.626014 1.124763
summary(model2)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + inv2 + ssize + start, data = cloth)
Residuals:
-14.7391 -2.1082 0.4661 2.5180 17.4720
Coefficients:

(Intercept) 2.614e+01 1.616e+00 16.180 < 2e-16 ***

nown 5.775e-01 5.136e-01 1.124 0.26160
nfull 4.295e-01 2.806e-01 1.531 0.12669
npart 1.452e+00 4.416e-01 3.287 0.00111 **
naux -1.101e+00 7.178e-01 -1.534 0.12590
hourspw 1.233e-01 3.909e-02 3.154 0.00175 **
inv1 4.171e-06 3.495e-06 1.193 0.23351
inv2 -3.154e-06 9.707e-06 -0.325 0.74546
ssize 1.393e-03 3.263e-03 0.427 0.66964
start 1.838e-01 1.920e-02 9.575 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 3:
model3=lm(margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start,data=cloth)
summary(model3)
Call:
inv1 + ssize + start, data = cloth)
Residuals:
-14.7463 -2.1432 0.4844 2.4914 17.3387
Coefficients:
(Intercept) 2.613e+01 1.613e+00 16.198 < 2e-16 ***
nown 5.814e-01 5.129e-01 1.134 0.25769
nfull 4.242e-01 2.798e-01 1.516 0.13035
npart 1.458e+00 4.406e-01 3.309 0.00103 **
naux -1.105e+00 7.169e-01 -1.541 0.12408
hourspw 1.227e-01 3.901e-02 3.146 0.00179 **
inv1 3.819e-06 3.319e-06 1.151 0.25064
ssize 1.356e-03 3.257e-03 0.416 0.67733
start 1.833e-01 1.911e-02 9.591 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 4:
model4=lm(margin~nown+nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model4)
Call:
inv1 + start, data = cloth)
Residuals:
-14.7640 -2.0986 0.3987 2.5074 17.5196
Coefficients:

(Intercept) 2.601e+01 1.587e+00 16.394 < 2e-16 ***

nown 6.029e-01 5.097e-01 1.183 0.237641
nfull 4.353e-01 2.782e-01 1.565 0.118537
npart 1.505e+00 4.255e-01 3.536 0.000458 ***
naux -1.072e+00 7.117e-01 -1.506 0.132843
hourspw 1.308e-01 3.381e-02 3.870 0.000129 ***
inv1 4.061e-06 3.265e-06 1.244 0.214364
start 1.828e-01 1.905e-02 9.595 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 5:
model5=lm(margin~nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model5)
Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + inv1 +
start, data = cloth)
Residuals:
-14.799 -2.170 0.479 2.566 17.310
Coefficients:
(Intercept) 2.678e+01 1.449e+00 18.483 < 2e-16 ***
nfull 4.237e-01 2.782e-01 1.523 0.128581
npart 1.540e+00 4.247e-01 3.626 0.000329 ***
naux -1.095e+00 7.118e-01 -1.538 0.124926
hourspw 1.348e-01 3.366e-02 4.004 7.53e-05 ***
inv1 3.892e-06 3.263e-06 1.193 0.233788
start 1.809e-01 1.899e-02 9.524 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 6:
model6=lm(margin~nfull+npart+naux+hourspw+start,data=cloth)
summary(model6)
Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + start,
data = cloth)
Residuals:
-14.8686 -2.0959 0.4684 2.6209 16.7507
Coefficients:
(Intercept) 26.62383 1.44390 18.439< 2e-16 ***
nfull 0.52359 0.26541 1.973 0.0493 *
npart 1.66624 0.41158 4.048 6.29e-05 ***
naux -1.18136 0.70849 -1.667 0.0963 .
hourspw 0.13976 0.03342 4.182 3.61e-05 ***

start 0.18062 0.01900 9.504 < 2e-16 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 7:
model7=lm(margin~nfull+npart+hourspw+start,data=cloth)
summary(model7)
Call:
lm(formula = margin ~ nfull + npart + hourspw + start, data = cloth)
Residuals:
-15.1194 -2.2302 0.3886 2.5829 17.2256
Coefficients:
(Intercept) 25.11266 1.12676 22.287< 2e-16 ***
nfull 0.48380 0.26497 1.826 0.0687 .
npart 1.68312 0.41245 4.081 5.50e-05 ***
hourspw 0.13594 0.03342 4.068 5.81e-05 ***
start 0.18105 0.01905 9.505 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Model 8:
model8=lm(margin~npart+hourspw+start,data=cloth)
summary(model8)
Call:
lm(formula = margin ~ npart + hourspw + start, data = cloth)
Residuals:
-15.143 -2.237 0.417 2.492 18.011
Coefficients:
(Intercept) 25.85285 1.05461 24.514< 2e-16 ***
npart 1.76478 0.41130 4.291 2.28e-05 ***
hourspw 0.13976 0.03346 4.177 3.69e-05 ***
start 0.18154 0.01911 9.502 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1


Appendix C
Plots:
Multicollinearity
Total hours vs hours per worker

600
500
Total number of hours worked
400
300
200
100
10 20 30 40
Number of hours worked per worker
Box plots with outliers





Histogram showing the residual distribution for model-7:
Plot of residuals
150
Frequency
100
50
0
-20 -10 0 10 20
residuals of regression model

Group 2 Report On Regression Model - Group2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Group 2 Report On Regression Model - Group2

Uploaded by

Copyright:

Available Formats

REPORT

Swati Jindal -P16023

Report on Regression model

Final Regression Model (Model 7)

Gross Margin=25.1126 + 0.4838*Nfull + 1.68312*Npart + 0.13594*hourspw +0.18105*start

Interpretation of the Model

Model Development Process

Indian Institute of Management Nagpur

2/ Computation of Variance Inflation Factor

Indian Institute of Management Nagpur

Indian Institute of Management Nagpur

Predicting the gross of profit for the given data:

95% Prediction Interval = 51.27±1.96*SQRT (1.176^2+4.29^2)

Indian Institute of Management Nagpur

Finding Quantiles to remove outliers:

Indian Institute of Management Nagpur

Modelling the regression equation:

Residual standard error: 4.252 on 364 degrees of freedom

Indian Institute of Management Nagpur

(Intercept) 2.614e+01 1.616e+00 16.180 < 2e-16 ***

Residual standard error: 4.287 on 365 degrees of freedom

Residual standard error: 4.282 on 366 degrees of freedom

Indian Institute of Management Nagpur

(Intercept) 2.601e+01 1.587e+00 16.394 < 2e-16 ***

Residual standard error: 4.277 on 367 degrees of freedom

Residual standard error: 4.279 on 368 degrees of freedom

Indian Institute of Management Nagpur

start 0.18062 0.01900 9.504 < 2e-16 ***

Residual standard error: 4.282 on 369 degrees of freedom

Residual standard error: 4.292 on 370 degrees of freedom

Residual standard error: 4.305 on 371 degrees of freedom

Indian Institute of Management Nagpur

Total hours vs hours per worker

Number of hours worked per worker

Box plots with outliers

Indian Institute of Management Nagpur

Indian Institute of Management Nagpur

Indian Institute of Management Nagpur

Indian Institute of Management Nagpur

Indian Institute of Management Nagpur

Histogram showing the residual distribution for model-7:

residuals of regression model

Indian Institute of Management Nagpur

You might also like

Gross Margin=25.1126 + 0.4838Nfull + 1.68312Npart + 0.13594hourspw +0.18105start