Professional Documents
Culture Documents
ON
LINEAR REGRESSION
MODEL
1. Interpret your model. Discuss if it makes sense. Remember also that it is not a good idea to depend
solely on p-values.
2. Also perform a full residual analysis. Comment on the residual histogram plotted.
3. Report prediction error with prediction.
B+
Submitted By:
Group 2
For all the individual variables alternative hypothesis is considered as that the coefficient of
that respective variable is not zero.
HA: β≠0
After removing the observations of hoursw variable using the VIF value and outliers, the next
step was to identify the significance of various variables. This could be done in a stepwise
procedure using the p-value of the respective variables. Having the significance level as 95%
if P-value is less than alpha (0.05), then we can reject the null hypothesis i.e. the variable is
significant, otherwise not. This means if any variable had a p-value greater than 0.05, we
dropped the same from the regression model.We have to conduct this test multiple times as
removing one variable can affect the p-value of other variables.
In this model inv2 variable has the highest p-value of 0.74. This value is higher than the
alpha value of 0.05. So going by the hypothesis testing, we removed this variable for 3 rd
model.
3rd model: Inv2 Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start
Adjusted R SQAURE=0.3164
In this model ssize has the highest p-value (.677) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 4th model.
4th model: Ssize Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+start
Adjusted R SQAURE=0.318
In this model nown has the highest p-value (0.238) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 5th model.
5th model: nown removed
Margin~nfull+npart+naux+hourspw+inv1+start
Adjusted R SQAURE=0.3172
In this model inv1 has the highest p-value (0.234) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 6thmodel.
6th model: inv1 removed
Margin~nfull+npart+naux+hourspw+start
Adjusted R SQAURE=0.3165
In this model naux has the highest p-value (0.096) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 7th model.
7th model: naux removed
margin~nfull+npart+hourspw+start
Adjusted R SQAURE=0.3132
In this model nfull has the highest p-value (0.068) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 8th model.
8th model: nfull
margin~npart+hourspw+start
Adjusted R SQAURE=0.3089
Comparison of models:
Various regression models can be compared using the adjusted R SQAURE value. High
adjusted R SQUARE means the spread of residuals is not wide and the error in prediction is
less. However we always cannot select the model solely based on this value. If there is only
minute difference between the values of two models but the amount of data required for one
of the models is very high, then it is better to go with the one which needs less data. This will
be a better fit compared to the other one due to the ease of calculation/prediction.
Conclusion:
After multiple iterations, the regression equation with nfull, npart, naux, hourspw, inv1 and
start as variables has adjusted R SQAURE with highest value of .3172(5th Model) Another
regression model with npart, naux, hourspw and start as variables has adjusted R SQAURE of
.3132(Model 7). Even if the value of adjusted R SQAURE for Model 7 is less, the reduction
in value is very low and comparative data requirement for this model is very less.
So depending on the Adjusted R SQUARE value and the amount of data required (4
independent variables) for our regression model, we selected Model-7 as the final regression
model. Adjusted R Square of 0.3132 explains the sample variation in Y using the estimated
model. Adjusted R Square is a better comparative as it overcomes some of the drawbacks
present in R Square.
Less value of R SQUARE means it is difficult to predict the outcome (that is gross margin),
from the available data. There might be some other variables which could explain the same.
F-test:
F-test compares our present regression model with the intercept only model, i.e. the model
without any variable. It gives the overall significance of the regression equation. If the p-
value is within the selected acceptable level (our own significance level) then the regression
model with our selected variables is significant. Otherwise we can reject it and the intercept
only model is the better fit. For F-test we take the null hypothesis as all the variables are
insignificant and alternate hypothesis as at-least one variable is not insignificant.
For the model-7 regression equation p-value is 2.2e-16 which is less than alpha value (0.05).
So null hypothesis is rejected and the regression equation is not insignificant
Appendix A
Quantile (cloths$nown,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.000000 1.295225 2.000000 3.000000 3.005000
quantile(cloths$nfull,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.9231 2.0664 4.3590 5.0000 6.0050
quantile(cloths$npart,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.2833 2.0000 3.0000 4.0000 5.0000
quantile(cloths$naux,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.3333 1.3673 2.0000 3.0000 4.0000
quantile(cloths$hoursw,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 145.25 250.00 313.05 359.13
quantile(cloths$inv1,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
20000.00 62269.23 292857.20 293178.63 350250.00
quantile(cloths$inv2,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
10000.00 22859.85 80000.00 200750.00 300500.00
quantile(cloths$ssize,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 190.00 360.50 454.46 600.50
quantile(cloths$start,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
37.00 42.00 80.00 88.01 90.00
quantile(cloths$hourspw,c(0.25,0.5,0.75,0.90,.99))
25% 50% 75% 90% 99%
13.54120 17.74459 24.30298 28.91297 37.75375
Appendix B
Model 1:
cloth=read.csv("C:\\Users\\Vamsi Manyam\\Desktop\\clothing.csv")
model1=lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth)
summary(model1)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hoursw +
hourspw + inv1 + inv2 + ssize + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.7050 -2.3052 0.5483 2.6236 15.5357
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 3.611e+01 4.087e+00 8.837 < 2e-16 ***
nown -7.548e-01 7.155e-01 -1.055 0.29220
nfull -1.482e+00 7.728e-01 -1.918 0.05591 .
npart -1.911e-01 7.587e-01 -0.252 0.80131
naux -2.765e+00 9.490e-01 -2.913 0.00380 **
hoursw 7.715e-02 2.910e-02 2.652 0.00836 **
hourspw -3.379e-01 1.782e-01 -1.896 0.05873 .
inv1 6.966e-06 3.623e-06 1.923 0.05530 .
inv2 -4.232e-06 9.636e-06 -0.439 0.66082
ssize -2.913e-04 3.298e-03 -0.088 0.92966
start 1.861e-01 1.906e-02 9.764 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
library(car)
vif(lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth))
nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2.034203 8.822122 3.500458 1.844885 49.878834 32.115259 1.518805 1.183078 1.688677 1.127065
Model 2:
model2=lm(margin~nown+nfull+npart+naux+hourspw+inv1+inv2+ssize+start,data=cloth)
vif(model2)
nown nfull npart naux hourspw inv1 inv2 ssize start
1.031140 1.144159 1.166476 1.038341 1.520605 1.390207 1.180971 1.626014 1.124763
summary(model2)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + inv2 + ssize + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.7391 -2.1082 0.4661 2.5180 17.4720
Coefficients:
Estimate Std. Error t value Pr( |t|)
Model 3:
model3=lm(margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start,data=cloth)
summary(model3)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + ssize + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.7463 -2.1432 0.4844 2.4914 17.3387
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 2.613e+01 1.613e+00 16.198 < 2e-16 ***
nown 5.814e-01 5.129e-01 1.134 0.25769
nfull 4.242e-01 2.798e-01 1.516 0.13035
npart 1.458e+00 4.406e-01 3.309 0.00103 **
naux -1.105e+00 7.169e-01 -1.541 0.12408
hourspw 1.227e-01 3.901e-02 3.146 0.00179 **
inv1 3.819e-06 3.319e-06 1.151 0.25064
ssize 1.356e-03 3.257e-03 0.416 0.67733
start 1.833e-01 1.911e-02 9.591 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model 4:
model4=lm(margin~nown+nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model4)
Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.7640 -2.0986 0.3987 2.5074 17.5196
Coefficients:
Estimate Std. Error t value Pr( |t|)
Model 5:
model5=lm(margin~nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model5)
Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + inv1 +
start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.799 -2.170 0.479 2.566 17.310
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 2.678e+01 1.449e+00 18.483 < 2e-16 ***
nfull 4.237e-01 2.782e-01 1.523 0.128581
npart 1.540e+00 4.247e-01 3.626 0.000329 ***
naux -1.095e+00 7.118e-01 -1.538 0.124926
hourspw 1.348e-01 3.366e-02 4.004 7.53e-05 ***
inv1 3.892e-06 3.263e-06 1.193 0.233788
start 1.809e-01 1.899e-02 9.524 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model 6:
model6=lm(margin~nfull+npart+naux+hourspw+start,data=cloth)
summary(model6)
Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + start,
data = cloth)
Residuals:
Min 1Q Median 3Q Max
-14.8686 -2.0959 0.4684 2.6209 16.7507
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 26.62383 1.44390 18.439< 2e-16 ***
nfull 0.52359 0.26541 1.973 0.0493 *
npart 1.66624 0.41158 4.048 6.29e-05 ***
naux -1.18136 0.70849 -1.667 0.0963 .
hourspw 0.13976 0.03342 4.182 3.61e-05 ***
Model 7:
model7=lm(margin~nfull+npart+hourspw+start,data=cloth)
summary(model7)
Call:
lm(formula = margin ~ nfull + npart + hourspw + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-15.1194 -2.2302 0.3886 2.5829 17.2256
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 25.11266 1.12676 22.287< 2e-16 ***
nfull 0.48380 0.26497 1.826 0.0687 .
npart 1.68312 0.41245 4.081 5.50e-05 ***
hourspw 0.13594 0.03342 4.068 5.81e-05 ***
start 0.18105 0.01905 9.505 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model 8:
model8=lm(margin~npart+hourspw+start,data=cloth)
summary(model8)
Call:
lm(formula = margin ~ npart + hourspw + start, data = cloth)
Residuals:
Min 1Q Median 3Q Max
-15.143 -2.237 0.417 2.492 18.011
Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 25.85285 1.05461 24.514< 2e-16 ***
npart 1.76478 0.41130 4.291 2.28e-05 ***
hourspw 0.13976 0.03346 4.177 3.69e-05 ***
start 0.18154 0.01911 9.502 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Appendix C
Plots:
Multicollinearity
400
300
200
100
10 20 30 40
Plot of residuals
150
Frequency
100
50
0
-20 -10 0 10 20