You are on page 1of 16

REPORT

ON

LINEAR REGRESSION

MODEL
1. Interpret your model. Discuss if it makes sense. Remember also that it is not a good idea to depend
solely on p-values.
2. Also perform a full residual analysis. Comment on the residual histogram plotted.
3. Report prediction error with prediction.

B+

Submitted By:

Group 2

Swati Jindal -P16023


Rishabh Jain -P16021
Vamsi Manyam- P16032
Vikram Khanna -P16027
Subhajit Roy -P16047
Report On Linear Regression Model 2

Report on Regression model

Final Regression Model (Model 7)

Gross Margin=25.1126 + 0.4838*Nfull + 1.68312*Npart + 0.13594*hourspw +0.18105*start

Interpretation of the Model


Dependent Variable
i) Gross Margin: It is the difference between net sales and cost of goods sold and is a measure
of profitability of a business. In the given model, it is dependent upon the following
independent variables.
Independent Variable
i) Nfull: Number of full time workers per day on average. Keeping all other variables
constant an increase in one full time worker increases the margin by 0.4838 units.
ii) Npart: Number of part time workers per day on average. Keeping all other variables
constant an increase in one part time worker increases the margin by 1.683 units.
iii) Hourspw: Total number of hours worked. Keeping all other variables constant, an
increase in total number of hours worked increases the margin by 0.1359 units.
iv) Start: Years since start of business. Keeping all other variables constant, an increase in the
number of years since the business started increases the margin by 0.181 units.

Model Development Process


1/ Identifying the Outliers
As a part of building the regression model, the first step is to check for outliers and deal with
them. For finding outliers, initially we started with box plots. As per the box plots, data
outside 3rd Quartile+ (1.5*Inter-quartile) and 1st Quartile-(1.5*Inter-quartile) are termed as
outliers. However, for some variables box plots were showing a lot of outliers, nearly 30
observations for one particular variable. Removing these many observations in a data of 400
observations didn’t seem to be a good idea. So we calculated outliers with 99 percentile value
as cut-off. Any value crossing 99 percentile was considered to be an outlier. Accordingly, we
removed a total of 24 observations that were outliers.
1st Model: All the variables
Margin ~ nown, nfull, npart, naux, hoursw, hourspw, inv1, inv2, ssize, start
Adj R square- 0.3259

Indian Institute of Management Nagpur


Report On Linear Regression Model 3

2/ Computation of Variance Inflation Factor


After removing the outliers, for building the regression model, the second step is to check the
multicollinearity between different variables. Correlation coefficient value between hoursw
and hourspw is 0.808 which is pretty high. After removing the outliers, if we calculate the
VIF, hoursw has a VIF value of nearly 50 which is way above threshold of 5. So we removed
the variable hoursw from the model. After removing this variable the VIF value of no other
variable is more than 5. So there is no multicollinearity issue now.
2nd model: hoursw Removed (High VIF)
Margin ~ nown, nfull, npart, naux, hourspw, inv1, inv2, ssize, start
Adj R SQAURE - 0.3148
3/ P-Value Computation
For all the individual variables null hypothesis is considered as that the coefficient of that
respective variable is zero.
H0:β=0

For all the individual variables alternative hypothesis is considered as that the coefficient of
that respective variable is not zero.
HA: β≠0
After removing the observations of hoursw variable using the VIF value and outliers, the next
step was to identify the significance of various variables. This could be done in a stepwise
procedure using the p-value of the respective variables. Having the significance level as 95%
if P-value is less than alpha (0.05), then we can reject the null hypothesis i.e. the variable is
significant, otherwise not. This means if any variable had a p-value greater than 0.05, we
dropped the same from the regression model.We have to conduct this test multiple times as
removing one variable can affect the p-value of other variables.
In this model inv2 variable has the highest p-value of 0.74. This value is higher than the
alpha value of 0.05. So going by the hypothesis testing, we removed this variable for 3 rd
model.
3rd model: Inv2 Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start
Adjusted R SQAURE=0.3164
In this model ssize has the highest p-value (.677) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 4th model.
4th model: Ssize Removed
Margin~nown+nfull+npart+naux+hourspw+inv1+start
Adjusted R SQAURE=0.318

Indian Institute of Management Nagpur


Report On Linear Regression Model 4

In this model nown has the highest p-value (0.238) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 5th model.
5th model: nown removed
Margin~nfull+npart+naux+hourspw+inv1+start
Adjusted R SQAURE=0.3172
In this model inv1 has the highest p-value (0.234) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 6thmodel.
6th model: inv1 removed
Margin~nfull+npart+naux+hourspw+start
Adjusted R SQAURE=0.3165
In this model naux has the highest p-value (0.096) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 7th model.
7th model: naux removed
margin~nfull+npart+hourspw+start
Adjusted R SQAURE=0.3132
In this model nfull has the highest p-value (0.068) which is more than alpha. So going by the
hypothesis testing, we removed this variable for 8th model.
8th model: nfull
margin~npart+hourspw+start
Adjusted R SQAURE=0.3089

Comparison of models:
Various regression models can be compared using the adjusted R SQAURE value. High
adjusted R SQUARE means the spread of residuals is not wide and the error in prediction is
less. However we always cannot select the model solely based on this value. If there is only
minute difference between the values of two models but the amount of data required for one
of the models is very high, then it is better to go with the one which needs less data. This will
be a better fit compared to the other one due to the ease of calculation/prediction.

Conclusion:
After multiple iterations, the regression equation with nfull, npart, naux, hourspw, inv1 and
start as variables has adjusted R SQAURE with highest value of .3172(5th Model) Another
regression model with npart, naux, hourspw and start as variables has adjusted R SQAURE of
.3132(Model 7). Even if the value of adjusted R SQAURE for Model 7 is less, the reduction
in value is very low and comparative data requirement for this model is very less.

Indian Institute of Management Nagpur


Report On Linear Regression Model 5

So depending on the Adjusted R SQUARE value and the amount of data required (4
independent variables) for our regression model, we selected Model-7 as the final regression
model. Adjusted R Square of 0.3132 explains the sample variation in Y using the estimated
model. Adjusted R Square is a better comparative as it overcomes some of the drawbacks
present in R Square.
Less value of R SQUARE means it is difficult to predict the outcome (that is gross margin),
from the available data. There might be some other variables which could explain the same.
F-test:
F-test compares our present regression model with the intercept only model, i.e. the model
without any variable. It gives the overall significance of the regression equation. If the p-
value is within the selected acceptable level (our own significance level) then the regression
model with our selected variables is significant. Otherwise we can reject it and the intercept
only model is the better fit. For F-test we take the null hypothesis as all the variables are
insignificant and alternate hypothesis as at-least one variable is not insignificant.
For the model-7 regression equation p-value is 2.2e-16 which is less than alpha value (0.05).
So null hypothesis is rejected and the regression equation is not insignificant

Predicting the gross of profit for the given data:


Prediction of the gross profit margin of a store with 2 owners, 5 full time workers, 2 part time
workers and 1 auxiliary worker, a total number of 300 work hours resulting in 300/
(2+5+2+1) = 30 work hours per worker, a $60000 investment in shop-premises and $10000
investment in automaton, with a shop size of 500 sq. m. and a 90 year old shop with
corresponding prediction interval.
Using the designed regression model:
Gross Margin= 25.1126+0.4838*Nfull +1.68312*Npart+0.13594*hourspw+0.18105*start
Gross Margin= 25.1126 + 0.4838*5+ 1.68312*2 + 0.13594*30 + 0.18105*90
= 51.27054

95% Prediction Interval = 51.27±1.96*SQRT (1.176^2+4.29^2)


=51.27±1.96*SQRT (1.382+18.4)
=51.27 ± 8.72
= (42.55, 59.98)

Indian Institute of Management Nagpur


Report On Linear Regression Model 6

Appendix A

Finding Quantiles to remove outliers:

Quantile (cloths$nown,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.000000 1.295225 2.000000 3.000000 3.005000

quantile(cloths$nfull,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.9231 2.0664 4.3590 5.0000 6.0050

quantile(cloths$npart,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.2833 2.0000 3.0000 4.0000 5.0000

quantile(cloths$naux,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
1.3333 1.3673 2.0000 3.0000 4.0000

quantile(cloths$hoursw,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 145.25 250.00 313.05 359.13

quantile(cloths$inv1,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
20000.00 62269.23 292857.20 293178.63 350250.00

quantile(cloths$inv2,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
10000.00 22859.85 80000.00 200750.00 300500.00

quantile(cloths$ssize,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
80.00 190.00 360.50 454.46 600.50

quantile(cloths$start,c(.25,.75,.95,.99,.995))
25% 75% 95% 99% 99.5%
37.00 42.00 80.00 88.01 90.00

quantile(cloths$hourspw,c(0.25,0.5,0.75,0.90,.99))
25% 50% 75% 90% 99%
13.54120 17.74459 24.30298 28.91297 37.75375

Indian Institute of Management Nagpur


Report On Linear Regression Model 7

Appendix B

Modelling the regression equation:

Model 1:

cloth=read.csv("C:\\Users\\Vamsi Manyam\\Desktop\\clothing.csv")
model1=lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth)
summary(model1)

Call:
lm(formula = margin ~ nown + nfull + npart + naux + hoursw +
hourspw + inv1 + inv2 + ssize + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.7050 -2.3052 0.5483 2.6236 15.5357

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 3.611e+01 4.087e+00 8.837 < 2e-16 ***
nown -7.548e-01 7.155e-01 -1.055 0.29220
nfull -1.482e+00 7.728e-01 -1.918 0.05591 .
npart -1.911e-01 7.587e-01 -0.252 0.80131
naux -2.765e+00 9.490e-01 -2.913 0.00380 **
hoursw 7.715e-02 2.910e-02 2.652 0.00836 **
hourspw -3.379e-01 1.782e-01 -1.896 0.05873 .
inv1 6.966e-06 3.623e-06 1.923 0.05530 .
inv2 -4.232e-06 9.636e-06 -0.439 0.66082
ssize -2.913e-04 3.298e-03 -0.088 0.92966
start 1.861e-01 1.906e-02 9.764 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.252 on 364 degrees of freedom


Multiple R-squared: 0.3439, Adjusted R-squared: 0.3259
F-statistic: 19.08 on 10 and 364 DF, p-value: < 2.2e-16

library(car)
vif(lm(margin~nown+nfull+npart+naux+hoursw+hourspw+inv1+inv2+ssize+start,data=cloth))
nown nfull npart naux hoursw hourspw inv1 inv2 ssize start
2.034203 8.822122 3.500458 1.844885 49.878834 32.115259 1.518805 1.183078 1.688677 1.127065

Model 2:
model2=lm(margin~nown+nfull+npart+naux+hourspw+inv1+inv2+ssize+start,data=cloth)
vif(model2)
nown nfull npart naux hourspw inv1 inv2 ssize start
1.031140 1.144159 1.166476 1.038341 1.520605 1.390207 1.180971 1.626014 1.124763

summary(model2)

Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + inv2 + ssize + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.7391 -2.1082 0.4661 2.5180 17.4720

Coefficients:
Estimate Std. Error t value Pr( |t|)

Indian Institute of Management Nagpur


Report On Linear Regression Model 8

(Intercept) 2.614e+01 1.616e+00 16.180 < 2e-16 ***


nown 5.775e-01 5.136e-01 1.124 0.26160
nfull 4.295e-01 2.806e-01 1.531 0.12669
npart 1.452e+00 4.416e-01 3.287 0.00111 **
naux -1.101e+00 7.178e-01 -1.534 0.12590
hourspw 1.233e-01 3.909e-02 3.154 0.00175 **
inv1 4.171e-06 3.495e-06 1.193 0.23351
inv2 -3.154e-06 9.707e-06 -0.325 0.74546
ssize 1.393e-03 3.263e-03 0.427 0.66964
start 1.838e-01 1.920e-02 9.575 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.287 on 365 degrees of freedom


Multiple R-squared: 0.3313, Adjusted R-squared: 0.3148
F-statistic: 20.09 on 9 and 365 DF, p-value: < 2.2e-16

Model 3:

model3=lm(margin~nown+nfull+npart+naux+hourspw+inv1+ssize+start,data=cloth)
summary(model3)

Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + ssize + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.7463 -2.1432 0.4844 2.4914 17.3387

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 2.613e+01 1.613e+00 16.198 < 2e-16 ***
nown 5.814e-01 5.129e-01 1.134 0.25769
nfull 4.242e-01 2.798e-01 1.516 0.13035
npart 1.458e+00 4.406e-01 3.309 0.00103 **
naux -1.105e+00 7.169e-01 -1.541 0.12408
hourspw 1.227e-01 3.901e-02 3.146 0.00179 **
inv1 3.819e-06 3.319e-06 1.151 0.25064
ssize 1.356e-03 3.257e-03 0.416 0.67733
start 1.833e-01 1.911e-02 9.591 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.282 on 366 degrees of freedom


Multiple R-squared: 0.3311, Adjusted R-squared: 0.3164
F-statistic: 22.64 on 8 and 366 DF, p-value: < 2.2e-16

Model 4:

model4=lm(margin~nown+nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model4)

Call:
lm(formula = margin ~ nown + nfull + npart + naux + hourspw +
inv1 + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.7640 -2.0986 0.3987 2.5074 17.5196

Coefficients:
Estimate Std. Error t value Pr( |t|)

Indian Institute of Management Nagpur


Report On Linear Regression Model 9

(Intercept) 2.601e+01 1.587e+00 16.394 < 2e-16 ***


nown 6.029e-01 5.097e-01 1.183 0.237641
nfull 4.353e-01 2.782e-01 1.565 0.118537
npart 1.505e+00 4.255e-01 3.536 0.000458 ***
naux -1.072e+00 7.117e-01 -1.506 0.132843
hourspw 1.308e-01 3.381e-02 3.870 0.000129 ***
inv1 4.061e-06 3.265e-06 1.244 0.214364
start 1.828e-01 1.905e-02 9.595 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.277 on 367 degrees of freedom


Multiple R-squared: 0.3307, Adjusted R-squared: 0.318
F-statistic: 25.91 on 7 and 367 DF, p-value: < 2.2e-16

Model 5:

model5=lm(margin~nfull+npart+naux+hourspw+inv1+start,data=cloth)
summary(model5)

Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + inv1 +
start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.799 -2.170 0.479 2.566 17.310

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 2.678e+01 1.449e+00 18.483 < 2e-16 ***
nfull 4.237e-01 2.782e-01 1.523 0.128581
npart 1.540e+00 4.247e-01 3.626 0.000329 ***
naux -1.095e+00 7.118e-01 -1.538 0.124926
hourspw 1.348e-01 3.366e-02 4.004 7.53e-05 ***
inv1 3.892e-06 3.263e-06 1.193 0.233788
start 1.809e-01 1.899e-02 9.524 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.279 on 368 degrees of freedom


Multiple R-squared: 0.3282, Adjusted R-squared: 0.3172
F-statistic: 29.96 on 6 and 368 DF, p-value: < 2.2e-16

Model 6:

model6=lm(margin~nfull+npart+naux+hourspw+start,data=cloth)
summary(model6)

Call:
lm(formula = margin ~ nfull + npart + naux + hourspw + start,
data = cloth)

Residuals:
Min 1Q Median 3Q Max
-14.8686 -2.0959 0.4684 2.6209 16.7507

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 26.62383 1.44390 18.439< 2e-16 ***
nfull 0.52359 0.26541 1.973 0.0493 *
npart 1.66624 0.41158 4.048 6.29e-05 ***
naux -1.18136 0.70849 -1.667 0.0963 .
hourspw 0.13976 0.03342 4.182 3.61e-05 ***

Indian Institute of Management Nagpur


Report On Linear Regression Model 10

start 0.18062 0.01900 9.504 < 2e-16 ***


---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.282 on 369 degrees of freedom


Multiple R-squared: 0.3256, Adjusted R-squared: 0.3165
F-statistic: 35.63 on 5 and 369 DF, p-value: < 2.2e-16

Model 7:

model7=lm(margin~nfull+npart+hourspw+start,data=cloth)
summary(model7)

Call:
lm(formula = margin ~ nfull + npart + hourspw + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-15.1194 -2.2302 0.3886 2.5829 17.2256

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 25.11266 1.12676 22.287< 2e-16 ***
nfull 0.48380 0.26497 1.826 0.0687 .
npart 1.68312 0.41245 4.081 5.50e-05 ***
hourspw 0.13594 0.03342 4.068 5.81e-05 ***
start 0.18105 0.01905 9.505 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.292 on 370 degrees of freedom


Multiple R-squared: 0.3205, Adjusted R-squared: 0.3132
F-statistic: 43.63 on 4 and 370 DF, p-value: < 2.2e-16

Model 8:

model8=lm(margin~npart+hourspw+start,data=cloth)
summary(model8)

Call:
lm(formula = margin ~ npart + hourspw + start, data = cloth)

Residuals:
Min 1Q Median 3Q Max
-15.143 -2.237 0.417 2.492 18.011

Coefficients:
Estimate Std. Error t value Pr( |t|)
(Intercept) 25.85285 1.05461 24.514< 2e-16 ***
npart 1.76478 0.41130 4.291 2.28e-05 ***
hourspw 0.13976 0.03346 4.177 3.69e-05 ***
start 0.18154 0.01911 9.502 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.305 on 371 degrees of freedom


Multiple R-squared: 0.3144, Adjusted R-squared: 0.3089
F-statistic: 56.71 on 3 and 371 DF, p-value: < 2.2e-16

Indian Institute of Management Nagpur


Report On Linear Regression Model 11

Appendix C

Plots:

Multicollinearity

Total hours vs hours per worker


600
500
Total number of hours worked

400
300
200
100

10 20 30 40

Number of hours worked per worker

Box plots with outliers

Indian Institute of Management Nagpur


Report On Linear Regression Model 12

Indian Institute of Management Nagpur


Report On Linear Regression Model 13

Indian Institute of Management Nagpur


Report On Linear Regression Model 14

Indian Institute of Management Nagpur


Report On Linear Regression Model 15

Indian Institute of Management Nagpur


Report On Linear Regression Model 16

Histogram showing the residual distribution for model-7:

Plot of residuals
150
Frequency

100
50
0

-20 -10 0 10 20

residuals of regression model

Indian Institute of Management Nagpur

You might also like