You are on page 1of 11

Kristin Fridgeirsdottir Data Analytics for Leaders

Break-Out Session 2b
The purpose of this workshop is to use Excel to build linear regression models. The overall
goal is to build good explanatory and/or prediction models based on a chosen data set.

Cost Forecasting

1. Before you start

Microsoft Excel has a built-in feature to perform a regression analysis. This feature is
available in the Analysis Toolpack. First, check that the option Data Analysis is available
under the Data tab. If it is not there, select the File tab then Options and from the list on the
left hand side select Add-Ins. Press the Go… button at the bottom (next to the Excel Add-ins
selection). In the ensuing dialog box, click on Analysis ToolPak VBA and then click OK.

The data can be found in “Heating Cost.xlsx”. The spreadsheet contains heating cost data for
20 small houses in different geographical regions, together with details on local average
minimum external temperature, inches of insulation in the house, the age of the central
heating equipment and the number of windows.

The objective of the regression analysis in this workshop is to discover if these variables
explain the differences in the heating costs for the 20 houses, and hence if the approach
would be useful for predicting heating costs for other similar properties.

2. Summarising and describing data sets

You can compute summary statistics for a data set by using appropriate functions, such as:
 Average(B4..B23), which yields the average heating cost for all the houses;
 Min(B4..B23), which yields the minimum heating cost for all the houses;
 Max(B4..B23), which yields the maximum heating cost for all the houses;
 Stdev.s(B4..B23), which yields the standard deviation in heating costs; etc.

1
A B C D E F
1 HEATING COST DATA
2
3 House Heating Cost Minimum Temperature Insulation (inches) Age Windows
4 1 250 35 3 6 10
5 2 360 29 4 10 1
6 3 165 36 7 3 9
7 4 43 60 6 9 8
8 5 92 65 5 6 8
9 6 200 30 5 5 9
10 7 355 10 6 7 14
11 8 290 7 10 10 9
12 9 230 21 9 11 11
13 10 120 55 2 5 9
14 11 73 54 12 4 11
15 12 205 48 5 1 10
16 13 400 20 5 15 12
17 14 320 39 4 7 10
18 15 72 60 8 6 8
19 16 272 20 5 8 10
20 17 94 58 7 3 10
21 18 190 40 8 11 11
22 19 235 27 9 8 14
23 20 139 30 7 5 9

If you enter for instance Average(B4..B23) in cell B24, the average heating cost will be
displayed in that cell. By selecting the cell and dragging the handle to the adjacent cells, the
formula will automatically be copied, and will display the average temperature, insulation,
age and number of windows.

Alternatively, you can use Data/Data Analysis/Descriptive Statistics to get a variety of


summary statistics automatically. In the Descriptive Statistics dialog box, specify:
 Input Range: B3..F23 (by selecting the region with the mouse);
 Select Labels in First Row;
 Select New Worksheet Ply with the name “Descriptive Statistics”;
 Select Summary Statistics.

The resulting spreadsheet is shown below (you may need to reformat the cells and column
widths).

2
Heating Cost Minimum Temperature Insulation (inches) Age Windows

Mean 205.25 Mean 37.2 Mean 6.35 Mean 7 Mean 9.65


Standard Error 23.67 Standard Error 3.89 Standard Error 0.55 Standard Error 0.75 Standard Error 0.60
Median 202.5 Median 35.5 Median 6 Median 6.5 Median 10
Mode #N/A Mode 60 Mode 5 Mode 6 Mode 10
Standard Deviation 105.86 Standard Deviation 17.41 Standard Deviation 2.48 Standard Deviation 3.34 Standard Deviation 2.66
Sample Variance 11206.09 Sample Variance 303.12 Sample Variance 6.13 Sample Variance 11.16 Sample Variance 7.08
Kurtosis -0.97 Kurtosis -1.05 Kurtosis 0.05 Kurtosis 0.36 Kurtosis 5.65
Skewness 0.21 Skewness 0.02 Skewness 0.45 Skewness 0.48 Skewness -1.48
Range 357 Range 58 Range 10 Range 14 Range 13
Minimum 43 Minimum 7 Minimum 2 Minimum 1 Minimum 1
Maximum 400 Maximum 65 Maximum 12 Maximum 15 Maximum 14
Sum 4105 Sum 744 Sum 127 Sum 140 Sum 193
Count 20 Count 20 Count 20 Count 20 Count 20

3. Correlation Analysis

You can compute correlation statistics for a data set by using the following function:
Correl(B4..B23,C4..C23), which yields the correlation between the heating cost and the
minimum outside temperature. A correlation coefficient indicates the level of linear
association between a pair of variables. In this case, the correlation between the heating cost
and the minimum outside temperature is –0.81, implying a rather strong negative correlation
in the sense that if the outside temperature is low, the heating cost is high and vice-versa.

Again, you can use an automated tool by selecting Data\Data Analysis\Correlation. The
following dialog box should appear:

3
In the Correlation dialog box, specify:
 Input Range: B3:F23;
 Grouped By: Columns, so that Excel knows that each column represents a variable;
 Select Labels in First Row;
 Select New Worksheet Ply with the name “Correlation Analysis”.

A new spreadsheet with the following correlation matrix should appear.

Heating Cost Minimum Temperature Insulation (inches) Age Windows


Heating Cost 1.00
Minimum Temperature -0.81 1.00
Insulation (inches) -0.26 -0.10 1.00
Age 0.54 -0.49 0.06 1.00
Windows 0.10 -0.26 0.31 0.03 1.00

Notice the high correlation between heating cost and the minimum temperature (negative) and
the age of the heating installation (positive). Also notice the sometimes high correlations
between the independent (explanatory) variables (min temp, insulation, age, and windows)
themselves.

4
4. Scatter Plots

Scatter plots are of great help in identifying the strength, nature and direction of relationships
between variables. In particular, they can highlight non-linear relationships, which will not
necessarily be apparent from the correlation values. Since the observed correlation, -0.81,
between the heating cost and the minimum outside temperature suggests a strong (linear)
relationship, let us examine their scatter plot:
 Select the Heating Cost and Minimum Temperature columns, cells B3:C23
 In the Insert tab and under Charts select Scatter and choose the first graph option
 In order to have Temperature on the x-axis and Cost on y-axis go to the Design tab
under Chart Tools and choose Select Data then click on Edit. Let the X values refer
to column C and Y values to column B
 Under Chart Layouts select the first option, it gives you titles on the axis
 Rename the titles in the graph by clicking on them

Cost & Temperature
450
400
350
300
250
Cost

200
150
100
50
0
0 10 20 30 40 50 60 70
Temperature

The scatter plot confirms the rather strong, linear relationship between heating cost and
temperature, with heating cost declining as the temperature increases. Similar scatter plots can
be examine for other pairs of variables.

5. Simple Linear Regression Analysis

A regression analysis estimates the linear equation that ‘best fits’ a set of data, in the sense
that it minimises the residual scatter. Let us perform a regression analysis of heating cost as a
function of the temperature, i.e. heating cost = a + b (temperature) + e

 Select Data\Data Analysis\Regression;


 Specify Input Y Range as B3..B23, this is the dependent variable;
5
 Specify Input X Range as C3..C23, this is the independent (explanatory) variable;
 Select Labels;
 Select New Worksheet Ply, with the name “Regression 1”;
 Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).

The results consist of different sections:


 Summary output, containing summary statistics for the regression as a whole, of
which Adjusted R Square (R2) and Standard Error (the standard deviation of the
residuals) are the most important;
 ANOVA (Analysis of Variance), can be ignored when performing a regression
analysis;
 a table with the actual regression model;
 Residual Output, containing the predicted values from the regression model for each
of the observations in the data set (how are they calculated?), the prediction errors
(residuals, tell us how far the predicted value is from the observation), and the
standardized prediction error (residuals, standardized by their standard deviation).
 a Residual Plot;
 a Line Fit Plot.

6
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.81
R Square 0.66
Adjusted R Square 0.64
Standard Error 63.55
Observations 20

ANOVA
df SS MS F Significance F
Regression 1 140215 140215 34.72 0.00
Residual 18 72701 4039
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 388.80 34.24 11.35 0.00 316.86 460.74
Minimum Temperature -4.93 0.84 -5.89 0.00 -6.69 -3.17

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 216.11 33.89 0.55
2 245.71 114.29 1.85
3 211.17 -46.17 -0.75
4 92.75 -49.75 -0.80
5 68.08 23.92 0.39
6 240.78 -40.78 -0.66
7 339.46 15.54 0.25
8 354.26 -64.26 -1.04
9 285.18 -55.18 -0.89
10 117.42 2.58 0.04
11 122.36 -49.36 -0.80
12 151.96 53.04 0.86
13 290.12 109.88 1.78
14 196.37 123.63 2.00
15 92.75 -20.75 -0.34
16 290.12 -18.12 -0.29
17 102.62 -8.62 -0.14
18 191.43 -1.43 -0.02
19 255.58 -20.58 -0.33
20 240.78 -101.78 -1.65

The suggested regression equation is:

heating cost = 388.80 – 4.93  (temperature) + e

The slope, -4.93, has a t-value of -5.89 (in absolute terms bigger than 2) and a very small p-
value (smaller than our confidence level of 5%). The coefficient related to the temperature
variable is therefore significantly different from zero, which can also be seen from the
confidence interval [-6.69; -3.17] which does not include zero. We may conclude that there is
a significant effect of temperature on heating cost.

The regression model is able to explain 64% of the variability in heating cost in terms of
differences between outside temperatures (Adjusted R2). The standard error of the forecasts is
63.55, implying that if we want to make a prediction with confidence (95%), we should
subtract and add 127.10 (=2*63.55) to the prediction to obtain a confidence interval. For
instance, for an outside temperature of 50, we predict the heating costs to be in the region of
[142.30-127.10; 142.30+127.10] = [15.20; 269.40].

7
The Reggression toool also disp plays severaal charts (yoou may hav ve to move tthem to maake them
visible):
 T The Line FitF Plot (seee below) shows actu ual costs an nd predictedd costs, plo otted for
ddifferent vaalues of tem
mperature. T This plot iss identical tot the scattter plot of cost and
ttemperaturee we consttructed earllier, with the t predicted points ssuperimpossed. The
rregression line is sho own as poinnts rather thant as a line.
l This ccan be changed by
cclicking onn the predictted points, aand clickingg on Forma at Selectionn\Line Color\Solid
LLine (or byy double cliccking on thee points andd then selectt Line Coloor\Solid Lin ne).
 T The Residuual Plot show ws the foreccast errors versus
v temp perature. If tthis plot exh
hibits an
oobvious patttern, it wouuld suggestt that the model
m is ill-sspecified. Iddeally, the residuals
r
sshould be random.
r Residual plotss are also useful
u for sppotting outliiers  datta points
mmuch furtheer from the regression lline than otthers.

8
6. Multiple Linear Regression Analysis

By adding extra independent (explanatory) variables in the regression model, we may be able
to improve our predictions of heating cost. However, including extra explanatory variables
may also cause problems such as multicollinearity1. We therefore have to find the best
possible regression model for the purpose of predicting heating costs using one or more
explanatory variables.

Let us perform a regression analysis of heating cost as a function of all the available
explanatory variables, i.e.

heating cost = a + b  (temperature) + c  (insulation) + d  (age) + e  (windows) + e

 Select Data\Data Analysis\Regression;


 Specify Input Y as B3..B23, this is the dependent variable;
 Specify Input X Range as C3..F23, these are the explanatory variable (they should be
in adjacent columns, you may need to move columns);
 Select Labels;
 Select New Worksheet Ply, with the name “Regression 2”;
 Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).

The suggested regression equation is (see regression results on next page):

heating cost = 424.74 – 4.57  (temperature) –14.91  (insulation)


+ 6.13  (age) + 0.24  (windows) + e

The Adjusted R2 has increased from 64% to 75%, indicating that we are now able to explain
more of the variability in heating cost. Also the standard error of the predictions has decreased
from 63.55 to 52.72, enabling much more accurate predictions. However, although the
coefficient related to the temperature and the insulation are found to be significantly different
from zero, the coefficients related to the age of the installation and the number of windows
have not. Therefore, these variables should be excluded from the model and the model re-
analysed. Also the residual plots and line fit plots need to be examined.

The modified regression equation is (see regression results):

heating cost = 490.29 – 5.15  (temperature) –14.72  (insulation) + e

1
Multicollinearity is an issue that can come up in multiple regression. It refers to the situation when some of the
independent (explanatory) variables are correlated and thus bring similar piece of information into the regression
model. This correlation makes the regression model, in fact, less accurate. Hence, it makes sense to delete one
of these correlated variables from the model and by doing so the significance of the other should improve. We
will discuss this in more detail in next lecture.
9
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.90
R Square 0.80
Adjusted R Square 0.75
Standard Error 52.72
Observations 20

ANOVA
df SS MS F Significance F
Regression 4 171227 42807 15.40 0.00
Residual 15 41689 2779
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 424.74 79.23 5.36 0.00 255.86 593.61
Minimum Temperature -4.57 0.83 -5.53 0.00 -6.33 -2.81
Insulation (inches) -14.91 5.14 -2.90 0.01 -25.86 -3.95
Age 6.13 4.17 1.47 0.16 -2.77 15.02
Windows 0.24 4.95 0.05 0.96 -10.31 10.80

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 259.20 -9.20 -0.20
2 294.04 65.96 1.41
3 176.38 -11.38 -0.24
4 118.08 -75.08 -1.60
5 91.75 0.25 0.01
6 245.88 -45.88 -0.98
7 335.88 19.12 0.41
8 307.13 -17.13 -0.37
9 264.65 -34.65 -0.74
10 176.30 -56.30 -1.20
11 26.18 46.82 1.00
12 139.32 65.68 1.40
13 353.59 46.41 0.99
14 232.13 87.87 1.88
15 69.89 2.11 0.05
16 310.22 -38.22 -0.82
17 76.05 17.95 0.38
18 192.69 -2.69 -0.06
19 219.57 15.43 0.33
20 216.07 -77.07 -1.65

10
SUMMARY OUTPUT

Regression Statistics
Multiple R 0.88
R Square 0.78
Adjusted R Square 0.75
Standard Error 52.98
Observations 20

ANOVA
df SS MS F Significance F
Regression 2 165195 82597 29.42 0.00
Residual 17 47721 2807
Total 19 212916

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 490.29 44.41 11.04 0.00 396.59 583.98
Minimum Temperature -5.15 0.70 -7.34 0.00 -6.63 -3.67
Insulation (inches) -14.72 4.93 -2.98 0.01 -25.13 -4.31

RESIDUAL OUTPUT

Observation Predicted Heating Cost Residuals Standard Residuals


1 265.89 -15.89 -0.32
2 282.07 77.93 1.56
3 201.86 -36.86 -0.74
4 92.98 -49.98 -1.00
5 81.95 10.05 0.20
6 262.20 -62.20 -1.24
7 350.48 4.52 0.09
8 307.06 -17.06 -0.34
9 249.68 -19.68 -0.39
10 177.61 -57.61 -1.15
11 35.57 37.43 0.75
12 169.50 35.50 0.71
13 313.70 86.30 1.72
14 230.57 89.43 1.78
15 63.55 8.45 0.17
16 313.70 -41.70 -0.83
17 88.57 5.43 0.11
18 166.55 23.45 0.47
19 218.78 16.22 0.32
20 232.76 -93.76 -1.87

11

You might also like