Professional Documents
Culture Documents
Break-Out Session 2b
The purpose of this workshop is to use Excel to build linear regression models. The overall
goal is to build good explanatory and/or prediction models based on a chosen data set.
Cost Forecasting
Microsoft Excel has a built-in feature to perform a regression analysis. This feature is
available in the Analysis Toolpack. First, check that the option Data Analysis is available
under the Data tab. If it is not there, select the File tab then Options and from the list on the
left hand side select Add-Ins. Press the Go… button at the bottom (next to the Excel Add-ins
selection). In the ensuing dialog box, click on Analysis ToolPak VBA and then click OK.
The data can be found in “Heating Cost.xlsx”. The spreadsheet contains heating cost data for
20 small houses in different geographical regions, together with details on local average
minimum external temperature, inches of insulation in the house, the age of the central
heating equipment and the number of windows.
The objective of the regression analysis in this workshop is to discover if these variables
explain the differences in the heating costs for the 20 houses, and hence if the approach
would be useful for predicting heating costs for other similar properties.
You can compute summary statistics for a data set by using appropriate functions, such as:
Average(B4..B23), which yields the average heating cost for all the houses;
Min(B4..B23), which yields the minimum heating cost for all the houses;
Max(B4..B23), which yields the maximum heating cost for all the houses;
Stdev.s(B4..B23), which yields the standard deviation in heating costs; etc.
1
A B C D E F
1 HEATING COST DATA
2
3 House Heating Cost Minimum Temperature Insulation (inches) Age Windows
4 1 250 35 3 6 10
5 2 360 29 4 10 1
6 3 165 36 7 3 9
7 4 43 60 6 9 8
8 5 92 65 5 6 8
9 6 200 30 5 5 9
10 7 355 10 6 7 14
11 8 290 7 10 10 9
12 9 230 21 9 11 11
13 10 120 55 2 5 9
14 11 73 54 12 4 11
15 12 205 48 5 1 10
16 13 400 20 5 15 12
17 14 320 39 4 7 10
18 15 72 60 8 6 8
19 16 272 20 5 8 10
20 17 94 58 7 3 10
21 18 190 40 8 11 11
22 19 235 27 9 8 14
23 20 139 30 7 5 9
If you enter for instance Average(B4..B23) in cell B24, the average heating cost will be
displayed in that cell. By selecting the cell and dragging the handle to the adjacent cells, the
formula will automatically be copied, and will display the average temperature, insulation,
age and number of windows.
The resulting spreadsheet is shown below (you may need to reformat the cells and column
widths).
2
Heating Cost Minimum Temperature Insulation (inches) Age Windows
3. Correlation Analysis
You can compute correlation statistics for a data set by using the following function:
Correl(B4..B23,C4..C23), which yields the correlation between the heating cost and the
minimum outside temperature. A correlation coefficient indicates the level of linear
association between a pair of variables. In this case, the correlation between the heating cost
and the minimum outside temperature is –0.81, implying a rather strong negative correlation
in the sense that if the outside temperature is low, the heating cost is high and vice-versa.
Again, you can use an automated tool by selecting Data\Data Analysis\Correlation. The
following dialog box should appear:
3
In the Correlation dialog box, specify:
Input Range: B3:F23;
Grouped By: Columns, so that Excel knows that each column represents a variable;
Select Labels in First Row;
Select New Worksheet Ply with the name “Correlation Analysis”.
Notice the high correlation between heating cost and the minimum temperature (negative) and
the age of the heating installation (positive). Also notice the sometimes high correlations
between the independent (explanatory) variables (min temp, insulation, age, and windows)
themselves.
4
4. Scatter Plots
Scatter plots are of great help in identifying the strength, nature and direction of relationships
between variables. In particular, they can highlight non-linear relationships, which will not
necessarily be apparent from the correlation values. Since the observed correlation, -0.81,
between the heating cost and the minimum outside temperature suggests a strong (linear)
relationship, let us examine their scatter plot:
Select the Heating Cost and Minimum Temperature columns, cells B3:C23
In the Insert tab and under Charts select Scatter and choose the first graph option
In order to have Temperature on the x-axis and Cost on y-axis go to the Design tab
under Chart Tools and choose Select Data then click on Edit. Let the X values refer
to column C and Y values to column B
Under Chart Layouts select the first option, it gives you titles on the axis
Rename the titles in the graph by clicking on them
Cost & Temperature
450
400
350
300
250
Cost
200
150
100
50
0
0 10 20 30 40 50 60 70
Temperature
The scatter plot confirms the rather strong, linear relationship between heating cost and
temperature, with heating cost declining as the temperature increases. Similar scatter plots can
be examine for other pairs of variables.
A regression analysis estimates the linear equation that ‘best fits’ a set of data, in the sense
that it minimises the residual scatter. Let us perform a regression analysis of heating cost as a
function of the temperature, i.e. heating cost = a + b (temperature) + e
6
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.81
R Square 0.66
Adjusted R Square 0.64
Standard Error 63.55
Observations 20
ANOVA
df SS MS F Significance F
Regression 1 140215 140215 34.72 0.00
Residual 18 72701 4039
Total 19 212916
RESIDUAL OUTPUT
The slope, -4.93, has a t-value of -5.89 (in absolute terms bigger than 2) and a very small p-
value (smaller than our confidence level of 5%). The coefficient related to the temperature
variable is therefore significantly different from zero, which can also be seen from the
confidence interval [-6.69; -3.17] which does not include zero. We may conclude that there is
a significant effect of temperature on heating cost.
The regression model is able to explain 64% of the variability in heating cost in terms of
differences between outside temperatures (Adjusted R2). The standard error of the forecasts is
63.55, implying that if we want to make a prediction with confidence (95%), we should
subtract and add 127.10 (=2*63.55) to the prediction to obtain a confidence interval. For
instance, for an outside temperature of 50, we predict the heating costs to be in the region of
[142.30-127.10; 142.30+127.10] = [15.20; 269.40].
7
The Reggression toool also disp plays severaal charts (yoou may hav ve to move tthem to maake them
visible):
T The Line FitF Plot (seee below) shows actu ual costs an nd predictedd costs, plo otted for
ddifferent vaalues of tem
mperature. T This plot iss identical tot the scattter plot of cost and
ttemperaturee we consttructed earllier, with the t predicted points ssuperimpossed. The
rregression line is sho own as poinnts rather thant as a line.
l This ccan be changed by
cclicking onn the predictted points, aand clickingg on Forma at Selectionn\Line Color\Solid
LLine (or byy double cliccking on thee points andd then selectt Line Coloor\Solid Lin ne).
T The Residuual Plot show ws the foreccast errors versus
v temp perature. If tthis plot exh
hibits an
oobvious patttern, it wouuld suggestt that the model
m is ill-sspecified. Iddeally, the residuals
r
sshould be random.
r Residual plotss are also useful
u for sppotting outliiers datta points
mmuch furtheer from the regression lline than otthers.
8
6. Multiple Linear Regression Analysis
By adding extra independent (explanatory) variables in the regression model, we may be able
to improve our predictions of heating cost. However, including extra explanatory variables
may also cause problems such as multicollinearity1. We therefore have to find the best
possible regression model for the purpose of predicting heating costs using one or more
explanatory variables.
Let us perform a regression analysis of heating cost as a function of all the available
explanatory variables, i.e.
The Adjusted R2 has increased from 64% to 75%, indicating that we are now able to explain
more of the variability in heating cost. Also the standard error of the predictions has decreased
from 63.55 to 52.72, enabling much more accurate predictions. However, although the
coefficient related to the temperature and the insulation are found to be significantly different
from zero, the coefficients related to the age of the installation and the number of windows
have not. Therefore, these variables should be excluded from the model and the model re-
analysed. Also the residual plots and line fit plots need to be examined.
1
Multicollinearity is an issue that can come up in multiple regression. It refers to the situation when some of the
independent (explanatory) variables are correlated and thus bring similar piece of information into the regression
model. This correlation makes the regression model, in fact, less accurate. Hence, it makes sense to delete one
of these correlated variables from the model and by doing so the significance of the other should improve. We
will discuss this in more detail in next lecture.
9
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.90
R Square 0.80
Adjusted R Square 0.75
Standard Error 52.72
Observations 20
ANOVA
df SS MS F Significance F
Regression 4 171227 42807 15.40 0.00
Residual 15 41689 2779
Total 19 212916
RESIDUAL OUTPUT
10
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.88
R Square 0.78
Adjusted R Square 0.75
Standard Error 52.98
Observations 20
ANOVA
df SS MS F Significance F
Regression 2 165195 82597 29.42 0.00
Residual 17 47721 2807
Total 19 212916
RESIDUAL OUTPUT
11