Break-Out Session 2b

Kristin Fridgeirsdottir Data Analytics for Leaders
Break-Out Session 2b
The purpose of this workshop is to use Excel to build linear regression models. The overall
goal is to build good explanatory and/or prediction models based on a chosen data set.
Cost Forecasting
1. Before you start
Microsoft Excel has a built-in feature to perform a regression analysis. This feature is
available in the Analysis Toolpack. First, check that the option Data Analysis is available
under the Data tab. If it is not there, select the File tab then Options and from the list on the
left hand side select Add-Ins. Press the Go… button at the bottom (next to the Excel Add-ins
selection). In the ensuing dialog box, click on Analysis ToolPak VBA and then click OK.
The data can be found in “Heating Cost.xlsx”. The spreadsheet contains heating cost data for
20 small houses in different geographical regions, together with details on local average
minimum external temperature, inches of insulation in the house, the age of the central
heating equipment and the number of windows.
The objective of the regression analysis in this workshop is to discover if these variables
explain the differences in the heating costs for the 20 houses, and hence if the approach
would be useful for predicting heating costs for other similar properties.
2. Summarising and describing data sets
You can compute summary statistics for a data set by using appropriate functions, such as:
 Average(B4..B23), which yields the average heating cost for all the houses;
 Min(B4..B23), which yields the minimum heating cost for all the houses;
 Max(B4..B23), which yields the maximum heating cost for all the houses;
 Stdev.s(B4..B23), which yields the standard deviation in heating costs; etc.
1
A B C D E F
1 HEATING COST DATA
2
3 House Heating Cost Minimum Temperature Insulation (inches) Age Windows
4 1 250 35 3 6 10
5 2 360 29 4 10 1
6 3 165 36 7 3 9
7 4 43 60 6 9 8
8 5 92 65 5 6 8
9 6 200 30 5 5 9
10 7 355 10 6 7 14
11 8 290 7 10 10 9
12 9 230 21 9 11 11
13 10 120 55 2 5 9
14 11 73 54 12 4 11
15 12 205 48 5 1 10
16 13 400 20 5 15 12
17 14 320 39 4 7 10
18 15 72 60 8 6 8
19 16 272 20 5 8 10
20 17 94 58 7 3 10
21 18 190 40 8 11 11
22 19 235 27 9 8 14
23 20 139 30 7 5 9
If you enter for instance Average(B4..B23) in cell B24, the average heating cost will be
displayed in that cell. By selecting the cell and dragging the handle to the adjacent cells, the
formula will automatically be copied, and will display the average temperature, insulation,
age and number of windows.
Alternatively, you can use Data/Data Analysis/Descriptive Statistics to get a variety of

summary statistics automatically. In the Descriptive Statistics dialog box, specify:
 Input Range: B3..F23 (by selecting the region with the mouse);
 Select Labels in First Row;
 Select New Worksheet Ply with the name “Descriptive Statistics”;
 Select Summary Statistics.
The resulting spreadsheet is shown below (you may need to reformat the cells and column
widths).
2
Heating Cost Minimum Temperature Insulation (inches) Age Windows
Mean 205.25 Mean 37.2 Mean 6.35 Mean 7 Mean 9.65

Standard Error 23.67 Standard Error 3.89 Standard Error 0.55 Standard Error 0.75 Standard Error 0.60
Median 202.5 Median 35.5 Median 6 Median 6.5 Median 10
Mode #N/A Mode 60 Mode 5 Mode 6 Mode 10
Standard Deviation 105.86 Standard Deviation 17.41 Standard Deviation 2.48 Standard Deviation 3.34 Standard Deviation 2.66
Sample Variance 11206.09 Sample Variance 303.12 Sample Variance 6.13 Sample Variance 11.16 Sample Variance 7.08
Kurtosis -0.97 Kurtosis -1.05 Kurtosis 0.05 Kurtosis 0.36 Kurtosis 5.65
Skewness 0.21 Skewness 0.02 Skewness 0.45 Skewness 0.48 Skewness -1.48
Range 357 Range 58 Range 10 Range 14 Range 13
Minimum 43 Minimum 7 Minimum 2 Minimum 1 Minimum 1
Maximum 400 Maximum 65 Maximum 12 Maximum 15 Maximum 14
Sum 4105 Sum 744 Sum 127 Sum 140 Sum 193
Count 20 Count 20 Count 20 Count 20 Count 20
3. Correlation Analysis
You can compute correlation statistics for a data set by using the following function:
Correl(B4..B23,C4..C23), which yields the correlation between the heating cost and the
minimum outside temperature. A correlation coefficient indicates the level of linear
association between a pair of variables. In this case, the correlation between the heating cost
and the minimum outside temperature is –0.81, implying a rather strong negative correlation
in the sense that if the outside temperature is low, the heating cost is high and vice-versa.
Again, you can use an automated tool by selecting Data\Data Analysis\Correlation. The
following dialog box should appear:
3
In the Correlation dialog box, specify:
 Input Range: B3:F23;
 Grouped By: Columns, so that Excel knows that each column represents a variable;
 Select Labels in First Row;
 Select New Worksheet Ply with the name “Correlation Analysis”.
A new spreadsheet with the following correlation matrix should appear.
Heating Cost Minimum Temperature Insulation (inches) Age Windows

Heating Cost 1.00
Minimum Temperature -0.81 1.00
Insulation (inches) -0.26 -0.10 1.00
Age 0.54 -0.49 0.06 1.00
Windows 0.10 -0.26 0.31 0.03 1.00
Notice the high correlation between heating cost and the minimum temperature (negative) and
the age of the heating installation (positive). Also notice the sometimes high correlations
between the independent (explanatory) variables (min temp, insulation, age, and windows)
themselves.
4
4. Scatter Plots
Scatter plots are of great help in identifying the strength, nature and direction of relationships
between variables. In particular, they can highlight non-linear relationships, which will not
necessarily be apparent from the correlation values. Since the observed correlation, -0.81,
between the heating cost and the minimum outside temperature suggests a strong (linear)
relationship, let us examine their scatter plot:
 Select the Heating Cost and Minimum Temperature columns, cells B3:C23
 In the Insert tab and under Charts select Scatter and choose the first graph option
 In order to have Temperature on the x-axis and Cost on y-axis go to the Design tab
under Chart Tools and choose Select Data then click on Edit. Let the X values refer
to column C and Y values to column B
 Under Chart Layouts select the first option, it gives you titles on the axis
 Rename the titles in the graph by clicking on them
Cost & Temperature
450
400
350
300
250
Cost
200
150
100
50
0
0 10 20 30 40 50 60 70
Temperature
The scatter plot confirms the rather strong, linear relationship between heating cost and
temperature, with heating cost declining as the temperature increases. Similar scatter plots can
be examine for other pairs of variables.
5. Simple Linear Regression Analysis
A regression analysis estimates the linear equation that ‘best fits’ a set of data, in the sense
that it minimises the residual scatter. Let us perform a regression analysis of heating cost as a
function of the temperature, i.e. heating cost = a + b (temperature) + e
 Select Data\Data Analysis\Regression;

 Specify Input Y Range as B3..B23, this is the dependent variable;
5
 Specify Input X Range as C3..C23, this is the independent (explanatory) variable;
 Select Labels;
 Select New Worksheet Ply, with the name “Regression 1”;
 Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).
The results consist of different sections:

 Summary output, containing summary statistics for the regression as a whole, of
which Adjusted R Square (R2) and Standard Error (the standard deviation of the
residuals) are the most important;
 ANOVA (Analysis of Variance), can be ignored when performing a regression
analysis;
 a table with the actual regression model;
 Residual Output, containing the predicted values from the regression model for each
of the observations in the data set (how are they calculated?), the prediction errors
(residuals, tell us how far the predicted value is from the observation), and the
standardized prediction error (residuals, standardized by their standard deviation).
 a Residual Plot;
 a Line Fit Plot.
6
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.81
R Square 0.66
Adjusted R Square 0.64
Standard Error 63.55
Observations 20
ANOVA
df SS MS F Significance F
Regression 1 140215 140215 34.72 0.00
Residual 18 72701 4039
Total 19 212916
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 388.80 34.24 11.35 0.00 316.86 460.74
Minimum Temperature -4.93 0.84 -5.89 0.00 -6.69 -3.17
RESIDUAL OUTPUT
Observation Predicted Heating Cost Residuals Standard Residuals

1 216.11 33.89 0.55
2 245.71 114.29 1.85
3 211.17 -46.17 -0.75
4 92.75 -49.75 -0.80
5 68.08 23.92 0.39
6 240.78 -40.78 -0.66
7 339.46 15.54 0.25
8 354.26 -64.26 -1.04
9 285.18 -55.18 -0.89
10 117.42 2.58 0.04
11 122.36 -49.36 -0.80
12 151.96 53.04 0.86
13 290.12 109.88 1.78
14 196.37 123.63 2.00
15 92.75 -20.75 -0.34
16 290.12 -18.12 -0.29
17 102.62 -8.62 -0.14
18 191.43 -1.43 -0.02
19 255.58 -20.58 -0.33
20 240.78 -101.78 -1.65
The suggested regression equation is:
heating cost = 388.80 – 4.93  (temperature) + e
The slope, -4.93, has a t-value of -5.89 (in absolute terms bigger than 2) and a very small p-
value (smaller than our confidence level of 5%). The coefficient related to the temperature
variable is therefore significantly different from zero, which can also be seen from the
confidence interval [-6.69; -3.17] which does not include zero. We may conclude that there is
a significant effect of temperature on heating cost.
The regression model is able to explain 64% of the variability in heating cost in terms of
differences between outside temperatures (Adjusted R2). The standard error of the forecasts is
63.55, implying that if we want to make a prediction with confidence (95%), we should
subtract and add 127.10 (=2*63.55) to the prediction to obtain a confidence interval. For
instance, for an outside temperature of 50, we predict the heating costs to be in the region of
[142.30-127.10; 142.30+127.10] = [15.20; 269.40].
7
The Reggression toool also disp plays severaal charts (yoou may hav ve to move tthem to maake them
visible):
 T The Line FitF Plot (seee below) shows actu ual costs an nd predictedd costs, plo otted for
ddifferent vaalues of tem
mperature. T This plot iss identical tot the scattter plot of cost and
ttemperaturee we consttructed earllier, with the t predicted points ssuperimpossed. The
rregression line is sho own as poinnts rather thant as a line.
l This ccan be changed by
cclicking onn the predictted points, aand clickingg on Forma at Selectionn\Line Color\Solid
LLine (or byy double cliccking on thee points andd then selectt Line Coloor\Solid Lin ne).
 T The Residuual Plot show ws the foreccast errors versus
v temp perature. If tthis plot exh
hibits an
oobvious patttern, it wouuld suggestt that the model
m is ill-sspecified. Iddeally, the residuals
r
sshould be random.
r Residual plotss are also useful
u for sppotting outliiers  datta points
mmuch furtheer from the regression lline than otthers.
8
6. Multiple Linear Regression Analysis
By adding extra independent (explanatory) variables in the regression model, we may be able
to improve our predictions of heating cost. However, including extra explanatory variables
may also cause problems such as multicollinearity1. We therefore have to find the best
possible regression model for the purpose of predicting heating costs using one or more
explanatory variables.
Let us perform a regression analysis of heating cost as a function of all the available
explanatory variables, i.e.
heating cost = a + b  (temperature) + c  (insulation) + d  (age) + e  (windows) + e
 Select Data\Data Analysis\Regression;

 Specify Input Y as B3..B23, this is the dependent variable;
 Specify Input X Range as C3..F23, these are the explanatory variable (they should be
in adjacent columns, you may need to move columns);
 Select Labels;
 Select New Worksheet Ply, with the name “Regression 2”;
 Under the heading Residuals, select all four options (Residuals, Standardized
Residuals, Residual Plots and Line Fit Plots).
The suggested regression equation is (see regression results on next page):
heating cost = 424.74 – 4.57  (temperature) –14.91  (insulation)

+ 6.13  (age) + 0.24  (windows) + e
The Adjusted R2 has increased from 64% to 75%, indicating that we are now able to explain
more of the variability in heating cost. Also the standard error of the predictions has decreased
from 63.55 to 52.72, enabling much more accurate predictions. However, although the
coefficient related to the temperature and the insulation are found to be significantly different
from zero, the coefficients related to the age of the installation and the number of windows
have not. Therefore, these variables should be excluded from the model and the model re-
analysed. Also the residual plots and line fit plots need to be examined.
The modified regression equation is (see regression results):
heating cost = 490.29 – 5.15  (temperature) –14.72  (insulation) + e
1
Multicollinearity is an issue that can come up in multiple regression. It refers to the situation when some of the
independent (explanatory) variables are correlated and thus bring similar piece of information into the regression
model. This correlation makes the regression model, in fact, less accurate. Hence, it makes sense to delete one
of these correlated variables from the model and by doing so the significance of the other should improve. We
will discuss this in more detail in next lecture.
9
SUMMARY OUTPUT
Multiple R 0.90
R Square 0.80
Observations 20
ANOVA
Regression 4 171227 42807 15.40 0.00
Residual 15 41689 2779
Total 19 212916

Intercept 424.74 79.23 5.36 0.00 255.86 593.61
Insulation (inches) -14.91 5.14 -2.90 0.01 -25.86 -3.95
Age 6.13 4.17 1.47 0.16 -2.77 15.02
Windows 0.24 4.95 0.05 0.96 -10.31 10.80
RESIDUAL OUTPUT

1 259.20 -9.20 -0.20
2 294.04 65.96 1.41
3 176.38 -11.38 -0.24
4 118.08 -75.08 -1.60
5 91.75 0.25 0.01
6 245.88 -45.88 -0.98
7 335.88 19.12 0.41
8 307.13 -17.13 -0.37
9 264.65 -34.65 -0.74
10 176.30 -56.30 -1.20
11 26.18 46.82 1.00
12 139.32 65.68 1.40
13 353.59 46.41 0.99
14 232.13 87.87 1.88
15 69.89 2.11 0.05
16 310.22 -38.22 -0.82
17 76.05 17.95 0.38
18 192.69 -2.69 -0.06
19 219.57 15.43 0.33
20 216.07 -77.07 -1.65
10
SUMMARY OUTPUT
Multiple R 0.88
R Square 0.78
Observations 20
ANOVA
Regression 2 165195 82597 29.42 0.00
Residual 17 47721 2807
Total 19 212916

Intercept 490.29 44.41 11.04 0.00 396.59 583.98
Insulation (inches) -14.72 4.93 -2.98 0.01 -25.13 -4.31
RESIDUAL OUTPUT

1 265.89 -15.89 -0.32
2 282.07 77.93 1.56
3 201.86 -36.86 -0.74
4 92.98 -49.98 -1.00
5 81.95 10.05 0.20
6 262.20 -62.20 -1.24
7 350.48 4.52 0.09
8 307.06 -17.06 -0.34
9 249.68 -19.68 -0.39
10 177.61 -57.61 -1.15
11 35.57 37.43 0.75
12 169.50 35.50 0.71
13 313.70 86.30 1.72
14 230.57 89.43 1.78
15 63.55 8.45 0.17
16 313.70 -41.70 -0.83
17 88.57 5.43 0.11
18 166.55 23.45 0.47
19 218.78 16.22 0.32
20 232.76 -93.76 -1.87
11

Break-Out Session 2b

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Break-Out Session 2b

Uploaded by

Copyright:

Available Formats

Kristin Fridgeirsdottir Data Analytics for Leaders

1. Before you start

2. Summarising and describing data sets

Alternatively, you can use Data/Data Analysis/Descriptive Statistics to get a variety of

Mean 205.25 Mean 37.2 Mean 6.35 Mean 7 Mean 9.65

A new spreadsheet with the following correlation matrix should appear.

Heating Cost Minimum Temperature Insulation (inches) Age Windows

5. Simple Linear Regression Analysis

 Select Data\Data Analysis\Regression;

The results consist of different sections:

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Observation Predicted Heating Cost Residuals Standard Residuals

The suggested regression equation is:

heating cost = 388.80 – 4.93  (temperature) + e

heating cost = a + b  (temperature) + c  (insulation) + d  (age) + e  (windows) + e

 Select Data\Data Analysis\Regression;

The suggested regression equation is (see regression results on next page):

heating cost = 424.74 – 4.57  (temperature) –14.91  (insulation)

The modified regression equation is (see regression results):

heating cost = 490.29 – 5.15  (temperature) –14.72  (insulation) + e

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Observation Predicted Heating Cost Residuals Standard Residuals

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Observation Predicted Heating Cost Residuals Standard Residuals

You might also like