You are on page 1of 8

Guide for Windows Excel 2010 Regression Modelling with Analysis Toolpak James W.

Taylor
The purpose of this guide is to explore linear regression using Excel. This note consists of the following sections: Summarising and describing a multi-variable data set Correlation analysis Scatter plots Simple regression Multiple regression Excels regression functions Exercises

First, check that Excels statistical add-in, Data Analysis, is attached to Excel. From the set of tabs at the top of your screen, click on the Data tab. If Data Analysis is attached, it will be available as an option in the Analysis group towards the top right of your screen.

If Data Analysis is not one of the options, you need to attach it by working through the following steps: (i) Click on the File tab, near the top left of the screen, and then select Options. (ii) Click Add-Ins (on the left of the screen), and then in the Manage box (at the bottom of the screen), select Excel Add-ins. If this is not one of the options in the dialog box, you need to install the add-ins from your Microsoft Excel installation disc. (iii) Click Go. (iv) In the Add-Ins box (shown on the right here), select three add-ins: Analysis ToolPak, Analysis ToolPak VBA, and Solver. Then click OK. If you get prompted that the Analysis ToolPak Add-in is not currently installed on your computer, click Yes to install it. (v) After you load these add-ins, the Solver and Analysis ToolPak commands are available in the Analysis group on the Data tab.

1. SUMMARISING & DESCRIBING A MULTI-VARIABLE DATA SET The Excel file ElectricityConsumption.xls contains monthly observations from January 2004 to July 2012 for the following variables:
ELEC C66 C76 H55 DINC AIRC Residential electricity sales (KWh) per customer in a mid-Atlantic U.S. city Cooling degree hours at base temperature 66 degrees (a measure of summer heat)1 Cooling degree hours at base temperature 76 degrees (a measure of summer heat) Heating degree hours at base temperature 55 degrees (a measure of winter cold)2 Disposable income per household ($) Proportion of households with air conditioning

The ultimate aim is to build a forecasting model for residential electricity consumption.
1 2 3 4 5 6 7 8 9 10 11 12 13 A MONTH Jan-04 Feb-04 Mar-04 Apr-04 May-04 Jun-04 Jul-04 Aug-04 Sep-04 Oct-04 Nov-04 Dec-04 B ELEC 681.7 620.3 590.8 538.0 513.4 575.5 1019.3 1203.9 1176.7 723.0 519.0 604.9 C C66 20 0 20 14 559 1601 5348 7416 6887 2975 427 9 D C76 0 0 0 0 3 83 833 1547 1287 398 5 0 E H55 10148 12504 9300 5333 2846 282 1 0 0 155 1812 5779 F DINC 34825 34934 35050 35172 35302 35438 35583 35734 35892 36056 36222 36391 G AIRC 0.698 0.701 0.705 0.708 0.712 0.716 0.72 0.724 0.728 0.731 0.735 0.739

Use the Analysis Toolpak Descriptive Statistics tool to get summary statistics (in one sequence of operations) for all 6 variables, by selecting From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Descriptive Statistics

In the Descriptive Statistics dialog box, specify: Input Range as the range containing values and variable names: B1:G104 Click the Labels in First Row checkbox Output options as New Worksheet Ply with the name Summary Click the Summary Statistics checkbox.

The cooling degree hours at base temperature T is:

in
i 1 i 1

where ni is the number of hours in the month at temperature T+i.

The heating degree hours at base temperature T is:

in

where ni is the number of hours in the month at temperature T-i.

2. CORRELATION ANALYSIS

Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Correlation

In the Correlation dialog box, specify: Input Range: as B1:G104 (dont include the house number column) Grouped By: as Columns, so that Excel knows that each column is a variable. The Labels in First Row checkbox should be crossed Output options: as New Worksheet Ply with the name Correlations Click OK.

The correlation matrix below should result. Correlation coefficients for pairs of variables indicate the levels of linear association between them, e.g. ELEC and C76 have correlation of 0.94, so that as C76 rises, ELEC rises.
ELEC 1.00 0.92 0.94 -0.36 0.14 0.14 C66 0.92 1.00 0.95 -0.65 0.02 0.02 C76 0.94 0.95 1.00 -0.52 0.01 0.01 H55 -0.36 -0.65 -0.52 1.00 -0.04 -0.05 DINC 0.14 0.02 0.01 -0.04 1.00 0.94 AIRC 0.14 0.02 0.01 -0.05 0.94 1.00

ELEC C66 C76 H55 DINC AIRC

You should get the same value using the Excel function =CORREL Note any variables strongly correlated with ELEC, and any strong inter-correlations between the potential explanatory variables, C66, C76, H55, DINC and AIRC.

3. SCATTER PLOTS Scatter plots are of great help in identifying the strength, nature and direction of relationships between pairs of variables. In particular, they can highlight non-linear relationships, which will not necessarily be apparent from the correlation values. Since the observed correlation, 0.94, between ELEC and C76 suggests a relationship, lets examine their scatter plot. Return to the Data worksheet. Copy the ELEC column of data to column K. Copy C76 to column J. Highlight the new C76 and ELEC columns (columns J and K), as shown in the screen dump below.

From the main Excel menu, click on the Insert tab. From the Charts group, select Scatter with no lines as highlighted above.

ElectricityConsumption
Dealing with charts is somewhat cumbersome in Excel 2010. A simple way to insert axis titles and chart titles is to use Excels text box option, which is also highlighted in the screen dump above. After a little work, the chart can look something like this. The scatter plot confirms the moderate strength, linear relationship, with ELEC increasing as C76 increases.
1600.0 1400.0 1200.0 1000.0 800.0 600.0 400.0 200.0 0.0 0 500 1000 1500 2000

ELEC

4. SIMPLE REGRESSION Regression analysis produces the estimated linear equation that best fits a set of data. By best fitting we mean the line (or linear model) for which there is least residual scatter. Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Regression

Complete the regression dialog box by specifying: Input Y range as B1:B104 Input X range as D1:D104 ELEC as dependent variable C76 as independent variable

Check the Labels box as the first entries in each cell range are labels Specify Output options as New Worksheet Ply, with the name Regression1. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK.

4.1 REGRESSION ANALYSIS - INTERPRETING NUMERICAL OUTPUT


SUMMARY OUTPUT Regression Statistics Multiple R 0.936601141 R Square 0.877221698 Adjusted R Squ 0.876006071 Standard Error 84.01563552 Observations 103 ANOVA df Regression Residual Total 1 101 102 SS 5093652.918 712921.3281 5806574.246 MS 5093652.918 7058.627011 F Significance F 721.6209201 8.45675E-48

Intercept C76

Coefficients Standard Error 632.1967321 9.685863338 0.538125757 0.020032227

t Stat 65.2700446 26.86300281

P-value 2.09858E-84 8.45675E-48

Lower 95% 612.9825852 0.498387209

Upper 95% 651.410879 0.577864305

The 1st part of the output contains summary statistics for the regression as a whole, R and residual standard deviation (called standard error). Ignore the 2nd part which displays ANOVA or Analysis of Variance calculations. The 3rd part of the output indicates that the best fitting linear model has equation: ELEC = 632.20 + 0.538*C76 And that the slope, 0.538, has a t-stat of 26.86 and a very small p-value. The variable C76 is therefore significantly explaining some of the variation in ELEC. The 4th part shows predicted values for each of the observations, and the residuals.

4.2 REGRESSION - INTERPRETING EXCELS GRAPHICAL OUTPUT The Regression tool puts one chart on top of another. Click on the top chart so that it becomes the active chart, and then move it down. The Line Fit Plot shows actual ELEC and predicted ELEC, plotted for different values of C76. The regression line (called Predicted ELEC in the legend) is shown as points rather than as a line. This can be changed by formatting the points.

C76LineFitPlot
2000.0 1500.0

ELEC

ELEC PredictedELEC

1000.0 500.0 0.0 0 1000 C76 2000

Residuals Plot shows residuals plotted versus the value of the C76 variable. Check that the residuals do not display an obvious pattern. Ideally, residuals should be as if random, not showing any systematic pattern, of much the same average size, and not increasing in size as X (ELEC) increases, etc. Residual plots are also useful for spotting outliers.

C76ResidualPlot
300 200

Residuals

100 0 100 0 200 300 C76 500 1000 1500 2000

5. MULTIPLE REGRESSION Can the ELEC predictions be improved if other possible explanatory variables are brought into the model? This section contains a brief description of the way Excels regression can be extended from simple (ELEC on C76) to multiple regression (ELEC on two or more variables). The purpose is to find the best equation for predicting ELEC from one or more of the independent variables. Lets regress ELEC on the other five variables. Return to the Data worksheet From the main Excel menu, click on the Data tab From the Analysis group, select Data Analysis In the resulting dialog box, select Regression In the Regression dialog box, specify: Input Y range as B1:B104 i.e. ELEC as dependent variable Input X range as C1:G104 i.e. five explanatory variables Check the Labels checkbox. Specify Output options: as New Worksheet Ply, with the name Regression2. Under the heading Residuals, select Residuals, Residual Plots & Line Fit Plots. Then click OK. 7