You are on page 1of 4

Hutcheson, G. D. (2011). Ordinary Least-Squares Regression. n L. !outinho and G. D. Hutcheson, "he S#G\$ Dictionary o% &uantitati'e !anage(ent Research.

)ages 22*-22+.

Ordinary Least-Squares Regression

ntroduction
Ordinary least-squares (OLS) regression is a generalized linear modelling technique that may be used to model a single response variable which has been recorded on at least an interval scale. The technique may be applied to single or multiple e planatory variables and also categorical e planatory variables that have been appropriately coded.

,ey -eatures
!t a very basic level" the relationship between a continuous response variable (#) and a continuous e planatory variable (\$) may be represented using a line o% best-%it" where # is predicted" at least to some e tent" by \$. &% this relationship is linear" it may be appropriately represented mathematically using the straight line equation '# ( ) * + '" as shown in ,igure (this line was computed using the least-squares procedure. see /yan" -001). The relationship between variables # and \$ is described using the equation o% the line o% best %it with ) indicating the value o% # when \$ is equal to zero (also 2nown as the intercept) and + indicating the slope o% the line (also 2nown as the regression coe%%icient). The regression coe%%icient + describes the change in # that is associated with a unit change in \$. !s can be seen %rom ,igure -" + only provides an indication o% the average e pected change (the observed data are scattered around the line)" ma2ing it important to also interpret the con%idence intervals %or the estimate (the large sample 034 two-tailed appro imation o% the con%idence intervals can be calculated as + 5 -.06 s.e. +). &n addition to the model parameters and con%idence intervals %or +" it is use%ul to also have an indication o% how well the model %its the data. 7odel %it can be determined by comparing the observed scores o% # (the values o% # %rom the sample o% data) with the e pected values o% # (the values o% # predicted by the regression equation). The di%%erence between these two values (the deviation" or residual as it is also called) provides an indication o% how well the model predicts each data point. !dding up the deviances %or all the data points a%ter they have been squared (this basically removes negative deviations) provides a simple measure o% the degree to which the data deviates %rom the model overall. The sum o% all the squared residuals is 2nown as the residual sum o% squares (/SS) and provides a measure o% model-%it %or an OLS regression model. ! poorly %itting model will deviate mar2edly %rom the data and will consequently have a relatively large /SS" whereas a good-%itting model will not deviate mar2edly %rom the data and will consequently have a relatively small /SS (a per%ectly %itting model will have an /SS equal to zero" as there will be no deviation between observed and e pected values o% #). &t is important to understand how the /SS statistic (or the deviance as it is also 2nown. see !gresti"-006" pages 96-97) operates as it is used to determine the signi%icance o% individual and groups o% variables in a regression model. ! graphical illustration o% the residuals %or a simple regression model is provided in ,igure 8. 9etailed e amples o% calculating deviances %rom residuals %or null and simple regression models can be %ound in :utcheson and 7outinho" 8;;<.

The deviance is an important statistic as it enables the contribution made by e planatory variables to the prediction o% the response variable to be determined. &% by adding a variable to the model" the deviance is greatly reduced" the added variable can be said to have had a large e%%ect on the prediction o% # %or that model. &%" on the other hand" the deviance is not greatly reduced" the added variable can be said to have had a small e%%ect on the prediction o% # %or that model. The change in the deviance that results %rom the e planatory variable being added to the model is used to determine the signi%icance o% that variable's e%%ect on the prediction o% # in that model. To assess the e%%ect that a single e planatory variable has on the prediction o% #" one simply compares the deviance statistics be%ore and a%ter the variable has been added to the model. ,or a simple OLS regression model" the e%%ect o% the e planatory variable can be assessed by comparing the /SS statistic %or the %ull regression model (# ( ) * + ) with that %or the null model (# ( )). The di%%erence in deviance between the nested models can then be tested %or signi%icance using an ,-test computed %rom the %ollowing equation.

F df

df

p+q

,df

p+q

df p df p+q RSS p+q

/ df

p+q

where p represents the null model" # ( )" p+q represents the model # ( ) * + " and df are the degrees o% %reedom associated with the designated model. &t can be seen %rom this equation that the ,-statistic is simply based on the di%%erence in the deviances between the two models as a %raction o% the deviance o% the %ull model" whilst ta2ing account o% the number o% parameters. &n addition to the model-%it statistics" the /-square statistic is also commonly quoted and provides a measure that indicates the percentage o% variation in the response variable that is =e plained' by the model. /-square" which is also 2nown as the coe%%icient o% multiple determination" is de%ined as

R8 =

and basically gives the percentage o% the deviance in the response variable that can be accounted %or by adding the e planatory variable into the model. !lthough /-square is widely used" it will always increase as variables are added to the model (the deviance can only go down when additional variables are added to a model). One solution to this problem is to calculate an ad>usted /-square statistic (/ 8a) which ta2es into account the number o% terms entered into the model and does not necessarily increase as more terms are added. !d>usted /-square can be derived using the %ollowing equation

R8 a

k - R8 = R n k 8

where n is the number o% cases used to construct the model and k is the number o% terms in the model (not including the constant).

#n e.a(/0e o% si(/0e OLS regression

! simple OLS regression model with a single e planatory variable can be illustrated using the e ample o% predicting ice cream sales given outdoor temperature (?oteswara" -01;). The model %or this relationship

(calculated using so%tware) is &ce cream consumption ( ;.8;1 * ;.;;@ temperature. The parameter %or ) (;.8;1) indicates the predicted consumption when temperature is equal to zero. &t should be noted that although the parameter ) is required to ma2e predictions o% ice cream consumption at any given temperature" the prediction o% consumption at a temperature o% zero might be o% limited use%ulness" particularly when the observed data does not include a temperature o% zero in it's range (predictions should only be made within the limits o% the sampled values). The parameter + indicates that %or each unit increase in temperature" ice cream consumption increases by ;.;;@ units. The signi%icance o% the relationship between temperature and ice cream consumption can be estimated by comparing the deviance statistics %or the two nested models in the table below. one that includes temperature and one that does not. This di%%erence in deviance can be assessed %or signi%icance using the ,-statistic.
!ode0 consumption ( a consumption ( ) * + temperature de'iance (RSS) ;.-833 ;.;3;; d% 80 8< change in de'iance ;.;133 --statistic )-'a0ue

A8.8<

B.;;;-

On the basis o% this analysis" outdoor temperature would appear to be signi%icantly related to ice cream consumption with each unit increase in temperature being associated with an increase o% ;.;;@ units in ice cream consumption. Csing these statistics it is a simple matter to also compute the /-square statistic %or this model" which is ;.;133D;.-833" or ;.6;. Temperature Ee plainsF 6;4 o% the deviance in ice cream consumption (i.e." when temperature is added to the model" the deviance in the # variable is reduced by 6;4).

OLS regression 1ith (u0ti/0e e./0anatory 'aria20es

The OLS regression model can be e tended to include multiple e planatory variables by simply adding additional variables to the equation. The %orm o% the model is the same as above with a single response variable (#)" but this time # is predicted by multiple e planatory variables (\$ - to \$@). # ( ) * +-\$- * +8\$8 * +@\$@ The interpretation o% the parameters () and +) %rom the above model is basically the same as %or the simple regression model above" but the relationship cannot now be graphed on a single scatter plot. ) indicates the value o% # when all vales o% the e planatory variables are zero. Gach + parameter indicates the average change in # that is associated with a unit change in \$" whilst controlling %or the other e planatory variables in the model. 7odel-%it can be assessed through comparing deviance measures o% nested models. ,or e ample" the e%%ect o% variable \$@ on # in the model above can be calculated by comparing the nested models # ( ) * +-\$- * +8\$8 * +@\$@ # ( ) * +-\$- * +8\$8 The change in deviance between these models indicates the e%%ect that \$ @ has on the prediction o% # when the e%%ects o% \$- and \$8 have been accounted %or (it is" there%ore" the unique e%%ect that \$ @ has on # a%ter ta2ing into account \$- and \$8). The overall e%%ect o% all three e planatory variables on # can be assessed by comparing the models # ( ) * +-\$- * +8\$8 * +@\$@ # ( ). The signi%icance o% the change in the deviance scores can be assessed through the calculation o% the ,statistic using the equation provided above (these are" however" provided as a matter o% course by most so%tware pac2ages). !s with the simple OLS regression" it is a simple matter to compute the /-square statistics.

#n e.a(/0e o% (u0ti/0e OLS regression

! multiple OLS regression model with three e planatory variables can be illustrated using the e ample %rom the simple regression model given above. &n this e ample" the price o% the ice cream and the average income o% the neighbourhood are also entered into the model. This model is calculated as &ce cream consumption ( ;.-01 H -.;AA price * ;.;@@ income * ;.;;@ temperature. The parameter %or ) (;.-01) indicates the predicted consumption when all e planatory variables are equal to zero. The + parameters indicate the average change in consumption that is associated with each unit increase in the e planatory variable. ,or e ample" %or each unit increase in price" consumption goes down by -.;AA units. The signi%icance o% the relationship between each e planatory variable and ice cream consumption can be estimated by comparing the deviance statistics %or nested models. The table below shows the signi%icance o% each o% the e planatory variables (shown by the change in deviance when that variable is removed %rom the model) in a %orm typically used by so%tware (when only one parameter is assessed" the ,-statistic is equivalent to the t-statistic (, ( It) which is o%ten quoted in statistical output).
de'iance change coe%%icient /rice inco(e te(/erature residua0s ;.;;8 ;.;-;.;<8 ;.;@3 86 ,-"86( -.361 ,-"86( 1.01@ ,-"86( 6;.838 ;.888 ;.;;0 B;.;;;d% --'a0ue )-'a0ue

Jithin the range o% the data collected in this study" temperature and income appear to be signi%icantly related to ice cream consumption.

3onc0usion
OLS regression is one o% the ma>or techniques used to analyse data and %orms the basis o% many other techniques (%or e ample !KOL! and the Meneralised linear models" see /uther%ord" 8;;-). The use%ulness o% the technique can be greatly e tended with the use o% dummy variable coding to include grouped e planatory variables (see :utcheson and 7outinho" 8;;<" %or a discussion o% the analysis o% e perimental designs using regression) and data trans%ormation methods (see" %or e ample" ,o " 8;;8). OLS regression is particularly power%ul as it relatively easy to also chec2 the model asumption such as linearity" constant variance and the e%%ect o% outliers using simple graphical methods (see :utcheson and So%roniou" -000).