You are on page 1of 51

Getting Started with Regression

$700,000 $600,000 $500,000

Sales Prices

$400,000

$300,000

$200,000

$100,000 $100,000

$200,000

$300,000

$400,000

$500,000

$600,000

Predicted Values

Presented By: Prepared For:

Tim Wilmath, MAI Hillsborough County Property Appraisers Office

History of Regression

James Galton created Regression Analysis in 1885 when he was attempting to predict a persons height based on the height of his or her parent.

History of Regression

Galton found that children born to tall parents would be shorter than their parents - and children born to short parents would be taller than their parents. Both groups of children regressed toward the mean height of all children.

History of Regression

In 1922, a PhD student by the name of Casper G. Haas suggested using regression for farm land valuation.

History of Regression

What is truly remarkable is that Mr. Haas was using a technique that required significant amounts of calculations, calculations that today are done by sophisticated computer programs in seconds. Mr. Haas did these calculations by hand. In looking at this excerpt, it is remarkable how the nomenclature and the statistical output has varied so little in more than 85 years.

Uses of Regression
Predicting the Weather

Uses of Regression
Predicting Election Results

Uses of Regression
Predicting Sales Prices

What is Regression?
When Regression Analysis is used to predict sales prices or establish assessments it becomes an
Automated Sales Comparison Approach

Steps in Regression
1. Data Exploration and cleanup

2. Specifying the model 3. Calibrating the model


4. Interpreting the results

Data Exploration & Cleanup


Is there a pattern suggesting a relationship between variables?
800000 700000 600000

Note the outliers. These will adversely

SALES PRICE

500000 400000 300000 200000 100000 0


0 1000 2000 3000 4000 5000 6000

affect our final values


if we dont deal with them now

7000

HEATED AREA

Because of the potential for extreme values to influence the mean, modelers often remove or trim extreme values.

Model Specification
Specifying the model means picking the appropriate equation and which variables that will be used. Models can be: Additive - Most common for residential properties Multiplicative- Often used for land valuation Hybrid - Most advanced

We are going to use an Additive Model in this presentation

Regression Components
Dependent Variable: Sales Price Independent Variables: Size Age Location Condition Lot size Construction Quality Amenities

Simple Regression
Simple Regression includes one Dependent Variable (sales price) and only one Independent Variable - such as Square Footage.
500000

400000

SALES PRICE

300000

200000

100000

Using this model, a 1,000 sf home would be valued at $75,000


HEATED AREA
0 0 1000 2000 3000 4000 5000

Simple Regression
Simple Regression using only size as the independent variable will predict sales prices, however, it will treat all homes with the same size equally.

1,000 square feet - $75,000

1,000 square feet - $75,000?

Multiple Regression
We know square footage is an important variable but what other variables should we include and how do we decide?

Effective Age

Actual Age View

Correlation Analysis
Pearsons Correlation tells you the degree of relationships between variables.
Correlations SALEPRICE Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N SALEPRICE BLDSIZE BEDROOMS DOCK 1 0.855 0.557 0.142 Notice the high 0 . 0 0 1367 1367 1367 1367 correlation between 0.855 1 0.659 0.062 sales price and size 0. 0 0.021 1367 1367 1367 1367 0.557 0.659 1 0.037 0 0. 0.176 1367 1367 1367 1367 0.142 0.062 0.037 1 Very 0 0.021 0.176 . little 1367 1367 1367 1367 correlation between
sales price and dock
Correlation Analysis also helps identify Collinearity, which is a correlation between 2 independent variables. For example, the living area of a home is highly correlated to the number of bedrooms. It would only be necessary to have one of these variables in the model.

BLDSIZE

BEDROOMS

DOCK

Regression Equations
Y=mx+b

Y = b0 + b1 X1 + b2 X2 + . . . + bK XK

Running Regression
Statistical Software makes using Regression much easier,
performing the necessary calculations quickly and accurately.

Lets Run This!

Regression Results
Model 1

The Output tells us how good our model is working


Model Summary

The closer the


Model 1 R .855(a) R Square .732 Adjusted R Square .731 Std. Error of the Adj. EstimateR-Square

is to 1

25406.53266545 the better

a Predictors: (Constant), BLDSIZE

And - it gives us the coefficients (or adjustments)


Coefficients(a) Standardized Coefficients Beta t Unstandardized Coefficients Model 1 B (Constant) BLDSIZE a Dependent Variable: SALEPRIC 6838.585 75.068 Std. Error 2195.717 1.231 .855

$6,838

+ Bldsize x $75.07 Sig.


= Property Value .000
.002

3.115 60.997

The adjusted R2 statistic measures the amount of total variation explained by the Regression Model. It ranges from 0.00 to 1.00 with 1.00 being the desired value. A high number, say 0.910 means that approximately 91% of the value can be explained by the model.

Regression Results
The output includes the coefficient and the Constant
Coefficients(a) Standardized Coefficients Beta t 3.115 .855 60.997 Sig. .002 .000

Unstandardized Coefficients Model 1 B (Constant) BLDSIZE a Dependent Variable: SALEPRIC 6838.585 75.068 Std. Error 2195.717 1.231

The Constant represents the un-explained value that is not included in the model.

Running Regression
Lets add another variable to the model - Say Land Size

Lets run this model and

see if results
improve.

Regression Results
Model 2
Our Adj. R2 went up from
Model Summary

.731 to .801!
R R Square .801 Adjusted R Square .801 Std. Error of the Estimate 21864.78975921

Model 1

.895(a)

a Predictors: (Constant), LANDSF, BLDSIZE

We also have new coefficients (or adjustments)


Coefficients(a) Standardized Coefficients Beta t 3.238 .828 .266 68.237 21.887 Sig. .001 .000 .000

$6,119
Unstandardized Coefficients B (Constant) BLDSIZE LANDSF 6119.232 72.660 .382 Std. Error 1889.914 1.065 .017

Model 1

+ Bldsize x $72.66 + Landsf x $0.382 = Property Value

a Dependent Variable: SALEPRIC

Running Regression
Lets add Age to the model

If Age is significant

to value, the model


should improve. Lets run it.

Regression Results
Model 3
Model Summary

Model 1

R .912(a)

R Square .832

Adjusted R Square .832

Std. Error of the Estimate Adj. Our 20114.04445033

R2 went up from

a Predictors: (Constant), AGE, LANDSF, BLDSIZE

.801 to .832!

Notice the age coefficient is negative


Coefficients(a) Standardized Coefficients Beta t 11.221 .767 .309 -.189 64.856 26.868 -15.773 Sig. Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE a Dependent Variable: SALEPRIC 22855.587 67.276 .444 -630.763 Std. Error 2036.809 1.037 .017 39.991

$22,855

+ Bldsize x $67.28
+ Landsf x $0.44

.000 .000 .000 .000

+ Age x ($630.76) = Property Value

Running Regression
Lets add Building Quality to the model

We may have

a problem.
Lets run it and see.

Regression Results
Model 4
Model Summary Model 1 R .924(a) R Square .854 Adjusted R Square .853 Std. Error of the Estimate 18784.15717760

Our Adj. R2 went up from .832 to .854 after adding quality, but

a Predictors: (Constant), QUAL, LANDSF, AGE, BLDSIZE

Notice the constant is now negative - thats not good!


Coefficients(a) Standardized Coefficients Beta t -8.794 .681 .309 -.182 .171 54.234 28.831 -16.205 14.171 Sig. .000 .000 .000 .000 .000 Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE QUAL a Dependent Variable: SALEPRIC -45723.503 59.808 .445 -605.886 26110.420 Std. Error 5199.675 1.103 .015 37.388 1842.475

What do we do with this quality adjustment?

Regression Results
Coefficients(a) Standardized Coefficients Beta t -8.794 .681 .309 -.182 .171 54.234 28.831 -16.205 14.171 Sig. .000 .000 .000 .000 .000 Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE QUAL a Dependent Variable: SALEPRIC -45723.503 59.808 .445 -605.886 26110.420 Std. Error 5199.675 1.103 .015 37.388 1842.475

Quality 1 - Fair 2 - Average 3 - Good 4 - Excellent 5 - Superior

Resulting Adjustment = 1 x $26,110 = $26,110 = 2 x $26,110 = $52,220 = 3 x $26,110 = $78,330 = 4 x $26,110 = $104,440 = 5 x $26,110 = $130,550

This doesnt make sense because the codes 1,2,3, etc.

were not meant


to be a rank

A Note about Data Types


There are 3 primary types of property Characteristics:
Continuous: Based on a size or measurement. Examples: Square Footage or Lot Size Discrete: Specific pre-defined value. Examples: Roof Material, Building Quality

Binary: Either the item is present or not Examples: corner location, Lakefront Location

Transformations
To solve the problem we need to convert the discrete variable Quality into individual binary variables which allows Regression to distinguish each type:

Quality

BECOMES

Fair Average Good Excellent Superior

Yes/No Yes/No Yes/No Yes/No Yes/No

Running Regression
Now that we have transformed the variable Quality we can put it back in the model

Notice we left
Average out

Regression Results
Model Summary

Model 5
R R Square .870 Adjusted R Square .869

Our Adj. R2 went up from .832 to .869.


Std. Error of the Estimate 17717.09739523

Model 1

.933(a)

a Predictors: (Constant), SUPERIOR, EXCEL, AGE, FAIR, GOOD, LANDSF, BLDSIZE

Coefficients(a) Standardized Coefficients Beta

Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE FAIR GOOD EXCEL SUPERIOR a Dependent Variable: SALEPRIC 35633.753 58.537 .419 -625.742 -25511.289 21095.623 75844.967 305671.839 Std. Error 1922.792 1.045 .016 35.363 8693.178 1838.228 12720.934 18494.059

These Quality
t

Sig. .000 .000 .000 .000 .003 .000 .000 .000

adjustments
.667 .291 -.188 -.031 .127 .059 .169 56.031 26.342 -17.695 -2.935

18.532

are all relative to


Average
5.962 16.528 11.476

Running Regression
Lets transform Neighborhood into a binary and add it to the model

Notice we left out theBase Neighborhood (the most typical)

Regression Results
Model 6
Model Summary Model 1 R .936(a) R Square .875 Adjusted R Square .874

Our Adj. R2 went up from

.869 to .874.
Std. Error of the Estimate 17391.93018134

a Predictors: (Constant), NB211006, BLDSIZE, EXCEL, FAIR, SUPERIOR, NB211002, NB211001, NB211005, AGE, LANDSF, GOOD, NB211003
Coefficients(a) Standardized Coefficients Beta

Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE FAIR GOOD EXCEL SUPERIOR NB211001 NB211002 NB211003 NB211005 NB211006 a Dependent Variable: SALEPRIC 40799.859 56.000 .423 -671.493 -33476.331 17371.495 72617.618 313444.055 14199.881 -3514.034 -1483.623 4044.357 1915.755 Std. Error 2299.668 1.143 .016 37.221 8602.963 2023.937 12567.147 18313.237 2321.457 1657.862 1244.877 2266.186 2601.773

These Neighborhood
t Sig. 17.742 .638 .294 -.201 -.041 .105 .057 .173 .070 -.025 -.015 .021 .008 48.980 25.753 -18.041 -3.891 8.583 5.778 17.116 6.117 -2.120 -1.192 1.785 .000

adjustments
.000 .000 .000 .000 .000 .000 .000 .034 .234 .075 .462

.000

are all relative to our Base

.736

Neighborhood

Running Regression
Multiplicative Transformations combine two variables into one Square Footage x Quality = SQFT1
Reflects the fact that quality may contribute greater value in larger homes and less value in smaller homes. In other words, without combining these variables, all Good Quality homes get the same adjustment regardless of their size. Lets add this new combined variable to the model.

Since we combined SF and Quality, we remove

them as stand-alone
variables

Regression Results
Our Adj. R2 went up from
Model Summary

Model 7
R R Square .880 Adjusted R Square .879 Std. Error of the Estimate 17065.96846831

.874 to .879.

Model 1

.938(a)

a Predictors: (Constant), SQFT5, SQFT4, AGE, NB211002, SQFT2, SQFT1, NB211006, NB211001, NB211005, LANDSF, NB211003, SQFT3
Coefficients(a) Standardized Coefficients Beta

Unstandardized Coefficients Model 1 B (Constant) LANDSF AGE NB211001 NB211002 NB211003 NB211005 NB211006 SQFT1 SQFT2 SQFT3 SQFT4 SQFT5 a Dependent Variable: SALEPRIC 43999.158 .418 -660.473 10975.273 -3611.418 -1250.573 6350.688 1923.311 21.119 53.673 63.139 77.267 108.100 Std. Error 2299.663 .016 36.505 2335.844 1624.028 1221.119 2243.206 2554.324 8.533 1.169 1.074 3.557 2.941

Notice the adjustments


19.133 25.996 -18.092 .000 .000 .000 .000 .026 .306 .005 .452 .013 .000 .000 .000 .000 .291 -.198 .054 -.026 -.013 .033 .008 .026 .723 .964 .210 .356

Sig.

went from fixed dollar


4.699 -2.224 -1.024 2.831

amounts to

per square foot


2.475 45.916 58.806 21.720 36.759

.753

Advanced Transformations
Exponential transformations - Raise variable to a power Land Size x .75 = LAND75
Reflects the principle of diminishing returns. The unit price of land tends to decrease as size increases. Without this transformation land would get the same adjustment, regardless of size. Raising land size to the power of .75 reflects the curve shown below.
SINGLE FAMILY LOT PRICES
$2.85 $2.80 $2.75 $2.70 $2.65 $2.60 $2.55 $2.50 $2.45 $2.40

PRICE PER SF

50 00 50 00 53 00 56 00 57 50 58 00 58 10 58 00 70 00 90 00 11 00 0 15 00 0 20 00 0 30 00 0
LOT SIZE

Running Regression
Lets add our new transformed land variable to the model

Regression Results
Our Adj. R2 went up from

Model 8
Model Summary Model 1 R .939(a) R Square .882 Adjusted R Square .881 Std. Error of the Estimate

.879 to .881.

16919.04533480

a Predictors: (Constant), LAND75, NB211005, NB211001, SQFT4, NB211002, SQFT5, SQFT1, AGE, SQFT2, NB211006, NB211003, SQFT3
Coefficients(a) Standardized Coefficients Beta t 17.903 -.219 .050 -.023 -.017 .035 -.024 .038 .698 .927 .194 .345 .314 -20.005 4.348 -1.986 -1.360 3.019 -2.131 3.640 44.421 56.177 20.094 35.625 26.668 Sig. .000 .000 .000 .047 .174 .003 .033 .000 .000 .000 .000 .000 .000

Unstandardized Coefficients Model 1 B (Constant) AGE NB211001 NB211002 NB211003 NB211005 NB211006 SQFT1 SQFT2 SQFT3 SQFT4 SQFT5 LAND75 a Dependent Variable: SALEPRIC 40782.649 -731.178 10061.900 -3196.888 -1646.847 6714.691 -5595.936 30.298 51.834 60.732 71.516 104.644 12.233 Std. Error 2277.915 36.549 2314.108 1609.968 1211.025 2224.018 2625.622 8.324 1.167 1.081 3.559 2.937 .459

Running Regression
Lets add garages, pools, and baths just to round out our model.

Regression Results
Our Adj. R2 went up from

Model 9
Model Summary(b) Model 1
Coefficients(a) Standardized Coefficients Beta t 10.286 -.212 .061 -.008 -.010 .066 .004 .039 .595 .808 .164 .312 .303 .076 .105 .038 -18.337 5.684 -.717 -.826 5.908 .336 4.016 32.349 41.857 16.974 32.186 27.240 5.765 11.279 3.427 Sig. .000 .000 .000 .474 .409 .000 .737 .000 .000 .000 .000 .000 .000 .000 .000 .001

.881 to .895.

R .947(a)

R Square .897

Adjusted R Square .895

Std. Error of the Estimate 15854.87728402

Unstandardized Coefficients Model 1 B (Constant) AGE NB211001 NB211002 NB211003 NB211005 NB211006 SQFT1 SQFT2 SQFT3 SQFT4 SQFT5 LAND75 BATHS POOL GARAGE a Dependent Variable: SALEPRIC 29680.695 -705.817 12374.064 -1094.891 -938.838 12639.946 852.109 31.388 44.166 52.939 60.447 94.723 11.788 7714.093 13359.275 10.750 Std. Error 2885.532 38.491 2176.815 1527.977 1136.671 2139.489 2535.266 7.815 1.365 1.265 3.561 2.943 .433 1338.204 1184.469 3.137

Regression Results
Coefficients(a) Standardized Coefficients Beta t 18.532 .667 .291 -.188 -.031 .127 .059 .169 56.031 26.342 -17.695 -2.935 11.476 5.962 16.528 Sig. .000 .000 .000 .000 .003 .000 .000 .000 Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE FAIR GOOD EXCEL SUPERIOR a Dependent Variable: SALEPRIC 35633.753 58.537 .419 -625.742 -25511.289 21095.623 75844.967 305671.839 Std. Error 1922.792 1.045 .016 35.363 8693.178 1838.228 12720.934 18494.059

The Beta value in column 4 indicates the partial correlation of the variable. It is used in stepwise regression in deciding which variable to add next.

Regression Results
The significance of each variable to the model can be determined by looking at the t values.
Coefficients(a) Standardized Coefficients Beta t 10.286 -.212 .061 -.008 -.010 .066 .004 .039 .595 .808 .164 .312 .303 .076 .105 .038 -18.337 5.684 -.717 -.826 5.908 .336 4.016 32.349 41.857 16.974 32.186 27.240 5.765 11.279 3.427 Unstandardized Coefficients Model 1 B (Constant) AGE NB211001 NB211002 NB211003 NB211005 NB211006 SQFT1 SQFT2 SQFT3 SQFT4 SQFT5 LAND75 BATHS POOL GARAGE a Dependent Variable: SALEPRIC 29680.695 -705.817 12374.064 -1094.891 -938.838 12639.946 852.109 31.388 44.166 52.939 Std. Error 2885.532 38.491 2176.815 1527.977 1136.671 2139.489 2535.266 7.815 1.365 1.265

Rule of Thumb: t scores should


Sig.

be 2.0 or greater
.000 .000 .000 .474 .409 .000 .737 .000 .000 .000 .000 .000 .000 .000 .000 .001

NB211002 60.447 3.561


94.723

NB211003 11.788 .433


7714.093 13359.275 1338.204 NB211006 1184.469 10.750 3.137 are insignificant

2.943

Regression Results
Coefficients(a) Standardized Coefficients Beta t 18.532 .667 .291 -.188 -.031 .127 .059 .169 56.031 26.342 -17.695 -2.935 11.476 5.962 16.528 Sig. .000 .000 .000 .000 .003 .000 .000 .000 Unstandardized Coefficients Model 1 B (Constant) BLDSIZE LANDSF AGE FAIR GOOD EXCEL SUPERIOR a Dependent Variable: SALEPRIC 35633.753 58.537 .419 -625.742 -25511.289 21095.623 75844.967 305671.839 Std. Error 1922.792 1.045 .016 35.363 8693.178 1838.228 12720.934 18494.059

The t-statistic is calculated by dividing the coefficient of a variable by its standard error. For example: for the variable BLDSIZE, the t-statistic is calculated as follows: 58.537 / 1.045 = 56.0

Regression Results
Model Summary(b) Model 1 R .947(a) R Square .897 Adjusted R Square .895 Std. Error of the Estimate 15854.87728402

The Standard Error of the Estimate in the regression model tells us how much a sale estimate will vary from its actual value. This number alone is meaningless unless related to the average sales price in the sale sample. Dividing the Standard Error by the Average SalesPrice produces the Coefficient of Variation (COV)

$15,854 / $134,043 = 11.82% COV

Regression Options
Enter is the default regression method in most statistical software programs. This method includes all variables entered by the modeler. Stepwise multiple regression automatically eliminates redundant or insignificant variables.
Coefficients(a) Model: 4 Unstandardized Coefficients B (Constant) AGE NB211001 NB211005 SQFT1 SQFT2 SQFT3 SQFT4 SQFT5 LAND75 BATHS POOL GARAGE 28624.283 -697.862 12794.553 13302.885 31.406 44.305 53.134 60.544 94.884 11.891 7732.836 13317.394 10.586 Std. Error 2584.025 37.689 2071.093 1969.163 7.797 1.354 1.249 3.557 2.924 .393 1332.987 1179.165 3.047 -.209 .063 .069 .039 .597 .811 .164 .313 .305 .076 .105 .037 Standardized Coefficients Beta t 11.077

Notice that Stepwise

Regression
Sig.

-18.516

kicked out the


.000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .001

.000

6.178 6.756 4.028 32.723 42.525 17.023 32.446 30.243 5.801 11.294 3.474

neighborhoods that had

low t-scores"

a Dependent Variable: SALEPRIC

Creating New Assessments


Once you have calibrated your model, the Regression software allows you to predict the new values (or assessments) using the coefficients (or adjustments) you created.

Reviewing Ratio Statistics


Once the new assessments are created using our final model, we can
review the accuracy of our new values using traditional ratio statistics.
Ratio Statistics for ASSESS Unstandardized Predicted Value / SALEPRIC Weighted Mean Price Related Differential Coefficient of Dispersion Coefficient of Variation Mean Centered Median Centered 1.000 1.008 .079 11.1% 11.2%

Valuing the Population


Valuing the population requires transforming the same variables you used in the model, then applying the coefficients to those variables. This can be done internally within some CAMA systems, using Microsoft Excel or other spreadsheet software, or within the

regression software.

Valuing the population is one of the most difficult aspects

of regression modeling because changes in the physical attributes of


any one parcel often requires re-running the entire model and re-calculating values.

Conclusion
Predicting assessments using Regression requires the appraiser to: Explore data to determine relationships and cleanup outliers Specify which model and variables will be used transform variables and run regression Review Results, modify or add variables Create predicted assessments and review ratio statistics Value Population using final coefficients

The End
500000 400000

SALE PRICES

300000

200000

100000

0 0 100000 200000 300000 400000 500000

Predicted Values