Regression Analysis Predicts House Prices

MMGT6012
Business Tools for Management
TOPIC 4: Introduction to Regression

Dr. Matthew Beck
ITLS, Business School
The University of Sydney Page 1

4. Introduction to Regression
Class Activity
Graph the data in the table Y X

4 7
Draw what you think is the straight line that fits best 3 5
3 4
1 2
Using this line: 15 8
– What do you predict X will be if Y = 4? 1 2
– How close is this prediction to the actual value? 3 4

1 2
2 3
3 5

Class Activity
Graph the data in the table Y X

21 1
20 2
Draw what you think is the straight line that fits best: 22 3
– What do you predict X will be if Y = 5? 24 4

28 6
23 7
Now draw any line that best fits the data: 22 8
– What do you now predict X will be if Y = 5? 22 9

19 11
20 12

Class Activity
Finally, graph the data in the table Y X

12 5
Draw what you think is the straight line that fits best: 18 8
– How well does the line fit the data? 12 5

10 4
12 5
Can you work out the equation of this line? 14 6
16 7
2 0
8 3
14 6

What is Linear Regression?
Process of fitting a straight (linear) line that best fits the data
Estimating the equation for that straight line:

– Constant (what does Y equal when X equals zero)
– Slope (how much does Y go up or down as X changes)
– Some error (our line is not perfect)
How did you try and fit the line:

– Mental process you used to fit your straight lines?

What is Linear Regression?

Y Yi  β0  β1Xi  εi
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Observed Value
of Y for Xi
εi
Predicted Value Random Error Slope = β1
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Observed Value
of Y for Xi
εi
Predicted Value Random Error Slope = β1
of Y for Xi
for this Xi value
Objective is to
Intercept = β0
minimise all errors!
Xi X
Assumptions of Regression
Linear relationship between X and Y
No multicolinearity
– Independent variables are not correlated with each other
Normality of Error
– Error values (ε) are normally distributed for any given value of X
Homoscedasticity
– The probability distribution of the errors has constant variance
Independence of Errors
– Error values are statistically independent

Clean Your Data!
Y X 16
4 7 14
3 5
12
3 4
10
1 2 y = 1.5859x - 3.0606
15 8 8
1 2 6
3 4 4
1 2
2
2 3
0
3 5
0 2 4 6 8 10

Clean Your Data!
Y X 16
4 7 14
3 5
12
3 4
10
1 2
5 8 8
1 2 6
y = 0.6263x - 0.0303
3 4 4
1 2
2
2 3
0
3 5
0 2 4 6 8 10

Simple Linear Regression
Only one independent variable, X
Relationship between X and Y is described by a linear function
Changes in Y are assumed to be caused by changes in X

The Linear Regression Function
In the population the regression model is:
Dependent Slope Independent Error

Constant
Variable Coefficient Variable Term
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖 + 𝜀𝑖
Linear Random
Component Component

In the sample the regression model is:
Estimated Estimated Estimated Observed

Y-value Constant Coefficient X-value
𝑌𝑖 = 𝛽0 + 𝛽1 𝑋𝑖
Linear in the
parameters

Constant:
– The average value of Y when X is equal to zero
Slope coefficient:
– The average change in Y for a one unit change in X
Error:
– The difference between the observed Y and the predicted Y
– Also called the residual

Simple Regression - Example
Some facts about the property market:

– Currently ~9 million homes in Australia
– Average size of 245m2 (largest in world)
– Average number of residents = 2.6 (down from 3.1 in 1976)
– Average price in 2014 of $571,500
– Prices rose in Sydney by 12.2% in the year
– Accounts for $1.5 trillion of household debt


Price Size
Thinking about buying a home: 735 130
– Relationship between house price and house size 936 149
837 158
Is a house fairly valued or not? 924 174
– Buy for yourself or buy as an investment 597 102
657 144
A random sample of 10 houses is selected: 1215 218
– Dependent variable (Y) = house price in $1000s 972 223
– Independent variable (X) = metres square 957 132
765 158

1. Is the data clean?

2. What might the relationship look like?
3. How consistent might the relationship be?
House Price ($'000)
1400
1200
1000
800
600
400
200
0
0 50 100 150 200 250

1. Where is the straight line of best fit?

2. What is the linear equation for this straight line?
House Price ($'000)

1400 Slope
1200
1000
Y = 3.66665x + 277.26
800
600
Constant
400
200
0
0 50 100 150 200 250

𝑌 = 277.26 + 3.67𝑋
𝑃𝑟𝑖𝑐𝑒 = 277.26 + 3.67(𝑆𝑖𝑧𝑒)
The constant is the average value of Y when X = 0

– Can a house ever have a size of 0?
– So does the constant actually tell us here?!?

𝑌 = 277.26 + 3.67𝑋
𝑃𝑟𝑖𝑐𝑒 = 277.26 + 3.67(𝑆𝑖𝑧𝑒)
The slope is the average change in Y when X changes by 1 unit

– As SIZE goes up by one unit (one square metre)
– House PRICE rises by 3.67 units ($3670)

Model Performance
We estimate average impacts of X on Y:

– We may be interested in knowing how good these averages are
– How well we can explain changes in Y by the different values of X

Model Performance
Coefficient of Determination:
– Also called the R-Square (R2) value
The portion of the total variation in the dependent variable that is

explained by variation in the independent variable
0  R 1 2

R2 = 0.59
Y = 277.26 + 3.67X
House Price ($'000)

1400
1200
1000
800
600
400
200
0
0 50 100 150 200 250

R2 and the Correlation Statistic (r)
R2 = 1, r = +1 R2 = 1, r = -1
^=b +b X
Yi 0 1 i
^=b +b X
Yi 0 1 i
R2 = .8, r = +0.9 R2 = 0, r = 0
^=b ^=b +b X
Y
Yi 0 + b1Xi i 0 1 i

Measuring Error
The standard error of the regression line:

– Represents the average error around the regression line
– Wrong the regression model is on average (in the units Y is measured in)
– Smaller values are better because means the data is closer to the line
Y Y
X X
Measuring Error
The standard error is relative to the units of Y:

– The size of the error is relative to the size of the Y variable
We assume the error term is normal:

– 95% of observations are + or – 2 standard deviations of the mean!!
We can not only predict Y:

– Give a range of Y where were are 95% confident Y is in that range

Class Activity
1. Predict the price for the above home:

– What range of prices are you 95% confident the true price will be in?
2. Is this apartment one you would invest in given your model?

Class Activity

Issues with Prediction
Can only use regression to predict like things:

– Are the houses in our data set representative of the house we are trying to
predict a value of Y (price) for?
– Can only predict Y within the range of the X values we have
House Price ($'000)

1400
1200
1000
800
600 Cannot predict values

400
for large houses or
200
small houses
0
0 50 100 150 200 250
Measuring Error and Model Performance
250
y = 0.5014x + 9.6693
200 R2 = 0.9994
150
100
50
0
0 100 200 300 400 500
Standard Error = 0.000759380428589896

250
y = 0.5005x + 9.5696
2
R = 0.9529
200
150
100
50
0
0 100 200 300 400 500
Standard Error = 0.007062134927559

350
300 y = 0.4658x + 15.539
2
250 R = 0.3818
200
150
100
50
0
-50 0 100 200 300 400 500
-100
Standard Error = 0.03764192536044

1000
800 y = 0.6328x - 19.306

2
R = 0.0806
600
400
200
0
0 100 200 300 400 500
-200
-400
-600
Standard Error = 0.1357083240931

2000
y = 0.522x + 30.018
2
1500 R = 0.0085
1000
500
0
0 100 200 300 400 500
-500
-1000
-1500

As we have more error in our model it gets harder to fit a line
What do you think it means if the best fitting line has NO SLOPE?
– Think about what a flat line tells you about Y as X goes up or down…
The slope is the average impact of X on Y:

– If that average has NO SLOPE what impact does X have on Y?

Testing the Impact of X on Y
If only there was some way of testing an

average against a fixed value…
One Sample t Test:

– Test an average against a fixed value
We want to test the average impact of X against zero impact:

– H0: 1 = 0
– H1: 1 ≠ 0
If the slope is equal to zero (i.e., 1 = 0) then:

– Y = 0 + 1X1
Becomes:
– Y = 0

One Sample t Test:

– The formula is actually quite simple!
b1  β1
t b1 = regression slope coefficient
Sb1 β1 = hypothesized slope (i.e. 0)
Sb1 = standard error of the slope coefficient
d.f.  n  2
Calculate t and compare it to the critical value of 1.96

250
y = 0.5014x + 9.6693
200 R2 = 0.9994
150
100
50
0
0 100 200 300 400 500
Standard Error = 0.000759380428589896

t = 660.263 (compare to critical of 1.96)
250
y = 0.5005x + 9.5696
2
R = 0.9529
200
150
100
50
0
0 100 200 300 400 500
Standard Error = 0.007062134927559

350
300 y = 0.4658x + 15.539
2
250 R = 0.3818
200
150
100
50
0
-50 0 100 200 300 400 500
-100
Standard Error = 0.03764192536044

1000
800 y = 0.6328x - 19.306

2
R = 0.0806
600
400
200
0
0 100 200 300 400 500
-200
-400
-600

2000
y = 0.522x + 30.018
2
1500 R = 0.0085
1000
500
0
0 100 200 300 400 500
-500
-1000
-1500

Class Activity
Does house size have a significant impact on house price?

Regression Analysis Predicts House Prices

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Regression Analysis Predicts House Prices

Uploaded by

Copyright:

Available Formats

MMGT6012

Business Tools for Management

TOPIC 4: Introduction to Regression

The University of Sydney Page 1

Graph the data in the table Y X

– What do you predict X will be if Y = 4? 1 2

– How close is this prediction to the actual value? 3 4

The University of Sydney Page 2

Graph the data in the table Y X

– What do you predict X will be if Y = 5? 24 4

– How close is this prediction to the actual value? 26 5

– What do you now predict X will be if Y = 5? 22 9

– How close is this prediction to the actual value? 18 10

The University of Sydney Page 3

Finally, graph the data in the table Y X

– How well does the line fit the data? 12 5

The University of Sydney Page 4

What is Linear Regression?

Estimating the equation for that straight line:

How did you try and fit the line:

The University of Sydney Page 5

What is Linear Regression?

The University of Sydney Page 6

Linear relationship between X and Y

The University of Sydney Page 10

Clean Your Data!

The University of Sydney Page 11

Clean Your Data!

The University of Sydney Page 12

Simple Linear Regression

Only one independent variable, X

Relationship between X and Y is described by a linear function

Changes in Y are assumed to be caused by changes in X

The University of Sydney Page 13

The Linear Regression Function

In the population the regression model is:

Dependent Slope Independent Error

The University of Sydney Page 14

The Linear Regression Function

In the sample the regression model is:

Estimated Estimated Estimated Observed

The University of Sydney Page 15

The Linear Regression Function

The University of Sydney Page 16

Simple Regression - Example

Some facts about the property market:

The University of Sydney Page 17

Simple Regression - Example

The University of Sydney Page 18

Simple Regression - Example

1. Is the data clean?

The University of Sydney Page 19

Simple Regression - Example

1. Where is the straight line of best fit?

House Price ($'000)

The University of Sydney Page 20

Simple Regression - Example

𝑃𝑟𝑖𝑐𝑒 = 277.26 + 3.67(𝑆𝑖𝑧𝑒)

The constant is the average value of Y when X = 0

The University of Sydney Page 21

Simple Regression - Example

𝑃𝑟𝑖𝑐𝑒 = 277.26 + 3.67(𝑆𝑖𝑧𝑒)

The slope is the average change in Y when X changes by 1 unit