You are on page 1of 5

25/10/2018

data displayed as a table

Mathematics as a Tool
Data Management 06
(Linear Regression and Correlation)

data displayed as a scatter plot a best-fit line drawn through the data
60 60

50 50
Poverty Incidence

Poverty Incidence

40 40

30 30

20 20

10 10

0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9

Family Size Family Size

Best Fit Line Method of Least Squares


• Makes more apparent possible relationships existing between data
sets
• Can also be used for prediction either by interpolation or
extrapolation.
• There are different methods to find a best-fit line
- One method: Method of Least Squares
25/10/2018

Method of Least Squares Method of Least Squares*


Given n data points (x1, y1), (x2, y2), … , (xn, yn)

we want to find an equation of a line 𝒚 = 𝜷𝟏 𝒙 + 𝜷𝟎

such that the square of the residuals is minimized.


∑ 𝑥 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦
𝛽 = or
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 𝑛 ∑𝑥 − ∑𝑥
𝛽 =
𝑛∑𝑥 − ∑𝑥
∑𝑦− 𝛽 ∑𝑥
𝛽 = = 𝑦 − 𝛽 𝑥̅
𝑛

* Labels for parameters (i.e., 𝜷𝟎 , 𝜷𝟏 ) are different from the book.

ILLUSTRATION ILLUSTRATION
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 60
𝛽 =
𝑛∑𝑥 − ∑𝑥
50
Poverty Incidence

y = 6.5798x + 0.7536
40

30

∑ 𝑥 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦 20

𝛽 = or
𝑛 ∑𝑥 − ∑𝑥 10

∑𝑦− 𝛽 ∑𝑥 0
𝛽 = = 𝑦 − 𝛽 𝑥̅ 0 1 2 3 4 5 6 7 8 9

𝑛
Family Size

According to the best fit line, what could be the poverty incidence for a family of 9?

Will the least square line always fit well? 90


Linear Correlation Coefficient
60
80
y = 6.5798x + 0.7536
50

• The linear correlation coefficient, r, is a measure of the degree of


70

40 60

30
50 ‘fitness’ of the least squares line.
40
y = 7.0476x + 17.786
20 30

10
20 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
10 𝑟=
0
0 2 4 6 8 10
0
0 2 4 6 8 10
(𝑛 ∑ 𝑥 − ∑ 𝑥 ) ⋅ 𝑛∑𝑦 − ∑𝑦
100 50
90 45
y = -1.5011x + 24.945
80
70
y = -5.7662x + 69.393
40
35
• Range of values is from -1 to +1
60 30 r = –1 or +1  perfect linear relationship
50
40
25
20
r=0  no linear relationship
30
20
15
10
• *We can say that a strong linear relationship exists if
10
0
5
0
r > 0.7 or r < – 0.7
0 2 4 6 8 10 0 2 4 6 8 10 * Use with caution
25/10/2018

60
y = 6.5798x + 0.7536 90

50 R² = 0.981 80

ILLUSTRATION 40

30
70
60
50
40
20 y = 7.0476x + 17.786
30
R² = 0.5148
10 20
10
0
0 2 4 6 8 10 0
0 2 4 6 8 10

100 50
90 45
y = -5.7662x + 69.393 y = -1.5011x + 24.945
80 40
R² = 0.3309
r = 0.9904
R² = 0.09
70 35
60 30
50 25
40 20
30 15
20 10
10 5
0 0
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 0 2 4 6 8 10 0 2 4 6 8 10
𝑟=
(𝑛 ∑ 𝑥 − ∑ 𝑥 ) ⋅ 𝑛∑𝑦 − ∑𝑦
NOTE: In Excel, you can obtain the equation of the least squares line and the value of R2.
USING MS EXCEL

(x) HDI Rank (y) CO2 emissions per capita


or the Human Development Index Rank measures the human-originated carbon dioxide emissions

EXERCISE considers a country’s life expectancy,


education levels, and wealth.
stemming from the burning of fossil fuels, gas flaring and
the production of cement, divided by midyear population.

• Find the least squares line relating the ASEAN


Country
HDI
Rank
CO2 emissions per capita
(2013 levels, in tonnes) 20
ASEAN
Country
HDI
Rank
CO2 emissions per capita
(2013 levels, in tonnes)
HDI rank (x) of an ASEAN country to Brunei 30 18.9 Brunei 30 18.9
CO2 emissions per capita (tones)

its CO2 emissions per capita (y). And Cambodia 143 0.4 15
Cambodia 143 0.4
compute for the correlation Indonesia 113 1.9
y = -0.1057x + 14.678
Indonesia 113 1.9
coefficient of this line. Laos 138 0.3
10
R² = 0.7426
Laos 138 0.3

(x) HDI Rank Malaysia 59 8.0 Malaysia 59 8.0


5
or the Human Development Index Rank Myanmar 145 0.2 Myanmar 145 0.2
considers a country’s life expectancy,
education levels, and wealth. Philippines 116 1.0 0
Philippines 116 1.0
0 20 40 60 80 100 120 140 160
(y) CO2 emissions per capita Singapore 5 9.4 Singapore 5 9.4
measures the human-originated carbon Thailand 87 4.5 HDI Rank Thailand 87 4.5
-5
dioxide emissions stemming from the
burning of fossil fuels, gas flaring and the Vietnam 115 1.7 Vietnam 115 1.7
production of cement, divided by midyear
population. Extracted from http://hdr.undp.org/en/data Extracted from http://hdr.undp.org/en/data

Calculator EXERCISE* EXERCISE


* Variable name in calculator may be different or reversed!

𝛽 = 2.7303 𝛽 = 3.1296
𝛽 = -5.5472
𝛽 = -3.3164
𝑟 = 0.9985
𝑟 = 0.9937

𝛽 = 3.2120
a. Find the mean of the stride lengths (𝑥̅ ) and speeds (𝑦) found in the
𝛽 = -1.0924 given data set.
𝑟 = 0.9864
b. Is the point (𝑥̅ , 𝑦) on the least-squares line?
c. Predict the speed of a camel whose stride length is 2.7 m
𝛽 = 3.1296
𝛽 = -5.5472 d. If a camel’s average speed is found to be 8 m/s, what could be its
𝑟 = 0.9985 stride length based on the least squares model?
25/10/2018

EXERCISE Correlation does not imply causation!


An aerobic exercise instructor remembers that the data given in the 60
following table, which shows the recommended maximum exercise heart
rates for individuals of the given ages. 50

Poverty Incidence
Age (x years) 20 40 60
40
Maximum hear rate
170 153 136
(y beats/minute)
30

a. Find the linear correlation coefficient for the data.


20
b. What is the significance of the value found in (a).
c. Find the equation of the least squares line. 10

d. Use the equation from (c) to predict the maximum exercise heart rate for
a person who is 72. 0
0 1 2 3 4 5 6 7 8 9

e. Is the procedure in part (d) an example of interpolation or extrapolation? Family Size

Note: While correlation may indicate a relationship between two variables, it may not always be true.

Spurious correlations exist Spurious correlations exist

http://tylervigen.com/ http://tylervigen.com/

Spurious correlations exist Spurious correlations exist

http://tylervigen.com/ http://tylervigen.com/
25/10/2018

Spurious correlations exist EXTRA: Deriving the least squares line 𝒚 = 𝜷𝟏 𝒙 + 𝜷𝟎

∑ 𝑥 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦
𝛽 = or
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 𝑛 ∑𝑥 − ∑𝑥
𝛽 =
𝑛∑𝑥 − ∑𝑥 ∑𝑦− 𝛽 ∑𝑥
𝛽 = = 𝑦 − 𝛽 𝑥̅
𝑛
http://tylervigen.com/

EXTRA: Calculator Exercise


For the given data set, find the required measures.

45 89 95 95 98 99 101 106 106 110

a. Mean
b. Sample standard deviation
c. Sample variance
d. Population standard deviation
e. Population variance

You might also like