Professional Documents
Culture Documents
Mathematics as a Tool
Data Management 06
(Linear Regression and Correlation)
data displayed as a scatter plot a best-fit line drawn through the data
60 60
50 50
Poverty Incidence
Poverty Incidence
40 40
30 30
20 20
10 10
0 0
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
ILLUSTRATION ILLUSTRATION
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 60
𝛽 =
𝑛∑𝑥 − ∑𝑥
50
Poverty Incidence
y = 6.5798x + 0.7536
40
30
∑ 𝑥 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦 20
𝛽 = or
𝑛 ∑𝑥 − ∑𝑥 10
∑𝑦− 𝛽 ∑𝑥 0
𝛽 = = 𝑦 − 𝛽 𝑥̅ 0 1 2 3 4 5 6 7 8 9
𝑛
Family Size
According to the best fit line, what could be the poverty incidence for a family of 9?
40 60
30
50 ‘fitness’ of the least squares line.
40
y = 7.0476x + 17.786
20 30
10
20 𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦
10 𝑟=
0
0 2 4 6 8 10
0
0 2 4 6 8 10
(𝑛 ∑ 𝑥 − ∑ 𝑥 ) ⋅ 𝑛∑𝑦 − ∑𝑦
100 50
90 45
y = -1.5011x + 24.945
80
70
y = -5.7662x + 69.393
40
35
• Range of values is from -1 to +1
60 30 r = –1 or +1 perfect linear relationship
50
40
25
20
r=0 no linear relationship
30
20
15
10
• *We can say that a strong linear relationship exists if
10
0
5
0
r > 0.7 or r < – 0.7
0 2 4 6 8 10 0 2 4 6 8 10 * Use with caution
25/10/2018
60
y = 6.5798x + 0.7536 90
50 R² = 0.981 80
ILLUSTRATION 40
30
70
60
50
40
20 y = 7.0476x + 17.786
30
R² = 0.5148
10 20
10
0
0 2 4 6 8 10 0
0 2 4 6 8 10
100 50
90 45
y = -5.7662x + 69.393 y = -1.5011x + 24.945
80 40
R² = 0.3309
r = 0.9904
R² = 0.09
70 35
60 30
50 25
40 20
30 15
20 10
10 5
0 0
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 0 2 4 6 8 10 0 2 4 6 8 10
𝑟=
(𝑛 ∑ 𝑥 − ∑ 𝑥 ) ⋅ 𝑛∑𝑦 − ∑𝑦
NOTE: In Excel, you can obtain the equation of the least squares line and the value of R2.
USING MS EXCEL
its CO2 emissions per capita (y). And Cambodia 143 0.4 15
Cambodia 143 0.4
compute for the correlation Indonesia 113 1.9
y = -0.1057x + 14.678
Indonesia 113 1.9
coefficient of this line. Laos 138 0.3
10
R² = 0.7426
Laos 138 0.3
𝛽 = 2.7303 𝛽 = 3.1296
𝛽 = -5.5472
𝛽 = -3.3164
𝑟 = 0.9985
𝑟 = 0.9937
𝛽 = 3.2120
a. Find the mean of the stride lengths (𝑥̅ ) and speeds (𝑦) found in the
𝛽 = -1.0924 given data set.
𝑟 = 0.9864
b. Is the point (𝑥̅ , 𝑦) on the least-squares line?
c. Predict the speed of a camel whose stride length is 2.7 m
𝛽 = 3.1296
𝛽 = -5.5472 d. If a camel’s average speed is found to be 8 m/s, what could be its
𝑟 = 0.9985 stride length based on the least squares model?
25/10/2018
Poverty Incidence
Age (x years) 20 40 60
40
Maximum hear rate
170 153 136
(y beats/minute)
30
d. Use the equation from (c) to predict the maximum exercise heart rate for
a person who is 72. 0
0 1 2 3 4 5 6 7 8 9
Note: While correlation may indicate a relationship between two variables, it may not always be true.
http://tylervigen.com/ http://tylervigen.com/
http://tylervigen.com/ http://tylervigen.com/
25/10/2018
∑ 𝑥 ∑ 𝑦 − ∑ 𝑥 ∑ 𝑥𝑦
𝛽 = or
𝑛 ∑ 𝑥𝑦 − ∑ 𝑥 ∑ 𝑦 𝑛 ∑𝑥 − ∑𝑥
𝛽 =
𝑛∑𝑥 − ∑𝑥 ∑𝑦− 𝛽 ∑𝑥
𝛽 = = 𝑦 − 𝛽 𝑥̅
𝑛
http://tylervigen.com/
a. Mean
b. Sample standard deviation
c. Sample variance
d. Population standard deviation
e. Population variance