Professional Documents
Culture Documents
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 1 / 58
This week: Regression and Correlation
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 2 / 58
Overview
Scatterplot and Correlation
Least-Squares Line
Regression Diagnostic
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 3 / 58
Introduction
SLR Examples
I predict salary from years of experience
I Relationship between child’s height and their head circumference
I predict car speed from gas mileage
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 4 / 58
Example: Height and Weight
Variables:
Height Weight
60 84
62 95
64 140
66 155
68 119
70 175
72 145
74 197
76 150
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 5 / 58
Scatterplot and Correlation
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 6 / 58
Relationship among quantitative variables
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 7 / 58
Scatterplot
A scatterplot, also called a scattergraph or scatter diagram, is a plot
of the data points in a set. It plots data that takes two variables into
account at the same time. Here are some examples of scatterplots:
Example : Student’s Heights and their Weights
R-code:
plot(Weight~Height, xlab="Height", pch=18, cex=3,
cex.lab=1.8, col="blue")
R-output:
200
180
160
Weight
140
120
100
80
60 65 70 75
Height
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 8 / 58
Pearson correlation coefficient
Source: Dr.
http://www.pythagorasandthat.co.uk/scatter-
Fode Tounkara graphs
Statistics For Data Science May 26, 2022 9 / 58
Pearson correlation coefficient
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 10 / 58
Pearson correlation coefficient
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 11 / 58
Example 1: Height and Weight
R-code:
cor(Height,Weight, method="pearson")
R-output:
[1] 0.7586069
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 12 / 58
Least square regression
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 13 / 58
Least square regression (1)
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 14 / 58
Least square regression (2)
Least square regression is the technique of finding the line (in other words,
finding b1 and b2 ) that minimizes sum of the squared deviations,
n
X
(yi − b1 − b2 xi )2
i=1
b1 = ȳ − b2 x̄
Pn Pn
i=1 (xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
b2 = Pn 2
= Pi=1
n 2 2
i=1 (xi − x̄) i=1 xi − nx̄
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 15 / 58
Example:
Therefore,
b2 = 208.13
101.08 = 2.059062 ≈ 2.059 and
b1 = 3.39 − 2.059062 ∗ 1 = 1.330938 ≈ 1.331
The least square regression line is: y = 1.331 + 2.059x
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 16 / 58
Some comments:
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 17 / 58
Least square estimates in R
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 18 / 58
Scatterplot with the regression Line
200
180
160
Weight
140
120
100
80
60 65 70 75
Height
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 19 / 58
Linear Regression under Normal distribution
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 20 / 58
Linear Regression under Normal distribution
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 21 / 58
Some assumptions:
(Y |X = x) ∼ N (β1 + β2 x, σ 2 )
The conditional mean of Y is a linear function of X
The conditional variance of Y , (σ 2 ) is constant
yi ’s are independent
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 22 / 58
Parameter estimation
The conditional distribution of Y is assumed to be Normal.
E[Yi |Xi = xi ] = β1 + β2 xi
var[Yi |Xi = xi ] = σ 2
The likelihood function of (Y1 = y1 , Y2 = y2 , ..., Yn = yn ) will be a function
of (x1 , x2 , ..., xn ), β1 , β2 and σ 2 =⇒
n
1 X
L(β1 , β2 , σ 2 |data) = (2πσ 2 )−n/2 exp[− (yi − β1 − β2 xi )2 ]
2σ 2 i=1
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 24 / 58
Properties of estimators of regression
parameters (1)
If we had the population level data, we would have been able to calculate
the “true" intercept and slope
I Population parameters: β1 and β2
If we keep taking random samples and keep calculating the intercept and
the slope we will get different values(likely)
I Estimators: B1 and B2 ← these two are random variables.
Recall: µ is the parameter, X̄ is the estimator(random variable) and x̄ is
the value from our sample ie. estimate.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 25 / 58
Properties of estimators of regression
parameters (2)
B1 = Ȳ − B2 x̄
Pn
(x − x̄)(Yi − Ȳ )
Pn i
B2 = i=1 2
i=1 (xi − x̄)
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 26 / 58
Properties of estimators of regression
parameters (3)
B2 is a linear combinations of Yi ’s (which are bunch of Normal variables).
So is B1 .
Then both B1 and B2 follows Normal distribution.
B1 and B2 are unbiased estimators of β1 and β2
I E[B1 ] = β1
I E[B2 ] = β2
σ2
var[B2 ] = Pn 2
i=1 (xi − x̄)
We can write,
σ2
B2 ∼ N β2 , Pn 2
i=1 (xi − x̄)
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 27 / 58
Confidence interval/t-test for the slope
An unbiased estimator of σ 2 is
n
1 X
S2 = (Yi − b1 − b2 xi )2
n − 2 i=1
S2 = sum((y-y_pred)ˆ2)/8
SE_B2 = sqrt(S2/sum((x-mean(x))ˆ2))
SE_B2
## [1] 0.1022622
B2 −β2
Using T = q S2
as a test statistic which follows a t(df =n−2)
Pn
(xi −x̄)2
i=1
distribution we can conduct any test of hypothesis like
H0 : β2 = 0
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 31 / 58
Significance test for the Slope
Hypothesis
The pvalue is =⇒ pvalue = P | T |>| tobs |
The test statistic and the pvalue can be calculated using R.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 32 / 58
Example 1: Height and Weight
R-code:
# Data set 1: Height versus Head Circum of 11 childs
Height = c(27.75, 24.5, 25.5, 26, 25, 27.75, 26.5, 27, 26.75, 26.75, 27.5)
Circ = c(17.5, 17.1, 17.1,17.3, 16.9, 17.6, 17.3, 17.5, 17.3, 17.5, 17.5)
Dat1 = data.frame(Height,Circ)
model1=lm(Height~Circ)
coefficients(summary(model1))
R-output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.4931689 0.72968486 17.121321 3.558749e-08
Height 0.1827324 0.02756116 6.630072 9.590386e-05
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 33 / 58
Confidence Interval (C.I.)
To test whether there is no linear relationship between Y and x
(H0 : β = 0), another approach consists of constructing a C.I. for β.
A confidence interval for β is computed using the following formula:
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 34 / 58
Example 1: Height and Weight
R-output:
2.5 % 97.5 %
(Intercept) 10.8425070 14.1438307
Height 0.1203848 0.2450801
2 Test H0 : β = 0 at level 5 %.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 35 / 58
Analysis of Variance (ANOVA)
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 36 / 58
Analysis of Variance (ANOVA)
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 37 / 58
ANOVA TABLE
The null hypothesis is: H0 : The regression line dosn’t fit well the
data
The usual way of setting out this test is to use an ANOVA table
Source of Degrees of Sum of squares Mean squares F
Variation Freedom (df) (SS) (MS)
RSS
Regression 1 RSS 1
RSS
1
F = ESS
n−2
ESS
Residual n-2 ESS n−2
Total n-1 TSS
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 38 / 58
Coefficient of Determination (or R2 )
I 1 ≤ R2 ≤ 1
I R2 near 1 suggests a good fit to the data.
I R2 = 1, ALL points fall exactly on the line (you can predict without error).
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 39 / 58
Coefficient of determination (R2 )
Pn
Total sum of square (TSS) = i=1 (yi − ȳ)2
TSS can be written as the sum of two terms:
Pn
I Error/Residual sum of square (ESS) = (yi − b1 − b2 xi )2
2
P n i=1 2
I Regression sum of square (RSS) = b2 i=1
(xi − x̄)
## [1] 437.009
ESS = sum((y-y_pred)ˆ2)
ESS
## [1] 8.456399
RSS = TSS - ESS
R2= RSS/TSS
R2
## [1] 0.9806494
R2 = 0.9806 =⇒ 98.06% variation in Y can be explained by the model/by
√in X.
the variation
√
r = R2 = 0.9806 = 0.9903 (why should “r" be +ve)
So, there a is strong +ve relationship between X and Y
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 41 / 58
Example: Heights (x) and Weights (y) of student
Find SSreg, SSE and R2 for this example.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 42 / 58
ANOVA TABLE in R: Heights and Weight data
anova(model1)
Response: Circ
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 0.39993 0.39993 43.958 9.59e-05 ***
Residuals 9 0.08188 0.00910
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that the F-statistic is equal to 43.958 with a p-value of 9.59e-05. Thus,
there is strong evidence that β is not equal to zero, and hence the least-squares
line fits well the data.
SSreg
The coefficient of determination R2 = SSreg+RSS 0.39993
= 0.39993+0.0818 = 0.83,
indicating that 83% of the variability in the response is explained by the
explanatory variable.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 43 / 58
Regression Diagnostic
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 44 / 58
Conditions for Linear Regression
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 45 / 58
Condtion 1: Linearity Assumption
Circ
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 46 / 58
Condtion 2: Constant variance Assumption
Circ
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 47 / 58
Condtion 3: Normality assumption
Theoretical Quantiles
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 48 / 58
Practice Questions
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 49 / 58
Consider the following Regression output
Call:
lm(formula = Circ ~ Height)
Residuals:
Min 1Q Median 3Q Max
-0.16148 -0.05842 -0.01831 0.06442 0.12989
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.49317 0.72968 17.12 3.56e-08 ***
Height 0.18273 0.02756 6.63 9.59e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 50 / 58
Questions:
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 51 / 58
What have we Learn?
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 52 / 58
SLR in R and Python
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 53 / 58
Example 1:Height and Weight of 9 students
Variables
y: Height
x: Weight
Data in R
Height=c(60, 62, 64, 66, 68, 70, 72, 74, 76)
Weight=c(84, 95, 140, 155, 119, 175, 145, 197, 150)
Data in Python
Height=[60, 62, 64, 66, 68, 70, 72, 74, 76]
Weight=[84, 95, 140, 155, 119, 175, 145, 197, 150]
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 54 / 58
Regression in R
Data in R
Height=c(60, 62, 64, 66, 68, 70, 72, 74, 76)
Weight=c(84, 95, 140, 155, 119, 175, 145, 197, 150)
##
## Call:
## lm(formula = Height ~ Weight)
##
## Coefficients:
## (Intercept) Weight
## 51.8864 0.1151
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 55 / 58
Regression in R
summary(fit.lm)
##
## Call:
## lm(formula = Height ~ Weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0000 -2.0284 -0.8206 2.4170 6.8490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.88644 5.38322 9.639 2.73e-05 ***
## Weight 0.11510 0.03736 3.080 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.815 on 7 degrees of freedom
## Multiple R-squared: 0.5755, Adjusted R-squared: 0.5148
## F-statistic: 9.489 on 1 and 7 DF, p-value: 0.0178
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 56 / 58
Regression in Python
Data in R
Height=[60, 62, 64, 66, 68, 70, 72, 74, 76]
Weight=[84, 95, 140, 155, 119, 175, 145, 197, 150]
y=Height
x=Weight
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 57 / 58
view model summary
print(model.summary())