You are on page 1of 58

Statistics For Data Science

Topic: Simple Linear Regression

Dr. Fode Tounkara

May 26, 2022

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 1 / 58
This week: Regression and Correlation

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 2 / 58
Overview
Scatterplot and Correlation

Least-Squares Line

Statistical Inference about the Regression Slope

Analysis of Variance (ANOVA)

Regression Diagnostic

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 3 / 58
Introduction

Regression is a method for studying the relationship between two or


more quantitative variables
Simple linear regression (SLR):
I One quantitative dependent variable
F response variable
F dependent variable
F Y
I One quantitative independent variable
F explanatory variable
F predictor variable
F X

SLR Examples
I predict salary from years of experience
I Relationship between child’s height and their head circumference
I predict car speed from gas mileage

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 4 / 58
Example: Height and Weight

Variables:

X: Height (measured in inch)


Y: Weight (measured in pound)

Data: SRS of 9 students

Height Weight
60 84
62 95
64 140
66 155
68 119
70 175
72 145
74 197
76 150

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 5 / 58
Scatterplot and Correlation

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 6 / 58
Relationship among quantitative variables

Suppose we have two quantitative variables X (say represents income) and


Y (say represents expense).
We want to check whether there is any relationship between them or not.
Let (x1 , x2 , ..., xn ) and (y1 , y2 , ..., yn ) are the two corresponding data
vectors.
I x1 is the income of the first individual and y1 is his/her expense.
A visual display of these two vectors can be done by drawing a Scatter
Plot.
Plotting yi ’s against xi ’s will give us the scatter plot where i = 1, 2, ..., n
Recall: Scatter plot suggests the direction and magnitude of correlation
between X and Y

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 7 / 58
Scatterplot
A scatterplot, also called a scattergraph or scatter diagram, is a plot
of the data points in a set. It plots data that takes two variables into
account at the same time. Here are some examples of scatterplots:
Example : Student’s Heights and their Weights
R-code:
plot(Weight~Height, xlab="Height", pch=18, cex=3,
cex.lab=1.8, col="blue")

R-output:
200
180
160
Weight
140
120
100
80

60 65 70 75

Height

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 8 / 58
Pearson correlation coefficient

Source: Dr.
http://www.pythagorasandthat.co.uk/scatter-
Fode Tounkara graphs
Statistics For Data Science May 26, 2022 9 / 58
Pearson correlation coefficient

Think of a hypothetical line that goes through the points.


Direction of the line:
I the line is going upward =⇒ the correlation is positive.
I the line is going downward =⇒ then the correlation is negative.
Closeness of the points to the line suggests the strength of the correlation
I points are closely clustered around the line =⇒ strong correlation
I points are not so close to the line =⇒ moderate/weak correlation
If the points look totally random =⇒ No relationship between X and Y

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 10 / 58
Pearson correlation coefficient

Correlation coefficient, r measures the linear relation ship between two


variables.
It’s a unit free number which ranges from -1 to 1.
r = −1 =⇒ Perfect Negative Correlation (All the points are exactly on a
downward line)
r = 1 =⇒ Perfect Positive Correlation (All the points are exactly on a
upward line)
r = 0 =⇒ Zero correlation.

Correlation coefficient calculated from a sample,


P
(xi − x̄)(yi − ȳ)
r = pP i P
2 2
i (xi − x̄) i (yi − ȳ)

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 11 / 58
Example 1: Height and Weight

R-code:
cor(Height,Weight, method="pearson")

R-output:
[1] 0.7586069

Interpretation: There is strong positive relationship between Student’s Height


and their Weights.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 12 / 58
Least square regression

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 13 / 58
Least square regression (1)

Let y = b1 + b2 x is the equation of the hypothetical line that we thought is


going throw the points.
(yi − b1 − b2 xi ) is the deviation of yi from the line.

Source: https://medium.com/statistical- guess/least- square- regression- 34aaef3f76ec

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 14 / 58
Least square regression (2)

Least square regression is the technique of finding the line (in other words,
finding b1 and b2 ) that minimizes sum of the squared deviations,
n
X
(yi − b1 − b2 xi )2
i=1

Differentiating this expression with respect to b1 and b2 and equating to


zero gives us:

b1 = ȳ − b2 x̄

Pn Pn
i=1 (xi − x̄)(yi − ȳ) xi yi − nx̄ȳ
b2 = Pn 2
= Pi=1
n 2 2
i=1 (xi − x̄) i=1 xi − nx̄

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 15 / 58
Example:

x y (x − x̄) (x − x̄)2 (y − ȳ) (y − ȳ)2 (x − x̄)(y − ȳ)


3.9 8.9 2.9 8.41 5.51 30.360 15.979
2.6 7.1 1.6 2.56 3.71 13.764 5.936
2.4 4.6 1.4 1.96 1.21 1.464 1.694
4.1 10.7 3.1 9.61 7.31 53.436 22.661
−0.2 1.0 −1.2 1.44 −2.39 5.712 2.868
5.4 12.6 4.4 19.36 9.21 84.824 40.524
0.6 3.3 −0.4 0.16 −0.09 0.008 0.036
−5.6 −10.4 −6.6 43.56 −13.79 190.164 91.014
−1.1 −2.3 −2.1 4.41 −5.69 32.376 11.949
−2.1 −1.6 −3.1 P 9.61 −4.99 P 24.900 P 15.469
x̄ = 1 ȳ = 3.39 - = 101.08 - = 437.009 = 208.13

Therefore,
b2 = 208.13
101.08 = 2.059062 ≈ 2.059 and
b1 = 3.39 − 2.059062 ∗ 1 = 1.330938 ≈ 1.331
The least square regression line is: y = 1.331 + 2.059x

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 16 / 58
Some comments:

Least square regression doesn’t require any distributional assumption.


It is more like how to fit a linear regression line if we have all have
population level data.
I found this page online which explains the concept of least square
interactively. https://setosa.io/ev/ordinary-least-squares-regression/.
One the second graph of this page, try changing the intercept or slope
value and see what happens graphically.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 17 / 58
Least square estimates in R

Example 1: Height and Weight


R-code:
Height=c(60, 62, 64, 66, 68, 70, 72, 74, 76)
Weight=c(84, 95, 140, 155, 119, 175, 145, 197, 150)
model1=lm(Weight~Height)
R-output:
(Intercept) Height
-200 5
The least-squares estimates are: b1 =-200 and b2 =5.
Thus, the least-squares equation is: y = −200 + 5x.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 18 / 58
Scatterplot with the regression Line

Scatterplot of Height vs. Weight

200
180
160
Weight
140
120
100
80

60 65 70 75

Height

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 19 / 58
Linear Regression under Normal distribution

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 20 / 58
Linear Regression under Normal distribution

Let’s take a look at a hypothetical example (may look a bit unrealistic)

X represents income category (10K, 20K etc.)


I For the sake of this example, say income can only be $10K or $20K or . . .
and nothing in between.
For each category of X, we have expenses of 10000 different individuals.
In total we have 50,000 individuals in our population.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 21 / 58
Some assumptions:
(Y |X = x) ∼ N (β1 + β2 x, σ 2 )
The conditional mean of Y is a linear function of X
The conditional variance of Y , (σ 2 ) is constant
yi ’s are independent

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 22 / 58
Parameter estimation
The conditional distribution of Y is assumed to be Normal.
E[Yi |Xi = xi ] = β1 + β2 xi
var[Yi |Xi = xi ] = σ 2
The likelihood function of (Y1 = y1 , Y2 = y2 , ..., Yn = yn ) will be a function
of (x1 , x2 , ..., xn ), β1 , β2 and σ 2 =⇒
n
1 X
L(β1 , β2 , σ 2 |data) = (2πσ 2 )−n/2 exp[− (yi − β1 − β2 xi )2 ]
2σ 2 i=1

For any given σ 2 , this likelihood will be maximized when


P n 2
i=1 (yi − β1 − β2 xi ) will be minimized.

Hence, the optimization becomes same as the least square regression


(which does not involve any Normality assumption)
Therefore,
Pn Pn
(x − x̄)(yi − ȳ) xi yi − nx̄ȳ
β̂2 = b2 = Pn i
i=1
2
= Pi=1
n 2 2
, and β̂1 = b1 = ȳ−b2 x̄.
i=1 (x i − x̄) i=1 xi − nx̄
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 23 / 58
Interpretation of regression parameters

β1 represents the expected value of Y when X = 0

β2 represents the change in expected value of Y for 1-unit increase in X

Using our example:


I β1 is the average expense when income = 0.
I β2 is the change in the average expense, when income increases by 1-unit
(the unit here is $1000).

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 24 / 58
Properties of estimators of regression
parameters (1)

If we had the population level data, we would have been able to calculate
the “true" intercept and slope
I Population parameters: β1 and β2

Instead we observe a sample and calculate estimates of those parameters.


I Estimates: b1 and b2

If we keep taking random samples and keep calculating the intercept and
the slope we will get different values(likely)
I Estimators: B1 and B2 ← these two are random variables.
Recall: µ is the parameter, X̄ is the estimator(random variable) and x̄ is
the value from our sample ie. estimate.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 25 / 58
Properties of estimators of regression
parameters (2)

We can re-write the equations of the estimators

B1 = Ȳ − B2 x̄
Pn
(x − x̄)(Yi − Ȳ )
Pn i
B2 = i=1 2
i=1 (xi − x̄)

Technical Note: Since we are dealing with bunch of conditional


distributions, Y is the random variable here and x is treated as fixed
constant.
B2 can be expressed as
Pn Pn
(x − x̄)(Yi − Ȳ ) (xi − x̄)Yi
B2 = i=1 Pn i 2
= Pi=1
n 2
i=1 (xi − x̄) i=1 (xi − x̄)

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 26 / 58
Properties of estimators of regression
parameters (3)
B2 is a linear combinations of Yi ’s (which are bunch of Normal variables).
So is B1 .
Then both B1 and B2 follows Normal distribution.
B1 and B2 are unbiased estimators of β1 and β2
I E[B1 ] = β1
I E[B2 ] = β2

var[B1 ] and var[B2 ] can be calculated and will be a function of σ 2

σ2
var[B2 ] = Pn 2
i=1 (xi − x̄)

We can write,
σ2
 
B2 ∼ N β2 , Pn 2
i=1 (xi − x̄)

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 27 / 58
Confidence interval/t-test for the slope
An unbiased estimator of σ 2 is
n
1 X
S2 = (Yi − b1 − b2 xi )2
n − 2 i=1

Then using the definition of t-distribution,


B − β2
r 2 ∼ t(n−2)
Pn S 2 2 (xi −x̄)
i=1

Then 100 ∗ (1 − α)% level confidence interval for β2


B2 ± t(1−α/2),(df =n−2) ∗ SE(B2 )
r
S2
where SE(B2 ) = Pn
(xi −x̄)2
i=1

For the numeric example on page 4, b2 = 2.06 , SE(B2 ) = 0.1023,


t0.975,(8) = 2.306
95% CI for β2
Dr. Fode Tounkara 2.06 ± Statistics
2.306 ∗For 0.1023 = (1.824, 2.296)
Data Science May 26, 2022 28 / 58
Using R
x = c(3.9,2.6,2.4,4.1,-0.2,5.4,0.6,-5.6,-1.1, -2.1)
y = c(8.9,7.1,4.6,10.7,1.0,12.6,3.3,-10.4,-2.3,-1.6)

y_pred = 1.331 + 2.059*x

S2 = sum((y-y_pred)ˆ2)/8

SE_B2 = sqrt(S2/sum((x-mean(x))ˆ2))
SE_B2

## [1] 0.1022622
B2 −β2
Using T = q S2
as a test statistic which follows a t(df =n−2)
Pn
(xi −x̄)2
i=1
distribution we can conduct any test of hypothesis like

H0 : β2 = 0

(or some other value)


Dr. Fode Tounkara Statistics For Data Science May 26, 2022 29 / 58
Using R
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.6727 -0.3960 0.1155 0.6541 1.3931
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.3309 0.3408 3.905 0.00451 **
## x 2.0591 0.1023 20.135 3.86e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.028 on 8 degrees of freedom
## Multiple R-squared: 0.9806, Adjusted R-squared: 0.9782
## F-statistic: 405.4 on 1 and 8 DF, p-value: 3.864e-08
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 30 / 58
Using R

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 31 / 58
Significance test for the Slope
Hypothesis

H0 : β = 0 (x has no effect on Y) against HA : β 6= 0

The test statistic


b−0
T= ∼ tn−2 when H0 is true
se(b)

The rejection region: at significant level α,

we reject H0 if | tobs |>| t α2 |


The pvalue is =⇒ pvalue = P | T |>| tobs |
The test statistic and the pvalue can be calculated using R.
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 32 / 58
Example 1: Height and Weight

R-code:
# Data set 1: Height versus Head Circum of 11 childs
Height = c(27.75, 24.5, 25.5, 26, 25, 27.75, 26.5, 27, 26.75, 26.75, 27.5)
Circ = c(17.5, 17.1, 17.1,17.3, 16.9, 17.6, 17.3, 17.5, 17.3, 17.5, 17.5)
Dat1 = data.frame(Height,Circ)
model1=lm(Height~Circ)
coefficients(summary(model1))

R-output:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.4931689 0.72968486 17.121321 3.558749e-08
Height 0.1827324 0.02756116 6.630072 9.590386e-05

H0 : β = 0 (Height has no effect on Head Circumference ).


=⇒ b = 0.183; se(b) = 0.028; tobs = 6.63; pvalue = 9.590 × 10−05
At level α = 0.05, since pvalue = 9.590 × 10−05 < 0.05 we reject the null
hypothesis that height has no effect on head circumference.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 33 / 58
Confidence Interval (C.I.)
To test whether there is no linear relationship between Y and x
(H0 : β = 0), another approach consists of constructing a C.I. for β.
A confidence interval for β is computed using the following formula:

b ± t∗ se(b) = b − t∗ se(b) ; b + t∗ se(b) , where




I t∗ is a t-critical value associated with the confidence level and determined


from a t-distribution with df=n-2;
I b is the least-squares estimate of the population slope, and se(b) is the
standard error of b.

Since the alternative is two sided (HA : β 6= 0) we reject H0 if the C.I.


doesn’t contain zero.
We can use the R function confint() to compute the lower and the upper
limits of the C.I. for β.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 34 / 58
Example 1: Height and Weight

1 Find the 95% C.I. for the regression slope (β).


R-code:
confint(model1, conf.level=0.95)

R-output:
2.5 % 97.5 %
(Intercept) 10.8425070 14.1438307
Height 0.1203848 0.2450801

The 95% C.I. for the slpoe β is: (0.1204 ; 0.2451)

2 Test H0 : β = 0 at level 5 %.

Since the C.I. (0.1204 ; 0.2451) doesn’t comprise 0, we reject H0 .

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 35 / 58
Analysis of Variance (ANOVA)

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 36 / 58
Analysis of Variance (ANOVA)

ANOVA can be used to assess the SLR goodness of fit.


The ANOVA method decomposes the total variability in the response
variable (TSS) into two parts: variability that is explained by the
regression (RSS) and unexplained variability (ESS).
Pn 2
I TSS = i=1
(yi − y) with df=n-1
Pn
I RSS = i=1
(ybi − y)2 with df=1
Pn
I ESS = i=1
(yi − ybi )2 with df=n-2
We can show that: ESS=RSS + ESS
To have a good fitting model, ESS should be small and RSS close to
TSS.
RSS ESS
We can perform a F-test which compare MSReg = 1 and MSE = n−2 .

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 37 / 58
ANOVA TABLE

The null hypothesis is: H0 : The regression line dosn’t fit well the
data
The usual way of setting out this test is to use an ANOVA table
Source of Degrees of Sum of squares Mean squares F
Variation Freedom (df) (SS) (MS)
RSS
Regression 1 RSS 1
RSS
1
F = ESS
n−2
ESS
Residual n-2 ESS n−2
Total n-1 TSS

By comparing the F-statistic to the appropriate F-distribution, we (or


the computer) can get a P-value.
Using R, we can get all ANOVA table components.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 38 / 58
Coefficient of Determination (or R2 )

The proportion of total variability that is explained by the regression


model is called the Coefficient of Determination (or R2 ):
I R2 = RSS
TSS
=1− ESS
TSS

I 1 ≤ R2 ≤ 1
I R2 near 1 suggests a good fit to the data.
I R2 = 1, ALL points fall exactly on the line (you can predict without error).

In SLR, squaring the correlation coefficient (r2 ) gives us the Coefficient of


determination R2 for the SLR model.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 39 / 58
Coefficient of determination (R2 )
Pn
Total sum of square (TSS) = i=1 (yi − ȳ)2
TSS can be written as the sum of two terms:
Pn
I Error/Residual sum of square (ESS) = (yi − b1 − b2 xi )2
2
P n i=1 2
I Regression sum of square (RSS) = b2 i=1
(xi − x̄)

It can be shown that


T SS = RSS + ESS

Coefficient of determination (R2 ) is defined as


RSS
R2 =
T SS

R2 represents the proportion of variation in Y that can be explained by


the model.
For simple linear regression (only one X variable),

r2 = R2 =⇒ r = R2
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 40 / 58
Example
For our numeric example,
TSS = sum((y-mean(y))ˆ2)
TSS

## [1] 437.009
ESS = sum((y-y_pred)ˆ2)
ESS

## [1] 8.456399
RSS = TSS - ESS
R2= RSS/TSS
R2

## [1] 0.9806494
R2 = 0.9806 =⇒ 98.06% variation in Y can be explained by the model/by
√in X.
the variation

r = R2 = 0.9806 = 0.9903 (why should “r" be +ve)
So, there a is strong +ve relationship between X and Y
Dr. Fode Tounkara Statistics For Data Science May 26, 2022 41 / 58
Example: Heights (x) and Weights (y) of student
Find SSreg, SSE and R2 for this example.

xi yi ŷi (ŷi − ȳ) (ŷi − ȳ)2 (yi − ŷi ) (yi − ŷi )2


60 84 100 -40 1600 -16 256
62 95 110 -30 900 -15 225
64 140 120 -20 400 20 400
66 155 130 -10 100 25 625
68 119 140 0 0 -21 441
70 175 150 10 100 25 625
72 145 160 20 400 -15 225
74 197 170 30 900 27 729
76 150 180 40 1600 -30 900
SSreg = 6000 SSE = 4426

SSreg=6000, SSE=4426 and SST=10 426


R2 = 10426
6000
= 0.58 =⇒ 58% of the variability in Weight is explained by Height.
The other 42% is “Error”.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 42 / 58
ANOVA TABLE in R: Heights and Weight data

anova(model1)

Analysis of Variance Table

Response: Circ
Df Sum Sq Mean Sq F value Pr(>F)
Height 1 0.39993 0.39993 43.958 9.59e-05 ***
Residuals 9 0.08188 0.00910
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see that the F-statistic is equal to 43.958 with a p-value of 9.59e-05. Thus,
there is strong evidence that β is not equal to zero, and hence the least-squares
line fits well the data.
SSreg
The coefficient of determination R2 = SSreg+RSS 0.39993
= 0.39993+0.0818 = 0.83,
indicating that 83% of the variability in the response is explained by the
explanatory variable.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 43 / 58
Regression Diagnostic

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 44 / 58
Conditions for Linear Regression

The residuals ê = yi − ŷi can be used to check regression conditions.


Condition 1: A linear model is appropriate for the data; no curvature, no
influential points, no group with different patterns.
I Check: Scatterplot; Residuals plot against x , or ŷ

Condition 2: The variation in the error terms is constant.


I Check: Scatterplot; Residual plot

Condition 3: The errors are normally distributed.


I Check: Histogram, Boxplot, Normal quantile plot of residuals

independent errors (this essentially equates to independent


Let’s use Height and Circumference data in Example 1 to check
regression conditions.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 45 / 58
Condtion 1: Linearity Assumption

Graph: scatter plot of Weight against height type:


27.5
27.0
26.5
Height
26.0
25.5
25.0
24.5

16.9 17.0 17.1 17.2 17.3 17.4 17.5 17.6

Circ

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 46 / 58
Condtion 2: Constant variance Assumption

Graph: scatter plot of the residual against height :


0.10
0.05
Residuals
0.00
−0.05
−0.10
−0.15

16.9 17.0 17.1 17.2 17.3 17.4 17.5 17.6

Circ

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 47 / 58
Condtion 3: Normality assumption

Graph: QQ-plot of residuals.


qqnorm(resid, pch=18, cex=3, lwd=3, cex.lab=1.5, col="blue")
qqline(resid, col="red")
Normal Q−Q Plot
0.10
0.05
Sample Quantiles
0.00
−0.05
−0.10
−0.15

−1.5 −1.0 −0.5 0.0 0.5 1.0 1.5

Theoretical Quantiles

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 48 / 58
Practice Questions

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 49 / 58
Consider the following Regression output

Call:
lm(formula = Circ ~ Height)

Residuals:
Min 1Q Median 3Q Max
-0.16148 -0.05842 -0.01831 0.06442 0.12989

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.49317 0.72968 17.12 3.56e-08 ***
Height 0.18273 0.02756 6.63 9.59e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09538 on 9 degrees of freedom


Multiple R-squared: 0.8301, Adjusted R-squared: 0.8112
F-statistic: 43.96 on 1 and 9 DF, p-value: 9.59e-05

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 50 / 58
Questions:

Using the above output, answer the following question


1 Find the correlation coefficient (r). Comment on the strength and direction
of the linear relationship between the variables.
2 What is the estimated intercept and slope of the regression line? Write
down the Least squares Line equation.
3 Write in words the interpretation of the slope.
4 Test the hypothesis that there is no linear relationship between the
student’s height and their weight. What is the p-value?
5 Find and interpret the value of the coefficient of determination.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 51 / 58
What have we Learn?

Know how to use a scatterplot to analyse a relationship between two


quantitative variables
Know how to use a regression line to describe a linear relationship between
an explanatory variable and a response variable.
Understand the meaning of the slope of the least squares regression line.
Know how to perform inference about the regression slope.
Understand how to use an ANOVA table to assess a regression goodness of
fit.
What the coefficient of determination is, how to compute it.
Understand how to use R to fit the regression line, perform t-test and
F-test, and check for regression assumptions.

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 52 / 58
SLR in R and Python

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 53 / 58
Example 1:Height and Weight of 9 students

Variables
y: Height
x: Weight
Data in R
Height=c(60, 62, 64, 66, 68, 70, 72, 74, 76)
Weight=c(84, 95, 140, 155, 119, 175, 145, 197, 150)

Data in Python
Height=[60, 62, 64, 66, 68, 70, 72, 74, 76]
Weight=[84, 95, 140, 155, 119, 175, 145, 197, 150]

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 54 / 58
Regression in R

Data in R
Height=c(60, 62, 64, 66, 68, 70, 72, 74, 76)
Weight=c(84, 95, 140, 155, 119, 175, 145, 197, 150)

Use lm(y~x) to perform a SLR in R


fit.lm=lm(Height~Weight)
fit.lm

##
## Call:
## lm(formula = Height ~ Weight)
##
## Coefficients:
## (Intercept) Weight
## 51.8864 0.1151

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 55 / 58
Regression in R
summary(fit.lm)

##
## Call:
## lm(formula = Height ~ Weight)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0000 -2.0284 -0.8206 2.4170 6.8490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 51.88644 5.38322 9.639 2.73e-05 ***
## Weight 0.11510 0.03736 3.080 0.0178 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.815 on 7 degrees of freedom
## Multiple R-squared: 0.5755, Adjusted R-squared: 0.5148
## F-statistic: 9.489 on 1 and 7 DF, p-value: 0.0178

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 56 / 58
Regression in Python

Data in R
Height=[60, 62, 64, 66, 68, 70, 72, 74, 76]
Weight=[84, 95, 140, 155, 119, 175, 145, 197, 150]

y=Height
x=Weight

Import statsmodels module


import statsmodels.api as sm

add constant to predictor variables


x = sm.add_constant(x)

fit linear regression model


#add constant to predictor variables
x = sm.add_constant(x)
model = sm.OLS(y, x).fit()

Dr. Fode Tounkara Statistics For Data Science May 26, 2022 57 / 58
view model summary

print(model.summary())

## OLS Regression Results


## ========================================================================
## Dep. Variable: y R-squared:
## Model: OLS Adj. R-squared:
## Method: Least Squares F-statistic:
## Date: Thu, 26 May 2022 Prob (F-statistic):
## Time: 13:33:26 Log-Likelihood: -
## No. Observations: 9 AIC:
## Df Residuals: 7 BIC:
## Df Model: 1
## Covariance Type: nonrobust
## ========================================================================
## coef std err t P>|t| [0.025
## ------------------------------------------------------------------------
## const 51.8864 5.383 9.639 0.000 39.157
## x1 0.1151 0.037 3.080 0.018 0.027
## ========================================================================
## Omnibus: 1.604 Durbin-Watson:
## Prob(Omnibus): 0.448 Jarque-Bera (JB):
## Skew:
Dr. Fode Tounkara 0.721
Statistics For Prob(JB):
Data Science May 26, 2022 58 / 58

You might also like