You are on page 1of 10

General notes

E ( x| y )=β 0 + β 1 x
 Equation for the line in the graph
 Note that the equation has the same format as y=mx+b

y i=β 0 + β 1 x i +ε i
 What is epsilon?
o The vertical distance between the regression line point at
x i and the actual scatter plot point at x i
o Hence: y i=E ( x| y ) +ε i

Four assumptions regarding the epsilons (ε )


1. E ( ε i )=0 and E ( y i|x i )=β 0 + β 1 x
o The underlying relationship between x and y is linear.
2
2. Var ( ε i ) =σ ε
o Constant variance of the epsilons; the variance doesn’t change with, for example, the x values.
3. ε i N ( σ , σ 2ε )
o The ~ sign means “distributed at”
o N ( σ , σ 2ε ) means Normal(mean, variance)
o What does this mean? That the epsilons are normally distributed.
4. The epsilons are uncorrelated with one another.

Examples of regressions that violate the assumptions:

Consider a random sample of n observations:

yhat=b o +b1 x
y i=b o +b1 x+ ei
y i= y h at +e i

What is correlation?
 Two variables are correlated if there is a positive or negative linear relationship
 1 = perfectly positive correlation, -1 = perfectly negative, 0 = no correlation
covariance(x , y)
 corr ( x , y )=ρ x, y =
σxσy
What is covariance?
 How x and y vary together

List of equations
b 0= ybar−b1 (xbar )
ybar =b0 +b 1( xbar)
2
sx , y SS xy r x, y s y
b 1= = =
s
2
x
SS xx sx

o s2x, y =¿ sample covariance of the x’s and y’s

o s2x =¿ sample variance of the x’s

o SS xy =¿ total variability in the x’s and y’s combined


o SS xx =¿ total variability in the x’s
2
s b s
r x , y= x , y = 1 x
sx s y sy
o r x , y =¿ sample correlation

R2=(r ¿¿ x , y )2 ¿
2
SSR=R × SST
SSE=SST −SSR
SSE 2
MSE= =s
n−2 e
se =√ MSE
2
SS xx=s x × ( n−1 )
se
sb =
1
√ SS xx
SST SST
s2y = → n= 2 +1
n−1 sy
o This will be useful in problems where we’re not given n
SS yy =SST =∑ ( y i− ybar )
2

SSE=( 1−R ) ×SST ; also SSE=MSE ×(n−2)


2

SSR
MSR=
df R
o This is for the ANOVA table

Sample midterm – notes/walkthrough


(blue text indicates commentary/explanation)

A dataset contains n observations of y vs x where:


S2x = 1.09
S2y = 36.552
b0 = 42.59
b1 = -3.835
xbar = 9.937
SST = 3618.648

1. Find rx,y, R2, sb1, se, SSR, SSE, MSE. (21 points)

b1 s x −3.835 × √ 1.09
r x , y= = =−0.66225
sy √ 36.552
R2=(r ¿¿ x , y )2=¿ ¿
SSR=R2 × SST =0.4386 ×3618.648=1587.057
SSE=SST −SSR=3618.648−1587.057=2031.59
SSE 2031.59
MSE= =
n−2 n−2
To find n :
SST 3618.648
n= +1= +1=100
2
sy 36.552

Hence:
2031.59 2031.59
= =20.731
n−2 100−2
se =√ MSE=√ 20.731=4.553
se 4.553 4.553
sb = = = =0.4383
1
√ SS xx √ s × ( n−1 ) √1.09 × ( 100−1 )
2
x

2. Is there evidence that β 1 is less than -2? Use α =0.05


The wording of the question (“is there evidence… ) indicates that it is asking for us to do a hypothesis
test. We’re testing whether β 1 is less than -2.
H0: β 1=−2
HA: β 1 ←2
Our goal is to figure out if our test statistic (which we’ll calculate using the formula for t) is in the
rejection region (RR). If it is, we reject the null hypothesis (H0). If the test statistic is outside the RR, we
accept H0. The rejection region is the following:
RR: t <−t α
In other words: if our t is less than −t α , we reject H0. Let’s start with calculating t :
b1−(βstar) −3.835−(−2)
t= = =−4.187  this is our test statistic, t .
sb 1
0.4383

What is βstar ? It’s the beta that we’re testing in our hypothesis test (in this case, -2).

Now, we find t α :
Figure out if it’s a one-tailed or two-tailed test. If HA has “<” or “>” it’s one-tailed. If HA has “≠” it’s
two-tailed. In this case, we have a one-tailed test. That means that we’re looking for the bottom
portion of the data under the curve, as opposed to the middle data. Below is an example of one and
two-tailed tests where α =0.05 in both:

Whenever n (our sample size) is ≥ 30, that means that we can use t and z interchangeably.
Since n=100, we’ll use a z-table.
We know that α =0.05 . That means that we’re looking for the bottom (since it’s one-tailed) 95%, or
0.95 under the curve. To find the correct value for z α aka t α , we look for 0.95 inside the z-table. We
find that 0.95 is directly in between 0.505 and 0.0495, whose z-values are 1.64 and 1.65 respectively.
Thus, we take the average of the two and find a z value of 1.645. This is our t α .
t α =1.645
Now, all that’s left is to compare our t and −t α
RR: t <−t α
−4.187←1.645
Since −4.187←1.645, we reject H0 in favor of HA at α =0.05 . Thus, there is evidence at α =0.05 that
β 1 ←2.
3. Construct a 95% CI for Beta1.
This is the structure for this kind of confidence interval:
b 1 ± t α /2 × sb 1

We already know b 1 and sb , so all that’s left is to find t α / 2.


1

Since n is large, we can use z instead of t and look for z α / 2


Because we're looking for a 95% confidence interval, we know that α =0.05 (since 1−0.95=0.05 ).
α =0.05
If α =0.05 , then α /2=0.025
α / 2=0.025
If α /2=0.025 , then we’re looking for the bottom 97.5% of what’s under the curve. See the drawing
below:

We’re looking for z 0.025. To find it, we look inside the z-table for 0.975.
The z-value that corresponds with 0.975 is 1.96. That means that z 0.025 aka z α / 2 aka t α / 2=1.96.
z 0.025 =1.96=t α /2
Now, we just plug that back into the CI formula:
b 1 ± t α /2 × sb 1

−3.835 ±1.96 × 0.438

4. Construct a 90% CI for muy, the true population mean of the y values and interpret the CI.
Almost exactly the same as #3, except the variables are a bit different in this CI formula:
sy
ybar ± t α /2 ×
√n
Let’s start with finding t α/ 2. Since it’s a 90% CI, α =0.1 (because 1−0.9=0.1)
If α =0.1, then α /2=0.05
α =0.1
α / 2=0.05
To find t α / 2, we look for z α/ 2=z 0.05, which can be found by looking for 0.95 in the z-table.
Why 0.95? Because if z α / 2=z 0.05, then we need the bottom 1−0.05=0.95 under the curve.
z 0.05=1.645=t α /2

Next, we find the remaining variables.


s y =36.552  s y = √ s2y =√ 36.552=6.0458
2

√ n= √100=10
ybar =b0 +b 1 ( xbar )=42.59+ (−3.835 ×9.937 )=4.4816

6.0458
CI: 4 . 4816 ±1 . 645( )
10
We are 90% confident that μ y is in the above interval. This is because if we repeat the process many
times, then approximately 90% of the resulting intervals would include μ y .

Consider the following regression performed in Rstudio.

> model1a <- lm(mpg~wt, data = mtcars)


> summary(model1a)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.046 on 30 degrees of freedom


Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

Based upon the regression output in RStudio, answer the following questions.

5. Is there evidence at alpha = .05 of a negative linear relationship between mpg and wt?
This is another hypothesis test. We’re trying to figure out if there is a negative linear relationship. Beta
is an indicator of that. If β is zero, there is no linear relationship. If β is positive, there is a positive
linear relationship. If β is negative, there is a negative linear relationship. Hence…
Below are the null and alternate hypotheses:
H0: β 1=0
HA: β 1< 0

b1
Test statistic: t=
sb
1

These values can be found in the R-studio information we’re given. Below is a template for information
from a simple regression in R-studio:

If we apply this to our current example, we find that:


b 1=−5.3445
sb =0.5591
1

b1 −5.3445
t= = =−9.559
sb
1
0.5591

We also know that the p-value is in the bottom right corner of the coefficients table.
p-value = 1.29e-10 ≈ 0

Recall that our rejection region is where the p-value ¿ α .


In our case, p-value = 0 and α =0.05
Since the one-tailed p-value of 0 is less than α =0.05 , we reject H0 in favor of HA. There is evidence
that β 1< 0 at α =0.05 , which indicates evidence of a negative linear relationship.
6. What proportion of variability in mpg is explained by wt?
To answer this, we must know that proportion of variability is represented by R2 .
We are given R2 in the R-studio information where it says: “Multiple R-squared: 0.7528”
2
R =0.7528

7. What is the value of the sample standard deviation of the residuals? Provide an interpretation of this value
within the context of the problem.
To answer this, we must know that the sample standard deviation of the residuals is represented by se .
We’re given se where it says: “Residual standard error: 3.046”
se =3.046
Within a band 2 se thick of the regression line, we would expect that approximately 95% of the y-
values (mpg’s) will lie there.

All this means is that we expect 95% of the values to lie within 2 standard deviations of the line itself
(this makes sense because we know that when it comes to a standard bell curve, 95% of the data lies
within 2 standard deviations from the mean).

8. What is the sample correlation between mpg and wt?


To answer this, we must know that sample correlation is represented by r x , y . All we need to do is solve
for that value using the following equation:

r x , y =( √ R ) × sign (b 1)
2

What is sign( b1) ? It’s either -1 or 1, depending on if b 1 has a negative or positive sign in front of it.
In this case, since b 1=−5.3445, sign ( b 1) =−1.

r x , y =( √ R ) × sign ( b1 ) =( √ 0.7528 ) (−1 )=−0.8676


2

8 (really 9). Complete an ANOVA table as we discussed in class.


Structure of an ANOVA table:
Source df SS MS
Regression Given in R-studio SSR MSR
Residual error n-2 SSE MSE
Total n-1 SST
MSE=¿
SSE
MSE= → SSE=MSE ( n−2 ) =9.2781 ( 30 )=278.343
n−2
How do we know that n−2=30? In the R-studio information, we’re told that there are “30
degrees of freedom.” Recall that the degrees of freedom is found by doing n−2. If n−2=30 ,
then n must be 32.
Now that we know that n=32, we can also find n−1.
n−1=32−1=31
SSE 278.343
SSE=( 1−R2 ) ×SST → SST = = =1125.983
1−R 1−0.7528
2

SSR=SST −SSE=1125.983−278.343=847.64
Intersection of Regression and df:
In the R-studio information, we’re told “F-statistic: 91.38 on 1 and 30 DF”
When it says “on 1,” that tells us that the value in the top-left corner of the ANOVA table is 1.
847.64
MSR = SSR divided by whatever is in the top-left box in the table ¿ =847.64
1

Hence, here is the completed ANOVA table:


Source df SS MS
Regression 1 847.64 847.64
Residual error 30 278.343 9.2781
Total 31 1125.983

You might also like