You are on page 1of 89

+

Stat 608 Chapter 2 (and 5)

Simple Linear regression


+Introduction: Simple
Linear Regression Models
+
Simple Linear Regression (SLR)
Models
We always draw a scatterplot first to obtain an
idea of the strength (measured by correlation),
form (e.g., linear, quadratic, exponential), and
direction (positive or negative, in the linear
case) of any relationship that exists between two
variables.

On the next slide we see a strong (correlation


close to 1) positive (as x increases, y increases)
linear (straight line) relationship between the
percentage of voters of each state who voted for
President Obama in 2008 and in 2012.
+
Simple Linear Regression (SLR)
Models
70 60
% Obama 2012
40 50
30

40 50 60 70
% Obama 2008
+
Simple Linear Regression (SLR)
Models
When data are collected in pairs the
standard notation used to designate this
is:
(x1, y1), (x2, y2), . . . , (xn, yn),
where x1 denotes the first value of the X-
variable and y1 denotes the first value of
the Y-variable. The X-variable is called the
explanatory variable, while the Y-
variable is called the response variable.
+
Simple Linear Regression (SLR)
Models
Forexample, the first few entries in
the election data set above were
(37.7, 40.8), (38.8, 38.4), (38.8,
36.9)
meaning 37.7% of Alabama voters
voted for President Obama in 2008,
while 40.8% did so in 2012, and so on.
+
Simple Linear Regression (SLR)
Models
We could use a matched pairs t-test to test
whether the same percentage of voters
voted for President Obama in 2008 and
2012 in the 50 states, on average. If the
percentages were indeed equal, what would
that mean for our regression model?

Aside: I wouldnt be able to conclude just from the


average percentages being equal that the percentage of
the country voting for Obama had stayed the same why
not?
+
S.L.R. Models
The regression of Y on X is linear if

where the unknown parameters 0 and 1


determine the intercept and the slope,
respectively, of a specific straight line.
Note: the parabolic model

is also a linear model, even though it graphs a parabola,


because the mean of Y given X is related to the
parameters via a linear function.
+
S.L.R. Models
+
S.L.R. Models
There will almost certainly be some variation in
Y due strictly to random phenomenon that
cannot be measured, predicted, or explained.
This variation is called random error, and
denoted ei.

In other words, the difference between what


actually occurs and our model is the error. Our
model doesnt explain everything; the amount
of variability in our response variable that
remains unexplained is measured by the error.
+
S.L.R. Models
Suppose that Y1, Y2 , , Yn are independent
realizations of the random variable Y that are
observed at the values x1 , x2 , , xn of a
random variable X . If the regression of Y on X
is linear, then for i = 1, 2, , n;

where ei is the random error in Yi and is such


that E(ei | X) = 0.
+
Continued
Notice that the random error term does not
depend on x , nor does it contain any
information about Y (otherwise it would be a
systematic error).

For now, we assume the errors have constant


variance:
+
Introduction to Least Squares
Dont get these ideas confused:
+
Introduction to Least Squares
+Deriving And Interpreting
Parameter Estimates
+
Least Squares: Derive
parameter estimates
Goal: Minimize
with respect to 0
and 1.

We have two equations


with two unknowns. Set
both equal to 0 and solve.
+
Least Squares: Derive
parameter estimates
First: derivative
with respect to the
intercept. Solve
for the y-intercept.

The least squares regression


line goes through the point
+
Least Squares: Derive
parameter estimates
Second: Derivative
with respect to the
slope. Solve for
the slope.
+
Example: R Code

election<read.csv("/Users/Elizabeth//stat608/sp13/pct_obama.csv)
my.lm<lm(election$Obama_12~election$Obama_08)
my.lm

For more information about this linear model, type


summary(my.lm)
anova(my.lm)

To plot your x and y variables on a scatterplot, use:


plot(x,y)

For more information, syntax, and sometimes examples on a function


(e.g. the lm function), type:
?lm
+
Example: SAS Code
dataelection;
infile/Users/Elizabeth//stat608/sp13/pct_obama.csv;
inputstate$Pollclose$ElectoralObama12Romney12Pred$Obama08
Mccain08;
run;

procregdata=election;
modelObama12=Obama08;
run;

procgplotdata=election;
ploty*x;
run;
+
Example: Election Data
Explanatory variable: % voting for Obama in 08
Response variable: % voting for Obama in 12

Coefficients:
(Intercept) election$Obama_08
-4.461 1.042

Interpret the slope in context.


+
Least Squares: Matrix Setup
+
Derivatives of Matrices
Derivatives of matrices are defined as partials
with respect to the variable vector. For example:
+
Derivatives of Matrices
The previous example was a derivative of a
scalar-by-vector on this wikipedia page:

http://en.wikipedia.org/wiki/Matrix_calculus#
Derivatives_with_matrices

It can also be found in the section on First Order


derivatives in the Matrix Cookbook. (61) p. 9
(see eCampus)
+
Derivatives of Quadratic Forms

The gradient and Hessian:


+ Least Squares: Derive
parameter estimates in matrix
notation
+
Linear Models: Inference
+ 28

George Box

All models are wrong,


but some are useful.
+ 29

Inference
Objective: Develop hypothesis tests and
confidence intervals for the slope and
intercept of a least squares regression model
1. Assumptions: Next chapter
2. Bias of Estimates
3. Variability of Estimates
In English: Is the percent of the states that
voted for Obama in 2008 linearly related to
the percentage in 2012?
+
Review: Expectation and
Variance of Random Variables
Recall the definition of variance:

Covariance is similar. It is the numerator of


correlation:
+
Expectation and Variance of
Random Variables
Assume X and Y are random variables, and a and
b are constants. Then:
+
Correlation
Correlation is unit-free (with no measurement error,
changing from meters to feet yields the same
correlation).

Correlation does not change if the explanatory and


response variables are exchanged.

Correlation is always between -1 and 1.

Correlation closer to -1 or +1 indicates a stronger


relationship between the explanatory and response
variables; correlation closer to 0 indicates a weaker
association.
+
Correlation

When both x and y are above average, or both x and y are


below average, a positive value is contributed to the sum.
When exactly one is above average, a negative value is
contributed to the sum.

When most values are in the bottom left and top right of the
plot, the correlation is positive.

Sample covariance is the numerator of sample correlation.


+
Correlation
70

The correlation between the


60

percents of states voting for


% Obama 2012

Obama in 2008 and 2012 is


50

r = 0.98.
40

This indicates a strong


30

positive relationship
40 50 60 70
between 2008 and 2012.
% Obama 2008
+
Expectation and Variance of
Matrices
Assume x is a vector of random variables. Then:
+
Variance / Covariance Matrix
+
Correlation Matrix

Tocalculate the correlation matrix, divide by


the appropriate standard deviations
elementwise.
+
Expectation and Variance of
Matrices
Assume X and Y are matrices of random
variables; x and y are vectors of random
variables; and A, B, and C are constant matrices.
Then:
+
Quadratic Form
+
Expectation of Quadratic Form
It can be shown that

where and are the expected value and


variance-covariance matrix of x, respectively,
and tr denotes the trace of a matrix.

This result only depends on the existence of


and ; in particular, normality of x is not
required.
+ 41

Usual Assumptions

In matrix notation:
+
Bias

The bias of a statistic is the difference between its


expected value and the parameter it should estimate.
+ 43

Inferences About the Slope and


Intercept
Unbiased Estimates
+ Inferences About the Slope and 44

Intercept
Variance of the estimates
+
Inferences About the Slope and
Intercept
What does it mean to take the expectation of the
sample slope?
+ Inferences About the Slope 46

Distribution of slope:

Since the errors are normally distributed, yi = 0 +


1xi + ei is also normally distributed, given x. Since
the sample slope is a linear combination of the yis, it
is also normally distributed.
+ Inferences About the Slope 47

Of course, is unknown, so we estimate it using the


sample standard deviation. Then our test statistic
has the t-distribution (with n k degrees of freedom
in general, k being the number of columns of X).
+
Inferences About the Slope

The
standard error is the standard deviation of a statistic
(often, the sample standard deviation).

The margin of error is half the width of a confidence interval.

Dont mix up the critical value with the test statistic calculated
from the data! Look up the critical value on a table, or use R.

, the variance of the errors, is estimated by MSE. So the


square root of MSE, RMSE, is s in the equation above.
+ Inferences About the Slope 49

For our election example, R output gives:

Coefficients: Estimate Std. Error t value


Pr(>|t|)

(Intercept) -5.13608 1.72705 -2.974


0.00459 **

states50$Obama_08 1.05610 0.03363 31.401


< 2e-16 ***
Confidence Interval for the slope:
+
Inferences About the Slope

Hypothesis Test for whether the slope = 0:


+
More Inference
+ 52

Objectives: Regression Line

Develop hypothesis tests and confidence


intervals for the regression line:
1. Confidence intervals for the mean
2. Prediction intervals for individuals
In English: If 48% of a state voted for
Obama in 2008, what percentage voted
for him in 2012?
+ 53

Slope vs. regression line


+ 54

Slope vs. regression line


+ 55

Slope vs. regression line


Confidence interval for slope:
If x increases by 1 unit, by how much does y increase or decrease?
Deals with sampling distribution of sample slope from one sample to
another.
Can use CLT to say the sample slope is approximately normal for large
n.

Confidence interval for the regression line (mean):


For a given value of x, what is the mean value of y?
Deals with sampling distribution of sample mean from one sample to
another.
Can use CLT to say the sample mean is approximately normal for large
n.

Prediction interval for the regression line:


For a given value of x, what are the individual values of y?
Deals with distribution of individuals about the regression line.
The CLT is NOT useful in this case. We need to know the distribution of
the errors.
+ 56

Confidence Intervals for the


Population Regression Line

We consider the problem of finding a confidence


interval for the unknown population mean at a
given value of X, which we shall denote by x*.

First, recall that the population mean at X = x*


is given by
+ 57

Confidence Intervals for the


Population Regression Line
Anestimator of this unknown quantity is the
value of the estimated regression equation at
X = x*, namely:
+ 58

How many Populations are


Sampled?

One for each unique value of X = x*


How many populations were sampled in
the election data?
+ 59

How Many Means and Variances


do we Have?
___ means
___ variances
However, we assume that
+ 60

Variance of mean
+ 61

Estimated Variance of mean


+ 62

Confidence Intervals for the


Population Regression Line
A 100(1 a)% confidence interval for

the population mean of Y at X = x*, is given by

where t/2, n-2 is the 100( 1 /2)th quantile of the t-


distribution with n - 2 degrees of freedom.
+ 63

Confidence Intervals for the


Population Regression Line
Does the confidence interval for the mean y-value get
narrower as n gets bigger?

What happens to the width of the confidence interval as


your desired x-value is farther from the mean for x?
+ 64

Prediction Intervals for the Actual


Value of Y
+ 65

Prediction Intervals for the


actual value of Y
+Dummy (Indicator)
Variable Regression
+ 67

Dummy Variable Regression


Experiment: Does increasing calcium intake
decrease blood pressure?

For placebo group:

For calcium group:


+ 68

Dummy Variable Regression


For placebo group:

For calcium group:


+ 69

Dummy Variable Regression

21 total men; 10
took the calcium
supplement, and 11
took the placebo.
+ 70

Dummy Variable Regression

What if we had the intercept plus two variables, one an


indicator of taking the placebo and one an indicator of
taking the calcium supplement?

The columns of X must be linearly independent; that is,


the matrix X must be of full rank.
+ 71

Dummy Variable Regression


+ 72

Dummy Variable Regression


+ 73

Dummy Variable Regression


+ 74

Dummy Variable Regression

my.lm<-lm(y~x)
summary(my.lm)

Output:

Estimate Std. Error t value Pr(>|t|)


(Intercept) -0.2727 2.2266 -0.122 0.904
x 5.2727 3.2267 1.634 0.119

Calcium group mean = 5, Placebo group mean = -0.27.


Difference between means = 5.27, as expected.
+ 75

Dummy Variable Regression


+
Geometry
+ 77

Geometric Interpretation of
LetV be the vector space spanned by the
columns of our design matrix X.
Then is the projection of the vector y down
into the vector space V.
Is a in V?
+ 78

Geometric Interpretation of

Projection (Hat) Matrix:

Projection is like a shadow of the real thing


+ 79

Geometric Interpretation of
+ 80

Geometric Interpretation of
+ 81

Geometric Interpretation of

Any vector in V multiplied by the error vector is 0:


+ 82

Geometric Interpretation of

Is it true that
What is
+
ANOVA Table
+
ANalysis Of VAriance (ANOVA)

The original sample variance of y is the total variance,


measuring the total amount of variability in the data
set.

The ANOVA table breaks that variability into two parts:


one explained by x, and the leftovers.
+ 85

ANalysis Of VAriance (ANOVA)


ANOVA Table
Were considering two sources of variation using this
regression model:

Variation that can be explained by the model

Variation due to randomness or other variables not


considered.

Data = Fit + Residual

yi = ( 0 + 1xi) + ( i)

The regression model, whether using dummy variables or not,


is not really about variances; the point in a regression model is
to study how y changes with x.
86
+ 87

ANOVA Table

SST = RSS + SSReg


+
ANOVA Table

All else held constant:

If the strength of the correlation increases,


_____________ decreases.

If the slope of the line gets steeper, _______________


increases.
+
ANOVA Table

Degre
es of
Freedo Sums of Mean F- P-
m Squares Squares Statistic value
Regressi
on 1
(Model)
Residual
n2
(Error)
Total
n1