You are on page 1of 89

# +

## Simple Linear regression

+Introduction: Simple
Linear Regression Models
+
Simple Linear Regression (SLR)
Models
We always draw a scatterplot first to obtain an
idea of the strength (measured by correlation),
form (e.g., linear, quadratic, exponential), and
direction (positive or negative, in the linear
case) of any relationship that exists between two
variables.

## On the next slide we see a strong (correlation

close to 1) positive (as x increases, y increases)
linear (straight line) relationship between the
percentage of voters of each state who voted for
President Obama in 2008 and in 2012.
+
Simple Linear Regression (SLR)
Models
70 60
% Obama 2012
40 50
30

40 50 60 70
% Obama 2008
+
Simple Linear Regression (SLR)
Models
When data are collected in pairs the
standard notation used to designate this
is:
(x1, y1), (x2, y2), . . . , (xn, yn),
where x1 denotes the first value of the X-
variable and y1 denotes the first value of
the Y-variable. The X-variable is called the
explanatory variable, while the Y-
variable is called the response variable.
+
Simple Linear Regression (SLR)
Models
Forexample, the first few entries in
the election data set above were
(37.7, 40.8), (38.8, 38.4), (38.8,
36.9)
meaning 37.7% of Alabama voters
voted for President Obama in 2008,
while 40.8% did so in 2012, and so on.
+
Simple Linear Regression (SLR)
Models
We could use a matched pairs t-test to test
whether the same percentage of voters
voted for President Obama in 2008 and
2012 in the 50 states, on average. If the
percentages were indeed equal, what would
that mean for our regression model?

## Aside: I wouldnt be able to conclude just from the

average percentages being equal that the percentage of
the country voting for Obama had stayed the same why
not?
+
S.L.R. Models
The regression of Y on X is linear if

## where the unknown parameters 0 and 1

determine the intercept and the slope,
respectively, of a specific straight line.
Note: the parabolic model

## is also a linear model, even though it graphs a parabola,

because the mean of Y given X is related to the
parameters via a linear function.
+
S.L.R. Models
+
S.L.R. Models
There will almost certainly be some variation in
Y due strictly to random phenomenon that
cannot be measured, predicted, or explained.
This variation is called random error, and
denoted ei.

## In other words, the difference between what

actually occurs and our model is the error. Our
model doesnt explain everything; the amount
of variability in our response variable that
remains unexplained is measured by the error.
+
S.L.R. Models
Suppose that Y1, Y2 , , Yn are independent
realizations of the random variable Y that are
observed at the values x1 , x2 , , xn of a
random variable X . If the regression of Y on X
is linear, then for i = 1, 2, , n;

## where ei is the random error in Yi and is such

that E(ei | X) = 0.
+
Continued
Notice that the random error term does not
depend on x , nor does it contain any
information about Y (otherwise it would be a
systematic error).

## For now, we assume the errors have constant

variance:
+
Introduction to Least Squares
Dont get these ideas confused:
+
Introduction to Least Squares
+Deriving And Interpreting
Parameter Estimates
+
Least Squares: Derive
parameter estimates
Goal: Minimize
with respect to 0
and 1.

## We have two equations

with two unknowns. Set
both equal to 0 and solve.
+
Least Squares: Derive
parameter estimates
First: derivative
with respect to the
intercept. Solve
for the y-intercept.

## The least squares regression

line goes through the point
+
Least Squares: Derive
parameter estimates
Second: Derivative
with respect to the
slope. Solve for
the slope.
+
Example: R Code

my.lm<lm(election\$Obama_12~election\$Obama_08)
my.lm

summary(my.lm)
anova(my.lm)

## To plot your x and y variables on a scatterplot, use:

plot(x,y)

(e.g. the lm function), type:
?lm
+
Example: SAS Code
dataelection;
infile/Users/Elizabeth//stat608/sp13/pct_obama.csv;
inputstate\$Pollclose\$ElectoralObama12Romney12Pred\$Obama08
Mccain08;
run;

procregdata=election;
modelObama12=Obama08;
run;

procgplotdata=election;
ploty*x;
run;
+
Example: Election Data
Explanatory variable: % voting for Obama in 08
Response variable: % voting for Obama in 12

Coefficients:
(Intercept) election\$Obama_08
-4.461 1.042

## Interpret the slope in context.

+
Least Squares: Matrix Setup
+
Derivatives of Matrices
Derivatives of matrices are defined as partials
with respect to the variable vector. For example:
+
Derivatives of Matrices
The previous example was a derivative of a

http://en.wikipedia.org/wiki/Matrix_calculus#
Derivatives_with_matrices

## It can also be found in the section on First Order

derivatives in the Matrix Cookbook. (61) p. 9
(see eCampus)
+

+ Least Squares: Derive
parameter estimates in matrix
notation
+
Linear Models: Inference
+ 28

George Box

## All models are wrong,

but some are useful.
+ 29

Inference
Objective: Develop hypothesis tests and
confidence intervals for the slope and
intercept of a least squares regression model
1. Assumptions: Next chapter
2. Bias of Estimates
3. Variability of Estimates
In English: Is the percent of the states that
voted for Obama in 2008 linearly related to
the percentage in 2012?
+
Review: Expectation and
Variance of Random Variables
Recall the definition of variance:

## Covariance is similar. It is the numerator of

correlation:
+
Expectation and Variance of
Random Variables
Assume X and Y are random variables, and a and
b are constants. Then:
+
Correlation
Correlation is unit-free (with no measurement error,
changing from meters to feet yields the same
correlation).

## Correlation does not change if the explanatory and

response variables are exchanged.

## Correlation closer to -1 or +1 indicates a stronger

relationship between the explanatory and response
variables; correlation closer to 0 indicates a weaker
association.
+
Correlation

## When both x and y are above average, or both x and y are

below average, a positive value is contributed to the sum.
When exactly one is above average, a negative value is
contributed to the sum.

When most values are in the bottom left and top right of the
plot, the correlation is positive.

+
Correlation
70

60

% Obama 2012

50

r = 0.98.
40

## This indicates a strong

30

positive relationship
40 50 60 70
between 2008 and 2012.
% Obama 2008
+
Expectation and Variance of
Matrices
Assume x is a vector of random variables. Then:
+
Variance / Covariance Matrix
+
Correlation Matrix

## Tocalculate the correlation matrix, divide by

the appropriate standard deviations
elementwise.
+
Expectation and Variance of
Matrices
Assume X and Y are matrices of random
variables; x and y are vectors of random
variables; and A, B, and C are constant matrices.
Then:
+
+
It can be shown that

## where and are the expected value and

variance-covariance matrix of x, respectively,
and tr denotes the trace of a matrix.

## This result only depends on the existence of

and ; in particular, normality of x is not
required.
+ 41

Usual Assumptions

In matrix notation:
+
Bias

## The bias of a statistic is the difference between its

expected value and the parameter it should estimate.
+ 43

## Inferences About the Slope and

Intercept
Unbiased Estimates
+ Inferences About the Slope and 44

Intercept
Variance of the estimates
+
Intercept
What does it mean to take the expectation of the
sample slope?
+ Inferences About the Slope 46

Distribution of slope:

## Since the errors are normally distributed, yi = 0 +

1xi + ei is also normally distributed, given x. Since
the sample slope is a linear combination of the yis, it
is also normally distributed.
+ Inferences About the Slope 47

## Of course, is unknown, so we estimate it using the

sample standard deviation. Then our test statistic
has the t-distribution (with n k degrees of freedom
in general, k being the number of columns of X).
+

The
standard error is the standard deviation of a statistic
(often, the sample standard deviation).

## The margin of error is half the width of a confidence interval.

Dont mix up the critical value with the test statistic calculated
from the data! Look up the critical value on a table, or use R.

## , the variance of the errors, is estimated by MSE. So the

square root of MSE, RMSE, is s in the equation above.
+ Inferences About the Slope 49

Pr(>|t|)

0.00459 **

## states50\$Obama_08 1.05610 0.03363 31.401

< 2e-16 ***
Confidence Interval for the slope:
+

+
More Inference
+ 52

## Develop hypothesis tests and confidence

intervals for the regression line:
1. Confidence intervals for the mean
2. Prediction intervals for individuals
In English: If 48% of a state voted for
Obama in 2008, what percentage voted
for him in 2012?
+ 53

+ 54

+ 55

## Slope vs. regression line

Confidence interval for slope:
If x increases by 1 unit, by how much does y increase or decrease?
Deals with sampling distribution of sample slope from one sample to
another.
Can use CLT to say the sample slope is approximately normal for large
n.

## Confidence interval for the regression line (mean):

For a given value of x, what is the mean value of y?
Deals with sampling distribution of sample mean from one sample to
another.
Can use CLT to say the sample mean is approximately normal for large
n.

## Prediction interval for the regression line:

For a given value of x, what are the individual values of y?
Deals with distribution of individuals about the regression line.
The CLT is NOT useful in this case. We need to know the distribution of
the errors.
+ 56

## Confidence Intervals for the

Population Regression Line

## We consider the problem of finding a confidence

interval for the unknown population mean at a
given value of X, which we shall denote by x*.

is given by
+ 57

## Confidence Intervals for the

Population Regression Line
Anestimator of this unknown quantity is the
value of the estimated regression equation at
X = x*, namely:
+ 58

Sampled?

## One for each unique value of X = x*

How many populations were sampled in
the election data?
+ 59

## How Many Means and Variances

do we Have?
___ means
___ variances
However, we assume that
+ 60

Variance of mean
+ 61

+ 62

## Confidence Intervals for the

Population Regression Line
A 100(1 a)% confidence interval for

## where t/2, n-2 is the 100( 1 /2)th quantile of the t-

distribution with n - 2 degrees of freedom.
+ 63

## Confidence Intervals for the

Population Regression Line
Does the confidence interval for the mean y-value get
narrower as n gets bigger?

## What happens to the width of the confidence interval as

your desired x-value is farther from the mean for x?
+ 64

Value of Y
+ 65

## Prediction Intervals for the

actual value of Y
+Dummy (Indicator)
Variable Regression
+ 67

## Dummy Variable Regression

Experiment: Does increasing calcium intake
decrease blood pressure?

+ 68

## Dummy Variable Regression

For placebo group:

+ 69

## Dummy Variable Regression

21 total men; 10
took the calcium
supplement, and 11
took the placebo.
+ 70

## What if we had the intercept plus two variables, one an

indicator of taking the placebo and one an indicator of
taking the calcium supplement?

## The columns of X must be linearly independent; that is,

the matrix X must be of full rank.
+ 71

+ 72

+ 73

+ 74

my.lm<-lm(y~x)
summary(my.lm)

Output:

## Estimate Std. Error t value Pr(>|t|)

(Intercept) -0.2727 2.2266 -0.122 0.904
x 5.2727 3.2267 1.634 0.119

## Calcium group mean = 5, Placebo group mean = -0.27.

Difference between means = 5.27, as expected.
+ 75

## Dummy Variable Regression

+
Geometry
+ 77

Geometric Interpretation of
LetV be the vector space spanned by the
columns of our design matrix X.
Then is the projection of the vector y down
into the vector space V.
Is a in V?
+ 78

Geometric Interpretation of

## Projection is like a shadow of the real thing

+ 79

Geometric Interpretation of
+ 80

Geometric Interpretation of
+ 81

Geometric Interpretation of

## Any vector in V multiplied by the error vector is 0:

+ 82

Geometric Interpretation of

Is it true that
What is
+
ANOVA Table
+
ANalysis Of VAriance (ANOVA)

## The original sample variance of y is the total variance,

measuring the total amount of variability in the data
set.

## The ANOVA table breaks that variability into two parts:

one explained by x, and the leftovers.
+ 85

## ANalysis Of VAriance (ANOVA)

ANOVA Table
Were considering two sources of variation using this
regression model:

considered.

## Data = Fit + Residual

yi = ( 0 + 1xi) + ( i)

## The regression model, whether using dummy variables or not,

is not really about variances; the point in a regression model is
to study how y changes with x.
86
+ 87

ANOVA Table

+
ANOVA Table

## If the strength of the correlation increases,

_____________ decreases.

## If the slope of the line gets steeper, _______________

increases.
+
ANOVA Table

Degre
es of
Freedo Sums of Mean F- P-
m Squares Squares Statistic value
Regressi
on 1
(Model)
Residual
n2
(Error)
Total
n1