You are on page 1of 30

APPLIED REGRESSION

ANALYSIS

EPSY 581 / PSYC 581


Note 1

Chapter 2
• Begin with Review
– Mean, Variance, etc. (one variable)
– Covariance, Correlation (two variables)
• Simple Linear Regression
– Notation
– Conceptualization with Example
– Population vs. Sample
– Definitions
– Least Squares Estimate
– Assumptions
– Types of Errors
– Partitioning the Sum of Squares
– Hypothesis Testing

1
Goal of Empirical Research
• Draw inferences about some population(s)
of interest based on observations of just a
subset or sample from the whole population.
• Then, generalize from sample to population.

Process of Statistical Analysis


Population

Random
Sample Make Inferences
Describe
Sample
Statistics

2
Defining the Problem

Before you begin any analysis, you should


complete certain tasks.
• Outline the purpose of the study.
• Document the study questions.
• Define the population of interest.
• Determine the need for sampling.
• Define the data collection protocol.

Cereal Example
15 ounces
Rise
n
Shine

3
Defining the Problem
The purpose of the study is to determine
whether the cereal boxes contain 15 ounces of
cereal.

The study question is whether the average


amount of cereal in the boxes is equal to 15
ounces.

Sample
Rise
Rise n Rise
n Shine n
Rise Shine
n Shine Rise
Shine Rise n
n Shine
Rise Rise Shine Rise
Rise
n n n
n
Shine Shine Shine
Rise Shine
n Rise
Shine Rise
Rise n n
n Rise Rise Shine Shine
Shine n Rise n Rise
Shine n Shine Rise
n
Shine n
Shine
Shine
Rise
n
Shine

4
Assumption for this Course
– The sample drawn is representative of the
population.
• In other words, the sample characteristics should
reflect the characteristics of the population as a
whole.

Describing Your Data


• The goals when you are describing data are
to
– screen for unusual data values
– inspect the spread and shape of continuous
variables
– characterize the central tendency
– draw preliminary conclusions about your data.
• Descriptive Statistics

10

5
Parameters and Statistics
Statistics are used to approximate population
parameters.
Population Sample
Parameters Statistics
Mean  X

Variance 2 s2
Standard  s
Deviation

11

Distributions
When you examine the distribution of values
for the variable, you can find out
– the range of possible data values
– the frequency of data values
– whether the data values accumulate in the middle
of the distribution or at one end.

12

6
“Typical Values” in a Distribution
– Mean: the sum of all the values in the data set
divided by the number of values
N
1
X 
N
X
i 1
i

– Median: the middle value (also known as the 50th


percentile)

– Mode: the most common or frequent data value


13

Sample Variance

sx 
2 1 N
 ( X i  X )2 
( X  X ) 2

(2.1)
N  1 i 1 N 1
N N
SXX   ( X i  X ) 2   X i2  NX 2
i 1 i 1
2
 N 
N 

 X i 

 ( N  1) sx   X i 
2 2 i 1
(2.2)
i 1 N
1
sx 2  SXX
N 1
14

7
Standard Deviation (SD)

1 N 1
sx  
N  1 i 1
( X i  X )2 
N 1
SXX

SDX  sx
N
SXX   ( X i  X ) 2
i 1

15

Computing the Variance


X  X (X  X )
2
X (N=5)
5 -10 100
10 -5 25
15 0 0
20 5 25
25 10 100
Sum: 75 0 250
Mean: 15 Var: 62.5
16

8
Point Estimates

estimates

estimates

17

Variability is about the Spread

18

9
Percentiles
98
95 third quartile
92 75th Percentile=91
90
85
81 50th Percentile=80 Quartiles break your data
79 up into quarters.
70
63 25th Percentile=59
55 first quartile
47
42
19

The Spread of a Distribution:


Dispersion
Measure Definition

range the difference between the maximum and


minimum data values
interquartile the difference between the 25th and 75th
range percentiles
variance a measure of dispersion of the data around
the mean
standard a measure of dispersion expressed in the
deviation same units of measurement as your data
(the square root of the variance)

20

10
The MEANS Procedure
General form of the MEANS procedure:

PROC
PROC MEANS
MEANSDATA=SAS-data-set
DATA=SAS-data-set<options>;
<options>;
VAR variables;
VAR variables;
RUN;
RUN;

21

Picturing Distributions:
Histogram
 Each bar in the
histogram represents a
group of values (a bin).
PERCENT

 The height of the bar is


the percent of values in
the bin (Frequency
Histogram).
 The area of the bar is
the percent of values in
the bin (Relative
Frequency Histogram).

Bins 22

11
The Normal Distribution

23

The Normal Distribution


The normal distribution
 is symmetric. If you draw a line down the center, you get
the same shape on either side.
 is fully characterized by the mean and standard deviation.
Given those two parameters, you know all there is to
know about the distribution.
 is bell shaped.
 has mean  median  mode.

The red line on each of the following graphs


represents the shape of the normal distribution with
the mean and variance estimated from the sample
data. 24

12
Characteristics of the Bell Curve
Peak

Flanks

Tails

-4 -3 -2 -1 0 1 2 3 4
25

The UNIVARIATE Procedure


General form of the UNIVARIATE procedure:

PROC
PROC UNIVARIATE
UNIVARIATE DATA=SAS-data-set
DATA=SAS-data-set
<options>;
<options>;
VAR
VAR variables;
variables;
ID
ID variable;
variable;
HISTOGRAM
HISTOGRAM variables
variables </
</ options>;
options>;
PROBPLOT
PROBPLOT variables
variables </
</ options>;
options>;
RUN;
RUN;
26

13
Graphical Displays of
Distributions
• You can produce three kinds of plots for
examining the distribution of your data
values:
– histograms
– box plots
– normal probability plots.

27

Box-and-Whisker Plots
largest point 1.5 I.Q. from the box

The mean is denoted by a +.

the 75th percentile

the 50th percentile (median)

the 25th percentile

smallest point 1.5 I.Q. from the box

28

14
The BOXPLOT Procedure
General form of the BOXPLOT procedure:

PROC
PROC BOXPLOT
BOXPLOT DATA=SAS-data-set;
DATA=SAS-data-set;
PLOT
PLOT analysis-variable*group-variable
analysis-variable*group-variable
</options>;
</options>;
RUN;
RUN;

29

Exploratory Data Analysis


with Two Variables

30

15
Objectives
– Examine the relationship between two
continuous variables using a scatter plot.
– Quantify the degree of linearity between
two continuous variables using correlation
statistics.
– Understand potential misuses of the correlation
coefficient.
– Obtain Pearson correlation coefficients using
the CORR procedure.

31

Scatter Plots

X 32

16
Overview

Correlation
Continuous
Variable

Continuous
Variable
33

Example of Two Continuous Variables

in.

lb.

Weight ?
Height
34

17
Relationships between Continuous Variables

1.
1. 2.
2.

3.
3. 4.
4.

35

Correlation

36

18
Plot of Weight by Height Plot of Errors by Study Time
210 30

180
20
Weight

Errors
150

10
120

90 0
60 63 66 69 72 75 0 100 200 300 400
Height Study Time
Plot of SAT-V by Toe Size
700

600
SAT-V

500

400
1.5 1.6 1.7 1.8 1.9
Toe Size

Pearson Correlation Coefficient

STRONG weak STRONG


Negative Positive

-1 0 1

Correlation Coefficient

38

19
Misuses of the Correlation
Coefficient

Strong correlation does not mean

causes in.

lb.

39

SAT Example
Average SAT Score
versus
Percent Taking Test
S CO R E
1100

1000

900

800
0 10 20 30 40 50 60 70 80
P CT A K I N G
40

20
Missing Another Type of
Relationship
Curvilinear Relationship

41

The CORR Procedure


General form of the CORR procedure:

PROC
PROC CORR
CORRDATA=SAS-data-set
DATA=SAS-data-set <options>;
<options>;
VAR variables;
VAR variables;
WITH
WITHvariables;
variables;
RUN;
RUN;

42

21
Sample Covariance

s xy 
 (X i  X )(Yi  Y )
(2.3)
N 1
• Linear relationship or association
– Degree that X and Y co-vary from their respective means
• Direction of linear association
– Sign (+ or -) for direction

• Affected by scales of X’s and Y’s


43

Cross-Product and Sample


Covariance

SXY   ( X i  X )(Yi  Y )  ( N  1) s xy
 N  N 
  X i    Yi 
  X iYi   i 1   i 1 
N
(2.4)
i 1 N
1
s xy  SXY
N 1
44

22
Pearson product-moment
correlation coefficient
s xy s xy
r xy  
sx s y SD X SD Y
SX Y

( SX X )( SYY )

• Standardizes linear association


– Size or magnitude of correlation

1  rxy  1

Disadvantages of Pearson’s Correlation

• Validity of rxy depends on sample size


• Designed for linear association, Does not
indicate causality
• Primarily for relationship between continuous
variables (interval or ratio scales)
• Affected by outliers

23
Pearson’s Correlation and Sample Size

Here are the limits within which 80% of sample r’s will fall
when the true correlation (i.e., in the population) is zero:

47

Simple Linear Regression

48

24
Simple Linear Regression Analysis
• The objectives of simple linear regression
are to
– assess the significance of the predictor variable
in explaining the variability or behavior of the
response variable
– predict the values of the response variable
given the values of the predictor variable.

49

Fitness Example

50

25
• In exercise physiology, an object measure of
aerobic fitness is how fast the body can
absorb and use oxygen (oxygen
consumption).
• Subjects participated in a predetermined
exercise run of 1.5 miles.
• Measurements of oxygen consumption as
well as several other continuous
measurements such as age, pulse, and weight
were recorded.
• The researchers are interested in determining
whether any of these other variables can help
predict oxygen consumption.
51

Variables in sasuser.b_fitness
• Name name of the member
• Gender gender of the member
• Runtime time to run 1.5 miles (in minutes)
• Age age of the member (in years)
• Weight weight of the member (in kilograms)
• Oxygen_Consumption a measure of the ability to use
oxygen in the blood stream
• Run_Pulse pulse rate at the end of the run
• Rest_Pulse resting pulse rate
• Maximum_Pulse maximum pulse rate during the run
• Performance a measure of overall fitness
52

26
Fitness Example
PREDICTOR RESPONSE
Performance Oxygen_Consumption

53

Outcome = Systematic component + Residual

Goal 1: Identify the systematic Goal 2: Assess how well we did by


components and determine how examining the magnitude of the
they fit the data residuals

“population parameters” or
“regression coefficients”
to be estimated
outcome predictor

Y   0  1 X  

54

27
Simple Linear Regression Model
Response (Y)

units

1 unit

Predictor (X)

55

Simple Linear Regression Model


Response (Y)
Unknown
Population
Relationship Y-Y

Predictor (X)
56

28
Simple Linear Regression
• Used to test association between two variables
• Accounts for (predicts) the variance in an
interval dependent variable based on an interval,
dichotomous, or dummy independent variable.
• By estimating a straight line through the
corresponding X-Y data points we can estimate
the magnitude of the relationship between X and
Y.

57

Clarifying the terminology


Term Definition Synonyms
Outcome Variable whose Dependent variable
behavior we are Response
trying to explain Criterion
Y

Predictor Variable we are Independent variable


using to explain Predictor
the variation in Covariate
the outcome X

Relationship How two Association


variables relate to Correlation
each other, Covariation
without implying Association is not the
causality same as causation
58

29
Next Lecture
• Next Lecture: SAS. Bring your laptop if
possible

• Reading assignments:
– the means, univariate, etc. procedures (Base
SAS Guide)
– Chapters 1 and 2

59

30

You might also like