L01

APPLIED REGRESSION
ANALYSIS
EPSY 581 / PSYC 581

Note 1
Chapter 2
• Begin with Review
– Mean, Variance, etc. (one variable)
– Covariance, Correlation (two variables)
• Simple Linear Regression
– Notation
– Conceptualization with Example
– Population vs. Sample
– Definitions
– Least Squares Estimate
– Assumptions
– Types of Errors
– Partitioning the Sum of Squares
– Hypothesis Testing
1
Goal of Empirical Research
• Draw inferences about some population(s)
of interest based on observations of just a
subset or sample from the whole population.
• Then, generalize from sample to population.
Process of Statistical Analysis

Population
Random
Sample Make Inferences
Describe
Sample
Statistics
2
Defining the Problem
Before you begin any analysis, you should

complete certain tasks.
• Outline the purpose of the study.
• Document the study questions.
• Define the population of interest.
• Determine the need for sampling.
• Define the data collection protocol.
Cereal Example
15 ounces
Rise
n
Shine
3
Defining the Problem
The purpose of the study is to determine
whether the cereal boxes contain 15 ounces of
cereal.
The study question is whether the average

amount of cereal in the boxes is equal to 15
ounces.
Sample
Rise
Rise n Rise
n Shine n
Rise Shine
n Shine Rise
Shine Rise n
n Shine
Rise Rise Shine Rise
Rise
n n n
n
Shine Shine Shine
Rise Shine
n Rise
Shine Rise
Rise n n
n Rise Rise Shine Shine
Shine n Rise n Rise
Shine n Shine Rise
n
Shine n
Shine
Shine
Rise
n
Shine
4
Assumption for this Course
– The sample drawn is representative of the
population.
• In other words, the sample characteristics should
reflect the characteristics of the population as a
whole.
Describing Your Data

• The goals when you are describing data are
to
– screen for unusual data values
– inspect the spread and shape of continuous
variables
– characterize the central tendency
– draw preliminary conclusions about your data.
• Descriptive Statistics
10
5
Parameters and Statistics
Statistics are used to approximate population
parameters.
Population Sample
Parameters Statistics
Mean  X
Variance 2 s2
Standard  s
Deviation
11
Distributions
When you examine the distribution of values
for the variable, you can find out
– the range of possible data values
– the frequency of data values
– whether the data values accumulate in the middle
of the distribution or at one end.
12
6
“Typical Values” in a Distribution
– Mean: the sum of all the values in the data set
divided by the number of values
N
1
X 
N
X
i 1
i
– Median: the middle value (also known as the 50th

percentile)
– Mode: the most common or frequent data value

13
Sample Variance
sx 
2 1 N
 ( X i  X )2 
( X  X ) 2
(2.1)
N  1 i 1 N 1
N N
SXX   ( X i  X ) 2   X i2  NX 2
i 1 i 1
2
 N 
N 

 X i 

 ( N  1) sx   X i 
2 2 i 1
(2.2)
i 1 N
1
sx 2  SXX
N 1
14
7
Standard Deviation (SD)
1 N 1
sx  
N  1 i 1
( X i  X )2 
N 1
SXX
SDX  sx
N
SXX   ( X i  X ) 2
i 1
15
Computing the Variance

X  X (X  X )
2
X (N=5)
5 -10 100
10 -5 25
15 0 0
20 5 25
25 10 100
Sum: 75 0 250
Mean: 15 Var: 62.5
16
8
Point Estimates
estimates
estimates
17
Variability is about the Spread
18
9
Percentiles
98
95 third quartile
92 75th Percentile=91
90
85
81 50th Percentile=80 Quartiles break your data
79 up into quarters.
70
63 25th Percentile=59
55 first quartile
47
42
19
The Spread of a Distribution:

Dispersion
Measure Definition
range the difference between the maximum and

minimum data values
interquartile the difference between the 25th and 75th
range percentiles
variance a measure of dispersion of the data around
the mean
standard a measure of dispersion expressed in the
deviation same units of measurement as your data
(the square root of the variance)
20
10
The MEANS Procedure
General form of the MEANS procedure:
PROC
PROC MEANS
MEANSDATA=SAS-data-set
DATA=SAS-data-set<options>;
<options>;
VAR variables;
VAR variables;
RUN;
RUN;
21
Picturing Distributions:
Histogram
 Each bar in the
histogram represents a
group of values (a bin).
PERCENT
 The height of the bar is

the percent of values in
the bin (Frequency
Histogram).
 The area of the bar is
the percent of values in
the bin (Relative
Frequency Histogram).
Bins 22
11
The Normal Distribution
23
The Normal Distribution

The normal distribution
 is symmetric. If you draw a line down the center, you get
the same shape on either side.
 is fully characterized by the mean and standard deviation.
Given those two parameters, you know all there is to
know about the distribution.
 is bell shaped.
 has mean  median  mode.
The red line on each of the following graphs

represents the shape of the normal distribution with
the mean and variance estimated from the sample
data. 24
12
Characteristics of the Bell Curve
Peak
Flanks
Tails
-4 -3 -2 -1 0 1 2 3 4
25
The UNIVARIATE Procedure

General form of the UNIVARIATE procedure:
PROC
PROC UNIVARIATE
UNIVARIATE DATA=SAS-data-set
DATA=SAS-data-set
<options>;
<options>;
VAR
VAR variables;
variables;
ID
ID variable;
variable;
HISTOGRAM
HISTOGRAM variables
variables </
</ options>;
options>;
PROBPLOT
PROBPLOT variables
variables </
</ options>;
options>;
RUN;
RUN;
26
13
Graphical Displays of
Distributions
• You can produce three kinds of plots for
examining the distribution of your data
values:
– histograms
– box plots
– normal probability plots.
27
Box-and-Whisker Plots
largest point 1.5 I.Q. from the box
The mean is denoted by a +.
the 75th percentile
the 50th percentile (median)
the 25th percentile
smallest point 1.5 I.Q. from the box
28
14
The BOXPLOT Procedure
General form of the BOXPLOT procedure:
PROC
PROC BOXPLOT
BOXPLOT DATA=SAS-data-set;
DATA=SAS-data-set;
PLOT
PLOT analysis-variable*group-variable
analysis-variable*group-variable
</options>;
</options>;
RUN;
RUN;
29
Exploratory Data Analysis

with Two Variables
30
15
Objectives
– Examine the relationship between two
continuous variables using a scatter plot.
– Quantify the degree of linearity between
two continuous variables using correlation
statistics.
– Understand potential misuses of the correlation
coefficient.
– Obtain Pearson correlation coefficients using
the CORR procedure.
31
Scatter Plots
X 32
16
Overview
Correlation
Continuous
Variable
Continuous
Variable
33
Example of Two Continuous Variables
in.
lb.
Weight ?
Height
34
17
Relationships between Continuous Variables
1.
1. 2.
2.
3.
3. 4.
4.
35
Correlation
36
18
Plot of Weight by Height Plot of Errors by Study Time
210 30
180
20
Weight
Errors
150
10
120
90 0
60 63 66 69 72 75 0 100 200 300 400
Height Study Time
Plot of SAT-V by Toe Size
700
600
SAT-V
500
400
1.5 1.6 1.7 1.8 1.9
Toe Size
Pearson Correlation Coefficient
STRONG weak STRONG

Negative Positive
-1 0 1
Correlation Coefficient
38
19
Misuses of the Correlation
Coefficient
Strong correlation does not mean
causes in.
lb.
39
SAT Example
Average SAT Score
versus
Percent Taking Test
S CO R E
1100
1000
900
800
0 10 20 30 40 50 60 70 80
P CT A K I N G
40
20
Missing Another Type of
Relationship
Curvilinear Relationship
41
The CORR Procedure

General form of the CORR procedure:
PROC
PROC CORR
CORRDATA=SAS-data-set
DATA=SAS-data-set <options>;
<options>;
VAR variables;
VAR variables;
WITH
WITHvariables;
variables;
RUN;
RUN;
42
21
Sample Covariance
s xy 
 (X i  X )(Yi  Y )
(2.3)
N 1
• Linear relationship or association
– Degree that X and Y co-vary from their respective means
• Direction of linear association
– Sign (+ or -) for direction
• Affected by scales of X’s and Y’s

43
Cross-Product and Sample

Covariance
SXY   ( X i  X )(Yi  Y )  ( N  1) s xy
 N  N 
  X i    Yi 
  X iYi   i 1   i 1 
N
(2.4)
i 1 N
1
s xy  SXY
N 1
44
22
Pearson product-moment
correlation coefficient
s xy s xy
r xy  
sx s y SD X SD Y
SX Y

( SX X )( SYY )
• Standardizes linear association

– Size or magnitude of correlation
1  rxy  1
Disadvantages of Pearson’s Correlation
• Validity of rxy depends on sample size

• Designed for linear association, Does not
indicate causality
• Primarily for relationship between continuous
variables (interval or ratio scales)
• Affected by outliers
23
Pearson’s Correlation and Sample Size
Here are the limits within which 80% of sample r’s will fall
when the true correlation (i.e., in the population) is zero:
47
Simple Linear Regression
48
24
Simple Linear Regression Analysis
• The objectives of simple linear regression
are to
– assess the significance of the predictor variable
in explaining the variability or behavior of the
response variable
– predict the values of the response variable
given the values of the predictor variable.
49
Fitness Example
50
25
• In exercise physiology, an object measure of
aerobic fitness is how fast the body can
absorb and use oxygen (oxygen
consumption).
• Subjects participated in a predetermined
exercise run of 1.5 miles.
• Measurements of oxygen consumption as
well as several other continuous
measurements such as age, pulse, and weight
were recorded.
• The researchers are interested in determining
whether any of these other variables can help
predict oxygen consumption.
51
Variables in sasuser.b_fitness
• Name name of the member
• Gender gender of the member
• Runtime time to run 1.5 miles (in minutes)
• Age age of the member (in years)
• Weight weight of the member (in kilograms)
• Oxygen_Consumption a measure of the ability to use
oxygen in the blood stream
• Run_Pulse pulse rate at the end of the run
• Rest_Pulse resting pulse rate
• Maximum_Pulse maximum pulse rate during the run
• Performance a measure of overall fitness
52
26
Fitness Example
PREDICTOR RESPONSE
Performance Oxygen_Consumption
53
Outcome = Systematic component + Residual
Goal 1: Identify the systematic Goal 2: Assess how well we did by

components and determine how examining the magnitude of the
they fit the data residuals
“population parameters” or
“regression coefficients”
to be estimated
outcome predictor
Y   0  1 X  
54
27
Simple Linear Regression Model
Response (Y)
units
1 unit
Predictor (X)
55
Simple Linear Regression Model

Response (Y)
Unknown
Population
Relationship Y-Y
Predictor (X)
56
28
Simple Linear Regression
• Used to test association between two variables
• Accounts for (predicts) the variance in an
interval dependent variable based on an interval,
dichotomous, or dummy independent variable.
• By estimating a straight line through the
corresponding X-Y data points we can estimate
the magnitude of the relationship between X and
Y.
57
Clarifying the terminology

Term Definition Synonyms
Outcome Variable whose Dependent variable
behavior we are Response
trying to explain Criterion
Y
Predictor Variable we are Independent variable

using to explain Predictor
the variation in Covariate
the outcome X
Relationship How two Association

variables relate to Correlation
each other, Covariation
without implying Association is not the
causality same as causation
58
29
Next Lecture
• Next Lecture: SAS. Bring your laptop if
possible
• Reading assignments:
– the means, univariate, etc. procedures (Base
SAS Guide)
– Chapters 1 and 2
59
30

L01

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L01

Uploaded by

Copyright:

Available Formats

APPLIED REGRESSION

EPSY 581 / PSYC 581

Process of Statistical Analysis

Before you begin any analysis, you should

The study question is whether the average

Describing Your Data

– Median: the middle value (also known as the 50th

– Mode: the most common or frequent data value

Computing the Variance

Variability is about the Spread

The Spread of a Distribution:

range the difference between the maximum and

 The height of the bar is

The Normal Distribution

The red line on each of the following graphs

The UNIVARIATE Procedure

The mean is denoted by a +.

the 75th percentile

the 50th percentile (median)

the 25th percentile

smallest point 1.5 I.Q. from the box

Exploratory Data Analysis

Example of Two Continuous Variables

Pearson Correlation Coefficient

STRONG weak STRONG

Strong correlation does not mean

The CORR Procedure

• Affected by scales of X’s and Y’s

Cross-Product and Sample

• Standardizes linear association

Disadvantages of Pearson’s Correlation

• Validity of rxy depends on sample size

Simple Linear Regression

Outcome = Systematic component + Residual

Goal 1: Identify the systematic Goal 2: Assess how well we did by

Simple Linear Regression Model

Clarifying the terminology

Predictor Variable we are Independent variable

Relationship How two Association

You might also like