(Not) Relationships among
Variables
• Descriptive stats (e.g., mean, median,
mode, standard deviation)
describe a sample of data
• ztest &/or ttest for a single population
parameter (e.g., mean)
infer the true value of a single variable
ex: mean # of random digits that people can
memorize
Relationships among Variables
• But relationships among more than one variable
are the crucial feature of almost all scientific
research.
• Examples:
How does the perception of a stimulus vary with the
physical intensity of that stimulus?
How does the attitude towards the President vary with
the socioeconomic properties of the survey
respondent?
How does the performance on a mental task vary with
age?
Relationships among Variables
• More Examples:
How does depression vary with number of traumatic
experiences?
How does undergraduate drinking vary with performance in
quantitative courses?
How does memory performance vary with attention span?
etc...
• We’ve already learned a few ways to analyze
relationships among 2 variables.
Relationships among Two
Variables: ChiSquare
• ChiSquare test of independence (2way
contingency table)
• compare observed cell frequencies to the cell
frequencies you’d expect if the two variables are
independent.
• ex:
• X=geographical region: West coast, Midwest, East
coast
• Y=favorite color: red, blue, green
• Note: both variables are categorical
Relationships among Two
Variables: ChiSquare
• Observed frequencies:
• Expected frequencies:
West
Coast Midwest
East
Coast
Red 49 30 18
Blue 52 32 20
Green 130 62 107
West
Coast Midwest
East
Coast total
Red 49 30 18 97
Blue 52 32 20 104
Green 130 62 107 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 97
Blue 104
Green 299
total 231 124 145 500
(row total)(column total)
grand total
West
Coast Midwest
East
Coast total
Red 44.814 97
Blue 104
Green 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 44.814 24.06 28.13 97
Blue 48.05 25.79 30.16 104
Green 138.14 74.15 86.71 299
total 231 124 145 500
Relationships among Two
Variables: ChiSquare
• Observed frequencies:
• Expected frequencies:
West
Coast Midwest
East
Coast total
Red 49 30 18 97
Blue 52 32 20 104
Green 130 62 107 299
total 231 124 145 500
West
Coast Midwest
East
Coast total
Red 44.814 24.06 28.13 97
Blue 48.05 25.79 30.16 104
Green 138.14 74.15 86.71 299
total 231 124 145 500
_
2
=
(observed÷expected)
2
expected
all cells
¿
≈17.97
if this exceeds
critical value,
reject H
0
that the
2 variables are
independent
(unrelated)
Relationships among Two
Variables: z, t tests
• ztest &/or ttest for difference of
population means
• compare values of one variable (Y) for 2 different
levels/groups of another variable (X)
• ex:
• X=age: young people vs. old people
• Y=# random digits can memorize
• Q: Is the mean # digits the same for the 2 age
groups?
Relationships among Two
Variables: ANOVA
• ANOVA
• compare values of one variable (Y) for 3+ different
levels/groups of another variable (X)
• ex:
• X=age: young people, middleaged, old people
• Y=# random digits can memorize
• Q: Is the mean # digits the same for all 3 age
groups?
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
old
Relationships among Two
Variables: z, t & ANOVA
NOTE: for z/t tests for differences, and for ANOVA, there are a
small number of possible values for one of the variables (X)
z, t
ANOVA
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
middle
old
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
old
0
1
2
3
4
5
6
7
8
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
# Digits Memorized
f
r
e
q
u
e
n
c
y
young
middle
old
Relationships among Two
Variables: z, t & ANOVA
NOTE: for z/t tests for differences, and for ANOVA, there are a
small number of possible values for one of the variables (X)
QuickTime™ and a
decompressor
are needed to see this picture.
z, t
QuickTime™ and a
decompressor
are needed to see this picture.
ANOVA
Relationships among Two
Variables: many values of X?
What about when there are many possible values of BOTH
variables? Maybe they’re even continuous (rather than discrete)?
QuickTime™ and a
decompressor
are needed to see this picture.
Correlation, and
Simple Linear
Regression will be
used to analyze
relationship among
two such variables
(scatter plot)
Correlation: Scatter Plots
QuickTime™ and a
decompressor
are needed to see this picture.
Does it look like there is a relationship?
Correlation
• measures the direction and strength of a
linear relationship between two variables
• that is, it answers in a general way the
question: “as the values of one variable
change, how do the corresponding values
of the other variable change?”
Linear Relationship
• linear relationship:
y=a +bx (straight line)
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Linear
Not (strictly) Linear
Correlation Coefficient: r
• sign: direction of relationship
• magnitude (number): strength of relationship
• 1 ≤ r ≤ 1
• r=0 is no linear relationship
• r=1 is “perfect” negative correlation
• r=1 is “perfect” positive correlation
• Notes:
Symmetric Measure (You can exchange X and Y and
get the same value)
Measures linear relationship only
QuickTime™ and a
decompressor
are needed to see this picture.
Correlation Coefficient: r
• Formula:
• alt. formula (ALEKS):
r =
1
n ÷1
x
i
÷ x
s
x

\

.

y
i
÷ y
s
y

\

.


¿
standardized
values
r =
x
i
y
i
÷ nx y
i=1
n
¿
(n ÷1)s
x
s
y
QuickTime™ and a
decompressor
are needed to see this picture.
Correlation: Examples
QuickTime™ and a
decompressor
are needed to see this picture.
Population: undergraduates
Correlation: Examples
QuickTime™ and a
decompressor
are needed to see this picture.
Population: undergraduates
Correlation: Examples
Population: undergraduates
Correlation: Examples
• Others?
Correlation: Interpretation
• Correlation ≠ Causation!
Correlation: Interpretation
• Correlation ≠ Causation!
• When 2 variables are correlated, the causality may
be:
X > Y
X < Y
Z > X&Y (“lurking” third variable)
or a combination of the above
• Examples: ice cream & murder, violence & video
games, SAT verbal & math, booze & GPA
• Inferring causation requires consideration of: how
data gathered (e.g., experiment vs. observation),
other relevant knowledge, logic...
Simple Linear Regression
• PREDICTING one variable (Y) from
another (X)
No longer symmetric like Correlation
One variable is used to “explain” another variable
X Variable
Independent Variable
Explaining Variable
Exogenous Variable
Predictor Variable
Y Variable
Dependent Variable
Response Variable
Endogenous Variable
Criterion Variable
Simple Linear Regression
• idea: find a line (linear function) that best
fits the scattered data points
• this will let us characterize the relationship
between X & Y, and predict new values of
Y for a given X value.
(0,a)
b
Intercept
Slope
bX+a
X
Reminder: (Simple) Linear Function Y=a+bX
We are interested in this to model the relationship between an
independent variable X and a dependent variable Y
Y
1
slope : b
intercept : a
bX a Y + =
: s prediction errorless had we If
X
Y
Simple Linear Regression
all data points would
fall right on the line
X
Y
A guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(same slope, different intercept)
X
Y
Initial guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(same intercept, different slope)
X
Y
Initial guess at the location of the regression line
X
Y
Another guess at the location of the regression line
(different intercept and slope, same “center”)
X
Y
We will end up being reasonably confident
that the true regression line is somewhere
in the indicated region.
X
Y
Estimated Regression Line
errors/residuals
X
Y
Estimated Regression Line
X
Y
Estimated Regression Line
Error Terms have to be drawn vertically
X
Y
Estimated Regression Line
i i i
y y e ˆ ÷ =
i
y
i
y
ˆ
i
x
bX a Y + =
ˆ
: Line Regression the of Equation
i
y
ˆ
=“y hat”: predicted
value of Y for X
i
Estimating the Regression Line
• Idea: find the formula for the line that minimizes
the squared errors
error: distance between actual data point and
predicted value
• Y=a+bX
• Y=b
0
+b
1
x
• b
1
=slope of regression line
• b
0
=Y intercept of regression line
b
1
=
X
i
÷ X
( )
Y
i
÷Y
( )
i=1
N
¿
X
i
÷ X
( )
2
i=1
N
¿
ALEKS:
b
1
= r
s
y
s
x
b
0
=Y ÷b
1
X
Y=b
0
+b
1
X
b
1
=
x
i
y
i
÷ nx y
i=1
n
¿
(n ÷1)s
x
2
b
1
(slope)
b
0
(Y intercept)
using
correlation
coefficient