You are on page 1of 5

Biostatistics Notes

JHU SON
Spring 2017 course
Essential Statistics, Instructors Copy, Moore/Notz/Fligner 2 edition
Transcribed course lecture by Janna Stephens, PhD, RN

Correlation and simple linear Regression

Define explanatory and response variables


Response variable - A variable that measures an outcome of a study.
(Dependent variable, or outcome variable)
Explanatory variable - A variable that may explain or influence changes in
a response variable. (independent variable, or predictor)
*studies show that change in one or more explanatory variable(s) can CAUSE
change in response variable.
Ex. To study correlation between amount of time studying biostatistics and
final exam grade: explanatory variable is amount of t studying; final grade is
response variable

* we cant always say that explanatory variable causes response variable; it


may help predict it, but there are other variables involved that should be
accounted for.
Ex. Does amount t studying cause a good grade on exam? It may help, but
other factors like previous exposure to math, comfort with math, years since
last math course, hours of sleep previous night, etc.

Construct and interpret scatterplots


Most useful graph to display relationship between two quantitative variables
is Scatterplot.

Scatterplot - A plot that displays the relationship between two quantitative


variables measured on the same individuals. If one of the variables is an
explanatory variable, it should be represented on the horizontal axis (x-axis).

1. which variable goes on which axis? (typically, explanatory variable is


on x-axis).
2. Label and scale axis
3. Plot individual values

To INTERPRET scatterplots, follow data analysis: look for patterns and


important departures from patterns. Describe overall pattern by direction,
form and strength of the relationship. Look for outliers individual value
that falls outside the overall pattern of the relationship.
Direction:
Positive association - Description for two variables when above
average values of one tend to accompany aboveaverage values of the
other, and belowaverage values also tend to occur together.
Negative association - Description for two variables when above
average values of one tend to accompany belowaverage values of the
other, and vice versa.

Calculate and interpret correlation


Correlation - Denoted by r. Measures the direction and strength of the
linear relationship between two quantitative variables.

r is always a number between -1 (perfect negative relationship) and 1


(perfect positive relationship) (very rare to have a perfect positive or
negative r that is perfect linear relationship; you can get close by
measuring constructs that theoretically hang together).
r>0 indicates positive association
r<0 indicates negative association
values of r near 0 indicate a very weak linear relationship.
Strength of the linear relationship increases as r moves away from 0
toward -1 or 1.
The extreme values r=-1 and r=1 occur only in the case of a perfect
linear relationship.

Notes on Correlation:
1. Correlation makes no distinction between explanatory and response
variables (doesnt matter which variable you all x or y)
2. r has no units and does not change when we change the units of
measurement of x, y or both. (ex. You can measure weight in pounds or
kilograms or height in cm or inches, and it wont change correlation
between height and weight).
3. Positive r indicates positive association between the variables, and
negative r indicates negative association.
4. The correlation r is always a number between -1 and 1.

Quantify the linear relationship between an explanatory variable (x) and


response variable (y).
Use a regression line to predict values of (y) for values of (x).
Regression line - A straight line that describes how a response variable y
changes as an explanatory variable x changes. You can use a regression line
to predict the value of y for a given value of x.
x is the value of the explanatory value
y-hat is the predicted value of the response variable for a given
value of x.
b is the slope:
o Slope - Denoted by b in the straight line equation of the form y
= a + bx, the amount by which y changes when x increases by
one unit.
a is the intercept:
o Intercept - Denoted by a in the straight line equation of the
form y = a + bx, the value of y when x = 0.
Since we are trying to predict y, we want the
regression line to be as close as possible to the
data points in the vertical (y) direction.
Least-squares regression line (LSRL) - The line
that makes the sum of the squares of the vertical
distances of the data points from the line as small
as possible. Where sx and sy are the standard
deviations of the two variables, and r is their
correlation.

Calculate and interpret residuals.


Residual - The difference between an observed value of the response
variable and the value predicted by the regression line.
So, the residual = observed y predicted y OR = y y-hat

We want to see how tightly grouped points are to the regression line. So we
look at each data point and draw a line from the data point to the regression
line. These lines are the residuals. Then, we plot that information on a
residual plot.

Residual plot - A scatterplot of the regression residuals against the


explanatory variable.

Outliers and Influential points:


Influential - Description given to an observation for a statistical calculation
if removing it would markedly change the result of the calculation.

Recall that an outlier is an observation that lies far away from the other
observations.
Outliers in the y direction have large residuals
Outliers in the x direction are often influential for the least-squares
regression line, meaning that the removal of such points would
markedly change the equation of the line.
Also, we discussed previously how correlation (r), describes the
strength of a straight line relationship. In the regression setting, this
description is r2 (or the square of the correlation, is the fraction of the
variation in the values of y that is explained by the least squares
regression of y on x).

Describe cautions about correlation and regression.


1. both describe linear relationships (require scatterplot show linear
pattern)
2. both are affected by outliers (susceptible to influential observations)
3. always plot the data before interpreting
4. beware of extrapolation.
Extrapolation - The use of a regression line for prediction far
outside the range of values of the explanatory variable x that you
used to obtain the line. Such predictions are often not accurate.
5. Beware of lurking variables
Lurking variable - A variable that is not among the explanatory or
response variables in a study and yet may influence the
interpretation of relationships among those variables.
Correlation does not imply causation!