You are on page 1of 8

Chapter 2 Looking at Data – Relationships

2.1 Scatterplots (P. 105-112) A scatter displays the form, direction, and strength of the relationship between two quantitative variables. 2.2 Correlation The correlation measures the direction and strength of the linear relationship between two quantitative variables. It is defined as 1 n ⎛ xi − x ⎞ ⎛ yi − y ⎞ r= ⎟ ∑⎜ ⎟⎜ ⎜ ⎟ n − 1 i =1 ⎝ sx ⎠ ⎝ s y ⎠

Correlation measures the strength of only the linear relationship.Properties of Correlation: 1. Correlation requires that both variables be quantitative. or both. 5. Correlation can be strongly affected by outliers. 4. Values of r close to -1 or 1 indicate a strong linear relationship. Positive r indicates positive association and negative r indicates negative association. −1 ≤ r ≤ 1. Correlation does not change with units of measurements of x. 3. 2. y. Correlation does not describe curved relationships no matter how strong they are. . Values of r close to 0 indicate a weak linear relationship. Interpret the value of r with caution when outliers appear in the scatterplot.

i. It is determined by fitting a line to data.2. y) observations. .e.. drawing a line that comes as close as possible to the data points on the scatterplot. the regression line can be described in a compact mathematical form ˆ y = a + bx where b=r sy sx and a = y − bx are the slope and intercept of the line.3 Least-Squares Regression A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes. Given a set of (x.

. However. 5. The intercept a is the value of y when x = 0. 4. It is the straight line that best fits the data in the sense that the sum of the squares of the vertical distances of the data points from the line is as small as possible. The slope b of the regression is an estimate of the rate of change of the y variable with respect to the x variable. The regression line can be used to predict the value of y for any given value of x by substituting this x value into the equation of the line. The regression line always passes through the point ( x . The regression line is also called the line of best fit. That is 3. Hence the term “least square regression”. extrapolation beyond the range of x-values is risky. y ) . 2.About the regression line 1.

Specifically.Variation The square of the correlation r2 is the fraction of the variation on the values of y that is explained by its relationship to x. r2 = variance of predicted values of y variance of observed values of y .

. ˆ Residual = observed (y) – predicted ( y ) The sum of the least square residuals is always 0. A residual plot is a scatterplot of the regression residuals against the explanatory variable. Residual plots help us assess the fit of a regression line.Residual A residual is the difference between an observed value of the response variable and the value predicted by the regression line.

Points that are outliers in the x direction of a scatterplot are often influential for the least-squares regression line.Outliers and Influential observations in regression An outlier is an observation that lies outside the overall pattern of the other observations.22 and 2. . An observation is influential for a statistical calculation if removing it would markedly change the result of the calculation. Some outliers have large residuals. but others do not (see Figure 2.23).

.Lurking Variable A lurking variable is a variable that is not among the explanatory or response variables in a study and yet may influence the interpretation of relationships among those variables.