# DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14

Lecture 10
REGRESSION AND SAMPLE
CORRELATION
Predrag Spasojevic

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14

INTRODUCTION
• Many engineering and scientific problems are concerned with
determining a relationship between a set of variables.
• For example: chemical process, interest relationship between:
 the output of the process,
 the temperature at which it occurs,

 the amount of catalyst employed.
• Knowledge of such a relationship would enable us to predict
the output for various values of temperature and amount of
catalyst.

i = 1. for some constants β0. . . .called independent variables • The simplest type of relationship is a linear relationship. βr would hold the equation Y = β0 + β1x1 + · · · + βr xr • (1) If this was the relationship between Y and the xi . there is a single response variable Y . That is. then possible (once the βi were learned) to exactly predict the response for any set of input values.the dependent variable. r. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LINEAR REGRESSION LINE • In many situations. . xr . . .  depends on the value of a set of input x1. . . . . . . β1.

 the most that one can expect is that Equation 1 would be valid subject to random error. having mean 0. i. v.DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LINEAR REGRESSION LINE • In practice. . such precision is almost never attainable.e • The explicit relationship is: Y = β0 + β1x1 +· · ·+βr xr + e (2) where e. representing the random error is assumed to be a r. • This relationship is called a linear regression equation.

β1. βr are called the regression coefficients. . . • Simple regression equation is a regression equation containing a single independent variable x (input level) Y = α + βx + e Y is the response and e representing the random error. • The quantities β0. . is a random variable having mean 0 and variation σ². xr . . and must usually be estimated from a set of data. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LINEAR REGRESSION LINE • Linear regression equation describes the regression of Y on the set of independent variables x1. . . . .

to x. 10. i x i yi i xi yi 1 100 45 6 150 68 2 110 52 7 160 75 3 120 54 8 170 76 4 130 63 5 140 9 180 92 62 10 190 88 . the temperature at which the experiment was run. yi).DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LINEAR REGRESSION LINE • EX.. 1: Consider the following 10 data pairs (xi.. relating y. i = 1. the percent yield of a laboratory experiment...

1.DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LINEAR REGRESSION LINE • A plot of yi versus xi — called a scatter diagram — is given in Fig. It seems that a simple linear regression model would be appropriate. .

. . .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • Suppose: the responses Yi corresponding to the input values xi . i = 1. . n be observed and used to estimate α and β in a simple linear regression model. • If A is the estimator of α and B of β. so the squared difference is: (Yi − A + B xi )². . then the estimator of the response corresponding to the input variable xi would be: A + B xi . • The actual response is Yi.

• So. we differentiate SS first with respect to A and then to B as follows: .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • The sum of the squared differences between the estimated responses and the actual response values—call it SS—is: n SS   (Yi  A   xi ) 2 • i 1 The method of least squares:  chooses as estimators of α and β the values of A and B that minimize SS. to determine these estimators.

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • Setting these partial derivatives = zero yields the normal equations for the minimizing values A and B: .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • Let • By method of substitution  first normal equation:  Second normal equation: .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • by usual transformations of Second normal equation: • and the fact that .

. • straight line A + Bx is called the estimated regression line. . .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • So we get the following proposition: • The least squares estimators of β and α corresponding to the data set xi . i = 1. Yi . n are. respectively. . .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • EX.  Measurements of the relative humidity in the storage location  the moisture content of a sample of the raw material were taken over 15 days with the following data (in percentages) resulting. 2: The raw material used in the production of a certain synthetic fiber is stored in a location without a humidity control. .

the estimated regression line of moisture content depending on relative humidity in the storage location will be the line from the following Figure.DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS • Calculating least squares estimators by last proposition. .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 LEAST SQUARES ESTIMATORS OF THE REGRESSION PARAMETERS .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • Notation: If we let • the least squares estimators can be expressed as .

. . . . . . Yn is: • if all the Yi are equal — and thus are all equal to Y — then SYY would equal 0. . .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • Suppose: we measure the amount of variation in the set of response values Y1. . xn. . Yn corresponding to the set of input values x1. . • A standard measure in statistics of the amount of variation in a set of values Y1. . .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • The variation in the values of the Yi arises from two factors:  First: the input values xi are different.  Second:  the fact that even when the differences in the input values are taken into account. so the response variables Yi all have different mean values.  each of the response variables Yi has variance σ² and thus will not exactly equal the predicted value at its input xi. .

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • How much of the variation in the values of the response variables is due to the different input values? • How much is due to the inherent variance of the responses even when the input values are taken into account? • Answer: note that the quantity • measures the remaining amount of variation in the response values after the different input values taking into account. .

• The quantity R² defined by represents the proportion of the variation in the response variables that is explained by the different input values. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • Thus. SYY − SSR represents the amount of variation in the response variables that is explained by the different input values.

. • A value of R² near 0: little of the variation is explained by the different input values. • 0 ≤ R² ≤ 1. • The value of R² is an indicator of how well the regression model fits the data. and one near 0 indicating a poor fit.DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION • R² is called the coefficient of determination. with a value near 1 indicating a good fit. • A value of R² near 1: most of the variation of the response data is explained by the different input values.

. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE SAMPLE CORRELATION COEFFICIENT • For all data set consists of the paired values (xi . • That statistic is called the sample correlation coefficient and defined by: . is obtained a statistic that can be used to measure the association between the individual values of a set of paired data. yi ). i =1. . . n.

the correlation is proportionate. then the correlation between the r. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE SAMPLE CORRELATION COEFFICIENT • The sample correlation coefficient is always between −1 and 1. • If correlation coefficient is negative value then the relationship is inverse or inversely proportional. • If |r|=1 . more the absolute value is closer to 1. • If correlation coefficient is positive value. • So. more stronger correlation.v’s X and Y is linearly perfect.

DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE SAMPLE CORRELATION COEFFICIENT .

. . of response values Y1. . • The sample correlation coefficient r of these data pairs in the notation of slide 17 is: • Upon using identity . Yi).DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION AND THE SAMPLE CORRELATION COEFFICIENT • Consider data pairs (xi. . . xn . . i = 1. . n. Yn corresponding to the set of input values x1. . . . . .

• The sign of r is the same as that of B.DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION AND THE SAMPLE CORRELATION COEFFICIENT • we see that: • So. • The above gives additional meaning to the sample correlation coefficient. .

 That is. 81 percent of the variation in the response values is explained by the different input values. .DESCRIPTIVE AND INFERENTIAL STATISTICS – LECTURES – summer 2013/14 THE COEFFICIENT OF DETERMINATION AND THE SAMPLE CORRELATION COEFFICIENT • For instance. if a data set has its sample correlation coefficient r equal to 0.9² = 0.9. then this implies  a simple linear regression model for these data explains 81 percent (since R² = 0.81) of the variation in the response values.