You are on page 1of 14

Linear Regression

Bivariate Data
Bivariate Data: Consists of the values of two different response variables that are obtained from the same population of interest
Three combinations of variable types: 1. Both variables are qualitative (attribute) 2. One variable is qualitative (attribute) and the other is quantitative (numerical) 3. Both variables are quantitative (both numerical)

Two Quantitative Variables


1. Expressed as ordered pairs: (x, y) 2. x: input variable, independent variable y: output variable, dependent variable Scatter Diagram: A plot of all the ordered pairs of bivariate data on a coordinate axis system. The input variable x is plotted on the horizontal axis, and the output variable y is plotted on the vertical axis.
Note: Use scales so that the range of the y-values is equal to or slightly less than the range of the x-values. This creates a window that is approximately square.

Presentation of Bivariate Data


Regression Plot
Y = 2.31464 + 1.28722X r = 0.559
60

50

40

Weight
30

20

10 10 20 30 40 50

Height

Example
Example: In a study involving childrens fear related to being hospitalized, the age and the score each child made on the Child Medical Fear Scale (CMFS) are given in the table below:
Age (x ) CMFS (y ) Age (x ) CMFS (y ) 8 9 9 10 11 9 8 9 8 11 31 25 40 27 35 29 25 34 44 19 7 6 6 8 9 12 15 13 10 10 28 47 42 37 35 16 12 23 26 36

Construct a scatter diagram for this data

Solution
age = input variable, CMFS = output variable Child Medical Fear Scale
50

40

CMFS

30

20

10 6 7 8 9 10 11 12 13 14 15

Age

Linear Correlation
Measures the strength of a linear relationship between two variables
As x increases, no definite shift in y: no correlation As x increases, a definite shift in y: correlation Positive correlation: x increases, y increases Negative correlation: x increases, y decreases If the ordered pairs follow a straight-line path: linear correlation

Example: No Correlation
As x increases, there is no definite shift in y:
55

Output

45

35 10 20 30

Input

Example: Positive Correlation


As x increases, y also increases:
60

50

Output

40

30

20 10 15 20 25 30 35 40 45 50 55

Input

Example: Negative Correlation


As x increases, y decreases:
95

85

Output

75

65

55 10 15 20 25 30 35 40 45 50 55

Input

Please Note
Perfect positive correlation: all the points lie along a line with positive slope Perfect negative correlation: all the points lie along a line with negative slope If the points lie along a horizontal or vertical line: no correlation If the points exhibit some other nonlinear pattern: no linear relationship, no correlation Need some way to measure correlation

Pearsons Product Moment Correlation


Coefficient of Linear Correlation: r, measures the
strength of the linear relationship between two variables

Pearsons Product Moment Formula:


r=

( x x)( y y)
( n 1) sx s y

Notes: 1 r +1 r = +1: perfect positive correlation r = -1 : perfect negative correlation

Alternate Formula for r


r= SS( xy ) SS( x )SS( y )

SS( x ) = sum of squ ares for x= x 2 SS( y ) = sum of squ ares for y= y
2

( x)2
n n

( y)2
x y n

SS( xy ) = sum of squ ares for xy= xy

Example
Example: The table below presents the weight (in thousands of pounds) x and the gasoline mileage (miles per gallon) y for ten different automobiles. Find the linear correlation coefficient: y2 y xy x x2
2.5 3.0 4.0 3.5 2.7 4.5 3.8 2.9 5.0 2.2 34.1 40 43 30 35 42 19 32 39 15 14 309 6.25 9.00 16.00 12.25 7.29 20.25 14.44 8.41 25.00 4.84 123.73 1600 1849 900 1225 1764 361 1024 1521 225 196 10665 100.0 129.0 120.0 122.5 113.4 85.5 121.6 113.1 75.0 30.8 1010.9

Sum

x2

y2

xy

Completing the Calculation for r


SS( x ) = x
SS( y ) = y
2

( x )
n

= 123.73
2

( 34.1) 2 = 7.449 10
( 309 ) 2 = 1116.9 10

( y)
n
n

= 10665

SS( xy ) = xy r=

x y = 1010.9 (34.1)(309) = 42.79


10 42 .79 ( 7.449 )(1116 .9 ) = 0.47

SS ( xy ) = SS ( x )SS ( y )

Please Note
r is usually rounded to the nearest hundredth r close to 0: little or no linear correlation As the magnitude of r increases, towards -1 or +1, there is an increasingly stronger linear correlation between the two variables Method of estimating r based on the scatter diagram. Window should be approximately square. Useful for checking calculations.

Linear Regression
Regression analysis finds the equation of the line that best describes the relationship between two variables One use of this equation: to make predictions

Models or Prediction Equations


Some examples of various possible y relationships: Linear: ^ = b0 + b1x
2 ^ Quadratic: y = a + bx + cx x ^ Exponential: y = a (b )

y Logarithmic: ^ = a log b x

Note: What would a scatter diagram look like to suggest each relationship?

Method of Least Squares


Equation of the best-fitting line: y Predicted value: ^ Least squares criterion: Find the constants b0 and b1 such that the sum
y ( y ^ ) 2 = ( y (b0 + b1 x )) 2
^ y = b0 + b1x

is as small as possible

Illustration
y

Observed and predicted values of y:


( x, y)

^ = b0 + b1 x y

y^ y
( x, ^ ) y

^ y

y
x

10

The Line of Best Fit Equation


The equation is determined by:
b0: y-intercept b1: slope

Values that satisfy the least squares criterion: ( x x )( y y ) SS( xy)


b1 = SS( x ) ( x x) 2 y (b1 x ) = y (b x) =
n
1

b0

Example
Example: A recent article measured the job satisfaction of subjects with a 14-question survey. The data below represents the job satisfaction scores, y, and the salaries, x, for a sample of similar individuals:
x y 31 17 33 20 22 13 24 15 35 18 29 17 23 12 37 21

1) Draw a scatter diagram for this data 2) Find the equation of the line of best fit

11

Finding b1 & b0
Preliminary calculations needed to find b1 and b0:
x

23 31 33 22 24 35 29 37 234

12 17 20 13 15 18 17 21 133

xy x2 529 276 961 527 1089 660 484 286 576 360 1225 630 841 493 1369 777 7074 4009

x2

xy

Line of Best Fit


SS( x ) = x
2

( x )
n
n

234 2 = 7074 = 229.5 8


8

SS( xy ) = xy

x y = 4009 (234)(133) = 118.75

b1 = b0 =

SS( xy ) 118.75 = =0.5174 SS( x ) 229.5

y (b1 x) = 133 (0.5174)(234) = 1.4902


n 8

^ . Solution 1) Equation of the line of best fit: y = 149 +0. 517 x

12

Scatter Diagram
Solution 2)
22 21 20 19 18

Job Satisfaction Survey

Job Satisfaction

17 16 15 14 13 12

21

23

25

27

29

31

33

35

37

Salary

Please Note
Keep at least three extra decimal places while doing the calculations to ensure an accurate answer When rounding off the calculated values of b0 and b1, always keep at least two significant digits in the final answer The slope b1 represents the predicted change in y per unit increase in x The y-intercept is the value of y where the line of best fit intersects the y-axis The line of best fit will always pass through the point ( x, y)

13

Making Predictions
1. One of the main purposes for obtaining a regression equation is for making predictions 2. For a given value of x, we can predict a value of ^ y 3. The regression equation should be used to make predictions only about the population from which the sample was drawn 4. The regression equation should be used only to cover the sample domain on the input variable. You can estimate values outside the domain interval, but use caution and use values close to the domain interval. 5. Use current data. A sample taken in 1987 should not be used to make predictions in 1999.

14

You might also like