You are on page 1of 65

Correlation &

Regression
Correlation

Finding the relationship between two


quantitative variables without being able to
infer causal relationships

Correlation is a statistical technique used to


determine the degree to which two variables
are related
Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and
the second is called dependent (Y)
• Points are not joined
Y
* *
*
X
Possible Relationships between X and Y in a Scatter
Plot
Example

Wt. 67 69 85 83 74 81 97 92 114 85
(kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)
Wt. 67 69 85 83 74 81 97 92 114 85
SBP(mmHg) (kg)
SBP 120 125 140 160 130 180 150 140 200 130
(mmHg)

220
200

180
160

140
120

100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


Scatter plots

The pattern of data is indicative of the type of


relationship between the two variables:
 positive relationship

 negative relationship

 no relationship
Positive relationship
18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship

Reliability

Age of Car
No relation
Correlation Coefficient

Statistic showing the degree of relation


between two variables
Simple Correlation Coefficient (r)

 It is also called as Karl Pearson's


Correlation Coefficient.

 It measures the nature and strength


between two variables of the
quantitative type.
The sign of r denotes the nature of
association

while the value of r denotes the


strength of association.
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated by the following
diagram.

strong intermediate weak weak intermediate strong

1- -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)

 xy   x y
r n

x 
2
(  x) 2
 
.  y 
2
(  y) 2


 n  n 
  
:Example
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

Weight Age serial


(Kg) (years) No
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
These 2 variables are of the quantitative type,
one variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y)
variable to find the relation between age and
weight.
Compute the simple correlation coefficient using
the following formula:

 xy   x y
r n

x 
2
(  x) 2
 
.  y 
2
(  y) 2


 n  n 
  
Weight Age
Serial
Y2 X2 xy (Kg) (years)
.n
(y) (x)
144 49 84 12 7 1

64 36 48 8 6 2

144 64 96 12 8 3

100 25 50 10 5 4

121 36 66 11 6 5

169 81 117 13 9 6

=y2∑ =x2∑ xy=∑ =y ∑ =x∑ Total


742 291 461 66 41
41  66
461 
r 6
 (41) 2   (66) 2 
291  .742  
 6  6 

r = 0.759
Interpretation of r:
Strong direct correlation
EXAMPLE: Relationship between Anxiety and
Test Scores
Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
 6(230)  32  6(204)  32 
2 2
(356)( 200)

r = - 0.94

Interpretation of r:
Indirect strong correlation
Regression Analysis

 In statistical modelling, regression analysis is a


statistical process for estimating the relationships
among variables.
 Uses a variable (x) to predict some outcome
variable (y).
 Tells us how values in y change as a function of
changes in values of x.
Correlation and Regression

 Correlation describes the nature and the


strength of relationship between two variables

 Regression tells us the type of this relationship

 So, regression is all about establishing this


relationship
Concern

??How to get this relationship


Two Different Estimating Lines Fitted to the Same Three
Observed Data Points, Showing Errors in Both Cases

Graph (a) Graph (b)


Summing Y – Y cap Y – Y cap
the Errors 8-6 = 2 8-2= 6
of the
Two 1-5 = -4 1-5 = -4
Estimating 6-4 = 2 6-8 = -2
Lines Total error = 0 Total error = 0
Summing the Absolute Values of the
Errors of the Two Estimating Lines

Graph (a) Graph (b)


|Y – Y cap| |Y – Y cap|
|8-6 |= 2 |8-2 |= 6
|1-5| = 4 |1-5 |= 4
|6-4 |= 2 |6-8 |= 2
Total error = 8 Total error = 12
Two Different Estimating Lines Fitted to the Same Three
Observed Data Points, Showing Errors in Both Cases

Graph (a) Graph (b)


|Y – Y cap| |Y – Y cap|
|4-4 |= 0 |4-5 |= 1
|7-3| = 4 |7-4 |= 3
|2-2 |= 0 |2-3 |= 1
Total error = 4 Total error = 5
Applying the Least-Square
Criterion to the Estimating Lines

Graph (a) Graph (b)


(Y – Y cap)2 (Y – Y cap)2
(4-4 )2= 0 (4-5 )2= 1
(7-3)2 = 16 (7-4 )2= 9
(2-2 )2= 0 (2-3)2= 1
Total error = 16 Total error = 11
From the last two examples , it is clear that the
method of least squares to find the best fitted
line is in agreement with what is expected

Now, we will see that how can we find this best


fitted line, mathematically
Regression
 Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares
of the residuals smaller than for any other line
Regression minimizes residuals

220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120
Get the constants for the
regression line
Now we see how to calculate the constants a: y intercept, where the
regression line will meet Y axis and slope i.e. gradient of the straight
line

We can find the values of a and b using the following normal


.equations
From y = a + bx ........(1)
Taking sum of both sides, we get
By using the least squares method (a procedure that
minimizes the vertical deviations of plotted points
surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
Regression Equation
SBP(mmHg)
220

 Regression equation 200

180

describes the 160

regression line 140

120

mathematically 100

80
Wt (kg)
60 70 80 90 100 110 120

 Intercept
 Slope
Linear Equations
Y
ŷY = baX + bX
a
Change
b = S lo p e in Y
C h a n g e in X
a = Y -in te r c e p t
X
Using the Least-Square Method:
Problem I
The director of a sanitation department is interested in the
relationship between the age of a garbage truck and the
annual repair expense she could expect to incur. In order to
determine this relationship, the director has accumulated
information concerning four of the trucks the city currently
.owns Truck Number Age of the Truck (in Repair Expense during
years) (X) last year in Hundreds
of $ (Y)
101 5 7

102 3 7

103 3 6

104 1 4
b = 0.75
a = 3.75
If the city has a truck that is 4 years old,
predict the annual repair expense for the
.same
Answer: $ 675.00
Hours studying and grades
Regressing grades on hours


Linear Regression


90.00 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88


80.00  

70.00  

2.00 4.00 6.00 8.00 10.00

Number of hours spent studying

Predicted final grade in class =


59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

…Predict the final grade of


 Someone who studies for 12 hours
 Final grade = 59.95 + (3.17*12)
 Final grade = 97.99

 Someone who studies for 1 hour:


 Final grade = 59.95 + (3.17*1)
 Final grade = 63.12
Problem 1
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.
Weight (y) Age (x) .Serial no
12 7 1
8 6 2
12 8 3
10 5 4
11 6 5
13 9 6
Answer

Y2 X2 xy Weight (y) Age (x) .Serial no


144 49 84 12 7 1
64 36 48 8 6 2
144 64 96 12 8 3
100 25 50 10 5 4
121 36 66 11 6 5
169 81 117 13 9 6

742 291 461 66 41 Total


Regression equation

ŷ (x)  4.675  0.92x

Value of Y when X = 8.5

ŷ (8.5)  4.675  0.92 * 8.5  12.50Kg


Value of Y when X = 7.5
ŷ (7.5)  4.675  0.92 * 7.5  11.58Kg
Problem 2
B.P Age B.P Age
(y) (x) (y) (x)
128 46 120 20
The following are the 136 53 128 43
age (in years) and
systolic blood 146 60 141 63
pressure of 20 124 20 126 26
apparently healthy 143 63 134 53
adults. 130 43 128 31
124 26 136 58
121 19 132 46
126 31 140 58
123 23 144 70
Find the correlation between age
and blood pressure using simple
and Spearman's correlation
coefficients, and comment.
Find the regression equation?
What is the predicted blood
pressure for a man aging 25 years?
x2 xy y x Serial
400 2400 120 20 1
1849 5504 128 43 2
3969 8883 141 63 3
676 3276 126 26 4
2809 7102 134 53 5
961 3968 128 31 6
3364 7888 136 58 7
2116 6072 132 46 8
3364 8120 140 58 9
4900 10080 144 70 10
x2 xy y x Serial
2116 5888 128 46 11
2809 7208 136 53 12
3600 8760 146 60 13
400 2480 124 20 14
3969 9009 143 63 15
1849 5590 130 43 16
676 3224 124 26 17
361 2299 121 19 18
961 3906 126 31 19
529 2829 123 23 20
41678 114486 2630 852 Total
 x y
 xy 
n 114486 
852  2630
b1  = 20  0.4547
(  x) 2
852 2

x  n
2 41678 
20

ŷ =112.13 + 0.4547 x

for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Checking the Estimating Equation
 A crude way to verify the accuracy of the estimating equation
is to determine the graph of the sample points

 A more sophisticated method comes from one of the


mathematical properties of a line fitted by the method of least
squares, i.e. the individual positive and negative errors must
sum to zero

 When both of them are satisfied, we can be reasonably


certain that we haven’t committed any serious mathematical
mistakes in determining the estimating equation
The Standard Error of Estimate
The standard error  of the estimate (Se) is a
measure of the accuracy of predictions made
.with the regression line
It measures the scatteredness (variability) of the
.observed points from the line of regression
It is given by

Where Y and Y-cap are the observed and the


estimated value of the dependent variable,
.respectively
Short-Cut Method for Finding the Standard
Error of Estimate

Find the standard error of estimate for the


.problem I
Answer: $ 86.6
Interpreting the Standard Error of Estimate
±1Se, ±2Se and ±3Se Bounds Around the Regression Line
(Measured on Y-Axis)*Note that Se is measured along the Y-axis, rather than
perpendicularly from the regression line
Approximate Prediction Intervals
We can use Se to make probability statement about the
interval around an estimated value of Y, within which
.the actual value of Y lies
In problem I, the estimating regression equation is
X 0.75
ŷ + 3.75 =
So, if the department has a 4 year old truck, we predict
.that it will have an annual repair expense of $675
Recall that Se = $86.6
Combining these two pieces of information, we are
roughly 68% confident that the actual repair expense
will be within ±1Se fromŷ
Calculating the upper and the lower
limits of prediction
Y^ +1Se = 675 + 1 (86.60)
$761.60 =
Y^ -1Se = 675 - 1 (86.60)
$588.40 =
i.e., we are roughly 68% confident that the actual
repair expense will be within $588.40 and
$761.60

Similar statements could be made for 2Se and


Some more Problems on Correlation
and Regression
Problem 1
Limca Company is studying the effect of its latest advertising
campaign. People chosen at random were called and asked how
many cans of Limca they had purchased in the past week and
how many Limca advertisements they had either read or seen in
the past week. The followings are the results
X (No. of 3 7 4 2 0 4 1 2
Ads.)
Y ( cans 11 18 9 4 7 6 3 8
purchased
)

Calculate the correlation coefficient between X and Y and interpret the


.result
Problem 2
The motor pool wants to know if it costs more to maintain cars that are
driven more often.
i) Calculate intercept and slope
ii) Predict Y if X=20000

Car Number Miles Driven (X) Repair Costs (Y)


80000 72000
29000 9000
53000 39000
13000 12000
45000 19500
Problem 3
Cost accountants often estimate overhead based on the level of
production. At the Standard Knitting Co., they have collected
information on overhead expenses and units produced at different
plants, and want to estimate a regression equation to predict
.future overhead
Overhead 191 170 272 155 280 173 234 116 153 178

Units 40 42 53 35 56 39 48 30 37 40

.Develop the regression equation for the cost accountants


.Predict overhead cost when 50 units are produced
.Calculate the standard error of estimate
Coefficient of Determination
The coefficient of determination, denoted R2 or r2 and pronounced
"R squared", is a number that indicates the proportion of the
variance in the dependent variable that is predictable from the
.independent variable

The coefficient of determination is the square of the correlation


coefficient (r) between predicted y scores and actual y scores; thus,
it ranges from 0 to 1
This coefficient of determination is also the proportion of the total
.variation in Y that is explained by the independent variables

.It is also called as goodness of fit of a regression equation


STEPS IN LINEAR REGRESSION

State the hypotheses .1


Gather the data  .2
4. Compute the regression equation 
5. Examine tests of statistical significant and measures
of association 
6. Relate statistical findings to the hypothesis. Accept or
reject the null hypothesis 
7. Reject, accept or revise the original hypothesis
Make suggestions for research design and management
.aspects of the problem
Example
The motor pool wants to know if it costs more to maintain cars
.that are driven more often
Ho: there is no relationship between mileage and maintenance
costs
Ha: maintenance costs are affected by car mileage
 
Dependent variable: Y is the cost in dollars of yearly
maintenance on a motor vehicle 
Independent variable: X is the yearly mileage on the same motor
vehicle
Data are gathered on each car in the motor pool, regarding
number of miles driven in a given year, and maintenance costs
 .for that year. Here is a sample of the data collected
Car Number Miles Driven (X) Repair Costs (Y)

1 80,000 $1,200

2 29,000 $150

3 53,000 $650

4 13,000 $200

5 45,000 $325
:The regression equation is computed as
Y = 50 + .03 X
For example, if X=50,000 then Y = 50 + .03 (50,000) = $1,550
a=50 or the cost of maintenance when X=0; if there is no mileage on
the car, then the yearly cost of maintenance=$50
b=.03 the value that Y increases for each unit increase in X; for each
extra mile driven (X), the cost of yearly maintenance increases by
$.03
s.e.b = .0005; the value of b divided by s.e.b=60.0; the t-table
indicates that the b coefficient of X is statistically significant (it is
related to Y)
r2=.90 we can explain 90% of the variance in repair costs for
different vehicles if we know the vehicle mileage for each car

Conclusion: Reject the null hypothesis of no relationship and


 .accept the research hypothesis, that mileage affects repair costs

You might also like