You are on page 1of 65

Correlation &

Regression
Correlation

Finding the relationship between two


quantitative variables without being able to
infer causal relationships

Correlation is a statistical technique used to


determine the degree to which two variables
are related
Scatter diagram
• Rectangular coordinate
• Two quantitative variables
• One variable is called independent (X) and
the second is called dependent (Y)
• Points are not joined
Possible Relationships between X and Y in a Scatter
Plot
Example
SBP(mmHg)

220
200

180
160

140
120

100
80 wt (kg)
60 70 80 90 100 110 120

Scatter diagram of weight and systolic blood pressure


Scatter plots

The pattern of data is indicative of the type of


relationship between the two variables:
 positive relationship

 negative relationship

 no relationship
Positive relationship
18

16

14

12
Height in CM

10

0
0 10 20 30 40 50 60 70 80 90
Age in Weeks
Negative relationship

Reliability

Age of Car
No relation
Correlation Coefficient

Statistic showing the degree of relation


between two variables
Simple Correlation Coefficient (r)

 It is also called as Karl Pearson's


Correlation Coefficient.

 It measures the nature and strength


between two variables of the
quantitative type.
The sign of r denotes the nature of
association

while the value of r denotes the


strength of association.
 If the sign is +ve this means the relation
is direct (an increase in one variable is
associated with an increase in the
other variable and a decrease in one
variable is associated with a
decrease in the other variable).

 While if the sign is -ve this means an


inverse or indirect relationship (which
means an increase in one variable is
associated with a decrease in the other).
 The value of r ranges between ( -1) and ( +1)
 The value of r denotes the strength of the
association as illustrated by the following
diagram.

strong intermediate weak weak intermediate strong

1- -0.75 -0.25 0 0.25 0.75 1


indirect Direct
perfect perfect
correlation correlation
no relation
If r = Zero this means no association or
correlation between the two variables.

If 0 < r < 0.25 = weak correlation.

If 0.25 ≤ r < 0.75 = intermediate correlation.

If 0.75 ≤ r < 1 = strong correlation.

If r = l = perfect correlation.
How to compute the simple correlation
coefficient (r)

 xy   x y
r n
 ( x) 2
  ( y) 
2
x 
2 .  y 
2 
 n  n 
  
:Example
A sample of 6 children was selected, data about their
age in years and weight in kilograms was recorded as
shown in the following table . It is required to find the
correlation between age and weight.

serial Age Weight


No (years) (Kg)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
These 2 variables are of the quantitative type, one
variable (Age) is called the independent and
denoted as (X) variable and the other (weight)
is called the dependent and denoted as (Y) variable
to find the relation between age and weight.
Compute the simple correlation coefficient using the
following formula:

 xy   x y
r n
 ( x) 2
  ( y) 
2
x 
2 .  y 
2 
 n  n 
  
Age Weight
Serial
(years) (Kg) xy X2 Y2
.n
(x) (y)
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169
Total =x∑ =y∑ xy=∑ =x2∑ =y2∑
41 66 461 291 742
41  66
461 
r 6
 (41)2   (66)2 
291  .742  
 6  6 

r = 0.759
Interpretation of r:
Strong direct correlation
EXAMPLE: Relationship between Anxiety and
Test Scores
Anxiety Test X2 Y2 XY
)X( score (Y)
10 2 100 4 20
8 3 64 9 24
2 9 4 81 18
1 7 1 49 7
5 6 25 36 30
6 5 36 25 30
X = 32∑ Y = 32∑ X2 = 230∑ Y2 = 204∑ XY=129∑
Calculating Correlation Coefficient

(6)(129)  (32)(32) 774  1024


r   .94
6(230)  32 6(204)  32 
2 2
(356)(200)

r = - 0.94

Interpretation of r:
Indirect strong correlation
Regression Analysis

 In statistical modelling, regression analysis is a


statistical process for estimating the relationships
among variables.
 Uses a variable (x) to predict some outcome
variable (y).
 Tells us how values in y change as a function of
changes in values of x.
Correlation and Regression

 Correlation describes the nature and the


strength of relationship between two variables

 Regression tells us the type of this relationship

 So, regression is all about establishing this


relationship
Concern

??How to get this relationship


Two Different Estimating Lines Fitted to the Same Three
Observed Data Points, Showing Errors in Both Cases

Graph (a) Graph (b)


Summing Y – Y cap Y – Y cap
the Errors
8-6 = 2 8-2= 6
of the
Two 1-5 = -4 1-5 = -4
Estimating 6-4 = 2 6-8 = -2
Lines
Total error = 0 Total error = 0
Summing the Absolute Values of the
Errors of the Two Estimating Lines

Graph (a) Graph (b)


|Y – Y cap| |Y – Y cap|
|8-6 |= 2 |8-2 |= 6
|1-5| = 4 |1-5 |= 4
|6-4 |= 2 |6-8 |= 2
Total error = 8 Total error = 12
Two Different Estimating Lines Fitted to the Same Three
Observed Data Points, Showing Errors in Both Cases

Graph (a) Graph (b)


|Y – Y cap| |Y – Y cap|
|4-4 |= 0 |4-5 |= 1
|7-3| = 4 |7-4 |= 3
|2-2 |= 0 |2-3 |= 1
Total error = 4 Total error = 5
Applying the Least-Square
Criterion to the Estimating Lines

Graph (a) Graph (b)


(Y – Y cap)2 (Y – Y cap)2
(4-4 )2= 0 (4-5 )2= 1
(7-3)2 = 16 (7-4 )2= 9
(2-2 )2= 0 (2-3)2= 1
Total error = 16 Total error = 11
From the last two examples , it is clear that the
method of least squares to find the best fitted
line is in agreement with what is expected

Now, we will see that how can we find this best


fitted line, mathematically
Regression
 Calculates the “best-fit” line for a certain set of data
The regression line makes the sum of the squares
of the residuals smaller than for any other line
Regression minimizes residuals

220

200

180

160

140

120

100

80
Wt (kg)
60 70 80 90 100 110 120
Get the constants for the
regression line
Now we see how to calculate the constants a: y intercept, where the
regression line will meet Y axis and slope i.e. gradient of the straight
line

We can find the values of a and b using the following normal


.equations
From y = a + bx ........(1)
Taking sum of both sides, we get
By using the least squares method (a procedure that
minimizes the vertical deviations of plotted points
surrounding a straight line) we are
able to construct a best fitting straight line to the
scatter diagram points and then formulate a
regression equation in the form of:
Regression Equation
SBP(mmHg)
220
 Regression equation 200

describes the regression line 180

mathematically 160

140

120

 Intercept 100

80
Wt (kg)
 Slope 60 70 80 90 100 110 120
Linear Equations
Y
ŷY = bX
a +bX
a
Change
b = Slope in Y
Change in X
a = Y-intercept
X
Using the Least-Square Method
Problem I
The director of a sanitation department is interested in the
relationship between the age of a garbage truck and the annual
repair expense she could expect to incur. In order to determine
this relationship, the director has accumulated information
.concerning four of the trucks the city currently owns
If the city has a truck that is 4 years old, predict the annual repair
.expense for the same
Truck Number Age of the Truck (in Repair Expense during
years) (X) last year in Hundreds
of $ (Y)
A 5 7
B 3 7
C 3 6
D 1 4
b = 0.75
a = 3.75
Therefore the regression equation for y on
:x is given by
y = 3.75 + 0.75 x
Therefore, when x = 4
y = 675.00
Hours studying and grades
Regressing grades on hours


Linear Regression


90.00 Final grade in course = 59.95 + 3.17 * study
R-Square = 0.88


80.00  

70.00  

2.00 4.00 6.00 8.00 10.00

Number of hours spent studying

Predicted final grade in class =


59.95 + 3.17*(number of hours you study per week)
Predicted final grade in class = 59.95 + 3.17*(hours of study)

…Predict the final grade of


 Someone who studies for 12 hours
 Final grade = 59.95 + (3.17*12)
 Final grade = 97.99

 Someone who studies for 1 hour:


 Final grade = 59.95 + (3.17*1)
 Final grade = 63.12
Problem 1
A sample of 6 persons was selected the
value of their age ( x variable) and their
weight is demonstrated in the following
table. Find the regression equation and
what is the predicted weight when age is
8.5 years.
.Serial no Age (x) Weight (y)
1 7 12
2 6 8
3 8 12
4 5 10
5 6 11
6 9 13
Answer

Serial Age (x) Weight (y) xy X2 Y2


.no
1 7 12 84 49 144
2 6 8 48 36 64
3 8 12 96 64 144
4 5 10 50 25 100
5 6 11 66 36 121
6 9 13 117 81 169

Total 41 66 461 291 742


Regression equation

ŷ (x)  4.675  0.92x

Value of Y when X = 8.5

ŷ (8.5)  4.675  0.92 * 8.5  12.50Kg


Value of Y when X = 7.5
ŷ (7.5)  4.675  0.92 * 7.5  11.58Kg
Problem 2
Age B.P Age B.P
(x) (y) (x) (y)
20 120 46 128
The following are the
age (in years) and 43 128 53 136
systolic blood 63 141 60 146
pressure of 20 26 126 20 124
apparently healthy 53 134 63 143
adults.
31 128 43 130
58 136 26 124
46 132 19 121
58 140 31 126
70 144 23 123
Find the correlation between age
and blood pressure using simple
and Spearman's correlation
coefficients, and comment.
Find the regression equation?
What is the predicted blood
pressure for a man aging 25 years?
Serial x y xy x2
1 20 120 2400 400
2 43 128 5504 1849
3 63 141 8883 3969
4 26 126 3276 676
5 53 134 7102 2809
6 31 128 3968 961
7 58 136 7888 3364
8 46 132 6072 2116
9 58 140 8120 3364
10 70 144 10080 4900
Serial x y xy x2
11 46 128 5888 2116
12 53 136 7208 2809
13 60 146 8760 3600
14 20 124 2480 400
15 63 143 9009 3969
16 43 130 5590 1849
17 26 124 3224 676
18 19 121 2299 361
19 31 126 3906 961
20 23 123 2829 529
Total 852 2630 114486 41678
 x y
 xy 
n 114486 
852  2630
b1  = 20  0.4547
(  x) 2
852 2

x  n
2 41678 
20

ŷ =112.13 + 0.4547 x

for age 25
B.P = 112.13 + 0.4547 * 25=123.49 = 123.5 mm hg
Checking the Estimating Equation
 A crude way to verify the accuracy of the estimating equation
is to determine the graph of the sample points

 A more sophisticated method comes from one of the


mathematical properties of a line fitted by the method of least
squares, i.e. the individual positive and negative errors must
sum to zero

 When both of them are satisfied, we can be reasonably


certain that we haven’t committed any serious mathematical
mistakes in determining the estimating equation
The Standard Error of Estimate
The standard error  of the estimate (Se) is a measure
of the accuracy of predictions made with the regression
.line
It measures the scatteredness (variability) of the
.observed points from the line of regression
It is given by

Where Y and Y-cap are the observed and the estimated


.value of the dependent variable, respectively
Short-Cut Method for Finding the Standard
Error of Estimate

Find the standard error of estimate for the


.problem I
Answer: $ 86.6
Interpreting the Standard Error of Estimate
±1Se, ±2Se and ±3Se Bounds Around the Regression Line
(Measured on Y-Axis)*Note that Se is measured along the Y-axis,
rather than perpendicularly from the regression line
Approximate Prediction Intervals
We can use Se to make probability statement about the interval
around an estimated value of Y, within which the actual value of Y
.lies
In problem I, the estimating regression equation is
X 0.75 + 3.75 =
So, ŷ
if the department has a 4 year old truck, we predict that it will
.have an annual repair expense of $675
Recall that Se = $86.6
Combining these two pieces of information, we are roughly 68%
confident that the actual repair expense will be within ±1Se from


Calculating the upper and the lower
limits of prediction
Y^ +1Se = 675 + 1 (86.60)
$761.60 =
Y^ -1Se = 675 - 1 (86.60)
$588.40 =
i.e., we are roughly 68% confident that the actual repair
expense will be within $588.40 and $761.60

.Similar statements could be made for 2Se and 3Se


Some more Problems on Correlation
and Regression
Problem 1
Limca Company is studying the effect of its latest advertising
campaign. People chosen at random were called and asked
how many cans of Limca they had purchased in the past week
and how many Limca advertisements they had either read or
seen in the past week. The followings are the results
X (No. of 3 7 4 2 0 4 1 2
Ads.)
Y ( cans 11 18 9 4 7 6 3 8
purchased
)

Calculate the correlation coefficient between X and Y and interpret the


.result
Problem 2
The motor pool wants to know if it costs more to maintain cars that are
driven more often.
i) Calculate intercept and slope
ii) Predict Y if X=20000

Car Number Miles Driven (X) Repair Costs (Y)


80000 72000
29000 9000
53000 39000
13000 12000
45000 19500
Problem 3
Cost accountants often estimate overhead based on the level of
production. At the Standard Knitting Co., they have collected
information on overhead expenses and units produced at different
plants, and want to estimate a regression equation to predict future
.overhead

Overhead 191 170 272 155 280 173 234 116 153 178

Units 40 42 53 35 56 39 48 30 37 40

.Develop the regression equation for the cost accountants


.Predict overhead cost when 50 units are produced
.Calculate the standard error of estimate
Coefficient of Determination
The coefficient of determination, denoted R2 or r2 and pronounced "R
squared", is a number that indicates the proportion of the variance in the
.dependent variable that is predictable from the independent variable

The coefficient of determination is the square of the correlation


coefficient (r) between predicted y scores and actual y scores; thus, it
ranges from 0 to 1
This coefficient of determination is also the proportion of the total
.variation in Y that is explained by the independent variables

.It is also called as goodness of fit of a regression equation


STEPS IN LINEAR REGRESSION

State the hypotheses .1


Gather the data  .2
4. Compute the regression equation 
5. Examine tests of statistical significant and measures
of association 
6. Relate statistical findings to the hypothesis. Accept or
reject the null hypothesis 
7. Reject, accept or revise the original hypothesis
Make suggestions for research design and management
.aspects of the problem
Example
The motor pool wants to know if it costs more to maintain cars that
.are driven more often
Ho: there is no relationship between mileage and maintenance costs
Ha: maintenance costs are affected by car mileage
 
Dependent variable: Y is the cost in dollars of yearly maintenance
on a motor vehicle 
Independent variable: X is the yearly mileage on the same motor
vehicle
Data are gathered on each car in the motor pool, regarding number
of miles driven in a given year, and maintenance costs for that year.
 .Here is a sample of the data collected
Car Number Miles Driven (X) Repair Costs (Y)

1 80,000 $1,200

2 29,000 $150

3 53,000 $650

4 13,000 $200

5 45,000 $325
:The regression equation is computed as
Y = 50 + .03 X
For example, if X=50,000 then Y = 50 + .03 (50,000) = $1,550
a=50 or the cost of maintenance when X=0; if there is no mileage on
the car, then the yearly cost of maintenance=$50
b=.03 the value that Y increases for each unit increase in X; for each
extra mile driven (X), the cost of yearly maintenance increases by
$.03
s.e.b = .0005; the value of b divided by s.e.b=60.0; the t-table
indicates that the b coefficient of X is statistically significant (it is
related to Y)
r2=.90 we can explain 90% of the variance in repair costs for
different vehicles if we know the vehicle mileage for each car

Conclusion: Reject the null hypothesis of no relationship and


 .accept the research hypothesis, that mileage affects repair costs

You might also like