You are on page 1of 9

url: http://www.cscst.edu.ph email: information@cscst.

ph
Tel. No. (032) 416 - 6501 Fax No. (032) 416 - 6501

Certificate No.: AJA03/6952

CORRELATION AND REGRESSION ANALYSIS

The basic purpose behind correlation is to find out if two variables are related to one another. If
the variables are related, regression then allows the use of the relationship in the prediction of
one variable given a score on the other variable.

In most graduate level programs an applicant who wishes to enroll is required to submit scores
from the Graduate Record Examination (GRE). After submitting one’s application for admission
to the graduate program, the admission team will look at the application materials and estimate
how well they expect an enrollee will do in the program when admitted.
1. This prediction may occur informally (just a sort of best guess), or formally where
correlation and regression procedures are used.
2. Using formal procedures, the admission committee calculates the correlation coefficient
using the GRE scores submitted by students attending the program, and the grade point
averages (GPA) earned by those students.
This correlation is both used as a descriptive and inferential statistic.
1. As a descriptive statistic, the correlation informs the admissions committee about the
relationship between GRE scores and GPAs.
2. As an inferential statistic, the correlation coefficient can be used to make a decision
about whether there is a statistically significant relationship between GRE scores and
GPAs.
3. If the relationship is significant, a regression equation can be generated and will then
be able to take the submitted GRE score, and predict GPA.
This correlation allows learning,
1. how to calculate a correlation coefficient,
2. how to evaluate both it's significance and strength,
3. how to test it for statistical significance, and
4. how to construct and use a regression equation.

Correlation Values
A correlation coefficient is a number that ranges from -1.0 up through 0 to a maximum value of
+1.0.
 The correlation indicates how closely the relative positions of two or more variables agree with
one another. Or stated another way, the correlation indicates the correspondence, or lack of
correspondence between the relative positions of two or more variables.

88 Correlation and Regression Summer 2009


Direction & Magnitude
Correlation coefficients indicate both the direction of the relationship and its magnitude.

1. If a correlation is negative, it indicates that the high values on the first variable are
related to low values on the second variable, and low values on the first variable go
with high values on the second.
2. If the correlation is positive, then low values on the first variable go with low values on
the second variable, and high values on the first variable, in general, go with high
values on the second variable.

Of course, this direction is given by the sign (either + or -) of the calculated correlation.

Magnitudes are measured by comparing the pattern the data makes with a straight line.

1. If the pattern perfectly matches a straight line then the magnitude of the correlation is
either +1.0 or -1.0 depending on the direction.
2. If the pattern of the relationship doesn't fit a linear pattern at all, then the magnitude of
the correlation is zero (0) indicating that there isn't a relationship between the two
variables in question.

High negative correlation -- Correlation = -.75 High positive correlation -- Correlation = .75

Low correlation -- Correlation = 0.0 Perfect positive correlation -- Correlation = 1.0

Scatter plots graphically show the relationships

Notes in Statistics Chapter 7


89
Compiled by: Nolasco K. Malabago Correlation and Regression Analysis
The strength and direction of the coefficient of correlation is summarized in the following
diagram.
Perfect Perfect
Negative Positive
Correlation No correlation Correlation

Strong Moderate Weak Weak Moderate Strong


Negative Negative Negative Positive Positive Positive
Correlation Correlation Correlation Correlation Correlation Correlation

-1.00 -0.50 0.00 +0.50 +1.00


Negative Correlation Positive Correlation

Pearson Correlation (Pearson’s Product Moment Coefficient of Correlation)


There is one major correlation coefficient called the Pearson product-moment correlation
named after Karl Pearson. Its symbol, if calculated using sample data, is r (which stands for
regression).
The Pearson Correlation is a measure of straight line association between two variables.
Remember: Correlation measures the relative position congruence between two variables. Z scores are
the best measures of relative position. Pearson correlations are based on the concept of
the product of the z scores.
1. If z scores for both the x and y variables are either positive or negative, their
product will be positive.
2. If one of the (x, y) pair is negative, their product will be negative.

The Pearson correlation is simply the average of the cross-products of the z scores in a bivariate
data set. The equations below give the population parameter and sample statistic formulas.

 xy 
x x  zy
ρxy = Equation of Population correlation rho
N

rxy 
x x  zy
rxy = Equation of Sample statistic
n 1
These equations simply show that each subject's score on the x and y variables are converted to z
scores, multiply the z scores together, add them up, and either divided by the population size or the
sample size minus one to calculate the correlation.

n( xy)  ( x)(  y)


rxy 
[n( x )  ( x)2 ][n( y2 )  ( y)2
2

The equation provides the calculation formula for the sample correlation and can generate the
calculation formula for rho by simply substituting the population standard deviations for the sample
standard deviations .

90 Correlation and Regression Summer 2009


Hypothesis Testing
Though correlations can be used to describe relationships between variables, a common use is
to test the null hypothesis that states there is no relationship between the two variables.
This answers the question "Is it significant?" and if that significance existed then, "Is the found
correlation important?"

Is it significant?
When there is no relationship between the variables in the population,
1. The correlation (rho) would be equal to zero and the derivation of the sampling
distribution is straight forward.
2. Repeatedly sampling from the population using the same sample size, creates a
sampling distribution of sample correlations having a zero mean.
The critical values for the sample correlation depend on whether directional or non - directional
tests are conducted, and on the degrees of freedom.
 As the sample size and degrees of freedom increase, the sample correlation becomes a
better and better estimate of the population parameter (rho = 0), so it is increasingly more
difficult for the sample correlation to differ from the population parameter of zero.

The appropriate test statistics follows the t distribution with n – 2 degrees of freedom.
r n 2
t
1  r2
Is the found correlation important?
Beyond finding that the two variables significantly correlate and rejecting the null hypothesis, it
is often important to consider the importance of the found correlation.
To discuss "importance" is to calculate the coefficient of determination. The coefficient of
determination is simply found by squaring the correlation coefficient.
Coefficient of Determination = (rcv )2

Remember this!
There are no rules for determining whether the percentage of variance shared is important
beyond what researchers know from the literature. However, if we could discover a measure
that the shared variance is with suicide attempts then it is quite important and no measure
has correlated well with attempting suicide.

The Population and Standard Deviation

 x
 x  n
2
 x
 x     x  N  x  x 
2
2 2 2
2

   S S
N N n 1 n 1
Formula for population standard deviation Formula for sample standard deviation

Notes in Statistics Chapter 7


91
Compiled by: Nolasco K. Malabago Correlation and Regression Analysis
Problem
Two student teachers observe eight children in a classroom situation where children are
supposed to be silently working at their desks. Each student teacher independently records the
number of times the child speaks without getting permission from the teacher to do so. Is there
a significant correlation between the two student teachers?
Solution:
Step 1: Is it significant?
Teacher_1 Teacher_2
x x2 y y2 xy
1 1 2 4 2
2 4 0 0 0
4 16 5 25 20
1 1 2 4 2
4 16 4 16 16
2 4 1 1 2
3 9 2 4 6
The Simple Lines
4 16 0 0 0
Two points __________________. Total 21 67 16 54 48
Mean 2.625 2.000
Probably anyone doesn’t have any S 1.302 1.773
difficulty filling in the words "make a
line" in the proceeding sentence. As n( xy )  ( x)( y )
rxy 
everyone learns in elementary school, [n( x )  ( x) 2 ][n( y 2 )  ( y ) 2
2

two points define a line and any two 8(48)  (21)(16)


rxy 
points completely describe a single [8(67)  (21)2 ][8(54)  (16)2 ]
line and through these two points rxy  0.371
with a simple formula,

y  bx  c where: y = The value of the y - axis


b = The slope of the line
c = Constant or intercept

The slope of the line is simply the change the line makes between the two y
b
points on the y-axis divided by the change made between the two points on x
the x-axis. This change in value is often abbreviated with symbol that looks like
a small triangle, and is called delta. Thus, the slope is often defined as delta y divided by delta x.
The constant is defined as the point where the line crosses the y-axis when x = 0.

Consider this!
Given the slope of the line, and a single point, it is easy to find the constant. Simply take the x value at the point,
and subtract it away from zero. This gives the change the line must make from this point to get the x value to zero.
Now multiply that value with the slope. The newly calculated value is the amount the line must change on the y-
axis as x moves from the value at the point considered as zero. Adding this number to the y value at this point,
gives the constant.

92 Correlation and Regression Summer 2009


Understanding Regression
A regression line is not defined by points at each (x, y) pair. It is calculated so that it is the single
best line representing all the data values that are scattered in a swarm shown in the scatter
plots.

Regression lines are derived so that the distance between every value and the regression line
(this distance is called a residual) when squared and summed across all the values is the
smallest possible value. Thus, the values on the y-axis for the regression line are not directly
derived from the values, but from expected values. To differentiate real from expected values,
statisticians put what they call a hat (^) above the expected variable.
Thus, the simple regression line formula includes a y-hat.
y  bx  c
ˆ

The slope of the regression line is given by the formula,


y
b  r
x
The constant is calculated using formula,
c  y  bx

Outliers
Outliers are the name given to values that are very different from the others in a data set.
Outliers have a large effect on the correlation coefficient and the regression equation.

Regression Analysis

Problem:
From the Quantitative (Quant) scores and Verbal portions of the Graduate Record Examination
given on the table, "Is there a significant relationship (correlation) between the Verbal and
Quantitative score".
If there is, then construct the best prediction equation (equation for the regression line) using
Quantitative scores to predict Verbal scores.

660 660 660 660 660 660 660 660 660 660
Quantitative
522 522 522 522 522 522 522 522 522 522
640 640 640 640 640 640 640 640 640 640
Verbal
580 580 580 580 580 580 580 580 580 580

Notes in Statistics Chapter 7


93
Compiled by: Nolasco K. Malabago Correlation and Regression Analysis
Solution:
n( xy)  ( x)(  y)
rxy 
[n( x )  ( x)2 ][n( y 2 )  ( y)2
2

20(6216168)  (10902)(11187)
rxy 
[20(6063480)  (10902)2 ][20(6414881)  (11187)2
rxy  0.85663465
2
Coefficient of Determination (rxy) = 0.733822932
There is a Strong Positive Correlation between Verbal and Quantitative score.
The Regression Equation:
y  bx  c
ˆ where: ŷ = quantitative
x = verbal
Δy 91.0275
br  0.85663465( )  0.97794
Δx 79.7363
c  y  bx
c  559.35  0.97794(54 5.1)  26.29671

Thus, the regression equation is Quant = 0.9779*Verbal + 26.29671

Quantitative Verbal
660 435600 640 409600 422400
490 240100 530 280900 259700
560 313600 520 270400 291200
510 260100 500 250000 255000
670 448900 580 336400 388600
580 336400 540 291600 313200
620 384400 600 360000 372000
610 372100 595 354025 362950
352 123904 370 136900 130240
480 230400 500 250000 240000
522 272484 580 336400 302760
576 331776 495 245025 285120
666 443556 658 432964 438228
680 462400 710 504100 482800
456 207936 500 250000 228000
580 336400 579 335241 335820
390 152100 410 168100 159900
615 378225 520 270400 319800
590 348100 495 245025 292050
580 336400 580 336400 336400
Total 11187 6414881 10902 6063480 6216168
Mean 559.35 545.1
Standard Deviation 91.0275 79.7363

94 Correlation and Regression Summer 2009


The Summary:

 The Summary actually tests the correlation coefficient and calculates the estimates for
the regression equation.
 The correlation coefficient is equal to 0.8566.
 The t value for the slope is equal to 7.04444 with a p value less than 0.05 and thus the
correlation coefficient is significantly different from zero.
 The regression equation is: The expected value for Quant = .97794*Verbal + 26.2779.
 The square of the correlation coefficient indicates that 73.38% of the variance is
accounted for using Verbal scores to predict Quantitative scores.
 To produce a 95% confidence interval around the expected score, researchers would add
and subtract 1.96 standard errors of estimate to the predicted values.
 The results of fitting a Linear model to describe the relationship between Quant and
Verbal has the equation of the fitted model is Quant = 26.2779 + 0.977935*Verbal.
 The inner bounds show 95.0% confidence limits for the mean of many new values of
Quant at given values of Verbal. The outer bounds show 95.0% confidence limits for a
single new value of Quant at given values of Verbal.

Problem 1

Each summer, many schools using the Montessori Approach conduct what is called kindergarten
roundup. Parents with children, who will be kindergartners the following school year, are asked
to attend the roundup. The future students are given speech and hearing screening tests as well
as academic screening tests. The academic screening tests are used to alert parents as to
whether their children are ready to profit from academic experiences. The following data set
contains children's screening scores (Screen), and the same children's achievement scores
(Achieve) at the end of their actual kindergarten year. Use the Regression Analysis/Simple
Regression procedure to determine if Screen is significantly correlated with Achieve.
Screen 1 3 3 3 5 6.5 6.5 8 9 10
Achieve 46 52 41 92 59 48 66 82 41 60

Problem 2
Use the Simple Regression procedure to determine the regression equation for the following
data where First is used to predict Second. Assuming the correlation is significant, what is the
predicted value for Second if the subject earned a score of 80 on First?

First 72 57 50 71 75 75 62 63 76 85 93 64 40 55 77
Second 55 46 30 73 76 68 50 53 83 90 82 60 34 60 81

Notes in Statistics Chapter 7


95
Compiled by: Nolasco K. Malabago Correlation and Regression Analysis
Rank – Order Correlation

Another scale of measurement for correlation uses ordinal (ranked) data – the Spearman’s rank
– order correlation coefficient (Spearman’s rho). This was introduced by Charles Edward
Spearman (1863 – 1945), a British statistician. The formula is,

6  d2
rs  1  where: d is the difference between the ranks of each pair
n(n2  1)
n is the number of paired observations
The Spearmen’s rank - order correlation coefficient ranges from -1.00 through 0 to +1.00 with
+1.00 indicating a perfect agreement between ranks and -1.00 indicating that variables are
inversely related but in perfect agreement.
An rs of 0 tells that there is no relationship between ranks.

Problem:
Television Ranking by Ranking by Given the ranking by the Teenagers and
Program Teenagers Senior Citizens Senior Citizens on the selected Television
TV Patrol 3 2 Programs (shown on the table) find the
It Might Be You 2 4 Spearman’s rank – order correlation and
Yes Yes Show 5 5 test the null hypothesis (level of
Lovers in Paris 1 1 significance = 0.05) that the correlation is
Hiram 4 3 0.

Solution:

Television Ranking by Ranking by Senior Difference Between Difference


Program Teenagers Citizens Ranks (d) Squared (d2)
TV Patrol 3 2 1 1
It Might Be You 2 4 -2 4
Yes Yes Show 5 5 0 0
Lovers in Paris 1 1 0 0
Hiram 4 3 1 1
6  d2 6(6) Σd2= 6
rs  1  1  0.7
n(n2  1) 5(52  1)
This indicates that there is a very strong positive agreement in the ranking of the Teenagers and
Senior Citizens of the Selected TV Programs.

n 2 52
t  rs  0.7  1.6977
1  rs2 1  (0.7)2

The decision rule calls for the acceptance of the null hypothesis if the value of t computed is
less than the table value at 0.05 level of significance (one-tailed).
Decision: @ df = 3,α = 0.05, one-tailed t = 2.353 > t computed = 1.6977 accept the null hypothesis
that the positive agreement is insignificant.

96 Correlation and Regression Summer 2009

You might also like