Sta 212
Sta 212
Example – linear Regression of patient's age and their blood pressure A study is conducted
involving 10 patients to investigate the relationship and effects of patient's age and their blood
pressure.
Obs Age BP
x y xy x
2
x=
∑ x = 491 =49.1
n 10
y=
∑ y = 1410 =141
n 10
n ∑ xy−∑ x ∑ y
β 1=
n ∑ x 2−( ∑ x)
2
Applying the value of age to the regression Model to calculate the estimated blood pressure
(Y^ ) coefficient of determination ( R2) as follows:
Obs Age BP
x y Y^ Y^ − y ( Y^ − y)
2
( y− Y^ ) ( y− Y^ )
2
( y−Y ) 2
( y−Y )
1 35 112 124.926 -16.074 258.373 -12.926 167.081 -29 841
2 40 128 130.626 -10.374 107.620 -2.626 6.896 -13 169
3 38 130 128.346 -12.654 160.124 1.654 2.736 -11 121
4 44 138 135.186 -5.814 33.803 2.814 7.919 -3 9
5 67 158 161.406 20.406 416.405 -3.406 11.601 17 289
6 64 162 157.986 16.986 288.524 4.014 16.112 21 441
7 59 140 152.286 11.286 127.374 -12.286 150.946 -1 1
8 69 175 163.686 22.686 514.655 11.314 128.007 34 1156
9 25 125 113.526 -27.474 754.821 11.474 131.653 -16 256
10 50 142 142.026 1.026 1.053 -0.026 0.001 1 1
Total 491 1410 1410 0.000 2662.750 0.000 622.950 0 3284
SIMPLE CORRELATION
A correlation exists between two variables when the values of one variable are somehow
associated with the values of the other variable. When you see a pattern in the data you say there
is a correlation in the data. Patterns can be exponential, logarithmic, or periodic, all are linear
patterns. To see this pattern, you can draw a scatter plot of the data. Remember to read graphs
from left to right, the same as you read words. If the graph goes up the correlation is positive and
if the graph goes down the correlation is negative. The words “weak”, “moderate”, and “strong”
are used to describe the strength of the relationship between the two variables.
a. Strong positive correlation between x and y. The points lie close to a straight line with y
increasing as x increases.
b. Weak, positive correlation between x and y. The trend shown is that y increases as x
increases but the points are not close to a straight line
c. No correlation between x and y; the points are distributed randomly on the graph.
d. Weak, negative correlation between x and y. The trend shown is that y decreases as x
increases but the points do not lie close to a straight line
e. Strong, negative correlation. The points lie close to a straight line, with y decreasing as x
increases
Correlation can have a value:
1. 1 is a perfect positive correlation
2. 0 is no correlation (the values don't seem linked at all)
3. -1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or
negative. Usually, in statistics, there are three types of correlations: Pearson correlation, Kendall
rank correlation and Spearman correlation.
The Pearson correlation coefficient is given by the following equation:
n
∑ (x i−x )( y i− y )
i=1
r=
√
n n
∑ ( xi −x) 2
∑ ( y i− y)2
i=1 i=1
where x is the mean of variable x values and y is the mean of variable y values.
Example – Correlation of statistics and science tests A study is conducted involving 10 students
to investigate the association between statistics and science tests. The question arises here; is
there a relationship between the degrees gained by the 10 students in statistics and science tests?
Student degree in Statistic and science
Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 21 17 18
Science 20 25 11 24 23 16 12 21 22 26
Notes: the marks out of 30
Suppose that (x) denotes for statistics degrees and (y) for science degree.
Calculating the mean ( x , y );
x=
∑ x = 173 =17.3, y=
∑ y = 200 =20
n 10 n 10
Where the mean of statistics degrees x = 17.3 and the mean of science degrees y = 20
Calculating the equation parameters
Statistics Science
x y x−x 2
(x−x ) y− y ( y− y)
2
(x−x )( y− y )
20 20 2.7 7.29 0 0 0
23 25 5.7 32.49 5 25 28
8 11 -9.3 86.49 -9 81 83
29 24 11.7 136.89 4 16 21.2
14 23 -3.3 10.89 3 9 -9.9
12 16 -5.3 28.09 -4 16 21.2
11 12 -6.3 39.69 -8 64 50.4
21 21 3.7 13.69 1 1 3.7
17 22 -0.3 0.09 2 4 -0.6
18 26 0.7 0.49 6 36 4.2
173 200 0 356.1 0 252 228
r=
∑ (x−x )( y− y ) = 228
√ ∑ ( x−x)2 ∑ ( y − y)2 √356.1 √ 252
228 228
¿ = =0.761
( 18.8706 ) (15.8745) 299.5614
Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to
measure the degree of association between two variables. It was developed by Spearman, thus it
is called the Spearman rank correlation. Spearman rank correlation test does not assume any
assumptions about the distribution of the data and is the appropriate correlation analysis when
the variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation coefficient:
6 ∑ di
2
ρ=1− 2
n(n −1)
Where:
ρ = Spearman rank correlation coefficient
d i= the difference between the ranks of corresponding values X i and Y i
n = number of value in each data set.
The Spearman correlation coefficient, ρ , can take values from +1 to -1. A ρ of +1 indicates a
perfect association of ranks, a ρ of zero indicates no association between ranks and a ρ of -1
indicates a perfect negative association of ranks. The closer ρ to zero, the weaker the association
between the ranks.
An example of calculating Spearman's correlation
To calculate a Spearman rank-order correlation coefficient on data without any ties use the
following data:
Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 20 17 18
Science 20 25 11 24 23 16 12 21 22 26
20 20 4 7 3 9
23 25 2 2 0 0
8 11 10 10 0 0
29 24 1 3 2 4
14 23 7 4 3 9
12 16 8 8 0 0
11 12 9 9 0 0
21 21 3 6 3 9
17 22 6 5 1 1
18 26 5 1 4 16
173 200 48
Where d = absolute difference between ranks and d 2 = difference squared. Then calculate the
following:
Then substitute into the main equation as follows:
6 ∑ di
2
ρ=1− 2
n(n −1)
6 × 48 288
ρ=1− 2 ; ρ=1− ; ρ=1−0.2909; ρ=0.71
10(10 −1) 990
Hence, we have a ρ = 0.71; this indicates a strong positive relationship between the ranks
individuals obtained in the statistics and science exam. This means the higher you ranked in
statistics, the higher you ranked in science also, and vice versa. So; the Pearson r correlation
coefficient = 0.761 and Spearman's correlation = 0.71 for the same data which means that
correlation coefficients for both techniques are approximately equal.
INFERENCE FOR REGRESSION AND CORRELATION
How do you really say you have a correlation? Can you test to see if there really is a correlation?
Of course, the answer is yes. The hypothesis test for correlation is as follows:
Hypothesis Test for Correlation:
1. State the random variables in words.
x = independent variable
y = dependent variable
2. State the null and alternative hypotheses and the level of significance
H 0 : ρ=0 (There is no correlation)
H a : ρ ≠ 0 (There is a correlation)
Or
H a : ρ<0 (There is a negative correlation)
Or
H a : ρ>0 (There is a positive correlation)
Also, state your α level here.
3. State and check the assumptions for the hypothesis test. The assumptions for the hypothesis
test are the same assumptions for regression and correlation.
4. Find the test statistic and p-value
r
t c=
5. Conclusion This is where you write reject H 0 or fail to reject H 0. The rule is: if the p-value <
α, then reject H 0 . If the p-value ≥ α, then fail to reject H 0.
6. Interpretation
This is where you interpret in real word terms the conclusion to the test. The conclusion for
a hypothesis test is that you either have enough evidence to show H a is true, or you do not
have enough evidence to show H a is true.
√ (1−r 2)
(n−2)
r √ n−2
t c=
√ 1−r 2
Method 1: Using the p – value
The p – value is calculated using t−¿ distribution with n−2 degrees of freedom.
The value of the test statistic,t , is shown in the computer or calculator output along with the
p – value. The test statistic t has the same sign as the correlation coefficient r .
The p – value is the combined area in both tails.
If the is less than the significance level (∝=0.05 ):
Decision: Reject the null hypothesis.
Conclusion: "There is sufficient evidence to conclude that there is a significant linear
relationship between and because the correlation coefficient is significantly different from zero."
If the is NOT less than the significance level (∝=0.05 )
Decision: DO NOT REJECT the null hypothesis.
Conclusion: "There is insufficient evidence to conclude that there is a significant linear
relationship between and because the correlation coefficient is NOT significantly different from
zero."
E.g. Consider an example where the line of best fit is: ^y =−173.51+ 4.83 x with r =0.6631 and
there are n=11 data points.
Can the regression line be used for prediction? Given a third exam score ( x value), can we use
the line to predict the final exam score (predicted y value)?
H 0 : ρ=0
Ha : ρ ≠ 0
∝=0.05
The p-value is 0.026 (from your calculator or from computer software). The p-value, 0.026, is
less than the significance level of ∝=0.05 .
Decision: Reject the Null Hypothesis H 0
Conclusion: There is sufficient evidence to conclude that there is a significant linear
relationship between the third exam score ( x ) and the final exam score ( y ) because the
correlation coefficient is significantly different from zero.
Method 2: Using a table of critical values
The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a
good idea of whether the computed value of is significant or not. Compare to the appropriate
critical value in the table. If is not between the positive and negative critical values, then the
correlation coefficient is significant. If is significant, then you may want to use the line for
prediction.
We will always use a significance level of 5 % ,∝=0.05 .
Example:
Suppose you computed r =0.801 using n=10 data points. df =n−2=10−2=8. The critical
values associated with df =8 are −0.632 and +0.632 . If r <¿ negative critical value or r
>positive critical value, then r is significant. Since r =0.801 and 0.801>0.632, r is significant
and the line may be used for prediction.
FITTING A STRAIGHT LINE BY THE METHOD OF LEAST SQUARES
Let ( x i , y i ), i=1 , 2 ,… , n be the n sets of observations and let the related relation by y=ax+ b.
Now we have to select a and b so that the straight line is the best fit to the data.
As explained earlier, the residual at x=x i is
d i= yi −f ( x¿¿ i)= y i−( a xi +b ) ,i=1, 2 , … , n ¿
n n
E=∑ d =∑ [ y i −(a x i +b) ]
2 2
i
i i
i=1 i=1
n n n
i.e. a ∑ x +b ∑ x i=∑ x i y i
2
i (1)
i=1 i=1 i=1
n n
and a ∑ x i +nb=∑ y i (2)
i=1 i=1
Since, x i, y i are known, equations (1) and (2) give two equations in a and b . Solve for a and b
from (1) and (2) and obtain the best fit y=ax+ b.
Note:
Equations (1) and (2) are called normal equations.
Dropping suffix i from (1) and (2), the normal equations are
a ∑ x +nb=∑ y and a ∑ x 2 +b ∑ x=∑ xy
Which are get taking Σ on both sides of y=ax+ b and also taking Σ on both sides after
multiplying by x both sides of y=ax+ b.
x−a y−b
Transformation like X = , Y= reduce the linear equation y=ax+ b to the form
h h
Y = AX + B. Hence, a linear fit is another linear fit in both systems of coordinates.
Example 1:
By the method of least squares find the straight line to the data given below
x 5 10 15 20 25
y 16 19 23 26 30
Solution:
Let the straight line be y=ax+ b
a ∑ x +b ∑ x=∑ xy
2
(2)
x y x
2
xy
5 16 25 80
10 19 100 190
15 23 225 345
20 26 400 520
25 30 625 750
Total 75 114 1375 1885
A ∑ X + B ∑ X =∑ XY
2
(5)
i.e. ,
y−23
5 (
=0.7
x−15
5 )
−0.04 → y−23=0.7 x−10.5−0.2
√
S y / x=
Sr
n−(m+1)
(3)
Example: Fit a second-order polynomial to the data in the first two columns of the table below
Computations for an error analysis of the quadratic least-squares fit
xi yi ( y ¿¿ i− y ) ¿
2 2 2
( y ¿¿ i−a¿¿ 0−a1 x i−a2 x i ) ¿ ¿
0 2.1 544.44 0.14332
1 7.7 314.47 1.00286
2 13.6 140.03 1.08158
3 27.2 3.12 0.80491
4 40.9 239.22 0.61951
5 61.1 1272.11 0.09439
Σ 152.6 2513.39 3.74657
Solution
From the given data,
m=2 n=6 ∑ x i=15 ∑ x 4i =979
∑ yi =152.6 ∑ x i y i=585.6 ∑ x 2i =55 ∑ x 3i =225
∑ x 2i y i=2488.8 y=25.433 x=2.5
Therefore, the simultaneous linear equations are
[ ][ ] [ ]
6 15 55 a0 152.6
15 55 225 a1 = 585.6
55 225 979 a2 2488.8
Solving these equations through a technique such as Gauss elimination gives a 0=2.47857 ,
a 1=2.35929 and a 2=1.86071. Therefore, the least-squares quadratic equation for this case is
2
y=2.47857+2.35929 x+1.86071 x
The standard error of the estimate based on the regression polynomial is
2 2513.39−3.74657
r= =0.99851
2513.39
and the correlation coefficient is r =0.99925.
These results indicate that 99.851 percent of the original uncertainty has been explained by the
model. This result supports the conclusion that the quadratic equation represents an excellent fit.