0% found this document useful (0 votes)
25 views14 pages

Sta 212

Knowledge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views14 pages

Sta 212

Knowledge
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

TARABA STATE UNIVERSITY, JALINGO

STA212 LABORATORY FOR INFERENCE II (2 UNITS) LECTURE NOTES


LINEAR REGRESSION AND CORRELATION
When comparing two different variables, two questions come to mind: “Is there a relationship
between two variables?” and “How strong is that relationship?” These questions can be answered
using regression and correlation. Regression answers whether there is a relationship (again this
book will explore linear only) and correlation answers how strong the linear relationship is.
The independent variable, also called the explanatory variable or predictor variable, is the x -
value in the equation. The independent variable is the one that you use to predict what the other
variable is. The dependent variable depends on what independent value you pick. It also
responds to the explanatory variable and is sometimes called the response variable.
The population equation looks like:
y=β 0 + β 1 x
β 0 = slope
β 1 = y-intercept
^y is used to predict y

Assumptions of the regression line:


a) The set ( x , y ) of ordered pairs is a random sample from the population of all such possible (
x , y ) pairs.
b) For each fixed value of x , the y -values have a normal distribution. All of the y distributions
have the same variance, and for a given x -value, the distribution of y values has a mean that
lies on the least squares line. You also assume that for a fixed y , each x has its own normal
distribution. This is difficult to figure out, so you can use the following to determine if you
have a normal distribution.
i. Look to see if the scatter plot has a linear pattern.
ii. Examine the residuals to see if there is randomness in the residuals. If there is a
pattern to the residuals, then there is an issue in the data.

SIMPLE LINEAR REGRESSION


We consider the modelling between the dependent and one independent variable. When there is
only one independent variable in the linear regression model, the model is generally termed as a
simple linear regression model. When there are more than one independent variables in the
model, then the linear model is termed as the multiple linear regression model.
The linear model
y=β 0 + β 1 x 1 +ε 1
where
 x independent variable.  n Number of cases or individuals.
 y dependent variable.  ∑ xy Sum of the product of dependent and independent variables.  β 1 The
Slope of the regression line  ∑ x = Sum of independent variable.
 β 0 The intercept point of the regression line and the y axis.  ∑ y = Sum of dependent variable.
 ∑ x 2 = Sum of square independent variable.
n ∑ xy−∑ x ∑ y
β 1= β 1= y−β 1 x
n ∑ x 2−( ∑ x)
2

Example – linear Regression of patient's age and their blood pressure A study is conducted
involving 10 patients to investigate the relationship and effects of patient's age and their blood
pressure.
Obs Age BP
x y xy x
2

1 35 112 3920 1225


2 40 128 5120 1600
3 38 130 4940 1444
4 44 138 6072 1936
5 67 158 10586 4489
6 64 162 10368 4096
7 59 140 8260 3481
8 69 175 12075 4761
9 25 125 3125 625
10 50 142 7100 2500
Total 491 1410 71566 26157

∑ x =491 ∑ y=1410 ∑ x y=71566 ∑ x 2=26157


Calculating the mean: x , y ;

x=
∑ x = 491 =49.1
n 10

y=
∑ y = 1410 =141
n 10
n ∑ xy−∑ x ∑ y
β 1=
n ∑ x 2−( ∑ x)
2

10× 71566−491 ×1410


β 1= 2
10× 26157−(491)
715660−692310
β 1=
261570−241081
23350
β 1= =1.140
20489
β 0= y −β1 x β 0=141−1.140× 49.1
β 0=141−55.974 β 0=85.026

𝑬𝒔𝒕𝒊𝒎𝒂𝒕𝒆𝒅 𝒃𝒍𝒐𝒐𝒅 𝒑𝒓𝒆𝒔𝒔𝒖𝒓𝒆 Y =85.026+1.140 age


Then substitute the regression coefficient into the regression model

Interpretation of the equation;


Constant (intercept) value β 0 = 85.026 indicates that blood pressure at age zero.
Regression coefficient β 1 = 1.140 indicates that as age increase by one year the blood pressure
increase by 1.140

Applying the value of age to the regression Model to calculate the estimated blood pressure
(Y^ ) coefficient of determination ( R2) as follows:

Obs Age BP
x y Y^ Y^ − y ( Y^ − y)
2
( y− Y^ ) ( y− Y^ )
2
( y−Y ) 2
( y−Y )
1 35 112 124.926 -16.074 258.373 -12.926 167.081 -29 841
2 40 128 130.626 -10.374 107.620 -2.626 6.896 -13 169
3 38 130 128.346 -12.654 160.124 1.654 2.736 -11 121
4 44 138 135.186 -5.814 33.803 2.814 7.919 -3 9
5 67 158 161.406 20.406 416.405 -3.406 11.601 17 289
6 64 162 157.986 16.986 288.524 4.014 16.112 21 441
7 59 140 152.286 11.286 127.374 -12.286 150.946 -1 1
8 69 175 163.686 22.686 514.655 11.314 128.007 34 1156
9 25 125 113.526 -27.474 754.821 11.474 131.653 -16 256
10 50 142 142.026 1.026 1.053 -0.026 0.001 1 1
Total 491 1410 1410 0.000 2662.750 0.000 622.950 0 3284

Equation of ANOVA table for simple linear regression


Source of Variation Sums of Squares Df Mean Square F
Regression ∑ (Y^ −Y )2 1 SS reg MS reg
1 MS res
Residual ∑ ( y−Y^ )2 N–2 SSres
(N −2)
Total ∑ ( y−Y )2 N–1

Calculating the ANOVA table values for simple linear regression

Source of Variation Sums of Df Mean Square F


Squares
Regression 2662.75 1 2662.75 2662.75
=2662.75 =34.195
1 77.86875
Residual 622.95 8 622.95
8
Total 3284 9

Calculating the coefficient of determination ( R2)


2 Explained Variation Regression ∑ of Square (SSR)
R= =
Total Variation Total ∑ of Square(SST )
Then substitute the values from ANOVA table
2662.75
2
R= =0.810
3284
We can say that 81% of the variation in the blood pressure rate is explained by age.

SIMPLE CORRELATION
A correlation exists between two variables when the values of one variable are somehow
associated with the values of the other variable. When you see a pattern in the data you say there
is a correlation in the data. Patterns can be exponential, logarithmic, or periodic, all are linear
patterns. To see this pattern, you can draw a scatter plot of the data. Remember to read graphs
from left to right, the same as you read words. If the graph goes up the correlation is positive and
if the graph goes down the correlation is negative. The words “weak”, “moderate”, and “strong”
are used to describe the strength of the relationship between the two variables.
a. Strong positive correlation between x and y. The points lie close to a straight line with y
increasing as x increases.
b. Weak, positive correlation between x and y. The trend shown is that y increases as x
increases but the points are not close to a straight line
c. No correlation between x and y; the points are distributed randomly on the graph.
d. Weak, negative correlation between x and y. The trend shown is that y decreases as x
increases but the points do not lie close to a straight line
e. Strong, negative correlation. The points lie close to a straight line, with y decreasing as x
increases
Correlation can have a value:
1. 1 is a perfect positive correlation
2. 0 is no correlation (the values don't seem linked at all)
3. -1 is a perfect negative correlation
The value shows how good the correlation is (not how steep the line is), and if it is positive or
negative. Usually, in statistics, there are three types of correlations: Pearson correlation, Kendall
rank correlation and Spearman correlation.
The Pearson correlation coefficient is given by the following equation:
n

∑ (x i−x )( y i− y )
i=1
r=


n n

∑ ( xi −x) 2
∑ ( y i− y)2
i=1 i=1
where x is the mean of variable x values and y is the mean of variable y values.
Example – Correlation of statistics and science tests A study is conducted involving 10 students
to investigate the association between statistics and science tests. The question arises here; is
there a relationship between the degrees gained by the 10 students in statistics and science tests?
Student degree in Statistic and science
Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 21 17 18
Science 20 25 11 24 23 16 12 21 22 26
Notes: the marks out of 30
Suppose that (x) denotes for statistics degrees and (y) for science degree.
Calculating the mean ( x , y );

x=
∑ x = 173 =17.3, y=
∑ y = 200 =20
n 10 n 10
Where the mean of statistics degrees x = 17.3 and the mean of science degrees y = 20
Calculating the equation parameters
Statistics Science
x y x−x 2
(x−x ) y− y ( y− y)
2
(x−x )( y− y )
20 20 2.7 7.29 0 0 0
23 25 5.7 32.49 5 25 28
8 11 -9.3 86.49 -9 81 83
29 24 11.7 136.89 4 16 21.2
14 23 -3.3 10.89 3 9 -9.9
12 16 -5.3 28.09 -4 16 21.2
11 12 -6.3 39.69 -8 64 50.4
21 21 3.7 13.69 1 1 3.7
17 22 -0.3 0.09 2 4 -0.6
18 26 0.7 0.49 6 36 4.2
173 200 0 356.1 0 252 228

∑ (x−x )2 =356.1, ∑ ( y− y)2=252, ∑ ( x −x ) ( y− y )=228


Calculating the Pearson correlation coefficient;

r=
∑ (x−x )( y− y ) = 228
√ ∑ ( x−x)2 ∑ ( y − y)2 √356.1 √ 252
228 228
¿ = =0.761
( 18.8706 ) (15.8745) 299.5614
Spearman rank correlation: Spearman rank correlation is a non-parametric test that is used to
measure the degree of association between two variables. It was developed by Spearman, thus it
is called the Spearman rank correlation. Spearman rank correlation test does not assume any
assumptions about the distribution of the data and is the appropriate correlation analysis when
the variables are measured on a scale that is at least ordinal.
The following formula is used to calculate the Spearman rank correlation coefficient:
6 ∑ di
2
ρ=1− 2
n(n −1)
Where:
ρ = Spearman rank correlation coefficient
d i= the difference between the ranks of corresponding values X i and Y i
n = number of value in each data set.
The Spearman correlation coefficient, ρ , can take values from +1 to -1. A ρ of +1 indicates a
perfect association of ranks, a ρ of zero indicates no association between ranks and a ρ of -1
indicates a perfect negative association of ranks. The closer ρ to zero, the weaker the association
between the ranks.
An example of calculating Spearman's correlation
To calculate a Spearman rank-order correlation coefficient on data without any ties use the
following data:
Students 1 2 3 4 5 6 7 8 9 10
Statistics 20 23 8 29 14 12 11 20 17 18
Science 20 25 11 24 23 16 12 21 22 26

Calculating the Parameters of Spearman rank Equation:


Statistics Science Rank Rank
(degree) (degree) (statistics) (science) |d| d
2

20 20 4 7 3 9
23 25 2 2 0 0
8 11 10 10 0 0
29 24 1 3 2 4
14 23 7 4 3 9
12 16 8 8 0 0
11 12 9 9 0 0
21 21 3 6 3 9
17 22 6 5 1 1
18 26 5 1 4 16
173 200 48

Where d = absolute difference between ranks and d 2 = difference squared. Then calculate the
following:
Then substitute into the main equation as follows:
6 ∑ di
2
ρ=1− 2
n(n −1)
6 × 48 288
ρ=1− 2 ; ρ=1− ; ρ=1−0.2909; ρ=0.71
10(10 −1) 990

Hence, we have a ρ = 0.71; this indicates a strong positive relationship between the ranks
individuals obtained in the statistics and science exam. This means the higher you ranked in
statistics, the higher you ranked in science also, and vice versa. So; the Pearson r correlation
coefficient = 0.761 and Spearman's correlation = 0.71 for the same data which means that
correlation coefficients for both techniques are approximately equal.
INFERENCE FOR REGRESSION AND CORRELATION
How do you really say you have a correlation? Can you test to see if there really is a correlation?
Of course, the answer is yes. The hypothesis test for correlation is as follows:
Hypothesis Test for Correlation:
1. State the random variables in words.
x = independent variable
y = dependent variable
2. State the null and alternative hypotheses and the level of significance
H 0 : ρ=0 (There is no correlation)
H a : ρ ≠ 0 (There is a correlation)
Or
H a : ρ<0 (There is a negative correlation)
Or
H a : ρ>0 (There is a positive correlation)
Also, state your α level here.
3. State and check the assumptions for the hypothesis test. The assumptions for the hypothesis
test are the same assumptions for regression and correlation.
4. Find the test statistic and p-value
r
t c=

with degrees of freedom = df = n – 2


√ (1−r 2)
(n−2)

5. Conclusion This is where you write reject H 0 or fail to reject H 0. The rule is: if the p-value <
α, then reject H 0 . If the p-value ≥ α, then fail to reject H 0.
6. Interpretation
This is where you interpret in real word terms the conclusion to the test. The conclusion for
a hypothesis test is that you either have enough evidence to show H a is true, or you do not
have enough evidence to show H a is true.

TESTS CONCERNING CORRELATION AND REGRESSION COEFFICIENTS


The correlation coefficient, r, tells us about the strength and direction of the linear relationship
between X 1 and X 2 .
The sample data are used to compute r, the correlation coefficient for the sample. If we had data
for the entire population, we could find the population correlation coefficient. But because we
have only sample data, we cannot calculate the population correlation coefficient. The sample
correlation coefficient, r , is our estimate of the unknown population correlation coefficient.
ρ = population correlation coefficient (unknown)
r = sample correlation coefficient (known; calculated from sample data)
The hypothesis test lets us decide whether the value of the population correlation coefficient ρ is
"close to zero" or "significantly different from zero". We decide this based on the sample
correlation coefficient r and the sample size n.
If the test concludes that the correlation coefficient is significantly different from zero, we say
that the correlation coefficient is "significant."
Conclusion: There is sufficient evidence to conclude that there is a significant linear
relationship between X 1 and X 2 because the correlation coefficient is significantly different from
zero.
What the conclusion means: There is a significant linear relationship X 1 and X 2 . If the test
concludes that the correlation coefficient is not significantly different from zero (it is close to
zero), we say that correlation coefficient is "not significant".
Performing the Hypothesis Test
 Null Hypothesis: H 0 : ρ=0
 Alternate Hypothesis: H a : ρ ≠ 0
What the Hypotheses Mean in Words
Null Hypothesis H 0: The population correlation coefficient IS NOT significantly different from
zero. There IS NOT a significant linear relationship (correlation) between X 1 and X 2 in the
population.
Alternate Hypothesis H a : The population correlation coefficient is significantly different from
zero. There is a significant linear relationship (correlation) between X 1 and X 2 in the population.
Drawing a Conclusion
There are two methods of making the decision concerning the hypothesis. The two methods are
equivalent and give the same result. The test statistic to test this hypothesis is:
r
t c=

√ (1−r 2)
(n−2)

r √ n−2
t c=
√ 1−r 2
Method 1: Using the p – value
 The p – value is calculated using t−¿ distribution with n−2 degrees of freedom.
 The value of the test statistic,t , is shown in the computer or calculator output along with the
p – value. The test statistic t has the same sign as the correlation coefficient r .
 The p – value is the combined area in both tails.
If the is less than the significance level (∝=0.05 ):
Decision: Reject the null hypothesis.
Conclusion: "There is sufficient evidence to conclude that there is a significant linear
relationship between and because the correlation coefficient is significantly different from zero."
If the is NOT less than the significance level (∝=0.05 )
Decision: DO NOT REJECT the null hypothesis.
Conclusion: "There is insufficient evidence to conclude that there is a significant linear
relationship between and because the correlation coefficient is NOT significantly different from
zero."
E.g. Consider an example where the line of best fit is: ^y =−173.51+ 4.83 x with r =0.6631 and
there are n=11 data points.
Can the regression line be used for prediction? Given a third exam score ( x value), can we use
the line to predict the final exam score (predicted y value)?
 H 0 : ρ=0
 Ha : ρ ≠ 0

∝=0.05
The p-value is 0.026 (from your calculator or from computer software). The p-value, 0.026, is
less than the significance level of ∝=0.05 .
Decision: Reject the Null Hypothesis H 0
Conclusion: There is sufficient evidence to conclude that there is a significant linear
relationship between the third exam score ( x ) and the final exam score ( y ) because the
correlation coefficient is significantly different from zero.
Method 2: Using a table of critical values
The 95% Critical Values of the Sample Correlation Coefficient Table can be used to give you a
good idea of whether the computed value of is significant or not. Compare to the appropriate
critical value in the table. If is not between the positive and negative critical values, then the
correlation coefficient is significant. If is significant, then you may want to use the line for
prediction.
We will always use a significance level of 5 % ,∝=0.05 .
Example:
Suppose you computed r =0.801 using n=10 data points. df =n−2=10−2=8. The critical
values associated with df =8 are −0.632 and +0.632 . If r <¿ negative critical value or r
>positive critical value, then r is significant. Since r =0.801 and 0.801>0.632, r is significant
and the line may be used for prediction.
FITTING A STRAIGHT LINE BY THE METHOD OF LEAST SQUARES
Let ( x i , y i ), i=1 , 2 ,… , n be the n sets of observations and let the related relation by y=ax+ b.
Now we have to select a and b so that the straight line is the best fit to the data.
As explained earlier, the residual at x=x i is
d i= yi −f ( x¿¿ i)= y i−( a xi +b ) ,i=1, 2 , … , n ¿
n n
E=∑ d =∑ [ y i −(a x i +b) ]
2 2
i
i i

By the principle of least squares, E is minimum


∂E ∂E
=0 and =0
∂a ∂b

i.e. 2 ∑ [ y i−( a xi +b ) ] (−x i )=0 and 2 ∑ [ y i−( a xi +b ) ] (−1)=0


n n
i.e. ∑ ( x i y i−a x i −b xi ) =0 and ∑ ( y i−a x i−b )=0
2

i=1 i=1

n n n
i.e. a ∑ x +b ∑ x i=∑ x i y i
2
i (1)
i=1 i=1 i=1

n n
and a ∑ x i +nb=∑ y i (2)
i=1 i=1

Since, x i, y i are known, equations (1) and (2) give two equations in a and b . Solve for a and b
from (1) and (2) and obtain the best fit y=ax+ b.
Note:
 Equations (1) and (2) are called normal equations.
 Dropping suffix i from (1) and (2), the normal equations are
a ∑ x +nb=∑ y and a ∑ x 2 +b ∑ x=∑ xy

Which are get taking Σ on both sides of y=ax+ b and also taking Σ on both sides after
multiplying by x both sides of y=ax+ b.
x−a y−b
 Transformation like X = , Y= reduce the linear equation y=ax+ b to the form
h h
Y = AX + B. Hence, a linear fit is another linear fit in both systems of coordinates.
Example 1:
By the method of least squares find the straight line to the data given below
x 5 10 15 20 25
y 16 19 23 26 30
Solution:
Let the straight line be y=ax+ b

The normal equations are a ∑ x +5 b=∑ y (1)

a ∑ x +b ∑ x=∑ xy
2
(2)

To calculate ∑ x , ∑ x 2, ∑ y , ∑ xy we form below the table.

x y x
2
xy
5 16 25 80
10 19 100 190
15 23 225 345
20 26 400 520
25 30 625 750
Total 75 114 1375 1885

The normal equations are 75 a+5 b=114 (1)


1375 a+75 b=1885 (2)
Eliminate b , multiply (1) by 15
1125 a+75 b=1710 (3)
Equation (1) – (2) gives, 250 a=175 or a=0.7 , hence b=12.3.
Hence, the best fitting line is y=0.7 x+12.3 .
x−x mid x−15 y− y mid y−23
Let X = = , Y= =
h 5 h 5
Let the line in the new variable by Y = AX + B
x y X X
2
Y XY
5 16 -2 4 -1.4 2.8
10 19 -1 1 -0.8 0.8
15 23 0 0 0 0
20 26 1 1 0.6 0.6
25 30 2 4 1.4 2.8
Total 75 114 0 10 -0.2 7

The normal equations are A ∑ X +5 B=∑ Y (4)

A ∑ X + B ∑ X =∑ XY
2
(5)

Therefore, −5 B=−0.2 → B=−0.04


10 A=7 → A=0.7
The equation Y =0.7 X−0.04

i.e. ,
y−23
5 (
=0.7
x−15
5 )
−0.04 → y−23=0.7 x−10.5−0.2

i.e. y=0.7 x−33.3


Which is the same equation as seen before.
Example 2:
Fit a straight line to the data given below. Also estimate the value of y at x =2.5
x 0 1 2 3 4
y 1 1.8 3.3 4.5 6.3

Substituting in (2) and (3), we get,


10 a+5 b=16.9
30 a+10 b=47.1
Solving equation (2) – (l), we get, a=1.33, b=0.72
Hence, the equation is y=1.33 x−0.72
y ( at x=2.5 )=1.33 ( 2.5 ) +0.72=4.045

POLYNOMIAL AND REGRESSION PLANE


Some data although exhibiting a marked pattern, can be poorly represented by a straight line.
One method to accomplish this objective is to use transformations. Another alternative method is
to fit polynomials to the data using polynomial regression.
The least squares procedure can be readily extended to fit the data to a higher-order polynomial.
For example, suppose that we a second-order polynomial or quadratic:
2
y=a0 +a 1 x +a2 x + ε
For this case the sum of the squares of the residual is:
n
Sr =∑ ( yi −a0 −a1 xi −a2 x2i )
2
(1)
i=1
We take derivative of (1) with respect to each of the unknown coefficients of the polynomial, as
in
n
∂ Sr
=−2 ∑ ( y i −a0 −a1 xi −a2 x2i )
∂ a0 i=1
n
∂ Sr
=−2 ∑ x i ( y i−a0 −a1 x i−a2 x 2i )
∂ a1 i=1
n
∂ Sr
=−2 ∑ x 2i ( y i−a0 −a1 xi −a2 x 2i )
∂ a2 i=1
These equations can be set equal to zero and rearranged to develop the following set of normal
equations (tag equation (2)):
( n ) a 0+ ( ∑ x i ) a1 + ( ∑ x 2i ) a 2=∑ y i
( ∑ x i ) a 0 + (∑ x 2i ) a1 + (∑ x3i ) a2=∑ x i y i
(∑ x2i ) a0 + (∑ x 3i ) a 1+ (∑ x 4i ) a 2=∑ x 2i y i
where all summation are from i=1 through n . Note that the above equations are linear and have
three unknowns: a 0, a 1 and a 2. The coefficients of the unknowns can be calculated directly from
the observed data.
The two-dimensional case can be easily extended to mth-order polynomial as:
2 m
y=a0 +a 1 x +a2 x + …+am x +ε
The foregoing analysis can be easily extended to this general case. Thus, we recognize that
determining the coefficient of an mth-order polynomial is equivalent to solving a system of m
=1 simultaneous linear equations. For this case, the standard error is formulated as


S y / x=
Sr
n−(m+1)
(3)

Example: Fit a second-order polynomial to the data in the first two columns of the table below
Computations for an error analysis of the quadratic least-squares fit
xi yi ( y ¿¿ i− y ) ¿
2 2 2
( y ¿¿ i−a¿¿ 0−a1 x i−a2 x i ) ¿ ¿
0 2.1 544.44 0.14332
1 7.7 314.47 1.00286
2 13.6 140.03 1.08158
3 27.2 3.12 0.80491
4 40.9 239.22 0.61951
5 61.1 1272.11 0.09439
Σ 152.6 2513.39 3.74657

Solution
From the given data,
m=2 n=6 ∑ x i=15 ∑ x 4i =979
∑ yi =152.6 ∑ x i y i=585.6 ∑ x 2i =55 ∑ x 3i =225
∑ x 2i y i=2488.8 y=25.433 x=2.5
Therefore, the simultaneous linear equations are

[ ][ ] [ ]
6 15 55 a0 152.6
15 55 225 a1 = 585.6
55 225 979 a2 2488.8
Solving these equations through a technique such as Gauss elimination gives a 0=2.47857 ,
a 1=2.35929 and a 2=1.86071. Therefore, the least-squares quadratic equation for this case is
2
y=2.47857+2.35929 x+1.86071 x
The standard error of the estimate based on the regression polynomial is

The coefficients of determination is



S y / x=
3.74657
6−3
=1.12

2 2513.39−3.74657
r= =0.99851
2513.39
and the correlation coefficient is r =0.99925.
These results indicate that 99.851 percent of the original uncertainty has been explained by the
model. This result supports the conclusion that the quadratic equation represents an excellent fit.

You might also like