You are on page 1of 12

INTRODUCTION TO REGRESSION ANALYSIS

Introduction
Emphasis on this course is to understand and model linear relationships.

Review of categorical methods for examining


relationships
A set of data is said to be categorical if the data is separable into categories
that are mutually exclusive, for example, race, sex, age groups, educational
level. Analysis of categorical data generally involves the use of data tables.
To examine if two categorical variables are related we use the χ2 test.

Example: School Admission Rates

Admit Not Admit Total Applicants


Male 233 324 557
Female 88 194 282
Total 321 518 839

Is there gender bias on school admission?

Solution

H0 : There is no gender bias on school’s admission.


H1 : There is gender bias on school’s admission.
(O−E)2
Test Statistic: χ2 =
P
E

Rejection Criteria: Testing at 5% significance level we reject H0 if χ2 >


χ0.05 (v) = χ0.05 (1) = 3.84

Test Statistic:

1
X (O − E)2
χ2 =
E
= 8.9546

Conclusion: Since χ2 > 3.84, we reject H0 and conclude that there is gender
bias implying dependence relationship between gender and admission.

Regression Analysis and Correlation


Regression analysis is a statistical methodology which is concerned with de-
scribing and evaluating the relationship between a given variable called the
dependent variable and one or more variables called explanatory or indepen-
dent variable(s).
For case of reference we will label dependent variable Y and the independent
variable X. There are other names for dependent and independent variable.

Y X
Predictand Predictor
Regressand Regressor
Endogenous Exogenous
Targer Control

Note Descriptive techniques generally look at one variable at a time, while


Regression analysis looks at the relationship between variables.

Defn: Correlation
Correlation is the intensity or strength of the relationship between two vari-
ables.

Note: Correlation does not measure how two variables are related but mea-
sures the strength of their relationship.

2
When to Apply Regression Analysis
There are conditions which must be satisfied before we can apply regression
analysis

1. The variables of concern should be related to each other, otherwise the


idea of regression collapses.

2. One variable should change in response to other, that is there should


be a dependence relationship.

How do we check on these requirements? This is often done by:

1. Constructing a scatter plot

2. Calculating the correlation coefficient of the variables

3. Cross-tabulations

Scatter Plot
A scatter plot is a plot of one variable against another. We may use three
variables to get a three-dimensional plot. By looking at the scatter plot, we
get the visual impression about the relationship between the variables, that
is, whether the variables are linearly related or otherwise.

3
The above scatter diagram shows that X and Y are linearly related.

Correlation
In correlation both X and Y are random variables and are of equal interest.
We want to determine whether or not there is a linear association between
these two random variables. The most often used measure of linear associ-
ation between two random variables is the Pearson product moment corre-
lation coefficient, ρ. There are other measures of linear association namely
Spearman and Kendall’s tau correlation coefficient. The parameter, ρ is de-
fined in terms of the covariance between X and Y , where covariance is a
measure of the manner in which X and Y vary together.

Defn: Covariance
Let X and Y be random variables with means µX and µY respectively. The
covariance between X and Y , denoted Cov(X, Y ) is given by:

Cov(x, Y ) = E[(X − µx )(Y − µY )]

= E(XY ) − E(X)E(Y )

Note

1. If small values of X tend to be associated with small values of Y and


large values of X with large values of Y , this implies positive covariance.

2. If the reverse to (1) above is true, that is, small values of X tend to be
associated with large values of Y and vice-versa then we have negative
covariance.

Covariance is unbounded, that is it can assume any real value. Also its
magnitude
q is meaningless. To correct this problem, we divide the covariance
by V ar(X)V ar(Y ), to form the Pearson’s correlation coefficient.

Cov(X, Y )
ρ= q
V ar(X)V ar(Y )

4
ρ lies between -1 and 1 inclusive.

Note
1. ρ = 1 implies a perfect positive correlation between X and Y .
2. ρ = −1 implies a perfect negative correlation between X and Y .
3. ρ = 0 implies no correlation that is, X and Y are uncorrelated. This
simply tells us that there is no linear association between X and Y . It
does not mean that X and Y are unrelated. If a relationship exists be-
tween the variables, the relationship is not linear for example quadratic
relationship.

Estimating ρ (Pearson’s Correlation Coefficient)


Cov(X, Y ) and ρ are theoretical(population) parameters. Neither can be
calculated without knowledge of the probability distribution of the pair of
variables (X, Y ).
E(XY ) and E(X)E(Y ) can be estimated by replacing each theoretical mean
by its sample mean. Thus we estimate E(XY ) by ni=1 xinyi , E(X) by ni=1 xni
P P

and E(Y ) by ni=1 yni .


P

Cov(X, Y ) = E(XY ) − E(X)E(Y ), so by substitution, the estimated covari-


ance becomes n n
X xi yi X xi yi
Cov(X, Y ) = − .
i=1 n i=1 n n
Pn
xi yi − ni=1 xi
P Pn
n i=1 i=1 yi
=
n2
Pn  Pn 2 Pn Pn 2
2 2 x2i xi n x2i −( xi )
V ar(X) = E(X ) − E (X) = i=1
n
− i=1
n
= i=1
n2
i=1

Similarly 2
Pn  Pn Pn Pn 2
2 2 yi2 yi n yi2 −( yi )
V ar(Y ) = E(Y ) − E (Y ) = i=1
n
− i=1
n
= i=1
n2
i=1

Thus Pn Pn Pn
n i=1
xi yi − i=1
xi i=1
yi
n2
ρ̂ = r = v ! !
u Pn Pn 2 Pn Pn 2
i=1 i ( i=1 i ) i=1 i ( i=1 i )
u n x2 − x n y2 − y
t
n2 n2

5
Pn Pn Pn
n i=1 xi yi − i=1 xi i=1 yi
= r  P 
Pn 2 Pn 2 n 2 Pn 2
i=1 xi − ( i=1 xi ) n i=1 yi − ( i=1 yi )

Interpretation of r is the same as the interpretation of ρ that is,


r ranges from -1 and 1
r = 1 implies perfect positive correlation
r = −1 implies no correlation
r = 0 implies no linear association

Since r lies between -1 and 1, the following are guidelines on the interpreta-
tion of r:
r ∈ (0.7, 1) or r ∈ (−1, −0.7) implies strong or high correlation
r ∈ (0.7, 0.5) or r ∈ (−0.7, −0.5) implies a moderate correlation
r ∈ (0, 0.5) or r ∈ (−0.5, 0) implies a weak correlation

Example: 1.1 The following data set relates maize usage to the number
of animals on 10 farms surveyed.

Maize Used (Y ) Number of Animals (X)


25 6
36 9
12 2
23 5
20 5
29 7
27 6
32 8
18 3
28 7

Calculate the Pearson’s product moment correlation coefficient (r) and in-
terpret your result.

Solution

6
Pn Pn Pn
n i=1 xi yi − i=1 xi i=1 yi
r = r  P 
Pn 2 Pn 2 n 2 Pn 2
i=1 xi − ( i=1 xi ) n i=1 yi − ( i=1 yi )

n P Pn
From the table, the sample statistics are: n = 10, i=1 xi = 58, i=1 yi =
Pn Pn 2 Pn 2
240, i=1 xi yi = 1554, i=1 xi = 378, i=1 yi = 6436

Therefore, r, is given by

10(1554) − (58)(240)
r=q
(10(378) − 582 ) (10(6436) − 2402 )
1620
=q = 0.9660 4 d.p.
(416)(6760)
Comment: There is a high positive correlation between number of animals(X)
and maize used (Y )

We can go on further to test whether the Pearson’s correlation coefficient


is insignficant (zero) or significant (not zero). We do this by comparing the
calculated Pearson’s correlation coefficient value, r, and the tabulated value
(correlation coefficient), that is, Table 13 from Department of Statistic’s Sta-
tistical Tables.

Example 1.2 Using Example 1.1, test the significance of the calculated
Pearson’s correlation coefficient.

Solution
H0 : ρ = 0
H1 : ρ 6= 0

Test statistic: r

Rejection criteria: Testing at 5% level of signicance, we reject H0 if r < rcrit .


In this case, n = 10, hence degrees of freedom, v = n − 1 = 9, thus
rcrit = 0.602.

7
Test statistic: r=0.9660

Conclusion: Since r > 0.602, we reject H0 and conclude that at 5% sig-


nificance level, the Pearson’s correlation coefficient is significant that is, not
zero, hence indeed maize usage is related to number of animals on the farm.

The other important measure of association is the Spearman’s rank cor-


relation coefficient.

Spearman’s Rank Correlation Coefficient, rs


6 ni=1 d2i
P
rs = 1 −
n(n2 − 1)
where, di = pi − qi , pi = rank of X and qi = rank of Y

Manual Procedure

1. Order all n values of xi by size and assign rank numbers pi .

2. Order all n values of yi by size and assign rank numbers qi .

3. Replace the xi , yi values in the original pairs of values by rank numbers


p i , qi .

4. Calculate the differences, di = pi − qi within pairs of (3).

5. Obtain the squares d2i of di from (4).


Pn
6. Obtain the sum, i=1 d2i of squares from (5).

7. Establish rs value using the formula shown below

6 ni=1 d2i
P
rs = 1 −
n(n2 − 1)

Interpretation of rs is the same as the Pearson’s correlation coefficient.

8
Example 1.3 Calculate the Spearman’s correlation coefficient between maize
used (Y ) and number of animals (X) from Example 2.1.
6 ni=1 d2i
P
rs = 1 −
n(n2 − 1)

Maize Used (Y ) Rank of Y (qi ) Number of Animals (X) Rank of X (pi ) di = pi − qi


25 5 6 5.5 0.5
36 10 9 10 0
12 1 2 1 0
23 4 5 3.5 0.5
20 3 5 3.5 -0.5
29 8 7 7.5 0.5
27 6 6 5.5 0.5
32 9 8 9 0
18 2 3 2 0
28 7 7 7.5 0.5

Pn
d2i = 1.5
i=1
Thus;
6(1.5)
rs = 1 − = 1 − 0.0091 = 0.9909
10(102 − 1)
Comment: High positive correlation

We can also test whether the Spearman’s correlation coefficient is signifi-


cant or not. We do this by comparing the calculated Spearman’s correlation
value, rs , and the tabulated value (Spearman’s Rank correlation coefficient -
Table 14 in Department of Statistic’s Statistical tables)

Example 2.4 Test the significance of the calculated Spearman’s correla-


tion coefficient in Example 1.3.

Solution
H 0 : ρs = 0
H1 : ρs 6= 0

9
Test statistic: rs

Rejection criteria: We reject H0 if rs > rcrit . In our case n=10 and test-
ing at α = 0.05, we reject H0 if rs > 0.648

Test statistic: rs = 0.9909

Conclusion: Since rs > 0.648, we reject H0 and conclude that the corre-
lation between maize usage and number of animals is significant.

Correlation Matrix
Correlation is defined for two variables, however in some cases we have many
variables. Suppose that we have p variables, X1 , X2 , ..., Xp . Then one can
express the different combinations of correlations coefficients in a matrix,
known as a correlation matrix, as follows:
 
1 ρx1 x2 . . . ρx1 xp
 ρx2 x1 1 . . . ρx2 xp
 

. . . .
 
 
ρ= 

 . . . . 

. . . .
 
 
ρxp x1 ρxp x2 . . . 1
ρxi xj for i, j = 1, 2, ..., p, represents the correlation between variables Xi and
Xj . We can tell which variables are correlated by examining the correlation
matrix.

Note: The correlation between Xk and itself is always 1.

Sample correlation matrix is given by


 
1 rx1 x2 . . . rx1 xp

 rx2 x1 1 . . . rx2 xp 

. . . . 
 

R= 

 . . . .  
. . . . 
 

rxp x1 rxp x2 . . . 1

10
Activity 1.1
1. In the field of organizational psychology, extensive study has been made
of different leadership styles. One researcher refers to two extremes as
authoritarian versus democratic; another refers to task-oriented versus
people-oriented; yet others have their own labels for these qualities.
Whatever the label, do these different styles affect the morale of the
subordinates? To address this issue, a researcher established a ranking
scale for worker morale, based on interviews, and grouped the workers
into low, acceptable and high morale categories. These were cross-
classified against the leadership style of the supervisor. The following
contingency table summarizes the results

LEADERSHIP STYLE
WORKER MORALE Authoritarian Democratic
Low 10 5
Acceptable 8 12
High 6 9

Analyze these data to determine if different leadership styles affect the


morale of the subordinates.
2. What do you understand by the terms Regression and Correlation.
3. The table below gives observations on mathematics achievement test
score (xi ) and calculus grades (yi ) for ten independently selected college
freshmen.
xi 39 43 21 64 57 47 28 75 34 52
yi 65 78 52 82 92 89 73 98 56 75

(a) Draw a scatterplot for these data.


(b) On the basis of the scattergram is there an association between X
and Y ?
(c) Calculate Pearson’s correlation coefficient and comment.
(d) Calculate Spearman’s Correlation coefficient for the above data.
Interpret your results.

11
(e) Test at the α = 0.01 level of significance the significance of Pear-
son’s and Spearson’s correlation coefficients.

12

You might also like