You are on page 1of 99

Linear regression and correlation

Simple linear regression and correlation

 Data are frequently given in pairs where one variable


is dependent on the other.
E.g.
1. Weight and height
2. House rent and income
3. Yield and fertilizer
4. Systolic blood pressure (SBP) and body mass
index
The linear regression model assumes that there is a linear, or
"straight line," relationship between the dependent variable and
each predictor.
linear regression and correlation cont’d..

 Linear regression is used to model the value of a


dependent scale variable based on its linear
relationship to one or more predictors.
 It is usually desirable to express their relationship by
finding an appropriate mathematical equation.
 To form the equation, collect the data on these two
variables (dependant and independents ).
linear regression and correlation cont’d..

A) Simple linear regression


 The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by:
Ŷ= α + boxi
This equation is a point estimate of:
Y = α + βXi
– bo = the sample regression coefficient of Y on X.
– β= the population regression coefficient of Y on X.
 Y on X means Y is the dependent variable and X is
the independent one.
linear regression and correlation cont’d..

 The model is linear because increasing the value of X


predictor by 1 unit increases the value of the
dependent by bo units. Note that α is the intercept,
the model-predicted value of the dependent variable
when the value of every predictor is equal to 0.

5
linear regression and correlation cont’d...

 Regression is a method of estimating the numerical


relationship between variables.
– For example, we would like to know what is the
mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.
 The purpose of a regression equation is to use one
variable to predict another.

 How is the regression equation determined?


linear regression and correlation cont’d...

If we want to investigate the nature of this


relationship, we need to do three things:
– Make sure that the relationship is linear.
– Find a way to determine the equation linking, i.e.
get the values of α and bo
α = constant
bo = regression coefficient
– See if the relationship is statistically significant, i.e.
if it presents in the population.
linear regression and correlation cont’d...

 Is the relationship linear?


– One way of investigating the linearity of the
relationship is to examine the scatter plot, such as
that in Figure 1.
– The points in the scatter plot seem to cluster along
a straight line (shown dotted). This suggests a
linear relationship between BMI and HIP. So far,
so good
– We can write the equation of this straight line as:
BMI = α + bo*HIP
linear regression and correlation cont’d...

The residual or
error term, e, for
this subject.

Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
linear regression and correlation cont’d...

Figure 2: scatter plot indicating the relation ship between the height of
oldest sons and fathers‘ height
linear regression and correlation cont’d...

BMI = α + bo*HIP

This equation is known as The variable on the left-


hand side of the equation,
the simple regression
BMI, is known as the
equation. ( why?) outcome, response or
dependent variable.

Dependant variable must be metric or scale. It gives us the mean


value of BMI for any specified HIP measurement. In other words,
it would tell us (if we knew α and bo) what the mean body mass
index would be for all those women with some particular HIP
measurement.
linear regression and correlation cont’d...

BMI = α + bo*HIP

The independent variable


The variable on the
can be of any type:
right-hand side of the
nominal, ordinal or
equation, HIP, is known
metric. This is the
as the predictor,
variable that’s doing the
explanatory or
‘causing’. It is changes in
independent variable,
hip circumference that
or the covariate.
cause body mass index to
change in response, but
not the other way round.
linear regression and correlation cont’d...

 If the independent variable is categorical, it need to


be recoded to binary (dummy) variables or other
types of contrast variables.
 Basically we have four ways of recoding categorical
variable for linear regression:
– Dummy coding (the common and mostly
used),
– Effects coding,
– Orthogonal coding, and
– Criterion coding (also known as criterion
scaling).
Recoding of categorical variables to binary
(dummy) variables
 Dummy coding: is used when a researcher wants to
compare other groups of the predictor variable with
one specific group of the predictor variable.
 Often, this specific group is called the reference
group or category.
 It is important to note that dummy coding can be
used with two or more categories.
Recoding cont’d…

 Dummy coding in regression is analogous to simple


independent t-testing or one-way Analysis of
Variance (ANOVA) procedures in that dummy coding
allows the researcher to explore mean differences
by comparing the groups of the categorical variable.

 In order to do this with regression, we must separate


our predictor variable groups in a manner that allows
them to be entered into the regression.
Recoding cont’d…

 For the demonstration dummy coding, a fictional


data set was created consisting of one continuous
outcome variable (DV_Score), one categorical
predictor variable (IV_Group), and 15 cases (see
Table 1).
 The predictor variable contains three groups;
experimental/treatment 1 (value label 1),
experimental/treatment 2 (value label 2), and
control (value label 3).
Table 1: Initial data
Case DV_score IV_group
1 1 1
2 3 1
3 5 1
4 7 1
5 9 1
6 8 2
7 10 2
8 12 2
9 14 2
10 16 2
11 22 3
12 24 3
13 26 3
14 28 3
15 30 3
Table 2:Dummy coding example data:
Case DV_score IV_group Dummy 1 Dummy 2
1 1 1 1 0
2 3 1 1 0
3 5 1 1 0
4 7 1 1 0
5 9 1 1 0
6 8 2 0 1
7 10 2 0 1
8 12 2 0 1
9 14 2 0 1
10 16 2 0 1
11 22 3 0 0
12 24 3 0 0
13 26 3 0 0
14 28 3 0 0
15 30 3 0 0
Table 2:Dummy coding cont’d…

 To accomplish this, we would create two new


‘dummy’ variables in our data set, labeled dummy 1
and dummy 2 (see Table 2).

 To represent membership in a group on each of the


dummy variables, each case would be coded as 1 if it
is a member and all other cases coded as 0.
Table 2:Dummy coding cont’d…

 When creating dummy variables, it is only necessary


to create k – 1 dummy variables where k indicates
the number of categories of the predictor variable.
Table 2:Dummy coding cont’d…

Choosing a reference category:


 The control group represents a lack of treatment
and therefore is easily identifiable as the reference
category.
 The reference category should have some clear
distinction. However, much research is done without
a control group. In those instances, identification of
the reference category is generally arbitrary, but
some scholar (Garson (2006)) suggested some
guidelines for choosing the reference category:
Table 2:Dummy coding cont’d…

 First, using categories such as miscellaneous or


other is not recommended because of the lack of
specificity in those types of categorizations (Garson).
 Second, the reference category should not be a
category with few cases, for obvious reasons related
to sample size and error (Garson).
 Third, some researchers choose to use a middle
category, because they believe it represents the best
choice for comparison; rather than comparisons
against the extremes
Table 2:Dummy coding cont’d…
 In the analysis, the predictor variable would not be
entered into the regression and instead the dummy
variables would take its place.

The results indicate a significant model, F(2, 12) =


57.17, p < 0.001. (Table 3)
ANOVAc
Table 3
Sum of
Model Squares df Mean Square F Sig.
1 Regression 653.333 1 653.333 13.923 .003a
Residual 610.000 13 46.923
Total 1263.333 14
2 Regression 1143.333 2 571.667 57.167 .000b
Residual 120.000 12 10.000
Total 1263.333 14
a. Predictors: (Constant), dummy 1
b. Predictors: (Constant), dummy 1, dummy 2
c. Dependent Variable: DV_score
Table 2:Dummy coding cont’d…
 Table 4: provide R, R², and adj.R². and the regression
model was able to account for 91% of the variance.

Model Summary
Table 4
Adjusted Std. Error of
Model R R Square R Square the Estimate
1 .719a .517 .480 6.850
2 .951b .905 .889 3.162
a. Predictors: (Constant), dummy 1
b. Predictors: (Constant), dummy 1, dummy 2
Table 2:Dummy coding cont’d…
 Table 5: provides unstandardized regression
coefficients (B), intercept (constant), standardized
regression coefficients (ß), which we can use for the
development of the model.
Coefficientsa

Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 19.000 2.166 8.771 .000 14.320 23.680
dummy 1 -14.000 3.752 -.719 -3.731 .003 -22.106 -5.894
2 (Constant) 26.000 1.414 18.385 .000 22.919 29.081
dummy 1 -21.000 2.000 -1.079 -10.500 .000 -25.358 -16.642
dummy 2 -14.000 2.000 -.719 -7.000 .000 -18.358 -9.642
a. Dependent Variable: DV_score
Table 2:Dummy coding cont’d…

 Now, because dummy variables were used to


compare experimental 1 (M = 5.00, SD = 3.16) and
experimental 2 (M = 12.00, SD = 3.16) to the control
(M = 26.00, SD = 3.16), the intercept term is equal to
the mean of the reference category (i.e. the control
group).
Descriptives

DV_goup Statistic Std. Error


DV_score treatment 1 Mean 5.00 1.414
Std. Deviation 3.162
treatment 2 Mean 12.00 1.414
Std. Deviation 3.162
control Mean 26.00 1.414
Std. Deviation 3.162
Table 2:Dummy coding cont’d…

 Each regression coefficient represents the amount of


deviation of the group identified in the dummy
variable from the mean of the reference category .
 So, some simple mathematics allows us to see that the
regression coefficient for dummy 1 (representing
experimental 1) is 5 – 26 = -21. Also, the regression
coefficient for dummy 2 (representing experimental 2)
is 12 – 26 = -14.
 All of this results in the regression equation:
Ŷ= 26.00 + (-21 * dummy 1) + (-14 * dummy 2)
linear regression and correlation cont’d..

 The Method of least square


– The difference between the given score Y and the
predicted score Ŷ is known as the error of
estimation. The regression line, or the line which
best fits the given pairs of scores, is the line for
which the sum of the squares of these errors of
estimation (Σеi²) is minimized. That is, of all the
curves, the curve with minimum Σеi² are the least
square regression which best fits the given data.
linear regression and correlation cont’d..

 Estimating α and bo– the method of ordinary least


squares (OLS)
– The second problem is to find a method of getting
the values of the sample coefficients α and bo,
which will give us a line that fits the scatter of
points better than any other line, and which will
then enable us to write down the equation linking
the variables.
linear regression and correlation cont’d...
 The most popular method used for this calculation is
called ordinary least squares, or OLS. This gives us the
values of α and bo, and the straight line that best fits the
sample data.
 Roughly speaking, ‘best’ means the line that is, on
average, closer to all of the points than any other line.
Look at Figure 1.

 e has been shown just for one of the points. If all of


these residuals are squared and then added together, to
give the term e2, then the ‘best’ straight line is the one
for which the sum, e2, is smallest. Hence the name
ordinary ‘least squares’.
linear regression and correlation cont’d...

The residual or
error term, e, for
this subject.

Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
linear regression and correlation cont’d...

 The least square regression line for the set of


observations (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has
the equation:
Ŷ = α + boxi.
 The values ‘α’ and ‘bo’ in the equation are constants,
i.e., their values are fixed. The constant ‘α’ indicates
the value of y when x = 0. It is also called the y
intercept. The value of ‘bo’ shows the slope of the
regression line and gives us a measure of the change
in y for a unit change in x.
linear regression and correlation cont’d..
 This slope (bo) is frequently termed as the regression
coefficient of Y on X. If we know the values of ‘α’ and
‘bo’, we can easily compute the value of Ŷ for any
given value of X.
 The constants ‘α’ and ‘bo’ are determined by solving
simultaneously the equations (normal equations):
ΣY = αn + boΣX
ΣXY = α ΣX + boΣX²

 = y  bx
linear regression and correlation cont’d...

n XY   X  Y  ( X  X )(Y  Y )
b = n X  ( X )
2 2 = (X  X ) 2
linear regression and correlation cont’d...
Example 1: Heights of 10 fathers(X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.

Father (X) oldest son (Y) product (XY) X²


63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041
Total 676 679 45967 45784
linear regression and correlation cont’d...

n XY   X  Y  ( X  X )(Y  Y )
b = =
n X  ( X )
2 2
(X  X ) 2

 = y  bx
10(45967)  (676 x 679) 459670  459004
b= 10(45784)  (676) 2 = 457840  456976

666
b = = 0.77
864
linear regression and correlation cont’d...
679 676
=  0.77* = 67.9 – 52.05 = 15.85
10 10

Therefore, Ŷ = 15.85 + 0.77 X or


Height of oldest son = 15.85 + 0.77*height of
father
The regression coefficient of Y on X (i.e., 0.77) tells us the
change in Y due to a unit change in X.

e.g. Estimate the height of the oldest son for a father’s


height of 70 inches.
Height of oldest son (Ŷ) = 15.85 + 0.77 (70) = 69.75 inches.
linear regression and correlation cont’d...
Explained, unexplained (error), total variations
 If all the points on the scatter diagram fall on the
regression line we could say that the entire variance
of Y is due to variations in X.
– Explained variation = Σ (Ŷ-y )²
linear regression and correlation cont’d...
 The measure of the scatter of points away from the
regression line gives an idea of the variance in Y that
is not explained with the help of the regression
equation.
– Unexplained variation = Σ(Y - Ŷ)²
 The variation of the Y’s about their mean can also be
computed.
– Total variation = Σ(Y-y )²
linear regression and correlation cont’d..
 Total variation = Explained variation + unexplained
variation
 The ratio of the explained variation to the total
variation measures how well the linear regression
line fits the given pairs of scores. It is called the
coefficient of determination, and is denoted by r².

2 exp lained var iation


r =
total var iation
linear regression and correlation cont’d..
 The explained variation is never negative and is never
larger than the total variation. Therefore, r² is always
between 0 and 1. If the explained variation equals 0,
r² = 0.
 If r² is known, then r =  r 2
– The sign of r is the same as the sign of bo from the
regression equation.
 Since r² is between 0 and 1, r is between -1 and +1.
– Thus, r is known as Karl Pearson’s Coefficient of
Linear correlation
linear regression and correlation cont’d..
 Linear Correlation (Karl Pearson’s Coefficient of
Linear correlation) (r):-
– measures the degree of linear correlation
between two variables (e.g. X and Y).
– This correlation coefficient is given in pure
number, independent of the units in which the
variables are expressed.
– It also tells us the direction of the slope of a
regression line (positive or negative).
linear regression and correlation cont’d..

 Population Corrélation Coefficient: ρ


 Sample Corrélation Coefficient: r
 r is positive if higher values of one variable are
associated with higher values of the other variable
and negative if one variable tends to be lower as the
other gets higher
 Correlation of around zero indicates that there is no
linear relationship between the values of the two
variables.

43
linear regression and correlation cont’d..

 In essence r is a measure of the scatter of the


points around an underlying linear trend: the
greater the spread of the points the lower
the correlation

44
linear regression and correlation cont’d..

Variable X
This line shows a perfect linear relationship between two variables. It is a perfect
positive correlation (r = 1)
45
linear regression and correlation cont’d..

Variable X
A perfect linear relationship; however a negative correlation (r = -1)
46
linear regression and correlation cont’d..

A weak positive correlation (r might be around .40)

47
linear regression and correlation cont’d...

No linear association between variables (r ~ 0)


48
linear regression and correlation cont’d...

 Strength of relationship
– Correlation from 0 to 0.25 (or –0.25) indicate little
or no relationship
– Those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
– Those from 0.50 to 0.75(or –0.50 to –0.75)
moderate to good relationship; and
– Those greater than 0.75 (or –0.75 to –1.00)
indicate very good to excellent relationship.
49
linear regression and correlation cont’d...

 The absolute value of the correlation coefficient


indicates the strength, with larger absolute values
indicating stronger relationships.
linear regression and correlation cont’d...

 Significance Test for Pearson Correlation


– H0: ρ = 0
– HA : ρ ≠ 0

r (n  2)
tcal =
(1  r ) 2

With n-2 degree of freedom

51
linear regression and correlation cont’d..
 Its formula is:
n XY  ( X )( Y )
r=
n  X 2  ( X ) 2 n  Y 2  ( Y ) 2

 Properties
– -1  r 1
– r is a pure number without unit
– If r is close to 1  a strong positive relationship
– If r is close to -1  a strong negative relationship
– If r = 0 → no linear correlation
linear regression and correlation cont’d..

 Determine the value of ‘r’ for the scores in the


above example 1.
r = 0.7776  0.78
 For Pearson’s correlation coefficient to be
appropriately used, both variables must be metric
continuous and, also approximately Normally
distributed.
linear regression and correlation cont’d..

 Assumptions in correlation
– The assumptions needed to make inferences
about the correlation coefficient are that the
sample was randomly selected and the two
variables, X and Y, vary together in a joint.
Distribution that is normally distributed, (called
the bivariate normal distribution).

54
linear regression and correlation cont’d...
 Spearman’s rank correlation coefficient
– If either (or both) of the variables is ordinal, then
Spearman’s rank correlation coefficient (usually
denoted ρs in the population and r in the sample)
is appropriate. (if there is extreme value)
– This is a non-parametric measure.
– As with Pearson’s correlation coefficient,
Spearman’s correlation coefficient varies from –1
to +1,
linear regression and correlation cont’d...

 This correlation coefficient is applied to the ranks in


two paired samples (not to the original scores).
 The formula for computing rank correlation is:

6 d i 2

rs = 1-
n(n  1)2
linear regression and correlation cont’d..

– List the n pairs of ranks; X,Y.


– Find the differences (di) between the ranks.
– Square these differences and add the squares
(di²).
– Compute rs.
linear regression and correlation cont’d...
Example: Six paintings were ranked by two judges. Calculate
the rank correlation coefficient.
Painting First judge Second judge di di²
(X) (Y)

A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4

di² = 10, n = 6.
6 di 2
rs = 1-
n ( n 2  1)
6(10) 60
rs = 1- = 1- = 1-0.29 = 0.71
6(6  1)
2
6*35
linear regression and correlation cont’d...

 The significance of the association is assessed using


t-test in the same way as described for Pearson
correlation coefficient.

r (n  2)
tcal =
(1  r ) 2

With n-2 degree of freedom

59
Multiple linear regression
– Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
– It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
– All types of multivariate analyses involve the
construction of a mathematical model to
describe the association between independent
and dependent variables.
12/18/2023 60
Multiple linear regression cont’d…
 Multiple linear regression (we often refer to this
method as multiple regression) is an extension of the
most fundamental model describing the linear
relationship between two variables.

 Multiple regression is a statistical technique that is


used to measure and describe the function relating
two (or more) predictors (independent) variables to
a single response (dependent) variable.

12/18/2023 61
Multiple linear regression cont’d…
 Regression equation for a linear relationship:
A linear relationship of n predictor variables,
denoted as:
X1, X2, . . ., Xn
to a single response variable, denoted (Y)
is described by the linear equation involving several
variables.
The general linear equation (model) is:
Y = α + b1X1 + b2X2 + . . . + bnXn

12/18/2023 62
Multiple linear regression cont’d…

• Where:
– The regression coefficients (or b1 . . . bn ) represent
the independent contributions of each explanatory
variable to the prediction of the dependent variable.

– X1 . . . Xn represent the individual’s particular set of


values for the independent variables.
– n shows the number of independent predictor
variables.

12/18/2023 63
Multiple linear regression cont’d…
 Assumptions
1. First of all, as it is evident in the name multiple
linear regression, it is assumed that the relationship
between the dependent variable and each
continuous explanatory variable is linear. We can
examine this assumption for any variable, by
plotting (i.e., by using bivariate scatter plots) the
residuals (the difference between observed values
of the dependent variable and those predicted by
the regression equation) against that variable.

12/18/2023 64
Multiple linear regression cont’d…

 Any curvature in the pattern will indicate that a non-


linear relationship is more appropriate- if so
transformation of the explanatory variable or using
the analogous non parametric may be considered.

12/18/2023 65
Multiple linear regression cont’d…

2. It is assumed in multiple regression that the


residuals should follow a normal distribution and
have the same variability throughout the range.
3. The observations (explanatory variables) are
independent.

12/18/2023 66
Multiple linear regression cont’d…
 Predicted and Residual Scores
– The regression line expresses the best prediction
of the dependent variable (Y), given the
independent variables (X).
– However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial
variation of the observed points around the fitted
regression line.
– The deviation of a particular point from the
regression line (its predicted value) is called the
residual value.
12/18/2023 67
Multiple linear regression cont’d…
 Residual Variance and R-square
– The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction.
– For example, if there is no relationship between
the X and Y variables, then the ratio of the
residual variability of the Y variable to the
original variance is equal to 1.0.

12/18/2023 68
Multiple linear regression cont’d…

– If X and Y are perfectly related then there is no


residual variance and the ratio of variance would
be 0.
– In most cases, the ratio would fall somewhere
between these extremes, that is, between 0 and
1.
– One minus this ratio is referred to as R-square or
the coefficient of determination.

12/18/2023 69
Multiple linear regression cont’d…

– This value is immediately interpretable in the


following manner. If we have an R-square of 0.6 then
we know that the variability of the Y values around
the regression line is 1- 0.6 times the original
variance.
– In other words, we have explained 60% of the
original variability, and are left with 40% residual
variability.
– Ideally, we would like to explain most if not all of the
original variability.

12/18/2023 70
Multiple linear regression cont’d…

– The R-square value is an indicator of how well the


model fits the data
– An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model.

12/18/2023 71
Multiple linear regression cont’d…
N.B. A) The sources of variation in regressions are:
i) Due to regression
ii) Residual (about regression)
B) The sum of squares due to regression (SSR)
over the total sum of squares (TSS) is the
proportion of the variability accounted for by the
regression model.
Therefore, the percentage variability accounted for
or explained by the regression is 100 times this
proportion.

12/18/2023 72
Multiple linear regression cont’d…
 Interpreting the multiple Correlation Coefficient (R)
– Customarily, the degree to which two or more
predictors (independent or X variables) are
related to the dependent (Y) variable is expressed
in the multiple correlation coefficient R, which is
the square root of R-square.
– In multiple correlation coefficient, R assumes
values between 0 and 1. This is true due to the
fact that no meaning can be given to the
correlation in the multivariate case. (why?)

12/18/2023 73
Multiple linear regression cont’d…

– The larger R is, the more closely correlated the


predictor variables are with the outcome
variable.
– When R=1, the variables are perfectly correlated
in the sense that the outcome variable is a linear
combination of the others.
– When the outcome variable is not linearly related
to any of the predictor variables, R will be very
small, but not zero.

12/18/2023 74
Multiple linear regression cont’d…
 Multicollinearity
– This is a common problem in many multivariate
correlation analyses.
– Imagine that you have two predictors (X variables)
of a person's height:
1. weight in pounds and
2. weight in ounces.
Trying to decide which one of the two measures is a
better predictor of height would be rather silly.

12/18/2023 75
Multiple linear regression cont’d…

 Collinearity (or multicollinearity ) is the undesirable


situation where the correlations among the independent
variables are string.
 Eigen values provide an indication of how many distinct
dimensions there are among the independent variables.
 When several eigenvalues are close to zero, the
variables are highly intercorrelated and small changes in
the data values may lead to large changes in the
estimates of the coefficients (see the following table)

76
Multiple linear regression cont’d…
A condition index greater than 15 indicates a possible problem and an index greater
than 30 suggests a serious problem with collinearity
Collinearity Diagnosticsa

Variance Proportions
height of monthly family period of
Condition mother income gestation Age of mother
Model Dimension Eigenvalue Index (Constant) (cms)(X2) (Birr)(X5) (days)(X6) (years)(X3)
1 1 1.999 1.000 .00 .00
2 .001 58.071 1.00 1.00
2 1 2.845 1.000 .00 .00 .01
2 .154 4.294 .00 .00 .43
3 .000 104.138 1.00 1.00 .56
3 1 3.829 1.000 .00 .00 .01 .00
2 .170 4.741 .00 .00 .42 .00
3 .000 116.493 .84 .19 .57 .03
4 7.58E-005 224.782 .16 .81 .00 .97
4 1 4.806 1.000 .00 .00 .00 .00 .00
2 .171 5.308 .00 .00 .41 .00 .00
3 .023 14.410 .00 .00 .13 .00 .90
4 .000 132.931 .87 .17 .45 .03 .03
5 7.10E-005 260.214 .13 .83 .01 .96 .06
a. Dependent Variable: birth weight of the child (kgs)(X1)

77
Multiple linear regression cont’d…

– When there are very many variables involved, it is


often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation.
– Nevertheless, when this problem occurs it means
that at least one of the predictor variables is
(practically) completely redundant with other
predictors.

12/18/2023 78
Multiple linear regression cont’d…

 The Partial Correlations:


– The Partial Correlations procedure computes
partial correlation coefficients that describe the
linear relationship between two variables while
controlling for the effects of one or more
additional variables.

79
Multiple linear regression cont’d…
 Example:
 A popular radio talk show host has just received the latest
government study on public health care funding and has
uncovered a startling fact: As health care funding
increases, disease rates also increase! Cities that spend
more actually seem to be worse off than cities that spend
less!
 The data in the government report yield a high, positive
correlation between health care funding and disease rates
-- which seems to indicate that people would be much
healthier if the government simply stopped putting money
into health care programs.
80
Multiple linear regression cont’d…

 But is this really true? It certainly isn't likely that


there's a causal relationship between health care
funding and disease rates. Assuming the numbers are
correct, are there other factors that might create the
appearance of a relationship where none actually
exists? (Health funding Data)
– To obtain partial correlations:
– From the menus choose:
– Analyze
Correlate
Partial
81
Multiple linear regression cont’d…

Correlations

Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Visits to health Health care funding Correlation 1.000 .013


care providers (amount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

82
Multiple linear regression cont’d…

 In this example, the Partial Correlations table shows


both the zero-order correlations (correlations
without any control variables) of all three variables
and the partial correlation controlling of the first two
variables controlling for the effects of the third
variable.

83
Multiple linear regression cont’d…

Correlations

Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

The
Visits zero-order
to health correlation
Health care funding between
Correlation health care funding
1.000 and.013disease

rates
(rate peris, indeed, both fairly high
care providers
10,000)
(amount per 100)
(0.737) and statistically
Significance (2-tailed)
df
.
0
significant(p
.928
47
<
0.001). Reported diseases
(rate per 10,000)
Correlation
Significance (2-tailed)
.013 1.000
.928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

84
Multiple linear regression cont’d…
The partial correlation controlling for the rate of visits to health care
providers, however, is negligible (0.013) and not statistically significant (p
= 0.928.)
Correlations

Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Visits to health Health care funding Correlation 1.000 .013


care providers (amount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

85
Multiple linear regression cont’d…

 One interpretation of this finding is that the observed


positive "relationship" between health care funding and
disease rates is due to underlying relationships between
each of those variables and the rate of visits to health
care providers:
 Disease rates only appear to increase as health care
funding increases because more people have access to
health care providers when funding increases, and
doctors and hospitals consequently report more
occurrences of diseases since more sick people come to
see them.
86
Multiple linear regression cont’d…

Correlations
Going back to the zero-order correlations, you can see that bothVisits health
to
care funding rates and reported disease rates Health
funding
are diseases
care highly positively
Reported health care
providers
correlated
Control Variables
with the control variable, rate perof100)visits10,000)
(amount to health
(rate per care
(rate per
10,000)
providers.
-none-a Health care funding
(amount per 100)
Correlation
Significance (2-tailed)
1.000
.
.737
.000
.964
.000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Visits to health Health care funding Correlation 1.000 .013


care providers (amount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

87
Multiple linear regression cont’d…

Removing the effects of this variable reduces the correlation between


Correlations

the other two variables to almost zero. It's even possible


Health care
that controlling
Reported
Visits to
health care
for the effects of some other relevant variables might diseases
funding
(amount
actually
(rate per
reveal
providers an
(rate per
underlying
Control Variables
-none-a negative relationship
Health care funding Correlation
between health
per 100)
1.000
care .737
10,000)
funding
10,000)
and
.964
disease rates. (amount per 100) Significance
df
(2-tailed)
0
. .000
48
.000
48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0

Visits to health Health care funding Correlation 1.000 .013


care providers (amount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.

88
Multiple linear regression cont’d…

 The Partial Correlations procedure is only


appropriate for scale variables.
 If you have categorical (nominal or ordinal) data, use
the Crosstabs procedure. Layer variables in
Crosstabs are similar to control variables in Partial
Correlations.

89
Multiple linear regression cont’d…
 Linear Regression Variable Selection Methods
 Method selection allows you to specify how
independent variables are entered into the analysis.
Using different methods, you can construct a variety
of regression models from the same set of variables.
– Enter (Regression): A procedure for variable
selection in which all variables in a block are
entered in a single step.

12/18/2023 90
Linear Regression Variable Selection Methods
cont’d…
– Stepwise: At each step, the independent variable
not in the equation which has the smallest
probability of F is entered, if that probability is
sufficiently small. Variables already in the
regression equation are removed if their
probability of F becomes sufficiently large. The
method terminates when no more variables are
eligible for inclusion or removal.
– Remove: A procedure for variable selection in
which all variables in a block are removed in a
single step.
12/18/2023 91
Linear Regression Variable Selection Methods cont’d…

– Backward Elimination: A variable selection


procedure in which all variables are entered into the
equation and then sequentially removed. The
variable with the smallest partial correlation with the
dependent variable is considered first for removal. If
it meets the criterion for elimination, it is removed.
After the first variable is removed, the variable
remaining in the equation with the smallest partial
correlation is considered next. The procedure stops
when there are no variables in the equation that
satisfy the removal criteria.
12/18/2023 92
Linear Regression Variable Selection Methods
cont’d…
– Forward Selection: A stepwise variable selection
procedure in which variables are sequentially
entered into the model. The first variable considered
for entry into the equation is the one with the largest
positive or negative correlation with the dependent
variable. This variable is entered into the equation
only if it satisfies the criterion for entry. If the first
variable is entered, the independent variable not in
the equation that has the largest partial correlation is
considered next. The procedure stops when there are
no variables that meet the entry criterion.
12/18/2023 93
Multiple linear regression cont’d…

 Example on multiple regression


– The data for multiple regression were taken from a
survey of women attending an antenatal clinic.

The objectives of the study were to identify the


factors responsible for low birth weight and to
predict women 'at risk' of having a low birth
weight baby.

12/18/2023 94
Multiple linear regression cont’d…

 Notations:
BW = Birth weight (kgs) of the child =X1
HEIGHT = Height of mother (cms) = X2
AGEMOTH = Age of mother (years) = X3
AGEFATH = Age of father (years) = X4
FAMINC = Monthly family income (Birr) = X5
GESTAT = Period of gestation (days) = X6
12/18/2023 95
Multiple linear regression cont’d…
 Answer the following questions based on the above
data
1. Check the association of each predictor with the
dependent variable.
2. Fit the full regression model
3. Fit the condensed regression model
4. What do you understand from your answers in parts 1,
2 and 3 ?

12/18/2023 96
Multiple linear regression cont’d…
5. What is the proportion of variability accounted for
by the regression?
6. Compute the multiple correlation coefficient
7. Predict the birth weight of a baby born alive from a
woman aged 30 years and with the following
additional characteristics;
– height of mother =170 cm
– age of father =40 years
– monthly family income = 600 Birr
– period of gestation = 275 days

12/18/2023 97
Multiple linear regression cont’d…
8. Estimate the birth weight of a baby born alive from
a woman with the same characteristics as in “7"
but with a mother's age of 49 years.

12/18/2023 98
Thank you!

You might also like