Linear Regression and Correlation

Linear regression and correlation
Simple linear regression and correlation
 Data are frequently given in pairs where one variable

is dependent on the other.
E.g.
1. Weight and height
2. House rent and income
3. Yield and fertilizer
4. Systolic blood pressure (SBP) and body mass
index
The linear regression model assumes that there is a linear, or
"straight line," relationship between the dependent variable and
each predictor.
linear regression and correlation cont’d..
 Linear regression is used to model the value of a

dependent scale variable based on its linear
relationship to one or more predictors.
 It is usually desirable to express their relationship by
finding an appropriate mathematical equation.
 To form the equation, collect the data on these two
variables (dependant and independents ).
A) Simple linear regression

 The scatter diagram helps to choose the curve that
best fits the data. The simplest type of curve is a
straight line whose equation is given by:
Ŷ= α + boxi
This equation is a point estimate of:
Y = α + βXi
– bo = the sample regression coefficient of Y on X.
– β= the population regression coefficient of Y on X.
 Y on X means Y is the dependent variable and X is
the independent one.
 The model is linear because increasing the value of X

predictor by 1 unit increases the value of the
dependent by bo units. Note that α is the intercept,
the model-predicted value of the dependent variable
when the value of every predictor is equal to 0.
5
linear regression and correlation cont’d...
 Regression is a method of estimating the numerical

relationship between variables.
– For example, we would like to know what is the
mean or expected weight for factory workers of a
given height, and what increase in weight is
associated with a unit increase in height.
 The purpose of a regression equation is to use one
variable to predict another.
 How is the regression equation determined?

If we want to investigate the nature of this

relationship, we need to do three things:
– Make sure that the relationship is linear.
– Find a way to determine the equation linking, i.e.
get the values of α and bo
α = constant
bo = regression coefficient
– See if the relationship is statistically significant, i.e.
if it presents in the population.
 Is the relationship linear?

– One way of investigating the linearity of the
relationship is to examine the scatter plot, such as
that in Figure 1.
– The points in the scatter plot seem to cluster along
a straight line (shown dotted). This suggests a
linear relationship between BMI and HIP. So far,
so good
– We can write the equation of this straight line as:
BMI = α + bo*HIP
The residual or
error term, e, for
this subject.
Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
Figure 2: scatter plot indicating the relation ship between the height of
oldest sons and fathers‘ height
BMI = α + bo*HIP
This equation is known as The variable on the left-

hand side of the equation,
the simple regression
BMI, is known as the
equation. ( why?) outcome, response or
dependent variable.
Dependant variable must be metric or scale. It gives us the mean

value of BMI for any specified HIP measurement. In other words,
it would tell us (if we knew α and bo) what the mean body mass
index would be for all those women with some particular HIP
measurement.
BMI = α + bo*HIP
The independent variable

The variable on the
can be of any type:
right-hand side of the
nominal, ordinal or
equation, HIP, is known
metric. This is the
as the predictor,
variable that’s doing the
explanatory or
‘causing’. It is changes in
independent variable,
hip circumference that
or the covariate.
cause body mass index to
change in response, but
not the other way round.
 If the independent variable is categorical, it need to

be recoded to binary (dummy) variables or other
types of contrast variables.
 Basically we have four ways of recoding categorical
variable for linear regression:
– Dummy coding (the common and mostly
used),
– Effects coding,
– Orthogonal coding, and
– Criterion coding (also known as criterion
scaling).
Recoding of categorical variables to binary
(dummy) variables
 Dummy coding: is used when a researcher wants to
compare other groups of the predictor variable with
one specific group of the predictor variable.
 Often, this specific group is called the reference
group or category.
 It is important to note that dummy coding can be
used with two or more categories.
Recoding cont’d…
 Dummy coding in regression is analogous to simple

independent t-testing or one-way Analysis of
Variance (ANOVA) procedures in that dummy coding
allows the researcher to explore mean differences
by comparing the groups of the categorical variable.
 In order to do this with regression, we must separate

our predictor variable groups in a manner that allows
them to be entered into the regression.
Recoding cont’d…
 For the demonstration dummy coding, a fictional

data set was created consisting of one continuous
outcome variable (DV_Score), one categorical
predictor variable (IV_Group), and 15 cases (see
Table 1).
 The predictor variable contains three groups;
experimental/treatment 1 (value label 1),
experimental/treatment 2 (value label 2), and
control (value label 3).
Table 1: Initial data
Case DV_score IV_group
1 1 1
2 3 1
3 5 1
4 7 1
5 9 1
6 8 2
7 10 2
8 12 2
9 14 2
10 16 2
11 22 3
12 24 3
13 26 3
14 28 3
15 30 3
Table 2:Dummy coding example data:
Case DV_score IV_group Dummy 1 Dummy 2
1 1 1 1 0
2 3 1 1 0
3 5 1 1 0
4 7 1 1 0
5 9 1 1 0
6 8 2 0 1
7 10 2 0 1
8 12 2 0 1
9 14 2 0 1
10 16 2 0 1
11 22 3 0 0
12 24 3 0 0
13 26 3 0 0
14 28 3 0 0
15 30 3 0 0
Table 2:Dummy coding cont’d…
 To accomplish this, we would create two new

‘dummy’ variables in our data set, labeled dummy 1
and dummy 2 (see Table 2).
 To represent membership in a group on each of the

dummy variables, each case would be coded as 1 if it
is a member and all other cases coded as 0.
 When creating dummy variables, it is only necessary

to create k – 1 dummy variables where k indicates
the number of categories of the predictor variable.
Choosing a reference category:

 The control group represents a lack of treatment
and therefore is easily identifiable as the reference
category.
 The reference category should have some clear
distinction. However, much research is done without
a control group. In those instances, identification of
the reference category is generally arbitrary, but
some scholar (Garson (2006)) suggested some
guidelines for choosing the reference category:
 First, using categories such as miscellaneous or

other is not recommended because of the lack of
specificity in those types of categorizations (Garson).
 Second, the reference category should not be a
category with few cases, for obvious reasons related
to sample size and error (Garson).
 Third, some researchers choose to use a middle
category, because they believe it represents the best
choice for comparison; rather than comparisons
against the extremes
 In the analysis, the predictor variable would not be
entered into the regression and instead the dummy
variables would take its place.
The results indicate a significant model, F(2, 12) =

57.17, p < 0.001. (Table 3)
ANOVAc
Table 3
Sum of
Model Squares df Mean Square F Sig.
1 Regression 653.333 1 653.333 13.923 .003a
Residual 610.000 13 46.923
Total 1263.333 14
2 Regression 1143.333 2 571.667 57.167 .000b
Residual 120.000 12 10.000
Total 1263.333 14
a. Predictors: (Constant), dummy 1
b. Predictors: (Constant), dummy 1, dummy 2
c. Dependent Variable: DV_score
 Table 4: provide R, R², and adj.R². and the regression
model was able to account for 91% of the variance.
Model Summary
Table 4
Adjusted Std. Error of
Model R R Square R Square the Estimate
1 .719a .517 .480 6.850
2 .951b .905 .889 3.162
a. Predictors: (Constant), dummy 1
b. Predictors: (Constant), dummy 1, dummy 2
 Table 5: provides unstandardized regression
coefficients (B), intercept (constant), standardized
regression coefficients (ß), which we can use for the
development of the model.
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients 95% Confidence Interval for B
Model B Std. Error Beta t Sig. Lower Bound Upper Bound
1 (Constant) 19.000 2.166 8.771 .000 14.320 23.680
dummy 1 -14.000 3.752 -.719 -3.731 .003 -22.106 -5.894
2 (Constant) 26.000 1.414 18.385 .000 22.919 29.081
dummy 1 -21.000 2.000 -1.079 -10.500 .000 -25.358 -16.642
dummy 2 -14.000 2.000 -.719 -7.000 .000 -18.358 -9.642
a. Dependent Variable: DV_score
 Now, because dummy variables were used to

compare experimental 1 (M = 5.00, SD = 3.16) and
experimental 2 (M = 12.00, SD = 3.16) to the control
(M = 26.00, SD = 3.16), the intercept term is equal to
the mean of the reference category (i.e. the control
group).
Descriptives
DV_goup Statistic Std. Error

DV_score treatment 1 Mean 5.00 1.414
Std. Deviation 3.162
treatment 2 Mean 12.00 1.414
control Mean 26.00 1.414
 Each regression coefficient represents the amount of

deviation of the group identified in the dummy
variable from the mean of the reference category .
 So, some simple mathematics allows us to see that the
regression coefficient for dummy 1 (representing
experimental 1) is 5 – 26 = -21. Also, the regression
coefficient for dummy 2 (representing experimental 2)
is 12 – 26 = -14.
 All of this results in the regression equation:
Ŷ= 26.00 + (-21 * dummy 1) + (-14 * dummy 2)
 The Method of least square

– The difference between the given score Y and the
predicted score Ŷ is known as the error of
estimation. The regression line, or the line which
best fits the given pairs of scores, is the line for
which the sum of the squares of these errors of
estimation (Σеi²) is minimized. That is, of all the
curves, the curve with minimum Σеi² are the least
square regression which best fits the given data.
 Estimating α and bo– the method of ordinary least

squares (OLS)
– The second problem is to find a method of getting
the values of the sample coefficients α and bo,
which will give us a line that fits the scatter of
points better than any other line, and which will
then enable us to write down the equation linking
the variables.
 The most popular method used for this calculation is
called ordinary least squares, or OLS. This gives us the
values of α and bo, and the straight line that best fits the
sample data.
 Roughly speaking, ‘best’ means the line that is, on
average, closer to all of the points than any other line.
Look at Figure 1.
 e has been shown just for one of the points. If all of

these residuals are squared and then added together, to
give the term e2, then the ‘best’ straight line is the one
for which the sum, e2, is smallest. Hence the name
ordinary ‘least squares’.
The residual or
error term, e, for
this subject.
Figure 1: A scatter plot of body mass index against hip circumference, for a sample of 412
women in a diet and health cohort study. The scatter of values appears to be distributed
around a straight line. That is, the relationship between these two variables appears to
be broadly linear
 The least square regression line for the set of

observations (X1 ,Y1), (X2 ,Y2), (X3 ,Y3) . . . (Xn ,Yn) has
the equation:
Ŷ = α + boxi.
 The values ‘α’ and ‘bo’ in the equation are constants,
i.e., their values are fixed. The constant ‘α’ indicates
the value of y when x = 0. It is also called the y
intercept. The value of ‘bo’ shows the slope of the
regression line and gives us a measure of the change
in y for a unit change in x.
 This slope (bo) is frequently termed as the regression
coefficient of Y on X. If we know the values of ‘α’ and
‘bo’, we can easily compute the value of Ŷ for any
given value of X.
 The constants ‘α’ and ‘bo’ are determined by solving
simultaneously the equations (normal equations):
ΣY = αn + boΣX
ΣXY = α ΣX + boΣX²
 = y  bx
n XY   X  Y  ( X  X )(Y  Y )
b = n X  ( X )
2 2 = (X  X ) 2
Example 1: Heights of 10 fathers(X) together with their
oldest sons (Y) are given below (in inches). Find the
regression of Y on X.
Father (X) oldest son (Y) product (XY) X²

63 65 4095 3969
64 67 4288 4096
70 69 4830 4900
72 70 5040 5184
65 64 4160 4225
67 68 4556 4489
68 71 4828 4624
66 63 4158 4356
70 70 4900 4900
71 72 5112 5041
Total 676 679 45967 45784
n XY   X  Y  ( X  X )(Y  Y )
b = =
n X  ( X )
2 2
(X  X ) 2
 = y  bx
10(45967)  (676 x 679) 459670  459004
b= 10(45784)  (676) 2 = 457840  456976
666
b = = 0.77
864
679 676
=  0.77* = 67.9 – 52.05 = 15.85
10 10
Therefore, Ŷ = 15.85 + 0.77 X or

Height of oldest son = 15.85 + 0.77*height of
father
The regression coefficient of Y on X (i.e., 0.77) tells us the
change in Y due to a unit change in X.
e.g. Estimate the height of the oldest son for a father’s

height of 70 inches.
Height of oldest son (Ŷ) = 15.85 + 0.77 (70) = 69.75 inches.
Explained, unexplained (error), total variations
 If all the points on the scatter diagram fall on the
regression line we could say that the entire variance
of Y is due to variations in X.
– Explained variation = Σ (Ŷ-y )²
 The measure of the scatter of points away from the
regression line gives an idea of the variance in Y that
is not explained with the help of the regression
equation.
– Unexplained variation = Σ(Y - Ŷ)²
 The variation of the Y’s about their mean can also be
computed.
– Total variation = Σ(Y-y )²
 Total variation = Explained variation + unexplained
variation
 The ratio of the explained variation to the total
variation measures how well the linear regression
line fits the given pairs of scores. It is called the
coefficient of determination, and is denoted by r².
2 exp lained var iation

r =
total var iation
 The explained variation is never negative and is never
larger than the total variation. Therefore, r² is always
between 0 and 1. If the explained variation equals 0,
r² = 0.
 If r² is known, then r =  r 2
– The sign of r is the same as the sign of bo from the
regression equation.
 Since r² is between 0 and 1, r is between -1 and +1.
– Thus, r is known as Karl Pearson’s Coefficient of
Linear correlation
 Linear Correlation (Karl Pearson’s Coefficient of
Linear correlation) (r):-
– measures the degree of linear correlation
between two variables (e.g. X and Y).
– This correlation coefficient is given in pure
number, independent of the units in which the
variables are expressed.
– It also tells us the direction of the slope of a
regression line (positive or negative).
 Population Corrélation Coefficient: ρ

 Sample Corrélation Coefficient: r
 r is positive if higher values of one variable are
associated with higher values of the other variable
and negative if one variable tends to be lower as the
other gets higher
 Correlation of around zero indicates that there is no
linear relationship between the values of the two
variables.
43
 In essence r is a measure of the scatter of the

points around an underlying linear trend: the
greater the spread of the points the lower
the correlation
44
Variable X
This line shows a perfect linear relationship between two variables. It is a perfect
positive correlation (r = 1)
45
Variable X
A perfect linear relationship; however a negative correlation (r = -1)
46
A weak positive correlation (r might be around .40)
47
No linear association between variables (r ~ 0)

48
 Strength of relationship
– Correlation from 0 to 0.25 (or –0.25) indicate little
or no relationship
– Those from 0.25 to 0.5 (or –0.25 to –0.50)
indicate a fair degree of relationship;
– Those from 0.50 to 0.75(or –0.50 to –0.75)
moderate to good relationship; and
– Those greater than 0.75 (or –0.75 to –1.00)
indicate very good to excellent relationship.
49
 The absolute value of the correlation coefficient

indicates the strength, with larger absolute values
indicating stronger relationships.
 Significance Test for Pearson Correlation

– H0: ρ = 0
– HA : ρ ≠ 0
r (n  2)
tcal =
(1  r ) 2
With n-2 degree of freedom
51
 Its formula is:
n XY  ( X )( Y )
r=
n  X 2  ( X ) 2 n  Y 2  ( Y ) 2
 Properties
– -1  r 1
– r is a pure number without unit
– If r is close to 1  a strong positive relationship
– If r is close to -1  a strong negative relationship
– If r = 0 → no linear correlation
 Determine the value of ‘r’ for the scores in the

above example 1.
r = 0.7776  0.78
 For Pearson’s correlation coefficient to be
appropriately used, both variables must be metric
continuous and, also approximately Normally
distributed.
 Assumptions in correlation
– The assumptions needed to make inferences
about the correlation coefficient are that the
sample was randomly selected and the two
variables, X and Y, vary together in a joint.
Distribution that is normally distributed, (called
the bivariate normal distribution).
54
 Spearman’s rank correlation coefficient
– If either (or both) of the variables is ordinal, then
Spearman’s rank correlation coefficient (usually
denoted ρs in the population and r in the sample)
is appropriate. (if there is extreme value)
– This is a non-parametric measure.
– As with Pearson’s correlation coefficient,
Spearman’s correlation coefficient varies from –1
to +1,
 This correlation coefficient is applied to the ranks in

two paired samples (not to the original scores).
 The formula for computing rank correlation is:
6 d i 2
rs = 1-
n(n  1)2
– List the n pairs of ranks; X,Y.

– Find the differences (di) between the ranks.
– Square these differences and add the squares
(di²).
– Compute rs.
Example: Six paintings were ranked by two judges. Calculate
the rank correlation coefficient.
Painting First judge Second judge di di²
(X) (Y)
A 2 2 0 0
B 1 3 -2 4
C 4 4 0 0
D 5 6 -1 1
E 6 5 1 1
F 3 1 2 4
di² = 10, n = 6.
6 di 2
rs = 1-
n ( n 2  1)
6(10) 60
rs = 1- = 1- = 1-0.29 = 0.71
6(6  1)
2
6*35
 The significance of the association is assessed using

t-test in the same way as described for Pearson
correlation coefficient.
r (n  2)
tcal =
(1  r ) 2
With n-2 degree of freedom
59
Multiple linear regression
– Multivariate analysis refers to the analysis of data
that takes into account a number of explanatory
variables and one outcome variable
simultaneously.
– It allows for the efficient estimation of measures
of association while controlling for a number of
confounding factors.
– All types of multivariate analyses involve the
construction of a mathematical model to
describe the association between independent
and dependent variables.
12/18/2023 60
Multiple linear regression cont’d…
 Multiple linear regression (we often refer to this
method as multiple regression) is an extension of the
most fundamental model describing the linear
relationship between two variables.
 Multiple regression is a statistical technique that is

used to measure and describe the function relating
two (or more) predictors (independent) variables to
a single response (dependent) variable.
12/18/2023 61
 Regression equation for a linear relationship:
A linear relationship of n predictor variables,
denoted as:
X1, X2, . . ., Xn
to a single response variable, denoted (Y)
is described by the linear equation involving several
variables.
The general linear equation (model) is:
Y = α + b1X1 + b2X2 + . . . + bnXn
12/18/2023 62
• Where:
– The regression coefficients (or b1 . . . bn ) represent
the independent contributions of each explanatory
variable to the prediction of the dependent variable.
– X1 . . . Xn represent the individual’s particular set of

values for the independent variables.
– n shows the number of independent predictor
variables.
12/18/2023 63
 Assumptions
1. First of all, as it is evident in the name multiple
linear regression, it is assumed that the relationship
between the dependent variable and each
continuous explanatory variable is linear. We can
examine this assumption for any variable, by
plotting (i.e., by using bivariate scatter plots) the
residuals (the difference between observed values
of the dependent variable and those predicted by
the regression equation) against that variable.
12/18/2023 64
 Any curvature in the pattern will indicate that a non-

linear relationship is more appropriate- if so
transformation of the explanatory variable or using
the analogous non parametric may be considered.
12/18/2023 65
2. It is assumed in multiple regression that the

residuals should follow a normal distribution and
have the same variability throughout the range.
3. The observations (explanatory variables) are
independent.
12/18/2023 66
 Predicted and Residual Scores
– The regression line expresses the best prediction
of the dependent variable (Y), given the
independent variables (X).
– However, nature is rarely (if ever) perfectly
predictable, and usually there is substantial
variation of the observed points around the fitted
regression line.
– The deviation of a particular point from the
regression line (its predicted value) is called the
residual value.
12/18/2023 67
 Residual Variance and R-square
– The smaller the variability of the residual values
around the regression line relative to the overall
variability, the better is our prediction.
– For example, if there is no relationship between
the X and Y variables, then the ratio of the
residual variability of the Y variable to the
original variance is equal to 1.0.
12/18/2023 68
– If X and Y are perfectly related then there is no

residual variance and the ratio of variance would
be 0.
– In most cases, the ratio would fall somewhere
between these extremes, that is, between 0 and
1.
– One minus this ratio is referred to as R-square or
the coefficient of determination.
12/18/2023 69
– This value is immediately interpretable in the

following manner. If we have an R-square of 0.6 then
we know that the variability of the Y values around
the regression line is 1- 0.6 times the original
variance.
– In other words, we have explained 60% of the
original variability, and are left with 40% residual
variability.
– Ideally, we would like to explain most if not all of the
original variability.
12/18/2023 70
– The R-square value is an indicator of how well the

model fits the data
– An R-square close to 1.0 indicates that we have
accounted for almost all of the variability with the
variables specified in the model.
12/18/2023 71
N.B. A) The sources of variation in regressions are:
i) Due to regression
ii) Residual (about regression)
B) The sum of squares due to regression (SSR)
over the total sum of squares (TSS) is the
proportion of the variability accounted for by the
regression model.
Therefore, the percentage variability accounted for
or explained by the regression is 100 times this
proportion.
12/18/2023 72
 Interpreting the multiple Correlation Coefficient (R)
– Customarily, the degree to which two or more
predictors (independent or X variables) are
related to the dependent (Y) variable is expressed
in the multiple correlation coefficient R, which is
the square root of R-square.
– In multiple correlation coefficient, R assumes
values between 0 and 1. This is true due to the
fact that no meaning can be given to the
correlation in the multivariate case. (why?)
12/18/2023 73
– The larger R is, the more closely correlated the

predictor variables are with the outcome
variable.
– When R=1, the variables are perfectly correlated
in the sense that the outcome variable is a linear
combination of the others.
– When the outcome variable is not linearly related
to any of the predictor variables, R will be very
small, but not zero.
12/18/2023 74
 Multicollinearity
– This is a common problem in many multivariate
correlation analyses.
– Imagine that you have two predictors (X variables)
of a person's height:
1. weight in pounds and
2. weight in ounces.
Trying to decide which one of the two measures is a
better predictor of height would be rather silly.
12/18/2023 75
 Collinearity (or multicollinearity ) is the undesirable

situation where the correlations among the independent
variables are string.
 Eigen values provide an indication of how many distinct
dimensions there are among the independent variables.
 When several eigenvalues are close to zero, the
variables are highly intercorrelated and small changes in
the data values may lead to large changes in the
estimates of the coefficients (see the following table)
76
A condition index greater than 15 indicates a possible problem and an index greater
than 30 suggests a serious problem with collinearity
Collinearity Diagnosticsa
Variance Proportions
height of monthly family period of
Condition mother income gestation Age of mother
Model Dimension Eigenvalue Index (Constant) (cms)(X2) (Birr)(X5) (days)(X6) (years)(X3)
1 1 1.999 1.000 .00 .00
2 .001 58.071 1.00 1.00
2 1 2.845 1.000 .00 .00 .01
2 .154 4.294 .00 .00 .43
3 .000 104.138 1.00 1.00 .56
3 1 3.829 1.000 .00 .00 .01 .00
2 .170 4.741 .00 .00 .42 .00
3 .000 116.493 .84 .19 .57 .03
4 7.58E-005 224.782 .16 .81 .00 .97
4 1 4.806 1.000 .00 .00 .00 .00 .00
2 .171 5.308 .00 .00 .41 .00 .00
3 .023 14.410 .00 .00 .13 .00 .90
4 .000 132.931 .87 .17 .45 .03 .03
5 7.10E-005 260.214 .13 .83 .01 .96 .06
a. Dependent Variable: birth weight of the child (kgs)(X1)
77
– When there are very many variables involved, it is

often not immediately apparent that this problem
exists, and it may only manifest itself after several
variables have already been entered into the
regression equation.
– Nevertheless, when this problem occurs it means
that at least one of the predictor variables is
(practically) completely redundant with other
predictors.
12/18/2023 78
 The Partial Correlations:

– The Partial Correlations procedure computes
partial correlation coefficients that describe the
linear relationship between two variables while
controlling for the effects of one or more
additional variables.
79
 Example:
 A popular radio talk show host has just received the latest
government study on public health care funding and has
uncovered a startling fact: As health care funding
increases, disease rates also increase! Cities that spend
more actually seem to be worse off than cities that spend
less!
 The data in the government report yield a high, positive
correlation between health care funding and disease rates
-- which seems to indicate that people would be much
healthier if the government simply stopped putting money
into health care programs.
80
 But is this really true? It certainly isn't likely that

there's a causal relationship between health care
funding and disease rates. Assuming the numbers are
correct, are there other factors that might create the
appearance of a relationship where none actually
exists? (Health funding Data)
– To obtain partial correlations:
– From the menus choose:
– Analyze
Correlate
Partial
81
Correlations
Visits to
Health care Reported health care
funding diseases providers
(amount (rate per (rate per
Control Variables per 100) 10,000) 10,000)
-none- a Health care funding Correlation 1.000 .737 .964
(amount per 100) Significance (2-tailed) . .000 .000
df 0 48 48
Reported diseases Correlation .737 1.000 .762
(rate per 10,000) Significance (2-tailed) .000 . .000
df 48 0 48
Visits to health care Correlation .964 .762 1.000
providers (rate per Significance (2-tailed) .000 .000 .
10,000) df
48 48 0
Visits to health Health care funding Correlation 1.000 .013

care providers (amount per 100) Significance (2-tailed) . .928
(rate per 10,000) df 0 47
Reported diseases Correlation .013 1.000
(rate per 10,000) Significance (2-tailed) .928 .
df 47 0
a. Cells contain zero-order (Pearson) correlations.
82
 In this example, the Partial Correlations table shows

both the zero-order correlations (correlations
without any control variables) of all three variables
and the partial correlation controlling of the first two
variables controlling for the effects of the third
variable.
83
Correlations
Visits to
df 0 48 48
df 48 0 48
10,000) df
48 48 0
The
Visits zero-order
to health correlation
Health care funding between
Correlation health care funding
1.000 and.013disease
rates
(rate peris, indeed, both fairly high
care providers
10,000)
(amount per 100)
(0.737) and statistically
Significance (2-tailed)
df
.
0
significant(p
.928
47
<
0.001). Reported diseases
(rate per 10,000)
Correlation
.013 1.000
.928 .
df 47 0
84
The partial correlation controlling for the rate of visits to health care
providers, however, is negligible (0.013) and not statistically significant (p
= 0.928.)
Correlations
Visits to
df 0 48 48
df 48 0 48
10,000) df
48 48 0

(rate per 10,000) df 0 47
df 47 0
85
 One interpretation of this finding is that the observed

positive "relationship" between health care funding and
disease rates is due to underlying relationships between
each of those variables and the rate of visits to health
care providers:
 Disease rates only appear to increase as health care
funding increases because more people have access to
health care providers when funding increases, and
doctors and hospitals consequently report more
occurrences of diseases since more sick people come to
see them.
86
Correlations
Going back to the zero-order correlations, you can see that bothVisits health
to
care funding rates and reported disease rates Health
funding
are diseases
care highly positively
Reported health care
providers
correlated
Control Variables
with the control variable, rate perof100)visits10,000)
(amount to health
(rate per care
(rate per
10,000)
providers.
-none-a Health care funding
(amount per 100)
Correlation
1.000
.
.737
.000
.964
.000
df 0 48 48
df 48 0 48
10,000) df
48 48 0

(rate per 10,000) df 0 47
df 47 0
87
Removing the effects of this variable reduces the correlation between

Correlations
the other two variables to almost zero. It's even possible

Health care
that controlling
Reported
Visits to
health care
for the effects of some other relevant variables might diseases
funding
(amount
actually
(rate per
reveal
providers an
(rate per
underlying
Control Variables
-none-a negative relationship
Health care funding Correlation
between health
per 100)
1.000
care .737
10,000)
funding
10,000)
and
.964
disease rates. (amount per 100) Significance
df
(2-tailed)
0
. .000
48
.000
48
df 48 0 48
10,000) df
48 48 0

(rate per 10,000) df 0 47
df 47 0
88
 The Partial Correlations procedure is only

appropriate for scale variables.
 If you have categorical (nominal or ordinal) data, use
the Crosstabs procedure. Layer variables in
Crosstabs are similar to control variables in Partial
Correlations.
89
 Linear Regression Variable Selection Methods
 Method selection allows you to specify how
independent variables are entered into the analysis.
Using different methods, you can construct a variety
of regression models from the same set of variables.
– Enter (Regression): A procedure for variable
selection in which all variables in a block are
entered in a single step.
12/18/2023 90
Linear Regression Variable Selection Methods
cont’d…
– Stepwise: At each step, the independent variable
not in the equation which has the smallest
probability of F is entered, if that probability is
sufficiently small. Variables already in the
regression equation are removed if their
probability of F becomes sufficiently large. The
method terminates when no more variables are
eligible for inclusion or removal.
– Remove: A procedure for variable selection in
which all variables in a block are removed in a
single step.
12/18/2023 91
Linear Regression Variable Selection Methods cont’d…
– Backward Elimination: A variable selection

procedure in which all variables are entered into the
equation and then sequentially removed. The
variable with the smallest partial correlation with the
dependent variable is considered first for removal. If
it meets the criterion for elimination, it is removed.
After the first variable is removed, the variable
remaining in the equation with the smallest partial
correlation is considered next. The procedure stops
when there are no variables in the equation that
satisfy the removal criteria.
12/18/2023 92
Linear Regression Variable Selection Methods
cont’d…
– Forward Selection: A stepwise variable selection
procedure in which variables are sequentially
entered into the model. The first variable considered
for entry into the equation is the one with the largest
positive or negative correlation with the dependent
variable. This variable is entered into the equation
only if it satisfies the criterion for entry. If the first
variable is entered, the independent variable not in
the equation that has the largest partial correlation is
considered next. The procedure stops when there are
no variables that meet the entry criterion.
12/18/2023 93
 Example on multiple regression

– The data for multiple regression were taken from a
survey of women attending an antenatal clinic.
The objectives of the study were to identify the

factors responsible for low birth weight and to
predict women 'at risk' of having a low birth
weight baby.
12/18/2023 94
 Notations:
BW = Birth weight (kgs) of the child =X1
HEIGHT = Height of mother (cms) = X2
AGEMOTH = Age of mother (years) = X3
AGEFATH = Age of father (years) = X4
FAMINC = Monthly family income (Birr) = X5
GESTAT = Period of gestation (days) = X6
12/18/2023 95
 Answer the following questions based on the above
data
1. Check the association of each predictor with the
dependent variable.
2. Fit the full regression model
3. Fit the condensed regression model
4. What do you understand from your answers in parts 1,
2 and 3 ?
12/18/2023 96
5. What is the proportion of variability accounted for
by the regression?
6. Compute the multiple correlation coefficient
7. Predict the birth weight of a baby born alive from a
woman aged 30 years and with the following
additional characteristics;
– height of mother =170 cm
– age of father =40 years
– monthly family income = 600 Birr
– period of gestation = 275 days
12/18/2023 97
8. Estimate the birth weight of a baby born alive from
a woman with the same characteristics as in “7"
but with a mother's age of 49 years.
12/18/2023 98
Thank you!

Linear Regression and Correlation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Linear Regression and Correlation

Uploaded by

Copyright:

Available Formats

Linear regression and correlation

Simple linear regression and correlation

 Data are frequently given in pairs where one variable

 Linear regression is used to model the value of a

A) Simple linear regression

 The model is linear because increasing the value of X

 Regression is a method of estimating the numerical

 How is the regression equation determined?

If we want to investigate the nature of this

 Is the relationship linear?

This equation is known as The variable on the left-

Dependant variable must be metric or scale. It gives us the mean

The independent variable

 If the independent variable is categorical, it need to

 Dummy coding in regression is analogous to simple

 In order to do this with regression, we must separate

 For the demonstration dummy coding, a fictional

 To accomplish this, we would create two new

 To represent membership in a group on each of the

 When creating dummy variables, it is only necessary

Choosing a reference category:

 First, using categories such as miscellaneous or

The results indicate a significant model, F(2, 12) =

 Now, because dummy variables were used to

DV_goup Statistic Std. Error

 Each regression coefficient represents the amount of

 The Method of least square

 Estimating α and bo– the method of ordinary least

 e has been shown just for one of the points. If all of

 The least square regression line for the set of

Father (X) oldest son (Y) product (XY) X²

Therefore, Ŷ = 15.85 + 0.77 X or

e.g. Estimate the height of the oldest son for a father’s

2 exp lained var iation

 Population Corrélation Coefficient: ρ

 In essence r is a measure of the scatter of the

A weak positive correlation (r might be around .40)

No linear association between variables (r ~ 0)

 The absolute value of the correlation coefficient

 Significance Test for Pearson Correlation

With n-2 degree of freedom

 Determine the value of ‘r’ for the scores in the

 This correlation coefficient is applied to the ranks in

– List the n pairs of ranks; X,Y.

 The significance of the association is assessed using

With n-2 degree of freedom

 Multiple regression is a statistical technique that is

– X1 . . . Xn represent the individual’s particular set of

 Any curvature in the pattern will indicate that a non-

2. It is assumed in multiple regression that the

– If X and Y are perfectly related then there is no

– This value is immediately interpretable in the

– The R-square value is an indicator of how well the

– The larger R is, the more closely correlated the

 Collinearity (or multicollinearity ) is the undesirable

– When there are very many variables involved, it is

 The Partial Correlations:

 But is this really true? It certainly isn't likely that

Visits to health Health care funding Correlation 1.000 .013

 In this example, the Partial Correlations table shows

Visits to health Health care funding Correlation 1.000 .013

 One interpretation of this finding is that the observed

Visits to health Health care funding Correlation 1.000 .013