You are on page 1of 15

Divya Krishnamohan

Student ID: 200292988

CORRELATION
& REGRESSION
EXERCISE

Biodiversity & Conservation


Master’s Skills
(BLGY5000)

Divya Krishnamohan
Student ID: 200292988
Divya Krishnamohan
Student ID: 200292988

Correlation and Regression Exercise

Q1 Following is a list of hypothetical examples of the types of analysis for which


one might use each of the methods mentioned (dependent variable is denoted by
an asterisk*; details of variables are within parentheses):

Pearson’s product moment correlation:

Testing for an association between leaf area and root starch concentration in a clonal tree
species (both variables continuous and normally distributed).

Rank correlation (Spearman and/or Kendall)

Testing for an association between the diversity of flowering plant species (ranked) and the
number of visiting pollinator species within the study area (normal distribution of variables
not required).

Linear regression

Testing the relationship between the degree of nuptial shading* (continuous measure,
residuals normally distributed) and the availability of mates in a species of fish (linearly
related to dependent variable).

Logistic regression

Testing the effect of genetic distance on the sex* of sterile offspring (binomial distribution) in
five hybrid species pairs.

Analysis of covariance

Testing the effect of soil permeability (ranked), litter depth and soil type (covariate) on the
rate of ant re-colonisation* (residuals normally distributed).
Q2
Divya Krishnamohan
Student ID: 200292988

Dr. William Kunin collected data on the abundance of 33 different species of


North American ducks and geese as well as the number of chewing lice
(Mallophaga) species recorded on them. The data pertaining to 32 of these
species (excluding species code name “snwgos” for which data of concern to this
analysis is missing) will be used in the following analysis.

In order to determine whether the number of Mallophaga species associated with


a duck species is affected by how common the host is, data on the diversity of
Mallophaga, as well as two measures of duck abundance will be used. The two
measures of duck abundance are:

i. Number of sites used in the Christmas Bird Count at which the species
was recorded (abbreviated as CBC circles)
ii. Number seen in the Christmas Bird Count (abbreviated as CBC
number).

a) Testing for a significant relationship between the two measures of host


abundance:
In order to perform a correlation, data have to meet the assumptions of the type
of correlation test being performed. As a rule, a parametric test, such as
Pearson’s product-moment correlation, gives a more powerful result than its non-
parametric counterparts, Kendall’s tau_b/ Spearman’s rank. Pearson’s correlation
assumes a normal distribution of both variables.

A histogram of the data sets CBC number and CBC circles reveals a strong skew
in the case of the former, and a moderate skew in the latter. (Refer Fig. 1)
Usually, a logarithmic transformation is applied to correct strong skews while a
square root transformation is applied to correct moderate skews. Both
transformations were applied respectively to the data and the normality assessed
by means of the Shapiro-Wilk Test (used as there are fewer than 50 cases).
(Refer Fig. 2, Table 1)
Divya Krishnamohan
Student ID: 200292988
Histogram

30

25

20
Frequency

15

10

Mean =194135.47
Std. Dev. =358003.762
0
N =32
0 250000 500000 750000 1000000 1250000 1500000
Histogram
CBC number

12

10

8
Frequency

Mean =375.66
Std. Dev. =262.151
0
N =32
0 200 400 600 800 1000 1200

CBC circles
Fig. 1. Histograms showing the distribution of untransformed data – CBC
numbers and CBC circles.
Divya Krishnamohan
Histogram Student ID: 200292988

12

10

8
Frequency

Mean =4.8193
Std. Dev. =0.70313
0
N =32
3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50

log10 (CBC number)

6
Frequency

Mean =18.0935
Std. Dev. =7.05973
N =32
0
5.00 10.00 15.00 20.00 25.00 30.00 35.00
square root (CBC circles)

Fig. 2 Histograms showing the distribution of log transformed CBC numbers


and square root transformed CBC circles.
Divya Krishnamohan
Student ID: 200292988

Table 1. Shapiro-Wilk’s test of normality on untransformed and transformed


variables – CBC number and CBC circles.

Tests of Normality

Shapiro-Wilk
Statistic df Sig.
CBC number .513 32 .000
log10 (CBC number) .948 32 .124
CBC circles .922 32 .024
sqrt (CBC circle) .947 32 .121

It is apparent that the transformations applied have helped normalise the data.
The results of a Pearson’s product-moment correlation are described in the table
below.

Table 2. Pearson’s product-moment correlation for sqrt (CBC circles) and


log10 (CBC numbers).
Correlations

sqrt(CBC circle) log(CBC number)


sqrt(CBC circle) Pearson Correlation 1 .625(**)
Sig. (2-tailed) .000
N 32 32
log(CBC number) Pearson Correlation .625(**) 1
Sig. (2-tailed) .000
N 32 32

** Correlation is significant at the 0.01 level (2-tailed).

The Pearson’s correlation reveals that there is a significant correlation between


the two variables (r=0.625, P<0.0001). However, there is no literature supporting
the use of two different transformations on the variables involved in a correlation
analysis, and therefore, a more conservative approach, such as the use of a non-
parametric test like Spearman’s Rank correlation (that does not make
assumptions about the normality of distribution), will be used to determine
whether or not there truly is a significant correlation between CBC numbers and
CBC circles.
Divya Krishnamohan
Student ID: 200292988

Table 3. Spearman’s Rank Correlation for CBC circles and CBC number

Correlations

CBC circles CBC number


Spearman's rho CBC circles Correlation Coefficient 1.000 .540(**)
Sig. (2-tailed) . .001
N 32 32
CBC number Correlation Coefficient .540(**) 1.000
Sig. (2-tailed) .001 .
N 32 32

** Correlation is significant at the 0.01 level (2-tailed).

The results of a Spearman’s rank correlation (rho=0.540, P=0.001) concur with


the Pearson’s correlation test, i.e., there is a significant and positive correlation
between CBC circles and CBC number.

b) Separately testing the effect of each measure of host abundance on


the diversity of Mallophaga species:

Being continuous, count data, the most appropriate test for determining
whether there is an association between a measure of host abundance and
Mallophaga diversity is a linear regression analysis.

Assumptions of Linear Regression Analysis:

 Independent (x) variables are measured without error – This is


assumed to be true.
 Errors in dependent (y) variable are normally distributed – A normality
test of the residuals of the dependent variable, Mallophaga species,
reveals that the errors are normally distributed (Shapiro-wilk=0.966,
df=32, P=0.395).
Divya Krishnamohan
Student ID: 200292988

 Variance in the dependent variable is constant – A residual plot of


Mallophaga species variable reveals a homoscedastic variance. (Refer
Fig. 3)

Observed
Predicted
Dependent Variable: Mallophaga spp
Std. Residual

Observed Predicted Std. Residual

Model: Intercept
Fig. 3. Residual plot of Mallophaga species variable showing a homscedastic
distribution of variance.

 Relationship between dependent and independent variables is linear – A


scatter plot of the data reveals that in an untransformed state, the
relationship is not linear in the case of Mallophaga and CBC numbers, as
well as Mallophaga and CBC circles. (Refer Fig. 4)
Divya Krishnamohan
Student ID: 200292988

Mallophaga species

0
250000 500000 750000 1000000 1250000 1500000

CBC number

Mallophaga spcies

0
CBC circles

Divya Krishnamohan
Student ID: 200292988

Fig. 4. Scatter plots showing the relationship between dependent variable


Mallophaga species and independent variables CBC number and CBC circles.

A logarithmic transformation of the independent variable CBC number helps to


increase linearity. (Refer Fig. 5)

Mallophaga species
4

3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50

log(CBC number)

Fig. 5. Scatter plot showing the relationship between dependent variable


Mallophaga species and log transformed independent variable, CBC number.

NB: Transforming CBC circles and even the Mallophaga species variable doesn’t
help increase the linearity of the relationship between these two variables.

Analysis of dependent variable Mallophaga species with independent


variable CBC circles:
Divya Krishnamohan
Student ID: 200292988

A linear regression analysis between these two variables is redundant as CBC


circles does not have a linear relationship with Mallophaga species. Moreover, a
linear regression is meaningful only if the two variables are correlated. A
Pearson’s correlation between Mallophaga and square root transformed CBC
circles (both variables are normally distributed) reveals that there is no significant
correlation between the variables (r=0.241, N=32, P=0.184).

Linear regression analysis of dependent variable Mallophaga species with


independent variable CBC number:

A linear regression analysis of the dependent variable Mallophaga species with


the independent variable CBC number (log transformed to increase linearity),
reveals a significant association between the two variables (F=4.502, df=1,
P=0.042). CBC numbers is seen to explain 13% of the dependent variable
variance (R2=0.130).

Backward multiple regression analysis of dependent variable Mallophaga


species with independent variables CBC circles and CBC number:

A backward multiple regression analysis is a method of regression that allows for


the inclusion of more than one independent variable. The analysis helps
determine which of the variables should be included in the final model of ‘best fit’.
This is achieved by sequentially running all the factors together and then running
a model that takes a step “backwards” by removing the factor that is assessed as
the least significant to the operation of the model. Choosing the best fit model is
dependent on two values – the R2 value (representing the amount of variation in
the dependent variable that is explained by that model) as well as the
significance of the model, denoted by the P value of the test statistic.
Divya Krishnamohan
Student ID: 200292988

The following table summarises the models tested by a backward multiple


regression; also indicated is the criterion for factor removal.

Table 4. Summary of variables entered and removed in the backward multiple


regressions performed on variables Mallophaga, CBC circles and log CBC
number.
Variables Entered/ Removed (b)

Variables
Model Variables Entered Removed Method
1

CBC circles,
. Enter
log (CBC number) (a)

2
Backward
(criterion:
. CBC circles Probability of F-
to-remove >=
.100).

a. All requested variables entered.


b. Dependent Variable: Mallophaga sp

The backward multiple regression analysis revealed that of the two possible
models, (Model 1: Dependent variable – Mallophaga species, independent
variables – log CBC Numbers and CBC circles; and Model 2: Dependent
variable – Mallophaga species, independent variable – log CBC number), Model
1 had an R2 value marginally higher than Model 2. (R2=0.133 and R2=0.130,
respectively).

However, Model 2 was seen to be significant (F=4.502, df=1, P=0.042), while


Model 1 was non-significant (F=2.232, df=2, P=0.125). Refer Table 5.

A review of the adjusted R2 values (a more conservative estimate of the variation


explained, as it attempts correcting the R2 value to give a better indication of the
“goodness of fit”) reveals that Model 2 explains 10.1% of the variation and Model
1, only 7.4%.
Divya Krishnamohan
Student ID: 200292988

From these statistics it is evident that in spite of a marginally lower R2 value,


Model 2 is actually the model representing the best fit.

Table 5. The ANOVA table for the backward multiple regression

ANOVA(c)

Sum of
Model Squares df Mean Square F Sig.
1 Regression 11.855 2 5.927 2.232 .125(a)
Residual 77.020 29 2.656
Total 88.875 31
2 Regression 11.596 1 11.596 4.502 .042(b)
Residual 77.279 30 2.576
Total 88.875 31

a. Predictors: (Constant), CBC circles, log (CBC number)


b. Predictors: (Constant), log (CBC number)
c. Dependent Variable: Mallophaga sp

A possible explanation for this occurrence is the inter-correlation of the


independent variables within Model 1. The phenomenon of ‘multicollinearity’ or
‘collinearity’ is an undesirable occurrence that manifests when two or more
independent variables have a high degree of correlation. While this phenomenon
is usually said to occur in variables that possess a correlation of over 50%, the
possibility of collinearity weakening the regression analysis can be assessed by
running a collinearity diagnostics test. The results of the test are described in
Table 6.

The eigenvalues of a collinearity diagnostic test provide an indication of how


many distinct dimensions there are among the independent variables. Of the
three possible eigenvalues in Model 1, one value is close to zero. This could
indicate that the variables are fairly highly inter-correlated and small changes in
the data values may lead to large changes in the estimates of the coefficients.
Divya Krishnamohan
Student ID: 200292988

Table 6. Results of the collinearity diagnostics test on Models 1 and 2

Collinearity Diagnostics (a)

Variance Proportions
Condition
Model Dimension Eigen value Index log CBC
(Constant) number CBC circles
1 1 2.789 1.000 .00 .00 .02
2 .204 3.695 .02 .01 .65
3 .007 20.069 .98 .99 .33
2 1 1.990 1.000 .01 .01
2 .010 13.999 .99 .99

a. Dependent Variable: Mallophaga spp

The Condition index (representative of the square roots of the ratios of the
largest eigenvalue to each successive eigenvalue), if greater than 15 (seen in
Model 1, dimension 3), indicates a possible problem with collinearity.

Further, the backward multiple regression analysis reveals that the excluded
variable, CBC circles has a test static and significance indicating a non-linear
relationship with the dependent variable (t=0.312, P=.757, where t tests the null
hypothesis that the regression coefficient is zero).

Table 7. The Coefficient’s table generated by a linear regression analysis of


Mallophaga species and log transformed CBC numbers.

Coefficients (a)

Unstandardized Standardized
Coefficients Coefficients

B Std. Error Beta


(Constant) -1.005 1.996 -.503 .618
log(CBC number) .870 .410 .361 2.122 .042

a. Dependent Variable: Mallophaga sp


Divya Krishnamohan
Student ID: 200292988

Conclusion:

On performing a backward multiple regression on the data, it was determined


that a model comprising the dependent variable Mallophaga species and the
independent variable CBC number (log transformed to increase linearity), was
the model of best fit. The independent variable CBC circles was excluded from
the analysis on the basis of possible collinearity. Further, the relationship of CBC
circles with the dependent variable was seen to be non-linear and non-
significantly correlated.

The analysis of the selected model revealed that Mallophaga diversity is


significantly affected by duck host abundance (as measured by CBC number)
(F=4.502, df=1, P=0.042).

The coefficient’s table (refer Table 7) may be used to predict the diversity of
Mallophaga given a specific abundance of duck hosts.

It may be noted however, that the regression is a weak one as the R2 value of the
model indicates that only 13% of the Mallophaga variance is explained. It is
evident that there are other, more significant factors that influence Mallophaga
diversity; however, these factors lie outside the scope of this analysis.

You might also like