You are on page 1of 15

Divya Krishnamohan Student ID: 200292988

CORRELATION & REGRESSION EXERCISE

Biodiversity & Conservation Master’s Skills (BLGY5000)
Divya Krishnamohan Student ID: 200292988

Divya Krishnamohan Student ID: 200292988

Correlation and Regression Exercise
Q1 Following is a list of hypothetical examples of the types of analysis for which one might use each of the methods mentioned (dependent variable is denoted by an asterisk*; details of variables are within parentheses):

Pearson’s product moment correlation:
Testing for an association between leaf area and root starch concentration in a clonal tree species (both variables continuous and normally distributed).

Rank correlation (Spearman and/or Kendall)
Testing for an association between the diversity of flowering plant species (ranked) and the number of visiting pollinator species within the study area (normal distribution of variables not required).

Linear regression
Testing the relationship between the degree of nuptial shading* (continuous measure, residuals normally distributed) and the availability of mates in a species of fish (linearly related to dependent variable).

Logistic regression
Testing the effect of genetic distance on the sex* of sterile offspring (binomial distribution) in five hybrid species pairs.

Analysis of covariance
Testing the effect of soil permeability (ranked), litter depth and soil type (covariate) on the rate of ant re-colonisation* (residuals normally distributed).

Q2

Divya Krishnamohan Student ID: 200292988

Dr. William Kunin collected data on the abundance of 33 different species of North American ducks and geese as well as the number of chewing lice (Mallophaga) species recorded on them. The data pertaining to 32 of these species (excluding species code name “snwgos” for which data of concern to this analysis is missing) will be used in the following analysis. In order to determine whether the number of Mallophaga species associated with a duck species is affected by how common the host is, data on the diversity of Mallophaga, as well as two measures of duck abundance will be used. The two measures of duck abundance are: i. ii. Number of sites used in the Christmas Bird Count at which the species was recorded (abbreviated as CBC circles) Number seen in the Christmas Bird Count (abbreviated as CBC number). a) Testing for a significant relationship between the two measures of host abundance: In order to perform a correlation, data have to meet the assumptions of the type of correlation test being performed. As a rule, a parametric test, such as Pearson’s product-moment correlation, gives a more powerful result than its nonparametric counterparts, Kendall’s tau_b/ Spearman’s rank. Pearson’s correlation assumes a normal distribution of both variables. A histogram of the data sets CBC number and CBC circles reveals a strong skew in the case of the former, and a moderate skew in the latter. (Refer Fig. 1) Usually, a logarithmic transformation is applied to correct strong skews while a square root transformation is applied to correct moderate skews. Both transformations were applied respectively to the data and the normality assessed by means of the Shapiro-Wilk Test (used as there are fewer than 50 cases). (Refer Fig. 2, Table 1)

Histogram

Divya Krishnamohan Student ID: 200292988

30

25

20

Frequency

15

10

5

0 0 250000 500000 750000 1000000 1250000 1500000

Mean =194135.47 Std. Dev. =358003.762 N =32

Histogram CBC number

12

10

8

Frequency

6

4

2

0 0 200 400 600 800 1000 1200

Mean =375.66 Std. Dev. =262.151 N =32

CBC circles

Fig. 1. Histograms showing the distribution of untransformed data – CBC numbers and CBC circles.

Histogram

Divya Krishnamohan Student ID: 200292988

12

10

8

Frequency

6

4

2

0 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50

Mean =4.8193 Std. Dev. =0.70313 N =32

log10 (CBC number)

6

Frequency

4

2

Mean =18.0935 Std. Dev. =7.05973 N =32
0 5.00 10.00 15.00 20.00 25.00 30.00 35.00

square root (CBC circles)

Fig. 2 Histograms showing the distribution of log transformed CBC numbers and square root transformed CBC circles.

Divya Krishnamohan Student ID: 200292988

Table 1. Shapiro-Wilk’s test of normality on untransformed and transformed variables – CBC number and CBC circles.
Tests of Normality Shapiro-Wilk CBC number log10 (CBC number) CBC circles sqrt (CBC circle) Statistic .513 .948 .922 .947 df 32 32 32 32 Sig. .000 .124 .024 .121

It is apparent that the transformations applied have helped normalise the data. The results of a Pearson’s product-moment correlation are described in the table below.
Table 2. Pearson’s product-moment correlation for sqrt (CBC circles) and log10 (CBC numbers).
Correlations

sqrt(CBC circle)

Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N

sqrt(CBC circle) 1 32 .625(**) .000 32

log(CBC number) .625(**) .000 32 1 32

log(CBC number)

** Correlation is significant at the 0.01 level (2-tailed).

The Pearson’s correlation reveals that there is a significant correlation between the two variables (r=0.625, P<0.0001). However, there is no literature supporting the use of two different transformations on the variables involved in a correlation analysis, and therefore, a more conservative approach, such as the use of a nonparametric test like Spearman’s Rank correlation (that does not make assumptions about the normality of distribution), will be used to determine whether or not there truly is a significant correlation between CBC numbers and CBC circles.

Divya Krishnamohan Student ID: 200292988

Table 3. Spearman’s Rank Correlation for CBC circles and CBC number
Correlations CBC circles 1.000 . 32 .540(**) .001 32 CBC number .540(**) .001 32 1.000 . 32

Spearman's rho

CBC circles

Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N

CBC number

** Correlation is significant at the 0.01 level (2-tailed).

The results of a Spearman’s rank correlation (rho=0.540, P=0.001) concur with the Pearson’s correlation test, i.e., there is a significant and positive correlation between CBC circles and CBC number.

b) Separately testing the effect of each measure of host abundance on the diversity of Mallophaga species: Being continuous, count data, the most appropriate test for determining whether there is an association between a measure of host abundance and Mallophaga diversity is a linear regression analysis. Assumptions of Linear Regression Analysis: Independent (x) variables are measured without error – This is assumed to be true.  Errors in dependent (y) variable are normally distributed – A normality test of the residuals of the dependent variable, Mallophaga species, reveals that the errors are normally distributed (Shapiro-wilk=0.966, df=32, P=0.395).

Divya Krishnamohan Student ID: 200292988

Variance in the dependent variable is constant – A residual plot of Mallophaga species variable reveals a homoscedastic variance. (Refer Fig. 3)

Dependent Variable: Mallophaga spp

Std. Residual Observed

Predicted

Observed

Predicted

Std. Residual

Model: Intercept

Fig. 3. Residual plot of Mallophaga species variable showing a homscedastic distribution of variance.

Relationship between dependent and independent variables is linear – A scatter plot of the data reveals that in an untransformed state, the relationship is not linear in the case of Mallophaga and CBC numbers, as well as Mallophaga and CBC circles. (Refer Fig. 4)

Divya Krishnamohan Student ID: 200292988

7

6

5

4

Mallophaga species
3 2 1 0 0 250000 500000 750000 1000000 1250000 1500000

CBC number

7

6

5

4

Mallophaga spcies
3 2 1 0

CBC circles

Divya Krishnamohan Student ID: 200292988

Fig. 4. Scatter plots showing the relationship between dependent variable Mallophaga species and independent variables CBC number and CBC circles.

A logarithmic transformation of the independent variable CBC number helps to increase linearity. (Refer Fig. 5)

7

6

5

Fig. 5. Scatter plot showing the relationship between dependent variable Mallophaga species and log transformed independent variable, CBC number.

NB: Transforming CBC circles and even the Mallophaga species variable doesn’t help increase the linearity of the relationship between these two variables.

Analysis of dependent variable Mallophaga species with independent variable CBC circles:

Mallophaga species
4 3 2 1 0 3.00 3.50 4.00 4.50 5.00 5.50 6.00 6.50

log(CBC number)

Divya Krishnamohan Student ID: 200292988

A linear regression analysis between these two variables is redundant as CBC circles does not have a linear relationship with Mallophaga species. Moreover, a linear regression is meaningful only if the two variables are correlated. A Pearson’s correlation between Mallophaga and square root transformed CBC circles (both variables are normally distributed) reveals that there is no significant correlation between the variables (r=0.241, N=32, P=0.184).

Linear regression analysis of dependent variable Mallophaga species with independent variable CBC number: A linear regression analysis of the dependent variable Mallophaga species with the independent variable CBC number (log transformed to increase linearity), reveals a significant association between the two variables (F=4.502, df=1, P=0.042). CBC numbers is seen to explain 13% of the dependent variable variance (R2=0.130).

Backward multiple regression analysis of dependent variable Mallophaga species with independent variables CBC circles and CBC number: A backward multiple regression analysis is a method of regression that allows for the inclusion of more than one independent variable. The analysis helps determine which of the variables should be included in the final model of ‘best fit’. This is achieved by sequentially running all the factors together and then running a model that takes a step “backwards” by removing the factor that is assessed as the least significant to the operation of the model. Choosing the best fit model is dependent on two values – the R2 value (representing the amount of variation in the dependent variable that is explained by that model) as well as the significance of the model, denoted by the P value of the test statistic.

Divya Krishnamohan Student ID: 200292988

The following table summarises the models tested by a backward multiple regression; also indicated is the criterion for factor removal.
Table 4. Summary of variables entered and removed in the backward multiple regressions performed on variables Mallophaga, CBC circles and log CBC number.
Variables Entered/ Removed (b) Variables Removed

Model 1

Variables Entered CBC circles, log (CBC number) (a)

Method

.

Enter

2 Backward (criterion: Probability of Fto-remove >= .100).

.

CBC circles

a. All requested variables entered. b. Dependent Variable: Mallophaga sp

The backward multiple regression analysis revealed that of the two possible models, (Model 1: Dependent variable – Mallophaga species, independent variables – log CBC Numbers and CBC circles; and Model 2: Dependent variable – Mallophaga species, independent variable – log CBC number), Model 1 had an R2 value marginally higher than Model 2. (R2=0.133 and R2=0.130, respectively). However, Model 2 was seen to be significant (F=4.502, df=1, P=0.042), while Model 1 was non-significant (F=2.232, df=2, P=0.125). Refer Table 5. A review of the adjusted R2 values (a more conservative estimate of the variation explained, as it attempts correcting the R2 value to give a better indication of the “goodness of fit”) reveals that Model 2 explains 10.1% of the variation and Model 1, only 7.4%.

Divya Krishnamohan Student ID: 200292988

From these statistics it is evident that in spite of a marginally lower R2 value, Model 2 is actually the model representing the best fit.
Table 5. The ANOVA table for the backward multiple regression
ANOVA(c) Sum of Squares 11.855 77.020 88.875 11.596 77.279 88.875

Model 1

df 2 29 31 1 30 31

Regression Residual Total Regression Residual Total

Mean Square 5.927 2.656 11.596 2.576

F 2.232

Sig. .125(a)

2

4.502

.042(b)

a. Predictors: (Constant), CBC circles, log (CBC number) b. Predictors: (Constant), log (CBC number) c. Dependent Variable: Mallophaga sp

A possible explanation for this occurrence is the inter-correlation of the independent variables within Model 1. The phenomenon of ‘multicollinearity’ or ‘collinearity’ is an undesirable occurrence that manifests when two or more independent variables have a high degree of correlation. While this phenomenon is usually said to occur in variables that possess a correlation of over 50%, the possibility of collinearity weakening the regression analysis can be assessed by running a collinearity diagnostics test. The results of the test are described in Table 6. The eigenvalues of a collinearity diagnostic test provide an indication of how many distinct dimensions there are among the independent variables. Of the three possible eigenvalues in Model 1, one value is close to zero. This could indicate that the variables are fairly highly inter-correlated and small changes in the data values may lead to large changes in the estimates of the coefficients.

Divya Krishnamohan Student ID: 200292988

Table 6. Results of the collinearity diagnostics test on Models 1 and 2
Collinearity Diagnostics (a)

Model 1

Dimension 1 2 3 1 2

Eigen value 2.789 .204 .007 1.990 .010

Condition Index 1.000 3.695 20.069 1.000 13.999 (Constant) .00 .02 .98 .01 .99

Variance Proportions log CBC number .00 .01 .99 .01 .99 CBC circles .02 .65 .33

2

a. Dependent Variable: Mallophaga spp

The Condition index (representative of the square roots of the ratios of the largest eigenvalue to each successive eigenvalue), if greater than 15 (seen in Model 1, dimension 3), indicates a possible problem with collinearity. Further, the backward multiple regression analysis reveals that the excluded variable, CBC circles has a test static and significance indicating a non-linear relationship with the dependent variable (t=0.312, P=.757, where t tests the null hypothesis that the regression coefficient is zero).

Table 7. The Coefficient’s table generated by a linear regression analysis of Mallophaga species and log transformed CBC numbers.
Coefficients (a) Unstandardized Coefficients (Constant) log(CBC number) B -1.005 .870 Std. Error 1.996 .410 Standardized Coefficients Beta .361 -.503 2.122 .618 .042

a. Dependent Variable: Mallophaga sp

Divya Krishnamohan Student ID: 200292988

Conclusion: On performing a backward multiple regression on the data, it was determined that a model comprising the dependent variable Mallophaga species and the independent variable CBC number (log transformed to increase linearity), was the model of best fit. The independent variable CBC circles was excluded from the analysis on the basis of possible collinearity. Further, the relationship of CBC circles with the dependent variable was seen to be non-linear and nonsignificantly correlated. The analysis of the selected model revealed that Mallophaga diversity is significantly affected by duck host abundance (as measured by CBC number) (F=4.502, df=1, P=0.042). The coefficient’s table (refer Table 7) may be used to predict the diversity of Mallophaga given a specific abundance of duck hosts. It may be noted however, that the regression is a weak one as the R2 value of the model indicates that only 13% of the Mallophaga variance is explained. It is evident that there are other, more significant factors that influence Mallophaga diversity; however, these factors lie outside the scope of this analysis.