Principal Components Analysis and Redundancy Analysis

16/05/12
ecology.msu.montana.edu/labdsv/R/labs/lab7/lab7.html
Principal Components Analysis and Redundancy Analysis

After our experience fitting individual species models to specific gradients, you may be struck by the enormity of the task of analyzing numerous species this way, and the problems inherent in summarizing the vast amounts of data generated by the procedures. As an alternative, vegetation ecologists have long sought efficient ways to summarize entire vegetation data sets, either with respect to specific environmental data, or simply to summarize the distribution of sample points in a low dimensional space where alternative environmental explanations can be explored. In our first attempts, we will employ both approaches. First, we will use Principal Components Analysis (PCA) to reduce the distribution of sample points to 2 or 3 dimensions, and plot the results in "ordinations." Depending on how successful we are at reducing the data set, we can seek patterns among the distribution of plots in ordination space, and explore possible environmental correlates with these. It's important to remember, however, that at most we can establish correlation among the variables, not causation. PCA is an eigenvector technique that attempts to explain as much variance as possible on each of a series of orthogonal vectors spanning the vegetation space (see Kent and Coker Pp. 186- 203 for a gentle presentation; alternatively Legendre and Legendre 1998 Pp. 391-424 for a more thorough treatment). Eigenanalysis is based in linear or matrix algebra, and has a wide range of uses. Our particular interest is in finding an alternative description of our vegetation data in a low dimensional space. The first vector is chosen to account for as much variance in plot dispersion from the centroid as possible; the second is chosen by the same criterion but subject to the constraint of being orthogonal the first, and so on. The approach can be likened to regression by least squares, except that the residuals from the vector are measured perpendicular to the vector rather than perpendicular to the x or y axis. In general, PCA is performed on a correlation or variance/covariance matrix, although equivalently, a matrix of sums-of-squares and-cross-products can be used. In R, however, the functions we will use convert the basic vegetation data into the appropriate forms, usually requiring only an argument to be specified in the function. There are two ways to perform PCA in R: p i c m ( and p c m ( . Essentially, they compute the same values (technically, p i c m computes rnop) rop) rnop an eigenanalysis and p c m computes a singular value decomposition). To make all of the ordination labs more similar, the LabDSV library includes rop a function called p a )that I will use in the following. p a )simply calls p c m , but includes different conventions for plotting and analysis that we c( c( rop will use with other ordination techniques in later labs. There are subtle differences in the Splus and R p i c m function, but they're essentially the same (at least for recent versions). It is worth noting, rnop however, that the p i c m function calculates the population variance/covariance (or correlation), rather than the sample variance/covariance; rnop accordingly, values in the cross-products matrix are divided by N, rather than N-1. So, we first have to load the LabDSV library, if you haven't already. If the libraries are installed properly
>lbaylbs) irr(adv
will do the trick. (see R for Ecologists if you don't remember how to install and access libraries in S-Plus or R) The p a )function accepts taxon matrices or dataframes, computing loadings for each column and scores for each row in the matrix or dataframe. c( The loadings are the contribution of the column vector to each of the eigenvectors. A large positive component means that that column (species in our case) is positively correlated with that eigenvector; a large negative values is negative correlation; and small values mean that the species is unrelated to that eigenvector. Often, we're more interested in the position of the sample plots in the reduced-dimensional space; these positions are given by the scores, or coordinates along the eigenvectors. The p a )function allows you to specify a number of parameters concerning the calculation, The first is whether you want to use a correlation or c( covariance matrix. Again, this is a subject too detailed to expound on adequately in HTML, but generally, PCA is sensitive to the scale of measurement of the data. If all the data are not measured on the same scale, using covariance means that the result will be determined mostly by the variable with the largest values, as it will have the highest variance. Using a correlation matrix treats all variables the same (standardized to mean=0 and std. dev.=1). Even if all species were measured on the same scale (e.g. percent cover), to prevent the dominant species from determining the results, you probably want to use correlation. In p a ) this means specifying c r T U in the function call. In addition, by default p a )will c(, o=RE c( generate one eigenvector for every column (species), and scores for each row (plot) on each of the eigenvectors. Generally, the vast majority of the variance is described on the first few eigenvectors, and we can save space by only calculating the scores for only the first few eigenvectors. This can be specified by including d m nwhere nequals the number of dimensions you want scores and loadings for. So, for example i=
pa1-c(e,o=REdm1) c.<pavgcrTU,i=0
calculates a principal components analysis of the Bryce Canyon vegetation, using a correlation matrix and only calculating scores for the first 10 eigenvectors. The results of a PCA are complex, including: variance explained by eigenvector cumulative variance explained by eigenvector species loading by eigenvector plot scores by vector Variance Explained
ecology.msu.montana.edu/labdsv/R/labs/lab7/lab7.html 1/18
16/05/12
To see the variance and cumulative variances explained by eigenvector, type

smaypa1dm1) umr(c.,i=0
The output looks like this:

Iprac o cmoet: motne f opnns nmrcmti:3rw,19clms uei arx os 6 oun. Cm.1 op Cm.2 op Cm.3 Cm.4 op op Cm.5 op Sadr dvain36254 27245 26062 2556823679 tnad eito .5755 .7286 .6172 .492 .8140 Pooto o Vrac 00854 00575 00177 0085600393 rprin f aine .7903 .4452 .4822 .334 .3611 Cmltv Pooto 00854 01429 01696 0245302845 uuaie rprin .7903 .2454 .6286 .063 .3346 Cm.6 Cm.7 op op Cm.8 op Cm.9 Cm.1 op op 0 Sadr dvain23496 2173020736 19789 18847 tnad eito .1580 .199 .6160 .0709 .5059 Pooto o Vrac 00103 0064400583 00132 00020 rprin f aine .3709 .252 .2244 .2566 .2481 Cmltv Pooto 02049 0268403177 03309 03339 uuaie rprin .7046 .957 .2810 .4476 .6858
For reasons known only to R/S, the top line is the standard deviation (rather than the variance) associated with each component. The next two lines are the items of primary interest, however; proportion of total variance, and cumulative proportion of variance. The d m 1 means to list only the first i=0 10 dimensions; the default is to list ALL of them. Notice that the variance explained is in order from highest to lowest by eigenvector; this is by design. Notice also that the first ten eigenvectors explain a total of 0.3638 of the total variance of the dataset. The number of significant eigenvectors is at most m n # l t , # p c e ) 1 Most of the eigenvectors after 100 explain a minute portion of the variance. i(pos seis-. We can look at this same information graphically with the v r l t )function. apo(
>vrltpapa1 apo.c(c.)
HtRtr t Cniu i eun o otne
2/18
16/05/12
The first panel shows the variance associated with each vector, and the second shows the cumulative variance as a fraction of the total. Species Loadings To see the species loadings, use the l a i g . c ( function as follows: odnspa)
laig.c(c,i=) odnspapadm5
A partial output looks like the following:

>laig.c(c) odnspapa Cm.1Cm.2Cm.3Cm.4Cm.5 op op op op op jns uot aet mua aca 017 rpt .5 -.0 013 atr rti 029 -.4 .1 017 014 .5 ara tcn 023 -.4 .0 014 014 .5 brr efe caa 014 emr .3 cre eld cro emn cre hdp cra hnu cra hpr cri -.0 hvs 020 era uln jno 010 ucm .0 012 .3 pcy 015 amr .4 011 014 .1 .5 . . . . . . . . . . . . . . . . . .
By default, l a i g . c ( suppresses small values to emphasize the more important ones. Accordingly, you can see that eigenvector 1 is odnspa) positively correlated with a c a , c a a , j n o ,and p c y , while negatively associated with c r i (along with many more species not rpt emr ucm amr hvs included in this excerpt). The d m 5means only list the first five components; the default is 5. i= It is often the case that loadings or scores from two different computers or programs will come out with the same values but with opposite signs. The orientation of eigenvectors is arbitrary, and the sign is only meaningful with respect to other values on the same axis. Plot Scores To see a similar output for plot scores use the s o e . c ( function. crspa)
16/05/12
>soe.c(c.,i=) crspapa1dm5
Which looks like:
Cm. op1 Cm. op2 Cm. op3 Cm. op4 Cm. op5 1 -.0523 -.8332 046838 046603 000417 033259 05243 .94649 .4269 .05229 2 016924-.6929 071402 100388-.01628 .0913 08355 .70618 .7219 010189 3 -.2208 -.3857 057963 086600-.39215 035253 06183 .97314 .1259 003417 4 -.0003 -.5843 041017 036871 002397 031126 05051 .43188 .5258 .72017 5 -.5378 05685 -.36323 069292 008086 301216 .8275 001788 .6064 .20440 6 -.5068 -.5514 069317 052401-.34506 049019 06673 .26364 .6447 029785 7 -.8255 06154 -.56455 012133 007369 298427 .9280 020836 .9795 .16088 8 -.5160 -.4468 014269 032340 017964 107678 01686 .36859 .2383 .67403 9 051069-.9007 072121 068375-.31753 .9188 10168 .95818 .4842 004655 1 -.6643 10206 -.46599-.4515 -.20425 0 377731 .9432 078686 018124 018695 . . . . . . . . . . . . . . . . . . 16 025479-.5075 043879 -.8191 -.79871 5 .7196 06941 .69545 124338 026796 17 075624-.8207 055025 -.1530 -.13530 5 .5501 09144 .40532 109109 026112 18 030312-.4486 033254 -.6682 007993 5 .2130 07409 .07196 063036 .54177 19 088438-.6163 047816 -.0786 -.86549 5 .1287 09698 .26225 115968 016278 10 039001-.9968 023588 -.2866 002362 6 .8225 06654 .65453 065558 .85607
The d m 5should be familiar by now. i= These scores are what is typically plotted in a PCA "ordination." This is easily done with the p o ( function. The LabDSV library includes a special lt) version of the plot function for PCA. In S-Plus and R the default plot function plots the barplot of variances, but ecologists are usually more interested in the plot scores. The LabDSV plot function for PCA is designed to simplify plotting, minimizes typing, and most important, attempts to scale the axes identically so that we can compare distances within the ordination without distortion. Unfortunately, R/S does not always understand the exact aspect ratio of computer monitors, and so plots may not be exactly square. Generally, they're close enough, and hardcopy output will be exact. The p o ( function requires a PCA as its first argument. The default is to plot axes 1 and 2, but any of the axes computed can be plotted by lt) specifying x= and y=. In addition, you can add a title. For example (showing R output)
>po(c.,il=ByeCno" ltpa1tte"rc ayn)
4/18
16/05/12
Environmental Analysis Of course the whole point of getting the first few axes of variation is to begin an analysis of the vegetation/environment relation. There are again several alternatives for achieving this. First, we will employ a technique called "Redundancy Analysis", which is described in detail by Legendre and Legendre (1998). In a nutshell, the approach is to use linear regression to replace the observed values of vegetation abundance in each plot with their fitted values, regressing each plot in turn on specified environmental variables. This sounds like a lot of work, but in R it's actually pretty easy. First, we want to center the species abundances around their means and standardize their variance. We can do this fairly directly, by
vgcne < apyvg2fnto()(-enx)s()) e.etr - pl(e,,ucinx{xma()/dx}
As you remember from Lab 1, the a p yfunction applies a function to each column (if we specify 2 as the second argument, rows if we specify 1) in pl turn. The f n t o ( ) x m a ( ) says to subtract the mean value of each column from every value in that column, and then divide by the ucinx{-enx} standard deviation of that column. If you don't divide by the standard deviation, you will get results equivalent to calculating the PCA on a covariance matrix, rather than a correlation matrix, but that may be preferred in some cases. Alternatively, we can use a built-in function called s a e cl.
vgcne < saevgcne=REsaeTU) e.etr - cl(e,etrTU,cl=RE
The c n e = R Eargument means to center the columns by their mean, and the s a e T U means divide by the standard deviation. Later, we etrTU cl=RE may wish to perform the PCA on the variance/covariance matrix of the centered vegetation matrix as opposed to the correlation matrix. The s a e cl function is a little more graceful about missing values and other issues, and so might be preferable. Next, we want to specify which environmental variables we want to use, and to center them as well. For example:
ev< cidee,vsoe n - bn(lva,lp) evsae< saeevcne=REsaeTU) n.cl - cl(n,etrTU,cl=RE
will create a new matrix with elevation, aspect value, and slope in in it, which we then center and scale to unit standard deviation. In this case, since the environmental data represent the independent variables in the respective regression, and since they were generally measured on different scales, we want to both center and scale to unit standard deviation to simplify interpretation of the regression coefficients. Now, we want to regress each species (column in veg.center) against the centered and scaled environmental variables. We could use a p yto regress each column in turn, but pl perhaps surprisingly, the R/S linear regression function l does this automatically if the dependent variable is a matrix. So: m
vgl < l(e.etrevsae e.m - mvgcne~n.cl)
Now, we create the matrix of fitted values:

vgft< fte(e.m e.i - itdvgl)
It's possible to view the regression coefficients of the models, but every species is a separate regression, and the details are generally overwhelming. It's sometimes worth scanning, however, to see if the same variables are judged significant for a large number of species. Finally, we calculate the PCA on the fitted values:
>ra< pavgftcrTU) d - c(e.i,o=RE >vrltparadm5 apo.c(d,i=)
5/18
16/05/12
Notice how all of the variation is accounted for in only three eigenvectors. This is necessarily so, as the fitted values subjected to the PCA were the output of a linear regression with only three independent variables. We can view the ordination exactly as any other ordination:
>po(d,il=Rdnac Aayi" ltratte"eudny nlss)
Alternatives to Redundancy Analysis Remembering Lab 4 and Lab 5, we can use what we learned about Generalized Linear Models (GLM) and Generalized Additive Models (GAM) to do the analysis. Rather than analyzing a species response along environmental gradients, we'll be analyzing the distribution of environmental variables (and later species distributions) along ordination axes. Traditionally, this was done primarily by correlation or linear regression, but we can do better
16/05/12
than that here I think. In our example, the first two axes (p a 1 s o e [ 1 and p a 1 s o e [ 2 ) describe only about 12% of the total variance, but we'll start with c.$crs,] c.$crs,] a 2-dimensional approach and go from there. Based on our previous results, we might hypothesize that elevation plays a big role in the distribution of vegetation in Bryce Canyon. Let's look:
>ee.c.l<gmee~c.$crs,]pa1soe[2) lvpagm-l(lvpa1soe[1+c.$crs,] >smayee.c.l) umr(lvpagm Cl: al gmfrua=ee ~pa1soe[ 1 +pa1soe[ 2) l(oml lv c.$crs, ] c.$crs, ] Dvac Rsdas eine eiul: Mn i 1 Q -132 12.6 -6.2 317 Cefcet: ofiins Etmt Sd ErrtvleP(|| siae t. ro au r>t) (necp) Itret 75.6 885 3.91692 <2-6** 98 9.8 e1 * pa1soe[ 1 c.$crs, ] -63 5.3 1.6 -.4 80e0 ** 09 512 .1-7 * pa1soe[ 2 -0.2 c.$crs, ] 157 1.4 -.2 11e1 ** 44 733 .9-1 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . (iprinprmtrfrgusa fml tknt b 245.) Dseso aaee o asin aiy ae o e 5651 Nl dvac:6306 o 19 dgeso fedm ul eine 0739 n 5 ere f reo Rsda dvac:3904 o 17 dgeso fedm eiul eine 9887 n 5 ere f reo AC 25. I: 407 Nme o Fse Soigieain:2 ubr f ihr crn trtos
Mda ein 1.8 94
3 Q 369 9.9
Mx a 941 6.9
If you don't recall the syntax and form of the g mcommand, review Lab 4 first. This command is telling us that e e is negatively related to l lv p a 1 s o e [ 1 and negatively related to p a 1 s o e [ 2 , and that the relation to p a 1 s o e [ 2 is stronger than to c.$crs,] c.$crs,] c.$crs,] p a 1 s o e [ 1 , as determined by looking at the coefficients ( - 0 v 5 ) or t-values ( - . 2 2 7 v 5 1 1 7 ) c.$crs,] 15 s 6 7337 s .466 In addition, there are several other items of interest in the summary. First, because we did not specify otherwise, the family was taken to be Gaussian, meaning we expected normal errors unrelated to the expected value. This is correct for a variable like elevation, which is unbounded on either end (for all practical purposes), and represents a continuous value rather than a count. You should recall from Lab 4 that fitting a GLM with a Gaussian error model is equivalent to a least-squares regression. Accordingly, we could have done:
ee.c.m-mee~c.$crs,]pa1soe[2) lvpal<l(lvpa1soe[1+c.$crs,] >smayee.c.m umr(lvpal) Cl: al l(oml =ee ~pa1soe[ 1 +pa1soe[ 2) mfrua lv c.$crs, ] c.$crs, ] Rsdas eiul: Mn i 1 Q -132 -6.2 12.6 317 Cefcet: ofiins Etmt Sd ErrtvleP(|| siae t. ro au r>t) (necp) Itret 75.6 885 3.91692 <2-6** 98 9.8 e1 * pa1soe[ 1 c.$crs, ] -63 5.3 1.6 -.4 80e0 ** 09 512 .1-7 * pa1soe[ 2 -0.2 c.$crs, ] 157 1.4 -.2 11e1 ** 44 733 .9-1 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . Rsda sadr err 546o 17dgeso fedm eiul tnad ro: 0. n 5 ere f reo Mlil RSurd 037, utpe -qae: .37 Ajse Rsurd 039 dutd -qae: .23 Fsaitc 4.3o 2ad17D, pvle 897-5 -ttsi: 00 n n 5 F -au: .2e1
Mda ein 1.8 94
3 Q 369 9.9
Mx a 941 6.9
Notice that the regression coefficients, standard errors, and t-values are all the same. The linear model explains the output in terms of variance (with multiple R^2 and an F test) instead of deviance, but the model is identical. The R^2 reported is identical to the D^2 value we would calculate from the GLM. To see the model,
>rqiemc) eur(gv >rqieaia eur(km) #t la tegm)fnto o od h a( ucin #t la teitr( fnto o od h nep) ucin
And then
>po(c.) ltpa1
7/18
16/05/12
>cnoritr(c.$crs,]pa1soe[2,itdee.c.l),d=REcl2lbe=.) otu(neppa1soe[1,c.$crs,]fte(lvpagm)adTU,o=,acx08
Let's take this command apart to understand it. First, c n o r )is a command to plot contour lines on gridded surface. It just so happens that our otu( contours are straight lines, but that's because we fit a linear model. c n o r )expects to be given a grid of values, but we only have points. To otu( generate the grid from the points, we use i t r ( , which interpolates the points to all locations in the grid. i t r ( expects three values: nep) nep) the X axis coordinates the Y axis coordinates the Z axis values for the grid or surface We supply the p a 1 s o e as the X and Y axes, and the fitted values of the model at each of those points as the Z axis. The fitted values c.$crs (remember) are given by f t e ( l v p a g m , which we nest in the i t r ( command. Finally, we specify a d T U to add the lines and itdee.c.l) nep) d=RE avoid erasing the points beneath, and a color with c l 2 Voila! As we knew from our interpretation of the coefficients, elevation decreases slightly o=. along the X axis, and declines more steeply along the Y axis. It's tempting to contour the actual elevations themselves (substituting e e for f t e ( l v p a g m in the above i t r command, but the lv itdee.c.l) nep results are generally poor. See for yourself:
cnoritr(c.$crs,]pa1soe[2,lv,d=,o=) otu(neppa1soe[1,c.$crs,]ee)adTcl3
8/18
16/05/12
The problem is that the actual elevations are too variable at a small scale, and need to be generalized to plot. More importantly, we need some idea of how well the ordination correlates to elevation, and by fitting a formal model we get much better quantification than we get by analysis of the raw data by eye. Nonetheless, the contours of the raw data give us the impression that the relation of elevation to the axes is non-linear (see the curve for the 7000 foot contour), and that perhaps we should fit a better model. Let's try:
>ee.c2gm< gmee~c.$crs,]Ipa1soe[1^)pa1soe[2+(c.$crs,]2) lvpa.l - l(lvpa1soe[1+(c.$crs,]2+c.$crs,]Ipa1soe[2^) >smayee.c2gm umr(lvpa.l) Cl: al gmfrua=ee ~pa1soe[ 1 +Ipa1soe[ 1^)+ l(oml lv c.$crs, ] (c.$crs, ]2 pa1soe[ 2 +Ipa1soe[ 2^) c.$crs, ] (c.$crs, ]2) Dvac Rsdas eine eiul: Mn i 1 Q Mda ein -225 -2. 12. 331 6. 06 Cefcet: ofiins Etmt Sd ErrtvleP(|| siae t. ro au r>t) (necp) Itret 72.2 6876 6.7 1830 <2-6** 446 1.2 e1 * pa1soe[ 1 c.$crs, ] -1.7 1795 1.6 -.3 17e1 ** 724 684 .8-0 * Ipa1soe[ 1^) 2.1 (c.$crs, ]2 083 345 59616e0 ** .9 .5 .7-8 * pa1soe[ 2 c.$crs, ] -506 6.9 1.3 -.4 11e0 ** 437 450 .2-5 * Ipa1soe[ 2^) -.4 (c.$crs, ]2 600 300 -.8 .5 190 009 * .44 -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . (iprinprmtrfrgusa fml tknt b 233.) Dseso aaee o asin aiy ae o e 0002 Nl dvac:6306 o 19 dgeso fedm ul eine 0739 n 5 ere f reo Rsda dvac:3498 o 15 dgeso fedm eiul eine 1662 n 5 ere f reo AC 21. I: 464 Nme o Fse Soigieain:2 ubr f ihr crn trtos
3 Q 285 5.
Mx a 907 3.
Clearly, the new model fits better, as indicated below

>aoaee.c.l,lvpa.l,et"h" nv(lvpagmee.c2gmts=Ci) Aayi o Dvac Tbe nlss f eine al Mdl1 ee ~pa1soe[ 1 +pa1soe[ 2 oe : lv c.$crs, ] c.$crs, ]
9/18
16/05/12
Mdl2 ee ~pa1soe[ 1 +Ipa1soe[ 1^)+pa1soe[ oe : lv c.$crs, ] (c.$crs, ]2 c.$crs, 2 +Ipa1soe[ 2^) ] (c.$crs, ]2 Rsd D Rsd Dv D Dvac P>Ci) ei. f ei. e f eine (|h| 1 17 3904 5 9887 2 15 3498 5 1662 2 8115 78e1 516 .9-0
Are both axes necessarily quadratic? p a 1 s o e [ 2 is not convincingly quadratic. Let's check: c.$crs,]
>aoaee.c2gmts=Ci) nv(lvpa.l,et"h" Aayi o Dvac Tbe nlss f eine al Mdl gusa,ln:iett oe: asin ik dniy Rsos:ee epne lv Trsaddsqetal (is t ls) em de eunily frt o at
NL UL pa1soe[ 1 c.$crs, ] Ipa1soe[ 1^) (c.$crs, ]2 pa1soe[ 2 c.$crs, ] Ipa1soe[ 2^) (c.$crs, ]2
D Dvac Rsd D Rsd DvP>Ci) f eine ei. f ei. e (|h| 19 6306 5 0739 1 6322 727 18 5689 842-9 5 3307 .9e0 11596 6126 17 3183 188-9 5 7181 .7e1 1 4593 823 16 3259 103-6 5 2689 .1e0 1 761 927 15 3498 477-2 5 1662 .6e0
Then again, I guess a reduction in deviance of 796217 is worth a degree of freedom after all!. To get a clean plot and overlay,
>po(c.) ltpa1 >cnoritr(c.$crs,]pa1soe[2,itdee.c2gm)adTcl2 otu(neppa1soe[1,c.$crs,]fte(lvpa.l),d=,o=)
As you can see, the actual relation of elevation to the ordination is highly non-linear, with high elevations at the extremes of the X axis. To evaluate other environmental variables, we do equivalent analyses, adjusting the model as suitable. For example, to analyze slope, which is bounded at 0 on the low end, we might want to fit a Poisson regression to avoid fitted values with a negative slope.
>soepagm< gmsoepa1soe[1+c.$crs,]fml=oso) lp.c.l - l(lp~c.$crs,]pa1soe[2,aiypisn >smaysoepagm umr(lp.c.l) Cl: al gmfrua=soe~pa1soe[ 1 +pa1soe[ 2, l(oml lp c.$crs, ] c.$crs, ] fml =pisn aiy oso)
10/18
16/05/12
Dvac Rsdas eine eiul: Mn i 1 Q Mda ein 3 Q Mx a -.00 -.46 -.52 -.70 1.36 337 175 118 018 425 Cefcet: ofiins Etmt Sd ErrzvleP(|| siae t. ro au r>z) (necp) Itret 1655 .32 0053 4.2 <2-6** .32 640 e1 * pa1soe[ 1 -.38 c.$crs, ] 0034 0000 -.2 0016 * .15 324 .023 * pa1soe[ 2 0007 c.$crs, ] .55 0033 3800000 ** .10 .8 .014 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . (iprinprmtrfrpisnfml tknt b 1 Dseso aaee o oso aiy ae o e ) Nl dvac:14. o 19 dgeso fedm ul eine 153 n 5 ere f reo Rsda dvac:12. o 17 dgeso fedm eiul eine 113 n 5 ere f reo AC 10. I: 664 Nme o Fse Soigieain:6 ubr f ihr crn trtos
While the z-values for the coefficients are reasonably large (and statistically significant), the model has a very poor fit (D^2=0.02). Does this mean that slope is not related to the ordination, or simply that it's so strongly non-linear that we don't see it. Possibly, we have fit the wrong model. Repeating the analysis with a Gaussian model gives us:
>ts < gmsoepa1soe[1+c.$crs,] et - l(lp~c.$crs,]pa1soe[2) >smayts) umr(et Cl: al gmfrua=soe~pa1soe[ 1 +pa1soe[ 2) l(oml lp c.$crs, ] c.$crs, ] Dvac Rsdas eine eiul: Mn i 1 Q Mda ein 3 Q Mx a -.80 -.16 -.51 -.27 5.84 548 359 248 045 946 Cefcet: ofiins Etmt Sd ErrtvleP(|| siae t. ro au r>t) (necp) Itret 522 .15 069 .98 74958e1 ** .4 .9-2 * pa1soe[ 1 -.42 c.$crs, ] 016 012 -.6 .92 071 048 .4 pa1soe[ 2 c.$crs, ] 027 .35 023 .52 098 .3 030 .5 -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . (iprinprmtrfrgusa fml tknt b 7.43 Dseso aaee o asin aiy ae o e 837) Nl dvac:145 o 19 dgeso fedm ul eine 21 n 5 ere f reo Rsda dvac:131 o 17 dgeso fedm eiul eine 20 n 5 ere f reo AC 15. I: 168 Nme o Fse Soigieain:2 ubr f ihr crn trtos >rnefte(et) ag(itdts) []2305 6921 1 .092 .672 >1(20/21) -131145 []00984 1 .0124
Apparently, we don't have a problem with the boundedness of the fitted values (you can plot them and see, I won't bother here), but the fit is even worse now. Since we don't really have an a priori expectation of the shape of the relation, let's go straight to the GAM and let the data tell us.
>soepagm< gmsoespa1soe[1)spa1soe[2)fml=oso) lp.c.a - a(lp~(c.$crs,]+(c.$crs,],aiypisn >smaysoepagm umr(lp.c.a) Fml:pisn aiy oso Ln fnto:lg ik ucin o Frua oml: soe~spa1soe[ 1)+spa1soe[ 2) lp (c.$crs, ] (c.$crs, ] Prmti cefcet: aaerc ofiins Etmt Sd ErrzvleP(|| siae t. ro au r>z) (necp) 1561 Itret .31 0084 4.6 <e1 ** .33 00 2-6 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . Apoiaesgiiac o sot trs prxmt infcne f moh em: efEtrn Cis pvle d s.ak h.q -au spa1soe[ 1)802 (c.$crs, ] .1 900170 <2-6** .0 0.1 e1 * spa1soe[ 2)882 (c.$crs, ] .3 900 7.532e1 ** .0 34 .0-2 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . Rs.aj = -.05 -q(d) 0078 Dvac epand=1.% eine xlie 79 UR soe=508 Saeet =1 BE cr .98 cl s. n=10 6
11/18
16/05/12
The plot is certainly non-linear, and the fit is somewhat better

>1(4.2814.9) -9013/1525 []01947 1 .712
It certainly appears to me strongly unimodal in both axes, so let's try the quadratic GLM to see.
>ts < gmsoepa1soe[1+(c.$crs,]2+c.$crs,]Ipa1soe[2^)fml=oso) et - l(lp~c.$crs,]Ipa1soe[1^)pa1soe[2+(c.$crs,]2,aiypisn >smayts) umr(et Cl: al gmfrua=soe~pa1soe[ 1 +Ipa1soe[ 1^)+ l(oml lp c.$crs, ] (c.$crs, ]2 pa1soe[ 2 +Ipa1soe[ 2^) fml =pisn c.$crs, ] (c.$crs, ]2, aiy oso) Dvac Rsdas eine eiul: Mn i 1 Q Mda ein -.20 -.10 -.48 3649 1881 0941 Cefcet: ofiins Etmt Sd ErrzvleP(|| siae t. ro au r>z) (necp) Itret 1977 .568 0020 3.7 <2-6** .645 130 e1 * pa1soe[ 1 c.$crs, ] 0003 .536 0077 .192 281 0050* .0 .01 * Ipa1soe[ 1^)-.184 0034 -.3 46e0 ** (c.$crs, ]2 0097 .094 509 .7-7 * pa1soe[ 2 c.$crs, ] 0077 .383 0061 .188 222 0043* .5 .23 Ipa1soe[ 2^)-.130 0037 -.8 0017* (c.$crs, ]2 0015 .050 310 .04 * -Sgi.cds 0'*'001'* 00 ''00 ''01''1 inf oe: ** .0 *' .1 * .5 . . (iprinprmtrfrpisnfml tknt b 1 Dseso aaee o oso aiy ae o e ) Nl dvac:14. o 19 dgeso fedm ul eine 153 n 5 ere f reo Rsda dvac:18. o 15 dgeso fedm eiul eine 052 n 5 ere f reo AC 17. I: 543 Nme o Fse Soigieain:6 ubr f ihr crn trtos >aoats,et"h" nv(etts=Ci) Aayi o Dvac Tbe nlss f eine al Mdl pisn ln:lg oe: oso, ik o Rsos:soe epne lp Trsaddsqetal (is t ls) em de eunily frt o at
3 Q Mx a 0011 1.30 .75 3008
12/18
16/05/12
D Dvac Rsd D Rsd DvP>Ci) f eine ei. f ei. e (|h| 19 5 14.0 153 1 90 .1 18 5 13.9267-3 162 .8e0 1 3.3 83 17 5 19.6594-0 079 .7e1 1 15 .4 16 5 19.1 064 02 .1 1 1.9 11 15 5 18.2828-4 052 .1e0
NL UL pa1soe[ 1 c.$crs, ] Ipa1soe[ 1^) (c.$crs, ]2 pa1soe[ 2 c.$crs, ] Ipa1soe[ 2^) (c.$crs, ]2
Apparently it's quadratic in X but not Y. We can test the improvement by Chi-square of the deviance, at least as a heuristic.
>aoats,lp.c.a,et"h" nv(etsoepagmts=Ci) Aayi o Dvac Tbe nlss f eine al Mdl1 soe~pa1soe[ 1 +Ipa1soe[ 1^)+pa1soe[ oe : lp c.$crs, ] (c.$crs, ]2 c.$crs, 2 +Ipa1soe[ 2^) ] (c.$crs, ]2 Mdl2 soe~spa1soe[ 1)+spa1soe[ 2) oe : lp (c.$crs, ] (c.$crs, ] Rsd D Rsd Dv ei. f ei. e D Dvac P>Ci) f eine (|h| 1 1500 5.0 18.2 052 2 1216 4.5 901 1.4 4.2 284 151 161-4 4.0 .4e2
The indication is that the GAM is better, but we have to take this with a grain of salt in this case. More to the point, all of these models are poor, so we're discriminating among a set of very low-power models. An analysis of aspect value (not shown, try it yourself) also results in a model of very poor fit. General Surface Plotter Plotting a variable as a surface on an ordination is such a common task that it is worthwhile to create a general ordination surface routine. Therefore, the LabDSV library includes not only a plotting routine, but a surfacing routine. It employs GAM to fit the surface (so you don't have to know a priori the shape of the response, and reports back D^2 for the fit. To use it, simply plot first, then s r ( . E.g. uf)
>po(c.) ltpa1 >sr(c.,lv ufpa1ee)
To assess the quality of the fit in more detail than the simple D^2 and GCV score, the surf function will produce the plots of the fitted values and standard errors that we saw in Lab 4 in a separate window. For example, produces
13/18
16/05/12
and
In this case the model is Gaussian, and so the values on the y axis are in the same units (feet). You can see that the model is estimating usually < +/250 feet. For models which are poisson or binomial, the y value is the log or logit respectively. Just like p o ( you can specify dimensions other than 1 or 2 (but they MUST match the dimensions of the previous plot), specify the color of the lt) contours (with c l ), and change the size of the contour labels (with l b e = You can plot as many surfaces in a single plot as makes sense. For o= a c x ).
16/05/12
example
>po(c.) ltpa1 >sr(c.,lvlbe=.) ufpa1ee,acx10 >sr(c.,lp,o=,acx10 ufpa1soecl4lbe=.)
And, you can add annotation with the t x ( , , e t . . command as desired. It's easiest to show an example. etxytx,.)
>tx(56"lvto"cl2aj0 et-,,Eeain,o=,d=) >tx(55"lp"cl4aj0 et-,,Soe,o=,d=)
15/18
16/05/12
The x and y coordinates come first, then the text desired in quotes, the color you want, and optionally, test justification (adj=0 means left-justified, adj=0.5 means center, and adj=1 means right justified). In the next few labs we'll use the s r ( function repeatedly. uf) Conclusion Are we left to conclude that the vegetation is extremely complex (requiring 20 axes to get half the variance described) and that most of our environmental variables have little relation to the vegetation? Thankfully not. The problem is that PCA operates on a correlation or covariance matrix, and that these structures are really not appropriate for vegetation data that spans a large amount of environmental variability, or that exhibits much beta diversity. Again, it's more complicated than I can explain briefly in HTML, but the literature is quite full of good explications on the subject. As a first alternative, we can still employ eigen analysis, but on a more suitable matrix. If we use a distance (or dissimilarity) matrix instead of correlation or covariance, we get Principal Coordinates Analysis (PCoA or PCO), the subject of the next lab, Lab 7. On the other hand, PCA is widely employed (often inappropriately). More importantly, it allowed us to work through many areas of ordination analysis that we will draw on in future labs.
DON'T HAVE THE LABDSV LIBRARY?

It's easiest to install it from CRAN , but to simply use the functions, cut and paste them from below.
Functions Used In This Lab

pca() The p afunction is simply a wrapper for the p i c m function that allows users to easily limit the dimensionality of the solution. In addition, it c rnop assigns the output of the function the class "pca," which is what allows the functions below to be called in a general way. This is an example of the object-orientation of S-Plus/R code, and simplifies life for programmers and users.
pa< fnto (a,cr=FLE dm=mnno(a) no(a)) c - ucin mt o AS, i i(rwmt, clmt) { tp< pcm(a,rt =TU,cne =TU,sae=cr m - ropmt ex RE etr RE cl o) ot< ls( u - it) otsoe < tpx,1dm u$crs - m$[ :i] otlaig < tprtto[ 1dm u$odns - m$oain, :i] otse < tpse[:i] u$dv - m$dv1dm ottte < smtpse^) u$odv - u(m$dv2
16/18
16/05/12
casot < "c" ls(u) - pa rtr(u) eunot }
The c r F L Esets the default to the covariance matrix, rather then the correlation matrix, and is over-ruled by specifying c r T U in the function o=AS o=RE call. The d m m n n o ( a ) n o ( a ) allows you to control the number of dimensions retained. The default retains them all. i=i(rwmt, clmt) loadings.pca() The l a i g . c function is a simple modification to the default l a i g function to control some defaults. In practice, it makes the R function odnspa odns behave more similarly to the S-Plus version which I prefer. It also makes it easy to control the number of dimensions presented.
laig.c < fnto (,dm=lnt($dv,dgt =3 ctf =01 odnspa - ucin x i eghxse) iis , uof .) { i (i >no($odns){ f dm clxlaig) ct"ny,no($odns,"xsaalben) a(Ol" clxlaig) ae vial\" dm< no($odns i - clxlaig) } ct"nodns\" a(\Laig:n) c < fra(on($odns,1dm,dgt =dgt) x - omtrudxlaig[ :i] iis iis) c[b($odns,1dm)<ctf]< sbtig" xasxlaig[ :i] uof - usrn( " , 1 nhrc[,1) , ca(x1 ]) pitc,qoe=FLE rn(x ut AS) ivsbe) niil( }
The d m argument controls how many columns or dimensions to print, the d g t =argument controls how many digits are printed for each loading, i= iis and the c t f =argument controls how big a loading has to be to be printed. The routine is borrowed from the R s m a y p i c m ( function. uof umr.rnop) scores.pca() The s o e . c ( function prints the scores (coordinates) of the sample plots on the eigenvectors. crspa)
soe.c < fnto (,lbl =NL,dm=lnt($dv) crspa - ucin x aes UL i eghxse) { i (i >lnt($dv){ f dm eghxse) ct"ny,lnt($dv,"ae aalben) a(Ol" eghxse) xs vial\" dm< lnt($dv i - eghxse) } i (i.ullbl) { f !snl(aes) cidlbl,xsoe[ 1dm) bn(aes $crs, :i] } es { le xsoe[ 1dm $crs, :i] } }
The l b l =argument allows you to specify a plot label of sample identifier (as a vector of strings) if desired. The d m argument specifies how aes i= many dimensions to print. plot.pca() The p o . c ( function is an example of object-oriented programming in S-Plus/R. When you enter ltpa)
>po(c.) ltpa1
the generic plot function notices that p a 1has class "pca." It then looks for a function called p o . c ( to plot the object. The connection c. ltpa) between the class of an object and the suffix of the functions establishes an interface that is invisible to users, but greatly increases the power and utility of the language.
po.c < fnto (,a =1 a =2 cl=1 tte=",ph=1 .. ltpa - ucin x x , y , o , il " c , .) { i (ls()! "c" f casx = pa) so(Yums seiyaa ojc o caspa) tp"o ut pcf n bet f ls c" odi < pr"i" lpn - a(pn) prpn=cmnodi[] odi[],mnodi[] odi[]) a(i (i(lpn1, lpn2) i(lpn1, lpn2)) xi < rnexsoe[ a] lm - ag($crs, x) yi < rnexsoe[ a] lm - ag($crs, y) tl< 00 o - .4 md < 05*(lm2 +xi[] ix - . xi[] lm1) md < 05*(lm2 +yi[] iy - . yi[] lm1) i (lm2 -xi[]>yi[]-yi[] { f xi[] lm1 lm2 lm1) xi < md +( +tl *05*c-,1 *(lm2 lm - ix 1 o) . (1 ) xi[]
17/18
16/05/12
xi[] lm1) yi < md +( +tl *05*c-,1 *(lm2 lm - iy 1 o) . (1 ) xi[] xi[] lm1) } es { le xi < md +( +tl *05*c-,1 *(lm2 lm - ix 1 o) . (1 ) yi[] yi[] lm1) yi < md +( +tl *05*c-,1 *(lm2 lm - iy 1 o) . (1 ) yi[] yi[] lm1) } po($crs,a] xsoe[ a] xi =xi,yi =yi, ltxsoe[ x, $crs, y, lm lm lm lm cl=cl xa =pse"C" a) ya =pse"C" o o, lb at(PA, x, lb at(PA, a) ph=ph mi =tte y, c c, an il) prpn=odi) a(i lpn ivsbe) niil(
The section in the middle starting t l 0 0 and ending just before p o calculates x and y limits that result in a plot with an aspect ratio of 1.0. It's o=.4 lt not actually necessary in R (which has an a p 1 0argument in its p o ( function), but works in both S-Plus and R. It's borrowed fairly directly s=. lt) from the e s p o ( function in the MASS library from Venables and Ripley, but actually uses a different convention in the plot. qclt) surf.pca()
s r . c ( adds contour lines for a variable onto ufpa)
an existing plot of a PCA. It is another example of object-orientation in S-Plus/R. Note the two
lines
rqiemc) eur(gv rqieaia eur(km)
These make sure that the mgcv library (which contains the g m )function, and the akima library (which contains the i t r ( function) are already a( nep) loaded in your S-Plus/R session.
sr.c < fnto (c,vr x=1 y=2 cl=2 fml =gusa, ufpa - ucin pa a, , , o , aiy asin gfF lbe=.,.. o=, acx08 .) { rqiemc) eur(gv rqieaia eur(km) i (isn(c) { f msigpa) so(Yums seiyals okc fo cdcl(" tp"o ut pcf it bet rm msae)) } i (isn(a) { f msigvr) so(Yums seiyavral t srae) tp"o ut pcf aibe o ufc" } xa < pasoe[ x vl - c$crs, ] ya < pasoe[ y vl - c$crs, ] i (slgclvr){ f i.oia(a) tp< gmvr~sxa)+sya) fml =bnma) m - a(a (vl (vl, aiy ioil } es { le tp< gmvr~sxa)+sya) fml =fml) m - a(a (vl (vl, aiy aiy } i (o!F { f gf=) x1) 1( po.a(m) ltgmtp raln(HtRtr t Cniu\" edie"i eun o otnen) dvof) e.f( } cnoritr(vl ya,fte(m),ad=T cl=cl lbe=acx otu(nepxa, vl itdtp) d , o o, acxlbe, .. .) }
varplot.pca() The v r l t )function plots the variances of the first d m eigenvectors as a barplot. It replaces the default p o ( function for p i c m ( , as apo( i= lt) rnop) the default p o ( , t > f n t o f r P A n w p o s t e s m l s o e i s e d o t e v r a c s lt)/t ucin o C o lt h ape crs nta f h aine.
vrltpa< fnto(c,i=0 apo.c - ucinpadm1) { vr< pase^ a - c$dv2 brltvr1dm,lb"aine) apo(a[:i]ya=Vrac" raln(HtRtr t Cniu\" edie"i eun o otnen) brltcmu(a/c$odv[:i]ya=Cmltv Vrac" apo(usmvrpatte)1dm,lb"uuaie aine) }
18/18

Principal Components Analysis and Redundancy Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Principal Components Analysis and Redundancy Analysis

Uploaded by

Copyright:

Available Formats

16/05/12

Principal Components Analysis and Redundancy Analysis

To see the variance and cumulative variances explained by eigenvector, type

The output looks like this:

HtRtr t Cniu i eun o otne

A partial output looks like the following:

Which looks like:

Now, we create the matrix of fitted values:

Mda ein 1.8 94

Mda ein 1.8 94

Clearly, the new model fits better, as indicated below

The plot is certainly non-linear, and the fit is somewhat better

3 Q Mx a 0011 1.30 .75 3008

DON'T HAVE THE LABDSV LIBRARY?

Functions Used In This Lab

You might also like