Principal Components Analysis and Redundancy Analysis

After our experience fitting individual species models to specific gradients, you may be struck by the enormity of the task of analyzing numerous species this way, and the problems inherent in summarizing the vast amounts of data generated by the procedures. As an alternative, vegetation ecologists have long sought efficient ways to summarize entire vegetation data sets, either with respect to specific environmental data, or simply to summarize the distribution of sample points in a low dimensional space where alternative environmental explanations can be explored. In our first attempts, we will employ both approaches. First, we will use Principal Components Analysis (PCA) to reduce the distribution of sample points to 2 or 3 dimensions, and plot the results in "ordinations." Depending on how successful we are at reducing the data set, we can seek patterns among the distribution of plots in ordination space, and explore possible environmental correlates with these. It's important to remember, however, that at most we can establish correlation among the variables, not causation.

PCA is an eigenvector technique that attempts to explain as much variance as possible on each of a series of orthogonal vectors spanning the vegetation space (see Kent and Coker Pp. 186- 203 for a gentle presentation; alternatively Legendre and Legendre 1998 Pp. 391-424 for a more thorough treatment). Eigenanalysis is based in linear or matrix algebra, and has a wide range of uses. Our particular interest is in finding an alternative description of our vegetation data in a low dimensional space. The first vector is chosen to account for as much variance in plot dispersion from the centroid as possible; the second is chosen by the same criterion but subject to the constraint of being orthogonal the first, and so on. The approach can be likened to regression by least squares, except that the residuals from the vector are measured perpendicular to the vector rather than perpendicular to the x or y axis.

In general, PCA is performed on a correlation or variance/covariance matrix, although equivalently, a matrix of sums-of-squares and-cross-products can be used. In R, however, the functions we will use convert the basic vegetation data into the appropriate forms, usually requiring only an argument to be specified in the function.

There are two ways to perform PCA in R: princomp() and prcomp(). Essentially, they compute the same values (technically, princomp computes an eigenanalysis and prcomp computes a singular value decomposition). To make all of the ordination labs more similar, the LabDSV package includes a function called pca() that I will use in the following. pca() simply calls prcomp, but includes different conventions for plotting and analysis that we will use with other ordination techniques in later labs.

So, we first have to load the LabDSV package, if you haven't already. If the packages are installed properly

will do the trick.

(see R for Ecologists if you don't remember how to install and access packages in R)

The pca() function accepts taxon matrices or dataframes, computing loadings for each column and scores for each row in the matrix or dataframe. The loadings are the contribution of the column vector to each of the eigenvectors. A large positive component means that that column (species in our case) is positively correlated with that eigenvector; a large negative values is negative correlation; and small values mean that the species is unrelated to that eigenvector. Often, we're more interested in the position of the sample plots in the reduced-dimensional space; these positions are given by the scores, or coordinates along the eigenvectors.

The pca() function allows you to specify a number of parameters concerning the calculation, The first is whether you want to use a correlation or covariance matrix. Again, this is a subject too detailed to expound on adequately in HTML, but generally, PCA is sensitive to the scale of measurement of the data. If all the data are not measured on the same scale, using covariance means that the result will be determined mostly by the variable with the largest values, as it will have the highest variance. Using a correlation matrix treats all variables the same (standardized to mean=0 and std. dev.=1). Even if all species were measured on the same scale (e.g. percent cover), to prevent the dominant species from determining the results, you probably want to use correlation. In pca(), this means specifying cor=TRUE in the function call. In addition, by default pca() will generate one eigenvector for every column (species), and scores for each row (plot) on each of the eigenvectors. Generally, the vast majority of the variance is described on the first few eigenvectors, and we can save space by only calculating the scores for only the first few eigenvectors. This can be specified by including dim=n where n equals the number of dimensions you want scores and loadings for. So, for example

pca.1 <- pca(veg,cor=TRUE,dim=10)

calculates a principal components analysis of the Bryce Canyon vegetation, using a correlation matrix and only calculating scores for the first 10 eigenvectors. The results of a PCA are complex, including:

Variance Explained

To see the variance and cumulative variances explained by eigenvector, type
The output looks like this:
Importance of components:
numeric matrix: 3 rows, 169 columns. 
                          Comp. 1    Comp. 2    Comp. 3   Comp. 4    Comp. 5 
    Standard deviation 3.65275545 2.77224856 2.66016722 2.5459628 2.38617490
Proportion of Variance 0.07895043 0.04547552 0.04187272 0.0383546 0.03369131
 Cumulative Proportion 0.07895043 0.12442594 0.16629866 0.2046533 0.23834456

                          Comp. 6   Comp. 7    Comp. 8    Comp. 9   Comp. 10 
    Standard deviation 2.31459860 2.1179390 2.06713660 1.90778099 1.85804579
Proportion of Variance 0.03170039 0.0265424 0.02528434 0.02153626 0.02042801
 Cumulative Proportion 0.27004496 0.2965874 0.32187170 0.34340796 0.36383598

For reasons known only to R, the top line is the standard deviation (rather than the variance) associated with each component. The next two lines are the items of primary interest, however; proportion of total variance, and cumulative proportion of variance. The dim=10 means to list only the first 10 dimensions; the default is to list ALL of them. Notice that the variance explained is in order from highest to lowest by eigenvector; this is by design. Notice also that the first ten eigenvectors explain a total of 0.3638 of the total variance of the dataset. The number of significant eigenvectors is at most min(#plots, #species)-1. Most of the eigenvectors after 100 explain a minute portion of the variance.

We can look at this same information graphically with the varplot() function.


Hit Return to Continue

The first panel shows the variance associated with each vector, and the second shows the cumulative variance as a fraction of the total.

Species Loadings

To see the species loadings, use the loadings.pca() function as follows:


A partial output looks like the following:

       Comp. 1 Comp. 2 Comp. 3 Comp. 4 Comp. 5 
arcpat  0.157                          -0.103 
arttri          0.219  -0.147           0.154
atrcan          0.203  -0.144           0.154
ceamar  0.134                               
chrvis -0.200                                 
juncom  0.100                   0.132        
pacmyr  0.145           0.111   0.154        
 .        .     .        .       .       .   
 .        .     .        .       .       .   
 .        .     .        .       .       .         

By default, loadings.pca() suppresses small values to emphasize the more important ones. Accordingly, you can see that eigenvector 1 is positively correlated with arcpat, ceamar, juncom, and pacmyr, while negatively associated with chrvis (along with many more species not included in this excerpt). The dim=5 means only list the first five components; the default is 5.

It is often the case that loadings or scores from two different computers or programs will come out with the same values but with opposite signs. The orientation of eigenvectors is arbitrary, and the sign is only meaningful with respect to other values on the same axis.

Plot Scores

To see a similar output for plot scores use the scores.pca() function.


Which looks like:

            Comp.1      Comp.2        Comp.3       Comp.4        Comp.5
  1   -0.303522539 -0.58234332  0.4964863489  0.446266093  0.0005421279
  2    0.106991234 -0.86395259  0.7710460128  1.070231898 -0.1001168298
  3   -0.325220583 -0.63188537  0.5977936134  0.816265090 -0.0339421175
  4   -0.301010236 -0.55085413  0.4413011878  0.356285781  0.0722309177
  5   -3.051327186  0.58628755 -0.0316738283  0.669026942  0.0280048460
  6   -0.459006189 -0.65657134  0.6296331674  0.562444071 -0.2394758056
  7   -2.988245257  0.69125840 -0.2506843565  0.192719353  0.0176306898
  8   -1.057166708 -0.14648668  0.1346286599  0.322338430  0.1677946043
  9    0.591108689 -1.09106087  0.7925182118  0.648834725 -0.0341675553
  10  -3.767674331  1.09240362 -0.7486658969 -0.148511254 -0.1280649255
   .        .            .            .            .            .
   .        .            .            .            .            .
   .        .            .            .            .            .
  156  0.275149769 -0.65904715  0.4639857495 -1.284139318 -0.2769789761
  157  0.755560214 -0.98124047  0.5450052352 -1.019513009 -0.2163151320
  158  0.320133102 -0.74440896  0.3037215946 -0.663608326  0.0574919737
  159  0.818248378 -0.96619683  0.4276821265 -1.105798668 -0.1866257489
  160  0.389202051 -0.69695648  0.2635548583 -0.625856568  0.0825366027

The dim=5 should be familiar by now.

These scores are what is typically plotted in a PCA "ordination." This is easily done with the plot() function. The LabDSV library includes a special version of the plot function for PCA. In S-Plus and R the default plot function plots the barplot of variances, but ecologists are usually more interested in the plot scores. The LabDSV plot function for PCA is designed to simplify plotting, minimizes typing, and most important, attempts to scale the axes identically so that we can compare distances within the ordination without distortion. Unfortunately, R/S does not always understand the exact aspect ratio of computer monitors, and so plots may not be exactly square. Generally, they're close enough, and hardcopy output will be exact.

The plot() function requires a PCA as its first argument. The default is to plot axes 1 and 2, but any of the axes computed can be plotted by specifying x= and y=. In addition, you can add a title. For example (showing R output)

plot(pca.1,title="Bryce Canyon")

Environmental Analysis

Of course the whole point of getting the first few axes of variation is to begin an analysis of the vegetation/environment relation. There are again several alternatives for achieving this. First, we will employ a technique called "Redundancy Analysis", which is described in detail by Legendre and Legendre (1998). In a nutshell, the approach is to use linear regression to replace the observed values of vegetation abundance in each plot with their fitted values, regressing each plot in turn on specified environmental variables. This sounds like a lot of work, but in R it's actually pretty easy.

First, we want to center the species abundances around their means and standardize their variance. We can do this fairly directly, by <- apply(veg,2,function(x){(x-mean(x))/sd(x)})
As you remember from Lab 1, the apply function applies a function to each column (if we specify 2 as the second argument, rows if we specify 1) in turn. The function(x){x-mean(x)} says to subtract the mean value of each column from every value in that column, and then divide by the standard deviation of that column. If you don't divide by the standard deviation, you will get results equivalent to calculating the PCA on a covariance matrix, rather than a correlation matrix, but that may be preferred in some cases. Alternatively, we can use a built-in function called scale. <- scale(veg,center=TRUE,scale=TRUE)
The center=TRUE argument means to center the columns by their mean, and the scale=TRUE means divide by the standard deviation. Later, we may wish to perform the PCA on the variance/covariance matrix of the centered vegetation matrix as opposed to the correlation matrix. The scale function is a little more graceful about missing values and other issues, and so might be preferable.

Next, we want to specify which environmental variables we want to use, and to center them as well. For example:

env <- cbind(elev,av,slope)
env.scale <- scale(env,center=TRUE,scale=TRUE)
will create a new matrix with elevation, aspect value, and slope in in it, which we then center and scale to unit standard deviation. In this case, since the environmental data represent the independent variables in the respective regression, and since they were generally measured on different scales, we want to both center and scale to unit standard deviation to simplify interpretation of the regression coefficients. Now, we want to regress each species (column in against the centered and scaled environmental variables. We could use apply to regress each column in turn, but perhaps surprisingly, the R/S linear regression function lm does this automatically if the dependent variable is a matrix. So:
veg.lm <- lm(
Now, we create the matrix of fitted values: <- fitted(veg.lm)
It's possible to view the regression coefficients of the models, but every species is a separate regression, and the details are generally overwhelming. It's sometimes worth scanning, however, to see if the same variables are judged significant for a large number of species.

Finally, we calculate the PCA on the fitted values:

rda <- pca(,cor=TRUE)

Notice how all of the variation is accounted for in only three eigenvectors. This is necessarily so, as the fitted values subjected to the PCA were the output of a linear regression with only three independent variables. We can view the ordination exactly as any other ordination:

plot(rda,title="Redundancy Analysis")

Alternatives to Redundancy Analysis

Remembering Lab 4 and Lab 5, we can use what we learned about Generalized Linear Models (GLM) and Generalized Additive Models (GAM) to do the analysis. Rather than analyzing a species response along environmental gradients, we'll be analyzing the distribution of environmental variables (and later species distributions) along ordination axes. Traditionally, this was done primarily by correlation or linear regression, but we can do better than that here I think.

In our example, the first two axes (pca.1$scores[,1] and pca.1$scores[,2]) describe only about 12% of the total variance, but we'll start with a 2-dimensional approach and go from there. Based on our previous results, we might hypothesize that elevation plays a big role in the distribution of vegetation in Bryce Canyon. Let's look:

elev.pca.glm <- glm(elev~pca.1$scores[,1]+pca.1$scores[,2])
glm(formula = elev ~ pca.1$scores[, 1] + pca.1$scores[, 2])
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-1123.26   -361.72     19.48    396.99    964.19
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        7858.56      39.89 196.982  < 2e-16 ***
pca.1$scores[, 1]   -56.33      10.96  -5.142 8.01e-07 ***
pca.1$scores[, 2]  -105.72      14.44  -7.323 1.19e-11 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 254655.1)
    Null deviance: 60370369  on 159  degrees of freedom
Residual deviance: 39980847  on 157  degrees of freedom
AIC: 2450.7
Number of Fisher Scoring iterations: 2

If you don't recall the syntax and form of the glm command, review Lab 4 first. This command is telling us that elev is negatively related to pca.1$scores[,1] and negatively related to pca.1$scores[,2], and that the relation to pca.1$scores[,2] is stronger than to pca.1$scores[,1], as determined by looking at the coefficients ( -105 vs 56) or t-values ( -7.323277 vs 5.141676)

In addition, there are several other items of interest in the summary. First, because we did not specify otherwise, the family was taken to be Gaussian, meaning we expected normal errors unrelated to the expected value. This is correct for a variable like elevation, which is unbounded on either end (for all practical purposes), and represents a continuous value rather than a count. You should recall from Lab 4 that fitting a GLM with a Gaussian error model is equivalent to a least-squares regression. Accordingly, we could have done:

elev.pca.lm <- lm(elev~pca.1$scores[,1]+pca.1$scores[,2])
lm(formula = elev ~ pca.1$scores[, 1] + pca.1$scores[, 2])
     Min       1Q   Median       3Q      Max
-1123.26  -361.72    19.48   396.99   964.19
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)        7858.56      39.89 196.982  < 2e-16 ***
pca.1$scores[, 1]   -56.33      10.96  -5.142 8.01e-07 ***
pca.1$scores[, 2]  -105.72      14.44  -7.323 1.19e-11 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 504.6 on 157 degrees of freedom
Multiple R-Squared: 0.3377,     Adjusted R-squared: 0.3293
F-statistic: 40.03 on 2 and 157 DF,  p-value: 8.927e-15
Notice that the regression coefficients, standard errors, and t-values are all the same. The linear model explains the output in terms of variance (with multiple R^2 and an F test) instead of deviance, but the model is identical. The R^2 reported is identical to the D^2 value we would calculate from the GLM.

To see the model,

require(mgcv)           # to load the gam() function
require(akima)          # to load the interp() function
And then

Let's take this command apart to understand it. First, contour() is a command to plot contour lines on gridded surface. It just so happens that our contours are straight lines, but that's because we fit a linear model. contour() expects to be given a grid of values, but we only have points. To generate the grid from the points, we use interp(), which interpolates the points to all locations in the grid. interp() expects three values:

We supply the pca.1$scores as the X and Y axes, and the fitted values of the model at each of those points as the Z axis. The fitted values (remember) are given by fitted(elev.pca.glm), which we nest in the interp() command. Finally, we specify add=TRUE to add the lines and avoid erasing the points beneath, and a color with col=2. Voila! As we knew from our interpretation of the coefficients, elevation decreases slightly along the X axis, and declines more steeply along the Y axis.

It's tempting to contour the actual elevations themselves (substituting elev for fitted(elev.pca.glm) in the above interp command, but the results are generally poor. See for yourself:


The problem is that the actual elevations are too variable at a small scale, and need to be generalized to plot. More importantly, we need some idea of how well the ordination correlates to elevation, and by fitting a formal model we get much better quantification than we get by analysis of the raw data by eye.

Nonetheless, the contours of the raw data give us the impression that the relation of elevation to the axes is non-linear (see the curve for the 7000 foot contour), and that perhaps we should fit a better model. Let's try:

elev.pca2.glm <- glm(elev~pca.1$scores[,1]+I(pca.1$scores[,1]^2)+
glm(formula = elev ~ pca.1$scores[, 1] + I(pca.1$scores[, 1]^2) +
    pca.1$scores[, 2] + I(pca.1$scores[, 2]^2))
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1222.5   -323.1     60.6    258.5    930.7
                       Estimate Std. Error t value Pr(>|t|)
(Intercept)            7628.726     64.476 118.320  < 2e-16 ***
pca.1$scores[, 1]      -117.975     17.264  -6.834 1.78e-10 ***
I(pca.1$scores[, 1]^2)   20.813      3.495   5.956 1.67e-08 ***
pca.1$scores[, 2]       -65.096     14.337  -4.540 1.12e-05 ***
I(pca.1$scores[, 2]^2)   -6.040      3.050  -1.980   0.0494 *
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 203030.2)
    Null deviance: 60370369  on 159  degrees of freedom
Residual deviance: 31469682  on 155  degrees of freedom
AIC: 2416.4
Number of Fisher Scoring iterations: 2
Clearly, the new model fits better, as indicated below
Analysis of Deviance Table
Model 1: elev ~ pca.1$scores[, 1] + pca.1$scores[, 2]
Model 2: elev ~ pca.1$scores[, 1] + I(pca.1$scores[, 1]^2) + pca.1$scores[,
    2] + I(pca.1$scores[, 2]^2)
  Resid. Df Resid. Dev  Df Deviance P(>|Chi|)
1       157   39980847
2       155   31469682   2  8511165  7.89e-10

Are both axes necessarily quadratic? pca.1$scores[,2] is not convincingly quadratic. Let's check:

Analysis of Deviance Table
Model: gaussian, link: identity
Response: elev
Terms added sequentially (first to last)
                        Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                                      159   60370369
pca.1$scores[, 1]        1  6732272       158   53638097 8.492e-09
I(pca.1$scores[, 1]^2)   1 16519266       157   37118831 1.878e-19
pca.1$scores[, 2]        1  4852933       156   32265899 1.013e-06
I(pca.1$scores[, 2]^2)   1   796217       155   31469682 4.767e-02

Then again, I guess a reduction in deviance of 796217 is worth a degree of freedom after all!. To get a clean plot and overlay,


As you can see, the actual relation of elevation to the ordination is highly non-linear, with high elevations at the extremes of the X axis.

To evaluate other environmental variables, we do equivalent analyses, adjusting the model as suitable. For example, to analyze slope, which is bounded at 0 on the low end, we might want to fit a Poisson regression to avoid fitted values with a negative slope.

slope.pca.glm <- glm(slope~pca.1$scores[,1]+pca.1$scores[,2],family=poisson)
glm(formula = slope ~ pca.1$scores[, 1] + pca.1$scores[, 2],
    family = poisson)
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.3070  -1.7456  -1.1582  -0.1780  14.2356
                  Estimate Std. Error z value Pr(>|z|)
(Intercept)        1.63525    0.03523  46.420  < 2e-16 ***
pca.1$scores[, 1] -0.03384    0.01050  -3.224 0.001263 **
pca.1$scores[, 2]  0.05057    0.01303   3.880 0.000104 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 1145.3  on 159  degrees of freedom
Residual deviance: 1121.3  on 157  degrees of freedom
AIC: 1606.4
Number of Fisher Scoring iterations: 6

While the z-values for the coefficients are reasonably large (and statistically significant), the model has a very poor fit (D^2=0.02). Does this mean that slope is not related to the ordination, or simply that it's so strongly non-linear that we don't see it. Possibly, we have fit the wrong model. Repeating the analysis with a Gaussian model gives us:

test <- glm(slope~pca.1$scores[,1]+pca.1$scores[,2])
glm(formula = slope ~ pca.1$scores[, 1] + pca.1$scores[, 2])
Deviance Residuals:
    Min       1Q   Median       3Q      Max
-5.4880  -3.5196  -2.4581  -0.4257  59.4864
                  Estimate Std. Error t value Pr(>|t|)
(Intercept)         5.2125     0.6998   7.449 5.89e-12 ***
pca.1$scores[, 1]  -0.1462     0.1922  -0.761    0.448
pca.1$scores[, 2]   0.2375     0.2532   0.938    0.350
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for gaussian family taken to be 78.3473)
    Null deviance: 12415  on 159  degrees of freedom
Residual deviance: 12301  on 157  degrees of freedom
AIC: 1156.8
Number of Fisher Scoring iterations: 2
[1] 2.300952 6.962712
[1] 0.00918244
Apparently, we don't have a problem with the boundedness of the fitted values (you can plot them and see, I won't bother here), but the fit is even worse now. Since we don't really have an a priori expectation of the shape of the relation, let's go straight to the GAM and let the data tell us.

slope.pca.gam <- gam(slope~s(pca.1$scores[,1])+s(pca.1$scores[,2]),
Family: poisson
Link function: log
slope ~ s(pca.1$scores[, 1]) + s(pca.1$scores[, 2])
Parametric coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)  1.53611    0.03834   40.06   <2e-16 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Approximate significance of smooth terms:
                       edf Est.rank Chi.sq  p-value
s(pca.1$scores[, 1]) 8.012    9.000 107.01  < 2e-16 ***
s(pca.1$scores[, 2]) 8.832    9.000  73.45 3.20e-12 ***
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-sq.(adj) =  -0.00758   Deviance explained = 17.9%
UBRE score = 5.0988  Scale est. = 1         n = 160

The plot is certainly non-linear, and the fit is somewhat better

[1] 0.1791427
It certainly appears to me strongly unimodal in both axes, so let's try the quadratic GLM to see.
test <- glm(slope~pca.1$scores[,1]+I(pca.1$scores[,1]^2)+
glm(formula = slope ~ pca.1$scores[, 1] + I(pca.1$scores[, 1]^2) +
    pca.1$scores[, 2] + I(pca.1$scores[, 2]^2), family = poisson)
Deviance Residuals:
     Min        1Q    Median        3Q       Max
-3.62409  -1.81801  -0.94481   0.07151  13.03008
                        Estimate Std. Error z value Pr(>|z|)
(Intercept)             1.957678   0.062405  31.370  < 2e-16 ***
pca.1$scores[, 1]       0.050336   0.017972   2.801  0.00510 **
I(pca.1$scores[, 1]^2) -0.019874   0.003944  -5.039 4.67e-07 ***
pca.1$scores[, 2]       0.037873   0.016818   2.252  0.02433 *
I(pca.1$scores[, 2]^2) -0.011350   0.003570  -3.180  0.00147 **
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
    Null deviance: 1145.3  on 159  degrees of freedom
Residual deviance: 1085.2  on 155  degrees of freedom
AIC: 1574.3
Number of Fisher Scoring iterations: 6
Analysis of Deviance Table
Model: poisson, link: log
Response: slope
Terms added sequentially (first to last)
                        Df Deviance Resid. Df Resid. Dev P(>|Chi|)
NULL                                      159    1145.30
pca.1$scores[, 1]        1     9.01       158    1136.29 2.687e-03
I(pca.1$scores[, 1]^2)   1    38.33       157    1097.96 5.974e-10
pca.1$scores[, 2]        1     1.54       156    1096.41      0.21
I(pca.1$scores[, 2]^2)   1    11.19       155    1085.22 8.218e-04
Apparently it's quadratic in X but not Y. We can test the improvement by Chi-square of the deviance, at least as a heuristic.
Analysis of Deviance Table
Model 1: slope ~ pca.1$scores[, 1] + I(pca.1$scores[, 1]^2) + pca.1$scores[,
    2] + I(pca.1$scores[, 2]^2)
Model 2: slope ~ s(pca.1$scores[, 1]) + s(pca.1$scores[, 2])
  Resid. Df Resid. Dev      Df Deviance P(>|Chi|)
1   155.000    1085.22
2   142.156     940.12  12.844   145.10 1.641e-24
The indication is that the GAM is better, but we have to take this with a grain of salt in this case. More to the point, all of these models are poor, so we're discriminating among a set of very low-power models. An analysis of aspect value (not shown, try it yourself) also results in a model of very poor fit.

General Surface Plotter

Plotting a variable as a surface on an ordination is such a common task that it is worthwhile to create a general ordination surface routine. Therefore, the LabDSV library includes not only a plotting routine, but a surfacing routine. It employs GAM to fit the surface (so you don't have to know a priori the shape of the response, and reports back D^2 for the fit. To use it, simply plot first, then surf(). E.g.

Just like plot() you can specify dimensions other than 1 or 2 (but they MUST match the dimensions of the previous plot), specify the color of the contours (with col=), and change the size of the contour labels (with labcex=). You can plot as many surfaces in a single plot as makes sense. For example


And, you can add annotation with the text(x,y,text,...) command as desired. It's easiest to show an example.


The x and y coordinates come first, then the text desired in quotes, the color you want, and optionally, test justification (adj=0 means left-justified, adj=0.5 means center, and adj=1 means right justified).

In the next few labs we'll use the surf() function repeatedly.


Are we left to conclude that the vegetation is extremely complex (requiring 20 axes to get half the variance described) and that most of our environmental variables have little relation to the vegetation? Thankfully not. The problem is that PCA operates on a correlation or covariance matrix, and that these structures are really not appropriate for vegetation data that spans a large amount of environmental variability, or that exhibits much beta diversity. Again, it's more complicated than I can explain briefly in HTML, but the literature is quite full of good explications on the subject.

As a first alternative, we can still employ eigen analysis, but on a more suitable matrix. If we use a distance (or dissimilarity) matrix instead of correlation or covariance, we get Principal Coordinates Analysis (PCoA or PCO), the subject of the next lab, Lab 7.

On the other hand, PCA is widely employed (often inappropriately). More importantly, it allowed us to work through many areas of ordination analysis that we will draw on in future labs.


It's easiest to install it from CRAN , but to simply use the functions, cut and paste them from below.

Functions Used In This Lab


The pca function is simply a wrapper for the princomp function that allows users to easily limit the dimensionality of the solution. In addition, it assigns the output of the function the class "pca," which is what allows the functions below to be called in a general way. This is an example of the object-orientation of S-Plus/R code, and simplifies life for programmers and users.
pca <- function (mat, cor = FALSE, dim = min(nrow(mat), ncol(mat)))
    tmp <- prcomp(mat, retx = TRUE, center = TRUE, scale = cor)
    out <- list()
    out$scores <- tmp$x[, 1:dim]
    out$loadings <- tmp$rotation[, 1:dim]
    out$sdev <- tmp$sdev[1:dim]
    out$totdev <- sum(tmp$sdev^2)
    class(out) <- "pca"
The cor=FALSE sets the default to the covariance matrix, rather then the correlation matrix, and is over-ruled by specifying cor=TRUE in the function call. The dim=min(nrow(mat), ncol(mat)) allows you to control the number of dimensions retained. The default retains them all.


The loadings.pca function is a simple modification to the default loadings function to control some defaults. In practice, it makes the R function behave more similarly to the S-Plus version which I prefer. It also makes it easy to control the number of dimensions presented.
loadings.pca <- function (x, dim = length(x$sdev), digits = 3, cutoff = 0.1)
    if (dim > ncol(x$loadings)) {
        cat("Only", ncol(x$loadings), "axes available\n")
        dim <- ncol(x$loadings)
    cx <- format(round(x$loadings[, 1:dim], digits = digits))
    cx[abs(x$loadings[, 1:dim]) < cutoff] <- substring("       ",
        1, nchar(cx[1, 1]))
    print(cx, quote = FALSE)
The dim= argument controls how many columns or dimensions to print, the digits= argument controls how many digits are printed for each loading, and the cutoff= argument controls how big a loading has to be to be printed. The routine is borrowed from the R summary.princomp() function.


The scores.pca() function prints the scores (coordinates) of the sample plots on the eigenvectors.
scores.pca <- function (x, labels = NULL, dim = length(x$sdev))
    if (dim > length(x$sdev)) {
        cat("Only", length(x$sdev), " axes available\n")
        dim <- length(x$sdev)
    if (!is.null(labels)) {
        cbind(labels, x$scores[, 1:dim])
    else {
        x$scores[, 1:dim]
The labels= argument allows you to specify a plot label of sample identifier (as a vector of strings) if desired. The dim= argument specifies how many dimensions to print.


The plot.pca() function is an example of object-oriented programming in S-Plus/R. When you enter
the generic plot function notices that pca.1 has class "pca." It then looks for a function called plot.pca() to plot the object. The connection between the class of an object and the suffix of the functions establishes an interface that is invisible to users, but greatly increases the power and utility of the language.
plot.pca <- function (x, ax = 1, ay = 2, col = 1, title = "", pch = 1, ...) 
    if (class(x) != "pca") 
        stop("You must specify a an object of class pca")
    plot(x$scores[, ax], x$scores[, ay], asp = 1, col = col, 
        xlab = paste("PCA", ax), ylab = paste("PCA", ay), pch = pch, 
        main = title)
The section in the middle starting tol=0.04 and ending just before plot calculates x and y limits that result in a plot with an aspect ratio of 1.0. It's not actually necessary in R (which has an asp=1.0 argument in its plot() function), but works in both S-Plus and R. It's borrowed fairly directly from the eqscplot() function in the MASS library from Venables and Ripley, but actually uses a different convention in the plot.


surf.pca() adds contour lines for a variable onto an existing plot of a PCA. It is another example of object-orientation in S-Plus/R. Note the two lines
These make sure that the mgcv library (which contains the gam() function, and the akima library (which contains the interp() function) are already loaded in your S-Plus/R session.
surf.pca <- function (ord, var, ax = 1, ay = 2, col = 2, 
    labcex = 0.8, family = gaussian, 
    thinplate = TRUE, grid = 50, gamma = 1, ...) 
    if (class(ord) != "pca") 
        stop("You must specify an object of class pca")
    if (missing(var)) {
        stop("You must specify a variable to surface")
    x <- ord$scores[, ax]
    y <- ord$scores[, ay]
    if (is.logical(var)) {
        if (thinplate) {
            tmp <- gam(var ~ s(x, y), family = binomial, gamma = gamma)
        else {
            tmp <- gam(var ~ s(x) + s(y), family = binomial, 
                gamma = gamma)
    else {
        if (thinplate) {
            tmp <- gam(var ~ s(x, y), family = family, gamma = gamma)
        else {
            tmp <- gam(var ~ s(x) + s(y), family = family, gamma = gamma)
    new.x <- seq(min(x), max(x), len = grid)
    new.y <- seq(min(y), max(y), len = grid)
    xy.hull <- chull(x, y)
    xy.hull <- c(xy.hull, xy.hull[1])
    new.xy <- expand.grid(x = new.x, y = new.y)
    inside <- as.logical(pip(new.xy$x, new.xy$y, x[xy.hull], 
    fit <- predict(tmp, type = "response", newdata =
    fit[!inside] <- NA
    contour(x = new.x, y = new.y, z = matrix(fit, nrow = grid), 
        add = TRUE, col = col)
    d2 <- (tmp$null.deviance - tmp$deviance)/tmp$null.deviance
    cat(paste("D^2 = ", formatC(d2, width = 4), "\n"))


The varplot() function plots the variances of the first dim= eigenvectors as a barplot. It replaces the default plot() function for princomp(), as the default plot(),/tt> function for PCA now plots the sample scores instead of the variances.
varplot.pca <- function(pca,dim=10)  
    var <- pca$sdev^2
    readline("Hit Return to Continue\n")
    barplot(cumsum(var/pca$totdev)[1:dim],ylab="Cumulative Variance")

Broken-Stick Model to Choose Number of Dimensions

brokenstick <- function (n) 
    stick <- rep(0,n)
    for (i in 1:n) {
        for (j in i:n) stick[i] <- stick[i] + 1/j