Principal Coordinates Analysis

While PCA has proven useful in a large number of statistical studies in other fields, it is now largely of historic interest in vegetation ecology. The assumption that species exhibit monotonic (linear, strictly) responses to environment means that it's really only suitable for VERY short gradients. PCA is probably best suited to studying within-community variation of a single community type, rather than variation among communities.

One significant improvement is Principal Coordinates Analysis (PCO or PCoA), which is an eigen-analysis of a distance or dissimilarity matrix. In contrast to PCA, with PCO you can employ a broader range of distances or dissimilarity coefficients, including ones which ignore joint absences. R provides a function for calculating distances, the dist() function, which provides a fairly narrow range of distances (euclidean, manhattan, binary, canberra, and maximum). However, the vegan library provides the vegdist() function, and the LabDSV library provides the dsvdis() function as alternatives that provide many more indices, including those commonly used in vegetation ecology.

Table 1. Distances and Dissimilarities

Distance, Dissimilarity, or IndexdistveganLabDSV
methodmethodindex
euclideanXX
manhattanXX
binaryXsteinhaus1
sorensenX1
canberra XX
bray-curtis braybray/curtis
gower X
kulczynski X
ochiai X
ruzicka X
roberts X
Chi-Square chisq
morisitaX
mountfordX
hornX
minkowskiX
footnote
1 = converts to presence/absence

Table 1 specifies the distance, dissimilarity, or index functions used in various programs and libraries. The second line specifies the argument expected in the function call. The first column specifies the name of the index as commonly known; X indicates present under that name, another name indicates that the index is present in the function, but is called something slightly different. In all cases, the first argument to the function is the vegetation matrix (transformed appropriately before hand if desired), and the second argument is the distance, dissimilarity or index. Since R allows function arguments to be specified by order, as well as named, you don't have to remember whether the specific function uses "metric", "method", or "index" as long as it's the second argument supplied.

For example:

result <- dist(veg,method="canberra")     
result <- dist(veg,"canberra")

result <- vegdist(veg,method="bray")      
result <- vegdist(veg,"bray")

result <- dsvdis(veg,index="ruzicka")     
result <- dsvdis(veg,"ruzicka")
are all correct function calls for R, vegan, and LabDSV respectively. My convention is to name the result "dis" with an extension that specifies which distance, dissimilarity, or index I used. For example:
dis.euc <- dist(veg,'euclidean')
dis.bray <- vegdist(veg,method="bray")
dis.ruz <- dsvdis(veg,index="ruzicka")
R uses a function called cmdscale() to calculate what it calls "classical multi-dimensional scaling", a synonym for principal coordinates analysis. To make our work easier and more comparable to other techniques, we will use a LabDSV function called pco() which simply calls cmdscale with specific arguments, and provides more convenient plotting routines.

In contrast to PCA, the pco() function only calculates two dimensions by default. More dimensions can be specified with the k=n arguments, where n is the number of dimensions desired. Finally, rather than having loadings and scores, pco()stores the plot coordinates as points and the eigenvalues as eig. Accordingly,

euc.pco <- pco(dis.euc,k=10)
says "calculate the first ten eigenvalues of the veg.euc distance matrix, and store the plot coordinates and eigenvalues." The reduction in dimensionality is shown by the relative values of the eigenvalues, which we can plot as shown
barplot(euc.pco$eig)

As you can see, the first axis accounts for most of the variability, and the first two or three account for a large fraction. Had we calculated all the eigenvalues, it turns out that the first 149 are positive, with a sum of 1469.485. Accordingly, the first two axes account for

(euc.pco$eig[1]+euc.pco$eig[2])/1469.485
[1] 0.4790171
or almost half of the total.

To plot the ordination, just use the plot() function.

plot(euc.pco)

While Euclidean distance has a nice intuitive or familiar feel (it is after all how we see the world), it often does not work well with vegetation data. Similar to the correlation coefficient we used in PCA, over high beta-diversities Euclidean distance suffers a number of problems. While it certainly achieved a great reduction in dimensionality (almost half in two dimensions), it did not succeed in spreading out the points very clearly.

Many ecologists suggest Manhattan distance as an alternative.

dis.mht <- dist(veg,"manhattan")
mht.pco <- pco(dis.mht,k=10)
plot(mht.pco)

Finally, let's compare the binary version.

dis.bin <- dist(veg,"binary")
bin.pco <- pco(dis.bin,k=10)
plot(bin.pco)

Qualitatively the shape is similar, with a little more dispersion. How different are the ordinations really? One way to compare is to look at the respective eigenvalues. To do this graphically,

plot(euc.pco$eig/sum(euc.pco$eig),type="b",xlab="Axis Number",
     ylab="Fraction of Sum")
lines(mht.pco$eig/sum(mht.pco$eig),type="b",col=2)
lines(bin.pco$eig/sum(bin.pcoe$eig),type="b",col=3)
text(8.0,0.45,'Euclidean')
text(8.0,0.4,"Manhattan",col=2)
text(8.0,0.35,"binary",col=3)

If any of the basic graphics commands above are unfamiliar, review R for Ecologists. As you can see, all three ordinations explain essentially the same amount of information on the first few axes, and in that sense are similar. It's important to understand, however, that the information they're summarizing is different, and they are not all saying the same thing.

Using the binary ordination as an example, let's look at the distribution of species in the ordination space. We can use the points() function as modified by the LabDSV library.

plot(bin.pco,title="Berberis repens")
points(bin.pco,veg$berrep>0)

We can also plot the probability of the distribution of Berberis repens using the LabDSV function surf() (if you are unfamiliar with the LabDSV surf function, see Lab 7). For the code for the PCO specific surface plotter, look here.

plot(bin.pco,title="Berberis repens")
surf(bin.pco,veg$berrep>0,family=binomial)
I used the family=binomial argument to get a logistic gam, since veg$berrep>0 evaluates to presence/absence.
Family: binomial 
Link function: logit 

Formula:
var ~ s(x, y)


Estimated degrees of freedom:
10.7  total = 11.73 

UBRE score: -0.5798227
D^2 =  0.7994 

We can also examine the distribution of environmental variables on the ordination using the LabDSV surf function.

plot(bin.pco,title="Elevation")
surf(bin.pco,elev)
Family: gaussian 
Link function: identity 

Formula:
var ~ s(x) + s(y)

Estimated degrees of freedom:
 5.990454 4.841457   total =  11.83191 

GCV score:  57233.38 
D^2 =  0.8699 

We can fit other environmental variables similarly, and compare the fits by analysis of D^2. Our D^2 square for elevation was 0.870, which is really pretty high. The best we could do with PCA was 0.610. Let's look at some others.

plot(bin.pco)
surf(bin.pco,elev)
surf(bin.pco,slope,col=3)
surf(bin.pco,av,col=4)
I haven't included the full output from the surf commands to save space, but the respective D^2s are 0.870, 0.162, and 0.026 for elevation, slope, and aspect value respectively. Slope does probably matter in the ordination, but aspect value clearly does not.

It's pretty busy, and since aspect value doesn't matter we maybe should not have included it in the plot. As you can see, the relationship with elevation is quite a bit stronger than for slope (0.86 vs 0.16). In a similar fashion, we can test which of the PCOs is better with regard to possible environmental explanation.

plot(euc.pco,title="Euclidean Distance Elevation")
surf(euc.pco,elev)

Family: gaussian 
Link function: identity 

Formula:
var ~ s(x) + s(y)

Estimated degrees of freedom:
 8.583779 1.282128   total =  10.86591 

GCV score:  248765.1 
D^2 =  0.4272 

plot(mht.pco,title="Manhattan Distance Elevation")
surf(mht.pco,elev)

Family: gaussian 
Link function: identity 

Formula:
var ~ s(x) + s(y)

Estimated degrees of freedom:
 8.554883 1.000016   total =  10.5549 

GCV score:  189264.8 
D^2 =  0.5624 
plot(bin.pco,title="Binary Distance Elevation")
Family: gaussian 
Link function: identity 

Formula:
var ~ s(x) + s(y)

Estimated degrees of freedom:
 5.990454 4.841457   total =  11.83191 

GCV score:  57233.38 
D^2 =  0.8699 
surf(bin.pco,elev)

These results (and other similar examples not shown to save space) are in the table below.

VariableEuclideanManhattanbinaryBray/Curtis
elevation0.4720.5620.8790.846
slope0.0080.0130.1630.204
aspect value0.0590.0990.0280.048

As you can see, the metrics/dissimilarities generally agreed on the effect size of variables, with elevation strongly dominant, and aspect value second for the distance metrics (Euclidean and Manhattan), and slope second for the dissimilarities (binary and Bray/Curtis). I also tried the Roberts and Ruzicka dissimilarities, but the results are almost identical with Bray/Curtis and aren't presented here.

In addition, the dissimilarity PCOs generally had the strongest signal (but not for aspect value). It's not apparent here on a single data set, but the relative merits of the metrics/dissimilarities differ depending on the beta-diversity (or gradient widths) of the data set. PCO of a euclidean distance matrix is equivalent to PCA, and is best suited to VERY short gradients or VERY low beta-diversity. As beta-diversity increases, the balance will shift from Bray/Curtis to binary as the best index.

Functions Used In This Lab

pco()

pco <- function(dis, k=2)
{
        tmp <-cmdscale(dis,k=k,eig=TRUE)
        class(tmp) <- "pco"
        return(tmp)
}

plot.pco()

The plot.pco() function is an example of object-oriented programming in R. If you look at the function just above (pco()), the line class(tmp) <- "pco" establishes the output of the pco function as an object of class "pco." When you enter a command like
plot(euc.pco)
R first determines that object euc.pco has class "pco." It then looks for a function called plot.pco(). If it finds one, it uses that function instead of the generic plot() function.
plot.pco <- function (x, ax = 1, ay = 2, col = 1, title = "", pch = 1, ...) 
{
    if (missing(x)) {
        stop("You must specify an object of class pco from pco()")
    }
    plot(x$points[, ax], x$points[, ay], asp = 1, col = col, 
        xlab = paste("PCO", ax), ylab = paste("PCO", ay), pch = pch, 
        main = title, ...)
    invisible()
}

points.pco()

Like the plot.pco() function just above, points.pco() is another example of the object-orientation of R code. When you enter
points(bin.pco,veg$berrep>0)
the points() function notices that bin.pco is an object of class "pco" and looks for a function called points.pco().
points.pco <- function (x, which, ax = 1, ay = 2, col = 2, pch = 1, cex = 1,
    ...)
{
    if (missing(x)) {
        stop("You must specify an object of class pco from pco()")
    }
    if (missing(which)) {
        stop("You must specify a logical subscript")
    }
    points(x$points[, ax][which], x$points[, ay][which], col = col,
        pch = pch, cex = cex, ...)
}

surf.pco()

Yet another example of object-orientation in R code. See plot.pco() just above for more detail if you are unfamiliar with this concept.
surf.pco <- function (ord, var, ax = 1, ay = 2, thinplate = TRUE, col = 2, 
    labcex = 0.8, family = gaussian, grid = 50, gamma = 1, ...) 
{
    if (class(ord) != "pco") 
        stop("You must supply an object of class pco from pco()")
    if (missing(var)) {
        stop("You must specify a variable to surface")
    }
    x <- ord$points[, ax]
    y <- ord$points[, ay]
    if (any(is.na(var))) {
        cat("Omitting plots with missing values \n")
        x <- x[!is.na(var)]
        y <- y[!is.na(var)]
        var <- var[!is.na(var)]
    }
    if (is.logical(var)) {
        if (thinplate) 
            tmp <- gam(var ~ s(x, y), gamma = gamma, family = binomial)
        else tmp <- gam(var ~ s(x) + s(y), family = binomial, 
            gamma = gamma)
    }
    else {
        if (thinplate) 
            tmp <- gam(var ~ s(x, y), gamma = gamma, family = family)
        else tmp <- gam(var ~ s(x) + s(y), family = family, gamma = gamma)
    }
    new.x <- seq(min(x), max(x), len = grid)
    new.y <- seq(min(y), max(y), len = grid)
    xy.hull <- chull(x, y)
    xy.hull <- c(xy.hull, xy.hull[1])
    new.xy <- expand.grid(x = new.x, y = new.y)
    inside <- as.logical(pip(new.xy$x, new.xy$y, x[xy.hull], 
        y[xy.hull]))
    fit <- predict(tmp, type = "response", newdata = as.data.frame(new.xy))
    fit[!inside] <- NA
    contour(x = new.x, y = new.y, z = matrix(fit, nrow = grid), 
        add = TRUE, col = col)
    print(tmp)
    d2 <- (tmp$null.deviance - tmp$deviance)/tmp$null.deviance
    cat(paste("D^2 = ", formatC(d2, width = 4), "\n"))
}
ellip.pco <- function (ord, overlay, ax = 1, ay = 2, cols = c(2, 3, 4, 5, 
    6, 7), ltys = c(1, 2, 3), ...) 
{
    if (class(ord) != "pco") 
        stop("You must pass an object of class pco")
    require(cluster)
    if (inherits(overlay, c("partana", "pam", "clustering"))) {
        overlay <- overlay$clustering
        if (min(overlay) < 0 || (length(table(overlay)) != max(overlay))) {
            cat("WARNING: renumbering clusters to consecutive integers\n")
            overlay <- match(overlay, sort(unique(overlay)))
        }
    }
    else if (is.logical(overlay)) 
        overlay <- as.numeric(overlay)
    else if (is.factor(overlay)) 
        overlay <- as.numeric(overlay)
    pass <- 1
    layer <- 0
    lty <- ltys[pass]
    for (i in 1:max(overlay, na.rm = TRUE)) {
        x <- ord$points[, ax][overlay == i & !is.na(overlay)]
        y <- ord$points[, ay][overlay == i & !is.na(overlay)]
        pts <- chull(x, y)
        layer <- layer + 1
        if (layer > length(cols)) {
            layer <- 1
            pass <- min(pass + 1, length(ltys))
        }
        col <- cols[layer]
        lty = ltys[pass]
        x <- as.matrix(cbind(x[pts], y[pts]))
        elp <- ellipsoidhull(x)
        print(elp)
        lines(predict(elp),col=col)
    }
}