### Lab 13 — Cluster Analysis

Cluster analysis is a multivariate analysis that attempts to form groups or "clusters" of objects (sample plots in our case) that are "similar" to each other but which differ among clusters. The exact definition of "similar" is variable among algorithms, but has a generic basis. The methods of forming clusters also vary, but follow a few general blueprints.

#### Similarity, Dissimilarity and Distance

Similarity is a characterization of the ratio of the number of attributes two objects share in common compared to the total list of attributes between them. Objects which have everything in common are identical, and have a similarity of 1.0. Objects which have nothing in common have a similarity of 0.0. As we have discussed previously (see Lab 8), there is a large number of similarity indices proposed and employed, but the concepts are common to all.

Dissimilarity is the complement of similarity, and is a characterization of the number of attributes two objects have uniquely compared to the total list of attributes between them. In general, dissimilarity can be calculated as 1 - similarity.

Distance is a geometric conception of the proximity of objects in a high dimensional space defined by measurements on the attributes. We've covered distance in detail under "ordination by PCO", and I refer you to that discussion for more details. Remember that R calculates distances with the dist function, and uses "euclidean", "manhattan", or "binary" as the "metric." The vegan package provides vegdist(), and labdsv provides dsvdis which together provide a large number of possible indices and metrics. Similar to the way in which these indices and metrics influenced ordination results, they similarly influence cluster analyses.

In practice, distances and dissimilarities are sometimes used interchangeably. They have quite distinct properties, however. Dissimilarities are bounded [0,1]; once plots have no species in common they can be no more dissimilar. Distances are unbounded on the upper end; plots which have no species in common have distances that depend on the number and abundance of species in the plots, and is thus variable.

#### Cluster Algorithms

Cluster algorithms are classified by two characteristics: (1) hierarchical vs fixed-cluster, and (2) if hierarchical, agglomerative vs divisive. We will explore agglomerative hierarchical clusters first, followed by fixed cluster and fuzzy fixed-cluster methods next.

In agglomerative hierarchical cluster analysis, sample plots all start out as individuals, and the two plots most similar (or least dissimilar) are fused to form the first cluster. Subsequently, plots are continually fused one-by-one in order of highest similarity (or equivalently lowest dissimilarity) to the plot or cluster to which they are most similar. The hierarchy is determined by the cluster at a height characterized by the similarity at which the plots fused to form the cluster. Eventually, all plots are contained in the final cluster at similarity 0.0

Agglomerative cluster algorithms differ in the calculation of similarity when more than one plot is involved; i.e. when a plot is considered for merger with a cluster containing more than one plot. Of all the algorithms invented, we will limit our consideration to those available in R, which are also those most commonly used. In R there are multiple methods:

• single --- nearest neighbor
• complete --- furthest neighbor or compact
• ward --- Ward's minimum variance method
• mcquitty --- McQuitty's method
• average --- average similarity
• median --- median (as opposed to average) similarity
• centroid --- geometric centroid
• flexible --- flexible Beta
In the "connected" or "single" linkage, the similarity of a plot to a cluster is determined by the maximum similarity of the plot to any of the plots in the cluster. The name "single linkage" comes because the plot only needs to be similar to a single member of the cluster to join. Single linkage clusters can be long and branched in high-dimensional space, and are subject to a phenomenon called "chaining", where a single plot is continually added to the tail of the biggest cluster. On the other hand, single linkage clusters are topologically "clean."

In the average linkage, the similarity of a plot to a cluster is defined by the mean similarity of the plot to all the members of the cluster. In contrast to single linkage, a plot needs to be relatively similar to all the members of the cluster to join, rather than just one. Average linkage clusters tend to be relatively round or ellipsoid. Median or centroid approaches tend to give similar results.

In the "complete" linkage, or "compact" algorithm, the similarity of a plot to a cluster is calculated as the minimum similarity of the plot to any member of the cluster. Similar to the single linkage algorithm, the probability of a plot joining a cluster is determined by a single other member of a cluster, but now it is the least similar, not the most similar. Complete linkage clusters tend to be very tight and spherical, thus the alternative name "compact."

#### Hierarchical Cluster Analysis in R

In R, we typically use the hclust() function to perform hierarchical cluster analysis. hclust() will calculate a cluster analysis from either a similarity or dissimilarity matrix, but plots better when working from a dissimilarity matrix. We can use any dissimilarity object from dist(), vegdist(), or dsvdis().

Give the hclust() function the dissimilarity object as the first argument, and the method or metric as the second explicit argument. E.g.

democlust<-hclust(demodist,"average")
To see the cluster analysis, simply use the plot().
plot(democlust)

The hierarchical cluster analysis is drawn as a "dendrogram", where each fusion of plots into a cluster is shown as a horizontal line, and the plots are labeled at the bottom of the graph (although often illegibly in dense graphs).

The cluster analysis can be "sliced" horizontally to produce unique clusters either by specifying a similarity or the number of clusters desired. For example, to get 5 clusters, use

democut<-cutree(democlust,k=5)
To cut the tree at a specific similarity, specify the explicit "h" argument second with the specified similarity (or "height").
democut2<-cutree(democlust,h=0.65)
Then, to label the dendrogram with the group IDs, use
plot(democlust, labels = as.character(democut))

Given the clusters, you can use the cluster IDs (in our case democut) as you would any categorical variable. For example,

table(berrep,democut)
      1  2  3  4 5
0 39 21 14 17 1
0.5  0 47  0  0 0
3  0 20  0  0 0
37.5  0  1  0  0 0

Where you can see that Berberis repens only occurs in cluster 2, but that cluster 2 also contains 21 plots without Berberis.

You can also perform environmental analyses of the clusters using the various plotting techniques we have developed. For example, to look at the distribution of plot elevations within clusters, we can do a boxplot as follows (assuming that the site data are already attached):

boxplot(elev~democut)

As another example, I'll use complete on the Bray/Curtis dissimilarity calculated in a previous lab and then follow with const()

cl.bc <- hclust(dis.bc,"complete")

Notice how in the "complete" dendrogram clusters tend to hang from the 1.0 dissimilarity line. This is because the similarity of each cluster to the others is defined by the least similar pairs among the two, which is often complete dissimilarity. If we cut this at 0.99, we get 8 clusters that are distinct.

cluster.bc <- cutree(cl.bc,h=0.99)
const(bryceveg,cluster.bc,min=0.2)

          1    2    3    4    5    6    7    8
ameuta 0.34 0.21  .    .    .    .    .    .
arcpat 0.71 0.89 0.23  0.5  .    .    .   0.25
arttri  .    .    .    .   0.85  .    .   0.25
atrcan  .    .    .    .   0.92  .    .    .
ceamar 0.52 0.84  .    .    .    .    .    .
cermon  .   0.28 0.94  .    .    .    .   1.00
.     .    .    .    .    .    .    .    .
.     .    .    .    .    .    .    .    .
.     .    .    .    .    .    .    .    .
senmul 0.60 0.47  .    .    .   0.78 0.28  .
sphcoc  .    .    .    .   0.64  .    .    .
swerad 0.26 0.32  .    .    .    .    .    .
taroff  .    .    .    .    .    .   0.42  .
towmin  .    .    .    1.0  .    .    .    .
tradub  .    .    .    .   0.21 0.21  .    .


#### Flexible-Beta

In recent years many ecologists have expressed a preference for a hierarchical clustering algorithm call "flexible Beta" developed by Australian ecologists Lance and Williams (1966). Lance and Williams determined that since all hierarchical agglomerative algorithms operate similarly, but differ in the calculation of multi-member dissimilarity, that it was possible to generalize the algorithms to a single algorithm with four parameters, alpha_1, alpha_2, beta, and gamma. Legendre and Legendre (1998) present a table (Table 8.8) that gives the L&W parameters to recreate many of the classical algorithms. Of more interest here is a special case where alpha_1 = alpha_2 = 0.625, beta = -0.25, and gamma = 0. Observe that not only are alpha_1 and alpha_2 equal, but that it is also the case that beta = 1 - (alpha_1 + alpha_2). This set of parameters gives a good intermediate result. To use flexible-beta in R you must load the package "cluster." This package contains several other functions we will want.
library(cluster)
The function we want to use is called agnes() (an abbreviation for agglomerative nesting). To perform a flexible-beta using agnes(), the call is a little more complicated than we have seen for other clustering algorithms). We need to specify alpha_1, alpha_2, beta, and gamma in the call, using the par.method argument.
demoflex <- agnes(dis.bc,method='flexible',par.method=c(0.625,0.625,-0.25))
The defaults, however, are that alpha_1 = alpha_2, beta = 1 - (alpha_1 + alpha_2), and gamma = 0. So we can simply specify alpha_1 and get the desired results.
demoflex <- agnes(dis.bc,method='flexible',par.method=0.625)

Alternatively, we can write a simple function to facilitate running flexible-beta specifying just beta.

flexbeta <- function (dis,beta)
{
alpha <- (1-beta)/2
out <- agnes(dis,meth='flex',par.method=alpha)
out
}

demoflex <- flexbeta(dis.bc,-0.25)

The plot function for an object of class "agnes" has two panels. First, the "banner plot" is given, where the height of fusions is shown as a white bar on a red background. The second plot gives the more traditional dendrogram. Both plots include the "agglomerative coefficient" = mean(1-(first merger/last merger)). An object of class "agnes' can be converted to an object of class "hclust" with the as.hclust() function, e.g.

demoflex.hcl <- as.hclust(demoflex)

Converting to an object of class "hclust" makes the results behave more similarly to other clustering results from hclust() and simplifies some analyses.

#### Fixed-cluster Algorithms or Partitions

In fixed-cluster algorithms (e.g. c-means, k-means or k-medoids), the number of clusters is specified a priori by the analyst, and the clusters are formed from either an initial guess at the cluster centers, or from random initial conditions. The clusters are not hierarchical, but rather form a partition of the data. The "goodness-of-fit" criterion (also called "stress") is generally some measure of within-cluster homogeneity versus among-cluster heterogeneity, often measured by the distance of each plot to the center of the cluster to which it belongs, compared to the average distance to other clusters.

In practice, you often don't know the number of clusters a priori, and the approach adopted is to cluster at a range of values, comparing the stress values to find the best partition. Often, clustering with the same number of clusters but a different initial guess will lead to a different final partition, so replicates at each level are often required.

The original function for fixed-cluster analysis was called "k-means" and operated in a Euclidean space. Kaufman and Rousseeuw (1990) created a function called "partitioning around medoids" which operates with any of a broad range of dissimilarities/distance. To perform fixed-cluster analysis in R we use the pam() function from the cluster library. pam uses a distance matrix output from any of our distance functions, or a raw vegetation matrix (invoking dist() on the fly). I've had better luck explicitly creating a distance matrix first, and then submitting it to pam.

demopam <- pam(dis.bc,k=5)
attributes(demopam)

$names: [1] "medoids" "id.med" "clustering" "objective" "isolation" [6] "clusinfo" "silinfo" "diss" "call"$class:
[1] "pam"       "partition"

$Call: pam(testdist, k = 5, diss = T)  The plots which serve as cluster centers (or medoids) are indicated by demopam$medoids
demopam$medoids [1] "50064" "50007" "50115" "50171" "50100" The cluster membership for each plot is given by demopam$clustering
50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 50011 50012 50013
1     1     1     1     2     1     2     2     1     2     2     2     1
50014 50015 50016 50017 50018 50019 50020 50021 50022 50023 50024 50025 50026
2     2     1     2     2     2     2     2     1     2     2     2     1
50027 50028 50029 50030 50031 50032 50033 50034 50035 50036 50037 50038 50039
.     .     .     .     .     .     .     .     .     .     .     .     .
.     .     .     .     .     .     .     .     .     .     .     .     .
.     .     .     .     .     .     .     .     .     .     .     .     .
50156 50157 50158 50159 50160 50161 50162 50163 50164 50165 50166 50167 50168
4     4     4     4     4     4     4     4     4     4     4     4     4
50169 50170 50171 50172
4     4     4     4

There is a special plotting function for outputs from the cluster library, including pam(). In R, just plot the pam object.
plot(demopam)

The plot is called a "Silhouette Plot", and shows for each cluster:

• the number of plots per cluster = number of horizontal lines, also given in the right hand column,
• the means similarity of each plot to its own cluster minus the mean similarity to the next most similar cluster (given by the length of the lines) with the mean in the right hand column, and
• the average silhouette width
Plots which fit well within their cluster have a large positive Silhouette width; those which fit poorly have a small positive or even a negative Silhouette.

The $clustering values can be used just as the "cut" values from the hierarchical cluster analysis before. For example, boxplot(elev~demopam$clustering)

The two algorithms can be compared as follows:

table(democut,demopam$clustering)  1 2 3 4 5 1 2 0 0 36 1 2 43 40 4 2 0 3 0 0 0 0 14 4 0 0 17 0 0 5 0 1 0 0 0  which shows that both algorithms found a similar cluster structure, but that pam() broke cluster 2 of democut into two clusters. #### References Lance, G.N., and W.T. Williams (1966). A General Theory of Classificatory Sorting Strategies, I. Hierarchical Systems. Computer J. 9:373-380. Legendre, P. and L. Legendre (1998). Numerical Ecology. Elsevier. #### Library optpart Library optpart is available at CRAN #### Miscellaneous Scripts Here's a short script to help compare clustering results. You'll have to specify the xlim and ylim to accommodate yyour resultsm but otherwise it should work. plot(partana(stride,dis)$ratio,silhouette(stride,dis)$sil_width, xlim=c(n,m),ylim=c(n,m),type='n') text(partana(stride,dis)$ratio,silhouette(stride,dis)$sil_width, as.character(seq,col=n ) . . . . . . . . . text(partana(stride,dis)$ratio,silhouette(stride,dis)\$sil_width,
as.character(seq,col=n )