Dissimilarity is the complement of similarity, and is a characterization of the number of attributes two objects have uniquely compared to the total list of attributes between them. In general, dissimilarity can be calculated as 1 - similarity.
Distance is a geometric conception of the proximity of objects in a high dimensional space defined by measurements on the attributes. We've covered distance in detail under "ordination by PCO", and I refer you to that discussion for more details. Remember that R calculates distances with the dist function, and uses "euclidean", "manhattan", or "binary" as the "metric." The vegan package provides vegdist(), and labdsv provides dsvdis which together provide a large number of possible indices and metrics. Similar to the way in which these indices and metrics influenced ordination results, they similarly influence cluster analyses.
In practice, distances and dissimilarities are sometimes used interchangeably. They have quite distinct properties, however. Dissimilarities are bounded [0,1]; once plots have no species in common they can be no more dissimilar. Distances are unbounded on the upper end; plots which have no species in common have distances that depend on the number and abundance of species in the plots, and is thus variable.
In agglomerative hierarchical cluster analysis, sample plots all start out as individuals, and the two plots most similar (or least dissimilar) are fused to form the first cluster. Subsequently, plots are continually fused one-by-one in order of highest similarity (or equivalently lowest dissimilarity) to the plot or cluster to which they are most similar. The hierarchy is determined by the cluster at a height characterized by the similarity at which the plots fused to form the cluster. Eventually, all plots are contained in the final cluster at similarity 0.0
Agglomerative cluster algorithms differ in the calculation of similarity when more than one plot is involved; i.e. when a plot is considered for merger with a cluster containing more than one plot. Of all the algorithms invented, we will limit our consideration to those available in R, which are also those most commonly used. In R there are multiple methods:
In the average linkage, the similarity of a plot to a cluster is defined by the mean similarity of the plot to all the members of the cluster. In contrast single linkage, a plot needs to be relatively similar to all the members of the cluster to join, rather than just one. Average linkage clusters tend to relatively round or ellipsoid. Median or centroid approaches tend to give similar results.
In the "complete" linkage, or "compact" algorithm, the similarity of a plot to a cluster is calculated as the minimum similarity of the plot to any member of the cluster. Similar to the single linkage algorithm, the probability of a plot joining a cluster if determined by a single other member of a cluster, but now it is the least similar, not the most similar. Complete linkage clusters tend to be very tight and spherical, thus the alternative name "compact."
Give the hclust() function the dissimilarity object as the first argument, and the method or metric as the second explicit argument. E.g.
> democlust<-hclust(demodist,"average")To see the cluster analysis, simply use the plot().
> plot(democlust)
The hierarchical cluster analysis is drawn as a "dendrogram", where each fusion of plots into a cluster is shown as a horizontal line, and the plots are labeled at the bottom of the graph (although often illegibly in dense graphs).
The cluster analysis can be "sliced" horizontally to produce unique clusters either by specifying a similarity or the number of clusters desired. For example, to get 5 clusters, use
democut<-cutree(democlust,k=5)To cut the tree at a specific similarity, specify the explicit "h" argument second with the specified similarity (or "height").
democut2<-cutree(democlust,h=0.65)Then, to label the dendrogram with the group IDs, use
plot(democlust, labels = as.character(democut))
Given the clusters, you can use the cluster IDs (in out case democut) as you would any categorical variable. For example,
>table(berrep,democut)
1 2 3 4 5
0 39 21 14 17 1
0.5 0 47 0 0 0
3 0 20 0 0 0
37.5 0 1 0 0 0
Where you can see that Berberis repens only occurs in cluster 2, but
that cluster 2 also contains 21 plots without Berberis.
You can also perform environmental analyses of the clusters using the various plotting techniques we have developed. For example, to look at the distribution of plot elevations within clusters, we can do a boxplot as follows (assuming that the site data are already attached):
> boxplot(elev~democut)
As another example, I'll use complete on the Bray/Curtis dissimilarity calculated in a previous lab and then follow with const()
> cl.bc <- hclust(dis.bc,"complete")
Notice how in the "complete" dendrogram clusters tend to hang from the 1.0 dissimilarity line. This is because the similarity of each cluster to the others is defined by the least similar pairs among the two, which is often complete dissimilarity. If we cut this at 0.99, we get 8 clusters that are distinct.
> cluster.bc <- cutree(cl.bc,h=0.99)
> const(bryceveg,cluster.bc,min=0.2)
1 2 3 4 5 6 7 8
ameuta 0.34 0.21 . . . . . .
arcpat 0.71 0.89 0.23 0.5 . . . 0.25
arttri . . . . 0.85 . . 0.25
atrcan . . . . 0.92 . . .
ceamar 0.52 0.84 . . . . . .
cermon . 0.28 0.94 . . . . 1.00
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
senmul 0.60 0.47 . . . 0.78 0.28 .
sphcoc . . . . 0.64 . . .
swerad 0.26 0.32 . . . . . .
taroff . . . . . . 0.42 .
towmin . . . 1.0 . . . .
tradub . . . . 0.21 0.21 . .
> library(cluster)The function we want to use is called agnes() (an abbreviation for agglomerative nesting). To perform a flexible-beta using agnes(), the call is a little more complicated than we have seen for other clustering algorithms). We need to specify alpha_1, alpha_2, beta, and gamma in the call, using the par.method argument.
demoflex <- agnes(dis.bc,method='flexible',par.method=c(0.625,0.625,-0.25))The defaults, however, are that alpha_1 = alpha_2, beta = 1 - alpha_1 - alpha_2, and gamm = 0. So we can simply specify alpha_1 and get the desired results.
demoflex <- agnes(dis.bc,method='flexible',par.method=0.625)The plot function for an object of class "agnes" has tow panels. First, the "banner plot" is given, where the height of fusions is shown as a white bar on a red background. The second plot gives the more traditional dendrogram. Both plots include the "agglomerative coefficient" = mean(1-(first merger/last merger)). N object of class "agnes' can be converted to an object of classs "hclust" with the as.hclust() function, e.g. demoflex.hcl <- as.hclust(demoflex)
In practice, you often don't know the number of clusters a priori, and the approach adopted is to cluster at a range of values, comparing the stress values to find the best partition. Often, clustering with the same number of clusters but a different initial guess will lead to a different final partition, so replicates at each level are often required.
The original function for fixed-cluster analysis was called "k=means" and operated in a Euclidean space. Kaufman and Rousseeuw (1990) created a function called "partitioning around medoids" whch operates with any of a broqd range of dissimilarities/distance. To perform fixed-cluster analysis in R we use the pam() function from the cluster library. pam uses a distance matrix output from any of our distance functions, or a raw vegetation matrix (invoking dist() on the fly). I've had better luck explicitly creating a distance matrix first, and then submitting it to pam.
> demopam <- pam(dis.bc,k=5) > attributes(demopam) $names: [1] "medoids" "id.med" "clustering" "objective" "isolation" [6] "clusinfo" "silinfo" "diss" "call" $class: [1] "pam" "partition" $Call: pam(testdist, k = 5, diss = T)The plots which serve as cluster centers (or medoids) are indicated by demopam$medoids
> demopam$medoids [1] "50064" "50007" "50115" "50171" "50100"The cluster membership for each plot is given by demopam$clustering
50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 50011 50012 50013
1 1 1 1 2 1 2 2 1 2 2 2 1
50014 50015 50016 50017 50018 50019 50020 50021 50022 50023 50024 50025 50026
2 2 1 2 2 2 2 2 1 2 2 2 1
50027 50028 50029 50030 50031 50032 50033 50034 50035 50036 50037 50038 50039
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
50156 50157 50158 50159 50160 50161 50162 50163 50164 50165 50166 50167 50168
4 4 4 4 4 4 4 4 4 4 4 4 4
50169 50170 50171 50172
4 4 4 4
There is a special plotting function for outputs from the cluster library,
including pam(). In R, just plot the pam object.
> plot(demopam)
The plot is called a "Silhouette Plot", and shows for each cluster:
The $clustering values can be used just as the "cut" values from the hierarchical cluster analysis before. For example,
> boxplot(elev~demopam$clustering)
The two algorithms can be compared as follows:
> table(democut,demopam$clustering) 1 2 3 4 5 1 2 0 0 36 1 2 43 40 4 2 0 3 0 0 0 0 14 4 0 0 17 0 0 5 0 1 0 0 0which shows that both algorithms found a similar cluster structure.
And here's a set of functions designed to simplify analysis of multiple classification levels from either an hclust() or a pam() here