Dissimilarity is the complement of similarity, and is a characterization of the number of attributes two objects have uniquely compared to the total list of attributes between them. In general, dissimilarity can be calculated as 1 - similarity.

Distance is a geometric conception of the proximity of objects in a high
dimensional space defined by measurements on the attributes. We've covered
distance in detail under "ordination by PCO", and I refer you to that discussion
for more details. Remember that R calculates distances with the `dist`
function, and uses "euclidean", "manhattan", or "binary" as the "metric." The
vegan package provides `vegdist()`, and labdsv provides `dsvdis`
which together provide a large number of possible indices and metrics.
Similar
to the way in which these indices and
metrics influenced ordination results, they similarly
influence cluster analyses.

In practice, distances and dissimilarities are sometimes used interchangeably. They have quite distinct properties, however. Dissimilarities are bounded [0,1]; once plots have no species in common they can be no more dissimilar. Distances are unbounded on the upper end; plots which have no species in common have distances that depend on the number and abundance of species in the plots, and is thus variable.

In agglomerative hierarchical cluster analysis, sample plots all start out as individuals, and the two plots most similar (or least dissimilar) are fused to form the first cluster. Subsequently, plots are continually fused one-by-one in order of highest similarity (or equivalently lowest dissimilarity) to the plot or cluster to which they are most similar. The hierarchy is determined by the cluster at a height characterized by the similarity at which the plots fused to form the cluster. Eventually, all plots are contained in the final cluster at similarity 0.0

Agglomerative cluster algorithms differ in the calculation of similarity when more than one plot is involved; i.e. when a plot is considered for merger with a cluster containing more than one plot. Of all the algorithms invented, we will limit our consideration to those available in R, which are also those most commonly used. In R there are multiple methods:

- single --- nearest neighbor
- complete --- furthest neighbor or compact
- ward --- Ward's minimum variance method
- mcquitty --- McQuitty's method
- average --- average similarity
- median --- median (as opposed to average) similarity
- centroid --- geometric centroid
- flexible --- flexible Beta

In the average linkage, the similarity of a plot to a cluster is defined by the mean similarity of the plot to all the members of the cluster. In contrast to single linkage, a plot needs to be relatively similar to all the members of the cluster to join, rather than just one. Average linkage clusters tend to be relatively round or ellipsoid. Median or centroid approaches tend to give similar results.

In the "complete" linkage, or "compact" algorithm, the similarity of a plot to a cluster is calculated as the minimum similarity of the plot to any member of the cluster. Similar to the single linkage algorithm, the probability of a plot joining a cluster is determined by a single other member of a cluster, but now it is the least similar, not the most similar. Complete linkage clusters tend to be very tight and spherical, thus the alternative name "compact."

Give the `hclust()` function the dissimilarity object as the first
argument, and the method or metric as the second explicit argument. E.g.

democlust<-hclust(demodist,"average")

To see the cluster analysis, simply use the plot(democlust)

The hierarchical cluster analysis is drawn as a "dendrogram", where each fusion of plots into a cluster is shown as a horizontal line, and the plots are labeled at the bottom of the graph (although often illegibly in dense graphs).

The cluster analysis can be "sliced" horizontally to produce unique clusters either by specifying a similarity or the number of clusters desired. For example, to get 5 clusters, use

democut<-cutree(democlust,k=5)

To cut the tree at a specific similarity, specify the explicit "h" argument
second with the specified similarity (or "height").
democut2<-cutree(democlust,h=0.65)

Then, to label the dendrogram with the group IDs, use
plot(democlust, labels = as.character(democut))

Given the clusters, you can use the cluster IDs (in our case democut) as you would any categorical variable. For example,

table(berrep,democut)

1 2 3 4 5 0 39 21 14 17 1 0.5 0 47 0 0 0 3 0 20 0 0 0 37.5 0 1 0 0 0

You can also perform environmental analyses of the clusters using the various plotting techniques we have developed. For example, to look at the distribution of plot elevations within clusters, we can do a boxplot as follows (assuming that the site data are already attached):

boxplot(elev~democut)

As another example, I'll use `complete` on the Bray/Curtis
dissimilarity calculated in a previous lab
and then follow with `const()`

cl.bc <- hclust(dis.bc,"complete")

Notice how in the "complete" dendrogram clusters tend to hang from the 1.0 dissimilarity line. This is because the similarity of each cluster to the others is defined by the least similar pairs among the two, which is often complete dissimilarity. If we cut this at 0.99, we get 8 clusters that are distinct.

cluster.bc <- cutree(cl.bc,h=0.99) const(bryceveg,cluster.bc,min=0.2)

1 2 3 4 5 6 7 8 ameuta 0.34 0.21 . . . . . . arcpat 0.71 0.89 0.23 0.5 . . . 0.25 arttri . . . . 0.85 . . 0.25 atrcan . . . . 0.92 . . . ceamar 0.52 0.84 . . . . . . cermon . 0.28 0.94 . . . . 1.00 . . . . . . . . . . . . . . . . . . . . . . . . . . . senmul 0.60 0.47 . . . 0.78 0.28 . sphcoc . . . . 0.64 . . . swerad 0.26 0.32 . . . . . . taroff . . . . . . 0.42 . towmin . . . 1.0 . . . . tradub . . . . 0.21 0.21 . .

library(cluster)

The function we want to use is called demoflex <- agnes(dis.bc,method='flexible',par.method=c(0.625,0.625,-0.25))

The defaults, however, are that alpha_1 = alpha_2, beta = 1 - (alpha_1 + alpha_2),
and gamma = 0. So we can simply specify alpha_1 and get the desired results.
demoflex <- agnes(dis.bc,method='flexible',par.method=0.625)

Alternatively, we can write a simple function to facilitate running flexible-beta specifying just beta.

flexbeta <- function (dis,beta) { alpha <- (1-beta)/2 out <- agnes(dis,meth='flex',par.method=alpha) out } demoflex <- flexbeta(dis.bc,-0.25)

demoflex.hcl <- as.hclust(demoflex)

Converting to an object of class "hclust" makes the results behave more similarly to
other clustering results from `hclust()` and simplifies some analyses.

In practice, you often don't know the number of clusters *
a priori*, and the approach adopted is to cluster at a range
of values, comparing the stress values to find the best partition.
Often, clustering with the same number of clusters but a different
initial guess will lead to a different final partition, so
replicates at each level are often required.

The original function for fixed-cluster analysis was called "k-means" and
operated in a Euclidean space. Kaufman and Rousseeuw (1990) created a function
called "partitioning around medoids" which operates with any of a broad range of
dissimilarities/distance.
To perform fixed-cluster analysis in R we use the `pam()`
function from the `cluster` library.
`pam` uses a distance matrix output from any of our distance functions,
or a raw vegetation matrix (invoking `dist()` on the fly).
I've had better luck explicitly creating a distance matrix first, and
then submitting it to `pam`.

demopam <- pam(dis.bc,k=5) attributes(demopam)

$names: [1] "medoids" "id.med" "clustering" "objective" "isolation" [6] "clusinfo" "silinfo" "diss" "call" $class: [1] "pam" "partition" $Call: pam(testdist, k = 5, diss = T)

demopam$medoids

[1] "50064" "50007" "50115" "50171" "50100"

The cluster membership for each plot is given by
demopam$clustering

50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 50011 50012 50013 1 1 1 1 2 1 2 2 1 2 2 2 1 50014 50015 50016 50017 50018 50019 50020 50021 50022 50023 50024 50025 50026 2 2 1 2 2 2 2 2 1 2 2 2 1 50027 50028 50029 50030 50031 50032 50033 50034 50035 50036 50037 50038 50039 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50156 50157 50158 50159 50160 50161 50162 50163 50164 50165 50166 50167 50168 4 4 4 4 4 4 4 4 4 4 4 4 4 50169 50170 50171 50172 4 4 4 4

plot(demopam)

The plot is called a "Silhouette Plot", and shows for each cluster:

- the number of plots per cluster = number of horizontal lines, also given in the right hand column,
- the means similarity of each plot to its own cluster minus the mean similarity to the next most similar cluster (given by the length of the lines) with the mean in the right hand column, and
- the average silhouette width

The $clustering values can be used just as the "cut" values from the hierarchical cluster analysis before. For example,

boxplot(elev~demopam$clustering)

The two algorithms can be compared as follows:

table(democut,demopam$clustering)

1 2 3 4 5 1 2 0 0 36 1 2 43 40 4 2 0 3 0 0 0 0 14 4 0 0 17 0 0 5 0 1 0 0 0

Legendre, P. and L. Legendre (1998). Numerical Ecology. Elsevier.

plot(partana(stride,dis)$ratio,silhouette(stride,dis)$sil_width, xlim=c(n,m),ylim=c(n,m),type='n') text(partana(stride,dis)$ratio,silhouette(stride,dis)$sil_width, as.character(seq,col=n) . . . . . . . . . text(partana(stride,dis)$ratio,silhouette(stride,dis)$sil_width, as.character(seq,col=n)