### Lab 1 — Reading and Viewing Ecological Data in R

#### Introduction

Before beginning detailed multivariate analysis of a community data set, it's important to thoroughly familiarize yourself with the the characteristics of the the community and site data. Simple, efficient graphical means are often the best way to begin. In lab one we will cover data exploration in R, one of the best packages available for EDA (exploratory data analysis). In the following, I will intersperse commentary and example R code as necessary. R code will be distinguished in a different font for emphasis; you may need to adjust your browser preferences to get a good font. Sections of R code and output will be set off in a colored box like:
Example R code
and output

To facilitate the learning process, I will employ an example data set available at:

This file, bryceveg.R, contains all the species abundance data from a community study of Bryce Canyon National Park, with the species abundance recorded by cover class (see below).

Bryce Canyon National Park is a 14000 ha park in southern Utah with vegetation representative of a transition from the cordilleran flora of the central Rocky Mountains to the Colorado Plateau. The Park vegetation ranges from small areas of badlands and sagebrush steppe at low elevations through an extensive pinyon-juniper woodland, to ponderosa pine savanna at mid-elevations, reaching closed mixed conifer forest at high elevations.

The dataset contains 160 0.1 acre circular sample plots, where the cover of all vascular plant species (except trees) was estimated by eye according to the following scale:

coderange (%)mid-pointpresence/absencenominal
+ (present)0010 .2
T0-10.510.5
11-53.011.0
25-2515.012.0
325-5037.513.0
450-7562.514.0
575-9585.015.0
695-10097.516.0

The abundance of trees was estimated by basal area (cross-sectional area of stems at breast height), but is not included in the data set because it is more representative of successional status than environmental relations. The cover scale is commonly referred to as the "Pfister" scale after R.D. Pfister, who first used it for forest vegetation analysis in the western U.S. (Pfister et al. 1977). It is similar to the Braun Blanquet or Domin scale commonly used in Europe. There are 169 vascular plant species in the data set.

Before beginning the analysis, let's get the data into R. Assuming R is installed on your computer, the first thing we want to do is create a workspace. Create a directory where you would like to work, cd to that directory, and do the following:

• Shift-Click on the file above to copy it into your chosen directory
• Start R
• in unix or linux, use an xterm to cd to the chosen directory, and type R
• in windoze, double-click on the R icon

The first objective is to convert the ascii community data file to an R object. Assuming that the files have been copied to the same directory you started R in:

If the file is in another directory, specify the path to the file (either full or relative). For example, under unix/linux,

or under Windows

or OPTIONALLY under Windows

Note that the slashes go forward rather than backward, or that backslashes need to be doubled. This is a legacy of R/S-Plus's development under unix, and you'll just have to remember it.

Alternatively, you can avoid the whole mess and use your file chooser and mouse with

#### Viewing the Data

Now that we've got the data into a data frame, we can examine the characteristics of the distribution of the data. First, we need to know there are 160 plots and 169 species in the data set. R could tell us that if we asked:
dim(veg)          # to get the dimensions of the data set
names(veg)        # to get the columns names (species in our case)
row.names(veg)    # to get the row names (plots in our case)


There are several things about the species we may wish to know:

• in how many plots does each species occur?
• what is the mean cover of each species when it occurs (not averaging zeros for plots where it is absent)?
• is the mean abundance of species correlated with the number of plots in which it occurs?
• how many species occur in each plot?
• is the total abundance of vegetation correlated with the number of species in a plot?

Here's how we can use R to answer each of these in turn. (See the "R for Ecologists" webpage for explanations of the R code we use below). The community data were recorded in what I called the "nominal" transformation above. To answer many of these questions we want to convert to the "mid-point" transformation instead.

#### Transformation of Vegetation Data

If the original dataframe is strictly numeric, and the transformation is a simple mathematical function, then we can simply operate on the whole dataframe at once. For example, to square the values in veg to get more emphasis on dominant species we could simply
vegsq <- veg^2
More commonly we want to de-emphasize dominant species, and could instead use a square root transformation.
vegsqrt <- sqrt(veg)
In our case, however, we want to convert classes to midpoints, and there is no strictly mathematical function to do this. We have a couple of choices, however. In recent versions of R, we can simply create another data frame, called cover, and assign new values with the assignment operator.
cover <- veg              # to create a copy of the veg data frame
cover[veg==1] <- 3.0      # to convert class 1 to midpoint of 3.0 percent
cover[veg==2] <- 15.0     # to convert class 2 to midpoint of 15.0 percent
cover[veg==3] <- 37.5     # to convert class 3 to midpoint of 37.5 percent
cover[veg==4] <- 62.5     # to convert class 4 to midpoint of 62.5 percent
cover[veg==5] <- 85.0     # to convert class 5 to midpoint of 85.0 percent
cover[veg==6] <- 97.5     # to convert class 6 to midpoint of 97.5 percent


An alternative is to load the labdsv library, and use the function vegtrans() as follows. (See the "R for Ecologists" webpage for details on loading libraries or packages.) First, create a vector of the existing values. Then create a vector of the transformed values. The lengths of these vectors must match, but you don't need to specify a conversion for zero. Then, call the function passing (1) the original dataframe, (2) the original value vector, and (3) the transformed value vector.

library(labdsv)
x <- c(0.2,0.5,1.0,2.0,3.0,4.0,5.0,6.0)
y <- c(0.2,0.5,3.0,15.0,37.5,62.5,85.0,97.5)
cover <- vegtrans(veg,x,y)

You can confirm the transformation worked by looking at a piece of the new dataframe.
cover[1:15,1:10]
   junost ameuta arcpat arttri atrcan berfre ceamar cerled cermon chrdep
1       0    0.0    3.0    0.0      0      0    0.5    0.0      0      0
2       0    0.5    0.5    0.0      0      0    0.0    0.0      0      0
3       0    0.0    3.0    0.0      0      0    0.5    0.0      0      0
4       0    0.5    3.0    0.0      0      0    0.5    0.0      0      0
5       0    0.0   62.5    0.0      0      0    0.5    0.0      0      0
6       0    0.5    3.0    0.0      0      0    3.0    0.0      0      0
7       0    0.0   62.5    0.0      0      0    3.0    0.0      0      0
8       0    0.0   15.0    0.0      0      0    0.0    0.0      0      0
9       0    0.0    0.0    0.0      0      0    0.0    0.0      0      0
10      0    0.0   85.0    0.0      0      0    0.5    0.0      0      0
11      0    0.0   15.0    0.0      0      0    0.5    0.5      0      0
12      0    0.0   15.0    0.5      0      0    0.5    0.0      0      0
13      0    0.0    0.5    0.0      0      0    0.5    0.0      0      0
14      0    0.0   37.5    0.0      0      0    0.5    0.0      0      0
15      0    0.5   62.5    0.0      0      0    0.5    0.0      0      0

The vegtrans() function will warn you about any values that you failed to specify a transformation for and set those values to NA.

Next, we calculate the vectors of interest.

#### How many plots does each species occur in?

spc.pres<-apply(veg>0,2,sum) # to get number of presences for each species. # Note that the first part of the function # call (veg>0) evaluates to TRUE/FALSE or 1/0), # and it is the sum of ones and zeros that # gets calculated. plot(sort(spc.pres)) # to see a plot of the cumulative empirical density # function (CEDF) for species presences
We can label the plot with a title and more interesting (and informative) labels using the main=, xlab= and ylab= arguments. For example:

plot(sort(spc.pres), main="Cumulative Distribution of Species Occurrences", xlab='Cumulative Count of Species',ylab='Number of Plots')

The figure above portrays a fairly typical distribution of species abundances. Note that the distribution is seriously skewed. We can correct for that by putting the Y axis on a log scale to achieve a semi-log plot.

plot(sort(spc.pres),log='y', main="Cumulative Distribution of Species Occurrences", xlab='Cumulative Count of Species',ylab='Number of Plots')

Notice how the Y axis is now log-scaled but retains the units in the original scale. If we had just plotted the log species abundances we would get

plot(sort(log(spc.pres)), main="Cumulative Distribution of Species Occurrences", xlab='Cumulative Count of Species',ylab='Number of Plots')

Notice here that the graph itself has not changed, but the Y axis is now scaled 0-4 (log of spc.pres), instead of 1-100. Generally, we want to transform the axis and keep the units in the original scale.

Many people are more familiar with looking at such distributions as histograms, so a short explanation is probably in order. Along the X axis is the number of species at or below a specific Y value. For example, approximately 100 out of 169 species occur in 10 or fewer plots; 130 of the species occur in 20 or fewer plots. To see the exact distributions, use the seq() function as follows:

seq(1,ncol(veg))[sort(spc.pres)==10]
[1] 103 104 105 106 107 108

which shows that first 102 species occur less than 10 times, species 103-108 occur exactly 10 times, and species 109-169 occur more often than 10 times. Let's look at this expression a little closer. The second half (the logical subscript) we have seen before, i.e. [sort(spc.pres)==10] means "those species which occur 10 times". For example, to see their names,

spc.pres[spc.pres==10]
agrdas asthum astmeg erifla eriumb tradub
10     10     10     10     10     10

To see their position in the sorted list, the seq() (sequence) function is used to generate a vector, in this case with all values from 1 to 169.
seq(1,ncol(veg))[sort(spc.pres)==10]
means, "give me the positions along the list from 1 to 169 of the sorted species abundances for those species that occur exactly 10 times."

Generally stated, most species occur infrequently, and a few are quite common. Only two species (Carex rossii and Symphoricarpos oreophilus) occur in half or more of the sample plots. To see the actual distribution, with species names attached, simply enter

sort(spc.pres)
 chrdep shearg arcuva atrcon agrscr phlpra anemul artcar astchi astten dessop
1      1      1      1      1      1      1      1      1      1      1

echtri erieat erisub genaff gerric heddru ivesab leueri ligpor litinc lygspi
1      1      1      1      1      1      1      1      1      1      1
.                         .                            .
.                         .                            .
.                         .                            .
tetcan erirac artarb juncom chrvis oryhym sithys pacmyr ceamar purtri berrep
35     36     37     39     41     43     47     48     59     64     68

74     75     81     87
where ellipses represent omitted material.

For comparison, the following figure is the same data in histogram form, plotted with

hist(spc.pres)

Whittaker (19xx) suggested that species abundance distributions sometimes follow a log-normal distribution. We can easily convert the plot to log distributions as shown below.

hist(log(spc.pres))

The figure below shows the histogram of the natural log of the spc.pres distribution.

Clearly, even after log transformation, the number of species with very few occurrences is higher than we would expect for any sort of normal distribution.

#### What is the mean cover of each species when it occurs (not averaging zeros for plots where it is absent)?

Here we simply use the apply() function to calculate the sums of the columns of the matrix veg, and then divide by the number of plots in which that species occurs.
tmp <- apply(cover,2,sum)
spc.mean <- tmp/spc.pres      # to get the average cover for each species

Notice how we can divide a vector of numbers (tmp) by another vector of numbers (spc.pres) to get yet a third vector (spc.mean). All three vectors should be the same length of course.

plot(sort(spc.mean),main="Cumulative Species Abundance Distribution", xlab="Cumulative Number of Species",ylab="Mean Abundance")

to see the ECDF of species mean abundances

Notice that this distribution is even more extreme than the previous ECDF for occurrence. Most species only occur at very low abundance, while relatively few are abundant (only four species with average cover > 10%). Perhaps more interesting than the simple distribution of mean abundance is the following.

#### Is the mean abundance of species correlated with the number of plots they occur in?

To examine this question we can plot the species mean abundance (spc.mean) as a function of the species occurrences (spc.pres) as follows.
plot(spc.pres,spc.mean)
to examine the relationship between species distribution and abundance. To see which dot is which species we can use the identify() function. identify() needs at least two arguments, the name of the variable used as the X axis and the name of the variable used as the y axis. For example

identify(spc.pres,spc.mean)

then click next to a point and its number will appear. If you use a third argument, you can get the name of the species printed next to the dot. For example

identify(spc.pres,spc.mean,names(veg))

will print the name of the species next to its dot, as names(veg) is the R function to get the columns names of the the veg matrix.

Click the second mouse button to stop listing points. The numbers of the selected points are listed in the order selected at the command line, and can be stored in a vector as follows:

listpts <- identify(spc.pres,spc.mean,names(veg))

The distribution shown is actually quite interesting. Most of the widespread species (Carex rossii, Symphoricarpos oreophilus, Senecio multifida and Berberis repens) have low mean abundance. Only Arctostaphylos patula is both widespread and abundant. In contrast, some other species (e.g. Artemisia arbuscula/nova, Stipa comata, and Quercus gambelii) are much less widespread, but abundant (if not dominant) when present. Arguably, species which are widespread or abundant (if not both), give an area's vegetation much of its character, while those species which are locally distributed may highlight sites of specific interest and add the interesting detail to the study of vegetation. We'll cover a little more of this discussion with some new functions below.

#### Is the total abundance of vegetation correlated with the number of species in a plot?

To calculate this, we will again use the apply function, this time applied to the rows rather than columns of data frame veg.
plt.pres<-apply(cover>0,1,sum)  # to calculate the number of species in each plot
plot(sort(plt.pres))            # to see the ECDF of number of species/plot


Surprisingly, perhaps, the number of species per plot ranges from 3 to 27 species per 0.1 acre. Bryce Canyon includes some rapidly eroding, xeric badlands where very few species can survive, leading to the plots with relatively few species. Alternatively, the most species rich sites are generally xeric shrubfields dominated by Artemisia arbuscula/nova or harsh woodland sites dominated by Picea pungens and Juniperus communis. Many Rocky Mountain ecologists would be surprised to see Picea pungens described as a harsh woodland species, but that describes much of its distribution in Bryce Canyon.

To plot the total cover on each plot, first calculate the vector of plot cover as follows:

plt.sum <- apply(cover,1,sum) # to calculate the total cover on each plot
and then plot as:
plot(sort(plt.sum))

Again, perhaps surprisingly, total cover per plot ranges from a minimum of 3% to just greater than 100% (107.5). As the species are estimated individually and then summed, it is easily possible to get more than 100% cover in a single plot, but few plots in Bryce Canyon achieve this high a total cover. In addition, using the cover class midpoint as the estimated cover in the sums is very likely biased too high (the discussion is too detailed to get into here), and yet the values are still quite low. Bryce Canyon is dominated by xeric sites with relatively low total plant species abundance.

Finally, to answer our question on the relation between total cover and number of species/plot,

plot(plt.pres,plt.sum)
to see the relationship between number of species/plot and total cover

Apparently, there is very little relationship between the number of species per plot and the total cover. This is not completely surprising since we already established that one of the most species-rich communities was a harsh site woodland. Still it is somewhat surprising that the sites with highest cover have fewer than the average number of species per plot.

mean(plt.pres)
[1] 13.68125
plt.pres[plt.sum>100]
18 19 20 90
12 13  9 12

Undoubtedly there are many other questions we could ask about the community data, but let's store our results in a list for easier plotting in the future, and then turn our attention to the site data.

spc.dat<-list(spc.pres=spc.pres,spc.mean=spc.mean,plt.pres=plt.pres,
plt.sum=plt.sum)


Storing our calculated vectors in a list will make them more easily accessible in the future, but is not really required, as R always saves all calculated data in the workspace, and asks whether to save your workspace when you exit R.

#### Species Abundance Calculations

I believe that the analyses we just performed (which I will call abundance/occurrence analysis) are highly informative and should be computed and plotted for any community dataset before beginning detailed analyses of the community. In later labs, I will demonstrate that the characteristics apparent in this plot are highly influential on the statistical power and suitability of many analyses.

Any time you identify a series of plots or analyses that you are likely to repeat, or that others are likely to use, you should think about writing a function to automate the process, and then distribute that function, possibly as an R package. Accordingly, I have developed an a function to repeat a variation on the analyses we just performed, called abuocc() which is in the labdsv package. To repeat the analysis, (since we have already loaded labdsv), simply enter

abuocc(cover)

and hit return as prompted. Two things you should note: (1) we could have simply done this from the beginning, but then you wouldn't have learned nearly as much R, and (2) the plots are organized slightly differently. For example, in several of the plots the Y axis is now log-scaled. In addition, many of the plots now go from maximum to minimum. This is in accordance with many studies of "niche theory," and is probably more easily visualized. To understand the differences in R, simply study the function code appended below.

#### Next Lab

In Lab 2 we will explore the site data from Bryce Canyon.

#### Functions included in this lab

abuocc <- function(veg,minabu=0)
{
spc.plt <- apply(veg>minabu,1,sum)
plt.spc <- apply(veg>minabu,2,sum)

if (minabu==0) {
mean.abu <- apply(veg,2,sum)/plt.spc
} else {
mean.abu <- rep(0,ncol(veg))
for (i in 1:ncol(veg)) {
}
}
mean.abu[is.na(mean.abu)] <- 0

plot(rev(sort(plt.spc[plt.spc>minabu])),log="y",xlab="Species Rank",
ylab="Number of Plots", main="Species Occurrence")

plot(rev(sort(spc.plt)),xlab="Plot Rank",ylab="Number of Species",
main="Species/Plot")

plot(plt.spc[mean.abu>minabu],mean.abu[mean.abu>minabu],log="y",
xlab="Number of Plots",ylab="Mean Abundance",
main="Abundance vs Occurrence")
yorn <- readline("Do you want to identify individual species? Y/N :")
if (yorn == 'Y' || yorn == 'y')
identify(plt.spc[mean.abu>minabu],mean.abu[mean.abu>minabu],names(veg))

plot(spc.plt,apply(veg,1,sum),xlab="Number of Species/Plot",
ylab="Total Abundance")
yorn <- readline("Do you want to identify individual plots? Y/N :")
if (yorn == 'Y' || yorn == 'y')
identify(spc.plt,apply(veg,1,sum),labels=row.names(veg))

invisible(list(spc.plt=spc.plt,plt.spc=plt.spc,mean=mean.abu))
}

matrify <- function (mat)
{
if (ncol(data) != 3)
stop("data frame must have three column format")
plt <- mat[,1]
spc <- mat[,2]
abu <- mat[,3]
plt.codes <- levels(factor(plt))
spc.codes <- levels(factor(spc))
taxa <- matrix(0,nrow=length(plt.codes),ncol=length(spc.codes))
row <- match(plt,plt.codes)
col <- match(spc,spc.codes)
for (i in 1:length(abu)) {
taxa[row[i],col[i]] <- abu[i]
}
taxa <- data.frame(taxa)
names(taxa) <- spc.codes
row.names(taxa) <- plt.codes
taxa
}


reconcile <- function (taxa, site)
{
if (identical(row.names(taxa), row.names(site))) {
cat("You're good to go\n")
}
else {
extra <- nrow(taxa) - sum(row.names(taxa) %in% row.names(site))
if (extra > 0)
cat(paste("You have", extra, "plots in taxa not in site\n"))
extra <- nrow(site) - sum(row.names(site) %in% row.names(taxa))
if (extra > 0)
cat(paste("You have", extra, "plots in site not in taxa\n"))
taxa <- taxa[order(row.names(taxa)), ]
site <- site[order(row.names(site)), ]
taxa <- taxa[row.names(taxa) %in% row.names(site), ]
site <- site[row.names(site) %in% row.names(taxa), ]
out <- list(taxa = taxa, site = site)
invisible(out)
}
}