Variables are assigned a value in an assignment statement, which in R has the variable name to the left of a left-pointing arrow (typed with the "less than" followed by a "dash") with the value behind the arrow. For example,
number.species <- 137
Recent versions of R allow you to use the = sign for an assignment, i.e.
number.species = 137
but I will stick with the older, more elegant arrow.
R allows the creation of variables which contain numeric values (both integers and floating point or real numbers), characters, or special characters interpreted as "logical" values. For example
x <- 1.2345 small.value <- 1.0e-10 species.name <- 'Pinus contorta' conifer <- TRUENotice that real or floating point numbers can be entered with just a decimal point, or in exponential notation, where 1.0e-10 means 1.0 -10. Notice also that character variables, called "strings" should be entered in quotes (single or double, it doesn't matter as long as they match). Finally, note that the word TRUE is NOT surrounded by quotes. This is not the WORD TRUE, but rather the VALUE TRUE. Logical variables can only take the values TRUE or FALSE.
Unlike many programming languages (e.g. FORTRAN or C) you do not have to tell R what kind of value (integer, real, or character) a variable will contain; it can tell when the variable is assigned. R will only allow the appropriate operations to be performed on a variable. For example
species.name + 37 Error in species.name + 37 : non-numeric argument to binary operatorR did not allow me to add 37 to species.name because species.name was a character variable.
Vectors are often read in as data or produced as the result of analysis, but you can produce one simply using the c() function, which stands for "combine." For example
demo.vector <- c(1,4,2,6,12)
produces a vector of length 5 with the values 1, 4, 2, 6, 12.
Individual items within a vector or matrix can be identified by subscript (numbered 1 - n), which is indicated by a number (or numeric variable) within square brackets. For example, if the number of species per plot is stored in a vector spc.plt, then
spc.plt[37] = the number of species in plot 37
Matrices are specified in the order "row, column", so thatveg[23,48] = row 23, column 48 in matrix veg
Individual rows or columns within a matrix can be referred to by implied subscript, where the the value of the desired row or column is specified, but other values are omitted. For example,veg[,3] = third column of matrix veg
represents the third column of matrix veg, as the row number before the comma was omitted. Similarly,veg[5,] = row 5 of matrix veg
represents row 5, as the column after the comma was omitted. In addition, a number of specialized subscripts can be used.
veg[] = all rows and columns of matrix veg
spcplt[a:b] = spcplt[a] through spcplt[b]
spcplt[-a] = all of vector spcplt except spcplt[a]
veg[a:b,c:d] = a submatrix of veg from row a to b and column c to d
It's even possible to specify specific subsets of rows and columns that are not adjacent.
spcplt[c(1,7,10),c(3,6,12)] = a submatrix consisting of rows 1,7 and 10, and columns
3, 6, as 12 from matrix spcplt.
Data frames can be accessed exactly as can matrices, but can also be accessed by data frame and column or field name, without knowing the column number for a specific data item. For example, in the Bryce dataset, there is a column labeled "elev" that holds the elevation of each sample plot. This column can be accessed as bryce$elev, where "bryce" is the name of the data frame, "elev" is the name of the field or column of interest, and the "$" is a separator to distinguish data frame from field. If you are routinely working with one or a few data frames, R can be told the name(s) of the data frames in an "attach " statement, and the data frame name and separator can be omitted. For example, if we give the command
attach(bryce)
we can specify the field "elev" simply as "elev" rather than "bryce$elev." This is more concise notation, but means that we cannot have a variable with the same name as a field in a data frame that is attached. Data frames are extraordinarily useful in R.
For the time being, I'll give a very simple example. Using the spc.plt vector above, and the names of the veg data frame.
list.demo <- list(species_per_plot=spc.plt,species_names=names(veg))
list.demo
$species_per_plot:
50001 50002 50003 50004 50005 50006 50007 50008 50009 50010 50011 50012 50013
9 14 12 8 16 11 12 8 8 16 19 18 9
50014 50015 50016 50017 50018 50019 50020 50021 50022 50023 50024 50025 50026
14 19 8 10 12 13 9 15 6 13 18 16 12
50027 50028 50029 50030 50031 50032 50033 50034 50035 50036 50037 50038 50039
19 13 6 13 19 10 15 16 13 16 15 9 27
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
50156 50157 50158 50159 50160 50161 50162 50163 50164 50165 50166 50167 50168
6 5 6 6 7 4 10 13 3 12 4 5 16
50169 50170 50171 50172
10 10 8 12
$species_names:
[1] "ACHMIL" "AGOGLA" "AGRCRI" "AGRDAS" "AGRSCR" "AGRSMI" "AMEUTA" "ANEMUL"
[9] "ANTROS" "APOAND" "ARAHOL" "ARAPEN" "ARCPAT" "ARCUVA" "AREFEN" "ARTARB"
. . . . . . . .
. . . . . . . .
. . . . . . . .
[153] "SPC475" "SPC476" "SPC477" "SPC478" "SPC479" "SPC.70" "SPHCOC" "STAPIN"
[161] "STETEN" "STICOM" "STILET" "STIPIN" "STRCOR" "SWERAD" "SYMORE" "TAROFF"
[169] "TETCAN" "THAFEN" "TOWMIN" "TRADUB" "VALACU" "VICAME"
Notice how I assigned a name to the list components before the equal sign, and
the component itself following the equal sign.
In this case, the first component species_per_plot has 160 numbers (each with
the plot identifier attached), and the second item has 174 strings.
For example
spc.plt[!is.na(spc.plt)]
That's complicated enough to merit some discussion. The R function to identify a missing value is
is.na( )
so that to say all of a vector except missing values, we set a logical test to be true when values are not missing. Since the R operator for "not" is !, the correct test is
!is.na( )
and to specify which vector we're testing for missing value, we put the vector in parentheses as follows:
!is.na(spc.plt)
Accordingly, the full expression is
spc.plt[!is.na(spc.plt)]
While the symbol for a missing value in a vector or matrix is NA, using
spc.plt[spc.plt!="NA"]
will NOT work.
We can use the missing value test on any vector as necessary. For example, the vector of elevations, except where the number of species per plot is missing, is
elev[!is.na(spc.plt)]
This use of missing values is critical to R because all operations on vectors or matrices must have the same number of elements. So, if there are missing values in any field we're using in a calculation, the same record (row) must be omitted from all the other fields as well. In a later lab I'll demonstrate how to create a "mask" that we can use to simplify working with vectors or matrices with missing values.
Vector operators can be applied to every row or column of a matrix to produce a vector with the apply command. For example:
spcmax <- apply(veg,2,max) creates a vector "spcmax" with the maximum value for each species in its respective position. The apply operator is employed as:
apply("matrix name",1(rowwise) or 2(columnwise),vector operator)
so that
pltsum <- apply(veg,1,sum)
creates a vector of total species abundance in each plot. The vector is as long as the number of rows in matrix veg. If the function to be applied doesn't exist, it can be created on the fly as follows:
pltspc <- apply(veg,2,function(x){sum(x>0)})
where function(x){sum(x>0)}) sums the number of plots where species x is greater than 0, and x is assigned to each column (species) in turn,Remembering that R works directly on matrices and vectors we can simplify the apply() example above as simply
pltspc <- apply(veg>0,2,sum)
where the veg>0 converts the veg matrix to a matrix of TRUE and FALSE, and the sum() function treats TRUE as 1 and FALSE as 0.
triang <- matrix[row(matrix) > col(matrix)]
The easiest way is to format the data in columns, with column headings, and blanks
or tabs between. For example:
plot elev aspect slope text
1 1300 240 30 loam
2 1640 170 20 clay.loam
3 1840 NA 24 silty.clay.loam
. . . . .
. . . . .
. . . . .
100 1730 70 15 sandy.loam
The columns do not need to be straight, but multi-word variables like "clay loam" need to be connected or put in quotes. The R convention (but it is just a convention) is to connect with a period, as shown above. It CANNOT be connected with "$". Recent versions of R also allow connections with "_". The above file (if named "site.dat" for instance) could be read with the read.table command as follows:
site <- read.table('site.dat',header=TRUE,row.names=1)
The resulting data frame would be named "site", and the columns would be named exactly as in the data file. The row.names=1 tells R that the first column is the sample identifier, and not data. In the absence of that specifier, R would assign consecutive integer sample IDs. That seems satisfactory, but it is much easier to ensure that your data in different file and data.frames match up if you make sure to employ the actual sample IDs from your data sheets.Note that the value for aspect in the third plot is NA. This is a missing value code, and will cause R to treat that value as missing, rather than as a code NA. It's possible to use other codes as missing values if you specify them in the read.table command. For example, suppose in your data set you used -999 as the missing value code. To tell R to set -999 to missing, add the na.strings= argument as follows:
site <- read.table('site.dat',header=TRUE,row.names=1,na.strings="-999")
Alternatively, data can be organized as in traditional spreadsheet "csv" comma delimited files, as follows:plot,elev,aspect,slope,text 1,1300,240,30,loam 2,1640,170,20,clay.loam 3,1840,90,24,silty.clay.loam . . . . . . . . . . . . . . . 100,1730,70,15,sandy.loam
In which case it would be read:
site <- read.table('site.dat',header=TRUE,row.names=1,sep=",")
to tell R that the values were separated by commas. Alternatively, you can use
site <- read.csv('site.dat',header=TRUE)
to read the file, as read.csv() calls read.table() with the appropriate parameters as defaults.
In cases where column headings are absent, the file can be read with header=FALSE and names can be entered separately with the names command. For example:
names(site) <- c("plot","elev","aspect","slope","text")
Row names (such as plot IDs) can also added if desired, using the row.names() function in a similar way.
The beauty of the read.table() function is the way it handles variables. If any value in a column is alphabetic, it treats the column as composed of "factors," or categorical variables. There is NEVER a reason to convert categorical variables to integer or numeric codes. However, if you already have categorical variables coded as integers, you can explain that to R with the factor() function after you read the data in.
This turns out to be a common enough problem to deserve some discussion. Let's say that you have a data frame (called site), with a column for soil parent material (called pm), and 1=granite and 2 = limestone. R will think that parent material is a quantitative vector, and can be added and subtracted for example. Worse, you have to remember forever that 1=granite and 2=limestone. The correct thing to do is to convert the data.
> site$pm[site$pm==1] <- 'granite' > site$pm[site$pm==2] <- 'limestone' > site$pm <- factor(site$pm)
The first two lines do a substitution using a logical subscript (which we discussed above). The third line converts the resulting vector to a factor. If the values had been 'granite' and 'limestone' all along R would have known that the column was a factor, but when you convert a field from one type to another you need to tell R.
I don't discuss loading R packages until later in this file, but it's worth noting here that if you have loaded package 'foreign' there are additional useful file reading functions. One particularly useful function is read.dbf() which allows you to read DBase (or XBase) files directly. There are also functions for importing data from SAS, SPSS, Systat, and other software packages. Finally, it is possible, although more difficult, to read Excel .xls files. Excel is exceptionally sloppy about data formatting and storage, and the best advice is simply not to attempt to read .xls files. Rather, using Excel (or OpenOffice Calc) export the spreadsheet to a .csv file, and use read.csv(). This path allows you to edit the data in the .csv file before reading it in, and avoids a huge number of issues later on.
In Windows, the graphics window is controlled by the windows() command; in unix/linux, the X11 window is controlled by the x11() function. The size of the window is specified in inches as arguments to the function. For example, to get a window 8 inches wide by 6 inches tall
x11(height=6,width=8) or windows(height=6,width=8)
This is simple, except that you can't control the location. You can, however, move the window with your mouse. As long as you don't resize it you are fine.
Simply type
?Devices or help(Devices)
to get a list of available devices and their names (note the capital D on Devices). Each of the devices has options that can be set to control plot size, orientation (landscape or portrait), font size, etc.
It is tempting in Windows to save a plotted graph to file by right clicking on it and specifying a format and name to save under. Do not do this. These files are really ugly when included in a document. Take the time to open an appropriate device (try pdf for vectors or png for raster (image) plots) and replot the figure. It's definitely worth the time and effort. As I will show below you can save all the plottting commands in a function, edit it until it's perfect, and then plot to any device.
If your machine is on the internet, R has routines available to automatically install or update packages from CRAN.
Alternatively, if you have a copy of the package you want to install on a CD or USB device, in the Packages menu select the Install package(s) from local zip files and browse to and select the package you want to install.