WS2M User's Manual

Ws2m.exe does the job of estimating the total number of types in a finite collection. It was developed to estimate the number of species in a collection of identified individuals, and we will use the language of that application to explain the program. (We believe Ws2m can be used on many other sorts of data but these uses remain untested. Ws2m contains a sample-with-replacement option to allow you to test it yourself.)

Ws2m takes in a data set and produces a series of statistics for a randomly ordered data set like it. At each step of the series, an enlarged proportion of the data set is used to calculate both Fisher's alpha and estimates of the number of species. Ws2m uses a large (and user-controllable) variety of estimators to produce the estimates. It also reports the number of individuals used to that point and the actual number of species so far obtained in the collection. It can report species-abundance distributions and Jaccard indices. It has more tools than we know how to use yet. However, it will not take out the trash at your grandmother's apartment.

The nature of the data: three alternatives

A series of samples with known sample abundances (PRN file). Each sample contains a number of individuals. The species of each individual in each sample is known.
Presence/absence matrix. Same as previous except without abundances. Each cell has a one (1) if the species appeared in the sample, a zero (0) if it was absent. Obviously, you can convert data of type 1 to this type if, for any reason, you wish to.
A vector (VEC file). There are two vectors in a VEC file: the list of sample sizes and the list of each species' proportional abundance. Useful when you do not know the detailed compostion of each sample, but do know the abundance proportions of the entire set. We discuss vector files below.

The output

The output file with the extension .out. This contains all the results in ASCII format. The output file is permanent until you overwrite it. Using your favorite graphics package, you can display anything you wish from the output files. The following table explains some suffixes on column headings.

X_BV	Bootstrapped Variance. Reports the variance in X among runs. X can be a counted value, such as the number of species observed, "Sobs" or an estimated value, such as "Chao1". X_BV values will be 0 if you do not shuffle sample order or resample individuals, or if you only do one run. Note that this does not compute the variance formulas for various estimators in the published literature. See also "Other notes" below.
X_RD	Mean deviation of X from zeta across multiple runs. zeta is the total number of species actually present in the entire original data set. This is most useful for testing estimators on data sets which you assume represent the real world. Ideal estimator (which doesn't exist yet) would have X_RD of 0 asymptotically. Even better would be to attain X_RD of 0 with only a single sample.
X_RPD	Like _RD, but expressed as percentage of zeta.
X_RPO	Percentage of runs on which X exceeded zeta. 50 would indicate that the estimator produces over and underestimates with equal frequency.
X_QPC	Percentage of samples which were within 5% of the final X value for all samples. The ideal estimator would have X_QPC of 100% (meaning even the estimate with 1 sample was within 5%).

The summary file with the extension .sum. This contains the averages of the results at the last step (the one with the largest number of individuals) (also in ASCII format).
The display. A graph of some results.

Graphing

Copy

Overlay new plot on existing plot

Graphing

A vector file with the extension .vec. Sample sizes and abundances. Produced by Ws2m if you ask for it in Advanced Options (see below). Or you can make one yourself.
A file of species-abundance results. Permanent text file produced when you request SAD graphing.
A text file of the shuffled matrix at the last run. Permanent text file produced when you request shuffling of individuals (see below). Useful for checking and confidence building.
A debug file. Permanent text file produced if anything unexpected happens during execution (e.g., one of the non-linear regression procedures does not converge).

Preparing the data file for input

In each case, the data file must be a pure ASCII file. Use a text editor to be sure. If the data file will not load properly, use a hex editor to get rid of the non-ASCII bytes.

PRN format. Types 1 & 2, above; the data occur as a series of samples ("quadrats").

First line is the name of the data set. ASCII only, but any name you want is OK. It does not have to be the same as the file name. Each line ends with a carriage return.
Second line has two control parameters. They are separated from each other by one (or more) spaces. The first parameter is the number of species (i.e., columns in the matrix); the second, the number of samples (i.e., rows in the matrix).
Subsequent lines contain the data matrix. Columns are the species (but they are not named). Rows are the samples (but they are not numbered). In each cell of the matrix, put the number of individuals of that column's species collected in that sample. (Or put a 1 or 0, if this is a presence/absence data matrix.) Separate the cells with a space (nothing else: no tabs, no semi-colons, nothing).

Butterflies

34 6

1 1 42 53 44 9 1 837 4 1 7 5 9 2 2 2 50 13 1 1 1 11 14 5 5 2 0 0 0 0 0 0 0 0

19 6 0 188 126 36 0 7 5 1 23 29 43 50 3 3 257 26 9 0 5 5 9 5 24 10 1 2 2 3 0 0 0 0

2 0 0 44 474 18 0 167 1 0 9 8 19 21 2 9 552 30 12 0 7 9 13 15 19 6 1 0 0 1 0 0 0 0

6 0 0 175 293 14 0 16 4 0 0 20 12 6 24 20 50 7 15 3 3 0 2 25 4 2 2 0 1 1 1 0 0 0

27 0 0 52 743 31 1 363 1 0 10 7 11 25 4 11 1041 18 38 21 18 12 19 11 15 3 1 1 0 3 0 0 0 0

4 0 0 128 146 38 0 18 6 1 9 7 13 4 19 1 132 2 11 1 3 0 6 31 45 3 0 0 0 0 0 0 0 0

Data matrix input files must use the extent prn. Thus, the example could be called Butterflies.prn.

VEC format. The data occur as vectors of sample sizes and proportional abundances. The first two lines are the same as prn files. The next lines contain one or two numbers. The first number on line three will be the total number of individuals in one of the samples. The second will be the proportional abundance of the first species in the entire set of samples. The next line has the total number of individuals in a second sample followed by the proportional abundance of the second species in the entire set of samples. Etc. The result is a pair of vectors that may not have the same length. Neither the sample size vector nor the abundance vector need be ordered in any way. But make sure the number of elements in the first vector (sample sizes) agrees with the second number in line two, and that the number of elements in the second vector agrees with the first number in line two.

vec

must

Load input file

Running WS2M

Run WS2M.EXE on a Windows 95/98/2000/NT machine. Some users have reported that the program also works on their Macintosh under Windows emulation.
Click Load Input File and select the data file you want to analyze. Ws2m will load it and report several statistics: the number of samples; the number of species; and the number of individuals.
Select the Actions tab.
The most important choice: Sample With Replacement. If you check this, then Ws2m will be estimating the number of species actually in the data set itself, NOT the number in the sampling universe from which you got it. This option should be used only to test Ws2m and its methods. (If you check Make Exchangeable samples, Ws2m will not perform replacement regardless of whether this option is checked.)
Crucial notes for dealing with presence-absence matrices:

Check Shuffle Incidences Only (on the Advanced Options tab). This option prevents any shuffled cell from containing more than one individual. It must not be checked for any other matrix form.
With a presence/absence matrix, you cannot maintain sample sizes unless you are sampling with replacement (in which case you are merely checking the methods to see if they will accurately report the number of species in your sample). If you want to estimate what is actually out there, check Make Exchangeable under the actions tab and do not check Sample With Replacement.

Shuffle Individuals (among samples) removes any habitat heterogeneity (i.e. intrinsic differences between times and/or places of the samples). Each step of the procedure will then estimate the number of species in the entire sampling universe.
Shuffle Sample Order corrects for unequal sample sizes by randomizing the order of quadrats on each run.
Make Samples IID (Independent and Identically Distributed) makes each sample artificially equal in number of individuals. There is very little difference in practice between the estimates you will get if you check this or retain the actual sample sizes (provided you ask the estimator to run 20 or more times).
# Runs tells the program to repeat the procedure a certain number of times and output the average estimates.
Random Seed lets you repeat the whole procedure precisely by resetting this value to any fixed integer (say 42) before each set of runs. If you use the same seed and options, and if your machine and program are operating properly, you will obtain exactly the same estimates every time.
Name the Output File(s) to save them (default = temp). Ws2m automatically adds the extension .out.

The Estimation Methods

The Jackknife (Jack# or JK#) Methods of Burnham & Overton (B&O or BO): Jack k is the kth-order jackknife estimator, as described in Burnham & Overton (1978) and Smith and van Belle (1984).; BO is the actual step-by-step estimate based on the first four JK orders calculated; Ws2m chooses among them according to B&O's method, as described in Burnham & Overton (1978). This includes their interpolation method (which you may deselect under the Advanced Options tab). The output file also reports the average order used for BO at each step.; Both JK and BO estimators use incidence information only.
Methods of Chao and Lee: Chao1, Chao2, ACE, and ICE: A suite of Chao estimates: Chao1 (Chao 1984); Chao 2 (Chao 1987); the abundance-based coverage estimator (ACE; Chao and Lee 1992); and the incidence-based coverage estimator (ICE; Lee and Chao 1994). ACE and ICE are identical if you analyze a presence/absence matrix.; Chao1 and ACE use abundance information only, while Chao2 and ICE use incidence information only.; ACE and ICE, as presented in the original papers, occur in more than one form. The different forms involve a revised calculation of the CV, the coverage estimate, both, or neither (i.e., no recalculation). Choose the one you want under the Advanced Options tab.; To generate their estimates, ACE and ICE require specification of the positive integer Rare/Infrequent Cutoff (RIcut) (see Chao et al. 1993). ACE uses the first RIcut abundance classes (species with exactly 1, 2, ..., RIcut individuals). ICE uses the first RIcut incidence classes (species occurring in exactly 1, 2, ..., RIcut samples). Ten is the default value of RIcut in Ws2m, but you may alter it under the Advanced Options tab. To use all incidence or abundance classes, set RIcut to a high number (higher than the highest possible incidence/abundance).
Fisher's alpha (FAlpha; Fisher et al. 1943): Does not estimate the number of species. Instead, it reports Fisher's well known, sample-size independent index of diversity.
The Bootstrap (Boot): The bootstrap is a general resampling technique which was applied to the problem of species diversity estimation by Smith and van Belle (1984). It uses incidence information only.
Michaelis-Menten: Using non-linear regression, one can fit the data directly to an M-M equation. However, the literature of diversity estimation generally has preferred the Eadie-Hofstee formula instead (Colwell and Coddington, 1994). Based on E-H, a maximum likelihood method produces estimates of P and a. WS2M calculates both estimators if you request the Michaelis-Menten fit in Advanced Options. It calculates only the Eadie-Hofstee otherwise. In the output tables, MMm is based on Eadie-Hofstee and MMFm is the direct non-linear fit. Note that we have found E-H to behave very poorly when it makes estimates using small amounts of data. Often it even returns large negative estimates of diversity.
New, extrapolation-based estimators: Three new, Pareto-style, extrapolation-based estimators of the asymptote of the collector's curve. We intend them to replace the Michaelis-Menten function and other extrapolation estimators. The methods are unpublished at present. However, we have had excellent results with them using both abundance and presence/absence inputs. Compare them to the others using your own data set (and choosing 'shuffle individuals with replacement'). For us, they seem to lack the extreme negative biases at small N that we have seen in B&O, Chao1, Chao2. And often they seem to converge faster than ICE.; The F5 equation is
: Heterogeneities present in data (which sometimes show up as jumps in the collector's curve) cause poor fitting of extrapolation-based estimators such as F5. You will find that F5 performs well when individuals are shuffled or sample order is shuffled for many (more than 10) runs. If you are fitting the S and N vectors by hand from the output file, try 0.15 as the initial value of q, and adapt the initial value from that point.; The F3 equation is; The F6 equation is; Notes on new extrapolation-based estimators:; All these extrapolation-based estimators require non-linear fitting. We have incorporated this into Ws2m. Ws2m's nonlinear curve fitting uses the Levenberg-Marquardt method as described in Press et al. 1986.; Because nonlinear fitting is a trial and error procedure (which would be quite daunting to fully automate), all extrapolation methods will occasionally give erratic results. These appear as values of -1. Simply reject any results of -1. (Or, go to the output file and fit the formulae to the S and N vectors by hand. Note: the erratic behavior is uncommon enough that you may never even see it.); No matter how many runs are specified, extrapolation-based estimators are generally calculated only once, on each step of the averaged species accumulation curve. The sole exception: if you ask for a separate output file for each run. Then the extrapolations are made for each run separately.
Log-normal estimator: This fits a log-normal distribution to the abundance data and calculates its integral as suggested by Frank Preston.

Other notes

The Advanced Options tab contains more of our toys. Ignore them or ask for more information if you are curious. The Force # Individuals will set a reduced total sample size for a data set. Default of 0 uses sample size actually obtained. The Shuffle Only Pooled option is also our toy, at least for now. This option leaves local abundance distributions intact. Making samples exchangeable permits us to use de Finetti's theorem. This will be further developed because it leads to confidence intervals.

By the way, we have not implemented 'variances' reported in the literature. They are not what most biologists would like to have. Here is what we want: if a method yields an estimate of S species, we want to know how that estimate compares to the true diversity. That is, we want to be able to say that true diversity lies between S+e and S-e with 95% confidence. But published confidence intervals do not tell that. What they say is that given these data, the estimate will lie between S+e and S-e 95% of the time. Because the estimates retain a bias in many cases, the truth will often lie well outside the 95% confidence interval. We are working on this problem.

Final Bug
When you finish Ws2m and exit, Windows usually complains. You will see an error message. It apparently comes from a problem in the published nonlinear regression routine that we incorporated. Pain to fix, and we have other stuff to do. Ignore it. No harm; no foul.

Updates
URL: EEBWEB.ARIZONA.EDU/DIVERSITY
There you will find the latest version of these notes, the latest stable Ws2m and the latest beta version of Ws2m. If you have time, let Will Turner (wturner@u.arizona.edu) and Mike Rosenzweig (scarab@u.arizona.edu) know about your experiences with this package.

Literature Cited

Burham, K. P. and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among individuals. Biometrika 65:625.

Chao, A. 1984. Nonparametric estimation of the number of classes in a population. Scandanavian Journal of Statisitics 11:265-270.

Chao, A. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783.

Chao, A. and S.-M. Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 82:210-217.

Chao, A., M.-C. Ma and M.C.K. Yang. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80:193-201.

Colwell, R.K. & Coddington, J.A. 1994. Estimating terrestrial biodiversity through extrapolation. Philosophical Trans. Royal Society of London, B, 345, 101-118.

Fisher, R.A., A.S. Corbet and C.B. Williams. 1943. The relations between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12:42-58.

Lee, S.-M. and A. Chao. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50:88-97.

Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. 1986. Numerical recipes: the art of scientific computing. Cambridge University Press, Cambridge.

Preston, F.W. 1948. The commonness, and rarity, of species. Ecology29: 254-283.
Preston, F.W. 1960. Time and space and the variation of species. Ecology41: 785-790.
Preston, F.W. 1962a. The canonical distribution of commonness and rarity. Ecology43: 185-215.
Preston, F.W. 1962b. The canonical distribution of commonness and rarity. Ecology43: 410-432.

Smith, E.P. and G. van Belle. 1984. Nonparametric estimation of species richness. Biometrics 40:119-129.