WS2M User's Manual

Ws2m.exe does the job of estimating the total number of types in a finite collection. It was developed to estimate the number of species in a collection of identified individuals, and we will use the language of that application to explain the program. (We believe Ws2m can be used on many other sorts of data but these uses remain untested. Ws2m contains a sample-with-replacement option to allow you to test it yourself.)

Ws2m takes in a data set and produces a series of statistics for a randomly ordered data set like it. At each step of the series, an enlarged proportion of the data set is used to calculate both Fisher's alpha and estimates of the number of species. Ws2m uses a large (and user-controllable) variety of estimators to produce the estimates. It also reports the number of individuals used to that point and the actual number of species so far obtained in the collection. It can report species-abundance distributions and Jaccard indices. It has more tools than we know how to use yet. However, it will not take out the trash at your grandmother's apartment.

The nature of the data: three alternatives

  1. A series of samples with known sample abundances (PRN file). Each sample contains a number of individuals. The species of each individual in each sample is known.
  2. Presence/absence matrix. Same as previous except without abundances. Each cell has a one (1) if the species appeared in the sample, a zero (0) if it was absent. Obviously, you can convert data of type 1 to this type if, for any reason, you wish to.
  3. A vector (VEC file). There are two vectors in a VEC file: the list of sample sizes and the list of each species' proportional abundance. Useful when you do not know the detailed compostion of each sample, but do know the abundance proportions of the entire set. We discuss vector files below.

The output

  1. The output file with the extension .out. This contains all the results in ASCII format. The output file is permanent until you overwrite it. Using your favorite graphics package, you can display anything you wish from the output files. The following table explains some suffixes on column headings.
    X_BV Bootstrapped Variance. Reports the variance in X among runs. X can be a counted value, such as the number of species observed, "Sobs" or an estimated value, such as "Chao1". X_BV values will be 0 if you do not shuffle sample order or resample individuals, or if you only do one run. Note that this does not compute the variance formulas for various estimators in the published literature. See also "Other notes" below.
    X_RD Mean deviation of X from zeta across multiple runs. zeta is the total number of species actually present in the entire original data set. This is most useful for testing estimators on data sets which you assume represent the real world. Ideal estimator (which doesn't exist yet) would have X_RD of 0 asymptotically. Even better would be to attain X_RD of 0 with only a single sample.
    X_RPD Like _RD, but expressed as percentage of zeta.
    X_RPO Percentage of runs on which X exceeded zeta. 50 would indicate that the estimator produces over and underestimates with equal frequency.
    X_QPC Percentage of samples which were within 5% of the final X value for all samples. The ideal estimator would have X_QPC of 100% (meaning even the estimate with 1 sample was within 5%).

  2. The summary file with the extension .sum. This contains the averages of the results at the last step (the one with the largest number of individuals) (also in ASCII format).
  3. The display. A graph of some results.

  4.     You change which of the results you graph by selecting them under the Graphing tab. The graphing options also allow you to display such properties as species-abundance distributions (and fit some SAD models) and Jaccard indices. Try them. (Note: we know about the display bug in the shuffled SAD. It is intermittent, and does not affect the file report of the results. A pain to fix, and we have not yet done it.)
        The display is linked to the output so that you need not rerun the program to graph other results from the output file. (Of course, you cannot display results you did not request from the previous run.) While results are in the program, you may change the graphing options and redisplay the graph any number of times.
        The display is ephemeral; the display engine loses its link to the output as soon as a new data set is loaded or the program is closed. However, you can save the display by using the Copy option below it. Copy puts the display on the Windows clipboard. From the clipboard, you can paste it into a graphics program or wordprocessor and save it.
        You may also maintain a display, run the same data or another set, and overlay the second (third, fourth, etc.) display on the first. Check Overlay new plot on existing plot at the bottom of the Graphing options page.
  5. A vector file with the extension .vec. Sample sizes and abundances. Produced by Ws2m if you ask for it in Advanced Options (see below). Or you can make one yourself.
  6. A file of species-abundance results. Permanent text file produced when you request SAD graphing.
  7. A text file of the shuffled matrix at the last run. Permanent text file produced when you request shuffling of individuals (see below). Useful for checking and confidence building.
  8. A debug file. Permanent text file produced if anything unexpected happens during execution (e.g., one of the non-linear regression procedures does not converge).

Preparing the data file for input

In each case, the data file must be a pure ASCII file. Use a text editor to be sure. If the data file will not load properly, use a hex editor to get rid of the non-ASCII bytes.
  1. PRN format. Types 1 & 2, above; the data occur as a series of samples ("quadrats").
    1. First line is the name of the data set. ASCII only, but any name you want is OK. It does not have to be the same as the file name. Each line ends with a carriage return.
    2. Second line has two control parameters. They are separated from each other by one (or more) spaces. The first parameter is the number of species (i.e., columns in the matrix); the second, the number of samples (i.e., rows in the matrix).
    3. Subsequent lines contain the data matrix. Columns are the species (but they are not named). Rows are the samples (but they are not numbered). In each cell of the matrix, put the number of individuals of that column's species collected in that sample. (Or put a 1 or 0, if this is a presence/absence data matrix.) Separate the cells with a space (nothing else: no tabs, no semi-colons, nothing).

    4.  
      Example of a matrix with 34 species collected from 6 samples:
       
      Butterflies
      34 6
      1 1 42 53 44 9 1 837 4 1 7 5 9 2 2 2 50 13 1 1 1 11 14 5 5 2 0 0 0 0 0 0 0 0
      19 6 0 188 126 36 0 7 5 1 23 29 43 50 3 3 257 26 9 0 5 5 9 5 24 10 1 2 2 3 0 0 0 0
      2 0 0 44 474 18 0 167 1 0 9 8 19 21 2 9 552 30 12 0 7 9 13 15 19 6 1 0 0 1 0 0 0 0
      6 0 0 175 293 14 0 16 4 0 0 20 12 6 24 20 50 7 15 3 3 0 2 25 4 2 2 0 1 1 1 0 0 0
      27 0 0 52 743 31 1 363 1 0 10 7 11 25 4 11 1041 18 38 21 18 12 19 11 15 3 1 1 0 3 0 0 0 0
      4 0 0 128 146 38 0 18 6 1 9 7 13 4 19 1 132 2 11 1 3 0 6 31 45 3 0 0 0 0 0 0 0 0
       
    5. Data matrix input files must use the extent prn. Thus, the example could be called Butterflies.prn.

    6.  
  2. VEC format. The data occur as vectors of sample sizes and proportional abundances. The first two lines are the same as prn files. The next lines contain one or two numbers. The first number on line three will be the total number of individuals in one of the samples. The second will be the proportional abundance of the first species in the entire set of samples. The next line has the total number of individuals in a second sample followed by the proportional abundance of the second species in the entire set of samples. Etc. The result is a pair of vectors that may not have the same length. Neither the sample size vector nor the abundance vector need be ordered in any way. But make sure the number of elements in the first vector (sample sizes) agrees with the second number in line two, and that the number of elements in the second vector agrees with the first number in line two.

  3.         One way to see a vector file example is to run the sample prn file above. Check the option in Advanced Options, and one of its output files will have the extent vec, and be the vector file corresponding to that prn file. Vector input files must use the extension vec. Thus the vector output file from the example is called Butterflies.vec.
            To use the vector input method, click Load input file and then open the menu under file type in the browse window. Choose vector file input. The display of files to input will switch from prn files to vec files.

Running WS2M

The Estimation Methods

The Jackknife (Jack# or JK#) Methods of Burnham & Overton (B&O or BO)
Jack k is the kth-order jackknife estimator, as described in Burnham & Overton (1978) and Smith and van Belle (1984).
BO is the actual step-by-step estimate based on the first four JK orders calculated; Ws2m chooses among them according to B&O's method, as described in Burnham & Overton (1978). This includes their interpolation method (which you may deselect under the Advanced Options tab). The output file also reports the average order used for BO at each step.
Both JK and BO estimators use incidence information only.

 
Methods of Chao and Lee: Chao1, Chao2, ACE, and ICE
A suite of Chao estimates: Chao1 (Chao 1984); Chao 2 (Chao 1987); the abundance-based coverage estimator (ACE; Chao and Lee 1992); and the incidence-based coverage estimator (ICE; Lee and Chao 1994). ACE and ICE are identical if you analyze a presence/absence matrix.

 
Chao1 and ACE use abundance information only, while Chao2 and ICE use incidence information only.

 
ACE and ICE, as presented in the original papers, occur in more than one form. The different forms involve a revised calculation of the CV, the coverage estimate, both, or neither (i.e., no recalculation). Choose the one you want under the Advanced Options tab.

 
To generate their estimates, ACE and ICE require specification of the positive integer Rare/Infrequent Cutoff (RIcut) (see Chao et al. 1993). ACE uses the first RIcut abundance classes (species with exactly 1, 2, ..., RIcut individuals). ICE uses the first RIcut incidence classes (species occurring in exactly 1, 2, ..., RIcut samples). Ten is the default value of RIcut in Ws2m, but you may alter it under the Advanced Options tab. To use all incidence or abundance classes, set RIcut to a high number (higher than the highest possible incidence/abundance).

 
Fisher's alpha (FAlpha; Fisher et al. 1943)
Does not estimate the number of species. Instead, it reports Fisher's well known, sample-size independent index of diversity.

 
The Bootstrap (Boot)
The bootstrap is a general resampling technique which was applied to the problem of species diversity estimation by Smith and van Belle (1984). It uses incidence information only.

 
Michaelis-Menten

The classic extrapolation-based estimator of the asymptote of the collector's curve. 
    S = P(N/(N + a))
where
    S is the number of species in the subset of samples
    N is the number of individuals in the subset of samples
    P is the estimated number of species
    a is the half-saturation constant and measures the curvature of the collector's curve
Using non-linear regression, one can fit the data directly to an M-M equation. However, the literature of diversity estimation generally has preferred the Eadie-Hofstee formula instead (Colwell and Coddington, 1994). Based on E-H, a maximum likelihood method produces estimates of P and a. WS2M  calculates both estimators if you request the Michaelis-Menten fit in Advanced Options. It calculates only the Eadie-Hofstee otherwise. In the output tables, MMm is based on Eadie-Hofstee and MMFm is the direct non-linear fit. Note that we have found E-H to behave very poorly when it makes estimates using small amounts of data. Often it even returns large negative estimates of diversity.

 
New, extrapolation-based estimators
Three new, Pareto-style, extrapolation-based estimators of the asymptote of the collector's curve. We intend them to replace the Michaelis-Menten function and other extrapolation estimators. The methods are unpublished at present. However, we have had excellent results with them using both abundance and presence/absence inputs. Compare them to the others using your own data set (and choosing 'shuffle individuals with replacement'). For us, they seem to lack the extreme negative biases at small N that we have seen in B&O, Chao1, Chao2. And often they seem to converge faster than ICE.
 
The F5 equation is

    S = P^(1-(N^(-q(N^q))))
where
    S is the number of species in the subset of samples
    N is the number of individuals in the subset of samples
    P is the estimated number of species
    q is the parameter of curvature of the collector's curve
 
Heterogeneities present in data (which sometimes show up as jumps in the collector's curve) cause poor fitting of extrapolation-based estimators such as F5. You will find that F5 performs well when individuals are shuffled or sample order is shuffled for many (more than 10) runs. If you are fitting the S and N vectors by hand from the output file, try 0.15 as the initial value of q, and adapt the initial value from that point.
 
The F3 equation is

   S = P^(1-(N^(-qlnN)))
where parameters are defined as for F5. We believe that, for small sample sizes, F3 may have a greater negative bias than F5. It also may not work very well for presence/absence matrices. But it is better than any published extrapolation method.
 
The F6 equation is

   S = P^(1-(a^(1-(N^q))))
F6 has two parameters of curvature, q and a.
If you are fitting the S and N vectors by hand from the output file, try 0.2 and 0.4 as initial values of q and a, and adapt the initial values from that point.

You may ask Ws2m not to calculate the Pareto-style estimators. Just check the appropriate box in Advanced Options.
 

 
Notes on new extrapolation-based estimators:
All these extrapolation-based estimators require non-linear fitting. We have incorporated this into Ws2m. Ws2m's nonlinear curve fitting uses the Levenberg-Marquardt method as described in Press et al. 1986.
    Because nonlinear fitting is a trial and error procedure (which would be quite daunting to fully automate), all extrapolation methods will occasionally give erratic results. These appear as values of -1. Simply reject any results of -1. (Or, go to the output file and fit the formulae to the S and N vectors by hand. Note: the erratic behavior is uncommon enough that you may never even see it.)
    No matter how many runs are specified, extrapolation-based estimators are generally calculated only once, on each step of the averaged species accumulation curve. The sole exception: if you ask for a separate output file for each run. Then the extrapolations are made for each run separately.

 
Log-normal estimator
This fits a log-normal distribution to the abundance data and calculates its integral as suggested by Frank Preston.

 

Other notes

The Advanced Options tab contains more of our toys. Ignore them or ask for more information if you are curious. The Force # Individuals will set a reduced total sample size for a data set. Default of 0 uses sample size actually obtained. The Shuffle Only Pooled option is also our toy, at least for now. This option leaves local abundance distributions intact. Making samples exchangeable permits us to use de Finetti's theorem. This will be further developed because it leads to confidence intervals.

By the way, we have not implemented 'variances' reported in the literature. They are not what most biologists would like to have. Here is what we want: if a method yields an estimate of S species, we want to know how that estimate compares to the true diversity. That is, we want to be able to say that true diversity lies between S+e and S-e with 95% confidence. But published confidence intervals do not tell that. What they say is that given these data, the estimate will lie between S+e and S-e 95% of the time. Because the estimates retain a bias in many cases, the truth will often lie well outside the 95% confidence interval. We are working on this problem.

Final Bug
When you finish Ws2m and exit, Windows usually complains. You will see an error message. It apparently comes from a problem in the published nonlinear regression routine that we incorporated. Pain to fix, and we have other stuff to do. Ignore it. No harm; no foul.

Updates
URL: EEBWEB.ARIZONA.EDU/DIVERSITY
There you will find the latest version of these notes, the latest stable Ws2m and the latest beta version of Ws2m. If you have time, let Will Turner (wturner@u.arizona.edu) and Mike Rosenzweig (scarab@u.arizona.edu) know about your experiences with this package.

Literature Cited

Burham, K. P. and W. S. Overton. 1978. Estimation of the size of a closed population when capture probabilities vary among individuals. Biometrika 65:625.

Chao, A. 1984. Nonparametric estimation of the number of classes in a population. Scandanavian Journal of Statisitics 11:265-270.

Chao, A. 1987. Estimating the population size for capture-recapture data with unequal catchability. Biometrics 43:783.

Chao, A. and S.-M. Lee. 1992. Estimating the number of classes via sample coverage. Journal of the American Statistical Association 82:210-217.

Chao, A., M.-C. Ma and M.C.K. Yang. 1993. Stopping rules and estimation for recapture debugging with unequal failure rates. Biometrika 80:193-201.

Colwell, R.K. & Coddington, J.A. 1994. Estimating terrestrial biodiversity through extrapolation. Philosophical Trans. Royal Society of London, B, 345, 101-118.

Fisher, R.A., A.S. Corbet and C.B. Williams. 1943. The relations between the number of species and the number of individuals in a random sample of an animal population. Journal of Animal Ecology 12:42-58.

Lee, S.-M. and A. Chao. 1994. Estimating population size via sample coverage for closed capture-recapture models. Biometrics 50:88-97.

Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. 1986. Numerical recipes: the art of scientific computing. Cambridge University Press, Cambridge.

Preston, F.W. 1948. The commonness, and rarity, of species. Ecology29: 254-283.
Preston, F.W. 1960. Time and space and the variation of species. Ecology41: 785-790.
Preston, F.W. 1962a. The canonical distribution of commonness and rarity. Ecology43: 185-215.
Preston, F.W. 1962b. The canonical distribution of commonness and rarity. Ecology43: 410-432.

Smith, E.P. and G. van Belle. 1984. Nonparametric estimation of species richness. Biometrics 40:119-129.