WS2M User's Manual
Ws2m.exe does the job of estimating the total number of types in a finite
collection. It was developed to estimate the number of species in a collection
of identified individuals, and we will use the language of that application
to explain the program. (We believe Ws2m can be used on many other sorts
of data but these uses remain untested. Ws2m contains a sample-with-replacement
option to allow you to test it yourself.)
Ws2m takes in a data set and produces a series of statistics for a randomly
ordered data set like it. At each step of the series, an enlarged proportion
of the data set is used to calculate both Fisher's alpha and estimates
of the number of species. Ws2m uses a large (and user-controllable) variety
of estimators to produce the estimates. It also reports the number of individuals
used to that point and the actual number of species so far obtained in
the collection. It can report species-abundance distributions and Jaccard
indices. It has more tools than we know how to use yet. However, it will
not take out the trash at your grandmother's apartment.
The nature of the data: three alternatives
-
A series of samples with known sample abundances (PRN file). Each
sample contains a number of individuals. The species of each individual
in each sample is known.
-
Presence/absence matrix. Same as previous except without abundances.
Each cell has a one (1) if the species appeared in the sample, a zero (0)
if it was absent. Obviously, you can convert data of type 1 to this type
if, for any reason, you wish to.
-
A vector (VEC file). There are two vectors in a VEC file: the list
of sample sizes and the list of each species' proportional abundance. Useful
when you do not know the detailed compostion of each sample, but do know
the abundance proportions of the entire set. We discuss vector files below.
The output
- The output file with the extension .out. This contains all
the results in ASCII format. The output file is permanent until you overwrite
it. Using your favorite graphics package, you can display anything you wish
from the output files. The following table explains some suffixes on column
headings.
X_BV |
Bootstrapped Variance. Reports the variance in X among runs. X can
be a counted value, such as the number of species observed, "Sobs"
or an estimated value, such as "Chao1". X_BV values will be
0 if you do not shuffle sample order or resample individuals, or if
you only do one run. Note that this does not compute the variance formulas
for various estimators in the published literature. See also "Other
notes" below. |
X_RD |
Mean deviation of X from zeta across multiple runs. zeta is the total
number of species actually present in the entire original data set.
This is most useful for testing estimators on data sets which you assume
represent the real world. Ideal estimator (which doesn't exist yet)
would have X_RD of 0 asymptotically. Even better would be to attain
X_RD of 0 with only a single sample. |
X_RPD |
Like _RD, but expressed as percentage of zeta. |
X_RPO |
Percentage of runs on which X exceeded zeta. 50 would indicate that
the estimator produces over and underestimates with equal frequency. |
X_QPC |
Percentage of samples which were within 5% of the final X value for
all samples. The ideal estimator would have X_QPC of 100% (meaning even
the estimate with 1 sample was within 5%). |
- The summary file with the extension .sum. This contains the
averages of the results at the last step (the one with the largest number
of individuals) (also in ASCII format).
- The display. A graph of some results.
You change which of the results you graph by selecting them
under the Graphing tab. The graphing options also allow you to display
such properties as species-abundance distributions (and fit some SAD models)
and Jaccard indices. Try them. (Note: we know about the display bug in the shuffled
SAD. It is intermittent, and does not affect the file report of the results.
A pain to fix, and we have not yet done it.)
The display is linked to the output so that you need not
rerun the program to graph other results from the output file. (Of course, you
cannot display results you did not request from the previous run.) While results
are in the program, you may change the graphing options and redisplay the graph
any number of times.
The display is ephemeral; the display engine loses its link
to the output as soon as a new data set is loaded or the program is closed.
However, you can save the display by using the Copy option below it.
Copy puts the display on the Windows clipboard. From the clipboard, you
can paste it into a graphics program or wordprocessor and save it.
You may also maintain a display, run the same data or another
set, and overlay the second (third, fourth, etc.) display on the first. Check
Overlay new plot on existing plot at the bottom of the Graphing
options page.
- A vector file with the extension .vec. Sample sizes and abundances.
Produced by Ws2m if you ask for it in Advanced Options (see below). Or you
can make one yourself.
- A file of species-abundance results. Permanent text file produced
when you request SAD graphing.
- A text file of the shuffled matrix at the last run. Permanent text
file produced when you request shuffling of individuals (see below). Useful
for checking and confidence building.
- A debug file. Permanent text file produced if anything unexpected
happens during execution (e.g., one of the non-linear regression procedures
does not converge).
Preparing the data file for input
In each case, the data file must be a pure ASCII file. Use a text editor
to be sure. If the data file will not load properly, use a hex editor to
get rid of the non-ASCII bytes.
-
PRN format. Types 1 & 2, above; the data occur as a series of
samples ("quadrats").
-
First line is the name of the data set. ASCII only, but any name you want
is OK. It does not have to be the same as the file name. Each line ends
with a carriage return.
-
Second line has two control parameters. They are separated from each other
by one (or more) spaces. The first parameter is the number of species (i.e.,
columns in the matrix); the second, the number of samples (i.e.,
rows in the matrix).
-
Subsequent lines contain the data matrix. Columns are the species (but
they are not named). Rows are the samples (but they are not numbered).
In each cell of the matrix, put the number of individuals of that column's
species collected in that sample. (Or put a 1 or 0, if this is a presence/absence
data matrix.) Separate the cells with a space (nothing else: no tabs, no
semi-colons, nothing).
Example of a matrix with 34 species collected from 6 samples:
Butterflies
34 6
1 1 42 53 44 9 1 837 4 1 7 5 9 2 2 2 50 13 1 1 1 11 14 5 5 2 0
0 0 0 0 0 0 0
19 6 0 188 126 36 0 7 5 1 23 29 43 50 3 3 257 26 9 0 5 5 9 5 24
10 1 2 2 3 0 0 0 0
2 0 0 44 474 18 0 167 1 0 9 8 19 21 2 9 552 30 12 0 7 9 13 15 19
6 1 0 0 1 0 0 0 0
6 0 0 175 293 14 0 16 4 0 0 20 12 6 24 20 50 7 15 3 3 0 2 25 4
2 2 0 1 1 1 0 0 0
27 0 0 52 743 31 1 363 1 0 10 7 11 25 4 11 1041 18 38 21 18 12
19 11 15 3 1 1 0 3 0 0 0 0
4 0 0 128 146 38 0 18 6 1 9 7 13 4 19 1 132 2 11 1 3 0 6 31 45
3 0 0 0 0 0 0 0 0
-
Data matrix input files must use the extent prn. Thus, the example
could be called Butterflies.prn.
-
VEC format. The data occur as vectors of sample sizes and proportional
abundances. The first two lines are the same as prn files. The next lines
contain one or two numbers. The first number on line three will be the
total number of individuals in one of the samples. The second will be the
proportional abundance of the first species in the entire set of samples.
The next line has the total number of individuals in a second sample followed
by the proportional abundance of the second species in the entire set of
samples. Etc. The result is a pair of vectors that may not have the same
length. Neither the sample size vector nor the abundance vector need be
ordered in any way. But make sure the number of elements in the first vector
(sample sizes) agrees with the second number in line two, and that the
number of elements in the second vector agrees with the first number in
line two.
One way to see a vector file example is to run the sample prn file above. Check the option in Advanced Options, and one of its output files will have the extent vec, and be the vector file corresponding to that prn file. Vector input files must use the extension vec. Thus the vector output file from the example is called Butterflies.vec.
To use the vector input
method, click Load input file and then open the menu under file
type in the browse window. Choose vector file input. The display of files
to input will switch from prn files to vec files.
Running WS2M
-
Run WS2M.EXE on a Windows 95/98/2000/NT machine. Some users have reported
that the program also works on their Macintosh under Windows emulation.
-
Click Load Input File and select the data file you want to analyze.
Ws2m will load it and report several statistics: the number of samples;
the number of species; and the number of individuals.
-
Select the Actions tab.
-
The most important choice: Sample With Replacement. If you check
this, then Ws2m will be estimating the number of species actually in the
data set itself, NOT the number in the sampling universe from which you
got it. This option should be used only to test Ws2m and its methods. (If
you check Make Exchangeable samples, Ws2m will not perform replacement
regardless of whether this option is checked.)
-
Crucial notes for dealing with presence-absence matrices:
-
Check Shuffle Incidences Only (on the Advanced Options tab).
This option prevents any shuffled cell from containing more than one individual.
It must not be checked for any other matrix form.
-
With a presence/absence matrix, you cannot maintain sample sizes unless
you are sampling with replacement (in which case you are merely checking
the methods to see if they will accurately report the number of species
in your sample). If you want to estimate what is actually out there, check
Make
Exchangeable under the actions tab and do not check Sample With
Replacement.
-
Shuffle Individuals (among samples) removes any habitat heterogeneity
(i.e. intrinsic differences between times and/or places of the samples).
Each step of the procedure will then estimate the number of species in
the entire sampling universe.
-
Shuffle Sample Order corrects for unequal sample sizes by randomizing
the order of quadrats on each run.
-
Make Samples IID (Independent and Identically
Distributed)
makes each sample artificially equal in number of individuals. There is
very little difference in practice between the estimates you will get if
you check this or retain the actual sample sizes (provided you ask the
estimator to run 20 or more times).
-
# Runs tells the program to repeat the procedure a certain number
of times and output the average estimates.
-
Random Seed lets you repeat the whole procedure precisely
by resetting this value to any fixed integer (say 42) before each set of
runs. If you use the same seed and options, and if your machine and program
are operating properly, you will obtain exactly the same estimates every
time.
-
Name the Output File(s) to save them (default = temp). Ws2m automatically
adds the extension .out.
The Estimation Methods
-
The Jackknife (Jack# or JK#) Methods of Burnham & Overton (B&O
or BO)
-
Jack k is the kth-order jackknife estimator, as described
in Burnham & Overton (1978) and Smith and van Belle (1984).
-
BO is the actual step-by-step estimate based on the first four JK orders
calculated; Ws2m chooses among them according to B&O's method, as described
in Burnham & Overton (1978). This includes their interpolation method
(which you may deselect under the Advanced Options tab). The output
file also reports the average order used for BO at each step.
-
Both JK and BO estimators use incidence information only.
-
Methods of Chao and Lee: Chao1, Chao2, ACE, and ICE
-
A suite of Chao estimates: Chao1 (Chao 1984); Chao 2 (Chao 1987); the abundance-based
coverage estimator (ACE; Chao and Lee 1992); and the incidence-based coverage
estimator (ICE; Lee and Chao 1994). ACE and ICE are identical if you analyze
a presence/absence matrix.
-
Chao1 and ACE use abundance information only, while Chao2 and ICE use incidence
information only.
-
ACE and ICE, as presented in the original papers, occur in more than one
form. The different forms involve a revised calculation of the CV, the
coverage estimate, both, or neither (i.e., no recalculation). Choose
the one you want under the Advanced Options tab.
-
To generate their estimates, ACE and ICE require specification of the positive
integer Rare/Infrequent Cutoff (RIcut) (see Chao et al. 1993).
ACE uses the first RIcut abundance classes (species with exactly
1, 2, ..., RIcut individuals). ICE uses the first RIcut incidence
classes (species occurring in exactly 1, 2, ..., RIcut samples).
Ten is the default value of RIcut in Ws2m, but you may alter it
under the Advanced Options tab. To use all incidence or abundance
classes, set RIcut to a high number (higher than the highest possible
incidence/abundance).
-
Fisher's alpha (FAlpha; Fisher et al. 1943)
-
Does not estimate the number of species. Instead, it reports Fisher's well known, sample-size independent index of diversity.
-
The Bootstrap (Boot)
-
The bootstrap is a general resampling technique which was applied to the
problem of species diversity estimation by Smith and van Belle (1984).
It uses incidence information only.
-
Michaelis-Menten
The classic extrapolation-based estimator of the asymptote of the collector's
curve.
S = P(N/(N + a))
where
S is the number of species in the subset
of samples
N is the number of individuals in the subset
of samples
P is the estimated number of species
a is the half-saturation constant and measures
the curvature of the collector's curve
-
-
Using non-linear regression, one can fit the data directly to an M-M equation.
However, the literature of diversity estimation generally has preferred
the Eadie-Hofstee formula instead (Colwell and Coddington, 1994). Based
on E-H, a maximum likelihood method produces estimates of P and
a. WS2M calculates both estimators if you request the Michaelis-Menten fit in Advanced Options. It calculates only the Eadie-Hofstee otherwise. In the output tables, MMm is based on Eadie-Hofstee and MMFm is the direct non-linear fit. Note that we have found E-H to behave very poorly when it makes estimates using small amounts of data. Often it even returns large negative estimates of diversity.
-
New, extrapolation-based estimators
-
Three new, Pareto-style, extrapolation-based estimators of the asymptote
of the collector's curve. We intend them to replace the Michaelis-Menten
function and other extrapolation estimators. The methods are unpublished at present.
However, we have had excellent results with them using both abundance and
presence/absence inputs. Compare them to the others using your own data set (and choosing 'shuffle individuals with replacement'). For us, they seem to lack
the extreme negative biases at small N that we have seen in B&O, Chao1,
Chao2. And often they seem to converge faster than ICE.
-
-
The F5 equation is
S = P^(1-(N^(-q(N^q))))
where
S is the number of species in the subset
of samples
N is the number of individuals in the subset
of samples
P is the estimated number of species
q is the parameter of curvature of the collector's
curve
-
-
Heterogeneities present in data (which sometimes show up as jumps in the
collector's curve) cause poor fitting of extrapolation-based estimators
such as F5. You will find that F5 performs well when individuals are shuffled
or sample order is shuffled for many (more than 10) runs. If you are fitting
the S and N vectors by hand from the output file, try 0.15 as the initial
value of q, and adapt the initial value from that point.
-
-
The F3 equation is
S = P^(1-(N^(-qlnN)))
where parameters are defined as for F5. We believe that, for small
sample sizes, F3 may have a greater negative bias than F5. It also may
not work very well for presence/absence matrices. But it is better than
any published extrapolation method.
-
-
The F6 equation is
S = P^(1-(a^(1-(N^q))))
F6 has two parameters of curvature, q and a.
If you are fitting the S and N vectors by hand from the output file,
try 0.2 and 0.4 as initial values of q and a, and adapt the
initial values from that point.
You may ask Ws2m not to calculate the Pareto-style estimators. Just check the appropriate box in Advanced Options.
-
-
Notes on new extrapolation-based estimators:
-
All these extrapolation-based estimators require non-linear fitting. We
have incorporated this into Ws2m. Ws2m's nonlinear curve fitting uses the
Levenberg-Marquardt method as described in Press et al. 1986.
-
Because nonlinear fitting is a trial and error procedure
(which would be quite daunting to fully automate), all extrapolation methods
will occasionally give erratic results. These appear as values of -1. Simply
reject any results of -1. (Or, go to the output file and fit the formulae
to the S and N vectors by hand. Note: the erratic behavior is uncommon
enough that you may never even see it.)
-
No matter how many runs are specified, extrapolation-based
estimators are generally calculated only once, on each step of the averaged species
accumulation curve. The sole exception: if you ask for a separate output file for each run. Then the extrapolations are made for each run separately.
-
Log-normal estimator
-
This fits a log-normal distribution to the abundance data and calculates
its integral as suggested by Frank Preston.
Other notes
The Advanced Options tab contains more of our toys. Ignore them
or ask for more information if you are curious. The
Force # Individuals
will set a reduced total sample size for a data set. Default of 0 uses
sample size actually obtained. The Shuffle Only Pooled option is
also our toy, at least for now. This option leaves local abundance distributions
intact. Making samples exchangeable permits us to use de Finetti's theorem.
This will be further developed because it leads to confidence intervals.
By the way, we have not implemented 'variances' reported in the
literature. They are not what most biologists would like to
have. Here is what we want: if a method yields an estimate of S
species, we want to know how that estimate compares to the true diversity.
That is, we want to be able to say that true diversity lies between
S+e
and S-e with 95% confidence. But published confidence intervals
do not tell that. What they say is that given these data, the
estimate
will lie between S+e and S-e 95% of the time. Because the
estimates retain a bias in many cases, the truth will often lie well outside
the 95% confidence interval. We are working on this problem.
Final Bug
When you finish Ws2m and exit, Windows usually complains. You will
see an error message. It apparently comes from a problem in the published
nonlinear regression routine that we incorporated. Pain to fix, and we
have other stuff to do. Ignore it. No harm; no foul.
Updates
URL: EEBWEB.ARIZONA.EDU/DIVERSITY
There you will find the latest version of these notes, the latest stable
Ws2m and the latest beta version of Ws2m. If you have time, let Will Turner
(wturner@u.arizona.edu) and Mike Rosenzweig (scarab@u.arizona.edu) know
about your experiences with this package.
Literature Cited
Burham, K. P. and W. S. Overton. 1978. Estimation of the size of a closed
population when capture probabilities vary among individuals.
Biometrika
65:625.
Chao, A. 1984. Nonparametric estimation of the number of classes in
a population. Scandanavian Journal of Statisitics 11:265-270.
Chao, A. 1987. Estimating the population size for capture-recapture
data with unequal catchability. Biometrics 43:783.
Chao, A. and S.-M. Lee. 1992. Estimating the number of classes via sample
coverage. Journal of the American Statistical Association
82:210-217.
Chao, A., M.-C. Ma and M.C.K. Yang. 1993. Stopping rules and estimation
for recapture debugging with unequal failure rates.
Biometrika 80:193-201.
Colwell, R.K. & Coddington, J.A. 1994. Estimating terrestrial biodiversity
through extrapolation. Philosophical Trans. Royal Society of London, B,
345, 101-118.
Fisher, R.A., A.S. Corbet and C.B. Williams. 1943. The relations between
the number of species and the number of individuals in a random sample
of an animal population. Journal of Animal Ecology 12:42-58.
Lee, S.-M. and A. Chao. 1994. Estimating population size via sample
coverage for closed capture-recapture models. Biometrics
50:88-97.
Press, W. H., B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling.
1986. Numerical recipes: the art of scientific computing. Cambridge
University Press, Cambridge.
Preston, F.W. 1948. The commonness, and rarity, of species. Ecology29:
254-283.
Preston, F.W. 1960. Time and space and the variation of species. Ecology41:
785-790.
Preston, F.W. 1962a. The canonical distribution of commonness and rarity.
Ecology43:
185-215.
Preston, F.W. 1962b. The canonical distribution of commonness and rarity.
Ecology43:
410-432.
Smith, E.P. and G. van Belle. 1984. Nonparametric estimation of species
richness. Biometrics 40:119-129.