Recent developments in geographic information systems and their application to conservation biology open doors to exciting new synthetic analyses. Exploration of these possibilities, however, is limited by the quality of information available: most biodiversity data are incomplete and characterized by biased sampling. Inferential procedures that provide robust and reliable predictions of species' geographic distributions thus become critical to biodiversity analyses. In this contribution, models of species' ecological niches are developed using an artificial-intelligence algorithm, and projected onto geography to predict species' distributions. To test the validity of this approach, I used North American Breeding Bird Survey data, with large sample sizes for many species. I omitted randomly selected states from model building, and tested models using the omitted states. For the 34 species tested, all predictions were highly statistically significant (all P < 0.001), indicating excellent predictive ability. This inferential capacity opens doors to many synthetic analyses based on primary point occurrence data.
Predicción de Áreas de Distribución de Especies con Pase en Modelaje de Nichos Ecológicos
Resumen. Avances recientes en los sistemas de información geográfica y su aplicación en la biología de conservación presentan la posibilidad de analisis nuevos y sintéticos. La exploración de estas posibilidades, de todas formas, se limita por la calidad de información disponible: la gran mayoria de datos respecto a la diversidad biológica son incompletos y sesgados. Por eso, procedimientos de inferencia que proveen predicciones robustas y confiables de distribuciones de especies se hacen importantes para los análisis de la biodiversidad. En esta contribución, se desarrollan modelos de los nichos ecológicos por medio de un algoritmo de inteligencia artificial, y los proyeccionamos en la geografía para predecir las distribuciones geográficas de especies. Para probar el método, se usan los datos del North American Breeding Bird Survey, con tamaños de muestra grande. Se construyeron modelos con base en 30 estados unidenses seleccionados al azar, y se probaron los modelos con base en los 20 estados restantes. De las 34 especies que se analizaron, todos mostraron un alto grado de significanza estadística (todos P < 0.001), lo cual indica un alto grado de predictividad. Esta capacidad de inferencia abre la puerta a varios analisis sintéticos con base en puntos conocidos de ocurrencia de especies.
INTRODUCTION
Many geographic applications have been developed in recent years that offer exciting new possibilities for understanding biological diversity (e.g., Scott et al. 1996). Geographic information systems (GIS) make it possible to build maps of species richness and endemism, to prioritize areas for conservation based on principles such as complementarity, and to assess the completeness of existing protected areas networks (e.g., Peterson et al. 2000). One of the most notable examples of these possibilities is that of Gap Analysis, an integrative program that links distributional information with information on land use and protection to identify priorities for conservation action (Scott et al. 1996). The success of such programs and approaches, however, depends critically on the quality of distributional information available, which has proven to be a weak link in the process (Krohn 1996).
Biodiversity information nevertheless exists in a difficult, fragmented system: sampling documents presence but rarely absence; sampling is rarely systematically planned so as to permit detailed statistical analysis; and institutional holdings separate specimens in different countries and regions (Peterson et al. 1998). Although occurrence data are now beginning to become much more available thanks to innovative, Internet-based technological developments (e.g., Vieglais 1999), the need for development of inferential approaches to interpreting biodiversity information is clear (Soberón 1999). Hence, in this contribution, I develop detailed statistical tests of an artificial-intelligence-based approach designed to predict species' geographic distributions.
MODELING ECOLOGICAL NICHES AND PREDICTING GEOGRAPHIC DISTRIBUTIONS
The fundamental ecological niche of a species is a critical determinant of its distribution; as such, it is defined in multidimensional ecological space (MacArthur 1972). Several distinct interpretations of ecological niches exist: most relevant to the present contribution is that of Grinnell (1917), who focused on the conjunction of ecological conditions within which a species is able to maintain populations without immigration. Hutchinson (1959) provided the valuable distinction between the fundamental niche, which is the range of theoretical possibilities, and the realized niche (that part which is actually occupied, given interactions with other species such as competition). Although it can be argued that only the realized niche is observable in nature, by examining species across their entire geographic distributions, species' distributional possibilities can be observed against varied community backgrounds, and thus a view of the fundamental ecological niche can be assembled (Peterson et al. 1999).
Several approaches have been used to model species' fundamental ecological niches. The very simplest is BIOCLIM (Nix 1986), which involves tallying species' occurrences in categories for each environmental dimension, trimming the extreme 5% of the distribution along each ecological dimension, and taking the niche as the conjunction of the trimmed ranges to produce a decision rule. BIOCLIM suffers generally from high rates of commission error, or overprediction (Stockwell and Peterson, unpubl.). Other investigators have applied logistic regression to the challenge of combining environmental variables into predictions of presence and absence (e.g., Austin et al. 1990).
The Genetic Algorithm for Rule-set Prediction (GARP) includes several distinct algorithms in an iterative, artificial-intelligence-based approach (Stockwell and Noble 1992, Stockwell and Peters 1999). Here, individual algorithms (e.g., BIOCLIM, logistic regression) are used to produce component “rules” in a broader rule-set, and hence portions of the species' distribution may be determined as within or without its niche, based on different rules from several algorithms. As such, GARP is a superset of other approaches, and should always have greater predictive ability than any one of them. Initial testing of GARP has indicated excellent predictive ability and insensitivity to BIOCLIM's problems with dimensionality of environmental data (Peterson and Cohoon 1999, Peterson et al., in press a, b; Stockwell and Peterson, in press).
Two general types of error enter into such niche modeling and geographic prediction efforts (Fielding and Bell 1997). First, omission of areas actually inhabited represents a failure of the modeling effort to extend to all ecological conditions under which the species is able to maintain populations. Second, commission error is that of including areas actually uninhabited; this error includes two components: real commission error, in which combinations of ecological conditions not actually within the species' niche are included, and apparent commission error, which results from species' absences owing to interspecific interactions (the difference between realized and fundamental niches, MacArthur 1972), as well as to historical factors, such as limited colonization ability, speciation patterns, and local extinction (Peterson et al. 1999). In this sense, apparent commission error represents a real feature of species' distributional ecology: not all habitable areas are inhabited (Peterson et al. 1999). The purpose of the present contribution is to put the GARP algorithm to a rigorous test with North American birds.
METHODS
Distributional data for four genera of passerine birds (Catharus, Dendroica, Toxostoma, and Vireo) were selected for analysis based on their high species richness, distribution in regions well covered by the North American Breeding Bird Survey ( http://www.mbr-pwrc.usgs.gov/bbs/bbs.html), and ease of detection in visual/auditory surveys. The Breeding Bird Survey data offer relatively uniform coverage of the continent, avoiding some of the challenges presented by museum specimen data in terms of uneven sampling (Peterson et al. 1998). Data points were extracted as presences (in any year) or absences (in all years) at the level of routes, and reduced to unique latitude-longitude combinations for each species. Thirty U.S. states were chosen at random for model development (“training data,” regardless of whether the species had been recorded from the state); data from the remaining 20 states were set aside for statistical testing of models (“test data,” Fig. 1); this ratio of training and test sample sizes was chosen so that in general more than 10 and 30 occurrence points would be available for testing and training models, respectively. This scheme is reasonable, given that the probability of detection of a particular species in one state in no way affects the probability of its detection in another state.
Four species of Catharus, 18 of Dendroica, 7 of Toxostoma, and 12 of Vireo were available in the data set, although 7 species had to be omitted because they did not occur in both training and test data sets (Catharus minimus, Dendroica chrysoparia, Toxostoma longirostre, T. redivivum, Vireo altiloquus, V. atricapillus, and V. flavoviridis). Training data were analyzed for North America with a 50 × 50 km pixel resolution using the web-based GARP facility ( http://biodi.sdsc.edu/), including coverages of (1–4) mean and standard deviation of annual mean precipitation and annual mean temperature, (5) life zones, (6) wetlands, (7) vegetation types, and (8) soil types. Variable combinations predicted present by the GARP rules were identified in the test states and used to predict species' occurrences in those states. Resulting geographic predictions were exported as ASCII raster grid files for use in ArcView (version 3.1).
In ArcView, the test occurrence data were overlain on the predictions for the 20 test states. Numbers of points correctly and incorrectly predicted by GARP were used as observed values. Expected numbers were taken as the test sample size multiplied by the proportional area predicted present in test states. A chi-square analysis for each species was used to assess model significance. To permit visualization of the ecological niche model, I crossed the eight ecological coverages (Combine option in ArcView) with the distributional prediction to produce a table of predicted presences and absences across environmental combinations.
RESULTS
Ecological niche models for each species showed restriction relative to the universe of ecological combinations available across North America. For example, the Brown Thrasher (Toxostoma rufum) was modeled to focus its distribution in relatively warm, yet relatively dry portions of the continent (Fig. 2). Similar visualizations of ecological distributions were developed for other ecological dimensions, and for each species.
Geographic distributions for all 34 species in the analysis were highly significantly predicted in the test states. For example, of 741 test points for Brown Thrashers, 715 were correctly predicted, even though only 39% of the 20-state test area was predicted present (Fig. 3), and so only 290 points would have been correctly predicted by a random model. Although most range limits are accurately delimited in the distributional prediction, an area of overprediction runs from the southwestern extreme of the species' distribution south into northern Mexico; here, other Toxostoma species are present, and this is therefore another example of stability of ecological niches on evolutionary time scales (Peterson et al. 1999). This model was statistically significant at P < 10−222, and hence it is highly likely that the model is accurately evaluating dimensions of the species' ecological requirements. Significance levels for all species ranged between 10−245 and 10−3 (Table 1).
Relationships between model quality and sample size (Fig. 4) were strong (simple linear regression, P < 0.05). Small chi-square values were associated with small sample sizes in both training and test data sets. In this sense, although all models were highly statistically significant, building truly predictive models may often require 100 or more occurrence points in this particular geographic scenario and with these particular ecological coverages.
DISCUSSION
GARP modeling approaches were able to predict species' occurrences at high levels of statistical significance in every species tested. This result parallels those obtained for 25 species of tropical birds in Mexico (Peterson et al., in press a), and suggests the generality of this tool. Comparisons with other algorithms are under development, but GARP appears to outperform each quite consistently (Stockwell and Peterson, unpubl.). For example, a BIOCLIM model of Toxostoma rufum distribution omitted more than twice as many of the test points as the GARP model discussed above, and overpredicted severely in the northwestern portion of the species' distribution (Peterson, unpubl.), making for a model that is clearly less predictive.
Development of robust algorithms for predicting species' geographic distributions based on point occurrence data opens doors to many exciting approaches and analyses. Although the present paper focuses on the well-known bird fauna of North America, applications are feasible for any taxon in any region on Earth. On the most basic level, then, given certain requirements as to sample size (Peterson et al. 1998), locality information from the approximately 3 × 109 specimens in the world's natural history museums can be put to use in developing distributional hypotheses for many tens of thousands of species, providing a first view of species' distributions worldwide. Diverse tests have confirmed that GARP is able to build highly predictive models even given the spatial biases inherent in specimen data (Peterson et al. 1999, Peterson et al., in press a, b; Stockwell and Peterson, in press).
Such an understanding has much to offer to workers in organismal biology and species conservation. Species' distributions can be modeled to produce first-pass hypotheses that may be the only usable information for many rare and poorly known taxa. Intensively managed species' potential distributions can identify sites at which reintroduction programs could be focused. Overlaying many species' predicted distributions allows prediction of community composition for any site in the region analyzed, and such cross-species predictions allow identification of conservation priorities (Peterson et al. 2000) or assessment of environmental impacts.
Acknowledgments
Research for this contribution was assisted greatly by advice and assistance from David R. B. Stockwell, and provision of data by Bruce Peterjohn. Financial support was provided by the National Science Foundation.
LITERATURE CITED
TABLE 1. Summary of distributional predictions and significance tests for 34 species in the genera Catharus, Dendroica, Toxostoma, and Vireo. ntrain = training sample size, ntest = test sample size, % correct = percentage of test points correctly predicted, Apresent = area predicted present (in 50 × 50 km pixel units), and Aabsent = area predicted absent (in pixel units)