Administrative, political, and natural history constraints on the design of research studies in ecology often result in small data sets. In this paper, I identify some problems associated with small data sets and describe a contingent process for data analysis. I argue that exploring small data sets is heuristic and can be a valuable first step in the formulation of biologically interesting hypotheses.
Wildlife studies by their nature and especially those involving species with sparse distributions often are characterised by small sample size, although notable exceptions exist, e.g. Berger 1986, Geist 1971, Sinclair 1977, McCullough 1979, Clutton-Brock & Ball 1987. Bears Ursus sp. and wolves Canis lupus in both Europe and North America are examples of sensitive, top level species for which political concerns and low population numbers appear to contribute to a limited data collection and often insufficiently replicated data sets. Similarly, mountain lions Felis concolor in North America are difficult to study because of their position as top carnivores and their wide-ranging habits. These species often operate over larger spatial and temporal scales, and often require longer-term studies over larger spatial extents to effectively capture the relevant dynamics. However, studies with limited sample sizes can be extremely useful. Studies involving experimental techniques to breed endangered species held in captivity for subsequent release of adults and progeny into the wild, (red wolves Canis niger (Moore 1990, Jenks & Wayne 1992), condors Gymnogyps californianus (Kiff, Mesta & Wallace 1996), whooping cranes Grus americana (Longmire, Gee, Hardekopf & Mark 1992, Cannon 1996)) are valuable even though sample size is generally small. Unfortunately, in some studies there is a tendency to generalise small sample size results beyond the appropriate boundaries.
In addition to natural history constraints, political sensitivity, changing administrative priorities of funding sources, and funding constraints, viz., two-year master's level graduate studies, influence the length and intensity with which most research projects are conducted, often resulting in smaller data sets for a wide variety of studies. For example, Weatherhead (1986) found that the length of over 300 studies that he reviewed lasted an average of only 2.5 years. Further, Tilman (1989) reported that only 1.7% of field studies reported in the journal Ecology lasted 5 years or more, and only 1% of 180 papers that involved experimentation lasted 5 years or more; a large number (N = 72) lasted less than one year (see May 1994). Although during the past 10 years there has been an increased appreciation for studies of longer duration and larger spatial scale, the problem will still continue to plague ecology and the wildlife profession. In the two years (1988–1989) I served as Associate Editor of the Journal of Wildlife Management, the most common complaint from reviewers involved small sample size.
My purpose here is to present one approach to the small sample size problem. My objective was to develop a multiple regression model to predict bam owl Tyto alba reproductive success from several habitat characteristics; however, the use of automated multiple regression model selection techniques, e.g. stepwise selection, often hides important aspects of data. Additionally, and apparently not generally recognised, there is a general tendency for ecologists to “overemphasize the potential role of significance testing in.... scientific practice” (Yoccoz 1991) to the detriment of biological understanding. Yoccoz (1991) further stated that “most biologists and other users of statistical methods seem still to be unaware that significance testing by itself (italics mine) sheds little light on the questions they are posing”. Thus, I adopted the philosophy of model selection recommended by Henderson & Velleman (1981) as an alternative approach. The approach is deliberately contingent: results at each step in the progressive analyses were evaluated, and the decisions about how to proceed to the next step were based on results from the preceding step and on biological insight (e.g. Myers 1990, Neter, Wasserman & Kutner 1985).
Analytical problems with small data sets
Small data sets pose structural problems for the investigator:
1) It is difficult to evaluate assumptions of the analyses, including the forms of the relationships between the response and explanatory variables.
2) Evaluation of any chosen model is ambiguous. Characteristics of collinearity, outliers, and influential points that interfere in model selection for data sets of reasonable size are even more problematic with small data sets because they are more difficult to assess.
3) When the sample size is small or when the number of fitted parameters is a moderate to large fraction of the sample size, most model selection procedures will lead to models that appear to have high explanatory power and that select as significant, explanatory variables that are not truly related to the response (Freedman 1983, Freedman & Pee 1989, Hurvich & Tsai 1989).
There are additional concerns to be aware of when analysing small data sets. One risk involves the repeated analyses of the same set of data in a search for models that fit the data well. Consequently, a model may be fitting the random variation in the data set on which it is based, rather than the underlying biological relationship. As a result, the predictive ability of the model for a second, similar data set may be less than for the data upon which it was built (Neter et al. 1985, Maurer 1986). Prediction bias is increased when the number of observations is small with respect to the number of predictors (Magnusson 1983, Verbyla 1986). Maurer (1986), Rotenberry (1986), and more recently Anderson & Burnham (1998) have discussed in detail the problems associated with predictability of wildlife habitat models.
However, small data sets can be valuable for generating realistic hypotheses and testable models. All ecological models attempt to simplify the complexity of nature, using usually easily measured variables in equations that represent ecological relationships. But simplification of analytical models necessitates trade-offs between generality, precision, and realism (Levins 1968). In statistical correlation models, this trade-off may not apply. However, correlation points to pattern in ecology and to interesting questions for which we seek explanation.
No single criterion determines the ‘best’ model. Rather, model evaluation takes into account all criteria, as well as biological insight. Different criteria address different aspects of model goodness of fit:
1) One measure of goodness of fit is the adjusted coefficient of multiple determination, R2, the proportion of the variation of the response variable that is jointly explained by the explanatory variables included in the model. Whereas R2 always increases with the addition of an explanatory variable to the model, the adjusted R2 increases only if an added explanatory variable results in an improved fit of the model to the data (Zar 1996). This is especially important when sample size is small and the number of explanatory variables in the model relatively large because the adjustment is considerable (Sokal & Rohlf 1995). Among competing models, the one with the largest adjusted R2 is favoured.
2) The error or residual mean square, s2, another measure of goodness of fit, expresses variation in the residuals from the model (Myers 1990). Because the residual is the difference between observed value and the value predicted by the model, the model with the smallest s2 is favoured. In selecting a ‘best’ model, balance between increased bias due to underfitting, i.e. failing to include important explanatory variables, and increased variance due to overfitting, i.e. incorporating unnecessary variables, must be achieved.
3) Mallow's Cp expresses variance plus bias, and thus is useful in discriminating between competing models (Myers 1990). The model with the smallest Cp is favoured. When the sample size is small, the data set cannot be split for validation reasons.
4) The PRESS (Prediction Sum of Squares) statistic is a criterion that can be used as a form of validation (Myers 1990). Each observation is set aside in turn, and a model is fit to the remaining observations in the sample. Using this model, the deleted response is estimated, and the PRESS residual is computed. The PRESS statistic is computed as the sum of squared residuals over all observations. Among competing models, the model with the smallest PRESS statistic is favoured.
Observed and predicted reproductive success3 of bam owls nesting in empty cisterns, and mean values for six habitat variables measured within a 1-km radius of each nest site in southwestern Oklahoma (USA), 1977–1981.
Ault (1982) collected data on bam owl reproductive success from 11 nest sites over a 5-year period in Jackson County, near Eldorado, Oklahoma USA (Table 1). Six habitat variables were measured within a 1-km radius of each nest site. Nest sites were found in the bottom of empty cisterns (dry water wells). The variables he measured included: 1) road length (km); 2) edge length (measured as the linear distance (km) of contiguous habitats not including road edges); 3) area (ha) of wheat or sorghum (hereafter termed grain); 4) mesquite Proposis glandular 5) herbland; and 6) an index of habitat interspersion (obtained by counting the number of discrete units of each cover type within the 1-km radius circle of each cistern nest site and then summing the number of units of each type across all cover types).
To avoid pseudoreplication (Hurlbert 1984), I used the cistern, not individual owls, as the sample unit. Regression analysis was performed using PROC REG in SAS and SYSTAT. Analysis proceeded as follows:
1) Reproductive success (# of young fledged) was plotted against each of the six explanatory variables to assess the nature (linear or non-linear) of the relationships and discover unusual points.
2) The set of explanatory variables was explored for collinearity.
3) All possible regressions were fit and screened as potential candidate models based on adjusted R2, Mallow's Cp, and s2 statistics.
4) Based on adjusted R2, Cp, and s2, a subset of models was selected for further evaluation, and the PRESS statistic was computed for each candidate model.
5) Residuals from each model were examined to evaluate the assumptions of linearity, normality, and homoscedasticity and to identify potential outliers. Partial regression plots and tests of significance of regression coefficients were used to assess the need for each variable. The susceptibility of any given model to influential data points was examined using various influence diagnostics.
6) One model was chosen to represent the data, based on the four criteria (adjusted R2, Cp, s2 and PRESS statistics), tests of significance, adherence to assumptions, susceptibility of each model to influential data points, and biological insight.
The scatter plots (Fig. 1) illustrate one source of ambiguity in small data sets: individual observations may be highly influential. Snedecor & Cochran (1967: 175) state that “even the direction of inclination.... may elude you if r is between -0.3 and + 0.3.” This type of ambiguity may be most evident in Figure 1C. By omitting one or two arbitrarily chosen data points (see arrows in Fig. 1), the relationship between reproductive success and a given explanatory variable can appear positive or negative, linear or non-linear (e.g. Fig. 1 B, D), or even disappear (e.g. Fig. 1C, D). For this analysis, all data points were considered valid.
Pearson correlation coefficients for the six independent variables.
No relationship between edge length and observed reproductive success (r = 0.23) was apparent in the scatterplot. Mesquite area exhibited a weak, negative linear relationship (r = -0.66); log-transformed herbland area (r = 0.53), grain area (r = 0.36), and interspersion (r = 0.49) exhibited weak, positive linear relationships. Herbland area was logarithmically transformed to achieve a more linear relationship. Road length (see Fig. 1 A) appeared most highly correlated with observed reproductive success (r = 0.91).
Candidate models and model selection criteria for bam owl reproductive success.
Simple correlation identified a potential collinearity problem (Table 2). A large negative correlation (-0.89) existed between mesquite area and log-transformed herbland area. Moderately large correlations were found for mesquite area and grain area (-0.77), mesquite area and road length (-0.76), road length and log-transformed herbland area (0.71), and edge length and interspersion (0.72).
A multiple linear regression with all six explanatory variables was fit to the data. Variance inflation factors, which represent the inflation of regression coefficients due to correlation among the explanatory variables, were large (>10) for mesquite area and log-transformed herbland area. Another measure of multicollinearity is the condition number. It is calculated as the ratio of the largest to the smallest eigenvalue of the correlation matrix for the explanatory variables. More formally, condition number is given as:Myers 1990). Our condition number was 3,706. The large variance proportions indicated that collinearity appeared to involve the intercept, mesquite, grain, and to a lesser extent log-transformed herbland areas. Consequently, as a first step mesquite area was dropped as an explanatory variable. The multiple linear regression was re-fit without mesquite area; no further evidence of collinearity was found.
Four models were identified as potential candidates based on the first three criteria (Table 3). A larger adjusted R2, a smaller Cp, a smaller s2, or a smaller PRESS generally indicates a better model. No one model provided the best performance on all four criteria. The model with road length alone provided a reasonable fit to the data. Road length (x) was positively associated with reproductive success (y) according to the following equation:
Multicollinearity causes regression coefficients to be unstable and exists when the explanatory variables are redundant (Myers 1990). Even small changes in explanatory variables produce large changes in coefficient estimates. However, note that lack of evidence does not confirm that collinearity problems no longer exist. Although other models, with additional variables, were better in terms of adjusted R2, Cp, s2, or PRESS statistics, support for the inclusion of additional variables was insufficient, as indicated by tests of significance for regression coefficients and partial regression plots. In addition, although the model with road length alone appeared to suffer from the influence of individual data points, more complex models were as susceptible or more so. The interpretation of the association between reproductive success and road length is biologically meaningful. Bar ditches (ditches along the roadsides) with relatively high densities of small mammals were associated with each road segment, and fence posts lined most roads, providing owl hunting perches. Accordingly, I chose the most parsimonious model.
The analysis suggests that reproductive success and road length are positively associated. However, this conclusion should not be accepted uncritically. Because the database is small, there is little guarantee that the linear association between road length and reproductive success would be evident in another data set. In addition, the power to detect associations of even moderate strength is low. Interspersion, amount of area in grain, and amount of area in mesquite may be non-linear, however in this paper I focused on linear relationships. Possible non-linearities were not explored for the following reasons. There are hints of associations between reproductive success and other explanatory variables for which form and strength are determined by a single data point (e.g. see Fig. 1B, D). When sample size is small, the estimates of the likely size of chance error in the regression results are imprecise and there is scant basis for checking model assumptions like linearity, normality, and homoscedasticity. In addition, the nature of this data set is observational, not experimental, and cannot reliably determine the mechanisms involved. There may well be unobserved and unmeasured variables that influence bam owl reproductive success. Consequently, care should be taken not to interpret the results as a confirmation of associations without additional and corroborating data.
One of the most profitable uses of small data sets is to generate interesting questions and hypotheses for future studies. The patterns uncovered may suggest general conclusions that allow one to devise experimental field studies. Even further, a ‘logical tree’ (Platt 1964) can be employed where a hierarchy of groups of hypotheses can be sequentially tested in a mechanistic approach (Price 1986) to discover why the pattern occurs. It is my observation that this step-down approach is sometimes preceded by intriguing results from often-small data sets that have been subjected to pattern analyses of some sort. This, perhaps, is the strongest reason why it is fruitful to explore limited data sets. Exploratory data analysis methods can prove most helpful and may point to scale-sensitive effects that need to be addressed.
Small sample sizes will continue to plague ecologists. I argue that these kinds of data are important, but that extreme care must be taken in both the analyses employed and the interpretations made. I suggest that contingent data analyses procedures promote conservative interpretations and can be used heuristically to illuminate patterns and interesting questions in ecology. As Tukey (1980; cited in Yoccoz 1991) suggested: “finding the question is often more important than finding the answer”. I have illustrated one possible approach here.
I thank S. Durham for sage advice and help in preparation of the manuscript. I also thank my former student J.W. Ault, III for the use of his data set, and G.E. Belovsky for helpful comments on an original draft of the manuscript. Comments provided by A. Mysterud, A. Loison, and N.G. Yoccoz were most helpful to me in clarifying my message. I very much appreciate the time and effort they spent with the manuscript. The original research was funded by the Oklahoma (USA) Department of Wildlife Conservation.