The North American Breeding Bird Survey (BBS) provides data for >420 bird species at multiple geographic scales over 5 decades. Modern computational methods have facilitated the fitting of complex hierarchical models to these data. It is easy to propose and fit new models, but little attention has been given to model selection. Here, we discuss and illustrate model selection using leave-one-out cross validation, and the Bayesian Predictive Information Criterion (BPIC). Cross-validation is enormously computationally intensive; we thus evaluate the performance of the Watanabe-Akaike Information Criterion (WAIC) as a computationally efficient approximation to the BPIC. Our evaluation is based on analyses of 4 models as applied to 20 species covered by the BBS. Model selection based on BPIC provided no strong evidence of one model being consistently superior to the others; for 14/20 species, none of the models emerged as superior. For the remaining 6 species, a first-difference model of population trajectory was always among the best fitting. Our results show that WAIC is not reliable as a surrogate for BPIC. Development of appropriate model sets and their evaluation using BPIC is an important innovation for the analysis of BBS data.

## INTRODUCTION

The North American Breeding Bird Survey (BBS) provides 50 yr of data for bird populations in the United States and Canada, monitoring population change for >420 species at multiple geographic scales (Sauer et al. 2013, 2017). The data are counts of all birds seen within 400 m, or heard from any distance, by a skilled observer at each of 50 roadside stops along designated routes.

BBS counts are analyzed as overdispersed Poisson random variables with a log-linear model for the effects of explanatory variables. Explanatory variables include population effects, observer effects, and overdispersion effects (“noise” in Figure 1). Population effects account for spatial and temporal variability in the counts. Observer effects account for differences among survey participants, and for temporal change in individuals' count rates. The latter may include effects due to increased familiarity with survey methods and route locations, or changes relating to the aging of observers.

Overdispersion effects are modeled as observation-specific, mean-zero random variables, additive on the log scale of the expected count. These are included to relax the restrictive relation between the mean and variance of the Poisson distribution, allowing for temporally stable variation in counts not explained by other components of the model. Having modeled the biologically irrelevant effects of observers and accounted for temporally stable overdispersion effects, the remaining variation in counts can reasonably be assumed to describe patterns of population change.

BBS counts are not intended to provide actual population numbers, and naive use of BBS counts to estimate population size or compare populations through space and time are misguided, misleading, and pointless (Sauer and Link 2004). Rather, the usefulness of BBS data lies in model-based estimation of an index to relative population size and its patterns through time and space. Focusing on a temporal pattern, the modeled pattern of variation in this index is referred to as the population trajectory. In combination with population size surveys, BBS trajectories can be used to predict population sizes from areas of overlap and areas covered only by BBS data (Runge et al. 2009, Millsap et al. 2013, Zimmerman et al. 2015, Runge and Sauer in press).

Selection of a model for estimating population change is a crucial component of the analysis of BBS data: the mathematical model used shapes our perception of population change. The model must be flexible enough to capture genuine signals in the data, but simple enough to filter irrelevant variation.

The widely cited and comprehensive analysis conducted by the U.S. Geological Survey (USGS; Sauer et al. 2014, www.mbr-pwrc.usgs.gov/bbs/) describes the population trajectory for an individual species as an additive effect on the log scale of the expected count, having the form:

Here, *t* denotes the year of the count and *t** is a reference year; α* _{S}* and β

*are spatially varying parameters reflecting baseline abundance and long-term pattern of change, indexed by geographic strata*

_{S}*s*; ε

*is a mean-zero normal random variable. The strata are defined by the intersections of U.S. states or Canadian provinces with Bird Conservation Regions (Sauer et al. 2013), and we focus only on the trajectory component of the model. The model contains additional parameters that describe other effects on counts. Model (1) is referred to as the linearly trending random year effects model.*

_{st}It is important to recognize that model (1) does not impose a strictly log-linear pattern on the population trajectory. The log-linear component of the model, α* _{S}* + β

*(*

_{S}*t*−

*t**), provides a baseline pattern from which fitted values may depart. Given sufficient data, the latent variables, ε

*, are precisely estimated, allowing estimates of γ*

_{st}*(*

_{S}*t*) to depart from the strictly linear form. This is illustrated in Figure 2, which displays survey-wide summary estimates of γ

*(*

_{S}*t*) and its linear component for the Carolina Wren (

*Thryothorus ludovicianus*), converted back to the original count scale. Note, for instance, that a population decline of ∼40% associated with the severe winter of 1976–1977 (Link and Sauer 2007) is very much in evidence; this pattern has not been obscured by the use of the log-linear model.

On the other hand, model (1) allows for efficient estimation of a long-term trend, even with relatively weak data. Consequently, model (1) has served well as an omnibus model for population trajectories in the multispecies USGS analysis.

Nevertheless, alternative models might be considered, especially when intensively studying individual species. For instance, a model with the form:

referred to as the “first-difference model,” might more faithfully reproduce nonlinear patterns in trend. Here, once again, ε* _{st}* are independent, mean-zero normal random variables; the model is typically parameterized with α

*= γ*

_{S}*(*

_{S}*t**), the baseline abundance for stratum

*s*in reference year

*t**.

Using the language of Bayesian modeling, model (1) shrinks year effects toward a long-term trend line, and model (2) shrinks year effects toward adjacent year values. Given a strong signal from the data, the amount of shrinkage is small and the same general estimated pattern is obtained by the 2 models. If we were to overlay the fitted trajectory from model (2) on the fitted trajectory from model (1) in Figure 2, the 2 would closely coincide (Link and Sauer 2016:1754). This is because of the high quality of BBS data for Carolina Wrens; the species is widespread, abundant, and easily detectable.

For other species, trajectory models (1) and (2) might not agree, and the question arises of how we should select among models. Other models might be considered, such as:

which might be considered a compromise between models (1) and (2). Models with higher-order autoregressive structure might also be considered. The question of model selection goes further, including choices for how we model observer and overdispersion effects. Even once the general form of a model has been decided upon, decisions are required with regard to hierarchical structure. For instance, in models (1) and (2), one might treat parameters α* _{S}* as unrelated, or as random effects sampled from common distributions.

How should we select among models? We have heard practitioners suggest that one model is superior to another, based on its narrower confidence intervals—but confidence intervals are only relevant under the assumption the associated model is correct. We have also heard practitioners suggest that excessively parameterized models might be preferred, on the grounds that less precisely estimated trajectories might somehow better suggest associations between population change and candidate explanatory variables.

What is needed is a formal mechanism for model selection (Hooten and Hobbs 2015, Link and Sauer 2016). Akaike's information criterion (AIC) has been widely used and energetically advocated (Burnham and Anderson 2002), but is known to favor unnecessarily complex models (Barker and Link 2015); furthermore, its extension to hierarchical models is difficult. The deviance information criterion (DIC) is similar to AIC and has been widely used in analysis of hierarchical models, but recent research has called its usefulness into question (Gelman et al. 2014). The gold standard for model selection remains cross-validation, but this is highly computationally intensive.

In this paper, we report on a comparison of 4 models applied to a suite of 20 species. We had 2 goals for conducting and reporting these analyses. First, we were interested in the model set per se. Three of the models were based on the linearly trending random year effects trajectory (equation (1)), but with varying levels of hierarchical structure. We were interested in whether adding structure to the usual USGS model would improve analyses. The fourth model used the first-difference trajectory (equation (2)); we were interested in knowing whether the more flexible structure offered by the first-difference model improved analysis. The suite of 20 species that we selected varied from rare, low-abundance species to very common species and included species with consistent population trends and species undergoing population fluctuations; our primary interest was in determining whether our model selection procedure selected different models as better estimating trends among this diverse set of species and population trajectories.

Naturally, we cannot draw general conclusions about the relative merits of these models for application to other species. At best, the results reported here may be suggestive of general tendencies. More important is the notion of a model set, and available tools for choosing among the models. We envision the omnibus model used in USGS analyses being replaced by an omnibus model set and practical model-selection criteria.

Our second goal was to investigate the performance of a new model-selection criterion, the Watanabe-Akaike information criterion (WAIC; Watanabe 2010, 2013, Gelman et al. 2014). The virtue of WAIC lies in its asymptotic equivalence to leave-one-out cross-validation (LOOCV), despite being much more easily calculated. We thus performed LOOCV analyses for our primary assessment of the model set, and then compared results based on WAIC.

## METHODS

### Leave-one-out Cross-validation

Leave-one-out cross-validation (LOOCV) of a model *M* involves setting aside a single observation *Y _{i}* from a data set

*, fitting*

**Y***M*based on the reduced data set

**Y**_{−}

*, and predicting the value*

_{i}*Y*based on its covariate vector

_{i}*and the fitted model. We can repeat the process for all observations*

**X**_{i}*i*, or for some subset of the observations, and compare the predictions with the observed values. Basing the predictions on reduced data sets of

**Y**_{−}

*avoids problems of overfitting.*

_{i}Let **θ*** _{M}* denote the unknown parameter vector for model

*M*and denote the distribution function of

*Y*under model

_{i}*M*as

*f*(y|

_{M}**θ**

*,*

_{M}*). Let [*

**X**_{i}**θ**

*|*

_{M}

**Y**_{−}

*] denote the posterior distribution for*

_{i}**θ**

*from a Bayesian analysis using the reduced data set*

_{M}

**Y**_{−}

*. Then, LOOCV prediction of*

_{i}*Y*is based on:

_{i}*f*(

_{M}*Y*|

**θ**

*,*

_{M}*), averaged against the posterior distribution. Note that we are careful to distinguish*

**X**_{i}*f*(

_{M}*y*|

**Y**_{−}

*,*

_{i}*), the estimated distribution function (a function of*

**X**_{i}*y*), from

*f*(

_{M}*Y*

_{i}|

**Y**_{−}

*,*

_{i}*), its calculated value at the observed value*

**X**_{i}*Y*. The quantity

_{i}*f*(

_{M}*Y*

_{i}|

**Y**_{−}

*,*

_{i}*) is sometimes referred to as the conditional predictive ordinate of*

**X**_{i}*Y*under model

_{i}*M*.

For the discrete data of the BBS, *f _{M}*(

*Y*

_{i}|

**Y**_{−}

*,*

_{i}*) is the estimated probability of observation*

**X**_{i}*Y*, based on the model

_{i}*M*, the covariate vector

*, and all of the other observations. Calculation of this estimate in a Markov chain Monte Carlo (MCMC) analysis of*

**X**_{i}

**Y**_{−}

*is straightforward; one need only calculate the expected value of*

_{i}*Y*(say,

_{i}*λ*) and monitor the Poisson probability of

_{i}*Y*(i.e. ) as a derived parameter. The posterior mean of this derived parameter is

_{i}*f*(

_{M}*Y*|

_{i}

**Y**_{−}

*,*

_{i}*).*

**X**_{i}The Bayesian predictive information criterion is defined as:

(Note that the multiplicative factor −2 is not always included; it puts BPIC on the deviance scale, meaning that smaller values are favored.) BPIC provides a convenient summary of LOOCV and an objective basis for model comparison.A bare ranking of values BPIC* ^{M}* across models isn't entirely satisfying, as one is left with the question of whether the differences in values are of practical or even statistical significance. Link and Sauer (2016) proposed a statistic,

*Z*, as a measure of statistical significance. Noting that:

_{B}*Z*as the sample mean value of divided by its estimated standard error. Under a null hypothesis that models (1) and (2) are equal as representations of the data-generating model (i.e. they have equal Kullback-Leibler divergences),

_{B}*Z*is treated as approximately standard normal.

_{B}BPIC is computationally intensive, requiring a complete reanalysis of the dataset for each set-aside value (*Y _{i}*). For the Mourning Dove (

*Zenaida macroura*), the BBS dataset through 2014 had 95,394 counts by 15,381 observers in 154 geographic strata (Pardieck et al. 2015), for which a full analysis using MCMC took ∼3 hr. Repeating this 95,394 times would take more than 30 years.

It is possible to save time on LOOCV by restricting attention to a subset of the *Y _{i}*, by strategically reducing the length of the Markov chains used in MCMC analysis, and by using parallel processing on multicore computers. Doing so for 500 randomly selected Mourning Dove observations, we were able to reduce the time needed for calculation of BPIC to 10 days. This is still too long a time to be practical, prompting our interest in the Watanabe-Akaike information criterion (WAIC), which is easily and quickly calculated and is asymptotically equivalent to BPIC. WAIC has been applied to complex model selection studies using BBS data (e.g., Gorzo et al. 2016).

### The Watanabe-Akaike Information Criterion

The Watanabe-Akaike information criterion, like the BPIC, is based on an estimate of the predictive distribution of individual observations. Using the same notation as used at equation (4), let:

and let:Note that equation (8) is the same as equation (4), except that the complete data set * Y* is used instead of the reduced data set

**Y**_{−}

*. This means that values*

_{i}*f*(

_{M}*y*|

**θ**

*,*

_{M}*) and can be computed for all*

**X**_{i}*i*with a single analysis of the dataset; there's no need to repeat the Mourning Dove analysis 95,394 times, for example. The term

*b*(

_{M}*y*|

*,*

**Y***) can be thought of as a bias correction in estimating log (*

**X**_{i}*f*(

_{M}*y*|

**θ**

*,*

_{M}*)) based on*

**X**_{i}*rather than*

**Y**

**Y**_{−}

*. WAIC is defined as:*

_{i}A statistic, *Z _{W}* (analogous to

*Z*), is based on differences in values = log (

_{B}*f*(

_{M}*Y*|

_{i}*,*

**Y***)) −*

**X**_{i}*b*(

_{M}*Y*|

_{i}*,*

**Y***) between models (Link and Sauer 2016).*

**X**_{i}*Z*is the mean difference divided by its estimated standard error.

_{W}### Data, Models, and Analyses

We compared the fit of 4 models to BBS survey-wide data for 20 species using BPIC, and compared the performance of WAIC with BPIC in making these assessments. The 20 species (Table 1) were selected to represent a cross-section of species in regard to breeding distribution, abundance along BBS routes, patterns of population change, and data quality (Sauer et al 2014); BBS data from 1966 to 2014 were included in the analysis (Pardieck et al. 2015). The intent of this selection was to provide case studies of species for which we believe model selection might provide meaningful results. In particular, models with linearly trending year effects may be better suited for species undergoing consistent population change, while first-difference year effect models might better fit species undergoing population fluctuations. To examine this, we selected the Wild Turkey (*Meleagris gallopavo*), Eurasian Collared-Dove (*Streptopelia decaocto*), Bank Swallow (*Riparia riparia*), McCown's Longspur (*Rhynchophanes mccownii*), and Lark Bunting (*Calamospiza melanocorys*) as species that have experienced consistent and dramatic population changes, and chose the Sora (*Porzana carolina*), Cliff Swallow (*Petrochelidon pyrrhonota*), Red-breasted Nuthatch (*Sitta canadensis*), Carolina Wren, Eastern Bluebird (*Sialia sialis*), Western Bluebird (*Sialia mexicana*), Wood Thrush (*Hylocycla mustilina*) Pine Siskin (*Spinus pinus*), and Henslow's Sparrow (*Ammodramus henslowii*) as representative of species that have experienced population fluctuations. We also examined the Lincoln's Sparrow (*Melospiza lincolnii*), Tricolored Blackbird (*Agelaius tricolor*), Eastern Meadowlark (*Sturnella magna*), and Western Meadowlark (*Sturnella neglecta*), because these species have historically proven difficult to analyze due to large variances and apparent lack of model fit. Finally, we included the Least Bittern (*Ixobrychus exilis*), a wetland species that is rarely encountered on BBS routes, and the Mourning Dove, one of the most frequently encountered birds on BBS routes.

## TABLE 1.

Species used to compare Watanabe-Akaike Information Criterion (WAIC) and Bayesian Predictive Information Criterion (BPIC) for model selection from North American Breeding Bird Survey data for the contiguous United States and southern Canada, 1966–2014. Trend is the estimated geometric mean rate of population change (% per year) from U.S. Geological Survey analyses (model *T*, linearly trending random year effects); 95% CI is the credible interval defined by the 2.5^{th} and 97.5^{th} percentiles of the posterior distribution.

All of the models considered treated counts, conditional on their expected values, as independent Poisson random variables, with explanatory variables additive on the log scale of the expected count. All included an effect, η, for the observer's first year of service on a route. Observer effects were modeled as mean-zero normal random variables, with a precision parameter, denoted . All of the models included count-specific overdispersion effects, modeled as mean zero normal random variables with precision τ^{ε}. Mean-zero normal random variables associated with temporal change were allowed to vary by stratum and denoted . We performed Bayesian analysis, assigning gamma priors to precision parameters and vague normal priors to all others.

Three of the models that we considered used linearly trending random year effects to model the population trajectory (equation (1)). These models were distinguished by the amount of hierarchical structure imposed on α* _{S}*, β

*, and . In model*

_{S}*T*, these were all assigned independent vague priors. Model

*S*was like model

*T*, except that α

*and β*

_{S}*were treated as independent normal random effects, varying by stratum. Model*

_{S}*V*was like model

*S*, except that were modeled as lognormal random effects, varying by stratum. Model

*T*has been used since 2010 as the omnibus model for BBS analyses (Sauer and Link 2011); models

*S*and

*V*add hierarchical structure, similarly to the suggestions of Smith et al. (2014). The fourth model, model

*D*, was identical to model

*T*, except that the population trajectory was described by the first-difference year effect model (equation (2)).

We fitted the models using MCMC as implemented in program JAGS (Plummer 2003) through package R2jags (Su and Yajima 2015) in R 3.3.2 (R Core Team 2016). A preliminary analysis for each species and model was performed using Markov chains of length 10,000. The final states of these analyses were saved as starting values for subsequent LOOCV analyses, obviating the need for burn-in for those analyses.

The number of observations in most BBS datasets is prohibitively large for full LOOCV analyses. It suffices to compare models based on their fit to a large subset of observations. We could have sampled these observations completely at random, had we been interested in comparing the overall fit of the models. However, our interest was in the fit of the trajectory portion of the model. Therefore, for each species, we randomly selected 12 observations per year for each of the 47–49 years of data analyzed. We computed conditional predictive ordinates *f _{M}*(

*Y*

_{i}|**Y**_{−}

*,*

_{i}*) from each of the 4 models for each of these observations, and generated values of BPIC*

**X**_{i}*and statistic*

^{M}*Z*in summary. Calculation of the conditional predictive ordinates thus typically involved 49 × 12 × 4 = 2,352 MCMC analyses. In the interest of time, we reduced the Markov chain length to 10,000 for these analyses, noting that there was little difference between the values computed based on the first 5,000 vs. the second 5,000. For 4 species (Western Bluebird, McCown's Longspur, Lark Bunting, and Tricolored Blackbird), we omitted the first 1–2 years of data from the LOOCV analysis due to low data quality (the BBS was not fully implemented over most of the survey area until 1968; Sauer et al. 2013). Next, we performed a single analysis for each species and model, computing

_{B}*f*(

_{M}*Y*

_{i}|**Y**_{−}

*,*

_{i}*) and*

**X**_{i}*b*(

_{M}*Y*

_{i}|**Y**_{−}

*,*

_{i}*) for the same indices (*

**X**_{i}*i*) as used for calculating BPIC

*, thus obtaining comparable observations of WAIC*

^{M}*and*

^{M}*Z*.

_{W}The key advantage of computing both BPIC and WAIC is that both can be decomposed to evaluate the fit of single observations. Examination of single observations permits in-depth comparison of the approaches; it also provides complete flexibility in selecting the components of model fit to consider when comparing models in complex time series of data (Link and Sauer 2016).

## RESULTS

### Comparisons of Models by BPIC

Models *T* and *D* have nearly identical hierarchical structure, differing only in the form of the trajectory (model *T* having linearly trending random year effects, and model *D* allowing greater flexibility in pattern through random first differences in year effects). For the 20 species considered, there was a nearly even split between these models alone (Table 2): *T* was favored over *D* 11/20 times. In considering the full model set, model *D* was top-ranked in 6/20 cases.

## TABLE 2.

Bayesian Predictive Information Criterion (BPIC) values for 4 models (*T*, *D*, *S*, and *V*) applied to North American Breeding Bird Survey data for 20 species. See Figure 3 for descriptions of models *T* and *D*. Model *S* was like model *T*, except that slopes and intercepts were treated as independent normal random effects, varying by strata. Model *V* was like model *S*, except that stratum-specific variances in observer effects were treated as lognormally distributed random effects. Values are based on 12 randomly selected observations per survey year for each species; the smallest BPIC value (best model fit) is underlined for each species.

Models *T*, *S*, and *V* share the same trajectory, but, among these, model *T* has the least and model *V* the most hierarchical structure. Restricting attention to these 3 models, *T* and *S* were each preferred in 5/20 cases, and *V* in 10/20 cases.

As noted previously, the raw ranking of BPIC values is not completely satisfactory; one might reasonably ask whether differences in BPIC values are of practical or statistical significance. We addressed the question of statistical significance using the statistic *Z _{B}*, judging the fit of a model to be superior to another if the

*Z*value was less than −1.96 (Table 3). For 14/20 species, none of the models emerged as superior. In the 6 remaining cases, the

_{B}*Z*statistic tended to define 2 groups, in which models

_{B}*D*and

*V*were indistinguishable but were preferred over the other 2 models.

## TABLE 3.

Groupings of models (*T*, *D*, *S*, and *V*; see Figure 3 and Table 2 for model descriptions) based on statistic *Z _{B}* (see Figure 5 for definition) used to analyze North American Breeding Bird Survey data for 6 species. For a given species, models occurring in the same column had values |

*Z*| < 1.96, and rows follow the ranking of models by BPIC from best (top) to worst (bottom). For the remaining 14 species in Tables 1 and 2, |

_{B}*Z*| < 1.96 for all comparisons among the 4 models. * Note that for comparing models

_{B}*T*and D for the Lincoln' Sparrow,

*Z*= 1.95.

_{B}The benefit of formal model-selection procedures becomes clear upon inspection of Figure 3. Over the last 30 years, fitted population trajectories for Lincoln's Sparrows using models *D* and *T* are in reasonably close agreement, but there is some disagreement in the indices for the early years, with model *D* suggesting an increase and model *T* a decrease in population. The long-term trend (geometric mean rate of annual change over 49 years) is −0.39% (95% CI: −1.58% to 0.71%) under model *T*, and 0.28% (95% CI: −0.77% to 1.07%) under model *D*. Model *D* assigns a 73% posterior probability to a larger population in the final year than in the first year; model *T* assigns only a 24% probability. While there is considerable overlap in the credible intervals of both the trajectories and the trend estimates, such that we would not assert that the 2 analyses are contradictory, it is surely desirable to be able to distinguish between the models on the basis of an objective evaluation. The results presented in Tables 2 and 3 indicate that model *D* is to be favored.

### WAIC as an Approximation to BPIC

WAIC is asymptotically equivalent to BPIC, but can be implemented with a single MCMC analysis, and hence has a greatly reduced computational burden. We were interested in evaluating the value of WAIC as a surrogate for BPIC in analyses of BBS data. We thus computed WAIC values for each of the 4 models and 20 species (Table 4).

## TABLE 4.

Watanabe-Akaike Information Criterion (WAIC) values for 4 models (*T*, *D*, *S*, and *V*; see Figure 3 and Table 2 for model descriptions) applied to North American Breeding Bird Survey data for 20 species. Values are based on the same 12 randomly selected observations per survey year for each species as used in analysis for Table 2; the smallest WAIC value (best model fit) is underlined for each species.

Surprisingly, rankings of models by WAIC were not consistent with those by BPIC (Tables 2 and 4, Figure 4). While the slope of the regression line was positive (β̂ = 0.299, 95% CI: 0.087 to 0.511, *P* = 0.007), the *R*^{2} value was only 0.089.

Similar results were obtained when evaluating statistic *Z _{W}* as a surrogate for

*Z*(Figure 5). Once again, the slope of the regression line was positive (β̂ = 0.200, 95% CI: 0.054 to 0.346,

_{B}*P*= 0.008), but the

*R*

^{2}value was only 0.057.

BPIC and WAIC are both based on estimates of −2log (*f _{M}*(

*Y*|

_{i}*,*

**θ**^{M}*)). Denote these by*

**X**_{i}*B*= −2log (

_{i}*f*(

_{M}*Y*|

_{i}

**Y**

_{−}*,*

_{i}*)) and*

**X**_{i}*W*= −2log (

_{i}*f*(

_{M}*Y*|

_{i}*,*

**Y***)) −*

**X**_{i}*b*(

_{M}*Y*|

_{i}*,*

**Y***), respectively. In an effort to understand the discrepancy between conclusions based on BPIC = ∑*

**X**_{i}*and WAIC = ∑*

_{i}B_{i}*, we plotted observation-specific values of*

_{i}W_{i}*W*against

_{i}*B*for each of the 20 species and 4 models considered. Figure 6 displays the results for the Wood Thrush BBS data evaluated with model

_{i}*T*; this pattern is typical of what was observed for all species and models. From its definition, we see that

*B*decreases as a function of the probability of

_{i}*Y*. For the highest-probability observations (∼60% had

_{i}*B*< 5), the agreement between

_{i}*B*and

_{i}*W*is good. However, for unlikely observations under the model,

_{i}*W*is consistently too small relative to

_{i}*B*, and increasingly so as the probability of the observation decreases. The bias correction term in

_{i}*W*is too small, especially for unlikely observations, which are the ones that contribute most to the totals.

_{i}## DISCUSSION

The North American Breeding Bird Survey is unparalleled as a source of spatial and temporal information on population change for most North American bird species. Statistical innovations in its analysis and the care taken in maintaining careful protocols for data collection and curation enhance its value to science and management. However, the complex structure of its data dictates a need for caution in model development and assessment. The development of MCMC and its easy implementation in programs such as JAGS have made it very easy to suggest new hierarchical models and to tweak existing models. Because our ability to produce and fit models has outstripped our ability (or desire) to formally critique and compare them, we need to consider how best to provide credible and defensible estimates of population change while still being open to modeling innovations. Development and comparison of model sets that provide reasonable alternatives for model structure are important steps in implementing a diversity of models while still permitting comparisons among them. The results presented here are a first step in exploring alternative approaches in model selection within such a model set.

Our model set was chosen to represent the current candidate models that a BBS analyst would be likely to employ, with the *T*, *S*, and *V* models sharing the same linearly trending random year effects trajectory but differing in the hierarchical nature of other model components. This comparison is timely; although model *T* is the one used in the analysis on the USGS Results and Analysis Website (Sauer et al. 2014), Sauer et al. (2017) recommend the use of model *S* because it facilitates expansion of the BBS analysis to additional regions where very limited data exist from the early years of the survey, and Smith et al. (2014) recommend use of model *V* for the BBS in Canada. Link and Sauer (2016) note that the first-difference model (model *D*) might provide more realistic results in cases where models using the slope–year effect parameterization might lead to overprediction of population change. The 20 species that we selected for our comparative analysis were chosen to provide a variety of life-history, distribution, and population change situations to compare the models.

Analyses using BPIC indicated that, for the majority of the 20 species that we thought would be likely to be sensitive to choice of model, no model was clearly superior. Although the ranking process inevitably produces a model with the smallest BPIC, the quantitative comparison using the *Z _{B}* statistic found a significant difference in BPIC value for only 25% of the species. Of these 6 species, Wild Turkeys and Eurasian Collared-Doves dramatically increased in population size and Bank Swallows steeply declined over the 1966–2014 interval. The population trajectory for Carolina Wrens was notably jagged, with large declines related to severe winters (Figure 2). Mourning Doves are one of the most commonly encountered species on BBS routes, hence the ability to discriminate among models for this species was not surprising. The final species, the Lincoln's Sparrow, showed dramatic differences in population trajectory in the early years depending on the model chosen (Figure 3). Although our limited selection of species does not allow us to make generalizations about appropriate models for the analysis of data for the 420 species presently monitored by the BBS, we suggest that these results have implications both for the present analysis of BBS data and for future directions in evaluations of appropriate models for BBS data analysis. First, our results suggest that, within this model set, the current BBS analysis model (model

*T*) generally performs quite well across the large variety of sampling situations associated with the 20 analyzed species. Second, in the cases where model

*T*was not preferred, model

*D*or model

*V*was preferred. Our inclination is to suggest that model

*D*, which presents an alternative formulation of change over time, is likely to be a superior model to model

*V*, which retains the slope–year effect structure of models

*T*and

*S*but adds additional hierarchical structure. In the case of the Lincoln's Sparrow, it is likely that both models can accommodate limited data in certain strata in different ways, with model

*D*providing more realistic estimates of change during periods with limited data.

The tendency for no clear best model to be determined for analysis of our selection of species is reflected in summary results for the species. For the 20 species, credible intervals for long-term trends at the survey-wide scale showed general overlap among the 4 models, except those for Wild Turkeys and Eurasian Collared-Doves, for which trends tended to be lower for the *D* model and higher for the *V* model than for the *T* and *S* models.

Comparison of BPIC results with WAIC results indicated that WAIC is not a reasonable surrogate for BPIC in model selection, and we caution against its use. Similar findings have been reported by Vehtari et al. (2017). This is a disappointing result, as WAIC is much easier and less time consuming to produce, and appears to be similar to BPIC conceptually. We investigated the use of WAIC as a screening surrogate for BPIC, using it to identify species for which models might differ, which could then be subjected to confirmatory BPIC analysis. However, that approach does not seem feasible without a better understanding of when WAIC fails. Because observation-specific values of both BPIC and WAIC can be computed, it may be possible to empirically model the relationship between BPIC and WAIC and “correct” WAIC. Although such an approach may not seem worth the effort, the current difference in cost between the 2 approaches makes further investigation of a means of improving WAIC a tempting idea. The development of computationally efficient approximations to BPIC remains an area of active research (see Vehtari et al. 2017).

We believe that cross-validation remains the best tool for objective examination of models. Much additional work is needed to establish how best to implement a BPIC-based comparison of models. Because it is not feasible, or perhaps even desirable, to compute BPIC for all samples, the question of what component of model fit is of interest must be explicitly addressed and a design established to select samples to meet this goal. Because the BBS has been constantly expanding in area and consistency of coverage, a primary concern in summary analysis is to not allow long-term results to be determined primarily by extrapolation from intervals with the most data, that is through estimation of a slope in the slope–year effects models. To address that concern, we balanced the random sample of observations for computing BPIC by year to limit the influence of the greater numbers of observations in recent years. However, the choice of observations to sample for BPIC can also address other goals. For example, a primary emphasis for modeling might be a subinterval of the larger time series, such as a time period when a species underwent an unusual fluctuation or when an important environmental change occurred, and thus the primary interest is in appropriately modeling the effect of the intervention on the population. More discussion on this issue can be found in Link and Sauer (2016).

Finally, we view the model set for BBS analyses as not being well established. The model set in this paper is a logical one, containing the commonly used BBS models. It is realistic, in that there is real interest in determining which of these models is in some sense best. Although we observed general similarity in the results from these models, controversy does still exist with regard to what model is best; even small differences in trend can have management implications, and practitioners tend to focus on point estimates that can appear quite different even though credible intervals may overlap. The results presented here should reassure practitioners that there are consistencies in current BBS modeling activities, but should also provide a cautionary note that alternative models will always be in development, and that future BBS analyses will almost certainly be based on model selection in the context of a set of candidate models.

## ACKNOWLEDGMENTS

The BBS is the product of thousands of volunteers and a hierarchy of organizers, coordinators, and data managers. We thank all of them. We also thank Keith Pardieck and Adam Smith for thoughtful reviews of the manuscript. Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

**Funding statement:** This work was partially funded by the U.S. Geological Survey (USGS) Status and Trends Program. Keith Pardieck was the project manager. This paper has been peer reviewed and approved for publication consistent with USGS Fundamental Science Practices ( http://pubs.usgs.gov/circ/1367).

**Ethics statement:** This research was conducted in compliance with all applicable USGS study design and review protocols.

**Author contributions:** W.A.L. formulated the questions, developed analysis approaches, conducted analyses, and wrote the paper. J.R.S. collaborated in the development of analysis approaches, conducted analyses, and cowrote the paper. D.K.N. conducted analyses and cowrote the paper.

**Data deposits:** BBS data are deposited at ftp://ftpext.usgs.gov/pub/er/md/laurel/BBS/Archivefiles/Version2015v0/