Information on the spatial distribution of soil pH is essential for assessing soil quality and soil productivity. Digital soil mapping (DSM) is commonly used to predict soil characteristics over various types of landscapes. Over the past decade, researchers have made progress using machine learning techniques to provide reliable predictions of soil properties with limited data. DSM studies often use a single learning approach, which is constructed with a machine learner that systematically extracts soil–environment relationships from a large database, whereby a fitted model is used to predict soil information in an unmapped area. The practice of using an ensemble learning approach, especially one that combines several base learners, has rarely been tested in DSM. We developed a workflow for using an ensemble learning algorithm to predict soil properties for the Thompson-Okanagan region of British Columbia, Canada. Here, we focused on soil pH and tested a variety of base learners. Base learners with high prediction accuracies were then used to construct a SuperLearner (SL) to extract the complex relationships between soil properties and environmental variables. The fitted SL was then used to predict soil properties at 25 m spatial resolution at three depth intervals (0–5, 5–15, and 15–30 cm). Prediction accuracies were assessed using an independent test dataset, which indicated that the SL had a similar prediction accuracy to the best individual base learners. Using the heterogeneous ensemble learning approach with a weighted average stacked generalization process eliminated the need to choose the best base learner.
Digital soil mapping (DSM) has increasingly applied novel machine learning techniques to predict the spatial distribution of soil properties and types (Brungard et al. 2015; Heung et al. 2016; Khaledian and Miller 2020). Machine learning algorithms have the potential to quantify the high-dimensional and nonlinear relationships between the environmental predictors and soil response variables over diverse ecosystem types. With improvements in computer technology (Rossiter 2018) and machine learning algorithms over the past decade, more powerful learners were designed to process larger datasets using a larger number of environmental variables. Examples of such algorithms have included, but are not limited to, generalized linear regression (GLM; Hastie and Pregibon 1992), stepwise regression (STEP; Hastie and Pregibon 1992), and lasso and elastic net regularized generalized linear regression (GLMNET; Friedman et al. 2010), which are capable of processing nonlinear relationships for both categorical and continuous data (Simon et al. 2011). Furthermore, the use of tree-based learners, such as the classification and regression trees (CART; Breiman et al. 1984), has led to the development of predictive modelling techniques that are effective in capturing the hierarchical relationships between predictors. The CART approaches also form the basis of more advanced, tree-based learners such as CART with bagging (Breiman 1996a), the cubist learner (Quinlan 1992, 1993), and the random forest (RF) model (Breiman 2001).
The availability of numerous machine learning algorithms has encouraged model comparison studies, and these studies have shown that by using the same input data, different learners could generate digital soil maps that are drastically different from one another (Brungard et al. 2015; Heung et al. 2016). Hence, it has been recommended that model comparison should be carried out as part of best practice in DSM (Heung et al. 2016). In addition to a diverse array of machine learners, DSM practitioners have also investigated the use of ensemble models by extending the application of the “bagging” concept proposed in Breiman (1996b). Here, multiple models are built on bootstrapped samples of the training data and integrated into a single predictive model to improve the model predictions in comparison to predictions made using only one model (Rokach 2010).
Building an ensemble of models using a single type of learner (i.e., homogeneous ensemble learning) has been of interest in the DSM literature when predicting the spatial distribution of soil categorical data and continuous data. Studies such as Heung et al. (2017) have applied a bootstrapping routine for k-nearest neighbours, multinomial logistic regression, and logistic model trees for mapping soil classes, and Padarian et al. (2017) used a bootstrapping of CART for predicting a variety of soil properties across six depth intervals. Furthermore, polygon disaggregation approaches such as DSMART, which uses the See5 tree-based algorithm (Odgers et al. 2014), and its subsequent implementation using the RF algorithm (Chaney et al. 2016), both operate on a similar principle.
Although homogeneous ensemble learning methods have been tested to some extent in DSM (e.g., Heung et al. 2017; Padarian et al. 2017), there has been considerably less attention on modelling approaches that combine multiple types of learners (i.e., heterogeneous ensemble learning). Within the DSM literature, Malone et al. (2014) tested a variety of model averaging approaches using equal weight averaging, Bates–Granger averaging, Bayesian model averaging, and Granger–Ramanathan averaging, which showed that model averaging had the potential to improve map accuracy. Subsequently, O'Rourke et al. (2016) applied a similar model averaging approach and evaluated the improved accuracy of portable visible, near-infrared, and X-ray fluorescence spectrometers. More recently, multiple studies in France have evaluated the use of model averaging techniques in DSM (Román Dobarco et al. 2017; Caubet et al. 2019; Chen et al. 2020).
Within the machine learning literature, stacked generalization is a type of ensemble learning and model averaging approach. As with other model averaging techniques, stacked generalization operates on the concept that multiple predictive learners (i.e., “base learners”) are aggregated into a combined learner, using a combiner algorithm (i.e., “meta-learner”), whereby the expectation is that the combined model has a higher predictive performance (Wolpert 1992). Here, the meta-learner evaluates the predictive performance of the individual base learners and builds an optimal combination. Stacked generalization was first proposed and tested for categorical data (Wolpert 1992) and later adapted into regression stacking for continuous data (Breiman 1996a).
An example of an ensemble learning approach that uses stacked generalization is the SuperLearner (SL), which was first proposed by van der Laan et al. (2007) and further evaluated by Polley and van der Laan (2010). The SL is unique in that it uses a variety of different base learners and a cost function based on cross-validation to create a heterogeneous ensemble. For example, its first implementation (van der Laan et al. 2007) combined modelling methods such as regression trees, RF, least angle regression, logistic regression, and adaptive regression splines. The construction of an SL includes two steps. First, the ensemble learning algorithm uses cross-validation to evaluate the performance of the base learners. In the second step, a cost function is applied, based on the cross-validation results, to calculate a weighted average prediction from the base learners. Whereas model averaging incorporates the predictions of all base learners, the SL also calculates a weighted combination function using least squares regression but includes the additional constraint that all model weights are positive — a constraint that is ensured by removing all learners that have a negative weight (i.e., non-negative least squares). Subsequently, all model weights are normalized. Using an SL can mitigate bias, noise, and uncertainties from individual base learners, and may improve the overall accuracy of the prediction (Polley and van der Laan 2010).
Although the use of SL has rarely been demonstrated in DSM applications, its use was suggested by Hengl and MacMillan (2019). Applying the stacked generalization approach to ensemble learning and the mapping of continuous soil attributes is relatively uncommon in DSM, with the recent exception of Taghizadeh-Mehrjardi et al. (2021), who used a stacked generalization process to predict soil organic carbon. Hence, the objectives of this study were to (1) evaluate and compare a set of base learners, (2) test the potential of using the ensemble learning approach with stacked generalization to extract the relationships between soil properties and environmental variables derived from a digital elevation model (DEM), and (3) compare and assess the use of the ensemble learner with the individual base learners for mapping the spatial distribution of soil pH at multiple depth increments for the Thompson-Okanagan region of British Columbia, Canada.
Materials and methods
This study modelled soil pH by using multiple base learners, e.g., GLM, STEP, generalized linear model with lasso or elastic net regularization (GLMNET_lasso, GLMNET_ridge, and GLMNET enet), support vector machine–radial (SVMR), k-nearest neighbours (kNN), RF, and extreme gradient boosting (XGBoost), and compared them against an ensemble learning approach with stacked generalization. In all cases, point data were spatially intersected with raster layers representing soil–environmental variables to create the training data. The training data were then used to fit the base learners and a stacked generalization process was applied using the SL algorithm to create the ensemble prediction of soil pH. As the goal of this study was to examine the effectiveness of the ensemble learning process with a continuous soil attribute, we statistically compared the performance of SL to the base learners. Figure 1 outlines the general workflow of using an ensemble learning algorithm with stacked generalization for DSM purposes.
This study was carried out in the Thompson-Okanagan region of British Columbia, Canada (map sheet NTS 092INE; Fig. 2). This area is located in the southern interior of British Columbia and includes the Thompson Plateau, the Fraser Plateau, and the Shuswap Highland physiographic subdivisions (Young et al. 1992). It spans latitudes 50.5°N to 51.0°N and longitudes 120.0°W to 121.0°W, and is approximately 4350 km2 in size. The elevation ranges from 318 m to 2088 m above mean sea level, and the area includes the following biogeoclimatic zones: Bunchgrass Zone, Interior Douglas-fir Zone, Montane Spruce Zone, Sub-Boreal Spruce Zone, Interior Cedar–Hemlock Zone, Engelmann Spruce–Subalpine Fir Zone, and Alpine Tundra Zone (Lloyd et al. 1990). The combination of these biogeoclimatic zones results in a variety of ecosystems, and the soil maps of the Thompson-Okanagan region include 99 soil associations for the study area. Each soil association was linked to one or more soil subgroups from the Canadian System of Soil Classification, of which there were 31 in total (Young et al. 1992). Digitized conventional soil polygon maps for the study area were accessed and downloaded from the BC Soil Information Finder Tool (B.C. Ministry of Agriculture and B.C. Ministry of Environment 2018).
Due to the climatic conditions and topography, the landscape features in the study area represent a combination of grassland regions, transitional regions, and dry interior forested regions (Klenner et al. 2008). The majority of the grasslands occupy catchment areas near Kamloops Lake and the Thompson River at lower elevations. Grasslands cover 40% of the region and occur between elevations of 230 m and 800 m. The climate is warm and dry in the grassland region; the dominant vegetation consists of grasses and sedges, and the dominant soils are Chernozems. At higher elevations, away from the water basins, the climate is colder and wetter; vegetation transitions into forests dominated by ponderosa pine (Pinus ponderosa), Douglas fir (Pseudotsuga menziesii), lodgepole pine (Pinus contorta), white spruce (Picea glauca), and Engelmann spruce (Picea engelmannii) at increasingly higher elevations (>800 m; Lloyd et al. 1990; Moore et al. 2010). In the grassland region, the mean annual temperature is 7.9 °C and the mean annual precipitation is 285 mm, whereas in the forest region the mean annual temperature is 5.0 °C and the mean annual precipitation is 476 mm (Lloyd et al. 1990). Most of the grasslands are maintained as pasture to support livestock, and forestry is the other primary industry in the area.
A conditioned Latin hypercube sampling approach (Minasny and McBratney 2006) was used to select 300 sample locations based on topographic variables (Fig. 2). To ensure accessibility, locations were constrained to lie within 200 m of the road network, which included paved, logging, and gravel roads. There were 15 of the selected locations that were not sampled, as 10 locations were inaccessible due to road conditions and five locations had either exposed bedrock or shallow soils over bedrock, not meeting the minimum thickness of soil for the first depth interval. Thus, 285 profiles were sampled, with total depths ranging from 10 cm to 45 cm. Among the 285 sampled profiles, 278 had soil depth greater than 30 cm, four had soil depth between 15 cm and 30 cm, and three were shallower than 15 cm.
Fieldwork was carried out in the summer of 2015 using a field sheet based on the second edition of the "Field Manual for Describing Terrestrial Ecosystems" (B.C. Ministry of Forests and Range and B.C. Ministry of Environment 2010). At each field location, mineral samples were collected from individual horizons. In total, 845 soil samples were collected from 285 field locations. All mineral soil samples were air dried and passed through a 2 mm sieve. The fine fraction of each air-dried soil sample was analysed for pH (water). Lab analyses were carried out in the B.C. Ministry of Environment Analytical Chemical Research Laboratory. The pH was measured with a pH/ion conductivity meter using a water solution at a ratio of 1:2 (Kalra and Maynard 1991).
Because the soil samples were collected on a horizon basis, soil pH data were converted into standard depth increments (0–5, 5–15, and 15–30 cm), based on the specifications of GlobalSoilMap.net products (Arrouays et al. 2014), using the equal-area spline function (Bishop et al. 1999) from the ithir package in the R statistical language (Malone 2017). The 0–5, 5–15, and 15–30 cm depth increments had pH values for 285, 282, and 278 sample locations, respectively. In addition, descriptive statistics were calculated for the soil pH values using the JMP 13.0 software. Analysis of variance was also carried out at the three standard depth increments to compare the pH values with respect to the samples acquired from the forest-dominated (F), mixed forest and grass (FG), and grass-dominated (G) landscapes. Surface vegetation type (F, FG, or G) was classified for each sample location based on the observation of vegetation cover in the field.
To build the training dataset and predict the soil pH, 17 environmental variables (Table 1) were derived at a 25 m spatial resolution from a DEM (B.C. Ministry of Sustainable Resource Management 2002) and used as predictors. The DEM was pre-processed using sequential mean filters with window sizes of 3 × 3, 3 × 3, and 5 × 5 cells to reduce noise and anomalies in the rasters (Heung et al. 2014). Environmental variables were then calculated to represent local-scale topography (e.g., elevation, slope, aspect, and curvature), landscape-scale topography (e.g., multiresolution index of valley bottom flatness), climatic characteristics (e.g., diurnal anisotropic heating and diffused insolation), and hydrological characteristics (e.g., topographic wetness index). All variables were calculated using the System for Automated Geoscientific Analysis (SAGA) software (SAGA Development Core Team 2011; Conrad et al. 2015) and projected using the Albers equal-area conic projection system using the NAD83 datum.
Covariates derived from a 25 m spatial resolution DEM used for modelling soil pH.
The following sections provide a brief description of the base learners; however, readers are encouraged to refer to the references provided in Table 2 for detailed descriptions.
Summary of all the learners and corresponding hyperparameters.
The GLM assumes that the regression function is linear in its inputs, which are comprised of independent environmental variables, and takes the following form:
where y is the dependent response variable (soil attribute), x is the independent predictor variable, N is the number of predictor variables, β0 is the intercept, βi is the partial regression coefficient for each predictor variable, and ϵ is the error term. The ordinary least squares (OLS) method determines the coefficients of the independent variables as well as the intercept value by minimizing the sum of squared residuals (eq. 3). As a result of its structure, GLM often has an interpretable description of how the predictor variables influence the target variable (Hastie et al. 2009). The R package glm was used to develop the GLM learner (Dobson 2002).
The STEP is a type of multiple regression technique that selects the best-fitted combination of independent variables to predict the dependent variable. The process includes forward addition and backward removal of predictors based on the Akaike information criterion. The R package step with backward removal was used to develop the STEP learner (Venables and Ripley 2002).
The GLMNET is an extension of the GLM model; however, it applies a shrinkage and/or regularization approach to minimize the number of predictors within the model (Hastie and Qian 2016; Hastie et al. 2016). The shrinkage method used by GLMNET is controlled by the alpha hyperparameter: when alpha = 0, ridge regression (Ridge) is employed; when alpha = 1, lasso regression (LASSO) is employed; and when 0 < alpha < 1, elastic net regression (ENET), a hybrid (i.e., mixing) of ridge and lasso, is employed. A full description of these methods is beyond the scope of this study, but readers may refer to Friedman et al. (2010) for more details. The use of GLMNET is commonly seen in medical studies and biological science, and it has been especially popular in epigenome-wide association studies (Horvath 2013; Knight et al. 2016), but it is less common in DSM. The R package glmnet was used to develop the GLMNET learner (Friedman et al. 2010). Previously, soil organic carbon (r2 = 0.50) and four other important soil nutrients were predicted using GLMNET in India (Sirsat et al. 2018), and Li et al. (2020) used GLMNET with multiple environmental variables to estimate soil thickness (concordance correlation coefficient (CCC) = 0.76) in Henan Province, China. Most recently, Taghizadeh-Mehrjardi et al. (2021) used GLMNET to predict 13 soil properties in Iran.
The SVMR was proposed by Vapnik et al. (1997) to use a nonlinear transformation technique to project the original input into hyperspace and then generate a linear regression in this newly developed multidimensional feature space. In this study, a radial basis function kernel was used to create the regression function. kNN is a supervised learner that uses a nonparametric method to predict the value in the target cell based on the values of the k closest neighbouring observations in feature space (Kuhn 2008). The RF learner uses an ensemble of individual tree-based models and is derived from the CART model (Breiman 2001). The individual trees are trained using a bootstrap sample of the training data and additional randomness is incorporated into the model because the variables used to generate the binary splits at each node of each tree are drawn using a random subset of the predictor variables (Breiman 2001). XGBoost uses a gradient boosting framework to build a strong learner from several weak learners and uses many decision trees to make the prediction. The uniqueness of XGBoost is the construction of new decision trees, which are based on the prediction errors of the previous tree model to minimize the prediction error of the final prediction. Therefore, the final predictions are an ensemble of several decision trees (Chen et al. 2015; Chen and Guestrin 2016).
Stacked generalization using SL
The SL algorithm is an ensemble learner that uses the stacked generalization concept (Polley and van der Laan 2010). Here, a model intercept is not included, and the coefficients, which represent the weights of the weighted combination of the learners, cannot be negative and must sum to 1. The following equation describes the weighted combination function:
where Yobs represents the observed value, represents the predicted value from base learner k, and αk represents the weight of that base learner's predicted value. When estimating the weights, a non-negative least squares regression approach is applied with the aim to minimize the mean square error (MSE). To ensure the non-negative coefficient constraint, all base learners that have negative coefficients following the MSE minimization are removed from the SL model. Then, to ensure that all coefficients sum to 1, the remaining coefficients are normalized. The SL and stacked generalization approach is distinguished from model averaging approaches based on these additional constraints. This approach differs from Granger–Ramanathan model averaging, which does not include the non-negative coefficient constraint.
All the base learners were available using the caret package (Kuhn 2008) and the SuperLearner package (Polley et al. 2019) in the R statistical software (R Development Core Team 2012). The caret package was used to facilitate optimization of the hyperparameters of the base learners, whereas the SuperLearner package integrated the stacked generalization process of the predictions (van der Laan and Dudoit 2003). The SuperLearner package is particularly useful for the ensemble learning approach as the package compiles a library of base learners from the existing R packages, including the caret package. The SL has been previously evaluated and tested in biostatistical studies (van der Laan and Dudoit 2003; van der Laan et al. 2007). Using the outputs predicted by the base learners and their associated weights, the SL may be used to generate a map of a target soil variable.
Model training and testing
Every sample location was spatially intersected with the 17 environmental variable layers to create the full observational dataset. Random holdback cross-validation was used and thus the full dataset was partitioned into a training dataset (70%) and a test dataset (30%). The training data were fed into each base learner to evaluate the relationship between soil pH and environmental variables; this relationship was then used to predict soil pH for all locations in the study area. A nested cross-validation was applied to build and test the SL and base learners. Within the training dataset (70%), a 10-fold cross-validation procedure was used to optimize the model hyperparameters and to determine the weights of the individual base learners used to build the SL (Polley et al. 2019). The independent test dataset (30%) was used to calculate the accuracy metrics.
The prediction accuracies of both the base learners and the SL were quantified using MSE, Lin's CCC, and bias. Here, the accuracy metrics were calculated using only the independent test data (30%) from the nested cross-validation procedure, which were not used for fitting the SL or optimizing the model hyperparameters. Mean square error is defined as the mean of the square of the difference between the observed values and predicted values, and is a measure of global model uncertainty (Schluchter 2005); CCC measures the agreement between the observed values and the predicted values, and is a measure of model accuracy (Lin 1989); and bias is calculated as the difference between the mean of the predictions and the mean of the observed values (Bellon-Maurel et al. 2010).
We repeated the nested cross-validation procedure 30 times to ensure the stability and reliability of the results, and we reported the mean value and standard deviation for each accuracy metric (Engebretsen and Bohlin 2019; Fig. 1). Furthermore, the accuracy metrics from these 30 repeats were used as the basis for statistical comparisons between the individual base learners and the SL using a comparison of means with a control using Dunnett's test with α = 0.05.
Spatial prediction using SL
After 30 repeats of the nested cross-validation procedure, the process resulted in 30 fitted SL models with the corresponding 30 sets of fitted base learners. The SL that yielded the highest CCC value was then used in the spatial prediction, where the weighted combination of selected base learners was calculated during the training process. The spatial prediction process with SL had two steps. The first was to individually produce the spatial predictions of all the fitted base learners with the 17 topographic raster layers. In the second step, the outputs of the base learners were then used as inputs to the weighted combination to produce the final SL output. All maps were produced at a 25 m spatial resolution.
Results and discussion
The mean value of soil pH was highest at the 15–30 cm depth increment and lowest at the 0–5 cm increment (Table 3). Vegetation type, which had been determined by observation in the field, had a strong influence on pH (Table 4). At both the 0–5 and 5–15 cm depth increments, pH was significantly different between each of the vegetation types, with pH being highest for grass (G), intermediate for forest intermixed with grass (FG), and lowest for forest (F). At the 15–30 cm depth increment, there were no significant differences in pH among the three vegetation types. These significant effects in the surface horizons can be partly attributed to the increase in effective moisture and leaching in the forested environments, which tend to occur at higher elevations in the study area (Young et al. 1992; Jobbágy and Jackson 2003). Deeper in the profile, where parent material is expected to exert a greater influence on soil properties, the effect of vegetation is muted, and fewer significant effects due to vegetation are expected.
Summary of descriptive statistics for the soil pH data at three standard depth intervals.
Analysis of variance and analysis of mean results for the difference in pH among vegetation types.
Topography can influence soil pH by controlling water flow, material redistribution, and microclimate (Moore et al. 1993). In our study, pH values were significantly correlated with two topographic variables (channel network base level and elevation) at all three standard depth increments, and the correlations were all negative (Table 5): pH tended to be higher at lower channel network base level values and at lower elevations. These results also reflect the increased leaching intensity at higher elevations. Similar trends of pH being related to topographic variables have been observed in several previous studies (Moore et al. 1993; Smith et al. 2002; Zhang et al. 2019). Several studies have found an increase in soil pH at downslope positions (Brubaker et al. 1993; Chen et al. 1997; Zhang et al. 2019) and Chen et al. (1997) found that topographic variables, such as aspect and slope, were controlling factors of the spatial distribution of soil pH in the mountainous area of east Taiwan.
Pearson correlation coefficients between pH at three depths and environmental variables.
The effects of vegetation and topography on soil pH are likely interdependent. Overall, soil pH was higher in the grass-covered area, which was at lower elevations and was generally dry and hot, and soil pH was lower in the forested area, which was at higher elevations with cooler temperatures and humid conditions. Chytrý et al. (2007) observed that soil pH decreased with increasing precipitation in the mountain area in southern Siberia. The higher amount of precipitation at higher elevations, where the forest is, increases the rate of leaching in the soil, which increases the concentration of H+ ions and thus decreases the pH. Lower soil pH was also reported at higher elevations in an oak woodland–conifer forest in the western United States (Dahlgren et al. 1997) and pine forests in the eastern United States.
Evaluation of base learners
The prediction performance and accuracy assessments for each base learner and SL are summarized in Table 6. Mean square error, CCC, and bias were calculated based on the 30 repeats for each depth interval. Random forest and XGBoost performed consistently well with mean CCC ranging from 0.60 to 0.63 for all three depth increments; furthermore, the standard deviation values were consistently low as they ranged from 0.03 to 0.05, which indicates stability in the models.
Overall error rate of the ensemble learners and base learners using 30 repeats of random holdback cross-validation for soil pH at three depth intervals.
At the 0–5 cm depth increment, RF and XGBoost were the two best performing base learners with CCC = 0.60; however, SVMR and GLMNET_ENET performed similarly with CCC = 0.58 and 0.56, respectively. Furthermore, the MSE and bias were also similar, ranging from MSE = 0.47 to 0.59 and bias = –0.06 to –0.02. With the exception of the kNN base learner, all other learners performed similarly at the 5–15 cm depth increment where the CCC ranged from 0.61 to 0.63 and the MSE and bias shared similar values. At the 15–30 cm depth increment, a similar pattern was observed in which the kNN learner performed the worst with CCC = 0.29 in comparison to the other learners, where the CCC ranged from 0.57 to 0.63.
In Taghizadeh-Mehrjardi et al. (2021), a higher performance of the RF and XGBoost learners was similarly observed. It is possible that this may be attributed to the fact that both models are ensemble machine learning models themselves—RF uses a bagging framework while XGBoost uses a boosting framework, and furthermore both models are tree-based models. Comparing the other learners, GLMNET and STEP showed a better prediction accuracy than GLM, which indicated that variable selection and the regularization process, such as the one used in the GLMNET model, may have reduced the prediction errors of the linear regression learners. Previous studies showed that when comparing both linear and nonlinear models, the linear models were the least effective while the decision tree learners were the most effective (Khaledian and Miller 2020; Taghizadeh-Mehrjardi et al. 2021). We observed this to be the case only at the 0–5 cm depth increment and not at the 5–15 cm and 15–30 cm depth increments, thus suggesting that nonlinear relationships may be present between soil pH and the environment for surficial soils.
Evaluation of SL
In fitting the SL, the base learners were weighted for each depth increment and are summarized in Table 7. Here, we reported the mean model weight from the 30 repeats and the corresponding standard deviation values for each base learner. It is important to note that the SL applies a non-negative least squares framework in estimating the model weights and hence the models that do not meet the non-negative condition were assigned a weight of 0. Overall, GLMNET_ENET, SVMR, RF, and XGBoost were consistently used in fitting the SL. Based on the low standard deviation values for the model weights, the SL appears to be stable when selecting the base learners as well as when calculating their corresponding weights.
Summary of weights for each base learner over 10 nested cross-validations.
It is also interesting to note that of the linear models, GLMNET_ENET performed the best and was included in the SL while all other models were assigned weights close to 0. In comparison, models with drastically different structures, such as SVMR, RF, and XGBoost, were consistently included in the SL. This observation shows that the SL selects diverse models rather than models with similar structures, such as the linear models. Although kNN is also distinct from the other models, it was weighted close to 0, which is due to the fact that it was consistently the poorest performing model across all depth increments. These findings were consistent with Polley et al. (2019), who suggested that a diverse set of base learners, including both linear and nonlinear base learners, should be used to fit the ensemble learning algorithm instead of using only similar base learners or testing only a few base learners.
The external test data showed that, based on the mean of the 30 iterations, the SL had CCC values of 0.56, 0.61, and 0.62 for the 0–5, 5–15, and 15–30 cm depth increments, respectively. Based on the CCC, the SL performed 14.3% better than the GLM learner at the 0–5 cm depth increment while performing approximately twice as effectively as kNN (Table 6). With respect to MSE, the SL had a significantly lower MSE of 0.55 when compared to the GLM model (1.41) at the 0–5 cm depth increment and had significantly lower MSE of 0.46 when compared to the kNN model (0.74) at the 5–15 cm increment. When using bias to assess the difference between the mean predictions of all the learners and the mean of the observed values, there were no significant differences. In general, the SL did not show significant improvements in accuracy, global uncertainty, and bias, which contradicted our original expectations based on Polley and van der Laan (2010) and Taghizadeh-Mehrjardi et al. (2021).
In a recent study, Taghizadeh-Mehrjardi et al. (2021) used 14 models to construct an ensemble learner and their overall finding was that the ensemble learner outperformed all base learners. Whereas this study showed that except for the kNN model, the range of accuracy metrics was fairly similar across all models, the range of accuracy metrics varied far more in Taghizadeh-Mehrjardi et al. (2021). A possible recommendation would be that DSM practitioners should first perform a comprehensive comparison of base learners, and if the results are inconsistent, the application of SL may be warranted despite the cost of additional computation and model interpretability.
We used the SL to predict the spatial distribution of soil pH at three depth intervals over the Thompson-Okanagan region (Fig. 3). The greatest spatial variation in pH occurred at the 0–5 cm depth increment (Fig. 3a), in which it is clear that soil pH is highest near the Thompson River and Kamloops Lake, and lowest at the boundary of the mountain regions of the study area. A similar spatial pattern was also observed at the 5–15 and 15–30 cm depths (Figs. 3b and 3c), in which the soil pH decreases with increasing elevation, and with distance from the stream network. This is also revealed by the correlation coefficient analysis, which showed that soil pH had a strong negative relationship with elevation (Table 5), and the predicted map of soil pH shows that soil has lower pH in the forest at higher elevations (Fig. 3). The low pH in the forest region could partly be related to the high organic matter content in the forest floor. A second pattern shows that soil pH is lower at the 0–5 cm depth and higher at the 15–30 cm depth. This could be the effect of accumulated base cations from the parent material.
In Fig. 4, we present a contour line for the pH value measured in water () of 5.99, which closely approximates a pH measured in CaCl2 () of 5.5, for the 5–15 cm depth interval. The contour line is shown in relation to the soil mapping boundaries for the study area. In the Canadian System of Soil Classification, a value of 5.5 is a diagnostic criterion for the separation of the Eutric Brunisol great group ( > 5.5) from the Dystric Brunisol great group ( < 5.5). Because of the declining pH values with increasing elevation in the study area, the contour line at of 5.99 was expected to occur near the upper elevation boundary for map units where Eutric Brunisols occur and near the lower elevation boundary for map units containing Dystric Brunisols.
There was considerable variability throughout the study area, and Brunisols did not occur in all map units; in the southwestern portion of the study area, the location of the = 5.99 contour line generally followed the expected behaviour by occurring near or slightly above the upper elevation where Eutric Brunisols were mapped (Fig. 5). In the north and northeastern portions of the study area, the = 5.99 contour at times appeared at a higher elevation than expected. Despite the variability observed in our map results, information from the Canadian Soil Information System (Agriculture and Agri-Food Ottawa, Ontario 2000) confirms that the soil units mapped in the vicinity of our contour line have pH values in the B horizon that generally agree with the placement of the contour line. These results were consistent with our pedological understanding of the region and accuracy and reliability of our pH predictions, but also point to areas where the pH predictions could be improved.
To further investigate the predicted outputs of the SL in comparison to its constituent base learners, Figs. 6–8 show a close-up region of the study area for each base learner with respect to the SL and for each depth increment. It should be noted that these figures show the predictions for the single repeat of the nested cross-validation that resulted in the highest CCC, in which the CCC values were 0.70, 0.71, and 0.66 for the 0–5, 5–15, and 15–30 cm depth increments, respectively. The spatial patterns are similar between the base learners and SL within each depth increment, except for kNN, where spatial patterning was less obvious, and a relatively uniform distribution of pH was predicted across the study area, which may account for the model's lower accuracy. Similar to when the accuracy metrics were investigated, when the spatial patterns between the base learners were similar, there appeared to be a limited influence on the spatial patterns when using the SL. This, again, suggests that a soil mapper should carry out a visual assessment of the base learners prior to investing additional time and computation into the SL.
For the surface horizon (0–5 cm), this study showed that the variation in topography had a direct influence on the spatial distribution of pH, and the linear regression learners with variable reduction achieved a prediction accuracy similar to the regression tree learners. As shown in the results, more powerful learners such as RF and GLMNET were more effective than GLM and kNN. On the other hand, this study only used topographic variables as environmental variables. Adding additional variables to represent vegetation, climate, and parent material may have improved the prediction accuracy.
Other studies have shown that increasing sample size and using additional environmental variables derived from hyperspectral images can improve the accuracy of prediction (Lagacherie et al. 2019). We recognize the potential limitation of using only topographic variables as predictors, and we considered leveraging satellite imagery, such as Landsat and Sentinel 2 images, to represent vegetative patterns in the study area. However, in the study area, vegetative and climatic patterns are largely controlled by topography; lower elevations of the region are dominated by grasslands where it is warmer and drier, and higher elevations are dominated by dry interior forests where moisture is higher. Furthermore, an important limitation of satellite data is that disturbance to natural vegetation patterns, primarily caused by fire and pine beetles, but also by clear-cutting, is substantial in this area. Satellite images show present vegetation patterns; considering that the time frame during which the soils developed, and as the result of the pH of the original parent material being altered by pedogenic processes, is over 200 times as long as the period of significant human activity in the area, we believe that the topography serves as a more reliable indicator of the distribution of soil forming processes across the landscape.
Previous studies (Seibert et al. 2007; Tu et al. 2018) showed that the use of topographic indices alone, as in this study, had the potential to effectively map the spatial distribution of soil properties, including pH. Tu et al. (2018) suggested that at a local scale, soil pH in the upper profile has the strongest correlation with topographic indices compared to the lower profile. However, in this study, both the individual base learners and a stacked ensemble learner had higher prediction accuracies in the lower profile. Other researchers have found that parent material and vegetation-related indices, such as rooting depth, have a stronger influence on soil pH than topography (Reuter et al. 2008; Gruba and Socha 2016; Zhang et al. 2019).
This paper presented the use of SL, an ensemble learning algorithm with stacked generalization, to map the spatial distribution of soil pH at three depth intervals. The approach was applied to examine the use of topographic indices to map the spatial distribution of soil pH in the dry forest ecosystem in the Thompson-Okanagan region of British Columbia, Canada. The DSMs of soil pH at three depth intervals provide the first full-coverage map for the area, and the workflow may be further applied to map the spatial distribution of other soil properties. Elevation and surface vegetation types have a strong influence on the distribution of soil pH, with pH around 7.5 near the valley basin on the grassland and around 5.5 near the mountain tops in the forested area. Soil pH was higher in the 15–30 cm depth increment, compared to the shallower depth increments, which is likely the result of leaching, seasonal fluctuations of the water table, and sodium-rich parent material. Additional spatial information, such as vegetation, water, or parent material data, should be considered as additional predictor variables in future studies.
This study demonstrated how to use an ensemble learning algorithm with stacked generalization in DSM studies. When using a machine learning approach in data analysis and prediction, learner selection is often a challenge, especially when more than two learners show promising preliminary results. The ensemble learning approach with stacked generalization provides the option to combine the results from multiple learners to create an integrated mapping output with relatively stable performance. The SuperLearner package provides a solution that allows DSM practitioners to test many learners at the same time and to help them identify the most effective learners. Contrary to results from other studies, the SL does not necessarily outperform all the base learners, but it does provide near-optimal performance. We suggest that DSM practitioners should first carry out a comprehensive comparison of base learners, and if the model outputs yield substantially different results, the SL may provide the means for improving predictions.
The authors are thankful for the field and lab assistance provided by Carson Li and M.J. Jeon. The authors are also grateful for the support from the B.C. Ministry of Forests, Lands, Natural Resource Operations and Rural Development, and Simon Fraser University.