Although British Columbia (BC), Canada, has a rich history of producing conventional soil maps (CSMs) between 1925 and 2000, the province still lacks a detailed soil map with a comprehensive coverage due to the cost and time required to develop such a product. This study builds on previous digital soil mapping (DSM) research in BC and develops provincial-scale maps. Soil taxonomic classes (e.g., great groups and order) and parent material classes were mapped at a 100 m spatial resolution for BC (944 735 km2). Training points were generated from detailed and semi-detailed soil survey maps. The training points were intersected with 26 topographic indices for mapping parent materials with an additional 9 climatic and vegetation indices for mapping soil classes. The soil–environmental relationships were inferred using the random forest (RF) classifier. The fitted models were used to predict 23 soil great groups, 9 soil orders, and 10 parent material classes. Accuracy assessments were performed using n = 14 570 validation points for parent material classes and n = 14 316 validation points for soil classes, acquired from the BC Soil Information System. The accuracy rates for soil great groups, orders, and parent material classes were 55%, 62%, and 69%, respectively, and kappa coefficients were 0.37, 0.41, and 0.59, respectively. This study demonstrated that when RF was trained using CSMs, the accuracy for the resulting DSM was higher than the original CSM. To assess prediction uncertainties, ignorance uncertainty maps were developed using class-probability layers generated by the RF models.
There is an increasing need for the Province of British Columbia (BC), and elsewhere in the world, to provide reliable information on soil patterns to help define and realize the benefits of sustainable resource use and to ensure environmental protection (Carré et al. 2007; Hartemink and McBratney 2008). A close linkage between soil properties and plant productivity underpins the application of precise and accurate soil information for the purposes of forest harvest planning (Brown 1973; Valentine 1986), agricultural production (Bertrand et al. 1991), and grazing management (Teague et al. 2011). The position of soil — at the physical interface between the lithosphere, atmosphere, and hydrosphere — accounts for its control over hydrologic processes, such as rainfall interception, and the partitioning of the received water between infiltration and surface runoff (Wilding and Lin 2006). These landscape processes have a key role in efforts to improve watershed management, protection, and restoration and our ability to respond to natural disasters, such as flooding and drought. Furthermore, the fundamental role of the soil in ecosystem processes provides the context for soil information as a critical component of efforts to respond to natural or anthropogenic disturbances, such as wildfire, insect outbreak, and climate change (Lal 2004).
According to Anderson and Smith (2011) and McKeague and Stobbe (1978), soil survey activities in BC started in the 1920s. An early example of a conventional soil map (CSM) for BC was produced by Kelley and Spilsbury (1939). Soil Survey Report Number 1 provided descriptions of the soils for the Lower Fraser Valley region of the province, and their association with the climatic, topographic, and economic factors that affected agricultural potential. The report contained a detailed description on the trend of agricultural development and settlement in the survey area, and where the most fertile areas were. The identification of areas suitable for farming was the main purpose of CSM in those years. Over the next five decades, more than 100 survey projects were completed in BC. The production of that series of extensive CSMs ended once the major areas with agriculture potential were mapped. One of the last extensive projects was prepared for the Stikine–Iskut area by Fenger and Kowall (1992). Since the late 1990s, some soil maps have been produced for smaller areas, and usually in response to a specific resource management project or information need. Currently, most of the digitized CSM reports are readily available through the Canadian Soil Information System (CanSIS) (Agriculture and Agri-Food Canada 2021).
To leverage the revolutionary advances in the acquisition of environmental data, computational ability, and geographic information systems (GIS), the application of digital soil mapping (DSM) techniques has steadily progressed over the last decade in BC. Early projects included the development of predictive ecological maps for a portion (3 million ha) of the Cariboo Forest Region of central BC, using fuzzy classification systems (MacMillan et al. 2007) — an approach that was later extended to the remainder of the region (MacMillan et al. 2010). To further test the applications of fuzzy inference approaches and extend their applicability toward the disaggregation and refinement of legacy CSMs, Smith et al. (2012) proposed the use of a weights-of-evidence technique to inform the process of defining the fuzzy rule sets used in the ArcSIE software (Shi et al. 2009; Shi 2010). In Smith et al. (2012), the soil classes were mapped for the Trout Creek region of the Okanagan Basin and in Smith et al. (2016) DSMs were produced for soil classes and soil attributes for the remainder of the Okanagan Basin in accordance with GlobalSoilMap.net specifications.
In addition to the use of fuzzy classification techniques, machine-learning techniques have been tested for the Lower Fraser Valley (Heung et al. 2014, 2016) and the Okanagan–Kamloops regions (Heung et al. 2017) of BC. In Heung et al. (2014), a framework for data mining the soil–environmental relationships from detailed legacy CSMs was evaluated for the mapping of soil parent material using the random forest (RF) classifier. A similar approach was used by Bulmer et al. (2016) to produce a map of parent material and a limited set of soil classes by mosaicking a series of regional maps. Because of the successful application of the RF classifier, it was subsequently used to produce exposed bedrock maps for the Tulameen–Princeton region of southcentral BC (Scarpone et al. 2017). Furthermore, the RF algorithm was also compared against a generalized linear model and regression kriging approaches for the purposes of mapping soil thickness for that same region (Scarpone et al. 2016).
To further explore the application of machine-learning techniques for producing DSMs in BC, Heung et al. (2016) presented an overview and comparison of a variety of machine-learning techniques for mapping soil classes for the Lower Fraser Valley. In that study, 10 machine-learning techniques were compared, where it was observed that support vector machine with radial basis function and k-nearest neighbours produced results with the highest accuracy; however, the classification and regression trees (CART) with bagging, logistic model trees, and RF classifiers performed comparably well. Although the support vector machine produced the best model, Heung et al. (2016) indicated that the learner was computationally demanding and the process of optimizing the hyperparameter values was challenging. With respect to the k-nearest neighbours learner, its accuracy was also high in Heung et al. (2016); however, the spatial patterns that were generated from those predictions were inconsistent with the pedological knowledge of the authors.
Further exploration was done for the Okanagan–Kamloops region of BC, where Heung et al. (2017) presented a comparison of DSMs derived from legacy soil pits and CSM polygons for mapping soil classes. There, it was observed that the DSMs produced using CSM polygons as training data were more accurate in comparison to maps produced by using soil pedon data. Furthermore, additional comparisons between single-model learners and ensemble-model learners were performed, where it was observed that ensemble-model learners consistently produced more accurate results with the RF classifier performing consistently well — regardless of the type of training data used. The effectiveness of the RF classification algorithm was consistent with other model comparison studies within the DSM literature (e.g., Brungard et al. 2015; Taghizadeh-Mehrjardi et al. 2015). Furthermore, RF has the added benefit in being computationally efficient when optimizing its main hyperparameter, mtry, whereby the number of potential hyperparameter values is constrained to the number of predictors (Heung et al. 2016). Lastly, the model has the added ability of generating uncertainty maps (Heung et al. 2017).
Despite the progress made in DSM research in BC, many of the developments have been limited to the production of regional maps and as a result, the availability of a high-resolution digital soil data set for the entirety of the province remains limited. Currently, the only uniform, province-wide, soil map is the Soil Landscapes of Canada (SLC) Version 2.2 data set, which was produced at a 1:1 000 000 scale and provided by CanSIS. The SLC data set was created by digitizing a combination of provincial- and regional-scale CSMs and is available in the polygon format (Schut et al. 2011). The portion of the SLC that covers BC consists of 2651 multicomponent polygons with an average polygon size of 380 km2 (Geng et al. 2010); as such, its usefulness remains limited when spatially explicit and detailed soil information is needed. Hence, the objectives of this study were as follows: (1) to demonstrate and extend the use of the RF classifier toward the development of provincial DSMs of soil parent material classes and soil taxonomic classes (great groups and orders) at a 100 m spatial resolution and (2) to validate the predictions using the BC Soil Information System (BCSIS) data set — a georeferenced repository of soil information.
Materials and methods
The methods used for this study were adopted from previous regional-scale DSM research that was carried out for the mapping of soil taxonomic class (great groups and orders) and parent material for the Lower Fraser Valley (Heung et al. 2014, 2016) and Okanagan–Kamloops regions (Heung et al. 2017) of BC.
Soil-forming environment of British Columbia
British Columbia is the western-most province of Canada, located between 48–60°N and 114–139°W and with an areal extent of 944 735 km2 (9.5% of Canada; Fig. 1). The peak elevation of the province is located on Fairweather Mountain at 4671 m (above mean sea level). British Columbia lies within the Western Cordillera of North America, and its physiography was described in Church and Ryder (2010) as a broad, complex system of mountains and plateaus. The complex physiography and geology reflect BC’s location on the western edge of North America, where for more than 350 million years, numerous geologic terranes have collided with the ancestral continent through the process of plate tectonics (Monger 1997). A series of northwest- to southeast-trending mountain ranges have resulted from the tectonic activity, while intervening periods of erosion and volcanism led to the presence of gently sloping plateaus that dominate large portions of central BC. The geology is highly diverse at local scales, while the broad patterns in rock type reflect the character of the accreted landmasses, as well as the tectonic and geomorphic processes of orogeny, volcanism, metamorphism, erosion, and deposition.
Holland (1976) outlined three broad mountain systems that occupy the majority of BC, in addition to a small portion (10%) of the provincial landmass occupied by the Interior Plains of North America. The western system of mountains includes several ranges where intrusive igneous rocks are the dominant bedrock type. The Interior Plateau has a wide diversity of rock types but includes a large area of flat lying and folded lava known as the plateau basalts. To the east of the Interior Plateau, the Eastern System consists primarily of sediments that were deposited into the ocean off the coast of the ancestral continent and were subsequently uplifted as a result of the tectonic activity. These mountain ranges include the Rocky Mountains, where the continental divide marks the provincial boundary in the southeast. The Eastern System adjoins the Alberta Plateau in the northeastern part of the province, is dominated by sedimentary rocks, and is generally flat in topography.
Church and Ryder (2010) provide a description of land-forming processes in relation to the spatial scale of the resulting landforms. The large landscape features (observable at scales of >100 km) and bedrock characteristics result from the tectonic processes of orogenesis and volcanism associated with the movement of continental plates, punctuated by long intervening periods of erosion. For at least the past 2 000 000 years, glaciers advanced and receded throughout BC with the last advance ending approximately 10 000 years B.P. The influence of ice in shaping the broad landscape is observable at scales up to 1000 km, while the contemporary geomorphic processes of erosion, landslides, flooding, and deposition continue to shape the land into features visible on maps at finer scale. Soil-forming processes within pedons can be considered to operate at localized scales and influence the properties of the surface to a depth of ~1 m within the unconsolidated surficial (parent) material.
Because of the highly variable topography of the Coast and Rocky Mountain Ranges and the proximity to the Pacific Ocean, BC experiences a combination of maritime and continental climate patterns. The wettest parts of BC are located along the Pacific Coast and especially on the windward slopes of Vancouver Island, Haida Gwaii (formerly the Queen Charlotte Islands) and the Coast Mountains as a result of orographic precipitation. The driest parts of the province are in southcentral BC, behind the rain shadow and leeward of the Coast Mountains. In the Great Plains region of northeastern BC, the continental climate results in the coldest temperatures in BC. A system of biogeoclimatic ecosystem classification (BEC) based on climate and vegetation type is used in BC to describe the ecological character of 14 biogeoclimatic zones (Pojar et al. 1991).
The Coastal Western Hemlock (CWH) biogeoclimatic zone has abundant rainfall, with a mean annual precipitation of 2228 mm, which ranges from 1000 to 4400 mm across the province. Here, the vegetation is comprised of a mixture of western hemlock (Tsuga heterophylla), Douglas-fir (Pseudotsuga menziesii), and western redcedar (Thuja plicata) tree species. The soils of the CWH zone are typically Humo-Ferric Podzols, which transition to Ferro-Humic Podzols with increasing precipitation and decreasing temperatures (Pojar et al. 1991). In southcentral BC, the Bunchgrass (BG) zone is the warmest and driest biogeoclimatic zone in BC, with a mean annual temperature of 10.0 °C and 242 mm of mean annual precipitation. The soils of the BG zone are typically Brown Chernozems, which transition to Dark Brown and Black Chernozems with decreasing temperatures (Nicholson et al. 1991). The Boreal White and Black Spruce (BWBS) zone in northeast BC has a continental climate and experiences a mean annual temperature of −1.4 °C; furthermore, this zone is also dry with a mean annual precipitation of 327 mm. The vegetation in the BWBS is comprised of predominantly black spruce (Picea mariana), which transitions to a mixture of white spruce (Picea glauca) and trembling aspen (Populus tremuloides) with increasing elevation. For the well-drained areas of the BWBS zone, the soils are mostly Gray Luvisols, while in the poorly drained lowlands, Organic Cryosols, Luvic Gleysols, and Organic soils are common. Where soils are developed from lacustrine clays over marine shale (e.g., Dawson Creek and Fort St. John), Solonetzic soils may be found (DeLong et al. 1991).
A suite of 35 environmental variables were derived from a combination of land cover classification, climate, digital elevation, and ecosystem classification data sets — all of which were resampled to a 100 m spatial resolution (Table 1). For mapping soil parent material, only topographic indices were used because in Heung et al. (2014) and Bulmer et al. (2016) it was observed that topographic indices were well correlated with the distribution of soil parent material. Given the glacial history of BC, many of the soil parent materials were transported across the landscape, which then resulted in the distinct topographic features that were produced due to the deposition of sediments. For mapping soil classes, vegetation and climatic indices were included in addition to the topographic indices.
List of 35 topographic, climatic, and vegetation predictors at a 100 m spatial resolution.
Topographic indices were mostly derived from BC’s Terrain Resource Information Management (TRIM) digital elevation model (DEM) (B.C. Ministry of Sustainable Resource Management 2002). The DEM was originally developed from a triangulated irregular network, derived from TRIM mass-points and break-lines, and aggregated to a 100 m spatial resolution. The 100 m DEM was acquired from HectaresBC.org (HectaresBC 2012), which provides a repository of freely available digital data layers for BC. To minimize the effects of spatially noncorrelated noise on the calculation of topographic indices, the original DEM was filtered using three consecutive mean filters of 3 × 3, 3 × 3, and 5 × 5 pixels (MacMillan et al. 2007).
All DEM-derived topographic indices were calculated in the System for Automated Geoscientific Analysis (SAGA) (SAGA Development Team 2011). The topographic indices were selected based on their ability to represent local-scale morphometry (e.g., slope, aspect, and curvature), landscape-scale morphometry (e.g., slope position and multiresolution valley bottom flatness index), hydrologic characteristics (e.g., slope-length factor and topographic wetness index), and landscape exposure (e.g., skyview factor and visible sky factor).
In addition, two distance-based metrics were included as predictors: distance-to-nearest-stream and distance-to-nearest-lake. The distance-to-nearest-stream layer was calculated from polyline mapping from HectaresBC.org, while the distance-to-nearest-lake layer was calculated from the Freshwater Atlas for BC data set (Integrated Land Management Bureau 2010). The distance-to-nearest-stream was used because it was previously found to be an important predictor when mapping soils developed from fluvial sediments (Heung et al. 2014). The distance-to-nearest-lake was included for this study due to the potential importance of the predictor for mapping soils developed from lacustrine sediments as well as the mapping of hydromorphic soils.
To represent the vegetation patterns of the province, the GeoBase Land Cover Circa 2000 (LCC2000) product was included. The LCC2000 data layer was derived from the classification of Landsat 5 and Landsat 7 orthoimagery, produced at a 30 m spatial resolution (Olthof et al. 2009), and aggregated to a 100 m spatial resolution using a nearest neighbour approach. The use of interpreted remote sensing products as a covariate in DSM has been done in multiple studies, such as Collard et al. (2014), which used the Corine Land Cover 2006 data sets to map the soils of northern France; Cheney et al. (2016), which used the 2006 National Land Cover Database to map soil properties of continental United States; and Hengl et al. (2017), which used the GlobCover30 data set to produce a global soil map product.
In addition, digitized BEC zone and subzone maps were included to provide information on vegetation patterns and ecological characteristics. Each of the 14 BEC zones is associated with a distinct climax plant species, where subzones are further delineated based on broad-scale moisture and temperature characteristics (Meidinger and Pojar 1991). The BEC zone and subzone maps were originally in the polygon format but were rasterized to a 100 m spatial resolution using the ArcGIS 10.7.1 software.
Climate data were acquired from Climate BC (Wang et al. 2012) and accessed through HectaresBC.org. Mean annual temperature, mean annual precipitation, degree days <0 °C, climatic moisture deficit, reference evaporation, and frost-free days were included as climatic indices. In addition, the provincial BEC maps describing patterns of vegetation and ecology were closely associated with observed patterns of climate variability in BC; hence, the BEC data were also a climate-related predictor.
Development of training data
Based on previous research, it was concluded that the most appropriate modelling approach would be to develop a training data set using CSM data via the sampling of single-component map units using an area-weighted approach. In Heung et al. (2014), three methods of sampling the training areas were tested for the prediction of soil parent materials for the Lower Fraser Valley: (1) a by-polygon approach, where a set number of sample points were obtained from each polygon; (2) an equal-class approach, where a set number of sample points were obtained from each class; and (3) an area-weighted approach, where the number of sample points per class was proportional to the areal extent of each class. Heung et al. (2016) later sought to examine the effects of class imbalance on classification and in addition to including the three sampling methods from Heung et al. (2014), an area-weighted approach with random oversampling (ROS) was included. Additionally, Bulmer et al. (2016) compared the area-weighted method to a sampling approach that integrated expert knowledge in the sampling procedure.
Among the sampling methods, it was shown that the area-weighted approach performed consistently better than the other sampling approaches except for the area-weighted approach with ROS. Based on Heung et al. (2016), ROS resulted in small improvements in terms of prediction accuracy; however, the procedure was computationally demanding and it was therefore concluded that its implementation was not feasible for large data sets. Furthermore, the choice of using CSMs as training data was deemed more appropriate because in Heung et al. (2017), higher accuracies were achieved when using CSMs as training data in comparison to pedon data for the Okanagan–Kamloops region.
Digitized CSMs, in the polygon format, were obtained from the BC Soil Information Finder Tool (B.C. Ministry of Agriculture and B.C. Ministry of Environment 2018). The digitized data consisted of 113 CSMs, which were originally mapped at scales between 1:20 000 and 1:125 000. In BC, the large-scale CSMs were typically developed for agriculturally intensive regions of the province, including the Lower Fraser Valley and the Okanagan Valley, whereas the small-scale CSMs were typically developed for areas with high relief, forest vegetation, and lower populations.
The digitized CSMs consisted of 1035 named soil series and 10 soil parent material classes. To simplify the CSM legend and provide a sufficient number of validation points for validating the accuracy of each individual class, the soil series were aggregated to the soil great group and soil order levels of the Canadian System of Soil Classification (Soil Classification Working Group et al. 1998). The simplification of the CSM legend resulted in 23 soil great groups and 9 soil orders. To reduce the classification uncertainty of the training data, only single-component mapping units of soil great group and parent material were retained as training areas. Within BC, the coverage by single-component map units occupied 12.3% (116 293 km2) and 11.1% (104 888 km2) for soil great groups and soil parent material of the province’s extent, respectively. It should be noted that when a map unit was described as a “single-component” unit, the CSMs were identifying the dominant soil type or parent materials — other soils may have been present within the polygon but were not identified due to the map scale.
To reduce the size of the training data set and to overcome the computational limitations when training a machine learner, 500 000 points were generated within the single-component polygons using the area-weighted approach described in Heung et al. (2014). Here, the number of points to be generated for each class was weighted by the areal extent of the single-component map units for each class. Table 2 shows the percentage of the 500 000 training points that were generated for each class. The values for each environmental variable were extracted for each training point and the resulting dataframe was then used to train the machine learner. The dominant parent material classes included till and colluvium, which consisted of 47% and 29% of the parent material training data, respectively, while the dominant soil great groups included Humo-Ferric Podzols and Gray Luvisols, which consisted of 33% and 30% of the soil great group training data, respectively (Table 2).
Soil orders, great groups, and parent material classes for BC with corresponding percentages of training and validation data.
Random forest classifier
Based on previous model comparison studies in BC (Heung et al. 2016, 2017), the RF algorithm was used for this study. RF is based on an ensemble of classification trees that are grown from a randomized bootstrap sample of the training data set (Breiman 2001). The bagging procedure in the RF algorithm is designed to mitigate the impact of high model variance, which is common for tree-based learners (Breiman 2001). The resulting prediction made by the RF ensemble is based on a majority-vote combination function using the individual trees. Furthermore, the algorithm can evaluate variable importance by measuring the mean decrease in Gini (MDG) impurity when an individual variable is removed from the prediction process. If the removal of a particular predictor variable leads to a large decrease in the Gini index, it is assigned a greater importance than predictors whose removal results in a smaller decrease in Gini.
For the purposes of this study, the implementation of the RF model and its parameterization were performed using a combination of the caret (Kuhn 2008) and randomForest (Liaw and Wiener 2002) packages within the R statistical software (R Development Core Team 2012). The parameterization of the RF models was performed through 20 replicates of fivefold cross-validation as described in Heung et al. (2014).
Assessment of predictions
The accuracy of the predicted soil parent material and soil great group maps was assessed using soil pedon data acquired from the BCSIS database (Sondheim and Suttie 1983). The BCSIS data set is a repository of georeferenced soil observations acquired by multiple provincial and federal agencies for various projects (e.g., terrestrial ecosystem mapping, and habitat, soil, and bioresource monitoring) and compiled by the BC Ministry of Environment. Because most of the data points in the BCSIS were contributed prior to the widespread availability of global positioning systems, the positional accuracy of individual points was variable; furthermore, the positional uncertainty for each sample location was unknown. Due to the potential spatial inaccuracy associated with the use of legacy soil observations, the BCSIS data were cleaned to ensure that only observations with the highest spatial accuracy were used for validation purposes. The data cleaning procedure consisted of four steps: (1) the removal of duplicated soil observations; (2) the removal of observations that were not located within the terrestrial landmass of BC; (3) the removal of observations, where its spatial position was located outside of the project boundary for which it was collected; and (4) the removal of observations, where its field-measured elevation was not within 30 m of the elevation extracted from the provincial DEM.
To further account for spatial inaccuracies of the BCSIS data, as well as to account for situations where a validation point was located along a pixel boundary, predictions were considered to be valid if the validation point matched the prediction within a radius of 1 pixel (i.e., 100 m radius). The BCSIS validation data set used for this study consisted of 14 316 points for soil great groups and orders, and 14 570 points for soil parent materials. To assess soil classes at the “order” level of taxonomy, predictions made at the great group level were aggregated. The by-class distribution of the validation data points for soil taxonomic classes and soil parent material classes is shown in Table 2. To compare the accuracy of the CSMs to the DSMs, the validation points were used to assess the single-component map units, where the validation points that coincided with those map units were then used to assess the accuracy of the DSM. The number of validation points that coincided with the single-component map units were 9336 points for soil great groups and orders, and 9856 points for soil parent materials.
The overall agreement between the validation points and the predictions, C, is calculated as
where p is the proportion of correctly classified pixels for the jth class. The overall agreement was used as the main accuracy metric for this study. The overall agreement values were further decomposed into additional accuracy metrics, which included the quantity disagreement (Q) and allocation disagreement (A), using the following relationship:
represents the amount of disagreement in the proportion of each soil class between the prediction and validation data sets. In eq. 2, pi+ and p+i represent the row and column totals of the error matrix, respectively, expressed as proportions of the population, for the ith class for j number of soil classes. Values of Q range from 0 to 1, where values near 0 represent high agreement in the proportions of coverage for each class while values near 1 represent high disagreement. In eq. 2, A, calculated as
represents the amount of disagreement in the spatial allocation of classes between the prediction and validation data sets. Values of A range from 0 to 1, where values near 0 represent high agreement in the spatial allocation for each class while values near 1 represent high disagreement.
Lastly, Cohen’s kappa coefficient, κ, was also calculated to account for the by-chance agreement between the observed and predicted classes. The coefficient is formulated as follows:
where P(A) represents the overall agreement fraction and P(E) represents the expected agreement fraction between the observed and predicted soil classes. Given that the by-chance agreement is accounted for in κ, the coefficient is expected to be consistently lower than C. Values of κ range from 0 to 1, where values near 0 represent low agreement while values near 1 represent high agreement between the observed and predicted classes.
One advantage of using the RF algorithm, and ensemble-modeling techniques in general, is their ability to produce maps of model uncertainty based on the constituent models of the ensemble. For each pixel, the RF algorithm can estimate the probability of occurrence (i.e., proportion of votes) for each class based on the ensemble of decision trees. In Heung et al. (2017), the soil class probability surfaces facilitated a visual assessment of the model results and also identified regions of the landscape where the transition of one soil type to another might occur.
Using the set of class-probability surfaces, ignorance uncertainty surfaces may be produced, which measure how evenly distributed the class-probability surfaces are across all classes. As the class probabilities become more evenly dispersed, the ignorance uncertainty increases; conversely, as the constituent models of the ensemble converge their predictions toward a particular class, the ignorance uncertainty decreases (Leung et al. 1993; Zhu 1997). The ignorance uncertainty is estimated using an entropy measure, H, and was calculated as follows (Zhu 1997):
where Pk is the proportion of instances where pixel x is classified as soil class k and n is the number of classes. Values of H range from 0 to 1, where values near 0 represent low ignorance uncertainty while values near 1 represent high uncertainty in classification. The use of the entropy measure for providing an overall estimation of uncertainty has previously been used for fuzzy inference classification approaches (e.g., Leung et al. 1993; Goodchild et al. 1994; Zhu 1997) and later extended toward ensemble-modelling approaches (e.g., Kempen et al. 2009; Heung et al. 2017; Blackford et al. 2021).
Results and discussion
Assessment of predictions
Provincial-scale digital soil maps of soil parent materials and soil great groups were produced at a 100 m spatial resolution and are shown in Figs. 2 and 3, respectively. Using the BCSIS validation data set, the accuracies for the provincial soil maps were 55% (κ = 0.37) and 62% (κ = 0.41) for the soil great group and soil order classes, respectively, and 69% (κ = 0.59) for the soil parent material class (Table 3). Due to the highly imbalanced class distribution within the training and validation data, the difference between the overall accuracies and the kappa coefficients were quite substantial given the ability of the kappa coefficient to capture class imbalances more effectively.
Accuracy assessment of overall predictions, single-component map units, and predictions coinciding with single-component map units, using overall agreement (C), quantity disagreement (Q), and allocation disagreement (A).
When assessing the accuracy of the single-component map units from the CSM and comparing the accuracy to the predicted pixels that coincided with those map units, it was observed that the DSM approach was able to produce a refined soil map with higher accuracy rates. The DSM approach resulted in increases in accuracy rates by 2% and 6% for soil orders and great groups, respectively, and an increase by 10% for soil parent material classes (Table 3). These results show how the use of CSMs as training data, under a DSM framework, may also serve to refine the original CSM. Previous studies that have shown this effect have included Collard et al. (2014), where a 1:250 000 CSM was improved using various machine-learning approaches; Yang et al. (2011), where a 1:20 000 CSM was improved using a fuzzy inference system and expert knowledge approach; and Heung et al. (2017), where multiple CSMs were improved using a variety of machine-learning approaches. Given the limited number of examples that demonstrate the capacity of DSM approaches in the refinement of CSMs, future DSM research should consider additional comparative analyses.
Soil parent materials
The allocation disagreement accounted for a larger proportion of the overall disagreement in comparison to the quantity disagreement (Table 3). The high accuracy observed for the soil parent material map was most likely due to the abundance and clustering of validation points that were identified as marine sediments. In terms of the training data, the marine class was a minority class and represented 1% of the training data set; yet, it accounted for 41% of the validation data set and an accuracy rate of 79% (Table 2). The clustering occurred primarily along southern Vancouver Island, where extensive soil data were acquired and where marine sediments are common. Glacial till, the majority class, provincially, was well predicted with an accuracy rate of 76%; however, an inspection of the parent material confusion matrix (Table 4) also indicates that the till class was overpredicted at the cost of other classes such as lacustrine, organic, and marine.
Confusion matrix between 14 570 observation points and predicted soil parent material using a random forest model.
The lacustrine and eolian classes were poorly predicted, at least partly, due to a combination of limited training and validation points (Table 2). Within BC, the likelihood of observing a pure eolian parent material is very low because it typically occurs as a 0.1–1.0 m thick veneer over other parent material classes such as glacio-fluvial, glacio-lacustrine, glacio-marine, and till materials. As a result, the eolian materials often have little impact on the overall soil profile characteristics compared to the underlying materials. When compared to previous regional-scale mapping efforts in the Lower Fraser Valley region of the province, the accuracy rate of the glacio-marine class (20%) was considerably lower for this study in comparison to the 80% accuracy that was reported in Heung et al. (2014). Overall, the distribution of glacio-marine sediments is closely associated with marine sediments, where glacio-marine sediments are located at higher elevations due to isostatic rebound (Luttmerding 1981); however, given the size and geographic extent of the provincial-scale training data set, the subtle differences in elevation between the two classes may have been masked. Within the CSMs, the occurrence of glacio-marine materials was restricted to the Lower Fraser Valley, while the CSMs for the adjacent region of southern Vancouver Island identified a large distribution of marine materials, which likely would have contributed to the masking of glacio-marine presence in the training data. However, it should also be noted that the detailed soil surveys for Vancouver Island did not differentiate between glacio-marine and marine parent materials, potentially contributing to additional model uncertainty.
Soil taxonomic class
The soil great group and order maps had higher quantity disagreement rates at 27% and 22%, respectively, in comparison to the allocation disagreement (Table 3). The reduction in the quantity disagreement by 5% between the taxonomic levels was likely due to the simplification of the legend from soil great groups (C = 55%) to orders, where the misclassification of validation points between soil great groups within the same order was eliminated. The accuracy of the soil great group map was largely influenced by the effective prediction of Dystric Brunisols, which accounted for 33% of the validation data set (Table 2) and had an accuracy rate of 90% (Table 5). Despite the Dystric Brunisols not being the clear majority class in comparison to Humo-Ferric Podzols that accounted for 33% of the training data, an inspection of the confusion matrix indicates that the Humo-Ferric Podzols were preferentially misclassified as Dystric Brunisols (Table 5). The misclassification between these two great groups was not unexpected because Dystric Brunisols are closely associated with and may even develop into Humo-Ferric Podzols under mild and humid environments — especially in southern BC (Smith et al. 2011).
Confusion matrix between 14 316 observation points and predicted soil great group using a random forest model.
Overall, the soil types that developed through hydromorphic processes (i.e., Organic and Gleysol soils) were poorly predicted (Tables 5 and 6). The Luvic Gleysol, Humic Gleysol, and Gleysol great groups had accuracy rates of 16%, 13%, and 3%, respectively, and Fibrisol, Humisol, and Mesisol great groups had accuracy rates of 28%, 21%, and 8%, respectively. The Gleysol soils were most effectively mapped where the landscape was influenced by large-scale fluvial features — this is particularly true for the Lower Fraser Valley region of BC, where the soils were developed from the deltaic deposits that are found at low elevations due to the Fraser River (Fig. 3 inset C). However, the overall low accuracy rates of these hydromorphic soils may have been due to several factors. Firstly, the CSMs often included these soils as part of multicomponent map units because of their localized and often limited occurrence in the landscape compared to other soils. Considering that the training data set we used only consisted of single-component map units, the representation of these soils was decreased. Furthermore, the map scale of the CSMs would have influenced the ability of the surveyor to delineate single-component instances of the hydromorphic soils at a 100 m spatial resolution. Lastly, an additional factor would also have been related to the 100 m spatial resolution, where the topographic variables derived from the 100 m DEM may not have highlighted the landscape features and hydromorphic processes with which these soils were associated. Each of these factors could have also contributed to the poor accuracy rates of Regosol soils due to their formation on highly disturbed sites (i.e., along floodplains and the bottom of steep slopes). Despite the large number of validation points (n = 14 316), the Folisol, Organic Cryosol, and Solod great groups were not represented within the validation data set. Solod soils are rarely found in BC, while Folisol soils are typically found in association with Podzol soils. Organic Cryosols have a large distribution, regionally, in the Fort Nelson area of the province (Fig. 3 inset A); however, the validation data for that region were sparse (Fig. 1).
Confusion matrix between 14 316 observation points and predicted soil order using a random forest model.
Variable importance plots were derived from the RF model outputs and were based on the MDG index, where variables with a high MDG value represented variables with greater importance (Fig. 4).
Soil parent material
Elevation, slope height, distance-to-nearest-stream, and topographic ruggedness index were identified as the four most important variables by the RF algorithm (Fig. 4). Elevation was particularly effective in distinguishing the parent materials that occur on mountainous and high-relief environments, where there was a clear separation of colluvial materials at the highest elevations and slope heights, which transitioned to till as the elevation decreased (Fig. 5). Elevation also plays a role in distinguishing fluvial and glacio-fluvial materials. Glacio-fluvial materials were typically mapped at lower elevations; however, the key distinguishing variables were the slope height and the distance-to-nearest-stream.
Soil taxonomic class
The variable importance plot for the prediction of soil great groups highlighted the strong influence of bioclimatic variables (Fig. 4). Mean annual precipitation was identified as the most important variable, followed by seven other bioclimatic variables; in contrast, the topographic variables that represented the local-scale variability of the environment were less important. Although we recognized elevation as a topographic covariate, it is inherently linked to climate due to the relationship between elevation and temperature via the environmental lapse rate. We believe that these variables were most effective in capturing the large-scale soil patterns of the province and the mapping of Brunisolic, Chernozemic, Luvisolic, and Podzolic soil orders, which had accuracy rates that ranged from 64% to 88%, and were markedly higher than the other soil orders, which had accuracy rates of <19%.
By inspecting the boxplots of the highest ranking climatic variables, based on the training data set, their relationship with the distribution of soil types was evident (Fig. 6). For example, Chernozemic and Podzolic soils, both recognized as “zonal soils”, occupied distinct regions of covariate space, where Chernozemic soils were found in regions of the province with the lowest mean annual precipitation and the highest climatic moisture deficit, whereas the converse was true for Podzolic soils. The influence of climate on the distribution of the soil great groups was also captured in the training data, where increases in moisture (low values of climatic moisture deficit) and decreases in temperature (high values of degree days <0 °C) facilitated the transition of Brown Chernozems, to Dark Brown Chernozems, and to Black Chernozems as a result of the influence of climate on decomposition rates (Pennock et al. 2011). Similarly, the distinction between the great groups within the Podzolic order also appeared to be controlled by substantial differences in mean annual precipitation, where Ferro-Humic Podzols are located in wetter regions in comparison to Humo-Ferric Podzols.
Although the Brunisolic and Luvisolic soils, both considered to be “intrazonal” soil orders, were predicted at respectable accuracy rates, their separation in covariate space, in comparison to the Podzolic and Chernozemic soils, was less clear. In particular, the distinction between Eutric and Dystric Brunisols, the two most common great groups from the Brunisolic order, was very subtle in that they were similar in terms of temperature range and topographic indices; however, their distributions were more obviously influenced by precipitation, where Dystric Brunisols (commonly associated with Podzolic soils) were located in moister regions of the province and at higher elevations.
In addition to the climatic variables, the vegetation indices were also determined to be important based on the RF algorithm (Fig. 4 and Table 7). Similar to the climatic variables, the “zonal soils” were strongly influenced by the vegetation patterns, where, for example, the transitions between Brown Chernozems to Dark Brown and Black Chernozems mirrored the transitions between BG to Ponderosa Pine and Interior Douglas-fir zones. In comparison, the Humo-Ferric Podzols were more closely associated with the Engelmann Spruce and Sub-Alpine zones, while the Ferro-Humic Podzols were predominantly restricted to the CWH zone — the wettest zone in the province. In comparison, the Canadian Land Cover Classification Circa 2000 data set was far less important than the BEC zone or subzone for predicting soil class. The information provided by an automated classification of individual pixels for land cover was influenced by short-range and temporal effects, such as natural disturbance, forest harvest, and agricultural practices, which are undoubtedly important factors affecting soil; however, the information was not specifically incorporated into the soil survey process and hence, not captured in the training data set.
Relative occurrence of training data points found under each biogeoclimatic ecosystem classification (BEC) zone and separated by soil great group (see Table 2 for great group codes).
Class-probability surfaces for soil parent material (Fig. 7) and soil great groups (Fig. 8) were generated based on the 500 constituent decision trees of the RF algorithm. Spatial representations of ignorance uncertainty values are presented in Fig. 9; furthermore, Fig. 10 shows the ignorance uncertainty with respect to the validation points.
Soil parent material
Using the parent material probability surfaces, we were able to visualize the transitions in parent material, where fluvial materials were dominant in valleys with moderate to low slopes and moving toward higher slope positions, the dominant parent material transitioned to glacio-fluvial, to till, and to colluvium (Fig. 7). In cases where slopes of the valley bottom rapidly increased, the fluvial materials transitioned directly into colluvial materials. Overall, the combination of colluvium and till probability rasters was able to capture the major physiographic features of the province, where colluvial materials were most extensively mapped along the Coast, Columbia, Northern, and Rocky Mountains, and till was most extensively mapped along the Interior and Central Plateaus and The Great Plains regions of the province (Valentine et al. 1978).
The overall patterns of the prediction uncertainties, represented using the ignorance uncertainty measure, are shown in Fig. 9A. The ignorance uncertainty measure (H) had a mean of 0.33 with a standard deviation of 0.19. The spatial patterns of the ignorance uncertainty surface indicated that low-relief terrains resulted in the highest uncertainty (i.e., H > 0.70). Low-lying areas of coastal BC were particularly challenging to predict due to a combination of complex geomorphic processes, which resulted in the deposition of fluvial, glacio-fluvial, and till materials — all of which may occupy similar regions of variables space (Fig. 5); furthermore, the presence of marine and glacio-marine materials also complicated the prediction. Regions of the province that consist of wide valleys and terraced features were also challenging to predict due to the presence of fluvial, glacio-fluvial, till, and glacio-lacustrine materials. An example of such a landscape includes the Rocky Mountain Trench, where a similar combination of parent materials has been identified in Clague (1975). Lastly, the combination of organic and fluvial parent materials, east of the Fort Nelson region, resulted in high values of ignorance uncertainty.
Soil taxonomic class
Large-scale transitions in soil great groups were clearly visible using the probability surfaces. The influence of orographic precipitation was most clear for the mountain slopes of Coastal BC, where the presence of Ferro-Humic Podzols was most dominant along the windward side of the Coast Mountains, and proceeding eastward, there was an abrupt transition to Humo-Ferric Podzols (Fig. 8). In contrast, as the drier air moves across the Interior Plateau and then up and over the Columbia Mountains, the resulting orographic influence produces less precipitation than along the Coast; hence, the development of Ferro-Humic Podzols is limited, while Humo-Ferric Podzols become more common.
For large portions of BC’s Interior Plateau, where there is less precipitation, Gray Luvisolic soils dominate the predictions. It was of interest that the areas with high probabilities of being predicted as Gray Luvisolic soils also coincided with the prediction of till as parent material. The main process involved with the formation of Luvisolic soils, lessivage, depends on the initial presence of clay particles within the glacial sediments. Although we did not use geology as an independent input for our predictions, it is well known in BC that the bedrock of the Coast Mountains has a very high component of acidic intrusive rocks that are expected to weather to sand during pedogenesis. In contrast, the Interior Plateau is made up of a more diverse assemblage of rock types, including volcanic, sedimentary, and metamorphic rocks that produce medium- and fine-textured surficial materials upon weathering. Given that the parent material predictions were derived only using topographic indices, we suggest that the large-scale topographic characteristics of the province (i.e., the transition from Coast Mountains to Interior Plateau) coincide not only with reduced precipitation, but also with the presence of more clay in the soil parent materials. Therefore, two of the most important characteristics of the pedogenic environment distinguishing Luvisols from Podzols occur to the east of this transition, but not to the west.
In summary, high precipitation levels in the western parts of the Coast Mountains favour the development of Podzols over Luvisols, while medium–fine textured soils on the Plateau to the east of the Coast Mountains favour the development of Luvisols over Brunisols. The Regosolic great group was the third-most extensively predicted soil class for the province; however, their distributions were mainly predicted based on topography and tied to specific geographic features. In southern BC, the highest probability values were observed at high elevations with steep slopes along the Coast and Columbia Mountains, where erosion processes inhibit pedogenic processes and also within valley bottoms, where fluvial materials are deposited. However, at high latitudes, Regosolic soils were commonly associated with the presence of cirque — as a result of ongoing alpine glaciation.
The ignorance uncertainty, H, for the soil great group had a mean of 0.36 with a standard deviation of 0.17 (Fig. 10b). Unlike the parent material predictions, where high uncertainty values were found in closer association with landscape features throughout the province, the areas of high uncertainty values for the soil great group map were primarily concentrated in regions of the province where CSMs were not available — especially the northwestern region of the province. A second observation was that the northern portions of BC also experienced low mean annual temperatures and areas that had moderate to high uncertainty values (i.e., H > 0.5) coincided with mean annual temperatures of <−3 °C. As a result, it is likely that the higher uncertainty values for the region were due to the lack of training data that represented the coldest part of the province. In addition, pixels where H > 0.6 also appeared to be related to topography and slope position for the (sub-)alpine landscapes, where the greatest uncertainty values were observed along lower backslopes. Although these regions were typically classified as Humo-Ferric Podzols, the probability values for Gray Luvisols, Humo-Ferric Podzols, Regosols, and Sombric Brunisols were similar and often found in association with each other (Valentine et al. 1978; Trowbridge 1994).
Incorrect predictions for soil taxonomic class were often associated with low ignorance uncertainty values (Fig. 10b), in contrast to soil parent materials. The lowest ignorance uncertainty values were associated with correct prediction for Dystric Brunisols, Ferro-Humic Podzols, and Gray Luvisols. Ferro-Humic Podzols had a distinct feature space for mean annual precipitation, but this was not evident for Dystric Brunisols or Gray Luvisols. Dystric Brunisols were well represented in the training data set, accounting for 31% of all points. The highest ignorance uncertainty was for the correct predictions of Dark Brown Chernozems.
For the parent material validation points with correct predictions, ignorance uncertainty values were lower, except for eolian and fluvial materials (Fig. 10a). The lowest ignorance uncertainty values were obtained for correct predictions of colluvium, lacustrine, and glacio-marine. Two of these parent material classes (colluvium and glacio-marine) had a distinct feature space for at least one predictor in Fig. 5, and they were all relatively sparse in the training data set, with a combined representation of less than 5%.
Even though there was a lack of training and validation data across large regions of the province, and especially Northwestern BC, the resulting maps were consistent with our pedological understanding of the environment. In terms of the parent material map, one of the surprising outcomes of the map was in the prediction of glacio-marine for the northeastern part of Haida Gwaii, where training data for that parent material class were derived exclusively from the Lower Fraser Valley in southwestern BC. Although there were no validation points located on Haida Gwaii, the extrapolation of that parent material class was consistent with a previous geological reconnaissance study for the same region (Clague et al. 1982). In terms of the prediction of soil great groups for the (sub-)alpine regions of Northwestern BC, the predicted soils were also consistent with their spatial patterns as described in the literature (Valentine et al. 1978).
This study used the BCSIS data set to validate the model predictions and assumed that the soil parent material and taxonomic units were correctly identified in the data set. However, such assumptions were also a limiting factor and a source of uncertainty in this study because there were many instances in the BCSIS data set where there were no corresponding analytical measurements to support the decisions made by the surveyors. For example, the distinction between Dystric Brunisols and Eutric Brunisols is based on a pH threshold of 5.5 using 0.01 mol/L CaCl2 for the B horizon, whereby the pH may or may not have been measured. A similar uncertainty may be introduced in identifying soil parent materials as well due to inconsistencies in how till and colluvium or fluvial and glacio-fluvial materials are differentiated by different surveyors. Furthermore, it should be recognized that there were also inconsistencies between the detailed soil surveys whereby the glacio-marine materials were not differentiated from the marine materials in the southern Vancouver Island surveys. Lastly, due to the high relief of BC, we also recognize that validation points were most likely to be located in agriculturally intensive regions and where access was good. Hence, this could possibly explain the underrepresentation of colluvium observations where accessibility may be limited and the potential overrepresentation of fluvial observations is found in agricultural regions (i.e., Lower Fraser Valley).
To improve upon this study, alternative mapping approaches that are focused on the prediction of “azonal” soils (e.g., Organics, Gleysols, and Regosols) could be tested. For example, Bulmer et al. (2016) used data from a wetland inventory of BC to simply assign, or “burn in” organic soils to known wetland locations. The amount of information contained in our training data was very limited for Gleysols and Regosols because either their occurrence was so localized that they were simply not mapped by surveyors, given the mapping scale, or they were recognized as part of a multicomponent map unit. Another issue related to the spatial resolution of the topographic indices was that the effective mapping of the azonal soils may require a finer resolution DEM, which would be more effective in capturing the small-scale topographic variability and be more reflective of the localized, pedogenic processes for these soils. A potential solution may be to use a suite of topographic indices derived at multiple spatial resolutions (Smith et al. 2006; Behrens et al. 2010) or the application of wavelet transformation to spatially decompose the topographic indices (Taghizadeh-Mehrjardi et al. 2021); however, such an approach would still require a DEM with finer resolution than the one that was available for this study.
This study aimed to extend the existing DSM approaches that were previously tested in BC for the development of provincial-scale maps of soil taxonomic classes and soil parent material classes using the RF classification algorithm. The key findings are summarized as follows:
By training a machine learner using single-component map units from a CSM, the resulting DSM had improved accuracy rates. The DSM approach increased accuracy rates by 10%, 6%, and 2% when mapping soil parent material classes, soil great group, and soil order, respectively. These results were consistent with similar studies such as Collard et al. (2014) and Yang et al. (2011) and reinforce the findings of Heung et al. (2017).
When mapping soil taxonomic classes, it was observed that zonal (Podzols and Chernozems) and intrazonal soils (Brunisols and Luvisols) were far more effectively mapped in comparison to azonal soils (Gleysols, Regosols, and Organics). We believe that the ineffective mapping of azonal soils was, in part, due to a combination of the spatial resolution of the environmental variables and the mapping scale of the CSMs from which the training data were derived.
Through the examination of variable importance plots, it was observed that the predicted distribution of soil taxonomic classes was heavily influenced by climatic variables and, to a lesser degree, land cover data. Based on the covariate boxplots, it was observed that mean annual precipitation was particularly important in distinguishing the Podzolic, Chernozemic, and Luvisolic soil orders.
Overall, this study, in conjunction with Heung et al. (2014, 2016, 2017) and Bulmer et al. (2016), demonstrates the long-term value of CSMs in BC and their potential to be refined using DSM and machine-learning techniques. The overall framework of training models using legacy maps may have the potential to be adopted for the refinement of legacy ecological or geological maps that share a similar data structure to CSMs.
The authors would like to thank the hard work and dedication of soil surveyors in British Columbia that left a legacy of maps that formed the basis of this study.
B.H., conceptualization, data curation, formal analysis, methodology, validation, visualization, and writing (original draft); C.B., formal analysis, methodology, and writing (original draft, review, and editing); M.S. and J.Z., writing (original draft, review, and editing).