Covariates used in spatial prediction of soil properties.
Brazilian semi-arid region is a recent frontier in the country for agribusiness. The objective of this study is to apply pedometric tools to zoning areas with distinct potential and limitations to agricultural purposes. The research was set in three main steps: (i) to compile a database with all complete profile data collection; (ii) to analyze the vertical variability of soil properties and select a set of soil key properties useful to define the land potential and limitations; and (iii) to classify the area according to potential for agriculture, considering a medium technological level of the farmers. The quantitative methods applied are supported by geographic information systems (GIS) and spatial statistics. The soil data compilation was based on legacy data, with corresponding topographical data and information from remote sensing images of the area. Tree-based and geostatistical algorithms were applied to predict the spatial variability of the soil key properties. The definition of management zones was based on Iso Cluster and Maximum Likelihood Classification tools. The results pointed three different management zones according to risks of salinization and requirements for irrigation control. The approach showed to be a simple and useful way to select and recommend primarily potential areas for agriculture based on soil properties.
- AQP package
- soil-depth functions
- soil key properties
- zoning management areas
Soils are essential to human survival, and they provide a wide diversity of ecosystem services. Among them, soils are at the base of food production, carbon storage and cycles to atmosphere, water availability, turning biomass into valuable nutrients and degrading toxic elements, and supporting biodiversity . Soil surveys are a relevant source of soil information and environmental data for many uses in agriculture and, otherwise, are fundamental for evaluating land capability and preventing degradation . Ramalho Filho and Pereira  highlight the importance of assessing regional conditions and soil properties to recommend adequate land-use management.
However, in Brazil, even though this is well-known, the soil class information is not commonly taken into account in planning or land-use zoning. One of the problems is the small scale of available maps from systematic soil surveys (1:250,000–1:1,000,000) in the country; this small scale is not suitable for recommendations regarding soil management at the farm level . This limits the implementation of land-use zoning and soil sustainable management practices and policies.
The northeast region of Brazil presents complex and heterogeneous natural features in relation to rainfall, soil, and vegetation. It represents a huge challenge for the use and management of soil and water in sustainable agricultural systems. The soil degradation observed in some areas of this region is due to irregular rainfall, soil fertility conditions, and population pressures in a typically fragile environment .
Despite of the environmental variability, the northeast region of Brazil has a unique biome with a particular vegetation type (named “caatinga”). The majority of soils present shallow superficial horizons and profiles and very low organic carbon contents [6, 7, 8]. The rainfall deficit together with an inappropriate soil management for agricultural use can lead to desertification processes and to increase soil salinity in many areas, further degrading the land. Also common is the presence of rocks on the soil mass and/or surface layers and soils with high clay activity, both implying strong limitations for tillage. Thus, knowing the distribution of soil classes and properties is a requirement for land-use zoning.
The manipulation and analysis of a large soil profile collection, in the existing datasets, can be difficult due to the wide variability of horizon depths and thickness, as well as the variability among and between soil classes and orders. Addressing this issue, studies about the variability of soil properties along with depth were developed by [9, 10, 11, 12, 13, 14, 15]. Beaudette et al.  introduced the algorithm for quantitative pedology (AQP) package, which gathers tools to produce standard profile sketches highlighting differences in soil properties between horizons. It allows to standardize profile sketches, according to Munsell color chart, and dataset harmonization by applying the slice-wise algorithm, among other useful tools to express variations inside a given soil profile collection.
Quantitative studies have been developed in soil science through numerical modeling relationships between environmental variables and soil, which are applied to a geographic dataset to create a preliminary or predictive map [17, 18]. All these techniques are known as pedometrics , which represents an interface between statics and soil science, supporting predictive analysis related with soil classes and attributes [19, 20]. Based in pedometric concepts, the digital soil mapping (DSM) allows to study the relationships between soil and forming factors represented by spectral bands from remote sensing data, surface numerical models, and thematic maps . Remote sensing data deserves an important place in digital soil mapping, particularly in flat landscapes, where morphometric covariates have lesser influence in soil formation .
Landsat spectral bands, particularly in the short-wave infrared range (SWIR), are commonly used to represent the environmental covariates of parent material and/or soil. Different mineral assemblages will have different spectral reflectances, which may be separable by analyzing bands 1–5 and 7 [22, 23]. The normalized difference vegetation index (NDVI) values range between −1 and +1. High positive values usually indicate the occurrence of dense green vegetation, pointing to an appropriate supply of water and nutrients. Low values express limited photosynthetic activity, and negative ones correspond to sparse lacking ground coverage . The soil enhancement ratios of Landsat spectral band ratios 3:2, 3:7, and 5:7 have been interpreted to accentuate carbonate radicals, ferrous iron, and hydroxyl radicals, respectively, in exposed soil and geologic materials .
The identification of homogeneous zones of management consists in grouping areas whose environmental characteristics, as relief and soil, are similar. The design of homogeneous zones must be done according to the soil management, considering as well the risk of degradation, and it contributes to the sustainable land-use planning. Sánchez and Silva  pointed out that management areas present similar response when applied similar agronomic practices and they are subject to the same risks and limitations of agricultural use.
The cluster analysis deals with segmentation of a set of N objects into clusters (groups) in a fashion that the same type of datasets falls in a cluster that is different from those with dissimilar datasets . The results of cluster analysis reveal internal data structure and improve understanding of data. There are many cluster algorithms that are available for data partitioning into. This analysis was used to delineate management zone methods , and the authors prefer to apply fuzzy cluster algorithms, and they are commonly used to define management zones for precision agriculture. To develop management zones , the cluster analysis was conducted by Management Zone Analyst 1.0.1 (Agricultural Research Services, University of Missouri, Columbia), with a combination of variables (elevation-Electrical Conductivity-EC, ECah-%Na; elevation-ECah-%Na, elevation-pH 1:1-%Na).
The hypothesis of this study is that the use of soil legacy data allied with modern tools that are able to analyze large datasets can assist the definition of soil potential and limitations to agricultural practices and thus assist the land-use planning and zoning. The work was developed from a legacy dataset of soil surveys made by the Companhia de Desenvolvimento do Vale do São Francisco (CODEVASF)  in the Brazilian semiarid region.
The specific goals of this study are (i) to identify soil key properties for definition of different management zones based on area databased and literature; (ii) to harmonize the dataset according predefined properties; (iii) to produce soil key property maps through digital soil mapping; and (iv) to create a map zoning the area according the potential to agricultural uses. This approach was used to establish the main trends from the profile collection by means of soil-depth functions and to select soil key properties adequate for the area (flatland landscape and semiarid climate), in this way, separating the areas according to their agriculture potentials.
2. Materials and methods
2.1 Study area and dataset
The study was conducted in an area located between 9°53′0″ and 9°36′30″ S and 40°34′30″ and 40°23′30″W (with 34,437.82 hectares), in the municipality of Juazeiro in Bahia State, northeast region of Brazil (Figure 1).
The biome in the region studied was the caatinga , an ecoregion characterized by xeromorphic vegetation with broadleaf thorny shrubs and trees that shed their leaves seasonally, cactus plants, and sparse arid-adapted grasses . The climate is characterized as semiarid with dry winter and rainy summer, and the coldest month mean temperature remains above 18°C (BSwh’ in the Köppen classification). The average annual rainfall is 400 mm, with the rainy season extending from November to April (highest precipitation in March); and average annual temperature is around 26°C. The xerothermic indexes are between 200 and 150, comprising of 7 to 8 dry months.
At the study site, a portion of the original vegetation has been removed, and signs of land degradation such as bare soils are common. The landscape is mainly compounded by flat surfaces, with maximum slopes of 8%, and plateaus are common in the region. Geology comprises of limestone rocks from the caatinga formation, gneiss-granite rocks of the Caraiba-Paramirim complex, and recent colluvium/alluvium sediments .
The soils types in the northeast region of Brazil show large variation, according to parent material and relief, from shallow and with high content of basic cations (Ca, Mg, K, and Na) to deep and leached profiles . Dominant soil classes according to the World Reference Base for Soil Resources  are Vertisols, Cambisols, and Planosols, with smaller extension of Regosols, Acrisols, and Luvisols.
2.2 Soil input data and covariates
The soil properties used in this study were sand, clay, and cation exchange capacity (CEC), determined according to . The selection of soil properties was based on two main conditions: (i) presenting variability according to depth and ii) having importance to agricultural management (e.g., clay or sodium (Na) content).
From the original dataset, with 523 profiles, the ones with complete morphological description and analytical data were selected, performed by Companhia de Desenvolvimento do Vale do São Francisco (CODEVASF) in 1989 . So, the input dataset comprised of the topsoil (0–20 cm) and subsoil (20–60 cm) layers for 290 soil profiles from a soil survey (legacy data). The sampling was performed according to the requirements of national soil survey service, correspondent to the detailed soil survey level [37, 38].
As covariates, Landsat 5 TM spectral bands were used in this study (Table 1), from September 1989, available in
|Covariates||Spatial resolution (m)||Spectral range (μm)*||References|
|Band 1||30||0.450–0.515||[23, 39, 40, 41, 42]|
|NDVI (Band 4 - Band 3)/(Band 4 + Band 3)||30||—||[11, 23, 41, 42, 43, 44]|
|Band 3/Band 2||30||—|
|Band 3/Band 7||30||—|
|Band 5/Band 7||30||—|
|GSI (Band 3 - Band 1)/(Band 3 + Band 2 + Band 1)||30||—|||
2.3 Modeling procedures
Quantitative soil properties, such as clay and sand content, can be estimated along the depth of soil profile through slice-wise algorithm  implemented in R software through algorithm for quantitative pedology (AQP) package . The functions calculate an estimative of the main trend of values, represented by median and the variation interval by quartiles (25th and 75th of data distribution). The values shown in a column corresponds to percentage of soil profiles contributing with the function at each depth.
Initially, Pearson’s linear correlation analysis was used to measure the linear association between variables, to determine among the soil properties defined by AQP, which correlated with environmental covariates (Table 1). This analysis was implemented in R , through the function cor.test, according to [42, 44, 47]. In Pearson’s correlation the p-value defines whether or not two variables are statistically correlated, and for this study, it was defined that values lower than 0.02 indicate that the correlation is significant.
The modeling procedure to execute the random forest (RF) prediction was performed in R software, through randomForest (RF). RF is a nonparametric technique developed by Breiman  as an extension of classification and regression tree (CART) systems, to improve the performance of the predictors. To implement the RF models, three parameters are necessary: number of trees in the forest (ntree); minimum amount of data in each terminal node (nodesize); and number of covariates used in each tree (mtry) . The ntree value was set to system default (500), although more stable results can be achieved with a larger number . The nodesize value was set to 5 for each terminal node, as usually selected in regression studies. The mtry value chosen in this study was according to Liaw and Wiener , which proposes an amount corresponding to one third of the total number of predictor variables for regression problems.
Although Na showed a significant correlation at the two depths, according to the Pearson correlation analysis, the preliminary results using the random forest (RF) model were very unsatisfactory. Thus, exceptionally for this property, the RF model has been replaced by ordinary kriging (OK). Semivariograms were used to analyze the spatial structure of the Na, and to generate predictive maps, in both depths. The OK was performed in R software, through krige function . OK model is the most familiar type of kriging and provides an accurate estimate for an area around a measure sample .
The model’s performance was evaluated based on independent validation set, which was not used in the training procedure. Thereby, the 290 soil samples were randomly divided into 2 independent datasets in the R software; one of these was used in the training process (200 soil samples) and another for the validation process (90 soil samples). The analysis of the model’s performance was based on the correlation between the measured values (validation samples) and estimated values, calculated by the coefficient of determination (R2), the root mean square error (RMSE), and mean error (ME), presented as Eqs. (1) and (2):
where “d” is the difference between the observed and estimated values and “n” is the number of samples used in the validation process.
The RMSE is a measure of the overall error of the estimation and commonly is used to estimate the error or uncertainty in places where the error was not measured directly; thereby, the higher the values of RMSE, the greater the differences between the datasets . The ME gives the bias and allows evaluation of overestimation (positive values) or underestimation (negative values); values close to zero are preferable.
2.4 Definition of management zones
Management zones were defined in this study according to potential for agriculture, considering variability of soil key properties along profile depth, importance of soil properties for the land management, and the performance of the models to predict the spatial variation of the properties. Based on the maps for the selected soil key properties, an unsupervised classification was performed by using a series of input raster bands (Na, CEC, clay, and sand) using the Iso Cluster and Maximum Likelihood Classification tools from ArcGIS Desktop 10.3.
The Iso Cluster tool uses a modified iterative optimization clustering procedure, also known as the migrating means technique. The algorithm separates all cells into the user-specified number of distinct unimodal groups in the multidimensional space of the input bands; the iso prefix of the isodata clustering algorithm is an abbreviation for the iterative self-organizing way of performing clustering. In the clustering process, during each iteration, all samples are assigned to existing cluster centers, and new means are recalculated for every class. The optimal number of classes to specify is usually unknown. Therefore, it is advised to enter a conservatively high number, analyze the resulting clusters, and rerun the function with a reduced number of classes .
3. Results and discussion
3.1 Algorithm for quantitative pedology (AQP)
The AQP package allows to gather a set of functions to work and to analyze large soil profile collections. Depth functions of soil key properties used to distinguish the profiles are presented in Figure 2 and Figure 3. The percentage values plotted along the profile shows the relative quantity of profiles used in the soil-depth function to calculate the statistics (median and quartiles) at each depth.
The study area presents shallow soils, sometimes with high base saturation; but the main limitation for agricultural use is the soil texture and high clay activity, which will influence the soil moisture to adequately manage the soils due to their high plasticity.
The pH and the calcium (Ca) content presented a linear trend of median values along depth (Figure 2), even though a smaller number of soil profiles (less than 25%) were used in the estimative for deeper than 150 cm depth. However, different patterns were shown for potassium (K), sodium (Na), electrical conductivity (EC), and cation exchange capacity (CEC). The soil key properties Ca, K, and CEC showed more variability around the median value, thus presenting a better predictive potential to distinguishing soils with different parent materials and high clay activity, respectively.
It is pertinent to select which soil key properties could indicate differences in soil behavior along depth and are relevant for land management. From the analyses of Figure 3, it is possible to conclude that clay and sand contents have an opposite and large variability among the profiles of the collection; thus, they have great potential to aid in the definition of soil management zones. Since both properties are important to irrigation and mechanization practices, they were selected as criteria for agricultural zoning.
3.2 Predictive models
3.2.1 Pearson’s correlation
The Pearson correlation coefficients (Table 2) showed that in general, the environmental covariates were significantly correlated with all soil properties analyzed (p < 0.05). The sand content was significantly correlated with covariates b5, b7, NDVI, b3/b7 ratio, and GSI (Table 2). These results do not agree with , who found no correlation between sand content and the Landsat 5 TM image data. Only the covariate b3/b7 was strongly correlated (r = −0.58 and −0.49, topsoil and subsoil, respectively) with the sand content , whereas the NDVI was moderately correlated (r = 0.39 and 0.36, topsoil and subsoil, respectively) and the other covariates weakly correlated, with r values below 0.27 (positive or negative) (Table 3).
Inverse relationships were observed between clay content and the environmental covariates, but magnitudes were the same as those observed for sand (Table 2), except for the covariate b4 (topsoil) which had no correlation with sand; similar results were reported by [42, 55]. The most relevant covariates were b3/b7 (r = −0.56 and −0.51, topsoil and subsoil, respectively) and NDVI (r = −0.36 and −0.37, topsoil and subsoil, respectively). These results vary among authors in the literature. Significant correlations were obtained between clay, the NDVI index, and b3/b2 and b5/b7 band ratios by , while there was no correlation between clay and the b3/b7 ratio band. On the other hand, Ahmed and Iqbal  found significant correlations between clay and bands 4 and 6 using Landsat 5 TM.
The CEC was significantly correlated with the covariates b4, b5, b7, NDVI, and b3/b7 and b5/b7 ratios (Table 2). Only the covariate b3/b7 was strongly correlated (r = −0.45 and −0.44, topsoil and subsoil, respectively) with the CEC , whereas the other covariates were weakly correlated, with r values below 0.27 (positive or negative) (Table 3). A strong correlation was observed by  between the CEC and covariates measured by ASTER spectral bands (1–8). The Na content was significantly correlated with most of the covariates, except with b7 (topsoil) and b3/b2 and b5/b7 ratios (Table 2). However, none of the covariates had a strong or moderate correlation with Na content (Table 3), which may explain the very low performance of the random forest model in the prediction of this soil property.
The study performed by Demattê et al.  highlighted that the correlation with a particular spectral band is directly related with soil characteristics in specific regions, explaining the differences between study cases.
3.2.2 Predictive models
In the training process, the RF model only maintains the covariates that had moderate or strong correlation with the soil properties (Table 3). The results obtained by the predictive models (RF and OK) using an independent dataset for validation (90 samples) are illustrated in Figure 4. In this study, the results obtained by RF models in the topsoil layer were higher than those obtained by subsoil for sand (R2 = 0.58 and 0.44 and RMSE = 91.95 and 90.32 g kg−1, topsoil and subsoil, respectively) and clay (R2 = 0.53 and 0.45 and RMSE = 75.71 and 73.45 g kg−1, topsoil and subsoil, respectively). The results were higher in subsoil than topsoil for CEC (R2 = 0.44 and 0.59 and RMSE = 8.62 and 7.47 cmolc kg−1, topsoil and subsoil, respectively) and Na content (R2 = 0.15 and 0.26 and RMSE = 0.18 and 0.37 cmolc kg−1, topsoil and subsoil, respectively) (Figure 4).
The results achieved by RF models were superior (topsoil) or similar (subsoil) to  that used terrain properties derived from a digital elevation model, for sand (variance explained = 30%) and clay (variance explained = 43%). In a study in Nigeria , percentages of variance explained (Varex) for RF models were of 48–49% for sand and 53–56% for clay in the top soil layer (0–15 cm). These values are inferior or similar to those obtained in this study for sand, similar for clay in the topsoil and superior in the subsoil. Therefore, the RMSE results for sand (19.26–19.67 g kg−1) and clay (13.11–13.59 g kg−1) were lower than those in the present study for the topsoil and subsoil layers (Figure 4). Lower Varex values were reported by  for sand (33–35%) and clay (31–35%).
The Varex obtained for CEC using the validation samples was 47% (R2 = 0.47) for topsoil layer and 59% (R2 = 0.59) for subsoil layer. The goodness of fit estimated by RMSE was 8.62 and 7.47 cmolc kg−1 (topsoil and subsoil, respectively). Regarding the model’s performance, the results can be considered satisfactory; besides that, few studies in literature have used RF to predict soil CEC, and no one has used only remote sensing data as main covariates. The present results showed worst performance from statistical indexes when compared with those obtained by Lagacherie et al. , which reached 79% for Varex for the layer between 15 and 30 cm and 3.4 cmolc kg−1 for RMSE, using as input data terrain attributes and hyperspectral data in the visible and near infrared (AISA-Dual) with 5 m of spatial resolution. The difference from this study may be related to the coarser spatial resolution of images from Landsat 5 (30 m), in comparison with Lagacherie et al.  who used hyperspectral data (5 m). The influence of spatial resolution in the prediction of soil properties is reported by other studies [62, 63].
However, in this study the performance of metrics for CEC prediction was better when compared to the values (9.49 cmolc kg−1) achieved by . The poor performance was explained by the authors as due to the small-scale variation of parental material and erosion/deposition rates, which were not captured by the spatial resolution of the covariates (100 m). The authors also highlighted importance of input dataset to improve the models’ performance.
The semivariogram obtained for Na content (Figure 4) provides a description of the spatial dependence and indicates processes related with the spatial distribution . For both depths, the best semivariogram was the exponential model. The semivariograms for topsoil and for subsoil (Figure 4) used to estimate the Na content present a R2 of 0.26 and RMSE of 0.37 cmolc kg−1 and R2 of 0.15 and RMSE of 0.18 cmolc kg−1, respectively. The descriptive statistics of soil properties prediction for the predictive models are presented in Table 4.
|Random forest||Ordinary kriging|
|g kg−1||cmolc kg−1||g kg−1||cmolc kg−1|
The RF model produced predicted values for sand, clay, and CEC within the range of the original values (Table 4), with a smaller standard deviation and coefficient of variation, as expected for this model . The same was true for Na content in the OK model. However, the map produced presented a smaller range of values than the original dataset, which, according to , means that this model had low accuracy to describe the spatial variation of Na content, corroborating results found when predicting this property at both depths. The mean values for sand, clay, CEC, and Na content in both depths (Table 4) are closer to the values of the original dataset. The CV for both models was smaller than the CV from the original dataset, except for Na content in the topsoil.
These results could be explained by the moderate correlation of soil properties with the covariates and predominance of short-scale variations that could not be modeled from the set of profiles used. The results were considered satisfactory, except for Na content, and they can be ascribed to the physical interference of soil properties in the incident and reflected energy. However, the quantification of soil properties using an orbital sensor is not an easy task due to complexity of soil dynamics and formation . The spatial distribution of soil properties according to the predictive models is shown in Figures 5 and 6.
3.3 Definition of the management zones
Based on the maps created for each soil key property, the zoning procedure was applied by using Iso Cluster and Maximum Likelihood Classification tools. The descriptive statistics and the map produced are presented on Table 5 and Figure 7, respectively.
|Na (0–20 cm)|
|Na (20–60 cm)|
|CEC (0–20 cm)|
|CEC (20–60 cm)|
|Clay content (0–20 cm)|
|Clay content (20–60 cm)|
|Sand content (0–20 cm)|
|Sand content (20–60 cm)|
The conditions to define the agricultural zones have to be adjusted according to available data and heterogeneity of soil properties in the study area, in this case located in a semiarid region.
Zone 1 presents the greater amount of Na in both layers, and it is associated with greater values of CEC as indicated by the mean values, than Zones 2 and 3. The same condition is verified for clay content, where Zone 1 has the greatest mean value in both layers (0–20 and 20–60 cm). As expected, sand content shows the opposite pattern. Zone 1 occupies 42.6% of the area (1467 ha), and it has greater risk of salinization due to the higher mean values of Na. Even though Zone 1 has the greater mean values of CEC, it is associated to the high Na content and high content of clay, and possibly high clay activity (2:1). These properties limit the soil potential and irrigation management, further increasing salinization risks.
Zone 2 is characterized by medium values of Na in both layers, when compared to Zones 1 and 3. The CEC values follow the same trend of Na, with medium values, and clay and sand contents are intermediary. This zone occupies 41.7% of the area with 1436 ha. Zone 2 presents medium effort and limitation to soil management, considering risks of salinization and requirements for irrigation control.
Zone 3 is characterized by the lowest values of Na and CEC in both layers; it has the soils with lowest clay contents and proportionally greatest sand contents. This zone occupies 15.7% of the area with 539 ha. Zone 3 has the lowest mean values of CEC and Na, which decrease risks of salinization; plus, the lower clay content is less limiting to soil mechanization.
The approach based on the combination of orbital remote sensing data, with random forest (RF) and OK models to predict properties of soils in the Brazilian semiarid region, was considered satisfactory for sand, clay and CEC contents, and unsatisfactory for Na content.
The use of remote sensing data to map soil properties in flat relief areas, where common topographic covariates may not result in adequate maps by digital soil technics, is an advance. However, more research is needed to improve the quality of the covariates used as an input data.
The zoning procedures through multivariate analysis allowed to identify areas with higher salinization risks and where the management of irrigation is more complex. In the same way, the approach allowed to recognize zone of soils with lower sodium content and clay activity, which are preferred to grow annual crops.