You are on page 1of 22

Biogeochemistry

https://doi.org/10.1007/s10533-023-01029-8

Predicting high resolution total phosphorus concentrations


for soils of the Upper Mississippi River Basin using machine
learning
Christine L. Dolph · Se Jong Cho · Jacques C. Finlay · Amy T. Hansen ·
Brent Dalzell

Received: 17 November 2022 / Accepted: 8 February 2023


© The Author(s), under exclusive licence to Springer Nature Switzerland AG 2023

Abstract The spatial distribution of soil phospho- total soil P at a 100 m grid scale for the Upper Mis-
rus (P) is important to both biogeochemical processes sissippi River Basin (UMRB), USA. The UMRB
and the management of agricultural landscapes, is one of the most intensively farmed regions in the
where it is critical for both crop production and con- world and is characterized by widespread water qual-
servation planning. Recent advances in the avail- ity degradation arising from P-associated eutrophica-
ability of large environmental datasets together with tion. Although potentially complex interacting drivers
big data analytical tools like machine learning have determine total soil P, the predictive accuracy of our
created opportunities for evaluating and predicting random forest model was relatively high (R2 = 0.58
spatial patterns in complex environmental variables and RMSE = 129.3 for an independent validation
like soil P. Here, we apply a random forest machine dataset). At the regional scale represented by our
learning model to publicly available soil P datasets model, the variables with the greatest comparative
together with nearly 300 geospatial attributes sum- importance for predicting soil P included a combina-
marizing aspects of soil type, land cover, land use, tion of soil sample depth, land use/land cover, under-
topography, nutrient inputs, and climate to predict lying soil physical and geochemical properties, land-
scape features (such as slope, elevation and proximity
to the stream network), nutrient inputs, and climate-
Responsible Editor: Justin B Richardson. related factors. An important product of this research
Supplementary Information The online version
is a fine-scale (100 m) raster data layer of predicted
contains supplementary material available at https://​doi.​ total soil P values for the UMRB for public use. This
org/​10.​1007/​s10533-​023-​01029-8.

C. L. Dolph (*) · J. C. Finlay A. T. Hansen


Department of Ecology, Evolution and Behavior, Civil, Environmental, and Architectural Engineering
University of Minnesota, 140 Gortner Laboratory, 1479 Department, University of Kansas, 1530 W. 15Th St,
Gortner Ave, St. Paul, MN 55108, USA Lawrence, KS 66045, USA
e-mail: dolph008@umn.edu
B. Dalzell
S. J. Cho ARS Soil and Water Management Research Unit, U.S.
The National Socio‑Environmental Synthesis Center, Department of Agriculture, 439 Borlaug Hall, 1991 Upper
University of Maryland, Annapolis, MD 21401, USA Buford Circle, St. Paul, MN 55108, USA

S. J. Cho
US Geological Survey, Water Resources Mission Area,
Reston, VA 20192, USA

Vol.: (0123456789)
13
Biogeochemistry

dataset can be used to improve conservation planning transport, including soil parent material, soil texture,
and modeling efforts to improve water quality in the and landscape geomorphology (Records et al. 2016).
region. Field work efforts to measure soil P at high spa-
tial resolution can be time intensive and costly, if not
Keywords Soil phosphorus · Modeling · Random infeasible, across large spatial extents. Although soil
forest · Conservation · Data mining · Water quality P is predicted quite well at small spatial scales by
coincident soil properties such as pH, texture, parti-
cle size and soil organic matter (e.g., Hosseini et al.
Introduction 2017), such microscale soil attributes are generally
not available at regional scales. Moreover, P het-
Phosphorus (P) is an essential nutrient in terrestrial erogeneity at regional scales may also be affected by
and aquatic systems, with strong influences on plant broader scale drivers such as climate, parent material,
productivity, element cycling and many other aspects and glacial history. Thus, efforts to predict soil P from
of ecosystems (Vitousek et al. 2010). In excess, P has existing geospatial and climatic datasets have con-
strong negative impacts on water quality and aquatic siderable utility within larger biophysical modeling
ecosystems arising from diverse effects of eutrophi- efforts designed to support conservation planning.
cation (Schindler 2006; Clark and Longo 2018). Recent advances in the availability of large envi-
Because soils are the main source of phosphorus for ronmental datasets together with analytical tools like
terrestrial and freshwater ecosystems, understanding machine learning models have created new opportuni-
of ecosystem dynamics and management of phos- ties for predicting spatial patterning in complex envi-
phorus enrichment due to human activities requires ronmental outcomes (Zhong et al. 2021). Total P at the
knowledge of distributions of soil P. While a large soil surface represents a critical pool of P that is poten-
majority of total soil phosphorus exists in diverse tially erodible, reactive, and eventually bioavailable and
forms that are not immediately available to plants thus highly relevant to both crop production and water
(i.e., primary and secondary minerals, organic P), quality. Machine learning represents a potentially use-
these pools of soil P determine general P availability ful tool to help predict spatial variability in total soil P,
and influence losses to aquatic ecosystems (Hou et al. which is driven by complex, interacting drivers. Previ-
2016; Vitousek et al. 2010). Thus, knowledge of total ously, machine learning has been used together with
soil P distributions can provide fundamental informa- detailed geospatial data to predict soil P across smaller
tion for research and management in terrestrial and geographic areas, ranging from a single field to small
aquatic ecosystems. catchments to subregions with areal extents of up to a
In the Cornbelt and Great Plains regions of the few hundred ­km2 (Matos-Moreira et al. 2017; Jeong
United States and Canada, decades of agricultural et al. 2017; Sahabiev et al. 2021; Kaya et al. 2022; Kaya
practices including the conversion of perennial veg- and Başayığıt 2022). Machine learning has also been
etation into row crops, intensive inputs of synthetic used to predict soil attributes such as % carbon and total
fertilizers, and the spreading of manure from concen- nitrogen content at continental scales (Ramcharan et al.
trated animal feeding operations (CAFOs), have led 2018), as well as nutrient concentrations in river net-
to an accumulation of legacy phosphorus in soils well works (e.g., Shen et al. 2020; Sadayappan et al. 2022).
beyond their retention capacity (Goyette et al. 2018). In this study, we develop a random forest machine
Wastewater sources also contribute to P export, learning model from large, publicly available datasets
although their impact is comparatively small for most of soil P and nearly 300 predictive variables includ-
industrially managed agricultural watersheds (Board- ing soil type, land cover, topography, nutrient inputs,
man et al. 2019). In many areas across the region, hydrology and climate to predict total surface soil P at a
legacy P continues to be compounded by ongoing fer- 100 m grid scale for the entire Upper Mississippi River
tilizer and manure inputs, exacerbating the challenges Basin (UMRB), USA. The UMRB is one of the most
of addressing nonpoint source pollution (Stackpoole intensively farmed regions in the world and is charac-
et al. 2019; Boardman et al. 2019; Van Meter et al. terized by widespread water quality degradation aris-
2021). These anthropogenic P inputs are overlaid on ing from P-associated eutrophication (Jacobson et al.
complex underlying controls on soil P content and 2011). Predicting the spatial variability of total soil P

Vol:. (1234567890)
13
Biogeochemistry

Fig. 1  Map of the study


region showing a) locations
of soil P samples from the
USGS National Geochemi-
cal Survey database (purple
dots, n = 6,664) and from
the National Cooperative
Soil Survey (NCSS) Soil
Characterization Database
(in orange, n = 558) across
the Midwestern regional
extent used in this study,
with the Upper Mississippi
River Basin (UMRB) out-
lined in bold and. b) close
up of the UMRB showing
land cover classes from the
2019 National Land Cover
Dataset (Dewitz, 2021) and
locations of major rivers
and cities

for this landscape has direct implications for conserva- Methods


tion planning for water quality improvement and for
understanding the resilience of soils for crop production Study area
(Qiao et al. 2022). Our objectives for this research were
to (1) highlight the potential of underutilized publicly- The focus of this study is the Upper Mississippi River
funded and available datasets to support conservation Basin (UMRB), which comprises the uppermost trib-
planning; (2) predict soil P at a fine (100 m) resolution utaries to the Mississippi River, from the headwaters
across the UMRB and provide results in an open access at Lake Itasca in Minnesota down to the confluence
data framework for use in future conservation planning of the Mississippi and Missouri Rivers at St. Louis,
and analysis; (3) identify geospatial attributes important Missouri (Fig. 1).
in predicting soil P at regional scales. The UMRB covers nearly half a million square
kilometers (492,027 k­m2) of the upper Midwestern

Vol.: (0123456789)
13
Biogeochemistry

Fig. 2  Schematic showing the workflow for machine learning development. Input data and R scripts are available at https://​github.​
com/​cldol​ph/​SoilP

United States, including large parts of Minnesota, of well drained sands. Topography in the UMRB
Wisconsin, Iowa, Missouri, Illinois, and smaller areas includes widespread flat and gently rolling areas,
of Indiana, North Dakota, and South Dakota, and cor- as well as steeper terrain and large bluffs near many
responds to the HUC2-scale watershed (HUC ID = 07) rivers and streams. Elevation across the basin ranges
in the USGS Watershed Boundary Dataset (2021). from 85 to 640 m above sea level.
The UMRB is intensively farmed for large scale mon-
ocultures of primarily corn and soybeans. CAFOs are Data processing and analysis overview
also prolific in the basin, particularly across Iowa and
southern Minnesota (Harun and Ogneva-Himmel- Data processing and analysis was done in R Studio v.
berger 2013). Overall, cultivated crops comprise 49% 2022.07.1 (RStudio Team 2022). All of the input data
of total basin land area (Dewitz, 2019); nearly 90% of and methods used to generate the machine learning
those crops are corn and soybeans (USDA 2022). The model for predicting soil P—including data acquisi-
intensity of agricultural land use varies across a gradi- tion, data preprocessing, data splitting, model training,
ent from north to south, with the uppermost parts of evaluation and interpretation—are publicly available
the basin (in Minnesota and Wisconsin) characterized at https://​github.​com/​cldol​ph/​SoilP. The workflow
by relatively higher amounts of forest and wetland used to develop the model is summarized in Fig. 2.
cover compared to the southern regions of the basin,
where crop land use can exceed 85%. However, even Soil phosphorus data
the northern areas of the basin are undergoing rapid
land use change from wetland and forest cover towards For model training and testing, we used nearly 7,000
more agriculture (Green et al. 2018). The precipitation soil P measurements from two spatially extensive
gradient varies from comparatively drier in the north datasets: (1) the USGS National Geochemical Survey
and west to wetter in the south and east. Soil textures (NGS) Database (USGS 2004), and (2) the National
in the basin are largely silty loam and loam, but also Cooperative Soil Survey (NCSS) Characterization
include areas of poorly drained clays as well as areas Dataset (NCSS 2021). Both of these datasets are

Vol:. (1234567890)
13
Biogeochemistry

publicly available1 and contain information about soil the UMRB, we expanded the regional range of data
P, among other soil attributes. We focused on total used to build our predictive model to include loca-
soil P rather than soil test P in our model develop- tions that were outside of the UMRB boundary but
ment for several reasons. First, the NGS does not that shared potentially similar soils, climate and/
contain information about soil test P. Although the or land management. This expanded study region
NCSS does contain information about various soil included the following 13 U.S. states: Arkansas, Illi-
test P measures (e.g., Bray 1 P, Olsen P, etc.), soil test nois, Indiana, Iowa, Kansas, Minnesota, Missouri,
P is not measured consistently across the geography Nebraska, North Dakota, Ohio, South Dakota, Wis-
of the dataset. Because different soil test P measures consin (Fig. 1).
extract variable amounts of P, they may not be readily
comparable to one another (Wuenscher et al. 2015), National geochemical survey dataset (NGS)
making it difficult to compile and interpret multiple
soil test P measures into one cohesive dataset. By The NGS contains 284 geochemical attributes for soil
contrast, total soil P data across the NGS and NCSS and stream sediment samples across the United States
was considerably more extensive compared to any (n = 77,212). While the primary focus of the dataset is
one soil test P measure, creating the opportunity for stream sediments, the dataset contains a considerable
stronger machine learning models. Second, total soil number of soil samples (n = 19,992). Of these soil
P is a fundamental ecosystem property with broad samples, 19,574 samples contain information about
applications in conservation, biogeochemistry, and total soil P. Total soil P estimates (% dry weight) were
agronomy (Correll 1998; Sharpley et al. 2009; Shen obtained using inductively coupled plasma spectrom-
et al. 2011). While soil test P may more directly indi- etry after acid dissolution (USGS 2004). We con-
cate the crop availability and risk of loss of soluble P verted soil P estimates to mg/kg by multiplying %
to downstream water bodies (e.g., Vadas et al. 2005), weight estimates by 10,000.
total P provides a more consistent and integrative Samples in the NGS dataset were collected
measurement of soil P status. Moreover, previous between 1900 and 2008. Since we were most inter-
work has shown total P to correlate with soil test P ested in developing modeling approaches capable of
measures (especially when other soil properties are predicting relatively current soil P conditions from
also accounted for; Allen and Mallarino 2006), and concurrent spatial attribute data, we subsampled the
thus total P can be related to the potential for both dataset to those collected between 2000 and 2008,
erosive and soluble P losses. Finally, water quality leaving 9,589 samples. We further restricted samples
monitoring and management efforts are often pegged to those found in our geographic study region, leaving
to estimates of total P in aquatic systems, rather than a total of 6,664 soil P samples (Fig. 1).
estimates of ‘bioavailable’ P (Correll 1998). The For most samples, the NGS contains information
complexity of P cycling and transformation in ter- about soil depth. However, depth was represented
restrial and aquatic environments is a good reason to inconsistently across the dataset, with some samples
track total P in our initial machine learning efforts. associated with minimum and maximum numeric
Future applications of this approach may be applied depth estimates (in inches), some samples associated
to soil test P, particularly as more data is available with a soil horizon category (i.e., “A”, “B”, etc.), and
from high frequency sampling to account for high some samples lacking any kind of information about
variability in these forms, as the bioavailability of dif- depth or horizon.
ferent components of total P may change over time. For samples where numeric depth information was
Soil P estimates from these two datasets provided available, we designated a predictor variable ‘depth’
only partial spatial cover of the UMRB (Fig. 1). To as the maximum depth pertaining to that sample. For
expand our potential to predict soil P for different soil samples without numeric depth measurements,
types of soils and landcover conditions present across we assumed a maximum depth of 30 cm if samples
were listed as occurring in the soil A horizon (cor-
responding to the surface or plowable layer). We also
1
NGS: https://​mrdata.​usgs.​gov/​geoch​em/; NCSS: https://​ncssl​ assumed a maximum depth of 30 cm for samples with
abdat​amart.​sc.​egov.​usda.​gov/ no depth or horizon information provided (that is, we

Vol.: (0123456789)
13
Biogeochemistry

assumed samples without depth information were Database (NLCD) for 20063; (iii) the National Wet-
surface samples). If samples occurred in any other lands Inventory4; (iv) the gridded Soil Survey Geo-
soil horizon (and if no numeric depth information graphic (gSSURGO) Database (10 m resolution)5;
was provided) we designated samples as ‘subsurface’, and v) the U.S. EPA StreamCat dataset6. Overall, we
and assigned them a maximum depth of 76 cm based included 268 attributes derived from these datasets at
on information from samples where depth data were multiple spatial scales as predictors in soil P models.
present (and corresponding primarily to the soil B Each of these datasets are described in more detail
horizon). below and specific attributes used in soil P models are
listed in Supplementary Information (SI) Table S1.
National cooperative soil survey Once soil samples were joined to all predictors, 6,683
soil samples had complete attribute information avail-
The NCSS contains data commonly requested for able for use in modeling.
agronomic and biogeochemical purposes for 404,080
soil samples across the United States analyzed by National hydrograph dataset plus v2 (NHDPlusv2)
the Kellogg Soil Survey Laboratory and cooperat-
ing universities. Of these, 10,678 samples contained We sought to include information about near chan-
information about total P, with 961 samples occur- nel areas (i.e., the 100 m buffer surrounding the
ring in our study region. As for the NGS dataset, we stream network) in our predictor dataset, because of
filtered samples to include only those collected since the possibility for these environments to accumulate
2000, leaving 558 samples in the remaining NCSS P relative to upland areas. For example, in a review
dataset. Total soil P estimates (mg/kg) were obtained of the effects of conservation practices on water qual-
using inductively coupled plasma spectrometry after ity, Dodd and Sharpley (2015) found that near chan-
acid dissolution (Soil Survey Staff 2014). All samples nel areas such as riparian buffers, including grass
included an estimate of the top and bottom depth of and vegetative filter strips and wetlands, accumu-
the soil horizon sampled in centimeters. We used the lated labile forms of soil P over time. We derived a
estimate of horizon bottom depth as the estimate of categorical attribute indicating whether soil samples
depth associated with each sample. were collected in the 100 m riparian buffer of various
After subsampling the data, the combined NGS stream categories defined in the NHDPlusv2 layer,
and NCSS datasets yielded a total of 7,222 soil P including ditches, intermittent streams, and perennial
samples for model development and testing across the streams. The NHDPlusv2 includes a digital stream
Midwestern United States (Fig. 1). network layer for the conterminous United States,
divided into reaches (i.e., unique links in the network)
Predictive variables as well as catchment boundaries associated with each
reach. Catchments (i.e., local drainage areas) include
To predict soil P across the UMRB, we sought attrib- the immediate land area draining into each individual
utes that had previously been identified as known stream reach excluding areas draining to upstream
drivers of P abundance, including soil properties such reaches. Buffering of the stream layer was con-
as parent material, mineral and organic content, grain ducted in ArcMap version 10.8.1. We also used the
size and texture, as well as attributes describing land- NHDPlusv2 layer to assign soil P samples to NHD
scape position, land use, climate and anthropogenic catchments, which were subsequently used to link
inputs (Records et al. 2016; Deiss et al. 2018; He soil samples to predictive variables in the StreamCat
et al. 2021). We used geospatial attributes from mul- database (see below).
tiple publicly available datasets to summarize aspects
of these potential drivers of soil P. These datasets
included: (i) the National Hydrography Dataset Plus 3
https://​www.​mrlc.​gov/​data/​nlcd-​2006-​land-​cover-​conus
v2 (NHDPlusv2)2; (ii) the National Land Cover 4
https://​www.​fws.​gov/​wetla​nds/​data/​data-​downl​oad.​html
5
https://​nrcs.​app.​box.​com/v/​soils
6
https://​www.​epa.​gov/​natio​nal-​aquat​ic-​resou​rce-​surve​ys/​strea​
2
https://​nhdpl​us.​com/​NHDPl​us/​NHDPl​usV2_​data.​php mcat-​datas​et

Vol:. (1234567890)
13
Biogeochemistry

National land cover database (NLCD) For gSSURGO attributes that are mapped at scales
smaller than the map unit, we preprocessed numeric
Land use is a likely driver of soil P dynamics through SSURGO attributes by taking the weighted average
multiple potential pathways, including alteration of of values for all components in a map unit, where
P inputs and outputs associated with different land weights were the proportional contribution by area
cover regimes, and through management practices of each component to the map unit. For categorical
that may alter biological, chemical and physical pro- SSURGO attributes, we selected the attribute value
cesses that affect P composition (Liu et al. 2018). We for the component comprising the highest percent
used the NLCD to summarize aspects of land cover area of the map unit. If two attribute values com-
for the UMRB. NLCD layers contain information prised the same percent area of the map unit, we pref-
about 16 land cover classes across the United States at erentially selected the component attribute with the
30 m resolution. We chose NLCD 2006 because most higher slope value because surface P runoff is likely
soil P samples in our dataset (85%) were collected to be higher for soils with greater slope.
between 2000 and 2006, and all were collected before
2011. We used the spatial join function in ArcMap to StreamCat dataset
link land cover classes to soil P sample locations.
U.S. EPA’s StreamCat dataset contains information
National wetlands inventory for over 600 different environmental metrics linked to
individual stream reaches in the NHDv2Plus dataset
Similar to riparian buffers, wetlands can accumulate (Hill et al. 2015). These metrics summarize diverse
P by intercepting P-rich runoff from surrounding geospatial attributes including land use, impervious
drainage areas (Dodds and Sharpley 2015); thus we surfaces and road density, soil type, point source and
sought to include wetland presence and type as pre- nutrient inputs, and climatic factors (temperature and
dictors in machine learning models for soil P. The precipitation) at the catchment and watershed scale
NWI is a publicly available dataset that captures dis- draining into each reach. In contrast to the 30 m pixel
tribution and attribute information for wetlands in gridded NLCD 2006 data, StreamCat attributes are
all 50 U.S. states. We derived a categorical attribute summarized at the catchment (local drainage area)
indicating whether soil samples were collected from a and watershed scales and thus offered the potential to
series of possible wetland types included in the NWI, capture land use trends at a slightly broader, but still
including freshwater emergent wetlands, freshwater local, geographic scale that might be relevant to soil P
forested/shrub wetlands, freshwater ponds, lakes, or concentrations. We included attributes from Stream-
riverine wetlands, or from non-wetland areas. Cat variables potentially relevant to soil P across our
study region (Table S1). These tables included 220
gSSURGO attributes summarized at the watershed and catch-
ment in our dataset.
A large body of previous work has identified the
importance of soil properties such as mineral and Random forest modeling
organic content, parent material, and soil texture to
P abundance (e.g., Records et al. 2016). We used the Random forest regression is a nonparametric ensem-
10 m resolution raster gSSURGO dataset (Soil Sur- ble learning method that utilizes predictions from
vey Staff 2021) to assign soil properties to P sample multiple decision trees to improve model accuracy.
locations. The gSSURGO database is organized by Each tree is composed of branches (“nodes”) rep-
map unit soil polygons where each map unit is linked resenting yes–no questions where features (i.e.,
to hundreds of detailed soil attributes. We selected 81 predictive variables) are used to split the depend-
attributes from the gSSURGO dataset with the poten- ent variable into two groups that minimize in-group
tial to control soil P abundance –including aspects variability and maximize between group variability.
of parent material, grain size, organic matter con- We selected random forest as our modeling method
tent, and mineral content (Records et al. 2016)—and because these models can be highly accurate, are
linked them by location to soil P sample locations. relatively fast to develop, can handle categorical and

Vol.: (0123456789)
13
Biogeochemistry

numerical attributes, are robust to outliers, and can the exact same results in terms of model performance
handle both non-linear and unbalanced data, all of (as measured by R ­ 2 and RMSE for actual vs predicted
which were pertinent criteria for our dataset. Ran- values on an independent test dataset), when the same
dom forest models also require very few assumptions hyperparameter values were used. A complete R
of the input datasets. At the same time, some limita- script detailing our approach can be found at https://​
tions of random forest modeling methods should be github.​com/​cldol​ph/​SoilP.
noted: mainly, the machine-learning techniques are
characterized by the data mechanisms as ‘black box’ Additional data curation
or ‘gray box’ algorithmic approaches (Jeong et al.
2016). Particularly, random forest is built on non- After pre-processing soil predictors as described
parametric advanced classification and regression above, we linked attributes to soil P samples and
tree (CART) analysis methods and models may not removed variables that did not contain useful infor-
be fully described mechanistically (Breiman 2001). mation (e.g., all rows = 0). We also excluded attrib-
Also, the random forest algorithm has the potential utes where information was missing (‘NA’) for > 20%
to overfit data, and may become impractical for mak- of soil P samples. Finally, we also excluded three
ing predictions beyond the training data range (Jeong attributes from the gSSURGO dataset (pmkind, tax-
et al. 2016). Given the inherent random forest struc- partsize, and texcl) with category levels that did not
ture, permutation of many trees may make the algo- encompass the entire set of the categories contained
rithm slow for real-time prediction (Breiman 2001). within the UMRB grid dataset (see below). Many of
Random forest model development followed the remaining attributes still contained some missing
the general scheme for machine learning models values. Because random forest models cannot handle
articulated by Zhong et al. 2021 (Fig. 2). We used a missing values in predictor variables, we used the
tidymodels framework (Kuhn and Wickham 2020) in missRanger package in R (Mayer 2021) to impute the
R (R Core Team 2022) to develop random forest mod- remaining missing values for the training and testing
els for predicting soil P from the assembled predic- datasets. missRanger imputes missing values for each
tors. Models were evaluated based on ­R2 and RMSE variable by building a random forest model for each
values (Chicco et al. 2021). Within this framework, variable that uses all other variables in the dataset as
we used both the ranger package (Wright and Ziegler covariables. Prior to random forest modeling, we nor-
2017) and the randomForest (Liaw and Wiener 2002) malized (i.e., centered and scaled) numeric attributes
packages at different points in the modeling process. to have a mean of zero and a standard deviation of
Both ranger and randomForest have been shown to one.
perform as well or better than other machine learning
approaches for predictive purposes (Hagenauer et al. Model tuning
2019) and have been shown to produce very simi-
lar model results to one another (Wright and Ziegler For the soil P random forest model, we used 90% of
2017). We used the ranger package for initial model the data for model training, and the remaining 10%
tuning, due to its considerably faster implementation for model testing. Using the training dataset and the
relative to randomForest (Wright and Ziegler 2017). ranger implementation for random forest modeling,
Once optimal hyperparameter values had been iden- we applied tenfold cross validation to tune model
tified using this approach, we applied these hyperpa- hyperparameters across a range of possible values.
rameter values to a randomForest implementation to The hyperparameters selected for tuning were: mtry
create our final model, because randomForest model (i.e., number of variables randomly sampled as can-
objects (but not ranger model objects) are compatible didates at each split) and min_n (i.e., the minimum
with conditional permutation methods for estimating number of data points in a node). The trees hyper-
the relative importance of predictors to model out- parameter (i.e., number of trees) was set to 1000
comes (see rationale for the approach we took to eval- across all models. K-fold cross validation can assist
uate predictor importance below). Prior to switching in avoiding model over-fitting and works by partition-
from ranger to randomForest implementations, we ing training data into K equal sized “folds” (in our
confirmed that the two approaches produced nearly case 10). The model is iteratively trained on various

Vol:. (1234567890)
13
Biogeochemistry

combinations of tuning hyperparameters across K-1 improved computational speed, accounting for non-
folds, leaving the remaining fold to evaluate model linear dependence between predictors, less sensitivity
performance for each combination. We defined a grid to dataset sample size, and greater stability of results
of 20 potential combinations of hyper-parameters (Debeer and Strobl 2020). To date, permimp can be
using the tune_grid () function from the tidymodels applied to randomForest or cforest model objects in
collection of packages in R. This approach draws R (but not to ranger objects). In permimp, a thresh-
hyperparameter values semi-randomly from parame- old value, equal to 1—the p-value for the associa-
ter space such that the various combinations cover the tion between predictor variables, is used to determine
whole space of potential values. We selected hyper- whether to include a predictor in the conditioning for
parameter values using out-of-bag (OOB) RMSE and the predictor of interest. We used the default value for
­R2 for the associated models. the threshold parameter in permimp (0.95) because
Once hyperparameter values were tuned, we re- Debeer and Strobl (2020) advised utilizing threshold
ran the random forest model using the randomFor- values close to 1 for larger datasets.
est package, to create a randomForest object that was
compatible with our selected measure of predictive Predicting soil P across the Upper Mississippi River
variable importance (see next section). We evaluated Basin
overall model performance based on the independent
test dataset (comprising 10% of the original dataset). One of our goals of this effort was to predict soil P
values for unsampled locations across the UMRB.
Predictive variable importance To do this, we created a grid of points spaced 100 m
apart across the entire UMRB (UMRB grid), using
Many implementations of random forest models the Create Fishnet tool in ArcMap version 10.8. We
come with default measures used to quantify the rela- linked these locations to the same set of attributes
tive importance of individual predictive variables to used in our random forest model (i.e., attributes from
model performance. However, recent work has shown the NHDPlusv2, NLCD 2006, NWI, gSSURGO, and
that many of these default strategies, including those StreamCat datasets). We excluded grid locations that
based on measures of impurity or permutation-based coincided with locations of open water. We used the
metrics, can produce biased results when model pre- High-Performance Cluster at the University of Min-
dictors (1) vary in their scale of measurement; (2) nesota’s Supercomputing Institute (https://​www.​msi.​
vary in their number of categories; or (3) are highly umn.​edu/) to assign attributes to the UMRB grid
correlated to one another (Hooker et al. 2021). When points, and to generate predictions for the grid loca-
these conditions apply, as they do for predictors in tions using the random forest model we developed.
our soil P dataset, alternative measures of importance For this effort, we assigned all predictions to the soil
such as conditional permutation importance (CPI) surface (max depth = 30 cm). Lastly, we used the
may be more appropriate, and less biased towards col- Inverse Distance Weighting (IDW) tool in ArcMap
linear predictive variables (Debeer and Strobl 2020; version 10.8 to interpolate a raster surface of soil P
Hooker et al. 2021). CPI aims to capture the depend- values at the 100 m grid scale for the UMRB.
ence between a predictor and the response variable,
conditionally on the values of all other predictors.
That is, CPI is a measure that can be used to assess Results
how much each variable contributes to accurately pre-
dicting the response variable, given what we know Soil phosphorus
from all other predictive variables. The tradeoff is that
CPI methods can be computationally expensive. To Mean soil total P for the broader Midwestern
derive less biased estimates of variable importance region included in the study was 580 mg/kg (range
to the random forest model performance, we imple- 17–4370 mg/kg). The distribution of soil P val-
mented the CPI approach from the permimp package ues showed a slight right skew (Fig. 3). The dataset
in R (Debeer et al. 2021). This approach builds upon was comprised of 1,761 samples that could be con-
the widely used party package for CPI, while offering sidered to occur in the “plowable layer” at the soil

Vol.: (0123456789)
13
Biogeochemistry

Fig. 3  Distribution of P
concentrations for all soil
samples included in the
dataset (n = 6,683). Blue
dashed line indicates mean
value (580 mg/kg). Black
dot-dash line indicates
cut off (1000 mg/kg) for
samples used in the random
forest model

surface (i.e., <  = 30 cm deep), and 5,128 samples that attributed to error or other specific site features, but
occurred at depths > 30 cm (i.e., ‘subsurface’ sam- there was no obvious information pointing to sam-
ples). Surface soil samples had a mean total P con- pling or analytical error.
centration of 585.6 mg/kg (range = 80—4370 mg/kg). Relative to samples with soil P < 1000 mg/kg,
Subsurface soil samples had a mean total P concentra- samples with very high soil P were located in areas
tion of 592.6 mg/kg (range = 17—4280 mg/kg). The with higher median crop productivity index for corn,
dataset contained 1,085 samples that were located in lower available water storage in the root zone, rela-
“near channel” or riparian areas (i.e., within a 100 m tively higher contribution of groundwater to streams
buffer of the NHDv2Plus stream network), and 5,798 in the local catchment and watershed, relatively
samples that were non riparian or located outside of higher rates of N deposition (as ­NO3−, ­NH4+ and
the stream buffer zone (Fig. S5). The dataset also inorganic N), higher pesticide and fertilizer use,
contained 309 samples that were located within wet- higher precipitation and runoff, and higher cropland
land soils, and 6,574 that were located in nonwetland cover in the catchment, watershed and riparian zone
soils. (See SI Table S2). Higher contributions of groundwa-
Across all samples, 352 samples (5%) had soil ter to stream and river flow could indicate high per-
P > 1000 mg/kg and were distributed throughout the meability (like karst), a large contribution of drain tile
study region (Figure S1). A very small number of (Schilling and Libra 2003), or a lower contribution of
samples (n = 5) had soil P > 3000 mg/kg. All of these surface water such as in the drier western part of the
samples occurred in the National Geochemical Sur- basin. Overall, predictor values for samples with high
vey dataset. We examined the original field notes for soil P suggest that very high soil P samples occur on
these samples to ascertain if these values could be average in areas with greater precipitation and well

Vol:. (1234567890)
13
Biogeochemistry

actual soil P values for the independent test dataset


indicated a model RMSE of 0129.3, and an R ­ 2 of
0.58, for soil P samples < 1000 mg/kg (Fig. 4).

Predictive variable importance

Using the permimp function to estimate conditional


permutation importance (CPI) for predictors in our
model indicated that soil sample depth had by far
the strongest relative impact on model performance,
followed by land use in the immediate area (i.e., the
30 m grid cell) in which the soil sample was collected
(Fig. 5). The remaining top predictive variables are
shown ranked by importance in Fig. 5 and Table 1.
In addition to sample depth and local land use, the
top predictors ranked for importance included aspects
of catchment- and watershed-scale land use (includ-
ing aspects of urban land use, the extent of impervi-
Fig. 4  Actual vs predicted total soil P (mg/kg) for soil sam- ous surfaces, open water cover, barren land cover,
ples in the independent test dataset (i.e., not used to build the and urban and agricultural land use in riparian areas),
model). R2 = 0.58, RMSE = 0.129.3, p < 0.0001. Note that the underlying soil properties (e.g., soil characteristics
random forest model was run using centered and scaled values
of P, but unscaled values are shown here for easier interpreta- such as organic matter and mineral content, water
tion storage and permeability), landscape properties (such
as slope, elevation, and whether sites were located
adjacent to the stream network), inputs (includ-
drained soils, which are also associated with areas of ing atmospheric deposition as well as fertilizer and
more intense agriculture in the UMRB. This rationale manure inputs), and climate factors (including tem-
makes sense for samples located in parts of the study perature and precipitation; Table 1). Additional pre-
region (e.g., Iowa), where these conditions are likely dictors beyond this set did not appear to contribute
to occur. However, the spatial distribution of sam- strongly to model accuracy once all other variables
ples with high soil P (Figure S1) indicates that many had been accounted for.
of them also occurred in northern areas of the basin Examining some of these attributes in relation
characterized by relatively little agriculture (e.g., to soil P in more detail, we find that the highest
northern Minnesota). soil P concentrations (> 750 mg/kg) were found
almost exclusively in samples collected at a depth
Model performance of 100 cm or shallower, i.e., comparatively closer to
the surface. Nearly all samples deeper than 100 cm
The original model included all soil P samples in had soil P < 750 mg/kg (SI Figure S3). Among
model training and testing; however, this resulted in NLCD land use categories, samples collected in
a model that tended to underpredict soil P for sam- areas designated as open water, shrub/scrub, culti-
ples > 1000 mg/kg (SI Figure S2). As a result, we vated crops, developed/open space and grassland/
decided to increase model prediction accuracy for the herbaceous had higher mean soil P concentrations
vast majority (95%) of soil samples in our dataset by than the total dataset average, whereas other cate-
excluding the 5% of samples with soil P > 1000 mg/ gories had comparatively lower soil P (Figure S4).
kg from the training and testing datasets (see Dis- Riparian soils (i.e., soils located within 100 m of
cussion). The final selected hyperparameters for the the NHDv2Plus stream network) had significantly
random forest model based on model tuning with ten- higher soil P (mean soil P = 646 mg/kg) compared
fold cross validation for this dataset were mtry = 50, to non-riparian soils (mean soil P = 580 mg/kg;
trees = 1000, min_n = 3. Evaluation of predicted vs unpaired t test t = − 7.04, df = 1456., p < 0.00001;

Vol.: (0123456789)
13
Biogeochemistry

Fig. 5  Conditional Permutation Importance (CPI) values covariates. Additional covariates beyond those shown did not
of covariates in the random forest model for soil P samples show a strong impact on model performance, once all other
with soil P < 1000 mg/kg. CPI is a measure that can be used covariates were accounted for. Full covariate descriptions are
to assess how much each variable ‘adds’ to accurately predict- listed in Table 1
ing the response variable, given what we know from all other

Figure S5). However, caution should be used when Predicting soil P for the Upper Mississippi River
applying univariate relationships to aid in the inter- Basin
pretation of importance values for the random forest
model. Because the random forest model is based Figure 6 shows predicted total soil P values for the
a series of successive data splitting events (i.e., soil surface (max depth = 30 cm) of the UMRB at the
decision trees) where data is binned into branches 100 m grid scale, based on the random forest model.
of groups based on predictive variable split points, The predictions indicate the highest concentrations of
the predictor values useful in parsing these groups total soil P in northern Iowa, southern Minnesota, and
may not be readily apparent from univariate north-central Illinois. The lowest predicted concen-
relationships. trations occurred in northern and north/north-central

Vol:. (1234567890)
13
Biogeochemistry

Table 1  Top covariates, as ranked by conditional permutation importance (CPI), to the performance of the random forest model
Abbreviation Category Short description Source layer

Depth_cm Underlying soil properties Soil depth (cm) NGS/NCSS


NLC06 Land use Land use category in the 30 m grid cell where soil sample NLCD 2006
was collected
PctImp2006CatRp100 Land use % Impervious in riparian zones within catchment StreamCat
PctImp2006Cat Land use % Impervious within catchment StreamCat
PctBl2006Ws Land use % Barren land within watershed StreamCat
P2O5Cat Underlying soil properties Mean % lithological phosphorous oxide content in surface StreamCat
geology within catchment
PctOw2006WsRp100 Land use % Open water in riparian zones within watershed StreamCat
slope_r Landscape properties Mean slope in gSSURGO map unit gSSURGO
RdDensCatRp100 Land use Road density in riparian zones within catchment StreamCat
nccpi3corn Underlying soil properties National commodity crop productivity index for corn in gSSURGO
gSSURGO map unit
PctGlacLakeCrsCat Underlying soil properties % of catchment area classified as coarse-textured glacial StreamCat
outwash & glacial lake sediment
NO3_2008Cat Inputs Mean NO3 wet deposition within catchment StreamCat
HydrlCondCat Underlying soil properties Mean lithological hydraulic conductivity in surface geol- StreamCat
ogy within catchment
sieveno200_r Underlying soil properties Mean soil fraction passing a #200 sieve (0.074 mm square gSSURGO
opening) for gSSURGO map unit
PctUrbOp2006CatRp100 Land use % Developed, open space in riparian zones within catch- StreamCat
ment
NO3_2008Ws Inputs Mean NO3 wet deposition within watershed StreamCat
ElevCat Landscape properties Mean elevation within catchment StreamCat
PctUrbLo2006Cat Land use % Developed, low intensity land use within catchment StreamCat
OmCat Underlying soil properties Mean organic matter of soils within catchment StreamCat
MgOCat Underlying soil properties Mean lithological magnesium oxide (MgO) content in StreamCat
surface geology within catchment
BFICat Landscape properties Ratio of baseflow to total flow within catchment StreamCat
PctGlacLakeCrsWs Underlying soil properties % of watershed classified as coarse-textured glacial out- StreamCat
wash and glacial lake sediment
Al2O3Cat Underlying soil properties Mean lithological aluminum oxide content in surface geol- StreamCat
ogy within catchment
aws0_30 Underlying soil properties Mean available water storage at 0–30 cm depth for gSSURGO
gSSURGO map unit
soc150_999 Underlying soil properties Mean soil organic carbon stock from 150 cm to max gSSURGO
reported depth of the soil profile for gSSURGO map unit
AgKffactCat Underlying soil properties Mean soil erodibility factor on ag land within catchment StreamCat
WetIndexWs Underlying soil properties Mean composite topographic (wetness) index within StreamCat
watershed
nccpi3sg Underlying soil properties National commodity crop productivity index for sorghum gSSURGO
in gSSURGO map unit
FertWs Inputs Mean rate of synthetic N fertilizer application to ag land StreamCat
within watershed
aws0_999 Underlying soil properties Available water storage in total soil profile in gSSURGO gSSURGO
map unit
Al2O3Ws Underlying soil properties Mean lithological aluminum oxide content in surface geol- StreamCat
ogy within watershed

Vol.: (0123456789)
13
Biogeochemistry

Table 1  (continued)
Abbreviation Category Short description Source layer

aws0_20 Underlying soil propertiesAvailable water storage estimate at 0–20 cm depth in gSSURGO
gSSURGO map unit
PctUrbLo2006CatRp100 Land use % Developed, low intensity land use in riparian zones StreamCat
within catchment
WtDepWs Underlying soil properties Mean seasonal water table depth of soils within watershed StreamCat
StreamType Landscape properties Whether soil samples were collected within riparian buffer NHDv2Plus
of stream network
SN_2008Ws Inputs Mean wet deposition for average sulfur and nitrogen within StreamCat
watershed
nccpi3all Underlying soil properties National commodity crop productivity index for all crops gSSURGO
in gSSURGO map unit
nccpi3soy Underlying soil properties National commodity crop productivity index for soybean in gSSURGO
gSSURGO map unit
Fe2O3Ws Underlying soil properties Mean lithological ferric oxide content in surface geology StreamCat
within catchment
aws0_150 Underlying soil properties Mean available water storage in top 150 cm of soil depth in gSSURGO
gSSURGO map unit
PctOw2006Cat Land use/landscape properties % Open water within catchment StreamCat
RdCrsCat Land use Road crossings in catchment StreamCat
PermCat Underlying soil properties Mean permeability of soils within catchment StreamCat
aws50_100 Underlyings soil properties Available water storage in 50–100 cm depth in ssurgo map gSSURGO
unit
CaOWs Underlying soil properties Mean lithological calcium oxide content in surface geology StreamCat
within watershed
PctHay2006WsRp100 Land use % Hay in riparian zones within watershed StreamCat
TmaxCat Climate 30 year normal max temperature, 1981–2010, within catch- StreamCat
ment
PctEolFineWs Underlying soil properties % of catchment area classified as eolian sediment, fine- StreamCat
textured (glacial loess)
PrecipWs Climate PRISM climate data—Mean precipitation (mm) within the StreamCat
watershed. Period: 2008
PctCrop2006WsRp100 Land use % Crop land use in riparian zones within watershed StreamCat
ElevWs Landscape properties Mean watershed elevation (m) StreamCat
BFICat Landscape properties Ratio of baseflow to total flow within catchment StreamCat
SN_2008Cat Inputs StreamCat
RunoffWs Landscape properties/climate Mean runoff (mm) within watershed StreamCat
CompStrgthWs Underlying soil properties Mean lithological uniaxial compressive strength content in StreamCat
surface geology within watershed
ManureCat Inputs Mean rate of manure application to ag land from CAFOs StreamCat
within catchment
PctImp2006Ws Land use % Impervious surfaces within watershed StreamCat
claytotal_r Underlying soil properties % Clay for soils in gSSURGO map unit gSSURGO
Na2OCat Underlying soil properties Mean lithological sodium oxide content in surface geology StreamCat
within catchment
Na2OWs Underlying soil properties Mean lithological sodium oxide content in surface geology StreamCat
within watershed
FertCat Inputs Mean rate of synthetic nitrogen fertilizer application to ag StreamCat
land within catchment
PctUrbHi2006CatRp100 Land use % developed, high-intensity land use within catchment StreamCat

Vol:. (1234567890)
13
Biogeochemistry

Table 1  (continued)
Abbreviation Category Short description Source layer

NH4_2008Cat Inputs Mean wet deposition for ammonium ion concentration StreamCat
2008 within catchment
PermWs Underlying soil properties Mean permeability of soils within watershed StreamCat
PctUrbOp2006Ws Land use % Developed, open space land use within watershed StreamCat
Attributes are designated by major categories including ‘land use’, ‘underlying soil properties’, landscape properties’, ‘inputs’, and
‘climate’. More detailed descriptions of each attribute are available in SI Table S1. ‘Riparian zones’ refers to the 100 m buffer sur-
rounding the NHD stream network

Fig. 6  Predicted total soil P


(mg/kg) for the soil surface
(max soil depth = 30 cm)
at the 100 m grid scale for
the Upper Mississippi River
Basin

Wisconsin, and in southern Illinois and southern Discussion


Missouri.
Soil P across the study region

Vol.: (0123456789)
13
Biogeochemistry

Here, we have assembled a large dataset of total complexity, we view the predictive accuracy of our
P concentrations for soils across the Midwestern random forest model as relatively high ­ (R2 = 0.58
United States from publicly available datasets and and RMSE = 129.3 for an independent validation
used them to predict total soil P at fine resolution dataset). The RMSE for our model represents 24%
across the UMRB, as well as to identify predic- of mean soil P for the validation dataset, which is
tors that were comparatively important to model comparatively more accurate than P prediction mod-
accuracy. These soil datasets, collected by federal els developed at smaller scales in Europe and Russia
agencies over relatively long time periods and large (Matos-Moreira et al. 2017 and references therein;
spatial extents, have great potential utility in the Sahabiev et al. 2021). Our model performance is also
analysis of dynamics in soil and water chemistry, considerably stronger than that for continental-scale
but the availability of these datasets is not widely predictive models developed using similar methods
known and they are not uniformly organized (and for soil organic carbon and total nitrogen by Ram-
require considerable pre-processing), possibly charan et al. (2018). However, the model developed
leading to their underutilization in the study of soil here underpredicts total soil P for samples with very
nutrient dynamics. Our study highlights the poten- high soil P (> 1000 mg/kg). Although such samples
tial of existing publicly-funded datasets to support comprised a relatively small proportion of our total
ongoing studies and management design related to dataset (~ 5%), underestimating soil P content for
soil and water health. For this reason, we provide these samples may be substantively important if they
the collated dataset together with the metadata and represent hotspots for P accumulation and ultimately
code in an open science framework available in the transport. Understanding the drivers of very high soil
open access Github repository (https://​github.​com/​ P samples in this study region is an important area for
cldol​ph/​SoilP). future research. Knowledge of specific land use prac-
The range of total P concentrations reported tices at finer scales than those used in our study (e.g.,
here for soils of the Midwestern United States fertilizer or manure application at field scales smaller
(mean = 580 mg/kg, range = 17–4370 mg/kg) is than the catchment or watershed scales available for
similar to that reported by Schilling et al. (2021) land use practices in the StreamCat dataset), could
for riparian soils across Iowa (mean = 460 mg/kg; also potentially help with predictive accuracy. Inten-
range = 109–1569 mg/kg; though it should be noted sive animal agriculture (e.g., CAFOs) and associated
that the latter study used the aqua regia digestion manure inputs in particular have been shown to coin-
method for measuring total P which can sometimes cide with areas of P accumulation (Stackpoole et al.
result in slightly lower measured P concentrations 2019). Future research could also explore whether a
(J. Kovar, personal communication), rather than classification-based approach (i.e., applying expert
the HCl digestion method used for the samples judgment to group numeric soil P estimates into bio-
in this study. Riparian soils in our study averaged chemically meaningful categories such as ‘very high’,
slightly higher total P (646 mg/kg) compared to ‘high’, moderate’, ‘low’ etc.) rather than the regres-
those measured by Schilling et al. (2021). Ringeval sion approach used here might enable random forest
et al. (2017) developed models to simulate total P models to more accurately identify hotspots or areas
for cropland soils globally and estimated the global of very high soil P.
average for total P in cropland soils to be 567 mg/
kg; however, their simulated total P values for soils Predictive variable importance
in North America ranged considerably higher, typi-
cally above 1000 mg/kg. Our analysis indicates that, at the regional scale rep-
resented by our model, the variables with the greatest
Model performance comparative importance for predicting soil P included
sample depth, land use, underlying soil properties,
Heterogeneity in total soil P is the outcome of landscape properties, P inputs, and climate. Soil sam-
potentially complex interacting drivers including ple depth was the most comparatively important vari-
geology, hydrology, climate, biogeochemical pro- able to model performance, with soil P tending lower
cesses, and specific land use practices. Given this with increased depth (Figure S3). This finding echoes

Vol:. (1234567890)
13
Biogeochemistry

Ramcharan et al. (2018), who found depth to be con- ranked as relatively important to model performance.
sistently ranked as one of the most important predic- This finding suggests that soil P in urban or high-
tors for random forest models developed for soil prop- density areas may be distinct from other landscapes.
erties including total nitrogen and soil organic carbon. However, several ranked “urban” attributes describe
Land use in the immediate areas (30 m grid) where the extent of low density or “open” urban land use,
soil samples were collected was the second most which are indicative of suburban and exurban land-
important variable to model accuracy, according to scapes and may indicate that these landscapes are also
the conditional permutation importance measure we distinct in terms of soil P. Although soil P dynamics
used. The comparatively high ranking of land use/ in urban areas are relatively under-studied, they have
land cover differs somewhat from that of Ringeval been identified as hotspots for P cycling, with dis-
et al. (2017), who found, using a different modeling tinct drivers that influence P stocks and flows (Met-
approach, that the main driver of variability for total son et al. 2015). For example, Hobbie et al. (2017)
soil P at a global scale was the soil biogeochemical observed that urban watersheds in our study region
background corresponding to P inherited from natu- retained comparatively little of their P inputs, and
ral soils. However, at smaller (continental) scales, that the majority of annual P inputs were exported via
Ringeval et al. (2017) found farming practices were stormwater flows. Additional research is needed to
also important in driving heterogeneity in total soil P, understand drivers of total soil P in urban and subur-
though still less important than native soil properties. ban landscapes and how these may differ from those
At a smaller scale, Kaya and Başayiğit (2022) found in agricultural settings.
that for an alluvial plain (~ 100 ­km2) characterized Riparian attributes summarized at the catchment
by intensive agriculture, land cover classes were the or watershed scale (e.g., the % of crop, hay or urban
most important attributes to predicting soil P. For a land use cover in catchment or watershed-scale ripar-
small watershed in Brittany, France, Matos-Moreira ian zones) were also identified as relatively impor-
et al. (2017) found that information about agricultural tant to model prediction accuracy. This finding could
practices contributed more to machine learning mod- result from the fact that land use in riparian zones
els for extractable P than for total P, whereas mod- may be indicative of a particularly intensive form of
els for total P relied more strongly on soil and parent land use—i.e., if riparian zones are used for crops,
material attributes. hay or dominated by urban land use, this may indi-
In our study, land use in the immediate area of a cate particularly intensive landscapes for these uses
soil sample was more important to accurately pre- in a way that affects soil P. Conversely, these driv-
dicting soil P than soil properties at the gSSURGO ers may be strongly affecting soil P in riparian areas,
map unit scale and more important than measures and therefore could have been important for model
of land use at catchment or watershed scales. Given accuracy for riparian samples. A categorical attrib-
this finding, a question for future research would ute (‘StreamType’) describing whether a sample was
be whether more specificity in soil properties and collected in the riparian buffer of the stream network
land use practices at the hyper local scale (e.g., on- (and which kind of stream reach it was adjacent to
field practices such as fertilizer and manure input, such as a ditch, stream, river etc.) also appeared in
specific crop type, tillage practices at the field or the list of important variables to model performance.
30 m grid scale, etc.) could lead to improvements Post hoc analysis indicated that riparian areas (soils
in models for predicting soil P across the Midwest. within 100 m of the stream network) had significantly
Matos-Moreira et al. (2017) suggested that detailed higher total P (646 mg/kg) compared to non-riparian
information about crop rotation in particular may areas (580 mg/kg). This echoes the finding by previ-
improve the accuracy of soil P prediction models, ous studies which have noted riparian and wetland
as it correlates with a number of agricultural prac- areas as potential sinks, and ultimately sources, of P
tices, including fertilizer and pesticide application in agricultural landscapes (Kleinman et al. 2022).
and soil tillage. Soil properties that were ranked as important to
Model variables associated with urban or high- model performance included the native P content
density areas, including extent of impervious sur- (as phosphorus oxide) and mineral content (alu-
faces, urban land use, and road density, were all minum, iron, calcium, and sodium oxides) of soils,

Vol.: (0123456789)
13
Biogeochemistry

along with aspects of soil texture, water storage and UMRB is 580 mg/kg. If we assume a soil bulk den-
depth, soil organic carbon/organic matter, glacial his- sity of 1.4 (Jacobson et al. 2011), this average soil
tory, hydraulic conductivity, erodibility, and suitabil- P concentration equates to ~ 8,120 kg/ha of P in the
ity of soils for growth of various crop types. Exten- top meter of the soil. By comparison, previous stud-
sive research has documented the importance of soil ies have shown yearly P losses from tile drains to
compounds such as iron and aluminum oxides and be ~ 0.01% of this amount (e.g., 0.5—1.0 kg/ha/yr;
calcium carbonate on soil P retention and release King et al. 2015; Dialameh and Ghane 2022). Mean-
(Records et al. 2016). Likewise, soil organic mat- while, P inputs may be ~ 12–25 kg/ha/yr, depending
ter content has been shown to have direct and indi- on the crop rotation (Potter et al. 2010). Thus, while
rect effects on P sorption capacity (e.g., Kang et al. accounting for tile drainage impacts is critical to
2009). And as documented recently by Plach et al. understanding losses of P to downstream water bod-
(2018), the parental origins of soil and their glacial ies, it may be less important to accurately estimating
history have a potentially strong effect on the physical total soil P. In addition, it is likely that the presence
and geochemical properties of soil which can subse- and density of tile drainage co-varies with other pre-
quently affect P retention and transport. Our findings dictors already included in our model. For example,
likewise indicate that many of these properties are attributes included in the model that relate to land
important for accurately predicting soil P across Mid- use, slope, soil texture and soil drainage class corre-
western landscapes. late with tile drainage (Valayamkunnath et al. 2020).
Nitrate and ammonium deposition at the catch- Future research in this area may consider using tile
ment scale, as well as fertilizer and manure inputs at drainage as a proxy for a more complex suite of soil
the catchment scale, were all identified as important and management parameters if detailed soils data are
to model performance. Atmospheric deposition of not available.
nitrogen could be an indicator of the proximity of fos- It is important to note that methods for the esti-
sil fuel combustion, fertilizer or manure application mation of variable importance within random for-
or livestock emissions (Russell et al. 1998), which est models is an active area of research (Debeer and
may also affect soil P. Strobl 2020), with rapid development of prospective
Predictors summarizing aspects of climate, land- improvements to “opening up” the black box of ran-
scape properties, and their interaction—including dom forest models. We applied a recently developed
precipitation, temperature, runoff, slope, and the con- approach that is aimed at examining the conditional
tribution of groundwater to baseflow—also appeared importance of predictors, which may give differ-
to affect model performance. Recent work has shown ent results than more commonly applied importance
that both temperature and precipitation are important measures. Care should be taken when interpreting
to soil P availability at a global scale and may have these results relative to other modeling approaches if
contrasting effects (Hou et al. 2018). Slope and result- different methods for importance estimation are used.
ing erosional processes (whether driven by human
land use, climate, or other factors) also have a pro-
found effect on erosional and subsequent biogeo- Limitations and future directions
chemical processes that may affect soil P (Berhe et al.
2018). Our initial goal in developing the machine learning
In addition to land cover, a major driver of hydrol- model described here was to determine if we could
ogy and water-soil interactions in the UMRB is the make accurate predictions for soil P across a broad
vast network of subsurface tile drainage that charac- geographic area, for use in future biophysical mod-
terizes much of the region (Schottler et al. 2013). The eling and conservation planning applications in the
predictor datasets we used did not include informa- UMRB. Because the UMRB contains considerable
tion about subsurface drainage. Although tile drain- variation in underlying soil properties, land cover and
age has important ramifications for downstream soil P values, we included a large number of potential
transport of P (Smith et al. 2015), it may have less predictor variables in our machine learning frame-
of an impact on total soil P. For example, based on work to improve model accuracy as much as possible.
soil P data assembled here, average total soil P in the Complex models often perform more accurately than

Vol:. (1234567890)
13
Biogeochemistry

simpler ones (Wadoux et al. 2020). Indeed, our model understanding and predicting complex biophysical
performed relatively highly in terms of accuracy com- phenomena such as soil P at fine scales. Such pre-
pared with other machine learning models for soil dictions could be useful when combined with water-
attributes, possibly because of the large number of shed models to improve estimates of P loss from this
predictors included. The downsides of this approach largely agricultural landscape. Improved knowledge
include the considerable time and effort required to of critical source areas of soil P to include under-
assemble and curate such a large dataset of predic- sampled locations is necessary to develop effective
tors, as well as increased difficulty in interpreting regional conservation plans and prioritize resources
the role of so many variables in predicting soil P. A such that they can reduce the accumulation and trans-
third challenge stemming from the use of hundreds port of P across landscapes.
of heterogeneous predictors is the way it complicates
predictions for soil P at unknown locations. Because Acknowledgements We would like to thank Zhen Xu for
alerting us to the availability of the USGS soil dataset. Ran-
we assembled predictors from existing disparate dom forest model tuning and predictions were performed on
public datasets, these variables were not summa- the High Performance Cluster at the University of Minne-
rized at consistent spatial resolution and extent. For sota’s Supercomputing Institute, https://​www.​msi.​umn.​edu/.
example, data from the StreamCat dataset was sum- This project was funded by a U.S. Department of Agricul-
ture Conservation Effects Assessment Project (CEAP) grant
marized at catchment and watershed scales, whereas #NR203A750023C023
gSSURGO data is summarized at the SSURGO map
unit scale, etc. This variability presents an obstacle Author contributions All authors contributed to the study
to the application of recently developed approaches conception and design. Material preparation, data collection
and analysis were performed by CD. R script quality control
to spatial prediction; e.g., where machine learning
was performed by SJC, S.Cho also designed Fig. 2. The first
is performed using stacks of raster data organized draft of the manuscript was written by CD and all authors com-
at the same spatial extent and resolution (Hengl and mented on previous versions of the manuscript. All authors
MacMillan 2019). We chose to predict soil P at the read and approved the final manuscript.
scale of a 100 m grid to provide fairly fine resolution
Funding This project was funded by a U.S. Department of
data for a large geographic extent given the hundreds Agriculture Conservation Effects Assessment Project (CEAP)
of heterogeneous predictors we had access to. How- Grant #NR203A750023C023.
ever, the disadvantage of this approach is that impor-
tant fine scale spatial variation in soil P could still Data availability The datasets and R scripts generated dur-
ing and/or analysed during the current study are available in the
be obscured. Future work could further investigate a Soil P repository, https://​github.​com/​cldol​ph/​SoilP.
soil P model in the context of parsimonious model
design and evaluate whether simpler machine learn- Declarations
ing models with fewer predicting variables could per-
Competing interests The authors have no relevant financial
form as well as the more complex one developed here or non-financial interests to disclose.
(Wadoux et al. 2020).

Conclusion
References
Here we have used existing, large public soil chem-
Allen BL, Mallarino AP (2006) Relationships between extract-
istry and geospatial datasets together with a random able soil phosphorus and phosphorus saturation after
forest model to generate predictions of total soil P at long-term fertilizer or manure application. Soil Sci Soc
fine spatial resolution across the Upper Mississippi Am J 70:454–463. https://​doi.​org/​10.​2136/​sssaj​2005.​0031
River Basin in an open science framework. Predicted Berhe AA, Barnes RT, Six J, Marín-Spiotta E (2018) Role of
soil erosion in biogeochemical cycling of essential ele-
soil P values are publicly available as a raster data ments: Carbon, nitrogen, and phosphorus. Annu Rev
layer at https://​github.​com/​cldol​ph/​SoilP. The combi- Earth Planet Sci 46:521–548. https://​doi.​org/​10.​1146/​
nation of large existing datasets with powerful analyt- annur​ev-​earth-​082517-​01001​8USGS
ical tools like machine learning in a high-performance Boardman E, Danesh-Yazdi M, Foufoula-Georgiou E,
Dolph CL, Finlay JC (2019) Fertilizer, landscape
computing environment offers new possibilities for

Vol.: (0123456789)
13
Biogeochemistry

features and climate regulate phosphorus retention and He X, Augusto L, Goll DS, Ringeval B, Wang Y, Helfenstein
river export in diverse Midwestern watersheds. Bio- J, Huang Y, Yu K, Wang Z, Yang Y, Hou E (2021) Global
geochemistry 146:293–309. https://​doi.​org/​10.​1007/​ patterns and drivers of soil total phosphorus concentra-
s10533-​019-​00623-z tion. Earth Syst Sci Data 13:5831–5846. https://​doi.​org/​
Breiman L (2001) Random Forests. Mach Learn 45:5–32. 10.​5194/​essd-​13-​5831-​2021
https://​doi.​org/​10.​1023/A:​10109​33404​324 Hengl T, MacMillan RA (2019). Predictive Soil Mapping
Chicco D, Warrens MJ, Jurman G (2021) The coefficient with R. OpenGeoHub foundation, Wageningen, the
of determination R-squared is more informative than Netherlands, 370 pages, www.​soilm​apper.​org, ISBN:
SMAPE, MAE, MAPE, MSE and RMSE in regression 978–0–359–30635–0
analysis evaluation. PeerJ Comput Sci 7:e623. https://​doi.​ Hill RA, Weber MH, Leibowitz SG, Olsen AR, Thornbrugh DJ
org/​10.​7717/​peerj-​cs.​623 (2015) The stream-catchment (StreamCat) dataset: a data-
Clark B, Longo SB (2018) Land-Sea Ecological Rifts. base of watershed metrics for the conterminous United
Mon Rev 70:106–121. https://​doi.​org/​10.​14452/​ States. JAWRA J Am Water Resour Assoc 52:120–128.
mr-​070-​03-​2018-​07_5 https://​doi.​org/​10.​1111/​1752-​1688.​12372
Correll DL (1998) The role of phosphorus in the eutrophica- Hobbie SE, Finlay JC, Janke BD, Nidzgorski DA, Millet DB,
tion of receiving waters: a review. J Environ Qual 27:261– Baker LA (2017) Contrasting nitrogen and phosphorus
266. https://​doi.​org/​10.​2134/​jeq19​98.​00472​42500​27000​ budgets in urban watersheds and implications for manag-
20004x ing urban water pollution. Proc Natl Acad Sci 114:4177–
Watershed Boundary Dataset for HUC07 (2021) Available 4182. https://​doi.​org/​10.​1073/​pnas.​16185​36114
URL: http://​datag​ateway.​nrcs.​usda.​gov. Accessed 21 Sep Hooker G, Mentch L, Zhou S (2021) Unrestricted permutation
2021 forces extrapolation: variable importance requires at least
Debeer D, Hothorn T, Strobl C (2021) Package ‘permimp’. one more model, or there is no free variable importance.
https://​cran.r-​proje​ct.​org/​web/​packa​ges/​permi​mp/​permi​ Stat Comput. https://​doi.​org/​10.​1007/​s11222-​021-​10057-z
mp.​pdf Hosseini M, Rajabi Agereh S, Khaledian Y, Jafarzadeh Zoghal-
Debeer D, Strobl C (2020) Conditional permutation impor- chali H, Brevik EC, Movahedi Naeini SAR (2017) Com-
tance revisited. BMC Bioinformatics. https://​doi.​org/​10.​ parison of multiple statistical techniques to predict soil
1186/​s12859-​020-​03622-2 phosphorus. Appl Soil Ecol 114:123–131. https://​doi.​org/​
Deiss L, de Moraes A, Maire V (2018) Environmental drivers 10.​1016/j.​apsoil.​2017.​02.​011
of soil phosphorus composition in natural ecosystems. Hou E, Chen C, Kuang Y, Zhang Y, Heenan M, Wen D (2016)
Biogeosciences 15:4575–4592. https://​doi.​org/​10.​5194/​ A structural equation model analysis of phosphorus trans-
bg-​15-​4575-​2018 formations in global unfertilized and uncultivated soils.
Dewitz J, US Geological Survey (2021) National Land Cover Global Biogeochem Cycles 30:1300–1309. https://​doi.​org/​
Database (NLCD) 2019 Products (ver. 2.0, June 2021). 10.​1002/​2016g​b0053​71
US Geol Surv Data Release. https://​doi.​org/​10.​5066/​ Hou E, Chen C, Luo Y, Zhou G, Kuang Y, Zhang Y, Heenan
P9KZC​M54 M, Lu X, Wen D (2018) Effects of climate on soil phos-
Dialameh B, Ghane E (2022) Effect of water sampling strate- phorus cycle and availability in natural terrestrial ecosys-
gies on the uncertainty of phosphorus load estimation in tems. Glob Change Biol 24:3344–3356. https://​doi.​org/​10.​
subsurface drainage discharge. J Environ Qual 51:377– 1111/​gcb.​14093
388. https://​doi.​org/​10.​1002/​jeq2.​20339 Jacobson LM, David MB, Drinkwater LE (2011) A Spatial
Dodd RJ, Sharpley AN (2015) Conservation practice effective- analysis of phosphorus in the Mississippi river basin. J
ness and adoption: unintended consequences and impli- Environ Qual 40:931–941. https://​doi.​org/​10.​2134/​jeq20​
cations for sustainable phosphorus management. Nutr 10.​0386
Cycl Agroecosyst 104:373–392. https://​doi.​org/​10.​1007/​ Jeong JH, Resop JP, Mueller ND, Fleisher DH, Yun K, Butler
s10705-​015-​9748-8 EE, Timlin DJ, Shim K-M, Gerber JS, Reddy VR, Kim
Goyette J-O, Bennett EM, Maranger R (2018) Low buffering S-H (2016) Random Forests for Global and Regional Crop
capacity and slow recovery of anthropogenic phosphorus Yield Predictions. PloS one 11:e0156571. https://​doi.​org/​
pollution in watersheds. Nat Geosci 11:921–925. https://​ 10.​1371/​journ​al.​pone.​01565​71
doi.​org/​10.​1038/​s41561-​018-​0238-x Jeong G, Oeverdieck H, Park SJ, Huwe B, Ließ M (2017) Spa-
Green TR, Kipka H, David O, McMaster GS (2018) Where tial soil nutrients prediction using three supervised learn-
is the USA Corn Belt, and how is it changing? Sci Total ing methods for assessment of land potentials in complex
Environ 618:1613–1618. https://​doi.​org/​10.​1016/j.​scito​ terrain. CATENA 154:73–84. https://​doi.​org/​10.​1016/j.​
tenv.​2017.​09.​325 catena.​2017.​02.​006
Hagenauer J, Omrani H, Helbich M (2019) Assessing the per- Kang J, Hesterberg D, Osmond DL (2009) Soil organic matter
formance of 38 machine learning models: the case of land effects on phosphorus sorption: a path analysis. Soil Sci
consumption rates in Bavaria, Germany. Int J Geogr Inf Soc Am J 73:360–366. https://​doi.​org/​10.​2136/​sssaj​2008.​
Sci 33:1399–1419. https://​doi.​org/​10.​1080/​13658​816.​ 0113
2019.​15793​33 Kaya F, Başayiğit L (2022) The effect of spatial resolution of
Harun SMR, Ogneva-Himmelberger Y (2013) Distribution of environmental variables on the performance of machine
industrial farms in the united states and socioeconomic, learning models in digital mapping of soil phosphorus,
health, and environmental characteristics of counties. in: 2022 IEEE mediterranean and middle-east geosci-
Geogr J 2013:1–12. https://​doi.​org/​10.​1155/​2013/​385893 ence and remote sensing symposium (M2GARSS). IEEE,

Vol:. (1234567890)
13
Biogeochemistry

Piscataway. https://​doi.​org/​10.​1109/​M2GAR​SS523​14.​ M (2022) Soil quality both increases crop production and
2022.​98403​25 improves resilience to climate change. Nat Clim Chang
Kaya F, Keshavarzi A, Francaviglia R, Kaplan G, Başayiğit L, 12:574–580. https://​doi.​org/​10.​1038/​s41558-​022-​01376-8
Dedeoğlu M (2022) Assessing machine learning-based R Core Team (2022). R: A language and environment for sta-
prediction under different agricultural practices for digital tistical computing. R Foundation for Statistical Comput-
mapping of soil organic carbon and available phosphorus. ing, Vienna, Austria. URL https://​www.R-​proje​ct.​org/
Agriculture 12:1062. https://​doi.​org/​10.​3390/​agric​ultur​ Ramcharan A, Hengl T, Nauman T, Brungard C, Waltman S,
e1207​1062 Wills S, Thompson J (2018) Soil property and class maps
King KW, Williams MR, Fausey NR (2015) Contributions of of the conterminous United States at 100-meter spatial
systematic tile drainage to watershed-scale phosphorus resolution. Soil Sci Soc Am J 82:186–201. https://​doi.​org/​
transport. J Environ Qual 44:486–494. https://​doi.​org/​10.​ 10.​2136/​sssaj​2017.​04.​0122
2134/​jeq20​14.​04.​0149 Records RM, Wohl E, Arabi M (2016) Phosphorus in the river
Kleinman PJA, Osmond DL, Christianson LE, Flaten DN, corridor. Earth Sci Rev 158:65–88. https://​doi.​org/​10.​
Ippolito JA, Jarvie HP, Kaye JP, King KW, Leytem AB, 1016/j.​earsc​irev.​2016.​04.​010
McGrath JM, Nelson NO, Shober AL, Smith DR, Staver Ringeval B, Augusto L, Monod H, van Apeldoorn D, Bouw-
KW, Sharpley AN (2022) Addressing conservation prac- man L, Yang X, Achat DL, Chini LP, Van Oost K, Guenet
tice limitations and trade-offs for reducing phosphorus B, Wang R, Decharme B, Nesme T, Pellerin S (2017)
loss from agricultural fields. Agric Environ Lett. https://​ Phosphorus in agricultural soils: drivers of its distribu-
doi.​org/​10.​1002/​ael2.​20084 tion at the global scale. Glob Change Biol 23:3418–3432.
Kuhn M, Wickham H (2020) Tidymodels: a collection of pack- https://​doi.​org/​10.​1111/​gcb.​13618
ages for modeling and machine learning using tidyverse RStudio Team (2022). RStudio: Integrated Development Envi-
principles. https://​www.​tidym​odels.​org ronment for R. RStudio, PBC, Boston, MA URL http://​
Liaw A, Wiener M (2002) Classification and regression by ran- www.​rstud​io.​com/
domforest. R News 2(3):18–22 Russell KM, Galloway JN, Macko SA, Moody JL, Scudlark JR
Liu J, Cade-Menun BJ, Yang J, Hu Y, Liu CW, Tremblay J, (1998) Sources of nitrogen in wet deposition to the Chesa-
LaForge K, Schellenberg M, Hamel C, Bainard LD (2018) peake Bay region. Atmos Environ 32:2453–2465. https://​
Long-term land use affects phosphorus speciation and the doi.​org/​10.​1016/​s1352-​2310(98)​00044-2
composition of phosphorus cycling genes in agricultural Sadayappan K, Kerins D, Shen C, Li L (2022) Nitrate concen-
soils. Front Microbiol. https://​doi.​org/​10.​3389/​fmicb.​ trations predominantly driven by human, climate, and soil
2018.​01643 properties in US rivers. Water Res 226:119295. https://​
Matos-Moreira M, Lemercier B, Dupas R, Michot D, Viaud doi.​org/​10.​1016/j.​watres.​2022.​119295
V, Akkal-Corfini N, Louis B, Gascuel-Odoux C (2017) Sahabiev I, Smirnova E, Giniyatullin K (2021) Spatial predic-
High-resolution mapping of soil phosphorus concentra- tion of agrochemical properties on the scale of a single
tion in agricultural landscapes with readily available or field using machine learning methods based on remote
detailed survey data. Eur J Soil Sci 68:281–294. https://​ sensing data. Agronomy 11:2266. https://​doi.​org/​10.​3390/​
doi.​org/​10.​1111/​ejss.​12420 agron​omy11​112266
Mayer M (2021) missRanger: Fast Imputation of Missing Val- Schilling KE, Libra RD (2003) Increased baseflow in Iowa over
ues. R package version 2.1.3, https://​CRAN.R-​proje​ct.​org/​ the second half of the 20th Century. J Am Water Resour
packa​ge=​missR​anger Assoc 39:851–860. https://​doi.​org/​10.​1111/j.​1752-​1688.​
Metson GS, Iwaniec DM, Baker LA, Bennett EM, Childers 2003.​tb044​10.x
DL, Cordell D, Grimm NB, Grove JM, Nidzgorski DA, Schilling KE, Isenhart TM, Wolter CF, Streeter MT, Kovar JL
White S (2015) Urban phosphorus sustainability: systemi- (2021) Contribution of streambanks to phosphorus export
cally incorporating social, ecological, and technological from Iowa. J Soil Water Conserv 77:103–112. https://​doi.​
factors into phosphorus flow analysis. Environ Sci Policy org/​10.​2489/​jswc.​2022.​00036
47:1–11. https://​doi.​org/​10.​1016/j.​envsci.​2014.​10.​005 Schindler DW (2006) Recent advances in the understanding
NCSS (2021) National Cooperative Soil Survey, National and management of eutrophication. Limnol Oceanogr
Cooperative Soil Survey Soil Characterization Data- 51:356–363. https://​doi.​org/​10.​4319/​lo.​2006.​51.1_​part_2.​
base, online. http://​ncssl​abdat​amart.​sc.​egov.​usda.​gov/. 0356
Accessed 10 Sep 2021 Schottler SP, Ulrich J, Belmont P, Moore R, Lauer JW, Eng-
Plach JM, Macrae ML, Williams MR, Lee BD, King KW strom DR, Almendinger JE (2013) Twentieth century
(2018) Dominant glacial landforms of the lower Great agricultural drainage creates more erosive rivers. Hydrol
Lakes region exhibit different soil phosphorus chemistry Process 28:1951–1961. https://​doi.​org/​10.​1002/​hyp.​9738
and potential risk for phosphorus loss. J Great Lakes Res Sharpley AN, Kleinman PJA, Jordan P, Bergström L, Allen AL
44:1057–1067. https://​doi.​org/​10.​1016/j.​jglr.​2018.​07.​005 (2009) Evaluating the success of phosphorus management
Potter P, Ramankutty N, Bennett EM, Donner SD (2010) Char- from field to watershed. J Environ Qual 38:1981–1988.
acterizing the spatial patterns of global fertilizer appli- https://​doi.​org/​10.​2134/​jeq20​08.​0056
cation and manure production. Earth Interact 14:1–22. Shen J, Yuan L, Zhang J, Li H, Bai Z, Chen X, Zhang W,
https://​doi.​org/​10.​1175/​2009e​i288.1 Fusuo Z (2011) Phosphorus dynamics: from soil to plant.
Qiao L, Wang X, Smith P, Fan J, Lu Y, Emmett B, Li R, Dor- Plant Physiol 156:997–1005. https://​doi.​org/​10.​1104/​pp.​
ling S, Chen H, Liu S, Benton TG, Wang Y, Ma Y, Jiang 111.​175232
R, Zhang F, Piao S, Mϋller C, Yang H, Hao Y, Li W, Fan

Vol.: (0123456789)
13
Biogeochemistry

Shen LQ, Amatulli G, Sethi T, Raymond P, Domisch S (2020) Plant, Soil Environ 61:86–96. https://​doi.​org/​10.​17221/​
Estimating nitrogen and phosphorus concentrations in 932/​2014-​pse
streams and rivers, within a machine learning framework. Zhong S, Zhang K, Bagheri M, Burken JG, Gu A, Li B, Ma
Sci Data. https://​doi.​org/​10.​1038/​s41597-​020-​0478-7 X, Marrone BL, Ren ZJ, Schrier J, Shi W, Tan H, Wang
Smith DR, King KW, Johnson L, Francesconi W, Richards T, Wang X, Wong BM, Xiao X, Yu X, Zhu J-J, Zhang
P, Baker D, Sharpley AN (2015) Surface runoff and tile H (2021) Machine learning: new ideas and tools in envi-
drainage transport of phosphorus in the Midwestern ronmental science and engineering. Environ Sci Technol.
United States. J Environ Qual 44:495–502. https://​doi.​org/​ https://​doi.​org/​10.​1021/​acs.​est.​1c013​39
10.​2134/​jeq20​14.​04.​0176
Soil Survey Staff (2014) Kellogg Soil Survey Laboratory Publisher’s Note Springer Nature remains neutral with regard
Methods Manual. Soil Survey Investigations Report No. to jurisdictional claims in published maps and institutional
42, Version 5.0. R. Burt and Soil Survey Staff (ed.). U.S. affiliations.
Department of Agriculture, Natural Resources Conserva-
tion Service., p. 457. https://​www.​nrcs.​usda.​gov/​Inter​net/​
Springer Nature or its licensor (e.g. a society or other partner)
FSE_​DOCUM​ENTS/​stelp​rdb12​53872.​pdf
holds exclusive rights to this article under a publishing
Soil Survey Staff (2021) Gridded Soil Survey Geographic
agreement with the author(s) or other rightsholder(s); author
(gSSURGO) Database for . United States Department of
self-archiving of the accepted manuscript version of this article
Agriculture, Natural Resources Conservation Service.
is solely governed by the terms of such publishing agreement
Available online at https://​gdg.​sc.​egov.​usda.​gov/ Accessed
and applicable law.
10 Sep 2021
Stackpoole SM, Stets EG, Sprague LA (2019) Variable impacts
of contemporary versus legacy agricultural phosphorus on
US river water quality. Proc Natl Acad Sci 116:20562–
20567. https://​doi.​org/​10.​1073/​pnas.​19032​26116
USDA National Agricultural Statistics Service Cropland Data
Layer. (2022). Published crop-specific data layer [Online].
Available at https://​nassg​eodata.​gmu.​edu/​CropS​cape/.
USDA-NASS, Washington, DC. Accessed 3 Feb 2023
USGS (2004) The National Geochemical Survey - database
and documentation: U.S. Geological Survey Open-File
Report 2004–1001, U.S. Geological Survey, Reston VA.
2021. https://​doi.​org/​10.​3133/​ofr20​041001 Accessed 22
June 2022
Vadas PA, Kleinman PJA, Sharpley AN, Turner BL (2005)
Relating soil phosphorus to dissolved phosphorus in run-
off: a single extraction coefficient for water quality mode-
ling. J Environ Qual 34:572–580. https://​doi.​org/​10.​2134/​
jeq20​05.​0572
Valayamkunnath P, Barlage M, Chen F, Gochis DJ, Franz KJ
(2020) Mapping of 30-meter resolution tile-drained crop-
lands using a geospatial modeling approach. Sci Data.
https://​doi.​org/​10.​1038/​s41597-​020-​00596-x
Van Meter KJ, McLeod MM, Liu J, Tenkouano GT, Hall RI,
Van Cappellen P, Basu NB (2021) Beyond the Mass bal-
ance: watershed phosphorus legacies and the evolution of
the current water quality policy challenge. Water Res Res.
https://​doi.​org/​10.​1029/​2020w​r0293​16
Vitousek PM, Porder S, Houlton BZ, Chadwick OA (2010)
Terrestrial phosphorus limitation: mechanisms, implica-
tions, and nitrogen–phosphorus interactions. Ecol Appl
20:5–15. https://​doi.​org/​10.​1890/​08-​0127.1
Wadoux AMJ-C, Román-Dobarco M, McBratney AB (2020)
Perspectives on data-driven soil research. Eur J Soil Sci
72:1675–1689. https://​doi.​org/​10.​1111/​ejss.​13071
Wright MN, Ziegler A (2017) ranger: a fast implementation of
random forests for high dimensional data in C++ and R. J
Stat Softw. https://​doi.​org/​10.​18637/​jss.​v077.​i01
Wuenscher R, Unterfrauner H, Peticzka R, Zehetner F (2016)
A comparison of 14 soil phosphorus extraction meth-
ods applied to 50 agricultural soils from Central Europe.

Vol:. (1234567890)
13

You might also like