Professional Documents
Culture Documents
Modeling Phytoremediation of Heavy Metal Contaminate 2023 Journal of Hazardo
Modeling Phytoremediation of Heavy Metal Contaminate 2023 Journal of Hazardo
H I G H L I G H T S G R A P H I C A L A B S T R A C T
A R T I C L E I N F O A B S T R A C T
Editor: Lingxin CHEN As an important subtopic within phytoremediation, hyperaccumulators have garnered significant attention due
to their ability of super-enriching heavy metals. Identifying the factors that affecting phytoextraction efficiency
has important application value in guiding the efficient remediation of heavy metal contaminated soil. However,
Keywords:
it is challenging to identify the critical factors that affect the phytoextraction of heavy metals in
Heavy metal
Hyperaccumulator
soil–hyperaccumulator ecosystems because the current projections on phytoremediation extrapolations are
Machine learning rudimentary at best using simple linear models. Here, machine learning (ML) approaches were used to predict
* Corresponding authors.
E-mail addresses: wangxiaonan@tsinghua.edu.cn (X. Wang), yongsikok@korea.ac.kr (Y.S. Ok).
1
These authors contributed equally to this work
https://doi.org/10.1016/j.jhazmat.2022.129904
Received 12 May 2022; Received in revised form 24 August 2022; Accepted 1 September 2022
Available online 5 September 2022
0304-3894/© 2022 Published by Elsevier B.V.
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Phytoextraction the important factors that affecting phytoextraction efficiency of hyperaccumulators. ML analysis was based on
Soil remediation 173 data points with consideration of soil properties, experimental conditions, plant families, low-molecular-
weight organic acids from plants, plant genes, and heavy metal properties. Heavy metal properties, especially
the metal ion radius, were the most important factors that affect heavy metal accumulation in shoots, and the
plant family was the most important factor that affect the bioconcentration factor, metal extraction ratio, and
remediation time. Furthermore, the Crassulaceae family had the highest potential as hyperaccumulators for
phytoremediation, which was related to the expression of genes encoding heavy metal transporting ATPase
(HMA), Metallothioneins (MTL), and natural resistance associated macrophage protein (NRAMP), and also the
secretion of malate and threonine. New insights into the effects of plant characteristics, experimental conditions,
soil characteristics, and heavy metal properties on phytoextraction efficiency from ML model interpretation
could guide the efficient phytoremediation by identifying the best hyperaccumulators and resolving its efficient
remediation mechanisms.
2
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
et al., 2021). To ease downstream model training and testing, the entire
dataset was partitioned into two, i.e., 80% for hyperparameter tuning
based on 5-fold cross-validation (CV) and ML model training, and the
remaining 20% data as test set for validating the generalization ability of
the trained models.
Based on our previous work, extreme gradient boosting (XGBoost),
an effective ML algorithm for modeling the application of carbon related
material (Yuan et al., 2021), was employed to develop multilabel pre
diction model where the model could predict the five above-mentioned
Fig. 1. Schematic diagram of Machine Learning applied in this work. Driven by outcomes (i.e., HMshoot, yield, BCFs, MER, and RT) at the same time. It
the ML model, 173 available input and output data were used to evaluate the was owing to the success of tree-based ensemble algorithms develop
heavy metal enrichment effect of hyperaccumulators. ment and good tradeoff between bias and variance to avoid overfitting
for the regression predictive model. Based on the gradient boosting
Science (https://www.webofscience.com). We used ‘Hyper models demonstrated in our previous study (Li et al., 2021a), XGBoost
accumulator’, ‘Heavy metal’ and ‘pH’ as key words to search paper on used more accurate approximations to identify the best tree model, i.e.,
‘Web of Science’. In addition, we used ‘Hyperaccumulator’, ‘Heavy by computing second partial derivatives (second-order gradients) of the
metal’, ‘pH’, ‘Cation exchange capacity’, ‘Organic carbon’, ‘Tempera loss function to obtain more information for the gradient direction (Chen
ture’, ‘Organic acid’, ‘Gene’, ‘Time’, ‘Pot’, ‘Field’, ‘Soil weight’, ‘Ni’, and Guestrin, 2016). Moreover, regularization terms as penalty were
‘Cd’, ‘Zn’, ‘As’, ‘Cu’ as key words to search paper on ‘Google Scholar’. integrated to avoid the bias and improve model generalization. In
Data pertaining to soil, heavy metals, plants, and experiments were XGBoost, four important hyperparameters, including the n_estimators,
extracted as inputs, and the shoot heavy metal concentration (HMshoot, learning_rate, subsample ratio, and max_depth, were adjusted to adapt
µg/g), shoot yield (yield, mg/plant), BCFs, metal extraction ratio (MER, to our dataset. For tree-based models, it is unnecessarily to do the data
%), and remediation time (RT, per kg soil [yr]) were identified as out normalization because they are irrelevant to the absolute values to split
puts to compile the original datasets (Fig. 1). the trees (Li et al., 2021b). Therefore, the original input values were
The properties of the soil, including pH, cation exchange capacity directly used for the model development of XGBoost.
(CEC, cmol/kg), and organic carbon (OC, g/kg) were considered for the To identify the types of LMWOA generated and the effect of plant
properties of different heavy metals, and the electronegativity of heavy genes on the mechanism of heavy metal phytoremediation, the ANN
metals (HM_x), ion radius of heavy metals (HM_r, nm), and total heavy algorithm with ‘sigmoid’ as the activation function in the output layer
metal concentration (μg/g) were identified from only 20 papers by was applied to develop classification models. The reason to select ANN
searching across major databases. Information regarding the different model for classification is that ANN has been proven to be able to handle
plants obtained from the literature was categorized based on the family the gene expression profiling and it is also quite popular in the phytor
level. The experimental conditions, including the temperature differ emediation domain. The detailed description of the ANN algorithm is
ence during planting, planting time, soil mass used for pot experiments, described in our previous paper (Li et al., 2021b). Here, two hidden layer
and pot depth, were considered for modeling. A dataset contained 173 ANNs were developed and the number of neurons in each hidden layer
data points were obtained with five heavy metals (As, Cd, Cu, Ni, and was optimized by searching optimal numbers from 2 to 128. Moreover,
Zn), seven different family levels of plants, and seven types of soils the activation function in the hidden layer was ‘relu’, and ‘adam’ was
related. All of the data come from 20 papers which provide all the selected as the optimizer with learning rate oof 0.001 during model
necessary information, including experimental conditions, hyper training. To train the ANN classification model, in addition to the inputs
accumulator information, heavy metal properties and soil properties in considered in the previous regression model, the heavy metal type was
common. considered by creating a binary column for each heavy metal. Subse
To interpret the phytoremediation mechanism, eight types of low quently, all inputs in the entire dataset were normalized before training
molecular weight organic acids (LMWOAs) generated from plants and and test data splitting to improve the convergence process during model
21 types of gene expression in plants were considered as other outputs training by removing the mean and scaling to the unit variance (Li et al.,
here to develop individual classification models that could aid in un 2021c).
derstanding the mechanism (Fig. 1). Since organic acids have the
function of chelating heavy metals, they may change the available 2.3. ML model performance evaluation and model-based feature analysis
content of heavy metals in the soil. Under the regulatory action of heavy
metal-related transport genes, the accumulation of heavy metals by Once the ML models were developed, the remaining 20% of the data
plants increase. Therefore, we also targeted the two aspects (LMWOAs points were introduced to evaluate the prediction performance. For the
and gene expression) to investigate their impacts on the phytor regression model, the determination coefficient (R2) and root mean
emediation efficiency of hyperaccumulator. squared error (RMSE) were applied to obtain the prediction accuracy
(Yuan et al., 2021) with an R2 value closer to one indicating a better
prediction; whereas, a smaller RMSE represented a higher accuracy.
2.2. Data pre-processing and ML model development
Accuracy was denoted by the accuracy score and F1 score were for the
classification models (Anon, 2022a).
For the preliminary datasets, the unit for each variable was uniform
before the model development. We performed the data statistics on the ∑
N
y n − yn )2
(̂
compiled datasets and found that 2.9% and 4.6% of the temperature
difference and soil weight data were missing. The missing values were R 2
(y, ̂y ) = 1 − n=1
(1)
∑N
filled in using the K-Nearest Neighbor method to complete the dataset (yn − y)2
n=1
for model training. After this data filling, we obtained three datasets for
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
phytoremediation properties, plant acid generation and plant genes √N
√∑
expression, respectively. Each dataset contained 173 pieces of data
2
√ (̂
√n=1 y n − yn )
points without missing values. Moreover, one-hot encoding was per RMSE (y, ̂y ) = (2)
N
formed to transfer the categorical features (heavy metal type, plant
family, LMWOA, and plant genes) into a one-hot numeric array (Pant
3
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
1 ∑
N− 1 metal concentrations in the hyperaccumulator. The average, minimum
accuracy score (y, ̂
y) = I(̂
y i = yi ), (3) and maximum concentrations of heavy metals in the hyperaccumulators
N N
were in the following order: Ni (8586 mg kg-1; min: 1303 - max: 23387)
where y, y, and ̂ y are the true value, average of the true value, and > Zn (3430 mg kg-1; min: 79.7 - max: 15469) > As (3109 mg kg-1; min:
predicted value among the total number of data points (N) for each 1080 - max: 5138) > Cd (96.8 mg kg-1; min: 0.45 - max: 1310) > Cu
(15.6 mg kg-1; min: 5.7 - max: 35.6). The CVs of Ni, Zn, As, Cd, and Cu
prediction target, respectively. I(̂y i = yi ) represents the indicator func
concentrations were 72.07%, 114.63%, 92.29%, 190.48%, and 62.32%,
tion, as follows:
{ } respectively. The CVs of all heavy metal concentrations in the hyper
1 if ̂y i = yi , accumulators indicated high or exceptionally high variabilities (Ridge
I(̂
y i = yi ) = (4)
0 if ̂y i ∕
= yi . way, 2020). The maximum concentrations of Ni, Zn, As, Cd, and Cu in
the hyperaccumulators here were significantly higher than the defined
F1 score = 2 ×
Pr × Re
(5) heavy metal concentrations in typical hyperaccumulators.
Pr + Re Figure S3 shows the BCFs of the heavy metals in the different
hyperaccumulators. The mean BCFs for each heavy metal based on
TP
Pr = (6) average values from different hyperaccumulators were as follows: As
TP + FP
(24.39) > Cd (21.96) > Ni (7.71) > Zn (7.21) > Cu (0.05). Crassulaceae
TP exhibited the highest BCFs for Cd (37.33) and Zn (9.50), Brassicaceae,
Re = (7) Cu (0.06), Ni (7.71), and Pteridaceae, for As (24.39).
TP + FN
where, Pr and Re are the precision and recall of the classification model.
TP, FN, and FP are the numbers of true positives, false negatives, and 3.2. Development of regression predictive model
false positives, respectively. The “micro” was selected during F1 score
calculate to calculate metrics globally by counting the total true posi Based on our preliminary regression model development trials, it was
tives, false negatives and false positives. difficult to adapt the outputs, including the five identified targets
Feature analysis, including feature importance and feature correla (HMshoot [µg/g], yield [mg/plant], BCF, MER [%], and RT per kg soil
tion, was interpreted based on the developed models with the best [y]), to the ML models. After performing a data distribution analysis, it
prediction performance. For the regression model, the feature impor was discovered that the data did not present a normal distribution owing
tance was automatically achieved from the Gini importance by devel to the dispersed data points (see Figure S1). To solve this issue and
oping a tree-based XGBoost model (Rosa, 2022). This model provided a improve the model performance, a logarithmic transformation was
score to indicate the value of each feature during the construction of employed to redistribute the data, (see Figure S2).
trees within the model. The importance was calculated based on the Apart from the data preprocessing, several critical hyperparameters
improvement in the performance measured by the Gini index during the in XGBoost were adjusted using the training dataset based on five-fold
selection of the split point to develop the trees. The final feature cross-validation to obtain a predictive model with good prediction
importance was averaged from all constructed trees in the model. It performance. The average RMSE for all targets from validation was used
should be noted that although the Gini importance tends to bias towards to determine the optimal hyperparameters. Based on the hyper-
numerical values, such feature importance could be acceptable since it parameter tuning results of XGBoost (see Figure S4), the average
has been cross validated by another feature analysis method (Pal RMSE decreased as the number of trees (n_estimators) increased from 10
ansooriya et al., 2022). For the feature correlation, XGBoost was inte to 30; then, decreased until the learning rate increased to 0.1. Moreover,
grated with the partial dependence plot (PDP) Python tool (PDPbox) to based on an optimal n_estimator of 30 and a learning rate of 0.1, the
visualize the marginal effect of a specified input variable with different subsample rate and maximum depth (max_depth) of the trees were
values on the model outcome (Anon, 2022b). In the ANN classification further optimized. The results indicated that a smaller average RMSE
models, feature importance was identified using the SHAP method, was achieved with a subsample rate of 0.7 and a maximum depth of 5
which proved to be useful for explaining black box models. SHAP (see Figure S4). Based on the optimal hyperparameters, two XGBoost
method determines the importance of features based on SHAP value models were trained with the output data before and after logarithmic
which is a concept in cooperative game theory. More details regarding transformation (Sigmund et al., 2020). It was discovered that the pre
the SHAP method are available in our previous paper (Li et al., 2020). diction performance improved after the original output data were
logarithmically transferred, particularly for the testing performance.
3. Results and discussion The XGBoost model performed better with the transformed data closer
to a normal distribution, since the value ranges of the transformed data
3.1. Statistical analysis of data acquired were more compact and balanceable, which was more acceptable for
regression model (Table 1, Figures S1 and S2).
Table S1 lists the basic descriptive statistics of the heavy metal The original experimental data versus the predicted data of heavy
concentrations in the soil samples. The average, minimum and metal concentrations in terms of the shoot concentration (HMshoot),
maximum concentrations of heavy metals in soil were in the following shoot yield (yield), BCF, MER, and RT are shown as scatter plots (see
order: Ni (1014 mg kg-1; min: 553 mg kg-1 - max: 1780 mg kg-1) > Zn Fig. 2). The line y = x indicated that the predicted values were equal to
(631 mg kg-1; min: 109 mg kg-1 - max: 3170 mg kg-1) > Cu (294 mg kg- the measured value, and the closer the dots were to the y = x line, the
1
; min: 196 mg kg-1 - max: 801 mg kg-1) > As (131 mg kg-1; min: better the prediction efficiency. Almost all the predicted and training
126 mg kg-1 - max: 135 mg kg-1) > Cd (7.1 mg kg-1, min: 0.34 mg kg-1 - points were concentrated on the y = x line; however, the predicted
max: 48 mg kg-1). The concentrations of all heavy metals in the soil were points indicated a less dense distribution than the training data points
higher than the typical background concentrations and ecological soil because of the slightly deteriorated accuracy in the prediction perfor
screening levels. The coefficient of variation (CV) represented the degree mance. The R2 values of the XGBoost model for the HMshoot, yield, BCF,
of variability in heavy metal concentrations. The high CV of heavy MER, and RT in the test dataset were 0.93, 0.79, 0.89, 0.92, and 0.91,
metals in soils from the survey region indicated that the accumulation of respectively, which were higher than the average 5-fold CV R2. How
heavy metals in these soils was likely due to anthropogenic activities ever, these values were still located in the ranges of the standard devi
(Manta et al., 2002). ation plus the average 5-fold CV R2 (See Figure S5). This result indicated
Table S2 presents the basic statistical characteristics of the heavy that our developed model did not have serious overfitting issues.
4
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Table 1 importance of the pot depth and total heavy metal concentrations was
Training and testing performance of developed XGBoost models based on orig less than 25% of the metal ion radius. This showed that the heavy metal
inal and log-transferred output data. concentration in the shoots was primarily determined by the metal ion
Items Output parameters Training Testing radius. Based on the correlation analysis of the top four features (see
R2 RMSE R2 RMSE
Fig. 4a), the metal ion radius and pot depth were negatively associated
Original HMshoot (µg/g) 0.97 629 0.77 1371 with the HMshoot, and they exhibited a linear relationship with a high
Yield (mg/plant) 0.98 346 -0.15 1804 slope in the metal ion radius range of 0.07–0.10 nm and pot depth range
BCF 0.97 2.99 0.59 13.19 of 20–100 cm. In addition, the total heavy metal concentrations in soil
MER (%) 0.97 0.80 0.52 2.11
and soil mass (0.6–1 kg) contributed positively to the HMshoot (see
RT per kg soil (yr) 0.91 8420 0.71 9460
Log transferred HMshoot (µg/g) 0.96 0.57 0.93 0.66 Fig. 4a). The results showed that the high HMshoot was primarily due to
Yield (mg/plant) 0.93 0.55 0.79 0.77 the small metal ion radius, shallow pot depth, high total heavy metal
BCF 0.95 0.44 0.89 0.69 concentrations in soil, and soil mass (0.60–1 kg). Gu et al (Gu and Lan,
MER (%) 0.95 0.60 0.92 0.77 2021). discovered that the adsorption capacity of Neochloris oleoa
RT per kg soil (yr) 0.96 0.59 0.91 0.83
bundans biomass to two-valence metal ions investigated in their study
Note: HMshoot: heavy metal concentration in shoots (µg/g); yield: shoot yield was proportional to the electronegativity and inversely proportional to
(mg/plant); BCF: bioconcentration factor; MER: metal extraction ratio (%); RT the radius of the metal ions; however, the exact reason has not been
per kg soil (yr): remediation time (1 year, 1 kg contaminated soil). clarified.
Fig. 3b shows that the experimental condition type constituted 48%
3.3. Model-based interpretation to yield prediction; whereas, the metal property type constituted only
4%. Soil mass was the most important feature for yield, followed by OC
Further, a feature analysis of each input to each output was per in soil, Brassicaceae, and planting time. Based on the correlation anal
formed to understand the phytoremediation process (see Fig. 3). The ysis of the top-four experimental condition features (see Fig. 4b), the soil
input variables were categorized into four types to determine the mass was positively associated with the yield in the ranges of 0.6–1.5 kg
importance of each feature type: plant family, experimental conditions, and negatively related in the ranges of 4–6 kg. The soil OC was nega
soil properties, and metal properties. Fig. 3a shows the metal property tively associated with the yield when the concentration was 24.6 kg to
contributed 71% to the HMshoot in terms of importance; whereas the soil 81.7 g/kg. The planting time was positively associated with the yield
property contributed only 4%. The ionic radius of metal was the most from 0.83 to 2.84 months. Meanwhile, the yield increased and reached a
important feature for the HMshoot, followed by the pot depth and total maximum when the temperature difference was 4 ◦ C; this increase was
heavy metal concentration in the soil (see Fig. 3a). Although the three associated with an improvement in soil enzyme activity. The yield of
features mentioned were the top three notable features, they indicated buckwheat (Polygonaceae family) increased owing to improvements in
significant differences on the importance for the HMshoot. The the accumulated temperature, temperature and water use efficiency,
Fig. 2. Multi-task predicted data vs original experimental data of (a) heavy metal concentration in shoot (HMshoot), (b) shoot yield (Yield), (c) BCF, (d) MER and (e)
RT based on optimized ML models XGBoost with training and testing datasets. 173 available input and output data were used to develop the predictive model.
5
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Fig. 3. Prediction plots (training and testing) based on feature importance of XGBoost for (a) heavy metal concentration in shoot (HMshoot), (b) shoot yield (Yield),
(c) BCF, (d) MER, and (e) RT. Plant family: Amaranthaceae, Asteraceae, Brassicaceae, Crassulaceae, Fabaceae, Pteridaceae, and Solanaceae; Experimental conditions:
T difference (◦ C), planting time (months), Pot depth (cm), soil mass (kg); Soil properties: soil pH, soil CEC (cmol/kg), and soil OC (g/kg); Metal properties: HM_x,
HM_r (nm), and total heavy metal concentration (µg/g). Note: HM_r (nm): Ion radius of heavy metals; Total HM conc (μg/g): Total heavy metal concentration in soil
before planting; Soil mass (kg): Fresh weight of soil used in each treatment; HM_X: Electronegativity of heavy metals; T difference (℃): Temperature difference; Soil
CEC (cmol/kg): Soil cation exchange capacity; Soil OC (g/kg): Soil organic carbon.
and soil organic carbon. Qu et al (Qu and Feng, 2020). discovered that increase the growth rate of plants, thereby diluting the content of heavy
buckwheat yield increased owing to improvements in the OC in the soil, metals in plants, resulting in a decrease in BCF (Venzhik et al., 2015).
accumulated temperature, temperature and water use efficiency. In Fig. 3d shows that the plant family type constituted 58% in terms of
addition, the increase in OC in the soil was associated with an importance to the MER, whereas Crassulaceae, pot depth, and Pter
improvement in soil enzyme activity. idaceae were the most notable features for MER. Pot depth was nega
For the BCF (see Fig. 3c), the plant family was the most important tively correlated with the MER (Fig. 4d), this might be also due to the
feature type, accounting for 41%, whereas Crassulaceae, soil mass, and higher heavy metal concentration in the deeper soil layer (Tőzsér et al.,
pot depth were the top three vital features. Employment of appropriate 2017). However, more in-depth research may be needed to explain why
plants is the key to the success of phytoremediation. Crassulaceae has the contribution of ‘Plant family’ feature on BCF and MER significant
high BCF and MER may be due to high biomass, high growth rate and higher than ‘Experimental condition’, ‘Soil property’ and ‘Metal prop
strong ability of absorbing and accumulating heavy metals compare erty’. For RT, the plant family was the most important feature type,
with other hyperaccumulators (Shen et al., 2022). The soil mass accounting for 51%, followed by the experimental conditions, metal
(0.6–1 kg) was positively associated with the BCF. However, the pot properties, and soil properties (see Fig. 3e). These important feature
depth, HM_x, and temperature difference were the three negative fea results for the inputs were similar to those of the MER. The pot depth,
tures associated with BCF (see Fig. 4c). Although we can not find the temperature difference, total heavy metal concentration, and HM_x were
relationship between soil mass and BCF in reference, we speculate that all positively associated with RT (see Fig. 4e).
within a certain mass range, as the soil mass increases, the total heavy
metal content in the culture system gradually increases, and the 3.4. LMWOA and gene identification for in-depth interpretation
enrichment capacity of plants is gradually enhanced within the
threshold, and when the threshold of plants can absorb heavy metals is Figure S5 and Figure S6 shows the results of accuracy based on five-
reached, the BCF will remain unchanged. For pot depth, Tőzsér et al fold cross-validation for hyperparameter tuning for the ANN model
(Tőzsér et al., 2017). found that increasing element concentrations to based on 80% of the data points from the dataset. The accuracy of
ward deeper layers, which could explain the relative low BCF in organic acid (a) and genes (b) increased with the number of neurons in
deep-soil experiments. For HM_x, the low accumulation efficiency of the first hidden layer from 2 to 16. In addition, as the number of neurons
heavy metals with plants might be related to the high electronegativity increased in the second hidden layer, the accuracy increased. However,
(Fan et al., 2016). In addition, a higher temperature difference may no further improvement for organic acid identification was observed as
6
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Fig. 4. Correlation of top-four continuous input features with log-based (a) heavy metal concentration in shoot (HMshoot), (b) shoot yield (yield), (c) bioconcentration
factor (BCF), (d) metal extraction factors (MER), and (e) RT. Note: HM_r (nm): Ion radius of heavy metals; Total HM conc (μg/g): Total heavy metal concentration in
soil before planting; Soil mass (kg): Fresh weight of soil used in each treatment; Soil OC (g/kg): Soil organic carbon; T difference (℃): Temperature difference; HM_X:
Electronegativity of heavy metals.
the number of neurons increased continuously in the first hidden layer identification were 32 and 128. Once determined the optimal hyper
from 16 to 128 with the number of neurons in the second hidden layer parameters of ANN, it was retrained by the all the training data for
over 64. Therefore, the final optimized hyper-parameters of ANN for LMWOAs and gene identification. As shown in Fig. 5, the accuracy and
organic acid identification were 16 and 46 for the first and second F1 score of the classification model for test data of LMWOAs exhibited a
hidden lawyers, respectively. Similarly, the optimized number of neu lower accuracy than the training data, though not significantly, with a
rons in the first and second layers of ANN determined for gene test accuracy and F1 score of approximately 0.8 and 0.85 for identifying
7
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Fig. 6. Feature importance analysis with respect to (a) LMWOA and (b) genes of plants based on the explanation of ANN model using SHAP values. Note: Soil mass
(kg): Fresh weight of soil used in each treatment; Soil CEC (cmol/kg): Cation exchange capacity; T difference (℃): Temperature difference; Total HM conc (μg/g):
Total heavy metal concentration in soil before planting; HM_X: Electronegativity of heavy metals; HM_r (nm): Ion radius of heavy metals; Soil OC (g/kg): Soil organic
carbon; Plant family: Amaranthaceae, Asteraceae, Brassicaceae, Crassulaceae, Fabaceae, and Solanaceae.
8
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
containing soils. The developed ML models predicted the heavy metal Republic of Korea. This work was also supported by the International
concentration in shoots, yield, BCFs, MER, and RT of different heavy Postdoctoral Exchange Program Fellowship (PC2020041). This work
metals, as well as identified the acid generation and gene expression for was also supported by the National Research Foundation of Korea (NRF)
a deep interpretation of soil–hyperaccumulator ecosystems. It was used grant funded by the Korea government (MSIT) (No.
to quantify the importance of variables and identify potential control 2021R1A2C2011734). This research was supported by Basic Science
factors affecting phytoremediation efficiency in soil-hyperaccumulator Research Program through the National Research Foundation of Korea
systems, as well as to provide the suitable hyperaccumulators for spe (NRF) funded by the Ministry of Education (NRF-
cific heavy metal-contained soil to accelerate the remediation process. 2021R1A6A1A10045235).
However, some limitations of this study could be improved in the future.
First of all, extreme gradient boosting (XGBoost) should be compared Appendix A. Supporting information
with other ML algorithms to evaluate their suitability for phytor
emediation models. Secondly, there is a need to increase the amount of Supplementary data associated with this article can be found in the
data to improve the model’s accuracy. For example, factors that might online version at doi:10.1016/j.jhazmat.2022.129904.
affect the phytoremediation efficiency were not comprehensively
considered because of the lack of data, such as geographical location of References
soil, climate factors, soil texture, etc. Moreover, the model-guided
phytoremediation experimental work could be another direction for Anon, 2022b; 〈https://github.com/SauceCat/PDPbox〉.
Anon, 2022a; 〈https://scikit-learn.org/stable/modules/model_evaluation.html#accurac
the continuation of the present work based on the suitable plant, and y-score〉.
also experimental and soil conditions. Bertin, V., Allemon, J., Sajet, P., Dieu, S., Papin, A., Collet, S., Gaucher, R., Chalot, M.,
Michiels, B., Raventos, C., 2017. Torrefaction and pyrolysis of metal-enriched
poplars from phytotechnologies: effect of temperature and biomass chlorine content
4. Conclusions on metal distribution in end-products and valorization options. Biomass-.-.
Bioenergy 96, 1–11.
In summary, the ‘plant family’ was the most important feature for Chen, T.; Guestrin, C., Xgboost: A scalable tree boosting system. In Proceedings of the
22nd acm sigkdd international conference on knowledge discovery and data mining
phytoremediation, followed by experimental conditions, soil properties, 2016, pp. 785–794.
and heavy metal properties. In addition, the plant family dominated the Cipullo, S., Snapir, B., Prpich, G., Campo, P., Coulon, F., 2019. Prediction of
BCF, MER, and RT. The Crassulaceae family had the highest potential of bioavailability and toxicity of complex chemical mixtures through machine learning
models. Chemosphere 215, 388–395.
hyperaccumulators for phytoremediation, which was related to the
Duquène, L., Vandenhove, H., Tack, F., Meers, E., Baeten, J., Wannijn, J., 2009.
expression of HMA, MTL, and NRAMP genes. The metal ion radius was Enhanced phytoextraction of uranium and selected heavy metals by Indian mustard
the most important factor affecting heavy metal concentration in shoots. and ryegrass using biodegradable soil amendments. Sci. Total Environ. 407,
In addition to the comprehensive interpretation of phytoremediation, 1496–1505.
Fan, C.H., Bo, D.U., Zhang, Y.C., Gao, Y.L., Chang, M., 2016. Determination of lead and
the developed ML model can adapt to other phytoremediation systems cadmium in Calendula officinalis seedlings for phytoremediation of multi-
to evaluate the soil remediation performance by predicting final heavy contaminated loess by using flame atomic absorption spectrometry with wet
metal distributions in plants and the plant growth. Moreover, the ML digestion. Spectrosc. Spectr. Anal. 36, 2625–2628.
Gu, S.W., Lan, C.Q., 2021. Biosorption of heavy metal ions by green alga neochloris
model can be utilized to design new phytoremediation experiments and oleoabundans: effects of metal ion properties and cell wall structure. J. Hazard.
guide the field phytoremediation for a specific and heavy metal- Mater. 418, 126336.
contaminated soil. Hanandeh, I.E., Mahdi, Z., Imtiaz, M.S., 2021. Modelling of the adsorption of Pb, Cu and
Ni ions from single and multi-component aqueous solutions by date seed derived
biochar: comparison of six machine learning approaches. Environ. Res. 192, 110338.
CRediT authorship contribution statement Hou, D., O’Connor, D., Igalavithana, A.D., Alessi, D.S., Luo, J., Tsang, D.C.W., Sparks, D.
L., Yamauchi, Y., Rinklebe, J., Ok, Y.S., 2020. Metal contamination and
bioremediation of agricultural soils for food safety and sustainability. Nat. Rev. Earth
L.S. and J.L. contributed equally to this work. L.S.: data collection, Environ. 1 (7), 366–381.
writing (review and editing) and visualization; J.L.: modeling, writing Hu, B.F., Xue, J., Zhou, Y., Shao, S., Fu, Z.Y., Li, Y., Chen, S.C., Qi, L., Shi, Z., 2020.
(review and editing), and visualization; K.N.P.: review and editing; X.N. Modelling bioaccumulation of heavy metals in soil-crop ecosystems and identifying
its controlling factors using machine learning. Environ. Pollut. 262, 114308.
W.: conceptualization, writing (review and editing), and supervision;
Hu, X.T., Li, T., Xu, W.H., Chai, Y.R., 2021. Distribution of cadmium in subcellular
and Y.S.O.: conceptualization, writing (review and editing), and fraction and expression difference of its transport genes among three cultivars of
supervision. pepper. Ecotoxicol. Environ. Saf. 216 (15), 112182.
Jin, Y.L., Wang, L.W., Song, Y.N., Zhu, J., Qin, M.H., Wu, L.H., Hu, P.J., Li, F.B., Fang, L.
P., Chen, C., Hou, D.Y., 2021. Integrated life cycle assessment for sustainable
Statement of environmental implication remediation of contaminated agricultural soil in China. Environ. Sci. Technol. 55
(17), 12032.
Machine learning can guide the phytoremediation of heavy metal- Li, J., Pan, L., Suvarna, M., Tong, Y.W., Wang, T., 2020. Fuel properties of hydrochar and
pyrochar: prediction and exploration with machine learning. Appl. Energy 269,
contaminated soils to improve remediation efficiency. 115166.
Li, J., Pan, L., Suvarna, M., Wang, X., 2021a. Machine learning aided supercritical water
Declaration of Competing Interest gasification for H2-rich syngas production with process optimization and catalyst
screening. Chem. Eng. J. 426, 131285.
Li, J., Zhang, W.J., Liu, T.G., Yang, L.H., Li, H.L., Peng, H.Y., Jiang, S.J., Wang, X.N.,
The authors declare that they have no known competing financial 2021b. Machine learning aided bio-oil production with high energy recovery and
interests or personal relationships that could have appeared to influence low nitrogen content from hydrothermal liquefaction of biomass with experiment
verification. Chem. Eng. J. 425, 130649.
the work reported in this paper. Li, J., Zhu, X.Z., Li, Y.N., Tong, Y.W., Wang, X.N., 2021c. Multi-Task prediction and
optimization of hydrochar properties from high-moisture municipal solid waste:
Data availability application of machine learning on waste-to-resource. J. Clean. Prod. 278, 123928.
Li, J.T., Gurajala, H.K., Wu, L.H., Ent, A.V.D., Qiu, R.L., Baker, A.J.M., Tang, Y.T.,
Yang, X.E., Shu, W.S., 2018a. Hyperaccumulator plants from China: a synthesis of
Data will be made available on request. the current state of knowledge. Environ. Sci. Technol. 52, 11980–11994.
Li, N.N., Li, S.T., Wang, S.F., Xie, D.T., Luo, F., 2018b. How exogenous cadmium affects
micronutrients accumulation and the related gene expression regulation in Brassica
Acknowledgements
juncea. Int. J. Agric. Biol. 20, 2074–2082.
Li, X.Y., Geng, T., Shen, W.J., Zhang, J.R., Zhou, Y.Z., 2021a. Quantifying the influencing
This work was carried out with the support of the Cooperative factors and multi-factor interactions affecting cadmium accumulation in limestone-
Research Program for Agriculture Science and Technology Development derived agricultural soil using random forest (RF) approach. Ecotoxicol. Environ.
Saf. 209, 111773.
(Project No. PJ01475801) from Rural Development Administration, the
9
L. Shi et al. Journal of Hazardous Materials 441 (2023) 129904
Liang, H.M., Lin, T.H., Chiou, J.M., Yeh, K.C., 2009. Model evaluation of the Sheoran, V., Sheoran, A.S., Poonia, P., 2016. Factors affecting phytoextraction: a review.
phytoextraction potential of heavy metal hyperaccumulators and non- Pedosphere 26 (2), 148–166.
hyperaccumulators. Environ. Pollut. 157 (6), 1945–1952. Sigmund, G., Gharasoo, M., Hüffer, T., Hofmann, T., 2020. Deep learning neural network
Liu, H., Zhao, H.X., Wu, L.H., Liu, A.N., Zhao, F, J., Xu, W.Z., 2017a. Heavy metal ATPase approach for predicting the sorption of ionizable and polar organic pollutants to a
3 (HMA3) confers cadmium hypertolerance on the cadmium/zinc hyperaccumulator wide range of carbonaceous materials. Environ. Sci. Technol. 54 (7), 4583.
Sedum plumbizincicola. N. Phytol. 215, 687–698. Tőzsér, D., Harangi, S., Baranyai, E., Lakatos, G., Fülöp, Z., Tóthmérész, B., Simon, E.,
Liu, H., Zhao, H.X., Wu, L.H., Liu, A.N., Zhao, F.J., 2017b. Heavy metal ATPase 3 2017. Phytoextraction with Salix viminalis in a moderately to strongly contaminated
(HMA3) confers cadmium hypertolerance on the cadmium/zinc hyperaccumulator area. Environ. Sci. Pollut. Res. 25, 3275–3290.
Sedum plumbizincicola. N. Phytol. 215, 687–698. Uraguchi, S., Fujiwara, T., 2012. Cadmium transport and tolerance in rice: perspectives
Manta, D.S., Angelone, M., Bellanca, A., Neri, R., Sprovieri, M., 2002. Heavy metals in for reducing grain cadmium accumulation. Rice 5, 5.
urban soils: a case study from the city of Palermo (Sicily), Italy. Sci. Total Environ. Venzhik, Y.V., Talanova, V.V., Titov, A.F., Kholoptseva, E.S., 2015. Similarities and
300, 229–243. differences in wheat plant responses to low temperature and cadmium. Plant
Meier, S., Alvear, M., Borie, F., Aguilera, P., Ginocchio, R., Cornejo, P., 2012. Influence of Physiol. 42, 508–514.
copper on root exudate patterns in some metallophytes and agricultural plants. Verbruggen, N., Hermans, C., Schat, H., 2009. Molecular mechanisms of metal
Ecotoxicol. Environ. Saf. 75, 8–15. hyperaccumulation in plants. N. Phytol. 181 (4), 759–776.
Montoya-Mayor, M., Fernandez-Espinosa, A.J., Seijo-Delgado, I., Ternero-Rodríguez, M., Wang, J.W., Liang, S., Xiang, W.W., Dai, H.P., Duan, Y.Z., Kang, F.R., Chai, T.Y., 2019b.
2013. Determination of soluble ultra-trace metals and metalloids in rainwater and A repeat region from the Brassica juncea HMA4 gene BjHMA4R is specifically
atmospheric deposition fluxes: a 2-year survey and assessment. Chemosphere 92, involved in Cd2+ binding in the cytosol under low heavy metal concentrations. BMC
882–891. Plant Biol. 19, 89.
Niemeyer, J.C., Lolata, G.B., Carvalho, G.M.D., Da Silva, E.M., Sousa, J.P., Nogueira, M. Wang, L.W., Hou, D.Y., Shen, Z.T., Zhu, J., Jia, X.Y., Ok, Y.S., Tack, F.M.G., Rinklebe, J.,
A., 2012. Microbial indicators of soil health as tools for ecological risk assessment of 2019a. Field trials of phytomining and phytoremediation: A critical review of
a metal contaminated site in Brazil. Appl. Soil Ecol. 59, 96–105. influencing factors and effects of additives. Crit. Rev. Environ. Sci. Technol. 50 (24),
Palansooriya, K.N., Li, J., Dissanayake, P.D., Suvarna, M., Li, L.Y., Yuan, X.Z., Sarkar, B., 2724–2774.
Tsang, D.C.W., Rinklebe, J., Wang, X.N., Ok, Y.S., 2022. Prediction of soil heavy Wang, L.W., Rinklebe, J., Tack, F.M.G., Hou, D.Y., 2021. A review of green remediation
metal immobilization by biochar using machine learning. Environ. Sci. Technol. 56, strategies for heavy metal contaminated soil. Soil Use Manag. 37 (4), 936–963.
4187–4198. Wang, X.L., Souza, M.F.D., Li, H.C., Qiu, J., Ok, Y.S., Meers, E., 2022. Biodegradation and
Pant, J., Pant, R.P., Singh, M.K., Singh, D.P., Pant, H., 2021. Analysis of agricultural crop effects of EDDS and NTA on Zn in soil solutions during phytoextraction by alfalfa in
yield prediction using statistical techniques of machine learning. Mater. Today.: soils with three Zn levels. Chemosphere 292, 133519.
Proc. 3, 34. Wood, J.L., Tang, C., Franks, A.E., 2016. Microbial associated plant growth and heavy
Peng, J.S., Ding, G., Meng, S., Yi, H.Y., Gong, J.M., 2017. Enhanced metal tolerance metal accumulation to improve phytoextraction of contaminated soils. Soil Biol.
correlates with heterotypic variation in SpMTL, a metallothionein-like protein from Biochem. 103, 131–137.
the hyperaccumulator Sedum plumbizincicola. Plant Cell Environ. 40, 1368–1378. Wu, X., Su, N., Yue, X., Fang, B., Zou, J., Chen, Y., Shen, Z.G., Cui, J., 2021. IRT1 and
Qu, Y., Feng, B.L., 2020. Straw mulching improved yield of field buckwheat (Fagopyrum) ZIP2 were involved in exogenous hydrogen-rich water-reduced cadmium
by increasing water-temperature use and soil carbon in rain-fed farmland. Acta Ecol. accumulation in Brassica chinensis and Arabidopsis thaliana. J. Hazard. Mater. 407,
Sin. https://doi.org/10.1016/j.chnaes.2020.11.008. 124599.
Ridgeway, G., Generalized Boosted Models: A guide to the gbm package. 2020. Yang, P., Chen, H.J., Fan, H.Y., Li, Q.S., Gao, Q., Wang, D.S., Wang, L.L., Zhou, C.,
Robinson, B., Fernandez, J.E., Madejon, P., Maranon, T., Murillo, J.M., Green, S., Zeng, E.Y., 2019. Phosphorus supply alters the root metabolism of Chinese flowering
Clothier, B., 2002. Phytoextraction: an assessment of biogeochemical and economic cabbage (Brassica campestris L. ssp. chinensis var. utilis Tsenet Lee) and the
viability. Plant Soil 249, 117–125. mobilization of Cd bound to lepidocrocite in soil. Environ. Exp. Bot. 167, 103827.
Rosa, G.J.M.; Blackwell., The Elements of Statistical Learning: Data Mining, Inference, Ye, P., Wang, M.H., Zhang, T., Liu, X.Y., Jiang, H., Sun, Y.P., Cheng, X.Y., Yan, Q., 2020.
and Prediction. 2022. Enhanced cadmium accumulation and tolerance in transgenic hairy roots of solanum
Shen, X., Dai, M., Yang, J.W., Sun, L., Tan, X., Peng, C.S., Ali, I., Naz, I., 2022. A critical nigrum L. expressing iron-regulated transporter gene. IRT1. Life 10, 324.
review on the phytoremediation of heavy metals from environment: performance Yuan, X.Z., Suvarna, M.N., Low, S., Dissanayake, D., Ok, Y.S., 2021. Applied machine
and challenges. Chemosphere 291, 132979. learning for prediction of CO2 adsorption on biomass waste-derived porous carbons.
Environ. Sci. Technol. 55, 11925.
10