You are on page 1of 12

Industrial Crops & Products 196 (2023) 116431

Contents lists available at ScienceDirect

Industrial Crops & Products


journal homepage: www.elsevier.com/locate/indcrop

Machine learning prediction of deep eutectic solvents pretreatment of


lignocellulosic biomass
Huanfei Xu a, c, *, Chenyang Dong a, Weixian Wang a, Yaoze Liu a, b, Bin Li c, Fusheng Liu a, **
a
College of chemical engineering, Qingdao University of science and technology, 266042 Qingdao, PR China
b
School of Information Science and Technology, Qingdao University of Science and Technology, Qingdao, Shandong 266061, PR China
c
Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao 266101, PR China

A R T I C L E I N F O A B S T R A C T

Keywords: The deep eutectic solvent pretreatments of lignocellulosic biomass were investigated by machine learning
Deep eutectic solvents methods. Principal component analysis, partial least squares, linear regression, optimized gradient boosting,
Pretreatment artificial neural networks, and random forests were used for reveal mechanisms and inner interaction re­
Lignocellulose
lationships. The dependences of the pretreatment effect on the variables of the reaction conditions, the DES
Partial dependence
Machine learning
properties and the lignocellulosic biomass properties were analyzed. The influences of various variables on
pretreatment process and partial dependence of lignin removal and carbohydrate recovery were studied. The
temperature in the reaction conditions, the hydrophilicity in the DES characteristic parameters, and the hemi­
cellulose content of raw lignocellulose components were the top three most influential factors for the changes in
lignin removal, which accounted for 30%, 11%, and 6%, respectively.

1. Introduction biomass, which will provide a more suitable substrate for further bio­
ethanol production process (Xu et al., 2021).
The research of high value-added chemicals based on lignocellulosic DES pretreatment efficiency is influenced by different reaction
biomass is the current research focus due to the increasingly serious mechanisms and process variables, which are related with the acidity
environmental problems (Gur, 2022). Lignocellulosic biomass is a and alkalinity of hydrogen bond donors of DES, reaction conditions, and
renewable resource widely distributed in nature, with the characteristics the composition of biomass raw material (Zhong et al., 2022). The
of low cost and easy to obtain in large quantities (Jha et al., 2022). optimal pretreatments conditions of different DES for different biomass
Lignocellulosic biomass was mainly composed of polysaccharides and varied according to studies (Chourasia et al., 2021). Therefore, it is
lignin. An important step of high value-added utilization of biomass and challenging to analyze the relationship among all process parameters
biorefining was the separation of cellulose, hemicellulose and lignin. involved in DES pretreatment to reveal the mechanism of DES reaction
Pretreatment based on green solvents was green fractionation methods only based on chemical experimental methods (Shen et al., 2019). The
for biorefining to produce bio-fuels and bio-chemicals. experimental approaches to determine the relative contributions of
Deep eutectic solvents (DES), a novel green biomass pretreatment various factors to the effect of DES pretreatment are time-consuming,
solvents, have the advantages of easy preparation, high purity, basically money-consuming and complex (Xu et al., 2020). The more rapid and
non-toxicity, easy biodegradation, not high melting point, strong ther­ convenient method that does not rely on trial-and-error exclusion to
mal stability, not high volatility, low flammability (Wang and Lee, guide the screening of reaction conditions is the development trend in
2021). DES was composed of hydrogen bond donors (HBD) and this field. This is an emerging hot research field, and few scholars have
hydrogen bond acceptors (HBA). DES can satisfy green chemistry prin­ begun to study it. There are few works reported in evaluating the DES
ciple, and it has received extensive attention in lignocellulose pretreat­ pretreatment performance of lignocellulosic biomass. Multivariate
ment (Li et al., 2022). DES has the ability to fractionate biomass (Su analysis methods such as principal components analysis and partial least
et al., 2021) and remove lignin from different kinds of lignocellulosic square have been used for evaluation of DES pretreatment (Xu et al.,

* Corresponding author at: College of chemical engineering, Qingdao University of science and technology, 266042 Qingdao, PR China.
** Corresponding author.
E-mail addresses: xuhuanfeixiaotu@163.com (H. Xu), liufusheng63@sina.com (F. Liu).

https://doi.org/10.1016/j.indcrop.2023.116431
Received 23 July 2022; Received in revised form 31 January 2023; Accepted 11 February 2023
Available online 1 March 2023
0926-6690/© 2023 Published by Elsevier B.V.
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

2020, 2021). There were some interesting findings such as the negative between every two variables. Through the sklearn version library in
relationship between intensive DES pretreatment stirring and glucose python, the linear regression function in sklearn linear_model was to
yield (Massayev and Lee, 2022). Methods capable of predicting the ef­ build a linear model. The matrix composed of process variables related
ficiency of DES pretreatment can address issues related to determining to pretreatment reactions features was assigned to the independent
the correlation between optimal experimental conditions, DES proper­ variable, and the matrix composed of four pretreatment results variables
ties, biomass properties, and separation effects of the three main com­ prediction targets is assigned to the dependent variable. Four pretreat­
ponents to maximize lignin removal rates and carbohydrate retention ment results variables were biomass recovery rate, lignin removal rate,
rate. Effective and robust models incorporating all variables could be glucan recovery rate, and xylan recovery rate. The data were randomly
studied to highlight the relative importance of every two variables, and separated into two groups, training groups and test groups. The training
such analytical studies have important instructive significance for DES group had 80% dataset and the test set had the other 20% dataset.
pretreatment of lignocellulosic biomass.
Machine learning methods could auto-learn based on complicated
multi-dimensional big dataset to build predictive analysis models 2.4. XGboost analysis
(Kaptan and Vattulainen, 2022). Machine learning methods have been
widely used in many fields of prediction (Rolnick et al., 2023). Such as XGboost algorithm was used to establish a regression prediction
prediction based on machine learning methods was studied for model(Pan et al., 2022). XGBoost was an optimized distributed gradient
analyzing carbon content and char yield from biomass biochar accord­ boosting library. The effect of independent variables of biomass pre­
ing to biomass components and reaction conditions (Zhu et al., 2019). treated with deep eutectic solvent on four dependent variables were
Machine learning could be studied for predicting the adsorption effi­ studied and predicted. 80% of the samples in the dataset were randomly
ciency of materials for heavy metals (Wei et al., 2021). At present, the grouped for the training model and 20% for the test model in each
complex interactions between variables in DES preprocessing and the training process, and the optimal model parameters were selected
lack of systematic datasets have led to a lack of studies on machine through grid search. The model and analysis running environment was
learning based prediction of DES pretreatment efficiency for different Python 3.9. The XGBoost model and SHapley Additive exPlanations
lignocellulosic feedstocks. In this study, different machine learning (SHAP) analysis were studied in the Python open-source libraries
methods were used to validate the ability to predict segregation of scikit-learn, XGBoost and SHAP, and run in the Jupyter Notebook
lignocellulosic biomass components. Machine learning-based analytical platform.
models provide new insights into biomass pretreatment and biomass
refining. 2.5. Artificial neural network

2. Materials and methods The Artificial neural network (ANN) prediction model was built
through the tensorflow (version: 2.0.0) library. The training of the ANN
2.1. DES pretreatment dataset model was divided into the following steps: Firstly, preparing the data.
NumPy was used to read in the prepared data and randomly separated
Variables for machine learning analysis were divided into different into two groups, which were training group and test group. There were
categories. The groups were as follows: DES-related physicochemical 80% data for training set, and the rest 20% dataset was for test set.
property variables, pretreatment reaction condition variables, raw Secondly, building the ANN model. The model of total of 5 layers was
biomass materials component content variables, biomass samples made, of which the dimension of the input layer was 43, containing 200
component content variables after DES pretreatment, pretreatment ef­ neurons, and the activation function is: relu; there were three hidden
fect variables. All data were published research papers’ results. The layers in the middle, and the number of neurons is 100, 50, 25, and the
database involved different hydrogen bond donors (HBD) of choline activation functions were all: relu; the last output layer, the output
chloride-based DES, different process variables, and one hundred and dimension was 1. The optimization function set by the model was
forty-one different reaction conditions sets of experiments. HBDs were ’adam’, the loss function was ’mse’, and the number of training was set
glycerol, ethylene glycol, urea, imidazole, formic acid, butanediol, to 2000 times. Thirdly, training model. Send the segmented data into the
oxalic acid dihydrate, lactic acid, levulinic acid, malonic acid. Different model and start training. After training, the final model can be obtained.
lignocellulosic biomass sources were pretreated by different DES under
many kinds of reaction conditions. All original experimental data for
machine learning were aggregated into a dataset which was Z-score 2.6. Random forest
normalized (Tangadpalliwar et al., 2019).
This model was imported from the sklearn library to train and
2.2. Multivariate analysis optimize the predictions of four pretreatment effect results variables.
The specific construction of random forest had two aspects: random
Principal component analysis (PCA) and partial least squares anal­ selection of data, and random selection of features to be selected. The
ysis (PLS) were multivariate analysis methods in this study. For PCA sampling with replacement was based on the raw dataset, and the next
analysis, all variables in the data set were independent variables for data set also called subdataset was constructed, and the data amount of
analysis. In PLS model, lignin removal rate was selected as the depen­ the subdataset was same as the raw dataset. In subdatasets, parameters
dent variable, and others were independent variables. The purpose of were repeated. Meanwhile, in the same sub dataset, parameters could be
PCA and PLS models were to reveal the relationships between different repeated. Similar to the random selection of the data set, in the random
variables of DES pretreatment and optimize the lignin removal results. forest, for every splitting step of the sub-tree could not process all fea­
Both PCA and PLS analysis were performed in Simca software (Saiz tures to be selected. However, randomly selected certain features from
et al., 2014). all the features to be selected. Then, randomly selected certain features
until the best feature. First, random sampling. Second, select attributes
2.3. Linear regression analysis to split. Third, build a decision tree. Forth, build a large number of de­
cision trees, based on the methods above, in order to form the random
Pearson correlation coefficient was used to measure the linear cor­ forest. Hyperparameters were searched through the GridSearchCV
relation between any two variables. NumPy was used to read the data­ function in the sklearn library, and the optimal hyperparameter settings
set, and pass the data through the seaborn function to obtain a heat map for this model were obtained.

2
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

2.7. Quantum chemical calculation Table 1


Variables for analysis.
In density functional theory (DFT) process, DES cluster conforma­ Categories Variables Variables description Value range
tions were constructed for different HBAs. In this section, HBAs were abbreviation
tetramethyl ammonium chloride (TMAC), methyl triethyl ammonium physicochemical Tm-HBD melting point of HBD -13–189.5
chloride (MTAC), tetraethyl ammonium chloride (TEAC), tetrapropyl property variables Tb-HBD boiling point of HBD 100.8–386.8
ammonium chloride (TPAC), tetrabutyl ammonium chloride (TBAC), of ChCl based DES Tf-HBD Flash point of HBD 29.9–201.9
benzyltrimethylammonium chloride (BTMAC), benzyl­ Enthalpy of Enthalpy of 22.7–69.8
Vaporization- Vaporization of HBD
triethylammonium chloride (BTEAC), chlorocholine chloride (ChCl), HBD
chlorocholine chloride (CCC), allyl trimethyl ammonium chloride Molar Volume- Molar Volume of HBD 38.4–102.8
(ATMAC). DESs were constructed with 1:2 ratio of HBA to HBD lactic HBD
acid based on the Molclus and MOPAC programs (Stewart, 1990). The Polarizability- Polarizability of HBD 3.3–10.6
HBD
structures of DES cluster conformations were optimized by ORCA at the
Surface Surface Tension of 35.8–87.4
BLYP- D3(BJ)/def2-TZVP level, and PWPB95-D3(BJ)/def2-TZVPP level Tension-HBD HBD
for more accurate single point energy calculation model. The interaction HB acceptors- number of hydrogen 2–4
energy data were collected at PWPB95-D3(BJ)/def2-TZVPP level with HBD bond acceptors of
basis set superposition error correction. The cluster structure was HBD
HB donors- Number of hydrogen 1–4
analyzed by Multiwfn with RDG analyses (Lu and Chen, 2012). HBD bond donors of HBD
HB a+D-HBD The sum of the 3–7
3. Results and discussion number of hydrogen
bond donors and
hydrogen bond
3.1. PCA and PLS analysis
acceptors of HBD
Freely Rotating Freely Rotating Bonds 0–3
All variables analyzed in this paper were shown in Table 1. In PCA Bonds-HBD of HBD
model, the cumulative explained variance was 0.940, and the cumula­ Log P-HBD Hydrophobicity -2.3~− 0.16
parameter/oil-water
tive explained variance was 0.751. As shown in Fig. 1a-1d, these plots
partition coefficient
displayed the cumulative R2 (explained ability) and Q2 (goodness of Polar Surface Polar Surface Area of 29–75
predict) for the parameter matrix. This overall PCA model was relatively Area-HBD HBD
good, could explain 94% of the overall data information, and had good Complexity- Complexity of HBD 6–106
predictive ability, which could predict 75.1% of the overall information. HBD
C– Number of carbonyl 0–2
For every variable, for example, the explanatory and predictive powers
–O-HBD
groups of HBD
of HBD’s acidity were 99% and 96%, respectively. For the length of the OH -HBD Number of hydroxyl 0–3
HBD molecular chain, the explanatory power and the predictive power groups of HBD
of the model were 99% and 97%, respectively. For the polarity of HBD, Carbon chain Carbon chain length 1–5
the explanatory and predictive powers of the model were 98.6% and length-HBD of HBD
NH2/-N-HBD number of nitrogen of 0–2
97.2%, respectively. Fig. 1c showed the relationships among all vari­ HBD
ables. Variables with positive relationships were located nearby each Double bond number of double 0–2
other; variables with negative inner relations were located on opposite number -HBD bonds of HBD
direction of the map in opposed quadrants. Variables that exist in Refraction Refraction index of 1.342–1.528
index-HBD HBD
clusters close to each other had a strong positive correlation; in contrast,
Molar Molar Refractivity of 8.4–26.8
variables far away from each other in different quadrants had negative Refractivity- HBD
correlations. The farther the distance, the stronger the negative corre­ HBD
lation could have. The acidity of HBD and the number of double bonds Density-HBD Density of HBD 0.99–1.6
contained in HBD showed a significant negative correlation. That might Mw-HBD molecular weight of 46.03–116.12
HBD
because the double bond can indicate the carboxyl group content from a MUP-HBD Dipole moment/ 1.3371–4.5568
certain angle, and HBD with more carboxyl group content was more polarity
acidic, so the acidity coefficient of HBD was relatively low. Meanwhile, omega-HBD Eccentricity factor 0.2863–46.67
the carboxyl group contained hydroxyl groups. It could be seen from the MPI-HBD Molecular polarity 49.09–97.39
index
Fig. 1c that the acidity of HBD and the content of carboxyl groups
pKa-HBD Acidity coefficient 0.1–14.83
contained in the HBD molecule also showed a negative correlation. The DES ratio molar ratio of HBD to 0.5–7
dipole moment reflected the polarity of the molecule. There were lone HBD/HBA HBA
pairs of electrons, which were acceptors, and no donors. The hydroxyl DES pretreatment time-R/h pretreatment time 1–16
group was a hydrogen bond donor. The more double bonds it contains conditions T-R prtreatment 60–150
temperatrue
and the more carbonyl groups it contains, the more polar the HBD LS ratio ratio of solid to liquid 8–100
molecule would be. Molecular polarity could help disrupt hydrogen R factor reaction severity 1.07–4.45
bond networks, which was conducive to the separation and removal of factor
biomass components(Soares et al., 2019). particle size particle size of raw 0.1–6.8
biomass material
In PLS model, the cumulative explained variance of X (R2Xcum) was
Raw material Glucan glucan content 16.8–62.4
0.826, the cumulative explained variance of Y (R2Ycum) was 0.941, and components Xylan xylan content 3.6–27.46
the cumulative explained variance (Q2cum) was 0.844. As shown in AIL acid insoluble lignin 13.7–30
Figs. 1e-1h, four principal components could explain 94% information content
and have 84.4% prediction ability, which showed this PLS model was ASL acid soluble lignin 0.23–5.33
content
good. It could be clearly seen that lignin removal rate had strong posi­ Lignin total lignin content 16.9–32.9
tively correlations with acid soluble lignin removal rate and acid P-Glucan 17–79.3
insoluble lignin removal rate, because the total lignin was contented (continued on next page)
with two different kinds of lignin. The acidity of DES had strong effect on

3
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Table 1 (continued ) negative numbers are negatively correlated. The greater the absolute
Categories Variables Variables description Value range value of the value, the stronger the correlation. Due to the Pearson
abbreviation correlation coefficient between logP and D-lignin was 0.13, which
Pretreated material glucan content in
showed that the delignin rate had positive relationship with logP. Due to
components pretreated biomass the Pearson correlation coefficient between glucan recovery rate and
samples xylan recovery rate were − 0.03 and − 0.28, respectively. That meant
P-Xylan xylan content in 0–30 that Carbohydrate retention showed a negative correlation with logP.
pretreated biomass
The results showed that the delignification effect of DES composed of
samples
P-AIL acid insoluble content 1.6–30 lipophilic HBD was better. Due to the Pearson correlation coefficient
in pretreated biomass between the content of HBD double bonds, the content of carbonyl
samples groups and the molecular complexity and D-lignin were 0.16, 0.11 and
P-ASL acid soluble lignin 0.11–4.03 0.13, respectively. The content of HBD double bonds, the content of
content in pretreated
biomass samples
carbonyl groups and the molecular complexity had positive correlation
P-lignin total lignin content in 4.4–32.9 with the delignin rate. The effect of temperature was the most impor­
pretreated biomass tant, far more important than particle size and reaction time. Temper­
samples ature was very important for pretreatment(Ma et al., 2022). According
DES pretreatment Biomass solid biomass rcovery 39.6%~99%
to the values in the Pearson correlation coefficient matrix, it was found
effect recovery rate
R-glucan glucan recovery rate 55.1%~100% that the top three parameters of pretreatment reaction conditions and
R-xylan xylan recovery rate 0–100% DES which had great impact on D-Lignin were T-R (0.65), R factor (0.62)
D-AIL acid insoluble lignin 6.9%~99.3% and omega-HBD (− 0.22).
removal rate The red point represents the prediction result of the training set data
D-ASL acid soluble lignin 6.9%~99.3%
removal rate
for the model, and the black dotted line in the middle represents the
D-Lignin total lignin removal 7.9%~85.1% straight line “y equals x”. When the predicted point is closer to the
rate middle black dashed line, the model is more accurate in predicting this
point. The blue points represent the prediction results of the test set data
for the model, the blue line represents the regression line for all blue
the pretreatment effect. The acidity coefficient pka of HBD showed a
points, and the blue shaded interval represents the 95% confidence in­
strong negative correlation with the D-lignin. The smaller the pka value,
terval of the regression line. The purpose of this is to observe the effect of
the more acidic the molecule was. The more acidic the DES, the better
the model more comprehensively, and to judge whether the linear model
the pretreatment effect could be. DES acidity especially affected the
is overfitting through two sets. the quality of the model is evaluated by
extraction rate of acid-soluble lignin, and pka showed a strong negative
variance R2. The significance of the R2 value is an indicator of the de­
correlation with the extraction rate of acid-soluble lignin in all di­
gree of fit of the model. Its value could reflect how well the predicted
rections. The larger the electronegativity difference between the two
value of the model fits the corresponding actual data. The higher the R-
ends, the larger the dipole moment. The various physicochemical pa­
square value of fitting, the more reliable the model could be. The linear
rameters of HBD in the composition of DES are compactly clustered
model performed well. All the the R-square value of fitting could be seen
together, indicating that there was a strong positive correlation with
in Fig. 2a. Among four variables of biomass recovery rate, glucan re­
each other. The greater the electronegativity difference between the two
covery rate, xylan recovery rate and delignin rate, the delignin rate
ends of the HBD, the greater the polarity could be. The more complex the
showed the best fitting reustls. The predicted value of D-lignin is almost
molecular structure of HBD, the greater the molecular polarity, the
equal to the actual value, the square of the test fitting R is 0.95. The
better the final DES pretreatment effect could be. The hydrogen bonding
training fitting value was 0.96, indicating that the linear regression
interactions inside carbohydrate could be destroyed, thereby disrupting
model had a high accuracy rate.
the intramolecular hydrogen bonding network (Kwon et al., 2020).
Therefore, pka-HBD had a strong negative correlation with the biomass
3.3. XGboost’s analysis
recovery. Due to the three main components of lignocellulosic were
cellulose, hemicellulose and lignin. In this paper, the final glucan re­
XGBoost was a machine learning method based on the Gradient
covery rate reflected the degree of cellulose retention, the final xylan
Boosting framework. The mean square error of this model was about
recovery rate reflected the degree of hemicellulose retention, the
0.098, and the average absolute error was about 0.19. It was verified
biomass recovery rate was the total solid retention including cellulose,
that the XGboost algorithm has achieved good results and prediction
hemicellulose, and lignin. Under PLS model, it could be clearly seen that
effect. Shap was an explanatory model that obtains the importance of
there was obvious negative correlation between the lignin removal rate
features by explaining the influence of different variables on the sample
and the xylan recovery rate. The biomass recovery rate and lignin
in XGboost’s prediction (Rzychon et al., ). In this XGboost’s analysis,
removal rate also showed the strong negative correlation. That was easy
physicochemical property variables of ChCl based DES, DES pretreat­
to understand based on the form of lignin present in lignocellulosic
ment conditions, raw material components were input features; the
biomass and the link between the lignin and carbohydrates LCC. Lignin
pretreatment results were output targets. This model analysis graph
removal could inevitably deconstruct and dissolve part of hemicellulose,
plotted the Shapley value of each feature for each sample, showing
which could result in the more lignin was removed, the less hemicel­
which features were most important, and how far they affected the
lulose could be retained, and the less biomass recovery rate could be.
dataset. The top important features for different DES pretreatment effect
variables were shown in Fig. 3a. Temperature and severity of reaction
3.2. Linear regression analysis had the greatest influence on the carbohydrate recovery and lignin
removal. Among different physicochemical properties parameters of
Linear regression was the most typical type of regression (Ao et al., HBD, log P showed the most important influence on the lignin removal.
2019), the results were shown in Fig. 2. The Pearson correlation coef­ Log P, also called the hydrophobic constant, was the partition coefficient
ficient was one of the most commonly used linear correlation co­ of oil and water. The bigger the LogP value, the higher lipophilic ability
efficients(Shang et al., 2022). Based on the heat map, the correlation the substance had; conversely, the smaller the LogP value, the higher
relationship between every two variables could be seen. For Pearson hydrophilic ability the substance had. Lignin was basically composed of
correlation coefficient, positive numbers are positively correlated and phenylpropanyl basic structural units, and its structure contains a large

4
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Fig. 1. PCA and PLS a: the wavelet power spectrum plot of PCA model; b: 3D Scatter plot of PCA model; c: PCA loading plot; d: X/Y Overview plot of PCA model; e:
3D scatter plot of PLS model; f: Overview plot of PLS model; g: loading plot of PLS model; h: Coefficient plot.

number of hydroxyl groups, with high polarity, and most of the lignin is Tb-HBD, and MPI-HBD.
hydrophilic. The figure also showed that molecular volume and polarity
of HBD had a great influence on lignin removal. 3.4. Artificial neural network (ANN) analysis
For glucan recovery rate, the most important HBD physicochemical
properties was acidity coefficient pka. The smaller the acidity coefficient ANN had auto-organization, auto-learning, auto-adaptation and
pka, the stronger the acidity and the lower the cellulose content of good ability of nonlinear function approximation, and had good ability
carbohydrates, which was the similar findings with others research that of fault tolerance (Rahimi et al., 2021). In this ANN model, physico­
the acidity of DES influenced the pretreatment results of lignocellulosic chemical property variables of ChCl based DES, reaction conditions of
biomass (Saito et al., 2022). For the total solid recovery rate, among the pretreatment, components of raw biomass and pretreated material
main components of biomass, the content of glucan has the greatest components were input features; the pretreatment results were output
impact. The higher the cellulose content, the greater the solid recovery targets. Fig. 3b showed the actual value and prediction value of the DES
rate after DES pretreatment could be. This was mainly due to the pretreatment effect of the ANN model with variance results. For biomass
structural characteristics of LCC (Sun et al., 2022). HBD polarity would recovery rate, the R2 of training model and test model were 0.96 and
have an impact on the pretreatment effect, in which dipole moment and 0.91, respectively. For glucan recovery rate, the R2 of training model
molecular polarity index (MPI) were important parameters to measure and test model were 0.92 and 0.77, respectively. For xylan recovery rate,
HBD polarity. The variables related to polarity, showed strong influence the R2 of training model and test model were 0.99 and 0.97, respec­
on the effect of DES pretreatment. Hidden relationships between tively. For acid insoluble lignin removal rate, the R2 of training model
different variables could be studied based on comprehensive analysis of and test model were 0.99 and 0.92, respectively. For acid soluble lignin
methods above. The variables of Biomass recovery, LS ratio, T-R, R removal rate, the training and test R2 were 0.92 and 0.73, respectively.
factor, particle size were clustered into a cluster, and were very tightly For total lignin removal rate, the R2 of training model and test model
distributed in PCA. The details of hidden relationship could be obtained were 0.98 and 0.97, respectively. That showed the ANN model had a
by XGboost. As shown in Figure3 a-3, the top three pretreatment con­ high accuracy rate for DES pretreatment.
dition variable affected R-glucan were R factor, LS ratio, temperature;
the top three physicochemical property variables of HBD were pKa-HBD,

5
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Fig. 2. Linear regression and Pearson correlation coefficient matrix,a1: Training and test of biomass recovery rate; a2: training and test of total lignin removal rate;
a3: training and test of glucan recovery rate; a4: training and test of xylan recovery rate; b: hot map of Pearson correlation coefficient matrix.

6
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Fig. 3. SHAP for XGboost and ANN a1: SHAP value for total lignin removal rate; a2: SHAP value for biomass recovery rate; a3: SHAP value for glucan recovery rate;
a4: SHAP value for xylan recovey rate. b:ANN models for four different pretreatment effect variables.

7
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

3.5. Random Forest analysis xylan recovery. Among the three components of lignin, cellulose and
hemicellulose, hemicellulose was the most sensitive to temperature
As shown in Fig. 4a, the very important parameters rank of Random changes, and its dependence on temperature. DES solvent could promote
Forest model showed that, for reaction conditions, the influence of T-R the cleavage of a large number of bonds, especially the links in LCC and
and R-factor were very strong, so T-R and R-factor were very important the β-o-4 bonds in lignin, resulting in the increase of low molecular
influencing factors. In Random Forest model, pretreatment temperature weight of carbohydrate and the removal of lignin, which was greatly
was accounted for 31% of the changes in lignin removal, 38% of changes affected by temperature. Another interesting finding was that particle
in biomass recovery, 15% of changes of glucan recovery and 44% of size was very important for all models. May because the particle size
changes of xylan recovery, respectively. Compared with other pre­ determines the mass transfer and heat transfer during pretreatment. For
treatment effect variables, temperature had the strongest influence on biomass recovery, the top three important variables of HBD were omega,

Fig. 4. Random Forest a1: very important variables rank for biomass recovery rate; a2: very important variables rank for total lignin removal rate; a3: very
important rank for glucan recovery rate; a4: very important rank for xylan recovery rate; b:all inputs and partial inputs for prediction of total lignin removal rate; c:all
inputs and partial inputs for prediction of glucan recovery rate; d: all inputs and partial inputs for prediction of xylan recovery rate; e: all inputs and partial inputs for
prediction of biomass recovery.

8
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

complexity and surface tension. For lignin removal, the top three Especially in the value of logP-HBD range of − 1.5 to − 0.5, the
important variables of HBD were Log P, Complexity and Freely Rotating dependence of the biomass recovery rate on logP decreased signifi­
Bonds, which accounted for 11%, 5% and 1.5%, respectively. For xylan cantly. When the Complexity-HBD was greater than 30, the dependence
recovery, the top three important variables of HBD were carbon chain began to decrease, and when the Complexity-HBD was greater than 60,
length, omega and density. For glucan recovery, the top three important the degree of dependence became stable. For R-glucan, the dependence
variables of HBD were pKa, complexity and Log P, which accounted for on R factor showed a trend of first increasing and then decreasing, and
4.5%, 4% and 2%, respectively. Different analysis models have different the dependence reached the maximum near R factor was 2.8. The
research perspectives, different database information, different input dependence on T-R also showed the similar trend, and the dependence
output variables, and different classification group construction. reached the maximum as T-R was 115 ◦ C. When the value of logP was
Therefore, it was understandable that the analysis results given from relatively small, glucan recovery rate was highly dependent on logP.
different models were not exactly the same. The overall indicative When logP gradually increased, the dependence of glucan recovery rate
explanation is consistent. Similar results could be seen in published ar­ on it gradually decreased. In the R-xylan model, the reaction intensity
ticles such as the use of carboxylic acid and hydroxy acid HBD was had a great influence on it, but as the reaction intensity increased, the
associated strongly with glucose yield (Massayev and Lee, 2022). influence became smaller. When the temperature was in the range of
Moreover, results about the reaction condition variables and DES lower than 120︒C, the xylan recovery was highly dependent on it.
characteristic variables are similar to the results of the previous paper When the pretreatment proceed at more higher temperature, the
(Xu et al., 2020, 2021). Further research was needed to understand dependence was significantly reduced. As LogP of HBD increased, the
which input of feature parameters was more important for the results of degree of influence on it first increased and then decreased. As the MUP
the model, so the incomplete input method was adopted, that was by of HBD increased, its impact increased. With the increase of the mo­
comparing the results of the model lacking a certain input with all the lecular volume of HBD, the degree of dependence of xylan on it first
inputs, the importance of the missing input could be judged. As shown in increased and then decreased.
Fig. 4b, in the model for predicting D-Lignin, when Complexity-HBD, Three dimensions partial dependency analysis with two features
Log P-HBD, MPI-HBD, and pKa-HBD were input into the model, the could also be used to study the inner relationships and visualize their
values of predicted and the accuracy had been improved. For example, interaction. In the model for predicting biomass recovery, with the
in the model lacking MPI-HBD, R2 was 11% lower than that of the full decreased of Log P-HBD and the decreased of Complexity-HBD, the
input model, which showed that MPI-HBD was very meaningful for degree of dependence increased, and when Log P-HBD was more than
predicting D-Lignin. In addition, the RMSE of the test set using all input − 1.5 and Complexity-HBD was smaller than 80, the dependence
models was better than that of other models. Overall, the accuracies of improvement is the fastest. In the model for predicting D-Lignin, Log P-
these models were acceptable. Similar conclusions could be drawn from HBD had a relatively large influence on the degree of dependence,
other models. For example, for biomass recovery, the variables of par­ especially in the interval of − 1.5 and − 1. In addition, the degree of
ticle size, Freely Rotating Bonds-HBD, Log P-HBD and Surface dependence of Surface Tension-HBD also changed to a certain extent.
Tension-HBD were very important. For R-xylan, Density-HBD, carbon For example, in the range of 60–70, the degree of dependence also
chain length-HBD and omge-HBD were important. For R-glucan, the increased in a small range with the change of Surface Tension-HBD. The
parameters LS ratio, particle size, Complexity-HBD and pka-HBD were reason for this may be that the surface tension was the macroscopic
very important. manifestation of the microscopic interaction between molecules in the
DES system. The smaller the void size of the DES system, the higher the
3.6. Partial Interdependence Analysis DES density could be, and the higher the DES viscosity could be. The
DES reaction system with high viscosity might limit the biomass
It was meaningful to further study the deep inner relationship among fractionation.
DES pretreatment variables. Partial interdependence was most used
method for revealing essential insights(Yuan et al., 2021). The partial 3.7. Density functional theory calculations analysis
dependency analysis for four pretreatment effects variables were ob­
tained based on Random Forest model, which can be used to visualize In order to analyze the mechanism of action more deeply, DFT
the dependencies among certain parameters, as shown in Fig. 5. For simulation analysis was carried out in this study. The interactions of
lignin removal, the dependence of T-R increased approximately linearly, different kinds of HBAs with lactic acid by calculations of density
but as the pretreatment temperature increased, the slope firstly functional theory (DFT) were studied. In order to analyze deep probing
increased and then decreased. The dependence of the complexity of the the weak interactions between Hydrogen bond donors of different
HBD also increased approximately linearly. When the Complexity-HBD functional groups and lactic acid, DFT calculations was processed to
was increased to 60, it had almost no effect on the degree of dependence. obtain more information about the interactions between HBA and HBD.
When the oil-water partition coefficient of HBD increased from − 1.2 to The reduced density gradient (RDG) could be used to describe the not
− 1, the dependence increases sharply, and the results tended to be strong interactions in real space functions due to electron density gra­
stable after it was greater than − 1. This might be because the higher the dients and electron density (Erin R. Johnson et al., 2010). As shown in
Log P-HBD value, the stronger the lipophilicity of the substance could Fig. 6, the RDG isosurface clearly reflected the weak interaction region
be. The main components of lignocellulosic biomass, lignin and carbo­ in different DES. HBAs based on different alkyl chains (TMAC, MTAC,
hydrates, both contained hydrophilic groups in their natural macro­ TEAC, TPAC, TBAC) and different functional groups (CL-CCC,
molecules. The swelling, distribution, migration and reaction of carbon-carbon double bond-ATMAC, benzene-BTMAC, BTEAC
lignocellulosic biomass during DES pretreatment were all related to the Hydroxy-CHCL) mainly interact with lactic acid by van der Waals’ force
hydrophilic and lipophilic characteristics of DES. That was to say, Log (VDW) interactions which were marked in green color in figure. The cl-
P-HBD affected the effect of DES pretreatment to a certain extent. R in DES formed hydrogen bonds with the hydroxyl group of lactic acid
factor value was positively correlated with D-Lignin dependence, espe­ was in blue color. In particular, CHCL containing hydroxyl groups) and
cially when the R factor was greater than 3.5, the dependence changed CCC containing chlorine groups produce hydrogen bonds with lactic
rapidly. For biomass recovery, the increase of T-R could reduce the acid and have stronger internal interaction energy CHCL
degree of dependence, and the degree of dependence decreased sharply (− 41.0 kcal/mol) and CCC (− 45.1 kcal/mol) compared to ATMAC
when the pretreatment temperature was higher than 120 ◦ C. Depen­ (with carbon-carbon double bonds) (− 35.1 kcal/mol), which was
dence on R factor also showed similar trend. The dependence of biomass similar with the results above about the functional group of HBD.
recovery rate on logP-HBD decreases with increasing log P-HBD value. Moreover, the strong internal hydrogen bonding network in them makes

9
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Fig. 5. The partial dependence plot of pretreatment effect on the key process parameters a: 2D partial dependence plot for biomass recovery rate; b: 2D partial
dependence plot for total lignin removal rate; c: 2D partial dependence plot for glucan recovery rate; d: 2D partial dependence plot for xylan recovery rate; e: 3D
partial dependence plot for biomass recovery rate; f: 3D partial dependence plot for total lignin removal rate; g: 3D partial dependence plot for glucan recovery rate;
h: 3D partial dependence plot for xylan recovery rate.

10
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

Fig. 6. RDG isosurface of DES Note: Blue parts indicate strong attractive interaction and red parts indicate strong nonbonding overlap. The values indicate the
interaction energy.

11
H. Xu et al. Industrial Crops & Products 196 (2023) 116431

it easier to remove lignin and xylan, which was similar with the research Kaptan, S., Vattulainen, I., 2022. Machine learning in the analysis of biomolecular
simulations. Adv. Phys. -X 7, 1.
findings of Liang’s study (Liang et al., 2021). This result was also
Kwon, G.J., Yang, B.S., Park, C.W., Bandi, R., Lee, E.A., Park, J.S., Han, S.Y., Kim, N.H.,
consistent with what was mentioned earlier in this article. For example, Lee, S.H., 2020. Treatment effects of choline chloride-based deep eutectic solvent on
increasing HBD polarity, acidity, and molecular complexity will facili­ the chemical composition of red pine (Pinus densiflora). Bioresources 15 (3),
tate lignin removal. By comparing the interaction information of MTAC 6457–6470.
Li, N., Meng, F.Y., Yang, H.Y., Shi, Z.J., Zhao, P., Yang, J., 2022. Enhancing enzymatic
(− 36.8 kcal/mol), TEAC (− 40.1 kcal/mol), and BTEAC digestibility of bamboo residues using a three-constituent deep eutectic solvent
(− 43.6 kcal/mol) with lactic acid interaction, it was found that the pretreatment. Bioresour. Technol. 346.
interaction energy of the addition of methyl (TEAC) and the addition of Liang, X., Zhu, Y., Qi, B., Li, S., Luo, J., Wan, Y., 2021. Structure-property-performance
relationships of lactic acid-based deep eutectic solvents with different hydrogen
phenyl (BTEAC) was larger than the unincorporated group (MTAC), bond acceptors for corn stover pretreatment. Bioresour. Technol. 336, 125312.
which indicated that the addition of phenyl and methyl had a significant Lu, T., Chen, F., 2012. Multiwfn: a multifunctional wavefunction analyzer. J. Comput.
effect on their internal interactions. Chem. 33 (5), 580–592.
Ma, C.Y., Xu, L.H., Sun, Q., Sun, S.N., Cao, X.F., Wen, J.L., Yuan, T.Q., 2022. Ultrafast
alkaline deep eutectic solvent pretreatment for enhancing enzymatic saccharification
4. Conclusion and lignin fractionation from industrial xylose residue. Bioresour. Technol. 352.
Massayev, S., Lee, K.M., 2022. Evaluation of deep eutectic solvent pretreatment towards
efficacy of enzymatic saccharification using multivariate analysis techniques.
Machine learning methods such as random forests were studied for J. Clean. Prod. 360.
reveal the reaction mechanism of DES pretreatment of lignocellulose. Pan, S.W., Zheng, Z.C., Guo, Z., Luo, H.N., 2022. An optimized XGBoost method for
Lignin removal and biomass recovery could be predicted based on predicting reservoir porosity using petrophysical logs. J. Pet. Sci. Eng. 208.
Rahimi, M., Abbaspour-Fard, M.H., Rohani, A., 2021. A multi-data-driven procedure
lignocellulose biomass characteristics, DES physicochemical properties towards a comprehensive understanding of the activated carbon electrodes
and reaction conditions. Temperature was the most important reaction performance (using for supercapacitor) employing ANN technique. Renew. Energy
condition variable affecting the pretreatment. Acidity, polarity, and 180, 980–992.
Rolnick, D., Donti, P.L., Kaack, L.H., Kochanski, K., Lacoste, A., Sankaran, K., Ross, A.S.,
hydrophilicity of DES were the top important physicochemical variables Milojevic-Dupont, N., Jaques, N., Waldman-Brown, A., Luccioni, A.S., Maharaj, T.,
of DES pretreatment. For lignin removal, the contribution of pretreat­ Sherwin, E.D., Mukkavilli, S.K., Kording, K.P., Gomes, C.P., Ng, A.Y., Hassabis, D.,
ment temperature and hydrophilicity of DES were higher than other Platt, J.C., Creutzig, F., Chayes, J., Bengio, Y., 2023. Tackling climate change with
machine learning. Acm Comput. Surv. 55, 2.
variables. This study could guide the reaction design, screen the DES,
Rzychon, M., Zogala, A., Rog, L. SHAP-based interpretation of an XGBoost model in the
and optimize the pretreatment process. prediction of grindability of coals and their blends. International Journal of Coal
Preparation and Utilization.
Saito, K., Hashizume, T., Kitayama, K., Watanabe, T., 2022. Characterization of novel
CRediT authorship contribution statement deep eutectic solvent, choline chloride/glutamic acid, as efficient solvent for lignin
dissolution. Chem. Lett. 51 (4), 407–411.
Huanfei Xu: Conceptualization, Writing – original draft, Supervi­ Saiz, J., Ortega-Ojeda, F., Lopez-Melero, L., Montalvo, G., Garcia-Ruiz, C., 2014.
Electrophoretic fingerprinting of benzodiazepine tablets in spike drinks.
sion. Chenyang Dong: Writing – original draft, Software. Weixian
Electrophoresis 35 (21–22), 3250–3257.
Wang: Writing – original draft. Yaoze Liu: Writing – original draft, Shang, J., Zhang, Y.X., Schauer, J.J., Chen, S.M., Yang, S.J., Han, T.T., Zhang, D.,
Software. Bin Li:
:Writing – review & editing. Fusheng Liu: Writing – Zhang, J.J., An, J.X., 2022. Prediction of the oxidation potential of PM2.5 exposures
review & editing,Supervision, Conceptualization. from pollutant composition and sources. Environ. Pollut. 293.
Shen, X.J., Wen, J.L., Mei, Q.Q., Chen, X., Sun, D., Yuan, T.Q., Sun, R.C., 2019. Facile
fractionation of lignocelluloses by biomass-derived deep eutectic solvent (DES)
pretreatment for cellulose enzymatic hydrolysis and lignin valorization. Green.
Declaration of Competing Interest Chem. 21 (2), 275–283.
Soares, B., Silvestre, A.J.D., Pinto, P.C.R., Freire, C.S.R., Coutinho, J.A.P., 2019.
The authors declare that they have no known competing financial Hydrotropy and Cosolvency in Lignin Solubilization with Deep Eutectic Solvents. Acs
Sustain. Chem. Eng. 7 (14), 12485–12493.
interests or personal relationships that could have appeared to influence
Stewart, J.J.P., 1990. MOPAC: a semiempirical molecular orbital program. J. Comput.
the work reported in this paper. -Aided Mol. Des. 4, 1–105.
Su, Y., Huang, C.X., Lai, C.H., Yong, Q., 2021. Green solvent pretreatment for enhanced
Data Availability production of sugars and antioxidative lignin from poplar. Bioresour. Technol. 321.
Sun, D., Lv, Z.W., Rao, J., Tian, R., Sun, S.N., Peng, F., 2022. Effects of hydrothermal
pretreatment on the dissolution and structural evolution of hemicelluloses and
The authors are unable or have chosen not to specify which data has lignin: a review. Carbohydr. Polym. 281.
been used. Tangadpalliwar, S.R., Vishwakarma, S., Nimbalkar, R., Garg, P., 2019. ChemSuite: a
package for chemoinformatics calculations and machine learning. Chem. Biol. Drug
Des. 93 (5), 960–964.
Acknowledgement Wang, W., Lee, D.J., 2021. Lignocellulosic biomass pretreatment by deep eutectic
solvents on lignin extraction and saccharification enhancement: a review. Bioresour.
Technol. 339.
We thank for the National Nature Science Foundation of China Wei, Y.J., Yu, J., Du, Y.L., Li, H.X., Su, C.H., 2021. Artificial intelligence simulation of Pb
(22008133). (II) and Cd(II) adsorption using a novel metal organic framework-based
nanocomposite adsorbent. J. Mol. Liq. 343.
Xu, H.F., Kong, Y., Peng, J.J., Song, X.M., Che, X.P., Liu, S.W., Tian, W.D., 2020.
References Multivariate analysis of the process of deep eutectic solvent pretreatment of
lignocellulosic biomass. Ind. Crops Prod. 150.
Ao, Y.L., Li, H.Q., Zhu, L.P., Ali, S., Yang, Z.G., 2019. The linear random forest algorithm Xu, H.F., Kong, Y., Peng, J.J., Song, X.M., Liu, Y.Z., Su, Z.N., Li, B., Gao, C.H., Tian, W.D.,
and its advantages in machine learning assisted logging regression modeling. J. Pet. 2021. Comprehensive analysis of important parameters of choline chloride-based
Sci. Eng. 174, 776–789. deep eutectic solvent pretreatment of lignocellulosic biomass. Bioresour. Technol.
Chourasia, V.R., Pandey, A., Pant, K.K., Henry, R.J., 2021. Improving enzymatic 319.
digestibility of sugarcane bagasse from different varieties of sugarcane using deep Yuan, X.Z., Suvarna, M., Low, S., Dissanayake, P.D., Lee, K.B., Li, J., Wang, X.N., Ok, Y.S.,
eutectic solvent pretreatment. Bioresour. Technol. 337. 2021. Applied machine learning for prediction of CO2 adsorption on biomass waste
Gur, T.M., 2022. Carbon dioxide emissions, capture, storage and utilization: review of -derived porous carbons. Environ. Sci. Technol. 55 (17), 11925–11936.
materials, processes and technologies. Prog. Energy Combust. Sci. 89. Zhong, L., Wang, C., Yang, G.H., Chen, J.C., Xu, F., Yoo, C.G., Lyu, G.J., 2022. Rapid and
Jha, S., Okolie, J.A., Nanda, S., Dalai, A.K., 2022. A review of biomass resources and efficient microwave-assisted guanidine hydrochloride deep eutectic solvent
thermochemical conversion technologies. Chem. Eng. Technol. 45 (5), 791–799. pretreatment for biological conversion of castor stalk. Bioresour. Technol. 343.
Johnson, Erin R., Keinan, Shahar, Mori-Sa´nchez, Paula, Contreras-Garcı´a, Julia, Zhu, X.Z., Li, Y.N., Wang, X.N., 2019. Machine learning prediction of biochar yield and
Cohen, Aron J., W, Yang, 2010. Revealing noncovalent interactions. J. Am. Chem. carbon contents in biochar based on biomass characteristics and pyrolysis
Soc. 132, 6498–6506. conditions. Bioresour. Technol. 288.

12

You might also like