Professional Documents
Culture Documents
Bioresource Technology
journal homepage: www.elsevier.com/locate/biortech
A R T I C LE I N FO A B S T R A C T
Keywords: Chemical constituents are important properties for utilization of biomass, and experimental approaches are
Biomass always expensive and time-consuming to determinate those properties. Here, a novel random forest (RF) model
Chemical constituents is developed for accurately predicting biomass major chemical constituents from the much-easier available ul-
Random forest timate analysis, and compared with the previous correlation as well as the experimental data. Two databases are
Ultimate analysis
constructed for training and application of the RF model from available literature. The training results show that
the determination coefficients (R2 ) of the RF model predictions are 0.954, 0.933 and 0.968 for cellulose,
hemicellulose and lignin, respectively. The application results show that the present RF model can give accurate
predictions on chemical constituents for various biomasses with MAPE < 20% , and R2 are 0.862, 0.904 and 0.962
for predictions of cellulose, hemicellulose and lignin, respectively. While the previous correlation only works for
a narrow range used to develop the correlation, and gives unrealistic negative predictions with MAPE > 500% for
outside samples.
⁎
Corresponding author.
E-mail address: fanjr@zju.edu.cn (J. Fan).
https://doi.org/10.1016/j.biortech.2019.121541
Received 4 April 2019; Received in revised form 20 May 2019; Accepted 21 May 2019
Available online 25 May 2019
0960-8524/ © 2019 Elsevier Ltd. All rights reserved.
J. Xing, et al. Bioresource Technology 288 (2019) 121541
and the rest 100 samples are used as the training samples. In the da- maximum values of the actual targets in the training samples, respec-
tabase, the mass fractions of lignin, hemicellulose and cellulose are tively.
within the range of 0–56.5%, 0–77.8% and 6.7%-92%, respectively.
The carbon, hydrogen and oxygen fractions are within the range of 2.2. Random forest
33.4–62%, 2.78–7.5% and 26.8–56.8%, respectively, and the hydrogen-
carbon ratio and oxygen-carbon ratio are within the range of Random forest is an ensemble machine learning approach, which
0.3387–1.1900 and 0.6224–1.9567, respectively (as shown in Fig. 1). gives prediction as the majority of the modes and the average predic-
Here, the carbon fraction, hydrogen-carbon ratio and oxygen-carbon tion of decision trees in the forest for clarification and regression pro-
ratio are selected as the inputs of the random forest model, and the blems, respectively. Fig. 2 shows the schematic topological of the
outputs are the biomass major chemical constituents. The carbon random forest. Starting with N samples with M features, a bootstrap
fraction, carbon-hydrogen ratio and carbon-oxygen ratio are chosen as sampling method is employed to randomly generate n sub-samples, and
the input parameters for the following two reasons. The first reason is then each sub-sample is randomly divided into in-bag (IB) and out-of-
2
J. Xing, et al. Bioresource Technology 288 (2019) 121541
bag data (OOB), respectively. The IB data are split into two types based N
1 yi, pred − yi, exp
on different selected features, and the split process is repeated until MAPE =
N
∑ yi, exp
× 100%
i=1 (6)
there are no data to split. The OOB data are not involved into the
training process but used to determine the optimal number of the de-
where ȳexp is the average value for all the training samples, and N is the
cision trees in the forest via a trial-and-error test. The normalized mean
number of total training samples.
sum error (NMSE), as expressed as Eqs. (3) and (4), of the OOB data is
employed to choose the optimal tree number.
Ntree 3. Results and discussions
∑ (yiOOB OOB 2
, pred − yi, exp )
Ntree i=1
MSEOOB = 3.1. Hyper-parameter optimization
Ntree (3)
Ntree Ntree In the random forest model, here all three features are selected for
Ntree MSEOOB − MSEOOB , min
NMSEOOB = Ntree Ntree
each decision trees, and the parameter needs to be determined is the
MSEOOB , max − MSEOOB , min (4) number of the trees in the forest. In the present study, this parameter,
where Ntree
NMSEOOB and Ntree
are the normalized and calculated mean
MSEOOB Ntree , is determined by a trial-and-error test. Fig. 3 shows the NMSE
sum errors for the OOB data when tree number is Ntree , respectively. values of the OOB data for the three major chemical constituents with
yiOOB OOB different number of decision trees. It can be found that the NMSE first
, pred and yi, exp are the predicted and real targets for the OOB data,
Ntree Ntree decreases sharply and then stabilizes with the increasing of the tree
respectively. MSEOOB , min and MSEOOB, max are the maximum and
number for both the IB and OOB data as previous studies found (Xing
minimum values of MSE for the OOB data for the tested range of the
et al., 2019; Genuer et al., 2017). This is because that when tree number
number of decision trees, respectively. The test results will be presented
is small, the complex nonlinear correlations can not be well char-
in Section 3. Finally, the random forest gives predictions as the majority
acterized via limited randomly-generated bootstrap sample datasets,
of the modes or the average value for the classification or regression
which results in the underfitting of the trained RF model. There are
problems, respectively. The details about this approach can be found in
some critical values of the tree number, and when the tree number is
this paper (Ho, 1995).
above this critical parameter, the testing and training performances of
the RF model do not increases with the number of trees. This indicates
2.3. Evaluation indicators that when exceeding the critical tree number, increasing the tree
number does not significantly improve the model accuracy but increase
To quantitatively evaluate the performances of the present proposed the computational time. Thus, the test with the lowest NMSE and
RF model and the previous nonlinear correlation, two common eva- smallest tree number provides the optimal number of decision trees in
luation indicators, including the determination coefficients (R2 ) and the random forest for a good compromise between accuracy and com-
mean absolute percentage error (MAPE), are introduced and can be putational time. The optimized values of tree number for the three
expressed as Eq. (5) and (6), respectively. major chemical constituents are 68, 51 and 66 for the following RF
N modelling of cellulose, hemicellulose and lignin fractions, respectively,
∑ (yi, pred − yi, exp )2 and the training performance of the RF model will be introduced in next
i=1 section.
R2 =1− N
∑ (yi, exp − y¯exp )2
i=1 (5)
3
J. Xing, et al. Bioresource Technology 288 (2019) 121541
100
1.0
Training data
0.8 Predicted data
OOB data Best fit
0.6
0.2
0.0 60
1.0
0.8
40
0.6
0.4 (b)
0.2
20
0.0
(a)
1.0
0
0.8 0 20 40 60 80 100
0.6 Measured cellulose fraction (%)
0.4 (c) 80
0.2
Predicted data
(MAPE), which are defined as Eqs. (5) and (6) respectively, are 0.954 Relative error:-20%
and 7.595%, 0.933 and 7.754%, and 0.968 and 6.428%, respectively. Relative error:+20%
The determination coefficients are obviously improved compared with 40
previous correlation proposed by Sheng and Azevedo (2002), which
indicates that the present RF model can better characterize the complex
nonlinear relations between ultimate analysis and chemical con-
stituents compared with the previous correlation. It is worth noting that
the high errors in low value region result from the following two rea- 20
sons. The first one is that there are limited samples around the low
value region, thus the nonlinear correlation can not be accurately
characterized in this region. The second one is that for low values, a (c)
small deviation would result in a larger relative error compared with
those of high values. Above all, the present RF model gives accurate
predictions on the chemical constituents for various biomasses, and the 0
0 20 40 60
determination coefficients have been much improved compared with
Measured lignin fraction (%)
those of the previous correlation.
Fig. 4. Training performances of the random forest models for predictions of (a)
cellulose fraction, (b) hemicellulose fraction and (c) lignin fraction.
3.3. Application performance
available in the application database, a direct comparison between the
To more comprehensively validate the performance of the devel- present RF model and previous correlation proposed by Sheng and
oped RF model, the RF model is used to predict the chemical con- Azevedo (2002), which are expressed as the following equations, is
stituents for the application database, in which all samples are outside achievable.
of the training database. Since mass fraction of volatile matter is
4
J. Xing, et al. Bioresource Technology 288 (2019) 121541
70
0
-35
(a)
-500
60
Relative error (%)
20
0
-20
-200
Correlation from Sheng and Azevedo (2002)
Present RF model
Experiment
-1600
0 5 10 15 20 25 30 35 40 45
Biomass samples number
Hemicellulose fractions (%)
60
30
(b)
-120
60
Relative error (%)
20
0
-20
-100
(c)
-50
3900
Relative error (%)
100
60
20
0
-20
-100
-500
0 5 10 15 20 25 30 35 40 45
Biomass samples number
Fig. 5. Comparisons of the chemical constituents predicted using correlation proposed in Sheng and Azevedo (2002) and the present RF model with the experimental
data for the application database: (a) cellulose, (b) hemicellulose and (c) lignin.
5
J. Xing, et al. Bioresource Technology 288 (2019) 121541
Xcel fraction, O/ C or H / C are outside of the data range used to develop the
correlation as seen in Fig. 6. This also indicates that the previous cor-
= −1019.07 + 293.810(O/ C ) − 187.639(O/ C )2 + 65.1426
relation has a narrow application scope, and slightly outside the data
(H / C ) − 19.3025(H / C )2 + 21.7448(VM ) − 0.132123(VM )2 (7) range would bring an obvious deviation as seen in the previous study
(Estiati et al., 2019). While the present RF model can well reproduce
Xlig the experimental data with relative error less than 20% for all samples
= 612.099 + 195.366(O/ C ) − 156.535(O/ C )2 + 511.357(H / C ) in the application database, and the model performance is much better
than that of the correlation even for biomass samples No.32 to No.44.
− 177.025(H / C )2 − 24.3224(VM ) + 0.145306(VM )2 (8)
This can be attributed to the following two reasons. The first one is that
where VM is the volatile matter fraction in weight percent daf. Xcel , Xlig the present training samples are distributed in a wider range, such as
and Xhem are the mass fractions of cellulose, lignin and hemicellulose, the hydrogen-carbon ratio (0.6224–1.9567) and oxygen-carbon ratio
respectively, and the hemicellulose fraction is calculated through (0.3387–1.1900), and all validation samples are within the present
Xhem = VM − Xcel − Xlig . Sheng and Azevedo (2002) claimed that this training database. Thus the RF model could give good predictions based
correlation was developed from samples with O/C from 0.56 to 0.83, on the well-learned complex nonlinear relations from the training da-
H/C from 1.26 to 1.69, and VM from 73% to 86%. tabase. The second reason is the strong ability of random forest to
Fig. 5 shows the direct comparisons of the chemical constituents handle nonlinear problems. Randomly selecting samples and features
predicted with the present RF model and the previous correlation by and predicting by averaging all tree predictions make it a powerful tool
Sheng and Azevedo (2002) as well as the experimental data. It is worth to overcome the overfitting possibility, thus the RF model has better
noting the green dash lines represent the data distribution whose re- robustness and accuracy compared with the previous correlation.
lative error is 20%, and the Y axis is interrupted for a clear observation
due to the extreme values predicted by the correlation of Sheng and 4. Conclusions
Azevedo (2002). To more comprehensively explain coincidences and
discrepancies of the predictions, we compare the data distribution of A novel RF model is developed for predicting biomass chemical
the validation samples with those of the present training database and constituents from ultimate analysis, and compared with the previous
the previous database used by Sheng and Azevedo (2002). Fig. 6 shows correlation and the experimental data. The training and application
the comparison results. It is worth noting that the dash lines with ar- databases are constructed from experimental data in available litera-
rowheads denote the data range of samples in the training database, ture. The training results show the R2 of the RF model predictions are
and the blue balls represent samples outside the training database. 0.954, 0.933 and 0.968 for cellulose, hemicellulose and lignin, re-
Biomass samples No.32 to No.44 are some of the biomass samples used spectively, which is much improved compared with the previous cor-
in the study of Sheng and Azevedo (2002). It can be found that most relation. The application results show the RF model gives accurate
samples in the validation database are outside the previous database predictions for various biomasses with MAPE < 20%. While the pre-
used by Sheng and Azevedo (2002), while all samples are inside the vious correlation shows a narrow application scope, and even gives
present training database. The previous correlation shows acceptable unrealistic negative predictions with MAPE > 500%.
performances of cellulose and lignin fractions predictions for biomass
samples No.32 to No.44, but obviously errors for prediction of hemi- Acknowledgement
cellulose fraction can still be found for those samples. For other samples
outside their training database, some unrealistic negative values can be The authors are grateful for the support from the National Natural
predicted for those three major chemical constituents, such as biomass and Science Foundation (Grant No: 91741203) and National Key
sample Nos. 2, 3, 6 and 8, and the relative errors are even larger than Research and Development Program of China (Grant:
500% as seen in the bottom figure of each sub-graph in Fig. 5. Those 2017YFB0601805). JX especially thanks to Miss Yuehan Xu for her
deviations can be attributed to the fact that for the volatile matter constant support and accompany during his doctoral degree. JX also
90
60
85
55
80
50
VM (%)
C (%)
45 75
40 70
1 .2
35
65
1. 0 0. 9
30
60
6
0.8 0.8
0.
8
8
0.
0.
0.7
0
/C
0.6
0
1.
1.
O
2
/C
1.
H/
1.
H/C
O
0.6
4
0 .4 C
1.
4
6
1.
1.
0.5
8
0.2
1.
1.
0
2.
8
1.
Fig. 6. Data distribution of the validation samples compared with the present training database (left) and the previous database used by Sheng and Azevedo (Sheng
and Azevedo, 2002) (right).
6
J. Xing, et al. Bioresource Technology 288 (2019) 121541
thanks to Mr. Shiling Yang for his helpful discussions. neural network based modeling to evaluate methane yield from biogas in a labora-
tory-scale anaerobic bioreactor. Bioresour. Technol. 217, 90–99.
Reisinger, K., Haslinger, C., Herger, M., 2012. BIOBIB - a database for biofuels. Available
Appendix A. Supplementary data at: http://cdmaster2.vt.tuwien.ac.at/biobib/ (accessed: 12 May 2019).
Saldarriaga, J.F., Aguado, R., Pablos, A., 2015. Fast characterization of biomass fuels by
Supplementary data associated with this article can be found, in the thermogravimetric analysis (TGA). Fuel 140, 744–751.
Sharma, H.S.S., 1996. Compositional analysis of neutral detergent, acid detergent, lignin
online version, athttps://doi.org/10.1016/j.biortech.2019.121541. and humus fractions of mushroom compost. Thermochim. Acta 285, 211–220.
Sheng, C.D., Azevedo, J.L.T., 2002. Modelling biomass devolatilization using the che-
References mical percolation devolatilization model for the main components. Proc. Combust.
Inst. 29, 407–414.
Sluiter, J.B., Ruiz, R.O., Scarlata, C.J., Sluiter, A.D., Templeton, D.W., 2010.
Cai, J.M., Xu, W.W., Liu, R.H., 2013. Sensitivity analysis of three-parallel-DAEM-reaction Compositional analysis of lignocellulosic feedstocks. 1. Review and description of
model for describing rice straw pyrolysis. Bioresour. Technol. 132, 423–426. methods. J. Agric. Food. Chem. 58, 9043–9053.
Carrier, M., Loppinet-Serani, A., Denux, D., Lasnier, J.M., Ham-Pichavant, F., Cansell, F., Solomon, P.R., Hamblen, D.G., Carangelo, R.M., Serio, M.A., Deshpande, G.V., 1998.
Aymonier, C., 2011. Thermogravimetric analysis as a new method to determine the General model of coal devolatilization. Energy Fuel 2, 405–422.
lignocellulosic composition of biomass. Biomass. Bioenergy 35, 298–307. Sunphorka, S., Chalermsinsuwan, B., Piumsomboon, P., 2017. Artificial neural network
Chelgani, S.C., Mesroghli, S., Hower, J.C., 2010. Simultaneous prediction of coal rank model for the prediction of kinetic parameters of biomass pyrolysis from its con-
parameters based on ultimate analysis using regression and artificial neural network. stituents. Fuel 193, 142–158.
Int. J. Coal Geol. 83, 31–34. Toscan, A., Morais, A.R.C., Paixao, S.M., Alves, L., Andreaus, J., Camassola, M., Dillon,
Chen, T.J., Zhang, J.Z., Wu, J.H., 2016. Kinetic and energy production analysis of pyr- A.J.P., Lukasik, R.M., 2017. High-pressure carbon dioxide/water pre-treatment of
olysis of lignocellulosic biomass using a three-parallel Gaussian reaction model. sugarcane bagasse and elephant grass: assessment of the effect of biomass composi-
Bioresour. Technol. 211, 502–508. tion on process efficiency. Bioresour. Technol. 224, 639–647.
Cozzani, V., Lucchesi, A., Stoppato, G., Maschio, G., 1997. A new method to determine Uzun, H., Yıldız, Z., Goldfarb, J.L., Ceylan, S., 2017. Improved prediction of higher
the composition of biomass by thermogravimetric analysis. Can. J. Chem. Eng. 75, heating value of biomass using an artificial neural network model based on proximate
127–133. analysis. Bioresour. Technol. 234, 122–130.
Despange, F., Massart, D.L., 1998. Neural networks in multivariate calibration. Analyst Vani, S., Sukumaran, R.K., Savithri, S., 2015. Prediction of sugar yields during hydrolysis
123, 157–178. of lignocellulosic biomass using artificial neural network modeling. Bioresour.
ECN.TNO, 2012. Phyllis2, database for biomass and waste. Available at: https://phyllis. Technol. 188, 128–135.
nl/ (accessed: 12 May 2019). Wang, C.H., Li, L.Q., Zeng, Z., Xu, X., Ma, X.M., Chen, R.F., Su, C.Q., 2019. Catalytic
Estiati, I., Tellabide, M., Saldarriaga, J.F., et al., 2019. Comparison of artificial neural performance of potassium in lignocellulosic biomass pyrolysis based on an optimized
networks with empirical correlations for estimating the average cycle time in conical three-parallel distributed activation energy model. Bioresour. Technol. 281,
spouted beds. Particuology 42, 48–57. 412–420.
Genuer, R., Poggi, J., Tuleau-Malot, Christine, Vialaneix, N., 2017. Random forests for big Wang, S.R., Dai, G.X., Yang, H.P., Luo, Z.Y., 2017. Lignocellulosic biomass pyrolysis
data. Big Data Res. 9, 28–46. mechanism: A state-of-the-art review. Prog. Energy Combust. Sci. 62, 33–86.
Ho, T.K., 1995. Random decision forests. In: Proceedings of the 3rd ICDAR, pp. 278–282. World Bioenergy Association, 2018. Global Bioenergy Statistics 2018. Available at:
Jin, X.L., Chen, X.L., Shi, C.H., Li, M., Guan, Y.J., Yu, C.Y., Yamada, T., Sacks, E., Peng, https://worldbioenergy.org/global-bioenergy-statistics (accessed: 18 March 2019).
J.H., 2017. Determination of hemicellulose, cellulose and lignin content using visible Xing, J.K., Luo, K., Wang, H.O., Wang, S., Bai, Y., Fan, J.R., 2019. Predictive single-step
and near infrared spectroscopy in Miscanthus sinensis. Bioresour. Technol. 241, kinetic model of biomass devolatilization for CFD applications: a comparison study of
603–609. empirical correlations (EC), artificial neural networks (ANN) and random forest (RF).
Jupudi, R.S., Zamansky, V., Fletcher, T.H., 2009. Prediction of light gas composition in Renew. Energy 136, 104–114.
coal devolatilization. Energy Fuel. 23, 3063–3067. Xing, J.K., Luo, K., Heinz, P., Wang, H.O., Zhao, C.G., Fan, J.R., 2019. Predicting kinetic
Li, X.L., Sun, C.J., Zhou, B.X., He, Y., 2015. Determination of hemicellulose, cellulose and parameters for coal devolatilization by means of Artificial Neural Networks. Proc.
lignin in moso bamboo by near infrared spectroscopy. Sci. Rep. 5, 17210. Combust. Inst. 37, 2943–2950.
Liu, L., Ye, X.P., Womac, A.R., Sokhansanj, S., 2010. Variability of biomass chemical Yang, Z., Li, K., Zhang, M.M., Xin, D.L., Zhang, J.H., 2016. Rapid determination of che-
composition and rapid analysis using FT-NIR techniques. Carbohydr. Polym. 81, mical composition and classification of bamboo fractions using visible-near infrared
820–829. spectroscopy coupled with multivariate data analysis. Biotechnol. Biofuels 9, 35.
Nair, V.V., Dhar, H., Kumar, S., Thalla, A.K., Mukherjee, S., Wong, J.W.C., 2016. Artificial