Professional Documents
Culture Documents
Methods
View Article Online
PAPER View Journal | View Issue
A new, rapid analytical method using near infrared spectroscopy (NIRS) was developed to differentiate Radix
Angelicae sinensis samples from five different geographic origins, and to determine the contents of ethanol
extract and ferulic acid in the samples. The scattering effect and baseline shift in the NIR spectra were
corrected and the spectral features were enhanced by several pre-processing methods. By using
principal component analysis (PCA), the grouping homogeneity and sample cluster tendency were
visualized. Furthermore, random forests (RF) was applied to select the most effective wavenumber
variables from the full NIR variables and to build the qualitative models. Finally, the genetic algorithm
optimization combined with multiple linear regression (GA-MLR) was applied to select the most relevant
Received 30th June 2014
Accepted 16th August 2014
variables and to build the ethanol extract and ferulic acid quantitative models. The results showed that
the correlation coefficients of the models are Rtest ¼ 0.83 for ethanol extract and Rtest ¼ 0.81 for ferulic
DOI: 10.1039/c4ay01542h
acid. The outcome showed that NIRS can serve as routine screening in the quality control of chinese
www.rsc.org/methods herbal medicine.
This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 9691–9697 | 9691
View Article Online
extract are two factors whose content are required for the quality standard. Each sample was performed in triplicate, and the
assessment of RAs.1 Ferulic acid isolated from RAs is widely average spectrum was obtained.
used as the marker compound for assessing the quality of RAs
and its products. Ferulic acid is an anti-oxidant, anti-inam- 2.3 Chemical analysis
matory and anti-cancer agent. RAs ethanol extract exhibits
The determination methods of ferulic acid and ethanol extract
estrogenic activity, antibrotic action, and effects on cardio-
were according to Chinese Pharmacopoeia.1 For the quanti-
and cerebro-vascular systems, and so on.17,18 Bioactivities of the
cation of ferulic acid, an HPLC method was performed. Initially,
major constituents vary in RAs extract isolated with different
a sample of 4.0 g of powdered plant material was extracted with
concentrations of ethanol.
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
9692 | Anal. Methods, 2014, 6, 9691–9697 This journal is © The Royal Society of Chemistry 2014
View Article Online
models is used to prove the predictive ability of the built of Breiman and Culter (http://www.stat.berkeley.edu/breiman/
models. RandomForests/).
2.5.3 GA-MLR. The selection of variables for multivariate
calibration can be considered an optimization problem. The
2.5 Statistical analysis genetic algorithm (GA) is currently popular in many elds and
2.5.1 Principal component analysis. PCA allows visualiza- has been successfully applied to frequency selection problems,
tion, retaining as much as possible the information present in in which the GA manipulates binary strings called chromo-
the original data by the reduction of the data dimensionality. somes that contain genes that encode experimental factors or
PCA transforms the original measurement variables into new, variables.26,27 Genetic algorithm optimization combined with
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
uncorrelated variables called principal components. Each multiple linear regression (GA-MLR) combines the advantages
principal component is a linear combination of all the original of GA and MLR. The GA can help nd optimal values for
measurement variables.20 In this work, the PCA was carried out several disparate variables associated with the calibration
using Matlab 7.6.0(R2008a). model, while the MLR procedure can be integrated into the
2.5.2 Random forests. Random forests (RF) is a useful objective function driving the optimization.26,28–30 Generally,
classication algorithm that was rst introduced by Breiman. It the GA consists of four basic steps, where steps (ii)–(iv) are
is a classier ensembling classication trees. Each tree gives a repeated until the termination criterion is reached. For more
classication, and the tree ‘votes’ is used to classify the samples. detail, one can see ref. 31.
The forest chooses the classication having the most votes. RF GA applied to MLR has been shown to be a very efficient
is very resistant to overtting and usually performs well in optimization procedure. They have been applied on many
problems with a low samples/features ratio, like spectrometric spectral datasets and are shown to provide better results than
data.21,22 full-spectrum approaches.31–33 In this study, GA-MLR was per-
RF uses bootstrap aggregating (bagging, i.e. each new formed by the MobyDigs soware.
training set is drawn, with a replacement from the original
training set, leaving out about one-third of the cases). Each
classication tree is grown without pruning, using a new 2.6 Model assessment
bootstrap training set and, at each node, is split using random
feature selection (i.e. using the best predictive variable of a For qualitative model assessment, the total accuracy (ACC), i.e.
subset of randomly selected variables). Aer a large number ‘k’ the proportion of correct classications, was used to evaluate
of classication trees has been generated, they are used to the models. For quantitative model assessment, Qloo (correla-
predict the class membership of new data. The cases not tion coefficient of leave-one-out cross validation, Qloo), Rtr
included in the bootstrap set (i.e. the out-of-bag cases) and (correlation coefficient of training set, Rtr), Rtest (correlation
therefore not used in the construction of the trees, are used as a coefficient of test set, Rtest), RMSE (root mean square error of
test set to provide an unbiased estimate of the prediction training set, RMSE), RMSEP (root mean square error of test set,
accuracy. Each new case is applied to each of the k classication RMSEP) and RPD (ratio of prediction to deviation) were used to
trees starting from the root and is assigned to a class corre- evaluate the models.
sponding to the leaf, and then the decisions of the individual
trees are combined by majority voting. At the end of the run, on
average, each element of the original data set is out-of-bag in 3. Results and discussion
one-third of the k-tree constructing iterations, or each element
of the original data set is classied by one-third of the k trees. 3.1 Spectral analysis
The proportion of misclassications (%) over all of the out-of- Fig. 1 shows the average NIR spectra of RAs from ve different
bag elements is called the out-of-bag (OOB) error. origins. The broad band at 4763 cm1 is commonly called the
The OOB error is an unbiased estimate of the generalization “carbohydrate band”. The rst overtone of the C–H stretch in the
error. Breiman (2001)21 proved that random forests produce a 5764 cm1 region is also present. Bands at 6859 cm1 and
limiting value of the generalization error. As the number of trees 5149 cm1 were probably related to the O–H groups and
increases, the generalization error always converges. The humidity. It is difficult to nd specic bands in the raw NIR
number of trees (k) needs to be set sufficiently high to allow for spectra based on geographical origins. Otherwise, baselines of the
this convergence.23 In RF, mtry, which is the number of sample spectra vary widely, due to the particle size effect, packing
randomly selected predictive variables used to split the nodes, is density, noise, and so on. The Savitzky–Golay derivative was used
the only parameter that requires some judgment to set, but the to remove the baseline dri and enhance the spectral features,
forests are not too sensitive to its value, as long as it is in the and the results are shown as Fig. 2. As a result, the unique spectral
right ballpark. According to the OOB error rate, an optimal features associated with different samples became more
value of mtry was found. apparent. Spectra around 7000 cm1, and 6000–4000 cm1 from
Random forests could provide a useful measure of the Qinhe are obviously different from samples from Gansu. More-
importance of the predictive value of each explanatory variable.24 over, slight differences around 4000–4500 cm1 and 7000 cm1
For more detail, see (ref. 25). For the RF model development, we were seen among samples from Gansu province. The further
used the soware for RF classication available from the website feature selection could be performed by RF.
This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 9691–9697 | 9693
View Article Online
3.2 PCA
PCA was performed as the rst attempt to extract and visualize
the main information in the multivariate data. Pre-treatment
methods, such as vector normalization + rst derivative and
SNV + rst derivative pre-processing, were compared and the
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
Fig. 4 Distribution of the training set and test set in the principal
Fig. 3 Three-dimensional score plot using PC1, PC2, and PC3 for component space for RF qualitative models of: (A) ethanol extract. (B)
discriminating Angelicae sinensis samples from five different origins. ferulic acid, and (C) the quantitation models.
9694 | Anal. Methods, 2014, 6, 9691–9697 This journal is © The Royal Society of Chemistry 2014
View Article Online
SNV rst derivative of the spectra which turned out to be the discrimination method. For the training set, there were 12
best data pre-treatment for optimum separation of both groups samples from Minxian, of which 1 sample was identied as
in this study was used. Fig. 3 shows the three dimensional Longxi; among the 21 Tanchang samples, 3 samples were mis-
principal component score plot. The rst three components identied as Minxian, Wuwei, and Qinhe respectively;
describe 78.07%, 16.18%, and 2.54%, respectively. It can be 2 samples from Longxi were misclassied as Minxian and
seen that the samples are distinguished clearly between Min- Tanchang, respectively. For the other varieties, all the samples
xian and Tanchang, Longxi and Tanchang, Gansu province and were correctly classied. For the test set, 1 sample from Longxi
Yunnan province. RAs from Longxi and Wuwei are not effec- was identied as Tanchang, whereas the other varieties were
tively clustered into groups. Some overlaps are observed correctly identied. Most misclassications related to samples
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
between Longxi and Minxian, and Longxi and Wuwei. However, from Gansu province, since the environmental conditions such
PCA only provided visual discrimination results. For actual as topography, soil, and moisture of the different cities in the
discrimination, RF was utilized in the following studies. same province are similar.
In order to build robust models, all 96 sample spectra were
randomly divided into training and test sets. Fig. 4 shows the
distribution homogeneity of the training set and test set in the 3.4 Quantication of the ethanol extract and ferulic acid
principal component space for the RF models (A), the ethanol with GA-MLR
extract (B), and the ferulic acid (C) quantitation models, Table 4 shows the average and standard deviation values for the
respectively. It could be seen that all the test set samples follow analyzed ethanol extract and ferulic acid in RAs. The samples
the same probability distribution as the training data. with the maximum and minimum contents of ethanol extract
and ferulic acid were included in the two data sets. For the
training set, the mean SD of the ethanol extract content is
3.3 Classication of RAs with RF
0.57 g/g 0.03, and that of the ferulic acid content is 0.10%
Table 2 shows the discrimination results of RF with different 0.02%. Generally, content prediction based on NIRS for
data pretreatment. The accuracy signicantly improved aer components less than 0.1% is considered not reliable,35 which
pretreatment. The best prediction results were obtained using is a challenge for ferulic acid models. The parameters for the
the full spectra with SNV rst derivative pretreatment, resulting ethanol extract and ferulic acid content quantitation models
in an accuracy of 92.2% for the training set and 94.7% for the were set as: population (500), iteration (5000), mutation rate
test set, with the parameters set up as: mtry ¼ 7; k ¼ 700. The (0.1), crossover rate (0.5) and population (300), iteration (6000),
selected four top-ranked important variables were 4782 cm1, mutation rate (0.1), crossover rate (0.5), respectively. Including
7264 cm1, 4454 cm1, and 4326 cm1. According to (ref. 34), more descriptors in the model will t the training set better, but
these variables were related to N–H, O–H, and C–H from protein would rupture the predictions of other samples. This
and starch in the samples. phenomenon is called ‘over-tting’ of a model. Finally, four
The detailed description of the results is shown in Table 3, descriptors also proved to be optimal for both the ethanol
where the rows correspond to the real class of the samples, and extract and ferulic acid content models.
the columns correspond to the class assigned by a particular The quantication functions for ethanol extract and ferulic
acid in RAs were as follows:
Table 2 Classification results with different preprocessed spectra YEE ¼ 14.06Var1 14.75Var2 + 2.443Var3 1.485Var4
+ 0.3545, Rtr ¼ 0.84;
Model Data pretreatment ACC (training set) ACC (test set)
Table 3 Detailed description of the classification results with RF+1st Table 4 Summary of the variation ranges, means, and standard
derivative deviations of the ethanol extract and ferulic acid contents
This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 9691–9697 | 9695
View Article Online
Table 5 GA-MLR modeling results for ethanol extract and ferulic acid
Ethanol extract SNV + rst derivative 0.84 0.82 0.83 0.02 0.03 1.6
Ferulic acid SNV + rst derivative 0.85 0.83 0.81 0.01 0.01 1.7
of the wavenumbers of 4770 cm1, 4808 cm1, 6140 cm1, and respectively; for the ferulic acid quantitation model, they are
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
5279 cm1 respectively; YFA denotes the content of ferulic acid. 0.83, 0.81, 0.01, 0.01 and 1.7, respectively. These suggest that
The four variables (Var0 ) denoted the intensity of the wave- the predicted values from NIRS are close to the values deter-
numbers of 7229 cm1, 7047 cm1, 6669 cm1, 5951 cm1 mined by the pharmacopoeia method. However, R and RPD
respectively. Table 5 lists the GA-MLR modeling results with the higher than 0.91 and 2, respectively, indicate a good predic-
SNV + rst derivative pretreatment for ethanol extract and tion,36 which suggests that the quantitative models need to be
ferulic acid. Qloo, Rtest, RMSE, RMSEP and RPD for the ethanol improved. Fig. 5 and 6 are scatter plots showing the correlation
extract quantitation model are 0.82, 0.83, 0.02, 0.03 and 1.6, between the NIR predicted values and the reference values. It
can be seen that the actual and predicted concentrations for
ferulic acid at below 0.1% show better correlation than for the
lowest concentrations of ethanol extract, which may be because
ethanol extract is a mixture and the extraction procedure is
complicated.
4 Conclusions
NIR combined with RF or GA-MLR shows great power for the
qualitative or quantitative analysis of Radix Angelicae sinensis.
Spectra treatment with the SNV + 1st derivative proved to be
more effective for removing effects that did not contribute to the
classication. The accuracy of the test set was up to 94.7% by RF
with SNV + 1st derivative spectra. In this work, GA-MLR using
SNV + 1st derivative spectra provided acceptable quantitative
models for ethanol extract and ferulic acid. However, the
samples collected were limited, and the further research
strategy will be directed toward assembling more herbal medi-
Fig. 5 Plot of experimental vs. predicted ethanol extract values. cines and optimizing the quantitative models.
References
1 The State Pharmacopoeia Commission of People's Republic
of China, Pharmacopoeia of the People's Republic of China, vol.
1, Chemical Industry Press, Beijing, China, 2005, p. 101.
2 L. Z. Yi, Y. Z. Liang, H. Wu and D. L. Yuan, The analysis of
Radix Angelicae sinensis (Danggui), J. Chromatogr. A, 2009,
1216, 1991–2001.
3 J. Bradbury, From chinese medicine to anticancer drug, Drug
Discovery Today, 2005, 10, 1131–1132.
4 R. Upton, American Herbal Pharmacopoeia and Therapeutic
Compendium—Dang Gui Root, Scotts Valley, CA, 2003.
5 The Compile Commission of Zhonghua Bencao of the State
Administration of Traditional Chinese Medicine of the
People's Republic of China. Zhonghua Bencao, vol. 5,
Shanghai Science and Technology Press, Shanghai, 1999, p.
893.
6 G. H. Lu, K. Chan, Y. Z. Liang, K. Leung, C. L. Chan,
Fig. 6 Plot of experimental vs. predicted ferulic acid values.
Z. H. Jiang and Z. Z. Zhao, J. Chromatogr. A, 2005, 1073,
383–392.
9696 | Anal. Methods, 2014, 6, 9691–9697 This journal is © The Royal Society of Chemistry 2014
View Article Online
7 S. Wang, H. Q. Ma, Y. J. Sun, C. D. Qiao, S. J. Shao and 23 T. Bylander, Mach. Learn., 2002, 48, 287–297.
S. X. Jiang, Talanta, 2007, 72, 434–436. 24 K. J. Archer and R. V. Kimes, Comput. Stat. Data Anal., 2008,
8 S. Y. Wei, C. J. Xua, D. K. W. Mok, H. Cao, T. Y. Lau and 52, 2249–2260.
F. T. Chau, J. Chromatogr. A, 2008, 1187, 232–238. 25 R. Genuer, J. M. Poggi and C. Tuleau-Malot, Pattern Recogn.
9 X. L. Piao, J. H. Park, J. Cui, D. H. Kim and H. H. Yoo, J. Lett., 2010, 31, 2225–2236.
Pharm. Biomed. Anal., 2007, 44, 1163–1167. 26 C. B. Lucasius, M. L. M. Beckers and G. Kateman, Anal. Chim.
10 S. Zschocke, J. H. Liu, H. Stuppner and R. Bauer, Phytochem. Acta, 1994, 286, 135–153.
Anal., 1998, 9, 283–290. 27 O. Polgár, M. Fried, T. Lohner and I. Bársony, Surf. Sci., 2000,
11 I. Esteban-Diez, J. M. Gonzalez-saiz and C. Pizarro, Anal. 457, 157–177.
Published on 06 November 2014. Downloaded by University of Pittsburgh on 28/11/2014 15:00:39.
This journal is © The Royal Society of Chemistry 2014 Anal. Methods, 2014, 6, 9691–9697 | 9697