Professional Documents
Culture Documents
Pedro Sousa Sampaio, Andreia Soares, Ana Castanho, Ana Sofia Almeida, Jorge
Oliveira, Carla Brites
PII: S0308-8146(17)31513-3
DOI: http://dx.doi.org/10.1016/j.foodchem.2017.09.058
Reference: FOCH 21727
Please cite this article as: Sampaio, P.S., Soares, A., Castanho, A., Almeida, A.S., Oliveira, J., Brites, C.,
Optimization of rice amylose determination by NIR-spectroscopy using PLS chemometrics algorithms, Food
Chemistry (2017), doi: http://dx.doi.org/10.1016/j.foodchem.2017.09.058
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers
we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and
review of the resulting proof before it is published in its final form. Please note that during the production process
errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
TITLE: Optimization of rice amylose determination by NIR-spectroscopy using PLS chemometrics
algorithms
Authors: Pedro Sousa Sampaio1,2*, Andreia Soares1, Ana Castanho1, Ana Sofia Almeida1, Jorge Oliveira3, Carla Brites1
Affiliation:
1
Instituto Nacional de Investigação Agrária e Veterinária (INIAV)
Av. da República,
Quinta do Marquês
2780-157 Oeiras
Portugal
2
Faculty of Engineering, Lusophone University of Humanities and Technology
Campo Grande, 376
1749-019 Lisbon
Portugal
3
University College Cork, School of Engineering
Ireland
E-mail: pnsampaio@gmail.pt
1
Abstract
Determining amylose content in rice with near infrared (NIR) spectroscopy, associated with a suitable multivariate
regression method, is both feasible and relevant for the rice business to enable Process Analytical Technology
applications for this critical factor, but it has not been fully exploited. Due to it being time-consuming and prone to
experimental errors, it is urgent to develop a low-cost, nondestructive and ‘on-line’ method able to provide high
accuracy and reproducibility. Different rice varieties and specific chemometrics tools, such as partial least squares
(PLS), interval-PLS, synergy interval-PLS and moving windows-PLS, were applied to develop an optimal regression
model for rice amylose determination. The model performance was evaluated by the root mean square error of
prediction (RMSEP) and the correlation coefficient (R). The high performance of the siPLS method (R=0.94;
RMSEP=1.938; 8941–8194 cm-1; 5592–5045 cm-1; and 4683–4335 cm-1) shows the feasibility of NIR technology for
Keywords: Multivariate models; Process Analytical Technologies; PLS; iPLS; siPLS; mwPLS.
2
1. Introduction
Rice (Oryza sativa L.), the world's main food crop, is constituted fundamentally by starch. Starch is a complex
polysaccharide of α-D-glucose units exclusively, which are joined by a sequence of α-D-(1,4)-glucosidic linkages thus
giving rise to linear or helical chains referred to as amylose. Although α-(1,6)-glucosidic linkages are much less
frequent, they form branch points between the chains thereby creating highly branched domains, denominated
amylopectin (Pandey, Rani, Madhav, Sundaram, Varaprasad, Bohra, & Kumar, 2012). Starch biosynthesis in higher
plants including rice is catalysed by four classes of enzymes, namely, ADP-Glc pyrophosphorylase (AGPase), starch
synthase, starch branching enzymes and starch debranching enzymes. The enzyme granule bound starch synthase-I
controls the synthesis of amylose in the rice endosperm, while soluble starch synthase, starch branching enzyme and
starch debranching enzymes together control the synthesis of amylopectin (Bao, Sun, & Corke, 2002; Zhang, Cheng,
Zhang, Guo, Su, & Jiang, 2011). Amylose is considered to be the most important determinant of the eating quality of
rice, and based on its content rice varieties can be classified as waxy (0-2%); very low (3-12%); low (13-20%);
intermediate (21-25%) and high (>26%) (Juliano, Perez, Blakeney, Castillo, Kongseree, & Laignelet, 1981). The fine
structure of amylose, both molecular size and chain-length distribution, are also significant factors of the hardness of
cooked rice (Li, Prakash, Nicholson, Fitzgerald & Gilbert, 2016). Amylose content is correlated with the retrogradation
behaviour, influencing the textural properties of cooked rice and the viscoelasticity dynamic of rice starch gel (Lu,
The classical method for amylose and amylopectin determination is the colour complex formed by iodine
reaction coupled with potentiometric or amperometric titration. The method is based on the capacity inherent to
amylose to accommodate polyiodide ions, chiefly I5 -, within its helical structure. As amylopectin is unable to form such
complexes because of its short chains and branch linkages interfering with the formation of stable structures, these
complexes are specific for the amylose fraction (Hizukuri, 1996). However, the iodine affinity varies within species,
hence compromising the accuracy of this method. A survey conducted by the international network for quality rice
(INQR) showed that five different versions of the iodine binding method are currently in use and that the reproducibility
was high within laboratories but low between laboratories (Fitzgerald, Bergman, Resurreccion, Moller, Jimenez, &
Reinke, 2009). There are also other methods, such as differential scanning calorimetry (Sievert, & Holm, 1993),
potentiometry (Banks, Greenwood, & Muir, 1971), spectrophotometry (Morrison, & Laignelet, 1983), and
chromatography (Matheson & Welsh, 1988; Yun, Li, & Wood, 2013). The amylose can also be evaluated by the
enzymatic method, developed by Megazyme (Gibson, Solah, & McCleary, 1997). However this method is characterised
by some drawbacks, such as the relatively high cost per sample and, mainly, it is hard testing a large number of samples
3
(Hu, Burton, & Yang, 2010; Soong, Quek, & Henry, 2015). Despite the existence of other procedures, the colorimetric
method still commonly used, and their accuracy was improved by using standards from specific rice varieties carrying
the alleles of the Waxy gene responsible for amylose synthesis and calibration values obtained by separation of
hydrodynamic volume and molecular weight of amylose by size exclusion chromatography (ISO 6647-1,2, 2015).
Near-infrared (NIR) spectroscopy is a promising technique with fast, easy-to-use, and nondestructive analytical
potentials being widely accepted, requiring minimal or no sample preparation (Bart, Himmelsbach, McClung, &
Champagne, 2007). It has become particularly popular in recent years in the pharmaceutical industry to assist the
development of online Process Analytical Technologies (PAT) to achieve Quality by Design in manufacturing.
However, its prediction accuracy depends on sample physical status, chemical components, temperature, colour,
cleanliness, quantity used for measurement and above all, the statistical model used (Bagchi, Sharma, &
Chattopadhyay, 2016). Apparent amylose content has been predicted by NIR spectroscopy using milled rice flour (Bao,
Cai, & Corke, 2001; Delwiche, Bean, Miller, Webb, & Williams, 1995), milled whole grain (Delwiche et al., 1995;
Windham, Lyon, Champagne, Barton, Webb, & McClung, 1997; Shu, Wu, Xia, Gao, & McClung, 1999), or amylose
and proteins in rice flour (Xie, Tang, Chen, Luo, Jiao, Shao, Wei, & Hu, 2014). However, those studies faced several
drawbacks concerning the valuable rice amylose reference data and the model performance. The main difficulty of NIR
spectroscopy with multivariate analysis is related to wavenumber or spectral region selection, especially when the
spectra displays unresolved peaks or fails to identify important features. Several methods have been studied to select the
optimal variables for multivariate calibration to remove irrelevant spectral variables and improve model performance;
The multivariate calibration builds a predictive model relating measured quantities (wavenumbers) to properties of
interest (concentration data). A variety of linear regression methods based on latent variables (LVs) have been
developed to address this problem, such as partial least squares (PLS), but due to several drawbacks, such as the noise
in spectral data, the calibration and prediction errors are high, and the model can be affected (Wold, & Sjostrom, 2001).
Meanwhile, spectral region selection, using appropriate algorithms, was reported to considerably improve the
performance of the full-spectrum calibration techniques, avoiding non-modeled interferences and building a well-fitted
model (Friedel, Patz, & Dietrich, 2013; Lee, Bawn, & Yoon, 2012; Nørgaard, Saudland, Wagner, Nielsen, Munck, &
Engelsen, 2000). Studies then performed showed that it is fundamental to conduct a spectral region selection
responsible for the property of interest to increase the prediction performance (Kalivas, 1997; Spiegelman, McShane,
Goetz, Motamedi, Yue, & Coté, 1998). These methods can be classified into two classes: single wavelength selection
and wavelength interval selection. Actually, several approaches have been proposed for selection of the optimal set of
spectral regions, such an interval PLS (iPLS), synergy PLS (siPLS) and moving window PLS (mwPLS) (Friedel et al.,
4
2013; Leardi, & Noorgard, 2004; Ma, Wang, Chen, Cheng, & Lai, 2017). The principle of iPLS consists of splitting the
spectra into equal-width intervals and developing sub-PLS models for each one. The sub-intervals with the lowest value
of the root mean squared error of prediction (RMSEP) are deemed to be the best. Many methods based on iPLS were
developed to optimise the combination of the selected intervals, such as siPLS (Leardi & Noorgard, 2004). The main
advantage of this kind of method is that it uses a graphical display to focus on a choice of better sub-intervals and
conduct comparisons among the prediction performance of local models and the full-spectrum model. Instead of just
testing a series of adjacent but non-overlapping intervals, which would miss some more informative ones, mwPLS was
proposed to overcome this drawback. It builds a series in a window that moves through the whole spectra and then
chooses the informative intervals with low model complexity and low value of the sum of residuals. The mwPLS is a
modelling technique that can be thought of as a series of diagnostic PLS regressions based on all continuous window
size ‘‘H’’ in the parent data set. In effect, a window of size H is ‘‘moved’’ across the data set to collect modelling
information. The model quality and number of latent variables (LVs) required for model production during this process
can then be used to find the best spectral region(s) of size H. mwPLS is a promising procedure used to conduct
consecutive wavelength selection for building an optimal calibration model; this method is proven to be effective for
waveband selection in analysis of many objects (Chen, Yin, Tang, & Pan, 2017, Yun et al. 2013).
The objective of this study was to test the various methods proposed to develop multivariate models to select
the most appropriate to obtain reliable and accurate measurements of amylose in rice. A large set of rice varieties was
used to challenge the various models. PLS, iPLS, siPLS and mwPLS procedures for NIR quantitative analysis of
amylose were investigated and compared. The different steps required for model calibration were analysed. The number
of PLS factors and the number of region intervals was optimised according to the root mean square error in the
calibration set. The performance of the final model was evaluated according to the RMSEP and the correlation
coefficient (R) with the prediction set. The model thus created can be considered a way to obtain a fast, non-destructive,
accurate and reproducible methodology for amylose determination in different rice varieties (after a suitable milling
procedure), providing a modern gold standard for laboratory and industrial analysis amenable for the development of
For this study, sixteen rice varieties (including Indica and Japonica sub species) from a Portuguese Rice
Breeding Program were grown at three different sites along the basins of 3 different rivers with very different micro-
climates (Alcácer do Sal, Salvaterra de Magos and Montemor-o-Velho, Portugal) along four seasons (2012-2015),
5
providing 168 samples. Also, 11 standard rice varieties, sourced from the International Rice Research Institute, Los
Baños, Philippines, (IRRI), characterised by different amylose content, were also used: IR 65; IR 24; IR 64; WU BAI
About 20 g of rice was ground to flour in a Cyclone Sample Mill (Falling number 3100, Perten, Sweden) equipped with
a 0.8 mm screen.
Amylose of rice was determined using the standard iodine colorimetric method according to ISO 6647-2 (2015). The
absorbance was measured using a spectrophotometer (Hitachi; Japan) at 720 nm. Amylose content was quantified using
a standard curve created from absorbance values of 4 calibrated samples from standard rice varieties carrying one of the
five alleles of the Waxy gene, which is the gene responsible for amylose synthesis (IR8, IR24, IR64, IR65) obtained
from IRRI. Pure amylose (potato origin) (Sigma-Aldrich, Germany) was also evaluated. The amylose content was
evaluated in duplicate for each sample of rice, and the reference value corresponds to the average.
The samples containing approximately 25 cm3 of rice flour were loaded in a circular sample cup and pressed slightly to
obtain a similar packing density. Sample spectra were collected using an NIR transflection MPA equipment (Bruker
Optics, Germany). For each rice sample, 16 successive scans were performed, over a wavenumber range (12,000 – 4000
cm-1), at 16 cm-1 of resolution. For each rice sample, two spectra were obtained.
Principal component analysis is a linear pattern recognition technique that allows the reduction of the dimensionality of
multivariate data to n principal components. All samples were considered for analysis to enable inferring how sample
variability may affect possible trends from the direct observation of the scores plot. The outliers were identified using
The NIR raw spectra obtained, after outlier elimination, were treated by different data preprocessing techniques, such as
standard normal variate (SNV) transformation, multiplicative scatter correction (MSC) and smoothing derivative to
obtain reliable qualitative classification and quantitative calibration models. After the SNV and MSC, the spectra were
6
treated using first and second derivatives. Savitzky-Golay smoothing method allowed eliminate the noises like baseline-
drift, tilt, reverse, and so forth (Savitzky & Golay, 1964; Xie, Xiang, Yu, & Deng, 2009).
The PLS regression was performed after outliers identification. The matrices containing the data provided by the NIR
spectra, denominated by X, and the vector Y containing the amylose content, were employed to build the regression
model. The performance of the final PLS model was evaluated according to the RMSEP and the determination
∑
= ( ŷ )
RMSEP (1)
∑
( − ŷ )
= 1 − (2)
∑
( − ȳ)
where n is the number of samples in the validation test set, yi is the experimentally measured reference result for sample
i and ŷi are the estimated results of the model for the corresponding test sample i. (Eq. 1). The correlation coefficient (R)
between the predicted and the measured values were calculated for both the calibration and the validation test sets with
Eq. 2, where ȳ is the mean of the reference measurement results for all samples in the calibration and test set. The best
combination of spectral regions and the preprocessing techniques were selected by picking the PLS model with a small
RMSEP, a high R and a low number of latent variables (LV) covering enough data variance.
The iPLS and siPLS were applied to remove irrelevant spectral variables and to improve PLS model performance. The
iPLS models were built on the spectral division into 10, 20, 25 and 50 intervals with a similar width. The iPLS routine
generates graphical information indicating the optimum number of LV used in each interval model and RMSEP values.
In this case, the subinterval that presented the lowest RMSEP values was selected. The siPLS models were constructed
with the spectral set divided into 10, 20, 25 and 50 intervals and combinations from 2 to 3 intervals. The combined
subintervals that presented the lowest RMSEP values were selected. The mwPLS model is a modelling technique that
can be thought of as a series of diagnostic PLS regressions based on all continuous window size ‘H’ in the parent
dataset. In effect, a window of size H is “moved” across the data set to collect modelling information. The model
7
quality and number of LVs required for model production during this process can then be used to find the best spectral
region(s) of size H.
PLS, iPLS, siPLS and mwPLS models were performed using MATLAB software (The Mathworks, Natick, MA, USA).
The iToolbox for MATLAB available from (http://www.models.life.ku.dk/itoolbox) was used for calculation of interval
selection by iPLS, siPLS and mwPLS. The statistical analysis (ANOVA) of the calibration and prediction set
A Principal Component Analysis (PCA) was performed after pre-processing for preliminary examination of NIR
spectra to provide an overview of the data and reveal the similarities and differences among all the samples and
consequently identify outliers. PCA is a popular variable reduction technique that replaces the actual measured
variables by Principal Components, which are linear combinations of them determined sequentially to maximise
orthogonality between the different components, in such a way that each PC explains the highest possible percentage of
the total variance of the data still unexplained. PCA is one of the most frequently used chemometric tools that allow a
projection of data from a higher to a lower dimensional space. A data matrix composed of 354 raw spectra from rice
samples, represented by 1154 variables (i.e. wavenumbers), was taken for PCA analysis, allowing selection and
elimination of the outliers spectra that can interfere negatively with the model construction. The samples that plotted
away from the main cluster in the PC graphs were eliminated, this being considered as evidence of very significant
differences with the other samples. PCA also allowed discriminating the differences that exist inside the total samples.
The main cluster was defined by two small groups characterised, each one, by samples harvested in different years.
Thus, the use of a supervised classification method, with an initial knowledge about the classes to be modelled, is
required. After that, the 313 NIR raw spectra obtained were treated by different preprocessing tools, such as smoothing
derivative, standard normal variate (SNV) transformation and multiplicative scatter correction (MSC) and Savitzky-
Golay filter to obtain a reliable qualitative classification and quantitative calibration models.
To avoid the bias in the sub-set division, all samples were placed in ascending order, based on the amylose content, and
the calibration set was selected to cover the full range of concentrations. After outliers elimination, the 313 spectra
8
related to all samples analysed were divided into two subsets: the calibration set (203), used to build the model and the
validation set (110), which was used to test the robustness of the model. Both subsets randomly constituted covered
similar amylose content ranges (calibration: 0-33.75%; test set: 2.72-33.65%) and means (calibration data: 19.70%; Test
set data: 20.27%). The variability of samples due to rice varieties could impose quite a challenge for the development of
The raw NIR spectra (12,000-4000 cm-1) of rice flour samples are plotted in Figure 1-A. A group of atoms in a
molecule may have multiple modes of oscillation caused by stretching and bending motions of the amylose group. The
strongest absorption bands observed at 5184 cm-1 are related to the combination of stretching and bending of the O-H
group of amylose, while the peak at 6835 cm-1 is related to the combination of the first overtone of (O-H) anti-
symmetric stretching and O-H symmetric stretching of amylose molecule, respectively. The weak absorption bands at
8316 cm-1 may be due to second overtone of symmetric stretching (–CH bonds) of methyl (–CH3) groups. The OH and
–C-H bond vibrations are caused by compounds such as amylose, proteins and water (Pandiselvam, Thirupathi, &
Vennila, 2016). The same authors also obtained five absorption peaks 10.792 and 6872 cm-1 due to mainly C–H second
overtone and combination that corresponds to amylose. Based on the studies performed by Bagchi et al. (2016), two
absorption peaks between 6872 and 5058 cm-1 were obtained, and they are related to C=O stretch, O–H and N–H stretch
and also C–H stretch first overtone associated to protein present in the rice (Burns & Curczak, 1992).
The spectra of pure amylose allow to analyse and evaluate the similarities between the bands of rice samples
and the amylose spectra (Fig. 1-B). The NIR spectra of amylose present also major peaks at 4633 cm-1, 4996 cm-1, 5184
cm-1, 6834 cm-1 and 8316 cm-1. The development of the amylose prediction model was accomplished by full spectrum
PLS models without or with preprocessing data (Table 1). The PLS model performed using the raw spectra, without
preprocessing method is characterised by a low R=0.70 and high RMSEP=3.909, due to the significant noise spectra.
As can be seen, the spectral profiles present some tendencies and noise and, therefore, a suitable spectral preprocessing
is necessary to highlight the differences between several rice varieties according to amylose contents, which cannot be
distinguished only by the naked eye when the doping level is low.
Furthermore, to make full use of the informative data and to eliminate noise present in the spectra, data
pretreatment is often needed before establishing the calibration model. Particle size, for example, determines the
spectral path length, which can lead to a substantial effect on the resultant spectrum and consequently the model (Mark,
2001). To minimise the influence of these parameters the raw spectra are usually subjected to preprocessing before
developing calibration models. Pre-treatments recommended to obtain reliable, accurate and stable models were
9
applied, namely smoothing derivatives (R=0.76, RMSEP=3.571 for 1st Derivative and R=0.73, RMSEP=3.761 for 2nd
Derivative), SNV transformation (R=0.69, RMSEP=4.018), MSC (R=0.71; RMSEP = 3.863) as well as Savitzky-Golay
filter (SG (69.4.4) R=0.87, RMSEP=2.678) and SNV+SG (69.4.4) R=0.90, RMSEP=2.435) to remove and highlight the
The SG filter method was also applied for model optimisation. SG filter contains many different smoothing
modes. The smoothing parameters include the polynomials degree (PD), the derivatives order of polynomials (DOP),
and the number of smoothing points (NSP), are considered as very meaningful. A too-small NSP is prone to cause
calculation error, resulting in a decreased model precision, while a large NSP would over smooth and polish the spectral
data, leading to decreased accuracy. A reasonable choice of NSP is essential for SG smoothing. The NSP could be
appropriately selected according to the PLS model prediction result by combination with the selection of PLS latent
variables. For that reason, an optimisation study of the SG filter was previously carried out to determine the polynomial
degree (PD), the derivatives order of polynomials (DOP) and the number of smoothing points (NSP) that provided best
results. Based on this preliminary study, the optimum parameters (PD=69; DOP=4; and NSP=4) were obtained.
Consequently, the PLS model performed using these parameters allowed to obtain a model characterised by (R=0.87;
RMSEP=2.678). These results showed that it was possible to extract significant information contained in spectra data,
Given these results, a simultaneous application of spectral pretreatment methods (MSC and SNV plus SG
smoothing) and PLS models were found to be more accurate: (MSG+SG (69.4.4) R=0.88; RMSEP=2,650; and
Near infrared spectroscopy (NIR) is characterised by an excessive background noise and weak analytical
signals due to near infrared overtones and combinations. NIR spectrum of solid samples is often accompanied by
scattering noise due to the no-uniformity of particle size, such as the rice grain that was previously ground. To make full
use of the informative data and to eliminate noise, the data pretreatment is often necessary before establishing the
calibration model. Savitzky-Golay (SG) smoothing is a widely-used pre-treatment method that can effectively remove
the noises like baseline-drift, tilt, reverse, and so forth (Gorry, 1990; Xie, Xiang, Yu, & Deng, 2009; Delwiche &
Reeves 2010; Chen, Pan, Chen, & Lu, 2011). To overcome the scattering interference, multiplicative scatter correction
method (MSC) is also used in the spectral data once it can segregate the informative absorbance of the analyte and the
scattering signal in the spectral data (Barnes et al. 1989; Silva, Ferreira, Braga, & Sena, 2012). This practical procedure
allows eliminating the spectral differences in the same batch of samples due to the non-uniform particle size. Then, the
10
SG smoothing and MSC are both spectral pretreatment methods with much potential. Indeed, the model effect would be
much different when separately (or combined) using SG smoothing and MSC pretreatment methods. Moreover, the
proper smoothing mode should be selected for the pretreatment optimisation. This requires a significant number of
computer experiments, establishing different NIR spectroscopy analysis models corresponding to different pretreatment
parameters. So, a reasonable model would be determined by contrasting the prediction effects. It is an important way to
improve the predictive ability of NIR spectroscopy analysis, especially for the samples of complex systems (Chen,
Song, Tang, Feng, & Lin, 2013). Moreover, it is evident that the most suitable smoothing mode should be selected for
the pretreatment optimisation. This requires a large number of computer runs, establishing different NIR spectroscopy
Based on these results, the PLS models were different when SG smoothing and MSC/SNV methods were used
separately or combined, respectively. SNV allowed normalising spectra when the effective path length varies among
samples. Such path length variation can occur when measuring the spectra of powdery samples as in this study because
of particle size, as well as colour, variation between samples. The MSC can be considered as a suitable method when
working with samples constituted by particles characterised by different size and structures of solids. As in the flour
obtained from rice, the particle size distribution varies according to the grain hardness, the samples lack of uniformity,
and so their NIR diffuse reflectance spectrum is accompanied by scattering noise. It eliminates the spectral differences
in the same batch of samples because of non-uniform particle size (Fig. 1-C). Spectral data preprocessing removes the
irrelevant information (noise) that cannot be handled properly by regression techniques, and MSC is the most popular
normalization technique used to preprocess the NIR spectral data (Næs, Isaksson, Fearn, & Davies, 2002) to
compensate for additive (baseline shift) and multiplicative (tilt) effects (Martens, & Stark, 1991). According to the PLS
models obtained after each different pretreatment, it was possible to observe that the models were improved,
comparatively to PLS model of the full spectrum without preprocessing method. Meanwhile, based on the correlation
coefficient and RMSEP for all PLS model for full-spectrum, it was not possible to create a suitable and robust
quantitative relationship between the spectral data and the amylose contents in the rice. These poor models can be due
to some regions in the spectra that may contain non-modeled information (noise) and should, therefore, be excluded
from the model. For that reason, it is important to develop a calibration model that must focus on a spectral region
selection.
11
The development of spectral interval selection was first accomplished by the interval PLS (iPLS) algorithm created by
Norgaard et al. (2000). The principle of this algorithm is to split the total spectra into some smaller equidistant regions
and, consequently, PLS regression models for each sub-interval were developed. After that, the R and RMSEP for every
sub-interval were determined, and the region that presents the lowest RMSEP was chosen, allowing to draw up a
calibration model. The prediction accuracy of the established iPLS model was evaluated by external test validation. The
full spectrum was split into 10, 20, 25 and 50 intervals. The optimal iPLS model obtained for 20 intervals were: MSC +
2nd derivative (R=0.84 and RMSEP=2,885), SNV + 2nd derivative (R=0.84 and RMSEP=3.012); and for 25 intervals,
the optimal model was obtained for Savitzky-Golay filter (69.4.4) (R=0.92 and RMSEP=2.133), MSC + SG (69.4.4)
(R=0.89 and RMSEP=2.475) and SNV + SG (69.4.4) (R=0.91 and RMSEP=2.330) (Table 2).
The scatter plot shows a good correlation between reference measurement and NIR predicted in the calibration
set by the iPLS model (Fig. 2C). In this case, the best iPLS model was achieved after SG filter preprocessing
characterised by 7 PLS components, R=0.92, RMSEP=2.133, selected from 25 intervals that correspond to
wavenumbers in the range 4651–4304 cm-1 (Fig. 2A-B, Table 3). The models performed after MSC+SG and SNV+SG
preprocessing for the spectral region (4651-4304 cm-1) also obtained a suitable regression model, presenting high
(R=0.90) and low RMSEP values (2.475 and 2.330), respectively. NIR spectroscopy records the spectral bands that
mainly correspond to C-H, O-H and N-H vibrations, which are overtone and combination bands, and an NIR method
was constructed to identify the origin and biochemical characteristics of rice variety. These spectral regions are
characterised by a combination of a methyl group (-CH3) (CH-stretching and CH-bending), CH2-combination specific
of amylose molecule. Comparatively to PLS models, the RMSEP values for iPLS models are lower due to specific
spectra range selection per interval range, which could automatically eliminate the weak spectral information, inducing
a RMSEP decreasing comparatively to the full-spectrum PLS, because, according to other studies, the information was
spread over the whole spectral range (Pataca, Borges Neto, Marcucci, & Poppi, 2007). The exclusion of the
uninformative and/or interfering variables contributes should avoid the inclusion of spectral regions that contain
residual information or noise that affect the final regression model. Then, based on these results, the division of full
spectra region in different small intervals allowed to select the suitable region for creating optimised regression model,
3.4. siPLS
In the full spectrum, many informative variables could negatively affect the calibration. Accordingly, a judicious
selection of spectral regions would improve the predictive ability of the PLS model. Synergy interval PLS (siPLS)
12
algorithm used in this work was also developed by Norgaard et al. (2000). The basic principle of this algorithm is
similar to iPLS. Initially, the spectra are split into a specific number of intervals (variable-wise), and after that, PLS
regression models for all possible combinations of two and three intervals are developed. After that, RMSEP is
evaluated for every group of intervals, and the combinations characterised by lowest RMSEP value are selected.
Therefore, spectral interval selection performed by siPLS was implemented to verify whether the combination
of more than one interval would yield the models with better predictive capacity. Table 3 shows the best results of the
siPLS model calibration when the full spectra were split into different intervals. All models were performed after the
spectrum has been divided into 10, 20, 25 and 50 equal intervals and consequently combined. The best siPLS model for
amylose was obtained with 25 and 50 intervals, characterised by low RMSEP and high determination coefficient (R).
The best models were obtained as consequence of a combination of 3 intervals, after a division of 25 intervals, being
characterised by the wavenumber ranges (8941–8194 cm-1, 5592–5045 cm-1; and 4683–4335 cm-1) (Fig. 3-A).
Both regression models, obtained after SG filter and SNV + SG filter pre-processing, were characterized,
respectively, by: (9 LV) (R=0.94 and RMSEP=1.938); and R=0.93; RMSEP=1.979%, for (9 LV). Based on these
parameters, the regression models present high accuracy and can be considered suitable for amylose determination from
a wide variety of rice germplasm (Fig. 3-B). The regression model had a higher determination coefficient comparatively
to results obtained by Xie et al. (2014). According to these models, the high number of intervals allowed to select more
efficiently the best spectra range that contains more complete spectral information and, consequently, to build a strong
model with high correlation coefficient and low prediction error. Based on the selected NIR spectra, (Fig. 3-C-E) are
included the region characterized by the second overtone (anti-symmetric stretching, for methyl group, (–CH3) (8941-
8194 cm-1) which is close to region of interval (8183 and 6850 cm-1) mainly the C-H second overtone and the
combination are responsible, corresponding to amylose (Bagchi et al., 2016). The spectral range (5592-5054 cm-1) then
selected was close to the interval (5875–5495 cm-1) that, according to the studies performed by Fertig, Podczeck, Jee, &
Smith, (2004) and Vichasilp, & Kawano, (2015), might also be linked to the vibration of amylose. The bands between
5149 and 5050 cm-1 correspond to the O-H stretch and O-H band combination and the H-O-H deformation combination,
which represents the starch content (Aenugu, Kumar, Parthiban, Ghose, & Banji, 2011), and N-H/C-H bending in the
plane is at 4878–4830 cm-1 (Burns, & Curczak, 1992). The spectral range selected (4683-4335 cm-1) can also be related
to some starch bands (4760 cm-1) and the protein band of 4587 cm-1 according to Vichasilp, & Kawano (2015).
According to the regression models created, after siPLS and iPLS methodology, the same spectral regions were
selected, proving that both chemometrics techniques permit a more confident extraction of the biomolecular information
present.
13
3.5. mwPLS model
The function of the mwPLS model can be briefly described as the selection of informative regions and the
approximation of latent factors (Du, Chen, Zhong, Wang, Yu, Nordon, Littlejhon, & Holden, 2011). The informative
regions can be optimised by different moving window sizes. The window size considered in this study was set to 31.
The mwPLS models were carried out only for the models that presented low RMSEP and high determination coefficient
(R). Analysing the plot obtained after mwPLS algorithm, it is possible to observe that the spectral region selected
coincides with other analysis performed initially, with iPLS and siPLS algorithms, and was characterised by low
RMSEP values (6303-6079 cm-1; 5863-5747 cm-1; and 4737-4443 cm-1). The mwPLS has the advantage of showing the
evolution of RMSEP along the full spectrum and thus help to identify clearly which is the region more suitable to
develop the PLS model for analytical determination using the NIR technology. The optimal model was based in the
spectral region 5932-4497 cm-1 determined by the mwPLS method. These spectral regions included the wavenumber
related to strongest absorption band (5184 cm-1) associated with the combination of the O-H stretching and O-H
bending of amylose molecule. Based on these results, the mwPLS can be considered a practicable method to select the
Comparing the results from PLS, iPLS, and siPLS models, siPLS models showed better predictive ability. The
experimental results lead to the following conclusions: i) For PLS models, all variables from the full spectral region
were used to calibrate models, having many variables that were noisy spectral information and uninformative variables
that inevitably weaken the performance of the models; ii) iPLS models can reduce noise by selecting definite spectral
intervals, but only one has been chosen to calibrate the PLS model, so some useful variables would be abandoned. The
overall performance of the model was inevitably weakened because too much relevant information was not considered
due to the selection performed. This is the reason why iPLS models would give weak results in the validation sets. iii)
In contrast to iPLS, siPLS shows unparalleled advantages. siPLS not only has the same benefits as iPLS but also
overcomes the disadvantages of iPLS, combining two or three intervals, obtaining better models with reduced total
variable numbers (removing noisy spectral information) and better predictive capacity (without loss of information); iv)
mwPLS was not a suitable method for amylose prediction as the RMSEP values were very poor compared to other
methods. This can be related to the sensibility or the presence of some poor or not significant spectral regions.
4. Conclusion
14
A robust calibration was obtained using different combinations of derivations, preprocessing and regression methods
regardless of sample types. The analytical methodology proposed can accurately quantify the amylose present in the
rice varieties using NIR combined with chemometric tools. Compared with PLS, iPLS and mwPLS algorithms, the
variable selection techniques of siPLS led to models with high predictive ability compared to full-spectrum PLS models
in the different pre-processing data used. The spectral region selected by siPLS in the wavenumber range 8941–8194
cm-1, (CH3, methyl group, 2nd overtone of anti- and symmetric stretching), 5592–5045 cm-1; (1st O-H stretch and O-H
band combination and the H-O-H deformation combination) and 4683–4335 cm-1 (related to some starch and protein
bands), which all are related to the starch content and consequently the amylose. Thus, the regions selected by siPLS
can lead to an increase in the prediction ability of the models. The PLS method was validated, as shown by the
satisfactory results obtained for all estimated figures of worth with no systematic errors. The proposed method presents
significant advantages over conventional lab analysis, such as a simplified procedure, low cost, fast, less chemical
waste, nondestructive and suitable for ‘on-line’ analysis. These results suggest that the combination of NIR
spectroscopy and chemometric techniques is a simple, fast and reliable method for amylose quantification in the quality
Acknowledgments
Funding for this research has been received from the Portuguese Fundação para a Ciência e Tecnologia (FCT) under the
grant agreement number RECI/AGR-TEC/0285/2012, BEST-RICE-4-LIFE project and P.N Sampaio acknowledges the
References
Aenugu, H.P.R., Kumar, D. S., Parthiban, N., Ghose, S.S., & Banji, D. (2011). Near-infrared spectroscopy - An
Bagchi, T. B., Sharma, S., & Chattopadhyay, K. (2016). Development of NIRS models to predict protein and amylose
content of brown rice and proximate compositions of rice bran. Food Chemistry, 191, 21–27.
Banks, W., Greenwood, C.T., & Muir, D.D. (1971). The characterization of starch and its components. Part 3. The
technique of semimicro, differential, potentiometric titration, and the factors affecting it. Starch/Stärke, 23, 118–127.
15
Bao, J.S., Cai, Y. Z., & Corke, H. (2001). Prediction of rice starch quality parameters by near infrared reflectance
Bao, J.S., Sun, M., & Corke, M. (2002). Analysis of the genetic behaviour of some starch properties in Indica rice
(Oryza sativa L): thermal properties, gel texture, swelling value. Theoretical and Applied Genetics, 104, 408–13.
Barnes, R.J., Dhanoa, M.S., & Lister, S.J. (1989). Standard normal variate transformation and de-trending of near-
Bart, M.N., Katrien, B., Els, B., Ann, P., Wouter, S., Karen, I. T., et al. (2007). Nondestructive measurement of fruit
and vegetable quality by means of NIR spectroscopy: A review. Journal of Postharvest Biology and Technology, 46,
99–118.
Burns, D. A., & Curczak, E. W. (1992). Handbook of near-infrared analysis. Practical spectroscopy series (Vol. 13) (pp.
Chen, H.Z., Pan, T., Chen, J.M., & Lu, Q.P, (2011). Waveband selection for NIR spectroscopy analysis of soil organic
matter based on SG smoothing and MWPLS methods. Chemometrics and Intelligent Laboratory Systems, 107, 139–
1146.
Chen, H., Song, Q., Tang, G., Feng, Q., & Lin, L. (2013). The combined optimisation of Savitzky-Golay smoothing and
multiplicative scatter correction for FT-NIR PLS models. Hindawi Publishing Corporation. ISRN Spectroscopy,
ID642190.
Chen, J., Yin, Z., Tang, Y., & Pan, T. (2017). Vis-NIR spectroscopy with moving-window PLS method applied to rapid
analysis of whole blood viscosity. Analytical and Bioanalytical Chemistry, 409, 2737-2745.
Delwiche, S. R., Bean, M. M., Miller, R. E., Webb, B. D., & Williams, P. C. (1995). Apparent amylose content of
milled rice by near infrared reflectance spectrophotometry. Cereal Chemistry, 72, 182–187.
16
Delwiche, S.R., & Reeves, J.B. (2010). A graphical method to evaluate spectral preprocessing in multivariate regression
calibrations: example with Savitzky-Golay filters and partial least squares regression. Applied Spectroscopy, 64, 73–82.
Du, W., Chen, Z., Zhong, L., Wang, S., Yu, R., Nordon, A., Littlejhon, D., & Holden, M. (2011). Maintaining the
predictive abilities of multivariate calibration models by spectral space transformation. Analytica Chimica Acta, 690,
64-70.
Fertig, C. C., Podczeck, F., Jee, R. D., & Smith, M. R. (2004). Feasibility study for the rapid determination of the
amylose content in starch by near-infrared spectroscopy. European Journal of Pharmaceutical Sciences, 21, 155–159.
Fitzgerald, M. A., Bergman, C. J., Resurreccion, A. P., Moller, J., Jimenez, R., Reinke, R. F., et al. (2009). Addressing
Friedel, M., Patz, C. D., & Dietrich, H. (2013). Comparison of different measurement techniques and variable selection
Gibson, T.S., Solah, V.A., & McCleary, B.V. (1997). A procedure to measure amylose in cereal starches and flours
Gorry, P.A. (1990). General least-squares smoothing and differentiation by the convolution (Savitzky-Golay) method.
Hizukuri, S. (1996). Starch: analytical aspects. In: Eliasson, A.C. (Ed.), Carbohydrates in Food. (pp. 347–429). Marcel
Hu, G., Burton, C., & Yang, C. (2010) Efficient measurement of amylose content in cereal grains. Journal of Cereal
17
ISO 6647-2:2015 Rice - Determination of amylose content - Part 2: Routine method.
Juliano, B.O., Perez, C.M., Blakeney, A.B., Castillo, D.T., Kongseree, N., Laignelet, B., et al. (1981). International co-
operative testing on the amylase content of milled rice. Starch/Starke, 33, 157–162.
Kalivas, J. H. (1997). Two data set for near infrared spectra. Chemometrics Intelligent Laboratories Systems, 37, 255–
259.
Leardi, L., & Nørgaard, J. (2004). Sequential application of backward interval partial least squares and genetic
algorithms for the selection of relevant spectral regions. Chemometrics, 18, 486–497.
Lee, H.W., Bawn, A., & Yoon S. (2012). Reproducibility, complementary measure of predictability for robustness
improvement of multivariate calibration models via variable selections. Analytica Chimica Acta, 757, 11-18.
Li, H., Prakash, S., Nicholson, T.M., Fitzgerald, M.A., & Gilbert, R.G. (2016). The importance of amylose and
amylopectin fine structure for textural properties of cooked rice grains. Food Chemistry, 196, 702-711.
Lu, Z.-H., Sasaki, T., Li, Y.-L., Yoshihashi, T., Li, L.-T., & Kohyama, K. (2009). Effect of amylose content and rice
type on dynamic viscoelasticity of a composite rice starch gel. Food Hydrocolloids, 23, 1712–1719.
Ma, H., Wang, J., Chen, Y., Cheng, J., & Lai, Z. (2017). Rapid authentication of starch adulteration in ultrafine granular
powder of Shanyao by near-infrared spectroscopy coupled with chemometric methods. Food Chemistry, 215, 108-115.
Mark, H. (2001). Fundamentals of near-infrared spectroscopy. In: Raghavachan, R. (Ed.), Near-infrared Applications in
Martens, H., & Stark, E. (1991). Extended multiplicative signal correction and spectral interference subtraction: new
preprocessing methods for near infrared spectroscopy. Journal of Pharmaceutical and Biomedical Analysis, 9, 625-635.
18
Matheson, N.K. & Welsh, L.A. (1988). Estimation and fractionation of the essentially unbranched amylose and
branched amylopectin component of starches with concanavalin A. Carbohydrate Research, 180, 301-313.
Morrison, W. R., & Laignelet, B. (1983). An improved colorimetric procedure for determining apparent and total
amylose content in cereals and other starches. Journal Cereal Science, 1, 9-20.
Naes, T., Isaksson, T., Fearn T. & Davies, A. (2002). A User-Friendly Guide to Multivariate Calibration and
Nørgaard, L., Saudland, A.J., Wagner, J.P., Nielsen, L., Munck, & Engelsen, S.B. (2000). Interval Partial least-squares
regression (iPLS): A comparative chemometric study with an example from Near-Infrared spectroscopy. Applied
Pandey, M.K., Rani, N.S., Madhav, M.S., Sundaram, R.M., Varaprasad, G.S., Sivaranjani, A.K.P., Bohra, A., Kumar,
G.R., & Kumar, A. (2012). Different isoforms of starch-synthesizing enzymes controlling amylose and amylopectin
Pandiselvam, R., Thirupathi, V., & Vennila, P. (2016). Fourier Transform – near infrared spectroscopy for rapid and
nondestructive measurement of amylose content of paddy. Scientific Journal Agricultural Engineering, 2, 93 – 100.
Pataca, L.C., Borges Neto, W., Marcucci, M. C., & Poppi, R.J. (2007). Determination of apparent reducing sugars,
moisture and acidity in honey by attenuated total reflectance-Fourier transform infrared spectrometry. Talanta, 71,
1926–1931.
Savitzky, A., & Golay, M.J.E. (1964). Smoothing and differentiation of data by simplified least squares procedures.
Shu, Q.Y., Wu, D.X., Xia, Y.W., Gao, M.W., & McClung, A. (1999). Calibration optimization for rice apparent
amylose content by near infrared reflectance spectroscopy (NIRS). Journal of Zhejiang University (Agriculture & Life
19
Silva, M., Ferreira, M.H., Braga. J.W., & Sena, M. (2012). Development and analytical validation of a multivariate
calibration method for determination of amoxicillin in suspension formulations by near infrared spectroscopy. Talanta,
89, 342–351.
Sievert, D., & Holm, J. (1993). Determination of amylose by differential scanning calorimetry. Methods, 45, 136-139.
Soong, Y.Y., Quek, R.Y.C., & Henry, C.J. (2015). Glycemic potency of muffins made with wheat, rice, corn, oat and
barley flours: a comparative study between in vivo and in vitro. European Journal of Nutrition, 54, 1281–1285.
Spiegelman, C.H., McShane, M.J., Goetz, M.J., Motamedi, M., Yue, Q.L., & Coté, G.L. (1998). Theoretical
justification of wavelength selection in PLS calibration: development of a new algorithm. Analytical Chemistry, 70, 35–
44.
Vichasilp, C., & Kawano, S. (2015). Prediction of starch content in meatballs using near infrared spectroscopy (NIRS).
Windham, W., Lyon, B.G., Champagne, E.T., Barton, F.E., Webb, B.D., McClung, A.M., Moldenhauer, K.A.,
Linscombe, S., & McKenzle, K.S. (1997). Prediction of cooked rice texture quality using near-infrared reflectance
Wold, S., & Sjostrom, M. (2001). PLS-regression: a basic tool of chemometrics. Chemometrics and Intelligent
Xie, S.F., Xiang, B.R., Yu, L.Y., &. Deng, H.S. (2009). Tailoring noise frequency spectrum to improve NIR
Xie, L.H., Tang, S.Q., Chen, N., Luo, J., Jiao, G.A., Shao, G.N., Wei, X.J., & Hu, P.S. (2014). Optimisation of near-
infrared reflectance model in measuring protein and amylose content of rice flour. Food Chemistry, 142, 92-100.
20
Yun, Y., Li, H., & Wood, L.R., et al. (2013). An efficient method of wavelength interval selection based on random
frog for multivariate spectral calibration. Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, 111,
31-36.
Zhang, G., Cheng, Z., Zhang, X., Guo, X., Su, N., Jiang, L., Mao, L., & Wan, J. (2011). Double repression of soluble
starch synthase genes SSIIa and SSIIIa in rice (Oryza sativa L.) uncovers interactive effects on the physicochemical
FIGURE CAPTIONS
Figure 1-NIR spectra without any pre-treatment (A); and the pure amylose spectra (B). NIR spectra plot obtained after
Figure 2-iPLS model: spectra intervals construction (A); Spectra interval selection (B); Scatter plot obtained from iPLS
model (C).
Figure 3-siPLS model: Spectra interval selection (A); Scatter plot obtained from siPLS model (B). NIR spectral region
used for the regression model obtained after SG filter preprocessing and siPLS algorithm: Spectral range (8941-8194
cm-1) (C); Spectral range (5592-5054 cm-1) (D) and; Spectral range (5875–5495 cm-1) (E).
21
22
23
24
Table 1-Analysis of several PLS models using full spectra with and without some preprocessing methods such as
multiplicative scatter correction (MSC); standard normal variate (SNV) and; Savitzky-Golay filter (SG). Root mean
square error prediction (RMSEP); Root mean square error of calibration (RMSEC); and the correspondent
determination coefficient (Rcal and Rep).
25
Table 2-Results related to iPLS model such as root mean square error prediction (RMSEP) and root mean square error of
calibration (RMSEC) and the correspondent determination coefficient (Rcal and Rpred) for the all preprocessing and each
spectra intervals then performed.
Spectra region
Processing Spectra Intervals
(cm-1)
PLS Rc RMSEC Rp RMSEP
Without Preprocessing
10 6249 - 5369 4 0,55 4,346 0,55 4,565
20 8462-8022 4 0,53 4,494 0,51 4,695
25 6071-5724 7 0,64 3,925 0,74 3,685
50 5894-5724 5 0,64 3,910 0,68 4,005
MSC + 2nd Derivative
10 6249-5359 7 0,75 3,366 0,80 3,290
20 4467-4035 5 0,86 2,602 0,84 2,885
25 4652-4305 5 0,81 3,022 0,82 3,141
50 4474-4305 7 0,82 3,016 0,79 3,352
SNV + 2nd Derivative
10 6249-5369 6 0,75 3,366 0,80 3,274
20 4467-4035 5 0,86 2,560 0,84 3,012
25 4652-4305 4 0,81 3,018 0,81 3,203
50 4474-4305 7 0,82 3,012 0,81 3,219
SG (69.4.4)
10 5361-4482 8 0,89 2,334 0,87 2,720
20 4906-4474 9 0,88 2,400 0,88 2,561
25 4651-4304 7 0,90 2,228 0,92 2,133
50 4651-4482 8 0,78 3,215 0,74 3,701
MSC + SG (69.4.4)
10 5361-4482 9 0,87 2,488 0,86 2,796
20 4906-4474 9 0,87 2,519 0,88 2,652
25 4651-4304 6 0,90 2,212 0,89 2,475
50 4651-4482 8 0,77 3,253 0,73 3,768
SNV + SG (69.4.4)
10 7136-6256 5 0,61 4,066 0,57 4,499
20 4906-4474 9 0,87 2,512 0,88 2,650
25 4651-4305 7 0,90 2,236 0,91 2,330
50 4651-4482 8 0,78 3,254 0,73 3,736
MSC–Multiplicative scatter correction; SNV-Standard normal variate and; SG – Savitzky-Golay filter.
26
Table 3-Results related to siPLS model such as root mean square error prediction (RMSEP) and root mean square error
of calibration (RMSEC) and the correspondent determination coefficient (Rcal and Rpred) for the all preprocessing and
combinations of spectra intervals then performed. Values presented are related only to the best model.
27
Highlights
• PLS, iPLS, siPLS and mwPLS algorithms showed high accuracy for amylose prediction
• siPLS allowed to obtained a model with highest accuracy and low error
• NIR and chemometric can be suitable techniques for fast, ‘on-line’ and accurate amylose determination.
28