You are on page 1of 8

Analytica Chimica Acta 689 (2011) 190–197

Contents lists available at ScienceDirect

Analytica Chimica Acta


journal homepage: www.elsevier.com/locate/aca

Biodiesel classification by base stock type (vegetable oil) using near infrared
spectroscopy data
Roman M. Balabin a,∗ , Ravilya Z. Safieva b
a
Department of Chemistry and Applied Biosciences, ETH Zurich, 8093 Zurich, Switzerland
b
Gubkin Russian State University of Oil and Gas, 119991 Moscow, Russia

a r t i c l e i n f o a b s t r a c t

Article history: The use of biofuels, such as bioethanol or biodiesel, has rapidly increased in the last few years. Near
Received 7 October 2010 infrared (near-IR, NIR, or NIRS) spectroscopy (>4000 cm−1 ) has previously been reported as a cheap and
Received in revised form 17 January 2011 fast alternative for biodiesel quality control when compared with infrared, Raman, or nuclear magnetic
Accepted 19 January 2011
resonance (NMR) methods; in addition, NIR can easily be done in real time (on-line). In this proof-of-
Available online 26 January 2011
principle paper, we attempt to find a correlation between the near infrared spectrum of a biodiesel
sample and its base stock. This correlation is used to classify fuel samples into 10 groups according to
Keywords:
their origin (vegetable oil): sunflower, coconut, palm, soy/soya, cottonseed, castor, Jatropha, etc. Prin-
Petroleum (fossil) fuel
Vegetable (plant) oil
cipal component analysis (PCA) is used for outlier detection and dimensionality reduction of the NIR
Biofuel (biodiesel, bioethanol, spectral data. Four different multivariate data analysis techniques are used to solve the classification
ethanol–gasoline fuel) problem, including regularized discriminant analysis (RDA), partial least squares method/projection on
Vibrational spectroscopy (infrared, latent structures (PLS-DA), K-nearest neighbors (KNN) technique, and support vector machines (SVMs).
near-infrared, and Raman) Classifying biodiesel by feedstock (base stock) type can be successfully solved with modern machine
Artificial neural networks learning techniques and NIR spectroscopy data. KNN and SVM methods were found to be highly effective
Support vector machines for biodiesel classification by feedstock oil type. A classification error (E) of less than 5% can be reached
using an SVM-based approach. If computational time is an important consideration, the KNN technique
(E = 6.2%) can be recommended for practical (industrial) implementation. Comparison with gasoline and
motor oil data shows the relative simplicity of this methodology for biodiesel classification.
© 2011 Elsevier B.V. All rights reserved.

1. Introduction NIR spectroscopic studies have been conducted throughout the


past few decades. Applications of NIR spectroscopic data can be
The applications of spectroscopy in various fields of mod- found in medical and biomedical studies and imaging [13–16],
ern science are increasing every year [1–4]. The most popular nanoscience and nanotechnology (nanotech) [17–20], Earth sci-
spectroscopic techniques include infrared (IR), near infrared ence (geoscience) [21], physical chemistry and chemical physics
(NIR/near-IR), and Raman spectroscopies [1–6], ultraviolet–visible [22–25], astronomy [26,27], analytical chemistry [2,3,12,28–30],
(UV–vis) absorption spectroscopy (spectrophotometry) [1,4], flu- food science [3], forestry, and wood science [31,32], and the phar-
orescence spectroscopy (fluorometry/spectrofluorometry) [7–10], maceutical and petroleum industries [22–24,28–30,33,34] (see Ref.
and nuclear magnetic resonance spectroscopy (1 H and 13 C NMR) [3] for additional references). When associated with multivariate
[11]. Of these, near infrared spectroscopy has several advantages data analysis (MDA) [35–40], vibrational spectroscopic techniques
over other analytical tools because it is non-invasive, requires (IR, NIR, and Raman) [1,35–37] have proven to be powerful tools in
minimal sample preparation, and can yield a response in real the analysis of a variety of fuel samples, including gasoline, diesel,
time [2,3], which is an important advantage for process analytical alcohol fuel (ethanol–gasoline mixtures [41,42]), and kerosene (jet
chemistry (PAC) [12]. NIR spectroscopy is based on the absorption fuel) [43]. Vibrational spectroscopic techniques are faster than the
of electromagnetic radiation in the region from 780 to 2500 nm usual (ASTM) techniques in the analysis of petroleum refining and
(12,820–4000 cm−1 ) [3]. petrochemical products [22,23,44,45]; in addition, these methods
demonstrate good accuracy and precision, are non-destructive, and
can be used in remote quality control [3,12].
Analysis of near-IR spectra usually involves a combination of
∗ Corresponding author. Tel.: +41 44 632 4783. multiple samples, each of which has a large number of correlated
E-mail addresses: balabin@org.chem.ethz.ch, balabinrm@yandex.ru features [2,3,12,33]. As such, a variety of data mining algorithms
(R.M. Balabin). have been introduced to reduce the complexity accompanying such

0003-2670/$ – see front matter © 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.aca.2011.01.041
R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197 191

large amounts of data and aim to identify meaningful patterns in This correlation is used to correctly classify samples into 10
NIR spectra [45–47]. Both regression and classification tasks were groups according to their origin (type of vegetable oil) [61]. Prin-
successfully completed using IR/NIR spectral data [48–50]. cipal component analysis (PCA) is used for outlier detection and
Increasing energy costs and environmental concerns have dimensionality reduction of NIR spectral data [28–30,45–50]. Four
emphasized the need to produce sustainable renewable fuels different multivariate data analysis techniques were used to solve
and chemicals [51–55]. Ideally, these alternatives should be eco- the problem, including regularized discriminant analysis (RDA),
nomically competitive, technically achievable, environmentally partial least squares method/projection on latent structures (PLS)
acceptable and widely available [51]. The use of renewable ener- [also known as partial least squares-discriminant analysis (PLS-
gies, such as biofuels [53], biomass [55], wave, hydro, wind and DA)], K-nearest neighbors (KNN) technique, and support vector
solar energies, needs to become widespread in order to decrease machines (SVMs) [61]. Notably, near infrared spectroscopy in com-
the dependence on fossil fuels [51]. bination with MDA techniques has not been previously applied to
The demand for ethanol and biodiesel, called alternative fuels classify biodiesel fuel by its original feedstock.
or biofuels, has increased over the last few years due to environ-
mental, economical, political, and social issues [44,54]. Biodiesel 2. Experimental
is composed of fatty acid mono-alkyl esters, which are obtained
through the base-catalyzed transesterification of vegetable oils or 2.1. Materials
animal fats with a short chain alcohol, such as methanol (MeOH,
CH3 OH) or ethanol (EtOH, C2 H5 OH) [44]. This biofuel is the major 2.1.1. Industrial samples
substitute for petroleum-derived diesel. Because its physical prop- Sixty-five samples of rapeseed biodiesel fuel were supplied
erties are very similar to diesel, the use of pure or blended biodiesel by three companies: Yugrostexport (Russia), Biosam (Russia), and
does not require any modification of the diesel engine or to the Sokhna Biodiesel Co. (Egypt).
existing fuel distribution and storage infrastructure [44]. Increased
biodiesel use leads to the reduction of greenhouse gases, particulate 2.1.2. Vegetable oils and used frying oil
matter, and sulphur emissions. In addition, waste frying oils (WFO) Nine types of vegetable oils were used for biodiesel preparation.
can be used as a raw material in biodiesel production, thus reducing Refined and crude sunflower oils, coconut oil, and palm oil were
and reusing industrial and household waste [51]. Recently, blends obtained from Kochmeister (Russia). Soy (soya) bean oil, cottonseed
of biodiesel with mineral diesel have become commercially avail- oil, and castor oil were obtained from MasloPromBaza (Russia). Jat-
able world-wide [44,56]. ropha (Jatrophacurcas) oil was obtained from Purandhar Agro &
Depending on geographic limitations and oil prices, biodiesel Biofuels (India). Linseed (flax or common flax) oil was obtained
may be produced from a variety of feedstocks, and different from local commercial sources. The kitchen of a local restaurant
technologies can be applied for biodiesel production [51–58]. Con- provided the used frying oil. All oil samples were used without
sequently, the final product can have different properties, so quality further purification.
control for biodiesel is very important. The EN 14214 mandates
twenty-five (25) parameters that must be analyzed to certify 2.1.3. Other chemicals
biodiesel quality [51]. ACS spectroscopic grade methanol (≥99.9%; Sigma–Aldrich),
Near infrared spectroscopy has already been applied to quality analytical grade sodium chloride (Sigma–Aldrich), analytical grade
analysis of biodiesel fuel. In 2007, Correia and co-workers [57] pub- phosphoric acid (Fluka), and reagent grade potassium hydroxide
lished their analysis of the effect of commonly used pre-processing (≥90%; Sigma–Aldrich) were used throughout the study without
techniques that were applied prior to partial least squares (PLS) and further purification.
principal components regression (PCR). They compared the qual-
ity of the developed calibration models to relate the near infrared 2.2. Methods
spectral characteristics of a biodiesel sample and its methanol and
water content. From the 50 samples studied, 38 were used for cali- 2.2.1. Biodiesel preparation
bration and 12 for testing. Leave-one-out cross-validation (LOOCV) The reactor was initially charged with an amount of oil that
was applied [57]. achieved the desired methanol/vegetable oil molar ratio, which
Chemometric treatment of NIR spectra was assessed by Galtier ranged from 6:1 to 9:1, and the reactor was then placed in a
et al. [59] for the quantification of fatty acids and triacylglycerols constant-temperature bath with its associated equipment and
in virgin olive oil samples. Their classification into five registered heated to a temperature of 25–45 ◦ C. The catalyst consisted of
designations of origin (RDOs) of French virgin olive oils, including 0.4–2.0% (w/w) of potassium hydroxide (KOH) by weight of veg-
“Aix-en-Provence”, “Haute-Provence”, “Nice”, “Nyons” and “Vallee etable oil and was completely dissolved in the methanol under
des Baux”, was successfully conducted using PLS classifiers. It stirring; the solution was then added to the reactor. The timing of
was concluded [59] that chemometric treatments of NIR spectra the reaction began as soon as the potassium hydroxide/methanol
allowed one to obtain similar results than those obtained by time- solution was added and continued for 3–8 h. The impeller speed
consuming analytical techniques, such as GC and HPLC, and these was 600 rpm. Following the reaction, the mixture was transferred
methods constitute a fast and robust means for authentication of to a separatory funnel, and glycerol was allowed to separate
the French virgin olive oils. by gravity for 2–3 h. After the glycerol layer was removed, the
The goals of the study conducted by Yang et al. [60] in 2005 were excess methanol in the methyl ester phase was removed by rotary
to use Fourier transform mid-infrared (FTIR), near-infrared (FT- evaporation at 70 ◦ C. The methyl ester was washed twice with a
NIR) and Raman (FT-Raman) spectroscopy to discriminate among phosphoric acid water solution (5%, v/v) and brine until a clear
10 different edible oils and fats and to compare the performance phase (methyl ester) was obtained.
of these spectroscopic methods. The spectral features of edible oils Outlier detection was done according to Ref. [71]. The final,
and fats were studied, and the characteristic vibrations of C C dou- “outlier-free” biodiesel sample set included 403 samples from
ble bond were identified and used for discriminant analysis (DA) 10 feedstocks: 65 industrial samples from rapeseed oil and 338
[60]. laboratory-prepared samples from different oils. The laboratory
In this paper, we attempt to find a correlation between the samples included sunflower oil (44 samples), coconut oil (40 sam-
near infrared spectrum of a biodiesel sample and its feedstock. ples), palm oil (40 samples), soy oil (40 samples), cottonseed oil (30
192 R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197

samples), castor oil (40 samples), Jatropha oil (29 samples), linseed of M) of each sample is coded with a (binary) vector of length M
oil (35 samples), and used frying oil (40 samples). Each sample was with zeros and 1 one. In our case, M is equal to 10. The M predicted
obtained with a separate synthesis. scores by the ordinary PLS-DA method are used to predict the class
of each sample. If the j-score (normalized) is greater than or equal
2.2.2. Near infrared (NIR) spectroscopy to parameter ı, the predicted class of the sample was j. Extra infor-
The near infrared (NIR) spectra were acquired with a MPA Multi mation about the idea of PLS-DA classification can be found in Ref.
Purpose FT-NIR Analyzer (Bruker). The spectra were acquired at [63].
room temperature (20–23 ◦ C). The NIR spectrometer was calibrated
with benzene (C6 H6 ) and cyclohexane (c-C6 H12 ) at least twice per 2.2.4.3. K-nearest neighbor (KNN). The K-nearest neighbor (KNN)
day to minimize the influence of variable laboratory conditions. method was first introduced by Fix and Hodges [64]. It has an
The spectral range between 9000 and 4500 cm−1 (1110–2500 nm) extremely simple, intuitive theoretical background. In the KNN
was scanned with a resolution of 8 cm−1 . Sixty-four scans were method, a distance (Euclidean or other, see below) is assigned
averaged for each sample spectrum, and a background spectrum between all points in a data set. The data points, which are K clos-
(64 scans) was measured every 45 min. A photometric accuracy of est neighbors (K being the number of neighbors), are then found by
∼0.05% was obtained [2,28–30]. A cylindrical glass cell was used analyzing (sorting) the distance matrix. The K-closest data points
throughout the study. Approximately 1 mL of biodiesel sample was are then analyzed to determine which class label is the most com-
needed for each NIR measurement. NIR spectrum collection was mon among the set (“voted” KNN). The most common class label
repeated five times with cell rotation inside the spectrometer to is then assigned to the data point being analyzed. Other variants
minimize interferences from the cell or glass defects. The mea- of KNN (“weighted” KNN) are also possible. In these methods, dif-
surement of one sample took less than 5 min. The averaged- and ferent neighbors have different “weight”, which is the ability to
background-corrected spectra were used for subsequent data pre- influence the class of a sample, according to their distances from
processing [28–30]. the sample. Some remarks about the possible types of functions
for weighting are given below (Table 1). For a KNN classifier, it is
2.2.3. Near infrared (NIR) spectral pre-processing necessary to have a training set that is not too small and a good
Data pre-processing was done according to Refs. [51,57]. discriminating distance. KNN performs well in multi-class, simul-
Briefly, prior to calibration model building, various widely used taneous problem solving. There exists an optimal choice for the
pre-processing techniques were applied to the data. Eight data value of the parameter K, which describes the best performance of
pre-treatment methods were tested: the classifier. For a detailed analysis, one can consult Ref. [64].

• MC: Mean Centering;


2.2.4.4. Support vector machines (SVMs). Support vector machines
• MSC-MC: Mean Scattering Correction followed by Mean Center-
were developed by Vapnik [65]. Support vector machines (SVMs)
ing;
are based on some “beautifully simple ideas” [66] and provide a
• SNV-MC: Standard Normal Variate scaling plus Mean Centering;
clear demonstration of what learning from examples is all about.
• SGD1-MC: First-order Savitzky–Golay derivative followed by
SVMs are also showing strong performances in practical appli-
Mean Centering;
cations [65–70]. In very simple terms, an SVM corresponds to a
• SGD2-MC: Second-order Savitzky–Golay derivative followed by
linear method in a very high-dimensional feature space that is
Mean Centering;
nonlinearly related to the input space. Even though we think of
• MC-OSC: Mean Centering followed by the Orthogonal Signal Cor-
it as a linear algorithm in a high-dimensional feature space, SVM
rection method;
does not involve any computations in that high-dimensional space
• SGD1-MC-OSC: SGD1-MC followed by the Orthogonal Signal Cor-
in practice. Through the use of kernels (Table 2), all necessary
rection method;
computations are performed directly in the input space. Support
• SGD2-MC-OSC: SGD2-MC followed by the Orthogonal Signal Cor-
vector machines map input vectors to a higher dimensional space
rection method.
where a maximal separating hyperplane is constructed. Two par-
allel hyperplanes are constructed on each side of the hyperplane
The SGD2-MC-OSC method produced the best results and was
that separates the data. The hyperplane is the hyperplane that
used for data pre-treatment, unless otherwise specified.
maximizes the distance between the two parallel hyperplanes. An
assumption is made that the larger the margin or distance between
2.2.4. Classifiers: RDA, PLS-DA, KNN, and SVM
these parallel hyperplanes, the better the generalization error of the
2.2.4.1. Regularized discriminant analysis (RDA). Regularized dis-
classifier will be. Details about SVM classifiers can be found in Refs.
criminant analysis (RDA) can be called a “compromise” between
[65,66].
linear discriminant analysis (LDA) and quadratic discriminant anal-
ysis (QDA). RDA introduces a double type of bias in the estimation
of the covariance matrices of each class with the aim of stabilizing 2.2.5. Data analysis
the prediction in the case of a deficient estimation of their elements. 2.2.5.1. Software and computing. The initial spectra were digitised
This is a very frequent problem encountered with real data because using the special software complex created by one of the authors
the estimations of the matrices are very sensitive to the presence (B.R.M.) [28]. After digitisation, each spectrum was represented
of different data. For a detailed analysis of RDA behavior one can as a 1 × 563 vector; the length of the vector was defined by the
consult Ref. [62]. spectrometer resolution. The software package MATLAB 2008b
(Mathworks, Inc., Natick, MA) along with the Statistics Toolbox and
2.2.4.2. Partial least squares regression/classification (PLS-DA). Par- the Neural Network Toolbox were extensively used in designing
tial least squares regression (PLS-DA) is an extension of the multiple and executing multivariate procedures. Standard programs were
linear regression model and principal component analysis (PCA) modified and extended by B.R.M. Self-written software was used
method. The PLS-DA method has found wide-spread use for multi- for spectral pre-processing [28–30].
variate analysis of spectral data. The main goal of PLS-DA usage is
calibration model building, but this technique can also be applied 2.2.5.2. Model efficiency (calibration error) estimation. To character-
for classification purposes. For PLS-DA classification, the class (one ize prediction ability (efficiency) of created classification model the
R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197 193

Table 1
Parameters of the K-nearest neighbor (KNN) classification model.

Weighting Formulaa Parameters Number of neighbors (K)


b
None – – K = 1–29
Weighting 1 w(xi ) = exp(−D(xi , p)) – K = 1–29
−q
Weighting 2 w(xi ) = D(xi , p) q = 1–5 K = 1–25
Weighting 3 w(xi ) = 1
2 ε = 10−3 –102 K = 1–20
D(xi ,p) +ε
−1
Weighting 4 w(xi ) = (D(xi , p) + 1) – K = 1–29

1, D(xi , p) ≤ d0
Weighting 5 w(xi ) = d0 = 10−4 –102 K = 1–20
0, D(xi , p) > d0

 of distances D(xi , p) between vectors xi and p were used: Euclidean distance (Pythagorean metric):
a
Two types
 L

D(xi , p) = 
2
(xi,l − pl )
l=1

and Manhattan distance (taxicab geometry):



L

D(xi , p) = |xi,l − pl |.
l=1
b
“Voting” KNN.

error of cross-validation (E) was used: model efficiency on its parameters, the following parameters were
varied and optimised using cross-validation error (RMSE) [28–30]:
Nwrong N0 − Nright
E= = (1)
N0 N0 • RDA: one parameter (˛ = 0–1);
where Nwrong and Nright refer to the number of wrongly and • PLS-DA: number of latent variables (LV = 1–35) and threshold
rightly classified samples; N0 is the total number of samples in parameter (ı = 0.40–0.95);
validation set. The value of E is a measure of model sensitivity • KNN: different types of K-nearest neighbor (KNN) method (voted
(SENS = 100% − E), i.e., the ability to classify in a correct manner and weighted by different weighting functions) were used to get
the objects belonging to the class (minimization of false negatives). the best result (Table 1), K was optimised on each case;
However, there is another fundamental parameter, specificity • SVM: different types of kernel functions (and corresponding
(SPEC), that expresses the ability of a given class model to correctly parameters) were checked for optimization of SVM classification
reject the objects of other classes (minimization of false positives) model (Table 2), cost parameter C was optimised in each case.
[69,70]. In our case (Tables 3–6) almost no difference is observed if
any error definition is used (e.g., mean of SENS and SPEC) for mul- 3. Results and discussion
tivariate methods comparison. So in this work we follow the same
error function (E) as in Ref. [61] and references therein. Note that 3.1. Biodiesel samples set and its classes
one can calculate any other error type from the data in Tables 3–6,
if he or she prefers other functional for methods comparison. All biodiesel samples in the set can be divided into two broad
Fivefold (5-fold) cross-validation was used to evaluate model’s categories: biodiesel fuel produced from vegetable oils (363 sam-
efficiency. It was checked that validation set consists of all biodiesel ples) and biodiesel fuel produced from used frying oil (40 samples).
classes. The results of cross-validation check were found to be A real variety of biodiesel feedstocks (vegetable oils) is presented
almost identical to the test ones on a fully independent sample set. in the set: nine types of vegetable oils that are very different in
So, cross-validation (CV) error will be used as a default measure of chemical composition and properties [51–60]. One can expect that
model accuracy, if other is not specified. this sample set makes the final model general enough to be used in
real-world applications. Of course, additional samples are needed
2.2.5.3. Efficiency estimation with external test set. To check the if samples from other feedstock are needed to be classified. But the
validity of the CV procedure the tests with external test sets were general trend is believed to stay the same as presented below.
done. The full sample set was randomly divided into three subsets: At least 29 samples per class: 40.3 ± 9.9 (±) on average—are
calibration (283), cross-validation test set/test set for parameter presented in the set. This value can be called acceptable [38–40]. It
optimization (50), and external test sets (70). It was checked that makes it possible to create a robust model, since approximately the
all classes are presented in each set. The test set has 6–10 samples same number of samples (±25%) is observed in all cases [40,65]. It
of every oil type. The procedure has been repeated five times and also means that the classification error can be calculated in a direct
average values are reported. way, see Eq. (1)—no correction for sample number difference and/or
making copies of samples is needed [38–40,65].
2.2.5.4. Model optimisation. To truly compare different multivari- One would also expect that 403 samples is enough to build
ate classification methods, the efficiency of the best possible model an accurate model at “sample set limit”—compare with “basis set
should be found. Because of the dependence of the calibration limit” or “complete basis set” in quantum chemistry (QC) [35,67]. It

Table 2
Parameters of the support vector machine (SVM) classification model.

Kernel Formula Kernel parameters Cost parameter (C)

Linear K(xi , xj ) = xi xj + 1 – C = 10−3 –102


Polynomial K(xi , xj ) = (xi xj + 1)p p = 2–8 C = 10−3 –102
 
||xi −xj ||2
Gaussian K(xi , xj ) = exp −  = 10−2 –102 C = 10−4 –102
2
194 R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197

Table 3
Biodiesel classification by regularized discriminant analysis (RDA): confusion matrix. The rows indicate the true sample class and that the columns refer to the observed
class.

Biodiesel type Total number Sunflower Coconut Palm Rapeseed Soy Cottonseed Castor Jatropha Linseed Used frying
of samples oil

Sunflower 44 33 1 1 3 2 1 1 1 1
Coconut 40 1 33 3 1 1 1
Palm 40 1 2 34 2 1
Rapeseed 65 3 60 1 1
Soy 40 1 1 37 1
Cottonseed 30 1 1 28
Castor 40 3 37
Jatropha 29 3 25 1
Linseed 35 3 3 1 2 1 25
Used frying oil 40 1 1 38

Table 4
Biodiesel classification by partial least squares (PLS) classification model: confusion matrix. The rows indicate the true sample class and that the columns refer to the observed
class.

Biodiesel type Total number Sunflower Coconut Palm Rapeseed Soy Cottonseed Castor Jatropha Linseed Used frying
of samples oil

Sunflower 44 33 2 2 1 2 1 1 1 1
Coconut 40 1 35 3 1
Palm 40 3 34 1 1 1
Rapeseed 65 1 63 1
Soy 40 2 37 1
Cottonseed 30 1 1 28
Castor 40 2 38
Jatropha 29 2 27
Linseed 35 3 3 1 1 1 1 25
Used frying oil 40 40

Table 5
Biodiesel classification by K-nearest neighbor (KNN) classification model: confusion matrix. The rows indicate the true sample class and that the columns refer to the observed
class.

Biodiesel type Total number Sunflower Coconut Palm Rapeseed Soy Cottonseed Castor Jatropha Linseed Usedfrying
of samples oil

Sunflower 44 38 2 2 2
Coconut 40 40
Palm 40 3 33 2 1 1
Rapeseed 65 64 1
Soy 40 1 1 35 2 1
Cottonseed 30 2 28
Castor 40 40
Jatropha 29 29
Linseed 35 1 1 1 1 31
Used frying oil 40 40

Table 6
Biodiesel classification by Support vector machine (SVM) classification model: confusion matrix. The rows indicate the true sample class and that the columns refer to the
observed class.

Biodiesel type Total number Sunflower Coconut Palm Rapeseed Soy Cottonseed Castor Jatropha Linseed Usedfrying
of samples oil

Sunflower 44 40 1 1 2
Coconut 40 39 1
Palm 40 1 2 35 1 1
Rapeseed 65 64 1
Soy 40 38 2
Cottonseed 30 30
Castor 40 40
Jatropha 29 29
Linseed 35 1 1 1 1 1 30
Used frying oil 40 40

is well-known that the accuracy of a multivariate model increases 3.2. Biodiesel classification by regularized discriminant analysis
with the increasing size of the sample set and saturates at a limit (RDA)
value at very large sample sets. The same effect is observed in ab ini-
tio quantum chemistry for basis set—set of Gaussian atom-cantered Table 3 shows the results of regularized discriminant analysis
functions used to create the molecular orbitals and to approximate (RDA) of biodiesel fuels. One can see that the model is not accu-
electron density distribution in molecular systems. rate enough for practical use. A classification error of 13.2% was
R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197 195

Table 7
Classification of biodiesel samples by different multivariate methods: regularized discriminant analysis (RDA), partial least squares classification (PLS), K-nearest neighbor
(KNN), and support vector machines (SVMs).

RDA PLS KNN SVM

Nright /N0 E Nright /N0 E Nright /N0 E Nright /N0 E

Sunflower 33/44 25% 33/44 25% 38/44 14% 40/44 9%


Coconut 33/40 18% 35/40 13% 40/40 0% 39/40 3%
Palm 34/40 15% 34/40 15% 33/40 18% 35/40 13%
Rapeseed 60/65 8% 63/65 3% 64/65 2% 64/65 2%
Soy 37/40 8% 37/40 8% 35/40 13% 38/40 5%
Cottonseed 28/30 7% 28/30 7% 28/30 7% 30/30 0%
Castor 37/40 8% 38/40 5% 40/40 0% 40/40 0%
Jatropha 25/29 14% 27/29 7% 29/29 0% 29/29 0%
Linseed 25/35 29% 25/35 29% 31/35 11% 30/35 14%
Used frying oil 38/40 5% 40/40 0% 40/40 0% 40/40 0%

Total 350/403 13.2% 360/403 10.7% 378/403 6.2% 385/403 4.5%

E is the error of 5-fold cross-validation, see Eq. (1); Nright refers to the number of wrongly and rightly classified samples; N0 is the total number of samples in validation set.
Methods: RDA: regularized discriminant analysis; PLS: partial least squares regression; KNN: K-nearest neighbor; and SVMs: support vector machines.

reached based on 350 correctly classified biodiesel samples out of of accuracy is significantly better than the accuracy achieved using
403. the PDA or PLS-DA methods (see above).
In the cases of biodiesel from sunflower oil and linseed oil, an
error above 25% was observed (Table 3). The best results were 3.5. Biodiesel classification by support vector machines (SVMs)
observed for the samples from the used frying oil; E = 5% for 38 cor-
rectly classified biodiesel samples out of 40. The confusion matrix Table 6 summarises the results of the SVM approach to biodiesel
(Table 3) shows that there is a significant chemical difference in classification. The confusion matrix (Table 6) clearly shows the
the feedstock (used frying oil vs. vegetable oil), which makes clas- superiority of a support vector approach over all of the classification
sification among these two species rather simple, and the problem methods described above [29,68]. The extent of the classification
can easily be solved by a simple RDA model. The same cannot be error decreases from 6.2% to 4.5% for 385 correctly classified sam-
said about classification among vegetable oils; a more sophisticated ples at a much higher computational cost (see below).
classification method is needed in this case. The largest inaccuracy (E = 14%) is observed for linseed oil. In
the cases of cottonseed, castor, Jatropha, and used frying oils, 100%
3.3. Biodiesel classification by the partial least squares (PLS-DA) effectiveness in classification using this model was reached, which
classification model is nearly the same result provided by the KNN method. Importantly,
not more than two samples were labelled incorrectly by a SVM-
The confusion matrix for the partial least squares or projection to based approach.
latent structures (PLS-DA) classification model (Table 4) shows that Support vector machines can be regarded as the most effective
the PLS-DA method results are close to those for the RDA method. classification model for biodiesel fuels. Very accurate (E < 5%) and
A classification error of 10.7% is obtained (360 correctly classified robust classification models can be built using a SVM approach.
biodiesel samples out of 403). This is better than the RDA results
(see above), but these results are still inadequate to state that the 3.6. General remarks. Comparison with gasoline and motor oil
model is highly accurate. data
Interestingly, two of the most problematic types of samples for
classification yield the same results as in the RDA method: sun- Table 7 summarises the data presented above for a general dis-
flower and linseed oils. In both cases, E is above 25%. An accuracy cussion. Table 7 clearly shows the non-equivalency of the classes in
of 100% (40/40) is reached in the case of the used frying oil clas- term of classification accuracy. Three types of vegetable oil have an
sification. Thus, the problem of a vegetable oil vs. used frying oil average classification error above 15%: sunflower, palm, and linseed
distinction is completely solved at the PLS-DA level. oils. Thus, biodiesel samples from these three groups are difficult to
Though PLS-DA classification has been shown to be a more effec- label correctly. Tables 4–6 show that this inaccuracy is mostly based
tive classification method than standard RDA, more sophisticated on the confusion between these three classes. The general accuracy
classification methods are needed to build highly accurate models of a classification model seems to be greatly dependent on its ability
for biodiesel classification by feedstock type. to correctly distinguish sunflower-, palm-, and linseed-based fuel
samples. Only the KNN and SVM methods are rather successful in
3.4. Biodiesel classification by the K-nearest neighbors (KNN) this task.
classification model Three other types of biodiesel samples may be described as
“easy-to-distinguish”. These are: rapeseed, castor, and used frying
The K-nearest neighbors (KNN) classification model is a sim- oils. In these cases, the average error is below 4%. The difference
ple but highly effective classification algorithm [64]; this fact is in chemical composition [51–60] and, as a consequence, the differ-
confirmed by this study using a biodiesel sample set (Table 5). A ence in near infrared spectral features [58–60] may be responsible
classification error of only 6.2% was reached by an application of the for this attribute.
KNN model. The largest inaccuracy (E = 18%) is observed for palm Comparison of the effectiveness of RDA, PLS-DA, KNN, and SVM
oil. In the cases of coconut, castor, Jatropha, and used frying oils, for different fuel and oil samples (gasoline, motor oil, and biodiesel)
100% effectiveness in classification using this model was reached. is presented in Fig. 1 [29,68]. Motor oils of 10 popular brands (Cas-
This final model based on K-nearest neighbors classification trol, Mobil, Esso, TNK-BP, Shell, Total, Lukoil, BP, Mannol, Liqui
can be recommended for practical implementation as 378 sam- moly; SAE types: 0W-20; 5W-20; 0W-30; 5W-30; 10W-30; 0W-40;
ples were successfully assigned to their correct classes. This level 5W-40; 10W-40; 15W-40; 5W-50; 10W-50) were used for classifi-
196 R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197

Since the selection of many parameters (e.g., kernel function


and cost parameter in SVM) is based on the minimum CV error,
the classification methods that seem to give the best performances
are those, where the number of free parameters is higher. It is
well known that an increased number of possible combinations of
parameters leads to increased possibilities of chance in classifica-
tion models [40].
The independent set of 70 biodiesel samples was used (see
Section 2.2.5). The results (E) of RDA, PLS-DA-DA, KNN, and SVM
classifications are 14.6 ± 0.6% (±), 10.6 ± 1.3%, 6.9 ± 1.2%, and
4.6 ± 0.6%, respectively. These results are very close to the results
of 5-fold cross-validation procedure, reported earlier. So, in the
Fig. 1. Comparison of the efficiency of classification models (classification error)
case of biodiesel classification CV and external results are very
for three sample sets: gasoline [29], motor oil [70], and biodiesel. RDA: regularized
discriminant analysis; PLS-DA: partial least squares method/projection on latent close. A small error increase in case of test sample set usage can be
structures; KNN: K-nearest neighbors; SVM: support vector machines. explained by, both, the underestimation of error by CV procedure
and the decrease of the calibration sample set by ∼12%.
Of course, the same situation (ECV ≈ Etest ) cannot be expected in
cation [29]. Motor oil samples (210–225 items) were classified into all other cases when near-infrared spectroscopy is used to classify
3–4 classes according to base stock (also origin) and viscosity. Three fuel samples, and independent tests are always necessary.
sets of gasoline near infrared (NIR) spectra (450, 415, and 345 spec- Note that the CV values reported (Tables 3–7) can be directly
tra) were used for classification of gasoline samples into 3–6 classes, compared to the motor oil and gasoline data [29,68] (see Section
according to their source (refinery or process) and type [68]. These 3.6). The main point here is to prove that the CV error is close to the
data can also be regarded as gasoline origin. The 14,000–8000 cm−1 test one (see above).
NIR spectral region was chosen for classification. All of the classifi-
cations are based on near infrared spectroscopic data.
One can see that the general trend that relates classification 4. Conclusions
model order with respect to accuracy is the same in all three
cases: RDA > PLS-DA > KNN > SVM. In all of the cases, RDA and PLS- The following conclusions can be drawn from this study:
DA methods yield approximately the same error of classification,
which is relatively large. (1) The task of biodiesel classification by feedstock (base stock)
The main difference between products of petroleum refining type can be successfully solved by modern machine learning
and petrochemicals (gasoline and motor oil) and renewable energy techniques based on near infrared (NIR) spectroscopic data.
sources (biodiesel) is the accuracy ratio between PLS-DA/KNN (2) K-nearest neighbors (KNN) and support vector machine (SVM)
and KNN/SVM methods. In the case of gasoline and motor oil, a methods were found to be highly effective for biodiesel classi-
SVM-based approach shows significant superiority over RDA- and fication by feedstock oil type (Fig. 1).
PLS-DA-based ones. The difference is not that pronounced in the (3) A classification error below 5% can be reached by a SVM-based
case of biodiesel classification (Fig. 1). The reason for this could be approach. The error per class lies in the range of 0–14%.
in the relative simplicity of biodiesel as a chemical system when (4) If computational time is an important consideration, the KNN
compared to gasoline and especially motor oil. Notably, the dif- technique can be recommended (E = 6.2%) for practical (indus-
ference in near-IR spectral ranges [29,68] and number of classes trial) implementation.
[29,68] may be partially responsible for the differences that are (5) Comparison with gasoline and motor oil data demonstrates the
observed. relative simplicity of this biodiesel classification task.
The choice between KNN- and SVM-based approaches is also
dependent on the computational resources available. While the Further work has to be done to proof that the proposed tech-
KNN algorithm is easy to create and use, SVM-based methods are nique is generally applicable for different biodiesel feedstocks
rather computationally demanding in terms of CPU time, and the (from different suppliers or from different origins of the same type
computer resources needed, in the model construction and opti- of vegetable oil). The results presented herein can be applied for the
mization phase. In case of online use, once the models are made, rapid and accurate analysis of other biofuels, products of petroleum
the differences are small. For off-line quality control of feedstocks, and petrochemicals. The use of vibrational spectroscopy such as
these differences are not as important as for on-line one. near infrared spectroscopy in other fields of analytical chemistry,
Note that for on-line applications, it is only necessary to assess including pharmaceutical (drug) quality control, food quality con-
the class of a given sample using a previously calculated model. trol, and analysis of tablets, can be enhanced by the application
Therefore, the time spent to calculate the model is not impor- of modern methods of multivariate data analysis [69–73]. We
tant, since this work has already been done previously (calibration believe that both classification and regression tasks for industrially
model building). In this application phase, there is only the need important systems can be solved using modern machine learning
to assign the sample to the proper class, and the time necessary to (computer based) techniques.
perform this operation is independent of the time needed to cal-
culate the models and to select the optimal one. So, it is only the Acknowledgements
building of the SVM model that takes a long time.
B.R.M. is grateful to the ITERA International Group of companies
3.7. Tests with independent (external) test set for a scholarship. Bruker Optics Inc. (Moscow, Russia) is acknowl-
edged for the use of their equipment in the preliminary studies.
All multivariate results should be tested using fully indepen-
dent (external) data sets, which were not used in the process of References
calibration model building. Only such data can be regarded as true
estimation of model real-world performance. [1] J.M. Hollas, Modern Spectroscopy, 4th ed., Blackwell, Wiley, 2003.
R.M. Balabin, R.Z. Safieva / Analytica Chimica Acta 689 (2011) 190–197 197

[2] J. Workman, A. Springsteen, Applied Spectroscopy: A Compact Reference for [36] M. Andersson, K.-G. Knuuttil, Vib. Spectrosc. 29 (2002) 133–138.
Practitioners, Academic Press, 1998. [37] L.M. Harwood, C.J. Moody, Experimental Organic Chemistry: Principles and
[3] B. Osborne, T. Fearn, Near Infrared Spectroscopy in Food Analysis, Wiley, New Practice, Wiley-Blackwell, 1989.
York, 1986. [38] T. Næs, T. Isaksson, T. Fearn, T. Davies, A User-Friendly Guide to Mul-
[4] F.J. Duarte, Tunable Laser Applications, CRC, New York, 2009. tivariate Calibration and Classification, NIR Publications, Chichester, UK,
[5] F.J. Duarte, Tunable Lasers Handbook, Academic Press, 1995. 2002.
[6] W. Demtröder, Laser Spectroscopy: Basic Principles, 4th ed., Springer, Berlin, [39] R.M. Balabin, E.I. Lomakina, J. Chem. Phys. 131 (2009) 074104.
2008. [40] B.F.J. Manly, Multivariate Statistical Methods: A Primer, 3rd ed., Chapman and
[7] J.R. Lakowicz, Principles of Fluorescence Spectroscopy, Kluwer Aca- Hall/CRC, 2004.
demic/Plenum Publishers, 1999. [41] R.M. Balabin, R.Z. Syunyaev, S.A. Karpov, Fuel 86 (2007) 323–327.
[8] A. Sharma, S.G. Schulman, Introduction to Fluorescence Spectroscopy, Wiley, [42] R.M. Balabin, R.Z. Syunyaev, S.A. Karpov, Energy Fuels 21 (2007) 2460–2465.
1999. [43] S.B. Kim, C. Temiyasathit, K. Bensalah, A. Tuncel, J. Cadeddu, W. Kabbani, A.V.
[9] J. Logan, K. Edwards, N. Saunders, Real-Time PCR: Current Technology and Mathker, H. Liu, Expert Syst. Appl. 37 (2010) 3863–3871.
Applications, Caister Academic Press, 2009. [44] M.R. Monteiroa, A.R.P. Ambrozin, M.S. Santos, E.F. Boffo, E.R. Pereira-Filho, L.M.
[10] J. Eisinger, J. Flores, Anal. Biochem. 94 (1979) 15–21. Lião, A.G. Ferreira, Talanta 78 (2009) 660–664.
[11] J. Keeler, Understanding NMR Spectroscopy, John Wiley & Sons, 2005. [45] D.D. Lee, H.S. Seung, Nature 401 (1999) 788–790.
[12] J. Workman, M. Koch, B. Lavine, R. Chrisman, Anal. Chem. 81 (2009) 4623–4643. [46] J.B. Tenenbaum, V. de Silva, J.C. Langford, Science 290 (2000) 2319–2321.
[13] A. Burke, X. Ding, R. Singh, R.A. Kraft, N. Levi-Polyachenko, M.N. Rylander, C. [47] G.E. Hinton, R.R. Salakhutdinov, Science 313 (2006) 504–507.
Szot, C. Buchanan, J. Whitney, J. Fisher, H.C. Hatcher, R. D’Agostino, N.D. Kock, [48] D.C. Malins, N.L. Polissar, S.J. Gunselman, Proc. Natl. Acad. Sci. U. S. A. 94 (1997)
P.M. Ajayan, D.L. Carroll, S. Akman, F.M. Torti, S.V. Torti, Proc. Natl. Acad. Sci. U. 3611–3615.
S. A. 106 (2009) 12897–12902. [49] K. Sakurai, Y. Goto, Proc. Natl. Acad. Sci. U. S. A. 104 (2007) 15346–15351.
[14] E.R. Trivedi, A.S. Harney, M.B. Olive, I. Podgorski, K. Moin, B.F. Sloane, A.G.M. [50] A.E. Cohen, W.E. Moerner, Proc. Natl. Acad. Sci. U. S. A. 104 (2007) 12622–12627.
Barrett, T.J. Meade, B.M. Hoffman, Proc. Natl. Acad. Sci. U. S. A. 107 (2010) [51] P. Baptista, P. Felizardo, J.C. Menezes, M.J.N. Correia, Talanta 77 (2008)
1284–1288. 144–151.
[15] X. Michalet, F.F. Pinaud, L.A. Bentolila, J.M. Tsay, S. Doose, J.J. Li, G. Sundaresan, [52] E.J. Steen, Nature 463 (2010) 559–562.
A.M. Wu, S.S. Gambhir, S. Weiss, Science 307 (2005) 538–544. [53] A.J. Ragauskas, Science 311 (2006) 484–489.
[16] L.R. Hirsch, Proc. Natl. Acad. Sci. U. S. A. 100 (2003) 13549–13554. [54] A.E. Farrell, R.J. Plevin, B.T. Turner, A.D. Jones, M. O’Hare, D.M. Kammen, Science
[17] M.J. Therien, Nature 458 (2009) 716–717. 311 (2006) 506–508.
[18] T.F. Krauss, R.M. De La Rue, S. Brand, Nature 383 (1996) 699–702. [55] S. Crossley, J. Faria, M. Shen, D.E. Resasco, Science 327 (2009) 68–72.
[19] Y. Zou, Proc. Natl. Acad. Sci. U. S. A. 106 (2009) 22135–22138. [56] G. Knothe, J. Am. Oil Chem. Soc. 78 (2001) 1025–1028.
[20] R. Adato, Proc. Natl. Acad. Sci. U. S. A. 106 (2009) 19227–19232. [57] P. Felizardo, P. Baptista, J.C. Menezes, M.J.N. Correia, Anal. Chim. Acta 595 (2007)
[21] H. Keppler, L.S. Dubrovinsky, O. Narygina, I. Kantor, Science 322 (2008) 107–113.
1529–1532. [58] F.C.C. Oliveira, C.R.R. Brandao, H.F. Ramalho, L.A.F. Costa, P.A.Z. Suarez, J.C.
[22] R.M. Balabin, R.Z. Safieva, J. Near Infrared Spectrosc. 15 (2007) 343–347. Rubim, Anal. Chim. Acta 587 (2007) 194–199.
[23] R.M. Balabin, R.Z. Syunyaev, J. Colloid Interface Sci. 318 (2008) 167–174. [59] O. Galtier, N. Dupuya, Y. Le Dreau, D. Ollivier, C. Pinatel, J. Kister, J. Artaud, Anal.
[24] R.Z. Syunyaev, R.M. Balabin, I.S. Akhatov, J.O. Safieva, Energy Fuels 23 (2009) Chim. Acta 595 (2007) 136–144.
1230–1238. [60] H. Yang, J. Irudayaraj, M.M. Paradkar, Food Chem. 93 (2005) 25–32.
[25] O.C. Mullins, Anal. Chem. 62 (1990) 508–514. [61] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Anal. Chim. Acta 671 (2010) 27–35.
[26] M.R. Swain, G. Vasisht, G. Tinetti, Nature 452 (2008) 329–331. [62] J.H. Friedman, J. Am. Stat. Assoc. 84 (1989) 16–1755.
[27] S. Byrne, Science 325 (2009) 1674–1676. [63] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1–17.
[28] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Chemometr. Intell. Lab. Syst. 88 (2007) [64] E. Fix, J.L. Hodges, Int. Stat. Rev. 57 (1989) 238.
183–188. [65] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, New
[29] R.M. Balabin, R.Z. Safieva, Fuel 87 (2008) 1096–1101. York, 1995.
[30] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Chemometr. Intell. Lab. Syst. 93 (2008) [66] S.R. Amendolia, G. Cossu, M.L. Ganadu, B. Golosio, G.L. Masala, G.M. Mura,
58–64. Chemometr. Intell. Lab. Syst. 69 (2003) 13–20.
[31] L.R. Schimleck, R. Evans, A.C. Matheson, J. Wood Sci. 48 (2002) 132–137. [67] R.M. Balabin, J. Chem. Phys. 129 (2008) 164101.
[32] P.D. Jones, L.R. Schimleck, G.F. Peter, R.F. Daniels, A. Clark, Wood Sci. Technol. [68] R.M. Balabin, R.Z. Safieva, Fuel 87 (2008) 2745–2752.
40 (2006) 709–720. [69] L. Pigani, G. Foca, A. Ulrici, K. Ionescu, V. Martina, F. Terzi, M. Vignali, C. Zanardi,
[33] E.W. Ciurczak, J.K. Drennen, Pharmaceutical and Medicinal Applications of R. Seeber, Anal. Chim. Acta 643 (2009) 67–73.
Near-infrared Spectroscopy, 1st ed., CRC Press, 2002. [70] L. Pigani, G. Foca, K. Ionescu, V. Martina, A. Ulrici, F. Terzi, M. Vignali, C. Zanardi,
[34] J. Moros, N. Galipienso, R. Vilches, S. Garrigues, M. de la Guardia, Anal. Chem. R. Seeber, Anal. Chim. Acta 614 (2008) 213–222.
80 (2008) 7257–7264. [71] T. Lillhonga, P. Geladi, Anal. Chim. Acta 544 (2005) 177–183.
[35] P. Taddei, S. Affatato, C. Fagnano, B. Bordini, A. Tinti, A. Toni, J. Mol. Struct. 613 [72] R.M. Balabin, R.Z. Safieva, E.I. Lomakina, Microchem. J., in press.
(2002) 121–129. [73] R.M. Balabin, E.I. Lomakina, R.Z. Safieva, Fuel, in press.

You might also like