You are on page 1of 19

This article was downloaded by: [Colorado College]

On: 09 December 2014, At: 00:44


Publisher: Taylor & Francis
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered
office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

Applied Spectroscopy Reviews


Publication details, including instructions for authors and
subscription information:
http://www.tandfonline.com/loi/laps20

An Overview of Infrared Spectroscopy


Based on Continuous Wavelet Transform
Combined with Machine Learning
Algorithms: Application to Chinese
Medicines, Plant Classification, and
Cancer Diagnosis
a a b c
Cungui Cheng , Jia Liu , Changjiang Zhang , Miaozhen Cai ,
a a
Hong Wang & Wei Xiong
a
Department of Chemistry , Zhejiang Normal University , Jinhua,
Zhejiang, China
b
Department of Electronics and Information Engineering , Zhejiang
Normal University , Jinhua, Zhejiang, China
c
Department of Biology , Zhejiang Normal University , Jinhua,
Zhejiang, China
Published online: 03 Mar 2010.

To cite this article: Cungui Cheng , Jia Liu , Changjiang Zhang , Miaozhen Cai , Hong Wang & Wei
Xiong (2010) An Overview of Infrared Spectroscopy Based on Continuous Wavelet Transform Combined
with Machine Learning Algorithms: Application to Chinese Medicines, Plant Classification, and Cancer
Diagnosis, Applied Spectroscopy Reviews, 45:2, 148-164, DOI: 10.1080/05704920903435912

To link to this article: http://dx.doi.org/10.1080/05704920903435912

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the
“Content”) contained in the publications on our platform. However, Taylor & Francis,
our agents, and our licensors make no representations or warranties whatsoever as to
the accuracy, completeness, or suitability for any purpose of the Content. Any opinions
and views expressed in this publication are the opinions and views of the authors,
and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content
should not be relied upon and should be independently verified with primary sources
of information. Taylor and Francis shall not be liable for any losses, actions, claims,
proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or
howsoever caused arising directly or indirectly in connection with, in relation to or arising
out of the use of the Content.
This article may be used for research, teaching, and private study purposes. Any
substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,
systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &
Conditions of access and use can be found at http://www.tandfonline.com/page/terms-
and-conditions
Downloaded by [Colorado College] at 00:44 09 December 2014
Applied Spectroscopy Reviews, 45:148–164, 2010
Copyright © Taylor & Francis Group, LLC
ISSN: 0570-4928 print / 1520-569X online
DOI: 10.1080/05704920903435912

An Overview of Infrared Spectroscopy Based


on Continuous Wavelet Transform Combined
with Machine Learning Algorithms: Application
to Chinese Medicines, Plant Classification,
and Cancer Diagnosis

CUNGUI CHENG,1 JIA LIU,1 CHANGJIANG ZHANG,2


Downloaded by [Colorado College] at 00:44 09 December 2014

MIAOZHEN CAI,3 HONG WANG,1 AND WEI XIONG1


1
Department of Chemistry, Zhejiang Normal University, Jinhua, Zhejiang, China
2
Department of Electronics and Information Engineering, Zhejiang Normal
University, Jinhua, Zhejiang, China
3
Department of Biology, Zhejiang Normal University, Jinhua, Zhejiang, China

Abstract: Infrared spectroscopy has been a workhorse technique for materials analysis
and can result in positively identifying many different types of material. In recent years
there have been reports using wavelet analysis and machine learning algorithms to
extract features of Fourier transform infrared spectrometry (FTIR). The machine learn-
ing algorithms contain back-propagation neural network (BPNN), radial basis function
neural network (RBFNN), and support vector machine (SVM). This article reviews
the important advances in FTIR analysis employing a continuous wavelet transform
(CWT) and machine learning algorithms, especially in the applications of the method
for Chinese medicine identification, plant classification, and cancer diagnosis.

Keywords: FTIR, CWT, machine learning, Chinese medicine identification, plant


classification, cancer diagnosis

Introduction
Infrared radiation is commonly defined as electromagnetic radiation with frequencies from
14,300 to 20 cm−1 (0.7 to 500 µm). Due to the change in a molecule’s dipole moment,
a normal molecular motion, such as a vibration, rotation, rotation/vibration, lattice mode,
or a combination, difference, or overtone of these normal vibrations, may result in ab-
sorbing infrared radiation in this region. The corresponding frequencies and intensities of
the infrared spectrum may be used to identify the material. It can be used to quantify the
amount of a particular compound in a mixture. Chemical compounds of different classes
contain structural units that absorb infrared radiation at essential similar frequencies and
intensities within that the same class of compound. These bands are called group frequen-
cies. The infrared spectroscopy scientist uses the so-called group frequency, the similar

Address correspondence to Cungui Cheng, Department of Chemistry, Zhejiang Normal Univer-


sity, Jinhua, Zhejiang, P.R. China, 321004. E-mail: ccg@zjnu.cn

148
IR Spectroscopy with Machine Learning Algorithms 149

frequencies and intensities of a certain class of compound, to predict the structures of


unknown molecules even in the absence of standard infrared spectra. Sample collection
and presentation accessories allow the analyst to collect spectra as solids, liquids, vapors,
and solutions at various temperatures. Experiments conducted under such conditions assist
the spectroscopy scientist to determine the structures of molecules in different phases as
well as structure-property relationships of materials. Modern instrumentation allows the
collection of infrared spectra at low picogram levels. The ability of infrared spectroscopy
to examine and identify materials under such a wide variety of conditions has earned this
technique the premier position as the work horse of analytical science (1). Infrared (IR)
spectroscopy is one of the most common spectroscopic techniques used by organic and
inorganic chemists. Simply, it is the absorption measurement of different IR frequencies by
a sample positioned in the path of an IR beam. The main goal of IR spectroscopic analysis
is to determine chemical functional groups in the sample. Different functional groups ab-
sorb different characteristic frequencies of IR radiation. When a normal molecular motion,
Downloaded by [Colorado College] at 00:44 09 December 2014

such as a vibration, rotation, rotation/vibration, lattice mode, a combination, difference, or


overtone of these normal vibrations, results in a change in the molecule’s dipole moment,
a molecule absorbs infrared radiation in this region of the electromagnetic spectrum. The
corresponding frequencies and intensities of these infrared bands, the infrared spectrum,
may be used to characterize the material. Infrared spectral information may be used to iden-
tify the presence and amount of a particular compound in a mixture. Thus, IR spectroscopy
is an important and popular tool for structural elucidation and compound identification.
Recently, spectroscopy has become one of the major tools for biomedical applications
and has made significant progress in the field of clinical evaluation. Research has been car-
ried out on a number of natural tissues using spectroscopic techniques, including Fourier
transform infrared (FTIR) spectroscopy. These vibrational spectroscopic techniques are
relatively simple, reproducible, nondestructive to the tissue, and only small amounts of
material (micrograms to nanograms) with a minimum sample preparation are required. In
addition, these techniques provide molecular-level information, allowing investigation of
functional groups, bonding types, and molecular conformations. Spectral bands in vibra-
tional spectra are molecule specific and provide direct information about the biochemical
composition. These bands are relatively narrow, easy to resolve, and sensitive to molecular
structure, conformation, and their environment (2). The original infrared instruments are
of the dispersive type. These instruments separate the individual frequencies of energy
emitted from the infrared source. Then, Fourier transform infrared spectrometry (FTIRs)
was developed to overcome the limitations encountered with dispersive instruments. The
main difficulty was the slow scanning process. A method for measuring all of the infrared
frequencies simultaneously, rather than individually, is needed. A solution was developed
that employed a simple optical device called an interferometer. The interferometer pro-
duces a unique type of signal that has all of the infrared frequencies “encoded” into it. The
signal can be measured quickly, usually on the order of one second or so. Thus, the time
element per sample is reduced to a matter of a few seconds rather than several minutes.
Because the analyst requires a frequency spectrum (a plot of the intensity at each indi-
vidual frequency) to make an identification, the measured interferogram signal cannot be
interpreted directly. A means of decoding the individual frequencies is required. This can
be accomplished via a well-known mathematical technique called Fourier transformation.
This transformation is performed by the computer, which then presents the user with the
desired spectral information for analysis.
Wavelet transform is a signal processing method developed in recent years that
is built after Fourier transform developed a time-frequency analysis method, has good
150 C. Cheng et al.

time-frequency separation characteristics, and is good at information processing (3–5). It


can decompose a signal into different frequencies and different scales of parts, can focus on
any part of the signal, and has been widely used in analytical chemistry. As a new transform
domain signal processing method, wavelet transform is particularly good at dealing with
nonstationary signal analysis. The wavelet transform plays an important role in the signal
analysis and the feature extraction. It can be used to detect the singularity of a signal and
identify a small difference between two signals. The wavelet transform can be divided into
two categories: discrete wavelet transformation (DWT) and continuous wavelet transforma-
tion (CWT). The cost of the DWT computation is less than that of the CWT computation.
However, the CWT is better in detecting the singularity of the signal (6). The continuous
wavelet transformation is the sum of the signal multiplied by the scaled and shifted mother
wavelet function. The wavelet coefficients are a function of scale and position. Normally
wavelet decomposition consists of calculating the “resemblance coefficients” between the
signal and the wavelet located at position b of scale a. If the coefficients are large, the
Downloaded by [Colorado College] at 00:44 09 December 2014

resemblance is strong. Otherwise, it is weak. The coefficients C(a, b) are calculated as


follows (4):
  
1 t −b
C(a, b) = s(t) √ ψ dt (1)
R a a
a ∈ R + − {0}, b ∈ R

When wavelet transformation is used to analyze data, a proper wavelet basis function
and decomposing level number should be determined according to the spectral character-
istics of the signal. The suitable wavelet base and wavelet scale are determined by the
effect of signal decomposition in different scales and the characteristics of the FTIR signal
in a wavelet multiscale decomposition procedure. There is not a general criterion about
how to choose the optimal wavelet basis function. In general, we choose a proper wavelet
basis function by considering the properties of the wavelet basis function, features of the
signal to be analyzed, and the actual problem. The part of the signal whose shape is similar
to that of the wavelet basis function will be enlarged, and other parts of the signal will
be suppressed. In addition, proper scale wavelet is used according to the real problems.
Large-scale wavelet basis function should be used if we describe the total and approximate
properties of the signal by the wavelet transform. Small-scale wavelet basis function should
be used if we extrude the details of the signal by the wavelet transform.
When using CWT to detect the singularity of the curvature curve, we should choose
the proper wavelet, which has a similar shape to the signal to the analyzed, short compact
branch set, and large vanishing moment, as a wavelet basis function. Some representative
wavelet basis functions include Coiflet, Symlets, Daubechies, Morlet, Mexihat, and Meyer.
Figures 1a–f show their function curves in the time domain. Compared to the other three
wavelets, the Morlet wavelet has the shortest compact branch set (Figure 1c), so we choose
a Morlet wavelet as the analyzing wavelet.
Machine learning is the use of computer simulation of human learning activities, by
learning to acquire new knowledge to identify existing knowledge. Radial basis function
neural network (RBFNN) can extend or preprocess the input vector to a high-dimensional
space. It not only has good generalization ability but avoids the complex computation
as back-propagation neural network. ANN is an engineering system that simulates its
structure and intelligent behavior based on the understanding of the brain organizational
structure and operation mechanism. SVM is a machine learning technique that has arisen
in recent years, and its core idea is to map nonlinearly the data to a high-dimension
IR Spectroscopy with Machine Learning Algorithms 151
Downloaded by [Colorado College] at 00:44 09 December 2014

Figure 1. Wavelet basis function curves in time domain. (a) Mexi hat wavelet; (b) Meyer wavelet;
(c) Morlet wavelet; (d) Db10 wavelet; (e) Coiflet 5 wavelet; (f) Sym8 wavelet.

feature space. The back-propagation neural network (BPNN) is one of the most com-
mon neural network structures. It is simple and effective. However, the back-propagation
algorithm using the gradient descent method is not optimal. The modification of the
weights is done according to the gradient of the error curve, which points in the direc-
tion to the local minimum near the instance. The resilient propagation (Rprop) algorithm
for BPNN provides the solution to this problem. This article mainly reviews the con-
tinuous wavelet transform and machine learning algorithms that are applied in infrared
spectra.
152 C. Cheng et al.

The Application in Traditional Chinese Medicine Identification


Fourier transform infrared spectroscopy is a common analysis tool with high sensitivity,
resolution, and fast speed, which has been widely used in the identification of traditional
Chinese medicinal materials (7–9). Because the traditional Chinese medicinal material is
a complex compound, the directly measured infrared spectrum is the superposition of all
the infrared spectrums. Therefore, if a general analysis method is used, this will greatly
depend on the experience of the scientist (10). If FTIR is directly used to distinguish the
same family and the same genus of traditional Chinese medicinal materials, their chemical
components are similar, which may result in misjudgments and misinterpretations. Tradi-
tional Chinese medicine has wide clinical usage. There are proprietary traditional Chinese
medicine preparations and chemical compositions of extractions. Incorrect identification
could result in a shortage of medicine, rising prices, and confusion of varieties. Therefore,
a method to clearly identify traditional Chinese medicine is needed.
The use CWT in traditional Chinese medicine identification has been reported (10).
Downloaded by [Colorado College] at 00:44 09 December 2014

The one-dimensional CWT is implemented to their FTIRs. Horizontal attenuation total


reflection–Fourier transform infrared spectroscopy (HATR-FTIR) is used to measure the
FTIRs of Stephania tetrandra S. Moore and Stephania cepharantha Hayata, as shown in
Figure 2.
Because Stephania tetrandra S. Moore and Stephania cepharantha Hayata belong
to the same family and the same genus of traditional Chinese medicinal material, their
chemical components are similar and the distinction of their absorption peaks in the IR
spectra is not clear. Because CWT has an advantage in detecting singularity of the signal,
it was used to extract the features of the FTIR and succeeded in enlarging the differences
between the FTIR of Stephania tetrandra S. Moore and Stephania cepharantha Hayata.
RBFNN was used to efficiently identify them. RBFNN can extend or preprocess the input
vector to a high-dimensional space. It not only has good generalization ability but avoids

Figure 2. FTIR spectra of (a) Stephania tetrandra S. Moore and (b) Stephania cepharantha Hayata.
IR Spectroscopy with Machine Learning Algorithms 153

Figure 3. Structure of RBFNN.


Downloaded by [Colorado College] at 00:44 09 December 2014

the complex computation as a back-propagation neural network. Therefore, a rapid learning


of the neural network was achieved (11). The goal was the classification and identification
to two kinds of plants (Stephania tetrandra S. Moore and Stephania cepharantha Hayata).
Nine feature parameters were used as the input vector, and thus the input layer of the
network needs nine neurons. Therefore, the RBFNN has nine input neural units and two
output neural units. The structure of the RBFNN is shown in Figure 3.
The first layer is the input layer, which introduces an eigenvector {S1, S2, . . . , S9}
into the network. The second layer is a hidden layer, which is fully connected with the
input layer (weight value = 1). Its role is equal to a conversion to the input modes, which
transforms low-dimensional model input data to the high-dimensional space, to be in favor
of the output layer’s classification and recognition. A basis function should be selected as
the transfer function of the hidden layer node. The Gaussian function is as follows:

ϕ(x, ρ) = − exp[−(s − Ch )2 /ρ 2 ] (2)

where Ch is the center of basis functions and ρ is the width. Nodes compute the Euclidian
distance between the input vector and the center and then transformation is carried out
by the transfer function. The third layer is the output layer, where output node j can be
written as:

yj = Wij φi (||s − C h ||, ρ) (3)

where h = 1, 2, . . ., H; · is Euclidean norm.


A center-adjusted algorithm uses the least clustering distance as index and divides the
input data sets into H classes and gets H centers. The steps of the algorithm are as follows:
1. Select the initial center Ch (0), 1 ≤ h ≤ H randomly, set the initial learning rate α(0).
2. Calculate the minimum distance of kth step:

lh (k) = ||s(k) − Ch (k − 1)||, 1≤h≤H (4)

3. Find node q that has the minimum distance:

q = arg[min lh (k)], 1≤h≤H (5)


154 C. Cheng et al.

4. Update the center:

Ch (k) = Ch (k − 1) 1 ≤ h ≤ H, h = q
Cq (k) = Cq (k − 1) + α(k)[x(k) − Cq (k − 1)] (6)

5. Recalculate the qth node’s distance:

lq (k) = ||x(k) − Cq (k − 1)|| (7)

6. Modify the learning rate:


α(k)
α(k + 1) = (8)
1 + int[k/H ]1/2
7. k = k + 1, return to step 2.
Downloaded by [Colorado College] at 00:44 09 December 2014

The weight values can be seen as a vector

Wj (k) = [W1j (k), W2j (k), . . . , WHj (k)]T , 1 ≤ j ≤ N (9)

At the kth step, if the output vector of middle layer is

(k) = [1 (k), 2 (k), . . . , H (k), ]T = [1 (l1 (k), ρ), . . . , H (lH (k), ρ)]T (10)

The estimating output of step k is

ŷj (k) = Wij Öi (li (k), ρ) (11)

If the actual output is yj (k), the error is εj (k) = yj (k) − ŷj (k). According to recursive least
squares method, the adjusted algorithm of the network weight values is as follows:

Wj (k + 1) = Wj (k) + P (k)(k) · εj (k) (12)

 
1 P (k − 1)(k) · T (k) · P (k − 1)
P (k) = P (k − 1) − (13)
λ(k) λ(k) + T (k)P (k − 1)(k)
where P is the error variance matrix and λ is a forgetting factor. To make the network
identify samples more accurately, we design a competition layer after the output layer
in the network. The output vector of the competition layer consists of the output of all
the neurons in the competition layer. The outputs of all other neurons are zero except the
victorious one. A radial basis function neural network has two adjustable parameters: center
Ch and W ij . The learning process of the network is divided into two stages: the first stage
is the center adjustment; that is, the center Ch of a Gaussian function of hidden layer’s
nodes is decided by the training sample; the second stage is the network weight value
adjustment; that is, after having determined the parameters of hidden layer, we can get the
output layer network’s connecting weights W ij with least squares principle according to
the given training samples. In their studies, the feature vector is input to the RBFNN to
train to accurately classify the Stephania tetrandra S. Moore and Stephania cepharantha
Hayata. One hundred twenty-eight couples of FTIR are used to train and test the proposed
method, where 78 couples of data are used as training samples and 50 couples of data
IR Spectroscopy with Machine Learning Algorithms 155

are used as testing samples. Experimental results show that the accurate recognition rate
between Stephania tetrandra S. Moore and Stephania cepharantha Hayata is 99.8% and
99.9%, respectively, using the proposed method.
Work on the recognition method between semen cuscutae and its sibling plant Japanese
dodder seed based on the FTIR-CWT and ANN classification method has been performed
(12). The artificial neural network has many models. It can be divided into feed-forward and
feedback based upon the network structure. One of the main applications of a feed-forward
network is identification and classification. There are no strict distinctions between the
input and output layers of a feedback network, and we extract the important characteristics
and energy minimization of data after study (13). When all the feed-forward neural network
nodes used the sigmoid function, one hidden layer was sufficient for arbitrary classification.
Then, the network model was chosen and rules learned, input and output data were studied
(output data also known as the target output data), and the network was learned and trained
to get the neural network’s node weight and node threshold. The network weights and
Downloaded by [Colorado College] at 00:44 09 December 2014

threshold were determined by comparing the error between the output data of the artificial
neural network and the target output data until the errors were within the allowable range.
The sigmoid function was used as the active function. To make the least square error of
the corresponding input samples p minimum, we studied and amended the threshold and
weights. The formula of the least square error function can be written by

Ep = 0.5 × (t pj − Opj )2 (14)

where tpj is the target output value of sample p in the output layer’s the jth node, that is, the
type of the plant, and o is the actual output value of sample p in the output layer’s node. The
actual output value is calculated from the input layer to the output layer, and the adjusted
direction of error and weight are from the output layer to the input layer. The formula of
calculation of the output value opj of the node (the output value equal to the input value
when node j is the input layer’s node) is

1
opj = f (netpj ) = (15)
(1 + e−wj i opj −θj )

where wj i is the weight value that connects nodes i and j, and θ j is the threshold of node j.
Threshold values can be considered as the weight connecting an output equal to 1 to other
nodes, so its adjustment process is the same as for wj i . The adjustment of weight wj i is as
follows: The formula of amendment weight wj i , which connects the hidden layer’s node i
to the output layer’s node j, is as follows (when j is the output layer’s node):

wj i (n + 1) = ηδpj Opj + α wj i (n) (16)

where η is the learning rate, α is the momentum term, and δ pj is the error signal of the
output layer’s node j, and δ pj is calculated as follows:

δpj = (tpj − Opj )Opj (1 − Opj ) (17)

When j is not the output layer nodes, we also used the above weight amendment to connect
the hidden layer’s node i and the hidden layer’s node j. But in this case δ pj calculation
becomes

δpj = Opj (1 − Opj )δpk Wkj (18)


156 C. Cheng et al.

where δ pk is the error between the output with input from node j and node k and wkj is the
weight connecting nodes j and k.
We used FTIR and HATR techniques to obtain the FTIR of semen cuscutae and its sib-
ling Japanese dodder seed. The similar features between the semen cuscutae and Japanese
dodder seed were extracted by continuous wavelet transformation. After comparison anal-
ysis, we chose the decomposition levels 7, 10, and 13 that were used to extract the feature
vectors, which were used to train the ANN, and the trained neural network was used to
classify semen cuscutae and Japanese dodder seeds, which were collected from different
places all over China. According to 32 testing samples, the sibling plants, semen cuscutae
and Japanese dodder seed, we found that they could be effectively identified by FTIR with
continuous wavelet feature and artificial neural network (13).
Jin and Cheng (14) used continuous wavelet transformation with HATR-FTIR spec-
troscopy to identify Fructus lycii and its unofficial samples of the same genera. HATR-FTIR
of Fructus lycii and its unofficial samples of the same genera were directly obtained by
Downloaded by [Colorado College] at 00:44 09 December 2014

using HATR-FTIR. Then Jin and Cheng (14) used principal component analysis (PCA) to
determine the information load quantity in all the regions and the Morlet wavelet was used
as the mother wavelet to analyze HATR-FTIR of Fructus lycii and its unofficial samples
in the continuous wavelet domain. The data in the FTIR spectra were analyzed by PCA.
The result showed that it had important application value with combined HATR-FTIR to
continuous wavelet transformation to identify Fructus lycii and its unofficial samples. This
method was shown to be direct, rapid, and accurate.
Semen celosiae and semen celosiae cristatae were identified by us using continuous
wavelet transformation with FTIR (15). We used FTIR spectroscopy to obtain the infrared
spectra of semen celosiae and semen celosiae cristatae directly, quickly, and accurately.
Then CWT was used to extract the features of the FTIR spectra and succeeded in enlarging
the differences between semen celosiae and semen celosiae cristatae’s FTIR spectrum.
The accurate identification rate was greatly improved. Because the Morlet wavelet can
effectively detect singular values of the signal, it was selected as the mother wavelet
and one-dimensional continuous wavelet transformation was implemented to the infrared
spectra of semen celosiae and its confusable varieties. We observed the difference between
semen celosiae and semen celosiae cristatae at all scales in the wavelet domain and an
optimal scale was determined where we selected the most obvious difference between
semen celosiae and semen celosiae cristatae for identification. The results show that it
is effective to apply continuous wavelet transformation on the basis of FTIR to identify
traditional Chinese medicinal materials that are the same genus but different species.
We did research on the recognition of Equisetum arvense L. and Hippochaete ramo-
sissima Boerner (16). Based on FTIR spectrometry with chemotrics, we used FTIR and
HATR techniques to obtain the FTIR of Equisetum arvense L. and Hippochaete ramosis-
sima Boerner. Features of their similar absorption were extracted by CWT. The features
at decomposition levels 8, 9, and 10 were used as input data of SVM. A model of their
discrimination was established by the FTIR-CWT-SVM. SVM is a machine learning tech-
nique that has arisen in recent years, and its core idea is to nonlinearly map the data to a
high-dimension feature space. A low Vapnik–Chervonenkis-dimension optimal separating
hyperplane is constructed in this space, which makes the upper bound of classification
risk least. SVM is based on the principle of minimum experience risk and can be guar-
anteed theoretically under the condition that the sample number is inclined to infinite.
Most properties of SVM method are contained in the structural risk minimization principle
proposed by Vapnik (17). It requests disjoin not only faultlessly but also maximal interval
and guarantees minimal risk. The SVM was first proposed to solve the problem of linear
IR Spectroscopy with Machine Learning Algorithms 157

separability. There still exist linear unseparated problems, so the SVM method can solve
nonlinear problems. Its main ideas are as follows: Suppose there are n training sets (xi, yi) ∈
Rd × {±1}, and transform the input vector to a high-dimension feature space  according
to nonlinear mapping g := (g1, g2, . . .), then account for oriental classification in the new
space. The nonlinear map is implemented by defining an appropriate inner product function.
An oriental hyperplane constructed in the feature space can be expressed as follows:
n
H (x) = αi yi g(x), g(xi ) + α0 (19)
i=1

Feature space  is a Hilbert space. It need not know what g(x) is when some changes
happen in the space, because it only refers to the inner product operation of kernel function:

K(x, xi ) =< g(x), g(xi ) (20)

Select some kernel functions that meet Mercer conditions, and then try to input interspace
Downloaded by [Colorado College] at 00:44 09 December 2014

midline impartibility samples that can separate in high-dimension feature space. Equation
(20) can be expressed as:

n
H (x) = αi yi K(x, Xi ) + α0 (21)
i=1

Quotiety α i can be gained by solving the equation as follows:


 
1  
n n
max W (α) = max − αi αj yi yj K(xi , xj + αi (22)
2 i,j =1 i=1

where

n
αi yi = 0 (23)
i=1

0 ≤ αi ≤ C; i = 1, . . . , n (24)

If the training vectors xi corresponding to α i > 0, then it supports vectors. α 0 is obtained


by the following support vectors (xs , ys );

n
α0 = ys − αi yi k(xi , xs ) (25)
i=1

C is a positive real index number in Eq. (25), inattentive variable controls parameters are
induced, which considers some samples cannot be correctly classified. It can control punish
extent to wrong classification (18). SVM can be carried out as shown in Figure 4.
The accuracy of discrimination for the 120 predicable samples was above 90% by
training, and when the radial basic function was used as the kernel function, its accuracy
of discrimination was exactly 100% by training. The results show that the eigenvalues of
the FTIR of the samples were different after continuous wavelet feature extraction, and it is
advantageous to classify the two plants using SVM as a classifier. By using SVM to classify
the eigenvalues that were extracted by continuous wavelet, it was successful to effectively
identify similar plants in morphology Equisetum arvense L. and Hippochaete ramosissima
Boerner (16).
158 C. Cheng et al.
Downloaded by [Colorado College] at 00:44 09 December 2014

Figure 4. The structure of SVM.

The Application in Plant Classification


Phytotaxonomy is the oldest and most system branch of plant science. Classical plant
classification is based upon the feature of the plant’s exterior appearance and the interior
dissection. It is quite limited and artificial, and with the development of science, many
different scientific disciplines interact, and new interdisciplinary fields are formed. A new
field of study, called phytochemistry systematics, has been created. In this system, the
chemistry classification for plants is based upon the difference between the chemical
composition in the plant’s secondary metabolism and macromolecules with biological
information. Currently, the techniques of chromatography, spectroscopic methodology,
and immunology are used in the chemistry classification. Because FTIR can provide all the
information about compositional system materials, we can use FTIR to classify different
plants. It has made a great contribution to the plant classification. Hypericum and Triadenum
have been identified using the FTIR spectrum (19). The results showed that the infrared
spectra of Hypericum and Triadenum were fingerprint-like patterns that were highly typical
for different taxa. Significant differences in the infrared spectra were found between these
two genera and among the nine sections of Hypericum. Certain differences exist in the
spectra among species in the same section of Hypericum or in the Triadenum. Furthermore,
no significant difference was found in the infrared spectral patterns in the leaves at various
developmental stages and in leaves of the same species collected from different geographic
regions, although occasionally geographic difference did exist in the same species. The
results indicated that the infrared spectra of leaves are of taxonomic value at the level
of species and sections in these two genera, and this technique can be widely used for
identification and classification of other taxa when standard spectra are available. Typical
medical plants in Araliaceae, Campanulaceae, Magnoliaceae, Lauraceae, Leguminosae,
Berberidaceae, Pteridophyta, etc., were studied with FTIR for the first time by Huang et al.
(20). They pointed out the similarities and differences within each family. Furthermore, the
differences in spectra of samples from different parts or collected at different times on the
same plant were discussed. The characteristic radicals of the mainly effective components
in plants and the primary peaks were identified. The result shows that FTIR can become
IR Spectroscopy with Machine Learning Algorithms 159
Downloaded by [Colorado College] at 00:44 09 December 2014

Figure 5. FTIR spectra of (a) yellow foxtail seed, (b) giant foxtail seed, and (c) green foxtail seed.

a fast, reliable, objective, and effective method of chemical taxonomy. However, FTIR
analysis has limitations, so fast and accurate classification is needed (21).
The continuous wavelet transformation applied in infrared spectra was reported by
Cheng et al. (21). Based on the fact that the wavelet coefficients of each level are different
for the same characterization of the signal; only a few coefficients are needed to reflect
on the absorption peaks of spectra. It is one of the most efficient chemometrics analysis
methods. Cheng et al. (21) were successful in effectively identifying the sibling plants
yellow foxtail seed, giant foxtail seed, and green foxtail seed by FTIR with continuous
wavelet feature and ANN classifier. These plants are difficult to distinguish by traditional
phytotaxonomy. FTIR and HATR techniques were used to obtain the FTIR spectra of the
yellow foxtail seed, the giant foxtail seed, and the green foxtail seed, as shown in Figure 5.
Because they belong to sibling plants, they contain similar chemical compositions,
including hydroxy of cellulose (seed coat), starch, and plant hormones ß-sitosterol, and
their IR absorption is quite similar. The FTIR spectra from the different plants have close
absorbance and it is difficult to distinguish them. CWT to extract their features was used for
further classification. The feature vectors were used to train the artificial neural network.
160 C. Cheng et al.

The trained neural network was used to classify the seeds. It effectively identified the
sibling plants yellow foxtail seed, giant foxtail seed, and green foxtail seed by FTIR with
continuous wavelet feature extraction and ANN classification and we (21) reported that
the wavelet transform is a more effective signal processing method than FTIR and plays
an important role in signal analysis and feature extraction. Ehrentreich pointed out that the
wavelet transform has been established with FTIR as a data processing method in analytical
chemistry. This method makes a contribution to plant classification (21).

The Application in Cancer Diagnosis


Cancer research presents a huge challenge for medical and academic researchers all over the
world. Early detection of invasive cancer is essential in reducing mortality rate. Curing can-
cer depends upon early diagnosis and prompt treatment. The main method for diagnosing
cancer is tumor tissue pathology (22). However, the nonquantification pathologic diagno-
Downloaded by [Colorado College] at 00:44 09 December 2014

sis process is tedious and often is influenced by human factors. Apart from conventional
methods of cancer diagnosis, it is necessary to develop new approaches that are simple,
objective, and noninvasive. In recent years, FTIR has been developed as a diagnostic tool
for various human cancers and other diseases. Wang et al. (23) used FTIR spectroscopy to
analyze normal and cancerous tissues of the esophagus. FTIR applied in cancer detection
was reported by Sahu (24). Mark et al. (25) reported FTIR microspectroscopy as a quanti-
tative diagnostic tool for assignment of premalignancy grading in cervical neoplasia. But,
the identification between normal tissue and early cancer is difficult because their Fourier
infrared spectrums are similar. In order to improve the accuracy of diagnosing earlier stages
of cancer with FTIR, a novel method of extraction of FTIR feature using CWT analysis
and classification using SVM was reported by Cheng et al. (26). Our research showed that
wavelet transformation can easily detect the singularity of the signal. The faint difference
can be greatly extruded between signals by the wavelet transform.
The current authors of this review have performed many studies and succeeded in
efficiently identifying gastric normal tissues, early cancer, and advanced cancer tissues
using continuous wavelets and SVM (6). The identification between gastric normal tissues
and early cancer is difficult because the FTIR spectra are similar.
As shown Figure 6, peaks’ locations, intensities, and shapes change greatly. The hy-
droxy absorption peak from protein, nucleic acid, and grease exists at about 3400 cm−1, and
there is no obvious difference in intensity. The carbonyl group absorption peak from protein
exists at 1649 cm−1, and its intensity decreases after becoming cancer. The absorption peak
from the amide II band exists at 1538 cm−1, and absorption intensity also decreases after
becoming cancer. Other spectrum bands such as symmetrical flexing vibration and unsym-
metrical flexing vibration peaks of di-phosphate ester from nucleic acid exist at 1084 and
1242 cm−1. The absorption peak of collagen protein exists at 1342 cm−1. At an increase of
the degree of cancer, the absorption intensities in the FTIR spectrum gradually decrease.
In order to improve the accuracy to diagnose earlier stage gastric cancer with FTIR, we
develop a novel method of extraction of FTIR feature using CWT and classification using
the SVM. To the FTIR of gastric normal tissues, early carcinoma tissues, and advanced
gastric carcinoma tissues, nine feature parameters were extracted with CWT. With SVM,
all spectra were classified into two categories: normal or abnormal, which included early
carcinoma and advanced gastric carcinoma. The accuracy rate of polynominal function and
radical basis function (RBF) kernels was high in all kernels. The accuracy rate of poly
kernels in gastric normal tissues, early carcinoma tissues, and advanced carcinoma tissues
was 100%, 96%, and 100%, respectively. The accuracy rate of RBF kernels in normal, early
IR Spectroscopy with Machine Learning Algorithms 161
Downloaded by [Colorado College] at 00:44 09 December 2014

Figure 6. FTIR spectra of (a) gastric normal tissues, (b) early cancer, and (c) advanced cancer
tissues.

carcinoma, and advanced carcinoma was 100, 96, and 100%, respectively. The research
results show the feasibility of establishing the models with FTIR-CWT-SVM to identify
normal tissues and early carcinoma tissues.
We conducted a study on the early detection of colon cancer using the methods of
wavelet feature extraction. We reported a new method for the early detection of colon cancer
using a combination of feature extraction based on wavelets for FTIR and classification
using SVM (27). The FTIR data collected from 36 normal Sprague–Dawley rats, 60 1,2-
DMH-induced SD rats, and 44 second-generation induced rats were first preprocessed.
Then, 12 feature variants were extracted using CWT. The extracted feature variants were
then input into the SVM for classification of normal, dysplasia, early carcinoma, and
advanced carcinoma. Among the kernel functions used in the SVM, the poly and RBF
kernels had the highest accuracy rate. The accuracy of the poly kernels in normal, dysplasia,
early carcinoma, and advanced carcinoma was 100, 97.5, 95, and 100%, respectively.
The accuracy of the RBF kernels in normal, dysplasia, early carcinoma, and advanced
162 C. Cheng et al.

carcinoma was 100, 95, 95, and 100%, respectively. The results indicate that this method
could effectively and easily diagnose colon cancer in its early stages.
We did research on the methods of wavelet feature extraction and SVM classification
of FTIR with lung cancer data (26). In order to improve the accuracy to early stage lung
cancer diagnosis rate with FTIR, we developed a novel method of extraction of FTIR
features using wavelet analysis and classification using SVM. To the FTIR of normal lung,
early carcinoma, and advanced lung cancer tissues, nine feature variants were extracted
with CWT. With SVM, we classified all spectra into two categories, normal and abnormal,
which included early lung cancer and advanced lung cancer. We found that the accuract
rates of poly and RBF kernels was high in all kernels and the accuracy rates of poly kernels
in normal, early lung cancer, and advanced cancer, were 100, 95, and 100%, respectively,
and those of RBF kernels in normal, early lung cancer, and advanced cancer were 100, 95,
and 100%, respectively. The results show the feasibility of establishing the models with an
FTIR-CWT-SVM method to identify normal lung tissue, early lung cancer, and advanced
Downloaded by [Colorado College] at 00:44 09 December 2014

lung cancer.
Li et al. (28) reported the classification of FTIR cancer data using wavelets and fuzzy
C-means clustering. A feature extracting method based on wavelets for FTIR cancer data
analysis was reported. They use a set of low-frequency wavelet basis to represent FTIR
data to reduce data dimension and remove noise. The fuzzy C-means algorithm is used to
classify the data. They use wavelet features and the original FTIR data provided by the
Derby City General Hospital in the UK to compare classification performance. The results
show that only 30 wavelet features are needed to represent 901 cm−1 of the FTIR data to
produce good clustering results.
We studied the classification of FTIR in cancer data analysis using wavelets and
BPNN (29). We present a feature extracting method based on wavelets for HATR-FTIR
cancer data analysis and classification using an artificial neural network trained with a
back-propagation algorithm. Extraction using the CWT from 168 spectra of fresh normal
and abnormal lung tissue samples gave 12 features. Further classification based on BPNN
resulted in two categories: normal or abnormal. The BPNN is one of the most common
neural network structures because of the simplicity and efficiency. However, the back-
propagation algorithm using the gradient descent method is not optimal. The modification
of the weights is done according to the gradient of the error curve, which points in the
direction to the local minimum near the instance. The Rprop algorithm for BPNN provides
the solution to this problem. There are four variables to be used in the Rprop algorithm:
initial weight adjustment step s0 , maximum weight adjustment step smax , increasing weight
adjustment step w++, and decreasing weight adjustment step w−. The preferred values for
increasing and decreasing weight adjustment step are set as w++ = 1.2, w− = 0.5. The
initial weight adjustment step s0 can be set arbitrarily. Usually, s0 = 0.1, smax = 50.0. The
number of hidden layer neurons can be estimated by the following equation:

In + 0.168 × (In − On ) In > On
Hn = (26)
On − 0.168 × (On − In ) In < On

where Hn indicates the number of neurons in the hidden layer, In shows the number of
neurons in the input layer, and On represents the number of neurons in the output layer.
The number of neurons in the hidden layer is 10 according to initial computation. Based
on this, the neural network is trained and predicted by increasing or decreasing the number
of neurons in the hidden layer. The optimal number of neurons in the hidden layer can be
obtained when the predicted accuracy is the highest. The accuracy of identifying normal,
IR Spectroscopy with Machine Learning Algorithms 163

early carcinoma, and advanced carcinoma was 100, 90, and 100%, respectively. The results
indicate that FTIR with CWT and BPNN can effectively and easily diagnose lung cancer
in its early stages (29).
Acknowledgment
This research was supported by the National Natural Science Foundation of China under
project number 30800705.

References
1. Mckelvy, M., Britt, T.R., Davis, B.L., and Gillie, J.K. (1998) Infrared spectroscopy. Anal. Chem.,
70(12): 119–177.
2. Movasaghi, Z., Rehman, S., and Rehman, I.ur. (2008) Fourier transform infrared spectroscopy
of biological tissues. Appl. Spectros. Rev., 43: 134–179.
3. Tian, G.Y., Yuan, H.F., Liu, H.Y., and Lu, W.Z. (2003) The application of wavelet transform in
Downloaded by [Colorado College] at 00:44 09 December 2014

near infrared spectroscopy. Spectroscopy and Spectral Analysis, 23: 1111–1114.


4. Niu, Y.R. (2009) The overview on wavelet analysis and application. Sci. Tech. Inform., 3: 158–159.
5. Shao, X.G., Pan, C.Y., and Sun, L. (2000) Wavelet transform and signal processing in analytical
chemistry. Progr. Chem., 12: 233–244.
6. Cheng, C.G., Cheng L.Y., and Xu, R.S. (2007) Classification of FTIR gastric cancer data using
wavelets and SVM. In Proceedings of the Third International Conference on Natural Computa-
tion, Vol. 1, 543–547.
7. Hong, Q. H., Cheng C. G., Cheng, Z. F., and Li, D.T. (2007) Application of HATR-FTIR
spectroscopy to the analysis of quality mensuration of rhizoma atractylodes. Spectroscopy and
Spectral Analysis, 27: 283–286.
8. Cheng, C.G., Wu, X.H., and Wang, S.Q. (2004) Application HATR-FTIR spectroscopy combined
with PCA to the identification of Viola yedoensis Makino and its confusable varieties. Chin. J.
Anal. Chem., 32: 1529–1531.
9. Dong, B., Sun, S. Q., Zhou, H.T., and Hu, S.L. (2002) Rapid and undamaged determination of
chishao by Fourier-transform infrared spectroscopy and clustering analysis. Spectroscopy and
Spectral Analysis, 22: 234–236.
10. Zhang, C.J. and Cheng, C.G. (2008) Identification between Stephania tetrandras Moore and
Stephania cepharantha Hayata. Spectros. Int. J., 22: 371–386.
11. Sing, J.K., Basu, D.K., and Nasipuri, M. (2007) Face recognition using point symmetry distance-
based RBF network. Applied Soft Computing, 7: 58–70.
12. Cheng, C.G., Tian, Y.M., and Zhang, C.J. (2008) Research of recognition method between semen
cuscutae and its sibling plant Japanese dodder seed based on FTIR-CWT and ANN classification
method. Acta Chimica Sinica, 66: 793–798.
13. Ha, S.B.G., Ma, J.W., Zhou, Z.J., and Li, Q.Q. (2003) Artificial neural network method used for
weather and AVHRR thermal data classification. Journal of the Graduate School of the Chinese
Academy of Sciences, 20: 328–333.
14. Jin, W.Y., and Cheng, L.Y. (2008) Identification of Fructus lycii and its unofficial samples of
same genera using continuous wavelet transform with HART-FTIR. Journal of Zhejiang Normal
University (Natural Sciences), 31: 312–315.
15. Zhang, C.J., Li, D.T., Liang, J.Z., and Cheng, C.G. (2007) Identification of semen celosiae
and semen celosiae cristatae using continuous wavelet transform with FTIR. Spectroscopy and
Spectral Analysis, 27: 50–53.
16. Cheng, C.G., Xiong, W., and Jin, W.Y. (2009) Recognition of Equisetum arvense L. and Hip-
pochaete ramosissima Boerner based on Fourier transform infrared spectrometry combined with
chemometrics. Chin. J. Anal. Chem., 37: 676–680.
17. Wang, L.P. and Fu, X.J. (2005) Data Mining with Computational Intelligence. Springer: Berlin.
18. Lee, J. and Lee, D. (2005) An improved cluster labeling method for support vector clustering.
IEEE Trans. Pattern Anal. Mach. Intell., 27: 461–464.
164 C. Cheng et al.

19. Lü, H.F., Cheng, C.G., Tang, X., and Hu, Z.H. (2004) FTIR spectrum of hypericum and tradeum
with reference to their identification. Acta Botanica Sinica, 46: 401–406.
20. Huang, H., Sun, S.Q., Xu, J.W., and Wang, Z. (2003) Novel application of FTIR in medical herb
chemotaxonomy. Spectroscopy and Spectral Analysis, 23: 253–257.
21. Cheng, C.G., Xiong, W., Tian, Y.M., and Zhang, C.J. (2009) A novel recognition of three kinds of
sibling plants using FTIR with continuous wavelet feature extraction combined with an artificial
neural network. Spectroscopy, 24 (2): 58–67.
22. Cheng, C.G., Sun, G.L., and Zhang, C.J. (2007) Early detection of gastric cancer using wavelet
feature extraction and neural network classification of FT-IR. Spectroscopy, 22 (11): 38–42.
23. Wang, J.S. and Shi, J.S. (2003) FT-IR spectroscopic analysis of normal and cancerous tissues of
esophagus. World J. Gastroenterol., 9: 1897–1899.
24. Sahu, R.K. and Mordechai, S. (2005) Fourier transform infrared spectroscopy in cancer detection.
Future Oncol., 1: 635–647.
25. Mark, S., Sahu, R.K., and Kantarovich, K. (2004) Fourier transform infrared microspectroscopy
as a quantitative diagnostic tool for assignment of premalignancy grading in cervical neoplasia.
Downloaded by [Colorado College] at 00:44 09 December 2014

J. Biomed. Optic., 9: 558–567.


26. Cheng, C.G., Tian, Y.M., and Jin, W.Y. (2007) Study on the methods of wavelet feature extraction
and SVM classification of FTIR lung cancer data. Acta Chimica Sinica, 65: 2539–2543.
27. Cheng, C.G., Tian, Y.M., and Jin, W.Y. (2008) A study on the early detection of colon cancer
using the methods of wavelet feature extraction and SVM classification of FTIR. Spectros. Int.
J., 22: 397–404.
28. Bai, L. and Liu, Y.H. (2005) Classification of FTIR cancer data using wavelets and fuzzy C-means
clustering. Proceedings of SPIE, 6001: 83–91.
29. Cheng, C.G., Tian, Y.M., and Zhang, C.J. (2007) Classification of FTIR cancer data using
wavelets and BPNN. Proceedings of SPIE, 6826: 682633–1–682633–8.

You might also like