Extracting Information From Spectral Data.: Nicole Labbé, University of Tennessee

Extracting information from
spectral data.
Nicole Labb, University of Tennessee
SWST, Advanced analytical tools for the wood industry
June 10, 2007
Data collection
Near Infrared spectra
2150 data points, 350-2500
nm, 1 nm resolution, 8 scans
400
800
1200
1600
2000
2400
W a v e le n g t h ( n m )
Mid Infrared spectra

3400 data points, 4000-600 cm-,
1 cm- resolution, 4 scans
4000
3200
2400
1600
W avenum ber (cm -1 )
800
Laser Induced Breakdown spectra

30000 data points, 200-800 nm,
0.02 nm resolution, 10 scans
200
300
400
500
600
Wavelength (nm)
700
800
Prior to the extraction of the information.

Signal processing is used to transform spectral data prior to analysis
Data pretreatment
-Local filters
Signal
N oisy signal of an absorption band

S m oothed with a m oving average filter
-Smoothing
-Derivatives
Absorbance
W avelength (nm )
-Baseline correction
Raw spectral data
-Multiplicative Scatter Correction (MSC)

-Orthogonal Scatter Correction (OSC)
Absorbance
Wavelength
Corrected spectral
data after MSC
Wavelength
What type of information?

1. Qualitative information
Grouping and classification of spectral objects from samples
into supervised and non-supervised learning methods.
2. Quantitative information
Relationships between spectral data and parameter(s) of interest
How to extract the information?

1. Multivariate analysis (MVA)
Principal Component Analysis (PCA), Projection to Latent
Structures (PLS), PLS-Discriminant Analysis (PLS-DA),
2. Two dimensional correlation spectroscopy
Homo-correlation, Hetero-correlation
Multivariate data analysis

Separating the signal from the noise in data and presenting the results as
easily interpretable plots.
Why are multivariate methods needed?
-Large data sets
-Problem of selectivity
Relationship between two variables: very simple, but does not always work
Many problems where several predictor variables used in combination give better results
3.0
Two approaches:
-Univariate analysis
-Multivariate analysis
Absorbance
2.5
2.0
1.5
1.0
0.5
0.0
800
375 x-variables
1200
1600
2000
2400
Wavelength (nm)
Near infrared spectra collected on 70 pine samples
Univariate analysis
Measured cellulose content versus
predicted cellulose content using one
variable (1530 nm) as a predictor
(R2 = 0.12)
Predicted cellulose content (%)
50
45
40
35
30
25
25
30
35
40
45
50
Measured cellulose content (%)
Measured cellulose content versus

predicted cellulose content using
multivariate method based on all
variables from the spectral data
(R2= 0.87)
Predicted Cellulose content (%)
Multivariate analysis
50
45
40
35
30
25
25
30
35
40
45
50
Qualitative information
Principal Components Analysis (PCA)
Recognize patterns in data: outliers, trends, groups
0.8
PC2 (8%)
0.4
0.0
-0.4
-0.8
-6
-4
-2
PC1 (89%)
Biomass species and near infrared spectra
n Samples
Spectral data
x variables
Transformation of the numbers into pictures by using

principal component analysis (scores)
0.18
PC1
PC2
30
20
Intensity (A.U.)
Bagasse
PC2
10
0.12
0.06
Corn stover
0.00
Red oak
Yellow poplar
Hichory
-10
Switchgrass
-0.06
800
-20
1200
1600
2000
Wavelength (nm)
-30
-30
-20
-10
PC1
10
20
2400
Transformation of the numbers into pictures by using

principal component analysis (loadings)
0.18
PC1
PC2
Bagasse
20
PC2
10
Intensity (A.U.)
30
0.12
0.06
Red oak
Corn stover
Yellow poplar
0.00
Hichory
-10
Switchgrass
-0.06
800
-20
1200
1600
2000
Wavelength (nm)
-30
-30
-20
-10
PC1
10
20
2400

PCA is a projection method, it decomposes the spectral data into a
structure part and a noise part
X is an n samples (observations) by x variables (spectral variables)
matrix
x1
1 variable = 1 dimension

X is an n samples (observations) by x variables (spectral variables)
matrix
x2
x2
x1
x1
x3
2 variables = 2 dimensions
3 variables = 3 dimensional space

Beyond 3 dimensions, it is very difficult to visualize whats going on.
3.0
Absorbance
2.5
2.0
1.5
1.0
0.5
375 x-variables
0.0
800
1200
1600
2000
2400
2800
Wavelength (nm)
18 Near infrared spectra of wood samples
Each sample is represented as a co-ordinate axis in 375-dimensional space

X has only 3 variables (wavelengths x1, x2 and x3)
The sample (n = 18) are represented in a 3D space
The first principal component

New co-ordinate axis representing the direction of maximum variation
through the data.
Higher-order Principal Components (PC2, PC3,)

After PC1, next best direction for approximating the original data
The second PC lies along a direction orthogonal to the first PC
PC3 will be orthogonal to both PC1 and PC2 while simultaneously lying along the
direction of the third largest variation.
The new variables (PCs) are uncorrelated with each other (orthogonal)
Scores (T) = Coordinates of samples in the PC space

Representation of the samples in the PC space
There is a set of scores for each PC (score vector)
Original
variable space
PC space
Loadings (P) = Relations between X and PCs
Relationship between the original variable space and the new PCs space
There is a set of loadings for each PC (loading vector)
Transformation of the numbers into pictures by using PCA

30
Bagasse
20
PC2
10
Red oak
Corn stover
Yellow poplar
Hichory
-10
Switchgrass
-20
0.18
-30
-20
-10
PC1
10
Intensity (A.U.)
-30
20
0.12
PC1
PC2
0.06
0.00
-0.06
800
1200
1600
2000
Wavelength (nm)
2400
Quantitative information
Projection to Latent Structures or Partial Least Squares Regression (PLS)
Establish relationships between input and output variables, creating predictive models.
Model
Establishing calibration model

from known X and Y data
+ Model
Using calibration model to predict

new Y-values from new set of Xmeasurement
PLS can be seen as two interconnected PCA analyses, PCA(X) and PCA(Y)
PLS uses the Y-data structure (variation) to guide the decomposition of X
The X-data structure also influences the decomposition of Y
Samples
Biomass composition and near infrared spectra
y variables
Spectral data x variables
If y variables are not correlated PLS1
Samples
y variables
Spectral data x variables
If y variables are correlated PLS2
Samples
2/3 of the samples for calibration model

1/3 of the samples for validation model
Random selection
Calibration model to
predict cellulose content
in pine
r = 0.95
RMSEC = 1.6
45
40
12
35
30
4
25
25
30
35
Intensity
Predicted Cellulose content (%)
50
40
45
0
-4
50
-8
-12
1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm)
Validation of the model

to predict cellulose
content in pine
Predicted cellulose content (%)
50
r = 0.95
RMSEP = 1.55
45
40
35
30
25
25
30
35
40
45
50
PLS-Discriminant Analysis (PLS-DA)

A powerful method for classification. The aim is to create a predictive
model which can accurately classify future unknown samples.
y variable
Spectral data
x variables

Development of PLS-DA calibration model
PLS-DA
Calibration model
0
-1
Yellow poplar
-2
-0.6
1.2
Hickory
-0.4
-0.2
0.0
PC1 (92%)
0.2
Predicted Y variable
PC2 (6%)
1.0
0.4
0.6
0.8
0.6
0.4
0.2
r = 0.99
RMSEC = 0.04
0.0
-0.2
0.0
0.2
0.4
0.6
0.8
Measured Y variable
1.0
1.2

Validation of PLS-DA model
PLS-DA
Validation model
Y-reference
Predicted Y
Spectrum 00008
0.0000
-0.0338
Spectrum 00009
0.0000
0.0270
Spectrum 00015
1.0000
0.9340
Spectrum 00016
1.0000
1.0220
1.2
1.0
Predicted Y
0.8
r = 0.99
RMSEP = 0.04
Hickory
0.6
0.4
0.2
Yellow
poplar
0.0
-0.2
0.0
0.2
0.4
0.6
Y reference
0.8
1.0
1.2
Two-dimensional correlation spectroscopy

simplify the
visualization of
complex spectra
2D correlation tools spread

spectral peaks over a
second dimension
Mechanical, electrical,
Perturbation chemical, magnetic
optical, thermal,
Electro-magnetic
Probe (eg, IR,
UV, LIBS,)
System
2D
correlation
maps
0.15
0.10
2.0
Intensity (A.U.)
Intensity (A.U.)
2.5
1.5
1.0
0.05
0.00
-0.05
0.5
-0.10
1000 1200 1400 1600 1800 2000 2200 2400
1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm)
Wavelength (nm)
Spectral data collected under

perturbation
Dynamic spectra [DYN]
[S](mxn) : m samples by n variables (wavelength) matrix
[DYN ](mn ) = [S ]1(1nm) [S ](1n )
Reference spectrum
Synchronous matrix
[SYNC ](nn ) = [DYN ]T(nm ) [DYN ](mn )
Asynchronous matrix
[ASYN ](nn ) = [DYN ]T(nm ) [N ]( mm) [DYN ](mn )
Noda-Hilbert matrix
Generation of orthogonal
components: synchronous and
asynchronous 2D correlation
intensities
Homo-correlation
Hetero-correlation
NIR/NIR
NIR/MBMS
Perturbation: cellulose content
Other techniques to extract information

Classification
Soft Independence Modeling of Class Analogy (SIMCA)
The Unscrambler, User manual. CAMO,1998
Kernel Principal Component Analysis (k-PCA)
Schlkopf B., Smola AJ., Mller K. (1998) Nonlinear component analysis as a

kernel eigenvalue problem. Neural Computation 10: 1299-1319
Artificial Neural Networks (ANN)
Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 Users Guide Matlab
non-linear
data
Other techniques to extract information

Regression
Orthogonal Projections to Latent Structures (O-PLS)
Trygg J., Wold S. (2002) Orthogonal projections to latent structures (O-PLS). J. Chemo. 16:119-128
Artificial Neural Networks (ANN)

Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 Users Guide Matlab
Kernel Projection to Latent Structures (k-PLS)

Rosipal R., Trejo L.J. (2001) Kernel partial least squares regression in reproducing Kernel
Hilbert space. J. Machine Learning Res. 2: 97123
Supervised Probabilistic Principal Component Analysis

(SPPCA)
Yu S., Yu K., Tresp V., Kriegel H., Wu M. (2006) Supervised probabilistic principal component
analysis. Proceedings of the 12th International Conference on Knowledge Discovery and Data
Mining (SIGKDD):464-473
nonlinear
data
Softwares
www.camo.com
www.umetrics.com
www.infometrix.com
www.mathworks.com
References
A user-friendly guide to Multivariate Calibration and Classification; T. Ns, T.
Isaksson, t. Fearn, T. Davies, NIR Publications, Chichester, UK, 2002
Multivariate calibration, H. Martens and T. Ns, John Wiley & Sons, Chichester, UK,
1989
Chemometric techniques for quantitative analysis, Marcel Dekker, New York, 1998
Two-dimensional correlation spectroscopy, I. Noda and Y. Ozaki, John Wiley & Sons,
Chichester, UK, 2004
Neural Network Toolbox 5 Users Guide Matlab, H. Demuth, M. Beale and M.
Hagan.
Questions?

Extracting Information From Spectral Data.: Nicole Labbé, University of Tennessee

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Extracting Information From Spectral Data.: Nicole Labbé, University of Tennessee

Uploaded by

Copyright:

Available Formats

Extracting information from

June 10, 2007

Mid Infrared spectra

Laser Induced Breakdown spectra

Prior to the extraction of the information.

N oisy signal of an absorption band

Raw spectral data

-Multiplicative Scatter Correction (MSC)

What type of information?

How to extract the information?

Multivariate data analysis

Near infrared spectra collected on 70 pine samples

Predicted cellulose content (%)

Measured cellulose content (%)

Measured cellulose content versus

Predicted Cellulose content (%)

Measured cellulose content (%)

Biomass species and near infrared spectra

Transformation of the numbers into pictures by using

Transformation of the numbers into pictures by using

Principal Components Analysis (PCA)

Principal Components Analysis (PCA)

3 variables = 3 dimensional space

Principal Components Analysis (PCA)

18 Near infrared spectra of wood samples

Each sample is represented as a co-ordinate axis in 375-dimensional space

Principal Components Analysis (PCA)

The first principal component

Higher-order Principal Components (PC2, PC3,)

Scores (T) = Coordinates of samples in the PC space

Loadings (P) = Relations between X and PCs

Transformation of the numbers into pictures by using PCA

Establishing calibration model

Using calibration model to predict

Biomass composition and near infrared spectra

Biomass composition and near infrared spectra

Biomass composition and near infrared spectra

2/3 of the samples for calibration model

Predicted Cellulose content (%)

Validation of the model

Predicted cellulose content (%)

Measured cellulose content (%)

PLS-Discriminant Analysis (PLS-DA)

PLS-Discriminant Analysis (PLS-DA)

PLS-Discriminant Analysis (PLS-DA)

Two-dimensional correlation spectroscopy

2D correlation tools spread

1000 1200 1400 1600 1800 2000 2200 2400

Spectral data collected under

Dynamic spectra [DYN]

[S](mxn) : m samples by n variables (wavelength) matrix

[DYN ](mn ) = [S ]1(1nm) [S ](1n )

[SYNC ](nn ) = [DYN ]T(nm ) [DYN ](mn )

Perturbation: cellulose content

Other techniques to extract information

Kernel Principal Component Analysis (k-PCA)

Schlkopf B., Smola AJ., Mller K. (1998) Nonlinear component analysis as a

Artificial Neural Networks (ANN)

Other techniques to extract information

Artificial Neural Networks (ANN)

Kernel Projection to Latent Structures (k-PLS)

Supervised Probabilistic Principal Component Analysis

You might also like