You are on page 1of 35

Extracting information from

spectral data.

Nicole Labbé, University of Tennessee

SWST, Advanced analytical tools for the wood industry June 10, 2007
Data collection

Near Infrared spectra


2150 data points, 350-2500
nm, 1 nm resolution, 8 scans 400 800 1200 1600 2000 2400
W a v e le n g t h ( n m )

Mid Infrared spectra


3400 data points, 4000-600 cm-¹,
1 cm-¹ resolution, 4 scans
4000 3200 2400 1600 800
W avenum ber (cm -1 )

Laser Induced Breakdown spectra


30000 data points, 200-800 nm,
0.02 nm resolution, 10 scans
200 300 400 500 600 700 800
Wavelength (nm)
Prior to the extraction of the information.
Signal processing is used to transform spectral data prior to analysis
Data pretreatment
N oisy signal of an absorption band
S m oothed with a m oving average filter

-Local filters

Signal
-Smoothing
W avelength (nm )

-Derivatives

Absorbance
-Baseline correction Raw spectral data
Wavelength

-Multiplicative Scatter Correction (MSC)

Absorbance
-Orthogonal Scatter Correction (OSC) Corrected spectral
data after MSC
Wavelength
What type of information?
1. Qualitative information
Grouping and classification of spectral objects from samples
into supervised and non-supervised learning methods.
2. Quantitative information
Relationships between spectral data and parameter(s) of interest

How to extract the information?


1. Multivariate analysis (MVA)
Principal Component Analysis (PCA), Projection to Latent
Structures (PLS), PLS-Discriminant Analysis (PLS-DA), …

2. Two dimensional correlation spectroscopy


Homo-correlation, Hetero-correlation
Multivariate data analysis
Separating the signal from the noise in data and presenting the results as
easily interpretable plots.
Why are multivariate methods needed?
-Large data sets
-Problem of selectivity
Relationship between two variables: very simple, but does not always work
Many problems where several predictor variables used in combination give better results
3.0

2.5

Two approaches: 2.0


Absorbance

-Univariate analysis 1.5

-Multivariate analysis 1.0

0.5
375 x-variables
0.0
800 1200 1600 2000 2400
Wavelength (nm)

Near infrared spectra collected on 70 pine samples


50

Predicted cellulose content (%)


45
Univariate analysis
40
Measured cellulose content versus
predicted cellulose content using one 35
variable (1530 nm) as a predictor
(R2 = 0.12) 30

25
25 30 35 40 45 50
Measured cellulose content (%)

50
Multivariate analysis
Measured cellulose content versus Predicted Cellulose content (%) 45

predicted cellulose content using


40
multivariate method based on all
variables from the spectral data 35
(R2= 0.87)
30

25
25 30 35 40 45 50
Measured cellulose content (%)
Qualitative information
Principal Components Analysis (PCA)
Recognize patterns in data: outliers, trends, groups…

0.8

0.4
PC2 (8%)

0.0

-0.4

-0.8
-6 -4 -2 0 2
PC1 (89%)
Biomass species and near infrared spectra

n Samples

Spectral data
x variables
Transformation of the numbers into pictures by using
principal component analysis (scores)

0.18
PC1
30 PC2

Bagasse
0.12

Intensity (A.U.)
20
0.06
10 Red oak
Corn stover
Yellow poplar
PC2

0 0.00
Hichory
-10
Switchgrass -0.06
800 1200 1600 2000 2400
-20
Wavelength (nm)
-30
-30 -20 -10 0 10 20
PC1
Transformation of the numbers into pictures by using
principal component analysis (loadings)

0.18
PC1
30 PC2

Bagasse 0.12

Intensity (A.U.)
20

0.06
10 Red oak
Corn stover
Yellow poplar
PC2

0 0.00
Hichory
-10
Switchgrass -0.06
-20 800 1200 1600 2000 2400
Wavelength (nm)
-30
-30 -20 -10 0 10 20
PC1
Principal Components Analysis (PCA)
PCA is a projection method, it decomposes the spectral data into a
“structure” part and a “noise” part
X is an n samples (observations) by x variables (spectral variables)
matrix

0
x1

1 variable = 1 dimension
Principal Components Analysis (PCA)
X is an n samples (observations) by x variables (spectral variables)
matrix

x2
x2

x1
x1
x3

2 variables = 2 dimensions 3 variables = 3 dimensional space


Principal Components Analysis (PCA)
Beyond 3 dimensions, it is very difficult to visualize what’s going on.
3.0

2.5

2.0
Absorbance

1.5

1.0

0.5
375 x-variables
0.0
800 1200 1600 2000 2400 2800
Wavelength (nm)

18 Near infrared spectra of wood samples

Each sample is represented as a co-ordinate axis in 375-dimensional space


Principal Components Analysis (PCA)
X has only 3 variables (wavelengths x1, x2 and x3)
The sample (n = 18) are represented in a 3D space

The first principal component


New co-ordinate axis representing the direction of maximum variation
through the data.
Higher-order Principal Components (PC2, PC3,…)

After PC1, next best direction for approximating the original data
The second PC lies along a direction orthogonal to the first PC

PC3 will be orthogonal to both PC1 and PC2 while simultaneously lying along the
direction of the third largest variation.
The new variables (PCs) are uncorrelated with each other (orthogonal)
Scores (T) = Coordinates of samples in the PC space

Representation of the samples in the PC space


There is a set of scores for each PC (score vector)

Original PC space
variable space

Loadings (P) = Relations between X and PCs

Relationship between the original variable space and the new PCs space
There is a set of loadings for each PC (loading vector)
Transformation of the numbers into pictures by using PCA
30
Bagasse
20

10 Red oak
Corn stover
Yellow poplar
PC2

0
Hichory
-10
Switchgrass

-20
0.18
PC1
-30 PC2
-30 -20 -10 0 10 20
PC1 0.12
Intensity (A.U.)

0.06

0.00

-0.06
800 1200 1600 2000 2400
Wavelength (nm)
Quantitative information
Projection to Latent Structures or Partial Least Squares Regression (PLS)
Establish relationships between input and output variables, creating predictive models.

X + Y ⇒ Model X + Model ⇒ Ŷ

Using calibration model to predict


Establishing calibration model
new Y-values from new set of X-
from known X and Y data
measurement

PLS can be seen as two interconnected PCA analyses, PCA(X) and PCA(Y)
PLS uses the Y-data structure (variation) to guide the decomposition of X
The X-data structure also influences the decomposition of Y
Biomass composition and near infrared spectra
Samples

y variables Spectral data x variables


If y variables are not correlated ⇒ PLS1
Biomass composition and near infrared spectra
Samples

y variables Spectral data x variables


If y variables are correlated ⇒ PLS2
Biomass composition and near infrared spectra
Samples

2/3 of the samples for calibration model


1/3 of the samples for validation model
Random selection
50
Predicted Cellulose content (%)

r = 0.95 Calibration model to


45
RMSEC = 1.6 predict cellulose content
in pine
40

35 12

8
30

4
25 Intensity
25 30 35 40 0 45 50
Measured cellulose content (%)
-4

-8

-12
1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm)
Validation of the model
to predict cellulose
content in pine

50
Predicted cellulose content (%)

r = 0.95
45 RMSEP = 1.55

40

35

30

25
25 30 35 40 45 50
Measured cellulose content (%)
PLS-Discriminant Analysis (PLS-DA)
A powerful method for classification. The aim is to create a predictive
model which can accurately classify future unknown samples.

y variable Spectral data


x variables
PLS-Discriminant Analysis (PLS-DA)
Development of PLS-DA calibration model
2
PLS-DA
1 Calibration model
PC2 (6%)

-1
Yellow poplar

-2 Hickory 1.2

1.0
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
Predicted Y variable
PC1 (92%) 0.8

0.6

0.4

0.2
r = 0.99
0.0 RMSEC = 0.04
-0.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Measured Y variable
PLS-Discriminant Analysis (PLS-DA)
Validation of PLS-DA model
PLS-DA Y-reference Predicted Y
Validation model Spectrum 00008 0.0000 -0.0338
Spectrum 00009 0.0000 0.0270
Spectrum 00015 1.0000 0.9340
Spectrum 00016 1.0000 1.0220

1.2

1.0
r = 0.99
RMSEP = 0.04
0.8
Hickory
Predicted Y

0.6

0.4
Yellow
0.2 poplar
0.0

-0.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Y reference
Two-dimensional correlation spectroscopy
simplify the
2D correlation tools spread
visualization of
spectral peaks over a
complex spectra
second dimension

Mechanical, electrical,
Perturbation chemical, magnetic
optical, thermal, …
Electro-magnetic 2D
Probe (eg, IR, System correlation
UV, LIBS,…) maps
0.15
2.5

0.10
2.0
Intensity (A.U.)

Intensity (A.U.)
0.05
1.5

0.00
1.0

-0.05
0.5

-0.10
1000 1200 1400 1600 1800 2000 2200 2400 1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm) Wavelength (nm)

Spectral data collected under Dynamic spectra [DYN]


perturbation

[S](mxn) : m samples by n variables (wavelength) matrix

[DYN ](m×n ) = [S ]1(1→×nm) − [S ](1×n )


Reference spectrum
Generation of orthogonal
Synchronous matrix components: synchronous and
[SYNC ](n×n ) = [DYN ]T(n×m ) × [DYN ](m×n ) asynchronous 2D correlation
Asynchronous matrix intensities
[ASYN ](n×n ) = [DYN ]T(n×m ) × [N ]( m×m) × [DYN ](m×n )
Noda-Hilbert matrix
Homo-correlation Hetero-correlation
NIR/NIR NIR/MBMS

Perturbation: cellulose content


Other techniques to extract information
Classification
•Soft Independence Modeling of Class Analogy (SIMCA)
The Unscrambler, User manual. CAMO,1998

• Kernel Principal Component Analysis (k-PCA)


Schölkopf B., Smola AJ., Müller K. (1998) Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation 10: 1299-1319 non-linear
data

• Artificial Neural Networks (ANN)


Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab
Other techniques to extract information
Regression
• Orthogonal Projections to Latent Structures (O-PLS)
Trygg J., Wold S. (2002) Orthogonal projections to latent structures (O-PLS). J. Chemo. 16:119-128

• Artificial Neural Networks (ANN)


Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab

•Kernel Projection to Latent Structures (k-PLS)


Rosipal R., Trejo L.J. (2001) Kernel partial least squares regression in reproducing Kernel
Hilbert space. J. Machine Learning Res. 2: 97–123 non-
• Supervised Probabilistic Principal Component Analysis linear
(SPPCA) data
Yu S., Yu K., Tresp V., Kriegel H., Wu M. (2006) Supervised probabilistic principal component
analysis. Proceedings of the 12th International Conference on Knowledge Discovery and Data
Mining (SIGKDD):464-473
Softwares
• www.camo.com
• www.umetrics.com
• www.infometrix.com
• www.mathworks.com

References
•A user-friendly guide to Multivariate Calibration and Classification; T. Næs, T.
Isaksson, t. Fearn, T. Davies, NIR Publications, Chichester, UK, 2002
•Multivariate calibration, H. Martens and T. Næs, John Wiley & Sons, Chichester, UK,
1989
•Chemometric techniques for quantitative analysis, Marcel Dekker, New York, 1998
•Two-dimensional correlation spectroscopy, I. Noda and Y. Ozaki, John Wiley & Sons,
Chichester, UK, 2004
•Neural Network Toolbox 5 User’s Guide – Matlab, H. Demuth, M. Beale and M.
Hagan.
Questions?

You might also like