You are on page 1of 35

Extracting information from

spectral data.
Nicole Labb, University of Tennessee
SWST, Advanced analytical tools for the wood industry

June 10, 2007

Data collection
Near Infrared spectra
2150 data points, 350-2500
nm, 1 nm resolution, 8 scans

400

800

1200

1600

2000

2400

W a v e le n g t h ( n m )

Mid Infrared spectra


3400 data points, 4000-600 cm-,
1 cm- resolution, 4 scans
4000

3200

2400

1600
W avenum ber (cm -1 )

800

Laser Induced Breakdown spectra


30000 data points, 200-800 nm,
0.02 nm resolution, 10 scans
200

300

400

500

600

Wavelength (nm)

700

800

Prior to the extraction of the information.


Signal processing is used to transform spectral data prior to analysis

Data pretreatment
-Local filters

Signal

N oisy signal of an absorption band


S m oothed with a m oving average filter

-Smoothing
-Derivatives

Absorbance

W avelength (nm )

-Baseline correction

Raw spectral data

-Multiplicative Scatter Correction (MSC)


-Orthogonal Scatter Correction (OSC)

Absorbance

Wavelength

Corrected spectral
data after MSC
Wavelength

What type of information?


1. Qualitative information
Grouping and classification of spectral objects from samples
into supervised and non-supervised learning methods.
2. Quantitative information
Relationships between spectral data and parameter(s) of interest

How to extract the information?


1. Multivariate analysis (MVA)
Principal Component Analysis (PCA), Projection to Latent
Structures (PLS), PLS-Discriminant Analysis (PLS-DA),
2. Two dimensional correlation spectroscopy
Homo-correlation, Hetero-correlation

Multivariate data analysis


Separating the signal from the noise in data and presenting the results as
easily interpretable plots.
Why are multivariate methods needed?
-Large data sets
-Problem of selectivity
Relationship between two variables: very simple, but does not always work
Many problems where several predictor variables used in combination give better results
3.0

Two approaches:
-Univariate analysis
-Multivariate analysis

Absorbance

2.5
2.0
1.5
1.0
0.5
0.0
800

375 x-variables
1200

1600

2000

2400

Wavelength (nm)

Near infrared spectra collected on 70 pine samples

Univariate analysis
Measured cellulose content versus
predicted cellulose content using one
variable (1530 nm) as a predictor
(R2 = 0.12)

Predicted cellulose content (%)

50
45
40
35
30
25
25

30

35

40

45

50

Measured cellulose content (%)

Measured cellulose content versus


predicted cellulose content using
multivariate method based on all
variables from the spectral data
(R2= 0.87)

Predicted Cellulose content (%)

Multivariate analysis

50
45
40
35
30
25
25

30

35

40

45

Measured cellulose content (%)

50

Qualitative information
Principal Components Analysis (PCA)
Recognize patterns in data: outliers, trends, groups
0.8

PC2 (8%)

0.4

0.0

-0.4

-0.8
-6

-4

-2

PC1 (89%)

Biomass species and near infrared spectra

n Samples

Spectral data
x variables

Transformation of the numbers into pictures by using


principal component analysis (scores)
0.18
PC1
PC2

30
20

Intensity (A.U.)

Bagasse

PC2

10

0.12
0.06

Corn stover

0.00

Red oak
Yellow poplar
Hichory

-10

Switchgrass

-0.06
800

-20

1200

1600

2000

Wavelength (nm)

-30
-30

-20

-10

PC1

10

20

2400

Transformation of the numbers into pictures by using


principal component analysis (loadings)
0.18
PC1
PC2

Bagasse

20

PC2

10

Intensity (A.U.)

30
0.12
0.06

Red oak

Corn stover

Yellow poplar

0.00

Hichory

-10

Switchgrass

-0.06
800

-20

1200

1600

2000

Wavelength (nm)

-30
-30

-20

-10

PC1

10

20

2400

Principal Components Analysis (PCA)


PCA is a projection method, it decomposes the spectral data into a
structure part and a noise part
X is an n samples (observations) by x variables (spectral variables)
matrix

x1

1 variable = 1 dimension

Principal Components Analysis (PCA)


X is an n samples (observations) by x variables (spectral variables)
matrix
x2

x2

x1
x1
x3
2 variables = 2 dimensions

3 variables = 3 dimensional space

Principal Components Analysis (PCA)


Beyond 3 dimensions, it is very difficult to visualize whats going on.
3.0

Absorbance

2.5
2.0
1.5
1.0
0.5

375 x-variables

0.0
800

1200

1600

2000

2400

2800

Wavelength (nm)

18 Near infrared spectra of wood samples

Each sample is represented as a co-ordinate axis in 375-dimensional space

Principal Components Analysis (PCA)


X has only 3 variables (wavelengths x1, x2 and x3)
The sample (n = 18) are represented in a 3D space

The first principal component


New co-ordinate axis representing the direction of maximum variation
through the data.

Higher-order Principal Components (PC2, PC3,)


After PC1, next best direction for approximating the original data
The second PC lies along a direction orthogonal to the first PC

PC3 will be orthogonal to both PC1 and PC2 while simultaneously lying along the
direction of the third largest variation.
The new variables (PCs) are uncorrelated with each other (orthogonal)

Scores (T) = Coordinates of samples in the PC space


Representation of the samples in the PC space
There is a set of scores for each PC (score vector)

Original
variable space

PC space

Loadings (P) = Relations between X and PCs

Relationship between the original variable space and the new PCs space
There is a set of loadings for each PC (loading vector)

Transformation of the numbers into pictures by using PCA


30
Bagasse

20

PC2

10

Red oak
Corn stover

Yellow poplar

Hichory

-10

Switchgrass

-20

0.18

-30
-20

-10

PC1

10

Intensity (A.U.)

-30

20
0.12

PC1
PC2

0.06
0.00
-0.06
800

1200

1600

2000

Wavelength (nm)

2400

Quantitative information
Projection to Latent Structures or Partial Least Squares Regression (PLS)
Establish relationships between input and output variables, creating predictive models.

Model

Establishing calibration model


from known X and Y data

+ Model

Using calibration model to predict


new Y-values from new set of Xmeasurement

PLS can be seen as two interconnected PCA analyses, PCA(X) and PCA(Y)
PLS uses the Y-data structure (variation) to guide the decomposition of X
The X-data structure also influences the decomposition of Y

Samples

Biomass composition and near infrared spectra

y variables
Spectral data x variables
If y variables are not correlated PLS1

Samples

Biomass composition and near infrared spectra

y variables
Spectral data x variables
If y variables are correlated PLS2

Samples

Biomass composition and near infrared spectra

2/3 of the samples for calibration model


1/3 of the samples for validation model
Random selection

Calibration model to
predict cellulose content
in pine

r = 0.95
RMSEC = 1.6

45
40

12

35

30

4
25
25

30

35

Intensity

Predicted Cellulose content (%)

50

40

45
0
Measured cellulose content (%)
-4

50

-8
-12
1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm)

Validation of the model


to predict cellulose
content in pine

Predicted cellulose content (%)

50

r = 0.95
RMSEP = 1.55

45
40
35
30
25
25

30

35

40

45

Measured cellulose content (%)

50

PLS-Discriminant Analysis (PLS-DA)


A powerful method for classification. The aim is to create a predictive
model which can accurately classify future unknown samples.

y variable

Spectral data
x variables

PLS-Discriminant Analysis (PLS-DA)


Development of PLS-DA calibration model

PLS-DA
Calibration model

0
-1
Yellow poplar

-2
-0.6

1.2

Hickory

-0.4

-0.2

0.0
PC1 (92%)

0.2

Predicted Y variable

PC2 (6%)

1.0

0.4

0.6

0.8
0.6
0.4
0.2

r = 0.99
RMSEC = 0.04

0.0
-0.2
0.0

0.2

0.4

0.6

0.8

Measured Y variable

1.0

1.2

PLS-Discriminant Analysis (PLS-DA)


Validation of PLS-DA model

PLS-DA
Validation model

Y-reference

Predicted Y

Spectrum 00008

0.0000

-0.0338

Spectrum 00009

0.0000

0.0270

Spectrum 00015

1.0000

0.9340

Spectrum 00016

1.0000

1.0220

1.2
1.0
Predicted Y

0.8

r = 0.99
RMSEP = 0.04
Hickory

0.6
0.4
0.2

Yellow
poplar

0.0
-0.2
0.0

0.2

0.4

0.6

Y reference

0.8

1.0

1.2

Two-dimensional correlation spectroscopy


simplify the
visualization of
complex spectra

2D correlation tools spread


spectral peaks over a
second dimension

Mechanical, electrical,
Perturbation chemical, magnetic
optical, thermal,
Electro-magnetic
Probe (eg, IR,
UV, LIBS,)

System

2D
correlation
maps

0.15
0.10

2.0

Intensity (A.U.)

Intensity (A.U.)

2.5

1.5
1.0

0.05
0.00
-0.05

0.5

-0.10
1000 1200 1400 1600 1800 2000 2200 2400

1000 1200 1400 1600 1800 2000 2200 2400

Wavelength (nm)

Wavelength (nm)

Spectral data collected under


perturbation

Dynamic spectra [DYN]

[S](mxn) : m samples by n variables (wavelength) matrix

[DYN ](mn ) = [S ]1(1nm) [S ](1n )

Reference spectrum

Synchronous matrix

[SYNC ](nn ) = [DYN ]T(nm ) [DYN ](mn )

Asynchronous matrix
[ASYN ](nn ) = [DYN ]T(nm ) [N ]( mm) [DYN ](mn )
Noda-Hilbert matrix

Generation of orthogonal
components: synchronous and
asynchronous 2D correlation
intensities

Homo-correlation

Hetero-correlation

NIR/NIR

NIR/MBMS

Perturbation: cellulose content

Other techniques to extract information


Classification
Soft Independence Modeling of Class Analogy (SIMCA)
The Unscrambler, User manual. CAMO,1998

Kernel Principal Component Analysis (k-PCA)

Schlkopf B., Smola AJ., Mller K. (1998) Nonlinear component analysis as a


kernel eigenvalue problem. Neural Computation 10: 1299-1319

Artificial Neural Networks (ANN)

Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 Users Guide Matlab

non-linear
data

Other techniques to extract information


Regression
Orthogonal Projections to Latent Structures (O-PLS)

Trygg J., Wold S. (2002) Orthogonal projections to latent structures (O-PLS). J. Chemo. 16:119-128

Artificial Neural Networks (ANN)


Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 Users Guide Matlab

Kernel Projection to Latent Structures (k-PLS)


Rosipal R., Trejo L.J. (2001) Kernel partial least squares regression in reproducing Kernel
Hilbert space. J. Machine Learning Res. 2: 97123

Supervised Probabilistic Principal Component Analysis


(SPPCA)
Yu S., Yu K., Tresp V., Kriegel H., Wu M. (2006) Supervised probabilistic principal component
analysis. Proceedings of the 12th International Conference on Knowledge Discovery and Data
Mining (SIGKDD):464-473

nonlinear
data

Softwares

www.camo.com
www.umetrics.com
www.infometrix.com
www.mathworks.com

References
A user-friendly guide to Multivariate Calibration and Classification; T. Ns, T.
Isaksson, t. Fearn, T. Davies, NIR Publications, Chichester, UK, 2002
Multivariate calibration, H. Martens and T. Ns, John Wiley & Sons, Chichester, UK,
1989
Chemometric techniques for quantitative analysis, Marcel Dekker, New York, 1998
Two-dimensional correlation spectroscopy, I. Noda and Y. Ozaki, John Wiley & Sons,
Chichester, UK, 2004
Neural Network Toolbox 5 Users Guide Matlab, H. Demuth, M. Beale and M.
Hagan.

Questions?

You might also like