Spectra Basic PDF

Extracting information from
spectral data.
Nicole Labbé, University of Tennessee
SWST, Advanced analytical tools for the wood industry June 10, 2007
Data collection
Near Infrared spectra

2150 data points, 350-2500
nm, 1 nm resolution, 8 scans 400 800 1200 1600 2000 2400
W a v e le n g t h ( n m )
Mid Infrared spectra

3400 data points, 4000-600 cm-¹,
1 cm-¹ resolution, 4 scans
4000 3200 2400 1600 800
W avenum ber (cm -1 )
Laser Induced Breakdown spectra

30000 data points, 200-800 nm,
0.02 nm resolution, 10 scans
200 300 400 500 600 700 800
Wavelength (nm)
Prior to the extraction of the information.
Signal processing is used to transform spectral data prior to analysis
Data pretreatment
N oisy signal of an absorption band
S m oothed with a m oving average filter
-Local filters
Signal
-Smoothing
W avelength (nm )
-Derivatives
Absorbance
-Baseline correction Raw spectral data
Wavelength
-Multiplicative Scatter Correction (MSC)
Absorbance
-Orthogonal Scatter Correction (OSC) Corrected spectral
data after MSC
Wavelength
What type of information?
1. Qualitative information
Grouping and classification of spectral objects from samples
into supervised and non-supervised learning methods.
2. Quantitative information
Relationships between spectral data and parameter(s) of interest
How to extract the information?

1. Multivariate analysis (MVA)
Principal Component Analysis (PCA), Projection to Latent
Structures (PLS), PLS-Discriminant Analysis (PLS-DA), …
2. Two dimensional correlation spectroscopy

Homo-correlation, Hetero-correlation
Multivariate data analysis
Separating the signal from the noise in data and presenting the results as
easily interpretable plots.
Why are multivariate methods needed?
-Large data sets
-Problem of selectivity
Relationship between two variables: very simple, but does not always work
Many problems where several predictor variables used in combination give better results
3.0
2.5
Two approaches: 2.0

Absorbance
-Univariate analysis 1.5
-Multivariate analysis 1.0
0.5
375 x-variables
0.0
800 1200 1600 2000 2400
Wavelength (nm)
Near infrared spectra collected on 70 pine samples

50
Predicted cellulose content (%)

45
Univariate analysis
40
Measured cellulose content versus
predicted cellulose content using one 35
variable (1530 nm) as a predictor
(R2 = 0.12) 30
25
25 30 35 40 45 50
Measured cellulose content (%)
50
Multivariate analysis
Measured cellulose content versus Predicted Cellulose content (%) 45
predicted cellulose content using

40
multivariate method based on all
variables from the spectral data 35
(R2= 0.87)
30
25
25 30 35 40 45 50
Qualitative information
Principal Components Analysis (PCA)
Recognize patterns in data: outliers, trends, groups…
0.8
0.4
PC2 (8%)
0.0
-0.4
-0.8
-6 -4 -2 0 2
PC1 (89%)
Biomass species and near infrared spectra
n Samples
Spectral data
x variables
Transformation of the numbers into pictures by using
principal component analysis (scores)
0.18
PC1
30 PC2
Bagasse
0.12
Intensity (A.U.)
20
0.06
10 Red oak
Corn stover
Yellow poplar
PC2
0 0.00
Hichory
-10
Switchgrass -0.06
800 1200 1600 2000 2400
-20
Wavelength (nm)
-30
-30 -20 -10 0 10 20
PC1
Transformation of the numbers into pictures by using
principal component analysis (loadings)
0.18
PC1
30 PC2
Bagasse 0.12
Intensity (A.U.)
20
0.06
10 Red oak
Corn stover
Yellow poplar
PC2
0 0.00
Hichory
-10
Switchgrass -0.06
-20 800 1200 1600 2000 2400
Wavelength (nm)
-30
-30 -20 -10 0 10 20
PC1
PCA is a projection method, it decomposes the spectral data into a
“structure” part and a “noise” part
X is an n samples (observations) by x variables (spectral variables)
matrix
0
x1
1 variable = 1 dimension
X is an n samples (observations) by x variables (spectral variables)
matrix
x2
x2
x1
x1
x3
2 variables = 2 dimensions 3 variables = 3 dimensional space

Beyond 3 dimensions, it is very difficult to visualize what’s going on.
3.0
2.5
2.0
Absorbance
1.5
1.0
0.5
375 x-variables
0.0
800 1200 1600 2000 2400 2800
Wavelength (nm)
18 Near infrared spectra of wood samples
Each sample is represented as a co-ordinate axis in 375-dimensional space

X has only 3 variables (wavelengths x1, x2 and x3)
The sample (n = 18) are represented in a 3D space
The first principal component

New co-ordinate axis representing the direction of maximum variation
through the data.
Higher-order Principal Components (PC2, PC3,…)
After PC1, next best direction for approximating the original data
The second PC lies along a direction orthogonal to the first PC
PC3 will be orthogonal to both PC1 and PC2 while simultaneously lying along the
direction of the third largest variation.
The new variables (PCs) are uncorrelated with each other (orthogonal)
Scores (T) = Coordinates of samples in the PC space
Representation of the samples in the PC space

There is a set of scores for each PC (score vector)
Original PC space
variable space
Loadings (P) = Relations between X and PCs
Relationship between the original variable space and the new PCs space
There is a set of loadings for each PC (loading vector)
Transformation of the numbers into pictures by using PCA
30
Bagasse
20
10 Red oak
Corn stover
Yellow poplar
PC2
0
Hichory
-10
Switchgrass
-20
0.18
PC1
-30 PC2
-30 -20 -10 0 10 20
PC1 0.12
Intensity (A.U.)
0.06
0.00
-0.06
800 1200 1600 2000 2400
Wavelength (nm)
Quantitative information
Projection to Latent Structures or Partial Least Squares Regression (PLS)
Establish relationships between input and output variables, creating predictive models.
X + Y ⇒ Model X + Model ⇒ Ŷ
Using calibration model to predict

Establishing calibration model
new Y-values from new set of X-
from known X and Y data
measurement
PLS can be seen as two interconnected PCA analyses, PCA(X) and PCA(Y)
PLS uses the Y-data structure (variation) to guide the decomposition of X
The X-data structure also influences the decomposition of Y
Biomass composition and near infrared spectra
Samples
y variables Spectral data x variables

If y variables are not correlated ⇒ PLS1
Samples
y variables Spectral data x variables

If y variables are correlated ⇒ PLS2
Samples
2/3 of the samples for calibration model

1/3 of the samples for validation model
Random selection
50
Predicted Cellulose content (%)
r = 0.95 Calibration model to

45
RMSEC = 1.6 predict cellulose content
in pine
40
35 12
8
30
4
25 Intensity
25 30 35 40 0 45 50
-4
-8
-12
1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm)
Validation of the model
to predict cellulose
content in pine
50
Predicted cellulose content (%)
r = 0.95
45 RMSEP = 1.55
40
35
30
25
25 30 35 40 45 50
PLS-Discriminant Analysis (PLS-DA)
A powerful method for classification. The aim is to create a predictive
model which can accurately classify future unknown samples.
y variable Spectral data

x variables
Development of PLS-DA calibration model
2
PLS-DA
1 Calibration model
PC2 (6%)
-1
Yellow poplar
-2 Hickory 1.2
1.0
-0.6 -0.4 -0.2 0.0 0.2 0.4 0.6
Predicted Y variable
PC1 (92%) 0.8
0.6
0.4
0.2
r = 0.99
0.0 RMSEC = 0.04
-0.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Measured Y variable
Validation of PLS-DA model
PLS-DA Y-reference Predicted Y
Validation model Spectrum 00008 0.0000 -0.0338
Spectrum 00009 0.0000 0.0270
Spectrum 00015 1.0000 0.9340
Spectrum 00016 1.0000 1.0220
1.2
1.0
r = 0.99
RMSEP = 0.04
0.8
Hickory
Predicted Y
0.6
0.4
Yellow
0.2 poplar
0.0
-0.2
0.0 0.2 0.4 0.6 0.8 1.0 1.2
Y reference
Two-dimensional correlation spectroscopy
simplify the
2D correlation tools spread
visualization of
spectral peaks over a
complex spectra
second dimension
Mechanical, electrical,
Perturbation chemical, magnetic
optical, thermal, …
Electro-magnetic 2D
Probe (eg, IR, System correlation
UV, LIBS,…) maps
0.15
2.5
0.10
2.0
Intensity (A.U.)
Intensity (A.U.)
0.05
1.5
0.00
1.0
-0.05
0.5
-0.10
1000 1200 1400 1600 1800 2000 2200 2400 1000 1200 1400 1600 1800 2000 2200 2400
Wavelength (nm) Wavelength (nm)
Spectral data collected under Dynamic spectra [DYN]

perturbation
[S](mxn) : m samples by n variables (wavelength) matrix
[DYN ](m×n ) = [S ]1(1→×nm) − [S ](1×n )

Reference spectrum
Generation of orthogonal
Synchronous matrix components: synchronous and
[SYNC ](n×n ) = [DYN ]T(n×m ) × [DYN ](m×n ) asynchronous 2D correlation
Asynchronous matrix intensities
[ASYN ](n×n ) = [DYN ]T(n×m ) × [N ]( m×m) × [DYN ](m×n )
Noda-Hilbert matrix
Homo-correlation Hetero-correlation
NIR/NIR NIR/MBMS
Perturbation: cellulose content

Other techniques to extract information
Classification
•Soft Independence Modeling of Class Analogy (SIMCA)
The Unscrambler, User manual. CAMO,1998
• Kernel Principal Component Analysis (k-PCA)

Schölkopf B., Smola AJ., Müller K. (1998) Nonlinear component analysis as a
kernel eigenvalue problem. Neural Computation 10: 1299-1319 non-linear
data
• Artificial Neural Networks (ANN)

Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab
Other techniques to extract information
Regression
• Orthogonal Projections to Latent Structures (O-PLS)
Trygg J., Wold S. (2002) Orthogonal projections to latent structures (O-PLS). J. Chemo. 16:119-128
• Artificial Neural Networks (ANN)

Demuth H., Beale M. and Hagan M., Neural Network Toolbox 5 User’s Guide – Matlab
•Kernel Projection to Latent Structures (k-PLS)

Rosipal R., Trejo L.J. (2001) Kernel partial least squares regression in reproducing Kernel
Hilbert space. J. Machine Learning Res. 2: 97–123 non-
• Supervised Probabilistic Principal Component Analysis linear
(SPPCA) data
Yu S., Yu K., Tresp V., Kriegel H., Wu M. (2006) Supervised probabilistic principal component
analysis. Proceedings of the 12th International Conference on Knowledge Discovery and Data
Mining (SIGKDD):464-473
Softwares
• www.camo.com
• www.umetrics.com
• www.infometrix.com
• www.mathworks.com
References
•A user-friendly guide to Multivariate Calibration and Classification; T. Næs, T.
Isaksson, t. Fearn, T. Davies, NIR Publications, Chichester, UK, 2002
•Multivariate calibration, H. Martens and T. Næs, John Wiley & Sons, Chichester, UK,
1989
•Chemometric techniques for quantitative analysis, Marcel Dekker, New York, 1998
•Two-dimensional correlation spectroscopy, I. Noda and Y. Ozaki, John Wiley & Sons,
Chichester, UK, 2004
•Neural Network Toolbox 5 User’s Guide – Matlab, H. Demuth, M. Beale and M.
Hagan.
Questions?

Spectra Basic PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Spectra Basic PDF

Uploaded by

Copyright:

Available Formats

Extracting information from

Nicole Labbé, University of Tennessee

Near Infrared spectra

Mid Infrared spectra

Laser Induced Breakdown spectra

-Multiplicative Scatter Correction (MSC)

How to extract the information?

2. Two dimensional correlation spectroscopy

Two approaches: 2.0

-Univariate analysis 1.5

-Multivariate analysis 1.0

Near infrared spectra collected on 70 pine samples

Predicted cellulose content (%)

predicted cellulose content using

2 variables = 2 dimensions 3 variables = 3 dimensional space

18 Near infrared spectra of wood samples

Each sample is represented as a co-ordinate axis in 375-dimensional space

The first principal component

Representation of the samples in the PC space

Loadings (P) = Relations between X and PCs

Using calibration model to predict

y variables Spectral data x variables

y variables Spectral data x variables

2/3 of the samples for calibration model

r = 0.95 Calibration model to

y variable Spectral data

Spectral data collected under Dynamic spectra [DYN]

[S](mxn) : m samples by n variables (wavelength) matrix

[DYN ](m×n ) = [S ]1(1→×nm) − [S ](1×n )

Perturbation: cellulose content

• Kernel Principal Component Analysis (k-PCA)

• Artificial Neural Networks (ANN)

• Artificial Neural Networks (ANN)

•Kernel Projection to Latent Structures (k-PLS)

You might also like