An Introduction to Multivariate Calibration and Analysis

© All Rights Reserved

1 views

An Introduction to Multivariate Calibration and Analysis

© All Rights Reserved

- identifying perceived personal
- Lecture
- research paper
- 8C1ECEB9d01
- EM-Based Mixture Models Applied to Video Event Detection
- PCA_JH_MM_GE
- IBP1732_12
- Life patterns in contemporary society
- Week4a Change Detection
- 17 2012 Magn Anomaly Detection Boris Ginzburg
- PCA-Dien PCA Chapter
- Face Recognition System_degree
- Social Network Aggregation Using
- WCE2007_pp1010-1015 (1)
- Further Maths A Level 9231
- An Efficient Method of PCA Based Face Recognition Using Simulink
- Artigo Jardel (1)
- IEAAIE_LNAI_2012
- Assessment of Surface Water Quality Using Multivariate
- Compozitie Minerala Sci Tot Env 2015

You are on page 1of 9

to Multivariate

Calibration and Analysis

handle large quantities of data are be-

coming a necessity for the practicing

analytical chemist. The number of

REPORT

books in this area (1-3) is increasing as

the awareness of the need for more so- any situation where multiple measure-

phisticated techniques grows. Howev- ments are acquired.

er, widespread use of the full set of ana- One example of an analytical prob-

lytical tools will be realized only when lem solved by near-IR reflectance anal-

the analyst is familiar with the general ysis is the estimation of the protein and

goals and advantages of the methods. moisture content of wheat samples. As

This REPORT serves as an introduc- with many analytical methods, this

tion to the area of multivariate anal- procedure consists of two phases: cali-

yses known as multivariate calibration. bration and prediction. The chemist

Our goal is to give the chemist insight begins by constructing a data matrix

into the workings of a collection of sta- (R) from the instrument responses at

tistical techniques and to enable him or selected wavelengths for a given set of

her to judge the appropriateness of the calibration samples. In the case of

techniques for an application of inter- near-IR reflectance analysis, the R ma-

est. The list of methods described is by trix can be constructed from the loga-

no means a complete representation of rithmic reflectances (log(l/ref)) or

those that can be applied to multivari- from some other combination of reflec-

ate data. Also, we do not compare the tances obtained at the various wave-

results obtained using one method over lengths (4). A matrix of concentration

another; instead, we describe the meth- values (C) is then formed using inde-

ods in the line of a tutorial. For this pendent or referee methods such as

reason, the examples chosen illustrate Kjeldahl nitrogen analysis for protein

how the methods work rather than determination and oven-drying for the

Kenneth R. Beebe compare among them. The three multi- determination of moisture content.

Bruce R. Kowalski variate methods that will be discussed The goal of the calibration phase is to

Department of Chemistry, BG-10 are multiple linear regression (MLR), produce a model that relates the near-

Laboratory for Chemometrics principal component regression (PCR), IR spectra to the values obtained by

University of Washington

and partial least squares (PLS). MLR, the independent or referee methods.

Seattle, Wash. 98195

an extension of ordinary least Figure 1 illustrates the resulting ma-

squares, is the easiest to understand trices for this example. R is a 3 X 8

The advent of the laboratory computer and also the most commonly used; matrix with three rows of near-IR spec-

has allowed analytical chemists to col- PCR and PLS have yet to achieve tra from the analysis of three samples

lect enormous amounts of data on a widespread acceptance in chemistry. and eight columns corresponding to

wide variety of problems of interest. the eight near-IR wavelengths chosen

With this ability, however, has come Calibration and prediction for the analysis. (Throughout this RE-

the realization that more data does not Multivariate statistics is a collection of PORT, these columns will be referred to

necessarily mean more information. powerful mathematical tools that can as the original variables for R.) The C

One can collect reams of computer out- be applied to chemical analysis when matrix is 3 X 2 dimensional for this

put without knowing anything more more than one measurement is ac- example. Again, the three samples oc-

about the system under investigation. quired for each sample. For the sake of cupy the three rows and the two col-

Only when the data are interpreted and consistency, near-infrared (near-IR) umns represent the protein and mois-

put to use do they become valuable to reflectance analysis will be used as an ture content of the samples as deter-

the chemist and to society in general; example of such an analytical method mined by the referee methods.

then data become information. For this throughout this REPORT. However, the According to this notation, the terms

reason, data analysis methods that can techniques described can be used in Cji and c^ represent the protein content

0003-2700/87/A359-1007/$01.50/0 ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1007 A

© 1987 American Chemical Society

earities). However, problems can be en sion of the same C as in the first exam

countered with the use of real data. A ple onto this new matrix R 2 is given in

model chosen solely according to this S 2 . For this model, Err = 0.07, where

criterion attempts to use all of the vari the smaller value implies that this sec

ance in the R matrix, including any ir ond model is more effective at model

ΙχJ relevant information, to model C. ing C. To further evaluate these re

When the resulting model is then ap sults, note that when R is multiplied by

plied to a new sample, the model will S, the jth column of R is always multi

assume that the correlation found be plied by the jth row of S. The impor

tween the calibration R and C matrices tance of variable j to the model can

also exists in that sample. Because the therefore be determined by examining

Γ * Ί model was built using irrelevant infor the jth row of S. The nonzero entries in

mation in the R matrix, this assump the fourth row of S 2 reveal that a vari

tion will not be true. Unfortunately, able consisting of random numbers

2 12 1 1 5 3 1 even noise has a very high probability (column number four of R 2 ) was chosen

of being used to build the model. The as a significant contributor to the cali

following example illustrates this bration model. In an ideal analysis, this

12 2 3 7 4 71

point. Consider the matrices random variable would have been ig

nored in the model-building phase of

75 152 102 the analysis. Furthermore, the inclu

12 1 3 6 2 6 2

63 132 82 sion of this column in R 2 has changed

R

96 218 176 the estimated model coefficients so

3 4 that they no longer represent the true

69 157 124_

=

model. The upper 3 X 3 portion of S 2 ,

ΙχΚ 2 7 which represents the model for the first

2 7 5

three variables in R 2 , is not equal to S,

8 6 4 3 3 which represents the true model for the

9 12 3 same variables. The addition of a col

6 8 2 umn of random numbers has resulted

Figure 1. Configuration of R and C in a model that appears to be better, in

matrices. -0.71 0.55 0.48 that it is more effective at reproducing

0.42 - 0 . 4 1 -0.24 C, and yet it does not describe the true

s= relationship. This is because MLR uses

-0.08 0.28 0.05

and moisture content, respectively, for all of the matrix R to build the model,

the ith wheat sample found in the ith MLR was used to determine the S regardless of whether or not it is rele

row of C. matrix, as described earlier. A standard vant in describing the true model.

The next step in the calibration measure of the effectiveness of the Therefore, an erroneous model can be

phase is the foundation of the entire model is the value of Err as presented derived and subsequently used to pre

analysis. The analyst must choose an above. For this example Err = 0.49, and dict the characteristics (e.g., protein

appropriate mathematical method it will be assumed that the matrix S is content) of new samples. Thus MLR

that will best reproduce C given the R an accurate estimate of the true model. alone often will generate misleading

matrix. How the analyst defines best The coefficients in S, therefore, will models with subsequent errors in pre

will ultimately determine the method closely approximate the true relation diction.

that is chosen. ship between the variables in R and C. The remainder of this R E P O R T in

MLR assumes that the best ap To illustrate how a calibration meth vestigates the advantages of PCR and

proach to estimating C from R is to od can be inappropriate, a column of PLS over MLR. To understand these

find the linear combination of the vari random numbers ranging from zero to advantages, it is necessary to under

ables in R that minimizes the errors in 100 was added to the R matrix. The stand how the methods work. Graphi

reproducing C. It proposes a relation addition of this column is analogous to cal representations of many of the con

ship between C and R such that C = the inclusion of a wavelength in near- cepts have been included so that the

RS + E, where S is a matrix of regres IR analysis that has no useful informa reader not well versed in linear algebra

sion coefficients and Ε is a matrix of tion for describing the protein or mois will be able to follow the discussions.

errors associated with the MLR model. ture content of the samples. The result

S is estimated by linear regression, and ing matrix R 2 and MLR model are Notation

the term

75 152 102 9l" Standard linear algebra notation will

be employed throughout this article.

63 132 82 36

Err=|;^(c i k -c i k )2=f;5\ i k 2 R2 - Bold upper case letters refer to matri

96 218 176 74 ces; plain upper case letters are used for

k= 1i=1 k=1i=1

_69 157 124 51. their row and column dimensions. For

(1)

example, the letter R refers to the ma

is minimized. In this expression, Cjk is "2 7 5~ trix of responses and is I X J dimen

the actual concentration element in the 4 3 3 sional. Bold lower case letters signify

ith row and kth column of C, C;k is the C = column vectors such as r 2 , which repre

9 12 3

MLR estimate of the same element us sents the second column of matrix R.

ing S, and «ii; is the corresponding ele _6 8 2_ By this convention r 2 T represents the

ment of the matrix E. second column of R written as a row

This approach appears to be reason 0.71 0.18 0.42 vector. Plain lower case letters are sca-

able, and in fact it is the best method -0.42 - 0 . 1 9 -0.20 lars, representing single elements of

when the analyst is dealing with well- >2 -

0.24 0.20 0.03 vectors (a;), matrices (ry), or other con

behaved systems (linear responses, no .-0.12 0.03 0.01 stants such as regression coefficients

interfering signals, no analyte-analyte (b). The transpose of a matrix or vector

interactions, low noise, and no collin- The model resulting from the regres is represented by a superscript T. (See

References 5 and 6 for books on linear three columns, graphing R in row space

algebra.)

Matrix representations. Matrices —ESS consists of plotting three points (each

point corresponding to a column) in

two-dimensional space. The coordi

of low dimensionality (i.e., the num R can be represented graphically as R

bers of rows and columns are small in in row space or as R in column space. nates of the points representing the

teger values) can be represented graph Row space is the space formed with the columns in each dimension are simply

ically. Consider the following 2 X 3 ma rows of R as the axes. It is, therefore, the matrix entries in the corresponding

trix two-dimensional. Because there are row. The resulting plot for this matrix

Row2

(4,6)

6-

/ ~*— Column 3

/ , Column 1

/ S s Column 2

(2,1) /

ι

1

ι

1

ι

1 1 1 1

' ι Sim

ι I I I I -7-6-

-7-6-5- t-3-2-1 1

2 3 4 5 6 7 Column 1

Row 1

-6-

Column 2

Figure 2. The matrix R in row space. Figure 3. The matrix R in column space.

CH 2 C1 2 , 15 min. 25°C

H H

Dithiatopazine: The First Stable 1,2-Ditbietane

K. C. Nicolaou, et. ai, J. Amer. Chem. Soc. 1987, 109, 3801

Designed By Chemists For Chemists

Isn't it time you discovered a more effective way to convey your ideas?

ChemDrawr the desktop publishing software for chemists, and Chem3D™

the molecular modeling package, will do just that-clearly and effectively!

With ChemDraw, you can easily create top-quality drawings of your (

chemical structures, using chemical drawing tools: bonds, arrows, rings, text,

circles, and more. Transfer drawings between ChemDraw and other software to

combine pictures and text for scientific articles. C/

CherrûD provides a wide array of features for manipulating 3-dimensional

molecular models. Define your own atom types and substructures. Create molec-

ular "movies". Import and export structures to ChemDraw, MM2, and other

programs, or build models with CherrûD's powerful building tools.

ChemDraw and CherrûD are designed for Apple " Macintosh™ computers

(including the new Macintosh II), and output to any compatible printer, including

high-quality laser printers and phototypesetters. ChemDraw structures have For more information, please call or write:

appeared in prestigious journals such as the Journal of the American Chemical Cambridge Scientific Computing inc.

Society and Tetrahedron Letters. P.O. Box 2123, Cambridge, HA 02238

617-491-6862

>m3l) are truderm

is shown in Figure 2. J factors. It is important to determine

Analogous to row space, the column factors for greater freedom in repre- Column 2

space has two points and is three-di- senting the useful information in the R

mensional. Figure 3 is a two-dimen- matrix. For example, the information 4.

sional representation of this three-di- in an I X J matrix R can be expressed as

mensional space. This example can be an I X J matrix R', where the columns

used to understand higher dimensional of R' are linear combinations of the 3-

matrices, so that a 10 X 12 matrix original columns in R. The advantage

would have 12 points in 10-dimensional of this is that if a particular column in 2-

row space and 10 points in 12-dimen- R is not useful (as in the MLR example 7 Eigenvector #1

sional column space. Examples in high- cited earlier), one can attempt to find a

er dimensions than three can be under- set of factors that gives that column 1 •

stood if one refers back to two or three small weights when forming R'. Column 1

dimensions and considers situations of This is only one of the advantages of I '

higher dimensionality as expansions of a factor-based calibration method over -2 -1 / 1 2

these simpler cases. the methods that simply use the raw

Projections. Another concept that data. The other possible advantages /

is important to understand is that of can be understood by examining one of

projecting either a point or a vector the factor methods mentioned in the 7-2-

onto a vector or plane. Each of these introduction: PCR (3, 7,8). The follow-

can be viewed as being the perpendicu- ing discussion concerns the prelimi-

lar shadow of one object onto another. nary step of PCR: principal component f -3-

Figure 4 illustrates the result of pro- analysis (PCA). An intuitive feel for

jecting the vector a onto a plane in how PCA works can be gained by con- -4-

three-dimensional space. sidering the result of PCA performed

Factors. The last concept that is on the 5 X 2 matrix R shown below.

very basic to understanding both PCR

and PLS is that of a factor. This is 2 4

Figure 5. The matrix R plotted in col-

because both PCR and PLS are factor- 1 2

umn space with the first eigenvector.

based modeling procedures. For our R = 0 0

purposes, a factor is defined as any lin- 1 -2

ear combination of the original vari- 2 In real analyses, the columns of R are

ables in R or C. It can be shown that -4_

often mean-centered and scaled. Mean

given J factors for an I X J matrix R, In column space, the matrix is five centering simply involves subtracting

one can also represent the variables in points in two dimensions as shown in the average of a column from each of

R as a linear combination of these same Figure 5. the entries in that column. Scaling,

which gives equal weights to each vari-

able, involves dividing each entry by

the variance of the column. PCA is then

performed on the covariance matrix

R T R formed from the mean-centered

and scaled matrix. The first eigenvec-

tor corresponding to the largest eigen-

value is, by definition, the direction in

the space defined by the columns of R

that describes the maximum amount of

variation or spread in the samples. Fig-

ure 5 shows the data and the direction

of the first eigenvector where the space

defined by R is a plane.

In this example, all of the variation

in the data can be described using one

eigenvector. The samples all fall on a

line in column space, and therefore all

of the variation lies in one direction.

When all of the variation in the sam-

ples cannot be accounted for using only

one eigenvector, a second eigenvector

can be found that is perpendicular or

orthogonal to the first and describes

the maximum amount of residual vari-

ation (not described by the first eigen-

vector) in the data set. Figure 6 is the

plot of a 30 X 2 matrix and the associat-

ed first two eigenvectors. The direction

of the first eigenvector describes the

maximum amount of variation or

spread in the data. In this particular

case, the samples in column space hap-

pen to fall within a two-dimensional

ellipse, and the first eigenvector corre-

Figure 4. The projection of a vector onto a plane. sponds to the major axis of the ellipse.

To illustrate how eigenvectors can thogonal to the variable [cos(90°) = 0]. columns of R that includes all of the

define factors for a matrix R, the re In the example given in Figure 6, the original variation. One can visualize

sults of the eigenvector analysis shown second eigenvector corresponds to the this by considering the column space

in Figure 5 are presented. The eigen minor axis of the ellipse. Because the spanned by R as a room and the sam

vector in Figure 5 is equal to the vector eigenvectors lie in the same space as ples lying on a table in the room. The

v T = [0.447, 0.894]. In vector notation, the original variables, they can be used column space is three-dimensional, but

the first factor or principal component as a new set of axes for the matrix R. only two dimensions are necessary to

can be shown to equal a linear combi Each point or sample has a new set of describe the position of each of the

nation of the original variables as fol coordinates from their projections onto sample points relative to the others.

lows: the new axes, and these axes may be The two eigenvectors necessary to de

more useful for describing the variation scribe the samples would define a plane

Rv = u (2) or spread of the samples. The new axes that includes the top of the table. This

also can be viewed as a rotation of the characteristic of the factor-based

where R is the original matrix of re original axes. In Figure 6 the transfor methods is exploited to yield more in

sponses, ν is the first eigenvector of mation from the old to the new axes can formative models, as is discussed in the

R T R, and u is the so-called score vec be achieved by rotating the original following sections.

tor, which is the projection of R onto axes by approximately 45 degrees.

the first eigenvector. The vector u is a In this example, the object (ellipse) Multiple linear regression

factor for the columns of R because it formed by the sample points in column The first method considered, MLR,

can be written as a linear combination space was two-dimensional. In more does not attempt to model underlying

of the columns. Likewise, when J eigen complicated situations the object can structure or factors that may exist in R

vectors are calculated and used to form be of higher dimensionality than two, or C. The goal of MLR is to find the

an I X J matrix U of scores, the original and the process of determining eigen linear combination of the variables η

variables in R can be expressed as the vectors continues until all of the varia such that the model's estimates of the

linear combination of the scores: tion in the samples is explained. The values of the c's in the calibration set

dimensionality of the space containing given by

RVV T = UV T (3) the samples equals the number of col j

T

R = UV (VV ) T _1

(4) umns in R; thus the maximum possible

dimensionality of the samples equals

j=i

where V T (VV T ) _ 1 , obtained from lin the number of columns in R. The col

ear algebra, is called the generalized lection of points, however, often can be are as close to the known values of C as

inverse (GI) of V. In PCA notation, the presented by fewer dimensions. This possible. As stated earlier, the criterion

elements in the GI of V are called the occurs when there is collinearity be of closeness for MLR is defined as min

principal component "loadings." tween the variables in the matrix under imizing the sum of the squares of the

These loadings range from —1 to +1 investigation. The presence of collin deviations of the predicted values from

and are the cosines of the angles be earity means that some of the variables the true values (see Equation 1). In sta

tween the eigenvector and the variable in the original matrix are correlated to tistical language, MLR is the regres

axes. High loadings correspond to high the other variables and therefore con sion of the columns of C onto the space

correlations (large coefficients) where tain redundant information. In this sit defined by the columns of R. The

the angle between the eigenvector and uation, PCA can be used to describe the mathematical procedure for this can be

a variable is very small [cos(0°) = 1]. variation in R using fewer dimensions found in most books on multivariate

Small loadings correspond to low corre than the number of columns in R. The analysis (3).

lations (small coefficients) where the eigenvectors can be used to determine a Graphically, MLR finds the projec

eigenvector is orthogonal or nearly or subspace in the space defined by the tion of the columns of the matrix C

Figure 6. A 30 X 2 matrix plotted in column space with Figure 7. The plane formed by the two columns of

the first two eigenvectors. matrix R plotted in row space.

two columns of R. In the general case of

an I X J matrix R, the first principal

component will be a vector in I-dimen-

sional space that has the smallest sum

of squared errors when used to esti

mate the J column vectors in R.

The difference between MLR and

PCR lies in the re-expression of R as a

score matrix U. This matrix results

from projecting R onto the eigenvec

tors V, as RV = U. The score matrix U

is composed of the original data points

in a new coordinate system described

by the eigenvectors. The second step of

PCR uses MLR to regress the C matrix

onto the score matrix as U S = C. The

Figure 8. Geometric representation of the regression of c onto R. eigenvectors, V, and the matrix of re

gression coefficients, S, together form

onto the variables or columns of R in its vector (e.g., r u n k T = [10 5]), one multi the PCR model.

row space. To illustrate this, consider plies the vector r un k by s as follows: The prediction step uses the V and S

the following matrix and vector matrices derived in the calibration set

0.20

[10 5] = 6.25 = c pred as follows. Multiply the new sample re

"39 29" "30" 0.85 sponse vector or matrix R un k by V to

R = 18 15 c = 20 where cpred is the predicted concentra obtain Uunk> then multiply Uunk by S to

.11 6_ -10, tion in the new sample. yield Cunk, where Cunk contains the es

where R might represent a near-IR The method of MLR is indeed an timates of the concentrations of the an-

data set with two columns of absor- adequate procedure in ideal situations. alytes in the unknown samples.

bance measurements at two wave However, it is based on the problematic If for an I X J matrix R all of the J

lengths and three rows corresponding step of inverting a matrix, and it can eigenvectors are used to form V, PCR

to three spectra from different samples incorporate significant amounts of ir will yield results identical to MLR. The

of wheat, and vector c represents the relevant variance (information) into advantages of PCR are based on the

protein content of the three wheat sam the model. concentration of the information in R

ples from Kjeldahl nitrogen analysis. into fewer factors than J. Because the

Principal component regression eigenvectors optimally describe the

Figure 7 is a graph of the column

vectors r t and r 2 plotted in row space. This model-building procedure has maximum variation in the original

The two columns of R define a sub- two steps. The first step is the determi variables, they can be used to deter

space (a tilted plane) within the row nation of the eigenvectors or factors for mine the rotation that best describes

space, and regressing c onto R is equiv a matrix R, as was described in the the information in the matrix R using

alent to finding the projection of c onto section on factors. This information is the minimum number of dimensions.

the subspace defined by R. To perform used to convert R into a score matrix U The first factor is the most descriptive;

MLR, one could plot the c vector in by projecting the R matrix onto the each successive factor describes less of

this same row space and project c onto space defined by the eigenvectors (see the information contained in R. Delet

the plane formed by Γ] and r 2 . Figure 8 Equation 2). Until now, no attempt has ing some of the latter factors that con

is an expanded view of the projection of been made to describe the rationale be tain mostly noise therefore does not

c onto the plane, where proj c is the hind the use of eigenvectors as the new significantly reduce the amount of use

estimate of c from matrix R. The re axes. We will show that there is a logi ful information present. This reduction

sulting regression coefficients for the cal interpretation of the significance of of dimensionality is often of utmost im

above example are given as s the eigenvectors. If the analyst is to portance for practical and mathemati

redefine the variables using a smaller cal stability. Another important char

0.20' number of factors, one reasonable ap acteristic of the matrix U is that if the

0.85 proach is to find the factors that best eigenvectors are calculated from a sym

describe the original variables. The metric matrix such as the correlation or

From s an estimate of c can be obtained first principal component is simply the

as c = proj c = Siri + β2Γ2· Or in matrix linear combination of the original vari

notation, ables that points in a direction that is

most correlated to all of the columns in Row 2

"39 29" "32.5" row space. In mathematical terms, it is

"0.20"

proj c 18 15 — 16.4 the linear combination that has the

_0.85_ 3--

.11 6. - V.3_ smallest squared errors when used to

Another point illustrated by this ex estimate the original variables. It is

ample is that the ability to estimate c also the direction in column space that -4-3-2-1 1 2 3 4

depends on how close c lies to the sub- best describes the variation in the sam H—I—I—r- •^ ι I—|—|— Row 1

space defined by R. In this example, c ples. To illustrate these characteristics,

consider the following matrix R: ' V ^ " * * * " Column 1

is near that subspace, and R's estimate - \ \ ^ Principal

of c, proj c, is the vector in the plane

R -3-- V component #1

defined by ri and r 2 that lies closest to Column 2

c. The model error can be described

geometrically as the length of the seg Figure 9 is a representation of the col

ment joining the heads of vector c and umns of R and the first principal com

proj c. ponent plotted in row space. No other

To predict the concentrations of an vector in this space will yield a smaller Figure 9. The matrix R in row space

unknown sample with a given response sum of errors when used to estimate the with the first principal component.

model C. With PCR, the rotation de

fined by the eigenvectors was used to

find a subspace in R that subsequently

was used to model C. The approach

taken by PLS is very similar to that of

PCA, except that factors are chosen to

describe the variables in C as well as in

R. This is accomplished by using the

columns of the C matrix to estimate the

factors for R. At the same time, the

columns of R are used to estimate the

factors for C. The resulting models are

R = TP + Ε (5)

and

C = UQ + F (6)

where the elements of Τ and U are

called the scores of R and C, respec

tively, and the elements of Ρ and Q are

called the loadings. The matrices Ε and

F are the errors associated with model

ing R and C with the PLS model.

The Τ factors are not optimal for es

timating the columns of R as was the

case with PCA, but are rotated so as to

simultaneously describe the C matrix.

In the ideal situation, the sources of

variation in R are exactly equal to the

sources of variation in C, and the fac

tors for R and C are identical. In real

applications, R varies in ways not cor

related to the variation in C, and there

fore t ^ u . However, when both matri

ces are used to estimate factors, the

factors for the R and C matrices have

the following relationship

u=bt + ( (7)

where the b is termed the inner rela

tionship between u and t and is used to

calculate subsequent factors if the in

trinsic dimensionality of R is greater

than one.

Geometrically, this states that the

vectors u and t are approximately

Figure 10. Geometric representation of the workings of PLS. equal except for their lengths. To get a

feel for where these factors lie in rela

tionship to the original R and C matri

covariance matrix, each of the columns ment. At the crucial factor-building ces, one can try to picture both of these

in U will be mutually orthogonal. step, PCR ignores the information con matrices and their factors plotted in

One way of viewing this procedure is tained in the C matrix. The C matrix is row space. R will have J vectors, one for

in terms of the space defined by R. not used until the second step of the each column, whereas C will have Κ

MLR uses all of the space described by procedure, when the factors in R have vectors corresponding to its columns.

the columns of R, whereas PCA deter already been determined. The correla The factor t will lie close to the J vec

mines a subspace by possibly ignoring tion between these factors and C is not tors for R and will be a good estimator

some of the eigenvectors. When one of taken into account when the factor for these vectors. Likewise, u will lie

the J columns in R is a linear combina model is built. The rotation that is de close to the Κ vectors for C. If there is a

tion of the other J-l columns, R can be termined has very good descriptive good underlying model that relates R

represented as a matrix U with J-l col qualities concerning the R matrix, but to C, this will be evident by the similar

umns without a loss of information. there is not much reason to assume it ity of the vectors u and t in this space,

This new matrix U defines a subspace will optimize the modeling of C. The and e in Equation 7 will have a small

of R. The first example in this article ultimate goal is to model the informa value. If the first principal component

illustrated how a noisy variable can ad tion in the C matrix, hence it seems were plotted in this same space, one

versely affect the true model relating R reasonable that C should be used to aid would find that the t vector would dif

to C. It would be advantageous to ig in determining the factors for R. fer from it in that t would be rotated

nore this variable by finding a portion toward the C vectors.

of the space (a subspace) defined by R Partial least squares In a similar manner, one can attempt

that would be more effective at model The method of PLS {9,10) is a model to view PLS working in the column

ing C. ing procedure that simultaneously esti spaces of R and C. Figure 10 represents

The advantages of PCR make it the mates underlying factors in both R and a hypothetical example of a 6 X 3 re

obvious choice over MLR in many situ C. These factors are used to define a sponse matrix and corresponding 6 X 3

ations, but there is room for improve subspace in R that is better able to concentration matrix plotted in their

factor model that simultaneously de linearly related with a coefficient, b;,

scribes R and C. describing the relationship for each of

For an example of the PLS algo the L factors.

rithm, the same R and C matrices used To perform prediction on an un

in the MLR example will be employed. known sample, the model used to de

The data were mean-centered before rive the scores of R and C and the rela

the analysis, and the following two fac tionship bf are used in the following

tors were determined. manner. The vector of responses runit

for the unknown sample is taken

20.5" 10" through the calibration model (P), and

-4.7 u = 0 a score vector (tunk) is calculated. Using

t =

b; and Equation 7, the score tun^ yields

.-15.8. .-ίο. an estimate of the scores for the pre

dicted matrix of concentrations (uunk)·

Plotting t versus u (Figure 11) reveals

The vector uunk is then transformed

the nearly linear relationship that ex

into concentration estimates using the

ists between them. The calculated val

calibration model for the C matrix (Q).

ue of b as defined in Equation 7 is 0.53.

If the intrinsic dimensionality of R is For more information concerning the

greater than one, that is, more than one mechanics of calibration and predic

Figure 11. The R matrix factor (t) plot tion using the PLS algorithm, see Ref

ted versus the C matrix factor (u).

factor is necessary to describe the vari

ation, additional factors are deter erence 11. For an even more rigorous

mined. The relationship given in Equa mathematical treatment of PLS in the

tion 7 is then used to describe the mod context of singular value decomposi

column spaces. The directions of the

el for C. Instead of using C = UQ + F, tion and the power method, see Refer

first PLS factors (t and u) would ex

one can substitute the equality stated ence 12. The main point of this section

plain much of the variation of the sam

in Equation 7 to yield C = bTQ + G. In is that PLS estimates factors for R and

ples in the respective spaces. If the PLS

other words, one can estimate the C using all of the information available.

model is valid, plotting t versus u will

scores of C from the scores of R. In PLS these factors u and t are called

also yield a linear relationship. In fact,

latent variables and are similar to the

PLS will compromise the ability of the An analysis of data using PLS can

principal components in PCR.

factors to describe the samples in the then be summarized as the determina

individual spaces (Figure 10, top) to tion of factors in R and C using all of

increase the correlation of t to u (Fig the information available. The final Conclusion

ure 10, bottom). It is this compromise model consists of score matrices Τ and It is instructive to summarize briefly

that allows for the determination of a U (called scores for R and C) that are the mechanics of each of the methods

discussed in this REPORT. MLR mod

els C from R using a least-squares cri

terion. All of R is available to model C,

and the method is dependent on the

inversion of a matrix. PCR first models

MLR to estimate the relationship be

tween U and C. The possibility of de

Boehringer Mannheim is a rapidly growing leader in the field of med

ical diagnostics and medical research. This leadership has been at leting factors and thus reducing dimen

tained through emphasis on quality and team effort and Boehringer sionality was discussed as its major ad

Mannheim employees are responsible for the growth and leadership. vantage. Finally, PLS builds factors for

Due to this growth, we are currently seeking a qualified individual for R and C using both matrices. The ad

the following opportunity in our INDIANAPOLIS corporate head vantage of PLS over PCR is that it in

quarters. corporates more information in the

Will implement packaged and tailored algorithms for support of re model-building phase.

search and commercial systems. Requires B.S. degree in Computer The advantages gained by employing

Science, Chemistry, Engineering, Physics or Mathematics with 5 the factor-based methods are realized

years experience to include convolutions techniques, database work, because of the characteristics of the

curve fitting and statistical techniques. factors. The first advantage of factor

At Boehringer Mannheim we offer an exciting and challenging work models comes from a computational

environment as well as excellent compensation and benefits pro limitation of MLR. MLR cannot be

grams. If you meet the above requirements and would like to join an used in cases where the columns or

organization that is a leader in its field, please send your resume with measurement variables in R are linear

salary history for confidential consideration. To inquire about other combinations of each other. This is be

opportunities not listed above, please call our job line at (317)

cause MLR is based on the inversion of

845-7035.

the matrix R T R, which is not possible if

Dept. MZAC the matrix R has columns that are ex

Human Resources

act linear combinations of the remain

Boehringer Mannheim

P.O. Box 50100 ing columns. Furthermore, the inver

Indianapolis, IN 46250-0100 sion process is unstable when some col

umns are nearly linear combinations of

BOEHRINGER others. A model built on such an in

MANNHEIM verse will give large errors in concentra

Λ'-Ι.Ιίι.ΜΤΙ

DIAGNOSTICS tion estimates when noise is present in

the R matrix. Much work has been

done in statistics in the area of variable

An Equal Opportunity Employer

selection to eliminate or adjust for col-

linearities among the variables of R

(e.g., stepwise regression, Mallow's sta (7) Rummel, R. J. Applied Factor Analy-

tistics [3]). These nonfactor-based

methods are disadvantageous in that

they are based on the ultimate deletion

of variables, and in doing so they may

eliminate useful information. Factor-

sis; Northwestern University Press: Ev-

anston, 1970.

(8) Jackson, J. E. Journal of Quality Tech-

nology, 1980,12, 201-13.

(9) Wold, H. In Research Papers in Statis-

tics; David, F., Ed.; Wiley: New York,

1966, pp. 411-44.

NEW

based models can eliminate collineari- (10) Wold, H. In Systems Under Indirect

ties without removing useful informa Observation, vol. 2; Joreskog, H.; Wold,

tion. H., Eds.; North-Holland: Amsterdam,

1982, pp. 1-54.

The second main advantage of fac (11) Geladi, P.; Kowalski, B. R. Anal.

tor-based models is the possibility of Chim. Acta, 1986,185,1-17.

removing noise from the data matrix. (12) Lorber, Α.; Wangen, L.; Kowalski,

Again, using PCA as an example one B. R. Journal of Chemometrics, 1986, /,

can demonstrate that the inclusion of 1.

(13) Wold, S. Technometrics, 1978,20,397.

all of the eigenvectors can actually de GROTON

crease the predictive ability of the TECHNOLOGY

model. To demonstrate when this is oc

curring, the analyst can use cross vali

introduces

dation (73) and build a factor-based HPLClike

calibration model using a set of sam never before.

ples termed the training set. This cali

bration model is then applied to the R

matrix of an independent set of sam

ples (termed the test set), and an esti

mate of the C matrix for the test set is VALUE

obtained. (For the test set, the C ma Best value today in a

trix is also known.) The sum of squared D.A.D. system.

deviations of the predicted values for

the Cij's and the true c.j's are calculated

for the test set. This value is termed the PERFORMANCE

Bruce R. Kowalski received his Ph.D. Unparalleled resolu

PRESS value (predictive residual error in chemistry from the University of

sum of squares). Factors are then add Washington in 1969. In 1972, he joined

tion. Incomparable

ed one at a time, and the PRESS value the chemistry faculty at Colorado State sensitivity.

is calculated. What usually occurs is University and then returned to the

that the PRESS value levels off or University of Washington, where he is

reaches a minimum before all of the a member of the chemistry faculty and

VERSATILITY

factors have been included. Inclusion codirector of the Center for Process True IBM

of additional factors beyond this point Analytical Chemistry. His research in compatibility-

actually results in an increased PRESS terests in analytical chemometrics in options for your

value, which means poorer prediction. clude developing new methods of mul application.

This is because the random variation or tivariate analysis for the resolution

noise in the R matrix is used to fit the C and calibration steps in analytical

matrix (this is termed overfitting). By instrumentation, extending various FUNCTIONALITY

not including the factors that describe chromatography-spectrometry meth Full function software

mainly noise, it is possible to increase ods, and using chemical sensors in pro overlays, ratios and

the predictive ability of the model. cess analysis and control. more.

Both PCR and PLS offer the capability

of such noise reduction.

Data contain relevant information as

well as noise and irrelevant informa

tion. Chemometricians and statisti

cians are working to develop the best

possible tools that can extract the full

complement of useful information

from all of the data chemists acquire in

their investigations.

References

(1) Sharaf, Μ. Α.; Illman, D. L.; Kowalski,

B. R. Chemometrics; Wiley: New York,

1986.

(2) Malinowski, E. R.; Howery, D. G. Fac Kenneth R. Beebe is a research assis

tor Analysis in Chemistry; Wiley: New

York, 1980. tant at the University of Washington. For information call

(3) Draper, N.; Smith, H. Applied Regres He received his B.S. in chemistry in (617)647-9400.

sion Analysis, 2nd éd.; Wiley: New York, 1981, and his M.S. in 1985 from the

1981. University of Washington. Currently

(4) Wetzel, D. L. Anal. Chem. 1983, 55, GROTON

1165A. he is enrolled in the Ph.D. program in

(5) Strang, G. Linear Algebra and Its Ap- the analytical division of the chemis TECHNOLOGY

plications; Academic Press: New York, try department. His research interest

1980. is in the application of various chemo

INCORPORATED

(6) Wesson, J. R. Lessons in Linear Alge- Waltham, MA 02154

bra; Charles E. Merrill Publishing Co.: metrics techniques to optimize multi

Columbus, 1974. variate calibration procedures.

CIRCLE 61 ON READER SERVICE CARD

ANALYTICAL CHEMISTRY, VOL. 59, NO. 17, SEPTEMBER 1, 1987 · 1017 A

- identifying perceived personalUploaded byapi-256762675
- LectureUploaded byJaved Hussain
- research paperUploaded byRishu Babhoria
- 8C1ECEB9d01Uploaded byShashank Shrivastava
- EM-Based Mixture Models Applied to Video Event DetectionUploaded byVania V. Estrela
- PCA_JH_MM_GEUploaded byfreeski5
- IBP1732_12Uploaded byMarcelo Varejão Casarin
- Life patterns in contemporary societyUploaded byRosário Mauritti
- Week4a Change DetectionUploaded byRizky Ilham
- 17 2012 Magn Anomaly Detection Boris GinzburgUploaded bysohrabghafoor
- PCA-Dien PCA ChapterUploaded byjuntujuntu
- Face Recognition System_degreeUploaded byVrushali Khatpe
- Social Network Aggregation UsingUploaded byAndrei Essence
- WCE2007_pp1010-1015 (1)Uploaded byDigito Dunkey
- Further Maths A Level 9231Uploaded byanzaaari
- An Efficient Method of PCA Based Face Recognition Using SimulinkUploaded byeditor3854
- Artigo Jardel (1)Uploaded byOlavonazareno Costa
- IEAAIE_LNAI_2012Uploaded bybinuq8usa
- Assessment of Surface Water Quality Using MultivariateUploaded bysandhya
- Compozitie Minerala Sci Tot Env 2015Uploaded byAnonymous sAeXr7i
- Factormodellecture HandoutUploaded byAnonymous lSeU8v2vQJ
- multivariatedataanalysisandvisualizationbydmitrygrapov2011-111207185317-phpapp02Uploaded byG Nathan Jd
- Habyarimana Faustin 2016Uploaded byTebsu Tebsuko
- Sensory Analysis - Section 5Uploaded byCathy Chiu
- A Modified K-means Clustering for Mining of MultimediaUploaded byImran Ud Din
- Developing a TOD Typology for Beijing Metro Station AreasUploaded byArsyad
- sol9.pdfUploaded byMichael ARK
- Michael Greenacre, Jorg Blasius-Multiple Correspondence Analysis and Related Methods (Chapman & Hall CRC Statistics in the Social and Behavioral Scie)-Chapman and Hall_CRC (2006)Uploaded byRudi Perwita
- Tutorials - Principal Component AnalysisUploaded byDanilo Martins
- Interview Questions Data AnalyticsUploaded byArpit Soni

- Advertisement Guidelines Services Edited Mab 6 2011Uploaded byiabureid7460
- Novel Spectrophotometric Methods for Simultaneous DeterminationUploaded byiabureid7460
- ch04.pdfUploaded byiabureid7460
- Liquid LiquidUploaded byjoiyya
- Toolkit on Standard Operating Procedures March 2013Uploaded byiabureid7460
- gmp-trainers-manual.pdfUploaded byiabureid7460
- Multilevel Multifactor Designs for MultivariateCalibrationUploaded byiabureid7460
- A Review on Uv Spectrophotometric Methods for SimultaneousUploaded byiabureid7460
- 1CLEANING VALIDATION EDDY EDDY.pdfUploaded byEddy Teran
- Design Spaces for Analytical MethodsUploaded byiabureid7460
- Quality by Design (Qbd) a Comprehensive Understanding of ImplementationUploaded byiabureid7460
- determination-of-thiomersal-lidocaine-and-phenylepherine-in-their-ternary-mixture.2157-7064.1000199.pdfUploaded byiabureid7460
- Determination of Amlodipine Using TerbiumUploaded byiabureid7460
- Shah Nov 2009Uploaded byiabureid7460
- 01 How to Identify CQA CPPUploaded by刘朝阳
- Session4 Monge Karem PresUploaded byiabureid7460
- Amer Gamal Pres Session11Uploaded byiabureid7460
- Case_Study_RMWG-05_-_Packaging_Line_Optimization.pdfUploaded byiabureid7460
- Effective Implementation of a Risk Management Program Jan 29, 2013.pdfUploaded byAjay Kumar
- White Paper on UNIDO’s GMP Roadmap ConceptUploaded byiabureid7460
- Training BookUploaded byiabureid7460
- The Use of Redox Reactions in the AnalysisUploaded byiabureid7460
- Utility of Oxidation–Reduction ReactionUploaded byiabureid7460
- Application of oxidants to the spectrophotometric determination.pdfUploaded byiabureid7460
- A Simple Method for Spectrophotometric DeterminationUploaded byiabureid7460
- Alternative Graphical Methods for the SpectrophotometricUploaded byiabureid7460

- Basics in Mineral Processing-upgradingUploaded bymakedo33
- RENR5076-10-00-ALL.pdfUploaded byAhmed Moustafa
- sortaskpUploaded byAkash Kumar
- Ventilador 7000 Ohmeda ModulusUploaded bymalejarojasl
- PSA 530Uploaded byceilhyn
- code word processing in notepad.txtUploaded byMohd HelmiHazim
- Bagging Films EnUploaded byAsif Hameed
- Lammps ManualUploaded bygogol_15
- FMR245 Operating InstructionsUploaded byasranzara
- Instrumentation and ControlUploaded bysolo66
- AFR- Turbine.pdfUploaded byChetanPrajapati
- Intro Digital Design-Digilent-VHDL OnlineUploaded byGregorio Borges
- Concept of Barge Hull 2006 Special2Uploaded byAladdin Hassan
- Easypano Tourweaver API ( JavaScript edition )-Flash CS5 Help Manual.pdfUploaded byGilson Ramos
- 80550_NAV2013_ENUS_MANI-80550_NAV2013_ENUS_MANI-80550_NAV2013_ENUS_MANI_05Uploaded bysap_infor
- VTGO PC MultilabUploaded byAlexandre Oliveira
- Wps File Format Reference Guide v1 1Uploaded byMuhammad Afroze Alam
- Campos Et Al 2013 - EupsUploaded bydanimiquee
- Bcs Socket ProgrammingUploaded bySana Iftikhar
- IGE335 dragonUploaded byDaniel Liu
- LSP,ESS,SLSUploaded byshyamu_cgs
- 8 A-D converterUploaded bynskprasad89
- 200787266 Adem DoğangunUploaded byHakan Keskin
- 5_6181299039569444924.pdfUploaded byrajeevgopan
- java_dip_tutorial.pdfUploaded byrphmi
- Drilling and Well ConstructionUploaded byMarcio Nascimento Bezerra
- datasheetUploaded byAtish Prasad Singh
- ragasaUploaded byJobel Sibal Capunfuerza
- FPEC PetroBowl Competition SampleQuestions 2013.5Uploaded byNur Syaffiqa Mohamad Ruzlan
- Binary2BCDUploaded byNguyễn Long

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.