You are on page 1of 33

C H A P T E R

2
Data Analysis and Chemometrics
Paolo Oliveri, Michele Forina
Department of Drug and Food Chemistry and Technology, University of Genoa,
Via Brigata Salerno, 13, Genoa, Italy

O U T L I N E

2.1. Introduction 25 2.3.1. Principal-Component Analysis 34


2.1.1. From Data to Information 25 2.3.2. Signal Pre-Processing 37
2.3.3. Supervised Data Analysis
2.2. From Univariate to Multivariate 27
and Validation 38
2.2.1. Histograms 27
2.3.4. Supervised Qualitative Modeling 39
2.2.2. Normality Tests 29
2.3.5. Supervised Quantitative Modeling 45
2.2.3. ANOVA 30
2.3.6. Artificial Neural Networks 48
2.2.4. Radar Charts 33
2.3. Multivariate Data Analysis 34

2.1. INTRODUCTION generally of the same nature. For instance, gas


chromatographic (GC) analysis of fatty acid
2.1.1. From Data to Information methyl esters allows us to quantify, with a single
chromatogram, the fatty acid composition of
Advances in technology and the increasing a vegetable oil sample (American Oil Chemist’s
availability of powerful instrumentation now Society, 1998). Spectroscopic techniques as well
offer analytical food chemists the possibility for may supply, with a single and rapid analysis
obtaining high amounts of data on each sample on a sample, multiple data of homogeneous
analyzed, in a reasonable e often negligible e nature: in fact, a spectrum can be considered as
time frame (Valcárcel and Cárdenas, 2005). a data vector, in which the order of the variables
Often, in fact, a single analysis may provide (e.g., absorbances at consecutive wavelengths)
a considerable number of measured quantities, has a physical meaning (Oliveri et al., 2011).

Chemical Analysis of Food: Techniques and Applications


DOI: 10.1016/B978-0-12-384862-8.00002-9 25 Copyright Ó 2012 Elsevier Inc. All rights reserved.
26 2. DATA ANALYSIS AND CHEMOMETRICS

In other cases, a set of samples can be of statistical tools and adapted them to better
described by a number of heterogeneous chem- solve actual chemical problems. He had to
ical and physical parameters at the same time. present his studies using a pseudonym, since
For example, a global analytical characteriza- his company did not permit him to publish
tion of a tomato sauce may involve the quanti- any data. Considering himself as a modest
fication of color and rheological parameters as contributor in the field, rather than a statisti-
well as pH and chemical composition and e cian, he adopted the pen name Student. His
possibly e a number of sensorial responses most famous work was on the definition of
(Sharoba et al., 2005). Also in such cases, each the probability distribution that is commonly
sample may be described by a data vector, referred to as the Student’s t distribution
but without any implication with respect to (Student, 1908).
the order of the variables. Instead, differences The term chemometrics was used for the first
in variable magnitude and scale between time by Svante Wold, in 1972, for identifying the
different variables may affect data analysis discipline that performs the extraction of useful
if a proper pre-processing approach is not chemical information from complex experi-
followed. mental systems (Wold, 1972).
The availability of large sets of data does not Statistics offers a number of helpful tools that
mean at the immediate time availability of infor- can be used for converting data into informa-
mation promptly accessible to the sample tion. Univariate methods, which consider one
analyzed: usually, in fact, a number of steps variable at a time, independently of the others,
are required to extract and properly interpret have been and are still extensively used for
the potential information embodied within the such purposes. Nonetheless, they usually
data (Martens and Kohler, 2008). supply just partial answers to the problems
A deep understanding of the nature of under study, since they underutilize the
analytical data is the first basic step for any potential for discovering global information
proper data treatment, because different data embodied in the data. For instance, they are
types usually require different processing strat- not able to take into account inter-correlation
egies, which closely depend on their nature and between variables e a feature that can be very
origin. For this reason, the data analyst should informative, if recognized and properly
always have a complete awareness of the interpreted.
problem under study and about the whole Multivariate strategies are able to take into
analytical process from which data derive e account such an aspect, allowing a more
from the sampling to the instrumental analysis. complete interpretation of data structures.
Such knowledge is fundamental: it makes the However, in spite of their big potential, multi-
difference between a chemometrician and variate methods are generally less used than
a mathematician. A chemometrician is, first of univariate tools.
all, a chemist, who is acquainted with his On the other hand, a number of people try
data, and utilizes mathematical methods for multivariate analysis as the last-ditch resort,
the conversion of numerical records into rele- when nothing seems to provide the desired
vant chemical information. results, pretending that chemometrics provide
The analytical food chemist William Sealy valuable information from data that do not
Gosset (1876e1937), who worked at the contain any informative feature at all.
Arthur Guinness & Son brewery of Dublin, Such demeanor is very hazardous especially
can be considered as one of the fathers of when complex methods are being used, because
chemometrics. In fact, he studied a number there may be the risk of employing chance

I. ANALYTICAL TECHNIQUES
2.2. FROM UNIVARIATE TO MULTIVARIATE 27
correlations to develop models with good names identifying the samples and their class,
performances only on appearance e namely, which represent additional information for the
on the same samples used for model building e rows.
but with very poor prediction ability on new It is easy to guess that such data enclose
samples: this is the so-called overfitting. To over- a great deal of potential information. Anyway,
come such a possibility, a proper validation of the simple visual inspection of the table, which
models is always required. In particular, the contains a considerable number of records,
more complex the technique applied, the deeper does not provide directly any valuable informa-
the validation recommended. tion about the samples analyzed. A conversion
For these reasons, a good understanding of from data into information is necessary.
the characteristics of the methods employed Univariate methods are still the most used
for data processing is always advantageous in many cases, although they generally offer
as well. only a very limitative vision of the global
In this chapter, an overview of the chemomet- situation.
ric techniques most commonly used for data
analysis in analytical food chemistry will be pre-
2.2.1. Histograms
sented, highlighting potentials and limits of
each one. A good way to extract information from data
is to use graphical tools. Among them, histo-
grams are probably the most widely employed
2.2. FROM UNIVARIATE (Chambers et al., 1983).
TO MULTIVARIATE To build a histogram, the range of interest
of the variable under study is divided into
A bidimensional table is probably the most a number of regular adjacent intervals. For
typical way to arrange, present, and store each interval, the contribution of the measured
analytical data: conventionally, in chemomet- samples is graphically displayed by a vertical
rics, each row usually represents one of the rectangle, whose area is proportional to the
samples analyzed, while each column corre- frequency (i.e., the number of observations)
sponds to one of the variables measured. within that interval. Consequently, the height
As an example, Table 2A.1 reports the red- of each rectangle is equal to the frequency
wine data set, which consists of 27 chemical divided by the interval width, so that it has
and physical parameters measured on 90 wine the dimension of a frequency density.
samples, belonging to three Italian denomina- Frequently, such frequency values are
tions of origin from the same region (Piedmont): normalized, dividing each of them by the total
Barolo, Grignolino, and Barbera. The original number of observations, thus obtaining relative
data set was composed of 178 samples (Forina values. It follows that, in such cases, the sum of
et al., 1986). the areas of all the rectangles e i.e., the sum of
Table 2A.1 contains also additional informa- all the relative frequencies e is equal to 1.
tion, which is usually not processed but which The frequency distribution visualized by
may be extremely helpful in the final under- a histogram can be used to estimate the proba-
standing and interpretation of the results. In bility distribution of the variable under study
particular, the two heading lines contain the and to make deductions about the samples.
numbers and the names of the variables, which Figure 2.1 shows examples of histograms for
are additional information for the columns, a portion of the data given in Table 2A.1, namely
while the two heading columns include the for variables number 13, 21, and 26.

I. ANALYTICAL TECHNIQUES
28 2. DATA ANALYSIS AND CHEMOMETRICS

(a) -3
x 10 (b)
6 0.7

5 0.6

Relative Frequency
0.5
4
Relative Frequency

0.4
3
0.3
2
0.2
1
0.1

0 0
150 200 250 300 350 400 450 500 550 600 1 1.5 2 2.5 3 3.5 4
Phosphate
OD280/OD315 of diluted wines

(c)
-3
x 10
2
1.8
1.6
1.4
Relative Frequency

1.2
1
0.8
0.6
0.4
0.2
0
200 400 600 800 1000 1200 1400 1600 1800
Proline

FIGURE 2.1 Histograms for three variables of Table 2A.1: phosphate (a), OD280/OD315 of diluted wines (b), and proline (c).

Three typical patterns are noticeable. In histograms could be drawn for each class sepa-
particular, variable 13 (phosphate) shows rately, to verify the trend of the within-class
a unimodal and almost symmetric shape, which distributions.
may suggest that such variable follows a normal Instead, the histogram shape for variable
probability distribution (Fig. 2.1a). 26 (proline) reveals an underlying asymmetric
Conversely, variable 21 (OD280/OD315 of distribution (Fig. 2.1c). It is possible to
diluted wines) presents a bimodal distribution, convert such behavior into an almost normal
which may suggest that this variable to be charac- one, simply by applying a logarithmic trans-
terized by different average values for diverse formation to the variable, as it is shown in
sample classes (Fig. 2.1b). In such cases, Fig. 2.2.

I. ANALYTICAL TECHNIQUES
2.2. FROM UNIVARIATE TO MULTIVARIATE 29
1.4 The test procedure consists in ordering the
values of the variable to be tested and normal-
1.2
izing them by means of a Student’s transforma-
tion (or autoscaling):
Relative Frequency

1
xi;v  xv
0.8 xi;v ¼ (2.1)
sv
0.6 The variable is corrected by subtracting its
0.4 mean (xv ) from each of its values and then
dividing by its standard deviation (sv). The
0.2 autoscaled variable is dimensionless and
present mean equal to 0 and standard deviation
0 equal to 1.
5.5 6 6.5 7 7.5
Log Proline Then, the corresponding cumulative theoret-
ical probability distribution is estimated from
FIGURE 2.2 Histogram for the log-transformed variable the statistical parameters computed, and the
proline of Table 2A.1. maximum distance from such hypothesized
distribution and the empirical one is calcu-
lated. This value is compared with a critical
2.2.2. Normality Tests distance value, at a predetermined significance
Assessing for compatibility with a normal level, and such comparison determines the
distribution is a basic issue in data analysis, acceptance/rejection of the null hypothesis.
because many methods require variables to be The critical values, which depend on the
normally distributed. As observed, frequency sample size, were obtained by Monte Carlo
distributions may be employed for this purpose. simulations and are available on tables or
Visual examination of histogram shapes may statistical software.
supply a preliminary evaluation. Besides, the The Lilliefors test can be performed also in
cumulative empirical frequency distributions a graphical way (Iman, 1982), as it is illustrated
(EFDs) constitute the basis for a family of statis- in Figs. 2.3 and 2.4 for the same cases of Figs. 2.1
tical normality tests, which are usually referred and 2.2.
to as KolmogoroveSmirnov tests (Kolmogorov, Charts for the Lilliefors test report the cumula-
1933; Smirnov, 1939). tive empirical frequency distributions (EFDs) for
One of the most effective and employed variables number 13, 21, and 26 of Table 2A.1,
among them is the Lilliefors test, which may after column autoscaling (polygonal curves in
be used for generally assessing how well an Figs. 2.3 and 2.4), together with the cumulative
empirical distribution fits with a theoretical theoretical probability distribution (sigmoid
one (Lilliefors, 1970). In the case for normality solid curves), and the distance limits according
verification, the null hypothesis (H0) is that the to the Lilliefors test, at a 5% significance (sigmoid
observed empirical frequency distribution for dot curves). When the EFD curve intersects at
a given variable is not significantly different least one of the limits individuated by
from the theoretical normal probability distribu- the critical distance, the null hypothesis is
tion, at a given significance level. The alternative rejected. As for the examples reported in
hypothesis (H1) is that the observed EFD is not Fig. 2.3, the null hypothesis is accepted only for
compatible with the theoretical normal distribu- the variable phosphate, while for both the other
tion, at that significance level. variables examined, it is rejected at the same

I. ANALYTICAL TECHNIQUES
30 2. DATA ANALYSIS AND CHEMOMETRICS

(a) 1 (b) 1
0.9 0.9
0.8
Cumulative distribution

0.8

Cumulative distribution
0.7 0.7
0.6 0.6
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Phosphate (autoscaled) OD280/OD315 of diluted wines (autoscaled)

(c) 1
0.9
0.8
Cumulative distribution

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
-3 -2 -1 0 1 2 3
Proline (autoscaled)

FIGURE 2.3 Graphical Lilliefors normality test for three variables of Table 2A.1: phosphate (a), OD280/OD315 of diluted
wines (b), and proline (c). The polygonal curves represent the empirical frequency distributions (EFDs) of variables after
column autoscaling, the solid sigmoid curves represent the cumulative theoretical probability distributions and the dot
sigmoid curves represent the distance limits according to the Lilliefors test, at a 5% significance. (For color version of the
figures, refer to the online version)

significance level. In fact, only in the first case


2.2.3. ANOVA
(Fig. 2.3a), the polygonal EFD curve does not Analysis of variance (ANOVA) is the name of
intersect the critical distance lines in any point. a group of statistical methods based on Fisher’s
In addition, it can be easily verified e con- F tests, generally aimed at verifying the
firming the deductions made by looking at the existence/absence of significant differences
histogram of Fig. 2.2 e that the logarithmic between groups of data. The null hypothesis
transformation applied to the variable proline H0 is that all the data derive from the same
makes it compatible with the normal distribu- stochastic population, i.e., there is no significant
tion (see Fig. 2.4). difference between the groups considered. In

I. ANALYTICAL TECHNIQUES
2.2. FROM UNIVARIATE TO MULTIVARIATE 31
1

0.9

0.8

Cumulative distribution
0.7

0.6

0.5

0.4

0.3

0.2

0.1

0
-3 -2 -1 0 1 2 3
Log proline (autoscaled)

FIGURE 2.4 Graphical Lilliefors normality test for the log-transformed variable proline of Table 2A.1. The polygonal
curve represents the empirical frequency distributions (EFDs) of variables after column autoscaling, the solid sigmoid curve
represents the cumulative theoretical probability distributions and the dot sigmoid curves represent the distance limits
according to the Lilliefors test, at a 5% significance.

order to verify this hypothesis, the final F test homogeneous. When only two groups e and,
compares the variability between groups with consequently, two variances e are being
the variability within groups (Box et al., 1978). compared, a Fisher’s F test is suitable to verify
The simplest case is the one-way ANOVA, this preliminary hypothesis. In the given numer-
whose procedure is described with a real ical example, the test value is computed as
numerical example. The two columns of Table
2.1 report the values of the alcoholic degree for s2Barolo 0:274
Ft ¼ 2
¼ ¼ 1:08 (2.2)
Barolo and Barbera wine samples of the red- sBarbera 0:254
wine data set (Table 2A.1), respectively, and
some basic descriptive parameters. A summary The F critical value, at a 5% right significance level
of all the parameters computed for the ANOVA and for 29 degrees of freedom (d.o.f.) both at the
test is given in Table 2.2. The aim is to assess numerator and at the denominator, is 1.86. So it
whether there is a significant difference between is possible to conclude that the variances of the
the alcohol content of the two wines or not. In two groups considered are not significantly
fact, although the mean Barbera alcoholic different, at a 5% right significance level.
percentage (13.07% abv) is noticeably less than In problems involving more than two
the corresponding Barolo value (13.83% abv), groups, the comparison among variances can
the two respective ranges overlap, so that it be performed with multiple F tests on all the
might be suspected the observed difference to possible pairs, or by means of the Cochran’s
be due to chance variations. test or the Bartlett’s tests (Snedecor and
The within-columns variance can be computed Cochran, 1989). The former is valid when there
as a pool variance, under the hypothesis that the is an equal numbers of data in each group,
variances of the different groups are while the latter has a wider applicability.

I. ANALYTICAL TECHNIQUES
32 2. DATA ANALYSIS AND CHEMOMETRICS

TABLE 2.1 Alcohol Content (% abv) for Barolo and TABLE 2.2 Full ANOVA Parameters for the Data
Barbera Samples of Red-Wine Data Set, given in Table 2.1. Computed F ratio
and Basic Statistical Parameters (from variances of columns Barolo and
Barbera) ¼ 1.08. Critical F value (at 5%
Barolo Barbera significance) ¼ 1.86. F test on variances
14.23 12.86 of columns Barolo and Barbera:
13.20 12.88
significance ¼ 41.8%

13.16 12.81 Source of variation d.o.f. Sum of squares Variance


14.37 12.70
Total 60 10874.485
13.24 12.51
Mean 1 10850.384
14.20 12.60
Between columns 1 8.786 8.786
14.39 12.25
Within columns 58 15.314 0.264
14.06 12.53
14.83 13.49 Computed F ratio ¼ 33.28; Critical F value (at 5% significance) ¼ 4.01;
ANOVA F test: significance ¼ 0.0%.
13.86 12.84
14.10 12.93
14.12 13.36
In the numerical example discussed, the
within-columns variance e computed as pooled
13.75 13.52
variance e corresponds to
14.75 13.62
PC PNC 2
14.38 12.25 c¼1 n ¼ 1 ðxnc  xc Þ
s2within ¼
13.63 13.16 NC (2.3)
15:314
14.30 13.88 ¼ ¼ 0:264
13.83 12.87 58
14.19 13.32 Conversely, the between-columns variance is
13.64 13.08 computed as
14.06 13.50 PC
c ¼ 1 NC ðxc  xÞ
12.93 12.79 s2between ¼
C1 (2.4)
13.71 13.11
8:786
12.85 13.23 ¼ ¼ 8:786
1
13.50 12.58
13.05 13.17
Finally, the ANOVA F test value is computed
as between-columns variance:within-columns
13.39 13.84
variance ratio, to perform the final test which
13.30 12.45 compares these variations, for the verification
13.87 14.34 of the ANOVA null hypothesis:
14.02 13.48
s2between 8:786
Min 12.85 12.25 FANOVA ¼ 2
¼ ¼ 32:28 (2.5)
swithin 0:264
Max 14.83 14.34
Mean 13.83 13.07 The F critical value, at a 5% right signifi-
Variance 0.274 0.254 cance level, for 1 degree of freedom at the
numerator and 58 degrees of freedom at the
Standard deviation 0.524 0.504
denominator, is 4.01. From the comparison

I. ANALYTICAL TECHNIQUES
2.2. FROM UNIVARIATE TO MULTIVARIATE 33
with the computed test value, it follows that Within radar charts, variables can be repre-
the null hypothesis is rejected at a 5% signif- sented without any previous scaling, revealing
icance level. The conclusion is that the differ- what variables are dominant for a given data
ence between the alcoholic content of Barolo set. Nonetheless, when variables are character-
and Barbera samples is significantly larger ized by considerably different scales (as in the
than the variability within each of the two case for red-wine data of Table 2A.1), a prelimi-
groups. nary transformation may be helpful in order to
ANOVA tests can be applied also when the make visible within the graph the contribution
effect of two variability sources (e.g., type of of all of them, by assuring the same a priori
wine and vintage year) is to be verified: such importance.
a scheme is usually called a two-way ANOVA. For instance, by looking at Fig. 2.5, it clearly
When a number of replicate measurements are appears that, without any scaling, four
available for each level combination of the two features are dominating, corresponding to
factors (nested two-way ANOVA), the model the variables number 10, 13, 24, and 16, which
obtained also allows an estimation of the inter- are characterized by the highest mean values
action between the factors, together with its (see Table 2.1). The contribution of the remain-
significance. ing 23 variables is not recognizable within
these graphs. Furthermore, it is not possible
to draw many valuable considerations about
2.2.4. Radar Charts
the sample profiles. In particular, it can be
Radar charts e also known as web charts, noticed that Grignolino wines (Fig. 2.5a) are
spider charts, star charts, cobweb charts, polar characterized, on average, by smaller values
charts, star plots, or Kiviat diagrams e are for the four observable variables. It can also
a data display tool that can be considered as be deduced that Barolo has a higher contribu-
a sort of link between univariate and multivar- tion from variable 26, while Barbera (Fig. 2.5c)
iate graphical representations (Chambers et al., has higher contributions from variables 24
1983). and 10.
They consist of circular graphs divided into On the other hand, Fig. 2.6 illustrates that,
a number of equiangular spokes, called radii. after application of column autoscaling (see
Each radium represents one of the variables. Eqn (2.1)), the a priori differences in location
A point is individuated on it, whose distance and dispersion among the original variables
from the center is proportional to the magnitude are eliminated, thus showing the contribution
of the related variable for that datum. Finally, all of all of them and highlighting the differences
the data points e corresponding to all the vari- among the observations. In fact, in this second
ables measured on a sample e are connected graph, the profiles of the three wines appear
with a line, which represent a sort of sample much more dissimilar than in the previous
profile. one. By a joint examination of the three radar
Usually, each plot represents a single charts of Fig. 2.6, it can be deduced that Barolo
sample, and multiple observations are and Barbera samples present two rather
compared by examining different plots. It is complementary profiles, while the Grignolino
also possible to overdraw several lines on the profile is somewhat intermediate. In particular,
same chart, although the outcome will be Barolo (Fig. 2.6a) is characterized by higher
legible only for small data sets. As a matter of average values of variables 1, 2, 13, 15, 16, 18,
fact, when the number of samples is large, 21, 22, 23, and 26. Instead, Grignolino
such graphical representation is generally not (Fig. 2.6b) presents average lower values of all
very functional. the variables, except for the number 20. Finally,

I. ANALYTICAL TECHNIQUES
34 2. DATA ANALYSIS AND CHEMOMETRICS

1 1 1
(a) 26
27 1200 2
3 (b) 26
27 1200 2
3 (c) 26
27 1200 2
3
25 1000 4 25 1000 4 25 1000 4
800 800 800
24 5 24 5 24 5
600 600 600
23 6 23 6 23 6
400 400 400
22 200 7 22 200 7 22 200 7
0 0 0
21 8 21 8 21 8

20 9 20 9 20 9

19 10 19 10 19 10

18 11 18 11 18 11
17 12 17 12 17 12
16 13 16 13 16 13
15 14 15 14 15 14

FIGURE 2.5 Radar charts of average profiles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27
correspond to the original variables listed in Table 2A.1.

1 1 1
(a) 26
27 2
1.5
2
3 (b) 26
27 2
1.5
2
3 (c) 26
27 2
1.5
2
3
25 1 4 25 1 4 25 1 4
24 0.5 5 24 0.5 5 24 0.5 5
0 0 0
23 -0.5 6 23 -0.5 6 23 -0.5 6
-1 -1 -1
22 7 22 7 22 7
-1.5 -1.5 -1.5
-2 -2 -2
21 8 21 8 21 8

20 9 20 9 20 9

19 10 19 10 19 10

18 11 18 11 18 11
17 12 17 12 17 12
16 13 16 13 16 13
15 14 15 14 15 14

FIGURE 2.6 Radar charts of average profiles of Barolo (a), Grignolino (b), and Barbera (c) samples. Numbers from 1 to 27
correspond to the variables listed in Table 2A.1, reprocessed by application of column autoscaling.

Barbera (Fig. 2.6c) has a bigger contribution one of the basic and most useful tools in the
from variables 3, 4, 5, 9, 17, 19, and 24. branch of multivariate analysis. It is an explor-
Such deductions may be useful for character- atory method, which always offers an overview
ization purposes. of the problem studied and often allows the
drawing of significant conclusions and for deci-
sions to be made on the basis of the observed
2.3. MULTIVARIATE DATA results. Furthermore, PCA can be employed
ANALYSIS for feature and noise reduction purposes and
constitutes the basis for other more complex
pattern recognition techniques.
2.3.1. Principal-Component Analysis
PCA is based on the assumption that a high
Principal-component analysis (PCA), which variability (i.e., a high variance value) is synon-
originates in the work of K. Pearson (1901), is ymous with a high amount of information.

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 35
For this reason, PCA algorithms search for where S is the score matrix and X is the original
the maximum variance direction, in the multidi- matrix, constituted by N objects (rows)
mensional space of the original data, preferably described by V variables (columns).
passing through the data centroid, which means One of the key features of PCA is its high
that data have to be at least mean centered capability for representing large amounts of
column-wise. The maximum variance direction complex information by way of simple bidimen-
represents the first principal component (PC). sional or tridimensional plots.
The second PC is the direction which keeps In fact, the space described by two or three
the maximum variance among all directions PCs can be used to represent the objects (score
orthogonal to the first PC. It follows that the plot), the original variables (loading plot), or
second PC explains the maximum information both objects and variables (biplot) (Geladi
not explained by the first one or, in other words, et al., 2003; Kjeldahl and Bro, 2010). Since prin-
these two new variables are not inter-correlated. cipal components are not inter-correlated vari-
The process continues with the identification of ables, no duplicate information is shown in PC
the subsequent PCs: it may stop at reaching plots.
a variance cutoff value or continue until all the Figure 2.7 represents an example of a highly
variability enclosed in the original data has informative biplot, which derives from PCA
been explained (Jolliffe, 2002). performed on the red-wine data set given in
Since the variance values depend on the Table 2A.1. Data have been previously auto-
scale of the variables, it becomes difficult to scaled in order to eliminate the magnitude
compare and impossible to combine informa- differences among the variables. The two Carte-
tion from variables of different nature, unless sian axes correspond to the first (meaning low-
properly normalized: column autoscaling (see order) two PCs, which together show a 44.2%
Eqn (2.1)) is the most commonly applied of the information (defined as explained vari-
transform. ance) enclosed in the original multidimensional
Each sample can be projected in the space data space.
defined by the new variables: the coordinate The plot clearly shows the interrelations
values obtained are called scores. existing among samples, among variables, and
The PCs are expressible as linear combina- between samples and variables. Moreover,
tions of the original variables: the coefficients considering also the additional row information
which multiply each variable are called load- given in Table 2A.1 (namely, the class of each
ings. They represent the cosine values (director sample, graphically represented by different
cosines) of the angles between the PCs and the colors), it is possible to get information about
original variables. These values may vary the discrimination among the three wine cate-
between e1 and þ1, indicating the importance gories (Barolo, Grignolino, and Barbera), the
in defining a given PC: the larger the cosine dispersion of samples within each class, and
absolute value, the closer the two directions, the discriminatory importance of the variables
thus the larger the contribution of the original measured.
variable to the PC. In particular, it appears that PC1 e which
In terms of matrix algebra, the rotation from accounts for the 27.4% of the total variance,
the space of the original variables to the PC i.e., information e is a direction effective in dis-
space is performed by means of the loading tinguishing among the three wine classes, espe-
orthogonal matrix, L: cially between Barolo (circles) and Barbera
(triangles) samples. Instead, PC2 (explaining
SNV ¼ XNV LVV (2.6) the 16.8% of total variance) is useful in

I. ANALYTICAL TECHNIQUES
36 2. DATA ANALYSIS AND CHEMOMETRICS

20

14
22
21 11
PC2 (16.8%)

16
25
7 4
15 12
18
13 27 9 17
26

6 5 3

1
2 23 81024
19
PC1 (27.4%)

FIGURE 2.7 Example of PCA biplot for the autoscaled red-wine data. The scores (symbols) correspond to the wine
samples of classes Barolo (circles), Grignolino (squares), and Barbera (triangles), respectively. The loadings (line segments
and numbers from 1 to 27) represent the contribution of the original variables e as listed in Table 2A.1 e to the information
visualized in the plot. (For color version of the figure, refer to the online version)

differentiating mainly Grignolino samples of the second group). Instead, Grignolino wines
(squares) from the other two groups. lay in a halfway position, meaning that they are
The variables which present the highest characterized by intermediate values for all
loading absolute value on PC1 are the numbers these variables. Conversely, variable number
15, 16, 21, 22, 3, 4, 5, 6, 9, and 17. It means that 20 has the highest loading value on PC2 and it
such variables are the most important in is clearly in the same direction of the Grignolino
defining PC1 and, consequently, in discrimi- cluster. This means that Grignolino wines have,
nating among the three wine classes, particu- on average, high values of variable 20. Opposite
larly Barolo from Barbera. In more detail, considerations are valid for variables 1, 2, 8, 10,
looking at the correspondences between scores 19, 23, and 24.
and loadings, it can be deduced by this plot All of these inferences are absolutely in
that the samples that present on average the accord with all the deductions already made
highest values for variables 3, 4, 5, 6, 9, and 17 by inspection of the radar charts in Fig. 2.6.
and the lowest values for variables 15, 16, 21, Moreover, the information that variable 21
and 22 belong to class Barbera. Just opposite (OD280/OD315 of diluted wines) has a highly
considerations are applicable for the samples discriminant power is in perfect agreement
of class Barolo (low values for the variables of with the bimodal distribution observed in the
the first group and high values for the variables histogram of Fig. 2.1b for the same variable.

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 37
Variables 7, 12, 25, and 27 have very small Instead, systematic unwanted variations are
loading values on both the PCs, meaning that commonly due to instrumental trends or to
such variables give a negligible contribution to external influences. They may affect the signal
the portion of information visualized in this with baseline shifts and/or drifts, which can
plot. be considered as a low-frequency contribution.
Further considerations may be drawn by Signal processing is generally aimed at mini-
looking at the distribution of samples inside mizing the unwanted variations, thus
each class. For instance, it can be easily seen improving the quality of signals and, conse-
that Barolo wines are characterized by the quently, the conversion of data to valuable
lowest within-class variability. In the case of information.
quality control, a low sample variability about In particular, it is possible to individuate
a target value is an index of high quality three main objectives: reduction of random
(Taguchi, 1986). noise, reduction of systematic unwanted varia-
It is worth noticing that a single and simple tions, and reduction of data size.
biplot is able to report a considerable amount Several pre-processing techniques accom-
of information that would require a large plish this with more than one point. Further-
number of univariate plots and tests to be more, in some cases, the transformation itself
extracted: PCA is, without any doubt, the most facilitates the interpretation of complex signals,
efficient way to account for the information as in the case for derivatives.
enclosed in a data table. When several digital signals are structured
into a data matrix, each of them corresponding
to a row e following the chemometric conven-
2.3.2. Signal Pre-Processing
tion e signal pre-processing is also known as
Instrumental analytical techniques com- row pre-processing. The mathematical trans-
monly provide information in the form of digital forms act on each single signal independently
signals. Spectra, chromatograms, and voltam- of the others.
mograms are typical examples. Such signals Techniques for reduction of random noise
generally require to be suitably pretreated, since include the moving average e or boxcar e
the analytical information is not the exclusive filters, the SavitzkyeGolay smoothing (Savitzky
component. A number of different variations, and Golay, 1964), and the Fourier transform
from sources other than the analytical system (FT)-based filters (Reis et al., 2009).
under investigation, generally affect signals. Regarding the elimination or minimization
They may be related either with the electric of unwanted systematic effects, a number of
instrumentation components or with the mathematical methods for signal transforma-
surroundings. In particular, unwanted signal tion are widely employed, such as the
variations may be random or systematic. The standard normal variate (SNV) transform and
former are generally due to sporadic interfer- derivatives.
ences or associated with random phenomena
(e.g., Brownian motions of particles and 2.3.2.1. Standard Normal Variate (SNV)
thermal motion of electrons e the so-called Transform
JohnsoneNyquist noise) which usually follow The SNV transform, or row autoscaling, is
a standard normal or a Poisson probability particularly applied in spectroscopy, since it is
distribution. This type of noise, also called white useful to correct for both baseline shifts and
noise, is characterized by frequency values global intensity variations (Barnes et al., 1989).
higher than those of the useful signal. Each signal (xi) is row-centered, by subtracting

I. ANALYTICAL TECHNIQUES
38 2. DATA ANALYSIS AND CHEMOMETRICS

its mean (xi ) from each single value (xi,v), and 2.3.2.3. Horizontal Alignment
then scaled by dividing by the signal standard When a series of chromatograms are used
deviation (si): as vectors to build a data matrix, a typical
xi;v  xi problem arises with the horizontal shifts
xi;v ¼ (2.7) that commonly characterizes such type of
si
data.
The most common methods for peak align-
After transformation, each signal presents ment, such as the correlation optimized
mean equal to 0 and standard deviation equal warping (COW), search for the maximum corre-
to 1. lation between a selected reference profile and
SNV has the peculiarity of possibly shifting a series of piecewise modified (shifted and
informative regions along the signal range, so warped) versions of the signals to be aligned
that the interpretation of the results referring (Nielsen et al., 1998; Jellema, 2009).
to the original signals should be performed
with caution (Fearn, 2009).
2.3.3. Supervised Data Analysis
2.3.2.2. Derivatives and Validation
The numerical differentiation of digitized Exploratory techniques for data analysis,
signals may correct for baseline shifts and drifts, such as PCA, are unsupervised, meaning that
depending on the derivation order. Further- they just show the data as they are. Conversely,
more, derivative profiles often exhibit an supervised chemometric methods look for
increased apparent resolution of overlapping determined features within data, explicitly
peaks and may accentuate small structural oriented to address particular issues.
differences between nearly identical signals In particular, when a model is developed
(Taavitsainen, 2009). with the purpose of predicting a qualitative or
The first derivative of a signal y ¼ f(x) is the quantitative property of interest, its reliability
rate of change of y with x (i.e., y0 ¼ dy/dx), in prediction should be assessed prior to using
which can be interpreted e at the single the model in practice. Prediction ability values
points e as the slope of the line tangent to should be presented together with their confi-
the signal. It provides a correction for baseline dence interval (Forina et al., 2001; Forina et al.,
shifts. 2007), which depends on the number of samples
The second derivative can be considered as used for the validation. The estimation of the
a further derivation of the first derivative predictive ability on new samples e not used
(y00 ¼ d2y/dx2); it represents a measure of the for building the models e is a fundamental
curvature of the original signal, i.e., the rate of step in any modeling process and several proce-
change of its slope. Such transform provides dures have been deployed for this purpose. The
a correction for both baseline shifts and drifts. most common validation strategies divide the
A disadvantageous consequence of deriva- available samples into two subsets: a training
tion may be an enhancement of the random (or calibration) set used for calculating the
noise, characterized by high-frequency slope model and an evaluation set used for assessing
variations. To overcome this hurdle, signals its reliability. A key feature for an honest valida-
are firstly smoothed, often by using the tion is that the test samples have to be absolutely
SavitzkyeGolay algorithm (Savitzky and Golay, extraneous to the model: no information from
1964) with a third-order polynomial. them can be used in building the model in the

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 39
pre-processing stages, otherwise the prediction once for evaluation. The number of cancellation
ability may be overestimated. groups usually ranges from 3 to N. Cross-valida-
In many modeling techniques, some parame- tion with N cancellation groups is generally
ters are optimized looking for a setting that known as the leave-one-out procedure (LOO).
provides the maximum predictive ability for LOO has the advantage of being unique for
the model for a given sample subset. a given data set, whereas, when G < N, different
In such cases, a correct validation strategy orders of the samples and different subdivision
would involve three sample subsets: a training schemes generally supply different outcomes.
set, an optimization set, and an evaluation set. However, especially when the total number of
The optimization set is used to find the best samples is considerable, predictions made on
modeling settings, while the actual reliability a unique object, although repeated many times,
of the final model is estimated by way of a real may yield an overly optimistic result. An exten-
prediction on the third subset, formed by objects sive evaluation strategy consists in performing
that have never influenced the model. cross-validation many times, with different
The evaluation of the predictive ability of numbers of cancellation groups, from 3 up to
a model can be performed in a unique step or N. Another possibility is to repeat the validation,
many times with different evaluation sets, for a given number G < N of cancellation groups,
depending on the strategy adopted. each time permuting the order of the samples,
thus obtaining a different group composition
2.3.3.1. Single Evaluation Set each time.
A single evaluation set is the simplest and
most rapid validation scheme. A fraction e 2.3.3.3. Repeated Evaluation Set
usually between 50% and 90% of the total This procedure, also called Monte Carlo vali-
number e of the available samples constitutes dation, computes many models (often many
the training set, while the remaining objects thousands), each time creating a different evalu-
form the evaluation set. The subdivision may ation set, with a variable number of samples, by
be arbitrary, random, or performed by way of random selection. Each sample may fall many
a uniform design, such as the Kennard and times, or even no times at all, in the evaluation
Stone and the duplex algorithm (Kennard and set. The main drawback of this validation
Stone, 1969; Snee, 1977), which allows two strategy is the longer computational time.
subsets that are uniformly distributed and
representative of the total sample variability to 2.3.4. Supervised Qualitative Modeling
be obtained.
2.3.4.1. Classification and Class-Modeling
2.3.3.2. Cross-Validation (CV) A wide number of issues within food science
Cross-validation is probably the most com- require qualitative answers. It is the case for the
mon validation procedure. The N available characterization of ingredients or finished
samples are divided into G cancellation groups products, the verification of the geographical
following a predetermined scheme (e.g., contig- origin, and e more generally e the quality
uous blocks or Venetian blinds). The model is control, the control on food adulterations, and
computed G times: each time, one of the cancel- so on.
lation groups is used as the evaluation set, while Discriminant classification and class-modeling
the other groups constitute the training set. At techniques represent the most commonchemo-
the end of the procedure, each sample has been metric tools for addressing such aims (Oliveri
used G e 1 times for building a model and and Downey, 2012). In fact, they build

I. ANALYTICAL TECHNIQUES
40 2. DATA ANALYSIS AND CHEMOMETRICS

mathematical rules or models able to characterize A class model is characterized by two param-
a sample with respect to a qualitative property, eters: sensitivity and specificity. Sensitivity is
namely the class to which it belongs. defined as the percentage of objects belonging
A class (or category) is defined as a group of to the modeled class which are rightly accepted
samples having in common the same values of by the model. Specificity is the percentage of
discrete variables or proximate values of contin- objects not belonging to the modeled class
uous variables. Frequently, such variables are which is rightly rejected by the model. A class-
qualitative factors that cannot be determined modeling technique builds a class space, whose
experimentally, so that they have to be esti- wideness corresponds to the confidence
mated from the values of some experimentally interval, at a pre-selected confidence level, for
measurable predictors, by way of suitable math- the class objects: sensitivity is an experimental
ematical tools. measure of this confidence level. A decrease in
In more detail, discriminant classification the confidence level for the modeled class gener-
techniques are able to determine to which class ally reduces the sensitivity and increases the
a sample more probably belongs, among specificity of the model. Frequently, in order to
a number of predefined classes. They work by evaluate the model performance, taking into
building a delimiter between the classes and, account these features, an efficiency parameter
then, each new object is always assigned to the is computed as the geometric mean of sensi-
category to which it more probably belongs, tivity and specificity.
even in the case of objects which are not perti- When at least two classes are modeled, the
nent to any class studied. results of the class-modeling analysis can be
Instead, class-modeling techniques verify visualized by way of the Coomans’ plots
whether a sample is compatible or not with the (Coomans et al., 1984). Such graphs represent
characteristics of a given class of interest. In fact, the samples in relation to the distances from
they provide an answer to the general question: the models of two given classes. Often, the
“Is sample X, claimed to belong to class A, actu- distances are normalized dividing by the critical
ally compatible with the class A model?” This is distance value that characterizes the corre-
essentially the question to be answered in most sponding model.
of the real qualitative problems studied within In the example given in Fig. 2.8, the two
the food sciences. Such an approach is also Cartesian axes correspond to the distances
capable of detecting outliers (Forina et al., 2008). from the model of class Barolo and from the
model of class Grignolino (red-wine data set),
2.3.4.2. Evaluation Parameters respectively, while two straight lines parallel to
The effectiveness of a classification rule is the axes describe the limits of the corresponding
usually evaluated by the classification rate, i.e., class spaces at a 95% confidence level. The plot
the percentage of objects correctly classified. area is divided into four regions, which contain
This parameter is often indicated as prediction respectively: the samples accepted by the model
rate, when it is estimated by means of an evalu- of class Barolo (upper left rectangle), the
ation sample set. A class rule can be considered samples accepted by the model of class Grigno-
valuable when the prediction rate should be lino (lower right rectangle), the samples
significantly bigger than the null-classification accepted by both the models (lower left square),
rate, which is defined as the probability and the objects rejected by both the models
percentage of chance correct assignments, and (upper right square). All the samples belonging
corresponds to 100% divided by the number of to the class Barbera correctly lay inside such
categories. last area.

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 41

with xi and xj being two sample data vectors,


Normalised distance from the model of class Grignolino

respectively, and xi,v the value of variable v for


sample i.
In some cases, the Mahalanobis distance can
be used. It can be considered as a Euclidean
distance modified for taking into account the
dispersion and the correlation of all of the
variables:
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Di;j ¼ ðxi  xj Þ0 V1 ðxi  xj Þ (2.9)

where V is the covariance matrix.


Once the matrix of distances between objects
has been computed, the k samples nearest to
the test sample are then taken into consider-
ation to perform the classification: generally,
Normalised distance from the model of class Barolo a majority vote is employed, meaning that the
new object is classified into the class mostly
FIGURE 2.8 Example of a Coomans’ plot for the data set represented within the k selected objects. Being
red-wine given in Table 2A.1. The samples are represented a distance-based method, it is sensitive to the
by class symbols: circles (Barolo), squares (Grignolino), and
triangles (Barbera). (For color version of the figure, refer to
measurement units and to the scaling proce-
the online version) dures applied.
The method provides a nonlinear delimiter
Classification and class-modeling techniques between categories, generally expressible as
belong to three main families: a piecewise linear function (see Fig. 2.9). The
• distance-based techniques delimiter usually becomes more smoothed for
• probabilistic techniques elevated values of k. When the parameter k is
• experience-based techniques. optimized to obtain the highest prediction
ability for a given data set, validation should
2.3.4.3. Distance-Based Techniques be performed by way of a three-set procedure.
Being a nonprobabilistic method, k-NN is
2.3.4.3.1. K NEAREST NEIGHBORS (K-NN)
free from statistical assumptions e such as
k-NN is one of the simplest approaches for normality of variable distributions e and
classification (Vandeginste et al., 1998). As the requirements from limitations on the number
first step, k-NN computes the distances of the of variables. This assures a wide applicability.
test sample from each of the samples of Furthermore, in many applications, it has been
a training set, whose class membership is shown to perform as well as or better than
known. Usually, the multivariate Euclidean more complex methods (Vandeginste et al.,
distance is employed: 1998; Dudoit et al., 2002).
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u V qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uX 2.3.4.3.2. A NONPARAMETRIC CLASS-
Di;j ¼ t ðxi;v  xj;v Þ2 ¼ ðxi  xj Þ0 ðxi  xj Þ MODELING TECHNIQUE
v¼1
Derde et al. (1986) presented a simple and effi-
(2.8) cient nonparametric class-modeling technique,

I. ANALYTICAL TECHNIQUES
42 2. DATA ANALYSIS AND CHEMOMETRICS

12 12

10 10

8 8

6 6

X2
X2

4 4

2
2

0
0

-2
-2 -2 0 2 4 6 8 10 12
-2 0 2 4 6 8 10 12 X1
X1
FIGURE 2.10 Example of nonparametric class space
FIGURE 2.9 Example of k-NN class delimiter for k ¼ 1 (shadowed region) for the class of interest (artificial data).
(artificial data). (For color version of the figures, refer to the online version)

closely related to k-NN, which defines the class (PCs) e ideally the significant ones e and
space on the basis of a critical distance from the models therefore correspond to rectangles
objects of the training set (see Fig. 2.10). Several (two PCs), parallelepipeds (three PCs), or
settings can be varied, such as the type of distance hyper-parallelepipeds (more than three PCs)
(Euclidean or Mahalanobis), and the strategies referred to as the multidimensional boxes of
adopted to determine the critical distance value. SIMCA inner space. Conversely, the principal
Unfortunately, this promising class-modeling components not used to describe the model
technique has not been used anymore, and it define the outer space, which represents the
would merit a thorough reconsideration. space of uninformative variations, often due
to noise. The score range can be enlarged or
reduced, mainly depending on the number of
2.3.4.3.3. SOFT INDEPENDENT MODELING OF samples, to avoid the possibility of under- or
CLASS ANALOGY (SIMCA) overestimation of the true variability (Forina
Soft independent modeling of class analogy and Lanteri, 1984). The standard deviation of
(SIMCA) (Wold and Sjöström, 1977) was the the distance of the objects in the training set
first class-modeling technique introduced into from the model corresponds to the class stan-
chemometrics. This method builds class dard deviation. The boundaries of the SIMCA
models based on PCA performed using only space around the model are determined (as
the samples of the category studied, generally shown in Fig. 2.11) by a critical distance, which
after within-class autoscaling or centering. In is obtained by means of the Fisher statistics.
more detail, SIMCA models are defined by SIMCA is a very flexible technique since it
the range of the sample scores on a selected allows variation in a large number of parame-
number of low-order principal components ters such as scaling or weighting of the

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 43
12 12

10 10

8 8

6 6
X2

X2
4 4

2 2

0 0

-2 -2
-2 0 2 4 6 8 10 12 -2 0 2 4 6 8 10 12
X1 X1

FIGURE 2.11 SIMCA normal-range model (segment) FIGURE 2.12 Iso-probability ellipses under LDA
and class space (shadowed region) for the class of interest hypotheses and the resultant linear class delimiter (artificial
(artificial data). (For color version of the figure, refer to the data).
online version)
original variables, number of components, and
expanded or contracted score range. to equal probability density values and to the
same Mahalanobis distance from the centroid.
Because of the above-mentioned LDA hypoth-
2.3.4.4. Probabilistic Techniques eses, the ellipses of different categories present
2.3.4.4.1. LINEAR DISCRIMINANT ANALYSIS equal eccentricity and axis orientation: they
(LDA) only differ for their location in the plane. By con-
Linear discriminant analysis (LDA) is necting the intersection points of each couple of
the first multivariate classification technique, corresponding ellipses, a straight line is identi-
introduced by Fisher (1936). It is a probabilistic fied which corresponds to the delimiter between
parametric technique, i.e., it is based on the the two classes (see Fig. 2.12). For this reason, this
estimation of multivariate probability density technique is called linear discriminant analysis.
functions, which are entirely described by The directions which maximize the separation
a minimum number of parameters: means, vari- between pairs of classes are the so-called canon-
ances, and covariances. LDA is based on the ical variables.
hypotheses that the probability density distribu-
tions are multivariate normal and that the disper- 2.3.4.4.2. QUADRATIC DISCRIMINANT
sion is the same for all the categories. This means ANALYSIS (QDA)
that the varianceecovariance matrix is the same Quadratic discriminant analysis (QDA) is
for all of the categories, while the centroids are a probabilistic parametric classification technique
different (different location). In the case of two which represents an evolution of LDA for
variables, the probability density function is bell- nonlinear class separations. Also QDA, like
shaped and its elliptic section lines correspond LDA, is based on the hypothesis that the

I. ANALYTICAL TECHNIQUES
44 2. DATA ANALYSIS AND CHEMOMETRICS

probability density distributions are multivariate statistics to define a class space, whose boundary
normal but, in this case, the dispersion is not the is an ellipse (two variables), an ellipsoid (three
same for all of the categories. It follows that the variables), or a hyper-ellipsoid (more than three
categories differ not only for the position of their variables). The dispersion of the class space is
centroid but also for the varianceecovariance defined by the critical value of the T2 statistics
matrix (different location and dispersion). Conse- at a selected confidence level (see Fig. 2.14). The
quently, the ellipses of different categories differ eccentricity and the orientation of the ellipse
also for eccentricity and axis orientation (Geisser, depend on the correlation between the variables
1964). By connecting the intersection points of and on their dispersion.
each couple of corresponding ellipses (at the These probabilistic techniques present some
same Mahalanobis distance from the respective restrictions on the number of objects that can
centroids), a quadratic delimiter is identified, be used. From a strictly mathematical point of
which is a parabola in the bidimensional case, as view, objects have to be one more than the
represented in Fig. 2.13. number of variables measured. Nevertheless,
in order to obtain reliable results, these tech-
2.3.4.4.3. UNEQUAL CLASS MODELS (UNEQ) niques should be applied in cases when the ratio
UNEQ is a powerful class-modeling tech- between the number of objects in a given cate-
nique, which originated in the work of Hotelling gory and the number of variables is at least
(1947) and was launched in chemometrics by three. Furthermore, the number of objects in
Derde and Massart (1986). The method, closely each class should be nearly balanced: it is not
related to QDA, is based on the hypothesis of advisable to work when ratios between number
a multivariate normal distribution in each cate- of objects in different categories are greater than
gory studied and on the use of Hotelling’s T2 three (Derde and Massart, 1989).

12 12

10 10

8 8

6 6
X2

X2

4 4

2 2

0 0

-2 -2
-2 0 2 4 6 8 10 12 -2 0 2 4 6 8 10 12
X1 X1

FIGURE 2.13 Iso-probability ellipses under QDA FIGURE 2.14 UNEQ model (cross) and class space
hypotheses and the resultant quadratic class delimiter (shadowed region) for the class of interest (artificial data).
(artificial data). (For color version of the figure, refer to the online version)

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 45
In cases involving many variables, it is 12
possible to apply LDA and QDA-UNEQ
following a preliminary reduction in the vari- 10
able number, for instance, by a PCA-based
compression. 8

2.3.4.4.4. POTENTIAL FUNCTION METHODS


6
Potential function techniques were intro-

X2
duced into chemometrics by Coomans and
4
Broeckaert (1986). These methods estimate
a probability density distribution as a sum of
2
contributions of each single sample in a training
set. A variety of functions can be used to define
the individual contributions. The most 0
commonly used are Gaussian-like functions,
with a smoothing coefficient that is formally -2
-2 0 2 4 6 8 10 12
analogous to the standard deviation of the
X1
Gaussian probability function, thus deter-
mining the shape of the distribution. Such coef- FIGURE 2.15 Potential function class space (shadowed
ficient can be the same for all the samples of region) for the class of interest (artificial data). (For color
a given class (fixed potential), otherwise it version of the figure, refer to the online version)
can be varied as a function of the local density
of samples: such latter strategy, known as Regression techniques can be univariate or
normal variable potential, is useful when the multivariate, depending on the number of
underlying multivariate distribution is very predictors and, eventually, of response variables
asymmetric, with regions characterized by involved. Furthermore, they can be linear or
nonuniform density of samples (Forina et al., nonlinear, depending on the type of relationship
1991). they are able to model.
The value of the smoothing coefficient can be Univariate linear regression is a very common
optimized by means of a leave-one-out proce- tool in analytical chemistry, generally used to
dure with an optimization sample set. describe the relation between a chemical quan-
As represented in Fig. 2.15, the resulting esti- tity (typically, the concentration of an analyte in
mated overall probability distribution can be a series of standards) and a measured physical
very complex, capable of effectively describing variable (e.g., absorbance values at a given wave-
nonuniform distributions of samples. From the length). The mathematical model obtained is
probability distribution, the boundary of the used inversely, to compute the chemical quantity
class space can be obtained at a selected confi- in real samples from the values of the physical
dence level. measures performed on them.

2.3.5.1. Ordinary Least Squares (OLS)


2.3.5. Supervised Quantitative
The ordinary least-squares (OLS) e or clas-
Modeling
sical least-squares (CLS) e method is probably
Regression defines mathematical relationships the most widely used and studied historically.
between variables or groups of variables, and It looks for the combination of parameters of
provides models for quantitative predictions. the linear model (intercept and slope) that

I. ANALYTICAL TECHNIQUES
46 2. DATA ANALYSIS AND CHEMOMETRICS

provides the minimum value for the squared cases, contiguous variables are considerably
residuals (i.e., the squared differences between inter-correlated. In a spectrum, for instance,
the values estimated by the model and the corre- absorbances evaluated at two consecutive
sponding true values). The statistical implica- wavelengths frequently carry almost the same
tions lying on the basis of the method allow us information, so that their correlation coefficient
to calculate the confidence interval for each pre- is nearly 1. In such cases, standard OLS is abso-
dicted value. lutely not recommendable.
OLS can be applied as well to multivariate Furthermore, the number of objects required
data, namely when the predictors are two or for OLS regression must be at least equal to
more. In such cases, the method is also known the number of predictors plus 1. Such a condi-
as multivariate linear regression (MLR) (Draper tion is rarely satisfied in many practical cases.
and Smith, 1981).
The model can be expressed as a mathemat- 2.3.5.2. Principal-Component Regression
ical relationship between the response y and (PCR)
the V predictors:
Principal-component analysis offers a very
^
y ¼ b 0 þ b 1 x1 þ b 2 x2 þ . þ b V xV (2.10) simple approach to overcome these hurdles.
The model is obtained by a classical least-
that is, in the matrix notation: squares approach, which uses a reduced
^ ¼ X0 b
y (2.11) number of significant principal components,
computed from the original variables, as the
where X is the matrix of the predictors predictors (Jolliffe, 1982). The PCs are, by defini-
augmented with a column of 1, necessary for tion, orthogonal and, therefore, uncorrelated.
the estimation of the intercept values, and b is This technique, which is very efficient in
the column vector of the regression coefficients. many cases, is known as principal-component
The regression coefficients are estimated by
regression (PCR). Since not always the direc-
tions which explain the highest variance
b ¼ ðX0 XÞ1 X0 y (2.12)
amount (i.e., the lowest-order PCs) are the
most important in predicting a response vari-
The elements of the vector y are the reference
able, it is possible to follow a refined approach,
values of the response variable, used for
which performs a stepwise selection of the prin-
building the model.
cipal components to be used in the model on the
The uncertainty on the coefficient estimation
basis of their modeling efficiency.
varies inversely with the determinant of the
information matrix ðX0 XÞ which, in the case of
a unique predictor, corresponds to its variance. 2.3.5.3. Partial Least Squares (PLS)
In the multivariate cases, the determinant value PLS (partial least squares, or even projections
depends on the variance of the predictors and onto latent structures) is probably the most
on their inter-correlation: a high correlation widely used multivariate regression technique
gives a small determinant of the information (Wold et al., 2001) and represents a better solu-
matrix, which means a large uncertainty on tion to both of the problems of variable number
the coefficients, and, consequently, unreliable and inter-correlation. The latent structures,
regression results. more frequently called latent variables (LVs) or
This is the typical situation when vectors cor- PLS components, are directions in the space of
responding to almost continuous signals (e.g., the predictors. In particular, the first latent
spectra) are used as predictors. In fact, in such variable is the direction characterized by

I. ANALYTICAL TECHNIQUES
2.3. MULTIVARIATE DATA ANALYSIS 47
the maximum covariance with the selected where yi is the value of the response variable y
response variable. The information related to for sample i, ^yi is the corresponding value
the first latent variable is then subtracted from computed or predicted by the model, and N is
both the original predictors and the response. the number of samples.
The second latent variable is orthogonal to the In general, the calibration error always
first one, being the direction of maximum decreases when the number of LVs augments,
covariance between the residuals of the predic- because the fitting increases (toward an overfit-
tors and the residuals of the response. This ting). On the contrary, the prediction error
approach continues for the subsequent LVs. generally decreases until a certain model
The optimal complexity of the PLS model, complexity and then raises: this indicates that
i.e., the most appropriate number of latent the LVs further introduced are bringing noise,
variables, is determined by evaluating, with as shown in the example given in Fig. 2.16.
a proper validation strategy, the prediction A simple and practical criterion is the choice
error corresponding to models with increasing of the LV number corresponding to the SDEP
complexity. The parameters considered are absolute minimum or e better e to the first local
usually the standard deviation of the error of minimum as the optimal model complexity. In
calibration (SDEC), computed with the objects the example of Fig. 2.16, it corresponds to six
used for building the model, and the standard LVs.
deviation of the error of prediction (SDEP), When the number of noisy (noninformative)
computed with objects not used for building variables is too large, the performance of PLS
the model: models may be improved by a selection of
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi useful predictors performed in advance (Forina
PN
i ¼ 1 ðyi  ^ yi Þ 2 et al., 2007).
SDECðPÞ ¼ (2.13) A number of PLS variants have been
N
deployed, for instance, for developing nonlinear

3.5 FIGURE 2.16 Example of calibra-


SDEC tion and prediction error typical
3.0 SDEP trends at the increasing of the PLS
model complexity (number of latent
variables).
2.5

2.0
SDE

1.5

1.0

0.5

0.0

0 5 10 15 20 25 30
LV number

I. ANALYTICAL TECHNIQUES
48 2. DATA ANALYSIS AND CHEMOMETRICS

models and for predicting together two or more one, the input layer, there are usually N neurons
response variables (PLS-2). Furthermore, when which correspond to the original predictors.
category indices are taken as dummy response The predictors are scaled (generally range
variables, PLS may work as a classification scaled). When their number is very large, the
method which is usually called PLS discrimi- principal components are often used, in order
nant analysis (PLS-DA) or discriminant PLS to reduce the data amount and the computa-
(D-PLS). tional time.
The first layer transmits the value of the
predictors to the second e hidden e layer. All
2.3.6. Artificial Neural Networks
the neurons of the input layer are connected
Artificial neural networks (ANNs) are to the J neurons of the second layer by
a family of nonparametric versatile tools that means of weight coefficients, meaning that the
can be employed both for data exploration and J elements of the hidden layer receive, as infor-
for qualitative and quantitative predictive mation, a weighted sum S of the values from
modeling. ANNs offer some advantages. For the input layer. They transform the information
instance, they are generally well suited for received (S) by means of a suitable transfer
nonlinear problems, and the related software function, frequently a sigmoid.
is easily available. Conversely, a number of These neurons transmit information to the
important drawbacks should limit ANN use third e output e layer, as a weighted combina-
only to the cases in which other simpler tech- tion (Z) of their values. The neurons in the output
niques fail and, primarily, when a large number layer correspond to the response variables which,
of samples are available. in the case for classification, are the coded class
Multilayer feedforward neural networks indices. The output neurons transform the infor-
(MLF) represent the configuration of ANNs mation Z, from the hidden layer, by means of
most widely applied to electronic tongue a further sigmoid or semilinear function.
data. An exemplificative scheme is shown in After a first random initialization of the
Fig. 2.17. values, a learning procedure modifies the
MLF are composed by a number of computa- weights wn,j and wj during several optimization
tional elements, called neurons, generally orga- cycles, in order to improve the performances of
nized in three layers (Zupan, 1994). In the first the net. The correction of the weights at each
step is proportional to the prediction error of
the previous cycle. The optimization of many
OUTPUT
O1 O2 parameters and the elevated number of learning
LAYER
Z Z cycles considerably increase the risk of overfit-
C or

wj,o ting and, for this reason, a deep validation is


rec

required, with a consistent number of objects.


tion

HIDDEN
LAYER H1 H... Hj Another type of widely employed ANN is
S S S represented by the Kohonen’s self-organizing
wn,j
maps (SOMs), used for unsupervised explor-
atory analysis, and by the counterpropagation
INPUT (CP) neural networks, used for nonlinear regres-
I1 I2 I3 I... I4
LAYER
sion and classification (Kohonen, 2001). In addi-
FIGURE 2.17 Exemplificative scheme of general inter- tion, these tools require a considerable number
neuronal connections and transmission/correction mecha- of objects to build reliable models, and a severe
nisms for a multilayer feedforward neural network. validation.

I. ANALYTICAL TECHNIQUES
TABLE 2A.1 Red-Wine Data Set, and Basic Statistical Parameters

Alcalinity
Alcohol Sugar-free Fixed Tartaric Malic Uronic Ash of ash Potassium Calcium Magnesium Phosphate
Name Category (% abv) extract (g/l) acidity (g/l) acid (g/l) acid (g/l) acids (mg/l) pH (g/l) (meq/l) (mg/l) (mg/l) (mg/l) (g/l)
OLO0171 Barolo 14.23 24.82 73.10 1.21 1.71 0.72 3.38 2.43 15.60 950 62 127 320

OLO0271 Barolo 13.20 26.30 72.80 1.84 1.78 0.71 3.30 2.14 11.20 765 75 100 395

OLO0371 Barolo 13.16 26.30 68.50 1.94 2.36 0.84 3.48 2.67 18.60 936 70 101 497

OLO0471 Barolo 14.37 25.85 74.90 1.59 1.95 0.72 3.43 2.50 16.80 985 47 113 580

OLO0571 Barolo 13.24 26.05 83.50 1.30 2.59 1.10 3.42 2.87 21.00 1088 70 118 408

OLO0671 Barolo 14.20 28.40 79.90 2.14 1.76 0.96 3.39 2.45 15.20 868 71 112 418

OLO0771 Barolo 14.39 27.02 64.30 1.64 1.87 0.95 3.42 2.45 14.60 889 67 96 306

OLO0871 Barolo 14.06 26.40 73.50 1.33 2.15 1.14 3.54 2.61 17.60 894 50 121 502

OLO0971 Barolo 14.83 26.80 69.50 1.82 1.64 0.67 3.30 2.17 14.00 765 49 97 440

OLO1071 Barolo 13.86 27.00 68.50 1.92 1.35 0.67 3.27 2.27 16.00 794 51 98 391

OLO1171 Barolo 14.10 26.08 72.50 1.64 2.16 0.62 3.31 2.30 18.00 838 61 105 399

OLO1271 Barolo 14.12 28.35 72.90 1.51 1.48 0.96 3.20 2.32 16.80 827 60 95 424

OLO1371 Barolo 13.75 30.25 75.10 1.92 1.73 0.64 3.18 2.41 16.00 752 65 89 453

OLO1471 Barolo 14.75 30.40 98.90 2.08 1.73 0.72 3.01 2.39 11.40 910 46 91 510

OLO1571 Barolo 14.38 27.10 72.30 1.95 1.87 0.67 3.20 2.38 12.00 927 29 102 523

OLO1671 Barolo 13.63 27.15 69.60 1.48 1.81 0.67 3.47 2.70 17.20 905 28 112 385

OLO1771 Barolo 14.30 27.90 74.90 1.41 1.92 0.82 3.40 2.72 20.00 860 108 120 513

OLO1871 Barolo 13.83 26.30 64.90 1.93 1.57 0.68 3.43 2.62 20.00 905 68 115 419

OLO1971 Barolo 14.19 26.40 72.00 1.85 1.59 0.82 3.38 2.48 16.50 964 86 108 488

OLO0173 Barolo 13.64 27.72 91.50 1.35 3.10 0.82 3.30 2.56 15.20 1038 111 116 402

OLO0273 Barolo 14.06 25.32 71.10 1.34 1.63 1.00 3.47 2.28 16.00 905 79 126 323

OLO0373 Barolo 12.93 28.80 102.10 1.05 3.80 0.89 3.26 2.65 18.60 915 79 102 294

OLO0473 Barolo 13.71 27.63 80.00 2.23 1.86 1.21 3.33 2.36 16.60 815 89 101 476

OLO0573 Barolo 12.85 25.80 69.60 1.54 1.60 0.79 3.45 2.52 17.80 958 101 95 415

OLO0673 Barolo 13.50 25.00 81.60 1.55 1.81 0.95 3.42 2.61 20.00 992 62 96 476

(Continued)
TABLE 2A.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

Alcalinity
Alcohol Sugar-free Fixed Tartaric Malic Uronic Ash of ash Potassium Calcium Magnesium Phosphate
Name Category (% abv) extract (g/l) acidity (g/l) acid (g/l) acid (g/l) acids (mg/l) pH (g/l) (meq/l) (mg/l) (mg/l) (mg/l) (g/l)

OLO0773 Barolo 13.05 25.72 78.30 1.15 2.05 1.08 3.57 3.22 25.00 1095 63 124 536

OLO0873 Barolo 13.39 27.10 72.30 1.52 1.77 1.05 3.46 2.62 16.10 936 68 93 395

OLO0973 Barolo 13.30 22.70 68.30 1.74 1.72 1.06 3.44 2.14 17.00 882 52 94 434

OLO1073 Barolo 13.87 29.30 68.30 1.38 1.90 0.75 3.42 2.80 19.40 1085 68 107 396

OLO1173 Barolo 14.02 25.20 69.60 1.71 1.68 0.79 3.26 2.21 16.00 780 62 96 510

GRI0170 Grignolino 12.37 18.30 90.10 2.80 0.94 0.73 3.11 1.36 10.60 580 77 88 296

GRI0270 Grignolino 12.33 22.90 72.20 2.25 1.10 0.69 3.26 2.28 16.00 715 85 101 365

GRI0370 Grignolino 12.64 23.90 95.70 1.93 1.36 1.06 3.19 2.02 16.80 688 83 100 395

GRI0470 Grignolino 13.67 22.20 64.80 2.20 1.25 0.74 3.40 1.92 18.00 725 51 94 301

GRI0570 Grignolino 12.37 23.50 70.00 2.06 1.13 0.72 3.30 2.16 19.00 785 73 87 422

GRI0670 Grignolino 12.17 23.03 65.70 1.84 1.45 0.72 3.35 2.53 19.00 790 62 104 411

GRI0770 Grignolino 12.37 26.80 62.70 1.70 1.21 0.88 3.40 2.56 18.10 978 55 98 310

GRI0870 Grignolino 13.11 23.70 80.00 1.40 1.01 0.77 3.10 1.70 15.00 730 80 78 297

GRI0970 Grignolino 12.37 20.90 63.70 1.94 1.17 0.67 3.40 1.92 19.60 785 40 78 212

GRI0171 Grignolino 13.34 23.72 70.00 2.02 0.94 1.09 3.26 2.36 17.00 760 64 110 451

GRI0271 Grignolino 12.21 22.70 90.70 3.62 1.19 0.94 3.14 1.75 16.80 795 134 151 448

GRI0371 Grignolino 12.29 21.40 55.60 1.43 1.61 0.87 3.54 2.21 20.40 682 102 103 324

GRI0471 Grignolino 13.86 25.25 59.50 1.27 1.51 1.09 3.63 2.67 25.00 785 63 86 383

GRI0571 Grignolino 13.49 22.30 60.90 1.74 1.66 0.67 3.44 2.24 24.00 680 60 87 300

GRI0671 Grignolino 12.99 26.10 50.50 1.42 1.67 1.24 3.52 2.60 30.00 974 55 139 473

GRI0771 Grignolino 11.96 24.50 65.70 2.18 1.09 0.73 3.40 2.30 21.00 681 98 101 366

GRI0871 Grignolino 11.66 20.30 61.70 1.70 1.88 0.60 3.30 1.92 16.00 785 52 97 312

GRI0971 Grignolino 13.03 23.50 78.60 1.90 0.90 0.76 3.30 1.71 16.00 790 57 86 396

GRI0172 Grignolino 11.84 26.40 108.70 1.70 2.89 0.91 3.11 2.23 18.00 790 71 112 350

GRI0272 Grignolino 12.33 20.60 58.70 2.41 0.99 0.84 3.32 1.95 14.80 680 124 136 438

GRI0372 Grignolino 12.70 27.15 93.30 1.46 3.87 1.11 3.19 2.40 23.00 890 110 101 321
GRI0472 Grignolino 12.00 23.20 58.40 1.88 0.92 0.82 3.30 2.00 19.00 680 63 86 408

GRI0572 Grignolino 12.72 22.90 58.40 1.40 1.81 0.81 3.50 2.20 18.80 890 83 86 418

GRI0672 Grignolino 12.08 23.50 56.90 1.33 1.13 0.71 3.65 2.51 24.00 980 85 78 215

GRI0772 Grignolino 13.05 25.50 104.80 1.64 3.86 0.73 3.19 2.32 22.50 938 98 85 195

GRI0173 Grignolino 11.84 23.40 70.80 1.80 0.89 1.00 3.40 2.58 18.00 922 80 94 378

GRI0273 Grignolino 12.67 24.30 74.10 1.70 0.98 0.88 3.35 2.24 18.00 840 81 99 336

GRI0373 Grignolino 12.16 25.80 78.90 1.84 1.61 0.78 3.37 2.31 22.80 845 98 90 285

GRI0473 Grignolino 11.65 22.90 62.90 1.80 1.67 0.64 3.55 2.62 26.00 1045 125 88 281

GRI0573 Grignolino 11.64 24.20 72.40 1.84 2.06 0.89 3.40 2.46 21.60 962 79 84 304

ERA0174 Barbera 12.86 26.80 87.30 0.99 1.35 0.92 3.22 2.32 18.00 830 52 122 266

ERA0274 Barbera 12.88 23.95 78.90 1.85 2.99 0.98 3.50 2.40 20.00 795 55 104 269

ERA0374 Barbera 12.81 24.45 76.20 2.93 2.31 0.87 3.64 2.40 24.00 785 49 98 266

ERA0474 Barbera 12.70 24.75 91.00 1.91 3.55 1.80 3.26 2.36 21.50 805 47 106 356

ERA0574 Barbera 12.51 23.50 104.70 1.34 1.24 0.98 3.50 2.25 17.50 975 60 85 273

ERA0674 Barbera 12.60 23.60 80.60 2.26 2.46 0.97 3.31 2.20 18.50 760 103 94 275

ERA0774 Barbera 12.25 25.30 91.40 1.42 4.72 1.25 3.40 2.54 21.00 995 105 89 262

ERA0874 Barbera 12.53 27.10 99.80 1.88 5.51 1.19 3.30 2.64 25.00 930 100 96 360

ERA0974 Barbera 13.49 25.70 115.50 2.17 3.59 1.47 3.24 2.19 19.50 825 111 88 315

ERA0176 Barbera 12.84 26.20 82.00 1.79 2.96 1.26 3.50 2.61 24.00 925 48 101 398

ERA0276 Barbera 12.93 26.78 80.00 1.69 2.81 1.15 3.31 2.70 21.00 965 40 96 351

ERA0376 Barbera 13.36 24.12 97.80 2.83 2.56 0.77 3.35 2.35 20.00 880 47 89 235

ERA0476 Barbera 13.52 27.90 85.00 1.46 3.17 1.23 3.28 2.72 23.50 880 38 97 325

ERA0576 Barbera 13.62 25.52 93.70 2.70 4.95 1.56 3.41 2.35 20.00 805 57 92 191

ERA0179 Barbera 12.25 23.40 113.50 3.54 3.88 1.04 3.01 2.20 18.50 785 77 112 358

ERA0279 Barbera 13.16 22.90 117.90 3.15 3.57 1.18 3.14 2.15 21.00 805 88 102 456

ERA0379 Barbera 13.88 21.40 99.30 2.81 5.04 1.29 3.28 2.23 20.00 750 43 80 171

ERA0479 Barbera 12.87 24.35 98.90 2.51 4.61 1.25 3.18 2.48 21.50 830 63 86 366

ERA0579 Barbera 13.32 21.46 96.90 2.85 3.24 1.75 3.30 2.38 21.50 790 42 92 306

ERA0178 Barbera 13.08 26.80 120.60 2.90 3.90 1.11 3.16 2.36 21.50 790 73 113 303

(Continued)
TABLE 2A.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

Alcalinity
Alcohol Sugar-free Fixed Tartaric Malic Uronic Ash of ash Potassium Calcium Magnesium Phosphate
Name Category (% abv) extract (g/l) acidity (g/l) acid (g/l) acid (g/l) acids (mg/l) pH (g/l) (meq/l) (mg/l) (mg/l) (mg/l) (g/l)

ERA0278 Barbera 13.50 26.50 105.50 2.31 3.12 1.31 3.23 2.62 24.00 980 67 123 338

ERA0378 Barbera 12.79 23.40 117.80 3.12 2.67 0.82 3.21 2.48 22.00 890 53 112 407

ERA0478 Barbera 13.11 25.20 95.40 2.26 1.90 0.86 3.49 2.75 25.50 1140 74 116 289

ERA0578 Barbera 13.23 23.85 120.60 2.80 3.30 0.80 3.20 2.28 18.50 915 68 98 351

ERA0678 Barbera 12.58 21.75 102.70 2.92 1.29 0.79 3.21 2.10 20.00 875 107 103 368

ERA0778 Barbera 13.17 23.20 129.30 2.28 5.19 1.49 3.58 2.32 22.00 1045 102 93 241

ERA0878 Barbera 13.84 24.70 122.90 2.76 4.12 1.07 3.19 2.38 19.50 840 108 89 402

ERA0978 Barbera 12.45 25.35 105.90 2.23 3.03 1.24 3.62 2.64 27.00 1050 118 97 393

ERA1078 Barbera 14.34 29.10 97.50 2.73 1.68 1.60 3.42 2.70 25.00 1095 78 98 462

ERA1178 Barbera 13.48 26.95 102.50 3.75 1.67 1.37 3.41 2.64 22.50 1055 79 89 480

mean 13.13 25.07 82.46 1.97 2.22 0.95 3.35 2.37 19.27 868.7 72.6 100.6 369.5

variance 0.59 5.25 332.34 0.35 1.27 0.07 0.02 0.08 13.24 13102.4 536.4 194.7 7453.5

standard deviation 0.77 2.29 18.23 0.59 1.12 0.26 0.14 0.28 3.64 114.5 23.2 14.0 86.3

OD280/
Total Non- OD315 OD280/ 2-3- Total
Chloride phenols flavanoid Pro- Color of diluted OD315 Glycerol butanediol nitrogen Proline Methanol
(mg/l) (g/l) Flavanoids phenols anthocyanins intensity Hue wines of flavanoids (g/l) (g/l) (mg/l) (mg/l) (% A.A.)

82 2.80 3.06 0.28 2.29 5.64 1.04 3.92 4.77 9.29 757 153 1065 113

90 2.65 2.76 0.26 1.28 4.38 1.05 3.40 3.80 8.93 881 194 1050 94

67 2.80 3.24 0.30 2.81 5.68 1.03 3.17 3.46 11.74 900 206 1185 125

49 3.85 3.49 0.24 2.18 7.80 0.86 3.45 3.54 10.13 1119 292 1480 80

65 2.80 2.69 0.39 1.82 4.32 1.04 2.93 3.22 10.27 799 215 735 73

58 3.27 3.39 0.34 1.97 6.75 1.05 2.85 3.16 10.85 865 364 1450 68

52 2.50 2.52 0.30 1.98 5.25 1.02 3.58 3.94 9.05 931 378 1290 80

64 2.60 2.51 0.31 1.25 5.05 1.06 3.58 3.94 10.13 865 358 1295 100

58 2.80 2.98 0.29 1.98 5.20 1.08 2.85 3.03 9.89 825 438 1045 141

64 2.98 3.15 0.22 1.85 7.22 1.01 3.55 3.75 12.65 788 350 1045 121
61 2.95 3.32 0.22 2.38 5.75 1.25 3.17 3.27 8.59 964 378 1510 123

79 2.20 2.43 0.26 1.57 5.00 1.17 2.82 3.04 11.52 894 294 1280 134

257 2.60 2.76 0.29 1.81 5.60 1.15 2.90 2.92 12.24 784 289 1320 164

50 3.10 3.69 0.43 2.81 5.40 1.25 2.73 2.82 12.29 766 224 1150 105

55 3.30 3.64 0.29 2.96 7.50 1.20 3.00 3.32 9.53 1041 324 1547 114

50 2.85 2.91 0.30 1.46 7.30 1.28 2.88 3.12 7.92 812 229 1310 97

62 2.80 3.14 0.33 1.97 6.20 1.07 2.65 3.10 9.24 836 308 1280 113

58 2.95 3.40 0.40 1.72 6.60 1.13 2.57 2.66 9.41 722 274 1130 99

28 3.30 3.93 0.32 1.86 8.70 1.23 2.82 3.17 9.85 808 230 1680 135

67 2.70 3.03 0.17 1.66 5.10 0.96 3.36 4.00 10.39 726 227 845 119

73 3.00 3.17 0.24 2.10 5.65 1.09 3.71 3.75 10.30 828 225 780 145

62 2.41 2.41 0.25 1.98 4.50 1.03 3.52 3.66 11.88 589 237 770 123

134 2.61 2.88 0.27 1.69 3.80 1.11 4.00 4.31 8.81 715 270 1035 109

73 2.48 2.37 0.26 1.46 3.93 1.09 3.63 3.82 9.14 568 248 1015 102

47 2.53 2.61 0.28 1.66 3.52 1.12 3.82 4.00 9.45 667 210 845 86

82 2.63 2.68 0.47 1.92 3.58 1.13 3.20 3.63 10.34 753 238 830 124

52 2.85 2.94 0.34 1.45 4.80 0.92 3.22 4.44 9.96 854 285 1195 85

46 2.40 2.19 0.27 1.35 3.95 1.02 2.77 3.10 9.70 757 350 1285 84

76 2.95 2.97 0.37 1.76 4.50 1.25 3.40 3.72 9.53 702 280 915 99

53 2.65 2.33 0.26 1.98 4.70 1.04 3.59 3.77 9.94 689 293 1035 100

52 1.98 0.57 0.28 0.42 1.95 1.05 1.82 2.12 5.40 736 287 520 98

108 2.05 1.09 0.63 0.41 3.27 1.25 1.67 1.42 6.90 658 345 680 127

53 2.02 1.41 0.53 0.62 5.75 0.98 1.59 1.86 8.20 691 321 450 60

47 2.10 1.79 0.32 0.73 3.80 1.23 2.46 1.73 8.60 797 262 630 87

306 3.50 3.10 0.19 1.87 4.45 1.22 2.87 3.07 7.20 748 141 420 157

116 1.89 1.75 0.45 1.03 2.95 1.45 2.23 2.73 7.50 627 219 355 58

69 2.42 2.65 0.37 2.08 4.60 1.19 2.30 2.60 7.96 680 259 678 118

148 2.98 3.18 0.26 2.28 5.30 1.12 3.18 3.33 8.20 604 100 502 114

54 2.11 2.00 0.27 1.04 4.68 1.12 3.48 4.07 7.10 554 425 510 98

(Continued)
TABLE 2A.1 Red-Wine Data Set, and Basic Statistical ParametersdCont’d

OD280/
Total Non- OD315 OD280/ 2-3- Total
Chloride phenols flavanoid Pro- Color of diluted OD315 Glycerol butanediol nitrogen Proline Methanol
(mg/l) (g/l) Flavanoids phenols anthocyanins intensity Hue wines of flavanoids (g/l) (g/l) (mg/l) (mg/l) (% A.A.)

111 2.53 1.30 0.55 0.42 3.17 1.02 1.93 1.92 8.10 704 363 750 137

88 1.85 1.28 0.14 2.50 2.85 1.28 3.07 3.23 6.81 714 195 718 116

50 1.10 1.02 0.37 1.46 3.05 0.91 1.82 2.00 6.38 661 301 870 78

59 2.95 2.86 0.21 1.87 3.38 1.36 3.16 3.52 7.62 748 170 410 99

43 1.88 1.84 0.27 1.03 3.74 0.98 2.78 3.50 8.04 614 160 472 64

35 3.30 2.89 0.21 1.96 3.35 1.31 3.50 3.60 8.00 731 293 985 113

48 3.38 2.14 0.13 1.65 3.21 0.99 3.13 3.15 7.70 563 183 886 57

59 1.61 1.57 0.34 1.15 3.80 1.23 2.14 2.35 6.14 596 109 428 129

122 1.95 2.03 0.24 1.46 4.60 1.19 2.48 2.85 8.40 756 167 392 145

58 1.72 1.32 0.43 0.95 2.65 0.96 2.52 3.25 5.22 514 246 500 122

99 1.90 1.85 0.35 2.76 3.40 1.06 2.31 2.70 7.96 654 259 750 59

52 2.83 2.55 0.43 1.95 2.57 1.19 3.13 3.82 8.66 700 227 463 101

27 2.42 2.26 0.30 1.43 2.50 1.38 3.12 3.52 5.80 645 199 278 95

64 2.20 2.53 0.26 1.77 3.90 1.16 3.14 3.33 7.38 664 199 714 111

53 2.00 1.58 0.40 1.40 2.20 1.31 2.72 3.50 8.11 548 203 630 115

48 1.65 1.59 0.61 1.62 4.80 0.84 2.01 2.07 8.64 649 207 515 114

95 2.20 2.21 0.22 2.35 3.05 0.79 3.08 3.81 6.36 586 138 520 141

70 2.20 1.94 0.30 1.46 2.62 1.23 3.16 3.60 7.90 600 217 450 121

54 1.78 1.69 0.43 1.56 2.45 1.33 2.26 2.92 8.04 643 195 495 116

36 1.92 1.61 0.40 1.34 2.60 1.36 3.21 3.27 9.54 608 262 562 120

70 1.95 1.69 0.48 1.35 2.80 1.00 2.75 3.60 7.97 523 223 680 120

46 1.51 1.25 0.21 0.94 4.10 0.76 1.29 1.26 6.43 673 252 630 122

72 1.30 1.22 0.24 0.83 5.40 0.74 1.42 1.34 10.10 918 319 530 102

67 1.15 1.09 0.27 0.83 5.70 0.66 1.36 1.24 10.02 1095 258 560 132

118 1.70 1.20 0.17 0.84 5.00 0.78 1.29 1.23 8.52 1020 238 600 121

29 2.00 0.58 0.60 1.25 5.45 0.75 1.51 1.40 8.32 764 178 650 79
77 1.62 0.66 0.63 0.94 7.10 0.73 1.58 1.37 6.47 573 174 695 100

144 1.38 0.47 0.53 0.80 3.85 0.75 1.27 1.12 8.25 680 217 720 107

6 1.79 0.60 0.63 1.10 5.00 0.82 1.69 1.80 8.35 821 230 515 139

56 1.62 0.48 0.58 0.88 5.70 0.81 1.82 2.23 10.40 700 245 580 150

15 2.32 0.60 0.53 0.81 4.92 0.89 2.15 2.25 10.60 940 269 590 132

25 1.54 0.50 0.53 0.75 4.60 0.77 2.31 2.34 10.62 955 260 600 82

71 1.40 0.50 0.37 0.64 5.60 0.70 2.47 2.60 10.41 814 216 780 106

21 1.55 0.52 0.50 0.55 4.35 0.89 2.06 2.21 10.20 976 201 520 118

16 2.00 0.80 0.47 1.02 4.40 0.91 2.05 2.55 8.90 899 205 550 140

14 1.38 0.78 0.29 1.14 8.21 0.65 2.00 2.23 8.16 521 218 855 97

17 1.50 0.55 0.43 1.30 4.00 0.60 1.68 2.24 5.61 696 252 830 63

10 0.98 0.34 0.40 0.68 4.90 0.58 1.33 1.81 7.94 670 156 415 154

50 1.70 0.65 0.47 0.86 7.65 0.54 1.86 2.10 8.52 806 213 625 122

21 1.93 0.76 0.45 1.25 8.42 0.55 1.62 2.19 6.12 604 219 650 106

50 1.41 1.39 0.34 1.14 9.40 0.57 1.33 1.26 7.36 733 164 550 114

106 1.40 1.57 0.22 1.25 8.60 0.59 1.30 1.29 6.28 568 129 500 107

127 1.48 1.36 0.24 1.26 10.80 0.48 1.47 1.40 7.00 898 154 480 91

55 2.20 1.28 0.26 1.56 7.10 0.61 1.33 1.25 8.57 905 249 425 125

35 1.80 0.83 0.61 1.87 10.52 0.56 1.51 1.42 10.80 915 154 675 84

100 1.48 0.58 0.53 1.40 7.60 0.58 1.55 1.34 7.52 924 142 640 100

84 1.74 0.63 0.61 1.55 7.90 0.60 1.48 1.31 9.50 969 207 725 84

6 1.80 0.83 0.48 1.56 9.01 0.57 1.64 1.92 9.29 902 159 480 132

53 1.90 0.58 0.63 1.14 7.50 0.67 1.73 2.18 10.20 865 252 880 118

49 2.80 1.31 0.53 2.70 13.00 0.57 1.96 2.25 10.82 764 223 660 182

35 2.60 1.10 0.52 2.29 11.75 0.57 1.78 2.09 11.09 1080 250 620 160

66.5 2.24 1.90 0.36 1.51 5.27 0.97 2.51 2.75 8.79 759.7 240.4 779.3 110.2

1977.3 0.40 0.99 0.02 0.35 4.90 0.06 0.62 0.87 2.76 20104.3 4691.8 102493.8 649.9

44.5 0.63 1.00 0.13 0.59 2.21 0.25 0.78 0.93 1.66 141.8 68.5 320.1 25.5
56 2. DATA ANALYSIS AND CHEMOMETRICS

Acknowledgment Forina, M., Armanino, C., Leardi, R., Drava, G., 1991. A
class-modelling technique based on potential functions.
The authors wish to thank Mr. Patrick Guerin for the careful J. Chemom. 5, 435e453.
revision of the manuscript. Forina, M., Lanteri, S., Rosso, S., 2001. Confidence intervals of
the prediction ability and performance scores of classifi-
References cations methods. Chemometr. Intell. Lab. Syst. 57, 121e132.
Forina, M., Lanteri, S., Casale, M., 2007. Multivariate cali-
American Oil Chemist’s Society, 1998. Official Methods and bration. J. Chromatogr. A 1158, 61e93.
Recommended Practices of the American Oil Chemists’ Forina, M., Oliveri, P., Lanteri, S., Casale, M., 2008. Class-
Society, fifth ed. AOCS, Champaign, IL. modeling techniques, classic and new, for old and new
Barnes, R.J., Dhanoa, M.S., Lister, S.J., 1989. Standard problems. Chemometr. Intell. Lab. Syst. 93, 132e148.
normal variate transformation and de-trending of near- Geisser, S., 1964. Posterior odds for multivariate normal
infrared diffuse reflectance spectra. Appl. Spectrosc. 43 distributions. J. R. Stat. Soc. Series B Stat Methodological
(5), 772e777. 26, 69e76.
Box, G.E.P., Hunter, W.G., Hunter, J.S., 1978. Statistics for Geladi, P., Manley, M., Lestander, T., 2003. Scatter plotting in
experimenters: an introduction to design, data analysis, multivariate data analysis. J. Chemom. 17, 503e511.
and model building. Wiley, New York. Hotelling, H., 1947. Multivariate Quality Control. In:
Chambers, J.M., Cleveland, W.S., Kleiner, B., Tukey, P.A., Eisenhart, C., Hastay, M.W., Wallis, W.A. (Eds.), Tech-
1983. Graphical Methods for Data Analysis. Chapman & niques of Statistical Analysis. McGraw-Hill, New York,
Hall, New York. pp. 111e184.
Coomans, D., Broeckaert, I., Derde, M.P., Tassin, A., Iman, R.L., 1982. Graphs for use with the Lilliefors test for
Massart, D.L., Wold, S., 1984. Use of a microcomputer for normal and exponential distributions. Amer. Statist. 36,
the definition of multivariate confidence regions in 109e112.
medical diagnosis based on clinical laboratory profiles. Jellema, R.H., 2009. Variable shift and alignment. In:
Comput. Biomed. Res. 17, 1e14. Brown, S.D., Tauler, R., Walczak, B. (Eds.), Comprehensive
Coomans, D., Broeckaert, I., 1986. Potential Pattern Recog- Chemometrics, vol. 2. Elsevier, Amsterdam, pp. 85e108.
nition in Chemical and Medical Decision Making. Jolliffe, I.T., 1982. A note on the use of principal components in
Research Studies Press, England. Letchworth. regression. J. R. Stat. Soc. Ser. C (Appl. Stat.) 31 (3), 300e303.
Derde, M.P., Kaufman, L., Massart, D.L., 1986. A non- Jolliffe, I.T., 2002. Principal Component Analysis, second ed.
parametric class modelling technique. J. Chemomem. 3, Springer, New York. 201e207.
375e395. Kennard, R.W., Stone, L.A., 1969. Computer aided design of
Derde, M.P., Massart, D.L., 1986. UNEQ: a disjoint model- experiments. Technometrics 11, 137e148.
ling technique for pattern recognition based on normal Kjeldahl, K., Bro, R., 2010. Some common misunderstand-
distribution. Anal. Chim. Acta 184, 33e51. ings in chemometrics. J. Chemom. 24, 558e564.
Derde, M.P., Massart, D.L., 1989. Evaluation of the required Kolmogorov, A., 1933. Sulla determinazione empirica di una
sample size in some supervised pattern recognition legge di distribuzione. G. Inst. Ital. Attuari 4, 83e91.
techniques. Anal. Chim. Acta 223, 19e44. Kohonen, T., 2001. Self Organizing Maps, third ed. Springer,
Dudoit, S., Fridly, J., Speed, P., 2002. Comparison of New York, NY.
discrimination methods for the classification of tumors Lilliefors, H.W., 1970. On the KolmogoroveSmirnov test for
using gene expression data. J. Am. Stat. Assoc. 97, 77e87. normality with mean and variance unknown. J. Amer.
Draper, N.R., Smith, H., 1981. Applied Regression Analysis, Stat. Assoc. 62, 399e405.
second ed. Wiley, New York. Martens, H., Kohler, A., 2008. Bio-spectroscopy and bio-
Fearn, T., 2009. The effect of spectral pre-treatments on chemometrics: high-throughput metabolic profiling for
interpretation. NIR News 20, 16e17. integrative genetics. In: Proceedings of the Metabo-
Fisher, R.A., 1936. The use of multiple measurements in meeting 2008 Conference, 28e29th April 2008. Ecole
taxonomic problems. Ann. Eugen. 7, 179e188. Normale Supérieure de Lyon, Lyon, France, p. 18.
Forina, M., Lanteri, S., 1984. Chemometrics: mathematics Nielsen, N.-P.V., Carstensen, J.M., Smedsgaard, J., 1998.
and statistics in chemistry. In: Kowalski, B.R. (Ed.), Aligning of single and multiple wavelength chromato-
NATO ASI Series, Ser. C, vol. 138. Reidel Publ. Co., graphic profiles for chemometric data analysis using
Dordrecht, pp. 439e466. correlation optimised warping. J. Chromatogr. A 805,
Forina, M., Armanino, C., Castino, M., Ubigli, M., 1986. 17e35.
Multivariate data analysis as a discriminating method of Oliveri, P., Casale, M., Casolino, M.C., Baldo, M.A., Nizzi-
the origin of wines. Vitis 25, 189e201. Grifi, F., Forina, M., 2011. A comparison between

I. ANALYTICAL TECHNIQUES
REFERENCES 57
classical and innovative class-modelling techniques for Student, 1908. The probable error of a mean. Biometrika 6,
the characterisation of a PDO olive oil, Anal. Bioanal. 1e25.
Chem., 399, 2105e2113. Taavitsainen, V.M., 2009. Denoising and signal-to-noise ratio
Oliveri, P., Downey, G., 2012. Multivariate class modeling enhancement: derivatives. In: Brown, S.D., Tauler, R.,
for the verification of food-authenticity claims. TrAC, Walczak, B. (Eds.), Comprehensive Chemometrics,
Trends Anal. Chem. 35, 74e86. vol. 2. Elsevier, Amsterdam, pp. 57e66.
Pearson, K., 1901. On lines and planes of closest fit to Taguchi, G., 1986. Introduction to Quality Engineering.
systems of points in space. Philos. Mag. 2 (6), 559e572. Designing Quality into Products and Processes. Asian
Reis, M.S., Saraiva, P.M., Bakshi, B.R., 2009. Denoising and Productivity Organization, ASI Press, Dearborn.
signal-to-noise ratio enhancement: wavelet transform Valcárcel, M., Cárdenas, S., 2005. Vanguard-rearguard
and Fourier transform. In: Brown, S.D., Tauler, R., analytical strategies. Trends Anal. Chem. 24, 67e74.
Walczak, B. (Eds.), Comprehensive Chemometrics, Vandeginste, B.G.M., Massart, D.L., Buydens, L.M.C., De
vol. 2. Elsevier, Amsterdam, pp. 25e55. Jong, S., Lewi, P.J., Smeyers-Verbeke, J., 1998. Handbook
Savitzky, A., Golay, M.J.E., 1964. Smoothing and differenti- of Chemometrics and Qualimetrics, vol. 20B. Elsevier,
ation of data by simplified least squares procedure. Amsterdam.
Anal. Chem. 36, 1627e1639. Wold, S., 1972. Spline functions, a new tool in data-analysis.
Sharoba, A.M., Senge, B., El-Mansy, H.A., Bahlol, H.ElM., Kem. Tidskr. 84, 34e37.
Blochwitz, R., 2005. Chemical, sensory and rheolog- Wold, S., Sjöström, M., 1977. SIMCA: a method for analysing
ical properties of some commercial German and chemical data in terms of similarity and analogy. In:
Egyptian tomato ketchups. Eur. Food Res. Technol. Kowalski, B.R. (Ed.), Chemometrics: Theory and Appli-
220, 142e151. cations, ACS Symposium Series 52. American Chemical
Smirnov, N.V., 1939. On the estimation of the discrepancy Society, Washington, pp. 243e282.
between empirical curves of distribution for two inde- Wold, S., Sjöström, M., Eriksson, L., 2001. PLS-regression:
pendent samples. Bull. Math. Univ. Moscow 2, 3e14. a basic tool of chemometrics. Chemom. Intell. Lab. Syst.
Snedecor, G.W., Cochran, W.G., 1989. Statistical Methods, 58, 109e130.
eighth ed. Iowa State University Press. Zupan, J., 1994. Introduction to artificial neural network
Snee, R., 1977. Validation of regression models: methods (ANN) methods: what they are and how to use them.
and examples. Technometrics 19, 415e428. Acta Chim. Slov. 41, 327e352.

I. ANALYTICAL TECHNIQUES

You might also like