You are on page 1of 8

Chemometrics and Intelligent Laboratory Systems 84 (2006) 164 – 171

www.elsevier.com/locate/chemolab

Authentication of Italian CDO wines by class-modeling techniques


Federico Marini ⁎, Remo Bucci, Antonio L. Magrì, Andrea D. Magrì
Dipartimento di Chimica, Università di Roma “La Sapienza”, I-00185 Rome, Italy
Received 26 September 2005; received in revised form 25 April 2006; accepted 28 April 2006
Available online 10 July 2006

Abstract

Two chemometric class-modeling techniques (SIMCA and UNEQ) have been used to authenticate the origin of CDO wine samples from Italy.
While SIMCA modeling was performed using all variables, UNEQ requires a preliminary variable selection, leading to 4 variables only (Cu, Zn,
antocyans and SO2) being included in the models. Both techniques provided highly sensitive and specific category models and resulted also in a
very reliable classification of samples (only one or two samples misclassified in SIMCA and UNEQ, respectively). A further investigation on the
stability of the models with respect to the wine production year was carried out, showing that while SIMCA models failed to classify most of the
samples produced in years different from those used in the modeling phase, quite all of the same samples were accepted by the corresponding
UNEQ models. However, when SIMCA modeling was repeated using only the same 4 variables than UNEQ, comparable results were obtained,
suggesting an effect of the choice of the experimental variables in the classification outcomes.
© 2006 Elsevier B.V. All rights reserved.

1. Introduction source of a large range of different wines (over 3000). Some of


these wines are widely known and well liked all over the
The quality of wines is mainly based on three factors, which world, while others, produced in small quantities and known
are globally referred to as the “quality triangle”: the type of only to a significantly small amount of connoisseurs, can be
grape (the species of vine – varietals); the climate and the soil found and tasted mainly by visiting the particular areas where
(which affect the quality of the grape); the human factor, which they are produced. Italian wines under Controlled Denomina-
includes cultivation techniques, production, preservation and tion of Origin are well over 330, 27 of which are under
ageing methods. In particular, the territory (climate-soil) is the Guaranteed and Controlled Denomination of Origin, for a total
factor which influences exclusively and decisively the charac- production of almost 1 billion l. Inspection of these figures
ter, the quality and the typical attributes of a wine. Indeed, the shows that denominated products cover a significant percent-
same vine species and the same production techniques used in age of the total wine market in Italy, so that the problem of
different environments and soils will origin grapes and wines authenticating the origin of wine samples appears of high
substantially different: that is why wine has been the first economic and social relevance. Indeed, due to CDO wines
foodstuff for which typicalness was explicitly recognized as a being value-added, the perspective of selling as denominated
main quality factor by the introduction of the Controlled (and hence top quality and higher price) product a cheaper and
Denominations of Origin (CDOs) in the early 60s. A Controlled worse wine could result a promising fraud. Therefore, there is
Denomination of Origin protects the name of the geographical the need of analytical instruments capable of verifying whether
area of production and codifies all the practices of cultivation, a sample declared as produced under a certain Denomination
transformation and preservation of the wine. of Origin be effectively so. Unfortunately, at present, even if
The geographic position of Italy, the variety present in its studies are being undertaken to look for potential experimental
territory and the multitude of microclimatic conditions are the indices to be used as unequivocal markers of origin (e.g. by
IRMS [1–4] or SNIF–NMR [4–6]), no single analytical index
⁎ Corresponding author. has been identified, whose measure could provide information
E-mail address: fmmonet@hotmail.com (F. Marini). about the origin of wine samples. On the other hand, the
0169-7439/$ - see front matter © 2006 Elsevier B.V. All rights reserved.
doi:10.1016/j.chemolab.2006.04.017
F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171 165

capability of multivariate analysis techniques for wine and 2.2. Chemometric class-modeling techniques
alcoholic beverages classification and differentiation has been
widely used in recent years [7–18], suggesting that a When considering the available pattern recognition methods,
chemometric pattern recognition approach could provide a a distinction can be made between pure classification and class-
reliable answer to the problem of the authentication of CDO modeling techniques [29–32]. The former divide the sample
wines. space in as many region as the number of classes under
To this end, in the present study we have examined the investigation, so that if a sample falls in a specific region of the
possibility of using two chemometric class-modeling techni- hyperspace it is assigned to the corresponding class. On the
ques to authenticate the origin of wine samples from 7 different other hand, class-modeling tools build a separate model for each
Italian Controlled Denominations of Origin. category: samples fitting the model are accepted by that
category, while samples falling outside the model are
2. Materials and methods considered as outliers for the specific class. With these latter
methods, if more than a class is modeled, each sample can be
2.1. Samples and chemical analyses assigned to a single category, to more than one category or to no
category at all. Two of the most commonly used class-modeling
A total of 180 wine samples from 7 Italian Controlled tools in chemometrics are Soft Independent Modeling of Class
Denominations of Origin (Montepulciano dʻAbruzzo [19], Analogies (SIMCA) and Unequal Class Modeling (UNEQ).
Nero dʻAvola [20], Solopaca [21], Pinerolese Freisa [22], SIMCA [30] describes the similarities among the samples of
Rosso di Montalcino [23], Terrano [24] and Sagrantino [25]) a category using a principal component method based on k
produced in the years 2000–2002 have been collected and components (in this study, the number of model components
analyzed (the detailed composition of the sample set is reported was determined by double cross-validation [33]):
in Table 1).
In particular, 35 chemical indices, selected among those X
kC
xCij ¼ C C
tia pja þ eCij ð1Þ
commonly determined for the quality control of wines, have a¼1
been determined on each sample: alcohol grade (% v/v), total
C C
acidity (expressed as g/L of tartaric acid), SO2 (mg/L), Cu (mg/ where tia and pja are the scores and loadings for the a-th
L), Zn (mg/L), Pb (ppb), total polyphenols (mg/L), gallic acid component in the model of class C and eijC are the residuals. A
(mg/L), protocatechic acid (mg/L), tyrosol (mg/L), vanillic acid sample is accepted by the specific category if its distance to the
(mg/L), siringic acid (mg/L), caffeic acid (mg/L), ferulic acid model is not significantly different (at a specified confidence
(mg/L), p-coumaric acid (mg/L), procyanidin B1 (mg/L), level) from the class residual standard deviation. In particular,
procyanidin B2 (mg/L), (+)-catechin (mg/L), (−)-epicatechin all the SIMCA computations in these paper have considered a
(mg/L), etylgallate (mg/L), rutin (mg/L), isoquercetin (mg/L), 95% confidence level to define the class box and a weighted
isoramnetin-3-O-glucoside (mg/L), kaempferol-3-O-glucoside augmented distance [30] was always used.
(mg/L), miricetin (mg/L), quercetin (mg/L), kaempferol (mg/L), UNEQ [34] is instead based on the assumption of
isoramnetin (mg/L), ramnetin (mg/L), trans-resveratrol (mg/L), multivariate normality for each class population and can be
cis-resveratrol (mg/L), trans-piceid (mg/L), cis-piceid (mg/L), considered as the modeling analogue of Quadratic Discriminant
proline (mg/L), total antocyans (mg/L). Each determination was Analysis. In this method, the class model is represented by the
performed according to the Official [26] or Recommended class centroid and the category space is defined on the basis of
Analytical Methods [27,28]. the Mahalanobis distance from this barycenter [35],
corresponding to a desired confidence level: in the present
paper, the class spaces have been built as the 95% confidence
Table 1 level hyperellipsoids around each centroid. With respect to
Composition of the dataset
SIMCA, UNEQ requires the data matrix to present a specific
Category No. of No. of samples ratio between the number of samples and the number of
samples (year)
variables in each category (Nsamples/Nvars > 3) so a variable
Montepulciano dʻAbruzzo 13 7 (2000) reduction step was necessary: stepwise Linear Discriminant
6 (2002)
Analysis [36] has been used to reduce the number of variables,
Nero dʻAvola 25 5 (2000)
14 (2001) retaining at the same time the most discriminant ones.
6 (2002) Both SIMCA and UNEQ have been applied to model the
Solopaca 18 10 (2001) class space of each CDO and to perform the classification
8 (2002) among the seven categories. To evaluate and compare the
Pinerolese Freisa 35 27 (2000)
performances of these two techniques, different figures of merit
8 (2001)
Terrano 34 11 (2000) have been used. The most obvious is the non-error classification
23 (2001) rate, which can be estimated either on the training samples
Rosso di Montalcino 31 14 (2001) (defined in the remainder of the paper as “modeling ability”) or
17 (2002) on the validation samples (here labeled “prediction ability”). In
Sagrantino 24 24 (2000)
particular, the validation of all the models built in this study has
166 F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171

Table 2 Table 4
SIMCA modeling – cross-validated sensitivities SIMCA modeling – modeling power
Category Sensitivity Category Variable Modeling
power
Montepulciano dʻAbruzzo 92.31% (12/13)
Nero dʻAvola 96.00% (24/25) Montepulciano dʻAbruzzo Procyanidin B2 0.41
Solopaca 100% (18/18) Procyanidin B1 0.36
Pinerolese Freisa 97.14% (34/35) (−)-Epicatechin 0.31
Terrano 97.06% (33/34) Nero dʻAvola Procyanidin B1 0.43
Rosso di Montalcino 90.32% (28/31) Total polyphenols 0.27
Sagrantino 95.83% (23/24) Procyanidin B2 0.22
Solopaca (+)-Catechin 0.63
Isoquercitrin 0.34
Vanillic acid 0.26
been carried out using a full leave-one-out cross-validation
Pinerolese Freisa Total polyphenols 0.32
approach [37], recalculating the local models after each sample Cu 0.26
deletion. While non-error rate is a result provided by all Kaempferol-3-O-glucoside 0.20
classification methods, there are two additional figures of merit Terrano Protocatechic acid 0.33
which are characteristic of class-modeling tools, namely Rutin 0.20
Quercetin 0.20
sensitivity and specificity. Sensitivity is the percentage of
Rosso di Montalcino Alcohol grade 0.43
samples from the modeled class which are accepted by the class Miricetin 0.27
model, while specificity is the percentage of samples from other Vanillic acid 0.18
classes which are rejected by the class model. Just as in the case Sagrantino Rutin 0.31
of the correct classification rate, both the values can be Quercetin 0.29
Tyrosol 0.28
computed either on the training or on the validation samples;
Total acidity 0.26
however, all the sensitivity and specificity values reported in
this paper always refer to their cross-validated estimates.
All the computations have been performed using V-Parvus specific: 100% specificity was observed for all the pair of
2003 [38]. categories, with the only exception of the specificity of
Pinerolese with respect to Rosso di Montalcino (90.32% 28
3. Results and discussion out of 31 sample refused, see Table 3). When considering the
non-error rate in prediction, only 1 out of the 172 assigned
The results of SIMCA modeling on the data set after samples (samples considered as outliers by all category models
separate category autoscaling (depending on the category, 4–9 were not classified) was mispredicted: a Rosso di Montalcino
components were retained in the different class models, sample classified as Pinerolese Freisa.
accounting for more than 80% of the total variance) are Successively, the importance of the variables in describing
reported in Tables 2 and 3. In the modeling phase, no sample each class (called modeling power, MP [30]) and their
resulted an outlier for the corresponding category model and discriminant power (DP [30]) were examined. In particular,
the non-error rate was 100% for all the examined classes. On the variables that contribute the most to the description of each
the other hand, in the validation phase, eight samples (1 from category are reported in Table 4, together with their modeling
each of the categories Montepulciano dʻAbruzzo, Nero power. It can be noticed that some variables are effective in
dʻAvola, Pinerolese Freisa, Terrano and Sagrantino, and 3 describing more than one category, for examples procyanidins
from Rosso di Montalcino) resulted outliers for all the class B1 and B2, which have the highest modeling power both for
models; no other sample was rejected by the class model of its Montepulciano dʻAbruzzo and Nero dʻAvola, or rutin, which is
category, so that the sensitivity values reported in Table 2 common to the categories Terrano and Sagrantino.
reflect these percentages: sensitivity is therefore rather high. At On the other hand, when analyzing the discriminating ability
the same time, all the model built in SIMCA resulted also very of the different variables, Cu (DP = 78.32) was found to be the
analytical index with the highest discriminating power, being
most effective in discriminating between Nero dʻAvola and the
Table 3 other six classes. Other variables showing a significant
SIMCA modeling – cross-validated specificities
discriminating power are Pb (DP = 57.73) and Zn (DP =
Category Specificity 31.70), which contribute significantly to discriminating Rosso
Montepulciano vs. all classes 100% di Montalcino and Sagrantino from the other categories and
dʻAbruzzo among themselves. Lastly, a relevant discriminating ability is
Nero dʻAvola vs. all classes 100%
shown also by alcohol grade (11.81), total polyphenols (9.85),
Solopaca vs. all classes 100%
Pinerolese Freisa vs. Rosso di Montalcino 90.32% (28/31 refused) vanillic acid (8.00) and kaempferol-3-O-glucoside (7.85).
vs. all other classes 100% Based on these considerations, we have repeated the
Terrano vs. all classes 100% computation using the modeling and discriminating power to
Rosso di Montalcino vs. all classes 100% weight the variables prior to SIMCA analysis, but the results
Sagrantino vs. all classes 100%
were practically the same.
F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171 167

Successively, we have tried to model the same data set with a Table 6
different class-modeling technique, UNEQ. As UNEQ assumes UNEQ modeling – cross-validated specificities
multivariate normality of the experimental data, at first we have Category Specificity
tested this assumption by combining univariate normality tests Montepulciano dʻAbruzzo vs. all classes 100%
and performing Shapiro–Wilks' multivariate normality test: all Nero dʻAvola vs. all classes 100%
the tests confirmed the data to be normally distributed within Solopaca vs. Terrano 50.00% (17/34
refused)
each category. The successive step was then to operate a
vs. all other classes 100%
preliminary variable selection on our data set to meet the Pinerolese Freisa vs. Rosso di 58.07% (18/31
number of samples to number of variables ratio required by this Montalcino refused)
technique. In order to select a subset of original variables vs. all other classes 100%
characterized by a significant discriminating ability, stepwise- Terrano vs. all classes 100%
Rosso di Montalcino vs. Pinerolese Freisa 97.14% (34/35
LDA was performed before UNEQ modeling. With respect to
refused)
the selection of individual variables based on their Fisher F- vs. all other classes 100%
ratio [31], the use of a stepwise selection allows to iteratively Sagrantino vs. all classes 100%
include in the subset the variable which accounts for the
maximum improvement of the discriminating ability, expressed
as Wilks' Lambda [36]. As the less numerous category In a second stage of our study, we have tried to investigate
(Montepulciano dʻAbruzzo) had only 13 samples, four variables whether the class models built using SIMCA or UNEQ could be
(Cu, Zn, total antocyans and SO2) only were retained to build accurate when dealing with wine samples produced in years
the class models. It is interesting to notice that the first two different from those used in the modeling phase. Unfortunately,
indices (Cu and Zn) were found to be highly discriminating also the uneven distribution of samples in our data set (see Table 1),
in SIMCA modeling (see above). It should be pointed out that did not allow us to perform this analysis on all the categories:
this kind of selection is done without any validation, so that a Sagrantino samples were produced only in 2000, while in the
certain degree of overfit can be expected. UNEQ modeling was categories Montepulciano dʻAbruzzo and Solopaca there were
then carried out using the four selected variables as inputs and too few samples for each production year. Therefore, the
the results are reported in Tables 5 and 6. A 100% classification stability of the models with respect of the production years was
ability was observed in the modeling phase, with no sample checked only for the four classes Nero dʻAvola, Pinerolese,
being an outlier for all the category models. On the other hand, Terrano and Rosso di Montalcino.
when turning to the validation phase, the seven UNEQ class We started considering the category Nero dʻAvola. As shown
models showed a comparable sensitivity (Table 5) with respect in Table 1, samples from this category were produced in three
to the models built with SIMCA; it should be stressed that with different years (5 in 2000, 14 in 2001 and 6 in 2002), so we tried
the only exception of a sample from the category Rosso di two different modelings: first using samples from 2000 and
Montalcino, which was accepted by the Pinerolese Freisa 2001 as training set and testing the model on the samples from
category model, all the other samples rejected by their own class 2002 and then using samples from 2001 and 2002 as training set
model resulted outliers for all the category models as evaluated and those from 2000 as test set; in both cases, the other category
by leave-one-out. When considering the remaining 170 models were built using all the remaining samples. The SIMCA
samples, only two misprediction were observed in the models built after setting aside the 2002 Nero dʻAvola samples
validation stage: one sample from Pinerolese classified as as test set showed leave-one-out cross-validated specificities
Rosso di Montalcino and one Rosso di Montalcino sample and sensitivities comparable to those observed in the modeling
classified as Pinerolese. This result reflects also the specificity of the whole data set, the only difference being the sensitivity of
values which, as shown in Table 6, even if slightly worse than in Nero dʻAvola (100%, all of the 2000 and 2001 accepted by the
SIMCA, are still rather high, being 100% for quite all the pairs category model). These models were then used to analyze the
of categories (the lowest values being precisely for the mutual 2002 Nero dʻAvola samples: the 6 samples left aside as test set
specificity of the two categories Rosso di Montalcino and were all accepted by the Nero dʻAvola category model built on
Pinerolese, and for Solopaca vs. Terrano). the 2000 and 2001 samples and rejected by the class models of
the other categories. As stated before, the analysis was repeated
using the 2001 and 2002 samples to build the Nero dʻAvola
Table 5
category model and leaving aside the five 2000 samples as test
UNEQ modeling – cross-validated sensitivities set. In this case, the overall cross-validated results were
absolutely identical to those observed in the modeling of the
Category Sensitivity
whole data set. When evaluating the Nero dʻAvola class model
Montepulciano dʻAbruzzo 100% (13/13)
on the samples produced in the year 2000, however, only 2 of
Nero dʻAvola 96.00% (24/25)
Solopaca 94.44% (17/18) the 5 test samples were accepted by the category model built
Pinerolese Freisa 91.42% (32/35) using the samples from the years 2001–2002. Successively, this
Terrano 97.06% (33/34) year-dependent modeling of the category Nero dʻAvola was
Rosso di Montalcino 93.55% (29/31) repeated using UNEQ: also in this case, the overall cross-
Sagrantino 91.67% (22/24)
validated results were practically identical to those obtained
168 F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171

when modeling the complete data set, both for the models built
using 2001–2002 and 2000–2001 samples as training set.
However, as far as the validation on the samples not been
considered in the training phase is considered, UNEQ
performed better than SIMCA: indeed, while as in SIMCA
the six 2002 samples were all accepted by the category model
built on the 2000–2001 samples, also all the five 2000 samples
were accepted by the class model built on the 2001–2002
samples (3 of those were rejected in SIMCA).
The year-dependent modeling was then performed consid-
ering other categories: Pinerolese (the twenty-seven 2000
samples were used as training set and the eight 2002 samples
as test), Terrano (the twenty-one 2001 samples were used as
training and the eleven 2000 samples as test sets) and Rosso di
Montalcino (the seventeen 2002 samples were used as training
and the fourteen 2001 samples were used as test sets). In all
cases modeling was performed both using SIMCA and UNEQ
and the results were compared. Also in these cases, the overall
cross-validated results on the data sets were practically identical
with those obtained when modeling the complete data set,
whatever the technique chosen. However, as already noticed in
the year-dependent modeling of the category Nero dʻAvola, as
far as the analysis of the test data was concerned, significant
differences where observed between SIMCA and UNEQ.
Indeed, in the modeling of the category Pinerolese, when
analyzing the 2001 samples with the models built on the 2000
samples, 7 of the 8 test samples were accepted by the category
model built using UNEQ, while only 5 of them were accepted
by the SIMCA class model. The differences between the two
methods are far more marked when the modeling of the other
two categories is considered. Using the models built on the
2001 samples to analyze the eleven 2000 Terrano samples, only
one test sample was rejected by the class model built using
UNEQ, while only three samples were accepted by the category
model built with SIMCA. Lastly, as far as the category Rosso di
Montalcino is concerned, 9 of the 14 test samples produced in
2001 were accepted by the class model built using UNEQ on the
2002 samples, while the analogous SIMCA model built on the
Fig. 1. Modeling of Nero dʻAvola samples produced in 2000, using samples
same training samples accepted only 2 of the 14 test samples. from the other years as training set. (A) SIMCA distance to model vs. leverage
All these results can be easily visualized in Figs. 1–4, where plot for the category Nero dʻAvola; (B) UNEQ Coomans plot for the pair of
the test set samples are represented on the class spaces built categories Nero dʻAvola and Montepulciano dʻAbruzzo. Symbols: ○ Nero
using the training samples, as described in the previous dʻAvola training samples (years 2001–2002); ● Nero dʻAvola test samples (year
2000); × Montepulciano dʻAbruzzo samples.
paragraphs. In particular, for SIMCA modeling it is possible
to represent the class space of a single category at a time by
using distance to the model vs. leverage plots: leverage is a rectangle corresponding to the critical distance (p = 0.05) from
measure of the distance of an observation from the model center the class. Any sample having a distance to the corresponding
and is an usual statistical figure to identify outlying observa- centroid greater than the critical distance is considered as being
tions; in the distance to the model vs. leverage plots, the outside the class model and, as a consequence, rejected as an
simultaneous presence of two threshold values (distance and outlier for the specific category (graphically, it is plotted outside
leverage) results in the class space being identified by a squared the rectangle defining the class model). When looking at Figs.
region in the bottom left of the graph. On the other hand, as far 1–4, it is obvious that while most of the test samples fall within
as UNEQ is concerned, the category space must be defined in the class spaces of the models built with UNEQ, as far as the
the framework of a Coomans plot, where an additional category SIMCA models are concerned, most of them fall outside the
is represented (here we have chosen to represent the most respective category space.
similar category to the modeled one). In a Coomans plot [39], Then, we have tried to rationalize why SIMCA performed
the two axes represent the distance of each sample from a significantly different than UNEQ on the reduced data sets. A
specific category, so that each class model is drawn as a first hypothesis, given the high number of principal components
F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171 169

seemed reliable. Anyway, we further tried to reduce the


complexity of the model arbitrarily including a lesser number
of PCs to define each category space but, in this case, the results
were worse than using the number of PCs suggested by DCV (a
lesser number of samples from each category was recognized by
their respective model built using samples from different years).
Having verified that the discrepancy in the results was not
due to a wrong choice of the number of principal components,
we tried to examine the effect of variable selection: UNEQ
modeling requires a preliminary variable selection and not all
the original indices measured on each samples are used to build
the model while, as far as SIMCA modeling was concerned, we
operated on all the variables. We then repeated SIMCA

Fig. 2. Modeling of Pinerolese Freisa samples produced in 2001, using samples


from produced in 2000 as training set. (A) SIMCA distance to model vs.
leverage plot for the category Pinerolese Freisa; (B) UNEQ Coomans plot for the
pair of categories Pinerolese Freisa and Rosso di Montalcino. Symbols: ◊
Pinerolese Freisa training samples (year 2000); ♦ Pinerolese Freisa test samples
(year 2001); □ Rosso di Montalcino samples.

used to model each category, could be that SIMCA was


overfitting in the modeling step: that could happen because
cross-validation alone, especially for classification, is some-
times seldom sufficient to determine an accurate number of
principal components, so that an analysis of the loadings and on
the eigenvalues could help a fine tuning of the optimal number
of PCs. Since all the principal components we retained in the
previous modeling stage had eigenvalues significantly larger
than 1, all that we could do is check for noisy eigenvectors in the Fig. 3. Modeling of Terrano samples produced in 2000, using samples from
produced in 2001 as training set. (A) SIMCA distance to model vs. leverage plot
modeling stage, but all the retained principal components we for the category Terrano; (B) UNEQ Coomans plot for the pair of categories
had used in the previous SIMCA modeling resulted significant; Terrano and Solopaca. Symbols: △ Terrano training samples (year 2001); ▲
so, the number of principal components estimated by DCV Terrano test samples (year 2000); ⁎ Solopaca samples.
170 F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171

Rosso di Montalcino test samples (2001) were rejected by the


category model built on samples from 2002 and, lastly, only 1
Terrano sample (2000) was rejected by its category model built
on the 2001 samples. Based on these results, it seems that the
large discrepancy observed in the previous modeling stage
between SIMCA and UNEQ could be ascribed by the presence,
in SIMCA, of other variables which can be more sensitive to
variations in climate and temperature and, hence, discriminate
among the different vintages: when these variables are removed
from the data set, the two techniques provide the same results.

4. Conclusions

The results reported in this paper show that using two


chemometric class-modeling techniques, namely SIMCA and
UNEQ, it is possible to authenticate the origin of Italian wine
samples from different Controlled Denominations of Origin. In
particular, as far as the modeling of the whole data set is
concerned, the two techniques provided very similar results
both in terms of sensitivities and non-error rate (2 samples
mispredicted in UNEQ, and only one in SIMCA), yet SIMCA
models built on the whole data set resulted more specific than
the corresponding ones built using UNEQ (3 confused samples
only with the former technique, compared to the 31 confused
samples using the latter). Therefore, according to these results
SIMCA would seem slightly better performing to authenticate
the wine samples under investigation.
However, when considering the ability of the category
models to deal with wine samples produced in years different
from those used to build the class models, further considerations
need to be made, also concerning which and how many
variables have to be included in the models. Indeed, in quite all
the examined cases, when all the measured variables are used,
SIMCA models built on samples from specific years did not
accept wine samples from the same category, but produced in
years not considered in the modeling phase. Given the high
specificity of the models (almost 100% for each category), this
could not be a bad result per se, if one was interested at
Fig. 4. Modeling of Pinerolese Freisa samples produced in 2001, using samples
discriminating also among the different vintages. However, if
from produced in 2002 as training set. (A) SIMCA distance to model vs.
leverage plot for the category Rosso di Montalcino; (B) UNEQ Coomans plot on the other hand the purpose is just to discriminate among the
for the pair of categories Rosso di Montalcino and Pinerolese Freisa. Symbols: different denominated products, some sort of variable selection
□ Rosso di Montalcino training samples (year 2002); ■ Rosso di Montalcino has to be operated on the data set. This variable selection is
test samples (year 2001); ◊ Pinerolese Freisa samples. always necessary using UNEQ, which requires an high ratio of
the number of samples per category to the number of variables
modeling of the different vintages using only the four variables in the model (larger or equal than 3), so that only four
selected to build each UNEQ model (Cu, Zn, SO2 and total experimental indices were included in the UNEQ models (Cu,
antocyans), to check whether the difference in the results Zn, total antocyans and SO2): these models resulted in a high
observed in the previous paragraph could be ascribed to the percentage of the test samples accepted by the corresponding
choice of the descriptors. Indeed, when SIMCA modeling was category models built using samples from different vintages.
repeated using 4 variables only, the results, both in terms of When the same four variables only were used to build the
specificity, sensitivity and acceptance/rejection of the test set corresponding SIMCA models, the results obtained were
samples, were completely identical to those of the optimal comparable (and significantly better than those of the SIMCA
UNEQ models. In particular, all the samples of Nero dʻAvola models using all the variables), with quite all the test samples
were recognized correctly by the models built using samples recognized by their respective category model. So, varying the
from the same category but of a different vintage, 7 of the number of variables to be included in the training set could
8 Pinerolese samples from 2001 were recognized by the provide a way to operate some sort of fine-tuning of the model's
Pinerolese model built on the 2000 samples, only 5 of the 14 response: including a large number of variables, which can
F. Marini et al. / Chemometrics and Intelligent Laboratory Systems 84 (2006) 164–171 171

contain information about small climatic or year-to-year [18] H.K. Sivertsen, B. Holen, F. Nicolaysen, E. Risvik, J. Sci. Food Agric. 79
variations can result in discriminating also the vintages while, (1999) 107.
[19] Decree 24/05/1968, Official J. of the Italian Republic, L178 (1968) and
including only those variables which are responsible of the successive modifications.
differences among the products, a discrimination among the [20] Decree 30/10/1994, Official J. of the Italian Republic, L238 (1994) and
denominations only is provided. successive modifications.
From a chemical point of view, both techniques agreed in [21] Decree 20/09/1973, Official J. of the Italian Republic, L28 (1974) and
successive modifications.
identifying the content of some metals (Cu, Zn and Pb in
[22] Decree 12/09/1996, Official J. of the Italian Republic, L227 (1996) and
SIMCA; Cu and Zn in UNEQ) as the most discriminant successive modifications.
variables: this could be explained by these indices being most [23] Decree 17/07/1985, Official J. of the Italian Republic, L145 (1986) and
related to soil condition and hence, at least in principle, to the successive modifications.
geographical origin of the samples. [24] Decree 25/11/1983, Official J. of the Italian Republic, L158 (1984) and
successive modifications.
[25] Decree 05/11/1992, Official J. of the Italian Republic, L269 (1992) and
References successive modifications.
[26] EC (The Commission of the European Communities), Regulation 2676/90,
[1] J.E. Gimenez-Miralles, D.M. Salazar, I. Solana, J. Agric. Food Chem. 47 Off. J. Commission Eur. Communities, L272 (1990) 1.
(1999) 2645. [27] OIV Récueil des Méthodes Internationales dʻAnalyse des vins et des
[2] A. Rossmann, F. Reniero, I. Moussa, H.-L. Schmidt, G. Versini, M.H. moûts, Office International de la Vigne et du Vin, Paris, 1990.
Merle, Z. Lebensm.-Unters. -Forsch. 208 (1999) 400. [28] K. Briviba, L. Pan, G. Rechkemmer, J. Nutr. 132 (2002) 2814.
[3] G. Versini, A. Monetti, F. Reniero, in: T.R. Watkins (Ed.), Wine – [29] B.G.M. Vandeginste, D.L. Massart, L.M.C. Buydens, S. De Jong, P.J.
Nutritional and Therapeutic Benefits, ACS Symposium Series, vol. 661, Lewi, J. Smeyers-Verbeke, Handbook of Chemometrics and Qualimetrics:
American Chemical Society, Washington, DC, 1997, p. 113. Part B, Elsevier, Amsterdam, 1998.
[4] I.J. Košir, M. Kocjančič, N. Ogrinc, J. Kidrič, Anal. Chim. Acta 429 [30] S. Wold, M. Sjostrom, in: B.R. Kowalski (Ed.), Chemometrics, Theory and
(2001) 195. Application, ACS Symposium Series, vol. 52, American Chemical
[5] R.W. Cahn, Nature 338 (1989) 708. Society, Washington, DC, 1977, pp. 243–282.
[6] G.J. Martin, C. Guillou, M.L. Martin, M.T. Cabanis, Y. Tep, J. Aerny, [31] M.A. Sharaf, D.L. Illman, B.R. Kowalski, Chemometrics, John Wiley &
J. Agric. Food Chem. 36 (1988) 316. Sons, New York, NY, 1986.
[7] M. Forina, C. Armanino, M. Castino, M. Ubigli, Vitis 25 (1986) 189. [32] D.L. Massart, B.G.M. Vandeginste, S.N. Deming, Y. Michette, L.
[8] G.J. Soleas, J. Dam, M. Carey, D.M. Goldberg, J. Agric. Food Chem. 45 Kaufman, Chemometrics: A Textbook, Elsevier, Amsterdam, The Nether-
(1997) 3871. lands, 1988.
[9] M. Forina, G. Drava, Analusis 25 (1997) M38. [33] S. Wold, Technometrics 20 (1978) 397–405.
[10] M.L. González-Sanjosé, G. Santa-Maria, C. Diez, J. Food Compos. Anal. [34] M.P. Derde, D.L. Massart, Anal. Chim. Acta 184 (1986) 33–51.
3 (1990) 54. [35] R. De Maesschalck, D. Jouan-Rimbaud, L. Massart, Chemom. Intell. Lab.
[11] R.M. Tapias, M.S. Larrechi, J. Guash, J. Rubio, F.X. Rius, Am. J. Enol. Syst. 50 (2000) 1–18.
Vitic. 37 (1986) 195. [36] G. McLachlan, Discriminant Analysis and Statistical Pattern Recognition,
[12] I. Moret, G. Scarponi, P. Cescon, J. Agric. Food Chem. 42 (1994) 1143. Wiley, New York, NY, 1992.
[13] C.M. García-Jares, M.S. García-Martín, N. Carro-Mariño, R. Cela- [37] M. Stone, J. R. Stat. Soc., B 36 (1974) 111–147.
Torrijos, J. Sci. Food Agric. 69 (1995) 175. [38] M. Forina, S. Lanteri, C. Armanino, C. Cerrato-Oliveiros, C. Casolino, V-
[14] L. Almela, S. Javaloy, J.A. Fernández-López, J.M. López-Roca, J. Sci. Parvus 2003: An extendable package of programs for data explorative
Food Agric. 70 (1996) 173. analysis, classification and regression analysis. Department of Chimica e
[15] M.J. Baxter, H.M. Crews, M.J. Dennis, I. Goodall, D. Anderson, Food Tecnologie Farmaceutiche e Alimentari, University of Genova, Genova,
Chem. 60 (1997) 443. Italy, 2003, Free download at http://www.parvus.unige.it.
[16] M. Ortega-Heras, M.L. González-Sanjosé, S. Beltrán, Quím. Anal. 18 [39] D. Coomans, I. Braeckaert, M.P. Derde, A. Tassin, D.L. Massart, S. Wold,
(1999) 127. Comput. Biomed. Res. 17 (1984) 1–14.
[17] S. Pérez-Magariño, M.L. González-Sanjosé, Food Sci. Technol. Int. 7
(2001) 237.

You might also like