Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Food Quality and Preference 14 (2003) 463–472

Food Quality and Preference 14 (2003) 463–472

Ratings: (0)|Views: 90 |Likes:
Published by NP

More info:

Published by: NP on Jan 06, 2010
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See more
See less

05/11/2014

pdf

text

original

 
Variable selection in PCA in sensory descriptive and consumer data
Frank Westad
*, Margrethe Hersleth
a
, Per Lea
a
, Harald Martens
b
a
MATFORSK, Norwegian Food Research Institute, Osloveien 1, N-1430 A˚ s, Norway
b
Sensory Science, The Royal Veterinary and Agricultural University, DK-1958 Frederiksberg, Denmark
Received 30 July 2002; received in revised form 6 November 2002; accepted 12 November 2002
Abstract
This paper presents a general method for identifying significant variables in multivariate models. The methodology is applied onprincipal component analysis (PCA) of sensory descriptive and consumer data. The method is based on uncertainty estimates fromcross-validation/jack-knifing, where the importance of model validation is emphasised. Student’s
t
-tests based on the loadings andtheir estimated standard uncertainties are used to calculate significance on each variable for each component. Two data sets areused to demonstrate how this aids the data-analyst in interpreting loading plots by indicating degree of significance for each vari-able in the plot. The usefulness of correlation loadings to visualise correlation structures between variables is also demonstrated.
#
2003 Elsevier Science Ltd. All rights reserved.
Keywords:
PCA; Descriptive sensory data; Consumer data; Variable selection; Validation
1. Introduction
In multivariate analysis where data-tables with sen-sory descriptive and consumer-related variables arestudied, it is important to extract the interpretable andstatistically reliable information. One objective may beto find significant sensory attributes in sensory descrip-tive analysis. Whereas descriptive analysis of sensorydata often yields high explained variance, consumer-related data such as preference data are less structured.There may be several reasons for this phenomenon(Lawless & Heymann, 1998). Firstly, the consumersmay not differentiate among the products at all, eitherbecause of too similar product samples or because theconsumers are indifferent to the attributes in the pro-ducts. Consumer ratings would consequently not fit wellinto the product space. Secondly, some consumers maybase their hedonic scores on factors (sensory or non-sensory) that were not included in the product spacederived from the analytical sensory data. Thirdly, someconsumers simply yield inconsistent, unreliable respon-ses, possibly because they changed their criteria foracceptance during the test. Unreliable responses canalso be the result of consumers who are not motivatedto take part and therefore answer randomly.Many different groups of background variables areusually available for the consumers, such as demo-graphy, eating habits, attitudes, etc. Socio-economicvariables often serve as a basis for segmentation of theconsumers before relating them to sensory descriptivedata with some preference mapping method (Helgesen,Solheim, & s, 1997). However, the segmentationshould be validated, and removing non-relevant vari-ables is usually more essential for these data than forsensory data. The main focus in this paper is to findrelevant variables in sensory descriptive and consumerdata, although the method can be applied on any datatable.For the sensory data on K sensory attributes, where
L
assessors in a trained panel have evaluated
products,analysis of variance (ANOVA) is usually employed toassess which individual sensory attributes are sig-nificant. These tests will reveal if assessors are able todistinguish among products on selected sensory attri-butes, and the average response over the assessors isoften computed before further analyses are per-formed. This data table of averaged responses is thebasis for the analysis as described below. As a resultfrom the ANOVA, one might exclude assessors ordown-weight some assessors for certain attributes to
0950-3293/03/$ - see front matter
#
2003 Elsevier Science Ltd. All rights reserved.doi:10.1016/S0950-3293(03)00015-6Food Quality and Preference 14 (2003) 463–472www.elsevier.com/locate/foodqual* Corresponding author. Tel.: +47-64-970303; fax: +47-64-970333.
E-mail address:
 
yield more reliable average values in the matrix of size
Â
K.
(Lea, Rødbotten, & Næs, 1995).Principal component analysis (PCA) is a frequentlyapplied method for multivariate overview analysis of sensory data (Helgesen et al., 1997; Jackson, 1991). Themain purpose is to interpret any latent factors spannedby characteristics such as flavour, odour, appearanceand texture, and to find products that are similar ordifferent, and what differentiates them. This is doneby studying loadings and score plots. In this context,it is of interest to assess which variables are sig-nificant on the individual components to simplify theinterpretation. There exist rules of thumb such as acut-off for loadings at values higher than, e.g. 0.3 toassess which variables are important. However, thechallenge for such general rules is that the squaredloadings sum up to 1.0, so that the cut-off is dependingof the number of variables as well as samples (Hair,Andersen, Tatham, & Black, 1998). Also, one shouldconsider the amount of explained variance for the com-ponent studied.Guo, Wu, Massart, Boucon, and deJong (2002)applied feature selection from Procrustesanalysis to find the best subset of variables to preserveas much information in the complete data as possible.Work on finding important variables has also been doneby, e.g.Krzanowski (1987) and Ra ¨nnar, Wold, andRussel (1996).
1.1. Model rank
The main purpose of assessing the model rank is toprevent spurious correlations to be interpreted asmeaningful information. Methods to assess the correctrank based on cross-validation have been addressedextensively in latent variable regression methods such asprincipal component regression (PCR) and partial leastsquares regression (PLSR) (Green & Kalivas, 2002;Martens & Martens, 2000). Model results from thesemethods include root mean square error (RMSE) froma validation procedure, which (preferably) decreases,and thereafter increases, or approaches some asympto-tic value. This behaviour is not necessarily to be expec-ted for the residual cross-validated variance in PCAsince the space into which the deleted objects are pro- jected is expanding with more components. A correctionfor the degrees of freedom consumed as more compo-nents are extracted might aid in assessing the rank. Inthe cross-validation for PCA the correction
K/(K 
À
A)
isemployed, where
is the number of variables and
A
isthe number of components.Also, the explained variance for the component is of importance, as one component may not be relevant tointerpret at all. In PCA, there exists an ensemble of methods (Jackson, 1991) to find the correct rank. Pre-ferably, a robust method should give the correct rankautomatically from the analysis. Cross-validation(Wold, 1987), inspection of scree-plot, ratio of eigen-values and Bartlett’s test for model dimensionality areamong the existing procedures (Jackson, 1991). Theterm ‘‘rank’’ with respect to a multivariate modeldeserves some comments, as ‘‘rank’’ has variousfacets:1.
Numerical 
. This rank is the one based onnumerical computations, e.g. the number of components that can be computed without sin-gularity problems.2.
Statistical 
. The important issue here is to find theoptimal rank from a statistical criterion, pre-ferably based on some proper validation method.3.
Application specific
. Since significant is not thesame as meaningful, this judgement is typically acombination of background knowledge, modelcomplexity, and interpretation aspects. In mostsituations, this rank is lower than the statisticalrank, i.e. the data-analyst tends to be moreconservative.
1.2. Uncertainty estimates
Significance testing based on uncertainty estimates inregression has been published elsewhere (Martens &Martens, 2000; Westad & Martens, 2000), and hasrecently been applied in a related method to PCA,independent component analysis (ICA;Westad & Ker-mit, in preparation). Uncertainties may be estimatedfrom resampling methods such as jack-knifing andbootstrapping. Jack-knifing is closely connected tocross-validation, the difference lies in whether the modelwith all objects or the mean of all individual modelsfrom the resampling should be regarded as the ‘‘refer-ence’’. We feel that it is more relevant to use the modelon all objects as the reference, since this is the model weinterpret in terms of scores, loadings and other relevantplots. Thus, this approach for estimation might benamed modified jack-knifing (Martens & Martens,2000), and it is applied in this paper. According tostudies byEfron (1982), the difference between thesetwo is negligible in practical applications, especially forlarge numbers of objects. The main objectives with esti-mating uncertainty in multivariate models are to assessthe model stability and to find significant componentsand variables.Model validation is essential in all multivariate dataanalysis. The validation can be either
model 
validationon the data at hand, such as cross-validation (Wold,1978), or
system
validation. One example of the secondtype of validation is where a survey is repeated at dif-ferent times or in different segments to confirm thehypothesis we might have about the system we are try-ing to observe.
464
F. Westad et al./Food Quality and Preference 14 (2003) 463–472
 
2. Materials and methods
2.1. Example 1: descriptive sensory evaluation of icecream
Fifteen different samples of vanilla ice cream wereevaluated by a panel using descriptive sensory analysisas described in ISO 6564:1985. The sensory panel con-sisted of 11 panellists selected and trained according toguidelines in ISO 8586-1:1993 and the laboratory wasdesigned according to guidelines in ISO 8589:1988. Thesamples were described using 18 different sensory attri-butes (Table 1). The panellists were given samples fromboth extreme ends of the scale to acquaint themselveswith the potential level of variation for the differentattributes. A continuous, unstructured 1.0–9.0 scale wasused for the evaluation. Each panellist did a monadicevaluation of the samples at individual speed on acomputerised system for direct recording of data (CSACompusense, version 5.24, Canada). Two replicatedmeasurements were made for each sample of ice cream.The samples were served in a randomised order. Repli-cates were randomised within the same session, so thatno replicate effect is needed in the models (Lea, Rød-botten, & Næs, 1997).
2.2. Example 2: consumer preference mapping of mozzarella cheese
The second data set was taken fromPagliarini, Mon-telleone, and Wakeling (1997), where nine commercialmozzarella cheeses where evaluated by a trained sensorypanel, and six of them were selected for a preference testby 105 consumers. The six cheeses were selected to spanthe sensory characteristics of the nine cheeses. Thesamples were rated on a nine-point hedonic scale by theconsumers. In this paper the focus is on analysing thepreference data with
=6 products and
=105 con-sumers.
2.2.1. PCA and significance tests
For a matrix,
X
, assume the bilinear model structure
X
¼
TP
T
þ
E
A
ð
1
Þ
where
X
(
Â
) is a column centred data matrix;
T
(
Â
A
) is a matrix of score vectors which are linearcombinations of the x-variables;
P
(
Â
A
) is a matrix of loading vectors,
P
T
P
=
I
; and
E
A
(
Â
) contains theresiduals after
A
principal components have beenextracted.The uncertainty of the loadings,
s
2
(
 p
ak
), may be esti-mated from (Efron, 1982; Martens & Martens, 2000)
s p
ak
ð Þ ¼
 ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiX
m
¼
1
 p
ak
À
 p
ak
ðÀ
m
Þ
À Á
2
!
ð
À
1
Þ
=
ð Þ
vuut
ð
2
Þ
where
=the number of segments in the cross-vali-dation;
s
(
 p
ak
)=estimated uncertainty variance of eachvariable
k
in the loading for component
a
;
p
ak
=theloading for component
a
using all the
objects;
 p
ak
(
À
m
)
=the loading for variable
k
for component
a
using all objects except the object(s) left out in crossvalidation segment
m
.
 p
ak
and
s
(
 p
ak
) may be subjected to a modified
t
-test(
 p
a
/
s
(
 p
ak
)=0 with
degrees of freedom) to give sig-nificance values for individual variables for each com-ponent, and can also be used as an approximateconfidence interval around each variable. The jack-knifebased estimates tend to be conservative due to theinherent validation aspect. Univariate tests might not bethe best way to assess significance for multivariatemodels. For one thing, there is a danger of false posi-tives when applying many tests. Another aspect is that avariable may be significant in a multivariate sense,although individual tests do not give significance. Theseaspects are not pursued in this paper, but explainedvariance
>
50% has shown to be a good ad hoc rule toaid the decision of significance.
2.2.2. Rotation of models
In PCA, cross-validation for individual segmentsmight give components that are mirrored or flippedcompared to the model on all objects. The componentsmay even come out in a different order when the corre-sponding eigenvalues are similar and/or close to eigen-values of the noise part of the data. The PCs from thecross-validation must therefore be rotated towards thePCs based on all objects before the uncertainties areestimated. Procrustes rotation (Jackson, 1991; Milan &Whittaker, 1995) can be applied to rotate loadings andscores. The aim of Procrustes rotation is to make amatrix
A
similar to
B
by estimating a rotation matrix
C
so that the squared residuals
D
are minimised
A
¼
BC
þ
D
ð
3
Þ
In this paper, the rotation matrix
C
in each submodelis estimated from the scores for objects not left out inthat segment (Martens & Martens, 2001) with orthogo-nal Procrustes rotation, and the inverse of 
C
is thenapplied in rotating the loadings. Applying the rotationmatrix directly allows rotation and stretching of thesubmodel in the direction of the main model. Thereby,the submodel may be closer to the main model than wewanted from the original objective of flipping, mirroringand ordering of the components. This may give toooptimistic significance values in situations with fewobjects and/or skewed distribution of samples. Onealternative is then to round the elements in
C
to integervalues (
À
1,0,1) before scores and loadings are rotated.This can, however, give a rotation matrix that is notorthonormal when the submodel is rotated with anangle close to 45 degrees. The norm of the rounded
F. Westad et al./Food Quality and Preference 14 (2003) 463–472
465

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->