You are on page 1of 15

This article was downloaded by: [McMaster University]

On: 18 July 2013, At: 10:04


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House,
37-41 Mortimer Street, London W1T 3JH, UK

Journal of Personality Assessment


Publication details, including instructions for authors and subscription information:
http://www.tandfonline.com/loi/hjpa20

Nonlinear Principal Components Analysis With CATPCA:


A Tutorial
a b
Mariëlle Linting & Anita van der Kooij
a
Child and Family Studies, Leiden University, The Netherlands
b
Methods and Statistics in Psychology, Leiden University, The Netherlands
Published online: 16 Dec 2011.

To cite this article: Marille Linting & Anita van der Kooij (2012) Nonlinear Principal Components Analysis With CATPCA: A
Tutorial, Journal of Personality Assessment, 94:1, 12-25, DOI: 10.1080/00223891.2011.627965

To link to this article: http://dx.doi.org/10.1080/00223891.2011.627965

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained
in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the
Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and
should be independently verified with primary sources of information. Taylor and Francis shall not be liable for
any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever
or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of
the Content.

This article may be used for research, teaching, and private study purposes. Any substantial or systematic
reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any
form to anyone is expressly forbidden. Terms & Conditions of access and use can be found at http://
www.tandfonline.com/page/terms-and-conditions
Journal of Personality Assessment, 94(1), 12–25, 2012
Copyright C Taylor & Francis Group, LLC
ISSN: 0022-3891 print / 1532-7752 online
DOI: 10.1080/00223891.2011.627965

STATISTICAL DEVELOPMENTS AND APPLICATIONS

Nonlinear Principal Components Analysis With


CATPCA: A Tutorial
MARIËLLE LINTING1 AND ANITA VAN DER KOOIJ2
1
Child and Family Studies, Leiden University, The Netherlands
2
Methods and Statistics in Psychology, Leiden University, The Netherlands

This article is set up as a tutorial for nonlinear principal components analysis (NLPCA), systematically guiding the reader through the process
of analyzing actual data on personality assessment by the Rorschach Inkblot Test. NLPCA is a more flexible alternative to linear PCA that can
Downloaded by [McMaster University] at 10:04 18 July 2013

handle the analysis of possibly nonlinearly related variables with different types of measurement level. The method is particularly suited to analyze
nominal (qualitative) and ordinal (e.g., Likert-type) data, possibly combined with numeric data. The program CATPCA from the Categories module
in SPSS is used in the analyses, but the method description can easily be generalized to other software packages.

The objective of this article is to familiarize applied researchers that standard PCA solutions can only be interpreted sensibly
with nonlinear principal components analysis (NLPCA), a when all variables are considered numeric.
method developed to explore possibly nonlinear relational struc- NLPCA output is comparable to PCA output, and includes
tures in data sets that might contain all types of (categorical and (a) eigenvalues, indicating the variance accounted for (VAF) by
numeric) variables. The article is set up as a tutorial that focuses each principal component; (b) component loadings, reflecting
primarily on the application of the method and the interpretation correlations between the quantified variables and the princi-
of results instead of on technical details. pal components; (c) sums of squared component loadings per
NLPCA is an alternative to principal components analysis variable over components (communalities), reflecting the con-
(PCA; see, e.g., Jolliffe, 2002) that is particularly useful for tributions of the quantified variables to the total VAF; and (d)
data sets containing variables with different measurement levels component scores for each case in the data set.
(nominal, ordinal, or numeric) that might be nonlinearly related It is important to realize that when all variables in a data set are
to each other. The goal of NLPCA is equivalent to that of PCA, considered numeric and relationships are considered to be linear,
namely to reduce a data set—typically consisting of many vari- PCA and NLPCA will render exactly the same results. Only
ables with complicated correlation patterns—to a smaller num- when the analysis includes variables specified by the researcher
ber of uncorrelated summary variables (principal components) as having a nominal or ordinal analysis level, NLPCA will deal
that represent the information in the data as closely as possi- with nonlinear relationships. This specification need not concur
ble. This goal is achieved by finding a (small) number of linear with the variable’s measurement level (see “Analysis Levels”
combinations that explain as much as possible of the variance later in this article).
in the data, thereby revealing relational structures among the In this article, we refer to three types of measurement level:
observed variables. The difference between the methods is that
1. Nominal variables consist of categories that classify cases
PCA can only reveal linear relationships, whereas NPLCA can
in addition reveal nonlinear relationships by quantifying cate- (persons or objects) into separate groups that are not hierar-
chically ordered. For instance, gender, religion, and ethnicity
gorical or nonlinearly related variables in a way that is optimal
are considered to be nominal variables.
(in a statistical sense) for the PCA goal.
The main advantages of NLPCA over standard PCA are its 2. Ordinal variables also consist of categories used to distin-
guish groups, but these categories have an ascending or de-
ability to (a) deal with nonlinear relationships; (b) enable the
scending order, based on the degree to which a certain qual-
researcher to check multivariate linearity of relationships be-
tween the variables at hand (which might be particularly use- ity is present. Typically, categories are not assumed to be
equally spaced. For instance, educational degree, or values
ful for ordinal variables, such as Likert-type variables); and (c)
on a (Likert-type) rating scale should be considered ordinal.1
jointly analyze numeric, ordinal, and nominal variables. In other
words, NLPCA overcomes the main limitations of linear PCA, 3. Numeric variables are variables of which equally spaced ob-
served values represent equally spaced true values. This type
that is, the assumption of linearity of relationships and the fact
of variable is often divided into two subtypes: interval scales

Received June 22, 2010; Revised February 25, 2011. 1Note that Likert-type rating scales are often analyzed as numeric, assuming
Address correspondence to Mariëlle Linting, Child and Family Studies, Lei-
the categories to be equally spaced. However, such an a priori assumption might
den University, P.O. Box 9555, 2300 RB Leiden, The Netherlands; Email:
not be justified.
linting@fsw.leidenuniv.nl
12
NONLINEAR PCA WITH CATPCA 13

(without an absolute zero point) and ratio scales (with an ab-


solute zero point). Numeric variables can be used directly in
statistical calculations. Examples are temperature (interval),
and reaction time (ratio).
In most literature, nominal and (sometimes) ordinal variables
are classified as categorical. However, numeric variables can
also be considered categorical variables with c categories (where
c is the number of distinct observed values).
Performing NLPCA is a dynamic process in which the re-
searcher takes an active part. The procedure typically exists of
several steps in which analysis results are evaluated and con-
sequently, analysis options are revised. The remainder of this
article consists of a more extensive discussion of the NLPCA
method, followed by a step-by-step description of this dynamic
analysis process on empirical Rorschach inkblot test data. Dif-
ferent analysis options and results are displayed, discussed, and
interpreted.
NLPCA was developed from a long history of contributions to
the field of categorical data analysis; see, for example, Guttman
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 1.—Variable vector (gray line) for a fictive variable with five values
(1941), Kruskal (1965), Shepard (1966), Kruskal and Shepard
(squares). The partial vector from the origin (open circle) to the loading point
(1974), Young, Takane, and De Leeuw (1978), and Winsberg (closed circle) indicates the variable’s variance is accounted for.
and Ramsay (1983). For a historical overview, see Gifi (1990).
Currently, the method is available in the two major commercial
software packages: SAS (SAS Institute, 2009b) contains PRIN-
represents a different category. The closed markers represent
QUAL (SAS Institute, 2009a), and SPSS (SPSS, 2009) contains
the category points of religion, and the open markers repre-
CATPCA (Meulman, Heiser, & SPSS, 2009). Also, some re-
sent component scores (case points). With the centroid model, a
lated functions are available in R (R Development Core Team,
point representing a category in the principal components space
2005). In this article, we use CATPCA to illustrate the proce-
is located as closely as possible to each of the component scores
dure of analyzing data with NLPCA. This software is part of the
for the persons that scored in that category. In that way, a cate-
SPSS Categories module, and can be found under Dimension
gory point can be interpreted as a group point, summarizing the
Reduction, Optimal Scaling in the SPSS Analysis menu.
group of people that scored in that category. For instance, the
NLPCA METHOD category point for “Catholic” (closed circle) lies in the center
of all Catholic persons (open circles), and thus represents the
Model group of Catholic people in the data set. Category points lying
The objective of NLPCA is dimension reduction—that is, close together in the plot are closely related. For example, the
reducing a set of variables to a smaller number of principal categories Catholic and Protestant are close together, indicating
components—while taking into account (nonnumeric) measure- that people scoring these categories are quite alike in their scor-
ment levels and possible nonlinearity of relationships. In re- ing patterns on the other variables in this data set. Component 1
search practice, different methods are used to obtain the goal distinguishes Muslims from the rest, and Component 2 makes a
of dimension reduction, depending on the measurement level of distinction between Catholics and Protestants versus the rest.
the variables. PCA is applied to variables with numeric mea- In NLPCA, (a) it is possible to apply the vector model to some
surement level and multiple correspondence analysis (MCA) to variables and the centroid model to others, and (b) the vector
categorical variables. These methods employ different models, model can be generalized to ordinal and nominal variables, en-
called the vector model and centroid model, respectively. The abling the researcher to effectively analyze variables of differ-
vector model depicts a variable as a straight line (vector), thus ent measurement level within one analysis. Methodologically,
representing a variable as a direction in the component space, these features are established by, while performing dimension
whereas the centroid model depicts a variable as a set of category reduction, transforming categories of variables with nominal
points (centroids). and ordinal analysis levels to numeric values. The method used
Figure 1 shows a so-called category plot for a variable ana- for this quantification process is called optimal scaling, with
lyzed with the vector model. The squares representing the values optimal referring to the fact that the transformations are optimal
of the variable lie equally spaced and ordered on the variable for the model that is fitted. Let P be the number of principal
vector (gray line). The loading vector (black line) starts at the components selected for the analysis (based on theory, possibly
origin and ends at the point with as coordinates the loadings on combined with some statistical procedure; see Step 2d in the
each principal component. The length of this vector represents “Application” section). Then, in NLPCA, optimal implies that
the variable’s VAF. the first P components explain as much as possible of the vari-
Figure 2 is a so-called biplot, displaying a variable (here rep- ance in the transformed variables. This objective is obtained by
resented as category points) as well as case points, and providing an iterative process that stops when a preset convergence crite-
insight into the location of category points relative to the case rion is reached; that is, when the statistically optimal solution is
points. This plot shows a fictive example of the variable religion obtained. Technical details are irrelevant here, and can be found
analyzed with the centroid model. Each different marker type in, for example, Linting, Meulman, Groenen, and Van der Kooij
14 LINTING AND VAN DER KOOIJ

FIGURE 3.—Category plot for the fictive variable Religion with category points
restricted to be on a straight line (vector model). The distance on the line between
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 2.—Category points (closed markers) for the fictive variable Religion the category point and the origin equals the category quantification. When the
represented as centroids of component scores (open markers) for the persons category point lies in the direction from the origin to the point representing the
who scored each category. component loadings, the quantification has a positive value. Otherwise, it has a
negative value.
(2007a), and Meulman, Van der Kooij, and Heiser (2004), and
Gifi (1990). Next, the optimal scaling procedure is conceptually
explained. points from the origin in the direction where the category with
the highest quantification is located, and the category with the
Optimal Scaling Graphically Explained lowest quantification is somewhere beyond the origin in the
Optimal scaling can be pictured as a process that starts with opposite direction; the origin signifies the mean of the quantified
estimating category quantifications based on the centroid model, variables.
and then, in accordance with the analysis level of a variable The restricted category quantifications in Figure 3 are un-
(specified by the researcher), can impose restrictions on the ordered compared to the original category labels. Such un-
quantifications. In the centroid model, category quantifications ordered quantifications are referred to as nominal quantifica-
are coordinates of category points (see Figure 2), calculated for tions, and can be restricted further in two ways: (a) the ordering
each component as the mean of the component scores for the of the points is required to equal the ordering of the category
persons that scored that category. values, and (b) the points are required to be equally spaced.
The centroid model is useful when the researcher is inter- When both of these restrictions are applied, the variable is an-
ested in the location of the separate categories in the principal alyzed numerically, and the variable vector equals the vector
components space. Alternatively, the variable as a whole might resulting from linear PCA. When only the first restriction is ap-
be of interest. Then, restricting the centroid points to lie on a plied, the variable is analyzed at an ordinal analysis level. Thus,
straight line—that is, applying the vector model—is appropri- NLPCA generalizes the vector model to variables of nominal
ate. Restriction of category points to a straight line is done by and ordinal analysis level, which is further explained in the next
perpendicular projection. Figure 3 is a Category plot, showing section.
such restriction of the category points from Figure 2. Projec-
tions (depicted in gray) are the restricted category points. The
category quantification is the distance on the vector from the ANALYSIS LEVELS
category point to the origin. One of the most important decisions for researchers applying
The variable vector in Figure 3 is the best fitting line through NLPCA is the specification of a nominal, ordinal, or numeric
the centroid points, which always runs through the origin and analysis level for each variable in the data set (also see Linting
the point with as coordinates the loadings on each principal et al., 2007a). A variable’s analysis level is not necessarily equal
component (loading point). The loading vector (the black line to its measurement level, but is based on the researcher’s pref-
in Figures 1 and 3), starting at the origin and ending at the erences. The opportunity to specify analysis levels in NLPCA
loading point, represents the variable’s VAF.2 A loading vector allows the researcher to (a) analyze variables in accordance with
their measurement level, and (b) discover and handle nonlinear
2In Figures 1 and 3 we have depicted the loading vector in the category relationships, also between numeric variables.
plot for didactic purposes. However, in actual output, the category vectors and
The analysis level specified determines the amount of free-
loading vectors are displayed in separate plots. Then, it is important to realize dom allowed in transforming category values (labels) to cat-
that the importance of variables in the solution is indicated by the length of egory quantifications. When the centroid model is applied, a
the loading vector in the loadings plot (or table), and not by the length of the category quantification is the centroid of the component scores
variable vector in a category plot. of the persons that scored that category, which allows the
NONLINEAR PCA WITH CATPCA 15

FIGURE 4.—Nominal (a) and ordinal (b) quantifications of the fictive variable Religion.

maximum amount of freedom. The centroid model is applied Nominal Analysis Level
Downloaded by [McMaster University] at 10:04 18 July 2013

to a variable when the researcher specifies a multiple nominal Specifying a nominal analysis level is advisable when the
analysis level. All other analysis levels in NLPCA specify the researcher is interested in possible nonmonotonic nonlinear re-
vector (PCA) model (see Figures 1 and 3). Out of the levels us- lationships between the variable at hand and other variables in
ing the vector model, the nominal level allows the most freedom the data set. This is typically the case when a variable has a
in quantifying the variables, followed by the ordinal and then the nominal measurement level, but could also occur for ordinal or
numeric level (applied in linear PCA). Only when nominal or numeric variables. For instance, the numeric variable number
ordinal analysis levels are specified, NLPCA is allowed enough of car accidents might be nonmonotonically related to a (cat-
freedom to reveal, respectively, nonmonotonic and monotonic egorized) variable regarding age, with young and old people
nonlinear relationships between variables. causing more accidents than middle-aged people. Thus, high
The researcher should keep in mind that allowing the quan- values on number of car accidents go together with both high
tification process more freedom could have the downside of and low values of age. Such a relationship will only be revealed
capitalization on chance (i.e., possible sampling instability, es- if a nominal analysis level is specified for either age or number
pecially with small samples). Stability of the NLPCA solution of car accidents.
can be checked using the nonparametric bootstrap approach Interpretation of nominal quantifications is the most straight-
(Linting, Meulman, Groenen, & Van der Kooij, 2007b). Apart forward when a specific nonlinear pattern is present in the data.
from stability issues, nonlinear analysis levels (especially nom- For instance, when Age is nominally analyzed concurrently with
inal) could complicate interpretation. So, when no nonlinear re- number of car accidents, the transformation plot for age will
lationships between variables are found with the least restricted show a U-shape: low quantified values for the middle age cate-
analysis levels, it is advisable to repeat the analysis, using more gories and high quantified values for both the lower and higher
restricted analysis levels. age categories. For variables with a nominal measurement level,
category order is irrelevant, and thus we cannot speak of a spe-
Transformation Plots cific nonlinear pattern of the quantifications with regard to the
original category order. Then, multiple nominal (centroid) quan-
Researchers might gain insight into (nonlinearity of) rela- tifications are more insightful than nominal (vector) quantifica-
tionships among variables by inspecting transformation plots. tions. Also, centroid quantification might be preferable when
Such plots display the category numbers on the x axis and their the unrestricted category points in a category plot (e.g., Figure
quantifications on the y axis, and thus show the shape of the 3) lie far away from the variable vector. The distances from the
transformation, which gives insight into the relationship be- centroid points to the projected points on the line give an indi-
tween the variable at hand and the other variables in the data cation of the “cost” (in terms of fit) of restriction of the category
set. For example, Figure 4a displays the transformation plot points to a straight line.
for the variable from Figure 3, revealing the nonlinear pattern
of the relationship between religion and the other variables in
the data set: The distinction between Categories 1 (Catholic) Ordinal Analysis Level
and 2 (Protestant) on the one side and Categories 3 (Muslim), Specifying an ordinal level is advisable when the researcher is
4 (Jewish), and 5 (other) on the other side especially stands interested in nonlinearity of relationships, but wishes to maintain
out. The difference between Muslims and Protestants regarding the category order in the quantifications on theoretical grounds
their scoring patterns on the other variables is the largest (given or for the sake of interpretation. That is, with an ordinal anal-
that the category points are restricted to be on a straight line). ysis level, the order of the categories on the variable vector is
The variable Religion in Figure 4a has a nominal analysis level, immediately clear, without consulting the category plot. The
which is explained next, along with the other analysis levels fact that the ordinal analysis level forces the order of the cate-
using the vector model. gory numbers on the quantifications might, of course, have the
16 LINTING AND VAN DER KOOIJ

drawback that nonmonotonic relationships between variables the number of interior knots to 0 (one interval), a spline transfor-
remain undiscovered. mation equals the numeric transformation (standardization) of
Ordinal quantifications are obtained by restricting the nomi- a variable. When the degree is set to 1 with the number of inte-
nal quantifications in the following way. If for two consecutive rior knots equal to the number of categories minus 2, the spline
category labels the nominal quantifications are in a descending transformation equals the nominal or ordinal transformation of
order, the ordinal quantification is restricted to the weighted the variable. For more information on splines, see Winsberg and
mean of the two nominal quantifications (with as weights the Ramsay (1983), and specifically on splines in CATPCA, see
category frequencies); if this mean is lower than the quantifica- Linting et al. (2007a). The remainder of this article consists of
tion for the consecutive category, the ordinal quantification is a step-by-step application of NLPCA to an empirical data set in
the weighted mean of the three nominal quantifications, and so which we use spline analysis levels.
on.3 Figure 4b shows the ordinal quantification of the religion
variable from Figures 2 and 3. Because Categories 1 and 2 as APPLICATION: NLPCA STEP-BY-STEP
well as 3 and 4 are in the “wrong” order on the vector, they
receive equal ordinal quantifications (i.e., the weighted mean of The Data
the two nominal quantifications). Note that, in practice, ordinal The data analyzed in this study consist of 378 outpatients
quantification would not be advised for the religion variable, who undertook the Rorschach inkblot test at the University of
because there is no reason to assume relations between religion Tennessee (Meyer, 2000). The Rorschach test involves asking
and other variables to be monotonic. respondents to react to a standard set of 10 inkblot stimuli. Re-
sponses are coded and interpreted for personality assessment.
Numeric Analysis Level In the University of Tennessee sample, the Rorschach was ad-
Downloaded by [McMaster University] at 10:04 18 July 2013

If the researcher is not interested in nonlinear relationships ministered and scored using the Comprehensive System (Exner,
between the variables, a numeric analysis level can be specified. 1986, 1991). Final scores indicate the number of times a partic-
This type of analysis level requires an extra restriction on top ular type of response was given across all 10 inkblot cards. The
of the ordinal quantification: Not only the category order, but total number of responses across cards is recorded as R.
also the distances between categories have to be maintained in In a linear factor analytic study on a sample of 268 volunteer
the quantification process. Applying this additional restriction college students, Meyer (1992b) found four main dimensions
renders a transformation plot showing a straight line.4 When (components): The first dimension mainly reflected the strong
a numeric analysis level is specified for all variables in the relationship between the frequency of particular responses and
analysis, NLPCA will give exactly the same result as standard the general response frequency R; the second dimension re-
PCA. flected the respondent’s depth of cognitive and affective en-
gagement; the third was a holistic nonform-dominant color and
Spline Nominal and Ordinal Analysis Levels shading dimension; and the fourth was a dimension of form-
dominant shading determinants. In the study described in this
If a variable has many categories (as, e.g., a continuous vari- article, the variables selected for analysis differ from Meyer’s,
able), and the researcher is interested in nonlinear relationships because we use a selection procedure based on CATPCA results.
between that variable and other variables, nominal and ordinal However, because our selection contains variables from most of
analysis levels might lead to very irregular quantifications (go- Meyer’s dimensions, we expect to find at least some substantive
ing wildly up and down in the transformation plot) that lack in- similarities between solutions.
sightfulness and stability. Alternatively, a more restrictive spline Meyer (1992a) reported that across factor-analytic studies of
ordinal or spline nominal analysis level can be specified. the Rorschach, the first (and sometimes second) dimension was
Roughly explained, a spline transformation is obtained by di- consistently dominated by response frequency R. The domi-
viding the x axis of the transformation plot into a user-specified nance of R over at least one dimension can be evaluated as an
number of intervals and in each interval a separate transforma- artifact, only revealing the trivial relationship between giving
tion curve is fitted (e.g., a second-degree polynomial, that is, a many responses in general and obtaining high scores on specific
quadratic curve). These separate curves are joined at the inter- variables: If people obtain high scores on many variables, they
val boundaries (called internal knots). The smoothness of the automatically obtain a high total response frequency. Also note
resulting curve can be influenced by the user by specifying the that, because of differences in R between respondents, scores
number of internal knots (and thereby the number of intervals) within a particular variable are not directly comparable: A score
and the degree of the function (influencing the smoothness of of 10 on a particular determinant means something different for
the joining of the curves at the knots): More intervals and a a person with an R of 50 (20%) than it does for a person with
higher degree diminish smoothness and increase freedom of the an R of 20 (50%). Other problems with R were described by
transformation. When the degree is set to 1 (straight line) and Meyer (1992a).
As a solution to R being confounded with other variables, thus
3As this restriction is imposed on the quantifications in each iteration of the
complicating interpretation of results, we decided to divide all
Rorschach scores by R. Thus, we obtain relative instead of raw
optimal scaling process, the final result of ordinal quantification is not equal to
frequencies. After dividing by R, the relations between the other
the result of directly applying the order restriction to the final nominal category
quantifications.
variables and R are no longer trivial, but have substantive mean-
4For a variable with numerical analysis level, the quantifications can be more ing. For instance, before dividing by R, the correlation between
easily obtained by simply standardizing the observed variable (as in linear PCA). Populars (commonly seen response) and R was .17, indicating
However, we indicate how they can be obtained by optimal scaling for didactic that there is a weak tendency for respondents who give more
purposes. responses to the cards in general to give more popular responses
NONLINEAR PCA WITH CATPCA 17

(which is quite obvious). After dividing by R, however, the cor- marginal frequencies are common, with the majority of respon-
relation between Populars and R becomes –.48, indicating that dents not scoring a certain variable at all, and only some respon-
respondents who give many responses tend to give relatively few dents scoring it a few times. Therefore, merging of adjoining
popular responses (which is much more informative, indicating categories with small marginal frequencies can be considered
that respondents who give more responses to the cards are more (see Step 3).
likely to give fanciful interpretations). So, dividing by R renders
more insightful correlations than using the raw frequencies. As NLPCA Step 2: Specifying Preliminary Analysis Options
response frequency has been given interpretative significance in The second part of the analysis process is the multivariate in-
the literature (see Meyer, 1992a), R is included as a variable spection of the data, which is best started by analyzing the data
in the analyses, enabling interpretation of substantive relations at the least restricted analysis levels viable for the variables at
between R and other scores. hand. For each variable in the data set, the researcher can specify
In the following sections, we give a step-by-step descrip- an analysis level appropriate for the research objectives, regard-
tion of the way we analyzed the University of Tennessee less of the measurement level of the variable. Strong correlations
(UT) sample (after dividing the scores by R) using the pro- between analysis variables are not a problem in (NL)PCA.
gram CATPCA (SPSS). In the preliminary analysis, besides R,
we include the 66 Rorschach variables described in Table 1. Step 2a: Initial analysis levels. Rorschach data have been
From the 90 available Rorschach variables, we summed four analyzed with standard PCA or factor analysis in the past. Thus,
pairs of variables with similar content (resulting in variables to obtain comparable results we quantify the data according
52 –55 in Table 1), and excluded 19 content variables that to the vector model, choosing a nominal scaling level because
were rarely scored (including, e.g., blood, botany, and explo-
Downloaded by [McMaster University] at 10:04 18 July 2013

we do not want to preclude either monotonic or nonmonotonic


sion). These uncommon content variables were used to com- nonlinear relationships between the variables. In addition, the
pute one nominal variable indicating the most frequently used variables have many categories, so spline transformation is ad-
uncommon content variable for each respondent (see Table 2). visable. Accordingly, we start with specifying a spline nominal
This variable is included as supplementary multiple nominal analysis level for all variables in the data set. We use the default
variable in the final step of the NLPCA, which has didactical settings for the number of interior knots (2, meaning the data are
relevance. In addition, computing this extra variable is statis- divided into three intervals) and degree of the spline (2, meaning
tically sensible: A large number of very rare responses were the function estimated in each data interval is quadratic).
replaced by one variable with substantial marginal frequencies.
Step 2b: Missing options. Unlike the calculation process
in linear PCA, the NLPCA calculation process is not based on
NLPCA Step 1: Univariate Data Examination the correlation matrix, but on the data itself. Therefore, missing
The first step in the data-analytic process is univariate exam- data can be treated in a more sophisticated way than in linear
ination of the data at hand, which should precede any type of PCA by simply excluding only the cells in the data matrix con-
analysis. Although NLPCA does not make univariate assump- taining missing values from the calculation process, without the
tions about the data, and skewness of distributions in itself need need for pairwise or listwise deletion. This type of treatment
not be a problem, categories with very low frequencies might of missing data is called Passive in CATPCA and is the default
cause instability of the NLPCA solution, or might have too option. The advantage of using Passive is that all available data
much influence in the quantification process. Such categories are used in the analysis without “fabricating” extra data. To
could be merged with an adjacent category beforehand. The find out whether persons with missing values are different from
minimum number of observations per category needed for sta- others, Missing can also be included in CATPCA as an extra
bility might vary across data sets. Stability of the solution can be category. Independent of the analysis level of the variable, the
checked using the nonparametric bootstrap (Efron & Tibshirani, Missing category will obtain an optimal nominal quantification.
1993; Linting et al., 2007b). In general, based on simulations, If that quantification is very different from the other quantifica-
Markus (1994) advises a bottom threshold of 8. tions, this indicates that persons scoring a missing value are a
Transformation plots in CATPCA also give an indication special group, not comparable to the other respondents. When
whether a category with small marginal frequency causes a missing values are at random, there will be only slight differ-
problem. If the plot shows a clearly outstanding quantification ences between solutions with missing Passive and missing Extra
for a category with small marginal frequency, the variable’s category.
quantification is dominated by the contrast between the one or Regardless, the UT data do not contain missing values and
few objects scoring the concerning category and all other per- missing treatment is irrelevant here. Note that CATPCA treats
sons. Such domination is undesirable because it prevents us values of 0 as missing, so if raw data are analyzed, zeros should
from finding possibly more interesting discrimination among be recoded. An alternative is to use a discretizing option de-
the other objects. In such cases, either the object(s) scoring the scribed next.
dominant category should be excluded, or categories should
be merged. The main implication of merging categories is that Step 2c: Discretizing. The program CATPCA requires
the original (separate) categories cannot be distinguished from (positive) integer valued data. This is merely a technical require-
each other in the interpretation. However, as at least one of the ment, not inherent to the method of NLPCA. To transform the
separate categories was scored by very few persons, one might values of continuous variables to integers, the researcher might
question whether the distinction was of interest to begin with. use some computational procedure (e.g., rounding) prior to data
In the UT sample, all of the 90 available Rorschach variables analysis, but alternatively and more conveniently, a Discretizing
show significant positive skewness and kurtosis values. Small option can be specified within CATPCA.
18 LINTING AND VAN DER KOOIJ

TABLE 1.—Description of the 66 Rorschach variables in the analysis.

Type of Response Variables: Name (Description)

Location 1. W (whole card) 2. D (common detail) 3. Dd (unusual detail)


4. Space (use of white space)
Developmental Quality 5. DQv (vague perception) 6. DQo (ordinary object) 7. DQvp (vague related percepts)
8. DQp (related objects)
Organization 9. Zf (organizational frequency) 10. Zd (difference Zf - expected frequency)
Form Quality 11. FQxp (unusually detailed form quality) 12. FQxo (ordinary form quality) 13. FQxu (unusual form that fits)
14. FQxm (form that does not fit) 15. FQxnone (no form quality)
Movement 16. M (human movement) 17. FM (animal movement) 18. m (inanimate movement)
19. active (energetic movement) 20. passive (less active than “talking”)
Chromatic Color 21. FC (color secondary to form) 22. CF (form secondary to color) 23. C (pure color, no form present)
24. Cn (only color naming)
Achromatic Color 25. FC’ (achromatic color secondary to 26. C’F (form secondary to achromatic 27.C’ (pure achromatic color, no form)
form) color)
Texture 28. FT (texture secondary to form) 29. TF (form secondary to texture) 30. T (pure texture)
Downloaded by [McMaster University] at 10:04 18 July 2013

Vista – Depth 31. FV (depth secondary to form) 32. VF (form secondary to depth) 33. V (pure depth)
Diffuse Shading 34. FY (diffuse shading secondary to form) 35. YF (form secondary to diffuse shading) 36. Y (pure diffuse shading)
Reflections 37. Fr (reflection secondary to form) 38. rF (form secondary to reflection) 39. FD (dimensionality secondary to
form)
40. F (form alone)
41. Blends (>1 from m through fd) 42. Pair (2 identical percepts) 43. Popular (commonly seen response)
Content 44. H (real human) 45. (H) (fictional human) 46. Hd (human detail)
47. (Hd) (fictional human detail) 48. A (real animal) 49. (A) (fictional animal)
50. ad (animal detail) 51. (Ad) (fictional animal detail)
Cognitive Special Scores 52. Any DV (deviant verbalization, 1+2) 53. Any INC (disorganized concept, 1+2) 54. Any DR (deviant response, 1+2)
55. Any FAB (fabulized combination, 1+2) 56. Contam (merging of 2 percepts) 57. PSV (perseveration)
58. ALOG (autistic logic)
Other Special Scores 59. AB (abstract content) 60. AG (aggressive movement) 61. COP (cooperative movement)
62. GoodHR (good human representation) 63. PoorHR (poor human representation) 64. MOR (morbid content)
65. PER (personal knowledge) 66. CP (color projection)

Note. Variables shown in italics were not selected for the final analyses.

The discretizing option Multiplying is advised when continu-


ous variables are analyzed numerically. It leaves all characteris-
tics of the observed variable intact, by standardizing, then mul-
TABLE 2.—Frequency table of nominal variable indicating a respondent’s most
tiplying all values by 10, rounding, and adding a constant such frequently mentioned uncommon content (N = 378).
that 1 is the lowest resulting integer value. The option Ranking
is advised for continuous variables that will be treated (spline) Category Frequency %
ordinally or nominally. It replaces observed values by their rank
number, thus discarding the distances between observed values An (anatomy) 42 11.1
Art (artistic) 21 5.6
(which is inherent to ordinal and nominal treatment). Finally, Ay (anthropology) 12 3.2
the Grouping option can be used when the researcher wishes to Bl (blood) 1 0.3
decrease the number of categories of a variable prior to NLPCA Bt (botany) 49 13.0
(which might increase stability of results, but, of course, has Cg (clothing) 56 14.8
implications for interpretation, as the original categories cannot Cl (clouds) 2 0.5
Ex (explosion) 1 0.3
be interpreted separately). Food (food) 8 2.1
The Rorschach data divided by R are continuous values (pro- Fi (fire) 3 0.8
portions). As we specified a spline nominal analysis level for all Geog (map) 5 1.3
variables in the data set, and do not wish to limit the number of Hh (household item) 28 7.4
Hx (human experience) 10 2.6
categories a priori, the discretizing option Ranking is appropri- Ls (landscape) 32 8.5
ate. R itself is integer valued and thus needs no discretization. Na (nature) 28 7.4
Sc (science) 19 5.0
Step 2d: Number of components. In linear PCA, the num- Sx (sex) 16 4.2
Xy (x-ray) 4 1.1
ber of components chosen for analysis is only relevant for in- Idio (other content) 41 10.8
terpretation. In NLPCA, however, the number of components
NONLINEAR PCA WITH CATPCA 19

also affects analysis results, because, with optimal scaling, the


eigenvalues of the first P components are maximized (with P the
number of components specified by the user). Thus, for instance,
the two components in a two-dimensional solution are not equal
to the first two components of a three-dimensional solution (i.e.,
solutions are not nested; also see Linting et al., 2007a). Conse-
quently, when choosing the number of components, results from
different dimensionalities should be compared. In this context,
scree plots might be helpful tools.
In Step 3d, we show an actual example of the use of scree
plots to select the final number of components. In the current
step, we investigate scree plots for four-, five-, six-, and seven-
dimensional NLPCA solutions with all variables at a nominal
analysis level, which consistently show an elbow at the sec-
ond and the sixth component (not shown for sake of concise-
ness). This result suggests that a maximum of seven compo-
nents should be considered. We use the nonparametric bootstrap
(Efron & Tibshirani, 1993; Linting et al., 2007b; Timmerman,
Kiers, & Smilde, 2007) to examine the significance of the load-
ings on the first seven components, and find that on the seventh
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 5.—Object plot depicting component scores on Components 1 and 4.


component, all of the 95% confidence intervals contain the value
zero, indicating that none of the loadings are significant at a two-
sided significance level of .05. In addition to statistical grounds, contribute much to the solution, eliminating them will barely
interpretability should be a criterion in choosing the number of affect the total fit of the solution.
components. As interpretation of the seventh component is also VAF is the most important indication of fit, both for the prin-
unclear, we set the number of components at six. cipal components, and for the quantified variables, and should
thus be considered main criterion in variable selection. VAF by
the principal components across variables is represented by the
NLPCA Step 3: Preliminary Analysis and Adjustment of eigenvalues. Proportion of VAF by a component is its eigen-
Analysis Options value divided by the number of analysis variables. Eigenvalues
The six-component NLPCA solution on 67 Rorschach vari- and percentage VAF can be found in the Model Summary ta-
ables analyzed at a spline nominal level with discretizing option ble in CATPCA.5 VAF by each component in each separate
Ranking is first examined for outliers. After removing outliers, variable is given in the VAF table. For variables analyzed ac-
variables that contribute substantially to the solution are selected cording to the vector model—(spline) nominal, (spline) ordinal,
for further analysis. Then, the data are reexamined for outliers or numeric—these values are the squared component loadings.
and analysis options are evaluated. The sum of VAF over components (last column) is the total VAF
of the variable (communality).
Step 3a: Examine the solution for outliers. Outliers will For variable selection we look at total VAF in the variables
affect variable fit. Therefore, prior to variable selection based on (communalities): Variables with total VAF of .25 or higher are
fit (see Step 3b), it is important to make sure that the solution is selected for the final analysis. That is, at least 25% of the vari-
not dominated by outliers. Outliers in the NLPCA solution are ance in a quantified variable is explained across the principal
cases that obtain component scores (called “object scores” in components. Comrey (1973) gives the following rules of thumb
CATPCA) that lie at a large distance from the other component for VAF in a variable per component: 10% is poor, 20% is fair,
scores in the principal component space. As component scores 30% is good, 40% is very good, and 50% is excellent. As we
are standard scores, one might consider scores roughly exceed- aim for at least fair VAF across components, a criterion of 25%
ing the range of –3.5 through 3.5 outliers. The most insightful seems reasonable. The VAF criterion used should be dependent
way to detect outliers is by looking at plots of the compo- on the research question and data set at hand. For instance, if
nent scores (“Object plots” in CATPCA). For instance, Figure data refer to a specific physical phenomenon that is measurable
5 shows that case number 155 is an outlier on the fourth com- with a small amount of error, the VAF that would be expected
ponent. In fact, this case is also an outlier on the second, fifth, and required would be much higher than the 25% criterion used
and sixth components. Case 303 is an outlier on the second and here. Based on the 25% criterion, all variables shown in italics
fourth components. Both outliers are excluded from the analysis in Table 1 are excluded from further analysis. We continue the
used for variable selection. The other component scores do not analysis with 42 variables.
defer much from the main scatter. We repeat the analysis with In Step 2d, we used significance of loadings, assessed by the
376 cases to check for possible newly occurring outliers, which nonparametric bootstrap, as a statistical criterion to include or
are not present.
5Note that, when missing option Passive is used, the Model Summary table
Step 3b: Variable selection. Because the Rorschach data does not include percentage VAF values, because in that case, proportion VAF is
consist of a large number of variables, we want to exclude not exactly equal to the eigenvalue divided by the number of analysis variables.
variables with bad fit from the analysis, for sake of legibility of However, when the number of missings is not very large, this value still gives a
the output (i.e., interpretability). As variables with bad fit do not proper indication of proportion VAF.
20 LINTING AND VAN DER KOOIJ
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 6.—Scree plot with lines indicating the eigenvalues for Components 1 through 15 for a four-, five-, six-, and seven-dimensional CATPCA solution on 42
selected Rorschach variables analyzed at a spline nominal level (degree = 2, number of interior knots = 2).

exclude components. Note that the same procedure could be by looking at scree plots of the eigenvalues in four-, five-, six-,
used as a first step in variable selection. Alternatively, permu- and seven-dimensional solutions.6 Figure 6 shows scree plots
tation tests can be applied to assess p values for the loadings for these four solutions.7 In each plot, a slight “elbow” shows
(Linting, Van Os, & Meulman, 2011). However, even when such at the P+1th component, with P the number of components
a statistical procedure is used, effect size (VAF value) should specified by the user. This is a logical result of the optimal scal-
remain the main criterion for variable inclusion. ing process, because its objective is optimizing the eigenvalues
of the first P components. Overall, the actual scree in the plot
Step 3c: Reexamine for outliers. Because the outliers consistently starts after the sixth component, indicating that a
previously excluded might have obtained outlying component six-dimensional solution is the most appropriate. (Again, this
scores based on variables not in the final selection, we examine could be checked with the bootstrap or permutations.)
the 42-variable solution for all 378 cases for outliers. We iden-
tify and remove two other outliers than in the analysis with 67 Step 3e: Evaluate analysis levels. After the variable selec-
variables. (Cases 155 and 303 are no longer outliers, after the tion and determination of number of components, the choice of
variables based on which they obtained outlying scores were analysis levels for the variables can be evaluated by examining
removed from the analysis.) The analysis is continued with 376
cases. 6Remember that it is necessary to look at scree plots in different dimension-

alities, because NLPCA solutions are not nested.


Step 3d: Evaluate number of components. After selection 7Eigenvalues are from the bottom row of the Correlations transformed vari-

of variables and cases, we evaluate the number of components ables table.


NONLINEAR PCA WITH CATPCA 21
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 7.—Transformation plots for four Rorschach variables, analyzed at a spline nominal level. (a) A typical linear pattern; (b) a typical nonmonotonic
(nonlinear) pattern; (c and d) two typical examples of monotonic nonlinear patterns.

transformation plots. In Figure 7, four exemplary transformation Plateaus such as in Figure 7d might also occur when an ordi-
plots are depicted. The plots show rank numbers on the x axis, nal analysis level is applied. In that case, a nominal level should
because Ranking was used as a discretizing option. The almost be applied instead to investigate the cause of the plateau. If, as
straight line in Figure 7a indicates that variable W is practically in Figure 7d, the categories in the plateau obtain equal or very
linearly related to the other variables in the data set. Figure 7b similar nominal quantifications, ordinal treatment is appropri-
shows an approximately reversed U-shape for variable Y, in- ate. If, alternatively, the quantifications go up and down in the
dicating that persons in the low and high categories are alike, nominal transformation plot, tied values are caused by forcing
considering their scoring pattern on the other variables, and dif- an ordinal restriction on the quantifications (as described in the
fer in the same way from the persons in the middle categories. section on ordinal analysis levels). In such cases, nominal treat-
Figure 7c shows that CF is approximately linearly related to ment is more appropriate than ordinal treatment, except when
the other variables but the curve levels off at the categories 20 the nominal quantifications are very irregular or when the cat-
through 30 and at the highest categories, meaning that persons egories that are in a nonincreasing order have low frequencies.
within those categories are much alike in their scoring pattern In the latter situations, the VAF of the nominal and ordinal solu-
on the other variables. The quantification of the variable FQu in tions should be examined. If there is only a small difference, a
Figure 7d shows a lot of tied values for the middle categories. (spline) ordinal level is preferable over a nominal level, because
If a transformation plot shows a pattern like in Figure 7a, the it will give more stable results.
variable could just as well be analyzed numerically. If it shows This application illustrates that transformation plots can
a pattern like in Figure 7b, the variable is best analyzed at a be useful tools in two ways: (a) they give insight into the
(spline) nominal level. If it shows a pattern like in Figure 7c, the nature of relationships between variables, and (b) they indicate
variable is best analyzed (spline) ordinally. However, if the cat- which categories are useful for distinguishing people from
egories for which the transformation curve levels off have low each other, which could have implications for future data
frequencies, or if the total VAF of the variable is low, the results collection. If, for example, a Likert scale variable with five
when choosing a numeric level will only slightly differ from the categories (from totally agree through totally disagree) shows
results with a (spline) ordinal level for this variable. The plateau a nominal transformation that looks like Figure 4b, it can be
in the transformation of the variable in Figure 7d indicates that concluded that there is only a distinction between the first two
the persons scoring these categories cannot be structurally dis- and the last three categories. So, in future data collection, this
tinguished from each other based on their scoring patterns on variable could be reduced to two categories. Merging the first
the other variables. As the curve is monotonic, ordinal treatment two and last three categories for this variable will not change
would be appropriate here. the solution much, because two quantified values will result
22 LINTING AND VAN DER KOOIJ

that each will not differ much from the quantified values of TABLE 3.—Rotated component loadings from a six-dimensional CATPCA on
the corresponding original (unmerged) categories, as these 42 Rorschach variables, with all variables analyzed ordinally.
values were already close together. However, merging such Component
categories has the advantage of rendering more concise output 1 2 3 4 5 6
and possibly more stable results (Linting et al., 2007b).
Apart from transformation plots, information on the VAF FM .735 .039 −.143 −.149 −.037 −.176
could aid the researcher in making a final decision on the analy- Active .713 .135 .274 .043 .165 −.011
sis levels of the variables. If granting the method more freedom DQ+ .670 .223 .436 −.101 .102 −.006
does not lead to a substantial increase in VAF, the researcher F −.655 −.160 −.140 .002 −.475 −.122
Blends .619 .198 .083 .042 .523 .119
could decide to use more restricted analysis levels, for the sake DQo −.605 −.223 −.369 .129 −.042 −.334
of stability and simpler interpretation of relations between quan- Any FAB .519 .001 .008 .175 −.109 −.016
tified variables. With all variables at a spline nominal level, the Passive .506 .019 .203 −.011 .108 −.006
VAF across components is 51.36%, which means that all six AG .474 .027 .206 .249 .067 .017
m .373 .173 −.099 .085 .360 .161
principal components together explain 51% of the variance in Any DR .323 .097 −.091 .232 .090 .308
the 42 quantified Rorschach variables. In a psychological con- W .159 .914 .124 −.101 .078 .082
text this might be considered a reasonable value. The majority of D −.134 −.808 −.135 −.081 −.182 −.123
the variables show a monotonic transformation curve. When we Zf .358 .788 .291 −.062 .163 −.092
analyze the variables that show a nonmonotonic transformation Pair .227 −.536 .411 −.142 −.166 −.125
Dd −.107 −.530 −.037 .344 .137 .041
curve (C’F, VF, Y, (HD), and Any Fab12) at a spline nominal and MOR .300 .347 −.031 .239 −.092 .092
the other variables at a spline ordinal level (degree 2, number of −.070 .019 .829 −.311 .101 −.076
Downloaded by [McMaster University] at 10:04 18 July 2013

Good HR
interior knots 2), the VAF becomes 51.28%. Finally, when we H .104 .110 .727 −.051 .066 .068
analyze all variables spline ordinally, the VAF is 50.69%. This M .522 .084 .684 .152 .091 .031
(H) .176 .077 .496 .126 −.091 −.064
shows that allowing for nominal transformations does not give a COP .212 .055 .493 −.108 .021 −.120
large improvement in VAF over ordinal transformations, which FQ− .130 .189 −.118 .722 −.178 .048
indicates that the variables showing a nonmonotonic transfor- FQo −.124 −.109 .215 −.702 −.237 −.153
mation curve do not have much weight in the solution (also see Poor HR .431 −.051 .304 .683 −.016 .098
their component loadings). We decide to analyze all variables Popular .079 .067 .320 −.553 −.276 −.026
Hd .019 −.217 .189 .528 −.130 .030
at a spline ordinal level, because of simpler interpretation. As R −.076 −.362 −.177 .520 .217 .049
none of the transformation plots show extreme category quan- C’F .111 .037 −.011 .033 .543 .096
tifications, merging categories is not necessary. FQu .037 −.062 −.086 .126 .523 −.138
Space −.140 .350 .006 .263 .493 −.050
NLPCA Step 4: Final Analysis and Interpretation CF .217 .239 .044 −.061 .476 .116
Zd .058 .329 .161 .048 .464 −.039
Before running the final analysis, Step 3 of the process can A .254 −.111 −.387 −.359 −.433 −.283
be repeated several times, until reaching the final decision about (Hd) −.100 −.067 .163 .043 .425 −.015
the analysis options. When these final options are specified, we FC’ .139 .009 −.160 −.130 .377 −.135
VF .124 −.096 .029 −.236 .346 .018
get to the final conclusions about the data at hand. In this case, FQnone −.115 .006 −.098 −.054 −.067 .843
we analyze the 42 remaining Rorschach variables all at a spline DQv −.102 −.013 −.133 −.054 −.125 .772
ordinal level (degree 2, number of interior knots 2). Annotated C .051 .119 .016 .074 −.081 .757
syntax for this final analysis is in the Appendix. Y .089 −.116 −.013 .150 .152 .496
AB .166 .189 .131 .178 .107 .423
Step 4a: Variance accounted for. The total VAF across the
Note. Loadings higher than .40 are shown in bold.
six components is 50.69%, with a clearly dominant first com-
ponent (VAF: Component 1 = 18.01%, Component 2 = 9.64%,
Component 3 = 7.21%, Component 4 = 6.14%, Component 5 objects). Rotated Component 2 gives a contrast between using
= 5.16%, Component 6 = 4.53%). This means that the six se- the whole card and using specific details (both common and
lected components explain about 51% of the variance in the 42 uncommon). Rotated Component 3 is all about normal, posi-
spline ordinally quantified Rorschach variables, which indicates tive human perception and cooperation. Rotated Component 4
reasonable fit. contains Popular responses and Ordinary-Form responses (well-
fitting form perceptions, occurring fairly often) versus Minus-
Step 4b: Component loadings. Two elements of the CAT- Form responses (nonfitting form perceptions) together with poor
PCA output give insight into the component loadings: (a) the human representation, human detail, and R. Thus, after creat-
Component loadings table, and (b) the Loadings plot. ing percentage scores for all variables, the raw variable R still
Table 3 is a loading table with (orthogonal) VARIMAX ro- contributes to a factor where it is negatively related to popular
tated component loadings of each of the analysis variables.8 Ro- and common well-fitting perceptions, and positively related to
tated Component 1 contains mainly variables indicating some distorted perceptions of human content or activity and human
type of active and integrated movement versus simplicity des- details. In other words, persons who give many responses are
ignated by the variables F (form alone) and DQo (ordinary more likely to report a high proportion of perceptions of the
latter types. Rotated Component 5 is mainly an organizational
8As rotation options are not available within CATPCA in SPSS 17, VARI- dimension containing Zd, unusual (fitting, but rare) form re-
MAX rotation was performed by saving the transformed variables and submit- sponses, and responses using color primary to form, and white
ting them to a linear PCA with VARIMAX rotation. space. Hdprn (mythical human) also loads positively on this
NONLINEAR PCA WITH CATPCA 23
Downloaded by [McMaster University] at 10:04 18 July 2013

FIGURE 9.—Biplot of (a) rotated component loadings from the CATPCA solu-
FIGURE 8.—Rotated component loadings plot for the first two components
tion on 42 selected Rorschach variables analyzed at a spline ordinal level, and
from the CATPCA solution on 42 selected Rorschach variables analyzed at a
(b) category points for the supplementary multiple nominal variable Uncommon
spline ordinal level. Component 1 signifies static objects versus movement, and
content.
Component 2 holistic interpretation versus detail.

dimension. Thus, persons with higher proportions of unusual make a small angle with each other, such as the movement vari-
form responses using color and white space are more likely to ables and DQo, DQ+, F, and Blends, or D and Dd (common
report perceptions of a mythological or fictive human figure, and unusual detail) are strongly related.
whereas a pure animal perception (loading negatively on this
dimension) is less likely. Finally, rotated Component 6 consists Step 4c: Object scores. The Object scores plot can be used
of vague, abstract responses (no-form requiring, vague percep- to determine outliers (as described in Step 2 of the analysis
tions, using pure color and pure diffuse shading, and abstract process). Also, the object points can be depicted together with
content). According to these interpretations, the NLPCA solu- the loading vectors in a biplot. Object points can then imagi-
tion seems to give sensible insight into the Rorschach data. narily be projected onto variable vectors to show relationships
Besides component tables, two-dimensional component load- with the variables. For instance, when we would depict object
ing plots give an insightful overview of relations between vari- points in Figure 8, a person located in the top left of the space
ables (for solutions with more than two dimensions, pairwise will have relatively high proportions of responses using the
plots can be requested). Such plots show loading vectors, which whole card, form alone, and—to a lesser extent—white space,
are partial variable vectors, running from the origin to the load- and relatively low proportions of responses involving details
ing point (see the black lines in Figures 1 and 3). When an and pairs. For outliers, such projection onto the loading vectors
ordinal or nominal analysis level is used, the category plot (such might show on which variables the outlying scores occur. To
as Figure 3) gives insight into the location of the categories on gain insight into associations between variables and persons,
the variable vector. The length of the loading vector indicates CATPCA offers the option to label the object points according
variable VAF, and when vectors are long, cosines of the angles to any variable in the data set (e.g., background variables, like
between vectors indicate correlations between variables. gender). For the Rorschach data, the object scores plots show
In Figure 8, loadings on rotated Components 1 and 2 are evenly spread clouds of object points without outliers, and are
shown. This plot immediately shows the relations between vari- not shown here for conciseness.
ables that are important in the first two components of the so-
lution. In interpreting relations it is important to realize that NLPCA Step 5: Gain More Insight
scores are relative to R, so proportions of total responses. For After looking at the NLPCA solution on the Rorschach vari-
instance, W (whole card) and D (detail) point in opposite di- ables, complementary steps can be taken to gain more insight
rections, indicating a strong negative association between the into the component structure. For instance, relations with cate-
two: Respondents who use the whole card in a large proportion gorical variables not included in the analysis could be investi-
of responses tend to use card details in a small proportion of gated by adding such variables as supplementary. Supplemen-
responses. The same type of reasoning applies, for instance, tary variables do not affect the solution in any way, but their
to the vectors for DQ+ (objects in a meaningful relationship, categories are fitted into the solution based on the analysis vari-
abbreviated DQp in Figure 8), DQo (ordinary object), F (form ables. Here, we add the variable Uncommon content (see the
only), and Blends. Variables indicating movement (M, FM, Ac- data description) as a supplementary variable with a multiple
tive, Passive, AG) make a close to 90◦ angle with the variables nominal analysis level.
indicating whole card versus detail, indicating that these groups Multiple nominal quantifications are estimated using the cen-
of variables are practically unrelated. In contrast, variables that troid model, showing the position of the categories as group
24 LINTING AND VAN DER KOOIJ

points instead of in the direction of the variable in the principal all variables nominally, or by increasing the degree or number
component space (see also Figure 2).9 Accordingly, variables of knots in the spline transformation. However, because of the
with a multiple nominal analysis level do not obtain loadings; many categories of most variables, granting too much freedom
their total VAF is called the discrimination measure in MCA.10 could lead to capitalization on chance, and would not be ad-
Figure 9 is a biplot, showing the categories of Uncommon visable. Even if less restricted analysis levels were used in this
content displayed within the rotated component loadings plot particular data set, the increase in VAF would not exceed a few
on the first two components (see Figure 8). To enhance legibil- percent. This result is due to the fact that the transformation of
ity, specific variable labels have been omitted and replaced by most variables did not show much nonlinearity, indicating that
more general indications of the content of the variables. Look- the relations between variables in this data set are not far from
ing at Figure 9, we see that the categories are well spread across linear, which is useful knowledge in itself. Variables that did
space, which indicates that persons scoring different Uncom- show nonlinear relations with the other variables had relatively
mon content categories can be clearly distinguished from each low VAF in the solution. Consequently, the loadings pattern in
other according to their scoring pattern on the analysis vari- NLPCA did not differ much from that in linear PCA either.
ables. The categories Blood and Explosion are excluded from The “Application” section of this article shows that, contrary
the plot, because these were very rare responses (only one per- to linear PCA, NLPCA gives insight in the nature of the rela-
son had either of these as most frequent category). The plot tions in the data, and allows the researcher to make informed
can be interpreted by (imaginary) projection of the categories decisions about data analysis, instead of relying on possibly un-
onto the variable vectors to see how they relate to the vec- realistic assumptions. In fact, NLPCA can be used as a means
tor variables. For example, people who have Clouds (Cl) and to check assumptions about linearity of relations between vari-
Human Experience (Hx) as their most frequent Uncommon con- ables. In addition, this article shows how including variables as
Downloaded by [McMaster University] at 10:04 18 July 2013

tent category typically also have high proportions of movement multiple nominal, thus combining the PCA (vector) model with
and interaction responses and relatively rarely have form alone the correspondence model (a feature unique to NLPCA), can
and ordinary objects responses. Sex (Sx) and X-ray (Xy) are render interesting results. In linear PCA, inclusion of (multiple)
most often reported when the proportion of detail and pair re- nominal variables is impossible. An additional advantage of the
sponses is high, and is the least reported when the proportion CATPCA program in particular is that it renders more insightful
of responses using the whole card is high. In contrast, persons (bi)plots than standard linear PCA software.
who have fire (Fi) as their most frequent Uncommon content A more substantive comparison to other PCA results regard-
category tend to base their perception on the card as a whole ing the Rorschach would be interesting. Such a comparison is
instead of on details. Food and Maps (Geog) go together with complicated, however, because different variables were selected
high proportions of ordinary objects and form alone responses for analysis across studies. In addition, a complete review goes
and a low proportion of movement responses. Another interest- beyond the scope of this article. However, we made a tentative
ing conclusion is that Landscape (Ls), Nature (Na), Art, and comparison to Meyer’s (1992b) results, and found interesting
Botany (Bt) are all close together in the plot, indicating that re- similarities. The first dimension found by Meyer can in con-
spondents who used these categories most frequently are quite tent be comparable to our Dimension 4, although it shows other
alike in their scoring patterns. These results all seem to make types of relationships between variables, because we divided
sense in the context of personality assessment. Note that these the data by response frequency (R), thus removing trivial re-
interpretations only concern the variables loading highly on the lationships between R and other Rorschach variables. Meyer’s
first or second component. Interpretations of the locations of the second dimension includes all frequent location features (e.g.,
Uncommon content categories in relation to the other variables D, Dd, and Zd) and R, which corresponds to our second dimen-
should be inspected in plots of the components on which these sion. His third dimension containing nonform-dominant holistic
variables load highly. perceptions resembles our Dimension 5. Meyer’s fourth dimen-
sion (form-dominant shading) is not present in our analysis,
CONCLUSION AND DISCUSSION because most variables from this dimension were not selected
This article serves two important purposes. First, it offers a for analysis. Compared to Meyer, we did not find a dimension
nontechnical, step-by-step guide through CATPCA, illustrated dominated by R, due to division by R. On the other hand, we did
by an empirical example, which should be helpful to all re- find extra dimensions on movement, human-related content, and
searchers wishing to use NLPCA in their own analyses. Second, abstract content. We leave it to applied psychologists to decide
it shows that the usefulness of NLPCA as a method goes beyond whether our results provide insight into personality. However,
improving the VAF of the PCA solution. using NLPCA on these data does provide insight into the nature
In fact, when the six-dimensional NLPCA solution with all of the relations among the Rorschach variables, compared to
variables analyzed at a spline ordinal level (VAF = 50.69%) is using standard linear PCA.
compared to the linear PCA solution (VAF = 49.22%), the im-
provement in VAF by allowing nonlinear transformations is very REFERENCES
small: a mere 1.5%. This increase will become larger as more Comrey, A. (1973). A first course on factor analysis. London, UK: Academic
freedom is granted to the method, for instance, by analyzing Press.
Efron, B., & Tibshirani, R. J. (1993). An introduction to the bootstrap. New
York, NY: Chapman & Hall.
9The multiple nominal level is referred to as multiple because such a variable Exner, J. (1986). The Rorschach: A comprehensive system: Vol. 1. Basic foun-
obtains category quantifications for each principal component, in contrast to the dations (2nd ed.). New York, NY: Wiley.
single, overall quantifications obtained with the other analysis levels. Exner, J. (1991). The Rorschach: A comprehensive system: Vol. 2. Interpretation
10Although not referred to as such in NLPCA, this measure is found in the (2nd ed.). New York, NY: Wiley.
VAF table in the Mean column under Centroid coordinates. Gifi, A. (1990). Nonlinear multivariate analysis. Chichester, UK: Wiley.
NONLINEAR PCA WITH CATPCA 25

Guttman, L. (1941). The quantification of a class of attributes: A theory and a SAS Institute. (2009a). SAS/ STAT R 9.2 user’s guide. Cary, NC: Author.

method of scale construction. In P. Horst (Ed.), The prediction of personal SAS Institute. (2009b). SAS/STAT R software [Computer software]. Cary, NC:

adjustment (pp. 319–348). New York, NY: Social Science Research Council. Author.
Jolliffe, I. T. (2002). Principal component analysis. New York, NY: Springer- Shepard, R. N. (1966). Metric structures in ordinal data. Journal of Mathematical
Verlag. Psychology, 3, 287–315.
Kruskal, J. B. (1965). Analysis of factorial experiments by estimating monotone SPSS. (2009). PASW Statistics R 17.0 [Computer software]. Chicago, IL: Au-

transformations of the data. Journal of the Royal Statistical Society: Series thor.
B, 27, 251–263. Timmerman, M. E., Kiers, H. A. L., & Smilde, A. K. (2007). Estimating con-
Kruskal, J. B., & Shepard, R. N. (1974). A nonmetric variety of linear factor fidence intervals for principal component loadings: A comparison between
analysis. Psychometrika, 39, 123–157. the bootstrap and asymptotic results. British Journal of Mathematical and
Linting, M., Meulman, J. J., Groenen, P. J. F., & Van der Kooij, A. J. (2007a). Statistical Psychology, 60, 295–314.
Nonlinear principal components analysis: Introduction and application. Psy- Winsberg, S., & Ramsay, J. O. (1983). Monotone spline transformations for
chological Methods, 12, 336–358. dimension reduction. Psychometrika, 48, 575–595.
Linting, M., Meulman, J. J., Groenen, P. J. F., & Van der Kooij, A. J. (2007b). Young, F., Takane, Y., & De Leeuw, J. (1978). The principal components of
Stability of nonlinear principal components analysis: An empirical study mixed measurement level multivariate data: An alternating least squares
using the balanced bootstrap. Psychological Methods, 12, 359–379. method with optimal scaling. Psychometrika, 43, 279–281.
Linting, M., Van Os, B. J., & Meulman, J. J. (2011). Statistical significance of
the contribution of variables to the PCA solution: An alternative permutation
strategy. Psychometrika, 76, 440–460.
APPENDIX
Markus, M. T. (1994). Bootstrap confidence regions in nonlinear multivariate ANNOTATED CATPCA SYNTAX FOR FINAL ANALYSIS
analysis. Leiden, The Netherlands: Leiden University DSWO Press. ON RORSCHACH DATA
Downloaded by [McMaster University] at 10:04 18 July 2013

Meulman, J. J., Heiser, W. J., & SPSS. (2009). SPSS Categories 17.0. Chicago,
IL: SPSS.
CATPCA can be entirely operated using the SPSS menu.
Meulman, J. J., Van der Kooij, A. J., & Heiser, W. J. (2004). Principal compo- CATPCA is entered by clicking Analyze → Dimension reduc-
nents analysis with nonlinear optimal scaling transformations for ordinal and tion (in versions preceding 17.0: Data reduction) → Optimal
nominal data. In D. Kaplan (Ed.), Handbook of quantitative methodology for Scaling, and checking “Some variables are not multiple nomi-
the social sciences (pp. 49–70). London, UK: Sage. nal.” The window that pops up gives the user the opportunity to
Meyer, G. (1992a). Response frequency problems in the Rorschach. Journal of specify all analysis options described in this article. After speci-
Personality Assessment, 58, 231–244. fying the options, syntax can be generated by clicking the Paste
Meyer, G. (1992b). The Rorschach’s factor structure: A contemporary investi- button. Some insight into this syntax enables the user to easily
gation and historical review. Journal of Personality Assessment, 59, 117–136. adjust analysis options, without repeatedly going through the
Meyer, G. (2000). A replication of Rorschach and MMPI–2 convergent validity.
entire menu. Therefore, the annotated syntax of the final analy-
Journal of Personality Assessment, 74, 175–215.
R Development Core Team. (2005). R: A language and environment for statis-
sis described in this article is displayed in Table A1. To enhance
tical computing. Vienna, Austria: R Foundation for Statistical Computing. legibility, instead of 42 variables, only 5 variables are listed.

TABLE A1.—Annotated syntax.

Syntax Annotation

CATPCA /VAR r w d space cont Call CATPCA and list all variables to be analyzed (including supplementary variables).
/ANALYSIS r w d space (SPORD, Specify the analysis level of each variable. Each variable can obtain its own specification, between parentheses, but
DEGREE = 2, INKNOT = 2) cont you can also list all variables with the same analysis level and state the level only once, followed by variables with
(MNOM) different levels (as done here). When for all variables the same analysis level is chosen, “all” can be used instead of
the complete variable list.
Note. These remarks on the use of variable lists apply to the other options as well.
/MISSING = r w d space cont (PASSIVE, Specify missing options. Here, passive treatment of missings is specified (default option), the second keyword
MODEIMPU) (MODEIMPU) applies only to the calculation of the correlation matrix once the quantified variables are found. If
the first keyword is ACTIVE, MODEIMPU imputes missing data with the mode of the variable (other imputation
options are available).
/SUPPLEMENTARY = VARIABLE (cont) List supplementary variables between parentheses.
/DISCR r w d space (RANK) Specify discretizing options for the variables. (Cont is an integer-valued variable and does not need to be
discretized.)
/DIM = 6 Specify number of dimensions.
/PRINT = Specify table output:
DESCR Descriptives (frequencies, mode, a.o.)
CORR Correlations between transformed variables
LOADING Component loadings
OBJECT Object scores
VAF VAF per variable per dimension
/PLOT = Specify figure output:
BIPLOT(LOADING) (20) Biplot of loadings and component scores
OBJECT (20) Component scores
CATEGORY(R W D SPACE) (20) Category points
LOADING((CENTR)) (20) Biplot of component loadings and category points of multiple nominal variables
TRANS(ALL) (20) Transformation plots. (20) indicates that 20 (the default) characters of the value labels of categories are used in
point labels. When specifying 0, values instead of value labels are used as point labels.
NDIM(1,6) Specifies dimension numbers for pairwise plots Here, we obtain plots of dimension 1 by 2, 3, 4, 5, and 6.
/SAVE = OBJECT TRDATA Specify results to be saved within the current data set. Here, object scores and quantified variables are rendered.
/OUTFILE = DISCRDATA Write results to external data file. Here, the discretized data are sent to Discr.sav in the directory dirname.
(“c:/DIRNAME/DISCR.SAV”).

You might also like