Professional Documents
Culture Documents
Multivariate Analysis - ADA-session-3 PDF
Multivariate Analysis - ADA-session-3 PDF
Introductory Notes
on
Multivariate Analysis
Prepared by
Prof Prithvi Yadav
For the course - ADA
2004
Indian Institute of Management,
Rajendra Nagar, Indore
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -1 -
Objectives
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -2 -
Index
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -3 -
• Scale of measurement of the independent & dependent variables:
§ Nominal, ordinal, interval, or ratio
§ Metric or nonmetric (i.e. continuous or categorical)
§ Fixed or random
§ Analytic technique used to estimate the model (e.g. OLS, maximum
likelihood estimation)
9. Sampling of multivariate techniques:
9.1 Dependency methods:
9.1.1 Multiple linear regression MLR
9.1.2 Discriminant analysis DA
9.1.3 Multivariate analysis of variance MANOVA
Multivariate analysis of covariance MANCOVA
9.1.4 Canonical correlation CC
9.1.5 Logistical regression LOGIT
9.1.6 Conjoint analysis
Structural equation modeling LISREL*
Probit*
Path analysis*
Two-stage least-squares regression*
Loglinear analysis*
Weighted least-squares regression*
Survival analysis*
9.2 Interdependency methods:
9.2.1 Principle component analysis
9.2.2 Factor analysis FA
9.2.3.Cluster analysis
9.2.4 Multidimentional scaling
9.2.5 Linear & Non Linear Techniques
Appendix
i. Multivariate Techniques by Data & Variable Type
ii. Overview of Multivariate Analysis
With the growth of computer technology in recent years, remarkable advances have been made in the
analysis of psychological, sociological, and other types of behavioral data. Computers have made it
possible to analyze large quantities of complex data with relative ease. At the same time, the ability to
conceptualize data analysis has also advanced. Much of the increased understanding and mastery of
data analysis has also advanced. Equally important has been the expanded understanding and
application of a group of analytical statistical techniques known as multivariate analysis. Multivariate data
occur in all branches of science. Almost all data collected by today’s researchers can be classified as
multivariate data. For example, a marketing researcher might be interested in identifying characteristics
of individuals that would enable the researcher to determine whether a certain individual is likely to
purchase a specific product. A social scientist might be interested in studying the relationships between
teenage girls’ dating behaviors and their fathers’ attitudes. Each of these endeavors involves multivariate
data.
To begin a discussion of multivariate data analysis methods, the concept of an experimental unit must be
defined. An experimental unit is any object or item that can be measured or evaluated in some way.
Measuring and evaluating experimental units is a principal activity of most researchers. Examples of
experimental units include people, animals, insects, fields, plots of land, companies, trees, wheat kernels,
and countries. Multivariate data result whenever a researcher measures or evaluates more than one
attribute or characteristic of each experimental unit. These attributes or characteristics are usually called
variables by statisticians.
Any researcher who examines only two variable relationships and avoids multivariate analysis is ignoring
powerful tools that can provide potentially very useful information. As one-researcher states: “For the
purposes of … any… applied field, most of our tools are, or should be, multivariate. One is pushed to a
conclusion that unless a …. Problem is treated as a multivariate problem, it is treated superficially.
According to Hardyck and Petrinovich “Multivariate analysis methods will predominate in the
future and will result in drastic changes in the manner in which research workers think about
problems and how they design their research. These methods make it possible to ask specific and
precise questions of considerable complexity in natural settings. This makes it possible to
conduct theoretically significant research and to evaluate the effects of naturally occurring
parametric variations in the context in which they normally occur. In this way, the natural
correlation among the manifold influences on behaviour can be preserved and separate effects of
these influences can be studied statistically with causing a typical isolation of either individuals
or variables”.
Widespread application of computers (first mainframe computers and, more recently, microcomputers)
to process large, complex data bank has spurred the use of multivariate statistical methods. Today a
number of pre package computer programs are available for multivariate data analysis and others are
being developed. In fact, many researchers have appeared who realistically call themselves data analysts
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -5 -
instead of statisticians or (in the vernacular) “quantitative types.” These data analysts have contributed
substantially to the increase in the number of journal articles using multivariate statistical techniques. Even
for people with strong quantitative training, the availability of pre-packaged programs for multivariate
analysis has facilitated the complex manipulation of data matrices that have long hampered the growth of
multivariate techniques.
With several major universities already requiring entering students to purchase their own
microcomputers before matriculating, students and professors will soon be analyzing multivariate data
routinely for decisions of various kinds in diverse fields. Some of the prepackaged programs designed
for mainframe computers (e.g., SPSS, SAS packages) are now available in a form suitable for
microcomputers, and more will soon be available.
1. Univariate Analysis X
3. Multivariate Relationship
4. Multivariate Database (N x k)
X1 X2 X3 .... Xk
S1
S2
S3
…
…
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -6 -
…
Sn
5. Thoughts on Causality
X Y
Multivariate theory
X1
X2
Y
X3
X4
• The confidence we have that two or more variables are related, however small the relationship
may be.
• The minimum standard for the rejection of the null hypothesis and affirming that a relationship
exists is p ≤ 0.05
• This means that we are 95% confident that a relationship exists, with a 5% chance of being
wrong, i.e. making a Type I error
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -7 -
• Concerns the magnitude of the relationship
• The fact that two variables are significantly related (p ≤ 0.05) is no indication of the magnitude
of the relationship
• Nor does statistical significance indicate the nature of the relationship: causal, correlative,
spurious, etc.
Explained = 7.47%
Unexplained =92.5%
26.820
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 -8 -
4. Inflation of Alpha ( α )
Given a study in which
• The means of 5 groups (k)
• Are to be compared using multiple t-tests
The total number of comparisons (C)
C = [k (k-1)] / 2 = [5 (5-1)] / 2 = 10
One reason for the difficulty of defining multivariate analysis is that the term multivariate is not used
consistently in the literature. To some researchers, multivariate simply means examining relationships
between or between more than two variables. Others use the term only for problems where there are
multiple variables, all of which are assumed to have a multivariate normal distribution. However, to be
considered truly multivariate, all of the variables must be random variables that are interrelated in such
ways that their different effects cannot meaningfully be interpreted separately. Some authors state that
the purpose of multivariate analysis is to measure, explain and/or predict the degree of relationship
among variates (weighted combinations of variables). Thus, the multivariate character lies in the multiple
variates (multiple combinations of variables), not only in the number of variables or observations.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 -9 -
• Some define multivariate analysis as employing only statistical techniques that assume that the
variables in question are multivariate normally distributed. This is a very limiting definition.
• Some use the term multivariate to describe a scale that produces a composite score from
multiple measures on an individual.
A risk assessment score on a probationer based upon current offense, prior record, social/educational
history, etc. A better term for this might be a multi- measurement model
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 10 -
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 11 -
8. Factors Giving Rise to Different Multivariate Techniques
Or ……Interdependency
X1 X2 X3 ... Xk
X1
X2
X3
…
Xk
Multivariate analyses are often concerned with finding relationships among (1) the response variables,
(2) the experimental units, and (3) both response variables and experimental units. One might say that
relationships exist among the response variables when several of the variables really are measuring a
common entity. For example, suppose one gives tests to third-graders in reading, spelling, arithmetic,
and science. Individual students may tend to get high scores, medium scores, or low scores in all four
areas. If this did happen, then these tests would be related to one another. In such a case, the common
thing that these tests maybe measuring might be “overall intelligence.”
Relationships might exist between the experimental units if some of them are similar to each other. For
example, suppose breakfast cereals are evaluated for their nutritional content. One might measure the
grams of fat, protein, carbohydrates, and sodium in each cereal. Cereals would be related to each other
if they tended to be similar with respect to the amounts of fat, protein, carbohydrates, and sodium that
are in a single serving of each cereal. One might expect sweetened cereals to be related to each other
and high-fiber cereals to be related to each other. One might also expect sweetened cereals to be much
different than high-fiber cereals.
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 12 -
Many multivariate techniques tend to be exploratory in nature rather than confirmatory. That is, many
multivariate methods tend to motivate hypotheses rather than test them. Consider a situation in which a
researcher may have 50 variables measured on more than 2000 experimental units. Traditional statistical
methods usually require that a researcher state some hypotheses, collect some data, and then use these
data to either substantiate or repudiate the hypotheses. An alternative situation that often exists is a case
in which a researcher has a large amount of data available and wonders whether there might be valuable
information in these data. Multivariate techniques are often useful for exploring data in an attempt to
learn if there is worthwhile and valuable information contained in these data.
One fundamental distinction between multivariate methods is that some are classified as “variable-
directed” techniques, while others are classified as “individual-directed” techniques.
Variable-directed techniques are those that are primarily concerned with relationships that might exist
among the response variables being measured. Some examples of this type of technique are analyses
performed on correlation matrices, principal components analysis, factor analysis, regression analysis,
and canonical correlation analysis.
Individual-directed techniques are those that are primarily concerned with relationships that might exist
among the experimental units and/or individuals being measured. Some examples of this type of
technique are discriminant analysis, cluster analysis, and multivariate analysis of variance (MANOVA).
We quite often find it useful to create new variables for each experimental unit so they can be compared
to each other more easily. Many multivariate methods help researchers create new variables that have
desirable properties.
Some of the multivariate techniques that create new variables are principal components analysis, factor
analysis, canonical correlation analysis, canonical discriminant analysis, and canonical variates analysis.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 13 -
9. Sampling of Various Multivariate Techniques
i. Dependency techniques
Involve prediction of one or more dependent variables (DVs) from two or more independent
variables (IVs)
Or nonmetric
Multiple regression is the method of analysis that is appropriate when the research problem involves a
single metric dependent variable presumed to be related to one or more metric independent variables.
The objective of multiple regression analysis is to predict the changes in the dependent variable in
response to changes in the several independent variables. This objective is most often achieved through
the statistical rule of least squares.
Whenever the researcher is interested in predicting the level of the dependent variable, multiple
regression is useful. For example, monthly expenditures on dining out (dependent variable) might be
predicted from information regarding a family's income, size, and the age of the head of the household
(independent variables). Similarly, the researcher might attempt to predict a company’s sales from
information on its expenditures for advertising, the number of salespeople, and the number of stores
selling its products.
Q. To what extent can a metric dependent variable Y be predicted or explained by 2 or more metric
and/or nonmetric variables Xk?
a = regression constant
Multiple Discriminant Analysis (MDA) if the single dependent variable is dichotomous (e.g., male-
female) or multi-chotomous (e.g., high-medium-low and therefore non-metric, the multivariate technique
of multiple Discriminant analysis is appropriate. Discriminant analysis is useful in situations where the
total sample can be divided into groups based on a dependent variable that has several known classes.
The primary objectives of multiple Discriminant analysis are to understand group differences and predict
the likelihood that an entity (individual or object) will belong to a particular class or group based on
several metric independent variables.
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 16 -
A company specializing in credit cards would certainly like to be able to classify credit card applicants
into two groups of individuals : (1) individuals who are good credit risks and (2) individuals who are bad
credit risks. Individuals deemed to be good credit risks would then be offered credit cards, while those
deemed to be bad risks would not. To help make this determination, the credit card company might
consider several demographic characteristics that can be measured on each individual. For example, the
company may consider educational level, salary, indebtedness, and past credit history as possible
predictors of creditworthiness. The company would then attempt to use this information to help to
decide whether an applicant for a credit card should be approved. The multivariate method that would
help the company classify applicants into one of the two credit risk groups is called discriminant analysis.
Discriminant analysis (DA) is primarily used to classify individuals or experimental units into two or more
uniquely defined populations. To develop a discriminant rule for classifying experimental units into one of
several possible categories, the researcher must have a random sample of experimental units from each
possible classification group. Then, DA provides methods that will allow researchers to build rules that
can be used to classify other experimental units into one of the classification groups.
In the credit card example, a discriminant rule is constructed using demographic data from individuals’
known to be good credit risks and similar data from individuals’ known to be bad credit risks. Then
new applicants for credit cards are classified into one of the two risk groups using the resulting rule.
Q. Can 2 or more metric and/or nonmetric independent variables predict or explain membership in two
or more categories of a nonmetric dependent variable?
Example To what extent can age, account balances and education predict credit risk; very good
(1), good (2), or bad (3)?
Discriminant function
Z = a + W1(age) + W2(balance) + W3(education)
Z = -6.04 + 0.24(age) + 0.25(balance) -0.04(education)
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 17 -
Q What is the discriminant score for a 18 year old with 2 (Rs 20,000 account balance) and
studied 12 yrs of education?
Z = -6.04 + 0.24(18) + 0.25(2) -0.04(12) = -1.7
Discriminant analysis calculates the Baysian probabilities of a case with a discriminant score of
Z = -1.7 being in each group. The two highest probabilities were …
good p = 0.6725
bad p = 0.2490
The group with the highest probability is good risk, therefore this case is predicted to be on good risk.
Observed Predicted %
Correct
VG G Bad
VGood (32) 26 5 1 81.25%
Good (21) 3 16 2 76.19%
Bad(17) 6 8 3 17.64%
Total (70) 35 29 6 64.29%
Statisticians raise two main objections to individual analyses for each measured variable. One objection
is that the populations being compared may be different on some variables but not on others. Often a
researcher finds it confusing as to which populations are really different and which are similar.
Multivariate analysis of variance can help researchers to compare several populations by considering all
of the measured variables, simultaneously. A second objection is that there is inadequate protection
against making Type I errors when performing one-variable-at-a-time analyses. Recall that a Type I
error occurs whenever a true hypothesis is rejected. The more variables that a researcher analyzes, the
more likely it is that at least one of the variables will give rise to statistical significance. As the number of
variables being analyzed increases, the probability of finding at least one of these analyses statistically
significant (i.e. producing a p value of less than 0.05) approaches one. Certainly, the large risk of
making Type I errors should be a concern for experimenters. A researcher should want to be confident
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 18 -
when claiming that two or more populations have different means with respect to a measured variable
and that his or her claim would not be contradicted by other experimenters conducting similar analyses
on similar data sets.
A MANOVA should be performed whenever two or more different populations are being compared to
one another on a large number of measured response variables. If a MANOVA shows significant
differences between population means, then the researcher can be confident that real differences actually
exist. In this case, it is reasonable to consider one-variable-at-a-time analyses to see where the
differences actually occur. If the MANOVA does not reveal any significant differences between
population means, then the researcher must use extreme caution when interpreting one-variable-at-a-
time analyses. Such analyses may be identifying nothing more than “false positives.”
The difference between ANOVA and MANOVA is in the number of dependent variables.
ANOVA Only one metric DV, but there may be one or more nonmetric IVs.
MANOVA Two or more metric DVs, but there may be one or more nonmetric
IVs.
Logistic regression is often used to model the probability that an experimental unit falls into a particular
group based on information measured on the experimental unit. Such models can be used for
discrimination purposes. In the credit card example described previously, one can model the probability
that an individual with certain demographic characteristics will be a good credit risk. After developing
this model, it can be used to predict the probability that a new applicant will fall into a certain risk group.
Q How well can two or more metric and/or nonmetric IVs predict membership in a binary DV?
Prob (event) = 1
(1 + e -Z)
Consider a binary DV in which one category is coded 0 and the other category 1. The event predicted
is the category coded 1. (e = 2.71828)
1
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 19 -
Example
To what extent does age, gender (m = 0, f = 1), and account balance predict type of credit risk; good
(0) or bad (1)?
Prob (event) = 1
(1 + e -Z)
Q What is the probability of a 26 year old female with 2 (Rs 20,000 account balance) being
predicted to bad risk?
P = ( 1 - 0.67) = 0.33
Classification Table
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 20 -
Observed Predicted % Correct
Good Bad
Good (37) 30 7 81.08%
Bad (33) 9 24 72.73%
Total (70) 39 31 77.14%
Canonical variates analysis (CVA) is a method that creates new variables in conjunction with
multivariate analyses of variance. These new variables are useful for helping researchers determine
where the major differences among the population means occur when the populations are being
compared on many different variables by using all of the measured variables simultaneously.
Occasionally, the canonical variates may suggest important differences that might otherwise be missed.
Canonical Correlation Analysis can be viewed as a logical extension of multiple regression analysis.
Recall that multiple regression analysis involves a single metric dependent variable and several metric
independent variables. With canonical analysis the objective is to correlate simultaneously several metric
dependent variables and several metric independent variables. Whereas multiple regression involves a
single dependent variable, canonical correlation involves multiple dependent variables. The underlying
principle is to develop a linear combination of each set of variables (both independent and dependent) in
a manner that maximizes the correlation between the two sets. Stated in a different manner, the
procedure involves obtaining a set of weights for the dependent and independent variables that provide
the maximum simple correlation between the set of dependent variables and the set of independent
variable. The assignment of variables into these two groups must always be motivated by the nature of
the response variables and never by an inspection of the data. For example, a legitimate assignment
would be one in which the variables in one group are easy to obtain and inexpensive to measure, while
the variables in the other group are hard to obtain or expensive to measure. Another would be
measurements on fathers versus measurements on their daughters.
A researcher wanted to compare fathers’ attitudes with their daughters’ dating behaviors. When several
different variables have been measured on the fathers and several others measured on the daughters,
canonical correlation analysis might be used to identify new variables that summarize any relationships
that might exist between these two sets of family members.
One basic question that canonical correlation analysis is expected to answer is whether the variables in
one group can be used to predict the variables in the other group. When they can, then canonical
correlation attempts to summarize the relationships between the two sets of variables by creating new
variables from each of the two groups of original variables.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 21 -
9.1.6 Conjoint Analysis
Conjoint Analysis is concerned with understanding how people make choices between products or
services or a combination of product and services, so that business can design new products or services
that better meet customers’ underlying needs. Although it has been a mainstream research for the last 10
years or so, conjoint analysis has been found to be an extremely powerful way of capturing what really
rives customers to buy one product over another and what customers really value. A key benefit of it is
the ability to produce dynamic market models that enable companies to test out what steps they would
need to take to improve their market share, or how competitors’ behavior will affect their customers. It
has become one of the most widely used quantitative methods in marketing research. It is used to
measure the perceived values of specific product features, to learn how demand for a particular product
or service is related to price, and to forecast what the likely acceptance of a product would be if
brought to market.
Conjoint analysis is a type of experiment done by market researchers. It enables the researcher to
understand consumer preferences or ratings of existing or possible products in terms of product
attributes and their levels. The purpose of conducting a conjoint experiment is to learn the relative
importance of product attributes, as well as learn what the most preferred attribute levels are. When
done well, conjoint analysis helps the researcher to understand the existing and desired product. The
researcher can simulate market share of preference of existing or possible products, even if the
particular combination of factor levels that comprises the "product" was not explicitly judged by the
subjects.
A number of approaches exist for doing conjoint analysis. In full profile conjoint analysis, a product
card consists of one level setting for each attribute under consideration. The set of such cards can be all
possible combinations of attribute levels (one combination per card) or some fraction thereof. Often, the
researcher is only interested in presenting a fraction, because all possible combinations represents too
many product alternatives to judge without concern about fatigue and the reliability of the subject data.
Fractional factor designs require the aid of a computer. The ORTHOPLAN procedure can generate
such designs. The researcher often presents the choice alternatives as a set of physical cards that the
subjects then sort in order of preference. The PLANCARDS procedure is a utility for generating such
cards. Full-profile conjoint data can be analyzed by way of ordinary least squares regression. The
CONJOINT procedure is a specially-tailored version of regression.
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 22 -
9.2 Interdependence Techniques
When a researcher is beginning to think about analyzing a new data set, several questions about the data
should be considered. Important questions include these : (i) Are there any aspects of the data that are
strange or unusual? (ii) Can the data be assumed to be distributed normally? (iii) Are there any
abnormalities in the data? (iv) Are there outliers in the data? Experimental units whose measured
variables seem inconsistent with measurements on other experimental units are usually called outliers.
By far, the most important reason for performing a principal components analysis (PCA) is to use it as a
tool for screening multivariate data. New variables, called principal components scores, can be created.
These new variables can be used as input for graphing and plotting programs, and an examination of the
resulting graphical displays will often reveal abnormalities in the data that you are planning to analyze.
For example, plots of principal component scores can help identify outliers in the data when they exist.
In addition, the principal component scores can be analyzed individually to see whether distributional
assumptions such as normality of the variables and independence of the experimental units hold. Such
assumptions are often required for certain kinds of statistical analyses of the data to be valid.
Principal components analysis uses a mathematical procedure that transforms a set of correlated
response variables into a new set of uncorrelated variables that are called principal components.
Principal components analysis can be performed on either a sample variance-covariance matrix or a
correlation matrix. The type of matrix that is best often depends on the variables being measured.
Occasionally, but not often, the newly created variables are interpretable. One cannot always expect to
be able to interpret the newly created variables. In fact, it is considered to be a bonus when the principal
component variables can actually be interpreted. When using PCA to screen a multivariate data set, you
do not need to be able to interpret the principal components because PCA is extremely useful
regardless of whether the new variables can be interpreted.
Principal components analysis is quite helpful to researchers who want to partition experimental units
into subgroups so that similar experimental units into subgroups so that similar experimental units belong
to the same subgroup. In this case, principal component scores can be used as input to clustering
programs. This often increases the effectiveness of the clustering programs, while reducing the cost of
using such programs. Furthermore, the principal component scores can and should always be used to
help validate the results of clustering programs.
Factor analysis, including variations such as component analysis and common factor analysis, is a
statistical approach that can be used to analyze interrelationships among a large number of variables and
to explain these variables in terms of their common underlying dimensions (factors). The statistical
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 23 -
approach involves finding a way of condensing the information contained in a number of original
variables into a smaller set of dimensions (factors) with a minimum loss of information.
Factor analysis (FA) is a technique that is often used to create new variables that summarize all of the
information that might be available in the original variables. For example, consider once again giving tests
to third-graders in reading, spelling, arithmetic, and science, whereby individual students may tend to get
high scores, medium scores, or low scores in all four areas. If this really does happen, then one might
say that these test results are being explained by some underlying characteristic or factor that is common
to all four tests. In this example, it might be reasonable to assume that such an underlying characteristic
is “overall intelligence.”
Factor analysis is also used to study relationships that might exist among the measured variables in a
data set. Similar to PCA, FA is a variable-directed technique. One basic objective of FA is to
determine whether the response variables exhibit patterns of relationships with each other, such that the
variables can be partitioned into subsets of variables so that the variables in a subset are highly
correlated with each other and so that variables in different subsets have low correlations with each
other. Thus, FA is often used to study the correlation structure of the variables in a data set. One
similarity between FA and PCA is that FA can also be used to create new variables that are
uncorrelated with each other. Such variables are called factor scores.
One advantage that FA seems to have over PCA when new variables are being created is that the new
variables created by FA are generally much easier to interpret than those created by PCA. If a
researcher wants to create a smaller set of new variables that are interpretable and that summarize most
of the information in the measured variables then FA should be given serious consideration.
Cluster analysis (CA) is similar to discriminant analysis in that it is used to classify individuals or
experimental units into uniquely defined subgroups. Discriminant analysis can be used when a researcher
has previously obtained random samples from each of the uniquely defined subgroups. Cluster analysis
deals with classification problems when it is not known beforehand from which subgroups observations
originate. Cluster analysis is an analytical technique that can be used to develop meaningful subgroups of
individuals or objects. Specifically, the objective is to classify a sample of entities (individuals or objects)
into a small number of mutually exclusive groups based on the similarities among the entities. Unlike
discriminant analysis, the groups are not predefined. Instead, the technique is used to identify the groups.
Cluster analysis usually involves at least two steps. The first is the measurement of some form of
similarity or association between the entities in order to determine how many groups really exist in the
sample. The second step is to profile the persons or variables in order to determine their composition.
This step may be accomplished by applying discriminant analysis to the groups identified by the cluster
technique.
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 24 -
9.2.4 Multi Dimensional Scaling
The starting point for most techniques of analysis is the data matrix. This matrix contains properties in
the head column, units in the front column and scores in the body. As far as contents are concerned,
multidimensional scaling is especially applied to preference data. In marketing research, for example, a
number of products are presented to a sampling of people, and the individuals have to give their
preference for each combination of two products. Input of the analysis is then the matrix of two by two
similarities between the products.
Such similarities were also the starting point in the investigation of the furnishing of the living room.
Therefore, the analysis chosen in this research was multidimensional scaling (MDS). All two by two
associations were calculated for 49 characteristics of the living rooms. Remember, for example, the
association of the Persian carpet and the parquet floor. These associations formed the input of the MDS
analysis.
We must, however, qualify our observations. For, the distinction between the input of multidimensional
scaling and of other latent structure analyses is in fact relative. The matrix of bivariate associations can
be used as a starting point in factor analysis as well as cluster analysis. Think of the correlations between
the indicators of marital adjustment, e.g., between staying at home and having outside activities together,
or between happiness in marriage and getting on each other’s nerves.
There is still a difference in research strategy. When the characteristics are presented to the respondents
by two’s and preferences are asked, the data already have the form of similarities in the observation
phase. In this case, MDS is generally chosen. When, on the other hand, scores of separate
characteristics are observed, we then in fact perform two analyses. A first analysis results in the matrix
of similarities and the latter is used as the input of a second latent structure analysis. In this case, either
factor analysis or cluster analysis is generally chosen.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 25 -
9.2.5 Linear and Non-linear Techniques of Analysis
Regression, factor analysis, canonical correlation analysis and other classical techniques are metrical and
linear. We know that a non-metric variant has been developed for each metric technique. The
pioneering work of Leo Goodman and the Gifi Group was praised in this respect. There is really no
need to make a distinction between the adjectives ‘non-metric’ and ‘non-linear’, for these non-metric
techniques are also non-linear. This is implied in the expression ‘loglinear model’, because the model
only becomes linear after taking the logarithm. The title of Gifi’s book Non-linear Multivariate Analysis
also speaks for itself.
Linearity is an old sore in statistics. In applications of classical regression analysis it became clear all too
often that the plots of the data seldom follow the pattern of a nice straight line or a flat plane. To
understand what is meant here, we only have to think of the exponential functions applied by the Club of
Rome in the analysis of increasing environmental pollution. And, in addition to regression, the same
holds of course for discriminant analysis, factor analysis and other classical techniques. A linearity test is,
therefore, always in obligation.
A linear function is naturally handy, as it is easy to calculate and to interpret. One can therefore consider
– in cases of non-linearity – performing certain transformations on the data in such a way as to obtain
linearity. Taking the logarithm is an example of such a transformation. It is, however, also possible to fit
a non-linear function, which will be a quadratic function, a function of the third degree or, in general, a
polynominal of the n-th degree, depending on the number of curves that are detected in the scatterplot:
one, two or n-1 respectively. It has become possible for nearly all-existing techniques to fit such a
nonlinear function.
Our conclusion is that, especially in recent decades, a non-linear analogue has been developed for
almost all multivariate analyses, so that for each technique we can make a sub-classification in the linear
and the non-linear versions.
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 26 -
10. Optimal Scaling
Categorical data are often found in marketing research, survey research, and research in the social and
behavioral sciences. In fact, many researchers deal almost exclusively with categorical data. Categorical
data are typically summarized in contingency tables. Analysis of tabular data requires a set of statistical
models different from the usual correlation-and regression-based approaches used for quantitative data.
Traditional analysis of two-way tables consists of displaying cell counts along with one or more sets of
percentages. If the data in the table represent a sample, the chi-square statistic might be computed along
with one ore more measures of association. Multi-way tables are handled with some difficulty, since
view of the data is influenced by which variable is the row variable, which variable is the column
variables, and which variables are control variables. Traditional methods don’t work well for three or
more variables because all statistics that might be produced are conditional statistics, which do not in
general capture the interrelationships among the variables.
Researchers have developed loglinear models as a comprehensive way for dealing with two-way and
multiway tables. Loglinear models is an umbrella term for several different models: models for the log-
frequency counts in a two-way or multiway table, logit models for log-odds when one categorical
variable is dependent and there are one or more categorical predictor variables, association models for
the log-odds ratios in two-way tables, and many other special-purpose models. Loglinear models have
a number of advantages. They are comprehensive models that apply to tables of arbitrary complexity.
They provide goodness-off-fit statistics, so that model-building can be undertaken until a suitable model
is found. They provide parameter estimates and standard errors.
However, loglinear models have a number of drawbacks. If the sample size is too small, the chi-square
statistic on which the models are based is suspect. If the sample size is too large, it is difficult to arrive at
a parsimonious model, and it can be difficult to discriminate between competing models that appear to
fit the data. As the number of variables and the number of values per variable go up, models with more
parameters are needed, and in practice, researchers have had some difficulty interpreting the parameter
estimates.
Optimal scaling is a technique that can be used instead of-or as a complement to-loglinear models.
Optimal scaling extends traditional loglinear analyses by incorporating variables at mixed levels.
Nonlinear relationships are described by relaxing the metric assumptions of the variables. Rather than
interpreting parameter estimates, interpretation is often based on graphical displays in which similar
variables or categories are positioned close to each other.
The simplest form of optimal scaling is correspondence analysis for a two-way table. If the two-way
table portrays two variables that are associated (not independent), correspondence analysis assigns
scores to the categories of true row and column variables in such a way as to account for as much of the
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 27 -
association between the two variables as possible. Depending on the dimensionality of the table,
correspondence analysis assigns one or more sets of scores to each variable. Conventionally, row and
column categories are displayed in two-dimensional plots defined by pairs of these scores. Using
correspondence analysis and the plots it produces, you can learn the following: within a variable,
categories that are similar or different; within a variable, categories that might be collapsed; across
variables, categories that go together; what category a user-missing category most resembles; and what
the optimal correlation is between the row and the column variable.
The final procedure, Categorical Regression, describes the relationship between a categorical response
variable and a combination of categorical predictor variables. The categories are quantified such that the
squared multiple correlation between the response and the combination of predictors is a maximum. The
influence of each predictor variable on the response variable is described by the corresponding
regression weight. As in the other procedures, data can be analyzed with different levels of optimal
scaling.
Following are brief guidelines for each of the five procedures:
• Use Categorical Regression to predict the values of a categorical dependent variable from a
combination of categorical independent variables.
• Use Nonlinear Principal Components Analysis to account for patterns of variation in a single set
of variables of mixed optimal scaling levels.
• Use Nonlinear Canonical Correlation Analysis to assess the extent to which two or more sets of
variables of mixed optimal scaling levels are correlated.
• Use Correspondence Analysis to analyze two-way contingency tables or data that can be
expressed as a two-way table, such as brand preference or sociometric choice data.
• Use Homogeneity Analysis to analyze a categorical multivariate data matrix when you are willing
to make no stronger assumption than that all variables are analyzed at the nominal level.
CATREG is an acronym for categorical regression with optimal scaling. The goal of regression analysis
is to predict a response variable from a set of predictor variables. The standard approach requires
continuous variables and entails deriving weights for the predictor variables such that the squared
correlation between the response and the weighted combination of predictors is a maximum. For any
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 28 -
given change in a predictor, the sign of the corresponding weight indicates whether the predicted
response increases or decreases. The size of the weight indicates the amount of change in the predicted
response for a one-unit increase in the predictor.
If some of the variables are not continuous, alternative analyses are available. If the response is
continuous and the predictors are categorical, analysis of variance is often employed. If the response is
categorical and the predictors are continuous, logistic regression or discriminant analysis may be
appropriate. If the response and the predictors are both categorical, loglinear models are often used.
Categorical regression with optimal scaling extends the standard approaches of regression and loglinear
modeling by quantifying categorical variables. Scale values are assigned to each category of every
variable such that these values are optimal with respect to the regression. The technique maximizes the
squared correlation between the transformed response and the weighted combination of transformed
predictors.
One advantage of the optimal scaling approach over standard regression analysis is in dealing with
nonlinear relationships between variables. If, for example, a predictor has both high and low values
associated with one value of the response, standard linear regression will not perform very well. The
predictor receives only one weight, and one weight cannot reflect the same amount of change in the
predicted response for both large and small predictor values. However, in CATREG, nonlinear
transformations of the variables are employed. The predictor described earlier could be treated as
nominal, receiving large quantifications for both large and small observed values. Thus, both values
affect the predicted response in the same manner.
Categorical regression with optimal scoring is equivalent to optimal scaling canonical correlation analysis
(OVERALS) with two sets, one of which contains only one variable. In the latter technique, similarity of
sets is derived by comparing each set to an unknown variable that lies somewhere between all of the
sets. In categorical regression, similarity of the transformed response and the linear combination of
transformed predictors is assessed directly.
PRINCALS is an acronym for principal components analysis via alternating least squares. Standard
principal components analysis is a statistical technique that linearly transforms an original set of variables
into a substantially smaller set of uncorrelated variables that represents as much of the information in the
original set as possible. The goal of principal components analysis is to reduce the dimensionality of the
original data set while accounting for as much of the variation as possible in the original set of variables.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 29 -
Objects in the analysis receive component scores. Plots of the component scores reveal patterns among
the objects in the analysis and can reveal unusual objects in the data. Standard principal components
analysis assumes that all variables in the analysis are measured at the numerical level and that
relationships between pairs of variables are linear.
Nonlinear principal components, also known as optimal scaling principal components, extends this
methodology so that you can perform principal components analysis with any mix of nominal, ordinal,
and numerical scaling levels. The aim is still to account for as much variation in the data as possible,
given the specified dimensionality of the analysis. For nominal and ordinal variables, the program
computes optimal scale values for the categories.
More generally, an optimal scaling principal components analysis of a set of ordinal scales is an
alternative to computing the correlations between the scales and analyzing them using a standard
principal components or factor analysis approach. Research has shown that naïve use of the usual
Pearson correlation coefficient as a measure of association for ordinal data can lead to nontrivial bias in
estimation of the correlations.
If all variables are declared numerical, PRINCALS produces an analysis equivalent to standard
principal components analysis using factor analysis. Both procedures have their own benefits.
If all variables are declared multiple nominal, PRINCALS produces an analysis equivalent to a
homogeneity analysis run on the same variables. Thus, optimal scaling principal components analysis can
be seen as a type of homogeneity analysis in which some of the variables are declared ordinal or
numerical.
Nonlinear canonical correlation analysis (OVERALS), or canonical correlation analysis with optimal
scaling, is the most general of the five procedures in the optimal scaling family. This procedure performs
nonlinear canonical correlation analysis on two or more sets of variables.
The goal of canonical correlation analysis is to analyze the relationships between sets of variables instead
of between the variables themselves, as in principal components analysis. In standard canonical
correlation analysis, there are two sets of numerical variables. For example, one set of variables might
be demographic background items on a set of respondents, while a second set of variables might be
responses to a set of attitude items. Standard canonical correlation analysis is a statistical technique that
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 30 -
finds a linear combination of one set of variables and a linear combination of a second set of variables
that are maximally correlated. Given this set of linear combinations, canonical correlation analysis can
find subsequent independent sets of linear combinations, referred to as canonical variables, up to a
maximum number equal to the number of variables in the smaller set.
Optimal scaling canonical correlation analysis extends the standard analysis in several ways. First, there
can be two or more sets of variables, so you are not restricted to two sets of variables, as is true in most
popular implementations of canonical correlation analysis. Second, the scaling levels in the analysis can
be any mix of nominal, ordinal, and numerical. Third, optimal scaling canonical correlation analysis
determines the similarity among the sets by simultaneously comparing the canonical variables from each
set to a compromise set of scores assigned to the objects.
If there are two sets of variables in the analysis and all variables are defined to be numerical, optimal
scaling canonical correlation analysis is equivalent to a standard canonical correlation analysis. Although
SPSS does not have a canonical correlation analysis procedure, many of the relevant statistics can be
obtained from multivariate analysis of variance.
If there are two or more sets of variables with only one variable per set, optimal scaling canonical
correlation analysis is equivalent to optimal scaling principal components analysis. If all variables in a
one-variable-per-set analysis are multiple nominal, optimal scaling canonical correlation analysis is
equivalent to homogeneity analysis. If there are two sets of variables, one of which contains only one
variable, optimal scaling canonical correlation analysis is equivalent to categorical regression with
optimal scaling.
Optimal scaling canonical correlation analysis has various other applications. If you have two sets of
variables and one of the sets contains a nominal variable declared as single nominal, optimal scaling
canonical correlation analysis results can be interpreted in a fashion similar to regression analysis. If you
consider the variable to be multiple nominal, the optimal scaling analysis is an alternative to discriminant
analysis. Grouping the variables in more than two sets a variety of ways to analyze your data.
The Correspondence Analysis procedure is a very general program to make biplots for correspondence
tables, using either chi-squared distances, as in standard correspondence analysis, or Euclidean
distances, for more general biplots. This procedure also offers the ability to constrain categories to have
equal scores, a useful option to impose ordering on the categories. In addition, it offers the ability to fit
supplementary points into the space defined by the active points.
In a correspondence table, the row and column variables are assumed to represent unordered
categories; therefore, we use the nominal optimal scaling level. Both variables are inspected for their
nominal information only. That is, the only consideration is the fact that some objects are in the same
category, while others are not. Nothing is assumed about the distance or order between categories of
the same variable. One specific use of correspondence analysis is the analysis of a two-way contingency
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 31 -
table. The SPSS Crosstabs procedure can also be used to analyze contingency tables, but
correspondence analysis provides a graphic summary in the form of plots that show the relationships
between categories of the two variables.
If a table has r active rows and c active columns, the number of dimensions in the correspondence
analysis solution is the minimum of r minus 1 or c minus 1, whichever is less. In other words, you could
perfectly represent the row categories or the column categories of a contingency table in a space of
min(r, c)- 1 dimensions. Practically speaking, however, you would like to represent the row and column
categories of a two-way table in a low-dimensional space, say two dimensions, for the obvious reason
that two-dimensional plots are comprehensible and multidimensional spatial representations are usually
not.
When fewer than the maximum number of possible dimensions is used, the statistics produced in the
analysis describe how well the row and column categories are represented in the low-dimensional
representation. Provided that the quality of representation of the two-dimensional solution is good, you
can examine plots of the row points and the column points to learn which categories of the row variable
are similar, which categories of the column variable are similar, and which row and column categories
are similar to each other.
Independence is a common focus in contingency table analyses. However, even in small tables,
detecting the cause of departures from independence may be difficult. The utility of correspondence
analysis lies in displaying such patterns for two-way tables of any size. If there is an association between
the row and column variables-that is, if the chi-square value is significant-correspondence analysis may
help reveal the nature of the relationship.
Homogeneity Analysis
HOMALS is an acronym for homogeneity analysis via alternating least squares. The input for
homogeneity analysis, also known as multiple correspondence analysis, is the usual rectangular data
matrix, where the rows represent subjects or, more generically, objects, and the columns represent
variables. There may be two or more variables in the analysis. As in correspondence analysis, all
variables in a homogeneity analysis are inspected for their nominal information only. The analysis
considers only the fact that some objects are in the same category, while others are not. Nothing is
assumed about the distance or order between categories of the same variable.
While correspondence analysis is limited to two variables, homogeneity analysis can be thought of as the
analysis of a multiway contingency table (with more than two variables). Multiway contingency tables
can also be analyzed with the SPSS Crosstabs procedure, but Crosstabs gives separate summary
statistics for each category of each control variable. With homogeneity analysis, it is often possible to
summarize the relationship between all the variables with a single two-dimensional plot.
For a one-dimensional solution, homogeneity analysis assigns optimal scale values (category
quantifications) to each category of each variable in such a way that overall, on average, the categories
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 32 -
have maximum spread. For a two-dimensional solution, homogeneity analysis finds a second set of
quantifications of the categories of each variable unrelated to the first set, again attempting to maximize
spread, and so on. Because categories of a variable receive as many scorings as there are dimensions,
the variables in the analysis are said to be multiple nominal in optimal scaling level.
Homogeneity analysis also assigns scores to the objects in the analysis in such a way that the category
quantifications are the averages, or centroids, of the object scores of objects in that category.
The output for homogeneity analysis includes plots of the category quantifications and the object scores.
By design, homogeneity analysis tries to produce a solution in which objects within the same category
are plotted close together and objects in different categories are plotted far apart. This is done for all
variables in the analysis. The plots have the property that each object is as close as possible to the
category points of categories that apply to the object. In this way, the categories divide the objects into
homogenous subgroups (thus, one reason for the name “homogeneity analysis”). Variables are
considered homogenous when they classify objects in the same categories into the same subgroups.
If homogeneity analysis is used for two variables, the results are not completely identical to those
produced by correspondence analysis, although both are appropriate when suitably interpreted. In the
two-variable situaion, correspondence analysis produced unique output summarizing the fit and the
quality of representation of the solution, including stability information. Thus, correspondence analysis is
probably preferable to homogeneity analysis in the two-variable case in most circumstances. Another
difference between the two procedures is that the input to homogeneity analysis is a data matrix, where
the rows are objects and the columns are variables, while the input to correspondence analysis can be
the same data matrix, a general proximity matrix, or a joint contingency table, which is an aggregated
matrix where both the rows and columns represent categories of variables.
Homogeneity analysis can be thought of as principal components analysis of nominal data with multiple
optimal scaling levels. If the variables in the analysis are assumed to be numerical level (linear
associations between the variables are assumed), then standard principal components analysis is
appropriate.
An advanced use of homogeneity analysis is to replace the original category values with the optimal
scale values from the first dimension and perform a secondary multivariate analysis. The Factor Analysis
procedure produces a first principal component that is equivalent to the first dimension of homogeneity
analysis. The component scores in the first dimension are equal to the object scores, and the squared
component loadings are equal to the discrimination measures. The second homogeneity analysis
dimension, however, is not equal to the second dimension of factor analysis. Since homogeneity analysis
replaces category labels with numerical scale values, many different procedures that require interval-
level (numerical) data can be applied after the homogeneity analysis. The same is true for nonlinear
principal components analysis, nonlinear canonical correlation analysis, and categorical regression.
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 33 -
11. Readings
Hair J.F., Anderson R.E., Tatham R.L. and Black W.C. (1998) Multivariate Data Analysis With
Readings, 5th Edition, Upper Saddle River, New Jersey, Prentice-Hall International. Chapter 1.
Gofton, L. R. and Ness, M. R. (1997) Business Marketing Research, London, Kogan Page. Chapter
8.
Heenan D.A. and Addleman R.B. (1976) Quantitative Techniques for Today’s Decions Makers,
Harvard Business Review, May-June, pp.32-62.
Hooley G.J. (1980) The Multivariate Jungle: The Academic’s Playground but the Manager’s Minefield.
European Journal of Marketing Research, Vol. 14, No 1, pp. 379-386.
Johnson, E Dollas. (1998). Applied Multivariate Methods for Data Analysis. Duxbury Press, An
International Thomson Publishing Company, Washington.
Ness M.R. (1997) Multivariate Analysis in Marketing Research, in: Padberg D., Ritson C. and Albisu
L.M. (Eds), Agro-Food Marketing, Wallingford, Oxfordshire, CAB International. Chapter 12,
pp. 253-278.
Sheth J. (1971) The Multivariate Revolution in Marketing Research. Journal of Marketing, Vol. 35, Jan,
pp.13-19.
SPSS categories. 1998. Marketing Department, SPSS Inc., 444 North Michigan Avenue, Chicago.
Tacq, Jacques. (1997). Multivariate Analysis Techniqyes in Social science Research from problem to
Analysis. Sage Publications, London
____________________________________________________________________________________________________
Introductory Notes on Multivariate Analysis :Prof Prithvi Yadav, I I M , Indore. @June, 2004 - 34 -
Appendix – (i)
Dependent
Metric Non Metric
Metric Multiple Multiple
Regression Discriminant
Analysis Analysis (MDA)
Canonical Canonical
Independent Correlation Correlation with
Variable Analysis Dummy Variables
Non Multivariate Canonical
Metric Analysis of Correlation with
Variance Dummy Variables
____________________________________________________________________________________________________
Introductory Note on Multivariate Analysis :Prof Prithvi Yadav, I I M, Indore. @June, 2004 - 35 -
Appendix – (ii)
Over View of Multivariate Techniques
Multivariate Methods
Metric Non-Metric
____________________________________________________________________________________________________
Introduction to Multivariate Analysis :Prof Prithvi Yadav, Indian Institute of Management, Indore. @June, 2004 - 36 -
____________________________________________________________________________________________________
Introduction to Multivariate Analysis :Prof Prithvi Yadav, Indian Institute of Management, Indore. @June, 2004 - 37 -