You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/301516298

Multivariate Analysis : An Overview

Article · September 2013

CITATIONS READS
6 29,743

1 author:

Siddharth Kumar Singh


Saraswati Dental College and Hospital
41 PUBLICATIONS   269 CITATIONS   

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Research work View project

All content following this page was uploaded by Siddharth Kumar Singh on 20 April 2016.

The user has requested enhancement of the downloaded file.


REVIEW ARTICLE

Multivariate Analysis : An Overview

Sandeep Kumar*, Siddharth Kumar Singh*, Prashant Mishra*

Abstract
Introduction: Multivariate analysis (MVA) techniques allow more than two variables to be
analysed at once. Two general types of MVA technique: Analysis of dependence& Analysis of
interdependence. Technique is selected depending on type of data and reason for the analysis.
Cluster analysis: “Techniques for identifying separate groups of similar cases”. Also used to
summarize data by defining segments of similar cases in the data. Discriminant analysis: Is a
statistical technique for classifying individuals or objects into mutually exclusive and
exhaustive groups on the basis of a set of independent variables”. Factor analysis :
Multiple factor analysis (mfa): “Statistical method used to describe variability among
observed variables in terms of a potentially lower number of unobserved variables called
factors”. Correspondance Analysis: “Technique that generates graphical representations of the
interactions between modalities (or "categories") of two categorical variables”. Regression
Analysis: “Refers to any techniques for modelling and analyzing several variables, when the focus
is on the relationship between a dependent variable and one or more independent variables.”
Multiple Linear Regression Analysis (MLR): In Multiple linear regressions, several independent
variables are used to predict with a least square approach one direct variable. Multivariate
analysis of variance (MANOVA): It is a generalized form of univariate analysis of
variance (ANOVA). Conclusion: Because there are many potential problems and pitfalls in the
use of multivariable techniques in clinical research, these procedures should be used with care.
(Kumar S, Singh SK, Mishra P. Multivariate Analysis : An Overview. www.journalofdentofacialsciences.com,
2013; 2(3): 19-26)

Introduction
*Sr Lecturer, Sri Aurbindo College of Dentistry, Indore, “Application of methods that deal with
Madhya Pradesh reasonably large number of measurements made
Address for Correspondence: on each object in one or more samples
**Dr Sandeep Kumar simultaneously”.1 Many statistical techniques focus
Flat No. 304, Sanskar Block, SAIMS Campus, on just one or two variables. Multivariate analysis
Sanwar Road, Indore, Madhya Pradesh
(MVA) techniques allow more than two variables
e-mail: drsandeep40@yahoo.com
to be analysed at once. The ultimate goal of these
analyses is either explanation or prediction, i.e.,
more than just establishing an association.
Multivariate Analysis Methods: Two general
types of MVA technique:1
20 Kumar et al.

• Analysis of dependence: Where one (or more) • Regional analyses—classify cities into
variables are dependent variables, to be typologies based on demographic
explained or predicted by others. e.g. Multiple variables.
regression, Discriminant analysis, Manova, • Marketing research--classifying customers
Partial Least Square. based on product use.
• Analysis of interdependence: No variables • Chemistry—classification of compounds
thought of as “dependent”. Look at the based upon properties
relationships among variables, objects or cases.
Overview of cluster analysis [3]
E.g. cluster analysis, factor analysis, and
principal component analysis. Step 1=n objects measured on p variables
Selection of Technique depends on:1 Step 2=transform into N x N similarity matrix
9 The type of data under analysis: Nominal Step3=cluster formation (Mutually exclusive
data, Ordinal data clusters, Hierarchical clusters
9 The reason for the analysis: Classify data, Step 4=cluster profile
Data reduction Types of Cluster Analysis4
Classifying data ƒ Hierarchical
ƒ Hierarchical cluster analysis ƒ Two Step
ƒ Two step cluster analysis ƒ K Means
ƒ K means cluster analysis Hierarchical cluster: Perform successive fusions
ƒ Discriminant analysis or divisions of the data. Hierarchical clustering is
Data reduction one of the most straightforward methods. It can be
ƒ Factor analysis either agglomerative or divisive.
ƒ Correspondence analysis Agglomerative hierarchical clustering: Every
case being a cluster unto itself. At successive steps,
Level of Measurement and Multivariate similar clusters are merged. Proceeds by forming a
Statistical Technique:2 series of fusions of the n objects into groups
Independent Dependent Technique
Divisive clustering: Starts with everybody in
Variable Variable
one cluster and ends up with everyone in
Numerical Numerical Multiple
Regression individual clusters. It partitions the set of n objects
Nominal or Nominal Logistic into finer and finer subdivisions.
Numerical Regression Agglomerative Methods
Nominal or Numerical Cox Regression ƒ Single linkage or nearest neighbor method
Numerical (censored)
Nominal or Numerical ANOVA, ƒ Complete linkage or the farthest neighbor
Numerical MANOVA method
Nominal or Nominal (2 or Discriminant ƒ Average linkage
Numerical more values) Analysis ƒ Wards error sum of squares methods
Cluster Analysis:3 “Techniques for Divisive Methods
identifying separate groups of similar ƒ A splinter average distance methods.
cases”. Also used to summarize data by defining
ƒ Automatic interaction detection.
segments of similar cases in the data. This use of
cluster analysis is known as “dissection.” The output from both agglomerative and
divisive methods is typically summarized by use of
Application:
a dendrogram.
• Psychology—classifying individuals
according to personality types.

www.journalofdentofacialsciences.com Vol. 2 Issue 3


Kumar et al. 21

the size of the matrix that contains distances


between all possible pairs of cases.
Step 2: Hierarchical Clustering of
Preclusters In the second step, SPSS uses the
standard hierarchical clustering algorithm on the
preclusters. This helps to explore a range of
solutions with different numbers of clusters. We
“It is a two dimensional tree like diagram can select the number of cluster to be formed by
illustrating the fusion or partition that have been clicking.
affected at each successive levels”. • Number of Clusters: Determine
To form clusters using a hierarchical cluster automatically, Specify fixed.
analysis, the following things need to be • Two options for categorical and
considered: A criterion for determining similarity or continuous variables.
distance between cases, A criterion for determining Two different criteria are used: Schwarzs
which clusters are required to be merged at Bayesian Criterion, Akaikes Information Criterion
successive steps, The number of clusters needed to ‰ Cluster formed can then be utilized for
represent the data, The hierarchical clustering
procedure attempts to identify relatively ¾ Examining the composition of the clusters.
homogeneous groups of cases based on selected ¾ Examining the importance of individual
characteristics, using an algorithm that starts with variables within the clusters.
each case in a separate cluster and combines ¾ Looking at all variables within a cluster
clusters until only one is left. ¾ Looking at the relationship to other
How Many Clusters Needed?? variables
No specific number. It depends on what we K Means cluster analysis: This procedure
are going to do with them. (Type of Analysis). To attempts to identify relatively homogeneous
find a good cluster solution: The characteristics of groups of cases based on selected characteristics,
the clusters at successive steps needs to be using an algorithm that can handle large numbers
evaluated. of cases. However, the algorithm requires
Two Step Cluster Analyses: “A clustering specifying the number of clusters.
procedure that can rapidly form clusters on the Guidelines on the appropriate use of clustering
basis of either categorical or continuous data.” methodologies [1]
Used in large data sets. It requires only one pass of
data which is important for very large data files ¾ Outliers should be removed prior to
and it can produce solutions based on mixtures of analysis (sensitive)
continuous and categorical variables and for ¾ The selection of similarity or distance
varying numbers of clusters. The clustering measure to use is still unanswered.
algorithm is based on a distance measure that
gives the best results if all variables are ¾ Care and good judgment should be used
independent, continuous variables have a normal in setting parameter values
distribution, and categorical variables have a ¾ Wherever possible data should be split and
multinomial distribution. Two set cluster analysis is cross replicated so as to assess the stability
designed to reveal natural groupings (or clusters) of cluster solutions.
within a data set that would otherwise not be
apparent. Discriminant Analysis5
Steps Involved3 “Is a statistical technique for classifying individuals
Step 1: Preclustering: Making Little or objects into mutually exclusive and exhaustive
groups on the basis of a set of independent
Clusters i.e. the goal of preclustering is to reduce
variables”.

www.journalofdentofacialsciences.com Vol. 2 Issue 3


22 Kumar et al.

Types Procedure1
Discrete discriminant analysis ‰ Performed in two steps
Logistic discrimination 9 Select the factors
‰ Descriptive
Error rate estimation
9 Initial solution
• The re-substitution method 9 Coefficient
• The Hold-out method 9 KMO+BARTLETT TEST: Check
• The U method or cross validation Bartlett table, KMO value 0-1(close to
1 better the significance at .05) , Using
• The Jackknife method
kieser criteria factors are selected,
APPLICATION Scree plot can be used for better
• Face identification judgement
• Bankruptcy detection ‰ Extraction
9 Principal component
• Marketing research
9 Correlation matrix
Factor Analysis1 9 Eigen value
Multiple Factor Analysis (MFA): “Statistical 9 Scree plot
method used to describe variability among ‰ Rotation
observed variables in terms of a potentially lower 9 No change
number of unobserved variables called factors”. ‰ Scores
Factor analysis attempts to identify underlying 9 Exclude case listwise
variables, or factors, that explain the pattern of ‰ Descriptive
correlations within a set of observed variables. 9 Untick
Factor analysis is often used in data reduction and ‰ Extraction
can also be used to generate hypotheses regarding 9 Scree plot
causal mechanisms or to screen variables for 9 Univariate untick
subsequent analysis 9 Select the factors
Type of factor analysis ‰ Rotation
9 Varimax-independent variables
Exploratory factor analysis : (EFA) is used to
9 Oblimin dependent variables
uncover the underlying structure of a relatively
large set of variables. The researcher's a NUMBER CALLED “LOADING”. Higher loading
priori assumption is that any indicator may be is selected and close loading is eliminated
associated with any factor. This is the most Types of factoring1
common form of factor analysis. Principal Component Analysis:2 PCA was
Confirmatory factor analysis : (CFA) seeks to invented in 1901 by Karl Pearson. The goal of
determine if the number of factors and the PCA is to decompose a data table with correlated
loadings of measured (indicator) variables on them measurements into a new set of uncorrelated (i.e.,
conform to what is expected on the basis of pre- orthogonal) variables. These variables are called,
established theory. The researcher's à priori depending upon the context, principal
assumption is that each factor is associated with a components, factors, eigenvectors, singular
specified subset of indicator variables vectors, or loadings. The results of the analysis are
Factor analysis is related to principal component often presented with graphs plotting the
analysis (PCA), but the two are not identical. The projections of the units onto the components, and
two methods become essentially equivalent if the the loadings of the variables.
error terms in the factor analysis model can be Canonical factor analysis: It is also called Rao's
assumed to all have the same variance. canonical factoring. Seeks factors which have the
highest canonical correlation with the observed

www.journalofdentofacialsciences.com Vol. 2 Issue 3


Kumar et al. 23

variables. It is unaffected by arbitrary rescaling of Also the results reflect relative associations, not
the data. just which rows are highest or lowest overall
Common factor analysis: It is also called Regression Analysis1: “Refers to any techniques
principal factor analysis (PFA) or principal axis for modelling and analyzing several variables,
factoring (PAF). Seeks the least number of factors when the focus is on the relationship between
which can account for the common variance a dependent variable and one or more
(correlation) of a set of variables. independent variables.” Helps to understand how
Image factoring: Based on the correlation matrix the typical value of the dependent variable
of predicted variables rather than actual variables, changes when any one of the independent
where each variable is predicted from the others variables is varied, while the other independent
using multiple regression. variables are held fixed.
Alpha factoring: Based on maximizing the TYPES2
reliability of factors, assuming variables are ¾ Multiple Linear Regression Analysis
randomly sampled from a universe of variables. ¾ Partial Least Square Regression (PLSR)
Correspondance Analysis6 ¾ Principal Component Regression (PCR)
“Technique that generates graphical ¾ Ridge Regression (RR)
representations of the interactions between ¾ Reduced Rank Regression (RRR) Or
modalities (or "categories") of two categorical Redundancy Analysis
variables”. It allows the visual discovery and
interpretation of these interactions, that is, of the ¾ Poisson’s Regression Analysis
departure from independence of the two variables ¾ Logistic Regression Analysis
Steps6 Multiple Linear Regression Analysis (MLR):1
Run a Chi-square test of independence on In Multiple linear regressions, several independent
these two variables variables are used to predict with a least square
approach one direct variable. If the independent
• If the test fails to reject the independence variables are orthogonal, the problem reduces to a
hypothesis, then Correspondence Analysis will set of univariate regressions. When the
not deliver any useful information, and can be independent variables are correlated, their
ignored. importance is estimated from the partial coefficient
• Only if the independence hypothesis is rejected of correlation. An important problem arises when
will C.A. be considered as the next step in the one of the independent variables can be predicted
analysis of the pair of variables. from the other variables. This is called
A PCA-like transformation then allows the multicolinearity.
modalities of the variables to be represented as The main approaches are:
points in factorial planes. ¾ Forward selection: which involves starting with
Interpretation no variables in the model, trying out the
Correspondence analysis plots should be variables one by one and including them if
interpreted by looking at points relative to the they are ‘statistically significant’?
origin ¾ Backward elimination: which involves starting
• Points that are in similar directions are with all candidate variables and testing them
positively associated one by one for statistical significance, deleting
any that are not significant.
• Points that are on opposite sides of the origin
are negatively associated ¾ Methods that are a combination of the above,
testing at each stage for variables to be
• Points that are far from the origin exhibit the
included or excluded.
strongest associations

www.journalofdentofacialsciences.com Vol. 2 Issue 3


24 Kumar et al.

Partial Least Square Regression (PLSR)1: It Dependent Variables in a series of standard


addresses the multicolinearity problem by Multiple Linear Regression where the original
computing latent vectors which explains both the Independent Variables are used as predictors. The
Independent variables and the Dependent reduced rank regression model is a multivariate
variables. It is used when the goal is to predict regression model with a coefficient matrix with
more than one dependent variables. It combines reduced rank. It is related to canonical correlations
features from Principal Component analysis and and involves calculating eigen values and eigen
Multiple Linear Regression. The score of the units vectors. It is a non symmetric method .The
as well as the loadings of the variables can be components extracted from X are such that they
plotted as in Principal component analysis, and the are closely correlated with the variables of Y as
Dependent variables can be estimated as in possible. Similarly, the components of Y are
Multiple Linear Regression. It is used to find the extracted so that they are closely correlated with
fundamental relations between two matrices the components extracted from X as possible.
(X and Y). The matrices X and Y are decomposed Poissons Regression:2 “form of regression
into latent structures in an iterative process.1 analysis used to model count data and
Uses contingency tables”. Poisson regression assumes
¾ Chemometrics the response variable Y has a Poisson distribution
and assumes the logarithm of its expected value
¾ Bioinformatics
can be modelled by a linear combination of
¾ Sensometrics unknown parameters. A Poisson regression model
¾ Neuroscience is sometimes known as a log-linear model,
¾ Anthropology especially when used to model contingency tables.
Principal Component Regression (PCR)1 Example: Poisson regression is appropriate when
In Principal Component Regression, the the dependent variable is a count, for instance
independent Variables are first submitted to a of events such as the arrival of a telephone call at a
Principal Component Analysis and the scores call centre. The events must be independent in the
of the units are then used as predictors in a sense that the arrival of one call will not make
standard Multiple Linear Regression. another more or less likely, but the probability per
unit time of events is understood to be related to
Step 1= To run Principal component analysis covariates such as time of day.
so as to reduce dimensionality of data.
Logistic Regression:1 Logistic regression
Step 2= To run a ordinary least squares
regression on selected components; factors (sometimes called the logistic model or logit
which are most correlated with dependent model) is used for prediction of the probability of
variables are selected. occurrence of an event by fitting data to a logit
Step 3= Finally the parameters of the model function logistic curve. It is a generalized linear
are computed for the selected explanatory model used for binomial regression. Like many
variables. forms of regression analysis, it makes use of
Ridge Regression (RR): Ridge Regression several predictor variables that may be either
accommodates the multi-colinearity problem by numerical or categorical.
adding a small constant (the ridge) to the diagonal Example: The probability that a person has a
of the correlation matrix. This makes the heart attack within a specified time period might
computation of the estimates for Multiple Linear be predicted from knowledge of the person's age,
Regression possible. sex and body mass index.
Reduced Rank Regression (RRR) or Uses: Logistic regression is used extensively in the
Redundancy Analysis4 medical and social sciences fields, as well as
marketing applications such as prediction of a
The Dependent Variables are first submitted to customer's propensity to purchase a product or
a PCA and the scores of the units are then used as cease a subscription.

www.journalofdentofacialsciences.com Vol. 2 Issue 3


Kumar et al. 25

Multivariate analysis of variance placement and size, by minimizing a measure of


(MANOVA):7 It is a generalized form of shape difference called the Procrustes distance
univariate analysis of variance (ANOVA). It is used between the objects.
when there are two or more dependent variables. CHAID: “Type of decision tree technique, based
It helps to answer: upon adjusted significance testing” (Bonferroni
9 Do changes in the independent variable have testing). In practice, CHAID is often used in direct
significant effects on the dependent variables. marketing (to select groups of consumers and
9 What are interactions among the dependent predict how their responses to some variables
variables and among the independent affect other variables.) medical and psychiatric
variables. research.
In MANOVA the Independent Variables have Advantages: Output is highly visual and easy to
the same structure as in a standard ANOVA, and interpret.
are used to predict a set of Dependent Variables. Disadvantages: Requires large sample sizes to
MANOVA computes a series of ordered work effectively (Reliable analysis).
orthogonal linear combinations of the Dependent CHAID is often used as an exploratory
Variables (i.e., factors) with the constraint that the technique and is an alternative to multiple linear
first factor generates the largest “F” if used in an regression and logistic regression, especially when
ANOVA. The sampling distribution of this F is the data set is not well-suited to regression
adjusted to take into account its construction. analysis.
Structural Equation Modelling:8 Structural Advantages of Multivariate Analyses1
Equation Modelling (SEM) is a statistical technique
¾ It assures that the results are not biased and
for testing and estimating causal relations using a
influenced by other factors that are not
combination of statistical data and qualitative
accounted for.
causal assumptions. Finds useful application in
measurement error, missing data, mediation ¾ Close resemblance to how the researcher
model, group differences. thinks.
Advanced uses ¾ Easy visualisation and interpretation of data.
¾ Invariance ¾ More information is analysed simultaneously,
giving greater power.
¾ Multiple group comparison
¾ Relationship between variables is understood
¾ Relations to other types of advanced models
better.
(hierarchical/multilevel models; item response
theory models) Application of multivariate analysis1
¾ Alternative estimation and testing techniques 9 For developing taxonomies or systems of
classification
Other Methods9,10
9 To investigate useful ways to conceptualize or
Statis: Statis is used when at least one dimension group items
of the three-way table is common to all tables. The 9 To generate hypotheses
first step of the method performs a PCA of each 9 To test hypotheses
table and generates a similarity table between the
Finds application in biology, medicine,
units for each table.
psychology, neuroscience, market research,
Procrustean Analysis (PA): “Form of statistical educational research, climatology, petroleum
shape analysis used to analyse the distribution of a geology, crime analysis etc.
set of shapes. To compare the shape of two or
Conclusion
more objects, the objects must be first optimally
"superimposed". This is performed by There are a variety of multivariate techniques
optimally translating, rotating and uniformly all of which are based on assumptions about the
scaling the objects. The aim is to obtain a similar nature of the data and the type of association

www.journalofdentofacialsciences.com Vol. 2 Issue 3


26 Kumar et al.

under analysis. Choice of an appropriate 4. Introduction to Biostatistics and Research


procedure to be used in multivariable analysis Methodology- P.S.S. Sundar Rao, J. Richard, Rao
depends on whether the dependent and P. S. S. Sundar, richard J.; 4th edition; PHI
independent variables are continuous, Learning Private.
dichotomous, nominal, ordinal or a combination 5. Schaums Outline Of Statistics-Murray R, Spiegel;
of these. Because there are many potential 2nd edition; Mcgraw-Hill
problems and pitfalls in the use of multivariable 6. http://www.statsoft.com/textbook/correspondence-
techniques in clinical research, these procedures analysis/ accessed on 12/05/2012
should be used with care. 7. http://userwww.sfsu.edu/~efc/classes/biol710/mano
va/MANOVAnewest.pdf accessed on 15/05/2012
References
8. http://www.statsoft.com/textbook/basic-statistics/
1. Multivariate Analysis; Methods and Applications –
accessed on 13/05/2013
William R Dillon and Matthew Goldstein; Wiley
series 2nd edition 9. Methods In Biostatistics: 6th Edition: BK Mahajan;
Jaypee brothers
2. Research Methodology- Methods and Techniques-
Kothari CR; New age International. 2nd edition 10. Doing Statistics With SPSS-AW Kerr.;Wiley online
library
3. http://www.statisticshell.com/cluster.pdf accessed on
11/05/12

www.journalofdentofacialsciences.com Vol. 2 Issue 3

View publication stats

You might also like