You are on page 1of 30

Multivariate Data Analysis

Overview of Methods
8

-2
-2 0 2 4 6 8
 Sophisticated multivariate
statistical methods are 1.0
Talking to

becoming standard Reading a book


30-34
co-workers

Radio

practice in the physical, .5


Talking to friends
35-39
Cable television

natural and social sciences,


Dimension 2 0.0
Newspaper
40-44

as well as in business 25-29 45-49


Broadcast television

• Variations of existing
-.5 Internet
50-54

methods are being -1.0 Most important


information source
developed, existing Magazines

techniques are being applied -1.5


-1.0 -.6 -.2 .2 .6
Age group

to new applications, and new -.8 -.4 .0 .4 .8

methods continue to be Dimension 1

designed
 The accelerated use of advanced
multivariate techniques is being
driven by
• Growing complexity in the topics
being addressed
• Ever-larger data sets
• Ability to apply computationally
intensive methods through powerful
computer tools
• Academic training
8

-2
-2 0 2 4 6 8
Scale Definition Examples Descriptive Statistics

Non-ordered Race, gender Percentages,


Nominal
categories marital status mode

Ordinal Ordered relation Attitudes, Percentiles,


between categories social class median

Interval Ordered relation, Economic Range, mean,


Metric equality of differences indices standard deviation

Ordered relation, Sales,costs, All of above,


Ratio
equality of differences, frequency coefficient
absolute zero of variation
 Response vs explanatory
• Response or dependent variable
▪ Variable to be modeled or predicted
• Explanatory or independent variable
▪ Variables used to predict or model dependent variable
 Importance of identifying data and variable types
• Critical in determining analysis objectives and appropriate
analysis method
• Avoid inappropriate variable operations
 Dependence techniques
• One or a set of variables are regarded as dependent
variables
• Objective is to predict or explain the value of the
dependent variable(s) based on the values of a set of
independent variables
• Examples
▪ What is the probability that a loan applicant will default?
▪ What factors best differentiate people whose primary news source
is the Internet?
 Dependence techniques
• Multiple regression
• Logistic regression
• Discriminant analysis
• Canonical correlation
• Structural equation modeling
• Analysis of variance
• Decision trees
 Interdependence techniques
• No single group of variables defined as
dependent or independent
• Objective is to identify and characterize
underlying structure between the variables
• Examples
▪ What are the underlying factors that define a customer’s
perception of a brand?
▪ Which signal returns arise from the same object and
how many objects are present?
 Interdependence techniques
• Factor analysis
• Multidimensional scaling
• Correspondence analysis
• Cluster analysis
 Interdependence techniques are valuable data
reduction methods
• Data reduction attempts to manage and interpret the large
amounts of data gathered
• One goal is combine groups of cases measured over multiple
variables into a relatively small number of understandable
segments
• Or to group variables together into categories of latent
traits and then characterize cases with respect to this
smaller number of traits
 The reduced data variables are then often used as
variables in dependence techniques
 Multiple regression is a dependence technique used
to model the relationship between the value of a
single metric dependent variable and a set of metric
independent variables
• Categorical variables can be included as “dummy” variables
 Model can be applied to predict changes in the
dependent variable’s response to changes in the
independent variables
 Regression also indicates the relative importance of
independent variables on the response of the
dependent variable
 For example, a client may be interested in
understanding the effect of price and promotional
activity on a product’s market share among both
“loyal” and “not loyal” customers
 Technical result is a linear model of the form

• Y = a0 + a1X1 + a2X2 + … +anXn

 Best visualizations of the results control all but one


(or two) of the independent variables and examine
how the value of dependent variable changes with
respect to the “free” independent variables
Market share for loyal customers Market share for not-loyal customers

60 60

50 50

40 40

30 30

20 20
Market Share
Market Share

10 10

0 0
20 30 40 50 60 70 80 20 30 40 50 60 70 80

Promotion Index Promotion Index


 Properties
• Single interval scale dependent variable
• Multiple independent variables, preferably on interval scale
• Familiar and useful technique
 Issues
• Assumes linear relationship between dependent and
independent variables
• Overused and often assumptions not fully checked
• Often misapplied to classification problems
 Logistic Regression is a dependence techniques
used to model the relationship between a single
categorical dependent variable and a set of metric
independent variables
• Typically dependent variable takes one of two values –
success/failure, buy/do not buy
• Multinomial formulations
 A logistic model gives the probability that the
dependent variable takes a target value given the
values of the independent variable
 For example, which credit and demographic
factors best predict whether a customer will
keep a loan current
• Dependent variable taken as 60 days past due or
worse
• Independent variables are credit and employment
history, and demographic descriptors
 Properties
• Powerful technique for predicting group membership and
identifying important independent variables
• Becoming more widely used
• Procedures and results similar to linear regression
 Issues
• Adequate data
• Model validation
• Communicating probabilistic concepts
 Decision trees are a dependence technique used to
develop a model to classify the value of a single
dependent variable based on a set of independent
variables
• Dependent and independent variables can be any data
type
 The typical product of CART is a straightforward,
easily interpretable set of segmentation rules
• For example, classify existing customers as high or low
likelihood buyers of a new product based on
demographics and historical purchasing behavior.
Classification could be used to focus advertising campaign
 Decision trees can be also used to examine
profiles of different market segments with
respect to underlying demographic and
psychographic variables
▪ For example, what are the most significant
demographic variables determining whether the
Internet is a person’s most important information
source?
 Properties
• Single dependent variable of any scale
• Multiple independent variables of any scale
• Free of model assumptions typical in other dependence
techniques
• Powerful statistical learning algorithm able to identify
complex variable interactions
 Issues
• Not as familiar
• Standard inferential statistics not applicable
• Often leads to asymmetric relationships
 Factor analysis is an interdependence technique
used to identify a set of underlying latent traits
(factors) that explain the correlations between a
large number of variables
• Data summarizing
▪ Derive a set of underlying concepts that summarize a larger set
of variables
• Data reduction
▪ Develop a set of factor variables that serves as a more
parsimonious description of the data
 Interested in defining underlying dimensions
influencing the perception of online destinations
• Survey respondents are asked to rate a set of destinations
(including client’s) with respect to a number of traits
• Factor analysis can be applied to develop a succinct set of
perception dimensions
• This manageable set of dimensions can be used to
characterize a client’s site and to develop a focused plan to
reposition it
 Factors can then be used to On a scale of 1 to 5 where "1" means
"not at all descriptive" and "5" means

provide visual summary of data "extremely descriptive," how well do


each of the following words or phrases
describe the +website?
4.5
A
C
4.0 15 Down-to-earth
16 Daring
B 17 Intelligent
3.5 18 Confusing
E
19 Friendly
3.0 G 20 Up-to-date
21 Clumsy
Client 22 Slick
2.5
23 Genuine
H F 24 Imaginative
2.0 25 Pretentious
26 Upper class
1.5 27 Honest
28 Spirited
D 29 Dependable
1.0
Competence 30
31
Reliable
Informative
.5
32 Silly
1.5 2.0 2.5 3.0 3.5 4.0 4.5 Sophistication 33 Efficient
34 Sassy
Trustworthy
Trustworthy
Exciting
 Properties
• Very useful in identifying structure and relationships in
data
• Provides tractable set of concepts for both managerial
and analytical uses
• Provides opportunities for visualizations
 Issues
• Questionnaire design
• Variable selection
• Factor interpretation and validity
 Cluster analysis is an interdependence technique
used to segment cases into homogeneous
groups based on a specified set of variables
• Data reduction
▪ Develop a more parsimonious description of cases which can
then be used in analytical classification methods
• Identify similarities between cases with respect to
clustering variables
• Characterize clusters with respect to other sets of
variables
 Want to identify and then characterize similar
groups of TV pilot shows based on survey responses
rating shows on various traits
• For one or two traits it may be possible to do this
subjectively. Cluster analysis provides an
objective method for multiple traits
• Clusters can be characterized with respect to variables not
used in the analysis, such as show success, and cluster
membership can be used as a dependent variable in
classification method
60

3
Cedric
Wanda at
Live Gir

2
Ground2
Normal O
50 Oliver B

1
More Pat
Becoming
Bernie M

The Grub
Beat Cop
Andy Ric
40 College
National
The Pitt Nathan's

GregRuling
the C

Tick2
30 Msgr. Ma

Normal P
HUMOR

20
20 30 40 50 60

CLEVER

Cluster 1: Low likelihood of success


Cluster 2: Moderate likelihood of success
Cluster 3: High likelihood of success
 Properties
• Many cluster techniques are available for data of all scales
• Can identify structure in large data sets that may be
difficult to discover in any other way
• Provides objective segmentation method
 Issues
• Selecting appropriate clustering method
• Determining appropriate number of clusters
• Validating clusters

You might also like