You are on page 1of 30

Multivariate Data Analysis

Overview of Methods
8 6

4

2

0

-2 -2 0 2 4 6 8

Sophisticated multivariate statistical methods are becoming standard practice in the physical, natural and social sciences, as well as in business
• Variations of existing methods are being developed, existing techniques are being applied to new applications, and new methods continue to be designed

1.0
Talking to co-workers Reading a book

.5
Talking to friends

30-34 35-39

Radio Cable television 40-44

Dimension 2

0.0
25-29

Newspaper 45-49 Broadcast television

-.5

Internet 50-54

-1.0
Magazines

Most important information source Age group

-1.5 -1.0 -.8 -.6 -.4 -.2 .0 .2 .4 .6 .8

Dimension 1

The accelerated use of advanced multivariate techniques is being driven by
• Growing complexity in the topics being addressed • Ever-larger data sets • Ability to apply computationally intensive methods through powerful computer tools • Academic training

8 6 4 2 0 -2 -2 0 2 4 6 8 .

standard deviation All of above. mean. equality of differences Ordered relation. equality of differences. absolute zero Attitudes. frequency Percentiles. social class Economic indices Sales. mode Ordinal Ordered relation between categories Ordered relation. coefficient of variation Interval Metric Ratio .Scale Nominal Definition Non-ordered categories Examples Race. median Range. gender marital status Descriptive Statistics Percentages.costs.

 Response vs explanatory • Response or dependent variable ▪ Variable to be modeled or predicted • Explanatory or independent variable ▪ Variables used to predict or model dependent variable  Importance of identifying data and variable types • Critical in determining analysis objectives and appropriate analysis method • Avoid inappropriate variable operations .

 Dependence techniques • One or a set of variables are regarded as dependent variables • Objective is to predict or explain the value of the dependent variable(s) based on the values of a set of independent variables • Examples ▪ What is the probability that a loan applicant will default? ▪ What factors best differentiate people whose primary news source is the Internet? .

 Dependence techniques • • • • • • • Multiple regression Logistic regression Discriminant analysis Canonical correlation Structural equation modeling Analysis of variance Decision trees .

 Interdependence techniques • No single group of variables defined as dependent or independent • Objective is to identify and characterize underlying structure between the variables • Examples ▪ What are the underlying factors that define a customer’s perception of a brand? ▪ Which signal returns arise from the same object and how many objects are present? .

 Interdependence techniques • • • • Factor analysis Multidimensional scaling Correspondence analysis Cluster analysis .

 Interdependence techniques are valuable data reduction methods • Data reduction attempts to manage and interpret the large amounts of data gathered • One goal is combine groups of cases measured over multiple variables into a relatively small number of understandable segments • Or to group variables together into categories of latent traits and then characterize cases with respect to this smaller number of traits  The reduced data variables are then often used as variables in dependence techniques .

 Multiple regression is a dependence technique used to model the relationship between the value of a single metric dependent variable and a set of metric independent variables • Categorical variables can be included as “dummy” variables Model can be applied to predict changes in the dependent variable’s response to changes in the independent variables  Regression also indicates the relative importance of independent variables on the response of the dependent variable  .

a client may be interested in understanding the effect of price and promotional activity on a product’s market share among both “loyal” and “not loyal” customers  Technical result is a linear model of the form  •  Y = a0 + a1X1 + a2X2 + … +anXn Best visualizations of the results control all but one (or two) of the independent variables and examine how the value of dependent variable changes with respect to the “free” independent variables .For example.

Market share for loyal customers Market share for not-loyal customers 60 60 50 50 40 40 30 30 20 20 Market Share Market Share 10 10 0 20 30 40 50 60 70 80 0 20 30 40 50 60 70 80 Promotion Index Promotion Index .

preferably on interval scale • Familiar and useful technique  Issues • Assumes linear relationship between dependent and independent variables • Overused and often assumptions not fully checked • Often misapplied to classification problems . Properties • Single interval scale dependent variable • Multiple independent variables.

 Logistic Regression is a dependence techniques used to model the relationship between a single categorical dependent variable and a set of metric independent variables • Typically dependent variable takes one of two values – success/failure. buy/do not buy • Multinomial formulations  A logistic model gives the probability that the dependent variable takes a target value given the values of the independent variable .

which credit and demographic factors best predict whether a customer will keep a loan current • Dependent variable taken as 60 days past due or worse • Independent variables are credit and employment history. For example. and demographic descriptors .

 Properties • Powerful technique for predicting group membership and identifying important independent variables • Becoming more widely used • Procedures and results similar to linear regression  Issues • Adequate data • Model validation • Communicating probabilistic concepts .

Classification could be used to focus advertising campaign . Decision trees are a dependence technique used to develop a model to classify the value of a single dependent variable based on a set of independent variables • Dependent and independent variables can be any data type  The typical product of CART is a straightforward. easily interpretable set of segmentation rules • For example. classify existing customers as high or low likelihood buyers of a new product based on demographics and historical purchasing behavior.

 Decision trees can be also used to examine profiles of different market segments with respect to underlying demographic and psychographic variables ▪ For example. what are the most significant demographic variables determining whether the Internet is a person’s most important information source? .

.

 Properties • Single dependent variable of any scale • Multiple independent variables of any scale • Free of model assumptions typical in other dependence techniques • Powerful statistical learning algorithm able to identify complex variable interactions • Not as familiar • Standard inferential statistics not applicable • Often leads to asymmetric relationships  Issues .

 Factor analysis is an interdependence technique used to identify a set of underlying latent traits (factors) that explain the correlations between a large number of variables • Data summarizing ▪ Derive a set of underlying concepts that summarize a larger set of variables • Data reduction ▪ Develop a set of factor variables that serves as a more parsimonious description of the data .

 Interested in defining underlying dimensions influencing the perception of online destinations • Survey respondents are asked to rate a set of destinations (including client’s) with respect to a number of traits • Factor analysis can be applied to develop a succinct set of perception dimensions • This manageable set of dimensions can be used to characterize a client’s site and to develop a focused plan to reposition it .

0 3.5 Factors can then be used to provide visual summary of data A C B E On a scale of 1 to 5 where "1" means "not at all descriptive" and "5" means "extremely descriptive.0 1.5 H 2.5 1.0 2. 4.5 4.5 3.5 D 1." how well do each of the following words or phrases describe the +website? 4.0 .0 3.5 F Competence Sophistication Trustworthy Exciting Trustworthy 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 Down-to-earth Daring Intelligent Confusing Friendly Up-to-date Clumsy Slick Genuine Imaginative Pretentious Upper class Honest Spirited Dependable Reliable Informative Silly Efficient Sassy .0 4.5 3.0 G Client 2.5 2.

 Properties • Very useful in identifying structure and relationships in data • Provides tractable set of concepts for both managerial and analytical uses • Provides opportunities for visualizations  Issues • Questionnaire design • Variable selection • Factor interpretation and validity .

 Cluster analysis is an interdependence technique used to segment cases into homogeneous groups based on a specified set of variables • Data reduction ▪ Develop a more parsimonious description of cases which can then be used in analytical classification methods • Identify similarities between cases with respect to clustering variables • Characterize clusters with respect to other sets of variables .

Cluster analysis provides an objective method for multiple traits • Clusters can be characterized with respect to variables not used in the analysis. Want to identify and then characterize similar groups of TV pilot shows based on survey responses rating shows on various traits • For one or two traits it may be possible to do this subjectively. such as show success. and cluster membership can be used as a dependent variable in classification method .

60 50 1 The Grub National 2 The Pitt 3 Oliver B Cedric Wanda at Live Gir Ground2 Normal O More Pat Becoming Bernie M 40 Beat Cop Andy Ric College Nathan's Greg Ruling the C 30 Msgr. Ma Normal P Tick2 HUMOR 20 20 30 40 50 60 CLEVER Cluster 1: Low likelihood of success Cluster 2: Moderate likelihood of success Cluster 3: High likelihood of success .

 Properties • Many cluster techniques are available for data of all scales • Can identify structure in large data sets that may be difficult to discover in any other way • Provides objective segmentation method  Issues • Selecting appropriate clustering method • Determining appropriate number of clusters • Validating clusters .