You are on page 1of 48

Data Analysis

Techniques using SPSS


By
Dr. Rajat Agrawal
Department of Management Studies
Indian Institute of Technology Roorkee
 Exploratory Factor Analysis
Contents of  Discriminant Analysis
this Session  Cluster Analysis
 Exploratory Factor Analysis (EFA) or just factor
analysis is a multi variate statistical technique which is
Exploratory used to uncover the underlying structure of a relatively
Factor large set of correlated variables.
 EFA serves to identify a set of latent variables
Analysis (construct or factors) from underlying set of measured
variables.
 A latent variable is a variable which cannot be
observed directly, rather it is inferred from other
observed variables.
For example, if we want to access a person’s aspiration towards
Latent Variable their career, then in this case we cannot use conventional methods
such as a ruler or weight balance, because what we are measuring is
something very abstract.
However, we can infer about their career aspirations from a series of
related statements, such as:
 My career achievement determines my life success.
 I will do anything to ensure that I succeed.
 No sacrifice is too great for my career.
 My career accomplishments are important to me than anything
else.

Latent
A person may be asked to respond to these statements on a Likert
Variable scale, with 1=highly disagree, 3=neutral and 5= highly agree.

The question is how can we be sure that these four statements


measure our latent variable (career aspirations). This question is
answered by factor analysis.
Factor Analysis can be used to to examine the underlying patterns
or relationships for large number of correlated observed variables
and to determine whether the information contained in these
observed variables can be condensed or summarized into a smaller
set of factors or components (Hair et al., 2009).
Definition and
Purpose Factor analysis helps us to construct latent variables from observed
variables.
If a battery of statements can be grouped successfully using factor
analysis, then such a battery forms a scale for measuring the latent
variable under observation.
 KMO (Kaiser Meyer Olkin) Measure of Sampling Adequacy.
This shows how suitable is the data for running factor analysis, it
shows the proportion of variance among the variables which is
common. The higher the better.

Guidelines for KMO


Basic 0 to 0.49 unacceptable
Terminology 0.50 to 0.69 miserable
0.60 to 0.69 mediocre
0.70 to 0.79 acceptable
0.80 to 0.89 very good
0.90 and above, excellent
Bartletts’s Test of Sphericity: This tests whether the data matrix is
an identity matrix.

Basic An identity matrix is one in which all the elements of the principal
Terminology diagonal are ones and all the elements are zero.

Factor Analysis cannot work on an identity matrix.


Eigen Value: Is the sum of the squared factor loadings.

Basic It shows how much of all observed variance was explained by a


factor.
Terminology
For a factor to be considered significant, it should have an Eigen
value of greater than or equal to 1.
Factor Loading: Is represents the correlation between the observed
score and the latent score. Generally, the higher the better.
Basic The following guidelines should be followed to decide the cut off
value of factor loadings based on the sample size (Stevens, 2002):
Terminology Sample Size Loading
100 0.512
200 0.364
300 0.298
600 0.21
1000 and above 0.162
Basic Percentage of Variance: It represents the percentage of variance
that can be attributed to a factor relative to all other factors.
Terminology
 Principal Axis Factoring (Most Commonly used)

Choosing a  Maximum Likelihood Factoring

Method for
Do not use PCA (Principal Component Analysis), as it is just a data
EFA reduction technique, where as EFA is a technique to construct latent
variables.
 Rotation can be simply defined as any of a variety of methods used to
further analyze the initial EFA results.
 Rotation aims to make the pattern of loadings clearer and more
pronounced.

There are two types of rotation methods:


Factor
Rotation Orthogonal: Assumes that factors are not correlated (example Varimax
rotation).

Oblique: Assumes factors are correlated (example Oblimin rotation).

For use in social sciences and management research, oblique


rotation should be preferred.
Steps in running an EFA
• Clean and load data in to SPSS
Running EFA in • Examine KMO and Bartlett’s Test
• Examine the Eigen Values and decide on the number of factors
SPSS to extract
• Examine the factor loadings and retain item as per criteria
mentioned earlier.
A smartphone OEM wants to know what the customer expects
from their brand. To understand what customers want, the
marketing department of the OEM drafted a questionnaire from
Example Case focus group interviews. The questionnaire consists of 9 items
measured on 5 point Likert scale. Conduct an EFA to uncover the
underlying dimensional structure. The sample questionnaire is
shown below:
var1 I consider myself a price conscious buyer
var2 I always seek value for money in my purchases
var3 There is no point in buying expensive phones as cheaper brands are
good enough
var4 I always seek phones that are in line with current trends
Example Case var5 I always seek phones that can run todays demanding apps and
-- The var6
games
I seek phones with latest OS and processor
Questionnaire var7 For me prompt after sales service is important
var8 I expect phone manufacturer to provide regular software updates
var9 I would prefer a phone brand that has a service center near my place
Goto Analyze -> Dimension reduction -> Factor

Add all the observed variables in the factor analysis variable box

Running EFA
on SPSS
Click the “descriptives” button, in the dialog box , tick mark the KMO
and Bartlett’s test option…..
Next, Click on “extraction ” button and in the dialog box select the
“method” box and select, Principal Axis Factoring. Click continue then…
Next, click on the “rotation” button and in the dialog box, select “Direct
Oblimin”. After this click continue and click ok.
Examine KMO and Bartlett’s test values

KMO

Examining the
Output

P-value of
Bartlett’s test
Examine the Eigen values, here we have obtained 3 factors, as there are 3
Eigen values, which are all greater than 1.

Eigen Values

Examining the
Output
Examine the “structure matrix” for factor loadings, irrespective of their sign.

The component where the factor loadings are the highest, is the factor where
a particular variable belongs.

var1, var2, var3 are


loading in factor 3

Examine the Factor 2

Output Factor 1

Examine the correlations between the


factors, correlations in excess of 0.85
indicate discriminant validity issues
Calculate the reliability (Cronbach’s Alpha) of the factors obtained. Go to
Analyze -> scale -> reliability analysis

For example Factor 3 had var1, var2, var3 in it, enter them in the items box in the
dialog box shown below.

Follow Up
For reliable measurement, Cronbach’s Alpha should be greater than 0.7.

Follow Up

Cronbach’s Alpha value


for factor 3
 Linear Discriminant Analysis (LDA) is a supervised machine
learning technique.
 Its goal it to classify observations in to classes (dependent
variables) based on their features (independent variables).
Linear  It helps in building classification models for future prediction.
Discriminant
Analysis Assumptions for LDA
 Normality of Data
 Homogeneity of Variance
 Observations should be independently and randomly sampled.
The present example used the “iris” dataset.
• The dataset classifies flowers in to three species “setosa”,
“versicolor” and “virginica”, based on features such as sepal
length, sepal width, petal length and petal width.
Example
• Here “species” is the dependent variable and sepal length, sepal
Dataset width, petal length and petal width are independent variables.
• Therefore for running LDA, the data should contain independent
variables on a continuous scale and dependent variable on a
nominal scale.
Go to analyze -> classify -> discriminant

In the dialog box, enter the independent variables in the box labeled,
“intendents” and the nominal dependent variable in the “grouping variable”
box.

Next, Click the “define range” button, here enter the range of numeric coding
used for coding the nominal variable, we will enter here 1 and 3, as we have 3
classes in the dependent variable.

Training a LDA
Model in SPSS
Next, Click the “statistics” button, in the dialog box, check “Box’s M” option.

Training a LDA
Model in SPSS
Next, Click the “classify” button, in the dialog box, check “combined-groups”
option under plots. This helps us to visualize the discriminant equation and
make predictions.

Click continue. Click ok to generate the output.

Training a LDA
Model in SPSS
The classification performance of classifiers is usually check using a confusion
matrix, which shows the correctly and incorrectly classified cases.
To generate a confusion matrix, click on the “classify” button the
discriminant analysis dialog box, in the dialog box that opens click on
“summary table” under the display section.
The first thing to examine is the “Box’s” test for homogeneity of variance.

We can see below that the p-value associated with this test is less than 0.05,
which indicates that assumption of homogeneity of variance has not been
violated.

Examining the
Output

p-value
Here, we can see that the first function has the highest Eigen value, so we will
consider the discriminant coefficients of this function to make the final equation.

Examining the Eigen Value


Output

Discriminant Coefficients
So, final Discriminant Equation will be:
Examining the
Output -0.422.Sepal length-0.522.sepal width+0.940.petal
length+ 0.585.petal width
This plot helps to visualizes where a particular point will belong based on the
its discriminant score.
Let us pick a random data point (first data point for sake of easiness) from our
dataset with

Sepal length =5.1


Sepal width = 3.5
Petal Length = 1.4
Petal Width = 0.2

Making a Fitting this in our equation

Prediction -0.422.Sepal length-0.522.sepal width+0.940.petal length+ 0.585.petal width

using LDA = -0.422 x 5.1-0.522 x 3.5+0.940x 1.4+ 0.585 x 0.2


= -2.15-1.82+1.316+0.177
=-3.47

Now as per the figure shown in the previous slide, this


case should belong to setosa group (since score is less
that -2).
Confusion Matrix
Correctly
Classified cases
Incorrectly
Classified cases

Evaluating
Performance

Diagonal values are correctly classified cased and off diagonal are incorrect
classified cased.
Cluster analysis or clustering is the task of grouping a set of objects in
such a way that objects in the same group (called a cluster) are more
similar (in some sense) to each other than to those in other groups
(clusters)

 It is an unsupervised machine learning technique.


 Basic aim is to group similar observations based on certain
Cluster features.
Analysis  It is descriptive analytics tool.
 Useful for market segmentation.
 Connectivity Based (Hierarchical Clustering): is based on the
core idea of objects being more related to nearby objects than to
objects farther away. These algorithms connect "objects" to form
"clusters" based on their distance.
Clustering  Centroid or Prototyping based (K- Means): clusters are
represented by a central vector, which may not necessarily be a
Algorithms member of the data set. When the number of clusters is fixed to k,
k-means clustering gives a formal definition as an optimization
problem: find the k cluster centers and assign the objects to the
nearest cluster center, such that the squared distances from the
cluster are minimized.
Clustering in The current example also used the ”iris” data set.
SPSS
Load the data set and go to analyze -> classify -> hierarchical cluster.
Add the features in the “variables” box.

Hierarchical
Clustering in
SPSS
Next, click on the “statistics button”, and select “agglomeration schedule”
and “proximity matrix”. Click continue.

Hierarchical
Clustering in
SPSS
Next, click on the “plots”, and select “dendogram” and Click continue. Click
ok.

Hierarchical
Clustering in
SPSS
We can see there are 3 clusters at the first level, secondly clusters 1 and 2 are
similar so they are grouped together in the second level. Lastly, at the third
level we have the third cluster and the combined and 1 and 2 clusters.

Dendogram
Load the data set and go to analyze -> classify -> K-means cluster.
Add the features in the “variables” box. In the number of clusters set it to 3.

K-Means
Clustering in
SPSS
Click on options and select “ANOVA table” in the dialog box. Click continue and
click ok.

K-Means
Clustering in
SPSS
Mean value of features in each
cluster

Importance
of each
The Output feature in
clustering

Total number of cases in


each cluster
Thank you

You might also like