You are on page 1of 22

HR Analytics: Session 3

Data summarization and reduction


DR. NIVEDHITHA KS
OB & HR AREA
IIM K
Factor Analysis
Form of dimension reduction.
Interdependent technique whose primary objective is to identify the hidden structure among the
variables.
During an univariate or bivariate analysis, the variables can be easily analysed.
Multivariate analysis requires huge number of variables, which makes the analysis difficult,
therefore factor analysis is used to reduce the dimensions by grouping the variables into common
categories known as factors.
Condense the variables into a reduced set of variables under specific categories
Factor Analysis-Simplified
For instance, through a qualitative research, firm identifies 85 features of gamification that
are preferred by the employees. Now the firm wants to understand how these features help
employees in improving various outcomes such as Mental health, cohesion, performance,
productivity etc.
What is the issue with this evaluation?
It is difficult to analyse when there are too many variables, therefore it is better to know if
these 85 features can be clubbed together based on some underlying dimension.
E.g., I like to compete with my colleagues
I like to be on the top of the leaderboard
Competition is the key element I like in gamification
If you look at the above-mentioned statements, you can roughly understand that all these statements express about competition.
Therefore, you can group all these statements under the dimension/factor called competition.
Exploratory Vs Confirmatory Factor
Analysis
The new composite measures(factors) reflect the items within the category and these items are highly correlated within
their respective factors and to the factors.
Exploratory analysis: Priori constraint on the number of factors to be extracted is not set
E.g. You would like to asses the impact of a few crowdsourcing social media features, so collects data from the
respondents and then based on the data, you will be able to identify the features.
Confirmatory analysis: The basic structure of the data is already known, for e.g., you would like to asses the impact of
specific crowdsourcing social media features, e.g., PDAs, verbal feedback, network exposure.
What are the variables in the above-mentioned objective?
PDA, verbal feedback and network exposure
You may use experimental method, interviews, surveys to collect data to test your assumptions.
When you are already aware of the variables, then you will also know the kind of questions that should be asked to
measure each of these variables, therefore these variables act as factors and the questions act as items. Using
confirmatory analysis, you can confirm whether these questions(items) exactly measure these variables(factors).
Steps involved in Factor Analysis
Objective
Condense the data contained in the original number of variables into a smaller set of composite
dimensions/factors with minimum loss of information
Select the unit of analysis: If the items/data are to be reduced, then use factor analysis, if the
respondents are to be reduced to definite set of dimensions, then use cluster analysis.
So, what is your purpose, to reduce the number of samples/respondents or to reduce the number
of items/attributes that are measured through some data collection method?
Data summarization Vs data reduction
Data summarization:
View the set of variables from individual level to generalized level, where individual items are
grouped based on how they represent a concept collectively.
Interdependence or dependence technique?
It is a interdependent technique because all the items/data/ attributes are considered
simultaneously without any distinction such as dependent /independent variables.
Variate(factors) is formed to maximise the explanation of entire data set
Data reduction:
Identify representative variables(factors) from a larger set of variables and replace the entire set
by representative variables through factor scores/summated scale.
Factor selection
A potential candidate for “Garbage in Garbage out”.
Therefore use logical assumptions while deciding the factors.
Also identify several key items that reflect the factors using factor loading.
Sample size: preferably 100 or more, As a general rule, item/attribute to sample ratio should be 1:
10
Assumptions
Conceptual issues
Ensure that the sample is homogenous. For instance, it is inappropriate to apply this analysis when you
are sure that set of attributes differ based on certain other attributes. For instance, if you are sure that
male and female differ on the attributes, then it is not appropriate to use them as a single data set.
Variance of a variable
Variance is the dispersion of values for a single attribute/item around its mean.
When an item is correlated with another, then it shares its variance with the other. For example, if
.50 is correlation, then the (0.50)*(0.50) will be the shared variance. Variance of an item has
Common variance: Variance that is shared by other items under the underlying factor
Specific variance: Variance which cannot be explained by the correlation with other items
Error variance: not explained by the correlation and it is a measurement error.
Under a particular common factor, which variance will be high and which one will be low?
Factor extraction
Objective of the algorithm is to extract factors:
The first factor: the best linear combination of items which accounts for most of the variance in a
data set.
Then the second factor is extracted which is derived from the variance after the first factor is
extracted.(variables that account for the variance which is still unexplained by the first factor)…
Likewise n factors will be extracted, n being the number of items.
Now the question is, how many factors should we consider in Exploratory factor analysis?
Confirmation of items to the respective
factors
Factor loading are the correlation of items with the factors.
Factor loadings should be preferably greater than 0.70
When an item has significant factor loading in more than one factor, then it is termed as cross-
loading.
Scree test an be conducted between factor numbers and eigen values. The graph will look like an
Elbow. The point at which the curve first begins to straighten out is considered to indicate the
number of factors.
Data reduction
Surrogate: Select the item with the highest factor loading on a factor A will be taken as a
surrogate for the entire set of items under the factor A.
Summated score: combining individual items into a composite measure by taking the average of
the items.
Factor scores: Composite measure of each factor computed for each respondent. It represents the
degree to which individual scores high on the group of items with high factor loading on a factor.
Cluster Analysis
Cluster Analysis
Classification of observations/objects based on certain characteristics. Observations are usually
respondents (e.g., customers taking up a survey, employees in an organisation)
Helps to identify hidden pattern among the observations.
This is a data reduction method, which helps us to classify a large number of
respondents/observations into smaller and manageable groups.
Have high intra-group similarities(homogeneous) and low inter-group similarities(heterogeneous)
How does the cluster and factor analysis groupings are performed?
◦ Based on proximity of the observations and variation/correlation of the items/variables respy.

Note:Observations/objects represent respondents in HR analytics, shown in the rows of the data set.
Characteristics/attributes represent the actual data you would like to collect, represent the columns
of the dataset.
Cluster analysis
Can be used for data reduction.
◦ Eg. From the entire set of fresh candidates, you would like to understand the nature of these candidates
and group them into several sub-groups based on commonalities, e.g., location, education background
etc.

Can be used for hypothesis testing.


◦ You believe that the candidates’ attitude towards different aspects of life can be used to group the
candidates into high performers, stayers and quitters. You can test this assumption using a cluster
analysis
Cluster analysis
If you change the attributes, will the clusters change or will the respondents fall in different
clusters?
Is it descriptive or inferential?
Will it always provide clusters irrespective of the structure of the data?
Is it generalizable?
Measuring similarity
Correlational measure: correlation between the objects are assessed. Lower correlation: different
clusters, higher correlation: same cluster
Distance measure: Most commonly used
◦ Euclidean distance: Straight light distance between two objects
Measuring similarity
Association measures:

Used to compare observations when the characteristics are measured in non-metric terms. For example,
the employees say yes or no for a set of attributes like likeability of the office space, boss and
colleagues. Association measures asses the degree of agreement between pair of respondents.
Hierarchical Clustering
Involves a series of n-1 clustering decisions(n represents the number of observations), combining
observations into hierarchy or decision tree.
◦ Agglomerative methods: Each observation starts out as its own cluster and is successively joined based
on the similarity measures until only a single cluster remains. So, when you have 50 observations, what
will be the number of cluster at the start and at the end?
◦ Divisive methods:All observations start under a single cluster and then divides themselves (first into 2,
then 3… so on) until each observation becomes a single cluster. So, when you have 50 observations,
what will be the number of cluster at the start and at the end?
◦ Commonly used method is the Agglomerative method.
Non-hierarchical Clustering
Doesn’t involve tree like construction process.
They assign observations into clusters once the number of clusters is specified. It is proceeded
through two steps.
◦ Specify cluster seeds. For example, the first observation, which has no missing values, can be taken as a
cluster seed for a cluster.’
◦ Assignment of observations: Assign each observation to one of the cluster seeds based on similarity.
◦ Cluster seeds can be formed simultaneously or sequentially.
K-means: A form of non-hierarchical
clustering
Portion the data into a user-specified number of clusters
Then iteratively reassign the observations to clusters until the numerical criterion is met.
The criterion specifies a goal related to minimizing the distance of observations within a cluster
and maximizing the distance between the clusters.
Non-hierarchical methods are preferred for HR analytics as it can accommodate a large sets of
data
Hierarchical or K-means?

You might also like