You are on page 1of 41

TITAN (THE IT & ANALYTICS)

IIM KASHIPUR

Dossier
WELCOME TO THE
MANAGEMENT WORLD
IT INDUSTRY
Information technology (IT) is the use of computers to store, retrieve, transmit,
and manipulate data, or information, often in the context of a business or other
Dear students, welcome to
enterprise. The nature of this industry has been rapidly changing for the past few
IIM. Welcome to the
years and is expected to be so in coming years. Generally, IT industry is used as a
management world. You
collective term with the Information Technology enabled Services (ITeS)
will realize over the
industry.
duration of your course
Categories include: how your work life is going
to be different from your
 IT Products regular employees. Many of
 IT Services you may have worked in
different industries at
 Business Process Outsourcing (BPO)
different positions in a
variety of roles. Take your
experience with you, build
India is a major player in the IT Services and the BPO segments. India is also a
on it in these 2 years and
major exporter of IT and has a contribution of 7.7% to the Indian GDP as of 2016-
develop yourself.
17. Though the industry had been on a track of continuous growth since it’s
rise in India, the last 2-3 years have been a little tough for the sector. The initial
years had been a continuous growth for the sector as it was just a support
function sector. Now, the industry itself has become so big that it has also
started have its own ups and downs.

Let us look at some of the IT concepts from a management student point of view,
and not just as a technical person.

1
IT MANAGEMENT
IT management is the discipline whereby all of the information technology resources of a firm are managed in
accordance with its needs and priorities. These resources may include tangible investments like computer
hardware, software, data, networks and data centre facilities, as well as the staff who are hired to maintain them.

The central aim of IT management is to generate value through the use of technology. To achieve this, business
strategies and technology must be aligned.

IT Management is different from management information systems. The latter refers to management methods
tied to the automation or support of human decision making. IT Management refers to IT related management
activities in organizations. MIS is focused mainly on the business aspect, with strong input into the technology
phase of the business/organization.

A primary focus of IT management is the value creation made possible by technology. This requires the alignment
of technology and business strategies. While the value creation for an organization involves a network of
relationships between internal and external environments, technology plays an important role in improving the
overall value chain of an organization. However, this increase requires business and technology management to
work as a creative, synergistic, and collaborative team instead of a purely mechanistic span of control.

SYSTEMS AND NETWORKS


A network is a group of two or more computer systems or other devices that are linked together to exchange data.
Networks share resources, exchange files and electronic communications. For example, networked computers can
share files or multiple computers on the network can share the same printer.

Different types of networks:


 Local Area Network (LAN)
 Wide Area Network (WAN)
 Metropolitan Area Network (MAN)
 Home Area Network (HAN)
 Virtual Private Network (VPN)
 Storage Area Network (SAN)

Network standards are important to ensure that hardware and software can work together. Without standards
you could not easily develop a network to share information. Networking standards can be categorized in one of
two ways: formal and de facto (informal).

Formal
Formal standards are developed by industry organizations or governments. Formal standards exist for network
layer software, data link layer, hardware and so on. Formal standardization is a lengthy process of developing the
specification, identifying choices and industry acceptance.
De Facto
The second category of networking standards is de facto standards. These standards typically emerge in the
marketplace and are supported by technology vendors but have no official backing. For example, Microsoft
Windows is a de facto standard, but is not formally recognized by any standards organization. It is simply widely
recognized and accepted.

DATA ANALYTICS
Data analytics (DA) is the process of examining data sets in order to draw conclusions about the information they
contain, increasingly with the aid of specialized systems and software. Data analytics technologies and techniques
are widely used in commercial industries to enable organizations to make more-informed business decisions and
by scientists and researchers to verify or disprove scientific models, theories and hypotheses.

Data analytics initiatives can help businesses increase revenues, improve operational efficiency, optimize marketing
campaigns and customer service efforts, respond more quickly to emerging market trends and gain a competitive
edge over rivals -- all with the ultimate goal of boosting businessperformance.

ANALYTICS INDUSTRY
Analytics is used across industry sectors and almost all every industry has scope for analytics. Though, there are
some early adopters for this too, as all other things. E.g. Finance & Banking is the biggest sector for use of analytics
followed by Marketing & Advertising and E-Commerce.

The data analytics market in India is growing at a fast pace, with companies and startups offering analytics services
and products catering to various industries. Different sectors have seen different penetration and adoption of
analytics, and so is the revenue generation from these sectors.

 Analytics, data science and big data industry in India is currently estimated to be $2.71 billion annually in
revenues, growing at a healthy rate of 33.5% CAGR.
 Analytics, data science and big data industry in India is expected to grow seven times in the next seven
years. It is estimated to become a 20-billion-dollar industry in India by 2025.
 In terms of geographies served, almost 64% of analytics revenues in India come from analytics exports to
USA. Indian analytics industry currently service almost $1.7 billion in revenue to USA firms.
 Indian domestic market serves as a significant market, with almost 4.7% of analytics revenues coming for
Indian firms.
 The average work experience of analytics professionals in India is 7.9 years; up from 7.7 years from last
year.
 57% of analytics professionals have a Master’s/ Post Graduation degree, which is same as last year.
SOME BASIC CONCEPTS IN STATISTICS

Continuous and Discrete variables


DISCRETE VARIABLE
 A discrete variable is a type of statistical variable that can assume only fixed number of distinct values and
lacks an inherent order.
Also known as a categorical variable, because it has separate, invisible categories. However, no values can
exist in-between two categories, i.e. it does not attain all the values within the limits of the variable.
 Number of printing mistakes in a book.
 Number of road accidents in New Delhi.
 Number of siblings of an individual.

CONTINUOUS VARIABLE
Continuous variable, as the name suggest is a random variable that assumes all the possible values in a continuum.
Simply put, it can take any value within the given range.
A continuous variable is one that is defined over an interval of values, meaning that it can suppose any values in
between the minimum and maximum value.

 Height of a person
 Age of a person
 Profit earned by the company
Central Tendency
A measure of central tendency is a single value that attempts to describe a set of data by identifying the central
position within that set of data.
 mean

 Median-The median is the middle score for a set of data that has been arranged in order of magnitude.
The median is less affected by outliers and skewed data.
 Mode-The mode is the most frequent score in our data set.
Distribution
Distribution tells how the data is distributed across the center. There are 2 types of distributions:
 Continuous distribution
 Discrete distribution
E.g. Normal Distribution, Poisson Distribution, Bernoulli Distribution, Chi-square Distribution etc.
Correlation and Regression
CORRELATION
It refers to a broad class of relationships in statistics that involve dependence. Dependence in statistics means a
relationship between two sets of data or two random variables.

Some of the familiar examples of dependent process are: the correlation between the physical statures of off-
springs and their parents, the correlation between the price of the product and its demand.

REGRESSION
Regression in statistics is a phenomenon that when a variable at its first measurement is extreme then it will have
a tendency to be closer on its second measurement to the average. And if the variable is extreme in its second
measurement then it will have a tendency to be closer to the average in its first measurement.

Sampling
In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample)
of individuals from within a statistical population to estimate characteristics of the whole population. Two
advantages of sampling are that the cost is lower and data collection is faster than measuring the entire
population.
Sampling enables the selection of right data points from within the larger data set to estimate the characteristics
of the whole population. For example, there are about 600 million tweets produced every day. It is not necessary
to look at all of them to determine the topics that are discussed during the day, nor is it necessary to look at all the
tweets to determine the sentiment on each of the topics. A theoretical formulation for sampling Twitter data has
been developed.
In manufacturing different types of sensory data such as acoustics, vibration, pressure, current, voltage and
controller data are available at short time intervals. To predict down-time it may not be necessary to look at all
the data but a sample may be sufficient.

SECURITY

Security, in information technology (IT), is the defense of digital information and IT assets against internal and
external, malicious and accidental threats. This defense includes detection, prevention and response to threats
through the use of security policies, software tools and IT services.

 Physical security
 Information security
Physical security
Physical security is the protection of personnel, hardware, software, networks and data from physical actions,
intrusions and other events that could damage an organization. Physical security for enterprises often includes
employee access control to the office buildings as well as specific locations, such as data centers. An example of a
common physical security threat is an attacker gaining entry to an organization and using a USB storage drive to
either copy and remove sensitive data or physically deliver malware directly to systems.

Information security
Information security, also called InfoSec, encompasses a broad set of strategies for managing the process, tools
and policies that aim to prevent, detect and respond to threats to both digital and non- digital information assets.
InfoSec includes several specialized categories, including:

APPLICATION SECURITY - the protection of applications from threats that seek to manipulate application and
access, steal, modify or delete data. These protections use software, hardware and policies, and are sometimes
called countermeasures. Common countermeasures include application firewalls, encryption programs, patch
management and biometric authentication systems.

CLOUD SECURITY - the set of policies and technologies designed to protect data and infrastructure involved in a
cloud computing environment. The top concerns that cloud security looks to address are identity and access
management, and data privacy.

ENDPOINT SECURITY - the part of network security that requires network devices nodes to meet certain security
standards before they can connect to a secure network. Nodes devices include PCs, laptops, smartphones
and tablets. Endpoint security also extends to equipment like point-of-sale (POS) terminals, bar code readers and
IoT devices.

INTERNET SECURITY - the protection of software applications, web browsers and virtual private networks (VPNs)
that use the internet. Using techniques such as encryption and internet security aim to defend the transfer of data
from attacks like malware and phishing as well as denial-of-service (DoS) attacks.

MOBILE SECURITY - the protection of portable devices, such as smartphones, tablets and laptops. Mobile security,
also known as wireless security, secures the devices and the networks they connect to in order to prevent theft,
data leakage and malware attacks.

NETWORK SECURITY - the protection of a network infrastructure and the devices connected to it through
technologies, policies and practices. Network security defends against threats such as unauthorized access, and
malicious use and modifications.
IT PROJECT MANAGEMENT MODELS
Some common models used for project management in the IT industry:

 Agile
 Waterfall
 Scrum
 Joint application development
Agile
Projects that require extreme flexibility and speed are best suited to the agile project management method.
Through this method, project manager breakdown milestones into “sprints”, or short delivery cycles.

Commonly used for in-house teams, agile project management was created for projects where there is no need
for extensive control over the deliverables. If you’re working with a team that is self-motivated and communicates
in real time, this type of project management works well because team members can rapidly adjust things as
needed, throughout each task.
Waterfall
The waterfall method builds upon the framework of the traditional method.

With the waterfall approach, it is assumed that team members are reliant upon the completion of o ther tasks
before their own tasks can be completed. Tasks must therefore be accomplished in sequence and it is vital that
team members correspond with one another. Everyone contributes to the overarching goals of the project and as
they complete their tasks, they enable other team members to complete theirs, which opens up opportunity to
begin larger tasks.

With waterfall project management, team size will often grow as the project develops and larger tasks become a
possibility. As these opportunities open up, new team members are assigned to those tasks. Project timelines and
dependencies need to be tracked extensively.
Scrum
Scrum is a derivative of agile project management.
As an iterative project management style, scrum features various “sessions” sometimes defined as “sprints” which
generally last for 30 days. These sprints are used to prioritize various project tasks and ensure they are completed
within this time.
Rather than being project manager, a Scrum Master should facilitate the process and assemble small teams that
have oversight of specific tasks.
The teams should communicate with the Scrum Master to discuss task progress and results. These meetings with
the Scrum Master are ideal times to reprioritize any backlogged tasks or discuss tasks that have yet to be pooled
into the project.
Joint Application Development
This method allows the project to have a joint development process by involving the client from the very early
stages.

The team and client are to have meetings or sessions where both can collaborate freely. This allows the client to
contribute ideas to the project and also give feedback on how things are progressing.

Joint application development relies on the client contributing and holding sessions with team members
throughout the entire lifecycle of the project.

The TCP/IP model


The TCP/IP network model is a four-layer reference model. All protocols that belong to the TCP/IP protocol suite
are located in the top three layers of this model.

APPLICATION
Defines TCP/IP application protocols and how host programs interface with transport layer services to use the network.
Protocol examples include HTTP, Telnet, FTP, TFTP, SNMP, DNS, SMTP.

TRANSPORT
Provides communication session management between host computers. Defines the level of service and status of
the connection used when transporting data. Protocol examples include TCP, UDP, and RTP.
INTERNET
Packages data into IP datagrams, which contain source and destination address information that is used to forward
the datagrams between hosts and across networks. Performs routing of IP datagrams. Protocol examples include
IP, ICMP, ARP, RARP.

NETWORK INTERFACE
Specifies details of how data is physically sent through the network, including how bits are electrically signaled by
hardware devices that interface directly with a network medium, such as coaxial cable, optical fiber, or twisted-pair
copper wire. Protocol examples include Ethernet, Token Ring, FDDI, X.25, Frame Relay, RS-232, v.35.

TYPES OF ANALYTICS
At the core of any data refinement process sits what is commonly referred to as “analytics”. But different people
use the word “analytics” to imply different things.
 Descriptive
The main focus of descriptive analytics is to summarize what happened in an organization. Descriptive
Analytics examines the raw data or content  — t hat is manually performed — to answer questions such
as:
What happened?
What is happening?
Descriptive analytics is characterized by conventional busi ness intelligence and visualizations such as
the bar charts, pie charts, line graphs, or the generated narratives.

 Diagnostic
As the name suggests, diagnostic analytics is used to unearth or to determine why something
happened. For example, if you’re conducting a social media marketing campaign, you may be
interested in assessing the number of likes, reviews, mentions, followers or fans. Diagnostic analytics
can help you distill thousands of mentions into a single view so that you can make progress with your
campaign.

 Prescriptive
While most data analytics provides general insights on the subject, prescriptive analytics gives you with
a “laser-like” focus to answer precise questions. For instance, in the healthcare industry, you can use
prescriptive analytics to manage the patient population by measuring the number of patients who are
clinically obese.

 Exploratory
Exploratory analytics is an analytical approach that primarily focuses on identifying general patterns in
the raw data to identify outliers and features that might not have been anticipated using other analytical
types. For you to use this approach, you have to understand where the outliers are occurring and how
other environmental variables are related to making informed decisions.

 Predictive
Predictive analytics is the use of data, machine learning techniques, and statistical algorithms to
determine the likelihood of future results based on historical data. The primary goal of predictive
analytics is to help you go beyond just what has happened and provide the best possible assessment
of what is likely to happen in future.
Predictive analytics can be used in banking systems to detect fraud cases, measure the levels of credit
risks, and maximize the cross-sell and up-sell opportunities in an organization. This helps to retain
valuable clients to your business.

 Mechanistic
As the name suggests, mechanistic analytics allow big data scientists to understand clear alterations in
procedures or even variables which can result in changing of variables. T he results of mechanistic
analytics are determined by equations in engineering and physical sciences. Also, they allow data
scientists to determine the parameters if they know the equation.
 Casual
Causal analytics allow big data scientists to figure out what is likely to happen if one
component of the variable is changed. When you use this approach, you should rely
on several random variables to determine what’s likely to happen next even though
you can use non-random studies to infer from causations. This approach to analytics is
appropriate if you’re dealing with large volumes of data.

 Inferential
This approach to analytics takes different theories on the world into account to
determine the certain aspects of the large population. When you use inferential
analytics, you’ll be required to take a smaller sample of information from the
population and use that as a basis to infer parameters about the larger
Cluster Analysis

It is a class of techniques that are used to classify objects or cases into relative groups called
clusters. Cluster analysis is also called classification analysis or numerical taxonomy. In cluster
analysis, there is no prior information about the group or cluster membership for any of the objects.

Let us talk in layman term to get vivid idea what is use case of cluster analysis?

Suppose you are Product Manger and about to new a launch product. But you cannot target everyone
out there because you will have constraints. We will be targeting customers based on the specific
parameters. So, by performing the cluster analysis on the data, you will be able decide your target
customers with parameter you chose. I hope it is clear to you some extent. Let’s look at different type
of cluster analysis using SPSS software. (It can be done using other tools as well)

In this document, we will be talking about two types of clustering.

1)Hierarchical Clustering

2)K-Means Clustering

Above clustering types are explained using dataset in SPSS (Statistical Package for the Social Sciences)
software.
Hierarchical clustering
Hierarchical clustering is one of the few types of clustering algorithms in SPSS in which similar objects
are grouped together. The name hierarchical clustering comes from the fact that the clustering
algorithm develops results in a hierarchical or a treelike manner.
There are two main methods in Hierarchical clustering: Agglomerative and Divisive. SPSS supports
agglomerative methods.
Agglomerative clustering starts with each object being in a separate cluster. So the number of clusters
in the initial stage is same as the number of objects. SPSS calculates the proximity among the different
objects and combines the closest objects in stages. The agglomeration happens in several stages until
the required number of clusters is formed.
Agglomerative methods are further classified into following methods:

For performing the cluster analysis, go to Analyze> Classify>Hierarchical Clustering. Then select all the
variables in the dataset to be clustered until they appear at the right side as shown below. The case
shown in this document is the Pampers case(
https://drive.google.com/file/d/1K1Bp60wENAqAmd7mpwyrdwcV1tRY_nYp/view?usp=sharing )
The method of clustering can be selected in SPSS by clicking on the Tab option and then selecting the
required cluster method:

For calculating the proximity between the objects there are several measures which could also be
selected in the Method option.
The different methods and distance measures would give different clustering results, therefore it is
important to select them depending on the requirements.

Go to Statistics Tab and check the Agglomeration schedule and the proximity matrix options. The
interpretation of these two options would be explained later.

Next go to the Plots tab and check the Dendrogram and select to have Icicles generated for all the
clusters. The interpretation of these would be explained later on.
Next click on Ok to run the Hierarchical clustering Analysis.

Interpretation of the results

Proximity Matrix

The proximity matrix appears as a table with n rows and columns. This is a matrix showing the distance
between the different variables or data points according to the interval measure selected in the
Method tab. The distance is calculated in a m dimensional space where m is the number of variables
selected for clustering. The distances shown in the proximity matrix would form the basis on which
the grouping of objects takes place.
Agglomeration Schedule

The agglomeration schedule table shows the different stages by which clustering happens and the
objects involved in those stages. In the below example, in stage 1, the objects 236 and 276 are grouped
together into a cluster based on the distance between them, which is shown in the ‘Coefficients’
column as a very low number close to zero and therefore rounded down to zero. In the same example,
the agglomeration takes place as can be seen for all the objects closest to each other and subsequently
for objects having larger distances. The last column of the agglomeration schedule ‘Next Stage’ shows
the stage at which the one of the objects in the current stage reappears for clustering again in the
agglomeration schedule table. For example, in stage 1, the object 236 reappears again in the 96 th
stage.
The agglomeration schedule also roughly indicates the optimum number of clusters for a dataset. As
the proximity increases in subsequent stages, at certain stages there is a huge spike in proximity
distances compared to the previous stage. This shows that a separate cluster needs to be
accommodated at this stage i.e if there are n spikes noticed, then n+1 is an optimum number of
clusters for the dataset.

Elbow Diagram

The Elbow diagram is also roughly used to find the optimum number of clusters for a dataset. The
below example shows a graph of the agglomeration schedule for a dataset with proximity distance
measured along y-axis and different stages along the x- axis. It can be noticed that at the 41 st stage,
there is a huge spike in the proximity distance between clusters. This point is known as an Elbow.
When read from right to left, this means that the elbow occurs at the 5th stage (46-41) and therefore
5 clusters is an appropriate number of clusters for the below example.

Note : No tab in SPSS for this method, we added this just for your information.
Cluster Membership

The cluster membership table shows the cluster to which each object belongs.
Icicle Plot

The icicle plot shows the number of clusters remaining at different stages of clustering. The icicle plot
should be interpreted with respect to the objects in the x-axis and the number of clusters in the y-axis.
In the below example, for the objects 269 and 267, the height of the icicle plot in between them stands
at approximately 133 clusters in the y-axis. This means that there are only 133 other clusters remaining
in a data set of 300 objects. This can also be checked in the agglomeration schedule table as the objects
269 and 267 appear at the 167th stage in it, denoting that there are 133 (300-167) clusters left to be
agglomerated.

Dendogram

The dendogram is treelike hierarchical structure which shows the distances at which the clusters
having the objects are combined. The dendogram is to be read from left to right. At the x-axis is the
distance between clusters and the y-axis shows the different objects. The dendogram is useful in
selecting the appropriate number of clusters. The number of clusters can be selected by noticing at
the places where the cluster formation happens at large distances. If there are n number of spurts in
distances between clusters during cluster formation or clusters get formed at sudden disproportionate
distances compared to previous stages, then n+1 clusters is said to be an appropriate number of
clusters.
K-Means Clustering

K-Means Clustering is Unsupervised Machine Learning Algorithm. What is Un supervised Learning.?


In Unsupervised Learning, we don’t have y variable which is target variable (You can relate it to
Regression model where we have target variable).
We hope it gave bit idea of what is unsupervised machine learning.

When to use K-Means Clustering?

1) When we want to divide dataset into homogenous groups, which we call clusters.
2) In K-Means, you should specify number of clusters you want in the beginning itself unlike
Hierarchical Clustering. (K represents number of clusters)

To make it easy, we shall do a practical exercise using dataset.

Please find link for the dataset.


Dataset contains the data about average working hours, measure of average prices and measure of
average salaries of different cities.

Requirement: We want to divide these data points into two homogenous groups based on the
variables.
We have three variables; we will not be taking city into consideration because it is just an identifier.

Hope by now the dataset is loaded into SPSS. Then we are good to go ahead with classification. We
shall start with k-means clustering.

Step 1: Select K-Means Cluster


This data shows the percentage of eligible people doing lifelong learning and education spending for
various European countries. Now we want to divide these data into two clusters.

Step 2: Specify number of clusters you need for classification. In this case, we decided to classify data
into two clusters. Hence, K = 2.

Step 3: Then select the variables on which you want to divide the clusters and drag it to right. As
mentioned above, we will not be selecting city.
We have successfully loaded the data. If we could see there are three icons on the right. (iterate, Save
and Options.)
We will be talking importance of these parameters one by one.

Step 4:
Iteration:
Let's have a look at how the K-means clustering iterative algorithm works step by step:
Define the number of k clusters that you want and randomly generate their respective center points
within the data domain.
Compute the distance between the observation and each center point, and then classify this
observation into a group whose center is closest to it.
Based on these classified observations, we re-compute the group center by taking the mean of all the
vectors in the group.
Steps 2 and 3 are repeated for a set number of iterations or when convergence has been reached.

Source : http://blog.keyrus.co.uk/k_means_clustering_iterative_algorithm.html

From above, we could interpret that higher the number of iterations, higher the accuracy. Ideal
number is 10 and post that most probably results will be same. At the end of the document, we have
attached screenshot for the same.

In SPSS, we will be selecting number of iterations as 10.


Step 5: Then we look at the option tab, where we can select Initial Cluster Centres and Cluster
Information for each case.

Step 6: We have SAVE tab. This helps in populating the clusters data in variable view and data view.
Information like cluster membership and distance of each data point from cluster center will be
populated.

Once we are done with interpretation, we shall look how SAVE option helped in analyzing the cluster
information.

Step 7: Once the above configuration is done, click OK to see the results.

Interpretation of output results

Step 1: Initial Cluster Centre

Initial Cluster Centers


Cluster
1 2
Average working hours 1583.00 2375.00
Measure of average prices 115.50 63.80
Measure of average salaries 63.70 27.80
This output tells us what is mean of the variables we used in different clusters. In ours, average
working of cities in cluster 1 is 1583 whereas 2375 for second cluster. Similarly, for other variables as
well. But the above output is after the initial iteration. These averages tend to change with the number
of iterations.

Step 2: Iteration History.

Iteration History
Iteration Change in Cluster Centers
1 2
1 218.575 260.036
2 10.821 21.623
3 10.897 18.665
4 5.757 8.425
5 5.513 8.293
6 .000 .000

This talks about change in cluster centers from one iteration to another iteration. Here we can see;
cluster are getting stabilized after 5th Iteration. If you could re-call, we mentioned number of iterations
as 10. If we mentioned as any number less than 6, there will be deviation in the result.

Step 3: Cluster Membership

Cluster Membership
Case Number Cluster Distance
1 1 52.073
2 1 40.698
3 2 96.025
4 2 35.058
5 1 56.825
6 2 89.065
7 2 23.067
8 2 140.988
9 1 51.535
10 1 9.591
11 1 72.559
12 1 115.288
13 1 123.868
14 1 105.594
15 2 315.892
16 2 85.010
17 2 114.408
18 2 109.896
19 1 60.359
20 1 43.372
21 1 28.595
22 2 46.451
23 1 23.297
24 1 57.045
25 2 210.687
26 2 117.076
27 1 10.374
28 1 62.940
29 2 103.830
30 2 126.562
31 1 70.329
32 1 186.179
33 2 23.743
34 1 21.290
35 1 51.995
36 1 102.935
37 1 81.330
38 2 20.175
39 1 53.468
40 1 96.966
41 2 90.147
42 2 45.089
43 1 122.728
44 1 123.886
45 1 15.504
46 1 117.399

This output talks about the cluster membership. Once the iterations are done, each datapoint will
assigned to one of the two clusters. These are the final cluster numbers of each data point. For e.g.,
case 1 falls into cluster 1 and case 3 into cluster 2.

Step 4: Final Cluster Centers.

Cluster
1 2
Average working hours 1764.68 2059.17
Measure of average prices 77.57 58.48
Measure of average salaries 48.97 24.89

Above output tells us about the mean of the mean of the variables after the iterations, If you
could see, there is a difference in the mean after the iterations.

Step 5: Distances between Final Clusters

Centre
Cluster 1 2
1 296.087
2 296.087

Once both clusters are formed, there exists a distance between centres. Above output shows the
same. In this case, distance between the clusters is 296.087.

Step 6: Number of Cases in each


Cluster
1 28.000
Cluster
2 18.000
Valid 46.000
Missing .000

We have divided the datapoints into two clusters. To get an overview of number of cases in different
clusters, above output helps in getting that information. In this case, out of 46 cases, 28 cases belong
to cluster 1 and 18 cases belong to cluster 2.

Hurray!! You are done with K-Means clustering.

Wait……..

As we promised above, we will now show the importance of save option. If you could go to data view
and variable view, there will be two new variables added, one is cluster number to which data point
belongs and distance of the data point from cluster center.
Please find the below screen snip showing the same.
One last thing we want show the difference in clusters membership based on number of iterations.
In the below screenshot there two columns of cluster membership, first column membership is when
iteration is 2 and second column when the iteration is 10. We could see there is a difference in the
cluster membership in some case. Hence, it is advisable to keep number of iterations high.
FACTOR ANALYSIS
Factor analysis
Factor analysis attempts to identify underlying variables, or factors, that explain the pattern of
correlations within a set of observed variables. The main purpose of factor analysis is to reduce many
individual items into a fewer number of dimensions. It can be used to simplify data, such as reducing
the number of variables in regression models by grouping similar variables into dimensions. This
process is used to identify latent variables or constructs.
Link to the dataset used: Link

Steps:

The factor analysis can be found in Analyze/Dimension Reduction/Factor

In the dialog box of the factor analysis we start by adding our variables to the list of variables.
In the dialog Descriptive… we need to add a few statistics to verify the assumptions made by the factor
analysis. To verify the assumptions, we need the KMO test of Sphericity

The dialog box Extraction… allows us to specify the extraction method and the cut-off value for the
extraction. Generally, SPSS can extract as many factors as we have variables. In an exploratory
analysis, the eigenvalue is calculated for each factor extracted and can be used to determine the
number of factors to extract. A cutoff value of 1 is generally used to determine factors based on
eigenvalues.
Next, an appropriate extraction method needs to be selected. Principal components is the default
extraction method in SPSS. It extracts uncorrelated linear combinations of the variables and gives the
first factor maximum amount of explained variance. All following factors explain smaller and smaller
portions of the variance and are all uncorrelated with each other. This method is appropriate when
the goal is to reduce the data.
The second most common extraction method is principal axis factoring. This method is appropriate
when attempting to identify latent constructs, rather than simply reducing the data. In our research
question, we are interested in the dimensions behind the variables, and therefore we are going to use
principal axis factoring.
The next step is to select a rotation method. After extracting the factors, SPSS can rotate the factors
to better fit the data. The most used method is varimax. Varimax is an orthogonal rotation method
that tends produce factor loading that are either very high or very low, making it easier to match each
item with a single factor. If non-orthogonal factors are desired (i.e., factors that can be correlated),
a direct oblimin rotation is appropriate. Here, we choose varimax.
In the dialog box Options we can manage how missing values are treated – it might be appropriate to
replace them with the mean, which does not change the correlation matrix but ensures that we do
not over penalize missing values. Also, we can specify in the output if we do not want to display all
factor loadings. The factor loading tables are much easier to read when we suppress small factor
loadings. Choose the value below which the factor value has to be suppressed as per your
requirement. We have taken the value to be 0.4.

Now click on ok and you will get the result.


Interpretation of Output

1. KMO and Bartlett’s Test


KMO and Bartlett's Test
Kaiser-Meyer-Olkin Measure of Sampling Adequacy. .894
Approx. Chi-Square 2118.207
Bartlett's Test of Sphericity df 66
Sig. .000

Kaiser-Meyer-Olkin Measure of Sampling Adequacy – This measure varies between 0 and 1, and
values closer to 1 are better. A value of .6 is a suggested minimum.

Bartlett’s Test of Sphericity – This tests the null hypothesis that the correlation matrix is an identity
matrix. An identity matrix is matrix in which all of the diagonal elements are 1 and all off diagonal
elements are 0. You want to reject this null hypothesis.

Taken together, these tests provide a minimum standard which should be passed before a factor
analysis (or a principal components analysis) should be conducted.

2.Communalities
Communalities
Initial Extraction

Q1_a 1.000 .470


Q1_b 1.000 .708
Q1_c 1.000 .621
Q1_d 1.000 .591
Q1_e 1.000 .643
Q1_f 1.000 .707
Q1_g 1.000 .686
Q1_h 1.000 .718
Q1_i 1.000 .748
Q1_j 1.000 .587
Q1_k 1.000 .735
Q1_l 1.000 .621
Extraction Method: Principal
Component Analysis.

Communalities – This is the proportion of each variable’s variance that can be explained by the
factors.
Extraction – The values in this column indicate the proportion of each variable’s variance that can be
explained by the retained factors. Variables with high values are well represented in the common
factor space, while variables with low values are not well represented.

2. Total Explained Variance

Initial Eigenvalues – Eigenvalues are the variances of the factors. Total variance is equal to the
number of variables used in the analysis, in this case, 12.

Total

This column contains the eigenvalues. The first factor will always account for the most variance (and
hence have the highest eigenvalue), and the next factor will account for as much of the left over
variance as it can, and so on. Hence, each successive factor will account for less and less variance. We
only consider the factor which have high Eigen values (for example in this case the first three
components).

Extraction Sums of Squared Loadings

The number of rows in this panel of the table correspond to the number of factors retained. In this
example, we requested that three factors be retained, so there are three rows, one for each retained
factor. The values in this panel of the table are calculated in the same way as the values in the left
panel, except that here the values are based on the common variance. The values in this panel of the
table will always be lower than the values in the left panel of the table, because they are based on the
common variance, which is always smaller than the total variance.

Rotation Sums of Squared Loadings – The values in this panel of the table represent the distribution
of the variance after the varimax rotation. Varimax rotation tries to maximize the variance of each of
the factors, so the total amount of variance accounted for is redistributed over the three extracted
factors.
3. Screen Plot: The screen plot graphs the eigenvalue against the component number. In the Total
explained variance table we can see that the line is almost flat after the third component, meaning
that each successive factor is accounting for smaller and smaller amounts of the total variance.

4. Component Matrix

Component
1 2 3
Q1_a .640 -.012 .244
Q1_b .334 .494 .594
Q1_c .723 .314 -.019
Q1_d .749 .171 -.035
Q1_e .342 .657 -.308
Q1_f .613 .463 -.343
Q1_g .776 -.264 .115
Q1_h .815 -.223 .068
Q1_i .727 -.446 -.143
Q1_j .729 -.011 -.237
Q1_k .739 -.368 -.230
Q1_l .615 -.046 .491

The value of each variable under the three components represents how much that variable
contributes to respective component. The higher the absolute value of the variable, the more the
variable contributed to the component.
5.Rotated Component Matrix

Rotated Component Matrix

Component

1 2 3
Q1_i .862

Q1_k .840

Q1_h .774

Q1_g .760

Q1_j .626 .435

Q1_d .509 .482

Q1_a .486 .456

Q1_e .796

Q1_f .789

Q1_c .402 .571

Q1_b .802

Q1_l .450 .647

The idea of rotation is to reduce the number variables on which the component or factor under
investigation have high dependence. Rotation does not actually change anything but makes the
interpretation of the analysis easier. In the table above we can see that Q1_i has high dependence

----------------------------------------------------------The End---------------------------------------------------------.

You might also like