You are on page 1of 38



“There will be more data created in

the next five years than in the

history of the planet Earth.
The ability to access that data, the

ability to understand it and

deliver it to the right people at
the right time so that they can
make the right decision, is going to
be critical to every enterprise on
the planet.

How analytics can help
Business Challenge
Internal Transactional Data

Primary/ Secondary Market

Perception Data data
Measure as-is
• Reporting • Performance Analyses • Trend Analyses

• Segmentation • Benchmarking and Gap Analyses
•Trigger Analyses •Key Driver Analyses

•Modeling •Optimization

Data Driven Business Decision

Business Intelligence and
Optimization What’s the best that can happen?
Predictive What will happen next?

Forecasting / What if these trends continue?
Competitive Advantage

Statistical Why is this happening?

Alerts What actions are needed?

Access and Reporting

Query / Drill Where exactly is the problem
Ad hoc reports How many, how often, where?

Standard What happened?


Degree of Intelligence
Data Rich - Information Poor

“The typical Fortune 1000 company has a

big problem: it collects a lot of important
business data that never gets used.

 Users who could take data and turn it into

competitive advantage can’t get the level
of access they need.”

--Forrester Research
 Nominal-identify and classify
 Ordinal-order data
 Interval-rate the data
 Ratio-set a reference.

 1 & 2 are non-metric data.

 3 & 4 are metric data.
 Data reporting.
 Data Cleaning:
 Consistency Checks-identifies data that are out
of range,logically inconsistent,or have extreme
values.Data not defined by CODING are
 Missing Responses-values of variables that are
unknown,because unambigous answers are
provided or no answer given.
 Treatment:Substitute a neutral value,imputed
 Data divided into TRAINING DATA,TESTING

Stages in Data Analysis

Coding Error
Data Entry Verification


Descriptive Univeriate Bivariate Multivate

Analysis Analysis Analysis Analysis

A Classification of Multivariate Methods

A ll M u lt iv a r ia t e
M e t h o d s

A r e s o m e o f t h e
v a r ia b le s d e p e n d e n t
o n o t h e r s ?

Y e s N o

D e p e n d I en n d c e e p e n d e n c e
M e t h o d sM e t h o d s
Multivariate Analysis: Classification of Dependence Methods

D e p e n d e n c e
M e t h o d s

H o w m a n y v a r i a b le s
a r e d e p e n d e n t ?

O n e D e p e n d e n t S e v e r a l D e p e n d e M n ut l t i p l e in d
V a r i a b l e V a r i a b l e s a n d d e p e n
v a r i a b l e s

M e t r i c - T h e sN c o a n l em s e a t r r i e c M - e T t rh i ce - T h e sN c o a n l em s e a t r r i ec - TC h a e n o n i c a
r a t i o o r i n t e s r c v a a l le s a r e n or a m t i oi n ao lr i n t e s r c v a a l l e s a r e n o mA ni n a a l y l s i s
o r o r d i n a l o r o r d in a l

M u l t i p l e M u
lt ip le M u lt iv a r ia t e C o n jo in t
R e g r e s s i o n D i s c r i m in a n At n a ly s i s o f A n a l y s i s
A n a ly s i s V a r i a n c e
( M A N O V A )
Multivariate Analysis: Classification of Independence Methods

I n d e p e n d e n c e
M e t h o d s

A r e i n p u t s
M a t r i c ?

M e t r i c - T h e N o n m e t r i c
s c a l e s a r e r a t i o s c a l e s a r e
o r I n t e r v a l o r o r d i n a

F a c t o r C l u s t e r M e t r i c N o n m e t r i
A n a l y s i s A n a l y s Mi s u l t i d i m eM n u s l i t o i d n i am l e
S c a l i n g S c a l i n g
Correlation Analysis
 Correlation analysis is a statistical technique used to
measure the magnitude of linear relationship
between two variables.
 Correlation analysis cannot be used in isolation to
describe the relationship between variables.
 It can be used along with regression analysis to
determine the nature of the relationship between
two variables.
 Two prominent types of correlation Coefficient are
 Pearson Product Moment correlation
 Spearman’s Rank correlation coefficient
 How strongly are sales related to advertising
 Is there any association between market share &
size of the sales force?
 Are consumer’s perceptions of quality related to
the perceptions of prices?
Regression Analysis

 Regression analysis is used to predict the

nature and closeness of relationships
between two or more variables
 It evaluate the causal effect of one variable
on another variable
 It used to predict the variability in the
dependent (or criterion) variable based
on the information about one or more
independent (or predictor) variables.
 Two variables : Simple or Linear
Regression Analysis
 More than two variables : Multiple
Regression Analysis

 Can the variation in market share be
accounted for by the size of the sales
 Are consumer’s perceptions of quality
determined by their perceptions of price?
 What level of sales can be expected,given
the levels of advertising
expenditures,prices,and level of the
Linear Regression Analysis
 Linear regression : Y =  + X
 Where Y : Dependent variable
 X : Independent variable
  and  : Two constants are called regression coefficients
  : Slope coefficient i.e. the change in the value of Y
 the corresponding change in one unit of X
  : Y intercept when X = 0
 R2 : The strength of association i.e. to what degree that the
 variation in Y can be explained by X.
 R2= 0.10 then only 10% of the total variation in Y can be
 explained by the variation in X variables
Test of significance of Regression Equation

 Linear regression : Y =  + X
 F test is used to test the significance of the linear relationship
between two variables Y and X
 H0:  = 0 (There is no linear relationship between Y and X)

 H1:   0 (There is linear relationship between Y and X)

 Objective : To check whether the estimates from the regression

model represent the real world data.

Other regression methods
 Forward regression
 Step wise regression
 Backward regression
Logit Analysis(Logistic
 Dependent variable is binary and several
independent variables are metric.
 Estimates the probability a binary event
taking place,by logistic regression
 S-curve

Discriminant Analysis
 Discriminant analysis aims at studying the effect of two or more
predictor variables (independent variables) on certain
evaluation criterion
 The evaluation criterion may be two or more groups
 Two groups such as good or bad, like or dislike, successful or
unsuccessful, above expected level or below expected level
 Three groupssuch as good, normal or poor
 Check whether the predictor variable discriminate among the
 To identify the predictor variable which is more important when
compared to other predictor variable(s).
 Such analysis is called discriminant analysis
Discriminant Analysis
 Designing a discriminant function: Y = aX1 + bX2
 where Y is a linear composite representing the discriminant function, X1 and
X2 are the predictor variables (independent variables) which are having
effect on the evaluation criterion of the problem of interest.
 Finding the discriminant ratio (K) and determining the variables which
account for intergroup difference in terms of group means
 This ratio is the maximum possible ratio between the ‘variability between
groups’ and the ‘variability within groups’
 Finding the critical value which can be used to include a new data set (i.e. new
combination of instances for the predictor variables) into its appropriate
 Testing H0: The group means are equal in importance
 H1: The group means are not equal in importance
 using F test at a given significance level 

 What psychographic characteristics help
differentiate between price-sensitive and
non-price-sensitive buyers of groceries?
 In terms of demographic characteristics,how
do customers who exhibit store loyalty
differ from those who do not?
Conjoint Analysis
 Technique that attempts to determine the
relative importance consumers attach to
salient attributes & the utilities they attach
to the levels of attributes.
 TERMS:Part worth functions,relative
importance weights,attribute levels.

Factor Analysis
 Factor analysis can be defined as a ‘set of methods in which the observable or
manifest responses of individuals on a set of variables are represented as
functions of a small number of latent variables called factors’.
 Factor analysis helps the researcher to reduce the number of variables to be
analyzed, thereby making the analysis easier.
 Analysis based on a wide range of variables can be tedious and time
 Using Factor Analysis, the researcher can reduce the large number of variables
into a few dimensions called factors that summarize the available data.
 Its aims at grouping the original input variables into factors which underlying
the input variables.
 For example, age, gender, marital status can be combined under a factor called
demographic characteristics. The income level, education, employment
status can be combined under a factor called socio-economic status. The
credit card and family background can be combined under factor called
background status.

Benefits of Factor Analysis
 To identify the hidden dimensions or construct which
may not be apparent from direct analysis

 To identify relationships between variables

 It helps in data reduction

 It helps the researcher to cluster the product and

population being analyzed.
Procedure followed for Factor Analysis
 Define the problem
 Construct the correlation matrix that measures the
relationship between the factors and the variables.
 Select an appropriate factor analysis method
 Determine the number of factors
 Rotation of factors
 Interpret the factors
 Determine the factor scores
 Used in market segmentation for identifying
underlying variables on which to group th
 In product research,employed to determine
the brand attributes that influences
consumer choices.
 In pricing studies,used to identify the
charactersitics of price-sensitive
 In advertising,used to understand the media
consumption habits of target market.
Cluster Analysis
 Cluster analysis can be defined as a set of techniques
used to classify the objects into relatively
homogeneous groups called clusters
 It involves identifying similar objects and grouping
them under homogeneous groups
 Cluster as a group of objects that display high
correlation with each other and low correlation
with other variables in other clusters
Procedure in Cluster Analysis
1. Defining the problem: First define the problem and de upon the variables based
on which the objects are clustered.
2. Selection of similarity or distance measures: The similarity measure tries to
examine the proximity between the objects. Closer or similar objects are
grouped together and the farther objects are ignored. There are three major
methods to measure the similarity between objects:
1. Euclidean Distance measures
2. Correlation coefficient
3. Association coefficients
3. Selection of clustering approach: To select the appropriate clustering approach.
There are two types of clustering approaches:
1. Hierarchical Clustering approach
2. Non-Hierarchical Clustering approach
 Hierarchical clustering Approachconsists of either a top-down approach or a
bottom-up approach. Prominent hierarchical clustering methods are: Single
linkage, Complete linkage, Average linkage, Ward’s method and Centroid
Procedure in Cluster Analysis
 Hierarchical clustering Approachconsists of either a top-down
approach or a bottom-up approach. Prominent hierarchical clustering
methods are: Single linkage, Complete linkage, Average linkage,
Ward’s method and Centroid method.

 Non-Hierarchical clustering Approach: A cluster center is first determined and

all the objects that are within the specified distance from the cluster center are
included in the cluster

4 Deciding on the number of clusters to be selected

 5 Interpreting the clusters

 Segmentation of market
 Understanding buying behaviours
 Identifying new product opportunities.
 Selecting test markets
 Reducing data
Artificial Neural Network
 Used when function is not well,stock price prediction.
 Used for unstructured functions only
 Single layer,multi layer ANN.
 STEPS:Architecture-Training-Activation
 Prediction,classification
 Information gain,Ginni
index used for hierarical
 Decile or precentile
values decides cut-offs
at nodes
 Also used as a substitute
of logit model.
Thank you