You are on page 1of 175

1

School of Business Management, NMIMS


University
Advanced Data Analysis Using SPSS
By
Dr. Shailaja Rego
MBA Human Resources
Year 2020-21
Trimester VI
2

Contents

Sr. Topic Page No.


No.

1 Teaching Plan 3–6

2 Charts 7–9

3 Nonparametric Tests 10 – 11

4 Parametric & Nonparametric tests using SPSS 11 – 25

5 Multivariate Analysis Introduction 26 – 31

6 Multiple Regression Analysis 31 – 58

7 Discriminant Analysis 59 – 73

8 Logistic Regression 73 – 82

9 MANOVA 82 – 84

10 Factor Analysis 84 – 102

11 Canonical Correlation 102 – 104

12 Cluster Analysis 104 – 126

13 Conjoint Analysis 126 – 133

14 Multidimensional Scaling ( MDS) 133 – 135

15 Classification & Regression Trees(CART) 136 – 149

16 K Nearest Neighborhood 150 – 154

17 Time Series Analysis Through AR Modeling 155 – 158

18 Neural Networks 159 - 175

19 Appendix – Time Series in IBM SPSS 1 to 32


3

NMIMS – SBM : Teaching Plan ( Course Description )


Course Code XXX

Course Title Advanced Data Analysis

Course Dr. Shailaja Rego


Instructor/s
Credit Value 3credits ( 100 marks paper )

Programme &
Trimester
Pre-requisite Statistical Analysis

Learning
Objectives /
Outcomes
Session Plan Session Topic Pre read/Class Activity
no.
Introduction to multivariate data (Chapter 11 & chapter
analysis, dependence & Interdependence 12, Appendix III- IBM
techniques, Introduction to SPSS, two SPSS of Business
sample tests, one way ANOVA, n-way Research Methodology
1
ANOVA using SPSS and interpretation by Srivastava Rego)
of the SPSS output. Discussion on
assumptions of these tests.

Introduction to SPSS and R(Open (Appendix II of


source Statistical Software) Business Research
2 Methodology by
Srivastava Rego) & R
manual
Basics of Data Analysis- Types of data Binder
vs tests& techniques, data cleaning, Pages 5 – 7 & 8 to 28
3
creating appropriate model, finding the
validity of the model.
Introduction to Econometrics, multiple (Read: H & A Chapter
regression analysis, Concepts of 4,p.g. 193 - 288)
Ordinary Least Square Estimate (OLSE) Binder Pages 48 – 74
4 & Best Linear Unbiased Estimate
(BLUE). With Practical Examples in
business

Multiple Regression Assumptions - (Read: H & A Chapter


Heteroscedasticity, Autocorrelation & 4,p.g. 193 - 288)
5 Multicollinearity
Outlier Analysis using cook’s Distance,
Dummy variables
Compass Maritime Services, LLC:
6
Valuing Ships
4

Factor Analysis – Introduction to Factor (Read: H & A Chapter


Analysis, Objectives of Factor Analysis, 3 pg. 125 - 165)
Designing a Factor Analysis, Binder pages 100 - 116
7
Assumptions in a Factor Analysis,
assessing overall fit. Interpretation of
the factors.
Case problem based on Factor Analysis. (Read: H & A Chapter
Case problem on principle component 3 pg. 125 - 165)
analysis + Multiple regression
8 analysis(having variables with high
multicollinearity )
HomeZilla: Attracting Homebuyers
through Better Photos
Introduction to Cluster Analysis. (Read: H & A Chapter
Objective of Cluster Analysis. Research 8 pg. 579 - 622)
9 Design of Cluster Analysis.
Assumptions in a Cluster Analysis.

Employing Hierarchical & Non (Read: H & A Chapter


10 Hierarchical clusters , K-means clusters, 8 pg. 579 - 622)
Two stage clusters
Case on Cluster Analysis - CarZuma:
11 Car Insurance Claim Case Study

Introduction to Discriminant Analysis. (Read: H & A Chapter


Objectives of Discriminant Analysis. 5 pg. 297 - 336)
Research Design for Discriminant
Analysis. Assumptions of Discriminant Binder pages75 - 89
Analysis.
12
Estimation of the Discriminant model &
assessing the overall fit.
Applications, Interpretation &
Hypothetical Illustrations

Introduction to Logistic Regression (Read: H & A Chapter


model, applications, assessing the 5 pg. 293 -336 )
overall fit.
13
Interpretation & Hypothetical
Illustrations

Case Study : Predicting Customer Churn


14
at QWE Inc
K-nearest-neighbor (KNN), Naive Binder pages 152 - 165
15 Bayes, Classification & Regression
Trees(CART) Models ,
Random Forests Models. Trees Vs (Read: H & A Chapter
Regression/Logistic regression choosing 7 pg. 483 - 547)
16 right technique.
Applications, Interpretation & case
problems
5

Advanced Forecasting Models for (Read: H & A Chapter


Stationary Time Series, Introduction to 11 pg. & 12, 541 - 638)
17 Box-Jenkins forecasting, Autoregressive
models,

Moving average models, Mixed Binder pages 166 - 185


autoregressive moving average models
Identifying an appropriate
18
autoregressive moving average model
Case : Larsen and Toubro: Spare Parts
Forecasting
Introduction to Neural Networks Neural Networks and
Biological Neural network Deep Learning chapter
19
Single layer Perceptron Training 1 pages 1 to 14

Application of artificial neural networks Neural Networks and


20 in business Advantages limitations Deep Learning Chapter
1 Pages 17 to 28
Teaching /
Learning Our Pedagogy
Methodology  Lecture
 Case Study
 Software – Hands on

Assessment Specific assessments/methods weight age


methods
Quiz 20%
Project (Group) 20%
Assignment(Group) 20%
Final exam 40%

Reading List Prescribed Text book


and References Multivariate Data Analysis by Hair & Anderson. 7th Edition Pearson
Education India ( H & A )2018

References
1. Statistics for Management by Srivastava & Rego 3rd Edition,
McGraw-Hill Publishers 2017
2. Applied Multivariate Statistical Analysis by Johnson & Wichern Sixth
Edition Pearson Education India.(2015)
3. Market Research by Naresh Malhotra 7th Edition Pearson Education
India (2015)
4. Neural Networks and Deep Learning, Springer, September 2018,
Charu C. Aggarwal.
5. Business Analytics: The Science of Data Driven Decision Making by
Dinesh Kumar 2017 Wiley Publications
6

Hypothesis Testing

Univariate
Techniques

Parametric Non Parametric

One Sample Two Sample


One Two Chi-Square
Sample Sample K-S
t Test Runs
Z test Binomial

Independent Dependent
Samples Samples Independent Dependent
t Test Paired t Test Samples Samples
Z test Chi-Square Sign
Mann- Wilcoxon
Whitney Chi-Square
Median McNemar
K-S
7

Multivariate Techniques

Dependence Interdependence
Techniques Techniques

One Dependent Variable More than one Dependent


Techniques : Variable Variable Inter object
Cross Tabulation Techniques : Interdependence similarity
ANOVA MANOVA Techniques : Techniques :
ANOCOVA MANOCOVA Factor Analysis Cluster Analysis
Multiple Regression Canonical Correlation MDS
Two group discriminant Multiple discriminant
Analysis Analysis
Logit Analysis
Conjoint Analysis
8

Metric Dependent
Variable

One Independent More than one


Variable Independent
Variable

Binary

Categorical: Categorical Interval


t Test Factorial & Interval

ANOVA ANOCOV Regression


A

One Factor More than


One Factor

One way N way


ANOVA ANOVA
9

NON-PARAMETRIC TESTS

Contents
1. Relevance- Advantages and Disadvantages
2. Tests for
 Randomness of a Series of Observations - Run Test
 c. Specified Mean or Median of a Population – Signed Rank Test
 d. Goodness of Fit of a Distribution – Kolmogorov- Smirnov Test
 e. Comparing Two Populations – Kolmogorov- Smirnov Test
 Equality of Two Means – Mann - Whitney (‘U’)Test
 Equality of Several Means `
– Wilcoxon - Wilcox Test
– Kruskel -Wallis Rank Sum (‘H’) Test
– Friedman’s ( ‘F’)Test – Two Way ANOVA
 Rank Correlation – Spearman’s
 Testing Equality of Several Rank Correlations
 Kendal’s Rank Correlation Coefficient
 Sign Test

1 Relevance and Introduction


All the tests of significance, those have been discussed in Chapters X and XI, are based on certain
assumptions about the variables and their statistical distributions. The most common assumption is
that the samples are drawn from a normally distributed population. This assumption is more critical
when the sample size is small. When this assumption or other assumptions for various tests described
in the above chapters are not valid or doubtful, or when the data available is ‘ordinal’ (rank) type, we
take the help of non-parametric tests. For example, in the student’s ‘t’ test for testing the equality of
means of two populations based on samples from the two populations, it is assumed that the samples
are from normal distributions with equal variance. If we are not too sure of the validity of this
assumption, it is better to apply the test given in this Chapter.

While the parametric tests refer to some parameters like mean, standard deviation, correlation
coefficient, etc. the non parametric tests, also called as distribution-free tests, are used for testing
other features also, like randomness, independence, association, rank correlation, etc.
In general, we resort to use of non-parametric tests where
 The assumption of normal distribution for the variable under consideration or some
assumption for a parametric test is not valid or is doubtful.
 The hypothesis to be tested does not relate to the parameter of a population
 The numerical accuracy of collected data is not fully assured
 Results are required rather quickly through simple calculations.
However, the non-parametric tests have the following limitations or disadvantages:
 They ignore a certain amount of information.
 They are often not as efficient or reliable as parametric tests.
The above advantages and disadvantages are in consistent with general premise in statistics that is, a
method that is easier to calculate does not utilize the full information contained in a sample and is less
reliable.

The use of non-parametric tests, involves a trade – off. While the ‘efficiency or reliability’ is
‘lost’ to some extent, but the ‘ability’ to use ‘lesser’ information and to calculate ‘faster’ is
‘gained’.
10

There are a number of tests in statistical literature. However, we have discussed only the following
tests.

Types and Names of Tests for


 Randomness of a Series of Observations – Run Test
 Specified Mean or Median of a Population – Signed Rank Test
 Goodness of Fit of a Distribution – Kolmogorov- Smirnov Test
 Comparing Two Populations – Kolmogorov- Smirnov Test
 Equality of Two Means – Mann - Whitney (‘U’) Test
 Equality of Several Means
– Wilcoxon - Wilcox Test
– Kruskel -Wallis Rank Sum ( ‘H’) Test
– Friedman’s ( ‘F’)Test
 Rank Correlation–Spearman’s

Statistical analyses using SPSS

Introduction

This page shows how to perform a number of statistical tests using SPSS. Each section gives a brief
description of the aim of the statistical test, when it is used, an example showing the SPSS commands
and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the page
Choosing the Correct Statistical Test for a table that shows an overview of when each test is
appropriate to use. In deciding which test is appropriate to use, it is important to consider the type of
variables that you have (i.e., whether your variables are categorical, ordinal or interval and whether
they are normally distributed), see What is the difference between categorical, ordinal and interval
variables? for more information on this.

About the hsb data file

Most of the examples in this page will use a data file called hsb2, high school and beyond. This data
file contains 200 observations from a sample of high school students with demographic information
about the students, such as their gender (female), socio-economic status (ses) and ethnic background
(race). It also contains a number of scores on standardized tests, including tests of reading (read),
writing (write), mathematics (math) and social studies (socst). You can get the hsb data file by
clicking on hsb2.

One sample t-test

A one sample t-test allows us to test whether a sample mean (of a normally distributed interval
variable) significantly differs from a hypothesized value. For example, using the hsb2 data file, say
we wish to test whether the average writing score (write) differs significantly from 50. We can do
this as shown below.

t-test
/testval = 50
/variable = write.
11

The mean of the variable write for this particular sample of students is 52.775, which is statistically
significantly different from the test value of 50. We would conclude that this group of students has a
significantly higher mean on the writing test than 50.

One sample median test

A one sample median test allows us to test whether a sample median differs significantly from a
hypothesized value. We will use the same variable, write, as we did in the one sample t-test example
above, but we do not need to assume that it is interval and normally distributed (we only need to
assume that write is an ordinal variable). However, we are unaware of how to perform this test in
SPSS.

Binomial test

A one sample binomial test allows us to test whether the proportion of successes on a two-level
categorical dependent variable significantly differs from a hypothesized value. For example, using
the hsb2 data file, say we wish to test whether the proportion of females (female) differs significantly
from 50%, i.e., from .5. We can do this as shown below.

npar tests
/binomial (.5) = female.

The results indicate that there is no statistically significant difference (p = .229). In other words, the
proportion of females in this sample does not significantly differ from the hypothesized value of 50%.

Chi-square goodness of fit


12

A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical
variable differ from hypothesized proportions. For example, let's suppose that we believe that the
general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White
folks. We want to test whether the observed proportions from our sample differ significantly from
these hypothesized proportions.

npar test
/chisquare = race
/expected = 10 10 10 70.

These results show that racial composition in our sample does not differ significantly from the
hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).

Two independent samples t-test

An independent samples t-test is used when you want to compare the means of a normally distributed
interval dependent variable for two independent groups. For example, using the hsb2 data file, say
we wish to test whether the mean for write is the same for males and females.

t-test groups = female(0 1)


/variables = write.
13

The results indicate that there is a statistically significant difference between the mean writing score
for males and females (t = -3.734, p = .000). In other words, females have a statistically significantly
higher mean score on writing (54.99) than males (50.12).

Wilcoxon-Mann-Whitney test

The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and
can be used when you do not assume that the dependent variable is a normally distributed interval
variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax
for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. We
will use the same data file (the hsb2 data file) and the same variables in this example as we did in the
independent t-test example above and will not assume that write, our dependent variable, is normally
distributed.

npar test
/m-w = write by female(0 1).

The results suggest that there is a statistically significant difference between the underlying
distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001).

Chi-square test

A chi-square test is used when you want to see if there is a relationship between two categorical
variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command
to obtain the test statistic and its associated p-value. Using the hsb2 data file, let's see if there is a
relationship between the type of school attended (schtyp) and students' gender (female). Remember
that the chi-square test assumes that the expected value for each cell is five or higher. This
assumption is easily met in the examples below. However, if this assumption is not met in your data,
please see the section on Fisher's exact test below.

crosstabs
/tables = schtyp by female
/statistic = chisq.
14

These results indicate that there is no statistically significant relationship between the type of school
attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828).

Let's look at another example, this time looking at the linear relationship between gender (female)
and socio-economic status (ses). The point of this example is that one (or both) variables may have
more than two levels, and that the variables do not have to have the same number of levels. In this
example, female has two levels (male and female) and ses has three levels (low, medium and high).

crosstabs
/tables = female by ses
/statistic = chisq.

Again we find that there is no statistically significant relationship between the variables (chi-square
with two degrees of freedom = 4.577, p = 0.101).

Fisher's exact test


15

The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your
cells has an expected frequency of five or less. Remember that the chi-square test assumes that each
cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and
can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS
Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are
presented by default. Please see the results from the chi squared example above.

One-way ANOVA

A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable
(with two or more categories) and a normally distributed interval dependent variable and you wish to
test for differences in the means of the dependent variable broken down by the levels of the
independent variable. For example, using the hsb2 data file, say we wish to test whether the mean of
write differs between the three program types (prog). The command for this test would be:

oneway write by prog.

The mean of the dependent variable differs significantly among the levels of program type. However,
we do not know if the difference is between only two of the levels or all three of the levels. (The F
test for the Model is the same as the F test for prog because prog was the only variable entered into
the model. If other variables had also been entered, the F test for the Model would have been
different from prog.) To see the mean of write for each level of program type,

means tables = write by prog.

From this we can see that the students in the academic program have the highest mean writing score,
while students in the vocational program have the lowest.

Kruskal Wallis test

The Kruskal Wallis test is used when you have one independent variable with two or more levels and
an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a
generalized form of the Mann-Whitney test method since it permits two or more groups. We will use
16

the same data file as the one way ANOVA example above (the hsb2 data file) and the same variables
as in the example above, but we will not assume that write is a normally distributed interval variable.

npar tests
/k-w = write by prog (1,3).

If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different
value of chi-squared. With or without ties, the results indicate that there is a statistically significant
difference among the three type of programs.

Paired t-test

A paired (samples) t-test is used when you have two related observations (i.e., two observations per
subject) and you want to see if the means on these two normally distributed interval variables differ
from one another. For example, using the hsb2 data file we will test whether the mean of read is
equal to the mean of write.

t-test pairs = read with write (paired).


17

These results indicate that the mean of read is not statistically significantly different from the mean
of write (t = -0.867, p = 0.387).

Wilcoxon signed rank sum test

The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use
the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the
two variables is interval and normally distributed (but you do assume the difference is ordinal). We
will use the same example as above, but we will not assume that the difference between read and
write is interval and normally distributed.

npar test
/wilcoxon = write with read (paired).

The results suggest that there is not a statistically significant difference between read and write.

If you believe the differences between read and write were not ordinal but could merely be classified
as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again,
we will use the same variables in this example and assume that this difference is not ordinal.
18

npar test
/sign = read with write (paired).

We conclude that no statistically significant difference was found (p=.556).

McNemar test

You would perform McNemar's test if you were interested in the marginal frequencies of two binary
outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-
control study) or two outcome variables from a single group. Continuing with the hsb2 dataset used
in several above examples, let us create two binary outcomes in our dataset: himath and hiread.
These outcomes can be considered in a two-way contingency table. The null hypothesis is that the
proportion of students in the himath group is the same as the proportion of students in hiread group
(i.e., that the contingency table is symmetric).

compute himath = (math>60).


compute hiread = (read>60).
execute.

crosstabs
/tables=himath BY hiread
/statistic=mcnemar
/cells=count.
19

McNemar's chi-square statistic suggests that there is not a statistically significant difference in the
proportion of students in the himath group and the proportion of students in the hiread group.

One-way repeated measures ANOVA

You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at least
twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more
levels of the categorical variable. This tests whether the mean of the dependent variable differs by the
categorical variable. We have an example data set called rb4wide, which is used in Kirk's book
Experimental Design. In this data set, y is the dependent variable, a is the repeated measure and s is
the variable that indicates the subject number.

glm y1 y2 y3 y4
/wsfactor a(4).
20

You will notice that this output gives four different p-values. The output labeled "sphericity
assumed" is the p-value (0.000) that you would get if you assumed compound symmetry in the
variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer
various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matter
21

which p-value you use, our results indicate that we have a statistically significant effect of a at the .05
level.

Factorial ANOVA

A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using the
hsb2 data file we will look at writing scores (write) as the dependent variable and gender (female)
and socio-economic status (ses) as independent variables, and we will include an interaction of
female by ses. Note that in SPSS, you do not need to have the interaction term(s) in your data set.
Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables
that will make up the interaction term(s).

glm write by female ses.

These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The
variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p =
0.002, respectively). However, that interaction between female and ses is not statistically significant
(F = 0.133, p = 0.875).

Friedman test

You perform a Friedman test when you have one within-subjects independent variable with two or
more levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and math
scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e.,
reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long
format. SPSS handles this for you, but in other statistical packages you will have to reshape the data
before you can conduct this test.

npar tests
/friedman = read write math.
22

Friedman's chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant.
Hence, there is no evidence that the distributions of the three types of scores are different.

Correlation

A correlation is useful when you want to see the relationship between two (or more) normally
distributed interval variables. For example, using the hsb2 data file we can run a correlation between
two continuous variables, read and write.

correlations
/variables = read write.

In the second example, we will run a correlation between a dichotomous variable, female, and a
continuous variable, write. Although it is assumed that the variables are interval and normally
distributed, we can include dummy variables when performing correlations.

correlations
/variables = female write.

In the first example above, we see that the correlation between read and write is 0.597. By squaring
the correlation and then multiplying by 100, you can determine what percentage of the variability is
23

shared. Let's round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with write. In the output for the second
example, we can see the correlation between write and female is 0.256. Squaring this number yields
.065536, meaning that female shares approximately 6.5% of its variability with write.

Simple linear regression

Simple linear regression allows us to look at the linear relationship between one normally distributed
interval predictor and one normally distributed interval outcome variable. For example, using the
hsb2 data file, say we wish to look at the relationship between writing scores (write) and reading
scores (read); in other words, predicting write from read.

regression variables = write read


/dependent = write
/method = enter.

We see that the relationship between write and read is positive (.552) and based on the t-value
(10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence,
we would say there is a statistically significant positive linear relationship between reading and
writing.

Non-parametric correlation

A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted in
24

ranks and then correlated. In our example, we will look for a relationship between read and write.
We will not assume that both of these variables are normal and interval.

nonpar corr
/variables = read write
/print = spearman.

The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is
statistically significant.
25

MULTIVARIATE STATISTICAL TECHNIQUES


Contents
 Introduction
 Multiple Regression
 Discriminant Analysis
 Logistic Regression
 Multivariate Analysis of Variance (MANOVA)
 Factor Analysis
 Principal Component Analysis
 Common Factor Analysis (Principle Axis Factoring)
 Canonical Correlation Analysis
 Cluster Analysis
 Conjoint Analysis
 Multidimensional Scaling
1 Relevance and Introduction
In general, what is true about predicting the success of a politician with the help of intelligence
software, as pointed out above by the President of India, is equally true for predicting the success of
products and services with the help of statistical techniques. In this Chapter, we discuss a number of
statistical techniques which are especially useful in designing of products and services. The products
and services could be physical, financial, promotional like advertisement, behavioural like
motivational strategy through incentive package or even educational like training
programmes/seminars, etc. These techniques, basically involve reduction of data and subsequent its
summarisation, presentation and interpretation. A classical example of data reduction and
summarisation is provided by SENSEX (Bombay Stock Exchange) which is one number like 18,000,
but it represents movement in share prices listed in Bombay Stock Exchange. Yet another example is
the Grade Point Average, used for assessment of MBA students, which ‘reduces’ and summarises
marks in all subjects to a single number.
In general, any live problem whether relating to individual, like predicting the cause of an ailment or
behavioural pattern, or relating to an entity, like forecasting its futuristic status in terms of products
and services, needs collection of data on several parameters. These parameters are then analysed to
summarise the entire set of data with a few indicators which are then used for drawing conclusions.
The following techniques (with their abbreviations in brackets) coupled with the appropriate
computer software like SPSS, play a very useful role in the endeavour of reduction and
summarisation of data for easy comprehension.
 Multiple Regression Analysis (MRA)
 Discriminant Analysis (DA)
 Logistic Regression (LR)
 Multivariate Analysis of Variance (MANOVA)( introduction)
 Factor Analysis (FA)
 Principal Component Analysis ( PCA)
 Canonical Correlation Analysis (CRA) ( introduction)
 Cluster Analysis
 Conjoint Analysis
 Multidimensional Scaling( MDS)

Before describing these techniques in details, we provide their brief description as also indicate
their relevance and uses, in a tabular format given below. This is aimed at providing
motivation for learning these techniques and generating confidence in using SPSS for arriving
at final conclusions/solutions in a research study. The contents of the Table will be fully
comprehended after reading all the techniques.
26

Statistical Techniques, Their Relevance and Uses for Designing and Marketing of Products and
Services

Technique Relevance and Uses


Multiple Regression Analysis (MRA)  One could assess the individual impact of the
 It deals with the study of relationship independent variables on the dependent variable
between one metric dependent variable  Given the values of the independent variables, one
and more than one metric independent could forecast the value of the dependent variable.
variables For example the sale of a product depends on
expenditures on advertisements as well as on R & D.
Given the values of these two variables, one could
establish a relationship among these variables and the
dependent variable, say, profit. Subsequently, if the
relationship is found appropriate it could be used to
predict the profit with the knowledge of the two types of
expenditures.
Discriminant Analysis  The basic objective of discriminant analysis is to
perform a classification function.
It is a statistical technique for classification or  From the analysis of past data, it can classify a given
determining a linear function, called group of entities or individuals into two categories –
discriminant function, of the variables which one those which would turn out to be successful and
helps in discriminating between two groups of others which would not be so. For example, it can
entities or individuals. predict whether a company or an individual would
turn out to be a good borrower.
 With the help of financial parameters, a firm could be
classified as worthy of extending credit or not.
 With the help of financial and personal parameters,
an individual could be classified as eligible for loan
or not or whether he would be a buyer of a particular
product/service or not.
 Salesmen could be classified according to their age,
health, sales aptitude score, communication
ability score, etc..
Logistic Regression  Logistic regression is highly useful in biometrics and
health sciences. It is used frequently by
 It is a technique that assumes the epidemiologists for the probability (sometimes
errors are drawn from a binomial interpreted as risk) that an individual will acquire a
distribution. disease during some specified period of vulnerability.
 In logistic regression the dependent  Credit Card Scoring: Various demographic and
variable is the probability that an credit history variables could be used to predict if an
event will occur, hence it is individual will turn out to be ‘good’ or ‘bad’
constrained between 0 and 1. customers.
 All of the predictors can be binary,  Market Segmentation: Various demographic and
a mixture of categorical and purchasing information could be used to predict if an
continuous or just continuous. individual will purchase an item or not

Multivariate Analysis of Variance (  Determines whether statistically significant


MANOVA) differences of means of several variables occur
27

 It explores, simultaneously, the simultaneously between two levels of a variable. For


relationship between several non- example, assessing whether
metric independent variables (i) a change in the compensation system has brought about
(Treatments, say Fertilisers) and two changes in sales, profit and job satisfaction.
or more metric dependant variables (ii) geographic region(North, South, East, West) has any
(say, Yield & Harvest Time). If there is
impact on consumers’ preferences, purchase intentions or
only one dependant variable, MANOVA
attitudes towards specified products or services.
is the same as ANOVA.
(iii) a number of fertilizers have equal impact on the yield of
rice as also on the harvest time of the crop.

Principal Component Analysis (PCA)  One could identify several financial parameters and
 Technique for forming set of new ratios exceeding ten for determining the financial
variables that are linear combinations health of a company. Obviously, it would be
of the original set of variables, and are extremely taxing to interpret all such pieces of
uncorrelated. The new variables are information for assessing the financial health of a
called Principal Components. company. However, the task could be much simpler
 These variables are fewer in number as if these parameters and ratios could be reduced to a
compared to the original variables, but few indices, say two or three, which are linear
they extract most of the informant combinations of the original parameters and ratios.
provided by the original variables.  A multiple regression model may be derived to
forecast a parameter like sales, profit, price, etc.
However, the variables under consideration could be
correlated among themselves indicating
multicollinearity in the data. This could lead to
misleading interpretation of regression coefficients as
also increase in the standard errors of the estimates of
parameters. It would be very useful, if the new
uncorrelated variables could be formed which are
linear combinations of the original variables. These
new variables could then be used for developing the
regression model, for appropriate interpretation and
better forecast.

Common Factor Analysis (CFA) Helps in assessing


 the image of a company/enterprise
 It is a statistical approach that is used to  attitudes of sales personnel and
analyse inter- relationships among a customers
large number of variables(indicators)  preference or priority for the
and to explain these characteristics of
variables(indicators) in terms of a few - product like television, mobile phone, etc.
unobservable constructs (factors). In - a service like TV program, air travel etc.
fact, these factors impact the variables,
and are reflective indicators of the
factors. The statistical approach
involves finding a way of condensing
the information contained in a number
of original variables into a smaller set
of constructs (factors) - mostly one or
two- with a minimum loss of
28

information.
 Identifies the smallest number of
common factors that best explain or
account for most of the correlation
among the indicators. For example,
intelligence quotient of a student might
explain most of the marks obtained in
Mathematics, Physics, Statistics, etc.
As yet another example, when two
variables x and y are highly correlated,
only one of them could be used to
represent the entire data

Canonical Correlation Analysis (CRA)  Used in studying relationship between types of


 An extension of multiple regression products purchased and consumer life styles and
analysis. (MRA involving one personal traits. Also, for assessing impact of life
dependant variable and several metric styles and eating habits on health as measured by
independent variables.). It is used for number of health related parameters.
situations wherein there are several  Given assets and liabilities of a set of banks/financial
dependant variables and several institutions, helps in examining interrelationship of
independent variables. variables on the asset and liability sides.
 HRD department might like to study the relationship
 Involves developing linear between set of behavioural, technological and social
combinations of the sets of variables skills of a salesman with the set of variables
(both dependant and independent) and representing sales performance, discipline and
studies the relationship between the cordial relations with staff.
two sets. The weights in the linear  The Central Bank of a country might like to study
combination are derived based on the the relationship between sets of variables
criterion that maximizes the correlation representing several risk factors and the financial
between the two sets of variables. indicators arising out of a bank’s operations. Similar
analysis could be carried out for any organisation.

Cluster Analysis  It helps in classifying a given set of entities into a


 It is an analytical technique that is smaller set of distinct entities by analysing
used to develop meaningful similarities among the given set of entities.
subgroups of entities which are
homogeneous or compact with Some situations where the technique could be used are ;
respect to certain characteristics.  A bank could classify its large network of branches
Thus, observations in each group into clusters( groups) of branches which are similar
would be similar to each other. to each other with respect to specified parameters.
Further, each group should be  An investment bank could identify groups of firms
different from each other with that are vulnerable for takeover..
respect to the same characteristics,  A marketing department could identify similar
and therefore, observations of one markets where products or services could be tested or
group would be different from the used for target marketing
observations of the other groups.  An insurance company could identify groups of
motor insurance policy holders with high claims.
29

Conjoint Analysis  Useful for analyzing consumer responses, and use the
 Involves determining the contribution same for designing of product and services
of variables ( each of several levels) to  Helps in determining the contributions of the
the choice preference over predictor variables and their respective levels to the
combinations of variables that desirability of the combinations of variables.
represent realistic choice sets ( For example, how much does the quality of food contribute
products, concepts, services, to continued loyalty of a traveller to an airline? Which type
companies, etc.) of food is liked most?
Multidimensional Scaling  Useful for designing of products and services.
 It is a set of procedures for drawing It helps in
pictures of data so as to visualise and  illustrating market segments based on indicated preferences.
clarify relationships described by the
data more clearly.  identifying the products and services that are more
 The requisite data is typically collected competitive with each other
by having respondents give simple  understanding the criteria used by people while judging
one-dimensional responses. objects (products, services, companies,
 Transforms consumer judgments / advertisements, etc.).
perceptions of similarity or preferences
in usually a two dimensional space.

1.1 Multivariate Techniques


These techniques are classified in two types
 Dependence Techniques
 Interdependence Techniques

Dependence Techniques
These are the techniques, that define some of the variable/s as independent variable/s and some other
as dependent variable/s. These techniques aim at finding the relationship of these variables and may,
in turn, find the effect of independent variable on dependent variable.

The techniques to be used may differ as the type of independent / dependent variables change. For
example, if all the independent and dependent variables are metric or numeric, Multiple Regression
Analysis can be used, if dependent variable is metric, and independent variable is /are categorical,
ANOVA can be used. If dependent variable is metric and some of the independent variables are
metric, and some are qualitative ANACOVA (Analysis of co-variance) can be used. If the dependent
variable is non metric or categorical, multiple discriminant analysis or logistic regression are the
techniques used for analysis.
All the above techniques require a single dependent variable.
If there are more than one dependent variables, the techniques used are MANOVA ( Multivariate
analysis of variance) or Canonical correlation
MANOVA is used when there are more than one dependent variables and all independent variables
are categorical. If some of the independent variables are categorical and some are metric,
MANOCOVA ( multivariate analysis of covariance) can be used. If there are more than one
dependent variables and all dependent and independent variables are metric, best suited analysis is
canonical correlation.
Interdependence Techniques
Interdependence techniques do not assume any variable as independent / dependent variables or try to
find the relationship. These techniques can be divided into variable interdependence and inter object
similarity.
30

The variable interdependence techniques can be also termed as data reduction techniques. Factor
analysis is the example of the variable interdependence techniques. Factor analysis is used when there
are many related variables and one wants to reduce the list of variables or find underlying factors that
determine the variables.
The inter object similarity is assessed with the help of cluster analysis, Multidimensional scaling
(MDS)
Brief descriptions of all the above techniques are given in subsequent sections of this Chapter.

2 Multiple Regression Analysis


In Chapter 10, we have discussed the correlation and regression analysis relating to two variables. Usually, one
of them, called dependent variable, is of prime consideration, and depends on another variable called
independent variable. Thus, we have one dependent and one independent variable. However, sometimes more
than one variable may influence the dependent variable. For instance, the marks obtained by students in an
examination could depend not only on their intelligent quotients (I.Qs.) but also the time devoted for preparing
for the examination. In agriculture, yield of a crop depends not on the fertilizer used but also depends upon
rainfall, temperature, etc. Similarly, in economic theory, the quantity demanded of a particular commodity may
not only depend on its price but also on the prices of other substitute commodities and also on the disposable
incomes of households. Further, the price of a particular stock depends not only on the stock market, in
general, but also on its dividend payout as also the retained earnings by the concerned company. It may be
noted that simple regression analysis helps in assessing the impact of a variable ( independent) on another
variable ( dependent ). With the help of such analysis, given a dependent variable, one could consider, one by
one, individually, the impact of even several independent variables. However, with the help of multiple
regression analysis, one could assess the impact of several independent variables, individually or jointly
together, on the dependent variable.
Some of the situations wherein a multiple regression equation, giving the relationship between the dependent
variable and independent variables, is used are as follows:
Manpower in a Sales Organisation Number of Sales Offices + Business per
Sales Office
EPS ( time series ) Sales + Dividend + Price
or EPS ( cross-sectional )

Sales of a Company Expenditure on Advertisement +


Expenditure on R & D
Return on BSE SENSEX Return on Stock of Reliance Industries +
Return on Stock of Infosys Technologies

2.1 Estimation of Multiple Regression Equation and Calculation of


Multiple Correlation Coefficient
In multiple regression analysis, the dependent variable is expressed as a function of independent variables.
The equation giving such relationship is called regression equation. The correlation between the dependent
variable and independent variables is measured by multiple correlation coefficient. The methodology of
deriving the multiple regression equation and calculating multiple correlation coefficient is illustrated below.
Illustration .1
Let the dependent variable of interest be y which depends on two independent variables, say x 1 and x2..
The linear relationship, among y, x1 and x2 can be expressed in the form of the regression equation of y on x 1
and x2 , in the following form:
y = bo + b1 x1 + b2 x2 ……………. (1 )
where bo is referred to as ‘intercept’ and b1 & b2 are known as regression coefficients
31

The sample comprises of ‘n’ triplets of values of x1, y and x2 ,in the following format:
y x1 x2
y1 x11 x21
y2 x12 x22
. . .
yn x1n x2n
The values of constants bo, b1 and b2 are estimated with the help of Principle of Least Squares just like values
of a and b were found while fitting the equation y = a + b x in Chapter 10 on Simple Correlation and
Regression analysis. These are calculated by using the above sample observations/values, and with the help of
the formulas given below :
These formulas and manual calculations are given for illustration only. In real life these are easily obtained
with the help of personal computers wherein the formulas are already stored.
2 2
(Σyix1i – n y x 1 )(Σ x 2i – n x 2 ) – (Σyi x 2i – n y x 2 )(Σ x1i x2i – n x 1 x 2 )
b1 = ------------------------------------------------------------------------ ------------
2
( Σ x1i2 – n x 1 2 )(Σ x2i2 – n x 2 ) – (Σx1i x2i – n x 1 x 2 )2

(Σyix1i – n y x 2 )(Σ x 1i2 – n x 12 ) – (Σyi x 1i2 – n y x 1 )(Σx1ix2i – n x 1 x 2 )


b2 = ------------------------------------------------------------------------- …( 2 )
(Σ x1i2 – n x 1 2 )(Σ x 2i 2 – n x 2 2 ) – (Σx1i x2i – n x 1 x 2 )2

b0 = y – b1 x 1 – b2 x 2 …(3)
The calculations needed in the above formulas are facilitated by preparing the following Table
Y x1 x2 yx1 yx2 x1 x2 y2 x12 x22
y1 x11 x21 y1 x11 y1 x21 x11x21 y12 x112 x212
. . . . . . . . .
yi x1i x2i yix1i yix2i x1ix2i yi2 x1i2 x2i2
. . . . . . . . .
yn x1n x2n yix1n yix2n x1nx2n yn2 x1n2 x2n2

Sum yi x1i x2i yix1i yix2i x1ix2i yi2 x1i2 x2i2

The effectiveness or the reliability of the relationship thus obtained is judged by the multiple coefficient of
determination, usually denoted by R2, and is defined as the ratio of variation explained by the regression
equation ( 1 ) and total variation of the dependent variable y. Thus,
Explained Variation in y
2
R = -------------------------------------- ….. ( 4 )
Total Variation in y

Unexplained Variation
R2 = 1 - -------------------------------------- ….. ( 5 )
Total Variation
32

=1-
(y i  yˆ i ) 2
… ( 6)
(y i  yi ) 2

It may be recalled from Chapter 10, that total variation in the variable y is equal to the variation explained by
the regression equation plus unexplained variation by the regression equation. Mathematically, this is
expressed as

(y i  y) 2 =  (yˆ i  y)2 + (y i  yˆi ) 2


Total Variation Explained Variation Unexplained Variation

where yi is the observed value of y, y is the mean of all , ŷi is the estimate of the value yi by the regression
equation ( 1 ).It may be recalled that  (yˆ i  y ) 2 is the explained variation of y by the estimate of y, and
(y i  yˆ i ) 2 is the unexplained variation of y by the estimate of y ( ŷ ). If yi is equal to the estimate ŷ i , then
all the variation is explained by ŷ i , and therefore unexplained variation is zero. In such case , total variation is
fully explained by the regression equation, and R2 is equal to 1.

The square root of R2 viz. R is known as the coefficient of multiple correlation and is always
between 0 and 1. In fact, R is the correlation between the independent variable and its estimate
derived from the multiple regression equation, and as such it has to be positive.

All the calculations and interpretations for the multiple regression equation and coefficient of multiple
correlation or determination have been explained with the help of an illustration given below:
Example 1
The owner of a chain of ten stores wishes to forecast net profit with the help of next year’s projected
sales of food and non-food items. The data about current year’s sales of food items, sales of non-food
items as also net profit for all the ten stores are available as follows:

Table 1 Sales of Food and Non-Food Items and Net Profit of a Chain of Stores

Sales of Sales if Non-


Net Profit
Supermarket Food Items Food Items(Rs.
(Rs. Crores)
No. (Rs. Crores) Crores)
y
x1 x2
1 5.6 20 5
2 4.7 15 5
3 5.4 18 6
4 5.5 20 5
5 5.1 16 6
6 6.8 25 6
7 5.8 22 4
8 8.2 30 7
9 5.8 24 3
10 6.2 25 4
33

In this case, the relationship is expressed by the equation (1) reproduced below:
y = b0 + b1 x1 + b2 x2
where, y denotes net profit, x1 denotes sales of food items, and x2 denotes sales of non-food items,
and b0, b1 & b2 are constants. Their values are obtained by the following formulas derived from the
"Principle of Least Squares".:
The required calculations can be made with the help of the following Table:
( Amounts in Rs. Crores)
Sales of
Sales of
Net Non-
Food
Supermarket Profit Food
Items
(y) Items
( x1)
(x2)

yi x1i x2i x1i 2 yi x1i yi 2 yi x2i x2i 2 x1i x2i

1 5.6 20 5 400 112 31.36 28 25


100
2 4.7 15 5 225 70.5 22.09 23.5 25 75
3 5.4 18 6 324 97.2 29.16 32.4 36 108
4 5.5 20 5 400 110 30.25 27.5 25 100
5 5.1 16 6 256 81.6 26.01 30.6 36 96
6 6.8 25 6 625 170 46.24 40.8 36 150
7 5.8 22 4 484 127.6 33.64 23.2 16 88
8 8.2 30 7 900 246 67.24 57.4 49 210
9 5.8 24 3 576 139.2 33.64 17.4 9 72
10 6.2 25 4 625 155 38.44 24.8 16 100
Sum 59.1 215 51 4815 1309.1 358.07 305.6 273 1099
Average 5.91 21.5 5.1
Substituting the values of bo, b1 and b2, the desired relationship is obtained as

y = 0.233 + 0.196 x1 + 0.287 x2 …(7)


This equation is known as the multiple regression equation of y on x1 and x2, and it indicates as to
how y changes with respect to changes in x1 and x2 .. The interpretation of the value of the coefficient
of x1 viz. ‘b1’, i.e. 0.196, is that if x2 (sales of non-food items) is held constant, then for every crore
of sales of food items, the net profit increases by 0.196 crore i.e. 19.6 lakhs.. Similarly the
interpretation of the value of coefficient of x2 viz. ‘b2’ i.e. 0.287 is that if the sales of non-food items
increases by one crore Rs, the net profit increases by 0.287 crore i.e. 28.7 lakhs.
The effectiveness or the reliability of this relationship is judged by the multiple coefficient of
determination, usually denoted by R2, and is defined as given in (4) as
Explained Variation in y by the Regression Equation
R2 = = ---------------------------------------------------------------------
Total Variation in y
The above two quantities are calculated with the help of the following Table.
Column ( 3 ) gives the difference in the observed value yi and its estimate ŷ i
derived from the fitted regression equation by substituting corresponding values of x1 and x2 .
34

yi ŷ i * yi – ŷ i (yi – ŷ i )2 (yi– y i )2
(1) (2) (3) (4) (5)
5.6 5.587 0.0127 0.0002 0.0961
4.7 4.607 0.0928 0.0086 1.4641
5.4 5.482 -0.082 0.0067 0.2601
5.5 5.587 -0.087 0.0076 0.1681
5.1 5.09 0.0099 0.0001 0.6561
6.8 6.854 -0.054 0.0029 0.7921
5.8 5.693 0.1075 0.0116 0.0121
8.2 8.121 0.0789 0.0062 5.2441
5.8 5.798 0.0023 0.0000 0.0121
6.2 6.281 -0.081 0.0065 0.0841
Sum =
59.1 0 Sum=0.0504 Sum=8.789
59.1
y = 5.91 (Unexplained (Total
Variation ) Variation )
* Derived from the earlier fitted equation, y = 0.233 + 0.196 x1 + 0.287 x2
Substituting the respective values in the equation (No 6) we get
R2 = 1 – (0.0504/ 8.789) = 1 – 0.0057 = 0.9943
The interpretation of the value of R2 = 0.9943 is that 99.43% of the variation in net profit is explained
jointly by variation in sales of food items and non-food items.

Incidentally, Explained Variation for the above example can be calculated by subtracting unexplained
variation from total variation as 8.789 – 0.0504 = 8.7386
It may be recalled that in Chapter 10 on Simple Correlation and Regression Analysis, we have
discussed the impact of the change of variation in only one independent variable on the dependent
variable. We shall now demonstrate the usefulness of two independent variables in explaining the
variation in the dependent variable (net profit in this case).
Suppose, we consider only as one variable, say food items, then the basic data would be as follows:
Sales of Food
Net Profit (Rs.
Items (Rs.
Supermarket Crores)
Crores)
Y
x1
1 5.6 20
2 4.7 15
3 5.4 18
4 5.5 20
5 5.1 16
6 6.8 25
7 5.8 22
8 8.2 30
9 5.8 24
10 6.2 25
35

The scatter diagram indicates a positive linear correlation between the net profit and the sales of food
items.

The linear relationship is assumed as


y = a + bx1 … (8)

which is the regression equation of y on x1. While ‘b’ is the regression coefficient of y on x1, ‘a’ is
just a constant. In the given example, y is the ‘net profit’ and x1 is the sales of food items.

The values of ‘a’ and ‘b’ are calculated from the following formulas given in Chapter 10.

b=
 y x  ny x
i 1i 1

 y  n( x )
2
i 1
2

and, a = y – b x1

Thus the desired regression equation can be obtained as

y = 1.61 + 0.2 x1 (9)


It tells us that how y changes with respect to changes in x1 i.e. how the net profit increases with
increase in sales of food-items. The interpretation of ‘b’ = 0.2 is that for every crore Rs. of sales
of food items, the net profit increases by 0.2 Crore Rs. i.e.. Rs. 20 lakhs.
As stated earlier, the effectiveness /reliability of the regression equation is judged by the coefficient
of determination, can be obtained as
r2 = 0.876
2
This value of r = 0.876 indicates that 87.6 % of the variation in net profit is explained by the
variation in sales of food items, and thus one may feel quite confident in forecasting net profit with
the help of the sales of food items.. However, before doing so, it is desired that one examines the
possibility of considering some other variables also either as an alternative or in addition to the
variable (sales of food items) already considered, to improve the reliability of the forecast.

As mentioned in Chapter 10, the correlation coefficient is also defined as

Explained Variation in y by the Regression Equation


r2 = -----------------------------------------------------------------
Total Variation in y

Unexplained Variation in y

r2 = 1 – ------------------------------- = 1 –
 ( y  yˆ )
i i
2

 ( y  y)
i
2

Total Variation in y
36

In the above illustration,

Total variation =  ( y  y) i
2

Unexplained Variation =  ( yˆ  y )
i
2

 ( y  y)
i
2

These quantities can be calculated from the following Table :

Sales of
Food
Net Profit Items yˆ i 
Supermarket yi xi yi  y ( yi  y )2 1.61  0.2 xi yi  yˆ ( yi  yˆ )2 yˆi  y ( yˆi  y )2
1 5.6 20 -0.31 0.0961 5.61 -0.01 0.0001 -0.3 0.09
2 4.7 15 -1.21 1.4641 4.61 0.09 0.0081 -1.3 1.69
3 5.4 18 -0.51 0.2601 5.21 0.19 0.0361 -0.7 0.49
4 5.5 20 -0.41 0.1681 5.61 -0.11 0.0121 -0.3 0.09
5 5.1 16 -0.81 0.6561 4.81 0.29 0.0841 -1.1 1.21
6 6.8 25 0.89 0.7921 6.61 0.19 0.0361 0.7 0.49
7 5.8 22 -0.11 0.0121 6.01 -0.21 0.0441 0.1 0.01
8 8.2 30 2.29 5.2441 7.61 0.59 0.3481 1.7 2.89
9 5.8 24 -0.11 0.0121 6.41 -0.61 0.3721 0.5 0.25
10 6.2 25 0.29 0.0841 6.61 -0.41 0.1681 0.7 0.49
Sum 59.1 215 Sum 8.789 Sum 1.109 Sum 7.7
Average 5.91 21.5

It may be noted that the unexplained variation or residual error is 1.109 when the simple regression
equation( 9) of net profit on sales of food items is fitted but it was lower as reduced to 0.05044, when
multiple regression equation ( 7 ) was used by taking into account adding one more variable as sale
of non-food items (x2)

Also it may be noted that only one variable viz. sales of food items is considered then r 2 is 0.876 i.e
87.6% of variation in net profit is explained by variation in sales of food item but when both the
variables viz. sales of food as well as non food items are considered. R 2 is 0.9943 i.e 99.43% of
variation in net profit is explained by variation in both these variables.

2.2 Forecast with a Regression Equation


The multiple regression equation ( 1 ) can be used to forecast the value of the dependent variable at
any point of time, given the values of the independent variables at that point of time. For illustration,
in the above example about the net profit in a company, one may be interested in forecasting the net
profit for the next year when the sales of food items is expected to increase to 30 Crores and sales of
sales of non-food items is expected to Rs. 7 Crores. Substituting x1 = 30, and x2 = 7 in equation ( 7),
we get
y = 0.233 + 0.196  30 + 0.287  7
= 8.122
37

Thus, the net profit for all the 10 stores, by the end of next year would be 8.122
Crores.

Caution : It is important to note that a regression equation is valid for estimating the
value of the dependent variable only within the range of independent variable(s) or
only slightly beyond the range. However, it can be used even much beyond the range if
no other better option is available, and it is supported by commonsense.

2.3 Correlation Matrix


The multiple correlation coefficient can also be determined with the help of the total correlation
coefficients between all pairs of dependent and independent variables. All the possible total
correlations between any two pairs of x1, x2 and y and can be presented in the form of a matrix as
follows:.
rx1x1 rx1y rx1x2
ryx1 ryy ryx2
rx2x1 rx2y rx2 x2
Since rx1x1 , ryy and rx2 x2 are all equal to 1, the matrix can be written as
1 rx1y rx1x2
ryx1 1 ryx2
rx2x1 rx2y 1
Further, since rxy and ryx are equal, and so are rxz and rxz , it is sufficient to write the matrix in the
following form :
1 rx1y rx1x2
- 1 ryx2
- - 1
If there are three variables x1, x2 and y then simple correlation coefficient can be defined between all
pairs of x1, x2 and y. However when there are more than two variables in a study, then the simple
correlation between any two variables are known as total correlation. All nine of such possible pairs
are represented in the form of a matrix given below.

For the above example relating to net profit where there are three variables y, x1 and x2 , the
correlation matrix is follows:
1 rx1y rx1x2
- 1 ryx2
- - 1

2.4 Adjusted Multiple Coefficient of Determination


In a multiple regression equation, addition of an independent or explanatory variable
increases the value of R2. For comparing two values of R2, it is necessary to take into consideration
the number of independent variables on which it is based. This is done by calculating an adjusted R 2
denoted by R 2 ( R bar squared ). This adjusted value of R2 is (number at variables) known as
38

adjusted multiplication coefficient of determination, takes into account n (number of observations)


and k (number of variables) for comparison in two situations, and is calculated as

2 n 1
R = 1 ( )(1  R 2 ) …(10)
n  k 1

where n is the sample size or the number of observations on each of the variables, and
k is the number of independent variables. For the above example,

2 10  1
R = 1 ( )(1  0.9943)
10  2  1

= 0.9927

To start with when an independent variable is added i.e. the value of k is increased, the value of
R 2 increases but when the addition of another variable does not contribute towards explaining
the variability in the dependent variable, the value of R 2 decreases. This implies that the
addition of that variable is redundant.

The adjusted R2 i.e. R 2 is lesser than R2 as number of observations per independent variable
decreases. However, R 2 tends to be equal to R2 as sample size increases for the given number of
independent variables.

R 2 is useful in comparing two regression equations having different number of independent


variables or when the number of variables is the same but both are based on different sample
sizes.

As an illustration, in the above example relating to regression equation of net profit on sales of food
items and sales of non food items , the value of R2 is 0.96 when only sales of food items is taken as
independent variable to predict net profit, but it increases to 0.98 when another independent variable
viz. sales of non food items is also taken into consideration. However, the adjusted value of R 2 is
0.96.

2.5 Dummy Variable


So far, we have considered independent variables which are quantifiable and measurable like income,
sales, profit, etc. However, sometimes the independent variables may not be quantifiable and
measurable and be only qualitative and categorical, and could impact the independent variable under
study. For example, the amount of insurance policy, a person takes, could depend on his/her marital
status which is categorical i.e. married or un-married. The sale of ice-cream might depend on the
seasons viz. summer or other seasons. The performance of a candidate at a competitive examination
depends not only his/her I.Q. but also on the categorical variable “coached” or “un-coached”

Dummy variables are very useful for capturing a variety of qualitative effects by indicating ‘0’ and
‘1’ as two states of qualitative or categorical data. The dummy variable is assigned the value ‘1’ or
‘0’ depending on whether it does or does not possess the specified characteristic. Some examples are
male and female, married and unmarried, MBA executives and Non-MBA executives, trained-not
trained, advertisement - I and advertisement - II, strategy like financial discount or gift item for sales
promotion.. Thus, a dummy variable modifies the form of a non-numeric variable to a numeric one.
They are used as explanatory variables in a regression equation. They act like ‘switch’ which turn
39

various parameters ‘on’ and ‘off ’ in an equation. Another advantage of ‘0’ and ‘1’ dummy variables
is that even though it is a nominal-level variable – it can be treated statistically just like ‘interval-
level’ variable which takes the value 1 or 0. It marks or encodes a particular attribute “Indicative
Variable” to “Binary Variable”. It is a form of coding to transform non-metric data to metric data. It
facilitates in considering two levels of an independent variable, separately.

Illustration 2
It is normally expected that a person with high income will purchase life insurance policy for a higher
amount. However, it may be worth examining whether there is any difference in the amounts of
insurance purchased by married & unmarried persons. To answer these queries, an insurance agent
collected the data about the policies purchased by his clients during the last month. The data is as
follows:

Amount of
Stipulated
Annual Income Annual
Sr. No of Marital Status
(in Thousands of Insurance
Client (Married/Single)
Rs.) Premium
(in Thousands of
Rs.)
1 800 85 M
2 450 50 M
3 350 50 S
4 1500 140 S
5 1000 100 M
6 500 50 S
7 250 40 M
8 60 10 S
9 800 70 S
10 1400 150 M
11 1300 150 M
12 1200 110 M

Note : The marital status is converted into a independent variable by substituting ‘M’ by 1 and ‘S’ by
0 for the purpose of fitting the regression equation.

It may be verified that the multiple regression equation with amount of insurance premium as
dependent variable and income as well as marital status as independent variables is
Premium = 5.27 + 0.091 Income + 8.95 Marital Status

The interpretation of the coefficient 0.091 is that for every additional thousand rupees of
income, the premium increases by 1000 × 0.091 = Rs 91.
The interpretation of the coefficient 8.95, is that a married person takes an additional premium
of Rs 8,950 as compared to a single person.

2.6 Partial Correlation Coefficients


So far, we have discussed total correlation coefficient and multiple correlation coefficient. In the
above case of net profit planning, we had three variables viz. x1, y and x2. The correlation coefficients
between any two variables viz. ryx1 , ryx 2 and rx1 x1 are called total correlation coefficients. The
total correlation coefficients indicate the relationship between the two variables ignoring the presence
or effect of the other third variable. The multiple correlation coefficient Ry.x1 x2 indicates the
correlation between y and the estimate of y obtained by the regression equation of y on x 1 and x2. The
partial correlation coefficients are defined as correlation between any two variables when the effect of
40

the third variable on these two variables is removed or when the third variable is held constant. For
example, ryx1 .x2 means the correlation between y and x1 when the effect of x2 on y and x1 is removed
or x2 is held constant. The various partial correlation coefficients viz. . ryx1 .x2 ryx 2 .x1 and rx1 x2. y are calculated

by their formulas as follows :

ryx1  ryx 2 .rx1 x 2


ryx1 .x2 = …… (11)
(1  ryx2 2 )(1  rx21 x 2 )
ryx 2  ryx1 .rx1 x 2
ryx 2 .x1 = …… (12)
(1  ryx2 1 )(1  rx21 x 2 )

rx1 x 2  ryx1 .ryx 2


rx1 x2. y = …… (13)
(1  ryx2 2 )(1  ryx2 1 )

The values of the above partial correlation coefficients, ryx1 .x2 ryx 2 .x1 and rx1 x2. y are 0.997, 0.977 and
0.973, respectively.
The interpretation of ryx 2 .x1 = 0.977 is that it indicates the extent of linear correlation between y
and x2 when x1 is held constant or its impact on y and x2 is removed.
Similarly, the interpretation of rx1 x2. y = 0.973 is that it indicates the extent of linear correlation
between x1 and x2 when y is held constant or its impact on x1 and x2 is removed.

2.7 Partial Regression Coefficients


The regression coefficients b1 and b2 in the regression equation (1) are known as partial regression
coefficients. The value of b1 indicates the change that will be caused in y with a unit change in x1
when x2 is held constant. Similarly, b2 indicates the amount by which y will change given a unit
change in x2. For illustration, in the regression equation, the interpretation of the value of ‘b 1’, i.e.
0.186, is that if x2 (sales of non-food items) is held constant, then for every increase of 1 Crore rise in
the sales of food items, on an average, net profit will rise by Rs.18.6 lakhs.. Similarly, the
interpretation of ‘b2’ i.e. 0.287 is that if x1 (sales of food items) is held constant, then for every
increase of 1 Crore rise in the sales of non-food items, on an average, net profit will rise by Rs.28.7
Lakhs.

2.8 Beta Coefficients


If the independent variables are standardised i.e. they are measured from their means and divided by
their standard deviations, then the corresponding regression coefficients are called beta coefficients.
Their advantages, like in simple regression analysis vide Section 8.5.3 are that correlation and
regression between standardised variables solves the problem of dealing with different units of
measurements of the variables. Thus the magnitudes of these coefficients can be used to compare the
relative contribution of each independent variable in the prediction of each dependent variable.
Incidentally, for the data in Illustration 1, relating to sales and net profit of supermarkets, reproduced
below,

Sales of
Net Profit Sales of Standardised Variables=
Food
Supermarket (Rs. Non-Food (Variable – Mean)/s.d.
Items (Rs.
Crores) Item
Crores)
41

yi xi zi xi2 Y X1 X2
1 5.6 20 5 400 -0.331 -0.34 -0.09
2 4.7 15 5 225 -1.291 -1.48 -0.09
3 5.4 18 6 324 -0.544 -0.8 0.792
4 5.5 20 5 400 -0.437 -0.34 -0.09
5 5.1 16 6 256 -0.864 -1.25 0.792
6 6.8 25 6 625 0.949 0.798 0.792
7 5.8 22 4 484 -0.117 0.114 -0.97
8 8.2 30 7 900 2.443 1.937 1.673
9 5.8 24 3 576 -0.117 0.57 -1.85
10 6.2 25 4 625 0.309 0.798 -0.97
Sum 59.1 215 51 4815
mean 5.91 21.5 5.1
Variance 0.88 19.25 1.29
s.d. 0.937 4.387 1.136
the multiple regression equation is as follows:
Y = 0.917 X1 + 0.348 X2
The interpretation of beta values is that the contribution by sales of food items in profit is 0.917 as
compared to the contribution by sale of non food items which is 0.348.

2.9 Properties of R2
As mentioned earlier that coefficient of multiple correlation R is the ordinary or total correlation
between the dependent variable and its estimate as derived by the regression equation i.e. R = r yi ŷ i ,
and as such is always positive. Further,
( i ) R2  each of the square of total correlation coefficients of y with any one of the
variables , x2, …., xk.

( ii ) R2 is high if correlation coefficients between independent variables viz.


r xi xj s are all low
( iii ) If r xi xj = 0 for each i  j , then
R2 = r2 y x1 + r2 y x2 + r2 y x3 + ……………… + r2 y xk
2.10 Multicollinearity
Multicollinearity refers to the existence of high correlation between independent variables. Even if
the regression equation is significant for the equation as a whole, it could happen that due to
multicollinearity, the individual regression coefficients could be insignificant indicating that they do
not have much impact on the value of the dependent variable. When two independent variables are
highly correlated, they basically convey the same information and logically appears that only one of
the two variables could be used in the regression equation..

If the value of R2 is high, and the multicollinearity problem exists, the regression equation can
still be used for prediction of dependent variables given values of independent variables.
However it should not be used for interpreting partial regression coefficients to indicate impact
of independent variables on the dependent variable.

The multicollinearity among independent variables can be removed, with the help of Principal
Component Analysis discussed in this Chapter. It involves forming new set of independent variables
which are linear combinations of the original variables in such a way that there is no multicollinearity
among the new variables.
42

If there are two variables , sometimes the exclusion of one may result in abnormal change in the
regression coefficient of the other variable ; sometimes even the sign of the regression coefficient
may change from + to – or vice versa, as demonstrated for the data given below.

y x1 x2
10 12 25
18 16 21
18 20 22
25 22 18
21 25 17
32 24 15

It may be verified that the correlation between x1 and x2 is 0.91 indicating the
existence of multicollinearity.

It may be verified that the regression equation of y on x1 is


y = – 3.4 + 1.2 x1 (i)

, the regression equation of y on x2 is


y = 58.0 – 1.9 x2 (ii)

, and the regression equation of y on x1 and x2 is


y = 70.9 – 0.3 x1 – 2.3 x2 (iii)

It may be noted that the coefficient of x1 (1.2) in (i) which was positive when the regression
equation of y on x1 was considered , has become negative(–0.3)in equation (iii), when x2 is also
included in the regression equation. This is due to high correlation of –0.91 between x1 and x2. It is ,
therefore, desirable to take adequate care of multicollinearity.

2.11 Tests for Significance of Regression Model and Regression


Coefficients

In any given situation, one can always define a dependent variable and some independent variables,
and thus define a regression model or equation. However, an important issue arises is whether all the
defined variables in the model, as a whole, have a real influence on the dependent variable, and are
able to explain the variation caused in the dependent variable. For example, one may postulate that
the sales of a company manufacturing a paint( defined as dependent variable) is dependent on the
expenditure on R & D, advertising expenditure, price of the paint, discount to the whole sellers and
number of salesmen. While, these variables might be found to be significantly impacting the sales of
the company, it could also happen that these variables might not influence the sales as the more
important factors could be the quality of the paint, availability and pricing of another similar type of
paint. Further, even if the four variables mentioned above are found to be significantly contributing,
as a whole, to the sales of the paint, but one or some of these might not be influencing the sales in a
significant way.

For example, it might happen that the sales are insensitive to the advertising expenses i.e. increasing
the expenditure on advertising might not be increasing the sales in a significant way. In such case, it
is advisable to exclude this variable from the model, and use only the other three variables. As
explained in the next Section, it is not advisable to include a variable unless its contribution to
43

variation in the dependent variable is significant. These issues will be explained with examples in
subsequent sections.

2.12 Regression Model with More Than Two Independent Variables:


So far we have discussed only the derivation of the regression equation and interpretations of the
correlation and regression coefficients. Further, we have confined to only two independent variables.
However, sometimes, it is advisable to have more than two independent variables. Also, for using the
equation for interpreting and predicting the values of the dependent variable with the help of
independent variables, there are certain assumptions to validate the regression equation. We also have
to test whether all or some of the independent variables are really significant to have an impact on the
dependent variable. In fact, we also have to ensure that only the optimum numbers of variables are
used in the final regression equation. While details will be discussed later on, it may be mentioned for
now that mere increase in the number of independent variables does not ensure better predictive
capability of the regression equation. Each variable has to compete with the others to be included or
retained in the regression equation.

2.13 Selection of Independent Variables


Following are three prime methods of selecting independent variables in a regression model:
 General Method
 Hierarchical method
 Stepwise Method
These are described below.
2.13.1General Method ( Standard Multiple regression)
• In standard multiple regression, all of the independent variables are entered into the regression
equation at the same time
• Multiple R and R² measure the strength of the relationship between the set of independent
variables and the dependent variable. An F test is used to determine if the relationship can be
generalized to the population represented by the sample.
• A t-test is used to evaluate the individual relationship between each independent variable and
the dependent variable.
This method is used when a researcher knows exactly which independent variables contribute
significantly in the regression equation. In this method, all the independent variables are considered
together and the regression model is derived.

It is often difficult to identify the exact set of variables that are significant in the regression model and
the process of finding these may have many steps or iterations as explained through an illustration in
the next section. This is the limitation of this method. This limitation can be overcome in the stepwise
regression method.

2.13.2Hierarchical method (Hierarchical Multiple regression)


• In hierarchical multiple regression, the independent variables are entered in two stages.
• In the first stage, the independent variables that we want to control for are entered into the
regression. In the second stage, the independent variables whose relationship we want to
examine after the controls are entered.
• A statistical test of the change in R² from the first stage is used to evaluate the importance of
the variables entered in the second stage.

This method is used when a researcher has clearly identified three different types of variables namely
dependent variable, independent variable/s and the control variable/s.
44

This method helps the researcher to find the relationship between the independent variables and the
dependent variable, in the presence of some variables that are controlled in the experiment. Such
variables are termed as control variables. The control variables are first entered in the hierarchy, and
then the independent variables are entered. This method is available in most statistical software
including SPSS.

2.13.3 Stepwise Method (Stepwise Multiple regression)


• Stepwise regression is designed to find the most parsimonious set of predictors that are most
effective in predicting the dependent variable.
• Variables are added to the regression equation one at a time, using the statistical criterion of
maximizing the R² of the included variables.
• When none of the possible addition can make a statistically significant improvement in R², the
analysis stops.

This method is used when a researcher wants to find out, which independent variables significantly
contribute in the regression model, out of a set of independent variables. This method finds the best
fit model, i.e. the model which has a set of independent variables that contribute significantly in the
regression equation.
For example, if a researcher has identified some three independent variables that may affect the
dependent variable, and wants to find the best combination of these three variables which contribute
significantly in the regression model. The researcher may use stepwise regression. The software
would give the exact set of variables that contribute or are worth keeping in the model.
There are three most popular stepwise regression methods namely, forward regression backward
regression and stepwise regression. In forward regression, one independent variable is entered with
dependent variable and the regression equation is arrived along with other tests like ANOVA, t tests
etc. in the next iteration the one more independent variable is added and is compared with the
previous model. If the new variable significantly contributes in the model, it is kept, otherwise it is
thrown out from the model. This process is repeated for each remaining independent variables, thus
arriving at a model that is significant containing all contributing independent variables. The
backward method is exactly opposite to this method. In case of backward method initially all the
variables are considered and they are removed one by one if they do not contribute in the model.
The stepwise regression method is a combination of the forward selection and backward elimination
methods. The basic difference between this and the other two methods is that in this method, even if a
variable is selected in the beginning or gets selected subsequently, it has to keep on competing with
the other entering variables at every stage to justify its retention in the equation.

These steps are explained in the next Section, with an example.

2.14 Selection of Independent Variables in a Regression Model


Whenever, we have several independent variables which influence a dependent variable, an issue
arises is whether it is worthwhile to retain all the independent variables or whether it is worthwhile to
include only some of the variables which have relatively more influence on the dependent variables
as compared to the others. There are several methods to select most appropriate or significant
variables out of the given set of variables. However, we shall describe one of the methods using R 2
as the selection criteria. The method is illustrated with the help of data given in following example.

Example 2
The following Table gives certain parameters about some of the top rated companies in the ET 500
listings published in the issue of February 2006
45

Sr. Company M-Cap Oct’ 05 Net Sales Sept’ Net Profit Sept’ 05 P/E as on Oct. 31’ 05
No 05
Company Amount Rank Amount Rank Amount Rank Amount Rank
1 INFOSYS 68560 3 7836 29 2170.9 10 32 66
TECHNOLOGIES
2 TATA CONSULTANCY 67912 4 8051 27 1831.4 11 30 74
SERVICES
3 WIPRO 52637 7 8211 25 1655.8 13 31 67
4 BHARTI TELE- 60923 5 9771 20 1753.5 12 128 3
VENTURES *
5 ITC 44725 9 8422 24 2351.3 8 20 183
6 HIRO HONDA 14171 24 8086 26 868.4 32 16 248
MOTORS
7 SATYAM COMPUTER 18878 19 3996 51 844.8 33 23 132
SERVICES
8 HDFC 23625 13 3758 55 1130.1 23 21 154
9 TATA MOTORS 18881 18 18363 10 139 17 14 304
10 SIEMENS 7848 49 2753 75 254.7 80 38 45
11 ONGC 134571 1 37526 5 14748.1 1 9 390
12 TATA STEEL 19659 17 14926 11 3768,6 5 5 469
13 STEEL AUTHORITY 21775 14 29556 7 6442.8 3 3 478
OF INDIA
14 NESTLE INDIA 8080 48 2426 85 311.9 75 27 99
15 BHARAT GORGE CO. 6862 55 1412 128 190.5 97 37 48
16 RELIANCE 105634 2 74108 2 9174.0 2 13 319
INDUSTRIES
17 HDFC BANK 19822 16 3563 58 756.5 37 27 98
18 BHARAT HEAVY 28006 12 11200 17 1210.1 21 25 116
ELECTRICALS
19 ICICI BANK 36890 10 11195 18 2242.4 9 16 242
20 MARUTI UDYOG 15767 22 11601 16 988.2 26 17 213
21 SUN 11413 29 1397 130 412.2 66 30 75
PHARMACEUTICALS
* The data about Bharti Tele-Ventures is not considered for analysis because it’s P/E ratio is
exceptionally high.

In the above example, we take market Capitalisation as the dependent variable, and Net Profit, P/E
Ratio and Net Sales as independent variables.

We may add that this example is to be viewed as an illustration of selection of optimum number of
independent variables, and not the concept of financial analysis.
The notations used for the variables are as follows.
Y Market Capitalisation
x1 Net Sales
x2 Net Profit
x3 P/E Ratio
Step I :
46

First of all we calculate the total correlation coefficients among all the dependent and independent
variables. We also calculate the correlation coefficients of the dependent variable These are tabulated
below.
1 2 3
Net Sales Net Profit P/E Ratio
Net Sales 1.0000
Net Profit 0.7978 1.0000
P/E Ratio -0.5760 -0.6004 1.0000
Market Cap 0.6874 0.8310 -0.2464
We note that the correlation of y with x2 is highest. We therefore start by taking only this variable in
the regression equation.
Step II :
The regression equation of y on x2 is
y = 15465 + 7.906 x2

The values of R2 and R 2 are : R2 = 0.6906 R 2 = 0.6734


Step III :
Now, we derive two regression equations, one by adding x1 and one by adding x3 to see which
combination viz. x2 and x3 or x2 and x1 is better.
The regression equation of y on x2 and x1 is
Y = 14989 +7.397 x2 + 0.135 x1
The values of R2 and R 2 are : R2 = 0.6922 R 2 = 0.656

The regression equation of y on x2 and x3 is


Y = – 19823 +10.163 x2 + 1352.4 x3
The values of R2 and R 2 are : R2 = 0.7903 R 2 = 0.7656

Since R 2 for the combination x2 and x3 (0.7656) is higher than R 2 for the combination x2 and x1
(0.6734) , we select x3 as the additional variable along with x2
It may be noted that R2 with variable x2 and x3 (0.7903) is more than value of R2 with only the
variable (0.6906). Thus it is advisable to have x3 along with x2 in the model.

Step IV :
Now we include the last variable viz. x1 to have the model as
Y = bo + b1x1 + b2x2 + b3 x3
The requisite calculations are too cumbersome to be carried out manually, and, therefore, we use
Excel spreadsheet which yields the following regression equation.
Y = – 23532 + 0.363x1 + 8.95x2 + 1445.6x3
The values of R2 and R 2 are : R2 = 0.8016 R 2 = 0.7644
It may be noted that inclusion of x1 in the model has very marginally increased the value of R2 from
0.7903to 0.8016, but the adjusted value of R2 i.e R 2 has come down from 0.7656 to 0.7644. Thus it
is not worthwhile to add the variable x1 to the regression model having variables as x2 and x3.

Step V :
The advisable regression model is by including only x2 and x3
Y = – 19823 +10.163 x2 + 1352.4 x3 (14)
2
This is the best regression equation fitted to the data on the basis of R criterion, as discussed above.
We have discussed the same example to illustrate the method using SPSS in Section 2.17
2.15 Generalised Regression Model
47

In general, a regression equation, or also referred to as model, is written as follows


yi = bo + b1x1i + b2x2i + b3 x3i +…….+ b1xki + ei …..(15)
where, there are k independent variables viz. x1, x2 , x3 ……. and xk , and ei
is the error or residual term which is not explained by the regression model.

2.15.1 Assumptions for the Multiple Regression Model


There are certain assumptions about the error terms that ought to hold good for the regression
equation to be useful for drawing conclusions from it or using it for prediction purposes.
These are :
(i) The distribution of ei s is normal
The implication of this assumption is that the errors are symmetrical with both positive and
negative values.

(ii) E (ei) = 0
This assumption implies that the sum of positive and negative errors is zero, and thus they
cancel out each other.

(iii) Var (ei ) = σ 2 for all values of i


This assumption means that the variance or fluctuations in all error terms are of the same
magnitude. (Homoscedasticity)
Heteroscedasticity often occurs when there is a large difference among the sizes of the observations.
The classic example of heteroscedasticity is that of income versus food consumption. As one's
income increases, the variability of food consumption will increase. A poorer person will spend a
rather constant amount by always eating essential food; a wealthier person may, in addition to
essential food, occasionally spend on expensive meal. Those with higher incomes display a greater
variability of food consumption

Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively
even distribution

Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.
48

Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.

• (iv) r (ei , ej ) = 0 (Autocorrelation)


This assumption implies that the error terms are uncorrelated with each other, i.e. one error
term does not influence the other error term.
• In regression analysis using time series data, autocorrelation of the residuals ("error terms")
econometrics) is a problem.
• Autocorrelation violates the ordinary least squares (OLS) assumption that the error terms are
uncorrelated. While it does not bias the OLS coefficient estimates, the standard errors tend to
be underestimated (and the t-scores overestimated) when the autocorrelations of the errors at
low lags are positive.

The traditional test for the presence of first-order autocorrelation is the Durbin–Watson statistic
Other than the above assumptions, regression analysis also requires that the independent variables
should not be related to each other in other words, there should not be any multicollinearity
• Indicators that multicollinearity may be present in a model:
• 1) Large changes in the estimated regression coefficients when a predictor variable is added or
deleted
• 2) Insignificant regression coefficients for the affected variables in the multiple regression, but
a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)
• 3) Large changes in the estimated regression coefficients when an observation is added or
deleted
• A formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity is:
• Tolerance = 1 – R2 VIF = 1 / Tolerance
• A tolerance of less than 0.20 and/or a VIF of 5 and above indicates a multicollinearity
problem
In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for
the others tends to be less precise than if predictors were uncorrelated with one another. The usual
interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit
change in an independent variable, X1, holding the other variables constant. If X1 is highly correlated
with another independent variable, X2, in the given data set, then we only have observations for
which X1 and X2 have a particular relationship (either positive or negative). We don't have
observations for which X1 changes independently of X2, so we have an imprecise estimate of the
effect of independent changes in X1.
Multicollinearity does not actually bias results, it just produces large standard errors in the related
independent variables. With enough data, these errors will be reduced.
What to do if multicollinearity :
• 1) Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't
affect the fitted model provided that the predictor variables follow the same pattern of
multicollinearity as the data on which the regression model is based .
49

• 2) Drop one of the variables. An explanatory variable may be dropped to produce a model
with significant coefficients. However, you lose information (because you've dropped a
variable). Omission of a relevant variable results in biased coefficient estimates for the
remaining explanatory variables.
• 3) Obtain more data. This is the preferred solution. More data can produce more precise
parameter estimates (with lower standard errors).
• Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the
interpretation of the explanatory variables. As long as the collinear relationships in your
independent variables remain stable over time, multicollinearity will not affect the forecast.

Sample Size Considerations


The minimum ratio of observations to variables is 5 to 1, but the preferred ratio is 15 or 20 to 1, and
this should increase when stepwise estimation is used.
Variable Transformations
• Nonmetric variables can only be included in a regression analysis by creating dummy
variables.
• Dummy variables can only be interpreted in relation to their reference category.

Assessing Statistical Assumptions


• Testing assumptions must be done not only for each dependent and independent variable, but
for the variate as well.
• Graphical analyses (i.e., partial regression plots, residual plots and normal probability plots)
are the most widely used methods of assessing assumptions for the variate.
• Remedies for problems found in the variate must be accomplished by modifying one or more
independent variables
Regression Model & fit : the researcher must accomplish three basic tasks:
1. Select a method for specifying the regression model to be estimated,
2. Assess the statistical significance of the overall model in predicting the dependent
variable, and
3. Determine whether any of the observations exert an undue influence on the
results.

2.16 Applications in Finance

In this section, we indicate financial applications of regression analysis in some aspects relating to
stock market.
(i) Individual Stock Rates of Return, Payout Ratio, and Market Rates of Return
Let the relationship of rate of return of a stock with the payout ratio defined as the ratio of dividend
per share to customer earnings per share , and the rate of return on BSE SENSEX stocks as a whole,
be
y = b0 + b1 ( payout ratio ) + b2 ( rate of return on Sensex)
Let us assume that the relevant data is available, and is collected over a period of last 10 years yields
the following equation
y = 1.23 – 0.22 payout ratio + 0.49 rate of return
The coefficient – 0.22 indicates that for a 1 % increase in pay-out ratio, the return on the stock
reduces by 0.22 % when the rate of return is held constant. Further, the coefficient 0.49 implies that a
1 % increase in the rate of return on BSE SENSEX , the return on the stock increases by 0.49 % when
the payout ratio is held constant.
Further, let the calculations yield the value of R2 as 0.66.
50

The value of R2 = 0.66 implies that 66% of variation in the rate of return on the investment in the
stock is explained by pay-out ratio and the return on BSE SENSEX.

(ii) Determination of Price per Share


To further demonstrate application of multiple regression techniques, let us assume that a cross-
section regression equation is fitted with dependent variable being the price per share (y) of the 30
companies used to compile the SENSEX, and the independent variables being the dividend per share
(x1) and the retained earnings per share (x2 ) for the 30 companies. As mentioned earlier, in a cross-
section regression, all data come from a single period.
Let us assume that the relevant data is available, and the data is collected for SENSEX stocks in a
year, yield the following regression equation
y = 25.45 + 15.30 x1 + 3.55 x2

The regression equation could be used for interpreting regression coefficients and predicting average
price per share given the values of dividend paid and earnings retained.
The coefficient 15.30 of x1 ( average price per share) indicates that the average price per share
increases by 15.30 when the dividend per share increases by Re. 1 when the retained earnings are
held constant.
The regression coefficient 3.55 of x2 means that when the retained earnings increases by Rs. 1.00, the
price per share increases by Rs 3.55 when dividend per share is held constant.

The use of multiple regression analysis in carrying out cost analysis was demonstrated by Bentsen in
1966. He collected data from a firm’s accounting, production and shipping records to establish a
multiple regression equation.

Terms in Regression Analysis


• Explained variance = R2 (coefficient of determination).
• Unexplained variance = residuals (error).
• Adjusted R-Square = reduces the R2 by taking into account the sample size and the number
of independent variables in the regression model (It becomes smaller as we have fewer
observations per independent variable).
• Standard Error of the Estimate (SEE) = a measure of the accuracy of the regression
predictions. It estimates the variation of the dependent variable values around the regression
line. It should get smaller as we add more independent variables, if they predict well.
• Total Sum of Squares (SST) = total amount of variation that exists to be explained by the
independent variables. TSS = the sum of SSE and SSR.
• Sum of Squared Errors (SSE) = the variance in the dependent variable not accounted for by
the regression model = residual. The objective is to obtain the smallest possible sum of
squared errors as a measure of prediction accuracy.
• Sum of Squares Regression (SSR) = the amount of improvement in explanation of the
dependent variable attributable to the independent variables.
• Variance Inflation Factor (VIF) – measures how much the variance of the regression
coefficients is inflated by multicollinearity problems. If VIF equals 0, there is no correlation
between the independent measures. A VIF measure of 1 is an indication of some association
between predictor variables, but generally not enough to cause problems. A maximum
acceptable VIF value would be 5; anything higher would indicate a problem with
multicollinearity.
• Tolerance – the amount of variance in an independent variable that is not explained by the
other independent variables. If the other variables explain a lot of the variance of a
particular independent variable we have a problem with multicollinearity. Thus, small values
51

for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance
is typically .2.. That is, the tolerance value must be smaller than .10 to indicate a problem of
multicollinearity.

2.17 Multiple Regression Using SPSS


SPSS is the most commonly used statistical tool for performing Multiple Regression Analysis. We
shall explain, in brief, the method and terms used in SPSS, as also the interpretations of the SPSS
output. We have discussed in Section 2.14, three methods of entering variables for multiple
regression analysis. SPSS allows selecting one of the three methods while entering the variables. We
shall illustrate the Example 2, given in Section 2.14, using SPSS. It may be noted that as Bharti
telecom has too high P/E Ratio, it is therefore omitted from the analysis.
It may also be noted that we will discuss the output which is different than the previous method.
The regression analysis can be done in SPSS by using ‘Analyse’ option and selecting Linear
regression as shown in the following snapshot.
SPSS Snapshot MRA 1

The next box that appears is shown in the following SPSS snapshot
SPSS Snapshot MRA 2
3.If the method of selection is
Hierarchical, all control variables
1.Dependent variable are selected in the list of
to be entered here independent variables in the first list
and then by clicking ‘next’, rest of
the independent variables are
entered

2. Independent variables
to be entered here

4. Method of entering the


variables to be selected
here

If the method of selection is General method, ‘Enter’ should be selected from the drop box. If the
method is stepwise, ‘Stepwise’ should be selected from the list. We have explained, the criteria for
selecting appropriate method in Section 2.14
52

The next step is to click on ‘Statistics’ button in the bottom of the box. When one clicks on statistics
the following box will appear
SPSS Snapshot MRA 3

2. Next Click continue

1. Tick the Check Box for


Model Fit ,Descriptives ,
3. Tick Co-linearity estimates, part and partial
diagnostics correlations and the Durbin
Watson Statistics

The Durbin-Watson Statistics is used to check the assumption of regression analysis which states
that the error terms should be uncorrelated. While its desirable value is 2, the desirable range is 1.5 to
2.5.
For Multi-colinearity to be absent, VIF should be less than 5 or Tolerance should be at least 0.2.
After clicking ‘Continue’, SPSS will return to screen as in SPSS Snapshot MRA 2.
In this snapshot, click on the plots button at the bottom. The next box that would appear is given in
the following Snapshot MRA 4.
SPSS Snapshot MRA 4

2. Next Click continue

1. Tick the Check Boxes Histogram


and Normal Probability Plot

The residual analysis is done to check the assumptions of multiple regressions. According to the
assumption the residuals should be normally distributed. This assumption can be checked by viewing
the histogram and normal probability plot.
After clicking ‘Continue’, SPSS will take back to the Snapshot 3 MRA.
By clicking ‘OK’, SPSS will carry out the analysis and give the output in the Output View.

We will discuss two outputs using the same data. One, by using General method for entering
variables, and the other by selecting stepwise method for entering variables.
General Method for Entering Variables – SPSS Output
Regression
53

Descriptive Statistics

Mean Std. Dev iation N


m_cap_amt_oct05 36285.80 34367.670 20
Net _Sal_sept05 13419.30 17025.814 20
Net _Prof _sept05 2633. 38 3612. 171 20
peratio_oct05 21.70 10.037 20

Descriptive statistics is generally useful in understanding overall distributions of variables.


Correlations

m_cap_ Net _Sal_ Net _Prof _


amt_oct05 sept05 sept05 peratio_oct05
Pearson Correlation m_cap_amt_oct05 1.000 .687 .831 -.246
Net _Sal_sept05 .687 1.000 .798 -.576
Net _Prof _sept05 .831 .798 1.000 -.600
peratio_oct05 -.246 -.576 -.600 1.000
Sig. (1-tailed) m_cap_amt_oct05 . .000 .000 .148
Net _Sal_sept05 .000 . .000 .004
Net _Prof _sept05 .000 .000 . .003
peratio_oct05 .148 .004 .003 .
N m_cap_amt_oct05 20 20 20 20
Net _Sal_sept05 20 20 20 20
Net _Prof _sept05 20 20 20 20
peratio_oct05 20 20 20 20

‘Part and partial correlations matrix’ is useful in understanding the relationships between the
independent and dependent variables. The regression analysis is valid only if the independent and
dependent variables are not interrelated. If these are related to each other, they may lead to
misinterpretation of the regression equation. This is termed as multicollinearity, and its impact is
described in Section 2.10 . The above correlation matrix is useful in checking the inter relationships
between the independent variables. In the above table, the correlations in square are correlation of
independent variables with dependent variables and are high (0.687 & 0.831) which means that the
two variables are related. Whereas the correlations between the independent variables (0.798, -0.576
and -0.6) are high which means that this data may have multicollinearity. Generally, very high
correlations between the independent variables like more than 0.9, may make the entire regression
analysis unreliable for interpreting the regression coefficients.
Variables Entered/Removedb

Variables Variables
Model Entered Remov ed Met hod
1 peratio_
oct05,
Net _Sal_
. Enter
sept05,
Net _Prof
a
_
sept05
a. All requested v ariables entered.
b. Dependent Variable: m_cap_amt_oct05

Since the method selected was Enter method or General method, this table does not communicate any
meaning.
b
Model Summ ary

Adjusted Std. Error o f Durbin -


Mo del R R Sq uare R Sq uare the Estima te Wa tson
1 .89 5 a .80 2 .76 4 166 82.5 85 .98 2
a.
Predicto rs: (Con stan t), pe ratio_ oct05, Net_ Sa l_se pt05, Net _Prof_sept05
b. Depen dent Vari able: m_cap _amt_ oct05

This table gives the model summary for the set of independent and dependent variables. R 2 for the
model is 0.802 which is high and means that around 80% of variation in dependent variable ( market
capitalization) is explained by the three independent variables.(net sale, net profit & P/E Ratio). The
54

Durbin Watson statistics for this model is 0.982, which is very low. The accepted value should be in
the range 1.5 to 2.5. It may, therefore, be appended as a caution that the assumption that the residuals
are uncorrelated is not valid.
ANOVAb

Sum o f
Mo del Square s df Me an Squa re F Sig.
1 Regressi on 1.8 0E+ 10 3 599 6219 888 21. 545 .00 0 a
Resid ual 4.4 5E+ 09 16 278 3086 55.0
T otal 2.2 4E+ 10 19
a. Predicto rs: (Con stan t), pe ratio_ oct05, Net _Sa l_sept05 , Ne t_Prof_sept0 5
b. Depen dent Vari able: m_cap _amt _oct05

The ANOVA table for the regression analysis indicates whether the model is significant, and valid or
not. The ANOVA is significant, if the ‘sig’ column in the above table is less than the level of
significance (generally taken as 5% or 1%). Since 0.000 < 0.01, we conclude that this model is
significant.
If the model is not significant, it implies that no relationship exists between the set of variables.
Coefficientsa

Unstandardized Standardized
Coef f icients Coef f icients Correlations
Model B Std. Error Beta t Sig. Zero-order Partial Part
1 (Constant) -23531. 5 13843.842 -1.700 .109
Net _Sal_sept05 .363 .381 .180 .953 .355 .687 .232 .106
Net _Prof _sept05 8.954 1.834 .941 4.882 .000 .831 .774 .544
peratio_oct05 1445. 613 486.760 .422 2.970 .009 -.246 .596 .331
a. Dependent Variable: m_cap_amt _oct05

This table gives the regression coefficients and their significance. The equation can be considered as,
Market capitalization = – 23531.5 + 0.363 × Net Sales + 8.954×Net Profits + 1445.613×P/E Ratio
It may be noted that the above equation is given for the purpose of understanding how to derive
equation from table. In this example, though regression coefficients, net profit and P/E ratio
are significant, net sales is not significant and since all three beta coefficients are not significant
this equation cannot be used for estimation of market cap.

Residuals Statisticsa

Minimum Maximum Mean Std. Dev iation N


Predicted Value 10307.05 135150.28 36285.80 30769.653 20
Std. Predicted Value -.844 3.213 .000 1.000 20
Standard Error of
4242. 990 16055.796 6689. 075 3390. 080 20
Predicted Value
Adjusted Predicted Value 9759. 82 139674.61 36258.65 30055.284 20
Residual -27441.7 28755.936 .000 15308.990 20
Std. Residual -1.645 1.724 .000 .918 20
Stud. Residual -1.879 1.809 -.009 .999 20
Deleted Residual -35824.0 31681.895 27.147 18618.671 20
Stud. Deleted Residual -2.061 1.964 -.021 1.060 20
Mahal. Distance .279 16.649 2.850 4.716 20
Cook's Distance .000 .277 .059 .091 20
Centered Lev erage Value .015 .876 .150 .248 20
a. Dependent Variable: m_cap_amt_oct05

Charts
55

Histogram

Dependent Variable: m_cap_amt_oct05

5
Frequency

1
Mean = -1.67E-16
Std. Dev. = 0.918
0 N = 20
-2 -1 0 1 2
Regression Standardized Residual

Above chart is to test the validity of the assumption that the residuals are normally distributed.
Looking at the chart one may conclude that the residuals are more or less normal. This can be tested
using Chi-square goodness of fit test.

Since all the three regression coefficients are not significant, the enter method cannot be used
for estimation. It is advisable to use stepwise regression in this case since it gives most
parsimonious set of variables in the equation.

Stepwise Method for Entering Variables – SPSS Output


It may be noted that as Bharti telecom has too high P/E Ratio, it is omitted from the analysis.

Regression
Descriptive Statistics

Mean Std. Dev iation N


m_cap_amt_oct05 36285.80 34367.670 20
Net _Sal_sept05 13419.30 17025.814 20
Net _Prof _sept05 2633. 38 3612. 171 20
peratio_oct05 21.70 10.037 20

Correlations

m_cap_ Net _Sal_ Net _Prof _


amt_oct05 sept05 sept05 peratio_oct05
Pearson Correlation m_cap_amt_oct05 1.000 .687 .831 -.246
Net _Sal_sept05 .687 1.000 .798 -.576
Net _Prof _sept05 .831 .798 1.000 -.600
peratio_oct05 -.246 -.576 -.600 1.000
Sig. (1-tailed) m_cap_amt_oct05 . .000 .000 .148
Net _Sal_sept05 .000 . .000 .004
Net _Prof _sept05 .000 .000 . .003
peratio_oct05 .148 .004 .003 .
N m_cap_amt_oct05 20 20 20 20
Net _Sal_sept05 20 20 20 20
Net _Prof _sept05 20 20 20 20
peratio_oct05 20 20 20 20

Variables Entered/Removeda

Variables Variables
Model Entered Remov ed Met hod
1 Stepwise
(Criteria:
Probabilit
y -of -
F-to-enter
Net _Prof _
. <= .050,
sept05
Probabilit
y -of -
F-to-remo
v e >= .
100).
2 Stepwise
(Criteria:
Probabilit
y -of -
F-to-enter
peratio_
. <= .050,
oct05
Probabilit
y -of -
F-to-remo
v e >= .
100).
a. Dependent Variable: m_cap_amt_oct05
56

This table gives the summary of the entered variables in the model.

Model Summaryc

Adjusted Std. Error of Durbin-


Model R R Square R Square the Estimate Watson
1 .831a .691 .673 19641.726
2 .889b .790 .766 16637.346 1.112
a. Predictors: (Constant), Net_Prof _sept05
b. Predictors: (Constant), Net_Prof _sept05, peratio_oct05
c. Dependent Variable: m_cap_amt_oct05

In the previous method, there was only one model. Since this is stepwise method, it will give all
models that are significant in each step. The Durbin Watson is improved from the previous model but
is still less than the desired range (1.5 to 2.5). The last model is generally the best model. This can be
verified by the adjusted R2 the model with highest adjusted R2 is best. The model 2 which consists of
dependent variable, market capitalization and independent variables, Net profit and P/E Ratio is the
best model. The following table gives coefficients for the best model.
It may be noted that this model, though do not contain the independent variable ‘Net Profit’ is slightly
better than the previous model discussed using ‘Enter’ or General method of entering variables. Since
previous model adjusted R2 = 0.764 this model Adjusted R2 = 0.766.
The following table gives ANOVA for all the iterations (in this case 2), and both are significant.
ANOVAc

Sum o f
Mo del Square s df Me an Squa re F Sig.
1 Regressi on 1.5 5E+ 10 1 1.5 50E+10 40. 169 .00 0 a
Resid ual 6.9 4E+ 09 18 385 7974 11.3
T otal 2.2 4E+ 10 19
2 Regressi on 1.7 7E+ 10 2 886 7988 179 32. 037 .00 0 b
Resid ual 4.7 1E+ 09 17 276 8012 81.6
T otal 2.2 4E+ 10 19
a. Predicto rs: (Con stan t), Net_Prof_ sep t05
b. Predicto rs: (Con stan t), Net_Prof_ sep t05, p erati o_oct05
c. Depen dent Vari able: m_cap _amt _oct05

Coefficientsa

Unstandardized Standardized
Coef f icients Coef f icients Correlations
Model B Std. Error Beta t Sig. Zero-order Partial Part
1 (Constant) 15465.085 5484. 681 2.820 .011
Net _Prof _sept05 7.906 1.247 .831 6.338 .000 .831 .831 .831
2 (Constant) -19822. 8 13249.405 -1.496 .153
Net _Prof _sept05 10.163 1.321 1.068 7.691 .000 .831 .881 .854
peratio_oct05 1352. 358 475.527 .395 2.844 .011 -.246 .568 .316
a. Dependent Variable: m_cap_amt _oct05

It may be noted in the above Table that the values of the constant and regression coefficients are the
same as in the equation (14), derived manually. The SPSS stepwise regression did this automatically,
and the results we got are the same.
The following table gives summary of excluded variables in the two models.
57

Excluded Variabl esc

Collinearity
Partial Statistics
Model Beta In t Sig. Correlation Tolerance
1 Net _Sal_sept05 .067a .301 .767 .073 .363
peratio_oct05 .395a 2.844 .011 .568 .639
2 Net _Sal_sept05 .180b .953 .355 .232 .349
a. Predictors in t he Model: (Constant), Net _Prof _sept05
b. Predictors in t he Model: (Constant), Net _Prof _sept05, peratio_oct05
c. Dependent Variable: m_cap_amt_oct05

There may be a situation that a researcher would like to divide the data into two parts, and use one
part to derive the model and other part to validate the model. SPSS allows to split the data into two
groups termed as estimation group and validation group. The estimation group is used to fit the
model, which is validated using validation group. This improves the validity of the model. This
process is called as cross validation. This method can be used only if the data is large enough to fit
model. Random variable functions from SPSS can be used to select the data randomly from the SPSS
file.
58

3 Discriminant Analysis
Discriminant analysis is basically a classifying technique that is used for classifying a given set of
objects, individuals, entities into two (or more) groups or categories based on the given data about
their characteristics. It is the process of deriving an equation called ‘Discriminant Function’ giving
relationship between one dependent variable which is categorical i.e. it takes only two values, say,
‘yes’ or ‘no’, represented by ‘1’ or ‘0’, and several independent variables which are continuous. The
independent variables, selected for the analysis, are such which contribute towards classifying an
object, individual, or entity in one of the two categories. For example, with the help of several
financial indicators, one may decide to extend credit to a company or not. The classification could
also be done in more than two categories.
Identifying a set of variables which discriminate ‘Best’ between the two groups is the first step in the
discriminant analysis. These variables are called discriminating variables.
One of the simplest examples of discriminating variable is the ‘height’ in case of students of
graduate students. Let, there be a class of 50 students comprising boys and girls. Suppose we are
given only roll numbers, and we are required to classify them by their sex or segregate boys and girls.
One alternative is to take ‘height’ as the variable, and premise all those equal to or more than 5’6’’
are boys and less than that height are girls. This classification should work well except in some cases
where girls are taller than 5’6’’ and boys are less than that height. In fact, one could work out from a
large sample of students, the most appropriate value of the discriminating height. This example
illustrates one fundamental aspect of discriminant analysis that in real life we cannot find
discriminating variable(s) or function that can provide 100 % accurate discrimination or
classification. We can only attempt to find the best classification from a given set of data. Yet another
example is the variable ‘marks’( percentage or percentile), in an examination which are used to
classify students in two or more categories. As is well known even marks cannot guarantee 100%
accurate classification.

Discriminant analysis is used to analyze relationships between a non-metric dependent variable and
metric or dichotomous (Yes / No type or Dummy) independent variables. Discriminant analysis uses
the independent variables to distinguish among the groups or categories of the dependent variable.
The discriminant model can be valid or useful only if it is accurate. The accuracy of the model is
measured on the basis of its ability to predict the known group memberships in the categories of the
dependent variable.

Discriminant analysis works by creating a new variable called the discriminant function score
which is used to predict to which group a case belongs. The computations find the coefficients for the
independent variables that maximize the measure of distance between the groups defined by the
dependent variable.
The discriminant function is similar to a regression equation in which the independent variables are
multiplied by coefficients and summed to produce a score.

The general form of discriminant function is :


D = b0 + b1X1 + b2X2 + … … + bkXk (16 )
D = Discriminant Score
bi = Discriminant coefficients or weights
Xi = Independent variables
The weights bi’s are calculated by using the criteria that the groups differ as much as possible on the
basis of discriminant function.
If the dependent variable has only two categories, the analysis is termed as discriminant analysis. If
the dependent variable has more than two categories, then the analysis is termed as Multiple
Discriminant Analysis.
59

In case of multiple discriminant analysis, there will be more than one discriminant function. If the
dependent variable has three categories like, high risk, medium risk, low risk, there will be two
discriminant functions. If dependent variable has four categories, there will be three discriminant
functions. In general, the number of discriminant functions is one less than the number of categories
of the dependent variable.
It may be noted that in case of multiple discriminant functions, each function needs to be significant
to conclude the results.

The following illustrations explain the concepts and the technique of deriving a Discriminant
function, and using it for classification. The objective in this example is to explain the concepts in a
popular manner without mathematical rigour.

Illustration 3
Suppose, we want to predict whether a science graduate, studying inter alia the subjects of Physics
and Mathematics, will turn out to be a successful scientist or not. Here, it is premised that the
performance of a graduate in Physics and Mathematics, to a large extent, contributes in shaping a
successful scientist. The next step is to select some successful and some unsuccessful scientists, and
record the marks obtained by them in Mathematics and Physics in their graduate examination. While
in real life application, we have to select sufficient, say 10 or more number of students in both
categories, just for the sake of simplicity, let the data on two successful and two unsuccessful
scientists be as follows:
Successful Scientist Unsuccessful Scientist
Marks in Marks in Physics Marks in Marks in Physics
Mathematics ( M ) ( P) Mathematics (M) (P)
12 8 11 7
8 10 5 9
Average : M s 10 Ps = 9 Mu = 8 Pu = 8

S : Successful U : Unsuccessful
It may be mentioned that the marks as 8, 10, etc. are taken just for the sake of ease in calculations.
The discriminant function assumed is
Z = w1 M + w2 P
The requisite calculations on the above data yield
w1 = 9 and w2 = 23
Thus, the discriminant function works out to be
Z = 9 M + 23 P
and the discriminant score works out to be
(9  M S  23PS )  (9  M U  23PU )
ZC 
2
9  10  23  9  9  8  23  8
=
2
60

= 276.5
This discriminant score helps us to predict whether a Graduate student will turn out to be a successful
scientist or not. This score for the two successful scientists is 292 and 302, both being more than the
discriminant score 276.5, the score is 214 and 270 for unsuccessful scientists, both being less than
276.5. If a young graduate gets 11 marks in Mathematics and 9 marks in Physics, his score as per the
discriminant function is 9  11 + 23  9 = 306. Since this is more than the discriminant score of
276.5, we can predict that this graduate will turn out to be a successful scientist. This is depicted
pictorially in the following diagram.
12

10 8, 10 Successful

5, 9 11, 9

8 12, 8
Marks in 11, 7
Physics
6

Unsuccessful
4

Marks in Mathematics
0
0 2 4 6 8 10 12 14

It may be noted that the both the successful scientists’ scores are above the discriminant line and the
scores of both the unsuccessful scientists are below the discriminant line.
The student with assumed marks is classified in the category of successful scientist.
This example illustrates that with the help of past data, about objects including entities,
individuals, etc., and their classification in two categories, one could derive the discriminant
function and the discriminant scores. Subsequently, if the same type of data is given for some
other object, the discriminant score could be worked out for that object and thus classify it in
either of the two categories.

14.3.1 Some Other Applications of Discriminate Analysis

Some other applications of discriminant analysis are given below.


(i) Based on the past data available for a number of firms, about
 Current ratio (defined as Current Assets ÷ Current Liabilities)
 Debt / Asset Ratio ( defined as Total Debt ÷ Total Assets )
and the information whether a firm succeeded or failed, a determinant function could be
derived which could discriminate successful and failed firms based on their current ratio
and debt/asset ratio.
As an example, the determinant function in a particular case could be
Z= – 0.5 +1.2x1 – 0.07 x2
where,
x1 represents the current ratio, and
61

x2 represents the debt/asset ratio.


The function could be used to sanction or not sanction the credit to an approaching firm based
on its current and debt asset ratios.
(ii) Suppose, one wants to have comparative assessment of a number of factors (performances
in various tests) responsible for effective performance of an executive, in an organisation.
Let the factors and the corresponding variables, given in brackets:
 Score in admission test for MBA (x1)
 Score in MBA (x2)
 Score in the internal examination after initial induction training of one month (x3)
Let the determinant function derived from the past data for 25 executives be:
Z = 3x1 + 4x2 + 6x3
It may be noted that x3 has maximum weightage as six. Thus we may conclude that the
most important factor for effective performance in the organization is the score in the
internal examination after induction training.

3.2 Assumptions of Discriminant Analysis and Measure of Goodness of


a Discriminant Function
For any statistical analysis to be valid, there are certain assumptions for the variables involved that
must hold good. Further, as mentioned above, one cannot derive a discriminant function that would
ensure 100% accurate classification. It is therefore logical to measure the goodness of a function that
would indicate the extent of confidence one could attach to the obtained results.

3.2.1 Assumptions of Discriminant Analysis


The first requirement for using discriminant analysis is that, the dependent variable should be non
metric and the independent variable should be metric or dummy.
The ability of discriminant analysis to derive discriminant function that provides accurate
classifications is enhanced when the assumptions of normality, linearity, and homogeneity of
variance are satisfied. In discriminant analysis, the assumption of linearity applies to the relationships
between pairs of independent variable. This can be verified from the correlation matrix, defined in
Section 2.3. Like multiple regression, multicollinearity in discriminant analysis is identified by
examining ‘tolerance’ values. The problem of multicollinearity problem can be resolved by
removing or combining the variables with the help of Principle Component Analysis discussed in
Section 6.3.
The assumption of homogeneity of variance is important in the classification stage of discriminant
analysis. If one of the groups defined by the dependent variable has greater variance than others,
more cases will tend to be classified in that group. Homogeneity of variance is tested with Box's M
test, which tests the null hypotheses that the group variance-covariance matrices are equal. If we fail
to reject this null hypothesis and conclude that the variances are equal, we may use a pooled variance-
covariance matrix in classification. If we reject the null hypothesis and conclude that the variances are
heterogeneous, then we may be required to use separate covariance matrices in the classification, and
evaluate whether or not our classification accuracy is improved.

3.2.2 Tests Used for Measuring Goodness of a Discriminant Function

There are two tests for judging goodness of a discriminant function

(1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant. Wilks'
 for each independent variable is calculated using the formula,
62

Within Group Sum of Squares


Wilks'  =
Total Sum of Squares

Wilks'  ) lies between 0 and 1. Large value of  indicate that there is no difference in the group
means for the independent variable. Small values of  indicate group means are different for the
independent variable. Smaller the value of  more is the discriminating power, of the variable, in the
group.

(2) If the F test shows significance, then the individual independent variables are assessed to see
which of these differ significantly (in mean) by group and are subsequently used to classify the
dependent variable.

3.3 Key Terms Related to Discriminant Analysis


Some key terms related to discriminant analysis are described below
Term Description Glossary at end
Discriminating These are the independent variables which are used as criteria for discrimination
Variables
Discriminant A discriminant function, also called a canonical root, is a latent variable which is
Function created as a linear combination of discriminating (independent) variables as stated in
equation 15 There are in general , k-1 discriminant functions if the dependent variable
has k categories
Eigen value Eigen value for each discriminating function is defined as ratio of between groups to
within group sum of squares. More the Eigen value, more appropriate is the
differentiation, hence the model. There is one eigenvalue for each discriminant function.
For two-group DA, there is one discriminant function and one eigenvalue, which
accounts for all of the explained variance. If there is more than one discriminant
function, the first will be the largest and most important, the second next most important
in explanatory power, and so on.
Relative Percentage The relative percentage of a discriminant function equals a function's eigenvalue
divided by the sum of all eigenvalues of all discriminant functions in the model. Thus it
is the percent of discriminating power for the model associated with a given
discriminant function. Relative % is used to tell how many functions are important.
The Canonical Measures extent of association between the discriminant scores and the groups. When
Correlation, R*, R* is zero, there is no relation between the groups and the function. When the canonical
correlation is large, there is a high correlation between the discriminant functions and
the groups. It may be Noted that for two-group DA, the canonical correlation is
equivalent to the Pearson’s correlation of the discriminant scores with the grouping
variable.
Centroid Mean values for discriminant scores for a particular group. The number of centroids
equals the number of groups, being one for each group. Means for a group on all the
functions are the group centroids.
Discriminant Score The, also called the DA score, is the value resulting from applying a discriminant
function formula to the data for a given case. The Z score is the discriminant score for
standardized data.
Cutoff: If the discriminant score of the function is less than or equal to the cutoff, the case is
classified as 0, or if above the cutoff, it is classified as 1. When group sizes are equal,
the cutoff is the mean of the two centroids (for two-group DA). If the groups are
unequal, the cutoff is the weighted mean.
Standardized also termed the standardized canonical discriminant function coefficients, are used to
discriminant compare the relative importance of the independent variables, much as beta weights are
coefficients used in regression. Note that importance is assessed relative to the model being
analyzed. Addition or deletion of variables in the model can change discriminant
63

coefficients markedly.
Functions at Group The mean discriminant scores for each of the dependent variable categories for each of
Centroids the discriminant functions in MDA. Two-group discriminant analysis has two centroids,
one for each group. We want the means to be well apart to show the discriminant
function is clearly discriminating. The closer the means, the more errors of
classification there likely will be
Discriminant Also called canonical plots, can be created in which the two axes are two of the
function plots discriminant functions (the dimensional meaning of which is determined by looking at
the structure coefficients, discussed above), and circles within the plot locate the
centroids of each category being analyzed. The farther apart one point is from another
on the plot, the more the dimension represented by that axis differentiates those two
groups. Thus these plots depict
(Model) Wilks' Used to test the significance of the discriminant function as a whole. The "Sig." level
lambda for this function is the significance level of the discriminant function as a whole. The
researcher wants a finding of significance, and The larger the lambda, the more likely it
is significant. A significant lambda means one can reject the null hypothesis that the
two groups have the same mean discriminant function scores and conclude the model is
discriminating.
ANOVA table for Another overall test of the DA model. It is an F test, where a "Sig." p value < .05 means
Discriminant scores the model differentiates discriminant scores between the groups significantly better than
chance (than a model with just the constant).
(Variable) Wilks' It can be used to test which independents contribute significantly to the discrimiinant
lambda function. The smaller the value of Wilks' lambda for an independent variable, the more
that variable contributes to the discriminant function. Lambda varies from 0 to 1, with 0
meaning group means differ (thus the more the variable differentiates the groups), and 1
meaning all group means are the same.
Dichotomous independents are more accurately tested with a chi-square test than with
Wilks' lambda for this purpose.
Classification Also called assignment, or prediction matrix or table, is used to assess the performance
Matrix or of DA. This is a table in which the rows are the observed categories of the dependent
Confusion Matrix and the columns are the predicted categories of the dependents. When prediction is
perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is the
percentage of correct classifications. This percentage is called the hit ratio.
Expected hit ratio. The hit ratio is not relative to zero but to the percent that would have been correctly
classified by chance alone. For two-group discriminant analysis with a 50-50 split in the
dependent variable, the expected percentage is 50%. For unequally split 2-way groups
of different sizes, the expected percent is computed in the "Prior Probabilities for
Groups" table in SPSS, by multiplying the prior probabilities times the group size,
summing for all groups, and dividing the sum by N. The best strategy is to pick the
largest group for all cases, the expected percent is then the largest group size divided by
N.
Cross-validation. Leave-one-out classification is available as a form of cross-validation of the
classification table. Under this option, each case is classified using a discriminant
function based on all cases except the given case. This is thought to give a better
estimate of what classificiation results would be in the population.
Measures of can be computed by the crosstabs procedure in SPSS if the researcher saves the
association predicted group membership for all cases.
Mahalanobis D- Indices other than Wilks' lambda of the extent to which the discriminant functions
Square, Rao's V, discriminate between criterion groups. Each has an associated significance test. A
Hotelling's trace, measure from this group is sometimes used in stepwise discriminant analysis to
Pillai's trace, and determine if adding an independent variable to the model will significantly improve
Roys gcr(greatest classification of the dependent variable. SPSS uses Wilks' lambda by default but also
characteristic root) offers Mahalanobis distance, Rao's V, unexplained variance, and smallest F ratio on
selection.
64

Structure These are also known as discriminant loadings, can be defined as simple correlations
Correlations between the independent variables and the discriminant functions

3.4 Discriminant Analysis Using SPSS


For illustration, we will be using file bankloan.sav. This is a data file that concerns a bank's efforts to
reduce the incidence of loan defaults. The file contains financial and demographic information on 850
past and prospective customers. The first 700 cases are customers who were previously given loans.
The last 150 cases are prospective customers that the bank needs to classify as good or bad credit
risks. This file is part of SPSS cases, and is in the tutorial folder of SPSS. Within tutorial folder, this
file is in the sample files folder. For the convenience of readers we have provided this file in the CD
with the book.
As in case with multiple regression, discriminant analysis can be done by two principle methods,
namely, enter method and stepwise method.
We will illustrate output for both methods.
After opening the file bankloan.sav, one can click on ‘Analyse’ and ‘Classify’ as shown in the
following snapshot.
SPSS Snapshot DA 1

The next box that will appear is given in the following snapshot.

1. Enter the categorical


dependent variable
“previously Defaulted” here

2. Click on Define
Range

After entering the dependent variable and clicking on the “Define Range” as shown above, SPSS will
open following box,

SPSS Snapshot DA 2
65

2. then click on continue

1. Enter Minimum as
‘0’ and Maximum as ‘1’

After defining the variable, one should click on ‘Continue’ button as shown above. SPSS will be back
to the previous box shown below
SPSS Snapshot DA 3

1. Enter the list of


independent variables in this
example the variables age,
years with current employee,
years at current address,
house hold income, debt to
income ratio, credit card debt
in thousands and other debt in
thousands

2. select the method of entering variables. If


enter is selected all the variables will be entered
together. If stepwise is selected the variables
3. next are entered in steps and the best fit model will
click on be displayed in the output.
“statistics”
After selecting the dependent, independent variables and the method of entering variables one my
click on statistics, SPSS will open a box as shown below
SPSS Snapshot DA 4
1. Select Univariate ANOVAs, Box's
M , Fisher's , Unstandardised and
Within-groups correlation.

2.then select continue.

After selecting the descriptives SPSS will go back to the previous box shown below:
SPSS Snapshot DA 5
66

1.This field is used when


one wants to split the data
into two part one as
estimation and other as
validation. In such cases
validation variable should be
entered here. Presently this
will be blank as we have not
split the file.

2.Next, click on
classify
After clicking on classify SPSS will open a box as shown below,
SPSS Snapshot DA 6
2. Click
continue

1.Select Summary table and


Leave-one-out classification.

After clicking ‘Continue’, SPSS will be back to previous box as shown in the Snapshot DA 5, then
click on the save button at the bottom. SPSS will open a box as shown below.

SPSS Snapshot DA 7
2. Click continue

1.select predicted group


membership, discriminant
scores and probabilities of
group membership.

After clicking continue SPSS will again go back to the previous window shown in Snapshot DA 6, at
this stage one may click ok button. This will lead SPSS to analyse the data and the output will be
displayed in the output view of SPSS.
Output for Enter Method
67

We will discuss interpretation for each output.


Discriminant
Analysis Cas e Proce ssing Summary

Unwe ighted Ca ses N Percent


Valid 700 82. 4
Exclu ded Missin g or o ut-of-range
150 17. 6
gro up code s
At least one missing
0 .0
discrimin ating va riable
Both m issing or
out -of-ran ge gro up code s
0 .0
and at le ast o ne mi ssing
discrimin ating va riable
T otal 150 17. 6
T otal 850 100 .0

This table gives case processing summary, i.e. how may valid cases were selected, how many were
excluded (due to missing data) , total and their respective percentages.
Grou p Statistics

Valid N (listwise)
Prev iously def aulted Mean Std. Dev iation Unweighted Weighted
No Age in y ears 35.5145 7.70774 517 517.000
Y ears wit h current
9.5087 6.66374 517 517.000
employ er
Y ears at current address 8.9458 7.00062 517 517.000
Household income in
47.1547 34.22015 517 517.000
thousands
Debt to income ratio
8.6793 5.61520 517 517.000
(x100)
Credit card debt in
1.2455 1.42231 517 517.000
thousands
Other debt in thousands 2.7734 2.81394 517 517.000
Y es Age in y ears 33.0109 8.51759 183 183.000
Y ears wit h current
5.2240 5.54295 183 183.000
employ er
Y ears at current address 6.3934 5.92521 183 183.000
Household income in
41.2131 43.11553 183 183.000
thousands
Debt to income ratio
14.7279 7.90280 183 183.000
(x100)
Credit card debt in
2.4239 3.23252 183 183.000
thousands
Other debt in thousands 3.8628 4.26368 183 183.000
Total Age in y ears 34.8600 7.99734 700 700.000
Y ears wit h current
8.3886 6.65804 700 700.000
employ er
Y ears at current address 8.2786 6.82488 700 700.000
Household income in
45.6014 36.81423 700 700.000
thousands
Debt to income ratio
10.2606 6.82723 700 700.000
(x100)
Credit card debt in
1.5536 2.11720 700 700.000
thousands
Other debt in thousands 3.0582 3.28755 700 700.000

This table gives the group statistics of independent variables, for each categories (here yes and no) of
dependent variables.
Tests of Equality of Group Means

Wilks'
Lambda F df 1 df 2 Sig.
Age in y ears .981 13.482 1 698 .000
Y ears with current
.920 60.759 1 698 .000
employ er
Y ears at current address .973 19.402 1 698 .000
Household income in
.995 3.533 1 698 .061
thousands
Debt to income ratio
.848 124.889 1 698 .000
(x100)
Credit card debt in
.940 44.472 1 698 .000
thousands
Other debt in thousands .979 15.142 1 698 .000

This table gives the test for Wilks’  for each independent variable if this is significant(<0.05 or
0.01), it means that the respective variable, mean is different for the two groups (in this case
previously defaulted and previously not defaulted). Any insignificant value will indicate that, the
variable is not different for different group or in other terms does not discriminate the dependent
variable. In the above example, all the variables are significant except household income in
thousands. This implies that the default of the loan does not depend on the household income.
68

Pooled Within-Groups Matrices

Y ears wit h Y ears at Household Debt to Credit card


current current income in income ratio debt in Other debt in
Age in y ears employ er address thousands (x100) thousands thousands
Correlation Age in y ears 1.000 .524 .588 .475 .077 .342 .368
Y ears wit h current
.524 1.000 .292 .627 .089 .509 .471
employ er
Y ears at current address .588 .292 1.000 .310 .083 .260 .257
Household income in
.475 .627 .310 1.000 .001 .608 .629
thousands
Debt to income ratio
.077 .089 .083 .001 1.000 .455 .580
(x100)
Credit card debt in
.342 .509 .260 .608 .455 1.000 .623
thousands
Other debt in thousands .368 .471 .257 .629 .580 .623 1.000

Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants

Log
Prev iously def aulted Rank Det erminant
No 7 21.292
Yes 7 24.046
Pooled within-groups 7 22.817
The ranks and natural logarithms of determinants
printed are those of the group cov ariance matrices.

Test Resul ts
Box's M 563.291
F Approx. 19.819
df 1 28
df 2 431743.0
Sig. .000
Tests null hy pothesis of equal population cov ariance matrices.

This table indicates that the Box’s M is significant. Which means the assumption of equality of
variance may not be true. This is a caution for interpreting results.

Summary of Canonical Discriminant Functions


Eigenval ues

Canonical
Function Eigenv alue % of Variance Cumulativ e % Correlation
1 .404a 100.0 100.0 .536
a. First 1 canonical discriminant f unctions were used in the
analy sis.

This table gives summary of canonical discriminant function. This indicates eigenvalue for this model
is 0.404 and canonical correlation is 0.536. since there is single discriminant function all the
explained variation is contributed by the function.53.6 % of variation in the dependent variable is
explained by the model.
Wilks' Lambda

Wil ks'
T est of Fun ction(s) Lam bda Chi-square df Sig.
1 .71 2 235 .447 7 .00 0

This table tests the significance of the model as seen in the sig column, the model is significant.
Stand ardized Canon ical Discr iminant Fu nction Coeffici en ts

Function
1
Age in y ears .122
Y ears with current
-.829
employ er
Y ears at current address -.310
Household income in
.215
thousands
Debt to income ratio
.603
(x100)
Credit card debt in
.564
thousands
Other debt in thousands -.178
69

Structure Matrix

Function
1
Debt to income ratio
.666
(x100)
Y ears with current
-.464
employ er
Credit card debt in
.397
thousands
Y ears at current address -.262
Other debt in thousands .232
Age in y ears -.219
Household income in
-.112
thousands
Pooled wit hin-groups correlations between discriminating
v ariables and standardized canonical discriminant f unctions
Variables ordered by absolute size of correlation wit hin f unction.

This table gives simple correlations between the independent variables and the discriminant function.
High correlation will get translated to high discriminating power.
Canonical Discriminant Functi on Coefficients

Function
1
Age in y ears .015
Y ears with current
-.130
employ er
Y ears at current address -.046
Household income in
.006
thousands
Debt to income ratio
.096
(x100)
Credit card debt in
.275
thousands
Other debt in thousands -.055
(Constant) -.576
Unstandardized coef f icient s

This table gives the canonical correlations. Negative sign indicates inverse relation. For example,
years at current employer is -0.130 means that more the number of years spent at current employer,
lesser the chance that the person will default.
Functions at Group Centroids

Function
Prev iously def aulted 1
No -.377
Y es 1.066
Unst andardized canonical discriminant
f unctions ev aluated at group means

Classification Statistics
Classifi cation Processing Summary
Processed 850
Excluded Missing or out-of -range
0
group codes
At least one missing
0
discriminating v ariable
Used in Output 850

Prior Probabiliti es for Groups

Case s Used in Analysis


Previously defa ulted Prior Unwe ighte d We ighted
No .50 0 517 517 .000
Yes .50 0 183 183 .000
T otal 1.0 00 700 700 .000
70

Classifi cation Function Coefficients

Prev iously def aulted


No Y es
Age in y ears .803 .825
Y ears with current
-.102 -.289
employ er
Y ears at current address -.294 -.360
Household income in
.073 .081
thousands
Debt t o income ratio
.639 .777
(x100)
Credit card debt in
-1.004 -.608
thousands
Other debt in thousands -1.044 -1.124
(Constant) -15.569 -16.898
Fisher's linear discriminant f unctions

The classification functions are used to assign cases to groups.


There is a separate function for each group. For each case, a classification score is computed for each
function. The discriminant model assigns the case to the group whose classification function obtained
the highest score. The coefficients for Years with current employer and Years at current address are
smaller for the Yes classification function, which means that customers who have lived at the same
address and worked at the same company for many years are less likely to default. Similarly,
customers with greater debt are more likely to default.
Classification Results(b,c)

Predicted Group
Membership
Previously
defaulted No Yes Total
Original Count No 393 124 517
Yes 44 139 183
Ungrouped
99 51 150
cases
% No 76.0 24.0 100.0
Yes 24.0 76.0 100.0
Ungrouped
66.0 34.0 100.0
cases
Cross- Count No 391 126 517
validated(a) Yes 47 136 183
% No 75.6 24.4 100.0
Yes 25.7 74.3 100.0
a Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the
functions derived from all cases other than that case.
b 76.0% of original grouped cases correctly classified.
c 75.3% of cross-validated grouped cases correctly classified.

This is classification matrix or confusion matrix. This gives the percentage of cases that are classified
correctly i.e. the hit ratio. This hit ratio should be at least 25% more than the random probability.
In the above example, 532 of the 700 cases are classified correctly.
Overall, 76% of the cases are classified correctly.
139 out of 263 defaulters were identified correctly.
Output for Stepwise Method
All the steps of the ‘enter’ method are very similar to stepwise method except that the method to be
selected is stepwise. We have shown it in SPSS Snapshot DA3
If one selects the stepwise method, one also needs to select the method which SPSS should use to
select best set of independent variables.
This can be selected by clicking method button from Snapshot DA 8 shown below
SPSS Snapshot DA 8
71

1. select stepwise
method

2. Click on method

After clicking on method, SPSS will open a window as shown below.


SPSS Snapshot DA 9

There are five methods available in SPSS namely,


 Wilk’s 
 Unexplained variance
 Mahalanobis distance
 Smallest ‘F’ ratio
 Rao’s V
One may select any one of the methods.
We have selected Mahalanobis distance method.
We will now discuss the output using this method.
It may be noted that we will discuss only the output that is different than the previous method.
Stepwise Statistics
Var iables Enter ed /Removeda,b,c,d

Min. D Squared

Between Exact F
Step Entered Statistic Groups Statistic df 1 df 2 Sig.
1 Debt to
income No and
.924 124.889 1 698.000 .000
ratio Y es
(x100)
2 Y ears
with No and
1.501 101.287 2 697.000 .000
current Y es
employ er
3 Credit
card debt
No and
in 1.926 86.502 3 696.000 .000
Y es
thousand
s
4 Y ears at
No and
current 2.038 68.572 4 695.000 .000
Y es
address
At each step, the v ariable that maximizes the Mahalanobis distance between the two closest
groups is entered.
a. Maximum number of steps is 14.
b. Minimum partial F to enter is 3.84.
c. Maximum partial F to remov e is 2.71.
d. F lev el, tolerance, or VIN insuf f icient f or f urther computation.
72
Variables in the Analysis

Min. D Between
Step Tolerance F to Remove Squared Groups
1 Debt to income
1.000 124.889
ratio (x100)
2 Debt to income
ratio (x100) .992 130.539 .450 No and Yes
Years with
.992 66.047 .924 No and Yes
current employer
3 Debt to income
.766 35.888 1.578 No and Yes
ratio (x100)
Years with
.716 111.390 .947 No and Yes
current employer
Credit card debt
.572 44.336 1.501 No and Yes
in thousands
4 Debt to income
ratio (x100) .766 35.000 1.693 No and Yes
Years with
current employer .691 89.979 1.213 No and Yes
Credit card debt
.564 48.847 1.565 No and Yes
in thousands
Years at current
.898 11.039 1.926 No and Yes
address

Wilks' Lambda

Number of
Step Variables Lambda df1 df2 df3 Exact F

Statistic df1 df2 Sig.


1 1 .848 1 1 698 124.889 1 698.000 .000
2 2 .775 2 1 698 101.287 2 697.000 .000
3 3 .728 3 1 698 86.502 3 696.000 .000
4 4 .717 4 1 698 68.572 4 695.000 .000

These tables give summary of variables that are in analysis variables that are not in the analysis and
the model at each step, its significance.

It can be concluded that variables Debt to income ratio (x100)


Years with current employer, Credit card debt in thousands, Years at current address remain in the
model and others are removed from the model. This means that only these variables contribute in the
model.

4 Logistic Regression/ Multiple Logistic Regression


As discussed earlier, in a regression equation y = a + bx , both the dependent variable ,y, and
independent variable, x, are assumed to be normally distributed. As such, both x and y are continuous
variables, and can take any value from –  to + . However, suppose, there is a situation when the
distribution of variable y is Binomial, which takes only one of the two possible values, say 0 and 1. In
such a case, the regression equation will not be valid, as for a given value of x, y can assume any
value, and not only either of the two values. This limitation of regression equation led to the
development of logistic regression where the dependent variable takes only two values, say 0 and 1.
As discussed above, in Discriminant Analysis also, the dependent variable takes only two ‘Yes’ and
‘No’ type possible values. However, therein, the limitation on the independent variable x is that it has
to be a continuous variable. Logistic regression is the response to this limitation, also wherein there is
no such constraint. Thus, logistic regression is preferred and used in both the following situations:

 when the dependent variable is dichotomous/binary/categorical


 when the independent variable is continuous or categorical. If there are more than one
independent variable, as discussed in the Section on Multiple Correlation and Regression
73

Analysis, there could be either continuous or dichotomous or a combination of continuous and


dichotomous variables.

In logistic regression, the relationship between dependent variable and independent variable is not
linear. It is of the type

1
p=
1  e y

where, p is the probability of ‘success’ i.e. dichotomous variable y taking the value 1, and ( 1 – p ) is
the probability of ‘failure’ i.e., y taking the value 0, and y = a + bx.

The graph of this relationship between p and y is depicted below:

Probability p as a function of y

1
0.8
0.6
p

0.4
0.2
0
-6 .2 .4 .6 .8 -2 .2 .4 4 2 2 8 6 4 2 6
-5 -4 -3 -2 -1 -0 0. 1. 2. 3. 4. 5.
y

The logistic equation (8) can be reduced to a linear form by converting the probability p into log of
(p)/(1 – p)p or logit as follows:

y = log [(p)/(1 – p))] = a + bx ….(2)


The logarithm, here, is the natural logarithm to the base ‘e’. The logarithm of any number to this base
is obtained by multiplying the logarithm to the base 10 by log of 10 to the base ‘e’ i.e. 2.303.
The equation (8) is similar to a regression equation. However, here, a unit change in the independent
variable causes a change in the dependent variable, expressed as logit rather than the dependent
variable p, directly. Such regression analysis is known as Logistic Regression.
The fitting of a logistic regression equation is explained through an illustration wherein data was
recorded on the CGPA (up to first semester in the second year of MBA) of 20 MBA students, and
their success in the first interview for placement. The data collected was as follows where Pass is
indicated as “1” while Fail is indicated as “0”.
Student ( Srl. No.) 1 2 3 4 5 6 7 8 9 10

CGPA 3.12 3.21 3.15 3.45 3.14 3.25 3.16 3.28 3.22 3.41

Result of First Interview 0 1 0 0 0 1 1 1 0 1


Student ( Srl. No.) 11 12 13 14 15 16 17 18 19 20

CGPA 3.48 3.34 3.25 3.46 3.32 3.29 3.42 3.28 3.36 3.31

Result of First Interview 1 1 0 1 1 1 1 1 1 0


74

Now, given this data, can we find the probability of a student succeeding in the first interview given
the CGPA?
1.2
y = 1.8264x - 5.3679
R2 = 0.1706
1
Success in Interview

0.8

0.6

0.4

0.2

0
3.1 3.15 3.2 3.25 3.3 3.35 3.4 3.45 3.5
CGPA

y = 1.83 x – 5.37
Instead, let us now attempt to fit a logistic regression to the student data. We will do this by
computing the logits and then fitting a linear model to the logits. To compute the logits, we will
regroup the data by CGPA into intervals, using the midpoint of each interval for the independent
variable. We calculate the probability of success based on the number of students that passed the
interview for each range of CGPAs. This results in the following data: .
Class Interval Middle Point of Class Probability of Success Logit
(CGPA) Interval { p/ (1 – p)}

3.1 - 3.2 3.15 1/4 = 0.25 -1.09861


3.2 - 3.3 3.25 4/6 = 0.667. 0.9163
3.3 – 3.4 3.35 3/4 = 0.75 1.0986
3.4 – 3.5 3.45 5/6 = 0.833 1.6094

We plot the Logit against the CGPA and then look for the linear fit which gives us the equation: y = 8.306x –
26.78
Thus if p is the probability of passing the interview and x is the CGPA the logistic regression can we expressed
as:
 p 
ln    8.306 x  26.78
1 p 
Converting the logarithm to an equivalent exponential form, this equation can also we expressed as expressed
as:
e8.306x  26.78
p
1  e8.306x  26.78
x 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
y
* -6.015 -5.184 -4.354 -3.523 -2.693 -1.862 -1.031 -0.201 0.63 1.46 2.291 3.122 3.952 4.783 5.613 6.444
P 0.002 0.006 0.013 0.029 0.063 0.134 0.263 0.45 0.652 0.812 0.908 0.958 0.981 0.992 0.996 0.998
*Upto 3 decimal places
This can be displayed graphically as follows:
75

From this regression model, we can see that probability of success at the interview is below 25% for
CGPAs below 2.90 but is above 75% for CGPAs above 3.60.
While one could apply logistic regression to a number of situations, it has been found useful
particularly in the following situations:
 Credit –study of creditworthiness of an individual or a company. Various demographic and
credit history variables could be used to predict if an individual will turn out to be ‘good’ or
‘bad’ customers.
 Marketing/ Market Segmentation – Study of purchasing behaviour of consumers. Various
demographic and purchasing information could be used to predict if an individual will
purchase an item or not
 Customer loyalty: The analysis could be done to identify loyal or repeat customers using
various demographic and purchasing information
 Medical – study of risk of diseases / body disorder
4.1 Assumptions of Logistic Regressions
The multiple regression assumes assumptions like linearity, normality etc. these are not required for
logistic regression. Discriminate Analysis requires the independent variables to be metric, which is
not necessary for logistic regression. This makes the technique to be superior to discriminate
Analysis. The only care to be taken is that there are no extreme observations in the data.
4.2 Key Terms of Logistic Regressions
Following are the key terms used in logistic regression
Factor The independent variable in logistic regression is termed as factor. The
factor is dichotomous in nature, and is usually converted into a dummy
variable.
Covariate The independent variable that is metric in nature is termed as covariate.
Maximum Likelihood This method is used in logistic regression to predict the odd ratio for the
Estimation dependent variable. In least square estimate, the square of error is
minimized, but in maximum likelihood estimation, the log likelihood is
maximized
Significance Test Hosmer and Lemeshow chi-square test is used to test the overall model
of goodness-of-fit test. It is the modified chi-square test, which is better
than the traditional chi-square test. Pearson chi-square test and
likelihood ratio test are used in multinomial logistic regression to
estimate the model goodness-of-fit.
Stepwise logistic In stepwise logistic regression, the three methods available are enter,
regression backward and forward. In enter method, all variables are included in
logistic regression, irrespective the variable is significant or
insignificant. In backward method, the model starts with all variables
and removes nonsignificant variables from the list. In forward method,
logistic regression stats with single variable and adds one by one
variable and tests significance and removes insignificant variables from
the model.
76

Odd Ratio Exponential beta in logistic regression gives the odd ratio of the
dependent variable. The probability of the dependent variable can be
computed from this odd ratio. When the exponential beta value is
greater than one, than the probability of higher category increases, and if
the probability of exponential beta is less than one, then the probability
of higher category decreases.
Measures of Effect In logistic regression, R2 is no more accepted because R2 tells us the
Size variance extraction by the independent variable. The maximum value of
the Cox and Snell r-squared statistic is actually somewhat less than 1;
the Nagelkerke r-squared statistic is a "correction" of the Cox and Snell
statistic so that its maximum value is 1.
Classification Table The classification table shows the practical results of using the logistic
regression model. It is useful to understand validity of the model

4.3 Logistic Regressions Using SPSS


For illustration we have used file bankloan.sav that was used in the Section 3 dealing with
discriminant analysis This is a data file that concerns a bank's efforts to reduce the rate of loan
defaults. The file contains financial and demographic information on 850 past and prospective
customers. The first 700 cases are customers who were previously given loans. The last 150 cases are
prospective customers that the bank needs to classify as good or bad credit risks. This file is part of
SPSS cases and is in the tutorial folder of SPSS. Within tutorial folder, this file is in the sample files
folder. For the convenience of readers we have provided this file in the CD with the book.
As in case with multiple regression, discriminant analysis can be done by two principle methods,
namely, enter method and stepwise method.
We will illustrate output for both methods.
After opening the file bankloan.sav, one can click on analyze and classify as shown in the following
snapshot.

LR Snapshot 1

SPSS will open following window.

LR Snapshot 2
77

1.Select dependent
variable ‘previously
defaulted’

2. Enter the list of independent


variables in this example the
variables age, years with current
employee, years at current
address, house hold income, debt
to income ratio, credit card debt
3.Select method of in thousands and other debt in
logistic regression as thousands
4.Click on
Forward LR
Save

SPSS will open following window


LR Snapshot 3

2.Click on
Continue

1.Select Predicted values


and Group Membership

SPSS will take back to the window as displayed in LR Snapshot 2, at this stage click on Options.
Following window will be opened

LR Snapshot 4

2. Click
Continue

1.Select Hosmer-
Lemeshow
Goodness of fit

SPSS will be back to the window as shown in LA Snapshot 2. At this stage Click OK. Following
output will be displayed.
78

Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 700 82.4
Missing Cases 150 17.6
Total 850 100.0
Unselected Cases 0 .0
Total 850 100.0
a. If weight is in ef f ect , see classif ication table f or the total
number of cases.

This table indicates the case processing summary 700 out of 850 cases are used for the analysis 150
are ignored as these have missing values.
Dependent Variabl e Encodi ng

Original Value Internal Value


No 0
Yes 1

This table indicates the coding for the dependent variable 0=>not defaulted 1=>Defaulted
Block 0: Beginning Block
Classifi cati on Tablea,b

Predicted

Prev iously def ault ed Percentage


Observ ed No Y es Correct
Step 0 Prev iously No 517 0 100.0
def aulted Y es 183 0 .0
Ov erall Percentage 73.9
a. Constant is included in the model.
b. The cut v alue is .500

Varia bles in the Equation

B S.E. Wa ld df Sig. Exp (B)


Step 0 Constant -1.0 39 .08 6 145 .782 1 .00 0 .35 4

Variables not i n the Equation

Score df Sig.
Step Variables age 13.265 1 .000
0 employ 56.054 1 .000
address 18.931 1 .000
income 3.526 1 .060
debtinc 106.238 1 .000
creddebt 41.928 1 .000
othdebt 14.863 1 .000
Ov erall Statistics 201.271 7 .000

Block 1: Method = Forward Stepwise (Likelihood Ratio)


Omnibus Tests of Model Coefficients

Chi-square df Sig.
Step 1 Step 102.935 1 .000
Block 102.935 1 .000
Model 102.935 1 .000
Step 2 Step 70.346 1 .000
Block 173.282 2 .000
Model 173.282 2 .000
Step 3 Step 55.446 1 .000
Block 228.728 3 .000
Model 228.728 3 .000
Step 4 Step 18.905 1 .000
Block 247.633 4 .000
Model 247.633 4 .000
79

Model Summary

-2 Log Cox & Snell Nagelkerke


Step likelihood R Square R Square
1 701.429a .137 .200
2 b
631.083 .219 .321
3 575.636 b .279 .408
4 c
556.732 .298 .436
a. Estimation terminated at iteration number 4 because
parameter estimates changed by less than .001.
b. Estimation terminated at iteration number 5 because
parameter estimates changed by less than .001.
c. Estimation terminated at iteration number 6 because
parameter estimates changed by less than .001.

Hosmer and Lemeshow Test

Step Chi-square df Sig.


1 3.160 8 .924
2 4.158 8 .843
3 6.418 8 .600
4 8.556 8 .381

The Hosmer-Lemeshow statistic indicates a poor fit if the significance value is less than 0.05. Here,
since the value is above 0.05, the model adequately fits the data
Classifi cati on Tablea

Predicted

Prev iously def ault ed Percentage


Observ ed No Y es Correct
Step 1 Prev iously No 490 27 94.8
def aulted Y es 137 46 25.1
Ov erall Percentage 76.6
Step 2 Prev iously No 481 36 93.0
def aulted Y es 110 73 39.9
Ov erall Percentage 79.1
Step 3 Prev iously No 477 40 92.3
def aulted Y es 99 84 45.9
Ov erall Percentage 80.1
Step 4 Prev iously No 478 39 92.5
def aulted Y es 91 92 50.3
Ov erall Percentage 81.4
a. The cut v alue is .500

This table is the classification table. It indicates the number of cases correctly classified as well as
incorrectly classified. Diagonal elements represent correctly classified cases and non-diagonal
elements represent incorrectly classified cases.
It may be noted that for each step, the number of correctly classified cases are improved than in the
previous step. The last column gives the percentage of correctly classified cases, which is improved
at each step.
Variables in the Equation

B S.E. Wald df Sig. Exp(B)


Step
a
debtinc .132 .014 85.377 1 .000 1.141
1 Constant -2.531 .195 168.524 1 .000 .080
Step
b employ -.141 .019 53.755 1 .000 .868
2 debtinc .145 .016 87.231 1 .000 1.156
Constant
-1.693 .219 59.771 1 .000 .184

Step
c
employ -.244 .027 80.262 1 .000 .783
3 debtinc .088 .018 23.328 1 .000 1.092
creddebt .503 .081 38.652 1 .000 1.653
Constant -1.227 .231 28.144 1 .000 .293
Step
d employ -.243 .028 74.761 1 .000 .785
4 address -.081 .020 17.183 1 .000 .922
debtinc .088 .019 22.659 1 .000 1.092
creddebt .573 .087 43.109 1 .000 1.774
Constant -.791 .252 9.890 1 .002 .453
a. Variable(s) entered on step 1: debtinc.
b. Variable(s) entered on step 2: employ .
c. Variable(s) entered on step 3: creddebt.
d. Variable(s) entered on step 4: address.

The best model is usually the last model i.e. step 4. It contains variables years to current employee,
years at current address, debt to income ratio, and credit card debt. All other variables are
insignificant in the model.
80

Model if Term Removed

Change in
Model Log -2 Log Sig. of the
Variable Likelihood Likelihood df Change
Step 1 debtinc -402.182 102.935 1 .000
Step 2 employ -350.714 70.346 1 .000
debtinc
-369.708 108.332 1 .000

Step 3 employ -349.577 123.518 1 .000


debtinc -299.710 23.783 1 .000
creddebt -315.541 55.446 1 .000
Step 4 employ -333.611 110.490 1 .000
address -287.818 18.905 1 .000
debtinc -290.006 23.281 1 .000
creddebt -311.176 65.621 1 .000

Variables not i n the Equation

Score df Sig.
Step Variables age 16.478 1 .000
1 employ 60.934 1 .000
address 23.474 1 .000
income 3.219 1 .073
creddebt 2.261 1 .133
othdebt 6.631 1 .010
Ov erall Statistics 113.910 6 .000
Step Variables age .006 1 .939
2 address 8.407 1 .004
income 21.437 1 .000
creddebt 64.958 1 .000
othdebt 4.503 1 .034
Ov erall Statistics 84.064 5 .000
Step Variables age .635 1 .426
3 address 17.851 1 .000
income .773 1 .379
othdebt .006 1 .940
Ov erall Statistics 22.221 4 .000
Step Variables age 3.632 1 .057
4 income .012 1 .912
othdebt .320 1 .572
Ov erall Statistics 4.640 3 .200

The above table gives the scores which can be used to predict whether the person having certain
values of variable will default or not. In fact the scores can be used to find the probability of default.

5 Multivariate Analysis of Variance (MANOVA)


In ANOVA, we study the impact of one or more factors on a single variable. For example, we could
study the differences in yields of rice due to say, 3 fertilisers. However, if we wish to study the
impact of fertilisers on more than one variable, say in yields as also on harvest times of rice crop, then
MANOVA could be used to test the Null hypotheses that the
(i) yields due to the use of three fertilisers are equal and
(ii) harvest times due to the use of these fertilisers are equal.

Another example could be to assess the impacts of two training programmes, conducted for a group
of employees, on their knowledge as well as motivation relevant for their job. While one programme
was mostly based on ‘Class Room’ training, the other was mostly based on the ‘On Job’ training.
The data collected could be as follows:
Class Room No Training Job Based
K M K M K M
1 92 98 70 75 83 90
2 88 77 56 66 65 76
3 91 88 89 90 93 91
4 85 82 87 84 90 85
5 88 85 72 71 77 73
6 81 82 74 71 89 81
7 92 83 75 75 84 78
8 88 90 80 68 85 76
9 80 79 78 65 73 80
10 84 87 72 75 83 81
81

In this case one of the conclusions drawn was that both the programmes had positive impact on both
knowledge and motivation but there was no significant difference between classrooms based and job
based training programmes.
As yet another example, ne could assess whether a change in Compensation System-1 to
Compensation System-2 has brought about changes in sales, profit and job satisfaction in an
organisation.
MANOVA is typically used when there are more than one dependent variables, and independent
variables are qualitative/ categorical.

5.1 MANOVA Using SPSS


We will use the case data on Commodity market perceptions displayed at end of this Chapter.
 Open the file Commodity.sav
 Select from the menu Analyze – General Linear Model – Multivariate as shown below
MANOVA Snapshot 1

The following window will be displayed


MANOVA Snapshot 2

1.Select Dependent
variables as invest_SM
and invest_CM

2.Select Fixed Factors


as Occupation and
how_much_time_block
These are the
categorical independent
variables

4.Click 3.Select Covariates as Age, Rate_CM


OK and Rate_SM. Covatiates are metric
independent variables

It may be noted that the above example is of MANOCOVA as we have selected some categorical
variables and some metric variables.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market and the categorical independent variables are occupation and
how long the respondents block investments. The metric independent variables are age, respondent’s
82

rating for commodity market and share market. Here we assume that their investments depend on
their ratings, occupation, age and how long they block their investments.
The following output will be displayed
General Linear Model
Between-Subjects Factors

Value Label N
Occupation 1 "Self
11
Employ ed"
2 Gov t 15
3 Student 4
4 House Wif e 14
how_much_time_ 1 <6 months 6
block_y our_money 2 6 to 12
8
months
3 1 to 3 y ears 5
4 > 3 y ears 10
5 4
6 6
7 3
8 2

This table gives summary of number of cases for the factors.


Multivariate Testsc

Ef f ect Value F Hy pothesis df Error df Sig.


Intercept Pillai's Trace .139 1.537a 2.000 19.000 .241
Wilks' Lambda .861 1.537a 2.000 19.000 .241
Hot elling's Trace .162 1.537a 2.000 19.000 .241
Roy 's Largest Root .162 1.537a 2.000 19.000 .241
Age Pillai's Trace .157 1.770a 2.000 19.000 .197
Wilks' Lambda .843 1.770a 2.000 19.000 .197
Hot elling's Trace .186 1.770a 2.000 19.000 .197
Roy 's Largest Root .186 1.770a 2.000 19.000 .197
Rat e_CM Pillai's Trace .096 1.011a 2.000 19.000 .383
Wilks' Lambda .904 1.011a 2.000 19.000 .383
Hot elling's Trace .106 1.011a 2.000 19.000 .383
Roy 's Largest Root .106 1.011a 2.000 19.000 .383
Rat e_SM Pillai's Trace .027 .268a 2.000 19.000 .768
Wilks' Lambda .973 .268a 2.000 19.000 .768
Hot elling's Trace .028 .268a 2.000 19.000 .768
Roy 's Largest Root .028 .268a 2.000 19.000 .768
Occupation Pillai's Trace .908 5.547 6.000 40.000 .000
Wilks' Lambda .250 6.333a 6.000 38.000 .000
Hot elling's Trace 2.366 7.099 6.000 36.000 .000
Roy 's Largest Root 2.059 13.725b 3.000 20.000 .000
how_much_time_block_ Pillai's Trace .855 2.132 14.000 40.000 .031
y our_money Wilks' Lambda .318 2.102a 14.000 38.000 .035
Hot elling's Trace 1.607 2.066 14.000 36.000 .040
Roy 's Largest Root 1.125 3.214b 7.000 20.000 .019
Occupation * how_much_ Pillai's Trace .823 1.399 20.000 40.000 .180
time_block_y our_money Wilks' Lambda .335 1.382a 20.000 38.000 .191
Hot elling's Trace 1.511 1.360 20.000 36.000 .206
Roy 's Largest Root 1.069 2.137b 10.000 20.000 .071
a. Exact statist ic
b. The stat istic is an upper bound on F that y ields a lower bound on the signif icance lev el.
c. Design: Intercept+Age+Rate_CM+Rate_SM+Occupation+how_much_time_block_y our_money +Occupation *
how_much_time_block_y our_money

This table indicates that the null hypothesis that the investments are equal for all occupations is
rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we may
conclude at 5% Level of Significance (LOS) that there is significant difference in the both
investments (share market and commodity markets) and occupation of the respondents.
The null hypothesis that the investments are equal for different levels of time the investment blocked,
is rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we
may conclude at 5% LOS that there is significant difference in the both investments (share market
and commodity markets) and the period for which the respondents would likely to block their money.
83

The other hypothesis about age, ratings of CM and ratings of SM are not rejected ( as p-value is
greater than 0.05) this means there is no significant difference in the investments for these variables.

Tests of Between-Subjects Effects

Ty pe III Sum
Source Dependent Variable of Squares df Mean Square F Sig.
Corrected Model Inv est_SM 1.014E+011a 23 4407960763 4.250 .001
Inv est_CM 1.193E+010b 23 518647980.2 2.033 .057
Intercept Inv est_SM 64681514.2 1 64681514.21 .062 .805
Inv est_CM 532489379 1 532489379.0 2.087 .164
Age Inv est_SM 313752666 1 313752666.0 .302 .588
Inv est_CM 472901520 1 472901520.3 1.854 .189
Rat e_CM Inv est_SM 224388812 1 224388812.0 .216 .647
Inv est_CM 526600356 1 526600355.7 2.064 .166
Rat e_SM Inv est_SM 528516861 1 528516860.7 .510 .484
Inv est_CM 76132638.6 1 76132638.56 .298 .591
Occupation Inv est_SM 4.254E+010 3 1.418E+010 13.672 .000
Inv est_CM 3131507520 3 1043835840 4.092 .020
how_much_time_block_ Inv est_SM 1.715E+010 7 2450532038 2.362 .062
y our_money Inv est_CM 2659656810 7 379950972.8 1.489 .227
Occupation * how_much_ Inv est_SM 2.190E+010 10 2190181184 2.111 .074
time_block_y our_money Inv est_CM 2642792207 10 264279220.7 1.036 .450
Error Inv est_SM 2.075E+010 20 1037276941
Inv est_CM 5102386227 20 255119311.4
Total Inv est_SM 2.109E+011 44
Inv est_CM 2.780E+010 44
Corrected Total Inv est_SM 1.221E+011 43
Inv est_CM 1.703E+010 43
a. R Squared = . 830 (Adjusted R Squared = .635)
b. R Squared = . 700 (Adjusted R Squared = .356)

6 Factor Analysis
Factor Analysis is interdependence technique. In interdependence techniques the variables are not
classified as independent or dependent variable, but their interrelationship is studied. Factor analysis
is general name for two different techniques namely, Principle Component Analysis and Common
Factor Analysis.
The Factor analysis originated about a century ago when Charles Spearman propounded that the
results of a wide variety of mental tests could be explained by a single underlying intelligence factor.
The factor analysis is done principally for two reasons
 To identify a new, smaller set of uncorrelated variables to be used in subsequent multiple
regression analysis. In this situation the Principle Component Analysis is performed on the
data. PCA considers the total variance in the data while finding principle components from a
given set of variables
 To identify underlying dimensions / factors that are unobservable but explain correlations
among a set of variables. In this situation the Common Factor Analysis is performed on the
data. FA considers only the common variance while finding common factors from a given set
of variables. The common factor analysis is also termed as Principle Axis Factoring.

The essential purpose of factor analysis is to describe, if possible, the covariance relationships among
many variables in terms of few underlying, but unobservable, random quantities called factors.
Basically, the factor model is motivated by the following argument. Suppose variables can be
grouped by their correlations. That is, all variables, within a particular group are highly correlated
among themselves but have relatively small correlations with variables in a different group. In that
case, it is conceivable that each group of variables represents a single underlying construct, or factor,
that is responsible for the correlations.

6.1 Rotation in Factor Analysis


If several factors have high loading with same variable, it is difficult to interpret the factors clearly.
This can be improved by using rotation.
84

Rotation does not affect communalities and percentage of total variance explained. However the
percentage of variance accounted by each factor changes. The variance explained by individual
factors is redistributed by rotation.

There are basic two types of rotations viz.


 Orthogonal Rotation
 Oblique Rotation

The rotation is said to be orthogonal, if the axes maintain the right angle.
 are the most widely used rotational methods.
 are The preferred method when the research goal is data reduction to either a smaller
number of variables or a set of uncorrelated measures for subsequent use in other
multivariate techniques.

Varimax procedure is a type of orthogonal rotation wherein it maximizes the variance of each of the
factors, so that the amount of variance accounted for is redistributed over the extracted factors. This
is the most popular method of rotation.
It may be noted that in the above diagram, the factors can be easily interpreted after the orthogonal
rotation. Variables v2,v3 and v6 contribute to factor 1 and v1,v4 and v5 contribute to factor 2.

The rotation is said to be Oblique, if the axis do not maintain right angle. best suited to the goal of
obtaining several theoretically meaningful factors or constructs because, realistically, very few
constructs in the “real world” are uncorrelated.

6.2 Key Terms used in Factor Analysis

Following is the list of some key concepts used in factor analysis.

Exploratory This technique is used when a researcher has no prior knowledge about the
Factor Analysis number of factors the variables will be indicating. In such cases computer based
(EFA) techniques are used to indicate appropriate number of factors.
Confirmatory This technique is used when the researcher has the prior knowledge(on the basis of
Factor Analysis some pre-established theory) about the number of factors the variables will be
(CFA) indicating. This makes it easy as there is no decision to be taken about the number
of factors and the number is indicated in the computer based tool while conducting
analysis.
Correlation This is the matrix showing simple correlations between all possible pairs of variables.
85

Matrix The diagonal element of this matrix is 1 and this is a symmetric matrix, since
correlation between two variables x and y is same as between y and x.
Communality The amount of variance, an original variable shares with all other variables included in
the analysis. A relatively high communality indicates that a variable has much in
common with the other variables taken as a group.
Eigenvalue Eigenvalue for each factor is the total variance explained by each factor.
Factor A linear combination of the original variables. Factor also represents the underlying
dimensions( constructs) that summarise or account for the original set of observed
variables
Factor Loadings The factor loadings, or component loadings in PCA, are the correlation coefficients
between the variables (given in output as rows ) and factors (given in output columns)
These loadings are analogous to Pearson’s correlation coefficient r, the squared factor
loading is defined as the percent of variance in the respective variable explained by the
factor.
Factor Matrix This contains factor loadings on all the variables on all the factors extracted
Factor Plot or This is a plot where the factors are on different axis and the variables are drawn on these
Rotated Factor axes. This plot can be interpreted only if the number of factors are 3 or less
Space
Factor Scores Each individual observation has a score, or value, associated with each of the original
variables. Factor analysis procedures derive factor scores that represent each
observation’s calculated values, or score, on each of the factors. The factor score will
represent an individual’s combined response to the several variables representing the
factor.
The component scores may be used in subsequent analysis in PCA. When the factors
are to represent a new set of variables that they may predict or be dependent on some
phenomenon, the new input may be factor scores.
Goodness of a How well can a factor account for the correlations among the indicators ?
Factor One could examine the correlations among the indicators after the effect of the factor is
removed. For a good factor solution, the resulting partial correlations should be near
zero, because once the effect of the common factor is removed , there is nothing to link
the indicators.
Bartlett’s Test of This is the test statistics used to test the null hypothesis that there is no correlation
specificity between the variables.
Kaiser Meyer This is an index used to test appropriateness of the factor analysis. High values of this
Olkin (KMO) index, generally, more than 0.5 , may indicate that the factor analysis is an appropriate
Measure of measure, where as the lower values (less than 0.5) indicate that factor analysis may not
Sampling be appropriate.
Adequacy
Scree Plot A plot of Eigen values against the factors in the order of their extraction.
Trace The sum of squares of the values on the diagonal of the correlation matrix used in the
factor analysis. It represents the total amount of variance on which the factor solution is
based.

6.3 Principal Component Analysis (PCA)


Suppose, in a particular situation, k variables are required to explain the entire system under study.
Through PCA, the original variables are transformed into a new set of variables called principal
components, numbering much less than k. These are formed in such a manner that they extract
almost the entire information provided by the original variables. Thus, the original data of n
observations on each of the k variables is reduced to a new data of n observations on each of the
principal components. That is how; PCA is referred to as one of the data reduction and interpretation
techniques. Some indicative applications are given below.
86

There are a number of financial parameters/ratios for predicting health of a company. It would be
useful if only a couple of indicators could be formed as linear combination of the original
parameters/ratios in such a way that the few indicators extract most of the information contained in
the data on original variables.
Further, in the regression model, if independent variables are correlated implying there is
multicollinearity, then new variables could be formed as linear combinations of original variables
which themselves are uncorrelated. The regression equation can then be derived with these new
uncorrelated independent variables, and used for interpreting the regression coefficients as also for
predicting the dependant variable with the help of these new independent variables. This is highly
useful in marketing and financial applications involving forecasting, sales, profit, price, etc.
with the help of regression equations.
Further, analysis of principal components often reveals relationships that were not previously suspected and
thereby allows interpretations that would not be ordinarily understood. A good example of this is provided by
stock market indices.
Incidentally, PCA is a means to an end and not the end in itself. PCA can be used for inputting
principal components as variables for further analysing the data using other techniques such as
cluster analysis, regression and discriminant analysis.
6.4 Common Factor Analysis
It is yet another example of a data reduction and summarization technique. It is a statistical approach
that is used to analyse inter relationships among a large number of variables (e.g., test scores, test
items, questionnaire responses) and then explaining these variables in terms of their common
underlying dimensions (factors). For example, a hypothetical survey questionnaire may consist of 20
or even more questions, but since not all of the questions are identical, they do not all measure the
basic underlying dimensions to the same extent. By using factor analysis, we can identify the separate
dimensions being measured by the survey and determine a factor loading for each variable (test item)
on each factor.

Common Factor analysis (unlike multiple regression, discriminant analysis, or canonical correlation, in which
one or more variables are explicitly considered as the criterion or dependant variable and all others the pre-
dictor or independent variables) is an interdependence technique in which all variables are simultaneously
considered. In a sense, each of the observed (original) variables is considered as a dependant variable that is a
function of some underlying, latent, and hypothetical/unobserved set of factors (dimensions). One could also
consider the original variables as reflective indicators of the factors. For example, marks( variable) in an
examination reflect the intelligence( factor).

The statistical approach followed in factor analysis involves finding a way of condensing the information
contained in a number of original variables into a smaller set of dimensions (factors) with a minimum loss of
information.
Common Factor Analysis was originally developed to explain students’ performance in various subjects and
to understand the link between grades and intelligence. Thus, the marks obtained in an examination reflect the
student’s intelligence quotient. A salesman’s performance in term of sales might reflect his attitude towards the
job, and efforts made by him.
One of the studies relating to marks obtained by students in various subjects, led to the conclusion
that students’ marks are a function of two common factors viz. Quantitative and Verbal abilities. The
quantitative ability factor explains marks in subjects like Mathematics, Physics and Chemistry, and
verbal ability explains marks in subjects like Languages and History.
In another study, a detergent manufacturing company was interested in identifying the major
underlying factors or dimensions that consumers used to evaluate various detergents. These factors
are assumed to be latent; however, management believed that the various attributes or properties of
detergents were indicators of these underlying factors. Factor analysis was used to identify these
underlying factors. Data was collected on several product attributes using a five-point scale. The
analysis of responses revealed existence of two factors viz. ability of the detergent to clean and its
87

mildness
In general, the factor analysis performs the following functions:
 Identifies the smallest number of common factors that best explain or account for the
correlation among the indicators
 Identifies a set of dimensions that are latent ( not easily observed) in a large number of
variables
 Devises a method of combining or condensing a large number of consumers with varying
preferences into distinctly different number of groups.
 Identifies and creates an entirely new smaller set of variables to partially or completely
replace the original set of variables for subsequent regression or discriminant analysis
from a large number of variables. It is especially useful in multiple regression analysis
when multicollinearity is found to exist as the number of independent variables is reduced
by using factors and thereby minimizing or avoiding multicolinearity. In fact, factors are
used in lieu of original variables in the regression equation.
Distinguishing Feature of Common Factor Analysis
Generally, the variables that we define in real life situations reflect the presence of unobservable
factors. These factors impact the values of those variables. For example, the marks obtained in an
examination reflect the student’s intelligence quotient. A salesman’s performance in term of sales
might reflect his attitude towards the job, and efforts made by him.
Each of the above examples requires a scale, or an instrument, to measure the various
constructs (i.e., attitudes, image, patriotism, sales aptitude, and resistance to innovation). These
are but a few examples of the type of measurements that are desired by various business
disciplines. Factor analysis is one of the techniques that can be used to develop scales to
measure these constructs.
6.4.1 Applications of Common Factor Analysis
In one of the studies conducted by a group of the students of a Management Institute, they undertook
a survey of 120 potential buyers outside retail outlets and at dealer counters. Their opinions were
solicited through a questionnaire for each of the 20 parameters relating to a television.
Through the use of principal component analysis and factor analysis using computer software, the
group concluded that the following five parameters are most important out of the twenty parameters
on which their opinion was recorded. The five factors were:
 Price (price, schemes and other offers)
 Picture Quality
 Brand Ambassador (Person of admiration)
 Wide range
 Information (Website use, Brochures, Friends’ recommendations )
In yet another study, another group of students of a Management Institute conducted a survey to
identify the factors that influence the purchasing decision of a motorcycle in the 125 cc category.
Through the use of Principal Component Analysis and factor analysis using computer software, the
group concluded that the following three parameters are most important:
 Comfort
 Assurance
 Long Term Value
6.5 Factor Analysis on Data Using SPSS
We shall first explain PCA using SPSS and than Common Factor Analysis
Principle Component Analysis Using SPSS
For illustration we will be using file car_sales.sav.This file is part of SPSS cases and is in the tutorial
folder of SPSS. Within tutorial folder, this file is in the sample_files folder. For the convenience of
readers, we have provided this file in the CD with the book. This data file contains hypothetical sales
estimates, list prices, and physical specifications for various makes and models of vehicles. The list
88

prices and physical specifications were obtained alternately from edmunds.com and manufacturer
sites. Following is the list of the major variables in the file.
1. Manufacturer 6. Price in thousands 11. Length
2. Model 7. Engine size 12. Curb weight
3. Sales in thousands 8. Horsepower 13. Fuel capacity
4. 4-year resale value 9. Wheelbase 14. Fuel efficiency
5. Vehicle type 10. Width
After opening the file Car_sales.sav, one can click on analyze – Data Reduction and Factor as shown
in the following snapshot.
FA Snapshot 1

This command will open a new window as shown below


FA Snapshot 2
1. Enter variables, Vehicle
type, Price in thousands,
Engine size
Horsepower, Wheelbase
Width, Length, Curb
weight, Fuel capacity
Fuel efficiency

2. click on
Descriptives

This will open a new window as shown below 4. click on


FA Snapshot 3 continue
1. click on initial
solution

2. click on
coefficients

3. click on KMO and


Bartlett’s test of
sphericity
89

SPSS will take back to previous window as shown below

FA Snapshot 3

Click on
Extraction

The new window that will appear is shown below 1. select the method
as principle
FA Snapshot 4 components

3. select Unrotated
Factor solution

4. select Scree Plot

2. select
correlation matrix

5. click on continue
SPSS will take back to the window shown below
FA Snapshot 5

1. click on rotation

A window will open as follows


FA Snapshot 6
90

3. click continue

1. Select varimax
rotation

2. Select Display
rotated solution

SPSS will take back to the window as shown in FA Snapshot 5. Click on the button ‘Scores’. This
will open a new window as shown below

FA Snapshot 7
4. click on continue

1. Select save as
variable option

2. Select regression
method

3. Select display
factor score
coefficient matrix
This will take back to window as shown in FA Snapshot 5, in this window now click on ‘ok’. SPSS
will give following output. We shall explain each in brief.

Factor Analysis
Correlation Matrix

Price in
Vehicle ty pe thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel eff iciency
Correlation Vehicle ty pe 1.000 -.042 .269 .017 .397 .260 .150 .526 .599 -.577
Price in thousands -.042 1.000 .624 .841 .108 .328 .155 .527 .424 -.492
Engine size .269 .624 1.000 .837 .473 .692 .542 .761 .667 -.737
Horsepower .017 .841 .837 1.000 .282 .535 .385 .611 .505 -.616
Wheelbase .397 .108 .473 .282 1.000 .681 .840 .651 .657 -.497
Width .260 .328 .692 .535 .681 1.000 .706 .723 .663 -.602
Length .150 .155 .542 .385 .840 .706 1.000 .629 .571 -.448
Curb weight .526 .527 .761 .611 .651 .723 .629 1.000 .865 -.820
Fuel capacity .599 .424 .667 .505 .657 .663 .571 .865 1.000 -.802
Fuel eff iciency -.577 -.492 -.737 -.616 -.497 -.602 -.448 -.820 -.802 1.000

This is the correlation matrix. The PCA can be carried out if the correlation matrix for the variables
contains at least two correlations of 0.30 or greater. It may be noted that the correlations >0.3 are
marked in circle.
91

KMO and Bartlett's Test


Kaiser-Mey er-Olkin Measure of Sampling
Adequacy . .833

Bartlett's Test of Approx. Chi-Square 1578. 819


Sphericity df 45
Sig. .000

KMO Bartlett measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.833 and the chi-square statistics is significant (<0.05). This means the principle component analysis
is appropriate for this data.
Communalities

Initial Extraction
Vehicle ty pe 1.000 .930
Price in thousands 1.000 .876
Engine size 1.000 .843
Horsepower 1.000 .933
Wheelbase 1.000 .881
Width 1.000 .776
Lengt h 1.000 .919
Curb weight 1.000 .891
Fuel capacity 1.000 .861
Fuel ef f iciency 1.000 .860
Extraction Method: Principal Component Analy sis.

Extraction communalities are estimates of the variance in each variable accounted for by the
components. The communalities in this table are all high, which indicates that the extracted
components represent the variables well. If any communalities are very low in a principal
components extraction, you may need to extract another component.

Total Variance Explained

Initial Eigenv alues Extraction Sums of Squared Loadings Rot at ion Sums of Squared Loadings
Component Total % of Variance Cumulativ e % Total % of Variance Cumulativ e % Total % of Variance Cumulativ e %
1 5.994 59.938 59.938 5.994 59.938 59.938 3.220 32.199 32.199
2 1.654 16.545 76.482 1.654 16.545 76.482 3.134 31.344 63.543
3 1.123 11.227 87.709 1.123 11.227 87.709 2.417 24.166 87.709
4 .339 3.389 91.098
5 .254 2.541 93.640
6 .199 1.994 95.633
7 .155 1.547 97.181
8 .130 1.299 98.480
9 .091 .905 99.385
10 .061 .615 100.000
Extraction Method: Principal Component Analy sis.

This output gives total variance explained. This table gives the total variance contributed by each
component. We can see that the percentage of total variance contributed by first component is 59.938,
by second component is 16.545 and by third component is 11.2227. It is also clear from this table
that there are total three distinct components for the given set of variables.
92

Scree Plot

4
Eigenvalue

1 2 3 4 5 6 7 8 9 10
Component Number

The scree plot gives the number of components against the eigenvalues and helps to determine
the optimal number of components.
Incidentally, "scree" is the geological term referring to the debris which collects on the lower part of a
rocky slope
The components having steep slope indicate that good percentage of total variance is explained by
that component, hence the component is justified. The shallow slope indicates that the contribution of
total variance is less, and the component is not justified. In the above plot, the first three components
have steep slope and later the slope is shallow. This indicates the ideal number of components is
three.

Component Matrixa

Component
1 2 3
Vehicle ty pe .471 .533 -.651
Price in thousands .580 -.729 -.092
Engine size .871 -.290 .018
Horsepower .740 -.618 .058
Wheelbase .732 .480 .340
Width .821 .114 .298
Lengt h .719 .304 .556
Curb weight .934 .063 -.121
Fuel capacity .885 .184 -.210
Fuel ef f iciency -.863 .004 .339
Extraction Method: Principal Component Analy sis.
a. 3 components extracted.

This table gives each variables component loadings but it is the next table, which is easy to interpret.
93

Rotated Component Matrixa

Component
1 2 3
Vehicle ty pe -.101 .095 .954
Price in thousands .935 -.003 .041
Engine size .753 .436 .292
Horsepower .933 .242 .056
Wheelbase .036 .884 .314
Width .384 .759 .231
Lengt h .155 .943 .069
Curb weight .519 .533 .581
Fuel capacity .398 .495 .676
Fuel ef f iciency -.543 -.318 -.681
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalization.
a. Rot at ion conv erged in 4 iterations.

This table is the most important table for interpretation. The maximum of each row(ignoring sign)
indicates that the respective variable belongs to the respective component. The variables ‘price in
thousands’, ‘engine size’ and ‘horsepower’ are highly correlated and contribute to a single
component. ‘wheelbase’ ‘width’ and ‘length’ contribute to second component. And ‘vehicle type’
curb ‘weight’ ‘fuel capacity’ contribute to the third component.
Component Transformation Matrix

Component 1 2 3
1 .601 .627 .495
2 -.797 .422 .433
3 -.063 .655 -.753
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalizat ion.

Component Score Coefficient Matrix

Component
1 2 3
Vehicle ty pe -.173 -.194 .615
Price in thousands .414 -.179 -.081
Engine size .226 .028 -.016
Horsepower .368 -.046 -.139
Wheelbase -.177 .397 -.042
Width .011 .289 -.102
Lengt h -.105 .477 -.234
Curb weight .070 .043 .175
Fuel capacity .012 .017 .262
Fuel ef f iciency -.107 .108 -.298
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalization.
Component Scores.

This table gives the component scores for each variables. The component scores can be saved for
each case in the SPSS file. These scores are useful to replace internally related variables in the
regression analysis. In the above table, the scores are given component wise. The factor score for
each component can be calculated as the linear combinations of the component scores of that
component.
94

Component Score Covariance Matrix

Component 1 2 3
1 1.000 .000 .000
2 .000 1.000 .000
3 .000 .000 1.000
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalizat ion.

Component Scores.

Common Factor Analysis Using SPSS


For illustration we will be considering following case
The file Telecom.sav is provided in the CD with the book.

Case 1
In the year 2009, the TRAI (Telephone Regulatory Authority of India) was assessing the
requirements for number portability. Number portability is defined as switching a service provider,
without having to change the number. This year had seen fierce price war in the telecom sector. Some
of the oldest service providers are still to certain extent immune to this war as most of their
consumers would not like to change number. The number portability will increase the price war and
give opportunity to relatively new service providers. The price war is so fierce that the industry
experts comment that the future lies in how one differentiates in terms of services, than price.
With this background, a TELECOMM company conducted a research study to find the factors that
affect consumers while selecting / switching a telecom service provider. The survey was conducted
on 35 respondents. They were asked to rate 12 questions, about their perception of factors important
to them while selecting a service provider, on 7-point scale (1= completely disagree, 7= completely
agree)
The research design for data collection can be stated as follows-
35 telecom users were surveyed about their perceptions and image attributes of the service providers
they owned. Twelve questions were asked to each of them, all answered on a scale of 1 to 7 (1=
completely disagree, 7= completely agree).
I decide my telephone provider on the basis of following attributes. (1= completely disagree, 7=
completely agree)
1. Availability of services (like drop boxes and different payment options, in case of post paid, and
recharge coupons in case of pre paid)
2. Good network connectivity all through the city.
3. Internet connection, with good speed.
4. Quick and appropriate response at customer care centre.
5. Connectivity while roaming (out of the state or out of country)
6. Call rates and Tariff plans
7. Additional features like unlimited SMS, lifetime prepaid, 2 phones free calling, etc.
8. Quality of network service like Minimum call drops, Minimum down time, voice quality, etc.
9. SMS and Value Added Services charges.
10. Value Added Services like MMS, caller tunes, etc.
11. Roaming charges
12. Conferencing
The sample of data collected is tabulated as follows

SrNo Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 Q9 Q10 Q11 Q12


1 3 6 5 4 6 7 2 5 1 4 2 5
2 3 5 3 4 6 5 4 5 4 2 4 2
95

Carry out relevant analysis and write a report to discuss the findings for the above data.

The initial process of conducting common factor analysis is exactly same as for principle component
analysis except for the method of selection shown in FA Snapshot 4.
We will discuss only the steps that are different than the principle component analysis shown above.
Following steps are carried out to run factor analysis using SPSS.
1. Open file telcom.sav
2. Click on Analyse ->Data Reduction ->Factor as shown in FA Snapshot 1.
3. Following window will be opened by SPSS.

FA Snapshot 8

Select variables Q1,


Q2, Q3, .. through
Q12

4. Click on descriptives , coefficients and click on initial solution, click on KMO and Bartlett’s test of
sphericity, and also Select Anti-Image as shown in FA Snapshot 3.
It may be noted that we did not select Anti-Image in PCA, but we are required to select it here.
5. Click on Extraction, following window will be opened by SPSS

FA Snapshot 9 1. select the method as


principle axis factor

3. select
Unrotated Factor
solution

2. select
correlation matrix

4. select Scree Plot

5. click on
continue
96

6. SPSS will take back to the window shown in FA Snapshot 8 at this stage click on Rotation, the
window SPSS will open is shown in FA Snapshot
7. Select Varimax rotation, Select Display rotated solution and click continue, as shown in FA
Snapshot 6
8. It may be noted that in PCA of FA Snapshot 7 we selected to store some variables which is not
require here.
Following output will be generated by SPSS
Factor Analysis
Descriptive Stati sti cs

Mean Std. Dev iation Analy sis N


Q1 4.80 1.568 35
Q2 3.20 1.410 35
Q3 2.83 1.671 35
Q4 3.89 1.605 35
Q5 3.09 1.245 35
Q6 3.49 1.772 35
Q7 3.23 1.734 35
Q8 3.86 1.611 35
Q9 3.46 1.633 35
Q10 3.74 1.615 35
Q11 3.17 1.505 35
Q12 3.60 1.866 35

This is descriptive statistics given by the SPSS. This gives general understanding of the variables.
Correlation Matrix

Q5
Q1 Q2 Q3 Q4 Q6 Q7 Q8 Q9 Q10 Q11 Q12
Correlatio Q1
1.000 -.128 -.294 .984 -.548 -.017 -.188 -.093 .129 -.242 -.085 -.259
n
Q2 -.128 1.000 .302 -.068 .543 .231 .257 .440 .355 .359 .344 .378
Q3 -.294 .302 1.000 -.282 .558 .148 .258 .056 .148 .898 .164 .930
Q4 .984 -.068 -.282 1.000 -.510 -.052 -.223 -.063 .099 -.227 -.113 -.251
Q5 -.548 .543 .558 -.510 1.000 .101 .195 .387 .067 .538 .149 .559
Q6 -.017 .231 .148 -.052 .101 1.000 .901 .159 .937 .230 .906 .096
Q7 -.188 .257 .258 -.223 .195 .901 1.000 .202 .866 .379 .943 .211
Q8 -.093 .440 .056 -.063 .387 .159 .202 1.000 .204 .042 .192 .156
Q9 .129 .355 .148 .099 .067 .937 .866 .204 1.000 .258 .889 .091
Q10 -.242 .359 .898 -.227 .538 .230 .379 .042 .258 1.000 .309 .853
Q11 -.085 .344 .164 -.113 .149 .906 .943 .192 .889 .309 1.000 .119
Q12 .930
1.00
-.259 .378 -.251 .559 .096 .211 .156 .091 .853 .119
0

This is the correlation matrix. The Common Factor Analysis can be carried out if the correlation
matrix for the variables contains at least two correlations of 0.30 or greater. It may be noted that some
of the correlations >0.3 are marked in circle.
KMO and Bartlett's Test
Kaiser-Mey er-Olkin Measure of Sampling
Adequacy . .658

Bartlett's Test of Approx. Chi-Square 497.605


Sphericity df 66
Sig. .000

KMO measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.658 and the chi-square statistics is significant (.000<0.05). This means the principle component
analysis is appropriate for this data.
97

Communaliti es

Initial Extraction
Q1 .980 .977
Q2 .730 .607
Q3 .926 .975
Q4 .978 .996
Q5 .684 .753
Q6 .942 .917
Q7 .942 .941
Q8 .396 .379
Q9 .949 .942
Q10 .872 .873
Q11 .934 .924
Q12 .916 .882
Extraction Method: Principal Axis Factoring.

Initial communalities are the proportion of variance accounted for in each variable by the rest of the
variables. Small communalities for a variable indicate that the proportion of variance that this
variable shares with other variables is too small. Thus, this variable does not fit the factor solution. In
the above table, most of the initial communalities are very high indicating that all the variables share
a good amount of variance with each other, an ideal situation for factor analysis.
Extraction communalities are estimates of the variance in each variable accounted for by the factors
in the factor solution. The communalities in this table are all high. It indicates that the extracted
factors represent the variables well.

Total Variance Explained

Initial Eigenv alues Extraction Sums of Squared Loadings Rot at ion Sums of Squared Loadings
Factor Total % of Variance Cumulativ e % Total % of Variance Cumulativ e % Total % of Variance Cumulativ e %
1 4.789 39.908 39.908 4.671 38.926 38.926 3.660 30.500 30.500
2 3.035 25.288 65.196 2.956 24.635 63.561 2.929 24.410 54.910
3 1.675 13.960 79.156 1.628 13.570 77.131 2.205 18.372 73.283
4 1.321 11.006 90.163 .911 7.593 84.724 1.373 11.441 84.724
5 .526 4.382 94.545
6 .275 2.291 96.836
7 .157 1.307 98.143
8 .093 .774 98.917
9 .050 .421 99.338
10 .042 .353 99.691
11 .027 .227 99.918
12 .010 .082 100.000
Extraction Method: Principal Axis Factoring.

This output gives the variance explained by the initial solution. This table gives the total variance
contributed by each component. We may note that the percentage of total variance contributed by first
component is 39.908, by second component is 25.288 and by third component is19.960. It may be
noted that the percentage of total variances is highest for first factor and it decreases thereafter. It is
also clear from this table that there are total three distinct factors for the given set of variables.

Scree Plot

4
Eigenvalue

1 2 3 4 5 6 7 8 9 10 11 12
Factor Number
98

The scree plot gives the number of factors against the eigenvalues, it and helps to determine the
optimal number of factors. The factors having steep slope indicate that larger percentage of total
variance is explained by that factor. The shallow slope indicates that the contribution to total variance
is less. In the above plot, the first four factors have steep slope; and later on the slope is shallow. It
may be noted from the above plot that the number of factors for Eigen value greater than one are four.
Hence the ideal number of factors is four
Factor Matrixa

Factor
1 2 3 4
Q1 -.410 .564 .698 .066
Q2 .522 -.065 .139 .557
Q3 .687 -.515 .423 -.245
Q4 -.413 .528 .725 .141
Q5 .601 -.500 -.067 .371
Q6 .705 .630 -.111 -.102
Q7 .808 .486 -.176 -.144
Q8 .293 .005 -.031 .540
Q9 .690 .682 .039 .019
Q10 .727 -.372 .401 -.213
Q11 .757 .575 -.137 -.040
Q12 .644 -.523 .432 -.090
Extraction Method: Principal Axis Factoring.
a. 4 f actors extracted. 14 it erations required.

This table gives each variables factor loadings but it is the next table, which is easy to interpret.
Rotated Factor Matrixa

Factor
1 2 3 4
Q1 .001 -.147 .972 -.109
Q2 .195 .253 -.002 .711
Q3 .084 .970 -.146 .074
Q4 -.042 -.137 .987 -.035
Q5 .006 .456 -.430 .600
Q6 .953 .053 -.004 .078
Q7 .939 .161 -.162 .086
Q8 .117 -.008 -.040 .603
Q9 .936 .068 .164 .186
Q10 .210 .899 -.101 .103
Q11 .943 .078 -.059 .158
Q12 .022 .909 -.110 .206
Extraction Method: Principal Axis Factoring.
Rot at ion Method: Varimax with Kaiser Normalization.
a. Rot at ion conv erged in 5 iterations.

This table is the most important table for interpretation. The maximum in each row (ignoring sign)
indicates that the respective variable belongs to the respective factor. For example, in the first row
the maximum is 0.972 which is for factor 3; this indicates that the Q1 contributes to third factor. In
the second row maximum is 0.711; for factor 4, indicating that Q2 contributes to factor 4, and so on.
The variables ‘Q6’, ‘Q7’, ‘Q9’, and ‘Q11’ are highly correlated and contribute to a single factor
which can be named as Factor 1 or ‘Economy’.
The variables ‘Q3’, ‘Q10’, and ‘Q12’ are highly correlated and contribute to a single factor which can
be named as Factor 2 or ‘Services beyond Calling’.
The variables ‘Q1’ and ‘Q4’ are highly correlated and contribute to a single factor which can be
named as Factor 3 or ‘Customer Care’.
The variables ‘Q2’, ‘Q5’, and ‘Q8’ are highly correlated and contribute to a single factor which can
be named as Factor 4 or ‘Anytime Anywhere Service’.
We may summarise the above analysis in the following Table.
Factors Questions
Factor 1 Q.6. Call rates and Tariff plans
99

Economy Q.7. Additional features like unlimited SMS, lifetime prepaid, 2 phones free
calling, etc.
Q.9. SMS and Value Added Services charges.
Q.11. Roaming charges
Factor 2 Q.3. Internet connection, with good speed.
Services Q.10. Value Added Services like MMS, caller tunes, etc.
beyond Calling Q.12. Conferencing
Factor 3 Q.1. Availability of services (like drop boxes and different payment
Customer Care options, in case of post paid, and recharge coupons in case of pre paid)
Q.4. Quick and appropriate response at customer care centre.
Factor 4 Q.2. Good network connectivity all through the city.
Anytime Q.5. Connectivity while roaming (out of the state or out of country)
Anywhere Q.8. Quality of network service like Minimum call drops, Minimum down
Service time, voice quality, etc.

It implies that the telecomm service provider should consider these four factors which customers feel
are important, while selecting / switching a service provider.
7 Canonical Correlation Analysis
The canonical correlation analysis, abbreviated as CRA is an extension of multiple regression
analysis, abbreviated as MRA. While, in MRA, there is one metric (measurable or non categorical)
dependant variable, say y, and there are several metric independent variables, say x 1, x2,…..,xk, in
CRA, there are several metric dependent variables, say y1, y2,……,ym .

CRA involves developing a linear combination of the two sets of above variables viz. y’s and x’s –
one as a linear combination of dependent variables( also called predictor set) and the other as a linear
combination of the set of independent variables ( also called criterion set). The two linear
combinations are derived in such a way that the correlation between the two is maximum.
While the linear combinations are referred as canonical variables, the correlation between the two
combinations is called canonical correlation. It measures the strength of the overall relationship
between the linear combinations of the predictor and criterion sets of variables. In the next stage the
identification process involves choosing the second pair of linear combinations having the second
largest correlation among all pairs but uncorrelated with the initial pair. The process continues for the
third pair and so on. The practical significance of a canonical correlation is that it indicates as to
how much variance in one set of dependent variables is accounted for by another set of
independent variables. The weights in the linear combination are derived based on the criterion
that maximizes the correlation between the two sets.
It can be represented as follows
Y+Y2+………Yp = X1 + X2+…………..Xp
(metric) (metric)
Some Indicative Applications:
 A medical researcher could be interested in determining if individuals’ lifestyle and personal
habits have an impact on their health as measured by a number of health-related variables
such as hypertension, weight, blood sugar, etc.
 The marketing manager of a consumer goods firm could be interested in determining if there
is a relationship between types of products purchased and consumers’ income and
profession.
The practical significance of a canonical correlation is that it indicates as to how much variance in one set of
variables is accounted for by another set of variables.
Squared canonical correlations are referred to as canonical roots or Eigen values.
If X1, X2, X3, ………, Xp & Y1, Y2, Y3, …….,Yq are the observable variables then canonical variables will be
100

U1= a1X1+ a2X2+ ………..+apXp V1=b1Y1+b2Y2+ ……….+bqYq


U2=c1X1+c2X2+………….+cpXp V2=d1Y1+d2Y2+………..+dqYq
& so on
Then U's & V's are called canonical variables & coefficients are called canonical coefficients.
The first pair of sample canonical variables is obtained in such a way that,
Var (U1) = Var (V1)=1
, and Corr (U1,V1) is maximum.
The second pair U2 and V2 are selected in such a way that they are uncorrelated with U 1 and V1 ,
and the correlation between the two is maximum, and so on
DA & CRA :
CR2 which is defined as the ratio of SSB to SST is a measure of the strength of
the discriminant function. If its value is 0.84 , it implies that 84 % of the variation between the two
groups is explained by the two discriminating variables.

Canonical Discriminant Function:


Canonical correlation measures the association between the discriminant scores and the groups. For
two groups, it is the usual person correlation between the scores and the groups coded as 0 and 1.

SSB SSB/SSW Eigen value


CR2 = = =
SST SST/SSW 1/

= Eigen value  
Wilk’s Lambda () is the proportion of the total variance in the discriminatn scores not explained by
differences among the groups. It is used to test H0 that the means of the variables of groups are equal.
 is approximated by 2p,G-1 = – {n-1 – (p + G/2)}  log 
p : no. of variables
G : no. of groups
0    1. If is small H0 is reject, is it is high H0 is accepted

MDA & PCA : Similarities and Differences


In both cases, a new axis is identified and a new variable is formed as a linear combination of
the original variables. The new variable is given by the projection of the points onto this new axis.
The difference is with respect to the criterion used to identify the new axis.
In PCA, a new axis is formed such that the projection of the points onto this new axis
account for maximum variance in the data. This is equivalent to maximizing SST, because there is no
criterion variable for dividing the sample into groups.
In MDA, The objective is not to account for maximum variance in the data ( i.e. maximum
SST) , but to maximize the between-group to within-group sum of squares ratio( i.e. SSB/ SSW ) that
results in the best discrimination between the groups.The new axis or the new linear combination that
is identified is called Linear Discriminant Function. The projection of an observed point onto this
discriminant function ( i.e. the value of the new variable) is called the discriminant score.

Application: Asset – Liability Mismatch


A Study on Structural changes and Asset Liability Mismatch in Scheduled Commercial Banks in
India was carried out in Reserve Bank of India. It was conducted as an empirical exercise to identify
and explore the relationships and structural changes, including hedging behaviour, between asset
and liability of cross-section of scheduled commercial banks at two different time points
representing pre- and post-liberalisation periods. As there were two sets of dependent variables,
101

instead of regression, the study used the canonical correlation technique to investigate the asset-
liability relationship of the banks at the two time points.

7.1 Canonical Correlation Using SPSS


We will use the case data on Commodity market perceptions given in Section 8.5.
Open the file Commodity.sav
Select from the menu Analyze – General Linear Model – Multivariate as shown below

CANONICAL Snapshot 1

The following window will be displayed

CANONICAL Snapshot 2

1.Select Dependent
variables as invest_SM,
invest_CM and
Invest_MF

2.Select Covariates as Age,


Rate_CM and Rate_SM,
Rate_MF, Risky_CM,
Risky_SM, and Risky_MF.
Covatiates are metric
independent variables

3.Click OK
It may be noted that the above example is discussed in Section 5.1. The difference between
MANOCOVA and canonical correlation is that MANOCOVA can have both factors and metric
independent variables, Canonical correlation can have only metric independent variables, factors
(categorical independent variables) are not possible in canonical correlation.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market. The metric independent variables are age, respondent’s
rating for commodity market ,share market and Mutual funds and respondent’s perception of risk for
commodity market, share market and mutual funds. Here we assume that their investments depend on
their ratings, age and their risk perceptions for mutual funds, commodity markets and share markets.
102

The following output will be displayed


General Linear Model
Multivari ate Testsb

Ef f ect Value F Hy pothesis df Error df Sig.


Intercept Pillai's Trace .022 .241a 3.000 32.000 .867
Wilks' Lambda .978 .241a 3.000 32.000 .867
Hot elling's Trace .023 .241a 3.000 32.000 .867
Roy 's Largest Root .023 .241a 3.000 32.000 .867
Rat e_CM Pillai's Trace .024 .267a 3.000 32.000 .848
Wilks' Lambda .976 .267a 3.000 32.000 .848
Hot elling's Trace .025 .267a 3.000 32.000 .848
Roy 's Largest Root .025 .267a 3.000 32.000 .848
Rat e_SM Pillai's Trace .026 .280a 3.000 32.000 .839
Wilks' Lambda .974 .280a 3.000 32.000 .839
Hot elling's Trace .026 .280a 3.000 32.000 .839
Roy 's Largest Root .026 .280a 3.000 32.000 .839
risky _CM Pillai's Trace .123 1.497a 3.000 32.000 .234
Wilks' Lambda .877 1.497a 3.000 32.000 .234
Hot elling's Trace .140 1.497a 3.000 32.000 .234
Roy 's Largest Root .140 1.497a 3.000 32.000 .234
risky _SM Pillai's Trace .044 .490a 3.000 32.000 .692
Wilks' Lambda .956 .490a 3.000 32.000 .692
Hot elling's Trace .046 .490a 3.000 32.000 .692
Roy 's Largest Root .046 .490a 3.000 32.000 .692
Age Pillai's Trace .152 1.914a 3.000 32.000 .147
Wilks' Lambda .848 1.914a 3.000 32.000 .147
Hot elling's Trace .179 1.914a 3.000 32.000 .147
Roy 's Largest Root .179 1.914a 3.000 32.000 .147
Rat e_FD Pillai's Trace .031 .338a 3.000 32.000 .798
Wilks' Lambda .969 .338a 3.000 32.000 .798
Hot elling's Trace .032 .338a 3.000 32.000 .798
Roy 's Largest Root .032 .338a 3.000 32.000 .798
Rat e_MF Pillai's Trace .092 1.075a 3.000 32.000 .373
Wilks' Lambda .908 1.075a 3.000 32.000 .373
Hot elling's Trace .101 1.075a 3.000 32.000 .373
Roy 's Largest Root .101 1.075a 3.000 32.000 .373
risky _FD Pillai's Trace .145 1.814a 3.000 32.000 .164
Wilks' Lambda .855 1.814a 3.000 32.000 .164
Hot elling's Trace .170 1.814a 3.000 32.000 .164
Roy 's Largest Root .170 1.814a 3.000 32.000 .164
risky _MF Pillai's Trace .001 .012a 3.000 32.000 .998
Wilks' Lambda .999 .012a 3.000 32.000 .998
Hot elling's Trace .001 .012a 3.000 32.000 .998
Roy 's Largest Root .001 .012a 3.000 32.000 .998
a. Exact statistic
b. Design: Intercept+Rate_CM+Rat e_SM+risky _CM+risky _SM+Age+Rate_FD+Rate_MF+risky _
FD+risky _MF

This table indicates that the hypothesis about age, ratings of CM, SM and MF and Risky CM, SM and
MF are not rejected ( as p-value is greater than 0.05) this means there is no significant difference in
the investments for these variables.
103

Tests of Between-Subjects Effects

Ty pe III Sum
Source Dependent Variable of Squares df Mean Square F Sig.
Correct ed Model Inv est_CM 6554404686a 9 728267187.3 2.363 .034
Inv est_SM 3.825E+010b 9 4250227452 1.723 .122
Inv est_MF 3.458E+010c 9 3842114308 1.269 .289
Intercept Inv est_CM 41236149.0 1 41236149.02 .134 .717
Inv est_SM 1820226683 1 1820226683 .738 .396
Inv est_MF 2063858495 1 2063858495 .682 .415
Rat e_CM Inv est_CM 3146747.812 1 3146747.812 .010 .920
Inv est_SM 701640357 1 701640356.6 .284 .597
Inv est_MF 91570171.1 1 91570171.09 .030 .863
Rat e_SM Inv est_CM 5827313.435 1 5827313.435 .019 .891
Inv est_SM 305120.840 1 305120.840 .000 .991
Inv est_MF 289045935 1 289045935.2 .095 .759
risky _CM Inv est_CM 1383505869 1 1383505869 4.490 .041
Inv est_SM 3599660366 1 3599660366 1.459 .235
Inv est_MF 4765133773 1 4765133773 1.574 .218
risky _SM Inv est_CM 5068043.704 1 5068043.704 .016 .899
Inv est_SM 3129346168 1 3129346168 1.269 .268
Inv est_MF 3115327283 1 3115327283 1.029 .318
Age Inv est_CM 1759288000 1 1759288000 5.709 .023
Inv est_SM 4483914274 1 4483914274 1.818 .187
Inv est_MF 5852257582 1 5852257582 1.933 .173
Rat e_F D Inv est_CM 314194024 1 314194023.6 1.020 .320
Inv est_SM 1038652882 1 1038652882 .421 .521
Inv est_MF 1418050354 1 1418050354 .468 .498
Rat e_MF Inv est_CM 709118826 1 709118825.7 2.301 .139
Inv est_SM 4469496231 1 4469496231 1.812 .187
Inv est_MF 3743054315 1 3743054315 1.236 .274
risky _FD Inv est_CM 925863328 1 925863327.7 3.005 .092
Inv est_SM 7133105111 1 7133105111 2.891 .098
Inv est_MF 4644806048 1 4644806048 1.534 .224
risky _MF Inv est_CM 112028.404 1 112028.404 .000 .985
Inv est_SM 46590657.4 1 46590657.39 .019 .892
Inv est_MF 14383175.1 1 14383175.06 .005 .945
Error Inv est_CM 1.048E+010 34 308143679.0
Inv est_SM 8.388E+010 34 2466958509
Inv est_MF 1.029E+011 34 3027453031
Total Inv est_CM 2.780E+010 44
Inv est_SM 2.109E+011 44
Inv est_MF 2.302E+011 44
Correct ed Total Inv est_CM 1.703E+010 43
Inv est_SM 1.221E+011 43
Inv est_MF 1.375E+011 43
a. R Squared = . 385 (Adjusted R Squared = . 222)
b. R Squared = . 313 (Adjusted R Squared = . 131)
c. R Squared = . 251 (Adjusted R Squared = . 053)

The above table gives three different models namely a, b, and c. Model a is for the first dependent
variable, invest CM, model b is for dependent variable invest SM and model c is for dependent
variable invest MF
The table also indicates the individual relationship between each dependent – independent variable
pair. It is indicated above that only two pairs namely, Risky CM and Invest CM, and Age and Invest
CM are significant (p value less than 0.05 indicated by circles). This indicates that the independent
variable, perception of risk of commodity markets by consumers (variable name Risky CM)
significantly affects the dependent variable, i.e. their investment in commodity markets. Indicating
that the riskiness perceived by consumers affects their investments in the market. Similarly variable
AGE also impacts their investments in commodity markets. All other combinations are not
significant.

8 Cluster Analysis
This type of analysis is used to divide a given number of entities or objects into groups called
clusters. The objective is to classify a sample of entities into a small number of mutually exclusive
clusters based on the premise that they are similar within the clusters but dissimilar among the
clusters. The criterion for similarity is defined with reference to some characteristics of the entity. For
example, for companies, it could be ‘Sales’, ‘Paid up Capital’, etc.
The basic methodological questions that are to be answered in the cluster analysis are:
 What are the relevant variables and descriptive measures of an entity?
 How do we measure the similarity between entities?
 Given that we have a measure of similarity between entities, how do we form
clusters?
104

 How do we decide on how many clusters are to be formed?


For measuring similarity, let us consider the following data. The data has been collected on each of
k characteristics for all the n entities under consideration of being divided into clusters. Let the k
characteristics be measured by k variables as x1, x2, x3 …, xk, and the data presented in the
following matrix form;

Variables
x1 x2 ….. xk
Entity 1 x11 x12 …. x1k
Entity 2 x21 x22 ….. x2p
……..
Entity n xn1 xn2 …………… xnp

The question is how to determine how similar or dissimilar each row of data is from the others?
This task of measuring similarity between entities is complicated by the fact that, in most cases, the
data in its original form are measured in different units or/and scales. This problem is solved by
standardizing each variable by subtracting its mean from its value and then dividing by its standard
deviation. This converts each variable to a pure number.
The measure to define similarity between two entities, i and j, is computed as
Dij = ( xi1 – xj1 )2 + ( xi2 – xj2 )2 + ……….+ ( xik – xjk )2
Smaller the value of Dij , more similar are the two entities.
The basic method of clustering is illustrated through a simple example given below.
Let there be four branches of a commercial bank each described by two variables viz. deposits and
loans / credits. The following chart gives an idea of their deposits and loans / credits.

x(1) x(2)
Loans
/Credit

x(3) x(4)

Deposits
From the above chart, it is obvious that if we want two clusters, we should group the branches 1&2
(High Deposit, High Credit) into one cluster, and 3&4(Low Deposit Low Credit) into another, since
such grouping produces the clusters for which the entities (branches) within each other are most
similar. However, this graphical approach is not convenient for more than two variables.
In order to develop a mathematical procedure for forming the clusters, we need a criterion upon
which to judge alternative clustering patterns. This criterion defines the optimal number of entities
within each cluster.
Now we shall illustrate the methodology of using distances among the entities from clusters. We
assume the following distance similarity matrix among three entities.
105

Distance or Similarity Matrix


1 2 3
-------------------------------------
1 0 5 10
2 5 0 8
3 10 8 0
The possible clusters and their respective distances are :
Total
Distance
Cluster Cluster Distance Between Distance
Within Two
I II Two Clusters Among These
Clusters
Entities
1 2&3 8 15 ( = 5+10) 23
2 1&3 10 13 ( = 5+8) 23
3 1&2 5 18 (= 8 + 10) 23
Thus the best clustering would be to cluster entities 1 and 2 together. This would yield minimum
distance within clusters (= 5 ) , and simultaneously the maximum distance between clusters ( =18).
Obviously, if the number of entities is large, it is a prohibitive task to construct every possible
cluster pattern, compute each ‘within cluster distance’ and select the pattern which yields the
minimum. If the number of variables and dimensions are large, computers are needed.
The criterion of minimizing the within cluster distances to form the best possible grouping to form
‘k’ clusters assumes that ‘k’ clusters are to be formed. If the number of clusters to be formed is not
fixed a priori, the criterion will not specify the optimal number of clusters. However, if the
objective is, to minimize the sum of the within cluster distances and the number of clusters is free to
vary, then all that is required is to make each entity its own cluster, and the sum of the within
cluster distances will be zero. Obviously, more the number of clusters lesser will be the sum of
‘within cluster distances’. Thus, making each entity its own cluster is of no value. This issue is,
therefore, resolved intuitively.

Discriminant and Cluster Analysis


It may be noted that though both discriminant analysis and cluster analysis are classification
techniques. However, there is a basic difference between the two techniques. In DA, the data is
classified in given set of categories using some prior information about the data. The entire rules of
classification are based on the categorical dependent variable and the tolerance of the model. But the
Cluster analysis does not assume any dependent variable. It uses different methods of classification to
classify the data into some groups without any prior information. The cases with similar data would
be in the same group, and the cases with distinct data would be classified in different groups.
Most computer oriented programs find the optimum number of clusters through their own
algorithm. We have described the methods of forming clusters in Section 8.3, and the use of SPSS
package in Section 8.5

8.1 Some Applications of Cluster Analysis


Following are two applications of cluster analysis in the banking system:
(i) Commercial Bank
In one of the banks in India, all its branches were formed into clusters by taking 15 variables
representing various aspects of the functioning of the branches. The variables are ; four types of
deposits, four types of advances, miscellaneous business such as ‘drafts issued’, receipts on behalf
of Government, foreign exchange business, etc. The bank uses this grouping of branches into
clusters for collecting information or conducting quick surveys to study any aspect, planning,
analysing and monitoring. If any sample survey is to be conducted, it is ensured that samples are
taken from branches in all the clusters so as to get a true representative of the entire bank. Since, the
106

branches in the same cluster are more or less similar to each other, only few branches are selected
from each cluster.
(ii) Agricultural Clusters
A study was conducted by one of the officers of the Reserve Bank of India, to form clusters of
geographic regions of the country based on agricultural parameters like cropping pattern, rainfall,
land holdings, productivity, fertility, use of fertilisers, irrigation facilities, etc. The whole country
was divided into 9 clusters. Thus, all the 67 regions of the country were allocated to these clusters.
Such classification is useful for making policies at the national level as also at regional/cluster
levels.
8.2 Key Terms in Cluster Analysis
Agglomeration Schedule While performing cluster analysis the tool gives Information on objects or
cases being combined at each stage at hierarchical clustering process. This
is in depth table which indicates the clusters and the objects combined in
the cluster the table can be read from top to bottom. The table starts with
any two cases combined together it also states ‘Distance Coefficients’ and
‘Stage Cluster First Appears’. The distance coefficient is an important
measure to identify the number of clusters for the data. Sudden jump in the
coefficient indicates better grouping. The last row of the table represents
one cluster solution second last, two cluster solution, etc.
Cluster Centroid Cluster centrioids are mean values of variables under consideration
(variables given while clustering) for all the cases in a particular cluster.
Each cluster will have different centroids for each variables.
Cluster Centers These are initial starting point in non Hierarchical clusters . Cluster are
built around these centers these are also termed as seeds.
Cluster Membership It is the cluster to which each case belongs. It is important to save cluster
membership to analyse cluster and further perform ANOVA on the data
Dendrogram This is the graphical summary of the cluster solution. This is more used
while interpreting results than the Agglomeration Schedule, as it is easier to
interpret. The cases are listed along the left vertical axis.
The horizontal axis shows the distance between clusters when they are
joined.
This graph gives an indication of the number of clusters the solution may
have. The diagram is read from right to left. Rightmost is the single cluster
solution just before right is two cluster solution and so on. The best
solution is where the horizontal distance is maximum. This could be a
subjective process.
Icicle Diagram It displays information about how cases are combined into clusters at each
iteration of the analysis.
Similarity/ Distance It is a matrix containing the pair wise distances between the cases.
Coefficient Matrix
8.3 Clustering Procedures and Methods
The cluster analysis procedure could be
 Hierarchical
 Non- hierarchical
Hierarchical methods develop a tree like structure (dendrogram). These could be
 Agglomerative – starts with each case as separate cluster and in every stage the similar
clusters are combined. Ends with single cluster.
 Divisive – starts with all cases in a single cluster and then the clusters are divided on the basis
of the difference between the cases. Ends with all clusters separate.
Most common methods of clustering are Agglomerative methods. This could be further divided into
 Linkage Methods – these distance measures. There are three linkage methods, Single linkage
– minimum distance or nearest neighborhood rule, Complete linkage – Maximum distance or
107

furthest neighborhood and Average linkage – average distance between all pair of objects.
This is explained in the Diagram provided below.
 Centroid Methods – this method considers distance between the two centroids. Centroid is the
means for all the variables
 Variance Methods – this is commonly termed as Ward’s method it uses the squared distance
from the means.
Diagram
a) Single Linkage (Minimum Distance / Nearest Neighborhood)

Cluster 1 Cluster 2
b) Complete Linkage ((Maximum Distance / Furthest Neighborhood)

Cluster 1 Cluster 2
c) Average Linkage (Average Distance)

Cluster 1 Cluster 2
d) Centroid Method

Cluster 1 Cluster 2
108

e) Ward’s Method (Variance Method)

Cluster 1 Cluster 2

Non- hierarchical is frequently termed as K-means clustering.


It may be noted that each of the above methods may give different results and different set of clusters.
The selection of method is generally done on the basis of the clarity of cluster solution. If a particular
method does not give appropriate or clear results, other method may be tried. Cluster analysis is a
trial and error process (especially if it is hierarchical cluster). There are no specific tests for finding
the validity of the results. ANOVA can give the validity of distinction between the clusters.

14.8.4 Assumptions of Cluster Analysis

Since distance measures are used in cluster analysis it assumes that the variables have similar means.
i.e. the variables are on the same unit or dimension. If all the variables are rating questions and the
scales of these ratings are same, then this assumption is satisfied. But if the variables have different
dimensions, for example, one variable is salary , other is age , some are rating on 1 to 7 scale, then
this difference may affect the clustering. This problem is solved through Standardization.
Standardization allows one to equalize the effect of variables measured on different scales.

8.5 Cluster Analysis Using SPSS


For illustration we will be considering following case study
The file Commodity.sav is provided in the CD with the book.
Case 2
INVESTMENT AWARENESS AND PERCEPTIONS
A study of investment awareness and perceptions was undertaken with the aim of achieving a better
understanding of investor behavior. The main objective was to conduct an analysis of the various
investment options available, behavior patterns and perceptions of the investors in relation to the
available options. For simplicity, four of the most popular investment options were selected for
analysis, viz.
 Investment in Commodity Markets.
 Investment in Stock Markets.
 Investment in Fixed Deposits.
 Investment in Mutual Funds.

Special focus was on the study of the levels of awareness among the investors about the
commodity markets and also perceptional ratings of the investment options by the investors.
This study was undertaken with an intention to gain a better perspective on investor behavior patterns
and to provide assistance to the general public, individual/small investors, broker’s and portfolio
109

managers to analyze the scope of investments and make informed decisions while investing in the
above mentioned options. However, the limitation of the study is that it considers investors from
Mumbai only, and hence, might not be representative of the entire country.
An investment is a commitment of funds made in the expectation of some positive rate of return. If
properly undertaken, the return will be commensurate with the risk the investor assumes.
An analysis of the backgrounds and perceptions of the investors was undertaken in the report. The
data used in the analysis was collected by e-mailing and distributing the questionnaire among friends,
relatives and colleagues. 45 people were surveyed, and were asked various questions relating to their
backgrounds and knowledge about the investment markets and options. The raw data contains a wide
range of information, but only the data which is relevant to objective of the study was considered.
The questionnaire used for the study is as follows

QUESTIONNAIRE

Age: _________

Occupation:
o SELF EMPLOYED
o GOVERNMENT
o STUDENT
o HOUSEWIFE
o DOCTOR
o ENGINEER
o CORPORATE PROFESSIONAL
o OTHERS (PLEASE SPECIFY) : ________________________
o
Gender:
o MALE
o FEMALE

1. RATE FOLLOWING ON THE PREFERENCE OF INVESTMENT { 1- Least Preferred 5-


Most Preferred (√ Appropriate Number)}

Question 1 2 3 4 5
COMMODITY MARKET
STOCK MARKET
FIXED DEPOSITS
Mutual Funds
2. HOW MUCH YOU ARE READY TO INVEST In COMMODITY MARKET?
_______________
3. HOW MUCH YOU ARE READY TO INVEST In STOCK MARKET ? ______________
4. HOW MUCH YOU ARE READY TO INVEST In FIXED DEPOSITS? ______________
5. HOW MUCH YOU ARE READY TO INVEST In MUTUAL FUNDS? ______________
6. FOR HOW MUCH TIME WOULD YOU BLOCK YOUR MONEY WITH INVESTMENTS?
o UNDER 5 MONTHS
o 6-12 MONTHS
o 1-3 YEARS
o MORE THAN 3 YEARS
110

7. ON A SCALE OF 1 –10( 1- LEAST RISKY & 10 - MOST RISKY) ,HOW RISKY DO YOU
THINK IS THE COMMODITY MARKET? (Circle the
appropriate number)

SAFE
RISKY
8. ON A SCALE OF 1-10, HOW RISKY DO YOU THINK IS THE STOCK MARKET?

SAFE RISKY

9. ON A SCALE OF 1-10, HOW RISKY DO YOU THINK IS FIXED DEPOSITS?

SAFE RISKY
10. ON A SCALE OF 1-10 HOW RISKY DO YOU THINK ARE MUTUAL FUNDS?

SAFE RISKY

The sample of collected data is given below


Sr Rate Rate Rate Rate Block Risky Risky Risky Risky
No Age Occu Sex CM SM FD MF Inv CM Inv SM Inv FD Inv MF Money CM SM FD MF
1 23 1 1 3 3 3 4 6000 8000 3000 5000 2 5 6 1 7
2 18 1 1 4 4 2 4 4000 5000 0 8000 2 7 9 3 6

Perform the appropriate analysis to find different groups of investors.


We shall use the Commodity.sav file to conduct cluster analysis.
We will start with the hierarchical cluster analysis. K-means cluster will be explained later.
After opening the file Commodity.sav, one can click on analyze – Classify and hierarchical cluster
as shown in the following snapshot.
CA Snapshot 1
111

The window that will be opened is shown below

CA Snapshot 2

1. Select
variables Age,
Rate_CM,
through
Risky_MF

2. Cases
Variable can be selected if one
wants to perform CA for
variables than cases. (like factor
analysis) default is cases.

3. Click on Plots
The following window will be opened
CA Snapshot 3

4. Click on Continue

1. Select Dendrogram

2. One may select Icicle for all


clusters or for specified number
of clusters. Default is all
clusters

3. Select orientation of Icicle


diagram default is vertical
112

SPSS will take back to the window as displayed in CA Snapshot 2. At this stage click on ‘Method’.
SPSS will open following window.

CA Snapshot 4

Select the method of clustering


from the list of methods. We
shall first select Nearest
neighborhood method first,
analyse the cluster solution and
if required select Furthest
neighborhood method later.
Next step is to select the clustering measure. Most common measure is the squared Euclidian
distance.

CA Snapshot 5
3. Next Click on
Continue

1. Select Squared Euclidian


Distance from the list of
measures

2. Select Standardize Z Scores,


since our data has scales
difference among variables

SPSS will be back to the window as shown in CA Snapshot 2. At this stage click on Save, following
window will be displayed.

CA Snapshot 6

SPSS gives option to save the cluster


membership. This is useful when it is to decide
how many clusters are formed for the given data.
Or to understand and analyse the formed clusters
by performing ANOVA on the cluster. For
ANOVA, saving of cluster membership is
required. One may save single solution(if sure of
number of clusters) or a ranges of solutions.
The default is none and at this stage we will keep
the default.
113

Click on continue, and SPSS will be back as Shown in CA Snapshot 2. At this stage click OK.
Following output will be displayed.

We shall discuss this output in detail.


Proximities

Case Processing Summarya

Cases
Valid Missing Total
N Percent N Percent N Percent
44 97.8% 1 2.2% 45 100.0%
a. Squared Euclidean Distance used

This table gives the case processing summary and its percentages. The above table indicates there are
44 out of 45 valid cases. Since one case have some missing values it is ignored from the analysis

Cluster

Single Linkage
This is the method we selected for cluster analysis
Agglomeration Schedule

Stage Cluster First


Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coef f icients Cluster 1 Cluster 2 Next Stage
1 6 42 .000 0 0 43
2 7 43 .829 0 0 20
3 2 10 2.190 0 0 16
4 40 41 2.361 0 0 6
5 8 44 2.636 0 0 20
6 38 40 3.002 0 4 15
7 30 32 3.749 0 0 14
8 1 14 3.808 0 0 10
9 26 29 3.891 0 0 12
10 1 16 4.145 8 0 13
11 34 35 4.326 0 0 12
12 26 34 4.587 9 11 15
13 1 18 4.698 10 0 16
14 19 30 5.105 0 7 23
15 26 38 5.751 12 6 19
16 1 2 5.921 13 3 17
17 1 20 6.052 16 0 18
18 1 9 6.236 17 0 21
19 24 26 6.389 0 15 32
20 7 8 6.791 2 5 37
21 1 5 7.298 18 0 22
22 1 15 7.481 21 0 24
23 11 19 7.631 0 14 31
24 1 21 7.711 22 0 26
25 4 12 7.735 0 0 28
26 1 13 8.289 24 0 27
27 1 27 8.511 26 0 28
28 1 4 8.656 27 25 29
29 1 22 8.807 28 0 30
30 1 33 8.994 29 0 31
31 1 11 9.066 30 23 32
32 1 24 9.071 31 19 33
33 1 17 9.245 32 0 34
34 1 28 9.451 33 0 35
35 1 31 9.483 34 0 36
36 1 37 9.946 35 0 38
37 7 36 9.953 20 0 38
38 1 7 10.561 36 37 39
39 1 25 10.705 38 0 40
40 1 23 11.289 39 0 41
41 1 39 12.785 40 0 42
42 1 3 12.900 41 0 43
43 1 6 15.449 42 1 0

This table gives the agglomeration schedule or the details of the clusters formed in each stage. This
table indicates that the cases 6 and 42 were combined at first stage. Cases 7 and 43 were combined at
2nd stage, 2 and 10 were combined at third stage, and so on. The last stage ( stage 43) indicates two
cluster solution. One above last stage( stage 42) indicates three cluster solution, and so on. The
column Coefficients indicates the distance coefficient. Sudden increase in the coefficient indicates
that the combining at that stage is more appropriate. This is one of the indicator for deciding the
number of clusters.
Agglomeration
Schedule
Stage Cluster Coefficients Stage Cluster Next Difference in
114

Combined First Appears Stage Coefficients


Cluster 1 Cluster 2 Cluster 1 Cluster 2
1 6 42 0 0 0 43
2 7 43 0.828734 0 0 20 0.828734
3 2 10 2.189939 0 0 16 1.361204
4 40 41 2.360897 0 0 6 0.170958
5 8 44 2.636238 0 0 20 0.275341
6 38 40 3.002467 0 4 15 0.366229
7 30 32 3.749321 0 0 14 0.746854
8 1 14 3.808047 0 0 10 0.058726
9 26 29 3.89128 0 0 12 0.083232
10 1 16 4.1447 8 0 13 0.253421
11 34 35 4.325831 0 0 12 0.181131
12 26 34 4.587371 9 11 15 0.26154
13 1 18 4.697703 10 0 16 0.110332
14 19 30 5.105442 0 7 23 0.407739
15 26 38 5.750915 12 6 19 0.645473
16 1 2 5.921352 13 3 17 0.170437
17 1 20 6.052442 16 0 18 0.13109
18 1 9 6.236206 17 0 21 0.183764
19 24 26 6.389243 0 15 32 0.153037
20 7 8 6.790893 2 5 37 0.40165
21 1 5 7.297921 18 0 22 0.507028
22 1 15 7.480892 21 0 24 0.182971
23 11 19 7.631185 0 14 31 0.150293
24 1 21 7.710566 22 0 26 0.079381
25 4 12 7.735374 0 0 28 0.024808
26 1 13 8.288569 24 0 27 0.553195
27 1 27 8.510957 26 0 28 0.222388
28 1 4 8.656467 27 25 29 0.14551
29 1 22 8.807405 28 0 30 0.150937
30 1 33 8.994409 29 0 31 0.187004
31 1 11 9.066141 30 23 32 0.071733
32 1 24 9.070588 31 19 33 0.004447
33 1 17 9.244565 32 0 34 0.173977
34 1 28 9.450741 33 0 35 0.206176
35 1 31 9.483015 34 0 36 0.032274
36 1 37 9.946286 35 0 38 0.463272
37 7 36 9.953277 20 0 38 0.00699
38 1 7 10.56085 36 37 39 0.607572
39 1 25 10.70496 38 0 40 0.144109
40 1 23 11.28888 39 0 41 0.583924
41 1 39 12.78464 40 0 42 1.495759
42 1 3 12.90033 41 0 43 0.115693
43 1 6 15.44882 42 1 0 2.548489
We have replicated the table ‘with one more column added called “Difference in the coefficients”
this is the difference in the coefficient between the current solution and the previous solution. The
highest difference gives the most likely clusters. In the above table, the highest difference is 2.548
which is for two cluster solution. The next highest difference 1.4956 and is for 3 cluster solution.
This indicates that there could be 3 clusters for the data.
115

The icicle table also gives the summery of the cluster formation. It is read from bottom to top.
topmost is the single cluster solution and bottommost is all cases separate. The cases in the table are
in the columns. The first column indicates the number of clusters for that stage. Each case is
separated by an empty column. A ‘cross’ in the empty column means the two cases are combined. A
‘gap’ means the two cases are in separate clusters.
If the number of cases is huge this table becomes difficult to interpret.
The diagram given below is called the dendrogram. A dendrogram is the most used tool to understand
the number of clusters and cluster memberships. The cases are in the first column and they are
connected by lines for each stages of clustering. This graph is from left to right leftmost is all cluster
solution and rightmost is the one cluster solution.
This graph also has the distance line from 0 to 25. More is the width of the horizontal line for the
cluster more appropriate is the cluster.
The graph shows that 2 cluster solution is a better solution indicated by the thick dotted line.
Dendrogram
* * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * *
Dendrogram using Single Linkage

Rescaled Distance Cluster Combine


116


The above solution is not decisive as the differences are very close. Hence we shall try a different
method, i.e. furthest neighborhood.
The entire process is repeated and this time the method ( as shown in CA Snapshot 4 ) selected is
furthest neighborhood.
The output is as follows
117

Proximities
Case Processing Summarya

Cases
Valid Missing Total
N Percent N Percent N Percent
44 97.8% 1 2.2% 45 100.0%
a. Squared Euclidean Distance used

Cluster
Complete Linkage
Agglomeration Schedule

Stage Cluster First


Cluster Combined Appears
Stage Cluster 1 Cluster 2 Coef f icients Cluster 1 Cluster 2 Next Stage
1 6 42 .000 0 0 35
2 7 43 .829 0 0 16
3 2 10 2.190 0 0 25
4 40 41 2.361 0 0 10
5 8 44 2.636 0 0 16
6 30 32 3.749 0 0 15
7 1 14 3.808 0 0 11
8 26 29 3.891 0 0 19
9 34 35 4.326 0 0 19
10 38 40 4.386 0 4 31
11 1 16 5.796 7 0 22
12 5 18 7.298 0 0 18
13 20 21 7.711 0 0 26
14 4 12 7.735 0 0 33
15 11 30 7.770 0 6 28
16 7 8 7.784 2 5 31
17 9 15 7.968 0 0 18
18 5 9 8.697 12 17 23
19 26 34 8.830 8 9 24
20 23 28 11.289 0 0 37
21 13 31 11.457 0 0 25
22 1 22 11.552 11 0 29
23 5 33 12.218 18 0 33
24 24 26 13.013 0 19 32
25 2 13 13.898 3 21 30
26 17 20 14.285 0 13 36
27 36 37 14.858 0 0 35
28 11 19 14.963 15 0 34
29 1 27 16.980 22 0 37
30 2 39 18.176 25 0 34
31 7 38 18.897 16 10 38
32 24 25 19.342 24 0 36
33 4 5 21.241 14 23 40
34 2 11 25.851 30 28 39
35 6 36 26.366 1 27 41
36 17 24 26.691 26 32 40
37 1 23 29.225 29 20 39
38 3 7 30.498 0 31 41
39 1 2 36.323 37 34 42
40 4 17 37.523 33 36 42
41 3 6 55.294 38 35 43
42 1 4 63.846 39 40 43
43 1 3 87.611 42 41 0

Dendrogram

* * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * *
* *
Dendrogram using Complete Linkage

Rescaled Distance Cluster Combine


118

The above dendrogram clearly shows that the longest horizontal lines for clusters are for 4 cluster
solution, shown by thick dotted line (the dotted line intersects four horizontal lines) it indicates that
the cluster containing cases 7, 43, 37 and 38, named as cluster 4 the cluster containing 41, 42, 39, 8,
44, 9, 45 and 3, named as cluster 2 and so on.
119

We shall run the cluster analysis again with same method and this time we shall save the cluster
membership for single solution = ‘4’ clusters as indicated in CA Snapshot 6.
The output will be same as discussed except a new variable is added in the SPSS file with name
‘CLU4_1’. This variable takes value between 1 to 4 each value indicates the cluster membership.
We shall conduct ANOVA test on the data where the dependent variables are taken as all the
variables that were included while performing cluster analysis, and the factor is the cluster
membership indicated by variable CLU4_1. This ANOVA will indicate if the clusters really
distinguish on the basis of the list of variables, which variables significantly distinguish the clusters
and which do not distinguish.
The ANOVA procedure is as follows
Select ‘Analyze’ – Compare Means – One way ANOVA from the menu as shown below.
CA Snapshot 7

SPSS will open following window.


CA Snapshot 8

1. Select All the list of


dependent variable
same as in Cluster
Analysis i.e. Age,
Rate_CM through
Risky_MF
2.Select Factor or independent
variable as Complete Linkage or
CLU4_1

3.Select Post Hoc. If ANOVA is


rejected Post Hoc investigates
which two clusters are different

Following window will be opened:

CA Snapshot 9
120

This gives list of Post Hoc tests for ANOVA. Most common are LSD and HSD (discussed in Chapter
12 ) we shall select LSD and click on continue.
SPSS will take back to CA Snapshot 8. Click on Options and following window will be opened.
CA Snapshot 10

2.Click on Continue

1.Select Descriptive and


Homogeneity of variance test

SPSS will take back to window as shown in CA Snapshot 8, at this stage click on OK
Following Output will be displayed:
Oneway
D escrip t iv es

95% C onfi denc e Inter val for


M ean
N M ean Std. D eviati on Std. Er r or Lower Bound U pper Bound M ini mum M aximum
Ag e 1 16 20.56 2.502 .626 19.23 21.90 18 25
2 8 46.13 9.250 3.270 38.39 53.86 32 55
3 16 33.56 9.063 2.266 28.73 38.39 21 50
4 4 26.00 4.761 2.380 18.42 33.58 23 33
T otal 44 30.43 11.571 1.744 26.91 33.95 18 55
R ate_C M 1 16 3.69 .873 .218 3.22 4.15 2 5
2 8 1.50 .535 .189 1.05 1.95 1 2
3 16 1.88 .719 .180 1.49 2.26 1 3
4 4 2.50 .577 .289 1.58 3.42 2 3
T otal 44 2.52 1.171 .177 2.17 2.88 1 5
R ate_SM 1 16 3.94 .680 .170 3.58 4.30 3 5
2 8 2.13 .354 .125 1.83 2.42 2 3
3 16 2.31 .602 .151 1.99 2.63 1 3
4 4 3.50 .577 .289 2.58 4.42 3 4
T otal 44 2.98 1.000 .151 2.67 3.28 1 5
R ate_F D 1 16 2.81 .911 .228 2.33 3.30 2 5
2 8 4.13 .354 .125 3.83 4.42 4 5
3 16 4.44 .727 .182 4.05 4.83 3 5
4 4 3.00 1.414 .707 .75 5.25 2 5
T otal 44 3.66 1.098 .166 3.33 3.99 2 5
R ate_M F 1 16 4.00 1.155 .289 3.38 4.62 2 5
2 8 2.75 .463 .164 2.36 3.14 2 3
3 16 2.94 .998 .249 2.41 3.47 1 5
4 4 4.50 1.000 .500 2.91 6.09 3 5
T otal 44 3.43 1.149 .173 3.08 3.78 1 5
Inves t_C M 1 16 5937.50 9681.382 2420.346 778.66 11096.34 0 30000
2 8 36250.00 8762.746 3098.098 28924.16 43575.84 25000 50000
3 16 4593.75 7289.762 1822.440 709.31 8478.19 0 23000
4 4 57500.00 11902.381 5951.190 38560.66 76439.34 50000 75000
T otal 44 15647.73 19901.671 3000.290 9597.07 21698.39 0 75000
Inves t_SM 1 16 20156.25 18716.943 4679.236 10182.70 30129.80 3000 60000
2 8 111875.00 60999.854 21566.705 60877.85 162872.15 50000 185000
3 16 18656.25 24993.145 6248.286 5338.34 31974.16 1500 100000
4 4 115000.00 41231.056 20615.528 49392.19 180607.81 70000 150000
T otal 44 44909.09 53293.535 8034.303 28706.38 61111.81 1500 185000
Inves t_F D 1 16 6718.75 7061.560 1765.390 2955.91 10481.59 0 25000
2 8 95625.00 43951.069 15539.049 58880.99 132369.01 70000 200000
3 16 20500.00 18071.156 4517.789 10870.56 30129.44 2500 60000
4 4 63750.00 7500.000 3750.000 51815.83 75684.17 55000 70000
T otal 44 33079.55 39780.056 5997.069 20985.30 45173.79 0 200000
Inves t_M F 1 16 18593.75 21455.550 5363.887 7160.89 30026.61 2500 75000
2 8 117500.00 54837.422 19387.956 71654.77 163345.23 50000 180000
3 16 17781.25 15921.651 3980.413 9297.20 26265.30 0 50000
4 4 124250.00 72126.625 36063.312 9480.44 239019.56 22000 175000
T otal 44 45886.36 56550.540 8525.315 28693.43 63079.30 0 180000
how_muc h_time_ 1 16 2.19 1.047 .262 1.63 2.75 1 4
bloc k_your _money 2 8 6.13 1.553 .549 4.83 7.42 4 8
3 16 4.31 1.887 .472 3.31 5.32 1 7
4 4 4.25 .500 .250 3.45 5.05 4 5
T otal 44 3.86 2.030 .306 3.25 4.48 1 8
r i sky_C M 1 16 5.31 1.887 .472 4.31 6.32 2 9
2 8 6.63 .916 .324 5.86 7.39 6 8
3 16 6.00 1.966 .492 4.95 7.05 3 9
4 4 3.00 .816 .408 1.70 4.30 2 4
T otal 44 5.59 1.921 .290 5.01 6.17 2 9
r i sky_SM 1 16 5.38 2.527 .632 4.03 6.72 1 9
2 8 7.25 .886 .313 6.51 7.99 6 8
3 16 7.19 1.559 .390 6.36 8.02 3 9
4 4 4.75 2.062 1.031 1.47 8.03 3 7
T otal 44 6.32 2.122 .320 5.67 6.96 1 9
r i sky_F D 1 16 1.94 .772 .193 1.53 2.35 1 3
2 8 1.13 .354 .125 .83 1.42 1 2
3 16 1.50 .516 .129 1.22 1.78 1 2
4 4 1.00 .000 .000 1.00 1.00 1 1
T otal 44 1.55 .663 .100 1.34 1.75 1 3
r i sky_M F 1 16 5.13 2.187 .547 3.96 6.29 2 10
2 8 6.50 1.195 .423 5.50 7.50 4 8
3 16 6.56 2.159 .540 5.41 7.71 4 10
4 4 4.75 2.363 1.181 .99 8.51 3 8
T otal 44 5.86 2.120 .320 5.22 6.51 2 10

The above table gives descriptive statistics for the dependent variables for each cluster the short
summary of above table is displayed below:
Descriptives
N Mean
121

Spend Spend Spend


Age Rate CM Rate SM Rate FD Rate MF CM SM FD
1 16 20.5625 3.6875 3.9375 2.8125 4 5937.5 20156.25 6718.75
2 8 46.125 1.5 2.125 4.125 2.75 36250 111875 95625
3 16 33.5625 1.875 2.3125 4.4375 2.9375 4593.75 18656.25 20500
4 4 26 2.5 3.5 3 4.5 57500 115000 63750
Total 44 30.43182 2.522727 2.977273 3.659091 3.431818 15647.73 44909.09 33079.55

N Spend Bolck
MF Time Risky CM Risky SM Risky FD Risky MF
18593.75 2.1875 5.3125 5.375 1.9375 5.125
1 16 117500 6.125 6.625 7.25 1.125 6.5
2 8 17781.25 4.3125 6 7.1875 1.5 6.5625
3 16 124250 4.25 3 4.75 1 4.75
4 4 45886.36 3.863636 5.590909 6.318182 1.545455 5.863636
Total 44

It may be noted that these four clusters have Average age as 20.56, 46.13, 33.56 and 26. This clearly
forms four different age groups. The other descriptive is summarised as follows.

Cluster 1 Average age 20.56( Young non working)


Sr No : 31, 33, 12, 20, 2, 11, Prefers investing in commodity market, share market
14, 32, 40, 24, 29, 1, 15, 17, and mutual funds do not prefer to invest in Fixed
23, 28 deposits Invests less money, blocks money for lesser
period and finds Share market and commodity market
investments as least risky (among other people)
Cluster 2 Average age 46.13 ( Oldest)
Sr No :41, 42, 39, 8, 44, 9, Least prefers investing in commodity market, share
45, 3 market and mutual funds prefers to invest in Fixed
deposits Invests high money, blocks money for more
period and finds Share market and commodity market
investments as most risky (among other people)
Cluster 3 Average age 33.56 ( Middle age)
Sr No :4, 13, 6, 19, 10, 16, Less prefers investing in commodity market, share
34, 21, 22, 18, 27, 30, 35, 36, market and mutual funds prefers to invest in Fixed
25, 26 deposits Invests less money, blocks money for lesser
period(but not lesser than people of cluster 1) and finds
Share market and commodity market investments as
more risky
Cluster 4 Average age 26 ( young working)
Sr No : 7, 43, 37, 38 Less prefers investing in commodity market, share
market but prefers to invest in mutual funds and Fixed
deposits Invests more money, blocks money for
moderate period and finds Share market and
commodity market investments as more risky
It may be noted that, these clusters are named in the dendrogram on the basis of above criteria.
122

Test of Homogeneity of Variances

Lev ene
Statistic df 1 df 2 Sig.
Age 7.243 3 40 .001
Rat e_CM .943 3 40 .429
Rat e_SM 1.335 3 40 .277
Rat e_FD 3.186 3 40 .034
Rat e_MF 1.136 3 40 .346
Inv est_CM .369 3 40 .775
Inv est_SM 17.591 3 40 .000
Inv est_FD 4.630 3 40 .007
Inv est_MF 15.069 3 40 .000
how_much_time_
3.390 3 40 .027
block_y our_money
risky _CM 1.995 3 40 .130
risky _SM 4.282 3 40 .010
risky _FD 5.294 3 40 .004
risky _MF 2.118 3 40 .113

This table gives Leven’s Homogeneity test which is a must for ANOVA as ANOVA assumes that the
different groups have equal variance. If the significance is less than 5%( LOS) the null hypothesis
which states that the variances are equal is rejected. i.e. the assumption is not followed. In such case,
ANOVA cannot be used. The above table rejection of the assumption is indicated by circles. Which
means ANOVA could be invalid for those variables.
It may be noted that when ANOVA is invalid the test that can be performed is Non parametric test,
Kruskal – Wallis test discussed in Chapter 13.
ANOVA

Sum of
Squares df Mean Square F Sig.
Age Between Groups 3764. 045 3 1254. 682 25.185 .000
Within Groups 1992. 750 40 49.819
Total 5756. 795 43
Rat e_CM Between Groups 36.790 3 12.263 22.108 .000
Within Groups 22.188 40 .555
Total 58.977 43
Rat e_SM Between Groups 28.727 3 9.576 26.879 .000
Within Groups 14.250 40 .356
Total 42.977 43
Rat e_F D Between Groups 24.636 3 8.212 12.054 .000
Within Groups 27.250 40 .681
Total 51.886 43
Rat e_MF Between Groups 17.358 3 5.786 5.869 .002
Within Groups 39.438 40 .986
Total 56.795 43
Inv est_CM Between Groups 1E+010 3 4621914299 58.403 .000
Within Groups 3E+009 40 79138671.88
Total 2E+010 43
Inv est_SM Between Groups 8E+010 3 2.545E+010 22.243 .000
Within Groups 5E+010 40 1144289844
Total 1E+011 43
Inv est_FD Between Groups 5E+010 3 1.624E+010 33.585 .000
Within Groups
Total
2E+010
7E+010
40
43
483427734.4
ANOVA
Inv est_MF Between Groups
Within Groups
9E+010
5E+010
3
40
3.005E+010
1184108594
25.377 .000
not
how_much_time_
Total
Between Groups
1E+011
89.682
43
3 29.894 13.666 .000
rejected
block_y our_money Within Groups
Total
87.500
177.182
40
43
2.188
as >0.05
risky _CM Between Groups 39.324 3 13.108 4.394 .009
Within Groups 119.313 40 2.983
Total 158.636 43
risky _SM Between Groups 43.108 3 14.369 3.821 .017
Within Groups 150.438 40 3.761
Total 193.545 43
risky _FD Between Groups 5.097 3 1.699 4.920 .005
Within Groups 13.813 40 .345
Total 18.909 43
risky _MF Between Groups 24.744 3 8.248 1.959 .136
Within Groups 168.438 40 4.211
Total 193.182 43

The above ANOVA table tests the difference between means for the different clusters. The null
hypothesis states that there is no difference between the clusters for given variable. If significance is
less than 5% (p value less than 0.05) the null hypothesis is rejected.
123

It may be noted that for above table, null hypothesis that the variable is equal for all clusters is
rejected for all variables except for Risky_MF. This means all other variables significantly vary for
different clusters. It also indicates that the four cluster solution is a good solution.

K Means Cluster
This method is used when one knows in advance, how many clusters to be formed. The procedure for
k-means cluster is as follows.
CA Snapshot 11

The following window will be displayed


CA Snapshot 12
1.Select variables Age,
Rate_CM through
Risky_MF

2. Put value for number


of clusters as 4

3.Click on Save
Following window will appear.
CA Snapshot 13
2.Click on Continue

1.Select Cluster
Membership

SPSS will take back to the window as shown in CA Snapshot 12. Click on options and following
window will appear
CA Snapshot 14
124

2. Click on
continue

1. select initial cluster


centers and ANOVA
table

SPSS will take back to CA Snapshot 12 at this stage click OK


The following output will be displayed.
Quick Cluster
Initial Cl uster Centers

Cluster
1 2 3 4
Age 55 22 54 45
Rat e_CM 1 4 1 1
Rat e_SM 2 4 2 2
Rat e_FD 4 4 4 4
Rat e_MF 2 3 3 3
Inv est_CM 45000 5000 50000 25000
Inv est_SM 60000 3000 60000 185000
Inv est_FD 75000 1000 200000 100000
Inv est_MF 75000 2500 50000 155000
how_much_time_
8 4 8 6
block_y our_money
risky _CM 8 3 8 6
risky _SM 8 6 8 7
risky _FD 1 2 1 2
risky _MF 7 3 4 7

This table gives initial cluster centers. The initial cluster centers are the variable values of the k well-
spaced observations.
Iteration Historya

Change in Cluster Centers


Iteration 1 2 3 4
1 31980.518 19406.551 .000 35237.293
2 7589. 872 3733. 790 .000 .000
3 .000 .000 .000 .000
a. Conv ergence achiev ed due to no or small change in
cluster centers. The maximum absolute coordinate
change f or any center is .000. The current iteration is
3. The minimum distance between initial centers is
124824.882.

The iteration history shows the progress of the clustering process at each step.
This table has only three steps as the process has stopped due to no change in cluster centers.
Final Cluster Centers

Cluster
1 2 3 4
Age 33 27 54 34
Rat e_CM 2 3 1 2
Rat e_SM 3 3 2 3
Rat e_FD 4 4 4 4
Rat e_MF 4 3 3 4
Inv est_CM 31000 2981 50000 36667
Inv est_SM 58182 11769 60000 161667
Inv est_FD 41636 11827 200000 81667
Inv est_MF 60636 10846 50000 170000
how_much_time_
4 3 8 5
block_y our_money
risky _CM 5 6 8 5
risky _SM 7 6 8 5
risky _FD 1 2 1 1
risky _MF 6 6 4 5

This table gives final cluster centers.


125

ANOVA

Cluster Error
Mean Square df Mean Square df F Sig.
Age 318.596 3 120.025 40 2.654 .062
Rat e_CM 2.252 3 1.306 40 1.725 .177
Rat e_SM .599 3 1.029 40 .582 .630
Rat e_FD .377 3 1.269 40 .297 .827
Rat e_MF .722 3 1.366 40 .528 .665
Inv est_CM 3531738685 3 160901842.9 40 21.950 .000
Inv est_SM 3.750E+010 3 240364627.0 40 156.032 .000
Inv est_FD 1.819E+010 3 336746248.5 40 54.022 .000
Inv est_MF 4.225E+010 3 268848251.7 40 157.162 .000
how_much_time_
10.876 3 3.614 40 3.010 .041
block_y our_money
risky _CM 3.654 3 3.692 40 .990 .407
risky _SM 3.261 3 4.594 40 .710 .552
risky _FD .603 3 .427 40 1.411 .254
risky _MF 1.949 3 4.683 40 .416 .742
The F tests should be used only f or descriptiv e purposes because the clusters hav e been chosen to
maximize the dif f erences among cases in dif f erent clusters. The observ ed signif icance lev els are not
corrected f or this and thus cannot be interpreted as t ests of the hy pothesis that the cluster means are
equal.

The ANOVA indicates that the clusters are different only for different investment options like invest
in CM, invest in SM, invest in FD and invest in MF as also block money, as the significance is less
than 0.05 only for these variables.
Number of Cases in each Cluster
Cluster 1 11.000
2 26.000
3 1.000
4 6.000
Valid 44.000
Missing 1.000

The above Table gives the number of cases for each cluster.
It may be noted that this solution is different than hierarchal solution and hierarchal cluster is more
valid for this data as it considers standardized scores and this method does not consider
standardization.

9 Conjoint Analysis
The name "Conjoint Analysis" implies the study of the joint effects. In marketing applications, it
helps in the study of joint effects of multiple product attributes on product choice. Conjoint analysis
involves the measurement of psychological judgements
such as consumer preferences, or perceived similarities or differences between choice alternatives.

In fact, conjoint analysis is a versatile marketing research technique which provides valuable
information for new product development, assessment of demand, evolving market
segmentation strategies, and pricing decisions. This technique is used to assess a wide number
of issues including:

 The profitability and/or market share for proposed new product concepts given the existing
competition.
 The impact of new competitors’ products on profits or market share of a company if status
quo is maintained with respect to it’s products and services
 Customers’ switch rates either from a company’s existing products to the company’s new
products or from competitors’ products to the company’s new products.
 Competitive reaction to the company’s strategies of introducing a new product
 The differential response to alternative advertising strategies and/or advertising themes
 The customer response to alternative pricing strategies, specific price levels, and proposed
price changes
126

Conjoint analysis examines the trade-offs that consumers make in purchasing a product. In
evaluating products, consumers make trade-offs. A TV viewer may like to enjoy the programs
on a LCD TV but might not go for it because of the high cost. In this case, cost is said to have
a high utility value. Utility can be defined as a number which represents the value that
consumers place on specific attributes. A low utility indicates less value; a high utility
indicates more value. In other words, it represents the relative ‘worth’ of the attribute. This
helps in designing products/services that are most appealing to a specific market. In addition,
because conjoint analysis identifies important attributes, it can be used to create advertising
messages that are most appealing.

The process of data collection involves showing respondents a series of cards that contain a written
description of the product or service. If a consumer product is being tested, then a picture of the
product can be included along with a written description. Several cards are prepared describing the
combination of various alternative set of features of a product or service. A consumer’s response is
collected by his/ her selection of number between 1 and 10. While ‘1’ indicates strongest dislike, ‘10’
indicates strongest like for the combination of features on the card. Such data becomes the input for
final analysis which is carried out through computer software.
The concepts and methodology are elaborated in the case study given below.
9.1 Conjoint Analysis Using SPSS
Case 3
Credit Cards
The new head of the credit card division in a bank wanted to revamp the credit card business of the
bank and convert it from loss making business to profit making business. He was given freedom to
experiment with various options that he considered as relevant. Accordingly, he organized a focus
group discussion for assessing the preference of the customers for various parameters associated with
the credit card business. Thereafter he selected the following parameters for study.
1) Transaction Time- this is the time taken for credit card transaction
2) Fees - the annual fees charged by the credit card company
3) Interest rate – the interest rate charged by the credit card company for the customers who revolve
the credits.(customers who do not pay full bill amount but use partial payment option and pay at their
convenience)
The levels of the above mentioned attributes were as follows:
 Transaction Time- 1minute, 1.5minutes, 2 minutes
 Fees – 0, 1000 Rs, 2000 Rs
 Interest rate- 1.5%, 2%,2.5% (per month)

This led to a total of 3×3×3=27 combinations. Twenty seven cards were prepared representing each
combination and the customers were asked to arrange these cards in order of their preference.
The following table shows all the possible combinations and the order given by the customer.
Input Data for Credit Card

SR.No Transaction Time Card Fees Interest Rate Rating *


(min) (Rs) 1.5%,2.0%,2.5% 27 to 1
1,1.5,2 0,1000,2000
1 1 0 1.5 27
2 1.5 0 1.5 26
3 1 1000 1.5 25
4 1.5 1000 1.5 24
5 2 0 1.5 23
6 2 1000 1.5 22
127

7 1 0 2 21
8 1.5 0 2 20
9 1 2000 1.5 19
10 1.5 2000 1.5 18
11 1 1000 2 17
12 1.5 1000 2 16
13 1 2000 2 15
14 2 2000 1.5 14
15 1.5 2000 2 13
16 2 0 2 12
17 2 1000 2 11
18 1 0 2 10
19 1.5 0 2.5 9
20 1 1000 2.5 8
21 2 1000 2.5 7
22 2 2000 2 6
23 2 0 2.5 5
24 2 1000 2.5 4
25 1 2000 2.5 3
26 1.5 2000 2.5 2
27 2 2000 2.5 1
* rating 27 indicates most preferred and rating 1 indicates lest preferred option by customer.
Conduct appropriate analysis to find the utility for these three factors.
The data is available in credit card.sav file, given in the CD.
Running Conjoint as a Regression Model: Introduction of Dummy Variables
Representing dummy variables:
X1, X2 = transaction time
X3, X4 = Annual Fees
X5, X6 = Interest Rates
The 3 levels of life are coded as follows:
Transaction Time X1 X2
1 1 0
1.5 0 1
2 -1 -1
The 3 levels of price are coded as follows:
Fees X3 X4
0 1 0
1000 0 1
2000 -1 -1
The 3 levels of Colour are coded as follows:
Interest rates X5 X6
1.5 1 0
2 0 1
1.5 -1 -1
Thus, 6 variables, ie. X1 to X6 are used to represent the 3 levels of life of the transaction time
(1,1.5,2), 3 levels of fees (0,1000,2000) and 3 levels of interest rates ( 1.5, 2,2.5).All the Six
Variables are independent variables in the regression run. Another variable Y which is the rating of
each combination given by the respondent forms the dependent variable of the regression curve.
Thus we generate the regression equation as: Y= a +b1X1+ b2X2+ b3X3+ b4X4+ b5X5+ b6X6
128

Input data for the regression model:


Sr Transaction Fees Interest Y X1 X2 X3 X4 X5 X6
No time Rate
1 1 0 1.5 27 1 0 1 0 1 0
2 1.5 0 1.5 26 0 1 1 0 1 0
3 1 1000 1.5 25 1 0 0 1 1 0
4 1.5 1000 1.5 24 0 1 0 1 1 0
5 2 0 1.5 23 -1 -1 1 0 1 0
6 2 1000 1.5 22 -1 -1 0 1 1 0
7 1 0 2 21 1 0 1 0 0 1
8 1.5 0 2 20 0 1 1 0 0 1
9 1 2000 1.5 19 1 0 -1 -1 1 0
10 1.5 2000 1.5 18 0 1 -1 -1 1 0
11 1 1000 2 17 1 0 0 1 0 1
12 1.5 1000 2 16 0 1 0 1 0 1
13 1 2000 2 15 1 0 -1 -1 0 1
14 2 2000 1.5 14 -1 -1 -1 -1 1 0
15 1.5 2000 2 13 0 1 -1 -1 0 1
16 2 0 2 12 -1 -1 1 0 0 1
17 2 1000 2 11 -1 -1 0 1 0 1
18 1 0 2 10 1 0 1 0 0 1
19 1.5 0 2.5 9 0 1 1 0 -1 -1
20 1 1000 2.5 8 1 0 0 1 -1 -1
21 2 1000 2.5 7 -1 -1 0 1 -1 -1
22 2 2000 2 6 -1 -1 -1 -1 0 1
23 2 0 2.5 5 -1 -1 1 0 -1 -1
24 2 1000 2.5 4 -1 -1 0 1 -1 -1
25 1 2000 2.5 3 1 0 -1 -1 -1 -1
26 1.5 2000 2.5 2 0 1 -1 -1 -1 -1
27 2 2000 2.5 1 -1 -1 -1 -1 -1 -1
This can be processed using SPSS package as follows
Open the file credit card.sav
Select ‘Analyse’ – Regression – Linear option from the menu as shown below
Conjoint Snapshot 1

/
129
3.Click on
The following window will be displayed. OK
Conjoint Snapshot 2

1.Select Rate as
dependent variable

2.Select Rate as
dependent variables x1,
x2, x3, x4, x5, x6 as
independent variables

The output generated is as follows:


Regression
Variables Entered/Removedb

Variables Variables
Model Entered Remov ed Met hod
1 x6, x4, x2,
a . Enter
x5, x3, x1
a. All requested v ariables entered.
b. Dependent Variable: rat e

This table summarizes the regression model


Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .963a .927 .905 2.45038
a. Predictors: (Constant), x6, x4, x2, x5, x3, x1

This table indicates that r square for the above model is 0.963, which is close to one. This indicates
that 96.3% variation in the rate is attributed by the six independent variables. (x1 to x6)
We conclude that the regression model is fit and explains the variations in the dependent variables
quite well.
ANOVAb

Sum of
Model Squares df Mean Square F Sig.
1 Regression 1517. 912 6 252.985 42.133 .000a
Residual 120.088 20 6.004
Total 1638. 000 26
a. Predictors: (Constant), x6, x4, x2, x5, x3, x1
b. Dependent Variable: rat e
130

Coefficientsa

Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) 13.857 .476 29.104 .000
x1 1.377 .673 .148 2.044 .054
x2 1.326 .695 .138 1.908 .071
x3 2.265 .673 .237 3.364 .003
x4 1.480 .673 .155 2.198 .040
x5 8.143 .670 .829 12.152 .000
x6 -.121 .661 -.013 -.183 .857
a. Dependent Variable: rate

The coefficients are circled these indicate the utility values for each variable.
The Regression equation is as follows:

Y= 13.857 + 1.377X1 + 1.326X2 + 2.265X3 + 1.48X4 + 8.143X5 - 0.121X6

Output and Interpretation


Utility (Uij) – The utility or the part worth contribution associated with the jth level (j, j=1,2,3) of the
ith attribute (i, i=1,2,3) for ex- U21 in our example means utility associated with zero fees.
Importance of an attribute (Ii)- is defined as the range of the part worth Uij across the levels of that
attribute. Ii={ Max (Uij)- Min (Uij)} for each attribute (i)
Normalisation: The attributes importance is normalized to desire its relative importance among all
attributes.
Wi= so that =1
The output provides the part utility of each level of attribute which is shown below:
X1 = 1.377 ( partial utility for 1 min transaction)
X2 = 0.11 (partial utility for 1.5 min transaction)
For 2 min transaction partial utility = -2.703 ( as all the utilities for a given attribute should sum to 0
hence -1.377-1.326 =-2.703 )
X3 = 2.265 ( partial utility for 0 fees)
X4 = 1.48 (partial utility for 1000 fees)
For 2 min transaction partial utility = -3.745 ( as all the utilities for a given attribute should sum to 0
hence -2.265-1.48 =-3.745)
X5 = 8.143 ( partial utility for 1.5% interest)
X6 = -0.121 (partial utility for 2% interest)
For 2% interest transaction partial utility = -8.022( as all the utilities for a given attribute should sum
to 0 hence -8.143+0.121 =-8.022)

Utilities Table for Conjoint Analysis

Range of Utility Percentage


Attributes Levels Part Utility
(Max – Min) Utility
Transaction Time 1 Min 1.377
15.54%
1.5 min 1.326 = 1.377-(-2.703)
= 4.08
2. min -2.703
131

Annual Fees in
0 2.265
Rupees
1000 1.480 = 2.265- (-3.745) 22.89%
= 6.01
2000 -3.745

Interest Rate 1.5% 8.143


2.0% -0.121 61.57%
= 8.143- (-8.022)
= 16.165
2.5% -8.022

From the above table, we can interpret the following:


 The Interest rate is the most important attribute for the customer. This is indicated by
following:
a. The range of utility value is the highest (16.165) for the interest rate. ( refer range of
utility column). This contributes to 61.57% of total utility.
b. The highest individual utility value of this attribute is at the 1 st level i.e. 8.143
 The Annual Fees is the 2nd most important attribute, as its range of utilities is 6.01 and it
contributes to 22.89% of the total.
 The last attribute in relative importance is the Transaction Time, with the utility range of
4.08, contributing to 15.54% of the total.
COMBINATION UTILITIES
The total utility of any combination can be calculated by picking up the attribute levels of our choice.
For example,
The combined utility of the combination of
1.5 min + 1000Fees + 2% Interest
= 1.326+ 1.480-0.121
= 2.685
To know the BEST COMBINATION, it is advisable to pick the highest utilities from each attribute
and then add them.

The best combination here is : 1 min + 0 Fees + 1.5% Interest


1.377+ 2.265+ 8.143

11.785
INDIVIDUAL ATTRIBUTES

The difference in utility with the change of one level in one attribute can also be checked.
1. Transaction Time
 For the time 1 min to 1.5 min – There in decrease in utility value of 0.051 units.
 But the next level, that is , 1.5min to 2 min has an decrease in utility of 4.029 units.
2. Annual Fees
 Increase fees from 0 to Rs.1000 induces a utility drop of 0.785
 Whereas, Rs.1000 to Rs.2000, there is an decrease in utility of 5.225
3. Interest Rates
132

 Interest rate increase from 1.5% to 2.0% induces 8.264 units drop in utility.
 Interest rate increase from 2.0% to 2.5% induces 7.901units drop in utility.

10 Multidimensional Scaling
Multidimensional Scaling transforms consumer judgments / perceptions of similarity or preferences
in a multidimensional space( usually 2 or 3 dimensions). It is useful for designing products and
services. In fact, MDS is a set of procedures for drawing pictures of data so that the researcher can
 Visualise relationships described by the data more clearly
 Offer clearer explanations of those relationships
Thus MDS reveals relationships that appear to be obscure when one examines only the numbers resulting from
a study.
It attempts to find the structure in a set of distance measures between objects. This is done by
assigning observations to specific locations in a conceptual space( 2 to 3 dimensions) such that the
distances between points in the space match the given dissimilarities as closely as possible.
If objects A and B are judged by the respondents as being most similar compared to all other possible
pairs of objects, multidimensional technique positions these objects in the space in such a manner that
the distance between them is smaller than that between any other two objects.
Suppose, data is collected for perceiving the differences or distances among three objects say A B and
C, and the following distance matrix emerges.

A B C
A 0 4 6
B 4 0 3
C 6 3 0

This matrix can be represented by a two dimensional diagram as follows:

3
6
B

4
A

However, if the data comprises of only ordinal or rank data, then the same distance matrix could be
written as:
A B C
A 0 2 3
B 1 0 3
C 3 1 0
and can be depicted as :
133

1
3
B

2
A

If the actual magnitudes of the original similarities (distances are used to obtain a geometric representation, the
process is called “ Metric Multidimensional Scaling”.
When only this ordinal information in terms of ranks is used to obtain a geometric representation, the process
is called “Non-metric Multidimensional Scaling “.

10.1 Uses of MDS


( i ) Illustrating market segments based on preference and judgments.
( i i ) Determining which products are more competitive with each other.
( i i i ) Deriving the criteria used by people while judging objects (products,
brands, advertisements, etc.).
Illustration 4
An all-India organization had six zonal offices, each headed by a zonal manager. The top management of the
organization wanted to have a detailed assessment of all the zonal managers for selecting two of them for
higher positions in the Head Office. They approached a consultant for helping them in the selection. The
management indicated that they would like to have assessment on several parameters associated with the
functioning of a zonal manager. The management also briefed the consultant that they laid great emphasis on
the staff with a view to developing and retaining them.
The consultants collected a lot of relevant data, analyzed it and offered their recommendations. In
one of the presentations, they showed the following diagram obtained through Multi Dimensional
Scaling technique. The diagram shows the concerns of various zonal managers, indicated by letters A
to F, towards the organization and also towards the staff working under them.
Concern for Organisation

D B

E
A

Concern for the


Staff
C F
134

It is observed that two zonal managers viz. B and E exhibit high concern for both the organisation as
well as staff. If these criteria are critical to the organisation, then these two zonal managers could be
the right candidates for higher positions in the Head Office.

Illustration 5
Similar study could be conducted for a group of companies to have an assessment of the perception of
investors about the attitude of companies towards interest of their shareholders and vis-à-vis interest
of their staff.
For example, from the following MDS graph, it is observed that company A is perceived to be taking
more interest in the welfare of the staff than company B.

Interest of
Shareholde
rs

Interest
of Staff
135

Popular Decision Tree: Classification and Regression Trees


(CART)
Introductory Overview - Basic Ideas
OVERVIEW

C&RT, a recursive partitioning method, builds classification and regression trees for predicting
continuous dependent variables (regression) and categorical predictor variables (classification). The
classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone,
1984; see also Ripley, 1996). A general introduction to tree-classifiers, specifically to
the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context
of the Classification Trees Analysis facilities, and much of the following discussion presents the same
information, in only a slightly different context. Another, similar type of tree building algorithm is
CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980).

CLASSIFICATION AND REGRESSION PROBLEMS

There are numerous algorithms for predicting continuous variables or categorical variables from a set
of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear
Models) and GRM (General Regression Models), we can specify a linear combination (design) of
continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction
effects) to predict a continuous dependent variable. In GDA (General Discriminant Function
Analysis), we can specify such designs for predicting categorical variables, i.e., to solve classification
problems.
Regression-type problems. Regression-type problems are generally those where we attempt to
predict the values of a continuous variable from one or more continuous and/or categorical predictor
variables. For example, we may want to predict the selling prices of single family homes (a
continuous dependent variable) from various other continuouspredictors (e.g., square footage) as well
as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area
code where the property is located, etc.; note that this latter variable would be categorical in nature,
even though it would contain numeric values or codes). If we used simple multiple regression, or
some general linear model (GLM) to predict the selling prices of single family homes, we would
determine a linear equation for these variables that can be used to compute predicted selling prices.
There are many different analytic procedures for fitting linear models (GLM, GRM, Regression),
various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized
Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear
136

Estimation), where we can type in an arbitrary equation containing parameters to be estimated.


CHAID also analyzes regression-type problems, and produces results that are similar (in nature) to
those computed byC&RT. Note that various neural network architectures are also applicable to solve
regression-type problems.
Classification-type problems. Classification-type problems are generally those where we attempt to
predict values of a categorical dependent variable (class, group membership, etc.) from one or more
continuous and/or categorical predictor variables. For example, we may be interested in predicting
who will or will not graduate from college, or who will or will not renew a subscription. These would
be examples of simple binary classification problems, where the categorical dependent variable can
only assume two distinct and mutually exclusive values. In other cases, we might be interested in
predicting which one of multiple different alternative consumer products (e.g., makes of cars) a
person decides to purchase, or which type of failure occurs with different types of engines. In those
cases there are multiple categories or classes for the categorical dependent variable. There are a
number of methods for analyzing classification-type problems and to compute predicted
classifications, either from simple continuous predictors (e.g., binomial or multinomial logit
regression in GLZ), from categorical predictors (e.g., Log-Linear analysis of multi-way frequency
tables), or both (e.g., via ANCOVA-like designs in GLZ or GDA). The CHAID also analyzes
classification-type problems, and produces results that are similar (in nature) to those computed
by C&RT. Note that various neural network architectures are also applicable to solve classification-
type problems.

CLASSIFICATION AND REGRESSION TREES (C&RT)

In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set
of if-then logical (split) conditions that permit accurate prediction or classification of cases.

CLASSIFICATION TREES

For example, consider the widely referenced Iris data classification problem introduced by Fisher
[1936; see alsoDiscriminant Function Analysis and General Discriminant Analysis (GDA)]. The data
file Irisdat reports the lengths and widths of sepals and petals of three types of irises (Setosa,
Versicol, and Virginic). The purpose of the analysis is to learn how we can discriminate between the
three types of flowers, based on the four measures of width and length of petals and sepals.
Discriminant function analysis will estimate several linear combinations of predictor variables for
computing classification scores (or probabilities) that allow the user to determine the predicted
classification for each observation. A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or classifying cases instead:
137

The interpretation of this tree is straightforward:


If the petal width is less than or equal to 0.8, the
respective flower would be classified as Setosa;
if the petal width is greater than 0.8 and less
than or equal to 1.75, then the respective flower
would be classified as Virginic; else, it belongs
to class Versicol.

REGRESSION TREES

The general approach to derive predictions from few simple if-then conditions can be applied to
regression problems as well. This example is based on the data file Poverty, which contains 1960 and
1970 Census figures for a random selection of 30 counties. The research question (for that example)
was to determine the correlates of poverty, that is, the variables that best predict the percent of
families below the poverty line in a county. A reanalysis of those data, using the regression tree
analysis [and v-fold cross-validation, yields the
following results:
Again, the interpretation of these results is rather
straightforward: Counties where the percent of
households with a phone is greater than 72%
have generally a lower poverty rate. The greatest
poverty rate is evident in those counties that
show less than (or equal to) 72% of households
with a phone, and where the population change
(from the 1960 census to the 170 census) is less
than -8.3 (minus 8.3). These results are
straightforward, easily presented, and intuitively
clear as well: There are some affluent counties (where most households have a telephone), and those
generally have little poverty. Then there are counties that are generally less affluent, and among those
the ones that shrunk most showed the greatest poverty rate. A quick review of the scatterplot of
observed vs. predicted values shows how the discrimination between the latter two groups is
particularly well "explained" by the tree model.
138

ADVANTAGES OF CLASSIFICATION AND REGRESSION TREES (C&RT) METHODS

As mentioned earlier, there are a large number of methods that an analyst can choose from when
analyzing classification or regression problems. Tree classification techniques, when they "work" and
produce accurate predictions or predicted classifications based on few logical if-then conditions, have
a number of advantages over many of those alternative techniques.

Simplicity of results. In most cases, the interpretation of results summarized in a tree is very simple.
This simplicity is useful not only for purposes of rapid classification of new observations (it is much
easier to evaluate just one or two logical conditions, than to compute classification scores for each
possible group, or predicted values, based on all predictors and using possibly some complex
nonlinear model equations), but can also often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular manner (e.g., when analyzing business
problems, it is much easier to present a few simple if-then statements to management, than some
elaborate equations).
Tree methods are nonparametric and nonlinear. The final results of using tree methods for
classification or regression can be summarized in a series of (usually few) logical if-then conditions
(tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the
predictor variables and the dependent variable are linear, follow some specific non-linear link
function [e.g., see Generalized Linear/Nonlinear Models (GLZ)], or that they are even monotonic in
nature. For example, some continuous outcome variable of interest could be positively related to a
variable Income if the income is less than some certain amount, but negatively related if it is more
than that amount (i.e., the tree could reveal multiple splits based on the same variable Income,
revealing such a non-monotonic relationship between the variables). Thus, tree methods are
particularly well suited for data mining tasks, where there is often little a priori knowledge nor any
coherent set of theories or predictions regarding which variables are related and how. In those types
139

of data analyses, tree methods can often reveal simple relationships between just a few variables that
could have easily gone unnoticed using other analytic techniques.

GENERAL COMPUTATION ISSUES AND UNIQUE SOLUTIONS OF C&RT

The computational details involved in determining the best split conditions to construct a simple yet
useful and informative tree are quite complex. Refer to Breiman et al. (1984) for a discussion of their
CART® algorithm to learn more about the general theory of and specific computational solutions for
constructing classification and regression trees. An excellent general discussion of tree classification
and regression methods, and comparisons with other approaches to pattern recognition and neural
networks, is provided in Ripley (1996).

AVOIDING OVER-FITTING: PRUNING, CROSSVALIDATION, AND V-FOLD CROSSVALIDATION

A major issue that arises when applying regression or classification trees to "real" data with much
random error noise concerns the decision when to stop splitting. For example, if we had a data set
with 10 cases, and performed 9 splits (determined 9 if-then conditions), we could perfectly predict
every single case. In general, if we only split a sufficient number of times, eventually we will be able
to "predict" ("reproduce" would be the more appropriate term here) our original data (from which we
determined the splits). Of course, it is far from clear whether such complex results (with many splits)
will replicate in a sample of new observations; most likely they will not.

This general issue is also discussed in the literature on tree classification and regression methods, as
well as neural networks, under the topic of "overlearning" or "overfitting." If not stopped, the tree
algorithm will ultimately "extract" all information from the data, including information that is not and
cannot be predicted in the population with the current set of predictors, i.e., random or noise
variation. The general approach to addressing this issue is first to stop generating new split nodes
when subsequent splits only result in very little overall improvement of the prediction. For example,
if we can predict 90% of all cases correctly from 10 splits, and 90.1% of all cases from 11 splits, then
it obviously makes little sense to add that 11th split to the tree. There are many such criteria for
automatically stopping the splitting (tree-building) process.

Once the tree building algorithm has stopped, it is always useful to further evaluate the quality of the
prediction of the current tree in samples of observations that did not participate in the original
computations. These methods are used to "prune back" the tree, i.e., to eventually (and ideally) select
a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally
as accurate for predicting or classifying "new" observations.

Crossvalidation. One approach is to apply the tree computed from one set of observations (learning
sample) to another completely independent set of observations (testing sample). If most or all of the
140

splits determined by the analysis of the learning sample are essentially based on "random noise," then
the prediction for the testing sample will be very poor. Hence, we can infer that the selected tree is
not very good (useful), and not of the "right size."
V-fold crossvalidation. Continuing further along this line of reasoning (described in the context of
crossvalidation above), why not repeat the analysis many times over with different randomly drawn
samples from the data, for every tree size starting at the root of the tree, and applying it to the
prediction of observations from randomly selected testing samples. Then use (interpret, or accept as
our final result) the tree that shows the best average accuracy for cross-validated predicted
classifications or predicted values. In most cases, this tree will not be the one with the most terminal
nodes, i.e., the most complex tree. This method for pruning a tree, and for selecting a smaller tree
from a sequence of trees, can be very powerful, and is particularly useful for smaller data sets. It is an
essential step for generating useful (for prediction) tree models, and because it can be
computationally difficult to do, this method is often not found in tree classification or regression
software.

REVIEWING LARGE TREES: UNIQUE ANALYSIS MANAGEMENT TOOLS

Another general issue that arises when applying tree classification or regression methods is that the
final trees can become very large. In practice, when the input data are complex and, for example,
contain many different categories for classification problems and many possible predictors for
performing the classification, then the resulting trees can become very large. This is not so much a
computational problem as it is a problem of presenting the trees in a manner that is easily accessible
to the data analyst, or for presentation to the "consumers" of the research.

ANALYZING ANCOVA-LIKE DESIGNS

The classic (Breiman et. al., 1984) classification and regression trees algorithms can accommodate
both continuous and categorical predictor. However, in practice, it is not uncommon to combine such
variables into analysis of variance/covariance (ANCOVA) like predictor designs with main effects or
interaction effects for categorical and continuous predictors. This method of analyzing coded
ANCOVA-like designs is relatively new and. However, it is easy to see how the use of coded
predictor designs expands these powerful classification and regression techniques to the analysis of
data from experimental designs (e.g., see for example the detailed discussion of experimental design
methods for quality improvement in the context of the Experimental Design module of Industrial
Statistics).
141

Computational Details

The process of computing classification and regression trees can be characterized as involving four
basic steps:

 Specifying the criteria for predictive accuracy


 Selecting splits
 Determining when to stop splitting
 Selecting the "right-sized" tree.
These steps are very similar to those discussed in the context of Classification Trees Analysis (see
also Breiman et al., 1984, for more details). See also, Computational Formulas.

SPECIFYING THE CRITERIA FOR PREDICTIVE ACCURACY

The classification and regression trees (C&RT) algorithms are generally aimed at achieving the best
possible predictive accuracy. Operationally, the most accurate prediction is defined as the prediction
with the minimum costs. The notion of costs was developed as a way to generalize, to a broader range
of prediction situations, the idea that the best prediction has the lowest misclassification rate. In most
applications, the cost is measured in terms of proportion of misclassified cases, or variance. In this
context, it follows, therefore, that a prediction would be considered best if it has the lowest
misclassification rate or the smallest variance. The need for minimizing costs, rather than just the
proportion of misclassified cases, arises when some predictions that fail are more catastrophic than
others, or when some predictions that fail occur more frequently than others.
Priors. In the case of a categorical response (classification problem), minimizing costs amounts to
minimizing the proportion of misclassified cases when priors are taken to be proportional to the class
sizes and when misclassification costs are taken to be equal for every class.
The a priori probabilities used in minimizing costs can greatly affect the classification of cases or
objects. Therefore, care has to be taken while using the priors. If differential base rates are not of
interest for the study, or if we know that there are about an equal number of cases in each class,
then we would use equal priors. If the differential base rates are reflected in the class sizes (as they
would be, if the sample is a probability sample), then we would use priors estimated by the class
proportions of the sample. Finally, if we have specific knowledge about the base rates (for example,
based on previous research), then we would specify priors in accordance with that knowledge The
general point is that the relative size of the priors assigned to each class can be used to "adjust" the
importance of misclassifications for each class. However, no priors are required when we are building
a regression tree.
142

Misclassification costs. Sometimes more accurate classification of the response is desired for some
classes than others for reasons not related to the relative class sizes. If the criterion for predictive
accuracy is Misclassification costs, then minimizing costs would amount to minimizing the
proportion of misclassified cases when priors are considered proportional to the class sizes and
misclassification costs are taken to be equal for every class.
Case weights. Case weights are treated strictly as case multipliers. For example, the misclassification
rates from an analysis of an aggregated data set using case weights will be identical to the
misclassification rates from the same analysis where the cases are replicated the specified number of
times in the data file.
However, note that the use of case weights for aggregated data sets in classification problems is
related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for
aggregated data sets, we could specify appropriate priors and/or misclassification costs and produce
the same results while avoiding the additional processing required to analyze multiple cases with the
same values for all variables. Suppose that in an aggregated data set with two classes having an equal
number of cases, there are case weights of 2 for all cases in the first class, and case weights of 3 for
all cases in the second class. If we specified priors of .4 and .6, respectively, specified equal
misclassification costs, and analyzed the data without case weights, we will get the same
misclassification rates as we would get if we specified priors estimated by the class sizes, specified
equal misclassification costs, and analyzed the aggregated data set using the case weights. We would
also get the same misclassification rates if we specified priors to be equal, specified the costs of
misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as
class 1 cases, and analyzed the data without case weights.

SELECTING SPLITS

The second basic step in classification and regression trees is to select the splits on the predictor
variables that are used to predict membership in classes of the categorical dependent variables, or to
predict values of the continuous dependent (response) variable. In general terms, the split at each
node will be found that will generate the greatest improvement in predictive accuracy. This is usually
measured with some type of node impurity measure, which provides an indication of the relative
homogeneity (the inverse of impurity) of cases in the terminal nodes. If all cases in each terminal
node show identical values, then node impurity is minimal, homogeneity is maximal, and prediction
is perfect (at least for the cases used in the computations; predictive validity for new cases is of
course a different matter...).

For classification problems, C&RT gives the user the choice of several impurity measures: The Gini
index, Chi-square, or G-square. The Gini index of node impurity is the measure most commonly
chosen for classification-type problems. As an impurity measure, it reaches a value of zero when only
143

one class is present at a node. With priors estimated from class sizes and equal misclassification
costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes
present at the node; it reaches its maximum value when class sizes at the node are equal; the Gini
index is equal to zero if all cases in a node belong to the same class. The Chi-square measure is
similar to the standard Chi-square value computed for the expected and observed classifications (with
priors adjusted for misclassification cost), and the G-square measure is similar to the maximum-
likelihood Chi-square (as for example computed in the Log-Linear module). For regression-type
problems, a least-squares deviation criterion (similar to what is computed in least squares regression)
is automatically used. Computational Formulas provides further computational details.

DETERMINING WHEN TO STOP SPLITTING

As discussed in Basic Ideas, in principal, splitting could continue until all cases are perfectly
classified or predicted. However, this wouldn't make much sense since we would likely end up with a
tree structure that is as complex and "tedious" as the original data file (with many nodes possibly
containing single observations), and that would most likely not be very useful or accurate for
predicting new observations. What is required is some reasonable stopping rule. InC&RT, two
options are available that can be used to keep a check on the splitting process; namely Minimum n
and Fraction of objects.
Minimum n. One way to control splitting is to allow splitting to continue until all terminal nodes are
pure or contain no more than a specified minimum number of cases or objects. In C&RT this is done
by using the option Minimum n that allows us to specify the desired minimum number of cases as a
check on the splitting process. This option can be used when Prune on misclassification error, Prune
on deviance, or Prune on variance is active as the Stopping rule for the analysis.
Fraction of objects. Another way to control splitting is to allow splitting to continue until all
terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of
one or more classes (in the case of classification problems, or all cases in regression problems). This
option can be used when FACT-style direct stopping has been selected as the Stopping rule for the
analysis. In C&RT, the desired minimum fraction can be specified as the Fraction of objects. For
classification problems, if the priors used in the analysis are equal and class sizes are equal as well,
then splitting will stop when all terminal nodes containing more than one class have no more cases
than the specified fraction of the class sizes for one or more classes. Alternatively, if the priors used
in the analysis are not equal, splitting will stop when all terminal nodes containing more than one
class have no more cases than the specified fraction for one or more classes. See Loh and
Vanichestakul, 1988 for details.
144

PRUNING AND SELECTING THE "RIGHT-SIZED" TREE

The size of a tree in the classification and regression trees analysis is an important issue, since an
unreasonably big tree can only make the interpretation of results more difficult. Some generalizations
can be offered about what constitutes the "right-sized" tree. It should be sufficiently complex to
account for the known facts, but at the same time it should be as simple as possible. It should exploit
information that increases predictive accuracy and ignore information that does not. It should, if
possible, lead to greater understanding of the phenomena it describes. The options available
in C&RT allow the use of either, or both, of two different strategies for selecting the "right-sized" tree
from among all the possible trees. One strategy is to grow the tree to just the right size, where the
right size is determined by the user, based on the knowledge from previous research, diagnostic
information from previous analyses, or even intuition. The other strategy is to use a set of well-
documented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized"
tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least
they take subjective judgment out of the process of selecting the "right-sized" tree.
FACT-style direct stopping. We will begin by describing the first strategy, in which the user
specifies the size to grow the tree. This strategy is followed by selecting FACT-style direct stopping
as the stopping rule for the analysis, and by specifying the Fraction of objects that allows the tree to
grow to the desired size. C&RT provides several options for obtaining diagnostic information to
determine the reasonableness of the choice of size for the tree. Specifically, three options are
available for performing cross-validation of the selected tree; namely Test sample, V-fold, and
Minimal cost-complexity.
Test sample cross-validation. The first, and most preferred type of cross-validation is the test
sample cross-validation. In this type of cross-validation, the tree is computed from the learning
sample, and its predictive accuracy is tested by applying it to predict the class membership in the test
sample. If the costs for the test sample exceed the costs for the learning sample, then this is an
indication of poor cross-validation. In that case, a different sized tree might cross-validate better. The
test and learning samples can be formed by collecting two independent data sets, or if a large learning
sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for
use as the test sample.
In the C&RT module, test sample cross-validation is performed by specifying a sample identifier
variable that contains codes for identifying the sample (learning or test) to which each case or object
belongs.
V-fold cross-validation. The second type of cross-validation available in C&RT is V-fold cross-
validation. This type of cross-validation is useful when no test sample is available and the learning
sample is too small to have the test sample taken from it. The user-specified 'v' value for v-fold cross-
validation (its default value is 3) determines the number of random subsamples, as equal in size as
145

possible, that are formed from the learning sample. A tree of the specified size is computed 'v' times,
each time leaving out one of the subsamples from the computations, and using that subsample as a
test sample for cross-validation, so that each subsample is used (v - 1) times in the learning sample
and just once as the test sample. The CV costs (cross-validation cost) computed for each of the 'v' test
samples are then averaged to give the v-fold estimate of the CV costs.
Minimal cost-complexity cross-validation pruning. In C&RT, minimal cost-complexity cross-
validation pruning is performed, if Prune on misclassification error has been selected as the Stopping
rule. On the other hand, if Prune on deviance has been selected as the Stopping rule, then minimal
deviance-complexity cross-validation pruning is performed. The only difference in the two options is
the measure of prediction error that is used. Prune on misclassification error uses the costs that
equals the misclassification rate when priors are estimated and misclassification costs are equal,
while Prune on deviance uses a measure, based on maximum-likelihood principles, called the
deviance (see Ripley, 1996). For details about the algorithms used in C&RT to implement Minimal
cost-complexity cross-validation pruning, see also the Introductory Overview and Computational
Methods sections ofClassification Trees Analysis.
The sequence of trees obtained by this algorithm have a number of interesting properties. They are
nested, because the successively pruned trees contain all the nodes of the next smaller tree in the
sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the
sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest
trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of
the same size with lower costs. Proofs and/or explanations of these properties can be found in
Breiman et al. (1984).

Tree selection after pruning. The pruning, as discussed above, often results in a sequence of
optimally pruned trees. So the next task is to use an appropriate criterion to select the "right-sized"
tree from this set of optimal trees. A natural criterion would be the CV costs (cross-validation costs).
While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized"
tree, oftentimes there will be several trees with CV costs close to the minimum. Following Breiman et
al. (1984) we could use the "automatic" tree selection procedure and choose as the "right-sized" tree
the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum
CV costs. In particular, they proposed a "1 SE rule" for making this selection, i.e., choose as the
"right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1
times the standard error of the CV costs for the minimum CV costs tree. In C&RT, a multiple other
than the 1 (the default) can also be specified for the SE rule. Thus, specifying a value of 0.0 would
result in the minimal CV cost tree being selected as the "right-sized" tree. Values greater than 1.0
could lead to trees much smaller than the minimal CV cost tree being selected as the "right-sized"
146

tree. One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "over
fitting" and "under fitting" of the data.
As can be been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized"
tree selection is a truly "automatic" process. The algorithms make all the decisions leading to the
selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. V-fold
cross-validation allows us to evaluate how well each tree "performs" when repeatedly cross-validated
in different samples randomly drawn from the data.
Computational Formulas

In Classification and Regression Trees, estimates of accuracy are computed by different formulas for
categorical and continuous dependent variables (classification and regression-type problems). For
classification-type problems (categorical dependent variable) accuracy is measured in terms of the
true classification rate of the classifier, while in the case of regression (continuous dependent
variable) accuracy is measured in terms of mean squared error of the predictor.

In addition to measuring accuracy, the following measures of node impurity are used for
classification problems: The Gini measure, generalized Chi-square measure, and generalized G-
square measure. The Chi-square measure is similar to the standard Chi-square value computed for the
expected and observed classifications (with priors adjusted for misclassification cost), and the G-
square measure is similar to the maximum-likelihood Chi-square (as for example computed in
the Log-Linear module). The Gini measure is the one most often used for measuring purity in the
context of classification problems, and it is described below.
For continuous dependent variables (regression-type problems), the least squared deviation (LSD)
measure of impurity is automatically applied.

ESTIMATION OF ACCURACY IN CLASSIFICATION

In classification problems (categorical dependent variable), three estimates of the accuracy are used:
resubstitution estimate, test sample estimate, and v-fold cross-validation. These estimates are defined
here.

Resubstitution estimate. Resubstitution estimate is the proportion of cases that are misclassified by the
classifier constructed from the entire sample. This estimate is computed in the following manner:

where X is the indicator function;


X = 1, if the statement is true

X = 0, if the statement is false


147

and d (x) is the classifier.


The resubstitution estimate is computed using the same data as used in constructing the classifier d .
Test sample estimate. The total number of cases are divided into two subsamples 1 and . The test sample
estimate is the proportion of cases in the subsample  which are misclassified by the classifier constructed
from the subsample 1. This estimate is computed in the following way.
Let the learning sample  of size N be partitioned into subsamples 1 and  of sizes N and N2, respectively.

where  is the sub sample that is not used for constructing the classifier.
v-fold crossvalidation. The total number of cases are divided into v sub samples 1, , ..., v of almost equal
sizes. v-fold cross validation estimate is the proportion of cases in the subsample  that are misclassified by
the classifier constructed from the subsample  v. This estimate is computed in the following way.
Let the learning sample  of size N be partitioned into v sub samples 1, , ..., v of almost sizes N1, N,
..., Nv, respectively.

where is computed from the sub sample   .


v

ESTIMATION OF ACCURACY IN REGRESSION

In the regression problem (continuous dependent variable) three estimates of the accuracy are used:
resubstitution estimate, test sample estimate, and v-fold cross-validation. These estimates are defined
here.

Resubstitution estimate. The resubstitution estimate is the estimate of the expected squared error using the
predictor of the continuous dependent variable. This estimate is computed in the following way.

where the learning sample  consists of (xi,yi),i = 1,2,...,N. The resubstitution estimate is computed using
the same data as used in constructing the predictor d .
Test sample estimate. The total number of cases are divided into two subsamples 1 and . The test sample
estimate of the mean squared error is computed in the following way:
Let the learning sample  of size N be partitioned into subsamples 1 and  of sizes N and N2, respectively.

where  is the sub-sample that is not used for constructing the predictor.
148

v-fold cross-validation. The total number of cases are divided into v sub samples 1, , ..., v of almost equal
sizes. The subsample  v is used to construct the predictor d. Then v-fold cross validation estimate is
computed from the subsample v in the following way:
Let the learning sample  of size N be partitioned into v sub samples 1, , ..., v of almost sizes N1, N,
..., Nv, respectively.

where is computed from the sub sample 


v .
ESTIMATION OF NODE IMPURITY: GINI MEASURE

The Gini measure is the measure of impurity of a node and is commonly used when the dependent variable
is a categorical variable, defined as:

if costs of misclassification are not specified,

if costs of misclassification are specified,

where the sum extends over all k categories. p( j / t) is the probability of category j at the node t and C(i / j )
is the probability of misclassifying a category j case as category i.
ESTIMATION OF NODE IMPURITY: LEAST-SQUARED DEVIATION

Least-squared deviation (LSD) is used as the measure of impurity of a node when the response variable is
continuous, and is computed as:

where Nw(t) is the weighted number of cases in node t, wi is the value of the weighting variable for
case i, fi is the value of the frequency variable, yi is the value of the response variable, and y(t) is the
weighted mean for node t.
149

k-Nearest Neighbor Method (KNN)

The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An
object is classified by a majority vote of its neighbors, with the object being assigned the class most
common amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the
object is simply assigned the class of its nearest neighbor. In binary (two class) classification
problems, it is helpful to choose k to be an odd number as this avoids difficulties with tied votes.

The same method can be used for regression, by simply assigning the property value for the object to
be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of
the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.

The neighbors are taken from a set of objects for which the correct classification (or, in the case of
regression, the value of the property) is known. This can be thought of as the training set for the
algorithm, though no explicit training step is required. In order to identify neighbors, the objects are
represented by position vectors in a multidimensional feature space. It is usual to use the Euclidean
distance, though other distance measures, such as the Manhattan distance could in principle be used
instead. The k-nearest neighbor algorithm is sensitive to the local structure of the data.

Algorithm

Example of k-NN classification. The test sample (green circle) should be classified either to the first
class of blue squares or to the second class of red triangles. If k = 3 it is classified to the second class
because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is classified to first
class (3 squares vs. 2 triangles inside the outer circle).

The training examples are vectors in a multidimensional feature space. The space is partitioned into
regions by locations and labels of the training samples. A point in the space is assigned to the class c
if it is the most frequent class label among the k nearest training samples. Usually Euclidean distance
is used.

The training phase of the algorithm consists only of storing the feature vectors and class labels of the
training samples. In the actual classification phase, the test sample (whose class is not known) is
represented as a vector in the feature space. Distances from the new vector to all stored vectors are
computed and k closest samples are selected. There are a number of ways to classify the new vector
to a particular class, one of the most used technique is to predict the new vector to the most common
class amongst the K nearest neighbors. A major drawback to use this technique to classify a new
vector to a class is that the classes with the more frequent examples tend to dominate the prediction of
the new vector, as they tend to come up in the K nearest neighbors when the neighbors are computed
due to their large number. One of the ways to overcome this problem is to take into account the
150

distance of each K nearest neighbors with the new vector that is to be classified and predict the class
of the new vector based on these distances.

Parameter selection

The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on
the classification, but make boundaries between classes less distinct. A good k can be selected by
various heuristic techniques, for example, cross-validation. The special case where the class is
predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor
algorithm.

The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant
features, or if the feature scales are not consistent with their importance. Much research effort has
been put into selecting or scaling features to improve classification. A particularly popular approach
is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale
features by the mutual information of the training data with the training classes.

Properties

The naive version of the algorithm is easy to implement by computing the distances from the test
sample to all stored vectors, but it is computationally intensive, especially when the size of the
training set grows. Many optimizations have been proposed over the years; these generally seek to
reduce the number of distance evaluations actually performed. Some optimizations involve
partitioning the feature space, and only computing distances within specific nearby volumes.

The k-NN algorithm can also be adapted for use in estimating continuous variables. One such
implementation uses an inverse distance weighted average of the k-nearest multivariate neighbors.
This algorithm functions as follows:

1. Compute Euclidean or Mahalanobis distance from target plot to those that were sampled.
2. Order samples taking for account calculated distances.
3. Choose heuristically optimal k nearest neighbor based on RMSE done by cross validation
technique.
4. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors.

KNN -Definition
KNNis a simple algorithm that stores all available cases and classifies new cases based on a similarity
measure.
KNN –different names
•K-Nearest Neighbors

•Memory-Based Reasoning

•Example-Based Reasoning

•Instance-Based Learning
151

•Case-Based Reasoning

•Lazy Learning

KNN –Short History


•Nearest Neighbors have been used in statistical estimation and pattern recognition already in the
beginning of 1970’s (non-parametric techniques).
•Dynamic Memory: A theory of Reminding and Learning in Computer and People (Schank, 1982).
•People reason by remembering and learn by doing.
•Thinking is reminding, making analogies.
•Examples = Concepts???
KNN Classification

KNN Classification – Distance

Similarity - Distance Measure


152

KNN Regression - Distance

KNN –Number of Neighbors


•If K=1, select the nearest neighbor
•If K>1,
–For classification select the most frequent neighbor.
–For regression calculate the average of K neighbors.
Distance – Categorical Variables

KNN -Applications
•Classification and Interpretation
–legal, medical, news, banking
•Problem-solving
–planning, pronunciation
•Function learning
153

–dynamic control
•Teaching and aiding
–help desk, user training
Summary
•KNN is conceptually simple, yet able to solve complex problems
•Can work with relatively little information
•Learning is simple (no learning at all!)
•Memory and CPU cost
•Feature selection problem
•Sensitive to representation

K Nearest Neighbor
Lazy Learning Algorithm
Defer the decision to generalize beyond the training
examples till a new query is encountered
Whenever we have a new point to classify, we find its K
nearest neighbors from the training data.
The distance is calculated using one of the following
measures
n Euclidean Distance
n Minkowski Distance
n Mahalanobis Distance

Curse of Dimensionality
Distance usually relates to all the attributes and assumes all of them have the same effects on distance
The similarity metrics do not consider the relation of attributes which result in inaccurate distance and
then impact on classification precision. Wrong classification due to presence of many irrelevant
attributes is often termed as the curse of dimensionality
For example: Each instance is described by 20 attributes out of which only 2 are relevant in
determining the classification of the target function. In this case, instances that have identical values
for the 2 relevant attributes may nevertheless be distant from one another in the 20 dimensional
instance space.
154

Time Series Analysis through AR Modeling


1 Introduction In the previous chapter we have discussed about the fundamentals of time series
modeling and forecasting. The selection of a proper model is extremely important as it reflects the
underlying structure of the series and this fitted model in turn is used for future forecasting. A time
series model is said to be linear or non-linear depending on whether the current value of the series is a
linear or non-linear function of past observations. In general models for time series data can have
many forms and represent different stochastic processes. There are two widely used linear time series
models in literature, viz. Autoregressive (AR) and Moving Average (MA) models. Combining these
two, the Autoregressive Moving Average (ARMA) and Autoregressive Integrated Moving Average
(ARIMA) models have been proposed in literature. The Autoregressive Fractionally Integrated
Moving Average (ARFIMA) model generalizes ARMA and ARIMA models. For seasonal time series
forecasting, a variation of ARIMA, viz. the Seasonal Autoregressive Integrated Moving Average
(SARIMA) model is used. ARIMA model and its different variations are based on the famous Box-
Jenkins principle and so these are also broadly known as the Box-Jenkins models. Linear models
have drawn much attention due to their relative simplicity in understanding and implementation.
However many practical time series show non-linear patterns. For example, as mentioned by R.
Parrelli in, non-linear models are appropriate for predicting volatility changes in economic and
financial time series. Considering these facts, various nonlinear models have been suggested in
literature. Some of them are the famous Autoregressive Conditional Heteroskedasticity (ARCH)
model and its variations like Generalized ARCH (GARCH) , Exponential Generalized ARCH
(EGARCH) etc., the Threshold Autoregressive (TAR) model, the Non-linear Autoregressive (NAR)
model, the Nonlinear Moving Average (NMA) model, etc. In this chapter we shall discuss about the
important linear and non-linear stochastic time series models with their different properties. This
chapter will provide a background for the upcoming chapters, in which we shall study other models
used for time series forecasting.
2 The Autoregressive Moving Average (ARMA) Models An ARMA(p, q) model is a combination of
AR(p) and MA(q) models and is suitable for univariate time series modeling. In an AR(p) model the
future value of a variable is assumed to be a linear combination of p past observations and a random
error together with a constant term. Mathematically the AR(p) model can be expressed as :

Here t y and t ε are respectively the actual value and random error (or random shock) at time period t ,
) ϕi (i = 1,2,..., p are model parameters and c is a constant. The integer constant p is known as the
order of the model. Sometimes the constant term is omitted for simplicity. Usually For estimating
parameters of an AR process using the given time series, the YuleWalker equations are used. Just as
an AR(p) model regress against past values of the series, an MA(q) model uses past errors as the
explanatory variables. The MA(q) model is given by :

Here μ is the mean of the series, ( j 1,2,...,q) θ j = are the model parameters and q is the order of the
model. The random shocks are assumed to be a white noise [21, 23] process, i.e. a sequence of
independent and identically distributed (i.i.d) random variables with zero mean and a constant
variance . 2 σ Generally, the random shocks are assumed to follow the typical normal distribution.
Thus conceptually a moving average model is a linear regression of the current observation of the
155

time series against the random shocks of one or more prior observations. Fitting an MA model to a
time series is more complicated than fitting an AR model because in the former one the random error
terms are not fore-seeable. Autoregressive (AR) and moving average (MA) models can be effectively
combined together to form a general and useful class of time series models, known as the ARMA
models. Mathematically an ARMA(p, q) model is represented as :

Here the model orders p,q refer to p autoregressive and q moving average terms. Usually ARMA
models are manipulated using the lag operator notation. The lag or backshift operator is defined as
. Polynomials of lag operator or lag polynomials are used to represent ARMA models as
follows :

It is shown in [23] that an important property of AR(p) process is invertibility, i.e. an AR(p) process
can always be written in terms of an MA(∞) process. Whereas for an MA(q) process to be invertible,
all the roots of the equation θ (L) = 0 must lie outside the unit circle. This condition is known as the
Invertibility Condition for an MA process.

3 Stationarity Analysis When an AR(p) process is represented as is known


as the characteristic equation for the process. It is proved by Box and Jenkins that a necessary and
sufficient condition for the AR(p) process to be stationary is that all the roots of the characteristic
equation must fall outside the unit circle. Hipel and McLeod mentioned another simple algorithm (by
Schur and Pagano) for determining stationarity of an AR process.

An MA(q) process is always stationary, irrespective of the values the MA parameters . The conditions
regarding stationarity and invertibility of AR and MA processes also hold for an ARMA process. An
ARMA(p, q) process is stationary if all the roots of the characteristic equation ϕ(L) = 0 lie outside the
unit circle. Similarly, if all the roots of the lag equation θ (L) = 0 lie outside the unit circle, then the
ARMA(p, q) process is invertible and can be expressed as a pure AR process.

4 Autocorrelation and Partial Autocorrelation Functions (ACF and PACF) To determine a proper
model for a given time series data, it is necessary to carry out the ACF and PACF analysis. These
statistical measures reflect how the observations in a time series are related to each other. For
modeling and forecasting purpose it is often useful to plot the ACF and PACF against consecutive
time lags. These plots help in determining the order of AR and MA terms. Below we give their
mathematical definitions:
156

The Autocorrelation Coeffient [21, 23] at lag k is defined as:

Here μ is the mean of the time series, i.e. . The autocovariance at lag zero i.e is the
variance of the time series. From the definition it is clear that the autocorrelation coefficient ρ k is
dimensionless and so is independent of the scale of measurement. Also, clearly
Statisticians Box and Jenkins [6] termed k γ as the theoretical Autocovariance Function
(ACVF) and ρ k as the theoretical Autocorrelation Function (ACF). Another measure, known as the
Partial Autucorrelation Function (PACF) is used to measure the correlation between an observation k
period ago and the current observation, after controlling for observations at intermediate lags (i.e. at
lags < k ) [12]. At lag 1, PACF(1) is same as ACF(1). The detailed formulae for calculating PACF are
given in . Normally, the stochastic process governing a time series is unknown and so it is not
possible to determine the actual or theoretical ACF and PACF values. Rather these values are to be
estimated from the training data, i.e. the known time series at hand. The estimated ACF and PACF
values from the training data are respectively termed as sample ACF and PACF . As given in , the
most appropriate sample estimate for the ACVF at lag k is

Then the estimate for the sample ACF at lag k is given by

As explained by Box and Jenkins [6], the sample ACF plot is useful in determining the type of model
to fit to a time series of length N. Since ACF is symmetrical about lag zero, it is only required to plot
the sample ACF for positive lags, from lag one onwards to a maximum lag of about N/4. The sample
PACF plot helps in identifying the maximum order of an AR process.
5 Autoregressive Integrated Moving Average (ARIMA) Models The ARMA models, described above
can only be used for stationary time series data. However in practice many time series such as those
related to socio-economic and business show non-stationary behavior. Time series, which contain
trend and seasonal patterns, are also non-stationary in nature . Thus from application view point
ARMA models are inadequate to properly describe non-stationary time series, which are frequently
encountered in practice. For this reason the ARIMA model is proposed, which is a generalization of
an ARMA model to include the case of non-stationarity as well. In ARIMA models a non-stationary
time series is made stationary by applying finite differencing of the data points. The mathematical
formulation of the ARIMA(p,d,q) model using lag polynomials is given below :
Here, p, d and q are integers greater than or equal to zero and refer to the order of the autoregressive,
integrated, and moving average parts of the model respectively.
157

A useful generalization of ARIMA models is the Autoregressive Fractionally Integrated Moving


Average (ARFIMA) model, which allows non-integer values of the differencing parameter d.
ARFIMA has useful application in modeling time series with long memory . In this model the
expansion of the term( ) d 1 is to be done by us − L ing the general binomial theorem. Various
contributions have been made by researchers towards the estimation of the general ARFIMA
parameters.

6 Seasonal Autoregressive Integrated Moving Average (SARIMA) Models The ARIMA model
is for non-seasonal non-stationary data. Box and Jenkins have generalized this model to deal with
seasonality. Their proposed model is known as the Seasonal ARIMA (SARIMA) model. In this
model seasonal differencing of appropriate order is used to remove non-stationarity from the series. A
first order seasonal difference is the difference between an observation and the corresponding
observation from the previous year and is calculated as . For monthly time series s = 12
and for quarterly time series

Here zt is the seasonally differenced series.

7 Some Nonlinear Time Series Models So far we have discussed about linear time series models. As
mentioned earlier, nonlinear models should also be considered for better time series analysis and
forecasting. Campbell, Lo and McKinley (1997) made important contributions towards this direction.
According to them almost all non-linear time series can be divided into two branches: one includes
models nonlinear in mean and other includes models non-linear in variance (heteroskedastic). As an
illustrative example, here we present two nonlinear time series models from [28]:

8 Box-Jenkins Methodology After describing various time series models, the next issue to our
concern is how to select an appropriate model that can produce accurate forecast based on a
description of historical pattern in the data and how to determine the optimal model orders.
Statisticians George Box and Gwilym Jenkins developed a practical approach to build ARIMA
158

model, which best fit to a given time series and also satisfy the parsimony principle. Their concept
has fundamental importance on the area of time series analysis and forecasting . The Box-Jenkins
methodology does not assume any particular pattern in the historical data of the series to be
forecasted. Rather, it uses a three step iterative approach of model identification, parameter estimation
and diagnostic checking to determine the best parsimonious model from a general class of ARIMA
models . This three-step process is repeated several times until a satisfactory model is finally selected.
Then this model can be used for forecasting future values of the time series. The Box-Jenkins forecast
method is schematically shown in Fig.

A crucial step in an appropriate model selection is the determination of optimal model parameters.
One criterion is that the sample ACF and PACF, calculated from the training data should match with
the corresponding theoretical or actual values . Other widely used measures for model identification
are Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) which are
defined below :

Here n is the number of effective observations, used to fit the model, p is the number of parameters in
the model and is the sum of sample squared residuals. The optimal model order is chosen by the
number of model parameters, which minimizes either AIC or BIC. Other similar criteria have also
been proposed in literature for optimal model identification.
159

Neural Network Concepts


The human visual system is one of the wonders of the world. Consider the following
sequence of handwritten digits:

Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each
hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing
140 million neurons, with tens of billions of connections between them. And yet human
vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing
progressively more complex image processing. We carry in our heads a supercomputer,
tuned by evolution over hundreds of millions of years, and superbly adapted to understand
the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are
stupendously, astoundingly good at making sense of what our eyes show us. But nearly all
that work is done unconsciously. And so we don't usually appreciate how tough a problem
our visual systems solve.

The difficulty of visual pattern recognition becomes apparent if you attempt to write a
computer program to recognize digits like those above. What seems easy when we do it
ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize
shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be
not so simple to express algorithmically. When you try to make such rules precise, you
quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.

Neural networks approach the problem in a different way. The idea is to take a large number
of handwritten digits, known as training examples,

and then develop a system which can learn from those training examples. In other words,
the neural network uses the examples to automatically infer rules for recognizing
handwritten digits. Furthermore, by increasing the number of training examples, the
network can learn more about handwriting, and so improve its accuracy. So while I've shown
160

just 100 training digits above, perhaps we could build a better handwriting recognizer by
using thousands or even millions or billions of training examples.

In this chapter we'll write a computer program implementing a neural network that learns to
recognize handwritten digits. The program is just 74 lines long, and uses no special neural
network libraries. But this short program can recognize digits with an accuracy over 96
percent, without human intervention. Furthermore, in later chapters we'll develop ideas
which can improve accuracy to over 99 percent. In fact, the best commercial neural networks
are now so good that they are used by banks to process cheques, and by post offices to
recognize addresses.

We're focusing on handwriting recognition because it's an excellent prototype problem for
learning about neural networks in general. As a prototype it hits a sweet spot: it's
challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to
require an extremely complicated solution, or tremendous computational power.
Furthermore, it's a great way to develop more advanced techniques, such as deep learning.
And so throughout the book we'll return repeatedly to the problem of handwriting
recognition. Later in the book, we'll discuss how these ideas may be applied to other
problems in computer vision, and also in speech, natural language processing, and other
domains.

Of course, if the point of the chapter was only to write a computer program to recognize
handwritten digits, then the chapter would be much shorter! But along the way we'll develop
many key ideas about neural networks, including two important types of artificial neuron
(the perceptron and the sigmoid neuron), and the standard learning algorithm for neural
networks, known as stochastic gradient descent. Throughout, I focus on
explaining why things are done the way they are, and on building your neural networks
intuition. That requires a lengthier discussion than if I just presented the basic mechanics of
what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the
payoffs, by the end of the chapter we'll be in position to understand what deep learning is,
and why it matters.

Perceptrons

What is a neural network? To get started, I'll explain a type of artificial neuron called
a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank
Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more
common to use other models of artificial neurons - in this book, and in much modern work
on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get
to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way
they are, it's worth taking the time to first understand perceptrons.

So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,…x1,x2,…, and
produces a single binary output:
161

In the example shown the perceptron has three inputs, x1,x2,x3x1,x2,x3. In general it could
have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He
introduced weights, w1,w2,…w1,w2,…, real numbers expressing the importance of the
respective inputs to the output. The neuron's output, 00 or 11, is determined by whether the
weighted sum ∑jwjxj∑jwjxj is less than or greater than some threshold value. Just like the
weights, the threshold is a real number which is a parameter of the neuron. To put it in more
precise algebraic terms:
output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)(1)output={0if ∑jwjxj≤ threshold1if ∑jwjxj
> threshold
That's all there is to how a perceptron works!

That's the basic mathematical model. A way you can think about the perceptron is that it's a
device that makes decisions by weighing up evidence. Let me give an example. It's not a very
realistic example, but it's easy to understand, and we'll soon get to more realistic examples.
Suppose the weekend is coming up, and you've heard that there's going to be a cheese
festival in your city. You like cheese, and are trying to decide whether or not to go to the
festival. You might make your decision by weighing up three factors:

1. Is the weather good?


2. Does your boyfriend or girlfriend want to accompany you?
3. Is the festival near public transit? (You don't own a car).

We can represent these three factors by corresponding binary variables x1,x2x1,x2, and x3x3.
For instance, we'd have x1=1x1=1 if the weather is good, and x1=0x1=0 if the weather is bad.
Similarly, x2=1x2=1if your boyfriend or girlfriend wants to go, and x2=0x2=0 if not. And
similarly again for x3x3 and public transit.

Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival
even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But
perhaps you really loathe bad weather, and there's no way you'd go to the festival if the
weather is bad. You can use perceptrons to model this kind of decision-making. One way to
do this is to choose a weight w1=6w1=6for the weather, and w2=2w2=2 and w3=2w3=2 for
the other conditions. The larger value of w1w1 indicates that the weather matters a lot to
you, much more than whether your boyfriend or girlfriend joins you, or the nearness of
public transit. Finally, suppose you choose a threshold of 55 for the perceptron. With these
choices, the perceptron implements the desired decision-making model,
outputting 11 whenever the weather is good, and 00 whenever the weather is bad. It makes
no difference to the output whether your boyfriend or girlfriend wants to go, or whether
public transit is nearby.

By varying the weights and the threshold, we can get different models of decision-making.
For example, suppose we instead chose a threshold of 33. Then the perceptron would decide
that you should go to the festival whenever the weather was good or when both the festival
162

was near public transit and your boyfriend or girlfriend was willing to join you. In other
words, it'd be a different model of decision-making. Dropping the threshold means you're
more willing to go to the festival.

Obviously, the perceptron isn't a complete model of human decision-making! But what the
example illustrates is how a perceptron can weigh up different kinds of evidence in order to
make decisions. And it should seem plausible that a complex network of perceptrons could
make quite subtle decisions:

In this network, the first column of perceptrons - what we'll call the first layer of
perceptrons - is making three very simple decisions, by weighing the input evidence. What
about the perceptrons in the second layer? Each of those perceptrons is making a decision by
weighing up the results from the first layer of decision-making. In this way a perceptron in
the second layer can make a decision at a more complex and more abstract level than
perceptrons in the first layer. And even more complex decisions can be made by the
perceptron in the third layer. In this way, a many-layer network of perceptrons can engage
in sophisticated decision making.

Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In
the network above the perceptrons look like they have multiple outputs. In fact, they're still
single output. The multiple output arrows are merely a useful way of indicating that the
output from a perceptron is being used as the input to several other perceptrons. It's less
unwieldy than drawing a single output line which then splits.

Let's simplify the way we describe perceptrons. The


condition ∑jwjxj>threshold∑jwjxj>threshold is cumbersome, and we can make two notational
changes to simplify it. The first change is to write ∑jwjxj∑jwjxj as a dot
product, w⋅x≡∑jwjxjw⋅x≡∑jwjxj, where ww and xx are vectors whose components are the
weights and inputs, respectively. The second change is to move the threshold to the other
side of the inequality, and to replace it by what's known as the
perceptron's bias, b≡−thresholdb≡−threshold. Using the bias instead of the threshold, the
perceptron rule can be rewritten:

output={01if w⋅x+b≤0if w⋅x+b>0(2)(2)output={0if w⋅x+b≤01if w⋅x+b>0


You can think of the bias as a measure of how easy it is to get the perceptron to output a 11. Or to put
it in more biological terms, the bias is a measure of how easy it is to get the perceptron to fire. For a
perceptron with a really big bias, it's extremely easy for the perceptron to output a 11. But if the bias
is very negative, then it's difficult for the perceptron to output a 11. Obviously, introducing the bias is
only a small change in how we describe perceptrons, but we'll see later that it leads to further
163

notational simplifications. Because of this, in the remainder of the book we won't use the threshold,
we'll always use the bias.

I've described perceptrons as a method for weighing evidence to make decisions. Another
way perceptrons can be used is to compute the elementary logical functions we usually think
of as underlying computation, functions such as AND, OR, and NAND. For example, suppose
we have a perceptron with two inputs, each with weight −2−2, and an overall bias of 33.
Here's our perceptron:

Then we see that input 0000 produces output 11,


since (−2)∗0+(−2)∗0+3=3(−2)∗0+(−2)∗0+3=3 is positive. Here, I've introduced the ∗∗symbol
to make the multiplications explicit. Similar calculations show that the
inputs 0101 and 1010 produce output 11. But the input 1111produces output 00,
since (−2)∗1+(−2)∗1+3=−1(−2)∗1+(−2)∗1+3=−1 is negative. And so our perceptron
implements a NAND gate!

The NAND example shows that we can use perceptrons to compute simple logical functions.
In fact, we can use networks of perceptrons to compute any logical function at all. The
reason is that the NAND gate is universal for computation, that is, we can build any
computation up out of NAND gates. For example, we can use NAND gates to build a circuit
which adds two bits, x1x1 and x2x2. This requires computing the bitwise sum, x1⊕x2x1⊕x2,
as well as a carry bit which is set to 11 when both x1x1 and x2x2 are 11, i.e., the carry bit is
just the bitwise product x1x2x1x2:

To get an equivalent network of perceptrons we replace all the NANDgates by perceptrons


with two inputs, each with weight −2−2, and an overall bias of 33. Here's the resulting
network. Note that I've moved the perceptron corresponding to the bottom right NAND gate a
little, just to make it easier to draw the arrows on the diagram:
164

One notable aspect of this network of perceptrons is that the output from the leftmost
perceptron is used twice as input to the bottommost perceptron. When I defined the
perceptron model I didn't say whether this kind of double-output-to-the-same-place was
allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then
it's possible to simply merge the two lines, into a single connection with a weight of -4
instead of two connections with -2 weights. (If you don't find this obvious, you should stop
and prove to yourself that this is equivalent.) With that change, the network looks as follows,
with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as
marked:

Up to now I've been drawing inputs like x1x1 and x2x2 as variables floating to the left of the
network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons -
the input layer- to encode the inputs:

This notation for input perceptrons, in which we have an output, but no inputs,
165

is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we
did have a perceptron with no inputs. Then the weighted sum ∑jwjxj∑jwjxj would always be
zero, and so the perceptron would output 11 if b>0b>0, and 00 if b≤0b≤0. That is, the
perceptron would simply output a fixed value, not the desired value (x1x1, in the example
above). It's better to think of the input perceptrons as not really being perceptrons at all, but
rather special units which are simply defined to output the desired values, x1,x2,…x1,x2,….

The adder example demonstrates how a network of perceptrons can be used to simulate a
circuit containing many NAND gates. And because NAND gates are universal for computation,
it follows that perceptrons are also universal for computation.

The computational universality of perceptrons is simultaneously reassuring and


disappointing. It's reassuring because it tells us that networks of perceptrons can be as
powerful as any other computing device. But it's also disappointing, because it makes it
seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!

However, the situation is better than this view suggests. It turns out that we can
devise learning algorithms which can automatically tune the weights and biases of a
network of artificial neurons. This tuning happens in response to external stimuli, without
direct intervention by a programmer. These learning algorithms enable us to use artificial
neurons in a way which is radically different to conventional logic gates. Instead of explicitly
laying out a circuit of NAND and other gates, our neural networks can simply learn to solve
problems, sometimes problems where it would be extremely difficult to directly design a
conventional circuit.

Sigmoid neurons

Learning algorithms sound terrific. But how can we devise such algorithms for a neural
network? Suppose we have a network of perceptrons that we'd like to use to learn to solve
some problem. For example, the inputs to the network might be the raw pixel data from a
scanned, handwritten image of a digit. And we'd like the network to learn weights and biases
so that the output from the network correctly classifies the digit. To see how learning might
work, suppose we make a small change in some weight (or bias) in the network. What we'd
like is for this small change in weight to cause only a small corresponding change in the
output from the network. As we'll see in a moment, this property will make learning
possible. Schematically, here's what we want (obviously this network is too simple to do
handwriting recognition!):
166

If it were true that a small change in a weight (or bias) causes only a small change in output,
then we could use this fact to modify the weights and biases to get our network to behave
more in the manner we want. For example, suppose the network was mistakenly classifying
an image as an "8" when it should be a "9". We could figure out how to make a small change
in the weights and biases so the network gets a little closer to classifying the image as a "9".
And then we'd repeat this, changing the weights and biases over and over to produce better
and better output. The network would be learning.

The problem is that this isn't what happens when our network contains perceptrons. In fact,
a small change in the weights or bias of any single perceptron in the network can sometimes
cause the output of that perceptron to completely flip, say from 00 to 11. That flip may then
cause the behaviour of the rest of the network to completely change in some very
complicated way. So while your "9" might now be classified correctly, the behaviour of the
network on all the other images is likely to have completely changed in some hard-to-control
way. That makes it difficult to see how to gradually modify the weights and biases so that the
network gets closer to the desired behaviour. Perhaps there's some clever way of getting
around this problem. But it's not immediately obvious how we can get a network of
perceptrons to learn.

We can overcome this problem by introducing a new type of artificial neuron called
a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their output. That's the
crucial fact which will allow a network of sigmoid neurons to learn.

Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we
depicted perceptrons:
167

Just like a perceptron, the sigmoid neuron has inputs, x1,x2,…x1,x2,…. But instead of being
just 00 or 11, these inputs can also take on any values between 00 and 11. So, for
instance, 0.638…0.638… is a valid input for a sigmoid neuron. Also just like a perceptron,
the sigmoid neuron has weights for each input, w1,w2,…w1,w2,…, and an overall bias, bb.
But the output is not 00 or 11. Instead, it's σ(w⋅x+b)σ(w⋅x+b), where σσ is called thesigmoid
function**Incidentally, σσ is sometimes called the logistic function, and this new class of
neurons called logistic neurons. It's useful to remember this terminology, since these terms
are used by many people working with neural nets. However, we'll stick with the sigmoid
terminology., and is defined by:
σ(z)≡11+e−z.(3)(3)σ(z)≡11+e−z.
To put it all a little more explicitly, the output of a sigmoid neuron with
inputs x1,x2,…x1,x2,…, weights w1,w2,…w1,w2,…, and bias bb is
11+exp(−∑jwjxj−b).(4)(4)11+exp⁡(−∑jwjxj−b).

At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of
the sigmoid function may seem opaque and forbidding if you're not already familiar with it.
In fact, there are many similarities between perceptrons and sigmoid neurons, and the
algebraic form of the sigmoid function turns out to be more of a technical detail than a true
barrier to understanding.

To understand the similarity to the perceptron model, suppose z≡w⋅x+bz≡w⋅x+b is a large


positive number. Then e−z≈0e−z≈0 and so σ(z)≈1σ(z)≈1. In other words,
when z=w⋅x+bz=w⋅x+b is large and positive, the output from the sigmoid neuron is
approximately 11, just as it would have been for a perceptron. Suppose on the other hand
that z=w⋅x+bz=w⋅x+b is very negative. Then e−z→∞e−z→∞, and σ(z)≈0σ(z)≈0. So
when z=w⋅x+bz=w⋅x+b is very negative, the behaviour of a sigmoid neuron also closely
approximates a perceptron. It's only when w⋅x+bw⋅x+b is of modest size that there's much
deviation from the perceptron model.

What about the algebraic form of σσ? How can we understand that? In fact, the exact form
of σσ isn't so important - what really matters is the shape of the function when plotted.
Here's the shape:

-4-3-2-1012340.00.20.40.60.81.0zsigmoid function

This shape is a smoothed out version of a step function:

-4-3-2-1012340.00.20.40.60.81.0zstep function

If σσ had in fact been a step function, then the sigmoid neuron would be a perceptron, since
the output would be 11 or 00 depending on whether w⋅x+bw⋅x+b was positive or
negative**Actually, when w⋅x+b=0w⋅x+b=0 the perceptron outputs 00, while the step
function outputs 11. So, strictly speaking, we'd need to modify the step function at that one
point. But you get the idea.. By using the actual σσfunction we get, as already implied above,
a smoothed out perceptron. Indeed, it's the smoothness of the σσ function that is the crucial
fact, not its detailed form. The smoothness of σσ means that small changes ΔwjΔwj in the
weights and ΔbΔb in the bias will produce a small change ΔoutputΔoutput in the output from
the neuron. In fact, calculus tells us that ΔoutputΔoutput is well approximated by

Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)(5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,
168

where the sum is over all the weights, wjwj,


and ∂output/∂wj∂output/∂wj and ∂output/∂b∂output/∂b denote partial derivatives of
the outputoutput with respect to wjwj and bb, respectively. Don't panic if you're not comfortable with
partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's
actually saying something very simple (and which is very good news): ΔoutputΔoutput is a linear
functionof the changes ΔwjΔwj and ΔbΔb in the weights and bias. This linearity makes it easy to
choose small changes in the weights and biases to achieve any desired small change in the output. So
while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it
much easier to figure out how changing the weights and biases will change the output.

If it's the shape of σσ which really matters, and not its exact form, then why use the
particular form used for σσ in Equation (3)? In fact, later in the book we will occasionally
consider neurons where the output is f(w⋅x+b)f(w⋅x+b) for some other activation
function f(⋅)f(⋅). The main thing that changes when we use a different activation function is
that the particular values for the partial derivatives in Equation (5) change. It turns out that
when we compute those partial derivatives later, using σσ will simplify the algebra, simply
because exponentials have lovely properties when differentiated. In any case, σσ is
commonly-used in work on neural nets, and is the activation function we'll use most often in
this book.

How should we interpret the output from a sigmoid neuron? Obviously, one big difference
between perceptrons and sigmoid neurons is that sigmoid neurons don't just
output 00 or 11. They can have as output any real number between 00 and 11, so values such
as 0.173…0.173… and 0.689…0.689… are legitimate outputs. This can be useful, for example,
if we want to use the output value to represent the average intensity of the pixels in an image
input to a neural network. But sometimes it can be a nuisance. Suppose we want the output
from the network to indicate either "the input image is a 9" or "the input image is not a 9".
Obviously, it'd be easiest to do this if the output was a 00 or a 11, as in a perceptron. But in
practice we can set up a convention to deal with this, for example, by deciding to interpret
any output of at least 0.50.5 as indicating a "9", and any output less than 0.50.5 as indicating
"not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause
any confusion.

Exercises

 Sigmoid neurons simulating perceptrons, part I


Suppose we take all the weights and biases in a network of perceptrons, and multiply
them by a positive constant, c>0c>0. Show that the behaviour of the network doesn't
change.
 Sigmoid neurons simulating perceptrons, part II
Suppose we have the same setup as the last problem - a network of perceptrons.
Suppose also that the overall input to the network of perceptrons has been chosen.
We won't need the actual input value, we just need the input to have been fixed.
Suppose the weights and biases are such that w⋅x+b≠0w⋅x+b≠0 for the input xx to any
particular perceptron in the network. Now replace all the perceptrons in the network
by sigmoid neurons, and multiply the weights and biases by a positive
constant c>0c>0. Show that in the limit as c→∞c→∞ the behaviour of this network of
sigmoid neurons is exactly the same as the network of perceptrons. How can this fail
when w⋅x+b=0w⋅x+b=0 for one of the perceptrons?
169

The architecture of neural networks

In the next section I'll introduce a neural network that can do a pretty good job classifying
handwritten digits. In preparation for that, it helps to explain some terminology that lets us
name different parts of a network. Suppose we have the network:

As mentioned earlier, the leftmost layer in this network is called the input layer, and the
neurons within the layer are called input neurons. The rightmost or output layer contains
the output neurons, or, as in this case, a single output neuron. The middle layer is called
a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term
"hidden" perhaps sounds a little mysterious - the first time I heard the term I thought it
must have some deep philosophical or mathematical significance - but it really means
nothing more than "not an input or an output". The network above has just a single hidden
layer, but some networks have multiple hidden layers. For example, the following four-layer
network has two hidden layers:

Somewhat confusingly, and for historical reasons, such multiple layer networks are
sometimes called multilayer perceptrons orMLPs, despite being made up of sigmoid
neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I
think it's confusing, but wanted to warn you of its existence.
170

The design of the input and output layers in a network is often straightforward. For example,
suppose we're trying to determine whether a handwritten image depicts a "9" or not. A
natural way to design the network is to encode the intensities of the image pixels into the
input neurons. If the image is a 6464 by 6464 greyscale image, then we'd
have 4,096=64×644,096=64×64 input neurons, with the intensities scaled appropriately
between 00 and 11. The output layer will contain just a single neuron, with output values of
less than 0.50.5 indicating "input image is not a 9", and values greater than 0.50.5 indicating
"input image is a 9 ".

While the design of the input and output layers of a neural network is often straightforward,
there can be quite an art to the design of the hidden layers. In particular, it's not possible to
sum up the design process for the hidden layers with a few simple rules of thumb. Instead,
neural networks researchers have developed many design heuristics for the hidden layers,
which help people get the behaviour they want out of their nets. For example, such heuristics
can be used to help determine how to trade off the number of hidden layers against the time
required to train the network. We'll meet several such design heuristics later in this book.

Up to now, we've been discussing neural networks where the output from one layer is used
as input to the next layer. Such networks are called feedforward neural networks. This
means there are no loops in the network - information is always fed forward, never fed back.
If we did have loops, we'd end up with situations where the input to the σσ function
depended on the output. That'd be hard to make sense of, and so we don't allow such loops.

However, there are other models of artificial neural networks in which feedback loops are
possible. These models are calledrecurrent neural networks. The idea in these models is to
have neurons which fire for some limited duration of time, before becoming quiescent. That
firing can stimulate other neurons, which may fire a little while later, also for a limited
duration. That causes still more neurons to fire, and so over time we get a cascade of
neurons firing. Loops don't cause problems in such a model, since a neuron's output only
affects its input at some later time, not instantaneously.

Recurrent neural nets have been less influential than feedforward networks, in part because
the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent
networks are still extremely interesting. They're much closer in spirit to how our brains work
than feedforward networks. And it's possible that recurrent networks can solve important
problems which can only be solved with great difficulty by feedforward networks. However,
to limit our scope, in this book we're going to concentrate on the more widely-used
feedforward networks.

A simple network to classify handwritten digits

Having defined neural networks, let's return to handwriting recognition. We can split the
problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of
breaking an image containing many digits into a sequence of separate images, each
containing a single digit. For example, we'd like to break the image
171

into six separate images,

We humans solve this segmentation problem with ease, but it's challenging for a computer
program to correctly break up the image. Once the image has been segmented, the program
then needs to classify each individual digit. So, for instance, we'd like our program to
recognize that the first digit above,

is a 5.

We'll focus on writing a program to solve the second problem, that is, classifying individual
digits. We do this because it turns out that the segmentation problem is not so difficult to
solve, once you have a good way of classifying individual digits. There are many approaches
to solving the segmentation problem. One approach is to trial many different ways of
segmenting the image, using the individual digit classifier to score each trial segmentation. A
trial segmentation gets a high score if the individual digit classifier is confident of its
classification in all segments, and a low score if the classifier is having a lot of trouble in one
or more segments. The idea is that if the classifier is having trouble somewhere, then it's
probably having trouble because the segmentation has been chosen incorrectly. This idea
and other variations can be used to solve the segmentation problem quite well. So instead of
worrying about segmentation we'll concentrate on developing a neural network which can
solve the more interesting and difficult problem, namely, recognizing individual handwritten
digits.

To recognize individual digits we will use a three-layer neural network:


172

The input layer of the network contains neurons encoding the values of the input pixels. As
discussed in the next section, our training data for the network will consist of
many 2828 by 2828 pixel images of scanned handwritten digits, and so the input layer
contains 784=28×28784=28×28 neurons. For simplicity I've omitted most of
the 784784 input neurons in the diagram above. The input pixels are greyscale, with a value
of 0.00.0 representing white, a value of 1.01.0representing black, and in between values
representing gradually darkening shades of grey.

The second layer of the network is a hidden layer. We denote the number of neurons in this
hidden layer by nn, and we'll experiment with different values for nn. The example shown
illustrates a small hidden layer, containing just n=15n=15 neurons.

The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an
output ≈1≈1, then that will indicate that the network thinks the digit is a 00. If the second
neuron fires then that will indicate that the network thinks the digit is a 11. And so on. A
little more precisely, we number the output neurons from 00 through 99, and figure out
which neuron has the highest activation value. If that neuron is, say, neuron number 66,
then our network will guess that the input digit was a 66. And so on for the other output
neurons.

You might wonder why we use 1010 output neurons. After all, the goal of the network is to
tell us which digit (0,1,2,…,90,1,2,…,9) corresponds to the input image. A seemingly natural
way of doing that is to use just 44 output neurons, treating each neuron as taking on a binary
value, depending on whether the neuron's output is closer to 00 or to 11. Four neurons are
enough to encode the answer, since 24=1624=16 is more than the 10 possible values for the
173

input digit. Why should our network use 1010 neurons instead? Isn't that inefficient? The
ultimate justification is empirical: we can try out both network designs, and it turns out that,
for this particular problem, the network with 1010output neurons learns to recognize digits
better than the network with 44 output neurons. But that leaves us
wondering why using 1010output neurons works better. Is there some heuristic that would
tell us in advance that we should use the 1010-output encoding instead of the 44-output
encoding?

To understand why we do this, it helps to think about what the neural network is doing from
first principles. Consider first the case where we use 1010 output neurons. Let's concentrate
on the first output neuron, the one that's trying to decide whether or not the digit is a 00. It
does this by weighing up evidence from the hidden layer of neurons. What are those hidden
neurons doing? Well, just suppose for the sake of argument that the first neuron in the
hidden layer detects whether or not an image like the following is present:

It can do this by heavily weighting input pixels which overlap with the image, and only
lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument
that the second, third, and fourth neurons in the hidden layer detect whether or not the
following images are present:
174

As you may have guessed, these four images together make up the 00image that we saw in
the line of digits shown earlier:

So if all four of these hidden neurons are firing then we can conclude that the digit is a 00. Of
course, that's not the only sort of evidence we can use to conclude that the image was a 00 -
we could legitimately get a 00 in many other ways (say, through translations of the above
images, or slight distortions). But it seems safe to say that at least in this case we'd conclude
that the input was a 00 Supposing the neural network functions in this way, we can give a
plausible explanation for why it's better to have 1010 outputs from the network, rather
than 44. If we had 44 outputs, then the first output neuron would be trying to decide what
175

the most significant bit of the digit was. And there's no easy way to relate that most
significant bit to simple shapes like those shown above. It's hard to imagine that there's any
good historical reason the component shapes of the digit will be closely related to (say) the
most significant bit in the output.

Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural
network has to operate in the way I described, with the hidden neurons detecting simple
component shapes. Maybe a clever learning algorithm will find some assignment of weights
that lets us use only 44 output neurons. But as a heuristic the way of thinking I've described
works pretty well, and can save you a lot of time in designing good neural network
architectures.

*********************** END ***********************

You might also like