Professional Documents
Culture Documents
Contents
2 Charts 7–9
3 Nonparametric Tests 10 – 11
7 Discriminant Analysis 59 – 73
8 Logistic Regression 73 – 82
9 MANOVA 82 – 84
Programme &
Trimester
Pre-requisite Statistical Analysis
Learning
Objectives /
Outcomes
Session Plan Session Topic Pre read/Class Activity
no.
Introduction to multivariate data (Chapter 11 & chapter
analysis, dependence & Interdependence 12, Appendix III- IBM
techniques, Introduction to SPSS, two SPSS of Business
sample tests, one way ANOVA, n-way Research Methodology
1
ANOVA using SPSS and interpretation by Srivastava Rego)
of the SPSS output. Discussion on
assumptions of these tests.
References
1. Statistics for Management by Srivastava & Rego 3rd Edition,
McGraw-Hill Publishers 2017
2. Applied Multivariate Statistical Analysis by Johnson & Wichern Sixth
Edition Pearson Education India.(2015)
3. Market Research by Naresh Malhotra 7th Edition Pearson Education
India (2015)
4. Neural Networks and Deep Learning, Springer, September 2018,
Charu C. Aggarwal.
5. Business Analytics: The Science of Data Driven Decision Making by
Dinesh Kumar 2017 Wiley Publications
6
Hypothesis Testing
Univariate
Techniques
Independent Dependent
Samples Samples Independent Dependent
t Test Paired t Test Samples Samples
Z test Chi-Square Sign
Mann- Wilcoxon
Whitney Chi-Square
Median McNemar
K-S
7
Multivariate Techniques
Dependence Interdependence
Techniques Techniques
Metric Dependent
Variable
Binary
NON-PARAMETRIC TESTS
Contents
1. Relevance- Advantages and Disadvantages
2. Tests for
Randomness of a Series of Observations - Run Test
c. Specified Mean or Median of a Population – Signed Rank Test
d. Goodness of Fit of a Distribution – Kolmogorov- Smirnov Test
e. Comparing Two Populations – Kolmogorov- Smirnov Test
Equality of Two Means – Mann - Whitney (‘U’)Test
Equality of Several Means `
– Wilcoxon - Wilcox Test
– Kruskel -Wallis Rank Sum (‘H’) Test
– Friedman’s ( ‘F’)Test – Two Way ANOVA
Rank Correlation – Spearman’s
Testing Equality of Several Rank Correlations
Kendal’s Rank Correlation Coefficient
Sign Test
While the parametric tests refer to some parameters like mean, standard deviation, correlation
coefficient, etc. the non parametric tests, also called as distribution-free tests, are used for testing
other features also, like randomness, independence, association, rank correlation, etc.
In general, we resort to use of non-parametric tests where
The assumption of normal distribution for the variable under consideration or some
assumption for a parametric test is not valid or is doubtful.
The hypothesis to be tested does not relate to the parameter of a population
The numerical accuracy of collected data is not fully assured
Results are required rather quickly through simple calculations.
However, the non-parametric tests have the following limitations or disadvantages:
They ignore a certain amount of information.
They are often not as efficient or reliable as parametric tests.
The above advantages and disadvantages are in consistent with general premise in statistics that is, a
method that is easier to calculate does not utilize the full information contained in a sample and is less
reliable.
The use of non-parametric tests, involves a trade – off. While the ‘efficiency or reliability’ is
‘lost’ to some extent, but the ‘ability’ to use ‘lesser’ information and to calculate ‘faster’ is
‘gained’.
10
There are a number of tests in statistical literature. However, we have discussed only the following
tests.
Introduction
This page shows how to perform a number of statistical tests using SPSS. Each section gives a brief
description of the aim of the statistical test, when it is used, an example showing the SPSS commands
and SPSS (often abbreviated) output with a brief interpretation of the output. You can see the page
Choosing the Correct Statistical Test for a table that shows an overview of when each test is
appropriate to use. In deciding which test is appropriate to use, it is important to consider the type of
variables that you have (i.e., whether your variables are categorical, ordinal or interval and whether
they are normally distributed), see What is the difference between categorical, ordinal and interval
variables? for more information on this.
Most of the examples in this page will use a data file called hsb2, high school and beyond. This data
file contains 200 observations from a sample of high school students with demographic information
about the students, such as their gender (female), socio-economic status (ses) and ethnic background
(race). It also contains a number of scores on standardized tests, including tests of reading (read),
writing (write), mathematics (math) and social studies (socst). You can get the hsb data file by
clicking on hsb2.
A one sample t-test allows us to test whether a sample mean (of a normally distributed interval
variable) significantly differs from a hypothesized value. For example, using the hsb2 data file, say
we wish to test whether the average writing score (write) differs significantly from 50. We can do
this as shown below.
t-test
/testval = 50
/variable = write.
11
The mean of the variable write for this particular sample of students is 52.775, which is statistically
significantly different from the test value of 50. We would conclude that this group of students has a
significantly higher mean on the writing test than 50.
A one sample median test allows us to test whether a sample median differs significantly from a
hypothesized value. We will use the same variable, write, as we did in the one sample t-test example
above, but we do not need to assume that it is interval and normally distributed (we only need to
assume that write is an ordinal variable). However, we are unaware of how to perform this test in
SPSS.
Binomial test
A one sample binomial test allows us to test whether the proportion of successes on a two-level
categorical dependent variable significantly differs from a hypothesized value. For example, using
the hsb2 data file, say we wish to test whether the proportion of females (female) differs significantly
from 50%, i.e., from .5. We can do this as shown below.
npar tests
/binomial (.5) = female.
The results indicate that there is no statistically significant difference (p = .229). In other words, the
proportion of females in this sample does not significantly differ from the hypothesized value of 50%.
A chi-square goodness of fit test allows us to test whether the observed proportions for a categorical
variable differ from hypothesized proportions. For example, let's suppose that we believe that the
general population consists of 10% Hispanic, 10% Asian, 10% African American and 70% White
folks. We want to test whether the observed proportions from our sample differ significantly from
these hypothesized proportions.
npar test
/chisquare = race
/expected = 10 10 10 70.
These results show that racial composition in our sample does not differ significantly from the
hypothesized values that we supplied (chi-square with three degrees of freedom = 5.029, p = .170).
An independent samples t-test is used when you want to compare the means of a normally distributed
interval dependent variable for two independent groups. For example, using the hsb2 data file, say
we wish to test whether the mean for write is the same for males and females.
The results indicate that there is a statistically significant difference between the mean writing score
for males and females (t = -3.734, p = .000). In other words, females have a statistically significantly
higher mean score on writing (54.99) than males (50.12).
Wilcoxon-Mann-Whitney test
The Wilcoxon-Mann-Whitney test is a non-parametric analog to the independent samples t-test and
can be used when you do not assume that the dependent variable is a normally distributed interval
variable (you only assume that the variable is at least ordinal). You will notice that the SPSS syntax
for the Wilcoxon-Mann-Whitney test is almost identical to that of the independent samples t-test. We
will use the same data file (the hsb2 data file) and the same variables in this example as we did in the
independent t-test example above and will not assume that write, our dependent variable, is normally
distributed.
npar test
/m-w = write by female(0 1).
The results suggest that there is a statistically significant difference between the underlying
distributions of the write scores of males and the write scores of females (z = -3.329, p = 0.001).
Chi-square test
A chi-square test is used when you want to see if there is a relationship between two categorical
variables. In SPSS, the chisq option is used on the statistics subcommand of the crosstabs command
to obtain the test statistic and its associated p-value. Using the hsb2 data file, let's see if there is a
relationship between the type of school attended (schtyp) and students' gender (female). Remember
that the chi-square test assumes that the expected value for each cell is five or higher. This
assumption is easily met in the examples below. However, if this assumption is not met in your data,
please see the section on Fisher's exact test below.
crosstabs
/tables = schtyp by female
/statistic = chisq.
14
These results indicate that there is no statistically significant relationship between the type of school
attended and gender (chi-square with one degree of freedom = 0.047, p = 0.828).
Let's look at another example, this time looking at the linear relationship between gender (female)
and socio-economic status (ses). The point of this example is that one (or both) variables may have
more than two levels, and that the variables do not have to have the same number of levels. In this
example, female has two levels (male and female) and ses has three levels (low, medium and high).
crosstabs
/tables = female by ses
/statistic = chisq.
Again we find that there is no statistically significant relationship between the variables (chi-square
with two degrees of freedom = 4.577, p = 0.101).
The Fisher's exact test is used when you want to conduct a chi-square test but one or more of your
cells has an expected frequency of five or less. Remember that the chi-square test assumes that each
cell has an expected frequency of five or more, but the Fisher's exact test has no such assumption and
can be used regardless of how small the expected frequency is. In SPSS unless you have the SPSS
Exact Test Module, you can only perform a Fisher's exact test on a 2x2 table, and these results are
presented by default. Please see the results from the chi squared example above.
One-way ANOVA
A one-way analysis of variance (ANOVA) is used when you have a categorical independent variable
(with two or more categories) and a normally distributed interval dependent variable and you wish to
test for differences in the means of the dependent variable broken down by the levels of the
independent variable. For example, using the hsb2 data file, say we wish to test whether the mean of
write differs between the three program types (prog). The command for this test would be:
The mean of the dependent variable differs significantly among the levels of program type. However,
we do not know if the difference is between only two of the levels or all three of the levels. (The F
test for the Model is the same as the F test for prog because prog was the only variable entered into
the model. If other variables had also been entered, the F test for the Model would have been
different from prog.) To see the mean of write for each level of program type,
From this we can see that the students in the academic program have the highest mean writing score,
while students in the vocational program have the lowest.
The Kruskal Wallis test is used when you have one independent variable with two or more levels and
an ordinal dependent variable. In other words, it is the non-parametric version of ANOVA and a
generalized form of the Mann-Whitney test method since it permits two or more groups. We will use
16
the same data file as the one way ANOVA example above (the hsb2 data file) and the same variables
as in the example above, but we will not assume that write is a normally distributed interval variable.
npar tests
/k-w = write by prog (1,3).
If some of the scores receive tied ranks, then a correction factor is used, yielding a slightly different
value of chi-squared. With or without ties, the results indicate that there is a statistically significant
difference among the three type of programs.
Paired t-test
A paired (samples) t-test is used when you have two related observations (i.e., two observations per
subject) and you want to see if the means on these two normally distributed interval variables differ
from one another. For example, using the hsb2 data file we will test whether the mean of read is
equal to the mean of write.
These results indicate that the mean of read is not statistically significantly different from the mean
of write (t = -0.867, p = 0.387).
The Wilcoxon signed rank sum test is the non-parametric version of a paired samples t-test. You use
the Wilcoxon signed rank sum test when you do not wish to assume that the difference between the
two variables is interval and normally distributed (but you do assume the difference is ordinal). We
will use the same example as above, but we will not assume that the difference between read and
write is interval and normally distributed.
npar test
/wilcoxon = write with read (paired).
The results suggest that there is not a statistically significant difference between read and write.
If you believe the differences between read and write were not ordinal but could merely be classified
as positive and negative, then you may want to consider a sign test in lieu of sign rank test. Again,
we will use the same variables in this example and assume that this difference is not ordinal.
18
npar test
/sign = read with write (paired).
McNemar test
You would perform McNemar's test if you were interested in the marginal frequencies of two binary
outcomes. These binary outcomes may be the same outcome variable on matched pairs (like a case-
control study) or two outcome variables from a single group. Continuing with the hsb2 dataset used
in several above examples, let us create two binary outcomes in our dataset: himath and hiread.
These outcomes can be considered in a two-way contingency table. The null hypothesis is that the
proportion of students in the himath group is the same as the proportion of students in hiread group
(i.e., that the contingency table is symmetric).
crosstabs
/tables=himath BY hiread
/statistic=mcnemar
/cells=count.
19
McNemar's chi-square statistic suggests that there is not a statistically significant difference in the
proportion of students in the himath group and the proportion of students in the hiread group.
You would perform a one-way repeated measures analysis of variance if you had one categorical
independent variable and a normally distributed interval dependent variable that was repeated at least
twice for each subject. This is the equivalent of the paired samples t-test, but allows for two or more
levels of the categorical variable. This tests whether the mean of the dependent variable differs by the
categorical variable. We have an example data set called rb4wide, which is used in Kirk's book
Experimental Design. In this data set, y is the dependent variable, a is the repeated measure and s is
the variable that indicates the subject number.
glm y1 y2 y3 y4
/wsfactor a(4).
20
You will notice that this output gives four different p-values. The output labeled "sphericity
assumed" is the p-value (0.000) that you would get if you assumed compound symmetry in the
variance-covariance matrix. Because that assumption is often not valid, the three other p-values offer
various corrections (the Huynh-Feldt, H-F, Greenhouse-Geisser, G-G and Lower-bound). No matter
21
which p-value you use, our results indicate that we have a statistically significant effect of a at the .05
level.
Factorial ANOVA
A factorial ANOVA has two or more categorical independent variables (either with or without the
interactions) and a single normally distributed interval dependent variable. For example, using the
hsb2 data file we will look at writing scores (write) as the dependent variable and gender (female)
and socio-economic status (ses) as independent variables, and we will include an interaction of
female by ses. Note that in SPSS, you do not need to have the interaction term(s) in your data set.
Rather, you can have SPSS create it/them temporarily by placing an asterisk between the variables
that will make up the interaction term(s).
These results indicate that the overall model is statistically significant (F = 5.666, p = 0.00). The
variables female and ses are also statistically significant (F = 16.595, p = 0.000 and F = 6.611, p =
0.002, respectively). However, that interaction between female and ses is not statistically significant
(F = 0.133, p = 0.875).
Friedman test
You perform a Friedman test when you have one within-subjects independent variable with two or
more levels and a dependent variable that is not interval and normally distributed (but at least
ordinal). We will use this test to determine if there is a difference in the reading, writing and math
scores. The null hypothesis in this test is that the distribution of the ranks of each type of score (i.e.,
reading, writing and math) are the same. To conduct a Friedman test, the data need to be in a long
format. SPSS handles this for you, but in other statistical packages you will have to reshape the data
before you can conduct this test.
npar tests
/friedman = read write math.
22
Friedman's chi-square has a value of 0.645 and a p-value of 0.724 and is not statistically significant.
Hence, there is no evidence that the distributions of the three types of scores are different.
Correlation
A correlation is useful when you want to see the relationship between two (or more) normally
distributed interval variables. For example, using the hsb2 data file we can run a correlation between
two continuous variables, read and write.
correlations
/variables = read write.
In the second example, we will run a correlation between a dichotomous variable, female, and a
continuous variable, write. Although it is assumed that the variables are interval and normally
distributed, we can include dummy variables when performing correlations.
correlations
/variables = female write.
In the first example above, we see that the correlation between read and write is 0.597. By squaring
the correlation and then multiplying by 100, you can determine what percentage of the variability is
23
shared. Let's round 0.597 to be 0.6, which when squared would be .36, multiplied by 100 would be
36%. Hence read shares about 36% of its variability with write. In the output for the second
example, we can see the correlation between write and female is 0.256. Squaring this number yields
.065536, meaning that female shares approximately 6.5% of its variability with write.
Simple linear regression allows us to look at the linear relationship between one normally distributed
interval predictor and one normally distributed interval outcome variable. For example, using the
hsb2 data file, say we wish to look at the relationship between writing scores (write) and reading
scores (read); in other words, predicting write from read.
We see that the relationship between write and read is positive (.552) and based on the t-value
(10.47) and p-value (0.000), we would conclude this relationship is statistically significant. Hence,
we would say there is a statistically significant positive linear relationship between reading and
writing.
Non-parametric correlation
A Spearman correlation is used when one or both of the variables are not assumed to be normally
distributed and interval (but are assumed to be ordinal). The values of the variables are converted in
24
ranks and then correlated. In our example, we will look for a relationship between read and write.
We will not assume that both of these variables are normal and interval.
nonpar corr
/variables = read write
/print = spearman.
The results suggest that the relationship between read and write (rho = 0.617, p = 0.000) is
statistically significant.
25
Before describing these techniques in details, we provide their brief description as also indicate
their relevance and uses, in a tabular format given below. This is aimed at providing
motivation for learning these techniques and generating confidence in using SPSS for arriving
at final conclusions/solutions in a research study. The contents of the Table will be fully
comprehended after reading all the techniques.
26
Statistical Techniques, Their Relevance and Uses for Designing and Marketing of Products and
Services
Principal Component Analysis (PCA) One could identify several financial parameters and
Technique for forming set of new ratios exceeding ten for determining the financial
variables that are linear combinations health of a company. Obviously, it would be
of the original set of variables, and are extremely taxing to interpret all such pieces of
uncorrelated. The new variables are information for assessing the financial health of a
called Principal Components. company. However, the task could be much simpler
These variables are fewer in number as if these parameters and ratios could be reduced to a
compared to the original variables, but few indices, say two or three, which are linear
they extract most of the informant combinations of the original parameters and ratios.
provided by the original variables. A multiple regression model may be derived to
forecast a parameter like sales, profit, price, etc.
However, the variables under consideration could be
correlated among themselves indicating
multicollinearity in the data. This could lead to
misleading interpretation of regression coefficients as
also increase in the standard errors of the estimates of
parameters. It would be very useful, if the new
uncorrelated variables could be formed which are
linear combinations of the original variables. These
new variables could then be used for developing the
regression model, for appropriate interpretation and
better forecast.
information.
Identifies the smallest number of
common factors that best explain or
account for most of the correlation
among the indicators. For example,
intelligence quotient of a student might
explain most of the marks obtained in
Mathematics, Physics, Statistics, etc.
As yet another example, when two
variables x and y are highly correlated,
only one of them could be used to
represent the entire data
Conjoint Analysis Useful for analyzing consumer responses, and use the
Involves determining the contribution same for designing of product and services
of variables ( each of several levels) to Helps in determining the contributions of the
the choice preference over predictor variables and their respective levels to the
combinations of variables that desirability of the combinations of variables.
represent realistic choice sets ( For example, how much does the quality of food contribute
products, concepts, services, to continued loyalty of a traveller to an airline? Which type
companies, etc.) of food is liked most?
Multidimensional Scaling Useful for designing of products and services.
It is a set of procedures for drawing It helps in
pictures of data so as to visualise and illustrating market segments based on indicated preferences.
clarify relationships described by the
data more clearly. identifying the products and services that are more
The requisite data is typically collected competitive with each other
by having respondents give simple understanding the criteria used by people while judging
one-dimensional responses. objects (products, services, companies,
Transforms consumer judgments / advertisements, etc.).
perceptions of similarity or preferences
in usually a two dimensional space.
Dependence Techniques
These are the techniques, that define some of the variable/s as independent variable/s and some other
as dependent variable/s. These techniques aim at finding the relationship of these variables and may,
in turn, find the effect of independent variable on dependent variable.
The techniques to be used may differ as the type of independent / dependent variables change. For
example, if all the independent and dependent variables are metric or numeric, Multiple Regression
Analysis can be used, if dependent variable is metric, and independent variable is /are categorical,
ANOVA can be used. If dependent variable is metric and some of the independent variables are
metric, and some are qualitative ANACOVA (Analysis of co-variance) can be used. If the dependent
variable is non metric or categorical, multiple discriminant analysis or logistic regression are the
techniques used for analysis.
All the above techniques require a single dependent variable.
If there are more than one dependent variables, the techniques used are MANOVA ( Multivariate
analysis of variance) or Canonical correlation
MANOVA is used when there are more than one dependent variables and all independent variables
are categorical. If some of the independent variables are categorical and some are metric,
MANOCOVA ( multivariate analysis of covariance) can be used. If there are more than one
dependent variables and all dependent and independent variables are metric, best suited analysis is
canonical correlation.
Interdependence Techniques
Interdependence techniques do not assume any variable as independent / dependent variables or try to
find the relationship. These techniques can be divided into variable interdependence and inter object
similarity.
30
The variable interdependence techniques can be also termed as data reduction techniques. Factor
analysis is the example of the variable interdependence techniques. Factor analysis is used when there
are many related variables and one wants to reduce the list of variables or find underlying factors that
determine the variables.
The inter object similarity is assessed with the help of cluster analysis, Multidimensional scaling
(MDS)
Brief descriptions of all the above techniques are given in subsequent sections of this Chapter.
The sample comprises of ‘n’ triplets of values of x1, y and x2 ,in the following format:
y x1 x2
y1 x11 x21
y2 x12 x22
. . .
yn x1n x2n
The values of constants bo, b1 and b2 are estimated with the help of Principle of Least Squares just like values
of a and b were found while fitting the equation y = a + b x in Chapter 10 on Simple Correlation and
Regression analysis. These are calculated by using the above sample observations/values, and with the help of
the formulas given below :
These formulas and manual calculations are given for illustration only. In real life these are easily obtained
with the help of personal computers wherein the formulas are already stored.
2 2
(Σyix1i – n y x 1 )(Σ x 2i – n x 2 ) – (Σyi x 2i – n y x 2 )(Σ x1i x2i – n x 1 x 2 )
b1 = ------------------------------------------------------------------------ ------------
2
( Σ x1i2 – n x 1 2 )(Σ x2i2 – n x 2 ) – (Σx1i x2i – n x 1 x 2 )2
b0 = y – b1 x 1 – b2 x 2 …(3)
The calculations needed in the above formulas are facilitated by preparing the following Table
Y x1 x2 yx1 yx2 x1 x2 y2 x12 x22
y1 x11 x21 y1 x11 y1 x21 x11x21 y12 x112 x212
. . . . . . . . .
yi x1i x2i yix1i yix2i x1ix2i yi2 x1i2 x2i2
. . . . . . . . .
yn x1n x2n yix1n yix2n x1nx2n yn2 x1n2 x2n2
Sum yi x1i x2i yix1i yix2i x1ix2i yi2 x1i2 x2i2
The effectiveness or the reliability of the relationship thus obtained is judged by the multiple coefficient of
determination, usually denoted by R2, and is defined as the ratio of variation explained by the regression
equation ( 1 ) and total variation of the dependent variable y. Thus,
Explained Variation in y
2
R = -------------------------------------- ….. ( 4 )
Total Variation in y
Unexplained Variation
R2 = 1 - -------------------------------------- ….. ( 5 )
Total Variation
32
=1-
(y i yˆ i ) 2
… ( 6)
(y i yi ) 2
It may be recalled from Chapter 10, that total variation in the variable y is equal to the variation explained by
the regression equation plus unexplained variation by the regression equation. Mathematically, this is
expressed as
where yi is the observed value of y, y is the mean of all , ŷi is the estimate of the value yi by the regression
equation ( 1 ).It may be recalled that (yˆ i y ) 2 is the explained variation of y by the estimate of y, and
(y i yˆ i ) 2 is the unexplained variation of y by the estimate of y ( ŷ ). If yi is equal to the estimate ŷ i , then
all the variation is explained by ŷ i , and therefore unexplained variation is zero. In such case , total variation is
fully explained by the regression equation, and R2 is equal to 1.
The square root of R2 viz. R is known as the coefficient of multiple correlation and is always
between 0 and 1. In fact, R is the correlation between the independent variable and its estimate
derived from the multiple regression equation, and as such it has to be positive.
All the calculations and interpretations for the multiple regression equation and coefficient of multiple
correlation or determination have been explained with the help of an illustration given below:
Example 1
The owner of a chain of ten stores wishes to forecast net profit with the help of next year’s projected
sales of food and non-food items. The data about current year’s sales of food items, sales of non-food
items as also net profit for all the ten stores are available as follows:
Table 1 Sales of Food and Non-Food Items and Net Profit of a Chain of Stores
In this case, the relationship is expressed by the equation (1) reproduced below:
y = b0 + b1 x1 + b2 x2
where, y denotes net profit, x1 denotes sales of food items, and x2 denotes sales of non-food items,
and b0, b1 & b2 are constants. Their values are obtained by the following formulas derived from the
"Principle of Least Squares".:
The required calculations can be made with the help of the following Table:
( Amounts in Rs. Crores)
Sales of
Sales of
Net Non-
Food
Supermarket Profit Food
Items
(y) Items
( x1)
(x2)
yi ŷ i * yi – ŷ i (yi – ŷ i )2 (yi– y i )2
(1) (2) (3) (4) (5)
5.6 5.587 0.0127 0.0002 0.0961
4.7 4.607 0.0928 0.0086 1.4641
5.4 5.482 -0.082 0.0067 0.2601
5.5 5.587 -0.087 0.0076 0.1681
5.1 5.09 0.0099 0.0001 0.6561
6.8 6.854 -0.054 0.0029 0.7921
5.8 5.693 0.1075 0.0116 0.0121
8.2 8.121 0.0789 0.0062 5.2441
5.8 5.798 0.0023 0.0000 0.0121
6.2 6.281 -0.081 0.0065 0.0841
Sum =
59.1 0 Sum=0.0504 Sum=8.789
59.1
y = 5.91 (Unexplained (Total
Variation ) Variation )
* Derived from the earlier fitted equation, y = 0.233 + 0.196 x1 + 0.287 x2
Substituting the respective values in the equation (No 6) we get
R2 = 1 – (0.0504/ 8.789) = 1 – 0.0057 = 0.9943
The interpretation of the value of R2 = 0.9943 is that 99.43% of the variation in net profit is explained
jointly by variation in sales of food items and non-food items.
Incidentally, Explained Variation for the above example can be calculated by subtracting unexplained
variation from total variation as 8.789 – 0.0504 = 8.7386
It may be recalled that in Chapter 10 on Simple Correlation and Regression Analysis, we have
discussed the impact of the change of variation in only one independent variable on the dependent
variable. We shall now demonstrate the usefulness of two independent variables in explaining the
variation in the dependent variable (net profit in this case).
Suppose, we consider only as one variable, say food items, then the basic data would be as follows:
Sales of Food
Net Profit (Rs.
Items (Rs.
Supermarket Crores)
Crores)
Y
x1
1 5.6 20
2 4.7 15
3 5.4 18
4 5.5 20
5 5.1 16
6 6.8 25
7 5.8 22
8 8.2 30
9 5.8 24
10 6.2 25
35
The scatter diagram indicates a positive linear correlation between the net profit and the sales of food
items.
which is the regression equation of y on x1. While ‘b’ is the regression coefficient of y on x1, ‘a’ is
just a constant. In the given example, y is the ‘net profit’ and x1 is the sales of food items.
The values of ‘a’ and ‘b’ are calculated from the following formulas given in Chapter 10.
b=
y x ny x
i 1i 1
y n( x )
2
i 1
2
and, a = y – b x1
Unexplained Variation in y
r2 = 1 – ------------------------------- = 1 –
( y yˆ )
i i
2
( y y)
i
2
Total Variation in y
36
Total variation = ( y y) i
2
Unexplained Variation = ( yˆ y )
i
2
( y y)
i
2
Sales of
Food
Net Profit Items yˆ i
Supermarket yi xi yi y ( yi y )2 1.61 0.2 xi yi yˆ ( yi yˆ )2 yˆi y ( yˆi y )2
1 5.6 20 -0.31 0.0961 5.61 -0.01 0.0001 -0.3 0.09
2 4.7 15 -1.21 1.4641 4.61 0.09 0.0081 -1.3 1.69
3 5.4 18 -0.51 0.2601 5.21 0.19 0.0361 -0.7 0.49
4 5.5 20 -0.41 0.1681 5.61 -0.11 0.0121 -0.3 0.09
5 5.1 16 -0.81 0.6561 4.81 0.29 0.0841 -1.1 1.21
6 6.8 25 0.89 0.7921 6.61 0.19 0.0361 0.7 0.49
7 5.8 22 -0.11 0.0121 6.01 -0.21 0.0441 0.1 0.01
8 8.2 30 2.29 5.2441 7.61 0.59 0.3481 1.7 2.89
9 5.8 24 -0.11 0.0121 6.41 -0.61 0.3721 0.5 0.25
10 6.2 25 0.29 0.0841 6.61 -0.41 0.1681 0.7 0.49
Sum 59.1 215 Sum 8.789 Sum 1.109 Sum 7.7
Average 5.91 21.5
It may be noted that the unexplained variation or residual error is 1.109 when the simple regression
equation( 9) of net profit on sales of food items is fitted but it was lower as reduced to 0.05044, when
multiple regression equation ( 7 ) was used by taking into account adding one more variable as sale
of non-food items (x2)
Also it may be noted that only one variable viz. sales of food items is considered then r 2 is 0.876 i.e
87.6% of variation in net profit is explained by variation in sales of food item but when both the
variables viz. sales of food as well as non food items are considered. R 2 is 0.9943 i.e 99.43% of
variation in net profit is explained by variation in both these variables.
Thus, the net profit for all the 10 stores, by the end of next year would be 8.122
Crores.
Caution : It is important to note that a regression equation is valid for estimating the
value of the dependent variable only within the range of independent variable(s) or
only slightly beyond the range. However, it can be used even much beyond the range if
no other better option is available, and it is supported by commonsense.
For the above example relating to net profit where there are three variables y, x1 and x2 , the
correlation matrix is follows:
1 rx1y rx1x2
- 1 ryx2
- - 1
2 n 1
R = 1 ( )(1 R 2 ) …(10)
n k 1
where n is the sample size or the number of observations on each of the variables, and
k is the number of independent variables. For the above example,
2 10 1
R = 1 ( )(1 0.9943)
10 2 1
= 0.9927
To start with when an independent variable is added i.e. the value of k is increased, the value of
R 2 increases but when the addition of another variable does not contribute towards explaining
the variability in the dependent variable, the value of R 2 decreases. This implies that the
addition of that variable is redundant.
The adjusted R2 i.e. R 2 is lesser than R2 as number of observations per independent variable
decreases. However, R 2 tends to be equal to R2 as sample size increases for the given number of
independent variables.
As an illustration, in the above example relating to regression equation of net profit on sales of food
items and sales of non food items , the value of R2 is 0.96 when only sales of food items is taken as
independent variable to predict net profit, but it increases to 0.98 when another independent variable
viz. sales of non food items is also taken into consideration. However, the adjusted value of R 2 is
0.96.
Dummy variables are very useful for capturing a variety of qualitative effects by indicating ‘0’ and
‘1’ as two states of qualitative or categorical data. The dummy variable is assigned the value ‘1’ or
‘0’ depending on whether it does or does not possess the specified characteristic. Some examples are
male and female, married and unmarried, MBA executives and Non-MBA executives, trained-not
trained, advertisement - I and advertisement - II, strategy like financial discount or gift item for sales
promotion.. Thus, a dummy variable modifies the form of a non-numeric variable to a numeric one.
They are used as explanatory variables in a regression equation. They act like ‘switch’ which turn
39
various parameters ‘on’ and ‘off ’ in an equation. Another advantage of ‘0’ and ‘1’ dummy variables
is that even though it is a nominal-level variable – it can be treated statistically just like ‘interval-
level’ variable which takes the value 1 or 0. It marks or encodes a particular attribute “Indicative
Variable” to “Binary Variable”. It is a form of coding to transform non-metric data to metric data. It
facilitates in considering two levels of an independent variable, separately.
Illustration 2
It is normally expected that a person with high income will purchase life insurance policy for a higher
amount. However, it may be worth examining whether there is any difference in the amounts of
insurance purchased by married & unmarried persons. To answer these queries, an insurance agent
collected the data about the policies purchased by his clients during the last month. The data is as
follows:
Amount of
Stipulated
Annual Income Annual
Sr. No of Marital Status
(in Thousands of Insurance
Client (Married/Single)
Rs.) Premium
(in Thousands of
Rs.)
1 800 85 M
2 450 50 M
3 350 50 S
4 1500 140 S
5 1000 100 M
6 500 50 S
7 250 40 M
8 60 10 S
9 800 70 S
10 1400 150 M
11 1300 150 M
12 1200 110 M
Note : The marital status is converted into a independent variable by substituting ‘M’ by 1 and ‘S’ by
0 for the purpose of fitting the regression equation.
It may be verified that the multiple regression equation with amount of insurance premium as
dependent variable and income as well as marital status as independent variables is
Premium = 5.27 + 0.091 Income + 8.95 Marital Status
The interpretation of the coefficient 0.091 is that for every additional thousand rupees of
income, the premium increases by 1000 × 0.091 = Rs 91.
The interpretation of the coefficient 8.95, is that a married person takes an additional premium
of Rs 8,950 as compared to a single person.
the third variable on these two variables is removed or when the third variable is held constant. For
example, ryx1 .x2 means the correlation between y and x1 when the effect of x2 on y and x1 is removed
or x2 is held constant. The various partial correlation coefficients viz. . ryx1 .x2 ryx 2 .x1 and rx1 x2. y are calculated
The values of the above partial correlation coefficients, ryx1 .x2 ryx 2 .x1 and rx1 x2. y are 0.997, 0.977 and
0.973, respectively.
The interpretation of ryx 2 .x1 = 0.977 is that it indicates the extent of linear correlation between y
and x2 when x1 is held constant or its impact on y and x2 is removed.
Similarly, the interpretation of rx1 x2. y = 0.973 is that it indicates the extent of linear correlation
between x1 and x2 when y is held constant or its impact on x1 and x2 is removed.
Sales of
Net Profit Sales of Standardised Variables=
Food
Supermarket (Rs. Non-Food (Variable – Mean)/s.d.
Items (Rs.
Crores) Item
Crores)
41
yi xi zi xi2 Y X1 X2
1 5.6 20 5 400 -0.331 -0.34 -0.09
2 4.7 15 5 225 -1.291 -1.48 -0.09
3 5.4 18 6 324 -0.544 -0.8 0.792
4 5.5 20 5 400 -0.437 -0.34 -0.09
5 5.1 16 6 256 -0.864 -1.25 0.792
6 6.8 25 6 625 0.949 0.798 0.792
7 5.8 22 4 484 -0.117 0.114 -0.97
8 8.2 30 7 900 2.443 1.937 1.673
9 5.8 24 3 576 -0.117 0.57 -1.85
10 6.2 25 4 625 0.309 0.798 -0.97
Sum 59.1 215 51 4815
mean 5.91 21.5 5.1
Variance 0.88 19.25 1.29
s.d. 0.937 4.387 1.136
the multiple regression equation is as follows:
Y = 0.917 X1 + 0.348 X2
The interpretation of beta values is that the contribution by sales of food items in profit is 0.917 as
compared to the contribution by sale of non food items which is 0.348.
2.9 Properties of R2
As mentioned earlier that coefficient of multiple correlation R is the ordinary or total correlation
between the dependent variable and its estimate as derived by the regression equation i.e. R = r yi ŷ i ,
and as such is always positive. Further,
( i ) R2 each of the square of total correlation coefficients of y with any one of the
variables , x2, …., xk.
If the value of R2 is high, and the multicollinearity problem exists, the regression equation can
still be used for prediction of dependent variables given values of independent variables.
However it should not be used for interpreting partial regression coefficients to indicate impact
of independent variables on the dependent variable.
The multicollinearity among independent variables can be removed, with the help of Principal
Component Analysis discussed in this Chapter. It involves forming new set of independent variables
which are linear combinations of the original variables in such a way that there is no multicollinearity
among the new variables.
42
If there are two variables , sometimes the exclusion of one may result in abnormal change in the
regression coefficient of the other variable ; sometimes even the sign of the regression coefficient
may change from + to – or vice versa, as demonstrated for the data given below.
y x1 x2
10 12 25
18 16 21
18 20 22
25 22 18
21 25 17
32 24 15
It may be verified that the correlation between x1 and x2 is 0.91 indicating the
existence of multicollinearity.
It may be noted that the coefficient of x1 (1.2) in (i) which was positive when the regression
equation of y on x1 was considered , has become negative(–0.3)in equation (iii), when x2 is also
included in the regression equation. This is due to high correlation of –0.91 between x1 and x2. It is ,
therefore, desirable to take adequate care of multicollinearity.
In any given situation, one can always define a dependent variable and some independent variables,
and thus define a regression model or equation. However, an important issue arises is whether all the
defined variables in the model, as a whole, have a real influence on the dependent variable, and are
able to explain the variation caused in the dependent variable. For example, one may postulate that
the sales of a company manufacturing a paint( defined as dependent variable) is dependent on the
expenditure on R & D, advertising expenditure, price of the paint, discount to the whole sellers and
number of salesmen. While, these variables might be found to be significantly impacting the sales of
the company, it could also happen that these variables might not influence the sales as the more
important factors could be the quality of the paint, availability and pricing of another similar type of
paint. Further, even if the four variables mentioned above are found to be significantly contributing,
as a whole, to the sales of the paint, but one or some of these might not be influencing the sales in a
significant way.
For example, it might happen that the sales are insensitive to the advertising expenses i.e. increasing
the expenditure on advertising might not be increasing the sales in a significant way. In such case, it
is advisable to exclude this variable from the model, and use only the other three variables. As
explained in the next Section, it is not advisable to include a variable unless its contribution to
43
variation in the dependent variable is significant. These issues will be explained with examples in
subsequent sections.
It is often difficult to identify the exact set of variables that are significant in the regression model and
the process of finding these may have many steps or iterations as explained through an illustration in
the next section. This is the limitation of this method. This limitation can be overcome in the stepwise
regression method.
This method is used when a researcher has clearly identified three different types of variables namely
dependent variable, independent variable/s and the control variable/s.
44
This method helps the researcher to find the relationship between the independent variables and the
dependent variable, in the presence of some variables that are controlled in the experiment. Such
variables are termed as control variables. The control variables are first entered in the hierarchy, and
then the independent variables are entered. This method is available in most statistical software
including SPSS.
This method is used when a researcher wants to find out, which independent variables significantly
contribute in the regression model, out of a set of independent variables. This method finds the best
fit model, i.e. the model which has a set of independent variables that contribute significantly in the
regression equation.
For example, if a researcher has identified some three independent variables that may affect the
dependent variable, and wants to find the best combination of these three variables which contribute
significantly in the regression model. The researcher may use stepwise regression. The software
would give the exact set of variables that contribute or are worth keeping in the model.
There are three most popular stepwise regression methods namely, forward regression backward
regression and stepwise regression. In forward regression, one independent variable is entered with
dependent variable and the regression equation is arrived along with other tests like ANOVA, t tests
etc. in the next iteration the one more independent variable is added and is compared with the
previous model. If the new variable significantly contributes in the model, it is kept, otherwise it is
thrown out from the model. This process is repeated for each remaining independent variables, thus
arriving at a model that is significant containing all contributing independent variables. The
backward method is exactly opposite to this method. In case of backward method initially all the
variables are considered and they are removed one by one if they do not contribute in the model.
The stepwise regression method is a combination of the forward selection and backward elimination
methods. The basic difference between this and the other two methods is that in this method, even if a
variable is selected in the beginning or gets selected subsequently, it has to keep on competing with
the other entering variables at every stage to justify its retention in the equation.
Example 2
The following Table gives certain parameters about some of the top rated companies in the ET 500
listings published in the issue of February 2006
45
Sr. Company M-Cap Oct’ 05 Net Sales Sept’ Net Profit Sept’ 05 P/E as on Oct. 31’ 05
No 05
Company Amount Rank Amount Rank Amount Rank Amount Rank
1 INFOSYS 68560 3 7836 29 2170.9 10 32 66
TECHNOLOGIES
2 TATA CONSULTANCY 67912 4 8051 27 1831.4 11 30 74
SERVICES
3 WIPRO 52637 7 8211 25 1655.8 13 31 67
4 BHARTI TELE- 60923 5 9771 20 1753.5 12 128 3
VENTURES *
5 ITC 44725 9 8422 24 2351.3 8 20 183
6 HIRO HONDA 14171 24 8086 26 868.4 32 16 248
MOTORS
7 SATYAM COMPUTER 18878 19 3996 51 844.8 33 23 132
SERVICES
8 HDFC 23625 13 3758 55 1130.1 23 21 154
9 TATA MOTORS 18881 18 18363 10 139 17 14 304
10 SIEMENS 7848 49 2753 75 254.7 80 38 45
11 ONGC 134571 1 37526 5 14748.1 1 9 390
12 TATA STEEL 19659 17 14926 11 3768,6 5 5 469
13 STEEL AUTHORITY 21775 14 29556 7 6442.8 3 3 478
OF INDIA
14 NESTLE INDIA 8080 48 2426 85 311.9 75 27 99
15 BHARAT GORGE CO. 6862 55 1412 128 190.5 97 37 48
16 RELIANCE 105634 2 74108 2 9174.0 2 13 319
INDUSTRIES
17 HDFC BANK 19822 16 3563 58 756.5 37 27 98
18 BHARAT HEAVY 28006 12 11200 17 1210.1 21 25 116
ELECTRICALS
19 ICICI BANK 36890 10 11195 18 2242.4 9 16 242
20 MARUTI UDYOG 15767 22 11601 16 988.2 26 17 213
21 SUN 11413 29 1397 130 412.2 66 30 75
PHARMACEUTICALS
* The data about Bharti Tele-Ventures is not considered for analysis because it’s P/E ratio is
exceptionally high.
In the above example, we take market Capitalisation as the dependent variable, and Net Profit, P/E
Ratio and Net Sales as independent variables.
We may add that this example is to be viewed as an illustration of selection of optimum number of
independent variables, and not the concept of financial analysis.
The notations used for the variables are as follows.
Y Market Capitalisation
x1 Net Sales
x2 Net Profit
x3 P/E Ratio
Step I :
46
First of all we calculate the total correlation coefficients among all the dependent and independent
variables. We also calculate the correlation coefficients of the dependent variable These are tabulated
below.
1 2 3
Net Sales Net Profit P/E Ratio
Net Sales 1.0000
Net Profit 0.7978 1.0000
P/E Ratio -0.5760 -0.6004 1.0000
Market Cap 0.6874 0.8310 -0.2464
We note that the correlation of y with x2 is highest. We therefore start by taking only this variable in
the regression equation.
Step II :
The regression equation of y on x2 is
y = 15465 + 7.906 x2
Since R 2 for the combination x2 and x3 (0.7656) is higher than R 2 for the combination x2 and x1
(0.6734) , we select x3 as the additional variable along with x2
It may be noted that R2 with variable x2 and x3 (0.7903) is more than value of R2 with only the
variable (0.6906). Thus it is advisable to have x3 along with x2 in the model.
Step IV :
Now we include the last variable viz. x1 to have the model as
Y = bo + b1x1 + b2x2 + b3 x3
The requisite calculations are too cumbersome to be carried out manually, and, therefore, we use
Excel spreadsheet which yields the following regression equation.
Y = – 23532 + 0.363x1 + 8.95x2 + 1445.6x3
The values of R2 and R 2 are : R2 = 0.8016 R 2 = 0.7644
It may be noted that inclusion of x1 in the model has very marginally increased the value of R2 from
0.7903to 0.8016, but the adjusted value of R2 i.e R 2 has come down from 0.7656 to 0.7644. Thus it
is not worthwhile to add the variable x1 to the regression model having variables as x2 and x3.
Step V :
The advisable regression model is by including only x2 and x3
Y = – 19823 +10.163 x2 + 1352.4 x3 (14)
2
This is the best regression equation fitted to the data on the basis of R criterion, as discussed above.
We have discussed the same example to illustrate the method using SPSS in Section 2.17
2.15 Generalised Regression Model
47
(ii) E (ei) = 0
This assumption implies that the sum of positive and negative errors is zero, and thus they
cancel out each other.
Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively
even distribution
Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.
48
Heteroscedasticity is indicated when the residuals are not evenly scattered around the line.
The traditional test for the presence of first-order autocorrelation is the Durbin–Watson statistic
Other than the above assumptions, regression analysis also requires that the independent variables
should not be related to each other in other words, there should not be any multicollinearity
• Indicators that multicollinearity may be present in a model:
• 1) Large changes in the estimated regression coefficients when a predictor variable is added or
deleted
• 2) Insignificant regression coefficients for the affected variables in the multiple regression, but
a rejection of the hypothesis that those coefficients are insignificant as a group (using a F-test)
• 3) Large changes in the estimated regression coefficients when an observation is added or
deleted
• A formal detection-tolerance or the variance inflation factor (VIF) for multicollinearity is:
• Tolerance = 1 – R2 VIF = 1 / Tolerance
• A tolerance of less than 0.20 and/or a VIF of 5 and above indicates a multicollinearity
problem
In the presence of multicollinearity, the estimate of one variable's impact on y while controlling for
the others tends to be less precise than if predictors were uncorrelated with one another. The usual
interpretation of a regression coefficient is that it provides an estimate of the effect of a one unit
change in an independent variable, X1, holding the other variables constant. If X1 is highly correlated
with another independent variable, X2, in the given data set, then we only have observations for
which X1 and X2 have a particular relationship (either positive or negative). We don't have
observations for which X1 changes independently of X2, so we have an imprecise estimate of the
effect of independent changes in X1.
Multicollinearity does not actually bias results, it just produces large standard errors in the related
independent variables. With enough data, these errors will be reduced.
What to do if multicollinearity :
• 1) Leave the model as is, despite multicollinearity. The presence of multicollinearity doesn't
affect the fitted model provided that the predictor variables follow the same pattern of
multicollinearity as the data on which the regression model is based .
49
• 2) Drop one of the variables. An explanatory variable may be dropped to produce a model
with significant coefficients. However, you lose information (because you've dropped a
variable). Omission of a relevant variable results in biased coefficient estimates for the
remaining explanatory variables.
• 3) Obtain more data. This is the preferred solution. More data can produce more precise
parameter estimates (with lower standard errors).
• Note: Multicollinearity does not impact the reliability of the forecast, but rather impacts the
interpretation of the explanatory variables. As long as the collinear relationships in your
independent variables remain stable over time, multicollinearity will not affect the forecast.
In this section, we indicate financial applications of regression analysis in some aspects relating to
stock market.
(i) Individual Stock Rates of Return, Payout Ratio, and Market Rates of Return
Let the relationship of rate of return of a stock with the payout ratio defined as the ratio of dividend
per share to customer earnings per share , and the rate of return on BSE SENSEX stocks as a whole,
be
y = b0 + b1 ( payout ratio ) + b2 ( rate of return on Sensex)
Let us assume that the relevant data is available, and is collected over a period of last 10 years yields
the following equation
y = 1.23 – 0.22 payout ratio + 0.49 rate of return
The coefficient – 0.22 indicates that for a 1 % increase in pay-out ratio, the return on the stock
reduces by 0.22 % when the rate of return is held constant. Further, the coefficient 0.49 implies that a
1 % increase in the rate of return on BSE SENSEX , the return on the stock increases by 0.49 % when
the payout ratio is held constant.
Further, let the calculations yield the value of R2 as 0.66.
50
The value of R2 = 0.66 implies that 66% of variation in the rate of return on the investment in the
stock is explained by pay-out ratio and the return on BSE SENSEX.
The regression equation could be used for interpreting regression coefficients and predicting average
price per share given the values of dividend paid and earnings retained.
The coefficient 15.30 of x1 ( average price per share) indicates that the average price per share
increases by 15.30 when the dividend per share increases by Re. 1 when the retained earnings are
held constant.
The regression coefficient 3.55 of x2 means that when the retained earnings increases by Rs. 1.00, the
price per share increases by Rs 3.55 when dividend per share is held constant.
The use of multiple regression analysis in carrying out cost analysis was demonstrated by Bentsen in
1966. He collected data from a firm’s accounting, production and shipping records to establish a
multiple regression equation.
for tolerance indicate problems of multicollinearity. The minimum cutoff value for tolerance
is typically .2.. That is, the tolerance value must be smaller than .10 to indicate a problem of
multicollinearity.
•
The next box that appears is shown in the following SPSS snapshot
SPSS Snapshot MRA 2
3.If the method of selection is
Hierarchical, all control variables
1.Dependent variable are selected in the list of
to be entered here independent variables in the first list
and then by clicking ‘next’, rest of
the independent variables are
entered
2. Independent variables
to be entered here
If the method of selection is General method, ‘Enter’ should be selected from the drop box. If the
method is stepwise, ‘Stepwise’ should be selected from the list. We have explained, the criteria for
selecting appropriate method in Section 2.14
52
The next step is to click on ‘Statistics’ button in the bottom of the box. When one clicks on statistics
the following box will appear
SPSS Snapshot MRA 3
The Durbin-Watson Statistics is used to check the assumption of regression analysis which states
that the error terms should be uncorrelated. While its desirable value is 2, the desirable range is 1.5 to
2.5.
For Multi-colinearity to be absent, VIF should be less than 5 or Tolerance should be at least 0.2.
After clicking ‘Continue’, SPSS will return to screen as in SPSS Snapshot MRA 2.
In this snapshot, click on the plots button at the bottom. The next box that would appear is given in
the following Snapshot MRA 4.
SPSS Snapshot MRA 4
The residual analysis is done to check the assumptions of multiple regressions. According to the
assumption the residuals should be normally distributed. This assumption can be checked by viewing
the histogram and normal probability plot.
After clicking ‘Continue’, SPSS will take back to the Snapshot 3 MRA.
By clicking ‘OK’, SPSS will carry out the analysis and give the output in the Output View.
We will discuss two outputs using the same data. One, by using General method for entering
variables, and the other by selecting stepwise method for entering variables.
General Method for Entering Variables – SPSS Output
Regression
53
Descriptive Statistics
‘Part and partial correlations matrix’ is useful in understanding the relationships between the
independent and dependent variables. The regression analysis is valid only if the independent and
dependent variables are not interrelated. If these are related to each other, they may lead to
misinterpretation of the regression equation. This is termed as multicollinearity, and its impact is
described in Section 2.10 . The above correlation matrix is useful in checking the inter relationships
between the independent variables. In the above table, the correlations in square are correlation of
independent variables with dependent variables and are high (0.687 & 0.831) which means that the
two variables are related. Whereas the correlations between the independent variables (0.798, -0.576
and -0.6) are high which means that this data may have multicollinearity. Generally, very high
correlations between the independent variables like more than 0.9, may make the entire regression
analysis unreliable for interpreting the regression coefficients.
Variables Entered/Removedb
Variables Variables
Model Entered Remov ed Met hod
1 peratio_
oct05,
Net _Sal_
. Enter
sept05,
Net _Prof
a
_
sept05
a. All requested v ariables entered.
b. Dependent Variable: m_cap_amt_oct05
Since the method selected was Enter method or General method, this table does not communicate any
meaning.
b
Model Summ ary
This table gives the model summary for the set of independent and dependent variables. R 2 for the
model is 0.802 which is high and means that around 80% of variation in dependent variable ( market
capitalization) is explained by the three independent variables.(net sale, net profit & P/E Ratio). The
54
Durbin Watson statistics for this model is 0.982, which is very low. The accepted value should be in
the range 1.5 to 2.5. It may, therefore, be appended as a caution that the assumption that the residuals
are uncorrelated is not valid.
ANOVAb
Sum o f
Mo del Square s df Me an Squa re F Sig.
1 Regressi on 1.8 0E+ 10 3 599 6219 888 21. 545 .00 0 a
Resid ual 4.4 5E+ 09 16 278 3086 55.0
T otal 2.2 4E+ 10 19
a. Predicto rs: (Con stan t), pe ratio_ oct05, Net _Sa l_sept05 , Ne t_Prof_sept0 5
b. Depen dent Vari able: m_cap _amt _oct05
The ANOVA table for the regression analysis indicates whether the model is significant, and valid or
not. The ANOVA is significant, if the ‘sig’ column in the above table is less than the level of
significance (generally taken as 5% or 1%). Since 0.000 < 0.01, we conclude that this model is
significant.
If the model is not significant, it implies that no relationship exists between the set of variables.
Coefficientsa
Unstandardized Standardized
Coef f icients Coef f icients Correlations
Model B Std. Error Beta t Sig. Zero-order Partial Part
1 (Constant) -23531. 5 13843.842 -1.700 .109
Net _Sal_sept05 .363 .381 .180 .953 .355 .687 .232 .106
Net _Prof _sept05 8.954 1.834 .941 4.882 .000 .831 .774 .544
peratio_oct05 1445. 613 486.760 .422 2.970 .009 -.246 .596 .331
a. Dependent Variable: m_cap_amt _oct05
This table gives the regression coefficients and their significance. The equation can be considered as,
Market capitalization = – 23531.5 + 0.363 × Net Sales + 8.954×Net Profits + 1445.613×P/E Ratio
It may be noted that the above equation is given for the purpose of understanding how to derive
equation from table. In this example, though regression coefficients, net profit and P/E ratio
are significant, net sales is not significant and since all three beta coefficients are not significant
this equation cannot be used for estimation of market cap.
Residuals Statisticsa
Charts
55
Histogram
5
Frequency
1
Mean = -1.67E-16
Std. Dev. = 0.918
0 N = 20
-2 -1 0 1 2
Regression Standardized Residual
Above chart is to test the validity of the assumption that the residuals are normally distributed.
Looking at the chart one may conclude that the residuals are more or less normal. This can be tested
using Chi-square goodness of fit test.
Since all the three regression coefficients are not significant, the enter method cannot be used
for estimation. It is advisable to use stepwise regression in this case since it gives most
parsimonious set of variables in the equation.
Regression
Descriptive Statistics
Correlations
Variables Entered/Removeda
Variables Variables
Model Entered Remov ed Met hod
1 Stepwise
(Criteria:
Probabilit
y -of -
F-to-enter
Net _Prof _
. <= .050,
sept05
Probabilit
y -of -
F-to-remo
v e >= .
100).
2 Stepwise
(Criteria:
Probabilit
y -of -
F-to-enter
peratio_
. <= .050,
oct05
Probabilit
y -of -
F-to-remo
v e >= .
100).
a. Dependent Variable: m_cap_amt_oct05
56
This table gives the summary of the entered variables in the model.
Model Summaryc
In the previous method, there was only one model. Since this is stepwise method, it will give all
models that are significant in each step. The Durbin Watson is improved from the previous model but
is still less than the desired range (1.5 to 2.5). The last model is generally the best model. This can be
verified by the adjusted R2 the model with highest adjusted R2 is best. The model 2 which consists of
dependent variable, market capitalization and independent variables, Net profit and P/E Ratio is the
best model. The following table gives coefficients for the best model.
It may be noted that this model, though do not contain the independent variable ‘Net Profit’ is slightly
better than the previous model discussed using ‘Enter’ or General method of entering variables. Since
previous model adjusted R2 = 0.764 this model Adjusted R2 = 0.766.
The following table gives ANOVA for all the iterations (in this case 2), and both are significant.
ANOVAc
Sum o f
Mo del Square s df Me an Squa re F Sig.
1 Regressi on 1.5 5E+ 10 1 1.5 50E+10 40. 169 .00 0 a
Resid ual 6.9 4E+ 09 18 385 7974 11.3
T otal 2.2 4E+ 10 19
2 Regressi on 1.7 7E+ 10 2 886 7988 179 32. 037 .00 0 b
Resid ual 4.7 1E+ 09 17 276 8012 81.6
T otal 2.2 4E+ 10 19
a. Predicto rs: (Con stan t), Net_Prof_ sep t05
b. Predicto rs: (Con stan t), Net_Prof_ sep t05, p erati o_oct05
c. Depen dent Vari able: m_cap _amt _oct05
Coefficientsa
Unstandardized Standardized
Coef f icients Coef f icients Correlations
Model B Std. Error Beta t Sig. Zero-order Partial Part
1 (Constant) 15465.085 5484. 681 2.820 .011
Net _Prof _sept05 7.906 1.247 .831 6.338 .000 .831 .831 .831
2 (Constant) -19822. 8 13249.405 -1.496 .153
Net _Prof _sept05 10.163 1.321 1.068 7.691 .000 .831 .881 .854
peratio_oct05 1352. 358 475.527 .395 2.844 .011 -.246 .568 .316
a. Dependent Variable: m_cap_amt _oct05
It may be noted in the above Table that the values of the constant and regression coefficients are the
same as in the equation (14), derived manually. The SPSS stepwise regression did this automatically,
and the results we got are the same.
The following table gives summary of excluded variables in the two models.
57
Collinearity
Partial Statistics
Model Beta In t Sig. Correlation Tolerance
1 Net _Sal_sept05 .067a .301 .767 .073 .363
peratio_oct05 .395a 2.844 .011 .568 .639
2 Net _Sal_sept05 .180b .953 .355 .232 .349
a. Predictors in t he Model: (Constant), Net _Prof _sept05
b. Predictors in t he Model: (Constant), Net _Prof _sept05, peratio_oct05
c. Dependent Variable: m_cap_amt_oct05
There may be a situation that a researcher would like to divide the data into two parts, and use one
part to derive the model and other part to validate the model. SPSS allows to split the data into two
groups termed as estimation group and validation group. The estimation group is used to fit the
model, which is validated using validation group. This improves the validity of the model. This
process is called as cross validation. This method can be used only if the data is large enough to fit
model. Random variable functions from SPSS can be used to select the data randomly from the SPSS
file.
58
3 Discriminant Analysis
Discriminant analysis is basically a classifying technique that is used for classifying a given set of
objects, individuals, entities into two (or more) groups or categories based on the given data about
their characteristics. It is the process of deriving an equation called ‘Discriminant Function’ giving
relationship between one dependent variable which is categorical i.e. it takes only two values, say,
‘yes’ or ‘no’, represented by ‘1’ or ‘0’, and several independent variables which are continuous. The
independent variables, selected for the analysis, are such which contribute towards classifying an
object, individual, or entity in one of the two categories. For example, with the help of several
financial indicators, one may decide to extend credit to a company or not. The classification could
also be done in more than two categories.
Identifying a set of variables which discriminate ‘Best’ between the two groups is the first step in the
discriminant analysis. These variables are called discriminating variables.
One of the simplest examples of discriminating variable is the ‘height’ in case of students of
graduate students. Let, there be a class of 50 students comprising boys and girls. Suppose we are
given only roll numbers, and we are required to classify them by their sex or segregate boys and girls.
One alternative is to take ‘height’ as the variable, and premise all those equal to or more than 5’6’’
are boys and less than that height are girls. This classification should work well except in some cases
where girls are taller than 5’6’’ and boys are less than that height. In fact, one could work out from a
large sample of students, the most appropriate value of the discriminating height. This example
illustrates one fundamental aspect of discriminant analysis that in real life we cannot find
discriminating variable(s) or function that can provide 100 % accurate discrimination or
classification. We can only attempt to find the best classification from a given set of data. Yet another
example is the variable ‘marks’( percentage or percentile), in an examination which are used to
classify students in two or more categories. As is well known even marks cannot guarantee 100%
accurate classification.
Discriminant analysis is used to analyze relationships between a non-metric dependent variable and
metric or dichotomous (Yes / No type or Dummy) independent variables. Discriminant analysis uses
the independent variables to distinguish among the groups or categories of the dependent variable.
The discriminant model can be valid or useful only if it is accurate. The accuracy of the model is
measured on the basis of its ability to predict the known group memberships in the categories of the
dependent variable.
Discriminant analysis works by creating a new variable called the discriminant function score
which is used to predict to which group a case belongs. The computations find the coefficients for the
independent variables that maximize the measure of distance between the groups defined by the
dependent variable.
The discriminant function is similar to a regression equation in which the independent variables are
multiplied by coefficients and summed to produce a score.
In case of multiple discriminant analysis, there will be more than one discriminant function. If the
dependent variable has three categories like, high risk, medium risk, low risk, there will be two
discriminant functions. If dependent variable has four categories, there will be three discriminant
functions. In general, the number of discriminant functions is one less than the number of categories
of the dependent variable.
It may be noted that in case of multiple discriminant functions, each function needs to be significant
to conclude the results.
The following illustrations explain the concepts and the technique of deriving a Discriminant
function, and using it for classification. The objective in this example is to explain the concepts in a
popular manner without mathematical rigour.
Illustration 3
Suppose, we want to predict whether a science graduate, studying inter alia the subjects of Physics
and Mathematics, will turn out to be a successful scientist or not. Here, it is premised that the
performance of a graduate in Physics and Mathematics, to a large extent, contributes in shaping a
successful scientist. The next step is to select some successful and some unsuccessful scientists, and
record the marks obtained by them in Mathematics and Physics in their graduate examination. While
in real life application, we have to select sufficient, say 10 or more number of students in both
categories, just for the sake of simplicity, let the data on two successful and two unsuccessful
scientists be as follows:
Successful Scientist Unsuccessful Scientist
Marks in Marks in Physics Marks in Marks in Physics
Mathematics ( M ) ( P) Mathematics (M) (P)
12 8 11 7
8 10 5 9
Average : M s 10 Ps = 9 Mu = 8 Pu = 8
S : Successful U : Unsuccessful
It may be mentioned that the marks as 8, 10, etc. are taken just for the sake of ease in calculations.
The discriminant function assumed is
Z = w1 M + w2 P
The requisite calculations on the above data yield
w1 = 9 and w2 = 23
Thus, the discriminant function works out to be
Z = 9 M + 23 P
and the discriminant score works out to be
(9 M S 23PS ) (9 M U 23PU )
ZC
2
9 10 23 9 9 8 23 8
=
2
60
= 276.5
This discriminant score helps us to predict whether a Graduate student will turn out to be a successful
scientist or not. This score for the two successful scientists is 292 and 302, both being more than the
discriminant score 276.5, the score is 214 and 270 for unsuccessful scientists, both being less than
276.5. If a young graduate gets 11 marks in Mathematics and 9 marks in Physics, his score as per the
discriminant function is 9 11 + 23 9 = 306. Since this is more than the discriminant score of
276.5, we can predict that this graduate will turn out to be a successful scientist. This is depicted
pictorially in the following diagram.
12
10 8, 10 Successful
5, 9 11, 9
8 12, 8
Marks in 11, 7
Physics
6
Unsuccessful
4
Marks in Mathematics
0
0 2 4 6 8 10 12 14
It may be noted that the both the successful scientists’ scores are above the discriminant line and the
scores of both the unsuccessful scientists are below the discriminant line.
The student with assumed marks is classified in the category of successful scientist.
This example illustrates that with the help of past data, about objects including entities,
individuals, etc., and their classification in two categories, one could derive the discriminant
function and the discriminant scores. Subsequently, if the same type of data is given for some
other object, the discriminant score could be worked out for that object and thus classify it in
either of the two categories.
(1) an F test (Wilks' lambda) is used to test if the discriminant model as a whole is significant. Wilks'
for each independent variable is calculated using the formula,
62
Wilks' ) lies between 0 and 1. Large value of indicate that there is no difference in the group
means for the independent variable. Small values of indicate group means are different for the
independent variable. Smaller the value of more is the discriminating power, of the variable, in the
group.
(2) If the F test shows significance, then the individual independent variables are assessed to see
which of these differ significantly (in mean) by group and are subsequently used to classify the
dependent variable.
coefficients markedly.
Functions at Group The mean discriminant scores for each of the dependent variable categories for each of
Centroids the discriminant functions in MDA. Two-group discriminant analysis has two centroids,
one for each group. We want the means to be well apart to show the discriminant
function is clearly discriminating. The closer the means, the more errors of
classification there likely will be
Discriminant Also called canonical plots, can be created in which the two axes are two of the
function plots discriminant functions (the dimensional meaning of which is determined by looking at
the structure coefficients, discussed above), and circles within the plot locate the
centroids of each category being analyzed. The farther apart one point is from another
on the plot, the more the dimension represented by that axis differentiates those two
groups. Thus these plots depict
(Model) Wilks' Used to test the significance of the discriminant function as a whole. The "Sig." level
lambda for this function is the significance level of the discriminant function as a whole. The
researcher wants a finding of significance, and The larger the lambda, the more likely it
is significant. A significant lambda means one can reject the null hypothesis that the
two groups have the same mean discriminant function scores and conclude the model is
discriminating.
ANOVA table for Another overall test of the DA model. It is an F test, where a "Sig." p value < .05 means
Discriminant scores the model differentiates discriminant scores between the groups significantly better than
chance (than a model with just the constant).
(Variable) Wilks' It can be used to test which independents contribute significantly to the discrimiinant
lambda function. The smaller the value of Wilks' lambda for an independent variable, the more
that variable contributes to the discriminant function. Lambda varies from 0 to 1, with 0
meaning group means differ (thus the more the variable differentiates the groups), and 1
meaning all group means are the same.
Dichotomous independents are more accurately tested with a chi-square test than with
Wilks' lambda for this purpose.
Classification Also called assignment, or prediction matrix or table, is used to assess the performance
Matrix or of DA. This is a table in which the rows are the observed categories of the dependent
Confusion Matrix and the columns are the predicted categories of the dependents. When prediction is
perfect, all cases will lie on the diagonal. The percentage of cases on the diagonal is the
percentage of correct classifications. This percentage is called the hit ratio.
Expected hit ratio. The hit ratio is not relative to zero but to the percent that would have been correctly
classified by chance alone. For two-group discriminant analysis with a 50-50 split in the
dependent variable, the expected percentage is 50%. For unequally split 2-way groups
of different sizes, the expected percent is computed in the "Prior Probabilities for
Groups" table in SPSS, by multiplying the prior probabilities times the group size,
summing for all groups, and dividing the sum by N. The best strategy is to pick the
largest group for all cases, the expected percent is then the largest group size divided by
N.
Cross-validation. Leave-one-out classification is available as a form of cross-validation of the
classification table. Under this option, each case is classified using a discriminant
function based on all cases except the given case. This is thought to give a better
estimate of what classificiation results would be in the population.
Measures of can be computed by the crosstabs procedure in SPSS if the researcher saves the
association predicted group membership for all cases.
Mahalanobis D- Indices other than Wilks' lambda of the extent to which the discriminant functions
Square, Rao's V, discriminate between criterion groups. Each has an associated significance test. A
Hotelling's trace, measure from this group is sometimes used in stepwise discriminant analysis to
Pillai's trace, and determine if adding an independent variable to the model will significantly improve
Roys gcr(greatest classification of the dependent variable. SPSS uses Wilks' lambda by default but also
characteristic root) offers Mahalanobis distance, Rao's V, unexplained variance, and smallest F ratio on
selection.
64
Structure These are also known as discriminant loadings, can be defined as simple correlations
Correlations between the independent variables and the discriminant functions
The next box that will appear is given in the following snapshot.
2. Click on Define
Range
After entering the dependent variable and clicking on the “Define Range” as shown above, SPSS will
open following box,
SPSS Snapshot DA 2
65
1. Enter Minimum as
‘0’ and Maximum as ‘1’
After defining the variable, one should click on ‘Continue’ button as shown above. SPSS will be back
to the previous box shown below
SPSS Snapshot DA 3
After selecting the descriptives SPSS will go back to the previous box shown below:
SPSS Snapshot DA 5
66
2.Next, click on
classify
After clicking on classify SPSS will open a box as shown below,
SPSS Snapshot DA 6
2. Click
continue
After clicking ‘Continue’, SPSS will be back to previous box as shown in the Snapshot DA 5, then
click on the save button at the bottom. SPSS will open a box as shown below.
SPSS Snapshot DA 7
2. Click continue
After clicking continue SPSS will again go back to the previous window shown in Snapshot DA 6, at
this stage one may click ok button. This will lead SPSS to analyse the data and the output will be
displayed in the output view of SPSS.
Output for Enter Method
67
This table gives case processing summary, i.e. how may valid cases were selected, how many were
excluded (due to missing data) , total and their respective percentages.
Grou p Statistics
Valid N (listwise)
Prev iously def aulted Mean Std. Dev iation Unweighted Weighted
No Age in y ears 35.5145 7.70774 517 517.000
Y ears wit h current
9.5087 6.66374 517 517.000
employ er
Y ears at current address 8.9458 7.00062 517 517.000
Household income in
47.1547 34.22015 517 517.000
thousands
Debt to income ratio
8.6793 5.61520 517 517.000
(x100)
Credit card debt in
1.2455 1.42231 517 517.000
thousands
Other debt in thousands 2.7734 2.81394 517 517.000
Y es Age in y ears 33.0109 8.51759 183 183.000
Y ears wit h current
5.2240 5.54295 183 183.000
employ er
Y ears at current address 6.3934 5.92521 183 183.000
Household income in
41.2131 43.11553 183 183.000
thousands
Debt to income ratio
14.7279 7.90280 183 183.000
(x100)
Credit card debt in
2.4239 3.23252 183 183.000
thousands
Other debt in thousands 3.8628 4.26368 183 183.000
Total Age in y ears 34.8600 7.99734 700 700.000
Y ears wit h current
8.3886 6.65804 700 700.000
employ er
Y ears at current address 8.2786 6.82488 700 700.000
Household income in
45.6014 36.81423 700 700.000
thousands
Debt to income ratio
10.2606 6.82723 700 700.000
(x100)
Credit card debt in
1.5536 2.11720 700 700.000
thousands
Other debt in thousands 3.0582 3.28755 700 700.000
This table gives the group statistics of independent variables, for each categories (here yes and no) of
dependent variables.
Tests of Equality of Group Means
Wilks'
Lambda F df 1 df 2 Sig.
Age in y ears .981 13.482 1 698 .000
Y ears with current
.920 60.759 1 698 .000
employ er
Y ears at current address .973 19.402 1 698 .000
Household income in
.995 3.533 1 698 .061
thousands
Debt to income ratio
.848 124.889 1 698 .000
(x100)
Credit card debt in
.940 44.472 1 698 .000
thousands
Other debt in thousands .979 15.142 1 698 .000
This table gives the test for Wilks’ for each independent variable if this is significant(<0.05 or
0.01), it means that the respective variable, mean is different for the two groups (in this case
previously defaulted and previously not defaulted). Any insignificant value will indicate that, the
variable is not different for different group or in other terms does not discriminate the dependent
variable. In the above example, all the variables are significant except household income in
thousands. This implies that the default of the loan does not depend on the household income.
68
Analysis 1
Box's Test of Equality of Covariance Matrices
Log Determinants
Log
Prev iously def aulted Rank Det erminant
No 7 21.292
Yes 7 24.046
Pooled within-groups 7 22.817
The ranks and natural logarithms of determinants
printed are those of the group cov ariance matrices.
Test Resul ts
Box's M 563.291
F Approx. 19.819
df 1 28
df 2 431743.0
Sig. .000
Tests null hy pothesis of equal population cov ariance matrices.
This table indicates that the Box’s M is significant. Which means the assumption of equality of
variance may not be true. This is a caution for interpreting results.
Canonical
Function Eigenv alue % of Variance Cumulativ e % Correlation
1 .404a 100.0 100.0 .536
a. First 1 canonical discriminant f unctions were used in the
analy sis.
This table gives summary of canonical discriminant function. This indicates eigenvalue for this model
is 0.404 and canonical correlation is 0.536. since there is single discriminant function all the
explained variation is contributed by the function.53.6 % of variation in the dependent variable is
explained by the model.
Wilks' Lambda
Wil ks'
T est of Fun ction(s) Lam bda Chi-square df Sig.
1 .71 2 235 .447 7 .00 0
This table tests the significance of the model as seen in the sig column, the model is significant.
Stand ardized Canon ical Discr iminant Fu nction Coeffici en ts
Function
1
Age in y ears .122
Y ears with current
-.829
employ er
Y ears at current address -.310
Household income in
.215
thousands
Debt to income ratio
.603
(x100)
Credit card debt in
.564
thousands
Other debt in thousands -.178
69
Structure Matrix
Function
1
Debt to income ratio
.666
(x100)
Y ears with current
-.464
employ er
Credit card debt in
.397
thousands
Y ears at current address -.262
Other debt in thousands .232
Age in y ears -.219
Household income in
-.112
thousands
Pooled wit hin-groups correlations between discriminating
v ariables and standardized canonical discriminant f unctions
Variables ordered by absolute size of correlation wit hin f unction.
This table gives simple correlations between the independent variables and the discriminant function.
High correlation will get translated to high discriminating power.
Canonical Discriminant Functi on Coefficients
Function
1
Age in y ears .015
Y ears with current
-.130
employ er
Y ears at current address -.046
Household income in
.006
thousands
Debt to income ratio
.096
(x100)
Credit card debt in
.275
thousands
Other debt in thousands -.055
(Constant) -.576
Unstandardized coef f icient s
This table gives the canonical correlations. Negative sign indicates inverse relation. For example,
years at current employer is -0.130 means that more the number of years spent at current employer,
lesser the chance that the person will default.
Functions at Group Centroids
Function
Prev iously def aulted 1
No -.377
Y es 1.066
Unst andardized canonical discriminant
f unctions ev aluated at group means
Classification Statistics
Classifi cation Processing Summary
Processed 850
Excluded Missing or out-of -range
0
group codes
At least one missing
0
discriminating v ariable
Used in Output 850
Predicted Group
Membership
Previously
defaulted No Yes Total
Original Count No 393 124 517
Yes 44 139 183
Ungrouped
99 51 150
cases
% No 76.0 24.0 100.0
Yes 24.0 76.0 100.0
Ungrouped
66.0 34.0 100.0
cases
Cross- Count No 391 126 517
validated(a) Yes 47 136 183
% No 75.6 24.4 100.0
Yes 25.7 74.3 100.0
a Cross validation is done only for those cases in the analysis. In cross validation, each case is classified by the
functions derived from all cases other than that case.
b 76.0% of original grouped cases correctly classified.
c 75.3% of cross-validated grouped cases correctly classified.
This is classification matrix or confusion matrix. This gives the percentage of cases that are classified
correctly i.e. the hit ratio. This hit ratio should be at least 25% more than the random probability.
In the above example, 532 of the 700 cases are classified correctly.
Overall, 76% of the cases are classified correctly.
139 out of 263 defaulters were identified correctly.
Output for Stepwise Method
All the steps of the ‘enter’ method are very similar to stepwise method except that the method to be
selected is stepwise. We have shown it in SPSS Snapshot DA3
If one selects the stepwise method, one also needs to select the method which SPSS should use to
select best set of independent variables.
This can be selected by clicking method button from Snapshot DA 8 shown below
SPSS Snapshot DA 8
71
1. select stepwise
method
2. Click on method
Min. D Squared
Between Exact F
Step Entered Statistic Groups Statistic df 1 df 2 Sig.
1 Debt to
income No and
.924 124.889 1 698.000 .000
ratio Y es
(x100)
2 Y ears
with No and
1.501 101.287 2 697.000 .000
current Y es
employ er
3 Credit
card debt
No and
in 1.926 86.502 3 696.000 .000
Y es
thousand
s
4 Y ears at
No and
current 2.038 68.572 4 695.000 .000
Y es
address
At each step, the v ariable that maximizes the Mahalanobis distance between the two closest
groups is entered.
a. Maximum number of steps is 14.
b. Minimum partial F to enter is 3.84.
c. Maximum partial F to remov e is 2.71.
d. F lev el, tolerance, or VIN insuf f icient f or f urther computation.
72
Variables in the Analysis
Min. D Between
Step Tolerance F to Remove Squared Groups
1 Debt to income
1.000 124.889
ratio (x100)
2 Debt to income
ratio (x100) .992 130.539 .450 No and Yes
Years with
.992 66.047 .924 No and Yes
current employer
3 Debt to income
.766 35.888 1.578 No and Yes
ratio (x100)
Years with
.716 111.390 .947 No and Yes
current employer
Credit card debt
.572 44.336 1.501 No and Yes
in thousands
4 Debt to income
ratio (x100) .766 35.000 1.693 No and Yes
Years with
current employer .691 89.979 1.213 No and Yes
Credit card debt
.564 48.847 1.565 No and Yes
in thousands
Years at current
.898 11.039 1.926 No and Yes
address
Wilks' Lambda
Number of
Step Variables Lambda df1 df2 df3 Exact F
These tables give summary of variables that are in analysis variables that are not in the analysis and
the model at each step, its significance.
In logistic regression, the relationship between dependent variable and independent variable is not
linear. It is of the type
1
p=
1 e y
where, p is the probability of ‘success’ i.e. dichotomous variable y taking the value 1, and ( 1 – p ) is
the probability of ‘failure’ i.e., y taking the value 0, and y = a + bx.
Probability p as a function of y
1
0.8
0.6
p
0.4
0.2
0
-6 .2 .4 .6 .8 -2 .2 .4 4 2 2 8 6 4 2 6
-5 -4 -3 -2 -1 -0 0. 1. 2. 3. 4. 5.
y
The logistic equation (8) can be reduced to a linear form by converting the probability p into log of
(p)/(1 – p)p or logit as follows:
CGPA 3.12 3.21 3.15 3.45 3.14 3.25 3.16 3.28 3.22 3.41
CGPA 3.48 3.34 3.25 3.46 3.32 3.29 3.42 3.28 3.36 3.31
Now, given this data, can we find the probability of a student succeeding in the first interview given
the CGPA?
1.2
y = 1.8264x - 5.3679
R2 = 0.1706
1
Success in Interview
0.8
0.6
0.4
0.2
0
3.1 3.15 3.2 3.25 3.3 3.35 3.4 3.45 3.5
CGPA
y = 1.83 x – 5.37
Instead, let us now attempt to fit a logistic regression to the student data. We will do this by
computing the logits and then fitting a linear model to the logits. To compute the logits, we will
regroup the data by CGPA into intervals, using the midpoint of each interval for the independent
variable. We calculate the probability of success based on the number of students that passed the
interview for each range of CGPAs. This results in the following data: .
Class Interval Middle Point of Class Probability of Success Logit
(CGPA) Interval { p/ (1 – p)}
We plot the Logit against the CGPA and then look for the linear fit which gives us the equation: y = 8.306x –
26.78
Thus if p is the probability of passing the interview and x is the CGPA the logistic regression can we expressed
as:
p
ln 8.306 x 26.78
1 p
Converting the logarithm to an equivalent exponential form, this equation can also we expressed as expressed
as:
e8.306x 26.78
p
1 e8.306x 26.78
x 2.5 2.6 2.7 2.8 2.9 3 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4
y
* -6.015 -5.184 -4.354 -3.523 -2.693 -1.862 -1.031 -0.201 0.63 1.46 2.291 3.122 3.952 4.783 5.613 6.444
P 0.002 0.006 0.013 0.029 0.063 0.134 0.263 0.45 0.652 0.812 0.908 0.958 0.981 0.992 0.996 0.998
*Upto 3 decimal places
This can be displayed graphically as follows:
75
From this regression model, we can see that probability of success at the interview is below 25% for
CGPAs below 2.90 but is above 75% for CGPAs above 3.60.
While one could apply logistic regression to a number of situations, it has been found useful
particularly in the following situations:
Credit –study of creditworthiness of an individual or a company. Various demographic and
credit history variables could be used to predict if an individual will turn out to be ‘good’ or
‘bad’ customers.
Marketing/ Market Segmentation – Study of purchasing behaviour of consumers. Various
demographic and purchasing information could be used to predict if an individual will
purchase an item or not
Customer loyalty: The analysis could be done to identify loyal or repeat customers using
various demographic and purchasing information
Medical – study of risk of diseases / body disorder
4.1 Assumptions of Logistic Regressions
The multiple regression assumes assumptions like linearity, normality etc. these are not required for
logistic regression. Discriminate Analysis requires the independent variables to be metric, which is
not necessary for logistic regression. This makes the technique to be superior to discriminate
Analysis. The only care to be taken is that there are no extreme observations in the data.
4.2 Key Terms of Logistic Regressions
Following are the key terms used in logistic regression
Factor The independent variable in logistic regression is termed as factor. The
factor is dichotomous in nature, and is usually converted into a dummy
variable.
Covariate The independent variable that is metric in nature is termed as covariate.
Maximum Likelihood This method is used in logistic regression to predict the odd ratio for the
Estimation dependent variable. In least square estimate, the square of error is
minimized, but in maximum likelihood estimation, the log likelihood is
maximized
Significance Test Hosmer and Lemeshow chi-square test is used to test the overall model
of goodness-of-fit test. It is the modified chi-square test, which is better
than the traditional chi-square test. Pearson chi-square test and
likelihood ratio test are used in multinomial logistic regression to
estimate the model goodness-of-fit.
Stepwise logistic In stepwise logistic regression, the three methods available are enter,
regression backward and forward. In enter method, all variables are included in
logistic regression, irrespective the variable is significant or
insignificant. In backward method, the model starts with all variables
and removes nonsignificant variables from the list. In forward method,
logistic regression stats with single variable and adds one by one
variable and tests significance and removes insignificant variables from
the model.
76
Odd Ratio Exponential beta in logistic regression gives the odd ratio of the
dependent variable. The probability of the dependent variable can be
computed from this odd ratio. When the exponential beta value is
greater than one, than the probability of higher category increases, and if
the probability of exponential beta is less than one, then the probability
of higher category decreases.
Measures of Effect In logistic regression, R2 is no more accepted because R2 tells us the
Size variance extraction by the independent variable. The maximum value of
the Cox and Snell r-squared statistic is actually somewhat less than 1;
the Nagelkerke r-squared statistic is a "correction" of the Cox and Snell
statistic so that its maximum value is 1.
Classification Table The classification table shows the practical results of using the logistic
regression model. It is useful to understand validity of the model
LR Snapshot 1
LR Snapshot 2
77
1.Select dependent
variable ‘previously
defaulted’
2.Click on
Continue
SPSS will take back to the window as displayed in LR Snapshot 2, at this stage click on Options.
Following window will be opened
LR Snapshot 4
2. Click
Continue
1.Select Hosmer-
Lemeshow
Goodness of fit
SPSS will be back to the window as shown in LA Snapshot 2. At this stage Click OK. Following
output will be displayed.
78
Logistic Regression
Case Processing Summary
a
Unweighted Cases N Percent
Selected Cases Included in Analy sis 700 82.4
Missing Cases 150 17.6
Total 850 100.0
Unselected Cases 0 .0
Total 850 100.0
a. If weight is in ef f ect , see classif ication table f or the total
number of cases.
This table indicates the case processing summary 700 out of 850 cases are used for the analysis 150
are ignored as these have missing values.
Dependent Variabl e Encodi ng
This table indicates the coding for the dependent variable 0=>not defaulted 1=>Defaulted
Block 0: Beginning Block
Classifi cati on Tablea,b
Predicted
Score df Sig.
Step Variables age 13.265 1 .000
0 employ 56.054 1 .000
address 18.931 1 .000
income 3.526 1 .060
debtinc 106.238 1 .000
creddebt 41.928 1 .000
othdebt 14.863 1 .000
Ov erall Statistics 201.271 7 .000
Chi-square df Sig.
Step 1 Step 102.935 1 .000
Block 102.935 1 .000
Model 102.935 1 .000
Step 2 Step 70.346 1 .000
Block 173.282 2 .000
Model 173.282 2 .000
Step 3 Step 55.446 1 .000
Block 228.728 3 .000
Model 228.728 3 .000
Step 4 Step 18.905 1 .000
Block 247.633 4 .000
Model 247.633 4 .000
79
Model Summary
The Hosmer-Lemeshow statistic indicates a poor fit if the significance value is less than 0.05. Here,
since the value is above 0.05, the model adequately fits the data
Classifi cati on Tablea
Predicted
This table is the classification table. It indicates the number of cases correctly classified as well as
incorrectly classified. Diagonal elements represent correctly classified cases and non-diagonal
elements represent incorrectly classified cases.
It may be noted that for each step, the number of correctly classified cases are improved than in the
previous step. The last column gives the percentage of correctly classified cases, which is improved
at each step.
Variables in the Equation
Step
c
employ -.244 .027 80.262 1 .000 .783
3 debtinc .088 .018 23.328 1 .000 1.092
creddebt .503 .081 38.652 1 .000 1.653
Constant -1.227 .231 28.144 1 .000 .293
Step
d employ -.243 .028 74.761 1 .000 .785
4 address -.081 .020 17.183 1 .000 .922
debtinc .088 .019 22.659 1 .000 1.092
creddebt .573 .087 43.109 1 .000 1.774
Constant -.791 .252 9.890 1 .002 .453
a. Variable(s) entered on step 1: debtinc.
b. Variable(s) entered on step 2: employ .
c. Variable(s) entered on step 3: creddebt.
d. Variable(s) entered on step 4: address.
The best model is usually the last model i.e. step 4. It contains variables years to current employee,
years at current address, debt to income ratio, and credit card debt. All other variables are
insignificant in the model.
80
Change in
Model Log -2 Log Sig. of the
Variable Likelihood Likelihood df Change
Step 1 debtinc -402.182 102.935 1 .000
Step 2 employ -350.714 70.346 1 .000
debtinc
-369.708 108.332 1 .000
Score df Sig.
Step Variables age 16.478 1 .000
1 employ 60.934 1 .000
address 23.474 1 .000
income 3.219 1 .073
creddebt 2.261 1 .133
othdebt 6.631 1 .010
Ov erall Statistics 113.910 6 .000
Step Variables age .006 1 .939
2 address 8.407 1 .004
income 21.437 1 .000
creddebt 64.958 1 .000
othdebt 4.503 1 .034
Ov erall Statistics 84.064 5 .000
Step Variables age .635 1 .426
3 address 17.851 1 .000
income .773 1 .379
othdebt .006 1 .940
Ov erall Statistics 22.221 4 .000
Step Variables age 3.632 1 .057
4 income .012 1 .912
othdebt .320 1 .572
Ov erall Statistics 4.640 3 .200
The above table gives the scores which can be used to predict whether the person having certain
values of variable will default or not. In fact the scores can be used to find the probability of default.
Another example could be to assess the impacts of two training programmes, conducted for a group
of employees, on their knowledge as well as motivation relevant for their job. While one programme
was mostly based on ‘Class Room’ training, the other was mostly based on the ‘On Job’ training.
The data collected could be as follows:
Class Room No Training Job Based
K M K M K M
1 92 98 70 75 83 90
2 88 77 56 66 65 76
3 91 88 89 90 93 91
4 85 82 87 84 90 85
5 88 85 72 71 77 73
6 81 82 74 71 89 81
7 92 83 75 75 84 78
8 88 90 80 68 85 76
9 80 79 78 65 73 80
10 84 87 72 75 83 81
81
In this case one of the conclusions drawn was that both the programmes had positive impact on both
knowledge and motivation but there was no significant difference between classrooms based and job
based training programmes.
As yet another example, ne could assess whether a change in Compensation System-1 to
Compensation System-2 has brought about changes in sales, profit and job satisfaction in an
organisation.
MANOVA is typically used when there are more than one dependent variables, and independent
variables are qualitative/ categorical.
1.Select Dependent
variables as invest_SM
and invest_CM
It may be noted that the above example is of MANOCOVA as we have selected some categorical
variables and some metric variables.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market and the categorical independent variables are occupation and
how long the respondents block investments. The metric independent variables are age, respondent’s
82
rating for commodity market and share market. Here we assume that their investments depend on
their ratings, occupation, age and how long they block their investments.
The following output will be displayed
General Linear Model
Between-Subjects Factors
Value Label N
Occupation 1 "Self
11
Employ ed"
2 Gov t 15
3 Student 4
4 House Wif e 14
how_much_time_ 1 <6 months 6
block_y our_money 2 6 to 12
8
months
3 1 to 3 y ears 5
4 > 3 y ears 10
5 4
6 6
7 3
8 2
This table indicates that the null hypothesis that the investments are equal for all occupations is
rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we may
conclude at 5% Level of Significance (LOS) that there is significant difference in the both
investments (share market and commodity markets) and occupation of the respondents.
The null hypothesis that the investments are equal for different levels of time the investment blocked,
is rejected since the significance value (p-value) is less than 0.05 as indicated by circles. Thus we
may conclude at 5% LOS that there is significant difference in the both investments (share market
and commodity markets) and the period for which the respondents would likely to block their money.
83
The other hypothesis about age, ratings of CM and ratings of SM are not rejected ( as p-value is
greater than 0.05) this means there is no significant difference in the investments for these variables.
Ty pe III Sum
Source Dependent Variable of Squares df Mean Square F Sig.
Corrected Model Inv est_SM 1.014E+011a 23 4407960763 4.250 .001
Inv est_CM 1.193E+010b 23 518647980.2 2.033 .057
Intercept Inv est_SM 64681514.2 1 64681514.21 .062 .805
Inv est_CM 532489379 1 532489379.0 2.087 .164
Age Inv est_SM 313752666 1 313752666.0 .302 .588
Inv est_CM 472901520 1 472901520.3 1.854 .189
Rat e_CM Inv est_SM 224388812 1 224388812.0 .216 .647
Inv est_CM 526600356 1 526600355.7 2.064 .166
Rat e_SM Inv est_SM 528516861 1 528516860.7 .510 .484
Inv est_CM 76132638.6 1 76132638.56 .298 .591
Occupation Inv est_SM 4.254E+010 3 1.418E+010 13.672 .000
Inv est_CM 3131507520 3 1043835840 4.092 .020
how_much_time_block_ Inv est_SM 1.715E+010 7 2450532038 2.362 .062
y our_money Inv est_CM 2659656810 7 379950972.8 1.489 .227
Occupation * how_much_ Inv est_SM 2.190E+010 10 2190181184 2.111 .074
time_block_y our_money Inv est_CM 2642792207 10 264279220.7 1.036 .450
Error Inv est_SM 2.075E+010 20 1037276941
Inv est_CM 5102386227 20 255119311.4
Total Inv est_SM 2.109E+011 44
Inv est_CM 2.780E+010 44
Corrected Total Inv est_SM 1.221E+011 43
Inv est_CM 1.703E+010 43
a. R Squared = . 830 (Adjusted R Squared = .635)
b. R Squared = . 700 (Adjusted R Squared = .356)
6 Factor Analysis
Factor Analysis is interdependence technique. In interdependence techniques the variables are not
classified as independent or dependent variable, but their interrelationship is studied. Factor analysis
is general name for two different techniques namely, Principle Component Analysis and Common
Factor Analysis.
The Factor analysis originated about a century ago when Charles Spearman propounded that the
results of a wide variety of mental tests could be explained by a single underlying intelligence factor.
The factor analysis is done principally for two reasons
To identify a new, smaller set of uncorrelated variables to be used in subsequent multiple
regression analysis. In this situation the Principle Component Analysis is performed on the
data. PCA considers the total variance in the data while finding principle components from a
given set of variables
To identify underlying dimensions / factors that are unobservable but explain correlations
among a set of variables. In this situation the Common Factor Analysis is performed on the
data. FA considers only the common variance while finding common factors from a given set
of variables. The common factor analysis is also termed as Principle Axis Factoring.
The essential purpose of factor analysis is to describe, if possible, the covariance relationships among
many variables in terms of few underlying, but unobservable, random quantities called factors.
Basically, the factor model is motivated by the following argument. Suppose variables can be
grouped by their correlations. That is, all variables, within a particular group are highly correlated
among themselves but have relatively small correlations with variables in a different group. In that
case, it is conceivable that each group of variables represents a single underlying construct, or factor,
that is responsible for the correlations.
Rotation does not affect communalities and percentage of total variance explained. However the
percentage of variance accounted by each factor changes. The variance explained by individual
factors is redistributed by rotation.
The rotation is said to be orthogonal, if the axes maintain the right angle.
are the most widely used rotational methods.
are The preferred method when the research goal is data reduction to either a smaller
number of variables or a set of uncorrelated measures for subsequent use in other
multivariate techniques.
Varimax procedure is a type of orthogonal rotation wherein it maximizes the variance of each of the
factors, so that the amount of variance accounted for is redistributed over the extracted factors. This
is the most popular method of rotation.
It may be noted that in the above diagram, the factors can be easily interpreted after the orthogonal
rotation. Variables v2,v3 and v6 contribute to factor 1 and v1,v4 and v5 contribute to factor 2.
The rotation is said to be Oblique, if the axis do not maintain right angle. best suited to the goal of
obtaining several theoretically meaningful factors or constructs because, realistically, very few
constructs in the “real world” are uncorrelated.
Exploratory This technique is used when a researcher has no prior knowledge about the
Factor Analysis number of factors the variables will be indicating. In such cases computer based
(EFA) techniques are used to indicate appropriate number of factors.
Confirmatory This technique is used when the researcher has the prior knowledge(on the basis of
Factor Analysis some pre-established theory) about the number of factors the variables will be
(CFA) indicating. This makes it easy as there is no decision to be taken about the number
of factors and the number is indicated in the computer based tool while conducting
analysis.
Correlation This is the matrix showing simple correlations between all possible pairs of variables.
85
Matrix The diagonal element of this matrix is 1 and this is a symmetric matrix, since
correlation between two variables x and y is same as between y and x.
Communality The amount of variance, an original variable shares with all other variables included in
the analysis. A relatively high communality indicates that a variable has much in
common with the other variables taken as a group.
Eigenvalue Eigenvalue for each factor is the total variance explained by each factor.
Factor A linear combination of the original variables. Factor also represents the underlying
dimensions( constructs) that summarise or account for the original set of observed
variables
Factor Loadings The factor loadings, or component loadings in PCA, are the correlation coefficients
between the variables (given in output as rows ) and factors (given in output columns)
These loadings are analogous to Pearson’s correlation coefficient r, the squared factor
loading is defined as the percent of variance in the respective variable explained by the
factor.
Factor Matrix This contains factor loadings on all the variables on all the factors extracted
Factor Plot or This is a plot where the factors are on different axis and the variables are drawn on these
Rotated Factor axes. This plot can be interpreted only if the number of factors are 3 or less
Space
Factor Scores Each individual observation has a score, or value, associated with each of the original
variables. Factor analysis procedures derive factor scores that represent each
observation’s calculated values, or score, on each of the factors. The factor score will
represent an individual’s combined response to the several variables representing the
factor.
The component scores may be used in subsequent analysis in PCA. When the factors
are to represent a new set of variables that they may predict or be dependent on some
phenomenon, the new input may be factor scores.
Goodness of a How well can a factor account for the correlations among the indicators ?
Factor One could examine the correlations among the indicators after the effect of the factor is
removed. For a good factor solution, the resulting partial correlations should be near
zero, because once the effect of the common factor is removed , there is nothing to link
the indicators.
Bartlett’s Test of This is the test statistics used to test the null hypothesis that there is no correlation
specificity between the variables.
Kaiser Meyer This is an index used to test appropriateness of the factor analysis. High values of this
Olkin (KMO) index, generally, more than 0.5 , may indicate that the factor analysis is an appropriate
Measure of measure, where as the lower values (less than 0.5) indicate that factor analysis may not
Sampling be appropriate.
Adequacy
Scree Plot A plot of Eigen values against the factors in the order of their extraction.
Trace The sum of squares of the values on the diagonal of the correlation matrix used in the
factor analysis. It represents the total amount of variance on which the factor solution is
based.
There are a number of financial parameters/ratios for predicting health of a company. It would be
useful if only a couple of indicators could be formed as linear combination of the original
parameters/ratios in such a way that the few indicators extract most of the information contained in
the data on original variables.
Further, in the regression model, if independent variables are correlated implying there is
multicollinearity, then new variables could be formed as linear combinations of original variables
which themselves are uncorrelated. The regression equation can then be derived with these new
uncorrelated independent variables, and used for interpreting the regression coefficients as also for
predicting the dependant variable with the help of these new independent variables. This is highly
useful in marketing and financial applications involving forecasting, sales, profit, price, etc.
with the help of regression equations.
Further, analysis of principal components often reveals relationships that were not previously suspected and
thereby allows interpretations that would not be ordinarily understood. A good example of this is provided by
stock market indices.
Incidentally, PCA is a means to an end and not the end in itself. PCA can be used for inputting
principal components as variables for further analysing the data using other techniques such as
cluster analysis, regression and discriminant analysis.
6.4 Common Factor Analysis
It is yet another example of a data reduction and summarization technique. It is a statistical approach
that is used to analyse inter relationships among a large number of variables (e.g., test scores, test
items, questionnaire responses) and then explaining these variables in terms of their common
underlying dimensions (factors). For example, a hypothetical survey questionnaire may consist of 20
or even more questions, but since not all of the questions are identical, they do not all measure the
basic underlying dimensions to the same extent. By using factor analysis, we can identify the separate
dimensions being measured by the survey and determine a factor loading for each variable (test item)
on each factor.
Common Factor analysis (unlike multiple regression, discriminant analysis, or canonical correlation, in which
one or more variables are explicitly considered as the criterion or dependant variable and all others the pre-
dictor or independent variables) is an interdependence technique in which all variables are simultaneously
considered. In a sense, each of the observed (original) variables is considered as a dependant variable that is a
function of some underlying, latent, and hypothetical/unobserved set of factors (dimensions). One could also
consider the original variables as reflective indicators of the factors. For example, marks( variable) in an
examination reflect the intelligence( factor).
The statistical approach followed in factor analysis involves finding a way of condensing the information
contained in a number of original variables into a smaller set of dimensions (factors) with a minimum loss of
information.
Common Factor Analysis was originally developed to explain students’ performance in various subjects and
to understand the link between grades and intelligence. Thus, the marks obtained in an examination reflect the
student’s intelligence quotient. A salesman’s performance in term of sales might reflect his attitude towards the
job, and efforts made by him.
One of the studies relating to marks obtained by students in various subjects, led to the conclusion
that students’ marks are a function of two common factors viz. Quantitative and Verbal abilities. The
quantitative ability factor explains marks in subjects like Mathematics, Physics and Chemistry, and
verbal ability explains marks in subjects like Languages and History.
In another study, a detergent manufacturing company was interested in identifying the major
underlying factors or dimensions that consumers used to evaluate various detergents. These factors
are assumed to be latent; however, management believed that the various attributes or properties of
detergents were indicators of these underlying factors. Factor analysis was used to identify these
underlying factors. Data was collected on several product attributes using a five-point scale. The
analysis of responses revealed existence of two factors viz. ability of the detergent to clean and its
87
mildness
In general, the factor analysis performs the following functions:
Identifies the smallest number of common factors that best explain or account for the
correlation among the indicators
Identifies a set of dimensions that are latent ( not easily observed) in a large number of
variables
Devises a method of combining or condensing a large number of consumers with varying
preferences into distinctly different number of groups.
Identifies and creates an entirely new smaller set of variables to partially or completely
replace the original set of variables for subsequent regression or discriminant analysis
from a large number of variables. It is especially useful in multiple regression analysis
when multicollinearity is found to exist as the number of independent variables is reduced
by using factors and thereby minimizing or avoiding multicolinearity. In fact, factors are
used in lieu of original variables in the regression equation.
Distinguishing Feature of Common Factor Analysis
Generally, the variables that we define in real life situations reflect the presence of unobservable
factors. These factors impact the values of those variables. For example, the marks obtained in an
examination reflect the student’s intelligence quotient. A salesman’s performance in term of sales
might reflect his attitude towards the job, and efforts made by him.
Each of the above examples requires a scale, or an instrument, to measure the various
constructs (i.e., attitudes, image, patriotism, sales aptitude, and resistance to innovation). These
are but a few examples of the type of measurements that are desired by various business
disciplines. Factor analysis is one of the techniques that can be used to develop scales to
measure these constructs.
6.4.1 Applications of Common Factor Analysis
In one of the studies conducted by a group of the students of a Management Institute, they undertook
a survey of 120 potential buyers outside retail outlets and at dealer counters. Their opinions were
solicited through a questionnaire for each of the 20 parameters relating to a television.
Through the use of principal component analysis and factor analysis using computer software, the
group concluded that the following five parameters are most important out of the twenty parameters
on which their opinion was recorded. The five factors were:
Price (price, schemes and other offers)
Picture Quality
Brand Ambassador (Person of admiration)
Wide range
Information (Website use, Brochures, Friends’ recommendations )
In yet another study, another group of students of a Management Institute conducted a survey to
identify the factors that influence the purchasing decision of a motorcycle in the 125 cc category.
Through the use of Principal Component Analysis and factor analysis using computer software, the
group concluded that the following three parameters are most important:
Comfort
Assurance
Long Term Value
6.5 Factor Analysis on Data Using SPSS
We shall first explain PCA using SPSS and than Common Factor Analysis
Principle Component Analysis Using SPSS
For illustration we will be using file car_sales.sav.This file is part of SPSS cases and is in the tutorial
folder of SPSS. Within tutorial folder, this file is in the sample_files folder. For the convenience of
readers, we have provided this file in the CD with the book. This data file contains hypothetical sales
estimates, list prices, and physical specifications for various makes and models of vehicles. The list
88
prices and physical specifications were obtained alternately from edmunds.com and manufacturer
sites. Following is the list of the major variables in the file.
1. Manufacturer 6. Price in thousands 11. Length
2. Model 7. Engine size 12. Curb weight
3. Sales in thousands 8. Horsepower 13. Fuel capacity
4. 4-year resale value 9. Wheelbase 14. Fuel efficiency
5. Vehicle type 10. Width
After opening the file Car_sales.sav, one can click on analyze – Data Reduction and Factor as shown
in the following snapshot.
FA Snapshot 1
2. click on
Descriptives
2. click on
coefficients
FA Snapshot 3
Click on
Extraction
The new window that will appear is shown below 1. select the method
as principle
FA Snapshot 4 components
3. select Unrotated
Factor solution
2. select
correlation matrix
5. click on continue
SPSS will take back to the window shown below
FA Snapshot 5
1. click on rotation
3. click continue
1. Select varimax
rotation
2. Select Display
rotated solution
SPSS will take back to the window as shown in FA Snapshot 5. Click on the button ‘Scores’. This
will open a new window as shown below
FA Snapshot 7
4. click on continue
1. Select save as
variable option
2. Select regression
method
3. Select display
factor score
coefficient matrix
This will take back to window as shown in FA Snapshot 5, in this window now click on ‘ok’. SPSS
will give following output. We shall explain each in brief.
Factor Analysis
Correlation Matrix
Price in
Vehicle ty pe thousands Engine size Horsepower Wheelbase Width Length Curb weight Fuel capacity Fuel eff iciency
Correlation Vehicle ty pe 1.000 -.042 .269 .017 .397 .260 .150 .526 .599 -.577
Price in thousands -.042 1.000 .624 .841 .108 .328 .155 .527 .424 -.492
Engine size .269 .624 1.000 .837 .473 .692 .542 .761 .667 -.737
Horsepower .017 .841 .837 1.000 .282 .535 .385 .611 .505 -.616
Wheelbase .397 .108 .473 .282 1.000 .681 .840 .651 .657 -.497
Width .260 .328 .692 .535 .681 1.000 .706 .723 .663 -.602
Length .150 .155 .542 .385 .840 .706 1.000 .629 .571 -.448
Curb weight .526 .527 .761 .611 .651 .723 .629 1.000 .865 -.820
Fuel capacity .599 .424 .667 .505 .657 .663 .571 .865 1.000 -.802
Fuel eff iciency -.577 -.492 -.737 -.616 -.497 -.602 -.448 -.820 -.802 1.000
This is the correlation matrix. The PCA can be carried out if the correlation matrix for the variables
contains at least two correlations of 0.30 or greater. It may be noted that the correlations >0.3 are
marked in circle.
91
KMO Bartlett measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.833 and the chi-square statistics is significant (<0.05). This means the principle component analysis
is appropriate for this data.
Communalities
Initial Extraction
Vehicle ty pe 1.000 .930
Price in thousands 1.000 .876
Engine size 1.000 .843
Horsepower 1.000 .933
Wheelbase 1.000 .881
Width 1.000 .776
Lengt h 1.000 .919
Curb weight 1.000 .891
Fuel capacity 1.000 .861
Fuel ef f iciency 1.000 .860
Extraction Method: Principal Component Analy sis.
Extraction communalities are estimates of the variance in each variable accounted for by the
components. The communalities in this table are all high, which indicates that the extracted
components represent the variables well. If any communalities are very low in a principal
components extraction, you may need to extract another component.
Initial Eigenv alues Extraction Sums of Squared Loadings Rot at ion Sums of Squared Loadings
Component Total % of Variance Cumulativ e % Total % of Variance Cumulativ e % Total % of Variance Cumulativ e %
1 5.994 59.938 59.938 5.994 59.938 59.938 3.220 32.199 32.199
2 1.654 16.545 76.482 1.654 16.545 76.482 3.134 31.344 63.543
3 1.123 11.227 87.709 1.123 11.227 87.709 2.417 24.166 87.709
4 .339 3.389 91.098
5 .254 2.541 93.640
6 .199 1.994 95.633
7 .155 1.547 97.181
8 .130 1.299 98.480
9 .091 .905 99.385
10 .061 .615 100.000
Extraction Method: Principal Component Analy sis.
This output gives total variance explained. This table gives the total variance contributed by each
component. We can see that the percentage of total variance contributed by first component is 59.938,
by second component is 16.545 and by third component is 11.2227. It is also clear from this table
that there are total three distinct components for the given set of variables.
92
Scree Plot
4
Eigenvalue
1 2 3 4 5 6 7 8 9 10
Component Number
The scree plot gives the number of components against the eigenvalues and helps to determine
the optimal number of components.
Incidentally, "scree" is the geological term referring to the debris which collects on the lower part of a
rocky slope
The components having steep slope indicate that good percentage of total variance is explained by
that component, hence the component is justified. The shallow slope indicates that the contribution of
total variance is less, and the component is not justified. In the above plot, the first three components
have steep slope and later the slope is shallow. This indicates the ideal number of components is
three.
Component Matrixa
Component
1 2 3
Vehicle ty pe .471 .533 -.651
Price in thousands .580 -.729 -.092
Engine size .871 -.290 .018
Horsepower .740 -.618 .058
Wheelbase .732 .480 .340
Width .821 .114 .298
Lengt h .719 .304 .556
Curb weight .934 .063 -.121
Fuel capacity .885 .184 -.210
Fuel ef f iciency -.863 .004 .339
Extraction Method: Principal Component Analy sis.
a. 3 components extracted.
This table gives each variables component loadings but it is the next table, which is easy to interpret.
93
Component
1 2 3
Vehicle ty pe -.101 .095 .954
Price in thousands .935 -.003 .041
Engine size .753 .436 .292
Horsepower .933 .242 .056
Wheelbase .036 .884 .314
Width .384 .759 .231
Lengt h .155 .943 .069
Curb weight .519 .533 .581
Fuel capacity .398 .495 .676
Fuel ef f iciency -.543 -.318 -.681
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalization.
a. Rot at ion conv erged in 4 iterations.
This table is the most important table for interpretation. The maximum of each row(ignoring sign)
indicates that the respective variable belongs to the respective component. The variables ‘price in
thousands’, ‘engine size’ and ‘horsepower’ are highly correlated and contribute to a single
component. ‘wheelbase’ ‘width’ and ‘length’ contribute to second component. And ‘vehicle type’
curb ‘weight’ ‘fuel capacity’ contribute to the third component.
Component Transformation Matrix
Component 1 2 3
1 .601 .627 .495
2 -.797 .422 .433
3 -.063 .655 -.753
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalizat ion.
Component
1 2 3
Vehicle ty pe -.173 -.194 .615
Price in thousands .414 -.179 -.081
Engine size .226 .028 -.016
Horsepower .368 -.046 -.139
Wheelbase -.177 .397 -.042
Width .011 .289 -.102
Lengt h -.105 .477 -.234
Curb weight .070 .043 .175
Fuel capacity .012 .017 .262
Fuel ef f iciency -.107 .108 -.298
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalization.
Component Scores.
This table gives the component scores for each variables. The component scores can be saved for
each case in the SPSS file. These scores are useful to replace internally related variables in the
regression analysis. In the above table, the scores are given component wise. The factor score for
each component can be calculated as the linear combinations of the component scores of that
component.
94
Component 1 2 3
1 1.000 .000 .000
2 .000 1.000 .000
3 .000 .000 1.000
Extraction Method: Principal Component Analy sis.
Rot at ion Method: Varimax with Kaiser Normalizat ion.
Component Scores.
Case 1
In the year 2009, the TRAI (Telephone Regulatory Authority of India) was assessing the
requirements for number portability. Number portability is defined as switching a service provider,
without having to change the number. This year had seen fierce price war in the telecom sector. Some
of the oldest service providers are still to certain extent immune to this war as most of their
consumers would not like to change number. The number portability will increase the price war and
give opportunity to relatively new service providers. The price war is so fierce that the industry
experts comment that the future lies in how one differentiates in terms of services, than price.
With this background, a TELECOMM company conducted a research study to find the factors that
affect consumers while selecting / switching a telecom service provider. The survey was conducted
on 35 respondents. They were asked to rate 12 questions, about their perception of factors important
to them while selecting a service provider, on 7-point scale (1= completely disagree, 7= completely
agree)
The research design for data collection can be stated as follows-
35 telecom users were surveyed about their perceptions and image attributes of the service providers
they owned. Twelve questions were asked to each of them, all answered on a scale of 1 to 7 (1=
completely disagree, 7= completely agree).
I decide my telephone provider on the basis of following attributes. (1= completely disagree, 7=
completely agree)
1. Availability of services (like drop boxes and different payment options, in case of post paid, and
recharge coupons in case of pre paid)
2. Good network connectivity all through the city.
3. Internet connection, with good speed.
4. Quick and appropriate response at customer care centre.
5. Connectivity while roaming (out of the state or out of country)
6. Call rates and Tariff plans
7. Additional features like unlimited SMS, lifetime prepaid, 2 phones free calling, etc.
8. Quality of network service like Minimum call drops, Minimum down time, voice quality, etc.
9. SMS and Value Added Services charges.
10. Value Added Services like MMS, caller tunes, etc.
11. Roaming charges
12. Conferencing
The sample of data collected is tabulated as follows
Carry out relevant analysis and write a report to discuss the findings for the above data.
The initial process of conducting common factor analysis is exactly same as for principle component
analysis except for the method of selection shown in FA Snapshot 4.
We will discuss only the steps that are different than the principle component analysis shown above.
Following steps are carried out to run factor analysis using SPSS.
1. Open file telcom.sav
2. Click on Analyse ->Data Reduction ->Factor as shown in FA Snapshot 1.
3. Following window will be opened by SPSS.
FA Snapshot 8
4. Click on descriptives , coefficients and click on initial solution, click on KMO and Bartlett’s test of
sphericity, and also Select Anti-Image as shown in FA Snapshot 3.
It may be noted that we did not select Anti-Image in PCA, but we are required to select it here.
5. Click on Extraction, following window will be opened by SPSS
3. select
Unrotated Factor
solution
2. select
correlation matrix
5. click on
continue
96
6. SPSS will take back to the window shown in FA Snapshot 8 at this stage click on Rotation, the
window SPSS will open is shown in FA Snapshot
7. Select Varimax rotation, Select Display rotated solution and click continue, as shown in FA
Snapshot 6
8. It may be noted that in PCA of FA Snapshot 7 we selected to store some variables which is not
require here.
Following output will be generated by SPSS
Factor Analysis
Descriptive Stati sti cs
This is descriptive statistics given by the SPSS. This gives general understanding of the variables.
Correlation Matrix
Q5
Q1 Q2 Q3 Q4 Q6 Q7 Q8 Q9 Q10 Q11 Q12
Correlatio Q1
1.000 -.128 -.294 .984 -.548 -.017 -.188 -.093 .129 -.242 -.085 -.259
n
Q2 -.128 1.000 .302 -.068 .543 .231 .257 .440 .355 .359 .344 .378
Q3 -.294 .302 1.000 -.282 .558 .148 .258 .056 .148 .898 .164 .930
Q4 .984 -.068 -.282 1.000 -.510 -.052 -.223 -.063 .099 -.227 -.113 -.251
Q5 -.548 .543 .558 -.510 1.000 .101 .195 .387 .067 .538 .149 .559
Q6 -.017 .231 .148 -.052 .101 1.000 .901 .159 .937 .230 .906 .096
Q7 -.188 .257 .258 -.223 .195 .901 1.000 .202 .866 .379 .943 .211
Q8 -.093 .440 .056 -.063 .387 .159 .202 1.000 .204 .042 .192 .156
Q9 .129 .355 .148 .099 .067 .937 .866 .204 1.000 .258 .889 .091
Q10 -.242 .359 .898 -.227 .538 .230 .379 .042 .258 1.000 .309 .853
Q11 -.085 .344 .164 -.113 .149 .906 .943 .192 .889 .309 1.000 .119
Q12 .930
1.00
-.259 .378 -.251 .559 .096 .211 .156 .091 .853 .119
0
This is the correlation matrix. The Common Factor Analysis can be carried out if the correlation
matrix for the variables contains at least two correlations of 0.30 or greater. It may be noted that some
of the correlations >0.3 are marked in circle.
KMO and Bartlett's Test
Kaiser-Mey er-Olkin Measure of Sampling
Adequacy . .658
KMO measure of sampling adequacy is an index used to test appropriateness of the factor
analysis. The minimum required KMO is 0.5. The above table shows that the index for this data is
0.658 and the chi-square statistics is significant (.000<0.05). This means the principle component
analysis is appropriate for this data.
97
Communaliti es
Initial Extraction
Q1 .980 .977
Q2 .730 .607
Q3 .926 .975
Q4 .978 .996
Q5 .684 .753
Q6 .942 .917
Q7 .942 .941
Q8 .396 .379
Q9 .949 .942
Q10 .872 .873
Q11 .934 .924
Q12 .916 .882
Extraction Method: Principal Axis Factoring.
Initial communalities are the proportion of variance accounted for in each variable by the rest of the
variables. Small communalities for a variable indicate that the proportion of variance that this
variable shares with other variables is too small. Thus, this variable does not fit the factor solution. In
the above table, most of the initial communalities are very high indicating that all the variables share
a good amount of variance with each other, an ideal situation for factor analysis.
Extraction communalities are estimates of the variance in each variable accounted for by the factors
in the factor solution. The communalities in this table are all high. It indicates that the extracted
factors represent the variables well.
Initial Eigenv alues Extraction Sums of Squared Loadings Rot at ion Sums of Squared Loadings
Factor Total % of Variance Cumulativ e % Total % of Variance Cumulativ e % Total % of Variance Cumulativ e %
1 4.789 39.908 39.908 4.671 38.926 38.926 3.660 30.500 30.500
2 3.035 25.288 65.196 2.956 24.635 63.561 2.929 24.410 54.910
3 1.675 13.960 79.156 1.628 13.570 77.131 2.205 18.372 73.283
4 1.321 11.006 90.163 .911 7.593 84.724 1.373 11.441 84.724
5 .526 4.382 94.545
6 .275 2.291 96.836
7 .157 1.307 98.143
8 .093 .774 98.917
9 .050 .421 99.338
10 .042 .353 99.691
11 .027 .227 99.918
12 .010 .082 100.000
Extraction Method: Principal Axis Factoring.
This output gives the variance explained by the initial solution. This table gives the total variance
contributed by each component. We may note that the percentage of total variance contributed by first
component is 39.908, by second component is 25.288 and by third component is19.960. It may be
noted that the percentage of total variances is highest for first factor and it decreases thereafter. It is
also clear from this table that there are total three distinct factors for the given set of variables.
Scree Plot
4
Eigenvalue
1 2 3 4 5 6 7 8 9 10 11 12
Factor Number
98
The scree plot gives the number of factors against the eigenvalues, it and helps to determine the
optimal number of factors. The factors having steep slope indicate that larger percentage of total
variance is explained by that factor. The shallow slope indicates that the contribution to total variance
is less. In the above plot, the first four factors have steep slope; and later on the slope is shallow. It
may be noted from the above plot that the number of factors for Eigen value greater than one are four.
Hence the ideal number of factors is four
Factor Matrixa
Factor
1 2 3 4
Q1 -.410 .564 .698 .066
Q2 .522 -.065 .139 .557
Q3 .687 -.515 .423 -.245
Q4 -.413 .528 .725 .141
Q5 .601 -.500 -.067 .371
Q6 .705 .630 -.111 -.102
Q7 .808 .486 -.176 -.144
Q8 .293 .005 -.031 .540
Q9 .690 .682 .039 .019
Q10 .727 -.372 .401 -.213
Q11 .757 .575 -.137 -.040
Q12 .644 -.523 .432 -.090
Extraction Method: Principal Axis Factoring.
a. 4 f actors extracted. 14 it erations required.
This table gives each variables factor loadings but it is the next table, which is easy to interpret.
Rotated Factor Matrixa
Factor
1 2 3 4
Q1 .001 -.147 .972 -.109
Q2 .195 .253 -.002 .711
Q3 .084 .970 -.146 .074
Q4 -.042 -.137 .987 -.035
Q5 .006 .456 -.430 .600
Q6 .953 .053 -.004 .078
Q7 .939 .161 -.162 .086
Q8 .117 -.008 -.040 .603
Q9 .936 .068 .164 .186
Q10 .210 .899 -.101 .103
Q11 .943 .078 -.059 .158
Q12 .022 .909 -.110 .206
Extraction Method: Principal Axis Factoring.
Rot at ion Method: Varimax with Kaiser Normalization.
a. Rot at ion conv erged in 5 iterations.
This table is the most important table for interpretation. The maximum in each row (ignoring sign)
indicates that the respective variable belongs to the respective factor. For example, in the first row
the maximum is 0.972 which is for factor 3; this indicates that the Q1 contributes to third factor. In
the second row maximum is 0.711; for factor 4, indicating that Q2 contributes to factor 4, and so on.
The variables ‘Q6’, ‘Q7’, ‘Q9’, and ‘Q11’ are highly correlated and contribute to a single factor
which can be named as Factor 1 or ‘Economy’.
The variables ‘Q3’, ‘Q10’, and ‘Q12’ are highly correlated and contribute to a single factor which can
be named as Factor 2 or ‘Services beyond Calling’.
The variables ‘Q1’ and ‘Q4’ are highly correlated and contribute to a single factor which can be
named as Factor 3 or ‘Customer Care’.
The variables ‘Q2’, ‘Q5’, and ‘Q8’ are highly correlated and contribute to a single factor which can
be named as Factor 4 or ‘Anytime Anywhere Service’.
We may summarise the above analysis in the following Table.
Factors Questions
Factor 1 Q.6. Call rates and Tariff plans
99
Economy Q.7. Additional features like unlimited SMS, lifetime prepaid, 2 phones free
calling, etc.
Q.9. SMS and Value Added Services charges.
Q.11. Roaming charges
Factor 2 Q.3. Internet connection, with good speed.
Services Q.10. Value Added Services like MMS, caller tunes, etc.
beyond Calling Q.12. Conferencing
Factor 3 Q.1. Availability of services (like drop boxes and different payment
Customer Care options, in case of post paid, and recharge coupons in case of pre paid)
Q.4. Quick and appropriate response at customer care centre.
Factor 4 Q.2. Good network connectivity all through the city.
Anytime Q.5. Connectivity while roaming (out of the state or out of country)
Anywhere Q.8. Quality of network service like Minimum call drops, Minimum down
Service time, voice quality, etc.
It implies that the telecomm service provider should consider these four factors which customers feel
are important, while selecting / switching a service provider.
7 Canonical Correlation Analysis
The canonical correlation analysis, abbreviated as CRA is an extension of multiple regression
analysis, abbreviated as MRA. While, in MRA, there is one metric (measurable or non categorical)
dependant variable, say y, and there are several metric independent variables, say x 1, x2,…..,xk, in
CRA, there are several metric dependent variables, say y1, y2,……,ym .
CRA involves developing a linear combination of the two sets of above variables viz. y’s and x’s –
one as a linear combination of dependent variables( also called predictor set) and the other as a linear
combination of the set of independent variables ( also called criterion set). The two linear
combinations are derived in such a way that the correlation between the two is maximum.
While the linear combinations are referred as canonical variables, the correlation between the two
combinations is called canonical correlation. It measures the strength of the overall relationship
between the linear combinations of the predictor and criterion sets of variables. In the next stage the
identification process involves choosing the second pair of linear combinations having the second
largest correlation among all pairs but uncorrelated with the initial pair. The process continues for the
third pair and so on. The practical significance of a canonical correlation is that it indicates as to
how much variance in one set of dependent variables is accounted for by another set of
independent variables. The weights in the linear combination are derived based on the criterion
that maximizes the correlation between the two sets.
It can be represented as follows
Y+Y2+………Yp = X1 + X2+…………..Xp
(metric) (metric)
Some Indicative Applications:
A medical researcher could be interested in determining if individuals’ lifestyle and personal
habits have an impact on their health as measured by a number of health-related variables
such as hypertension, weight, blood sugar, etc.
The marketing manager of a consumer goods firm could be interested in determining if there
is a relationship between types of products purchased and consumers’ income and
profession.
The practical significance of a canonical correlation is that it indicates as to how much variance in one set of
variables is accounted for by another set of variables.
Squared canonical correlations are referred to as canonical roots or Eigen values.
If X1, X2, X3, ………, Xp & Y1, Y2, Y3, …….,Yq are the observable variables then canonical variables will be
100
= Eigen value
Wilk’s Lambda () is the proportion of the total variance in the discriminatn scores not explained by
differences among the groups. It is used to test H0 that the means of the variables of groups are equal.
is approximated by 2p,G-1 = – {n-1 – (p + G/2)} log
p : no. of variables
G : no. of groups
0 1. If is small H0 is reject, is it is high H0 is accepted
instead of regression, the study used the canonical correlation technique to investigate the asset-
liability relationship of the banks at the two time points.
CANONICAL Snapshot 1
CANONICAL Snapshot 2
1.Select Dependent
variables as invest_SM,
invest_CM and
Invest_MF
3.Click OK
It may be noted that the above example is discussed in Section 5.1. The difference between
MANOCOVA and canonical correlation is that MANOCOVA can have both factors and metric
independent variables, Canonical correlation can have only metric independent variables, factors
(categorical independent variables) are not possible in canonical correlation.
We are assuming in the above example that the dependent variables are the investments in
commodity market and in share market. The metric independent variables are age, respondent’s
rating for commodity market ,share market and Mutual funds and respondent’s perception of risk for
commodity market, share market and mutual funds. Here we assume that their investments depend on
their ratings, age and their risk perceptions for mutual funds, commodity markets and share markets.
102
This table indicates that the hypothesis about age, ratings of CM, SM and MF and Risky CM, SM and
MF are not rejected ( as p-value is greater than 0.05) this means there is no significant difference in
the investments for these variables.
103
Ty pe III Sum
Source Dependent Variable of Squares df Mean Square F Sig.
Correct ed Model Inv est_CM 6554404686a 9 728267187.3 2.363 .034
Inv est_SM 3.825E+010b 9 4250227452 1.723 .122
Inv est_MF 3.458E+010c 9 3842114308 1.269 .289
Intercept Inv est_CM 41236149.0 1 41236149.02 .134 .717
Inv est_SM 1820226683 1 1820226683 .738 .396
Inv est_MF 2063858495 1 2063858495 .682 .415
Rat e_CM Inv est_CM 3146747.812 1 3146747.812 .010 .920
Inv est_SM 701640357 1 701640356.6 .284 .597
Inv est_MF 91570171.1 1 91570171.09 .030 .863
Rat e_SM Inv est_CM 5827313.435 1 5827313.435 .019 .891
Inv est_SM 305120.840 1 305120.840 .000 .991
Inv est_MF 289045935 1 289045935.2 .095 .759
risky _CM Inv est_CM 1383505869 1 1383505869 4.490 .041
Inv est_SM 3599660366 1 3599660366 1.459 .235
Inv est_MF 4765133773 1 4765133773 1.574 .218
risky _SM Inv est_CM 5068043.704 1 5068043.704 .016 .899
Inv est_SM 3129346168 1 3129346168 1.269 .268
Inv est_MF 3115327283 1 3115327283 1.029 .318
Age Inv est_CM 1759288000 1 1759288000 5.709 .023
Inv est_SM 4483914274 1 4483914274 1.818 .187
Inv est_MF 5852257582 1 5852257582 1.933 .173
Rat e_F D Inv est_CM 314194024 1 314194023.6 1.020 .320
Inv est_SM 1038652882 1 1038652882 .421 .521
Inv est_MF 1418050354 1 1418050354 .468 .498
Rat e_MF Inv est_CM 709118826 1 709118825.7 2.301 .139
Inv est_SM 4469496231 1 4469496231 1.812 .187
Inv est_MF 3743054315 1 3743054315 1.236 .274
risky _FD Inv est_CM 925863328 1 925863327.7 3.005 .092
Inv est_SM 7133105111 1 7133105111 2.891 .098
Inv est_MF 4644806048 1 4644806048 1.534 .224
risky _MF Inv est_CM 112028.404 1 112028.404 .000 .985
Inv est_SM 46590657.4 1 46590657.39 .019 .892
Inv est_MF 14383175.1 1 14383175.06 .005 .945
Error Inv est_CM 1.048E+010 34 308143679.0
Inv est_SM 8.388E+010 34 2466958509
Inv est_MF 1.029E+011 34 3027453031
Total Inv est_CM 2.780E+010 44
Inv est_SM 2.109E+011 44
Inv est_MF 2.302E+011 44
Correct ed Total Inv est_CM 1.703E+010 43
Inv est_SM 1.221E+011 43
Inv est_MF 1.375E+011 43
a. R Squared = . 385 (Adjusted R Squared = . 222)
b. R Squared = . 313 (Adjusted R Squared = . 131)
c. R Squared = . 251 (Adjusted R Squared = . 053)
The above table gives three different models namely a, b, and c. Model a is for the first dependent
variable, invest CM, model b is for dependent variable invest SM and model c is for dependent
variable invest MF
The table also indicates the individual relationship between each dependent – independent variable
pair. It is indicated above that only two pairs namely, Risky CM and Invest CM, and Age and Invest
CM are significant (p value less than 0.05 indicated by circles). This indicates that the independent
variable, perception of risk of commodity markets by consumers (variable name Risky CM)
significantly affects the dependent variable, i.e. their investment in commodity markets. Indicating
that the riskiness perceived by consumers affects their investments in the market. Similarly variable
AGE also impacts their investments in commodity markets. All other combinations are not
significant.
8 Cluster Analysis
This type of analysis is used to divide a given number of entities or objects into groups called
clusters. The objective is to classify a sample of entities into a small number of mutually exclusive
clusters based on the premise that they are similar within the clusters but dissimilar among the
clusters. The criterion for similarity is defined with reference to some characteristics of the entity. For
example, for companies, it could be ‘Sales’, ‘Paid up Capital’, etc.
The basic methodological questions that are to be answered in the cluster analysis are:
What are the relevant variables and descriptive measures of an entity?
How do we measure the similarity between entities?
Given that we have a measure of similarity between entities, how do we form
clusters?
104
Variables
x1 x2 ….. xk
Entity 1 x11 x12 …. x1k
Entity 2 x21 x22 ….. x2p
……..
Entity n xn1 xn2 …………… xnp
The question is how to determine how similar or dissimilar each row of data is from the others?
This task of measuring similarity between entities is complicated by the fact that, in most cases, the
data in its original form are measured in different units or/and scales. This problem is solved by
standardizing each variable by subtracting its mean from its value and then dividing by its standard
deviation. This converts each variable to a pure number.
The measure to define similarity between two entities, i and j, is computed as
Dij = ( xi1 – xj1 )2 + ( xi2 – xj2 )2 + ……….+ ( xik – xjk )2
Smaller the value of Dij , more similar are the two entities.
The basic method of clustering is illustrated through a simple example given below.
Let there be four branches of a commercial bank each described by two variables viz. deposits and
loans / credits. The following chart gives an idea of their deposits and loans / credits.
x(1) x(2)
Loans
/Credit
x(3) x(4)
Deposits
From the above chart, it is obvious that if we want two clusters, we should group the branches 1&2
(High Deposit, High Credit) into one cluster, and 3&4(Low Deposit Low Credit) into another, since
such grouping produces the clusters for which the entities (branches) within each other are most
similar. However, this graphical approach is not convenient for more than two variables.
In order to develop a mathematical procedure for forming the clusters, we need a criterion upon
which to judge alternative clustering patterns. This criterion defines the optimal number of entities
within each cluster.
Now we shall illustrate the methodology of using distances among the entities from clusters. We
assume the following distance similarity matrix among three entities.
105
branches in the same cluster are more or less similar to each other, only few branches are selected
from each cluster.
(ii) Agricultural Clusters
A study was conducted by one of the officers of the Reserve Bank of India, to form clusters of
geographic regions of the country based on agricultural parameters like cropping pattern, rainfall,
land holdings, productivity, fertility, use of fertilisers, irrigation facilities, etc. The whole country
was divided into 9 clusters. Thus, all the 67 regions of the country were allocated to these clusters.
Such classification is useful for making policies at the national level as also at regional/cluster
levels.
8.2 Key Terms in Cluster Analysis
Agglomeration Schedule While performing cluster analysis the tool gives Information on objects or
cases being combined at each stage at hierarchical clustering process. This
is in depth table which indicates the clusters and the objects combined in
the cluster the table can be read from top to bottom. The table starts with
any two cases combined together it also states ‘Distance Coefficients’ and
‘Stage Cluster First Appears’. The distance coefficient is an important
measure to identify the number of clusters for the data. Sudden jump in the
coefficient indicates better grouping. The last row of the table represents
one cluster solution second last, two cluster solution, etc.
Cluster Centroid Cluster centrioids are mean values of variables under consideration
(variables given while clustering) for all the cases in a particular cluster.
Each cluster will have different centroids for each variables.
Cluster Centers These are initial starting point in non Hierarchical clusters . Cluster are
built around these centers these are also termed as seeds.
Cluster Membership It is the cluster to which each case belongs. It is important to save cluster
membership to analyse cluster and further perform ANOVA on the data
Dendrogram This is the graphical summary of the cluster solution. This is more used
while interpreting results than the Agglomeration Schedule, as it is easier to
interpret. The cases are listed along the left vertical axis.
The horizontal axis shows the distance between clusters when they are
joined.
This graph gives an indication of the number of clusters the solution may
have. The diagram is read from right to left. Rightmost is the single cluster
solution just before right is two cluster solution and so on. The best
solution is where the horizontal distance is maximum. This could be a
subjective process.
Icicle Diagram It displays information about how cases are combined into clusters at each
iteration of the analysis.
Similarity/ Distance It is a matrix containing the pair wise distances between the cases.
Coefficient Matrix
8.3 Clustering Procedures and Methods
The cluster analysis procedure could be
Hierarchical
Non- hierarchical
Hierarchical methods develop a tree like structure (dendrogram). These could be
Agglomerative – starts with each case as separate cluster and in every stage the similar
clusters are combined. Ends with single cluster.
Divisive – starts with all cases in a single cluster and then the clusters are divided on the basis
of the difference between the cases. Ends with all clusters separate.
Most common methods of clustering are Agglomerative methods. This could be further divided into
Linkage Methods – these distance measures. There are three linkage methods, Single linkage
– minimum distance or nearest neighborhood rule, Complete linkage – Maximum distance or
107
furthest neighborhood and Average linkage – average distance between all pair of objects.
This is explained in the Diagram provided below.
Centroid Methods – this method considers distance between the two centroids. Centroid is the
means for all the variables
Variance Methods – this is commonly termed as Ward’s method it uses the squared distance
from the means.
Diagram
a) Single Linkage (Minimum Distance / Nearest Neighborhood)
Cluster 1 Cluster 2
b) Complete Linkage ((Maximum Distance / Furthest Neighborhood)
Cluster 1 Cluster 2
c) Average Linkage (Average Distance)
Cluster 1 Cluster 2
d) Centroid Method
Cluster 1 Cluster 2
108
Cluster 1 Cluster 2
Since distance measures are used in cluster analysis it assumes that the variables have similar means.
i.e. the variables are on the same unit or dimension. If all the variables are rating questions and the
scales of these ratings are same, then this assumption is satisfied. But if the variables have different
dimensions, for example, one variable is salary , other is age , some are rating on 1 to 7 scale, then
this difference may affect the clustering. This problem is solved through Standardization.
Standardization allows one to equalize the effect of variables measured on different scales.
Special focus was on the study of the levels of awareness among the investors about the
commodity markets and also perceptional ratings of the investment options by the investors.
This study was undertaken with an intention to gain a better perspective on investor behavior patterns
and to provide assistance to the general public, individual/small investors, broker’s and portfolio
109
managers to analyze the scope of investments and make informed decisions while investing in the
above mentioned options. However, the limitation of the study is that it considers investors from
Mumbai only, and hence, might not be representative of the entire country.
An investment is a commitment of funds made in the expectation of some positive rate of return. If
properly undertaken, the return will be commensurate with the risk the investor assumes.
An analysis of the backgrounds and perceptions of the investors was undertaken in the report. The
data used in the analysis was collected by e-mailing and distributing the questionnaire among friends,
relatives and colleagues. 45 people were surveyed, and were asked various questions relating to their
backgrounds and knowledge about the investment markets and options. The raw data contains a wide
range of information, but only the data which is relevant to objective of the study was considered.
The questionnaire used for the study is as follows
QUESTIONNAIRE
Age: _________
Occupation:
o SELF EMPLOYED
o GOVERNMENT
o STUDENT
o HOUSEWIFE
o DOCTOR
o ENGINEER
o CORPORATE PROFESSIONAL
o OTHERS (PLEASE SPECIFY) : ________________________
o
Gender:
o MALE
o FEMALE
Question 1 2 3 4 5
COMMODITY MARKET
STOCK MARKET
FIXED DEPOSITS
Mutual Funds
2. HOW MUCH YOU ARE READY TO INVEST In COMMODITY MARKET?
_______________
3. HOW MUCH YOU ARE READY TO INVEST In STOCK MARKET ? ______________
4. HOW MUCH YOU ARE READY TO INVEST In FIXED DEPOSITS? ______________
5. HOW MUCH YOU ARE READY TO INVEST In MUTUAL FUNDS? ______________
6. FOR HOW MUCH TIME WOULD YOU BLOCK YOUR MONEY WITH INVESTMENTS?
o UNDER 5 MONTHS
o 6-12 MONTHS
o 1-3 YEARS
o MORE THAN 3 YEARS
110
7. ON A SCALE OF 1 –10( 1- LEAST RISKY & 10 - MOST RISKY) ,HOW RISKY DO YOU
THINK IS THE COMMODITY MARKET? (Circle the
appropriate number)
SAFE
RISKY
8. ON A SCALE OF 1-10, HOW RISKY DO YOU THINK IS THE STOCK MARKET?
SAFE RISKY
SAFE RISKY
10. ON A SCALE OF 1-10 HOW RISKY DO YOU THINK ARE MUTUAL FUNDS?
SAFE RISKY
CA Snapshot 2
1. Select
variables Age,
Rate_CM,
through
Risky_MF
2. Cases
Variable can be selected if one
wants to perform CA for
variables than cases. (like factor
analysis) default is cases.
3. Click on Plots
The following window will be opened
CA Snapshot 3
4. Click on Continue
1. Select Dendrogram
SPSS will take back to the window as displayed in CA Snapshot 2. At this stage click on ‘Method’.
SPSS will open following window.
CA Snapshot 4
CA Snapshot 5
3. Next Click on
Continue
SPSS will be back to the window as shown in CA Snapshot 2. At this stage click on Save, following
window will be displayed.
CA Snapshot 6
Click on continue, and SPSS will be back as Shown in CA Snapshot 2. At this stage click OK.
Following output will be displayed.
Cases
Valid Missing Total
N Percent N Percent N Percent
44 97.8% 1 2.2% 45 100.0%
a. Squared Euclidean Distance used
This table gives the case processing summary and its percentages. The above table indicates there are
44 out of 45 valid cases. Since one case have some missing values it is ignored from the analysis
Cluster
Single Linkage
This is the method we selected for cluster analysis
Agglomeration Schedule
This table gives the agglomeration schedule or the details of the clusters formed in each stage. This
table indicates that the cases 6 and 42 were combined at first stage. Cases 7 and 43 were combined at
2nd stage, 2 and 10 were combined at third stage, and so on. The last stage ( stage 43) indicates two
cluster solution. One above last stage( stage 42) indicates three cluster solution, and so on. The
column Coefficients indicates the distance coefficient. Sudden increase in the coefficient indicates
that the combining at that stage is more appropriate. This is one of the indicator for deciding the
number of clusters.
Agglomeration
Schedule
Stage Cluster Coefficients Stage Cluster Next Difference in
114
The icicle table also gives the summery of the cluster formation. It is read from bottom to top.
topmost is the single cluster solution and bottommost is all cases separate. The cases in the table are
in the columns. The first column indicates the number of clusters for that stage. Each case is
separated by an empty column. A ‘cross’ in the empty column means the two cases are combined. A
‘gap’ means the two cases are in separate clusters.
If the number of cases is huge this table becomes difficult to interpret.
The diagram given below is called the dendrogram. A dendrogram is the most used tool to understand
the number of clusters and cluster memberships. The cases are in the first column and they are
connected by lines for each stages of clustering. This graph is from left to right leftmost is all cluster
solution and rightmost is the one cluster solution.
This graph also has the distance line from 0 to 25. More is the width of the horizontal line for the
cluster more appropriate is the cluster.
The graph shows that 2 cluster solution is a better solution indicated by the thick dotted line.
Dendrogram
* * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * * * * *
Dendrogram using Single Linkage
The above solution is not decisive as the differences are very close. Hence we shall try a different
method, i.e. furthest neighborhood.
The entire process is repeated and this time the method ( as shown in CA Snapshot 4 ) selected is
furthest neighborhood.
The output is as follows
117
Proximities
Case Processing Summarya
Cases
Valid Missing Total
N Percent N Percent N Percent
44 97.8% 1 2.2% 45 100.0%
a. Squared Euclidean Distance used
Cluster
Complete Linkage
Agglomeration Schedule
Dendrogram
* * * * * * H I E R A R C H I C A L C L U S T E R A N A L Y S I S * * *
* *
Dendrogram using Complete Linkage
The above dendrogram clearly shows that the longest horizontal lines for clusters are for 4 cluster
solution, shown by thick dotted line (the dotted line intersects four horizontal lines) it indicates that
the cluster containing cases 7, 43, 37 and 38, named as cluster 4 the cluster containing 41, 42, 39, 8,
44, 9, 45 and 3, named as cluster 2 and so on.
119
We shall run the cluster analysis again with same method and this time we shall save the cluster
membership for single solution = ‘4’ clusters as indicated in CA Snapshot 6.
The output will be same as discussed except a new variable is added in the SPSS file with name
‘CLU4_1’. This variable takes value between 1 to 4 each value indicates the cluster membership.
We shall conduct ANOVA test on the data where the dependent variables are taken as all the
variables that were included while performing cluster analysis, and the factor is the cluster
membership indicated by variable CLU4_1. This ANOVA will indicate if the clusters really
distinguish on the basis of the list of variables, which variables significantly distinguish the clusters
and which do not distinguish.
The ANOVA procedure is as follows
Select ‘Analyze’ – Compare Means – One way ANOVA from the menu as shown below.
CA Snapshot 7
CA Snapshot 9
120
This gives list of Post Hoc tests for ANOVA. Most common are LSD and HSD (discussed in Chapter
12 ) we shall select LSD and click on continue.
SPSS will take back to CA Snapshot 8. Click on Options and following window will be opened.
CA Snapshot 10
2.Click on Continue
SPSS will take back to window as shown in CA Snapshot 8, at this stage click on OK
Following Output will be displayed:
Oneway
D escrip t iv es
The above table gives descriptive statistics for the dependent variables for each cluster the short
summary of above table is displayed below:
Descriptives
N Mean
121
N Spend Bolck
MF Time Risky CM Risky SM Risky FD Risky MF
18593.75 2.1875 5.3125 5.375 1.9375 5.125
1 16 117500 6.125 6.625 7.25 1.125 6.5
2 8 17781.25 4.3125 6 7.1875 1.5 6.5625
3 16 124250 4.25 3 4.75 1 4.75
4 4 45886.36 3.863636 5.590909 6.318182 1.545455 5.863636
Total 44
It may be noted that these four clusters have Average age as 20.56, 46.13, 33.56 and 26. This clearly
forms four different age groups. The other descriptive is summarised as follows.
Lev ene
Statistic df 1 df 2 Sig.
Age 7.243 3 40 .001
Rat e_CM .943 3 40 .429
Rat e_SM 1.335 3 40 .277
Rat e_FD 3.186 3 40 .034
Rat e_MF 1.136 3 40 .346
Inv est_CM .369 3 40 .775
Inv est_SM 17.591 3 40 .000
Inv est_FD 4.630 3 40 .007
Inv est_MF 15.069 3 40 .000
how_much_time_
3.390 3 40 .027
block_y our_money
risky _CM 1.995 3 40 .130
risky _SM 4.282 3 40 .010
risky _FD 5.294 3 40 .004
risky _MF 2.118 3 40 .113
This table gives Leven’s Homogeneity test which is a must for ANOVA as ANOVA assumes that the
different groups have equal variance. If the significance is less than 5%( LOS) the null hypothesis
which states that the variances are equal is rejected. i.e. the assumption is not followed. In such case,
ANOVA cannot be used. The above table rejection of the assumption is indicated by circles. Which
means ANOVA could be invalid for those variables.
It may be noted that when ANOVA is invalid the test that can be performed is Non parametric test,
Kruskal – Wallis test discussed in Chapter 13.
ANOVA
Sum of
Squares df Mean Square F Sig.
Age Between Groups 3764. 045 3 1254. 682 25.185 .000
Within Groups 1992. 750 40 49.819
Total 5756. 795 43
Rat e_CM Between Groups 36.790 3 12.263 22.108 .000
Within Groups 22.188 40 .555
Total 58.977 43
Rat e_SM Between Groups 28.727 3 9.576 26.879 .000
Within Groups 14.250 40 .356
Total 42.977 43
Rat e_F D Between Groups 24.636 3 8.212 12.054 .000
Within Groups 27.250 40 .681
Total 51.886 43
Rat e_MF Between Groups 17.358 3 5.786 5.869 .002
Within Groups 39.438 40 .986
Total 56.795 43
Inv est_CM Between Groups 1E+010 3 4621914299 58.403 .000
Within Groups 3E+009 40 79138671.88
Total 2E+010 43
Inv est_SM Between Groups 8E+010 3 2.545E+010 22.243 .000
Within Groups 5E+010 40 1144289844
Total 1E+011 43
Inv est_FD Between Groups 5E+010 3 1.624E+010 33.585 .000
Within Groups
Total
2E+010
7E+010
40
43
483427734.4
ANOVA
Inv est_MF Between Groups
Within Groups
9E+010
5E+010
3
40
3.005E+010
1184108594
25.377 .000
not
how_much_time_
Total
Between Groups
1E+011
89.682
43
3 29.894 13.666 .000
rejected
block_y our_money Within Groups
Total
87.500
177.182
40
43
2.188
as >0.05
risky _CM Between Groups 39.324 3 13.108 4.394 .009
Within Groups 119.313 40 2.983
Total 158.636 43
risky _SM Between Groups 43.108 3 14.369 3.821 .017
Within Groups 150.438 40 3.761
Total 193.545 43
risky _FD Between Groups 5.097 3 1.699 4.920 .005
Within Groups 13.813 40 .345
Total 18.909 43
risky _MF Between Groups 24.744 3 8.248 1.959 .136
Within Groups 168.438 40 4.211
Total 193.182 43
The above ANOVA table tests the difference between means for the different clusters. The null
hypothesis states that there is no difference between the clusters for given variable. If significance is
less than 5% (p value less than 0.05) the null hypothesis is rejected.
123
It may be noted that for above table, null hypothesis that the variable is equal for all clusters is
rejected for all variables except for Risky_MF. This means all other variables significantly vary for
different clusters. It also indicates that the four cluster solution is a good solution.
K Means Cluster
This method is used when one knows in advance, how many clusters to be formed. The procedure for
k-means cluster is as follows.
CA Snapshot 11
3.Click on Save
Following window will appear.
CA Snapshot 13
2.Click on Continue
1.Select Cluster
Membership
SPSS will take back to the window as shown in CA Snapshot 12. Click on options and following
window will appear
CA Snapshot 14
124
2. Click on
continue
Cluster
1 2 3 4
Age 55 22 54 45
Rat e_CM 1 4 1 1
Rat e_SM 2 4 2 2
Rat e_FD 4 4 4 4
Rat e_MF 2 3 3 3
Inv est_CM 45000 5000 50000 25000
Inv est_SM 60000 3000 60000 185000
Inv est_FD 75000 1000 200000 100000
Inv est_MF 75000 2500 50000 155000
how_much_time_
8 4 8 6
block_y our_money
risky _CM 8 3 8 6
risky _SM 8 6 8 7
risky _FD 1 2 1 2
risky _MF 7 3 4 7
This table gives initial cluster centers. The initial cluster centers are the variable values of the k well-
spaced observations.
Iteration Historya
The iteration history shows the progress of the clustering process at each step.
This table has only three steps as the process has stopped due to no change in cluster centers.
Final Cluster Centers
Cluster
1 2 3 4
Age 33 27 54 34
Rat e_CM 2 3 1 2
Rat e_SM 3 3 2 3
Rat e_FD 4 4 4 4
Rat e_MF 4 3 3 4
Inv est_CM 31000 2981 50000 36667
Inv est_SM 58182 11769 60000 161667
Inv est_FD 41636 11827 200000 81667
Inv est_MF 60636 10846 50000 170000
how_much_time_
4 3 8 5
block_y our_money
risky _CM 5 6 8 5
risky _SM 7 6 8 5
risky _FD 1 2 1 1
risky _MF 6 6 4 5
ANOVA
Cluster Error
Mean Square df Mean Square df F Sig.
Age 318.596 3 120.025 40 2.654 .062
Rat e_CM 2.252 3 1.306 40 1.725 .177
Rat e_SM .599 3 1.029 40 .582 .630
Rat e_FD .377 3 1.269 40 .297 .827
Rat e_MF .722 3 1.366 40 .528 .665
Inv est_CM 3531738685 3 160901842.9 40 21.950 .000
Inv est_SM 3.750E+010 3 240364627.0 40 156.032 .000
Inv est_FD 1.819E+010 3 336746248.5 40 54.022 .000
Inv est_MF 4.225E+010 3 268848251.7 40 157.162 .000
how_much_time_
10.876 3 3.614 40 3.010 .041
block_y our_money
risky _CM 3.654 3 3.692 40 .990 .407
risky _SM 3.261 3 4.594 40 .710 .552
risky _FD .603 3 .427 40 1.411 .254
risky _MF 1.949 3 4.683 40 .416 .742
The F tests should be used only f or descriptiv e purposes because the clusters hav e been chosen to
maximize the dif f erences among cases in dif f erent clusters. The observ ed signif icance lev els are not
corrected f or this and thus cannot be interpreted as t ests of the hy pothesis that the cluster means are
equal.
The ANOVA indicates that the clusters are different only for different investment options like invest
in CM, invest in SM, invest in FD and invest in MF as also block money, as the significance is less
than 0.05 only for these variables.
Number of Cases in each Cluster
Cluster 1 11.000
2 26.000
3 1.000
4 6.000
Valid 44.000
Missing 1.000
The above Table gives the number of cases for each cluster.
It may be noted that this solution is different than hierarchal solution and hierarchal cluster is more
valid for this data as it considers standardized scores and this method does not consider
standardization.
9 Conjoint Analysis
The name "Conjoint Analysis" implies the study of the joint effects. In marketing applications, it
helps in the study of joint effects of multiple product attributes on product choice. Conjoint analysis
involves the measurement of psychological judgements
such as consumer preferences, or perceived similarities or differences between choice alternatives.
In fact, conjoint analysis is a versatile marketing research technique which provides valuable
information for new product development, assessment of demand, evolving market
segmentation strategies, and pricing decisions. This technique is used to assess a wide number
of issues including:
The profitability and/or market share for proposed new product concepts given the existing
competition.
The impact of new competitors’ products on profits or market share of a company if status
quo is maintained with respect to it’s products and services
Customers’ switch rates either from a company’s existing products to the company’s new
products or from competitors’ products to the company’s new products.
Competitive reaction to the company’s strategies of introducing a new product
The differential response to alternative advertising strategies and/or advertising themes
The customer response to alternative pricing strategies, specific price levels, and proposed
price changes
126
Conjoint analysis examines the trade-offs that consumers make in purchasing a product. In
evaluating products, consumers make trade-offs. A TV viewer may like to enjoy the programs
on a LCD TV but might not go for it because of the high cost. In this case, cost is said to have
a high utility value. Utility can be defined as a number which represents the value that
consumers place on specific attributes. A low utility indicates less value; a high utility
indicates more value. In other words, it represents the relative ‘worth’ of the attribute. This
helps in designing products/services that are most appealing to a specific market. In addition,
because conjoint analysis identifies important attributes, it can be used to create advertising
messages that are most appealing.
The process of data collection involves showing respondents a series of cards that contain a written
description of the product or service. If a consumer product is being tested, then a picture of the
product can be included along with a written description. Several cards are prepared describing the
combination of various alternative set of features of a product or service. A consumer’s response is
collected by his/ her selection of number between 1 and 10. While ‘1’ indicates strongest dislike, ‘10’
indicates strongest like for the combination of features on the card. Such data becomes the input for
final analysis which is carried out through computer software.
The concepts and methodology are elaborated in the case study given below.
9.1 Conjoint Analysis Using SPSS
Case 3
Credit Cards
The new head of the credit card division in a bank wanted to revamp the credit card business of the
bank and convert it from loss making business to profit making business. He was given freedom to
experiment with various options that he considered as relevant. Accordingly, he organized a focus
group discussion for assessing the preference of the customers for various parameters associated with
the credit card business. Thereafter he selected the following parameters for study.
1) Transaction Time- this is the time taken for credit card transaction
2) Fees - the annual fees charged by the credit card company
3) Interest rate – the interest rate charged by the credit card company for the customers who revolve
the credits.(customers who do not pay full bill amount but use partial payment option and pay at their
convenience)
The levels of the above mentioned attributes were as follows:
Transaction Time- 1minute, 1.5minutes, 2 minutes
Fees – 0, 1000 Rs, 2000 Rs
Interest rate- 1.5%, 2%,2.5% (per month)
This led to a total of 3×3×3=27 combinations. Twenty seven cards were prepared representing each
combination and the customers were asked to arrange these cards in order of their preference.
The following table shows all the possible combinations and the order given by the customer.
Input Data for Credit Card
7 1 0 2 21
8 1.5 0 2 20
9 1 2000 1.5 19
10 1.5 2000 1.5 18
11 1 1000 2 17
12 1.5 1000 2 16
13 1 2000 2 15
14 2 2000 1.5 14
15 1.5 2000 2 13
16 2 0 2 12
17 2 1000 2 11
18 1 0 2 10
19 1.5 0 2.5 9
20 1 1000 2.5 8
21 2 1000 2.5 7
22 2 2000 2 6
23 2 0 2.5 5
24 2 1000 2.5 4
25 1 2000 2.5 3
26 1.5 2000 2.5 2
27 2 2000 2.5 1
* rating 27 indicates most preferred and rating 1 indicates lest preferred option by customer.
Conduct appropriate analysis to find the utility for these three factors.
The data is available in credit card.sav file, given in the CD.
Running Conjoint as a Regression Model: Introduction of Dummy Variables
Representing dummy variables:
X1, X2 = transaction time
X3, X4 = Annual Fees
X5, X6 = Interest Rates
The 3 levels of life are coded as follows:
Transaction Time X1 X2
1 1 0
1.5 0 1
2 -1 -1
The 3 levels of price are coded as follows:
Fees X3 X4
0 1 0
1000 0 1
2000 -1 -1
The 3 levels of Colour are coded as follows:
Interest rates X5 X6
1.5 1 0
2 0 1
1.5 -1 -1
Thus, 6 variables, ie. X1 to X6 are used to represent the 3 levels of life of the transaction time
(1,1.5,2), 3 levels of fees (0,1000,2000) and 3 levels of interest rates ( 1.5, 2,2.5).All the Six
Variables are independent variables in the regression run. Another variable Y which is the rating of
each combination given by the respondent forms the dependent variable of the regression curve.
Thus we generate the regression equation as: Y= a +b1X1+ b2X2+ b3X3+ b4X4+ b5X5+ b6X6
128
/
129
3.Click on
The following window will be displayed. OK
Conjoint Snapshot 2
1.Select Rate as
dependent variable
2.Select Rate as
dependent variables x1,
x2, x3, x4, x5, x6 as
independent variables
Variables Variables
Model Entered Remov ed Met hod
1 x6, x4, x2,
a . Enter
x5, x3, x1
a. All requested v ariables entered.
b. Dependent Variable: rat e
This table indicates that r square for the above model is 0.963, which is close to one. This indicates
that 96.3% variation in the rate is attributed by the six independent variables. (x1 to x6)
We conclude that the regression model is fit and explains the variations in the dependent variables
quite well.
ANOVAb
Sum of
Model Squares df Mean Square F Sig.
1 Regression 1517. 912 6 252.985 42.133 .000a
Residual 120.088 20 6.004
Total 1638. 000 26
a. Predictors: (Constant), x6, x4, x2, x5, x3, x1
b. Dependent Variable: rat e
130
Coefficientsa
Unstandardized Standardized
Coef f icients Coef f icients
Model B Std. Error Beta t Sig.
1 (Constant) 13.857 .476 29.104 .000
x1 1.377 .673 .148 2.044 .054
x2 1.326 .695 .138 1.908 .071
x3 2.265 .673 .237 3.364 .003
x4 1.480 .673 .155 2.198 .040
x5 8.143 .670 .829 12.152 .000
x6 -.121 .661 -.013 -.183 .857
a. Dependent Variable: rate
The coefficients are circled these indicate the utility values for each variable.
The Regression equation is as follows:
Annual Fees in
0 2.265
Rupees
1000 1.480 = 2.265- (-3.745) 22.89%
= 6.01
2000 -3.745
11.785
INDIVIDUAL ATTRIBUTES
The difference in utility with the change of one level in one attribute can also be checked.
1. Transaction Time
For the time 1 min to 1.5 min – There in decrease in utility value of 0.051 units.
But the next level, that is , 1.5min to 2 min has an decrease in utility of 4.029 units.
2. Annual Fees
Increase fees from 0 to Rs.1000 induces a utility drop of 0.785
Whereas, Rs.1000 to Rs.2000, there is an decrease in utility of 5.225
3. Interest Rates
132
Interest rate increase from 1.5% to 2.0% induces 8.264 units drop in utility.
Interest rate increase from 2.0% to 2.5% induces 7.901units drop in utility.
10 Multidimensional Scaling
Multidimensional Scaling transforms consumer judgments / perceptions of similarity or preferences
in a multidimensional space( usually 2 or 3 dimensions). It is useful for designing products and
services. In fact, MDS is a set of procedures for drawing pictures of data so that the researcher can
Visualise relationships described by the data more clearly
Offer clearer explanations of those relationships
Thus MDS reveals relationships that appear to be obscure when one examines only the numbers resulting from
a study.
It attempts to find the structure in a set of distance measures between objects. This is done by
assigning observations to specific locations in a conceptual space( 2 to 3 dimensions) such that the
distances between points in the space match the given dissimilarities as closely as possible.
If objects A and B are judged by the respondents as being most similar compared to all other possible
pairs of objects, multidimensional technique positions these objects in the space in such a manner that
the distance between them is smaller than that between any other two objects.
Suppose, data is collected for perceiving the differences or distances among three objects say A B and
C, and the following distance matrix emerges.
A B C
A 0 4 6
B 4 0 3
C 6 3 0
3
6
B
4
A
However, if the data comprises of only ordinal or rank data, then the same distance matrix could be
written as:
A B C
A 0 2 3
B 1 0 3
C 3 1 0
and can be depicted as :
133
1
3
B
2
A
If the actual magnitudes of the original similarities (distances are used to obtain a geometric representation, the
process is called “ Metric Multidimensional Scaling”.
When only this ordinal information in terms of ranks is used to obtain a geometric representation, the process
is called “Non-metric Multidimensional Scaling “.
D B
E
A
It is observed that two zonal managers viz. B and E exhibit high concern for both the organisation as
well as staff. If these criteria are critical to the organisation, then these two zonal managers could be
the right candidates for higher positions in the Head Office.
Illustration 5
Similar study could be conducted for a group of companies to have an assessment of the perception of
investors about the attitude of companies towards interest of their shareholders and vis-à-vis interest
of their staff.
For example, from the following MDS graph, it is observed that company A is perceived to be taking
more interest in the welfare of the staff than company B.
Interest of
Shareholde
rs
Interest
of Staff
135
C&RT, a recursive partitioning method, builds classification and regression trees for predicting
continuous dependent variables (regression) and categorical predictor variables (classification). The
classic C&RT algorithm was popularized by Breiman et al. (Breiman, Friedman, Olshen, & Stone,
1984; see also Ripley, 1996). A general introduction to tree-classifiers, specifically to
the QUEST (Quick, Unbiased, Efficient Statistical Trees) algorithm, is also presented in the context
of the Classification Trees Analysis facilities, and much of the following discussion presents the same
information, in only a slightly different context. Another, similar type of tree building algorithm is
CHAID (Chi-square Automatic Interaction Detector; see Kass, 1980).
There are numerous algorithms for predicting continuous variables or categorical variables from a set
of continuous predictors and/or categorical factor effects. For example, in GLM (General Linear
Models) and GRM (General Regression Models), we can specify a linear combination (design) of
continuous predictors and categorical factor effects (e.g., with two-way and three-way interaction
effects) to predict a continuous dependent variable. In GDA (General Discriminant Function
Analysis), we can specify such designs for predicting categorical variables, i.e., to solve classification
problems.
Regression-type problems. Regression-type problems are generally those where we attempt to
predict the values of a continuous variable from one or more continuous and/or categorical predictor
variables. For example, we may want to predict the selling prices of single family homes (a
continuous dependent variable) from various other continuouspredictors (e.g., square footage) as well
as categorical predictors (e.g., style of home, such as ranch, two-story, etc.; zip code or telephone area
code where the property is located, etc.; note that this latter variable would be categorical in nature,
even though it would contain numeric values or codes). If we used simple multiple regression, or
some general linear model (GLM) to predict the selling prices of single family homes, we would
determine a linear equation for these variables that can be used to compute predicted selling prices.
There are many different analytic procedures for fitting linear models (GLM, GRM, Regression),
various types of nonlinear models (e.g., Generalized Linear/Nonlinear Models (GLZ), Generalized
Additive Models (GAM), etc.), or completely custom-defined nonlinear models (see Nonlinear
136
In most general terms, the purpose of the analyses via tree-building algorithms is to determine a set
of if-then logical (split) conditions that permit accurate prediction or classification of cases.
CLASSIFICATION TREES
For example, consider the widely referenced Iris data classification problem introduced by Fisher
[1936; see alsoDiscriminant Function Analysis and General Discriminant Analysis (GDA)]. The data
file Irisdat reports the lengths and widths of sepals and petals of three types of irises (Setosa,
Versicol, and Virginic). The purpose of the analysis is to learn how we can discriminate between the
three types of flowers, based on the four measures of width and length of petals and sepals.
Discriminant function analysis will estimate several linear combinations of predictor variables for
computing classification scores (or probabilities) that allow the user to determine the predicted
classification for each observation. A classification tree will determine a set of logical if-then
conditions (instead of linear equations) for predicting or classifying cases instead:
137
REGRESSION TREES
The general approach to derive predictions from few simple if-then conditions can be applied to
regression problems as well. This example is based on the data file Poverty, which contains 1960 and
1970 Census figures for a random selection of 30 counties. The research question (for that example)
was to determine the correlates of poverty, that is, the variables that best predict the percent of
families below the poverty line in a county. A reanalysis of those data, using the regression tree
analysis [and v-fold cross-validation, yields the
following results:
Again, the interpretation of these results is rather
straightforward: Counties where the percent of
households with a phone is greater than 72%
have generally a lower poverty rate. The greatest
poverty rate is evident in those counties that
show less than (or equal to) 72% of households
with a phone, and where the population change
(from the 1960 census to the 170 census) is less
than -8.3 (minus 8.3). These results are
straightforward, easily presented, and intuitively
clear as well: There are some affluent counties (where most households have a telephone), and those
generally have little poverty. Then there are counties that are generally less affluent, and among those
the ones that shrunk most showed the greatest poverty rate. A quick review of the scatterplot of
observed vs. predicted values shows how the discrimination between the latter two groups is
particularly well "explained" by the tree model.
138
As mentioned earlier, there are a large number of methods that an analyst can choose from when
analyzing classification or regression problems. Tree classification techniques, when they "work" and
produce accurate predictions or predicted classifications based on few logical if-then conditions, have
a number of advantages over many of those alternative techniques.
Simplicity of results. In most cases, the interpretation of results summarized in a tree is very simple.
This simplicity is useful not only for purposes of rapid classification of new observations (it is much
easier to evaluate just one or two logical conditions, than to compute classification scores for each
possible group, or predicted values, based on all predictors and using possibly some complex
nonlinear model equations), but can also often yield a much simpler "model" for explaining why
observations are classified or predicted in a particular manner (e.g., when analyzing business
problems, it is much easier to present a few simple if-then statements to management, than some
elaborate equations).
Tree methods are nonparametric and nonlinear. The final results of using tree methods for
classification or regression can be summarized in a series of (usually few) logical if-then conditions
(tree nodes). Therefore, there is no implicit assumption that the underlying relationships between the
predictor variables and the dependent variable are linear, follow some specific non-linear link
function [e.g., see Generalized Linear/Nonlinear Models (GLZ)], or that they are even monotonic in
nature. For example, some continuous outcome variable of interest could be positively related to a
variable Income if the income is less than some certain amount, but negatively related if it is more
than that amount (i.e., the tree could reveal multiple splits based on the same variable Income,
revealing such a non-monotonic relationship between the variables). Thus, tree methods are
particularly well suited for data mining tasks, where there is often little a priori knowledge nor any
coherent set of theories or predictions regarding which variables are related and how. In those types
139
of data analyses, tree methods can often reveal simple relationships between just a few variables that
could have easily gone unnoticed using other analytic techniques.
The computational details involved in determining the best split conditions to construct a simple yet
useful and informative tree are quite complex. Refer to Breiman et al. (1984) for a discussion of their
CART® algorithm to learn more about the general theory of and specific computational solutions for
constructing classification and regression trees. An excellent general discussion of tree classification
and regression methods, and comparisons with other approaches to pattern recognition and neural
networks, is provided in Ripley (1996).
A major issue that arises when applying regression or classification trees to "real" data with much
random error noise concerns the decision when to stop splitting. For example, if we had a data set
with 10 cases, and performed 9 splits (determined 9 if-then conditions), we could perfectly predict
every single case. In general, if we only split a sufficient number of times, eventually we will be able
to "predict" ("reproduce" would be the more appropriate term here) our original data (from which we
determined the splits). Of course, it is far from clear whether such complex results (with many splits)
will replicate in a sample of new observations; most likely they will not.
This general issue is also discussed in the literature on tree classification and regression methods, as
well as neural networks, under the topic of "overlearning" or "overfitting." If not stopped, the tree
algorithm will ultimately "extract" all information from the data, including information that is not and
cannot be predicted in the population with the current set of predictors, i.e., random or noise
variation. The general approach to addressing this issue is first to stop generating new split nodes
when subsequent splits only result in very little overall improvement of the prediction. For example,
if we can predict 90% of all cases correctly from 10 splits, and 90.1% of all cases from 11 splits, then
it obviously makes little sense to add that 11th split to the tree. There are many such criteria for
automatically stopping the splitting (tree-building) process.
Once the tree building algorithm has stopped, it is always useful to further evaluate the quality of the
prediction of the current tree in samples of observations that did not participate in the original
computations. These methods are used to "prune back" the tree, i.e., to eventually (and ideally) select
a simpler tree than the one obtained when the tree building algorithm stopped, but one that is equally
as accurate for predicting or classifying "new" observations.
Crossvalidation. One approach is to apply the tree computed from one set of observations (learning
sample) to another completely independent set of observations (testing sample). If most or all of the
140
splits determined by the analysis of the learning sample are essentially based on "random noise," then
the prediction for the testing sample will be very poor. Hence, we can infer that the selected tree is
not very good (useful), and not of the "right size."
V-fold crossvalidation. Continuing further along this line of reasoning (described in the context of
crossvalidation above), why not repeat the analysis many times over with different randomly drawn
samples from the data, for every tree size starting at the root of the tree, and applying it to the
prediction of observations from randomly selected testing samples. Then use (interpret, or accept as
our final result) the tree that shows the best average accuracy for cross-validated predicted
classifications or predicted values. In most cases, this tree will not be the one with the most terminal
nodes, i.e., the most complex tree. This method for pruning a tree, and for selecting a smaller tree
from a sequence of trees, can be very powerful, and is particularly useful for smaller data sets. It is an
essential step for generating useful (for prediction) tree models, and because it can be
computationally difficult to do, this method is often not found in tree classification or regression
software.
Another general issue that arises when applying tree classification or regression methods is that the
final trees can become very large. In practice, when the input data are complex and, for example,
contain many different categories for classification problems and many possible predictors for
performing the classification, then the resulting trees can become very large. This is not so much a
computational problem as it is a problem of presenting the trees in a manner that is easily accessible
to the data analyst, or for presentation to the "consumers" of the research.
The classic (Breiman et. al., 1984) classification and regression trees algorithms can accommodate
both continuous and categorical predictor. However, in practice, it is not uncommon to combine such
variables into analysis of variance/covariance (ANCOVA) like predictor designs with main effects or
interaction effects for categorical and continuous predictors. This method of analyzing coded
ANCOVA-like designs is relatively new and. However, it is easy to see how the use of coded
predictor designs expands these powerful classification and regression techniques to the analysis of
data from experimental designs (e.g., see for example the detailed discussion of experimental design
methods for quality improvement in the context of the Experimental Design module of Industrial
Statistics).
141
Computational Details
The process of computing classification and regression trees can be characterized as involving four
basic steps:
The classification and regression trees (C&RT) algorithms are generally aimed at achieving the best
possible predictive accuracy. Operationally, the most accurate prediction is defined as the prediction
with the minimum costs. The notion of costs was developed as a way to generalize, to a broader range
of prediction situations, the idea that the best prediction has the lowest misclassification rate. In most
applications, the cost is measured in terms of proportion of misclassified cases, or variance. In this
context, it follows, therefore, that a prediction would be considered best if it has the lowest
misclassification rate or the smallest variance. The need for minimizing costs, rather than just the
proportion of misclassified cases, arises when some predictions that fail are more catastrophic than
others, or when some predictions that fail occur more frequently than others.
Priors. In the case of a categorical response (classification problem), minimizing costs amounts to
minimizing the proportion of misclassified cases when priors are taken to be proportional to the class
sizes and when misclassification costs are taken to be equal for every class.
The a priori probabilities used in minimizing costs can greatly affect the classification of cases or
objects. Therefore, care has to be taken while using the priors. If differential base rates are not of
interest for the study, or if we know that there are about an equal number of cases in each class,
then we would use equal priors. If the differential base rates are reflected in the class sizes (as they
would be, if the sample is a probability sample), then we would use priors estimated by the class
proportions of the sample. Finally, if we have specific knowledge about the base rates (for example,
based on previous research), then we would specify priors in accordance with that knowledge The
general point is that the relative size of the priors assigned to each class can be used to "adjust" the
importance of misclassifications for each class. However, no priors are required when we are building
a regression tree.
142
Misclassification costs. Sometimes more accurate classification of the response is desired for some
classes than others for reasons not related to the relative class sizes. If the criterion for predictive
accuracy is Misclassification costs, then minimizing costs would amount to minimizing the
proportion of misclassified cases when priors are considered proportional to the class sizes and
misclassification costs are taken to be equal for every class.
Case weights. Case weights are treated strictly as case multipliers. For example, the misclassification
rates from an analysis of an aggregated data set using case weights will be identical to the
misclassification rates from the same analysis where the cases are replicated the specified number of
times in the data file.
However, note that the use of case weights for aggregated data sets in classification problems is
related to the issue of minimizing costs. Interestingly, as an alternative to using case weights for
aggregated data sets, we could specify appropriate priors and/or misclassification costs and produce
the same results while avoiding the additional processing required to analyze multiple cases with the
same values for all variables. Suppose that in an aggregated data set with two classes having an equal
number of cases, there are case weights of 2 for all cases in the first class, and case weights of 3 for
all cases in the second class. If we specified priors of .4 and .6, respectively, specified equal
misclassification costs, and analyzed the data without case weights, we will get the same
misclassification rates as we would get if we specified priors estimated by the class sizes, specified
equal misclassification costs, and analyzed the aggregated data set using the case weights. We would
also get the same misclassification rates if we specified priors to be equal, specified the costs of
misclassifying class 1 cases as class 2 cases to be 2/3 of the costs of misclassifying class 2 cases as
class 1 cases, and analyzed the data without case weights.
SELECTING SPLITS
The second basic step in classification and regression trees is to select the splits on the predictor
variables that are used to predict membership in classes of the categorical dependent variables, or to
predict values of the continuous dependent (response) variable. In general terms, the split at each
node will be found that will generate the greatest improvement in predictive accuracy. This is usually
measured with some type of node impurity measure, which provides an indication of the relative
homogeneity (the inverse of impurity) of cases in the terminal nodes. If all cases in each terminal
node show identical values, then node impurity is minimal, homogeneity is maximal, and prediction
is perfect (at least for the cases used in the computations; predictive validity for new cases is of
course a different matter...).
For classification problems, C&RT gives the user the choice of several impurity measures: The Gini
index, Chi-square, or G-square. The Gini index of node impurity is the measure most commonly
chosen for classification-type problems. As an impurity measure, it reaches a value of zero when only
143
one class is present at a node. With priors estimated from class sizes and equal misclassification
costs, the Gini measure is computed as the sum of products of all pairs of class proportions for classes
present at the node; it reaches its maximum value when class sizes at the node are equal; the Gini
index is equal to zero if all cases in a node belong to the same class. The Chi-square measure is
similar to the standard Chi-square value computed for the expected and observed classifications (with
priors adjusted for misclassification cost), and the G-square measure is similar to the maximum-
likelihood Chi-square (as for example computed in the Log-Linear module). For regression-type
problems, a least-squares deviation criterion (similar to what is computed in least squares regression)
is automatically used. Computational Formulas provides further computational details.
As discussed in Basic Ideas, in principal, splitting could continue until all cases are perfectly
classified or predicted. However, this wouldn't make much sense since we would likely end up with a
tree structure that is as complex and "tedious" as the original data file (with many nodes possibly
containing single observations), and that would most likely not be very useful or accurate for
predicting new observations. What is required is some reasonable stopping rule. InC&RT, two
options are available that can be used to keep a check on the splitting process; namely Minimum n
and Fraction of objects.
Minimum n. One way to control splitting is to allow splitting to continue until all terminal nodes are
pure or contain no more than a specified minimum number of cases or objects. In C&RT this is done
by using the option Minimum n that allows us to specify the desired minimum number of cases as a
check on the splitting process. This option can be used when Prune on misclassification error, Prune
on deviance, or Prune on variance is active as the Stopping rule for the analysis.
Fraction of objects. Another way to control splitting is to allow splitting to continue until all
terminal nodes are pure or contain no more cases than a specified minimum fraction of the sizes of
one or more classes (in the case of classification problems, or all cases in regression problems). This
option can be used when FACT-style direct stopping has been selected as the Stopping rule for the
analysis. In C&RT, the desired minimum fraction can be specified as the Fraction of objects. For
classification problems, if the priors used in the analysis are equal and class sizes are equal as well,
then splitting will stop when all terminal nodes containing more than one class have no more cases
than the specified fraction of the class sizes for one or more classes. Alternatively, if the priors used
in the analysis are not equal, splitting will stop when all terminal nodes containing more than one
class have no more cases than the specified fraction for one or more classes. See Loh and
Vanichestakul, 1988 for details.
144
The size of a tree in the classification and regression trees analysis is an important issue, since an
unreasonably big tree can only make the interpretation of results more difficult. Some generalizations
can be offered about what constitutes the "right-sized" tree. It should be sufficiently complex to
account for the known facts, but at the same time it should be as simple as possible. It should exploit
information that increases predictive accuracy and ignore information that does not. It should, if
possible, lead to greater understanding of the phenomena it describes. The options available
in C&RT allow the use of either, or both, of two different strategies for selecting the "right-sized" tree
from among all the possible trees. One strategy is to grow the tree to just the right size, where the
right size is determined by the user, based on the knowledge from previous research, diagnostic
information from previous analyses, or even intuition. The other strategy is to use a set of well-
documented, structured procedures developed by Breiman et al. (1984) for selecting the "right-sized"
tree. These procedures are not foolproof, as Breiman et al. (1984) readily acknowledge, but at least
they take subjective judgment out of the process of selecting the "right-sized" tree.
FACT-style direct stopping. We will begin by describing the first strategy, in which the user
specifies the size to grow the tree. This strategy is followed by selecting FACT-style direct stopping
as the stopping rule for the analysis, and by specifying the Fraction of objects that allows the tree to
grow to the desired size. C&RT provides several options for obtaining diagnostic information to
determine the reasonableness of the choice of size for the tree. Specifically, three options are
available for performing cross-validation of the selected tree; namely Test sample, V-fold, and
Minimal cost-complexity.
Test sample cross-validation. The first, and most preferred type of cross-validation is the test
sample cross-validation. In this type of cross-validation, the tree is computed from the learning
sample, and its predictive accuracy is tested by applying it to predict the class membership in the test
sample. If the costs for the test sample exceed the costs for the learning sample, then this is an
indication of poor cross-validation. In that case, a different sized tree might cross-validate better. The
test and learning samples can be formed by collecting two independent data sets, or if a large learning
sample is available, by reserving a randomly selected proportion of the cases, say a third or a half, for
use as the test sample.
In the C&RT module, test sample cross-validation is performed by specifying a sample identifier
variable that contains codes for identifying the sample (learning or test) to which each case or object
belongs.
V-fold cross-validation. The second type of cross-validation available in C&RT is V-fold cross-
validation. This type of cross-validation is useful when no test sample is available and the learning
sample is too small to have the test sample taken from it. The user-specified 'v' value for v-fold cross-
validation (its default value is 3) determines the number of random subsamples, as equal in size as
145
possible, that are formed from the learning sample. A tree of the specified size is computed 'v' times,
each time leaving out one of the subsamples from the computations, and using that subsample as a
test sample for cross-validation, so that each subsample is used (v - 1) times in the learning sample
and just once as the test sample. The CV costs (cross-validation cost) computed for each of the 'v' test
samples are then averaged to give the v-fold estimate of the CV costs.
Minimal cost-complexity cross-validation pruning. In C&RT, minimal cost-complexity cross-
validation pruning is performed, if Prune on misclassification error has been selected as the Stopping
rule. On the other hand, if Prune on deviance has been selected as the Stopping rule, then minimal
deviance-complexity cross-validation pruning is performed. The only difference in the two options is
the measure of prediction error that is used. Prune on misclassification error uses the costs that
equals the misclassification rate when priors are estimated and misclassification costs are equal,
while Prune on deviance uses a measure, based on maximum-likelihood principles, called the
deviance (see Ripley, 1996). For details about the algorithms used in C&RT to implement Minimal
cost-complexity cross-validation pruning, see also the Introductory Overview and Computational
Methods sections ofClassification Trees Analysis.
The sequence of trees obtained by this algorithm have a number of interesting properties. They are
nested, because the successively pruned trees contain all the nodes of the next smaller tree in the
sequence. Initially, many nodes are often pruned going from one tree to the next smaller tree in the
sequence, but fewer nodes tend to be pruned as the root node is approached. The sequence of largest
trees is also optimally pruned, because for every size of tree in the sequence, there is no other tree of
the same size with lower costs. Proofs and/or explanations of these properties can be found in
Breiman et al. (1984).
Tree selection after pruning. The pruning, as discussed above, often results in a sequence of
optimally pruned trees. So the next task is to use an appropriate criterion to select the "right-sized"
tree from this set of optimal trees. A natural criterion would be the CV costs (cross-validation costs).
While there is nothing wrong with choosing the tree with the minimum CV costs as the "right-sized"
tree, oftentimes there will be several trees with CV costs close to the minimum. Following Breiman et
al. (1984) we could use the "automatic" tree selection procedure and choose as the "right-sized" tree
the smallest-sized (least complex) tree whose CV costs do not differ appreciably from the minimum
CV costs. In particular, they proposed a "1 SE rule" for making this selection, i.e., choose as the
"right-sized" tree the smallest-sized tree whose CV costs do not exceed the minimum CV costs plus 1
times the standard error of the CV costs for the minimum CV costs tree. In C&RT, a multiple other
than the 1 (the default) can also be specified for the SE rule. Thus, specifying a value of 0.0 would
result in the minimal CV cost tree being selected as the "right-sized" tree. Values greater than 1.0
could lead to trees much smaller than the minimal CV cost tree being selected as the "right-sized"
146
tree. One distinct advantage of the "automatic" tree selection procedure is that it helps to avoid "over
fitting" and "under fitting" of the data.
As can be been seen, minimal cost-complexity cross-validation pruning and subsequent "right-sized"
tree selection is a truly "automatic" process. The algorithms make all the decisions leading to the
selection of the "right-sized" tree, except for, perhaps, specification of a value for the SE rule. V-fold
cross-validation allows us to evaluate how well each tree "performs" when repeatedly cross-validated
in different samples randomly drawn from the data.
Computational Formulas
In Classification and Regression Trees, estimates of accuracy are computed by different formulas for
categorical and continuous dependent variables (classification and regression-type problems). For
classification-type problems (categorical dependent variable) accuracy is measured in terms of the
true classification rate of the classifier, while in the case of regression (continuous dependent
variable) accuracy is measured in terms of mean squared error of the predictor.
In addition to measuring accuracy, the following measures of node impurity are used for
classification problems: The Gini measure, generalized Chi-square measure, and generalized G-
square measure. The Chi-square measure is similar to the standard Chi-square value computed for the
expected and observed classifications (with priors adjusted for misclassification cost), and the G-
square measure is similar to the maximum-likelihood Chi-square (as for example computed in
the Log-Linear module). The Gini measure is the one most often used for measuring purity in the
context of classification problems, and it is described below.
For continuous dependent variables (regression-type problems), the least squared deviation (LSD)
measure of impurity is automatically applied.
In classification problems (categorical dependent variable), three estimates of the accuracy are used:
resubstitution estimate, test sample estimate, and v-fold cross-validation. These estimates are defined
here.
Resubstitution estimate. Resubstitution estimate is the proportion of cases that are misclassified by the
classifier constructed from the entire sample. This estimate is computed in the following manner:
where is the sub sample that is not used for constructing the classifier.
v-fold crossvalidation. The total number of cases are divided into v sub samples 1, , ..., v of almost equal
sizes. v-fold cross validation estimate is the proportion of cases in the subsample that are misclassified by
the classifier constructed from the subsample v. This estimate is computed in the following way.
Let the learning sample of size N be partitioned into v sub samples 1, , ..., v of almost sizes N1, N,
..., Nv, respectively.
In the regression problem (continuous dependent variable) three estimates of the accuracy are used:
resubstitution estimate, test sample estimate, and v-fold cross-validation. These estimates are defined
here.
Resubstitution estimate. The resubstitution estimate is the estimate of the expected squared error using the
predictor of the continuous dependent variable. This estimate is computed in the following way.
where the learning sample consists of (xi,yi),i = 1,2,...,N. The resubstitution estimate is computed using
the same data as used in constructing the predictor d .
Test sample estimate. The total number of cases are divided into two subsamples 1 and . The test sample
estimate of the mean squared error is computed in the following way:
Let the learning sample of size N be partitioned into subsamples 1 and of sizes N and N2, respectively.
where is the sub-sample that is not used for constructing the predictor.
148
v-fold cross-validation. The total number of cases are divided into v sub samples 1, , ..., v of almost equal
sizes. The subsample v is used to construct the predictor d. Then v-fold cross validation estimate is
computed from the subsample v in the following way:
Let the learning sample of size N be partitioned into v sub samples 1, , ..., v of almost sizes N1, N,
..., Nv, respectively.
The Gini measure is the measure of impurity of a node and is commonly used when the dependent variable
is a categorical variable, defined as:
where the sum extends over all k categories. p( j / t) is the probability of category j at the node t and C(i / j )
is the probability of misclassifying a category j case as category i.
ESTIMATION OF NODE IMPURITY: LEAST-SQUARED DEVIATION
Least-squared deviation (LSD) is used as the measure of impurity of a node when the response variable is
continuous, and is computed as:
where Nw(t) is the weighted number of cases in node t, wi is the value of the weighting variable for
case i, fi is the value of the frequency variable, yi is the value of the response variable, and y(t) is the
weighted mean for node t.
149
The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms. An
object is classified by a majority vote of its neighbors, with the object being assigned the class most
common amongst its k nearest neighbors. k is a positive integer, typically small. If k = 1, then the
object is simply assigned the class of its nearest neighbor. In binary (two class) classification
problems, it is helpful to choose k to be an odd number as this avoids difficulties with tied votes.
The same method can be used for regression, by simply assigning the property value for the object to
be the average of the values of its k nearest neighbors. It can be useful to weight the contributions of
the neighbors, so that the nearer neighbors contribute more to the average than the more distant ones.
The neighbors are taken from a set of objects for which the correct classification (or, in the case of
regression, the value of the property) is known. This can be thought of as the training set for the
algorithm, though no explicit training step is required. In order to identify neighbors, the objects are
represented by position vectors in a multidimensional feature space. It is usual to use the Euclidean
distance, though other distance measures, such as the Manhattan distance could in principle be used
instead. The k-nearest neighbor algorithm is sensitive to the local structure of the data.
Algorithm
Example of k-NN classification. The test sample (green circle) should be classified either to the first
class of blue squares or to the second class of red triangles. If k = 3 it is classified to the second class
because there are 2 triangles and only 1 square inside the inner circle. If k = 5 it is classified to first
class (3 squares vs. 2 triangles inside the outer circle).
The training examples are vectors in a multidimensional feature space. The space is partitioned into
regions by locations and labels of the training samples. A point in the space is assigned to the class c
if it is the most frequent class label among the k nearest training samples. Usually Euclidean distance
is used.
The training phase of the algorithm consists only of storing the feature vectors and class labels of the
training samples. In the actual classification phase, the test sample (whose class is not known) is
represented as a vector in the feature space. Distances from the new vector to all stored vectors are
computed and k closest samples are selected. There are a number of ways to classify the new vector
to a particular class, one of the most used technique is to predict the new vector to the most common
class amongst the K nearest neighbors. A major drawback to use this technique to classify a new
vector to a class is that the classes with the more frequent examples tend to dominate the prediction of
the new vector, as they tend to come up in the K nearest neighbors when the neighbors are computed
due to their large number. One of the ways to overcome this problem is to take into account the
150
distance of each K nearest neighbors with the new vector that is to be classified and predict the class
of the new vector based on these distances.
Parameter selection
The best choice of k depends upon the data; generally, larger values of k reduce the effect of noise on
the classification, but make boundaries between classes less distinct. A good k can be selected by
various heuristic techniques, for example, cross-validation. The special case where the class is
predicted to be the class of the closest training sample (i.e. when k = 1) is called the nearest neighbor
algorithm.
The accuracy of the k-NN algorithm can be severely degraded by the presence of noisy or irrelevant
features, or if the feature scales are not consistent with their importance. Much research effort has
been put into selecting or scaling features to improve classification. A particularly popular approach
is the use of evolutionary algorithms to optimize feature scaling. Another popular approach is to scale
features by the mutual information of the training data with the training classes.
Properties
The naive version of the algorithm is easy to implement by computing the distances from the test
sample to all stored vectors, but it is computationally intensive, especially when the size of the
training set grows. Many optimizations have been proposed over the years; these generally seek to
reduce the number of distance evaluations actually performed. Some optimizations involve
partitioning the feature space, and only computing distances within specific nearby volumes.
The k-NN algorithm can also be adapted for use in estimating continuous variables. One such
implementation uses an inverse distance weighted average of the k-nearest multivariate neighbors.
This algorithm functions as follows:
1. Compute Euclidean or Mahalanobis distance from target plot to those that were sampled.
2. Order samples taking for account calculated distances.
3. Choose heuristically optimal k nearest neighbor based on RMSE done by cross validation
technique.
4. Calculate an inverse distance weighted average with the k-nearest multivariate neighbors.
KNN -Definition
KNNis a simple algorithm that stores all available cases and classifies new cases based on a similarity
measure.
KNN –different names
•K-Nearest Neighbors
•Memory-Based Reasoning
•Example-Based Reasoning
•Instance-Based Learning
151
•Case-Based Reasoning
•Lazy Learning
KNN -Applications
•Classification and Interpretation
–legal, medical, news, banking
•Problem-solving
–planning, pronunciation
•Function learning
153
–dynamic control
•Teaching and aiding
–help desk, user training
Summary
•KNN is conceptually simple, yet able to solve complex problems
•Can work with relatively little information
•Learning is simple (no learning at all!)
•Memory and CPU cost
•Feature selection problem
•Sensitive to representation
K Nearest Neighbor
Lazy Learning Algorithm
Defer the decision to generalize beyond the training
examples till a new query is encountered
Whenever we have a new point to classify, we find its K
nearest neighbors from the training data.
The distance is calculated using one of the following
measures
n Euclidean Distance
n Minkowski Distance
n Mahalanobis Distance
Curse of Dimensionality
Distance usually relates to all the attributes and assumes all of them have the same effects on distance
The similarity metrics do not consider the relation of attributes which result in inaccurate distance and
then impact on classification precision. Wrong classification due to presence of many irrelevant
attributes is often termed as the curse of dimensionality
For example: Each instance is described by 20 attributes out of which only 2 are relevant in
determining the classification of the target function. In this case, instances that have identical values
for the 2 relevant attributes may nevertheless be distant from one another in the 20 dimensional
instance space.
154
Here t y and t ε are respectively the actual value and random error (or random shock) at time period t ,
) ϕi (i = 1,2,..., p are model parameters and c is a constant. The integer constant p is known as the
order of the model. Sometimes the constant term is omitted for simplicity. Usually For estimating
parameters of an AR process using the given time series, the YuleWalker equations are used. Just as
an AR(p) model regress against past values of the series, an MA(q) model uses past errors as the
explanatory variables. The MA(q) model is given by :
Here μ is the mean of the series, ( j 1,2,...,q) θ j = are the model parameters and q is the order of the
model. The random shocks are assumed to be a white noise [21, 23] process, i.e. a sequence of
independent and identically distributed (i.i.d) random variables with zero mean and a constant
variance . 2 σ Generally, the random shocks are assumed to follow the typical normal distribution.
Thus conceptually a moving average model is a linear regression of the current observation of the
155
time series against the random shocks of one or more prior observations. Fitting an MA model to a
time series is more complicated than fitting an AR model because in the former one the random error
terms are not fore-seeable. Autoregressive (AR) and moving average (MA) models can be effectively
combined together to form a general and useful class of time series models, known as the ARMA
models. Mathematically an ARMA(p, q) model is represented as :
Here the model orders p,q refer to p autoregressive and q moving average terms. Usually ARMA
models are manipulated using the lag operator notation. The lag or backshift operator is defined as
. Polynomials of lag operator or lag polynomials are used to represent ARMA models as
follows :
It is shown in [23] that an important property of AR(p) process is invertibility, i.e. an AR(p) process
can always be written in terms of an MA(∞) process. Whereas for an MA(q) process to be invertible,
all the roots of the equation θ (L) = 0 must lie outside the unit circle. This condition is known as the
Invertibility Condition for an MA process.
An MA(q) process is always stationary, irrespective of the values the MA parameters . The conditions
regarding stationarity and invertibility of AR and MA processes also hold for an ARMA process. An
ARMA(p, q) process is stationary if all the roots of the characteristic equation ϕ(L) = 0 lie outside the
unit circle. Similarly, if all the roots of the lag equation θ (L) = 0 lie outside the unit circle, then the
ARMA(p, q) process is invertible and can be expressed as a pure AR process.
4 Autocorrelation and Partial Autocorrelation Functions (ACF and PACF) To determine a proper
model for a given time series data, it is necessary to carry out the ACF and PACF analysis. These
statistical measures reflect how the observations in a time series are related to each other. For
modeling and forecasting purpose it is often useful to plot the ACF and PACF against consecutive
time lags. These plots help in determining the order of AR and MA terms. Below we give their
mathematical definitions:
156
Here μ is the mean of the time series, i.e. . The autocovariance at lag zero i.e is the
variance of the time series. From the definition it is clear that the autocorrelation coefficient ρ k is
dimensionless and so is independent of the scale of measurement. Also, clearly
Statisticians Box and Jenkins [6] termed k γ as the theoretical Autocovariance Function
(ACVF) and ρ k as the theoretical Autocorrelation Function (ACF). Another measure, known as the
Partial Autucorrelation Function (PACF) is used to measure the correlation between an observation k
period ago and the current observation, after controlling for observations at intermediate lags (i.e. at
lags < k ) [12]. At lag 1, PACF(1) is same as ACF(1). The detailed formulae for calculating PACF are
given in . Normally, the stochastic process governing a time series is unknown and so it is not
possible to determine the actual or theoretical ACF and PACF values. Rather these values are to be
estimated from the training data, i.e. the known time series at hand. The estimated ACF and PACF
values from the training data are respectively termed as sample ACF and PACF . As given in , the
most appropriate sample estimate for the ACVF at lag k is
As explained by Box and Jenkins [6], the sample ACF plot is useful in determining the type of model
to fit to a time series of length N. Since ACF is symmetrical about lag zero, it is only required to plot
the sample ACF for positive lags, from lag one onwards to a maximum lag of about N/4. The sample
PACF plot helps in identifying the maximum order of an AR process.
5 Autoregressive Integrated Moving Average (ARIMA) Models The ARMA models, described above
can only be used for stationary time series data. However in practice many time series such as those
related to socio-economic and business show non-stationary behavior. Time series, which contain
trend and seasonal patterns, are also non-stationary in nature . Thus from application view point
ARMA models are inadequate to properly describe non-stationary time series, which are frequently
encountered in practice. For this reason the ARIMA model is proposed, which is a generalization of
an ARMA model to include the case of non-stationarity as well. In ARIMA models a non-stationary
time series is made stationary by applying finite differencing of the data points. The mathematical
formulation of the ARIMA(p,d,q) model using lag polynomials is given below :
Here, p, d and q are integers greater than or equal to zero and refer to the order of the autoregressive,
integrated, and moving average parts of the model respectively.
157
6 Seasonal Autoregressive Integrated Moving Average (SARIMA) Models The ARIMA model
is for non-seasonal non-stationary data. Box and Jenkins have generalized this model to deal with
seasonality. Their proposed model is known as the Seasonal ARIMA (SARIMA) model. In this
model seasonal differencing of appropriate order is used to remove non-stationarity from the series. A
first order seasonal difference is the difference between an observation and the corresponding
observation from the previous year and is calculated as . For monthly time series s = 12
and for quarterly time series
7 Some Nonlinear Time Series Models So far we have discussed about linear time series models. As
mentioned earlier, nonlinear models should also be considered for better time series analysis and
forecasting. Campbell, Lo and McKinley (1997) made important contributions towards this direction.
According to them almost all non-linear time series can be divided into two branches: one includes
models nonlinear in mean and other includes models non-linear in variance (heteroskedastic). As an
illustrative example, here we present two nonlinear time series models from [28]:
8 Box-Jenkins Methodology After describing various time series models, the next issue to our
concern is how to select an appropriate model that can produce accurate forecast based on a
description of historical pattern in the data and how to determine the optimal model orders.
Statisticians George Box and Gwilym Jenkins developed a practical approach to build ARIMA
158
model, which best fit to a given time series and also satisfy the parsimony principle. Their concept
has fundamental importance on the area of time series analysis and forecasting . The Box-Jenkins
methodology does not assume any particular pattern in the historical data of the series to be
forecasted. Rather, it uses a three step iterative approach of model identification, parameter estimation
and diagnostic checking to determine the best parsimonious model from a general class of ARIMA
models . This three-step process is repeated several times until a satisfactory model is finally selected.
Then this model can be used for forecasting future values of the time series. The Box-Jenkins forecast
method is schematically shown in Fig.
A crucial step in an appropriate model selection is the determination of optimal model parameters.
One criterion is that the sample ACF and PACF, calculated from the training data should match with
the corresponding theoretical or actual values . Other widely used measures for model identification
are Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) which are
defined below :
Here n is the number of effective observations, used to fit the model, p is the number of parameters in
the model and is the sum of sample squared residuals. The optimal model order is chosen by the
number of model parameters, which minimizes either AIC or BIC. Other similar criteria have also
been proposed in literature for optimal model identification.
159
Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each
hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing
140 million neurons, with tens of billions of connections between them. And yet human
vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing
progressively more complex image processing. We carry in our heads a supercomputer,
tuned by evolution over hundreds of millions of years, and superbly adapted to understand
the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are
stupendously, astoundingly good at making sense of what our eyes show us. But nearly all
that work is done unconsciously. And so we don't usually appreciate how tough a problem
our visual systems solve.
The difficulty of visual pattern recognition becomes apparent if you attempt to write a
computer program to recognize digits like those above. What seems easy when we do it
ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize
shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be
not so simple to express algorithmically. When you try to make such rules precise, you
quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.
Neural networks approach the problem in a different way. The idea is to take a large number
of handwritten digits, known as training examples,
and then develop a system which can learn from those training examples. In other words,
the neural network uses the examples to automatically infer rules for recognizing
handwritten digits. Furthermore, by increasing the number of training examples, the
network can learn more about handwriting, and so improve its accuracy. So while I've shown
160
just 100 training digits above, perhaps we could build a better handwriting recognizer by
using thousands or even millions or billions of training examples.
In this chapter we'll write a computer program implementing a neural network that learns to
recognize handwritten digits. The program is just 74 lines long, and uses no special neural
network libraries. But this short program can recognize digits with an accuracy over 96
percent, without human intervention. Furthermore, in later chapters we'll develop ideas
which can improve accuracy to over 99 percent. In fact, the best commercial neural networks
are now so good that they are used by banks to process cheques, and by post offices to
recognize addresses.
We're focusing on handwriting recognition because it's an excellent prototype problem for
learning about neural networks in general. As a prototype it hits a sweet spot: it's
challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to
require an extremely complicated solution, or tremendous computational power.
Furthermore, it's a great way to develop more advanced techniques, such as deep learning.
And so throughout the book we'll return repeatedly to the problem of handwriting
recognition. Later in the book, we'll discuss how these ideas may be applied to other
problems in computer vision, and also in speech, natural language processing, and other
domains.
Of course, if the point of the chapter was only to write a computer program to recognize
handwritten digits, then the chapter would be much shorter! But along the way we'll develop
many key ideas about neural networks, including two important types of artificial neuron
(the perceptron and the sigmoid neuron), and the standard learning algorithm for neural
networks, known as stochastic gradient descent. Throughout, I focus on
explaining why things are done the way they are, and on building your neural networks
intuition. That requires a lengthier discussion than if I just presented the basic mechanics of
what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the
payoffs, by the end of the chapter we'll be in position to understand what deep learning is,
and why it matters.
Perceptrons
What is a neural network? To get started, I'll explain a type of artificial neuron called
a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank
Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more
common to use other models of artificial neurons - in this book, and in much modern work
on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get
to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way
they are, it's worth taking the time to first understand perceptrons.
So how do perceptrons work? A perceptron takes several binary inputs, x1,x2,…x1,x2,…, and
produces a single binary output:
161
In the example shown the perceptron has three inputs, x1,x2,x3x1,x2,x3. In general it could
have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He
introduced weights, w1,w2,…w1,w2,…, real numbers expressing the importance of the
respective inputs to the output. The neuron's output, 00 or 11, is determined by whether the
weighted sum ∑jwjxj∑jwjxj is less than or greater than some threshold value. Just like the
weights, the threshold is a real number which is a parameter of the neuron. To put it in more
precise algebraic terms:
output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)(1)output={0if ∑jwjxj≤ threshold1if ∑jwjxj
> threshold
That's all there is to how a perceptron works!
That's the basic mathematical model. A way you can think about the perceptron is that it's a
device that makes decisions by weighing up evidence. Let me give an example. It's not a very
realistic example, but it's easy to understand, and we'll soon get to more realistic examples.
Suppose the weekend is coming up, and you've heard that there's going to be a cheese
festival in your city. You like cheese, and are trying to decide whether or not to go to the
festival. You might make your decision by weighing up three factors:
We can represent these three factors by corresponding binary variables x1,x2x1,x2, and x3x3.
For instance, we'd have x1=1x1=1 if the weather is good, and x1=0x1=0 if the weather is bad.
Similarly, x2=1x2=1if your boyfriend or girlfriend wants to go, and x2=0x2=0 if not. And
similarly again for x3x3 and public transit.
Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival
even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But
perhaps you really loathe bad weather, and there's no way you'd go to the festival if the
weather is bad. You can use perceptrons to model this kind of decision-making. One way to
do this is to choose a weight w1=6w1=6for the weather, and w2=2w2=2 and w3=2w3=2 for
the other conditions. The larger value of w1w1 indicates that the weather matters a lot to
you, much more than whether your boyfriend or girlfriend joins you, or the nearness of
public transit. Finally, suppose you choose a threshold of 55 for the perceptron. With these
choices, the perceptron implements the desired decision-making model,
outputting 11 whenever the weather is good, and 00 whenever the weather is bad. It makes
no difference to the output whether your boyfriend or girlfriend wants to go, or whether
public transit is nearby.
By varying the weights and the threshold, we can get different models of decision-making.
For example, suppose we instead chose a threshold of 33. Then the perceptron would decide
that you should go to the festival whenever the weather was good or when both the festival
162
was near public transit and your boyfriend or girlfriend was willing to join you. In other
words, it'd be a different model of decision-making. Dropping the threshold means you're
more willing to go to the festival.
Obviously, the perceptron isn't a complete model of human decision-making! But what the
example illustrates is how a perceptron can weigh up different kinds of evidence in order to
make decisions. And it should seem plausible that a complex network of perceptrons could
make quite subtle decisions:
In this network, the first column of perceptrons - what we'll call the first layer of
perceptrons - is making three very simple decisions, by weighing the input evidence. What
about the perceptrons in the second layer? Each of those perceptrons is making a decision by
weighing up the results from the first layer of decision-making. In this way a perceptron in
the second layer can make a decision at a more complex and more abstract level than
perceptrons in the first layer. And even more complex decisions can be made by the
perceptron in the third layer. In this way, a many-layer network of perceptrons can engage
in sophisticated decision making.
Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In
the network above the perceptrons look like they have multiple outputs. In fact, they're still
single output. The multiple output arrows are merely a useful way of indicating that the
output from a perceptron is being used as the input to several other perceptrons. It's less
unwieldy than drawing a single output line which then splits.
notational simplifications. Because of this, in the remainder of the book we won't use the threshold,
we'll always use the bias.
I've described perceptrons as a method for weighing evidence to make decisions. Another
way perceptrons can be used is to compute the elementary logical functions we usually think
of as underlying computation, functions such as AND, OR, and NAND. For example, suppose
we have a perceptron with two inputs, each with weight −2−2, and an overall bias of 33.
Here's our perceptron:
The NAND example shows that we can use perceptrons to compute simple logical functions.
In fact, we can use networks of perceptrons to compute any logical function at all. The
reason is that the NAND gate is universal for computation, that is, we can build any
computation up out of NAND gates. For example, we can use NAND gates to build a circuit
which adds two bits, x1x1 and x2x2. This requires computing the bitwise sum, x1⊕x2x1⊕x2,
as well as a carry bit which is set to 11 when both x1x1 and x2x2 are 11, i.e., the carry bit is
just the bitwise product x1x2x1x2:
One notable aspect of this network of perceptrons is that the output from the leftmost
perceptron is used twice as input to the bottommost perceptron. When I defined the
perceptron model I didn't say whether this kind of double-output-to-the-same-place was
allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then
it's possible to simply merge the two lines, into a single connection with a weight of -4
instead of two connections with -2 weights. (If you don't find this obvious, you should stop
and prove to yourself that this is equivalent.) With that change, the network looks as follows,
with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as
marked:
Up to now I've been drawing inputs like x1x1 and x2x2 as variables floating to the left of the
network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons -
the input layer- to encode the inputs:
This notation for input perceptrons, in which we have an output, but no inputs,
165
is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we
did have a perceptron with no inputs. Then the weighted sum ∑jwjxj∑jwjxj would always be
zero, and so the perceptron would output 11 if b>0b>0, and 00 if b≤0b≤0. That is, the
perceptron would simply output a fixed value, not the desired value (x1x1, in the example
above). It's better to think of the input perceptrons as not really being perceptrons at all, but
rather special units which are simply defined to output the desired values, x1,x2,…x1,x2,….
The adder example demonstrates how a network of perceptrons can be used to simulate a
circuit containing many NAND gates. And because NAND gates are universal for computation,
it follows that perceptrons are also universal for computation.
However, the situation is better than this view suggests. It turns out that we can
devise learning algorithms which can automatically tune the weights and biases of a
network of artificial neurons. This tuning happens in response to external stimuli, without
direct intervention by a programmer. These learning algorithms enable us to use artificial
neurons in a way which is radically different to conventional logic gates. Instead of explicitly
laying out a circuit of NAND and other gates, our neural networks can simply learn to solve
problems, sometimes problems where it would be extremely difficult to directly design a
conventional circuit.
Sigmoid neurons
Learning algorithms sound terrific. But how can we devise such algorithms for a neural
network? Suppose we have a network of perceptrons that we'd like to use to learn to solve
some problem. For example, the inputs to the network might be the raw pixel data from a
scanned, handwritten image of a digit. And we'd like the network to learn weights and biases
so that the output from the network correctly classifies the digit. To see how learning might
work, suppose we make a small change in some weight (or bias) in the network. What we'd
like is for this small change in weight to cause only a small corresponding change in the
output from the network. As we'll see in a moment, this property will make learning
possible. Schematically, here's what we want (obviously this network is too simple to do
handwriting recognition!):
166
If it were true that a small change in a weight (or bias) causes only a small change in output,
then we could use this fact to modify the weights and biases to get our network to behave
more in the manner we want. For example, suppose the network was mistakenly classifying
an image as an "8" when it should be a "9". We could figure out how to make a small change
in the weights and biases so the network gets a little closer to classifying the image as a "9".
And then we'd repeat this, changing the weights and biases over and over to produce better
and better output. The network would be learning.
The problem is that this isn't what happens when our network contains perceptrons. In fact,
a small change in the weights or bias of any single perceptron in the network can sometimes
cause the output of that perceptron to completely flip, say from 00 to 11. That flip may then
cause the behaviour of the rest of the network to completely change in some very
complicated way. So while your "9" might now be classified correctly, the behaviour of the
network on all the other images is likely to have completely changed in some hard-to-control
way. That makes it difficult to see how to gradually modify the weights and biases so that the
network gets closer to the desired behaviour. Perhaps there's some clever way of getting
around this problem. But it's not immediately obvious how we can get a network of
perceptrons to learn.
We can overcome this problem by introducing a new type of artificial neuron called
a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small
changes in their weights and bias cause only a small change in their output. That's the
crucial fact which will allow a network of sigmoid neurons to learn.
Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we
depicted perceptrons:
167
Just like a perceptron, the sigmoid neuron has inputs, x1,x2,…x1,x2,…. But instead of being
just 00 or 11, these inputs can also take on any values between 00 and 11. So, for
instance, 0.638…0.638… is a valid input for a sigmoid neuron. Also just like a perceptron,
the sigmoid neuron has weights for each input, w1,w2,…w1,w2,…, and an overall bias, bb.
But the output is not 00 or 11. Instead, it's σ(w⋅x+b)σ(w⋅x+b), where σσ is called thesigmoid
function**Incidentally, σσ is sometimes called the logistic function, and this new class of
neurons called logistic neurons. It's useful to remember this terminology, since these terms
are used by many people working with neural nets. However, we'll stick with the sigmoid
terminology., and is defined by:
σ(z)≡11+e−z.(3)(3)σ(z)≡11+e−z.
To put it all a little more explicitly, the output of a sigmoid neuron with
inputs x1,x2,…x1,x2,…, weights w1,w2,…w1,w2,…, and bias bb is
11+exp(−∑jwjxj−b).(4)(4)11+exp(−∑jwjxj−b).
At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of
the sigmoid function may seem opaque and forbidding if you're not already familiar with it.
In fact, there are many similarities between perceptrons and sigmoid neurons, and the
algebraic form of the sigmoid function turns out to be more of a technical detail than a true
barrier to understanding.
What about the algebraic form of σσ? How can we understand that? In fact, the exact form
of σσ isn't so important - what really matters is the shape of the function when plotted.
Here's the shape:
-4-3-2-1012340.00.20.40.60.81.0zsigmoid function
-4-3-2-1012340.00.20.40.60.81.0zstep function
If σσ had in fact been a step function, then the sigmoid neuron would be a perceptron, since
the output would be 11 or 00 depending on whether w⋅x+bw⋅x+b was positive or
negative**Actually, when w⋅x+b=0w⋅x+b=0 the perceptron outputs 00, while the step
function outputs 11. So, strictly speaking, we'd need to modify the step function at that one
point. But you get the idea.. By using the actual σσfunction we get, as already implied above,
a smoothed out perceptron. Indeed, it's the smoothness of the σσ function that is the crucial
fact, not its detailed form. The smoothness of σσ means that small changes ΔwjΔwj in the
weights and ΔbΔb in the bias will produce a small change ΔoutputΔoutput in the output from
the neuron. In fact, calculus tells us that ΔoutputΔoutput is well approximated by
Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)(5)Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,
168
If it's the shape of σσ which really matters, and not its exact form, then why use the
particular form used for σσ in Equation (3)? In fact, later in the book we will occasionally
consider neurons where the output is f(w⋅x+b)f(w⋅x+b) for some other activation
function f(⋅)f(⋅). The main thing that changes when we use a different activation function is
that the particular values for the partial derivatives in Equation (5) change. It turns out that
when we compute those partial derivatives later, using σσ will simplify the algebra, simply
because exponentials have lovely properties when differentiated. In any case, σσ is
commonly-used in work on neural nets, and is the activation function we'll use most often in
this book.
How should we interpret the output from a sigmoid neuron? Obviously, one big difference
between perceptrons and sigmoid neurons is that sigmoid neurons don't just
output 00 or 11. They can have as output any real number between 00 and 11, so values such
as 0.173…0.173… and 0.689…0.689… are legitimate outputs. This can be useful, for example,
if we want to use the output value to represent the average intensity of the pixels in an image
input to a neural network. But sometimes it can be a nuisance. Suppose we want the output
from the network to indicate either "the input image is a 9" or "the input image is not a 9".
Obviously, it'd be easiest to do this if the output was a 00 or a 11, as in a perceptron. But in
practice we can set up a convention to deal with this, for example, by deciding to interpret
any output of at least 0.50.5 as indicating a "9", and any output less than 0.50.5 as indicating
"not a 9". I'll always explicitly state when we're using such a convention, so it shouldn't cause
any confusion.
Exercises
In the next section I'll introduce a neural network that can do a pretty good job classifying
handwritten digits. In preparation for that, it helps to explain some terminology that lets us
name different parts of a network. Suppose we have the network:
As mentioned earlier, the leftmost layer in this network is called the input layer, and the
neurons within the layer are called input neurons. The rightmost or output layer contains
the output neurons, or, as in this case, a single output neuron. The middle layer is called
a hidden layer, since the neurons in this layer are neither inputs nor outputs. The term
"hidden" perhaps sounds a little mysterious - the first time I heard the term I thought it
must have some deep philosophical or mathematical significance - but it really means
nothing more than "not an input or an output". The network above has just a single hidden
layer, but some networks have multiple hidden layers. For example, the following four-layer
network has two hidden layers:
Somewhat confusingly, and for historical reasons, such multiple layer networks are
sometimes called multilayer perceptrons orMLPs, despite being made up of sigmoid
neurons, not perceptrons. I'm not going to use the MLP terminology in this book, since I
think it's confusing, but wanted to warn you of its existence.
170
The design of the input and output layers in a network is often straightforward. For example,
suppose we're trying to determine whether a handwritten image depicts a "9" or not. A
natural way to design the network is to encode the intensities of the image pixels into the
input neurons. If the image is a 6464 by 6464 greyscale image, then we'd
have 4,096=64×644,096=64×64 input neurons, with the intensities scaled appropriately
between 00 and 11. The output layer will contain just a single neuron, with output values of
less than 0.50.5 indicating "input image is not a 9", and values greater than 0.50.5 indicating
"input image is a 9 ".
While the design of the input and output layers of a neural network is often straightforward,
there can be quite an art to the design of the hidden layers. In particular, it's not possible to
sum up the design process for the hidden layers with a few simple rules of thumb. Instead,
neural networks researchers have developed many design heuristics for the hidden layers,
which help people get the behaviour they want out of their nets. For example, such heuristics
can be used to help determine how to trade off the number of hidden layers against the time
required to train the network. We'll meet several such design heuristics later in this book.
Up to now, we've been discussing neural networks where the output from one layer is used
as input to the next layer. Such networks are called feedforward neural networks. This
means there are no loops in the network - information is always fed forward, never fed back.
If we did have loops, we'd end up with situations where the input to the σσ function
depended on the output. That'd be hard to make sense of, and so we don't allow such loops.
However, there are other models of artificial neural networks in which feedback loops are
possible. These models are calledrecurrent neural networks. The idea in these models is to
have neurons which fire for some limited duration of time, before becoming quiescent. That
firing can stimulate other neurons, which may fire a little while later, also for a limited
duration. That causes still more neurons to fire, and so over time we get a cascade of
neurons firing. Loops don't cause problems in such a model, since a neuron's output only
affects its input at some later time, not instantaneously.
Recurrent neural nets have been less influential than feedforward networks, in part because
the learning algorithms for recurrent nets are (at least to date) less powerful. But recurrent
networks are still extremely interesting. They're much closer in spirit to how our brains work
than feedforward networks. And it's possible that recurrent networks can solve important
problems which can only be solved with great difficulty by feedforward networks. However,
to limit our scope, in this book we're going to concentrate on the more widely-used
feedforward networks.
Having defined neural networks, let's return to handwriting recognition. We can split the
problem of recognizing handwritten digits into two sub-problems. First, we'd like a way of
breaking an image containing many digits into a sequence of separate images, each
containing a single digit. For example, we'd like to break the image
171
We humans solve this segmentation problem with ease, but it's challenging for a computer
program to correctly break up the image. Once the image has been segmented, the program
then needs to classify each individual digit. So, for instance, we'd like our program to
recognize that the first digit above,
is a 5.
We'll focus on writing a program to solve the second problem, that is, classifying individual
digits. We do this because it turns out that the segmentation problem is not so difficult to
solve, once you have a good way of classifying individual digits. There are many approaches
to solving the segmentation problem. One approach is to trial many different ways of
segmenting the image, using the individual digit classifier to score each trial segmentation. A
trial segmentation gets a high score if the individual digit classifier is confident of its
classification in all segments, and a low score if the classifier is having a lot of trouble in one
or more segments. The idea is that if the classifier is having trouble somewhere, then it's
probably having trouble because the segmentation has been chosen incorrectly. This idea
and other variations can be used to solve the segmentation problem quite well. So instead of
worrying about segmentation we'll concentrate on developing a neural network which can
solve the more interesting and difficult problem, namely, recognizing individual handwritten
digits.
The input layer of the network contains neurons encoding the values of the input pixels. As
discussed in the next section, our training data for the network will consist of
many 2828 by 2828 pixel images of scanned handwritten digits, and so the input layer
contains 784=28×28784=28×28 neurons. For simplicity I've omitted most of
the 784784 input neurons in the diagram above. The input pixels are greyscale, with a value
of 0.00.0 representing white, a value of 1.01.0representing black, and in between values
representing gradually darkening shades of grey.
The second layer of the network is a hidden layer. We denote the number of neurons in this
hidden layer by nn, and we'll experiment with different values for nn. The example shown
illustrates a small hidden layer, containing just n=15n=15 neurons.
The output layer of the network contains 10 neurons. If the first neuron fires, i.e., has an
output ≈1≈1, then that will indicate that the network thinks the digit is a 00. If the second
neuron fires then that will indicate that the network thinks the digit is a 11. And so on. A
little more precisely, we number the output neurons from 00 through 99, and figure out
which neuron has the highest activation value. If that neuron is, say, neuron number 66,
then our network will guess that the input digit was a 66. And so on for the other output
neurons.
You might wonder why we use 1010 output neurons. After all, the goal of the network is to
tell us which digit (0,1,2,…,90,1,2,…,9) corresponds to the input image. A seemingly natural
way of doing that is to use just 44 output neurons, treating each neuron as taking on a binary
value, depending on whether the neuron's output is closer to 00 or to 11. Four neurons are
enough to encode the answer, since 24=1624=16 is more than the 10 possible values for the
173
input digit. Why should our network use 1010 neurons instead? Isn't that inefficient? The
ultimate justification is empirical: we can try out both network designs, and it turns out that,
for this particular problem, the network with 1010output neurons learns to recognize digits
better than the network with 44 output neurons. But that leaves us
wondering why using 1010output neurons works better. Is there some heuristic that would
tell us in advance that we should use the 1010-output encoding instead of the 44-output
encoding?
To understand why we do this, it helps to think about what the neural network is doing from
first principles. Consider first the case where we use 1010 output neurons. Let's concentrate
on the first output neuron, the one that's trying to decide whether or not the digit is a 00. It
does this by weighing up evidence from the hidden layer of neurons. What are those hidden
neurons doing? Well, just suppose for the sake of argument that the first neuron in the
hidden layer detects whether or not an image like the following is present:
It can do this by heavily weighting input pixels which overlap with the image, and only
lightly weighting the other inputs. In a similar way, let's suppose for the sake of argument
that the second, third, and fourth neurons in the hidden layer detect whether or not the
following images are present:
174
As you may have guessed, these four images together make up the 00image that we saw in
the line of digits shown earlier:
So if all four of these hidden neurons are firing then we can conclude that the digit is a 00. Of
course, that's not the only sort of evidence we can use to conclude that the image was a 00 -
we could legitimately get a 00 in many other ways (say, through translations of the above
images, or slight distortions). But it seems safe to say that at least in this case we'd conclude
that the input was a 00 Supposing the neural network functions in this way, we can give a
plausible explanation for why it's better to have 1010 outputs from the network, rather
than 44. If we had 44 outputs, then the first output neuron would be trying to decide what
175
the most significant bit of the digit was. And there's no easy way to relate that most
significant bit to simple shapes like those shown above. It's hard to imagine that there's any
good historical reason the component shapes of the digit will be closely related to (say) the
most significant bit in the output.
Now, with all that said, this is all just a heuristic. Nothing says that the three-layer neural
network has to operate in the way I described, with the hidden neurons detecting simple
component shapes. Maybe a clever learning algorithm will find some assignment of weights
that lets us use only 44 output neurons. But as a heuristic the way of thinking I've described
works pretty well, and can save you a lot of time in designing good neural network
architectures.