You are on page 1of 44

Discriminant Analysis

Prepared by-
Sumit Jain
Introduction-
 Discriminant analysis or DA, is a technique for analysing marketing
research data when criterion or dependent variable is categorical and
the predictor or independent variables are interval in nature . In other
words, Discriminant analysis is a statistical method that
is used by researchers to help them understand the
relationship between a "dependent variable" and one
or more "independent variables." A dependent
variable is the variable that a researcher is trying to
explain or predict from the values of the independent
variables. Discriminant analysis is similar to regression
analysis and analysis of variance (ANOVA). The
principal difference between discriminant analysis and
the other two methods is with regard to the nature of
the dependent variable.


Contd..

 It is a statistical technique that is used to classify the dependent


variable between two or more categories. Discriminant analysis
also has a regression technique, which is used for predicting the
value of the dependent categorical variable.

 F test (Wilks’ lambda) The overall model significance of the


discriminant function is tested by the Wilks’ lambda test. If the
overall model is significant, then the F test is used to test whether
or not the individual variable means differ from the group mean
function.
.


Examples-
 For example, an educational researcher may want to
investigate which variables discriminate between high school
graduates who decide (1) to go to college, (2) to attend a trade
or professional school, or (3) to seek no further training or
education. For that purpose the researcher could collect data
on numerous variables prior to students' graduation. After
graduation, most students will naturally fall into one of the
three categories. Discriminant Analysis could then be used to
determine which variable(s) are the best predictors of students'
subsequent educational choice.

 Another example a medical researcher may record different
variables relating to patients' backgrounds in order to learn
which variables best predict whether a patient is likely to
recover completely (group 1), partially (group 2), or not at all
(group 3). A biologist could record different characteristics of
similar types (groups) of flowers, and then perform a
discriminant function analysis to determine the set of
characteristics that allows for the best discrimination between
the types.


Purpose-
 The main purpose of a discriminant function analysis is to
predict group membership based on a linear combination of the
interval variables. The procedure begins with a set of
observations where both group membership and the values of the
interval variables are known. The end result of the procedure is a
model that allows prediction of group membership when only the
interval variables are known. A second purpose of discriminant
function analysis is an understanding of the data set, as a careful
examination of the prediction model that results from the
procedure can give insight into the relationship between group
membership and the variables used to predict group membership.

Objectives-
Ø To classify cases into groups using a discriminant prediction
equation.
Ø To test theory by observing whether cases are classified as
predicted.
Ø To investigate differences between or among groups.
Ø To determine the most parsimonious way to distinguish among
groups.
Ø To determine the percent of variance in the dependent variable
explained by the independents.
Ø To determine the percent of variance in the dependent variable
explained by the independents over and above the variance
accounted for by control variables, using sequential
discriminant analysis.
Ø
Ø To assess the relative importance of the independent
variables in classifying the dependent variable.
Ø To discard variables which are little related to group
distinctions.
Ø To infer the meaning of MDA dimensions which
distinguish groups, based on discriminant loadings.
Ø
 Multiple discriminant analysis (MDA) is an extension of
discriminant analysis and a cousin of multiple analysis of
variance (MANOVA), sharing many of the same assumptions and
tests. MDA is used to classify a categorical dependent which has
more than two categories, using as predictors a number of
interval or dummy independent variables. MDA is sometimes
also called discriminant factor analysis or canonical discriminant
analysis.

Assumptions in Discriminant analysis-

 1. Independence: Each case should be independent of each other.


Correlated data cannot be used in discriminant analysis.

2. Adequate sample size: There must be at least two cases for


each category of the dependent variable. However, it is
recommended that there should be at least four or five times as
many cases as independent variables.

 3. Interval data: In discriminant analysis, there should be an


interval data for independent variable.

4. Variance: No independents have a zero standard deviation in


one or more of the groups formed by the dependent.

Contd..

 5. Random error: Error terms are assumed to be randomly distributed.


6. Homogeneity of variances: Variance with each group of independent


variables should be equal.

7. Absence of perfect multicollinearity: There should be no perfect


multicollinearity between the independent variables.


 8. Assumes linearity: The discriminant functions should be linear and related
to each other.

9. Normally distributed: The predictor variable should be normally


distributed.


STEPS
Key Terms and Concepts-
Ø Discriminating variables: Discriminating variables are
independent variables that are used to predict the dependent
variable. These variables are also called the predictors.
Ø
Ø The criterion variable: Dependent variables are also called the
criterion variables.
Ø
Ø Discriminant function: The Linear combination of the
discriminating (independent) variable is called the
discriminant function. For example,
 L = b1×1 + b2×2 + … + bnxn + c
 where L= discriminant function, b1= discriminant
coefficients, X= independents variables, and C = constants

Ø Number of discriminant functions: For the two groups, there is
one discriminant analysis function. For multivariate
discriminant analysis there will be g-1 discriminant function.
Ø
Ø The Eigenvalues: This is also called characteristic root, which
tells us the variance explained by each discriminant function.
Ø
Ø The discriminant score: By applying discriminant formulas, the
value that comes is called the discriminant score. This
discriminant score helps us to classify the group category.


Contd…

Ø Cutoff: This is the value which divides the group value into two
parts. When the value of the discriminant score is at the
negative side of the cutoff point, then the group will fall into a
lower category, and when it is at the positive side, the group
will be at a higher category.
Ø
Ø Unstandardized discriminant coefficients: Unstandardized
discriminant coefficients are simply like the regression beta,
which is used to predict the discriminate score. Standardized
discriminant coefficients are used to compare the relative
importance of the independent variables.



TYPES OF DISCRIMINANT ANALYSIS-

LINEAR DISCRIMINANT ANALYSIS


 Linear Discriminant model (LDA) is used in the case when


the groups are separable by linear combinations of the discriminating
variables. If only two features, the separators between objects
group will become lines. If the features are three, the separator is
a plane and the number of features (i.e. independent variables) is
more than 3, the separators become a hyper- plane. The final
value of the Discriminant function will determine the group the
particular observation belongs to. Appropriate threshold values
and relative significance of individual Discriminant function will
lead to the final
 outcome/group.

Contd..

 LDA is closely related to ANOVA (analysis of variance) and


regression analysis, which also attempt to express one
dependent variable as a linear combination of other features or
measurements. In the other two methods however, the
dependent variable is a numerical quantity, while for LDA it is
a categorical variable (i.e. the class label).


Application-
Career Counsellors
 suppose we have two groups of high school
graduates: Those who choose to attend
college after graduation and those who do
not. We could have measured students'
stated intention to continue on to college
one year prior to graduation. If the means
for the two groups (those who actually went
to college and those who did not) are
different, then we can say that intention to
attend college as stated one year prior to
graduation allows us to discriminate
between those who are and are not college
bound (and this information may be used by
career counsellors to provide the
appropriate guidance to the respective
students).
Marketing-
 In marketing, discriminant analysis
was once often used to determine
the factors which distinguish
different types of customers and/or
products on the basis of surveys or
other forms of collected data.
Logistic regression or other methods
are now more commonly used. The
use of discriminant analysis in
marketing can be described by the
following steps:
 Formulate the problem and gather
data - Identify the salient attributes
consumers use to evaluate products in
this category - Use quantitative
marketing research techniques (such
as surveys) to collect data from a
sample of potential customers
concerning their ratings of all the
product attributes. The data collection
stage is usually done by marketing
research professionals. Survey
questions ask the respondent to rate a
product from one to five (or 1 to 7, or 1
to 10) on a range of attributes chosen
 Anywhere from five to twenty
attributes are chosen. They could
include things like: ease of use,
weight, accuracy, durability,
colourfulness, price, or size. The
attributes chosen will vary depending
on the product being studied. The
same question is asked about all the
products in the study. The data for
multiple products is codified and
input into a statistical program such
as R, SPSS or SAS. (This step is the
 Estimate the Discriminant Function
Coefficients and determine the statistical
significance and validity - Choose the
appropriate discriminant analysis method. The
direct method involves estimating the
discriminant function so that all the predictors
are assessed simultaneously. The stepwise
method enters the predictors sequentially. The
two-group method should be used when the
dependent variable has two categories or
states. The multiple discriminant method is
used when the dependent variable has three or
more categorical states. Use Wilks’s Lambdato
test for significance in SPSS or F stat in SAS.
The most common method used to test validity
is to split the sample into an estimation or
analysis sample, and a validation or holdout
 The estimation sample is used in constructing the
discriminant function. The validation sample is used to
construct a classification matrix which contains the
number of correctly classified and incorrectly classified
cases. The percentage of correctly classified cases is
called the hit ratio.

 Plot the results on a two dimensional map, define the


dimensions, and interpret the results. The statistical
program (or a related module) will map the results.
The map will plot each product (usually in two
dimensional space). The distance of products to each
other indicate either how different they are. The
dimensions must be labelled by the researcher. This
requires subjective judgement and is often very
challenging.

SOCIAL SCIENCES-

Prediction of Elections:
In this case the variables can be various social and economic

factors,
coupled with party effort parameters. Some of these variables can

be as follows
(1)No. of new projects implemented by incumbent party

(2)No. of candidates in fray

(3)National reach of the party (no .of states active in)


(4)SEC division of the Electorate (in form of ratios)
(5)Profession wise division of the Electorate

(6)Age wise division of the Electorate.

The variables mentioned above are few of the representative parameters


that might have a bearing on the coming elections. Nowadays another

important parameter is the result of exit polls, which are conducted by

various media agencies. They provide the general expectations of the

electorate in view.


Outcome of terrorist attacks with hostages:

 With the increasing occurrences of terrorist attacks, it becomes


very important for the law and order enforcing body and
governments to ensure minimal collateral damage during rescue
operations. Lot of times it can be prudent to predict the
possibility of such an operation going bad i.e. casualty while
rescue. Research on this front has already been initiated. The
basic hypothesis is based on the fact that various variables may
be good predictors of the safe release or execution of the
hostages. Some of these variables are as follows-


Contd..

(1)Number of terrorists
(2)Strength of their support in the local population

(3)Number of weapons and amount of ammunition with the terrorists

(4)Type of weapons wielded by the attackers

(5)Ratio of terrorists to hostages

(6)Whether the terrorists are independent operators or they belong to

some large scale terrorist outfit

(7)Time since the hostages were taken

(8)Female/male ratio among the hostages

(9)Children/adults ratio among the hostages

A careful training with past cases can help the government take a decision
on whether to use force or negotiations to neutralize the terrorist threat.


MEDICINE AND DIAGNOSTICS

 The application of multivariate analysis, and especially


discriminant analysis ,to the study of trace elements in food and
environmental fields has been largely used in various occasions.
In the clinical field, Discriminant analysis has been tentatively
used to improve the predictive value of tomography images in
differential diagnosis between AD and frontotemporal dementia.
Similarly, the need for non-invasive, specific and sensitive test
led to study whether levels of some proteins considered markers
of neuronal degeneration were useful to discriminate between
patients and control groups.
Hepatitis Disease Detection
 Research has been going in this domain. The basic diagnostic
flowchart follows. Here LDA is useful in determining the most
important features impacting the advent of the disease. Once the
reduction is done, the actual classification is done through a
fuzzy network based classifier. Here the LDA is like a data
conditioning function, instead of being a predictor.
 Diagram
Contd..

 The study hence conducted attained 94.16% accuracy in


detection on Hepatitis, which is very high. This would help quick
medication and hence recovery for the patient.

INSURANCE COMPANIES

     Insolvency prediction (Case study on Spanish Banks)
 Unlike other financial problems, there are a
great number of agents facing business failure, so
research in this topic has been of growing interest
in the last decades. Insolvency, early detection of
financial distress, or conditions leading to
insolvency of insurance companies have been a
concern of parties such as insurance regulators,
investors, management, financial analysts, banks,
auditors, policy holders and consumers. This
concern has arised from the necessity of
protecting the general public
Contd..

 against the consequences of insurer’s


insolvencies, as well as minimizing the costs
associated to this problem such as the
effects on state insurance guaranty funds or
the responsibilities for management and
auditors. It has long been recognized that there needs
to be some form of supervision of such entities to
attempt to minimize the risk of failure. Nowadays,
Solvency II project is intended to lead to the reform of
the existing solvency rules in European Union. Many
insolvency cases appeared after the insurance cycles of
the 1970s and 1980s in the United States and in
European Union.

Contd..

 Several surveys have been devoted to identify the main causes of


insurers’ insolvency, in particular, the Müller Group Report
(1997) analyses the main identified causes of insurance
insolvencies in the European Union. The main reasons can be
summarized as follows: operational risks (operational failure
related to inexperienced or incompetent management, fraud);
underwriting risks (inadequate reinsurance programme and
failure to recover from reinsurers, higher losses due to rapid
growth, excessive operating costs, poor underwriting process);
insufficient provisions and imprudent

Contd..

 investments. On the other hand, many insurance companies,


specially larger companies, have developed internal risk models
for a number of purposes. There is an absence of such
standardized systems in Spain, where most insurance companies
have internal check mechanism to predict insolvency.
 A recent study by academicians from Madrid performed a LDA
to predict insolvency of Spanish banks using historical data from
72 banks. The data was collected 1,2,3 years prior to the
insolvency. Some of the results of the study are as given below.
click
Here Model 1, 2 and 3 are predictors with data 1,
2, and 3 years prior to insolvency respectively.
Table: List of Financial Ratios, used as variables for the Predictor
model
BDM&DM
Table: Final Results of the LDA performed in the three models
From the above results we see that the LDA model, was probably not the
best model to apply here as the accuracy was very low, and only
slightly
better than 0.5 probability in the case of the test cases. Maybe some
other
high level classification method would work better here.
BDM&
CONCLUSIONS-

 In short, Discriminant Analysis is a
very useful tool (1) for detecting the
variables that allow the researcher to
discriminate between different
(naturally occurring) groups, and (2)
for classifying cases into different
groups with a better than chance
accuracy.

Reference
• www.wikipedia.com
• www.books.google.co.in
• www.resample.com
• www.statsoft.com
• www.faculty.chass.ncsu.edu
• www.eso.org

You might also like