You are on page 1of 24

ANALYSIS OF VARIANCE (ANOVA)

A statistical analysis tool that separates the total variability found within a data
set into two components: random and systematic factors. The random factors
do not have any statistical influence on the given data set, while the systematic
factors do. The ANOVA test is used to determine the impact independent
variables have on the dependent variable in a regression analysis.

Analysis of variance (ANOVA) is a collection of statistical models used to


analyze the differences between group means and their associated procedures
(such as "variation" among and between groups). In ANOVA setting, the
observed variance in a particular variable is partitioned into components
attributable to different sources of variation. In its simplest form, ANOVA
provides a statistical test of whether or not the means of several groups are
equal, and therefore generalizes t-test to more than two groups. Doing
multiple two-sample t-tests would result in an increased chance of committing
a type I error. For this reason, ANOVAs are useful in comparing (testing) three
or more means (groups or variables) for statistical significance.

The terminology of ANOVA is largely from the statistical design of experiments.


The experimenter adjusts factors and measures responses in an attempt to
determine an effect. Factors are assigned to experimental units by a
combination of randomization and blocking to ensure the validity of the
results. Blinding keeps the weighing impartial. Responses show a variability
that is partially the result of the effect and is partially random error.

ANOVA is the synthesis of several ideas and it is used for multiple purposes. As
a consequence, it is difficult to define concisely or precisely.

Classical ANOVA for balanced data does three things at once:

1. As exploratory data analysis, an ANOVA is an organization of an additive


data decomposition, and its sums of squares indicate the variance of each
component of the decomposition (or, equivalently, each set of terms of a
linear model).
2. Comparisons of mean squares, along with F-tests ... allow testing of a
nested sequence of models.
3. Closely related to the ANOVA is a linear model fit with coefficient
estimatesand standard errors.
In short, ANOVA is a statistical tool used in several ways to develop and
confirm an explanation for the observed data.

Additionally:

4. It is computationally elegant and relatively robust against violations of its


assumptions.
5. ANOVA provides industrial strength (multiple sample comparison) statistical
analysis.
6. It has been adapted to the analysis of a variety of experimental designs.

Design-of-experiments terms (Associated with ANOVA)

Balanced design
An experimental design where all cells (i.e. treatment combinations) have
the same number of observations.

Blocking
A schedule for conducting treatment combinations in an experimental study
such that any effects on the experimental results due to a known change in
raw materials, operators, machines, etc., become concentrated in the levels
of the blocking variable. The reason for blocking is to isolate a systematic
effect and prevent it from obscuring the main effects. Blocking is achieved
by restricting randomization.
Design
A set of experimental runs which allows the fit of a particular model and the
estimate of effects.
DOE
Design of experiments. An approach to problem solving involving collection
of data that will support valid, defensible, and supportable conclusions.
Effect
How changing the settings of a factor changes the response. The effect of a
single factor is also called a main effect.
Error
Unexplained variation in a collection of observations. DOE's typically
require understanding of both random error and lack of fit error.

Experimental unit
The entity to which a specific treatment combination is applied.
Factors
Process inputs an investigator manipulates to cause a change in the output.
Lack-of-fit error
Error that occurs when the analysis omits one or more important terms or
factors from the process model. Including replication in a DOE allows
separation of experimental error into its components: lack of fit and random
(pure) error.
Model
Mathematical relationship which relates changes in a given response to
changes in one or more factors.
Random error
Error that occurs due to natural variation in the process. Random error is
typically assumed to be normally distributed with zero mean and a constant
variance. Random error is also called experimental error.
Randomization
A schedule for allocating treatment material and for conducting treatment
combinations in a DOE such that the conditions in one run neither depend on
the conditions of the previous run nor predict the conditions in the
subsequent runs.
Replication
Performing the same treatment combination more than once. Including
replication allows an estimate of the random error independent of any lack of
fit error.
Responses
The output(s) of a process and is sometimes called dependent variable(s).
Treatment
A treatment is a specific combination of factor levels whose effect is to be
compared with other treatments.

There are three classes of models used in the analysis of variance, and these are
outlined here.

Fixed-effects models

The fixed-effects model of analysis of variance applies to situations in which


the experimenter applies one or more treatments to the subjects of the
experiment to see if the response variablevalues change. This allows the
experimenter to estimate the ranges of response variable values that the
treatment would generate in the population as a whole.

Random-effects models

Random effects models are used when the treatments are not fixed. This occurs
when the various factor levels are sampled from a larger population. Because
the levels themselves are random variables, some assumptions and the method
of contrasting the treatments (a multi-variable generalization of simple
differences) differ from the fixed-effects model.

Mixed-effects models

A mixed-effects model contains experimental factors of both fixed and random-


effects types, with appropriately different interpretations and analysis for the
two types.

Example: Teaching experiments could be performed by a university department


to find a good introductory textbook, with each text considered a treatment. The
fixed-effects model would compare a list of candidate texts. The random-effects
model would determine whether important differences exist among a list of
randomly selected texts. The mixed-effects model would compare the (fixed)
incumbent texts to randomly selected alternatives.

Assumptions of ANOVA

The analysis of variance has been studied from several approaches, the most
common of which uses alinear model that relates the response to the treatments
and blocks. Note that the model is linear in parameters but may be nonlinear
across factor levels. Interpretation is easy when data is balanced across factors
but much deeper understanding is needed for unbalanced data.

The analysis of variance can be presented in terms of a linear model, which


makes the following assumptions about the probability distribution of the
responses:

 Independence of observations – this is an assumption of the model that


simplifies the statistical analysis.

 Normality – the distributions of the residuals are normal.


 Equality (or "homogeneity") of variances, called homoscedasticity — the
variance of data in groups should be the same.

We shall discuss One-way and Two-way ANOVA Models in details:

One-Way ANOVA Model:

The One-Way ANOVA procedure produces a one-way analysis of variance for a


quantitative dependent variable by a single factor (independent) variable.
Analysis of variance is used to test the hypothesis that several means are equal.
This technique is an extension of the two-sample t-test.

In addition to determining that differences exist among the means, you may
want to know which means differ. There are two types of tests for comparing
means: a priori contrasts and post-hoc tests.

Contrasts are tests set up before running the experiment and post hoc tests are
run after the experiment has been conducted. You can also test for trends
across categories.

Two-Way ANOVA Model:

The Two-Way ANOVA procedure produces a two-way analysis of variance for a


quantitative dependent variable which is affected by two factors
simultaneously. The interaction effects can also be tested.

ANALYSIS OF COVARINACE (ANCOVA)

The analysis of covariance (ANCOVA) is a technique that is occasionally useful


for improving the precision of an experiment. Suppose that in an experiment
with a response variable Y, there is another variable X, such that Y is linearly
related to X. Furthermore, suppose that the researcher cannot control X but
can observe it along with Y. Such a variable X is called a covariate or a
concomitant variable. The basic idea underlying ANCOVA is that precision in
detecting the effects of treatments on Y can be increased by adjusting the
observed values of Y for the effect of the concomitant variable. If such
adjustments are not performed, the concomitant variable X could inflate the
error mean square and make true differences in the response due to
treatments harder to detect. The concept is very similar to the use of blocks to
reduce the experimental error. However, when the blocking variable is a
continuous variable, the delimitation of the blocks can be very subjective. The
ANCOVA uses information about in X in two ways:

1. Variation in Y that is associated with variation in X is removed from the


error variance
(MSE), resulting in more precise estimates and more powerful tests

2. Individual observations of Y are adjusted to correspond to a common value


of X, thereby producing group means that are not biased by X, as well as
equitable group comparisons. A sort of hybrid of ANOVA and linear
regression analysis, ANCOVA is a method of adjusting for the effects of an
uncontrollable nuisance variable.

MULTIVARIATE ANALYSIS OF VARIANCE (MANOVA)

Multivariate analysis of variance (MANOVA) is simply an ANOVA with several


dependent variables. That is to say, ANOVA tests for the difference in means
between two or more groups, while MANOVA tests for the difference in two or
more vectors of means. For example, we may conduct a study where we try
two different textbooks, and we are interested in the students' improvements
in math and physics. In that case, improvements in math and physics are the
two dependent variables, and our hypothesis is that both together are affected
by the difference in textbooks. A multivariate analysis of variance (MANOVA)
could be used to test this hypothesis. Instead of a univariate F value, we would
obtain a multivariate F value (Wilks' λ) based on a comparison of the error
variance/covariance matrix and the effect variance/covariance matrix.
Although we only mention Wilks' λ here, there are other statistics that may be
used, including Hotelling's trace and Pillai's criterion. The "covariance" here is
included because the two measures are probably correlated and we must take
this correlation into account when performing the significance test. Testing the
multiple dependent variables is accomplished by creating new dependent
variables that maximize group differences. These artificial dependent variables
are linear combinations of the measured dependent variables.

Assumptions of MANOVA

Normal Distribution: - The dependent variable should be normally distributed


within groups. Overall, the F test is robust to non-normality, if the non-
normality is caused by skewness rather than by outliers. Tests for outliers
should be run before performing a MANOVA, and outliers should be
transformed or removed.

Linearity -MANOVA assumes that there are linear relationships among all pairs
of dependent variables, all pairs of covariates, and all dependent variable-
covariate pairs in each cell. Therefore, when the relationship deviates from
linearity, the power of the analysis will be compromised.
Homogeneity of Variances: -- Homogeneity of variances assumes that the
dependent variables exhibit equal levels of variance across the range of
predictor variables.

MANCOVA

MANCOVA is an extension of ANCOVA. It is simply a MANOVA where the


artificial dependent variables are initially adjusted for differences in one or
more covariates. This can reduce error "noise" when error associated with the
covariate is removed.

Logistic regression

Logistic regression is a statistical method for analysing a dataset in which there


are one or more independent variables that determine an outcome. The
outcome is measured with a dichotomous variable (in which there are only two
possible outcomes).

In logistic regression, the dependent variable is binary or dichotomous, i.e. it


only contains data coded as 1 (TRUE, success, pregnant, etc. ) or 0 (FALSE,
failure, non-pregnant, etc.).

The goal of logistic regression is to find the best fitting (yet biologically
reasonable) model to describe the relationship between the dichotomous
characteristic of interest (dependent variable = response or outcome variable)
and a set of independent (predictor or explanatory) variables. Logistic
regression generates the coefficients (and its standard errors and significance
levels) of a formula to predict a logit transformation of the probability of
presence of the characteristic of interest:

where p is the probability of presence of the characteristic of interest. The logit


transformation is defined as the logged odds:
and

Rather than choosing parameters that minimize the sum of squared errors (like
in ordinary regression), estimation in logistic regression chooses parameters
that maximize the likelihood of observing the sample values.

Overall model fit

The null model -2 Log Likelihood is given by -2 * ln(L0) where L0 is the likelihood
of obtaining the observations if the independent variables had no effect on the
outcome.

The full model -2 Log Likelihood is given by -2 * ln(L) where L is the likelihood of


obtaining the observations with all independent variables incorporated in the
model.

The difference of these two yields a Chi-Squared statistic which is a measure of


how well the independent variables affect the outcome or dependent variable.

If the P-value for the overall model fit statistic is less than the conventional
0.05 then there is evidence that at least one of the independent variables
contributes to the prediction of the outcome.

Regression coefficients

The regression coefficients are the coefficients b 0, b1, b2, ...bk of the regression
equation:

An independent variable with a regression coefficient not significantly different


from 0 (P>0.05) can be removed from the regression model (press function key
F7 to repeat the logistic regression procedure). If P<0.05 then the variable
contributes significantly to the prediction of the outcome variable.
The logistic regression coefficients show the change (increase when b i>0,
decrease when bi<0) in the predicted logged odds of having the characteristic
of interest for a one-unit change in the independent variables.

When the independent variables Xa and Xb are dichotomous variables (e.g.


Smoking, Sex) then the influence of these variables on the dependent variable
can simply be compared by comparing their regression coefficients ba and bb.

Odds ratios with 95% CI

By taking the exponential of both sides of the regression equation as given


above, the equation can be rewritten as:

It is clear that when a variable X i increases by 1 unit, with all other factors
remaining unchanged, then the odds will increase by a factor e bi.

This factor ebi is the odds ratio (O.R.) for the independent variable X i and it
gives the relative amount by which the odds of the outcome increase (O.R.
greater than 1) or decrease (O.R. less than 1) when the value of the
independent variable is increased by 1 units.

For example,If variable SMOKING is coded as 0 (= no smoking) and 1 (=


smoking), and the odds ratio for this variable is 3.2. This means that in the
model the odds for a positive outcome in cases that do smoke are 3.2 times
higher than in cases that do not smoke.

Interpretation of the fitted equation


The logistic regression equation is:

logit(p) = -4.48 + 0.11 x AGE + 1.16 x SMOKING


So for 40 years old cases who do smoke logit(p) equals 1.08. Logit(p) can be
back-transformed to p by the following formula:

Alternatively, you can use the Logit table. For logit(p)=1.08 the probability p of
having a positive outcome equals 0.75.

Hosmer & Lemeshow test


The Hosmer-Lemeshow test is a statistical test for goodness of fit for the
logistic regression model. The data are divided into approximately ten groups
defined by increasing order of estimated risk. The observed and expected
number of cases in each group is calculated and a Chi-squared statistic is
calculated as follows:

with Og, Eg and ng the observed events, expected events and number of


observations for the gth risk decile group, and n the number of groups. The test
statistic follows a Chi-squared distribution with n-2 degrees of freedom.

A large value of Chi-squared (with small p-value < 0.05) indicates poor fit and
small Chi-squared values (with larger p-value closer to 1) indicate a good
logistic regression model fit.

The Contingency Table for Hosmer and Lemeshow Test table shows the


details of the test with observed and expected number of cases in each group.

Classification table
The classification table is another method to evaluate the predictive accuracy
of the logistic regression model. In this table the observed values for the
dependent outcome and the predicted values (at a user defined cut-off value,
for example p=0.50) are cross-classified. In our example, the model correctly
predicts 70% of the cases.

ROC curve analysis


Another method to evaluate the logistic regression model makes use of ROC
curve analysis. In this analysis, the power of the model's predicted values to
discriminate between positive and negative cases is quantified by the Area
under the ROC curve (AUC). The AUC, sometimes referred to as the c-statistic
(or concordance index), is a value that varies from 0.5 (discriminating power
not better than chance) to 1.0 (perfect discriminating power).

To perform a full ROC curve analysis on the predicted probabilities you can
save the predicted probabilities and next use this new variable in ROC curve
analysis. The Dependent variable used in Logistic Regression then acts as the
Classification variable in the ROC curve analysis dialog box.

Sample size considerations


Sample size calculation for logistic regression is a complex problem, but based
on the work of Peduzzi et al. (1996) the following guideline for a minimum
number of cases to include in your study can be suggested.

Let p be the smallest of the proportions of negative or positive cases in the


population and k the number of covariates (the number of independent
variables), then the minimum number of cases to include is:

N = 10 k / p

For example: If you have 3 covariates to include in the model and the
proportion of positive cases in the population is 0.20 (20%). The minimum
number of cases required is

N = 10 x 3 / 0.20 = 150

If the resulting number is less than 100 you should increase it to 100 as
suggested by Long (1997).
FACTOR ANALYSIS

Factor analysis is a general name denoting a class of procedures primarily used


for data reduction and summarization. Factor analysis is an interdependence
technique in that an entire set of interdependent relationships is examined
without making the distinction between dependent and independent variables.

Factor analysis is used in the following circumstances:

 To identify underlying dimensions, or factors, that explain the correlations


among a set of variables.
 To identify a new, smaller, set of uncorrelated variables to replace the
original set of correlated variables in subsequent multivariate analysis
(regression or discriminant analysis).
 To identify a smaller set of salient variables from a larger set for use in
subsequent multivariate analysis.

Factor Analysis Model

The unique factors are uncorrelated with each other. The common factors
themselves can be expressed as linear combinations of the observed variables.

Fi = Wi1X1 + Wi2X2 + Wi3X3 + . . . + WikXk , where

Fi =estimate of i th factor

Wi =weight or factor score coefficient

k =number of variables

 It is possible to select weights or factor score coefficients so that the first


factor explains the largest portion of the total variance.
 Then a second set of weights can be selected, so that the second factor
accounts for most of the residual variance, subject to being uncorrelated
with the first factor.
 This same principle could be applied to selecting additional weights for the
additional factors.

Statistics Associated with Factor Analysis

 Bartlett's test of sphericity. Bartlett's test of sphericity is a test statistic


used to examine the hypothesis that the variables are uncorrelated in the
population. In other words, the population correlation matrix is an identity
matrix; each variable correlates perfectly with itself (r = 1) but has no
correlation with the other variables (r = 0).
 Correlation matrix. A correlation matrix is a lower triangle matrix
showing the simple correlations, r, between all possible pairs of variables
included in the analysis. The diagonal elements, which are all 1, are usually
omitted.
 Communality. Communality is the amount of variance a variable shares
with all the other variables being considered. This is also the proportion of
variance explained by the common factors.
 Eigenvalue. The eigenvalue represents the total variance explained by each
factor.
 Factor loadings. Factor loadings are simple correlations between the
variables and the factors.
 Factor loading plot. A factor loading plot is a plot of the original variables
using the factor loadings as coordinates.
 Factor matrix. A factor matrix contains the factor loadings of all the
variables on all the factors extracted.

 Factor scores. Factor scores are composite scores estimated for each
respondent on the derived factors.
 Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy. The
Kaiser-Meyer-Olkin (KMO) measure of sampling adequacy is an index
used to examine the appropriateness of factor analysis. High values
(between 0.5 and 1.0) indicate factor analysis is appropriate. Values below
0.5 imply that factor analysis may not be appropriate.
 Percentage of variance. The percentage of the total variance attributed to
each factor.

 Residuals are the differences between the observed correlations, as given in


the input correlation matrix, and the reproduced correlations, as estimated
from the factor matrix.

 Scree plot. A scree plot is a plot of the Eigenvalues against the number of
factors in order of extraction.

DISCRIMINANT ANALYSIS

Discriminant analysis is a technique for analyzing data when the criterion or


dependent variable is categorical and the predictor or independent variables are
interval in nature.

The objectives of discriminant analysis are as follows:

 Development of discriminant functions, or linear combinations of the


predictor or independent variables, which will best discriminate between the
categories of the criterion or dependent variable (groups).
 Examination of whether significant differences exist among the groups, in
terms of the predictor variables.
 Determination of which predictor variables contribute to most of the
intergroup differences.
 Classification of cases to one of the groups based on the values of the
predictor variables.
 Evaluation of the accuracy of classification.
 When the criterion variable has two categories, the technique is known as
two-group discriminant analysis.
 When three or more categories are involved, the technique is referred to as
multiple discriminant analysis.
 The main distinction is that, in the two-group case, it is possible to derive
only one discriminant function. In multiple discriminant analysis, more
than one function may be computed. In general, with G groups and k
predictors, it is possible to estimate up to the smaller of G - 1, or k,
discriminant functions.
 The first function has the highest ratio of between-groups to within-groups
sum of squares. The second function, uncorrelated with the first, has the
second highest ratio, and so on. However, not all the functions may be
statistically significant.

Similarities and Differences between ANOVA, Regression, and Discriminant Analysis


ANOVA REGRESSION
DISCRIMINANT ANALYSIS

Similarities
Number of One One One
dependent
Variables

Number of
independent Multiple Multiple Multiple
variables

Differences
Nature of the
dependent Metric Metric Categorical
Variables

Nature of the
independent Categorical Metric Metric
variables
Discriminant Analysis Model

The discriminant analysis model involves linear combinations of the following


form:

D = b0 + b1X1 + b2X2 + b3X3 + . . . + bkXk

where

D = discriminant score

b 's = discriminant coefficient or weight

X 's = predictor or independent variable

The coefficients or weights (b), are estimated so that the groups differ as much
as possible on the values of the discriminant function. This occurs when the
ratio of between-group sum of squares to within-group sum of squares for the
discriminant scores is at a maximum.

Assumptions of the Model

The discriminant model has the following assumptions:


•  The predictors are not highly correlated with each other
•  The mean and variance of a given predictor are not correlated
•  The correlation between two predictors is constant across groups
•  The values of each predictor have a normal distribution.
Statistics Associated with Discriminant Analysis
 Canonical Correlation. Canonical correlation measures the extent of
association between the discriminant scores and the groups. It is a measure
of association between the single discriminant function and the set of dummy
variables that define the group membership.
 Centroid. The centroid is the mean values for the discriminant scores for a
particular group. There are as many centroids as there are groups, as there is
one for each group. The means for a group on all the functions are the group
centroids.
 Classification matrix. Sometimes also called prediction matrix, the
classification matrix contains the number of correctly classified and
misclassified cases.
 Discriminant function coefficients. The discriminant function coefficients
(unstandardized) are the multipliers of variables, when the variables are in the
original units of measurement.
 Discriminant scores. The unstandardized coefficients are multiplied by the
values of the variables. These products are summed and added to the
constant term to obtain the discriminant scores.
 Eigen value. For each discriminant function, the Eigen value is the ratio of
between-group to within-group sums of squares. Large Eigen values imply
superior functions.
 F values and their significance. These are calculated from a one-way
ANOVA, with the grouping variable serving as the categorical independent
variable. Each predictor, in turn, serves as the metric dependent variable in
the ANOVA.
 Group means and group standard deviations. These are computed for
each predictor for each group.
 Pooled within-group correlation matrix. The pooled within-group
correlation matrix is computed by averaging the separate covariance matrices
for all the groups.
 Standardized discriminant function coefficients. The standardized
discriminant function coefficients are the discriminant function coefficients
and are used as the multipliers when the variables have been standardized to a
mean of 0 and a variance of 1.
 Structure correlations. Also referred to as discriminant loadings, the
structure correlations represent the simple correlations between the predictors
and the discriminant function.
 Total correlation matrix. If the cases are treated as if they were from a
single sample and the correlations computed, a total correlation matrix is
obtained.
 Wilks' λ. Sometimes also called the U statistic. Wilks' λ for each predictor
is the ratio of the within-group sum of squares to the total sum of squares. Its
value varies between 0 and 1. Large values of λ (near 1) indicate that group
means do not seem to be different. Small values of λ (near 0) indicate that
the group means seem to be different.
CLUSTER ANALYSIS

• Cluster analysis is a class of techniques used to classify objects or cases


into relatively homogeneous groups called clusters.
• Objects in each cluster tend to be similar to each other and dissimilar to
objects in the other clusters
• Both cluster analysis and discriminant analysis are concerned with
classification. However,
• Discriminant analysis requires prior knowledge of the cluster or group
membership for each object or case included, to develop the classification
rule.
• In cluster analysis there is no a priori information about the group or
cluster membership for any of the objects.
• Groups or clusters are suggested by the data, not defined a priori.

An Ideal Clustering Situation


A Practical Clustering Situation

Statistics Associated with Cluster Analysis

• Agglomeration schedule. An agglomeration schedule gives information on


the objects or cases being combined at each stage of a hierarchical clustering
process.
• Cluster centroid. The cluster centroid is the mean values of the variables
for all the cases or objects in a particular cluster.
• Cluster centers. The cluster centers are the initial starting points in
nonhierarchical clustering. Clusters are built around these centers, or seeds.
• Cluster membership. Cluster membership indicates the cluster to which
each object or case belongs.
• Dendrogram. A dendrogram, or tree graph, is a graphical device for
displaying clustering results. Vertical lines represent clusters that are joined
together. The position of the line on the scale indicates the distances at
which clusters were joined. The dendrogram is read from left to right.
• Distances between cluster centers. These distances indicate how separated
the individual pairs of clusters are. Clusters that are widely separated are
distinct, and therefore desirable.

A Classification of Clustering Procedures

Linkage Variance Centroid


Methods Methods Methods

Ward’s Method

Select a Clustering Procedure – Hierarchical

 Hierarchical clustering is characterized by the development of a


hierarchy or tree-like structure. Hierarchical methods can be
agglomerative or divisive.
 Agglomerative clustering starts with each object in a separate cluster.
Clusters are formed by grouping objects into bigger and bigger clusters.
This process is continued until all objects are members of a single cluster.
 Divisive clustering starts with all the objects grouped in a single cluster.
Clusters are divided or split until each object is in a separate cluster.
 Agglomerative methods are commonly used in marketing research.
They consist of linkage methods, error sums of squares or variance
methods, and centroid methods.

Select a Clustering Procedure – Linkage Method

 The single linkage method is based on minimum distance, or the nearest


neighbor rule. At every stage, the distance between two clusters is the
distance between their two closest points
 The complete linkage method is similar to single linkage, except that it is
based on the maximum distance or the furthest neighbor approach. In
complete linkage, the distance between two clusters is calculated as the
distance between their two furthest points

 The average linkage method works similarly. However, in this method,


the distance between two clusters is defined as the average of the
distances between all pairs of objects, where one member of the pair is
from each of the clusters

Select a Clustering Procedure – Nonhierarchical

 The nonhierarchical clustering methods are frequently referred to as k-


means clustering. These methods include sequential threshold, parallel
threshold, and optimizing partitioning
 In the sequential threshold method, a cluster center is selected and all
objects within a prespecified threshold value from the center are grouped
together. Then a new cluster center or seed is selected, and the process is
repeated for the unclustered points. Once an object is clustered with a
seed, it is no longer considered for clustering with subsequent seeds

 The parallel threshold method operates similarly, except that several


cluster centers are selected simultaneously and objects within the
threshold level are grouped with the nearest center

 The optimizing partitioning method differs from the two threshold


procedures in that objects can later be reassigned to clusters to optimize
an overall criterion, such as average within cluster distance for a given
number of clusters.

Select a Clustering Procedure

 It has been suggested that the hierarchical and nonhierarchical methods


be used in tandem. First, an initial clustering solution is obtained using a
hierarchical procedure, such as average linkage or Ward's. The number
of clusters and cluster centroids so obtained are used as inputs to the
optimizing partitioning method
 Choice of a clustering method and choice of a distance measure are
interrelated. For example, squared Euclidean distances should be used
with the Ward's and centroid methods. Several nonhierarchical
procedures also use squared Euclidean distances.

Decide on the Number of Clusters

 Theoretical, conceptual, or practical considerations may suggest a certain


number of clusters
 In hierarchical clustering, the distances at which clusters are combined
can be used as criteria. This information can be obtained from the
agglomeration schedule or from the dendrogram
 In nonhierarchical clustering, the ratio of total within-group variance to
between-group variance can be plotted against the number of clusters.
The point at which an elbow or a sharp bend occurs indicates an
appropriate number of clusters
 The relative sizes of the clusters should be meaningful.

Interpreting and Profiling the Clusters

 Interpreting and profiling clusters involves examining the cluster


centroids. The centroids enable us to describe each cluster by assigning it
a name or label
 It is often helpful to profile the clusters in terms of variables that were not
used for clustering. These may include demographic, psychographic,
product usage, media usage, or other variables.

You might also like