PR Assignment 01 - Seemal Ajaz (206979)

Summary
Statistical Pattern Recognition: A Review

Abstract:
Pattern recognition is used for supervised or unsupervised classification. With
applications that use complex patterns, efficient recognition techniques are required.
The subject paper gives a review of some prominent methods used in different stages
of pattern recognition and identifies research topics & applications at its forefront.
1. Introduction:
The human brain is capable of recognizing different patterns/characters irrespective of
size, orientation, background or foreground clutter/occlusion or mode of representation.
Teaching a machine to do the same is pattern recognition. Pattern recognition is critical
in decision making tasks. The more relevant the patterns, the better the decision. The
authors’ goal is to use available sensors, processors & knowledge for pattern
recognition enabling automatic decision making.
1.1. What is Pattern Recognition?
A pattern is defined as an entity that has some order and can be given a name. Pattern
recognition includes the following three steps: 1) data acquisition & preprocessing, 2)
data representation, and 3) decision making. Sensors, preprocessing technique,
representation scheme & decision making model are selected based on the problem to
be addressed. An aptly defined problem will lead to a compact pattern representation
and a simple decision making strategy. The four renowned approaches to pattern
recognition are defined below:
1.2. Template Matching:
It involves measuring the similarity between a stored template of pattern and a new
pattern (to be recognized) while accounting for allowable translation, rotation & scale
changes. The computational expense is set off by fast processors available. The
technique does not do well with distortions that may result after image processing,
viewpoint change or if large intraclass variations are present among patterns.
1.3. Statistical Approach:
It involves patterns to be represented in terms of d features in d dimensional space such
that patterns belonging to different categories occupy mutually exclusive regions in that
space. The objective is to establish decision boundaries based on distribution
probabilities (either learnt or specified) that separate different pattern classes.
1.4. Syntactic Approach:
It involves breaking down a complex pattern into its elementary subpatterns called
primitives. Primitives and grammatical rules (inferred from training samples) can
describe a large collection of complex patterns. It is used where patterns have definite
structure which can be captured in terms of a set of rules e.g. EKG waveforms. It faces
difficulties with segmentation of noisy patterns and the inference of the grammar.
1.5. Neural Networks:
These are parallel computing systems with many interconnected simple processors
capable of learning complex nonlinear relationships using sequential training procedure
Page 1 of 7
to adapt themselves to available data. The network of weighted directed graphs is
updated for efficient data classification. Neural networks can also perform feature
extraction. With their implicit similarity with the statistical method, neural networks are
regarded as statistics for amateurs where the statistics is concealed.
1.6. Scope and Organization:
The paper focuses on statistical methods for pattern recognition. The statistical
approach has been broadly divided into 05 categories defined in sections 3 through 8.
2. Statistical Pattern Recognition:

Section 1.3 gives an overview of statistical pattern recognition. Continuing; the pattern is
preprocessed (compact representation of the pattern is defined), appropriate features
are extracted/selected (representation of input patterns) and classified to partition the
feature space assigning a class to pattern based on extracted features.
The features have class-conditional form densities (PMF or PDF) and decision
boundaries are drawn using decision rules like Bayes, maximum likelihood etc. Decision
problem can be either parametric (known densities) or nonparametric (unknown
densities) that either use estimated values or directly construct decision boundary
based on training data to solve the problem.
A dichotomy of this method is that of supervised learning (pattern belongs to a
predefined class defined by the system designer – discriminant analysis) versus
unsupervised learning (pattern is assigned an unknown class by learning similarity of
patterns – clustering). Another dichotomy is based on whether the decision boundaries
are obtained directly by optimizing certain cost functions (geometric approach) or
indirectly by estimating density functions first, and then constructing discriminant
functions for decision boundary (probabilistic density-based approach).
Irrespective of classification or decision rule, the system must be trained using training
data. Classifier performance depends on the number of available training samples & the
specific values of the samples. The system is then tested using samples different from
training and as a result classifier performance may not be optimal. This may be because
of curse of dimensionality or overfitting.
3. The Curse of Dimensionality and Peaking Phenomena:
Dividing a feature space into smaller cells and giving each cell a label requires the
number of training data points to be an exponential function of the feature dimension
[18]. This is called “curse of dimensionality”.
The peaking phenomena [80], [131], [132], arises as a consequence to the curse of
dimensionality when the number of training samples used to design a classifier are
small compared to the number of features. This is because reliability of probabilities
decreases with the increase in features. It is good practice to use at least ten times as
many training samples per class as the number of features while designing a classifier.
4. Dimensionality Reduction:
Low measurement cost & high classification accuracy drive the need for dimensionality
reduction (i.e. fewer numbers of features). Carefully chosen prominent features set will
provide faster & less memory hungry classifiers with high accuracy. Subsequent to
feature extraction, feature selection filters out features with low discrimination ability.
Page 2 of 7
The choice of criterion function is critical so as not to lead to a loss in the discrimination
power and reducing the accuracy of the system.
4.1. Feature Extraction:
Determination of appropriate subspace of dimensionality from the original feature space
using either a linear or nonlinear approach is called feature extraction. Examples of
linear extracts are Principal Component Analysis (PCA), Discriminant Analysis,
Projection Pursuit [53], Independent Component Analysis (ICA) [31], [11], [24], [96],
PCA network, and that of nonlinear include Kernel PCA [73], [145], Nonlinear auto-
associative network, Multidimensional scaling [20], Self-Organizing Map [92].
Best known unsupervised linear extractor is PCA that is good for Gaussian data.
Projection Pursuit & ICA are suitable for non-Gaussian distributions. Neural networks
offers feature extraction & classification often using nonlinear methods like examples
given above.
4.2. Feature Selection:
Selecting a subset of features from the large pool features that gives minimal
classification error falls under the umbrella of feature selection. Type of classifier used
and the training & test sets directly affect feature selection. It is inherently offline; thus
feature selection methods must compensate execution time with generation of optimal
feature subset.
Methods resulting in optimal subsets include exhaustive search and Branch & Bound
search. Suboptimal methods include Best Individual Features, Sequential
Forward/Backward Selection (SFS/SBS), Plus l - take away r, Sequential
Forward/Backward Floating Selection (SFFS/SBFS) [126] where there is a tradeoff
between optimality and computational efficiency.
5. Classifiers:
There are three common approaches to designing a classifier.
1. Similarity Approach:
It follows the logic that similar patterns should be assigned to the same class. Template
matching or minimum distance classifier based on a metric & prototype are used to
classify patterns by finding the nearest subspace.
2. Probabilistic Approach:
Decision rules like Bayes give optimum classifiers for known class conditional
probabilities. Practically, empirical Bayes decision rule or plug-in rule is used to estimate
probabilities (parametric or non-parametric). Parametric models are Gaussian, binomial
& multinormal distributions. Non-parametric models employ k-nearest neighbor (KNN)
or Parzen classifier rule.
3. Geometric Approach:
It involves construction of decision boundaries directly by optimizing error criterion.
Examples include Fisher’s linear discriminant, single/multi-layer perceptron, decision
trees [22], [30], [129], and support vector classifier [162].
Prominent support vector classifier is a two class classifier that uses width of margin
between classes for classification using nonlinear combination of minimum distance
function.
Page 3 of 7
6. Classifier Combination:
Outputs of individual classifiers can be combined to improve the overall classification
accuracy of a system by learning through different training sets, classification methods
or training sessions. Depending upon the architecture, classifiers can be combined in
parallel, series or in hierarchy.
6.1. Selection and Training of Individual Classifiers:
Classifier combination is useful if individual classifiers are independent. Independence
may be achieved by different resampling techniques e.g. stacking (decision dependent
on outputs of stacked & individual classifier) [168], bagging (different datasets are made
from the original and combined by a rule like averaging) [21], or boosting (individual
classifiers are trained hierarchically to learn more complex regions of feature space)
[142]. Other methods include cluster analysis that separate classes into subclasses in a
training set and random subspace method that use different feature sets instead of
different classifiers on different training sets.
6.2. Combiner:
A combiner is used to combine together individual classifiers. It can be static with no
training required, trainable, non-adaptive that treat all input patterns the same or
adaptive where decisions of individual classifiers are evaluated. Combiners expect the
data from the classifier to be either a confidence level (numerical value representing
probability pattern belonging to a class), rank (list ranking each class) or abstract
(outputs a unique class label or several labels.
6.3. Theoretical Analysis of Combination Schemes:
Classifier combination improves the recognition accuracy. This is evident from various
conducted experiments but only a few theoretical explanations have been given by
scientists. The dilemma is to choose the best bias-variance value for classifier(s).
1.1. An Example:
The authors used a digit dataset consists of handwritten numerals (0-9) to evaluate
different classifiers & combination rules using training and testing dataset same for each
of the 12 classifiers. A comprehensive summary of error rates show that combination
results are better than individual classifier results and performance of different
classifiers vary substantially over different feature sets.
2. Error Estimation:
Classifier performance is generally measured by its error rate. A simple analytical
expression for error rate is not possible, even for simple cases since it is a combined
function of samples, densities & probabilities etc. Error rate is a random variable and is
estimated by available samples (both testing & training). Various methods of error
estimation differ in their utilization of available samples as train & test sets.
Another measure is the reject rate that rejects doubtful patterns instead of assigning
them a class thus reducing error.
Page 4 of 7
3. Unsupervised Classification:
Unsupervised classification (data clustering) constructs decision boundaries (clusters)
based on unlabeled data by perceiving similarities among patterns [81]. An appropriate
measure of similarity must be defined which is both data & context dependent.
Clustering techniques include either more commonly used iterative square-error
partitional clustering (minimizes within cluster scatter or maximizes between cluster
scatter) or agglomerative hierarchical clustering (data organized in nested sequence of
groups viewable in tree form). In partitional clustering, number of clusters may or may
not be specified but either a global or local criterion is adopted.
3.1. Square-Error Clustering:
It is a commonly used partitional clustering strategy which has the objective to obtain
partition which for a fixed number of clusters minimizes square error. K-means algorithm
is an example of such clustering method. The design however still has to fuss over
choosing number of clusters “K”, initial partition, updating the partition, adjusting number
of clusters & stopping criterion.
3.2. Mixture Decomposition:
Mixtures aptly model situation where patterns have been produced by set of alternative
(probabilistically modeled) sources thereby are able to define formal approaches to
unsupervised classification. They are also well suited for representing complex class
conditional in supervised learning scenarios. Finite mixtures can also be used a feature
selection tool [127].
3.2.1. Basic Definitions:
There are two fundamental issues when fitting a mixture model to set of observations.
1) Estimation of parameters defining the mixture model and 2) estimation of number of
components. Answers to these issues are given in sections 8.2.2 & 8.2.3 respectively.
3.2.2. EM Algorithm:
The algorithm interprets the given observations as incomplete data, with the missing
part being a set of labels. It alternatively applies the two steps: E-step: computing
conditional expectation of the complete log-likelihood & M-step: updating parameter
estimates. It is critically dependent on initialization and may converge to a point on the
boundary of parameter space with unbounded likelihood.
3.2.3. Estimating the Number of Components:

Maximum likelihood criterion is useless as a model selection criterion. Alternates are
EM-based approaches like minimum description length (MDL) [10], minimum message
length (MML) [2] etc, resampling based schemes, cross validation approaches and
stochastic algorithms that generally involve Markov chain Monte Carlo (MCMC)
sampling which the authors consider too computationally expensive to use in pattern
recognition applications.
4. Discussion:
Statistical pattern recognition has experienced a rapid growth due to increased
interaction & collaboration among different disciplines, prevalence of faster and less
expensive computational media & devices, easy utilization in challenging applications
Page 5 of 7
like data mining, and need for principled, rather than adhoc approach for solving
problems.
4.1. Frontiers of Pattern Recognition:
Recognition problems like model selection, mixture modeling, optimization methods,
local decision boundary learning, local invariance (dis)similarity measures, independent
component analysis, classifier combination, semi-supervised learning, and use of
context are all important leading research areas in pattern recognition.
4.2. Concluding Remarks:
The authors quoted Watanabe [164] from the preface of his editorial work on Frontiers
of Pattern Recognition.
Acknowledgments:
The authors thank their reviewers for their assistance, comments and suggestions.
Page 6 of 7
References:
[2] H. Akaike, ªA New Look at Statistical Model [131] S.J. Raudys and V. Pikelis, ªOn Dimensionality,
Identification,º IEEE Trans. Automatic Control, vol. 19, Sample Size, Classification Error, and Complexity of
pp. 716-723, 1974. Classification Algorithms in Pattern Recognition,º
[11] A. Bell and T. Sejnowski, ªAn Information- IEEE Trans. Pattern Analysis and Machine
Maximization Approach to Blind Separation,º Neural Intelligence, vol. 2, pp. 243-251, 1980.
Computation, vol. 7, pp. 1,004- 1,034, 1995. [132] S.J. Raudys and A.K. Jain, ªSmall Sample Size
[18] C.M. Bishop, Neural Networks for Pattern Effects in Statistical Pattern Recognition:
Recognition. Oxford: Clarendon Press, 1995 Recommendations for Practitioners, º IEEE Trans.
[20] I. Borg and P. Groenen, Modern Multidimensional Pattern Analysis and Machine Intelligence, vol. 13,
Scaling, Berlin: Springer-Verlag, 1997. no. 3, pp. 252-264, 1991.
[21] L. Breiman, ªBagging Predictors,º Machine [142] R.E. Schapire, ªThe Strength of Weak
Learning, vol. 24, no. 2, pp. 123-140, 1996. Learnability,º Machine Learning, vol. 5, pp. 197-227,
[22] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. 1990.
Stone, Classification and Regression Trees. [145] B. SchoÈlkopf, A. Smola, and K.R. Muller,
Wadsworth, Calif., 1984. ªNonlinear Component Analysis as a Kernel
[24] J. Cardoso, ªBlind Signal Separation: Statistical Eigenvalue Problem,º Neural Computation, vol. 10,
Principles,º Proc. IEEE, vol. 86, pp. 2,009-2,025, no. 5, pp. 1,299-1,319, 1998.
1998. [162] V.N. Vapnik, Statistical Learning Theory. New
[30] P.A. Chou, ªOptimal Partitioning for Classification York: John Wiley & Sons, 1998.
and Regression Trees,º IEEE Trans. Pattern Analysis [164] S. Watanabe, Pattern Recognition: Human and
and Machine Intelligence, vol. 13, no. 4, pp. 340-354, Mechanical. New York: Wiley, 1985.
Apr. 1991. [168] D. Wolpert, ªStacked Generalization,º Neural
[31] P. Comon, ªIndependent Component Analysis, a Networks, vol. 5, pp. 241-259, 1992.
New Concept?,º Signal Processing, vol. 36, no. 3, pp.
287-314, 1994.
[53] J.H. Friedman, ªExploratory Projection Pursuit,º
J. Am. Statistical Assoc., vol. 82, pp. 249-266, 1987.
[73] S. Haykin, Neural Networks, A Comprehensive
Foundation. Second ed., Englewood Cliffs, N.J.:
Prentice Hall, 1999.
[80] A.K. Jain and B. Chandrasekaran,
ªDimensionality and Sample Size Considerations in
Pattern Recognition Practice,º Handbook of Statistics.
P.R. Krishnaiah and L.N. Kanal, eds., vol. 2, pp. 835-
855, Amsterdam: North-Holland, 1982.
[81] A.K. Jain and R.C. Dubes, Algorithms for
Clustering Data. Englewood Cliffs, N.J.: Prentice Hall,
1988.
[92] T. Kohonen, Self-Organizing Maps. Springer
Series in Information Sciences, vol. 30, Berlin, 1995.
[96] T.W. Lee, Independent Component Analysis.
Dordrech: Kluwer Academic Publishers, 1998.
[126] P. Pudil, J. Novovicova, and J. Kittler, ªFloating
Search Methods in Feature Selection,º Pattern
Recognition Letters, vol. 15, no. 11, pp. 1,119-1,125,
1994.
[127] P. Pudil, J. Novovicova, and J. Kittler, ªFeature
Selection Based on the Approximation of Class
Densities by Finite Mixtures of the Special Type,º
Pattern Recognition, vol. 28, no. 9, pp. 1,389-1,398,
1995.
[129] J.R. Quinlan, C4.5: Programs for Machine
Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
Page 7 of 7

PR Assignment 01 - Seemal Ajaz (206979)

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PR Assignment 01 - Seemal Ajaz (206979)

Uploaded by

Copyright:

Available Formats

Summary

Statistical Pattern Recognition: A Review

2. Statistical Pattern Recognition:

3.2.3. Estimating the Number of Components:

You might also like