Professional Documents
Culture Documents
Page 1 of 7
to adapt themselves to available data. The network of weighted directed graphs is
updated for efficient data classification. Neural networks can also perform feature
extraction. With their implicit similarity with the statistical method, neural networks are
regarded as statistics for amateurs where the statistics is concealed.
1.6. Scope and Organization:
The paper focuses on statistical methods for pattern recognition. The statistical
approach has been broadly divided into 05 categories defined in sections 3 through 8.
Page 2 of 7
The choice of criterion function is critical so as not to lead to a loss in the discrimination
power and reducing the accuracy of the system.
4.1. Feature Extraction:
Determination of appropriate subspace of dimensionality from the original feature space
using either a linear or nonlinear approach is called feature extraction. Examples of
linear extracts are Principal Component Analysis (PCA), Discriminant Analysis,
Projection Pursuit [53], Independent Component Analysis (ICA) [31], [11], [24], [96],
PCA network, and that of nonlinear include Kernel PCA [73], [145], Nonlinear auto-
associative network, Multidimensional scaling [20], Self-Organizing Map [92].
Best known unsupervised linear extractor is PCA that is good for Gaussian data.
Projection Pursuit & ICA are suitable for non-Gaussian distributions. Neural networks
offers feature extraction & classification often using nonlinear methods like examples
given above.
4.2. Feature Selection:
Selecting a subset of features from the large pool features that gives minimal
classification error falls under the umbrella of feature selection. Type of classifier used
and the training & test sets directly affect feature selection. It is inherently offline; thus
feature selection methods must compensate execution time with generation of optimal
feature subset.
Methods resulting in optimal subsets include exhaustive search and Branch & Bound
search. Suboptimal methods include Best Individual Features, Sequential
Forward/Backward Selection (SFS/SBS), Plus l - take away r, Sequential
Forward/Backward Floating Selection (SFFS/SBFS) [126] where there is a tradeoff
between optimality and computational efficiency.
5. Classifiers:
There are three common approaches to designing a classifier.
1. Similarity Approach:
It follows the logic that similar patterns should be assigned to the same class. Template
matching or minimum distance classifier based on a metric & prototype are used to
classify patterns by finding the nearest subspace.
2. Probabilistic Approach:
Decision rules like Bayes give optimum classifiers for known class conditional
probabilities. Practically, empirical Bayes decision rule or plug-in rule is used to estimate
probabilities (parametric or non-parametric). Parametric models are Gaussian, binomial
& multinormal distributions. Non-parametric models employ k-nearest neighbor (KNN)
or Parzen classifier rule.
3. Geometric Approach:
It involves construction of decision boundaries directly by optimizing error criterion.
Examples include Fisher’s linear discriminant, single/multi-layer perceptron, decision
trees [22], [30], [129], and support vector classifier [162].
Prominent support vector classifier is a two class classifier that uses width of margin
between classes for classification using nonlinear combination of minimum distance
function.
Page 3 of 7
6. Classifier Combination:
Outputs of individual classifiers can be combined to improve the overall classification
accuracy of a system by learning through different training sets, classification methods
or training sessions. Depending upon the architecture, classifiers can be combined in
parallel, series or in hierarchy.
6.1. Selection and Training of Individual Classifiers:
Classifier combination is useful if individual classifiers are independent. Independence
may be achieved by different resampling techniques e.g. stacking (decision dependent
on outputs of stacked & individual classifier) [168], bagging (different datasets are made
from the original and combined by a rule like averaging) [21], or boosting (individual
classifiers are trained hierarchically to learn more complex regions of feature space)
[142]. Other methods include cluster analysis that separate classes into subclasses in a
training set and random subspace method that use different feature sets instead of
different classifiers on different training sets.
6.2. Combiner:
A combiner is used to combine together individual classifiers. It can be static with no
training required, trainable, non-adaptive that treat all input patterns the same or
adaptive where decisions of individual classifiers are evaluated. Combiners expect the
data from the classifier to be either a confidence level (numerical value representing
probability pattern belonging to a class), rank (list ranking each class) or abstract
(outputs a unique class label or several labels.
6.3. Theoretical Analysis of Combination Schemes:
Classifier combination improves the recognition accuracy. This is evident from various
conducted experiments but only a few theoretical explanations have been given by
scientists. The dilemma is to choose the best bias-variance value for classifier(s).
1.1. An Example:
The authors used a digit dataset consists of handwritten numerals (0-9) to evaluate
different classifiers & combination rules using training and testing dataset same for each
of the 12 classifiers. A comprehensive summary of error rates show that combination
results are better than individual classifier results and performance of different
classifiers vary substantially over different feature sets.
2. Error Estimation:
Classifier performance is generally measured by its error rate. A simple analytical
expression for error rate is not possible, even for simple cases since it is a combined
function of samples, densities & probabilities etc. Error rate is a random variable and is
estimated by available samples (both testing & training). Various methods of error
estimation differ in their utilization of available samples as train & test sets.
Another measure is the reject rate that rejects doubtful patterns instead of assigning
them a class thus reducing error.
Page 4 of 7
3. Unsupervised Classification:
Unsupervised classification (data clustering) constructs decision boundaries (clusters)
based on unlabeled data by perceiving similarities among patterns [81]. An appropriate
measure of similarity must be defined which is both data & context dependent.
Clustering techniques include either more commonly used iterative square-error
partitional clustering (minimizes within cluster scatter or maximizes between cluster
scatter) or agglomerative hierarchical clustering (data organized in nested sequence of
groups viewable in tree form). In partitional clustering, number of clusters may or may
not be specified but either a global or local criterion is adopted.
3.1. Square-Error Clustering:
It is a commonly used partitional clustering strategy which has the objective to obtain
partition which for a fixed number of clusters minimizes square error. K-means algorithm
is an example of such clustering method. The design however still has to fuss over
choosing number of clusters “K”, initial partition, updating the partition, adjusting number
of clusters & stopping criterion.
3.2. Mixture Decomposition:
Mixtures aptly model situation where patterns have been produced by set of alternative
(probabilistically modeled) sources thereby are able to define formal approaches to
unsupervised classification. They are also well suited for representing complex class
conditional in supervised learning scenarios. Finite mixtures can also be used a feature
selection tool [127].
3.2.1. Basic Definitions:
There are two fundamental issues when fitting a mixture model to set of observations.
1) Estimation of parameters defining the mixture model and 2) estimation of number of
components. Answers to these issues are given in sections 8.2.2 & 8.2.3 respectively.
3.2.2. EM Algorithm:
The algorithm interprets the given observations as incomplete data, with the missing
part being a set of labels. It alternatively applies the two steps: E-step: computing
conditional expectation of the complete log-likelihood & M-step: updating parameter
estimates. It is critically dependent on initialization and may converge to a point on the
boundary of parameter space with unbounded likelihood.
Page 5 of 7
like data mining, and need for principled, rather than adhoc approach for solving
problems.
4.1. Frontiers of Pattern Recognition:
Recognition problems like model selection, mixture modeling, optimization methods,
local decision boundary learning, local invariance (dis)similarity measures, independent
component analysis, classifier combination, semi-supervised learning, and use of
context are all important leading research areas in pattern recognition.
4.2. Concluding Remarks:
The authors quoted Watanabe [164] from the preface of his editorial work on Frontiers
of Pattern Recognition.
Acknowledgments:
The authors thank their reviewers for their assistance, comments and suggestions.
Page 6 of 7
References:
[2] H. Akaike, ªA New Look at Statistical Model [131] S.J. Raudys and V. Pikelis, ªOn Dimensionality,
Identification,º IEEE Trans. Automatic Control, vol. 19, Sample Size, Classification Error, and Complexity of
pp. 716-723, 1974. Classification Algorithms in Pattern Recognition,º
[11] A. Bell and T. Sejnowski, ªAn Information- IEEE Trans. Pattern Analysis and Machine
Maximization Approach to Blind Separation,º Neural Intelligence, vol. 2, pp. 243-251, 1980.
Computation, vol. 7, pp. 1,004- 1,034, 1995. [132] S.J. Raudys and A.K. Jain, ªSmall Sample Size
[18] C.M. Bishop, Neural Networks for Pattern Effects in Statistical Pattern Recognition:
Recognition. Oxford: Clarendon Press, 1995 Recommendations for Practitioners, º IEEE Trans.
[20] I. Borg and P. Groenen, Modern Multidimensional Pattern Analysis and Machine Intelligence, vol. 13,
Scaling, Berlin: Springer-Verlag, 1997. no. 3, pp. 252-264, 1991.
[21] L. Breiman, ªBagging Predictors,º Machine [142] R.E. Schapire, ªThe Strength of Weak
Learning, vol. 24, no. 2, pp. 123-140, 1996. Learnability,º Machine Learning, vol. 5, pp. 197-227,
[22] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. 1990.
Stone, Classification and Regression Trees. [145] B. SchoÈlkopf, A. Smola, and K.R. Muller,
Wadsworth, Calif., 1984. ªNonlinear Component Analysis as a Kernel
[24] J. Cardoso, ªBlind Signal Separation: Statistical Eigenvalue Problem,º Neural Computation, vol. 10,
Principles,º Proc. IEEE, vol. 86, pp. 2,009-2,025, no. 5, pp. 1,299-1,319, 1998.
1998. [162] V.N. Vapnik, Statistical Learning Theory. New
[30] P.A. Chou, ªOptimal Partitioning for Classification York: John Wiley & Sons, 1998.
and Regression Trees,º IEEE Trans. Pattern Analysis [164] S. Watanabe, Pattern Recognition: Human and
and Machine Intelligence, vol. 13, no. 4, pp. 340-354, Mechanical. New York: Wiley, 1985.
Apr. 1991. [168] D. Wolpert, ªStacked Generalization,º Neural
[31] P. Comon, ªIndependent Component Analysis, a Networks, vol. 5, pp. 241-259, 1992.
New Concept?,º Signal Processing, vol. 36, no. 3, pp.
287-314, 1994.
[53] J.H. Friedman, ªExploratory Projection Pursuit,º
J. Am. Statistical Assoc., vol. 82, pp. 249-266, 1987.
[73] S. Haykin, Neural Networks, A Comprehensive
Foundation. Second ed., Englewood Cliffs, N.J.:
Prentice Hall, 1999.
[80] A.K. Jain and B. Chandrasekaran,
ªDimensionality and Sample Size Considerations in
Pattern Recognition Practice,º Handbook of Statistics.
P.R. Krishnaiah and L.N. Kanal, eds., vol. 2, pp. 835-
855, Amsterdam: North-Holland, 1982.
[81] A.K. Jain and R.C. Dubes, Algorithms for
Clustering Data. Englewood Cliffs, N.J.: Prentice Hall,
1988.
[92] T. Kohonen, Self-Organizing Maps. Springer
Series in Information Sciences, vol. 30, Berlin, 1995.
[96] T.W. Lee, Independent Component Analysis.
Dordrech: Kluwer Academic Publishers, 1998.
[126] P. Pudil, J. Novovicova, and J. Kittler, ªFloating
Search Methods in Feature Selection,º Pattern
Recognition Letters, vol. 15, no. 11, pp. 1,119-1,125,
1994.
[127] P. Pudil, J. Novovicova, and J. Kittler, ªFeature
Selection Based on the Approximation of Class
Densities by Finite Mixtures of the Special Type,º
Pattern Recognition, vol. 28, no. 9, pp. 1,389-1,398,
1995.
[129] J.R. Quinlan, C4.5: Programs for Machine
Learning. San Mateo, Calif.: Morgan Kaufmann, 1993.
Page 7 of 7