You are on page 1of 6



Vinay Kumar K 1

Bhavanishankar K2

Asst. Professor and HOD, Dept of CS, National Institute of Technology, Karnataka, India, E-mail: 2 II Year M.Tech, Dept of CS, National Institute of Technology, Karnataka,

The primary goal of pattern recognition is supervised or unsupervised classification. Among the various frameworks in which pattern recognition has been traditionally formulated, the statistical approach has been most intensively studied and used in practice. More recently, neural network techniques and methods imported from statistical learning theory have been receiving increasing attention. The design of a recognition system requires careful attention to the following issues: definition of pattern classes, sensing environment, pattern representation, feature extraction and selection, cluster analysis, classifier design and learning, selection of training and test samples, and performance evaluation. In spite of almost 50 years of research and development in this field, the general problem of recognizing complex patterns with arbitrary orientation, location, and scale remains unsolved. New and emerging applications, such as data mining, web searching, retrieval of multimedia data, face recognition, and cursive handwriting recognition, require robust and efficient pattern recognition techniques. The objective of this review paper is to summarize and compare some of the well-known methods used in various stages of a pattern recognition system and identify research topics and applications which are at the forefront of this exciting and challenging field Keywords Statistical pattern recognition, classification, clustering, feature extraction, feature selection, dimensionality reduction, classifier combination, neural networks.

machine printed or rotated - all are easily recognized by the young. The characters may be written on a cluttered background, on crumpled paper or may even be partially occluded. We take this ability for granted until we face the task of teaching a machine how to do the same. The best pattern recognizers in most instances are humans, yet we do not understand how humans recognize patterns. Ross [1] emphasizes that pattern recognition is critical in most human decision making tasks. And his central finding is that "the more relevant patterns at your disposal, the better your decisions will be". Successful computer programs that help banks score credit applicants, help doctors diagnose disease and help pilots land airplanes depend in some way on pattern recognition. My aim here is to introduce pattern recognition as the best possible way of utilizing available sensors, processors, and domain knowledge to make decisions automatically.

1.1 Meaning of Pattern Recognition

Automatic (machine) recognition, description, classification, and grouping of patterns are important problems in a variety of engineering and scientific disciplines such as biology, psychology, medicine, marketing, computer vision, artificial intelligence, and remote sensing. Watanabe [2] defines pattern as opposite of a chaos; it is an entity, vaguely defined, that could be given a name. For example, a pattern could be a fingerprint image, a handwritten cursive word, a human face, or a speech signal. Given a pattern, its recognition/classification may consist of one of the following two tasks [2] 1) Supervised classification (e.g., discriminant analysis) in which the input pattern is identified as a member of a predefined class 2) Unsupervised classification (e.g., clustering) in which the pattern is assigned to a hitherto unknown class.

By the time they are five years old, most children can recognize digits and letters. Small characters, large characters, handwritten,

Interest in the area of pattern recognition has been not only challenging but also computationally more renewed recently due to emerging applications which are demanding (Refer Table I)

Problem Domain Remote sensing Data mining Speech Recognition Bioinforniatics Biometnc recognition Industrial automation

Application Forecasting crop yield Searching for meaningful patterns Telephone directory enquiry without operator assistance Sequence analysis Personal identification Printed circuit board inspection

Input Pattern Miiltispectral nnage Pomts in multidimensional space Speech waveform DNA/ Protein Sequence Face, iris, fingerprint Intensity1 or range image

Pattern Classes Land use categories, growth pattern of crops Compact and well separated clusters Spoken words Known types of genes /patterns Authorized users for access control Defective/non defective nature of product

Table I Applications of pattern recognition Figure 1.1 grossly oversimplifies the pattern classification procedure. The design of a pattern recognition system essentially involves the following three aspects: 1) data acquisition and preprocessing, 2) data representation, and 3) decision making. The best known four approaches for pattern recognition are

1. 2. 3. 4.

Template matching Statistical classification Syntactic or structural matching and Neural network

Figure 1.1 Block diagram of pattern recognition system

1.2 Template Matching

One of the simplest and earliest approaches to pattern recognition is based on template matching. In template matching, a template (typically, a 2D shape) or a prototype of the pattern to be recognized is available. The pattern to be recognized is matched against the stored template while taking into account all allowable pose (translation and rotation) and scale changes. Often, the template itself is learned from the training set. Template matching is computationally demanding, but the availability of faster processors has now made this approach more feasible. This template matching has a number of disadvantages for instance; it would fail if the patterns are distorted due to the imaging process, viewpoint change, or large intra class variations among the patterns.

Neural networks can be viewed as massively parallel computing systems consisting of an extremely large number of simple processors with many interconnections. Neural network models attempt to use some organizational principles in a network of weighted directed graphs in which the nodes are artificial neurons and directed edges (with weights) are connections between neuron outputs and neuron inputs. The main characteristics of neural networks are that they have the ability to learn complex nonlinear input-output relationships, use sequential training procedures, and adapt themselves to the data. The most commonly used neural networks for pattern recognition are free forward network, Radial Basis Function (RBF) networks, Self Organizing Map (SOM) etc.

Approach Representation Template matching Samples, pixels, curves Statistical Features

Recognition function Correlation, distance measure, Discriminant function Rules, grammar Network function

Typical Criterion

Syntactic or statural Primitive Neural networks Samples, pnels, features

Classification error Classification error Acceptance error Mean square error

Table II Types of pattern recognition 2.


Statistical pattern recognition has been used successfully to design a number of commercial recognition systems. In statistical pattern recognition, a pattern is represented by a set of d features, or attributes, viewed as a d dimensional feature vector. Well-known concepts from statistical decision theory are utilized to establish decision boundaries between pattern classes. The recognition system is operated in two modes: training (learning) and classification (testing) (Figure 2.1).The role of the preprocessing module is to segment the pattern of interest from the background, remove noise, normalize the pattern, and any other operation which will contribute in defining a compact representation of the pattern. In the training mode, the feature extraction/selection module finds the appropriate features for representing the input patterns and the classifier is trained to partition the feature space. The feedback path allows a designer tooptimize the preprocessing and feature extraction/selection strategies. In the classification mode, the trained classifier assigns the input pattern to one of the pattern classes under consideration based on the measured features The decision making process in statistical pattern recognition can be summarized as follows: A given pattern is to be assigned to one of c categories w1,w2, ... wc based on a vector of d feature values x = (xl ,x2, ... , xd) . The features are assumed to have a probability density or mass function conditioned on the pattern class. Thus, a pattern vector x belonging to class wi is viewed as an observation drawn randomly from the classconditional probability function p(x | wi). A number of wellknown decision rules, including the Bayes decision rule, the maximum likelihood rule (which can be viewed as a particular case of the Bayes rule), and the Neyman-Pearson rule are available to define the decision boundary. The optimal Bayes decision rule for minimizing the risk (Expected value of the loss function) can be stated as follows. Assign input patterns x to class wi for which the conditional risk R (wi I x) = ? L (wi, wj) . P (wj I x) (1) For j=1 to c

1.3 Statistical Matching

In the statistical approach, each pattern is represented in terms of d features or measurements and is viewed as a point in a d dimensional space. The goal is to choose those features that allow pattern vectors belonging to different categories to occupy compact and disjoint regions in a d dimensional feature space. The effectiveness of the representation space (feature set) is determined by how well patterns from different classes can be separated. Given a set of training patterns from each class, the objective is to establish decision boundaries in the feature space which separate patterns belonging to different classes. In the statistical decision theoretic approach, the decision boundaries are determined by the probability distributions of the patterns belonging to each class, which must either be specified or learned

1.4 Syntactic Approach

In many recognition problems involving complex patterns, it is more appropriate to adopt a hierarchical perspective where a pattern is viewed as being composed of simple sub patterns which are themselves built from yet simpler sub patterns The simplest /elementary sub patterns to be recognized are called primitives and the given complex pattern is represented in terms of the interrelationships between these primitives. In syntactic pattern recognition, a formal analogy is drawn between the structures of patterns and the syntax of a language. The patterns are viewed as sentences belonging to a language, primitives are viewed as the alphabet of the language, and the sentences are generated according to a grammar. Thus, a large collection of complex patterns can be described by a small number of primitives and grammatical rules. The grammar for each pattern class must be inferred from the available training samples

1.5 Neural Networks

Is minimum where L (wi , wj) is the loss incurred in deciding wi when the true class is wj and p (wj | x) is the posterior probability. In the case of 0/1loss function as defined in (2) the conditional risk becomes the conditional probability of misclassification L(wi, wj) = { 0, i=j / 1, i ? j} (2)

For this choice of loss function, the Bayes decision rule can be simplified as follows (also called the maximum a posteriori (MAP) rule): Assign input pattern x to class wi if P( wi I x) > P (wj I x) for all j ? i (3)

dataset in two or three dimensions such that the dis tance matrix in the original d-dimensional feature space is preserved as faithfully as possible in the projected space. Various stress functions are used for measuring the performance of this mapping [3], the most popular criterion is the stress function introduced by Sammon [4] and Niemann [5]. A problem with MDS is that it does not give an explicit mapping function, so it is not possible to place a new pattern in a map which has been computed for a given training set without repeating the mapping. Several techniques have been investigated to address this deficiency which range from linear interpolation to training a neural network. A feed-forward neural network offers an integrated procedure for feature extraction and classification; the output of each hidden layer may be interpreted as a set of new, often nonlinear, features presented to the output layer for classification. In this sense, multilayer networks serve as feature extractors The Self-Organizing Map (SOM), or Kohonen Map [6], can also be used for nonlinear feature extraction. In SOM, neurons are arranged in an m dimensional grid, where m is usually 1, 2, or 3. Each neuron is connected to all the d input units. The weights on the connections for each neuron form a d dimensional weight vector. During training, patterns are presented to the network in a random order. At each presentation, the winner whose weight vector is the closest to the input vector is first identified. Then, all the neurons in the neighborhood (defined on the grid) of the winner are updated such that their weight vectors move towards the input vector. Consequently, after training is done, the weight vectors of neighboring neurons in the grid are likely to represent input patterns which are close in the original feature space. Thus, a topology-preserving map is formed. When the grid is plotted in the original space, the grid connections are more or less stressed according to the density of the training data. Thus, SOM offers an m dimensional map with a spatial connectivity, which can be interpreted as feature extraction

Various strategies are utilized to design a classifier in statistical pattern recognition, depending on the kind of information available about the class-conditional densities.

2.1 Dimensionality Reduction

There are two main reasons to keep the dimensionality of the pattern representation (i.e., the number of features) as small as possible: measurement cost and classification accuracy. A limited yet salient feature set simplifies both the pattern representation and the classifiers that are built on the selected representation. Consequently, the resulting classifier will be faster and will use less memory. Moreover, as stated earlier, a small number of features can alleviate the curse of dimensionality when the number of training samples is limited. On the other hand, a reduction in the number of features may lead to a loss in the discrimination power and thereby lower the accuracy of the resulting recognition system. Watanabe's ugly duckling theorem [2] also supports the need for a careful choice of the features, since it is possible to make two arbitrary patterns similar by encoding them with a sufficiently large number of redundant features Now it is very important to make a distinction between feature selection and feature extraction. Some of the commonly used methods for feature selection and feature extraction are discussed below.

2.3 Feature Selection

The problem of feature selection is defined as follows: given a set of d features, select a subset of size m that leads to the smallest classification error. There has been a resurgence of interest in applying feature selection methods due to the large number of features encountered in the following situations: 1) multi-sensor fusion: features, computed from different sensor modalities, are concatenated to form a feature vector with a large number of components 2) integration of multiple data models: sensor data can be modeled using different approaches, where the model parameters serve as features, and the parameters from different models can be pooled to yield a high-dimensional feature vector. Let Y be the given set of features, with cardinality d and let m represent the desired number of features in the selected subset X, X is sub set of Y. Let the feature selection criterion function for the set X be represented by J(X). Let us assume that a higher value of J indicates a better feature subset; a natural choice for the criterion function is J= (1-Pe), where Pe denotes the classification error. The use of Pe in the criterion function makes feature selection procedures dependent on the specific classifier that is used and the sizes of the training and test sets. The most straightforward approach to the feature selection problem would require 1) Examining all dCm possible subsets of size m and 2) selecting the subset with the largest value of J (.^.However, the number of possible subsets grows combinatorially; making this

2.2 Feature Extraction

Feature extraction methods determine an appropriate subspace of dimensionality m (either in a linear or a nonlinear way) in the original feature space of dimensionality d (m =d). Linear transforms, such as principal component analysis, factor analysis, linear discriminant analysis, and projection pursuit have been widely used in pattern recognition for feature extraction and dimensionality reduction The best known linear feature extractor is the principal component analysis (PCA) or Karhunen-LoeAve expansion that computes the m largest eigenvectors of the d x d covariance matrix of the n d dimensional patterns. The linear transformation is defined as Y= X H Where X is the given n x d pattern matrix. Y is the derived n x m pattern matrix and H is the d x m matrix of linear transformation whose column is the eigenvectors. Since PCA uses the most expressive features (eigenvectors with the largest eigenvalues), it effectively approximates the data by a linear subspace using the mean squared error criterion. Multidimensional scaling (MDS) is another nonlinear feature extraction technique. It aims to represent a multidimensional

exhaustive search impractical for even moderate values of m and d. Cover and Van Campenhout [7] showed that no nonexhaustive sequential feature selection procedure can be guaranteed to produce the optimal subset. The only optimal feature selection method which avoids the exhaustive search is based on the branch and bound algorithm. This procedure avoids an exhaustive search by using intermediate results for obtaining bounds on the final criterion value. The key to this algorithm is monotonicity property of the criterion function J (.). given two features subsets Xl and X2, if Xl is subset of X2 , then J(X1) < J(X2). In other words, the performance of a feature subset should improve whenever a feature is added to it. Most commonly used criterion functions do not satisfy this monotonicity property. Table III lists most of the well-known feature selection methods which have been proposed in the literature [8]. Only the first two methods in this table guarantee an optimal subset. All other strategies are suboptimal due to the fact that the best pair of features need not contain the best single feature. In general: good, larger feature sets do not necessarily include the good, small sets. As a result, the simple method of selecting just the best individual features may fail dramatically. It might still be useful, however, as a first step to select some individually good features in decreasing very large feature sets (e.g., hundreds of features). Further selection has to be done by more advanced methods that take feature dependencies into account. This operates either by evaluating growing feature sets (forward selection) or by evaluating shrinking feature sets (backward selection). A simple sequential method like SFS (SBS) adds (deletes) one feature at a time. More sophisticated techniques are the Plus l - take away r strategy and the Sequential Floating Search methods, SFFS and SBFS [9]. These methods backtrack as long as they find improvements compared to previous feature sets of the same size. In almost any large feature selection problem, these methods perform better than the straight sequential searches.
Method Exhaustive Search Property Evaluate all dCm possible subsets Comments Guaranteed to find the optimal subset, not feasible for even moderately large values of m and d Guaranteed to find the optimal subset provided the criterion function satisfies the monotonicity property, the worst case complexity of tins algorithm is exponential Computationally simple, not likely to lead to an optimal subset Once a feature is retained , it can riot be discarded; computationally attractive since to select a subset of size 2 it examines only (d1) possible subsets Once a feature is deleted : it can not be brought back in to the optimal subset: demands more computation than SFS Avoids the problem of feature subset nesting encountered in SFS and SBS methods, needs to select values of' and T(]>T)

should be assigned to the same class. So, once a good metric has been established to define similarity, patterns can be classified by template matching or the minimum distance classifier using a few prototypes per class. The choice of the metric and the prototypes is crucial to the success of this approach. In the nearest mean classifier, selecting prototypes is very simple and robust; each pattern class is represented by a single prototype which is the mean vector of all the training patterns in that class. More advanced techniques for computing prototypes are vector quantization and learning vector quantization and the data reduction methods associated with the one-nearest neighbor decision rule (1NN) [12] The second main concept used for designing pattern classifiers is based on the probabilistic approach. The optimal Bayes decision rule (with the 0/1 loss function) assigns a pattern to the class with the maximum posterior probability. This rule can be modified to take into account costs associated with different types of misclassifications. For known class conditional densities, the Bayes decision rule gives the optimum classifier, in the sense that, for given prior probabilities, loss function and class -conditional densities, no other decision rule will have a lower risk (i.e., expected value of the loss function, for example, probability of error). If the prior class probabilities are equal and a 0/1 loss function is adopted, the Bayes decis ion rule and the maximum likelihood decision rule exactly coincide. The third category of classifiers is to construct decision boundaries directly by optimizing certain error criterion. While this approach depends on the chosen metric, sometimes classifiers of this type may approximate the Bayes classifier asymptotically. The driving force of the training procedure is, however, the minimization of a criterion such as the apparent classification error or the mean squared error (MSE) between the classifier output and some preset target value. A classical example of this type of classifier is Fisher's linear discriminant that minimizes the MSE between the classifier output and the desired labels. Another example is the single-layer perception, where the separating hyper plane is iteratively updated as a function of the distances of the misclassified patterns from the hyper plane.

Uses the well known branch and bound search method; only a fraction of all possible feature subsets need to be enumerated to find the optimal subset Best Individual Evaluate all the m features individually, Features select the best m individual features Sequential Forward Select the best single feature and then Selection(SFS) add one feature at a time which m combination with the selected features maximizes the criterion function Sequential Start with all the d features and Backward Selection successively delete one feature at a time [SBS) Plus / take away J1 First enlarge the feature subset by / feature using forward selection and then Selection delete r features using backward selection Branch and Bound Search

3.1 Classifier Combination

There are several reasons for combining multiple classifiers to solve a given classification problem. So me of them are listed below: 1. A designer may have access to a number of different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. An example is the identification of persons by their voice, face, as well as handwriting. 2. Sometimes more than a single training set is available, each collected at a different time or in a different environment. These training sets may even use different features. 3. Different classifiers trained on the same data may not only differ in their global performances, but they also may show strong local differences. Each classifier may have its own region in the feature space where it performs the best. 4. Some classifiers such as neural networks show different results with different initializations due to the randomness inherent in the training procedure. Instead of selecting the best

Table III Feature selection methods

Once a feature selection or classification procedure finds a proper representation, a classifier can be designed using a number of possible approaches. Basically three different approaches can be identified The simplest and the most intuitive approach to classifier design is based on the concept of similarity: patterns that are similar

network and discarding the others, one can combine various networks, thereby taking advantage of all the attempts to learn from the data. In the literature [10] a number of combination schemes have been proposed. A typical combination scheme consists of a set of individual classifiers and a combiner which combines the results of the individual classifiers to make the final decision. When the individual classifiers should be invoked
Method Property Comments

Re substitution Method

Holdout Method

Leave-one-out Method

All the available data is used for training as well as testing; training and test sets are the same Half the data is used for training and the remaining data is. used for testing; training and test sets are independent A classifier is designed using (n1) samples and evaluated on the one remaining sample, this is repeated n tunes with different training sets of size (n-l)

Is small Optimistically biased estimate, especially when the ratio of sample size to dimensionality Pessimistically biased estimate, different partitioning will give different estimates Estimation is unbiased but it has a large variance, computationally demanding because n different classifiers have to be designed

Table IV Error estimation methods or how they should interact with each other is determined by the architecture of the combination scheme Various schemes for combining multiple classifiers can be grouped into three main categories according to their architecture: 1) parallel, 2) cascading (or serial combination), and 3) hierarchical (tree-like). In the parallel architecture, all the individual classifiers are invoked independently, and their results are then combined by a combiner. In the gated parallel variant, the outputs of individual classifiers are selected or weighted by a gating device before they are combined. In the cascading architecture, individual classifiers are invoked in a linear sequence. The number of possible classes for a given pattern is gradually reduced as more classifiers in the sequence have been invoked. In the hierarchical architecture, individual classifiers are combined into a structure, which is similar to that of a decision tree classifier.

decision rule. While it is easy to define the probability of error in terms of the class conditional densities, it is very difficult to obtain a closed form expression for Pe. In practice, the error rate of a recognition system must be estimated from all the available samples which are split into training and test sets [11]. The classifier is first designed using training samples, and then it is evaluated based on its classification performance on the test samples. The percentage of misclassified test samples is taken as an estimate of the error rate. In order for this error estimate to be reliable in predicting future classification performance, not only should the training set and the test set be sufficiently large, but the training samples and the test samples must be independent. This requirement of independent training and test samples is still often overlooked in practice. The main issue here is how to split the available sample s in to training and test sets? The methods that are commonly used to estimate the errors are tabulated in Table IV These methods differ in how they utilize the available samples as training and test sets

Pattern recognition is a fast-moving and proliferating discipline. It is not easy to form a well-balanced and well-informed summary view of the newest developments in this field. It is still harder to have a vision of its future progress. This review throws light on the rudimentary as well as few advanced concepts of statistical pattern recognition. The summary of various approaches are tabulated for quick reference. This review can further be extended to include various algorithms used to implement different stages of pattern recognition i.e. data acquisition, preprocessing, feature selection, feature extraction, classification etc. [9] P. Pudil, J. Novovicova, and J. Kittler, Floating Search Methods in Feature selection, Pattern Recognition Letters, vol. 15, no. 11, pp. 1,119-1,125, 1994. [10] L. Xu, A. Krzyzak, and C.Y. Suen, Methods for Combining Multiple Classifiers and Their Applications in Handwritten Character Recognition, IEEE Trans. Systems, Man, and Cybernetics, vol. 22, pp. 418-435, 1992. [11] D.J. Hand, Recent Advances in Error Rate Estimation, Pattern Recognition Letters, vol. 4, no. 5, pp. 335-346, 1986. [12] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. London: Prentice Hall, 1982

3.2 Combiner
After individual classifiers have been selected, they need to be combined together by a module, called the combiner. Various combiners can be distinguished from each other in their trainability, adaptivity, and requirement on the output of individual classifiers. Combiners, such as voting, averaging (or sum), and Borda count are static, with no training required, while others are trainable. The trainable combiners may lead to a better improvement than static combiners at the cost of additional training as well as the requirement of additional training data. Some combination schemes are adaptive in the sense that the combiner evaluates (or weighs) the decisions of individual classifiers depending on the input pattern. In contrast, non-adaptive combiners treat all the input patterns the same. Adaptive combination schemes can further exploit the detailed error characteristics and expertise of individual classifiers. Examples of adaptive combiners include adaptive weighting associative switch, mixture of local experts (MLE) and hierarchical MLE

Bhavanishankar . K Educational Background: B.E, M. Tech (II Year). Comp Engg Area of Expertise and Interest: Pattern Recognition. Experience details: Teaching experience of more than 3 years. Author 2:Name: Vinay Kumar K,Authors Educational Background: B.E, M. Tech (IITB),Area of Expertise and Interest: Artificial Intelligence, Computer Architecture, Experience details: Teaching experience of more than 20 years in NITK Surathkal.

The classification error or simply the error rate, Pe, is the ultimate measure of the performance of a classifier. Competing classifiers can also be evaluated based on their error probabilities. Other performance measures include the cost of measuring features and the computational requirements of the