Professional Documents
Culture Documents
Unit-5
Pattern Recognition : Introduction, Design principles of pattern recognition system, Statistical
Pattern recognition, Parameter estimation methods - Principle Component Analysis (PCA) and
Linear Discriminant Analysis (LDA), Classification Techniques Nearest Neighbor (NN) Rule,
Bayes Classifier, Support Vector Machine (SVM), K means clustering.
Figure 1.1: Characteristic (feature); a. the distinction between good and poor features, and b.
feature properties.
The goal of a classifier is to partition feature space into class-labeled decision regions. Borders
between decision regions are called decision boundaries (Figure 1.2).
from different groups based upon the morphological interrelationships (or interconnections)
present within the data.
The semantics related with each feature are determined by the coding scheme (i.e., the selection
of morphologies) used to identify primitives in the data. Feature vectors generated by structural
pattern recognition systems contain a variable number of features (one for each primitive
extracted from the data) in order to accommodate the presence of superfluous structures which
have no impact on classification.
Figure 1.11: The statistical and structural approaches to pattern recognition applied to a common
identification problem.
The essential dissimilarities are two-fold:
(1) The description generated by the statistical approach is quantitative, while the structural
approach produces a description composed of sub patterns or building blocks.
(2) The statistical approach discriminates based upon numeric differences among features from
different groups, while grammars are used by the structural approach to define a language
encompassing the acceptable configurations of primitives for each group.
where I is the n * n identity matrix. This equation is called the characteristic equation (or
characteristic polynomial) and has n roots.
Let be an Eigen value of A. Then there exists a vector x such that:
The vector x is called an eigenvector of A associated with the Eigen value . Suppose we have a
3 * 3 matrix A with eigenvectors x1, x2, x3, and eigenvalues 1, 2, 3 so:
Dimension Reduction
10
11
LDA is also closely related to principal component analysis (PCA) and factor analysis in
that they both look for linear combinations of variables which best explain the data.
LDA explicitly attempts to model the difference between the classes of data. PCA on the
other hand does not take into account any difference in class, and factor analysis builds
the feature combinations based on differences rather than similarities.
Discriminant analysis is also different from factor analysis in that it is not an
interdependence technique: a distinction between independent variables and dependent
variables (also called criterion variables) must be made.
Rajeev Kumar, AP, IT Department (HCST Mathura)
12
LDA works when the measurements made on independent variables for each observation
are continuous quantities. When dealing with categorical independent variables, the
equivalent technique is discriminant correspondence analysis.
Example
For example, let us suppose we have a simple two variable problem in which there are two
classes of objects.
Instead, the approach of the LDA is to project all the data points into new space, normally of
lower dimension, which maximises the between-class separability while minimising their withinclass variability.
13
The above figure illustrates the difference between PCA and LDA. In PCA we take the data as a
whole and do not consider any division into classes. The axes are optimal for representing the
data as they indicate where the maximum variation actually lies. The LDA axis is optimal for
distinguishing between the different classes. In general the number of axes that can be computed
by the LDA method is one less than the number of classes in
14
Classification Techniques
In machine learning and statistics, classification is the problem of identifying to which of a set of
categories (sub-populations) a new observation belongs, on the basis of a training set of data
containing observations (or instances) whose category membership is known.
An example would be assigning a given email into "spam" or "non-spam" classes or
assigning a diagnosis to a given patient as described by observed characteristics of the
patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
15
class to each member of a sequence of values (for example, part of speech tagging, which assigns
a part of speech to each word in an input sentence); parsing, which assigns a parse tree to an
input sentence, describing the syntactic structure of the sentence; etc.
taking values in
,
for
(and probability
and a point
, let
on
reordering
of
the
training
data
such
that
.
The -nearest neighbor classifier
The most intuitive nearest neighbor type classifier is the one nearest neighbor classifier that
assigns a point
to the class of its closest neighbor in the feature space, that is
.
As the size of training data set approaches infinity, the one nearest neighbor classifier guarantees
an error rate of no worse than twice the Bayes error rate.
The -nearest neighbor classifier
The -nearest neighbor classifier assigns a point to a particular class based on a majority vote
among the classes of the nearest training points to .
Properties
There are many results on the error rate of the
converges to zero as
-nearest
) consistent provided
Let
denote the nearest neighbor classifier based on a training set of size . Under certain
regularity conditions, the excess risk yields the following asymptotic expansion
16
and
offers a tradeoff between the two terms in the above display, for
-nearest neighbor error converges the Bayes error at the optimal (minimax) rate
.
17
Probabilistic model
Abstractly, the probability model for a classifier is a conditional model
over a dependent class variable with a small number of outcomes or classes, conditional on
several feature variables
through
. The problem is that if the number of features is large
or if a feature can take on a large number of values, then basing such a model on probability
tables is infeasible. We therefore reformulate the model to make it more tractable.
Using Bayes' theorem, this can be written
In plain English, using Bayesian Probability terminology, the above equation can be written as
In practice, there is interest only in the numerator of that fraction, because the denominator does
not depend on
and the values of the features
are given, so that the denominator is
effectively constant. The numerator is equivalent to the joint probability model
which can be rewritten as follows, using the chain rule for repeated applications of the definition
of conditional probability:
Now the "naive" conditional independence assumptions come into play: assume that each feature
is conditionally independent of every other feature
for
, given the category .
This means that
,
,
Rajeev Kumar, AP, IT Department (HCST Mathura)
18
,
and so on, for
This means that under the above independence assumptions, the conditional distribution over the
class variable is:
19
, can be computed by
and
. That is,
The multinomial naive Bayes classifier becomes a linear classifier when expressed in log-space:
where
and
20
where
is the probability of class C generating the term
. This event model is
especially popular for classifying short texts. It has the benefit of explicitly modeling the absence
of terms. Note that a naive Bayes classifier with a Bernoulli event model is not the same as a
multinomial NB classifier with frequency counts truncated to one.
21
22
23
Applications
SVMs can be used to solve various real world problems:
SVMs are helpful in text and hypertext categorization as their application can
significantly reduce the need for labeled training instances in both the standard inductive
and transductive settings.
Classification of images can also be performed using SVMs. Experimental results show
that SVMs achieve significantly higher search accuracy than traditional query refinement
schemes after just three to four rounds of relevance feedback.
SVMs are also useful in medical science to classify proteins with up to 90% of the
compounds classified correctly.
Hand-written characters can be recognized using SVM
24
Additionally, they both use cluster centers to model the data; however, k-means
clustering tends to find clusters of comparable spatial extent, while the expectationmaximization mechanism allows clusters to have different shapes.
Description
Given a set of observations (x1, x2, , xn), where each observation is a d-dimensional
real vector, k-means clustering aims to partition the n observations into k ( n) sets
S = {S1, S2, , Sk} so as to minimize the within-cluster sum of squares (WCSS). In other
words, its objective is to find:
Standard algorithm
The most common algorithm uses an iterative refinement technique. Due to its ubiquity it is
often called the k-means algorithm; it is also referred to as Lloyd's algorithm, particularly in
the computer science community.
Given an initial set of k means m1(1),,mk(1) , the algorithm proceeds by alternating between two
steps:
Assignment step: Assign each observation to the cluster whose mean yields the least
within-cluster sum of squares (WCSS). Since the sum of squares is the squared Euclidean
distance, this is intuitively the "nearest" mean.
where each
more of them.
Update step: Calculate the new means to be the centroids of the observations in the new
clusters.
Since the arithmetic mean is a least-squares estimator, this also minimizes the withincluster sum of squares (WCSS) objective.
25
The algorithm has converged when the assignments no longer change. Since both steps
optimize the WCSS objective, and there only exists a finite number of such partitioning,
the algorithm must converge to a (local) optimum. There is no guarantee that the global
optimum is found using this algorithm.