You are on page 1of 4

An Experimental Study of Learning Curves

for Statistical Pattern Classifiers


Tsutomu MATSUNAGA and Hiromi KIDA
Research and Development Headquarters
NTT DATA COMMUNICATIONS SYSTEMS CORPORATION
66-2 Horikawa-cho, Saiwai-ku, Kawasaki-shi, Kanagawa. 210 Japan

Abstract Learning curves [4] exhibit asymptotic behaviors


where a probability of misclassification decreases as
Statistical pattern classifiers are designed by pop- a number of training samples increases. In this pa-
ulation panameters of pattern distributions estimated per, learning curves on statistical classifiers are eval-
by a set of training samples. Therefore, classification uated through Monte-Carlo simulations. Experimen-
performance depends considerably on training sam- tal results of two-class problems having multi-variate
ple size. Learning curves exhibit asymptotic behav- normal distributions are discussed while pattern dis-
iors where a probability of misclassification decreases tributions are controlled with population parameters.
as a number of training samples increases. This pa- Next, an experiment of a character recognition prob-
per presents asymptotic behaviors of eflects of training lem whose population parameters of underlying pat-
sample size and shows that learning curves for pradi- tern distributions are unknown, is examined. The aim
cal purpose can be obtained using available samples. of this paper is to gain insight on effects of train-
ing sample size and to show that learning curves can
be obtained experimentally using available real world
samples.
1 Introduction

Classifier design is an actual problem in develop- 2 Pattern classification


ment of pattern recognition systems. Statistical pat-
tern classifiers are widely employed in the processing 2.1 Classification problem
of recognition of character and speech. The statis-
tical classifiers are designed by population parame- When there exists two classes C l and Cz and the
ters of pattern distributions such as mean vectors and random variable X has a continuous density function
covariance matrices [l]. The population parameters f i ( X ) in the first population and f 2 ( X ) in the second,
are estimated by a set of training samples. There- and PI and Pz are the a priori probabilities, the prob-
fore, classification performance depends considerably ability of misclassification E can be written as follows:
on training sample size. However, only a finite, fre-
quently small, number of training samples are usually
available. Since collecting many samples requires time
E = P l k 2 f l ( X ) d X + Pz Ll fZ(X)dX
and cost, it is useful to know the number of training = PlE1 I-P 2 E 2 (1)
samples necessary for a desired level of classification where
performance for practical implement a t ion. Ei : probability that an observation X from Ci is misclassified
Research efforts on effects of sample size have been Classification problems having an equal a priori prob-
conducted [2, 31. But it is of little use for practical abilities are discussed in this paper.
purpose since the pattern distribution is limited. It
has been simply considered that the larger the training 2.2 Statistical pattern classifier
sample size, the better the classification performance.
In a conventional approach, every time the number In this paper, three commonly used statistical clas-
of training samples increases, the performance of the sifiers: quadratic discriminant function, Fisher's linear
classifier must be re-evaluated. discriminant function and Euclidean distance Classifier

1103
$4.000 1995 IEEE
0-8186-7128-9/95

Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on March 31,2024 at 11:03:56 UTC from IEEE Xplore. Restrictions apply.
are discussed, A classifier gives a decision rule for a set mean-squared error technique between probabilities of
of classes to classify an observation X into one class misclassification obtained in experiments and the cor-
Ci. The types of the classifiers are defined as: responding approximate probabilities given in equa-
(a) Quadratic Discriminant Function (QDF): tion ( 5 ) . Experimental learning curves are calculated
for the combinations of the type of the classifier, the
&X) = (X- Crk)TC,'(X - Pk) + kJlCkl (2) dimensionality, and the generalization probability of
Assign X to class Ci if gp(X) = m;ln{gf(X)}. misclassification. In addition, a generalization ratio
vN is defined here as:
(b) Fisher's Linear Discriminant Function (FLDF):
&(X) = (X- Pk)TC;l(X - c l k ) (3)
The generalization ratio is calculated using a learning
Assign X to class C, if gr(X) = mF{g:(X)}. curve. This measures a learning effect produced by
(c) Euclidean Distance Classifier (ED): one training sample.

&(X) = ( X - I r k Y (4)
Assign X to class Ci if gF(X) = min{gf(X)}. 4 Experiment
k
where XT indicates the transpose of the vector X and 4.1 Artificial pattern
C-l is the inverse of the matrix C. /.tk and Ck are the
kth population mean vector and the kth population In the following, the probability of misclassification
covariance matrix respectively, and Cw is within-class related with pattern distributions is briefly described
scatter matrix. Since all the population parameters
where the populations have multi-variate normal dis-
are not known in most practical situations, the sample tributions. Under the pattern distributions mentioned
estimates, which are obtained from a set of training
above, the decision rule by quadratic discriminant
samples, substitute for the true parameters.
function can be optimal in the Bayes sense. If popula-
tion parameters share the same covariance matrices ',
Fisher's linear discriminant function equates t o Bayes
3 Learning curve classification rule. Moreover, Euclidean distance clas-
sifier can perform optimally only when the covariance
Learning curves on statistical classifiers describe in- matrices are identity matrices. Here, the increase in
creases in probability of misclassification due to esti- the probability of misclassification is shown to be
mates of population parameters from training sam- proportional to and dependent on the dimensional-
ples. The experimental learning curve requires ity [3].
(i) a monotonous decreasing function where the In this experiment, mean vectors pi and covariance
probabilities of misclassification decreases gradu- matrices Ci are formalized as follows:
ally up to the generalization probabilities of mis-
classification as the number of training samples
increases.
(ii) a common function that represents the wide vit
riety of asymptotic behaviors characterized by a 'The probability of misclassification only depends on Maha-
small number of parameters. lanobis distance between classes.
2Fukunaga et al. 12) have derived theoretical approximate
From the viewpoints, a function of the experimental equations under the assumption of equal covariance matrices.
learning curve is chosen as: The equation for quadric discriminant function:
1 6 - - - 164) } 6=
E> = E,.,-- + ccNCe (5)
E ~ = d + - .1
4N
~ { P 2 + ( 1 + z ) P + ( 62
2
(7)
where The equation for Fisher's linear discriminant function:
E> : approximate probability of misclassification
€%=E+-*
.mo((l+ --)p- 62 1)
(8)
t N + ,: generalization probability of misclassification where 2N 6
N : number of training samples E : true probability of misclassification
cc, ce : parameter ( c C l 0 , Ce50) 6 : Mahalanobis distance between classes
Equation ( 5 ) is a general formulation of learning 4 : density function of standard normal distribution
31t is known that any two covariance matrices can be simul-
curves as a function of training sample size. The pa- taneously diagonized and a coordinate shift can bring the mean
rameters E ~ - - , CC and ce are estimated using a least vector of one class to zero, without loss of generality [I].

1104

Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on March 31,2024 at 11:03:56 UTC from IEEE Xplore. Restrictions apply.
timal dimensionality is assigned to the sample size.
The contour map in Figure 2 was drawn by replacing
LU TP J the expected misclassifiication rate with approximate
Artificial samples are generated from each class ac- misclassification rate by the estimated experimental
cording to given population parameters. A classi- learning curves for the corresponding combinations of
fier is designed by random chosen training samples of the training sample size and the dimensionality. The
equal size N from each class, and then a misclassifi- contour map mostly trace over the asymptotic behav-
cation rate is measured on 50,000 test samples drawn ior even though the learning curves are estimated in-
independently from each class. This procedure was dependently for the different dimensionality.
repeated 500 times independently, and an expected
probability of misclassification is obtained by averag- 4.1.3 Case 3: Cl#&
ing over the misclassification rates in the 500 trials.
The training sample sizes N are 1, 2, 5, 10, 20, 50, The population parameters in equation (10) are
and 100. The dimensionality p is varied from 1 to 32 shown in Table 1. The estimated learning curve pa-
in powers of 2,except for Case 3. rameters are shown in Table 2. As expected, the esti-
mated rate, 2.08% of quadratic discriminant function
4.1.1 Case 1: Cl=& nearly agrees with the Bayes error rate, 1.9%. The
generation ratios qiooand qiooare also shown in Table
When both covariance matrices are identity matrix, 2. The ratios indicate that Euclidean distance clas-
every classifier can perform optimally, yielding the sifier performs the least improvement in classification
Bayes error rate 4. An experimental learning curve performance among the three classifiers. These ratios
of Fisher’s linear discriminant function for the Bayes should be taken into consideration in classifier design,
error rate of 10% is shown in Figure 1 as a function
of sample size where the dimensionality is 8. The ex- Table 1: Population parameter
pected misclassification rate verses sample size is also
plotted. The vertical lines on the curve show the stan- i l 2 3 4 5 6 7 8
dard deviation for each sample size. The experimental mi 3.86 3.10 0.84 0.84 1.64 1.08 0.26 0.01
learning curve fits quite well to the expected misclas- r; 8.41 12.06 0.12 0.22 1.49 1.77 0.35 2.73
sification rates. The curve shows such a similar be-
havior to the theoretical approximate equation (see
Table 2: Learning curve parameter
equation (8)) that the misclassification rate decreases
exponentially as the sample size increases although the
misclassification rates given in the equation are under-
estimated in the range of small sample size. The esti- 0.0869 0.13 -1.01 1.0 X lo-’ 2.0 X
mated values of parameter ce in equation ( 5 ) does not
always equal -1 (see equation (8)), and CC increases
linearly for logarithmic scale of dimensionality as the 4.2 Character pattern
dimensionality increases [5].
An experiment using handprinted Japanese Kanji
4.1.2 Case 2: &=AX1
character patterns in five classes (,,”, ”%”, ”T”,
’)I$)’, and ”$ ”) is examined. Figure 3 shows an ex-
In Figure 2, the expected misclassification rates of ample of the character pattern samples. The number
quadratic discriminant function for a covariance shift of samples per class used are 670 for design and 500
A=4 are illustrated on a gray-scale over the sample size for testing. The training sample sizes N are 1, 2, 5,
and the dimensionality 5 . The figure reveals that more 10, 20, 50, and 60. The: misclassification rates and
training samples are required to maintain the same standard deviations for Euclidean distance classifier
degree of classification performance as the dimension- computed in 10 independent trials for each sample size
ality increases. There exists peaks in classification ac- are shown in Table 3. The approximate misclassifi-
curacy as dimensionality is increased, where the op- cation rates calculated by the experimental learning
‘The Bayes error rate can be controlled by setting the mean 6The Bayes error rate is known to be 1.9% [I].
vector of the first component ml to the corresponding value ’The PDC (Peripheral Direction Contributivity) feature [e],
and of all other to 0, and it remains unchanged for the fired my known to be useful in the recognition of Kanji character, is
even with different dimensionality. used as the feature for the classification. A character patterns
5The mean vector is fixed as ml=2.56,mi=0 ( k 2 , . .,p). . is represented by a 768-dimensional feature vector.

1105

Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on March 31,2024 at 11:03:56 UTC from IEEE Xplore. Restrictions apply.
curve are shown in Table 4. The table shows that the [5] T. Matsunaga, and H. Kida, “An Experimental
approximate rates almost agree with the rates of the Study of Learning Curve on Linear Classifiers,”
corresponding sample sizes in Table 3. Finally, the PRU94-56,pp. 67-74, 1994. (in Japanese)
misclassification rate is 1.72% where all training sam-
ples are used (the sample size is 670). As compared [6] N. Hagita, S . Naito, and I. Masuda, “Handprinted
with the approximate misclassification rate (1.65%), Kanji Characters Recognition based on Pattern
it is clear that the experimental learning curve gives a Matching Method,” Proc. ICTP’83, pp. 169-174,
good approximation. 1983.

-+
-
experimentalmean
exoerimental meanistandad deviation
N 1
22.6
2
13.2
5
5.74
10
3.74
20
2.53
50
1.97
60
1.98
-- exbrimentd ieaming cuwe
approximate equation by Fukunaga

U 6.55 5.55 2.76 1.08 0.454 0.457 0.342 50.0 ! \


I

Table 4: Approximate misclassification rate

5 Conclusion 1 10 100
Sample Size
Experimental results have shown that learning
curves formulated from equation ( 5 ) can be obtained Figure 1: Learning curve (FLDF) Cl=&, p = 8
experimentally and that they express the proper
asymptotic behaviors of effects of training sample size.
The main goal of this work is to establish an effec-
tive classifier design approach which makes optimal
use of a finite number of available samples. The anal-
Above
31.0 -- 355
355

ysis of the statistical nature of learning curves and .*


282 91.0

-
25.1 282-
the interval estimate of the approximate probability - 22.4 25.1
204-22.4
of misclassification will be the subject of future study. .-E7
2
--
17.8 206

E4
15.8
14.1
126
--
17.8
15.8
14.1

References
8
112 --
126

oa -
2 10.0 112
10.0
7.8- U
I<. Fukunaga, “Introduction to Statistical Pattern 1 Belor 70
Recognition,” 2nd ed., Academic Press, New-York,
1990.
K. Fukunaga and R. Hayes, “Effectsof Sample Size Figure 2: MkCh~ificatiQn
rate (QDF) &=4&
in Classifier Design,” IEEE ?Fans. PAMI-2, No.8,
pp. 873-885, 1989.
S. J . Raudys and A. I<. Jain, “Small Sample Size
Effect in Statistical Pattern Recognition: Rec-
ommend ations for Practitioners,” IEEE Trans.
PAMI-13, NO.3, pp. 252-264, 1991.
S. Amari, N. Fujita and S. Shinomoto, “Four
Types of Learning Curves,” Neural Computation, Figure 3: Example of character pattern
NO.4, pp. 605-618, 1992.

1106

Authorized licensed use limited to: UNIVERSIDAD DE ALICANTE . Downloaded on March 31,2024 at 11:03:56 UTC from IEEE Xplore. Restrictions apply.

You might also like