You are on page 1of 14

Kudo, M. and M. Shimbo (1989) Optimal Subclasses with Dichotomous Variables for Feature Selection and Discrimination.

IEEE Trans. Systems, Man,
and Cybern, 19, 1194{1199.
Kudo, M., S. Yanagi and M. Shimbo (1996). Construction of Class Regions by
a Randomized Algorithm: A Randomized Subclass Method. Pattern Recognition, 29, 581{588.
Hayamizu, S. et al. (1985) Generation of VCV/CVC balanced word sets for a
speech data base. Bul. Electrotech. Lab., 49(10), 803{834. (in Japanese)
Rissanen, J. (1989). Stochastic Complexity in Statistical Inquiry, volume 15
of Comp. Sci. World Scienti

1989. 14 .c.

7 Discussion Some methods were proposed for the recognition of sets of a variable number of points such as .

Methods C and T-B are good for . and for the recognition of curves over discrete time such as speech time series. It is very dicult to determine which of the proposed methods is the best for a given problem.gures in a plane.

nding many characteristics that only samples of one class have in common. but sometimes they .

nd also redundant commonness. Such redundant commonness is harmful to classi. especially when only a small number of training samples are available.

B. are not so good for this purpose. 8 Conclusion We proposed a methodology for classifying sets of data points in a multidimensional space or multidimensional curves.cation of novel samples. less redundant common characteristics are obtained by these methods. As a result. Some model selection criteria like MDL (Rissanen . Further study is needed to establishing e ective ways of choosing an appropriate method for a given problem. Although other simpler methods such as methods A. The classi. and T-A. 1989) may be useful for choosing an appropriate method. we have to choose a method whose complexity is neither too small nor too large.

this approach was con. Based on the results of experiments.cation rule is based on the common regions through which only curves of one class pass.

Wu. Eds. Methods of Biological Signal Processing. T.rmed to be applicable for a wide range of applications and work well for certain applications. Handbook of Pattern Recognition and Image Processing. S. Pattern Recognition and Classi. R. Fu. B. References Shiavi. Chapter 22. Bourne (1986). and J. R. Academic Press. (1994). Y. G. Young and K.

and B. Plataniotis. N. 62.Lainiotis (1996). A. Androutsos.cation in Time Series Analysis.G. Rebiner. K. Applied Mathematics and Computation. L.N. 29{45. A new time series classi. H. Juang (1993). Fundamentals of Speech Recognition. Prentice-Hall.. Venetsanopoulos and D. Inc. D.

and V. Predictive Modular Neural Networks for Time Series Classi. 191{199. A. Petridis (1997).cation approach. Signal Processing. Kehagias. 54.

cation. 10. Neural Networks. 31{49. 13 .

A subclass for the . 7.Feature 1 Feature 2 Feature 3 Feature 4 Fig.

rst speaker in the .

rst four features. The subclass and ten curves of each class (dark lines for the .

The horizontal axes are time and the vertical axes are feature values.rst speaker and light lines for the other eight speakers) are shown. The solid rectangles show the regions R(1) through which curves of the .

rst speaker should pass. and the dotted rectangles are the regions R (1) through which curves of the .

lies in a certain local part. 12 .rst speaker should not pass.

Original polygons A subclass for class 1 (10 for each class) A subclass for class 2 A subclass for class 3 Fig. Results of polygon classi. 6.

A subclasses for the . One subclass and ten samples for each class are shown. A total of 270 time series (30 per class) were used for training and the remaining 370 time series were used for testing.3 Speaker Recognition This dataset aims to distinguish two Japanese vowels /ae/ continuously uttered by nine male speakers (a 9-class problem). 6. 12 LPC cepstrum coecients were calculated from each frame of the data. By 12-degree linear prediction analysis. a sample is a time series with length 7-29 in a 12-dimensional space. K = 1) was applied to this dataset. As a result. Method T-B with (U = 4. V = 20.cation.

which is higher than that using method T-B. 7. which is more e ective when the most important cue to distinguish one class from the others 11 . The recognition rate using HMM was 96:2%. the 5-state continuous Hidden Markov Model (HMM) (Rebinar and Juang . The recognition rate was 94:1%: For comparison.rst speaker is shown in Fig. This might be because a variance of samples within a class appears as a change in the global form of curves and does not meet the conditions of our technique. 1993) was also used.

si.

ers (a linear classi.

a quadratic classi.er.

er. 5-nearest neighbor classi.

er) for which six forcibly calculated formants are used. Table 2 Results of recognition of .

ve Japanese vowels. Classi.

9 84.3 85.2 From Table 2 we can see that method A has the highest recognition rate. in other words.er Linear Quadratic 5-NN Method A Method B Recognition Rate (%) 82.9 85. One possible explanation for the reason why method A worked better than method B is that method B found too much redundant commonness.8 86. some sort of over-.

A hundred samples for each class were used. This is reasonable because the last class is uniform and some instances of the class can belong to the .1.2 Polygon Recognition Next. Method C (V = 25. 6. The numbers of subclasses are 1. 6. K = 1) was applied to this problem. we dealt with the problem of distinguish three kinds of polygons with 3-5 vertices: 1) polygons whose vertices are all located in a circle with a radius of 0.5. 2) polygons for which one vertex is located in a circle with a radius of 0. The result is shown in Fig. 3 and 9.tting to the training samples was done. and 3) polygons whose vertices are located at arbitrary positions.

limits the range where polygons of the class can exist to a small region with radius of 0. Similarly. 6.rst or second class. using the dotted rectangles.5. it can be seen that the subclass rules for of class ! . As a result. From Fig. we cannot specify the last class in a simple way. the subclass of class ! succeeded to .

However.nd a rule requiring that all polygons of the class should have at least one vertex located in a circle with radius of 0.1 (a solid rectangle). we can see also that there is much redundant commonness which is not general in the class but is common among training samples of the class. 1 2 2 3 10 . Many dotted regions of classes ! and ! in Fig. 6 are such examples.

The largest subclass of class /a/ is shown in Fig. 5. In this .

and the solid line in the left direction at level 1 shows the region for which at least one formant should exit.gure. for example. for example. the dotted line in the left direction at level 1 shows the region for which formants are prohibited to exist. all ten sets of points of /a/ satisfy these requirements and have therefore been recognized correctly. In fact. because their . nine of ten samples of /e/ violate this subclass rule. On the other hand.

A subclass obtained for vowel /a/ using method A. Ten samples in each class are shown. Level 3 2 1 11 11111 1 0 2222222 3 3 3 33 33 1 3444444 44 4 2 555 3 555555 5 6 6 6 6 6 6 4 5 6 kHz /a/ Level 3 2 111111 1 1 1 2 0 2 232222223 3 3 3 3 333 4 3 1 4 4 44 44 45 45 55 55 4 5 6 5 2 3 6 5 6 6 6 4 666 5 6 kHz /e/ Level 3 2 111 111 1 2 22222222 0 1 3 3 3 333334 2 34 444 445 5 5555655 6 5 3 56 6 4 6 6 66 6 5 6 kHz /i/ Level 3 2 1111 1 2 1 22222 2 1 0 333 3333 3 3 1 2 4 44 444444 54 5 55 5 5 55 3 5 6 4 66 66 66 5 6 kHz /o/ Level 3 2 1 11111 1 0 2 2 1 2 2 2223 2 2 333333333 4 2 4 444 444 4 54 5 3 5 55 565 555 4 66 6 6 66 66 6 5 6 kHz /u/ Fig. 5. and the remaining one sample violates one of the limitations of the rule.rst formants appear in the prohibited region. The same subclass is shown in all .

Solid lines at level k (R(k) ) show that at least k formants should be in the range. 8. and so on. The number of subclasses obtained were 3. 6. 13. Small numbers with a bar show formants: 1 is the 1st formant. The recognition rates of methods A and B are shown in Table 2 together with those of some conventional clas9 .gures. and 16. and dotted lines at level k (R (k) ) show that only less than k formants is permitted. 2 is the 2nd formant. respectively. Method B was also applied to the same dataset.

Time has U thresholds and every feature has V thresholds. the subclass method is applied to the converted binary vectors.Table 1 Required bit length. and furthermore a class is characterized by a set of subclasses. It should be noted that a subclass indicates many regions simultaneously because a bit of 1 corresponds to one region and many bits of 1 can exist in one subclass. For classi. Method A Method B Method C ?  ?  2nKV 2nK V2 2K V2 n Method T-A Method T-B 8nKUV ? ?  2nK U2 V2 After this binalization.

cation of a novel curve x without a class label: (1) First we count the number of subclasses in each class for which x satis.

es its corresponding rectangle rules. We choose the class for which the largest percentage of subclasses is satis.

(2) If no such subclass is found. where \nearness" is measured by the number of bits that is 1 in the subclass X and 0 in x. The nearest class is chosen for the assignment. we use the average of the nearness of subclasses to x in each class.1 Vowel Recognition by Formants This dataset is for recognition of the .ed and assign x to that class. 6 Experiments 6.

/e/. In total. peak frequencies [Hz] characterizing each vowel. We applied method A with V = 200 and K = 3. /o/. Through frequency analysis. by removing formants with a width larger than a reasonable value." are calculated for a sample. that is. called \formants. We removed such false formants by referring their widths. 1985). /i/. 29.ve Japanese vowels f /a/. . /u/ g. 27. 6000 vowel samples were taken from the database ETL-WD-I-1 (Hayamizu et al. In this analysis. some false formants can appear because six peak frequencies are forcibly calculated regardless of their actual existence. 12. 19 and 31 subclasses were respectively obtained for . In total. and a test set of 5000 samples (1000 for each vowel). samples are expressed by four to six formants. As a result. We divided all of the samples into two sets: a training set consisting of 1000 samples (200 for each vowel).

ve classes. 8 .

5 Binalization Taking method T-B as an example. we explain our binalization method. We consider a 1-dimensional curve de. A similar procedure is applied to the remaining methods.

: : : . . . U ? 1) in time. equally spaced V thresholds v (v = 0. . : : : . + We set up equally spaced U thresholds u(u = 0. Q Q Q Q Q Q .. 1 . . Fig. . . v ] more than or equal to k times (k = 1. . For all   vector  U V possible  rectangles in the discrete domain. bB ) by the following procedure. In addition. the total length of bits becomes ! ! V U B = 2nK 2 2 : + 1 2 2 ( ) 0 2 0 0 0 0 ( ) 0 f1 φ φ φ 0 3 R 2 1 Q φ 0 θ1 θ0 θ2 Time t θ4 θ3 (1) (2) (3) (1) (2) (3) (1) (2) (3) (1) (2) (3) R R R R R R b = . . . Similarly. 1 . . u ]  [v . By multiplying by the dimensionality n. . 0 . . . xt 2 R: 1 2 Let us assume that class ! is a positive class and let the training set S! be referred to as a positive sample set S and all other training sets be combined into a negative sample set S ?. . . . otherwise buuk vv = 0 (Fig. 0 . if x 2 S passes through Ruu vv = [u . whose ends are located at 0 and the maximum length of curves in S [ S ?. + + The binalization is carried out as follows.ned over discrete time t: x = (x . 0 . A curve x 2 S is transformed into a binary b = (b . Bits related to K = 3 for R and Q. . . . . . 1 . . . . b . 0 . . in each feature. . . . . 7 0 + . . . K ). V ? 1) are set up such that both ends are located at the minimum and maximum values of xt of every curve x 2 S [ S ?. The bit length of each method is shown in Table 1. . . 1 . 4). . . 4. . xT ). . 1 . x . 1 . . . : : : . . . 0 . .. . . . : : : . : : : . let buuk vv = 1. . . . . 0 . . . . . all of the inverted bits are added to the vector because 0's have the same importance as 1's. . . .

let us consider the following data. Sample x x x x x x y y y y y 1 2 S + 3 4 5 6 1 2 S? 3 4 5 Features 00011001110011100011 00001011110001100111 00111000110000101111 00111000110001100111 01111000010001100111 00111000110111100001 00001011110011100011 00001011110000101111 11111000000000101111 01111000010011100011 00011001111111100000 The subclass method .For example.

we can know which characteristics are important for distinguishing one class from others. 1996) of the subclass method. x g and X = fx . The most desirable situation is that where only one subclass. x . is found. then the class can be thought of as \being totally isolated from the other classes. ( ) ( ) 6 ." + Before using this method for a certain application. a binalization method has to be carried out. a bit corresponds to a characteristic to be examined. x g 1 2 4 5 2 1 3 4 6 and their corresponding binary vectors X . . X = 00011000110000100001: 1 2 The position of 1's shows the commonness of the positive class. by the position of 1's remaining in a subclass X. x . In the binalization method. x . X are 1 2 X = 00001000010001100111. a characteristic is associated with R k or R k . Once the original feature vectors have been converted into binary vectors. a semi-optimal set of subclasses is usually found by a randomized version (Kudo et al. Due to its high computational cost.nds two subclasses: X = fx . S itself. in our study.

3.curves must not pass through k times or more. respectively. Now. (1) (1) ( ( ) ( ) ) ( ) ( ) ( ) ( ) fi Q (1) P (2) Class ω1 Class ω2 R R (1) Time t Fig. As shown in the next section. The curve of !1 should pass through rectangle P (2) twice or more and rectangle Q(1) at least once. lastly. We use the notations of R!K = fR k (1  k  K )g (R k can be more than one) and R !K = fR k (1  k  K )g. we . 3. but should not pass through R (1) even once. Three rectangular regions specifying class !1 . R and R are denoted by R and R . The curves of class ! must pass through R k k times or more and must not pass through R k k times or more. An example is shown in Fig.

nd a set of rules ! = fR!K [ R !K g and a rule is of an element R!K [ R !K . 1989). we use a subclass method (Kudo and Shimbo . every pattern is assumed to be a binary pattern. In the subclass method. We . ( ( ) ( ) ( ) ) 4 Subclass Method To realize this idea.

called \subclasses.nd a family X of subsets X . another training set consisting the samples of the other classes. (Exclusiveness) (2) X is maximal with respect to the inclusion relation in the set of subsets satisfying the above condition (Maximalness) 5 . then at least one bit of 1 in X is 0 in any sample of S ?." of a training set S of a class satisfying the following conditions: + (1) Let X be the binary pattern obtained by AND operation over all elements of X in each bit.

Usually. One way is to count the number of points and the other way is to count how many times a sequence of points enter the region. the . 2).(Fig.

way (Count Type A) is used for methods A. we describe our methodology to realize the above idea. Let us assume a training set S! = fxg of N! curves for class ! 2 . and the second way (Count Type B) is for methods T-A and T-B. B and C. 3 Realization of rules In this section. We . time series. Two types of counting. 2. that is. A set of points can be treated in a similar way. We pay attention only to curves.rst 1 2 45 3 6 1 2 45 3 Count Type A (Cnt = 3) Count Type B (Cnt = 2) Count Type A ( Cnt = 3) Count Type B ( Cnt = 2) 6 Fig.

The details will be given in the following section. a region is taken as a rectangle in a plane spanned by each dimension and time.nd a set R! = fRg of regions R (it can be a half-interval. an interval. or a hyper-rectangle depending on the method used) for which the maximum number of curves of S! pass through and no curve of S 2 ( 6= !) passes through. In methods T-A and T-B. : : : . For the ith (i = 1. a rectangle R is speci. n) feature (dimension).

Mi].ed by R = [ti . where ti and Ti are the two ends (can be ?1 or 1) in time. It can happen that n rectangles de. and mi and Mi are the two ends in the ith dimension. Ti]  [mi.

In this case.ned over di erent dimensions coincidently have the same time interval: t = t =    = tn . the set of these rectangles corresponds to a hyper-rectangle de. T = T =    = Tn.

Also. However. a set of regions which any curve of S! should not pass through is denoted by R i = fR g. we denote a region which these ( ) ( ) 4 . By R k . we consider how many times a curve passes through the region. this does not usually happen. we denote a region which curves of S! must pass through at least k times. by R k .ned in the original n-dimensional space. 1 2 1 2 Similarly. As an extension.

but an interval is used instead of a threshold. Method C . : : : . both cases in the ith feature fi(i = 1.larger value than a threshold (a solid arrow) and no point to have smaller value than another threshold (a dotted arrow). n). Method B shows the same rule.

nds a hyper-rectangle in the n-dimensional space instead of a set of thresholds or intervals de.

one hyper-rectangle (solid rectangle) requires that two or more points exist inside the area of the rectangle and another hyper-rectangle (dotted rectangle) requires that no point exists in the area of the rectangle.ned in each feature. 1. 1. 1 2 3 1 2 3 fi fi Method A Method B fj 1 3 2 fi Method C fi 2 1 fi 2 3 1 3 t t Method T-A Method T-B Fig. Three methods for classi. In Fig.

cation of sets of a variable number of points and two methods for classi.

n). One sample consisting of three points is shown for methods A. T ) 2 Rn. let us consider a curve in n-dimensional space. : : : . xT ). For this special case. A parameter t is referred to as a time. Thus. 1). x . 1 2 where the length T depends on the curve. Next. the above-mentioned methods A. two more methods are examined (Fig. Method T-B shows a rule using two rectangles with the same requirement. B and C are also applicable to this case by ignoring the order of appearance. : : : . that is. and and one curve of points is shown for methods T-A and T-B. xt (t = 1. In the relationship between t and each feature fi(i = 1. Right-lower and left-upper regions will be considered also in practice. : : : .cation of time series. In addition. we use two ways to count the number of points itself in a region 3 . B and C. the case that a set of points are ordered by a time index. This case can be regarded as a special case of the above case. Let us assume that a curve is a discrete time series: x = (x . method T-A shows a rule requiring that two or more points exist in the right-upper region (solid arrows) and no point exists in the left-lower region (dotted arrows).

expression of a sample is a set of feature points in a .

a handwritten character may be expressed by a set of strokes and one stroke may be expressed by 4 numerical values for the start and end positions of the stroke. In other applications such as classi. the number of feature points (strokes) can vary depending on the sample. for example. In such a case.xed dimension.

one sample may be measured over a long duration and another sample over a short duration. we have to devise a classi.cation of a time series. Therefore. which results in a variable number of observations.

Classi. but they usually require the extraction of a special set of features and the expression of the set in a form suitable for their algorithms.cation method applicable for such cases. Many syntactic approaches may be applicable. we discuss a natural way to treat two typical expressions of a sample: a set of points in a high-dimensional space and a time series of points in a high-dimensional space. In this study.

cation of curves is very important in the .

elds of analysis of several natural signals. 1986). Many methods for classi. time-series like speech. and biological signals (see Shiavi et al. .

Particularly for analysis of time-series. Wu . many methods choose one source from a possible . 1994) and a hidden Markov model (see Rebinar and Juang . a prediction method like the auto-regression model (e..g. 1993).cation of curves are based on Fourier analysis.

1996. ..g. However.nite set of signal sources (e. Plataniotis et al. 1997). Kehagias and Petridis . no approach gives an intuitional explanation of the resulting classi.

we assume that the form of each curve is the most important cue for classi.cation rule. In this study.

cation. and we try to .

x . Such regions are easy to view and interpret. : : : . The expression is basically independent of the order of the elements. xi (i = 1. Assume that a sample x is a set of points in an n-dimensional space: x = fx . xmg.nd the regions that all or almost all curves of a class pass through and no curve of the other class does. : : : . m) 2 Rn. We . 1 2 where m is variable depending on the sample. we will present an outline of the proposed methods. 2 Rules to be found First of all.

Method A shows a rule requiring two or more points to have a 2 . 1. An example is shown in Fig. for example.nd a rule that. or prohibits more than k points to take a value less than 0 in some feature. requires k or more points to take a value greater than a threshold  in some feature (dimension).

Multidimensional Curve Classi.

Almost all classi. Sapporo 060-8628. Division of Systems and Information Engineering Graduate School of Engineering Hokkaido University Kita-13.cation Using Passing-Through Regions Mineichi Kudo a 1 Jun Toyama a Masaru Shimbo a . Kita-ku. JAPAN a Abstract A new method is proposed for classifying sets of a variable number of points and curves in a multi-dimensional space as time series. Nishi-8.

we examine a .ers proposed so far assume that there is a constant number of features and they cannot treat a variable number of features. To cope with this diculty.

xed number of questions like \how many points are in a certain range of a certain dimension." and we convert the corresponding answers into a binary vector with a .

These converted binary vectors are used as the basis for our classi.xed length.

With respect to curve classi.cation.

many conventional methods are based on a frequency analysis such as Fourier analysis.cation. However. a predictive analysis such as autoregression. or a hidden Markov model. their resulting classi.

cation rules are dicult to interpret. In addition. they also rely on the global shape of curves and cannot treat cases in which only one part of a curve is important for classi.

Key words: multidimensional curve classi. We propose some methods that are especially e ective for such cases and the obtained rule is visualized.cation.

cation. classi.

cation of sets of multidimensional points. binarization 1 Introduction Almost all conventional classi. a variable number of features. a subclass method.

ers assume that a sample is expressed by a .

hokudai.jp Preprint submitted to Elsevier Preprint 21 June 1999 .xed number of features. the most natural 1 Corresponding author. However.ac. E-mail: mine@main. in some applications.eng.