Professional Documents
Culture Documents
Articles
Applying Metrics to
Machine-Learning Tools
A Knowledge Engineering Approach
■ The field of knowledge engineering has been one duction of these metrics into a model (a
of the most visible successes of AI to date. Knowl- benchmark) for application, and concentra-
edge acquisition is the main bottleneck in the tion on the knowledge engineer as the user of
knowledge engineer’s work. Machine-learning
the proposed model.
tools have contributed positively to the process of
The main concepts used throughout this
trying to eliminate or open up this bottleneck,
but how do we know whether the field is pro- article are briefly explained in the next sec-
gressing? How can we determine the progress tion. Then, a view of different comparative
made in any of its branches? How can we be sure studies is given. Subsequently, the bench-
of an advance and take advantage of it? mark and some experiments with it are pre-
This article proposes a benchmark as a classifica- sented. Finally, the conclusions reached are
tory, comparative, and metric criterion for summarized, and some future research direc-
machine-learning tools. The benchmark centers tions are given.
on the knowledge engineering viewpoint, cover-
ing some of the characteristics the knowledge
engineer wants to find in a machine-learning Concepts
tool. The proposed model has been applied to a
The development of expert systems, that is,
set of machine-learning tools, comparing expect-
ed and obtained results. Experimentation validat- programs that are skillful in a particular appli-
ed the model and led to interesting results. cation domain, emphasized the importance
of large stores of domain-specific knowledge
(called knowledge bases) as a basis for high per-
formance. Assembling and modifying the
L
earning from examples is currently one
of the most active AI research areas required knowledge base is a complex process
(Bratko and Lavrac 1987; Muñoz 1991). that requires much expertise and careful
Basically, it is important because the results maintenance.
contribute positively to eliminating the bot- A key element of this process is the transfer
tleneck that knowledge acquisition consti- of expertise from a human expert to the pro-
tutes in expert system building. As a result, gram, that is, the task of knowledge acquisi-
there is a good collection of machine-learning tion. Because the expert generally knows little
tools for use in knowledge acquisition (Nuñez about programming, this process usually
1991; Wielinga et al. 1990; Bratko and Lavrac requires the mediation of a person called a
1987). This proliferation of learning projects knowledge engineer. However, this transfer of
and, therefore, of techniques and tools has knowledge through the knowledge engineer is
led to even more interest in comparative stud- somewhat problematic. First, the knowledge
ies of such techniques and tools (Kodratoff engineer is not an expert in the specific appli-
and Michalski 1990). cation domain. Second, because most of the
This article focuses on a comparative study expert knowledge is heuristic and experimen-
of how well machine-learning tools deal with tal, the expert is not capable of conveying it
ordinal, numeric, and structured attributes as directly to the knowledge engineer. The field
well as cost. It elaborates the metrics to be that studies these problems and their possible
applied to machine-learning tools, the intro- solutions is called knowledge engineering.
64 AI MAGAZINE
Articles
Related Work
TYPE OF STUDY AUTHOR YEAR
The need for measures is nothing
new in learning from examples Comparative study of ID & AQ families O’Rorke 1982
(Michalski, Carbonell, and
Mitchell 1983; O’Rorke 1982), Comparative review of learning from Dietterich & Michalski 1983
and such measures have centered examples techniques: computational
efficiency & others
on different magnitudes. Table 1
shows some of the studies that Algorithm complexity Utgoff 1989
originated from this need. The
first column contains the type of Classification accuracy Michalski, Mozetic, Hong 1986
study and the second some refer- & Lavrac
ences for further information. Mingers 1989a
For an overview of these papers, Quinlan 1989
see Muñoz (1991). Utgoff 1989
It is evident to those of us in
this field that the early studies Noisy and incomplete data Cestnik 1987
tend to concentrate on the Quinlan 1986 & 1989
developer’s viewpoint and later Clark & Niblett 1989
ones on the user’s viewpoint.
Also evident is the lack of any Use of common sets of examples for Cestnik 1987
benchmark. These two consider- different studies Clark & Niblett 1987 & 1989
ations have been the major Michalski 1990
inspiration for the research pre-
sented here. Obtained results legibility Cendrowska 1988
Cestnik 1989
FALL 1994 65
Articles
Accuracy
Structured attribute
Numerical attribute
Ordinal attribute
Nominal attribute
Cost
Incomplete data
Incrementality
Noisy data
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
Fi 3 W i ht i d t th t t h t i ti
Figure 3. Weights Assigned to the Target Characteristics.
one tool from a set of them, what is he/she knowledge; its importance is patent even for
going to look for? The answer to this ques- nonexperts.
tion constitutes the target characteristics. Structured attributes are attributes whose
Five experts were selected to help deter- values are not flat; that is, they form a struc-
mine a good set of features. These experts are ture (García 1991; Nuñez 1991, 1990; Witten
experienced knowledge engineers with a and McDonald 1990). These structures usual-
sound knowledge of machine learning and ly appear as a hierarchy (Witten and Mac-
the use of these tools for knowledge acquisi- Donald 1990). The presence of these struc-
tion. A list of possible features of interest was tures is associated with the use of background
given to the experts, who assigned weights knowledge (Nuñez 1991, 1990; Mellis 1989)
between zero and one to each of the charac- to guide the generalizations. The grouping of
teristics; zero meant of no interest, and one villages into counties, counties into states,
meant really important. Figure 3 is a list of and so on, is a good example of these
selected features and their respective weights. attributes.
There are other features that appear in the Ordinal attributes (Cestnik 1987; Mingers
machine-learning literature but not in figure 1989a) are attributes whose values can be
3 (Muñoz 1991). This difference exists ordered. A person’s academic qualifications
because the experts did not consider these and how well he/she speaks a foreign lan-
characteristics relevant to the purpose at guage are examples of this kind of attribute.
hand; that is, they assigned them a total Numeric attributes can be considered as a kind
weight lower than one. of ordinal attribute in which the order is giv-
First, accuracy refers to the classification en by the greater-than relation in real num-
accuracy of the acquired knowledge. A main bers. Examples of these attributes (Cestnik
goal in using these tools is to acquire precise 1989, 1987) are temperature, distance from a
66 AI MAGAZINE
Articles
Ordinal Task 1 5 3 8 60 25
attribute
* There is an infinity of possible values. This average corresponds to real numbers with
only one decimal digit.
** There are three test sets, each with 118 test cases.
FALL 1994 67
Articles
p - 0. 2
Ordinal f ( p) = ma x ( 0, )
0. 8
attributes
0 x < 0. 5
Numerical
attributes
f ( x, y) = { ( x + 2y)
3
f ( x, y) =
x >= 0. 5
f 1 ( x) *f 2 ( y)
Structured
attributes
f ( x) = ma x ( x - 0. 5 , 0) f 2 ( x) = ma x ( x - 0. 33 , 0)
1 0. 5 0. 67
Cost f ( c , p) = 3 g( c ) *h( p)
g( c ) = 1 - c2
2
h(p) = e (1.5 - (p + 1/2p )) + 0.05 sin2 (∏(1-p)1.8)
learning purposes and a test set containing more suitable a task is for inclusion in the
the examples used for test purposes. The benchmark suite. Thus, if a simple task is
benchmark suite specifies one or more tasks found that fits the benchmark purposes per-
for each benchmark target characteristic. The fectly, more complex ones are rejected.
main features of these tasks are summarized Bearing in mind these guidelines, table 2
in table 2. The tasks have been selected care- shows how some of the tasks are really simple
fully on the basis of the following: in terms of the learning and test-set sizes,
First, the examples of both the learning conveying a rich combination of attributes
and test suites will contain attributes for test- for the pursued goal. A detailed explanation
ing the associated target characteristics. Let us of each task is beyond the scope of this arti-
call this attribute As. cle; however, a brief description, including
Second, the attributes As will be crucial for the most relevant attributes for each task, fol-
learning purposes during the selected task; lows.
that is, their information gain (the informa- Ordinal attributes are tested using one task
tion obtained when their value is known) will that consists of assigning a job to a person
be high. depending on some of his/her abilities. Both
Third, the examples contained in the learn- the abilities considered (academic rank, job
ing and test suites will contain extreme con- experience, and foreign languages) and the
ditions, which means that the learning set job to be assigned are ordinal.
will contain just the minimum information For numeric attributes, there are two tasks,
needed for a human being to solve the task, both involving numeric relations: (1) classify-
and the test set will explore the examples ing figures on a surface depending on their
that a human being finds most difficult to position, size, and form and (2) classifying
deal with. points on a surface previously divided into
Fourth, in view of these principles, the different areas. The first task is a simple one
smaller the training and test cases are, the and merely aims to establish whether or not a
68 AI MAGAZINE
Articles
Results Analysis
f(x,y)=1
Classification accuracy p was taken to mea-
sure the results with a view toward analysis.
Once the knowledge has been acquired by
the tools using the learning set, its classifica-
tion accuracy p is determined using the test
set. These values are then put into some spe-
cific functions to get meaningful measures. y=1
These functions are the analysis functions
that are shown in table 3 and figures 4 to 7
for each target characteristic.
When the benchmark suite specifies two
tasks, two parameters appear in the analysis
function, and these parameters represent the f(x,y)=0
accuracy obtained with the first and the sec- x=0.5
ond tasks, respectively. This value is standard-
ized between zero and one. An exception is
the function for cost analysis. Only one task x=1; y=0
is specified in the benchmark suite for this
characteristic; however, the function has two
arguments because in this case, each attribute
has an associated cost, and the cost involved
in the classification process is also an impor- Figure 5. Analysis of Results for Numeric Attributes in the Interval in Which
tant feature to be considered. Table 3 shows the Value of f(x,y) ≠ 0(x > 0.5).
all the analysis functions, and figures 4 to 7
sketch their graphic representations for those
values on which the functions are defined.
Different functions have been tried to find
the one that best determines the following
aspects for each target characteristic: (1)
when are the results good enough for a value
of one to be given to the tested tool, (2)
where is the line below which a value of zero
is given to the tested tool, and (3) which is
FALL 1994 69
Articles
Experiment Results
Tables 5 and 6 show the results obtained with
the benchmark on the selected tools work-
bench. Table 5 shows the values of the
70 AI MAGAZINE
Articles
FALL 1994 71
Articles
0.8
0.6
0.4
0.2
0
Ordinal Numerical Structured Cost
parameters needed for the analysis functions, way of dealing with numbers. However, this
and table 6 shows the values of the functions value is greater than zero, which differenti-
themselves. ates it from other tools that do not deal with
The - symbol in table 5 appears only in the numbers at all.
numeric attribute row. Based on the analysis Table 5 shows that the values of x and y for
function for these attributes, if classification ALEXIS II are too close, but these values for
accuracy with the first task is lower than 0.5, ASSISTANT 86 are different. The reason for this
the function value will automatically be zero, difference is that ALEXIS II deals with struc-
with no consideration of the classification tured attributes in a reasonable way; ASSISTANT
accuracy of the second task. Thus, the second 86 does not deal with this type of attribute
task does not need to be carried out for these but provides some mechanisms that can sim-
two tools. ulate them under certain circumstances. The
A comparison of tables 4, 5, and 6 illus- proposed benchmark is robust in this sense.
trates the parallelism of tables 4 and 6, which Table 5 shows how the values in the AQ15
suggests the adequacy of the model. Never- column are not zero, but those in the ID* col-
theless, several points should be stressed. The umn are. However, in table 6, the value of the
first point is the value of ASSISTANT 86 in table analysis function is zero for both of them.
6 for numeric attributes: It is not as close to The differences between the two tools, as
one as expected because an attribute can have shown in table 5, are purely coincidental. The
any numeric value in ASSISTANT 86. However, analysis function detects that they are chance
when it is considered, the user must specify differences because the values of table 6 are
the interval bounds; that is, this information zero for both tools.
is not induced from the learning set. The In the cost row of table 6, the best result is
benchmark suite and analysis function detect for ID* (.95), followed by ALEXIS II (.88). It is
this fact and consider that it is not a good important to mention that according to their
72 AI MAGAZINE
Articles
developers, both tools use the same cost func- New evidence has been found to support
tion. Why is there this difference? It is the fact that in some circumstances, dealing
because ID* deals with noisy and incomplete with noisy data improves the results sought.
data, but ALEXIS II does not. This mechanism Finally, more cases can be added to the
allows ID* to get more economic knowledge benchmark suite to allow comparative studies
at the expense of slightly lower accuracy. A of characteristics other than ordinal, numer-
gain in economy is of greater interest than ic, and structured attributes or cost considera-
the loss of accuracy in this case; the analysis tions. This feature of the model is important.
function gives better values for ID*. This
result backs up previous studies (Mingers Future Work
1989a) that show how, in some circum-
stances, dealing with noisy data improves the In view of the benchmark’s extensibility, an
results obtained. important research effort would be to add
new features to deal with other important
Tuning the Benchmark characteristics, such as background knowl- … an impor-
Even with good results, as explained previ- edge and compound objects. We are, in fact,
ously, the value of the numeric-analysis func- currently working along these lines. It would tant research
tion for ALEXIS II and the values of the cost- also be interesting to apply the benchmark to effort would
analysis function for ALEXIS II, AQ15, and ID* other tools that have been compared previ-
ously in other ways and to study both results.
be to add new
suggest that these functions can be tuned. For
this purpose, the tool developers have been features to
Acknowledgments
asked to quantify the a priori reference yes for
This article was prepared and written in col-
deal with
these attributes. The result of this quantifica-
tion for the numeric attributes has been YES laboration with the Centre of Technology other
= 1. The analysis function can be tuned by Transfer in Knowledge Engineering in important
applying a factor a to it, such that a * 0.86 = Madrid, Spain.
1; so, a = 1.16. With respect to cost, the tool
characteris-
References
developers accept the values obtained as tics, such as
good, except for ALEXIS II, whose value should Alonso, F.; García, G.; Maté, J. L.; Morant, J. L.; and
be greater than or equal to .9. Thus, a small
Pazos, J. 1990. Evaluation in Knowledge Engineer- background
factor b can be applied to this function, such
ing: Classifying, Comparative, and Metric Criteria.
Presented at the Fourth International Symposium
knowledge
that b * 0.88 = 0.9; that is, b = 1.02. This fac- on Knowledge Engineering, 7–11 May, Barcelona, and
tor also maintains the values for the other Spain.
tools within the quantified range. Álvaro, R. 1990. Induction as a Solution for Knowl-
compound
Figure 8 is a graph showing the compara- edge Acquisition: AQ Algorithm. Master’s thesis, objects.
tive strengths and weaknesses of each tool Facultad de Informática, Universidad Politécnica de
included in the workbench. Based on the Madrid.
We are,
assumption that a knowledge engineer wants Atlas, L.; Cole, R.; Connor, J.; El-Sharkawi, M.; in fact,
to choose one of these tools to solve a partic- Mars, R. J.; Muthusamy, Y.; and Barnard, E. 1990.
ular problem, this graph can be taken as a Performance Comparisons between Back-Propaga-
currently
good basis for selection. tion Networks and Classification Trees on Three working
Real-World Applications. Advances in Neural Infor-
mation Processing Systems 2:622–629. along
Conclusions Bratko, I., and Lavrac, N. 1987. Proceedings of EWSL these
‘87: Second European Working Session on Learning.
The benchmark presented constitutes an
Bled, Yugoslavia: Sigma. lines.
important step in the search for measurement
Cendrowska, J. 1988. PRISM : An Algorithm for
criteria to be applied to any machine-learning
Inducing Modular Rules. Knowledge-Based Systems
tool. To be precise, it is the first benchmark 1:225–276.
applicable to machine-learning tools and pro-
Cestnik, B. 1989. Informativity-Based Splitting of
vides an objective and reliable measure. Numeric Attributes into Intervals. In Expert Systems,
Taking the knowledge engineer as the user Theory and Applications, IASTED 89, ed. M. H.
of the benchmark is a new approach to com- Hamza, 59–62. Anaheim, Calif.: Acta.
parative studies, close to the focus of most Cestnik, B. 1987. ASSISTANT PROFESSIONAL, A Software
current studies of the user perspective. Tool for Inductive Learning of Decision Rules. Sys-
The experiments show that the model tem User Manual, Edvard Kardelj University, Ljubl-
eliminates the random effect and detects the jana, Yugoslavia.
weakness of some tools that are supposed to Clark, P., and Niblett, T. 1989. The CN2 Induction
deal with numeric attributes, such as ASSISTANT Algorithm. Machine Learning 3(4): 261–283.
86. Clark, P., and Niblett, T. 1987. Induction in Noisy
FALL 1994 73
Articles
Domains. In Proceedings of EWSL ‘87: Second Euro- tion for Artificial Intelligence.
pean Working Session on Learning, 11–30. Bled, Mingers, J. 1989a. An Empirical Comparison of
Yugoslavia: Sigma. Selection Measures for Decision Tree Induction.
Dhaliwal, J. S., and Benbasat, I. 1990. A Framework Machine Learning 3: 319–342.
for the Comparative Evaluation of Knowledge- Mingers, J. 1989b. An Empirical Comparison of
Acquisition Tools and Techniques. Knowledge Pruning Methods for Decision Tree Induction.
Acquisition 2:141–166. Machine Learning 4: 227–243.
Dietterich, T. G., and Michalski, R. S. 1983. A Com- Mooney, R. J.; Shavlik, J. W.; Towell, G. G.; and
parative Review of Selected Methods for Learning Grove, A. 1989. An Experimental Comparison of
from Examples. In Machine Learning: An AI Symbolic and Connectionist Learning Algorithms.
Approach, Volume 1, eds. R. S. Michalski, J. G. Car- In Proceedings of the Eleventh International Joint
bonell, and T. M. Mitchell, 41–81. San Mateo, Conference on Artificial Intelligence, 775–780.
Calif.: Morgan Kaufmann. Menlo Park, Calif.: International Joint Conferences
Fernández, A. 1990. Comparative Study of Prune on Artificial Intelligence.
Methods for Decision Tree Induction Algorithms. Muñoz, P. L. 1991. Elaboration of a Benchmark for
Master’s thesis, ETSI Telecomunicaciones, Universi- Machine-Learning Tools. Ph.D. diss., Facultad de
dad Politécnica de Madrid. Informática, Universidad de Politécnica de Madrid.
Fisher, D. H., and McKusik, K. B. 1989. An Empiri- Nuñez, M. 1991. The Use of Background Knowl-
cal Comparison of ID3 and Back Propagation. In edge in Decision Tree Induction. Machine Learning
Proceedings of the Eleventh International Joint 6(3): 231–250.
Conference on Artificial Intelligence, 788–793.
Nuñez, M. 1990. Decision Tree Induction Using
Menlo Park, Calif.: International Joint Conferences
on Artificial Intelligence. Domain Knowledge. In Current Trends in Knowledge
Acquisition, eds. B. Wielinga, J. Boose, B. Gaines, G.
Gams, M., and Lavrac, L. 1987. Review of Five
Schreiber, and M. van Someren, 276–288. Amster-
Empirical Learning Systems within Proposed
dam: IOS.
Schemata. In Proceedings of EWSL ‘87: Second Euro-
pean Working Session on Learning, 46–78. Bled, Nuñez, M. 1988. Economic Induction: A Case
Yugoslavia: Sigma. Study. In Proceedings of EWSL ‘88: Third European
Working Session on Learning, 139–145. San Mateo,
García, L. E. 1991. A Modification to the ID3 Algo-
Calif.: Morgan Kaufmann.
rithm for Decision Tree Induction. Master’s thesis,
ETS Ingenieros de Telecomunicación, Universidad O’Rorke, P. 1982. A Comparative Study of Inductive
Politécnica de Madrid. Learning Systems AQ 11 P and ID 3 Using a Chess
Endgame Test Problem, UIUCDCS-F-82899, Depart-
Hayes-Roth, F. 1989. Towards Benchmarks for
ment of Computer Science, University of Illinois at
Knowledge Systems and Their Implication for Data
Urbana-Champaign.
Engineering. IEEE Transactions on Knowledge and
Data Engineering 1(1): 101–110. Quinlan, J. R. 1989. Unknown Attribute Values in
Induction. In Proceedings of the Sixth International
Kodratoff, Y., and Michalski, R. S. 1990. Machine
Workshop on Machine Learning, 164–168. San Mateo,
Learning: An Artificial Intelligence Approach, Volume
Calif.: Morgan Kaufmann.
3. San Mateo, Calif.: Morgan Kaufmann.
Quinlan, J. R. 1986. The Effect of Noise in Concept
López de Mántaras, R. 1991. A Distance-Based
Learning. In Machine Learning: An AI Approach, Vol-
Attribute Selection Measure for Decision Tree
ume 2, eds. R. S. Michalski, J. G. Carbonell, and T.
Induction. Machine Learning 6(1): 81–92.
M. Mitchell, 149–169. San Mateo, Calif.: Morgan
Mellis, W. 1989. A General Approach to the Use of
Kaufmann.
Background Knowledge in a Numerical Induction
Algorithm. Presented at the Third European Work- Shavlik, J. W.; Mooney, R. J.; and Towell, G. G.
shop on Knowledge Acquisition for Knowledge 1991. Symbolic and Neural Learning Algorithms:
Based Systems, EKAW-90, July, Paris, France. An Experimental Comparison. Machine Learning
6:111–143.
Michalski, R. S. 1990. Learning Flexible Concepts:
Fundamental Ideas and a Method Based on Two- Tan, M. 1990. CSL: A Cost-Sensitive Learning Sys-
Tiered Representation. In Machine Learning: An Arti- tem for Sensing and Grasping Objects. Presented at
ficial Intelligence Approach, Volume 3, eds. Y. IEEE International Conference on Robotics and
Kodratoff and R. S. Michalski, 63–111. San Mateo, Automation, 13–18 May, Cincinnati, Ohio.
Calif.: Morgan Kaufmann. Tan, M., and Schlimmer, J. C. 1989. Cost-Sensitive
Michalski, R. S.; Carbonell, J. G.; and Mitchell, T. Concept Learning of Sensor Use in Approach
M. 1983. Machine Learning: An Artificial Intelligence Recognition. In Proceedings of the Sixth International
Approach, Volume 1. San Mateo, Calif.: Morgan Workshop on Machine Learning, 392–395. San Mateo,
Kaufmann. Calif.: Morgan Kaufmann.
Michalski, R. S.; Mozetic, I.; Hong, J.; and Lavrac, Utgoff, P. E. 1989. Incremental Induction of Deci-
N. 1986. The Multipurpose Incremental Learning sion Trees. Machine Learning 4:161–186.
System AQ15 and Its Testing Application to Three Utgoff, P. E. 1988. ID5: An Incremental ID3. In Pro-
Medical Domains. In Proceedings of the Fifth ceedings of the Fifth International Conference on
National Conference on Artificial Intelligence, Machine Learning, 107–120. San Mateo, Calif.: Mor-
1041–1045. Menlo Park, Calif.: American Associa- gan Kaufmann.
74 AI MAGAZINE
Articles
FALL 1994 75