Professional Documents
Culture Documents
to be performed with care (Reich 1997). It is critical to per- ment. Nevertheless, the rules created were very specific and
form it with several experts to get an intersubjective opinion. thus are not too interesting from the understanding purpose
If an evaluation involves comparing learned rules with other viewpoint.
knowledge, a particular setup should be followed in which one To further illustrate the usefulness of using multiple models
group of experts evaluates the learned rules with respect to a for better understanding, the discusser ran the decision tree
baseline and another group-serving as the control group- induction program NewID (Boswell 1990) on the same data
evaluates another arbitrary set of rules compared to the same The performance of NewlD in a leave-one-out test was 67.7%
baseline. The experts must not know to which of the groups accuracy, suggesting that its models are "less accurate" than
they belong. It is outside the scope of this discussion to explain CN2 on this data. The decision tree created (after simple
statistical experimental design in depth; nevertheless, without "postprocessing") was
careful attention to such design, results would be unreliable.
The discusser ran the learning program CN2 (Clark and Ni-
NoSla
blett 1989) on the 31 examples to compare its performance One:BeCha1
with INLEN. (I thank Peter Clark and Robin Boswell for the SlichgeReinf:ReRa
access to CN2 and NewID.) CN2 employs the covering al- Lov:CoBeRa1
gorithm used by INLEN (or AQI5). In addition, it uses en-
tropy measurement to assess the quality of rules, thus can han- Lov: EXCELLENT 1
dle noise well. This ability could lead CN2 to produce rules High: Good 1
that are more general than those of INLEN. The rules extracted Average: GOOD 1
by CN2 are given in Table 5. The presentation follows the
self-explanatory format of Table 3. In addition, the last column
AIIChgeReinf: GOOD 2
states how many examples of the same decision a rule covers. None: EXCELLENT 2
No rule covered examples of other decisions. In this sense, SameTvo:BeCha1
the rules are an accurate reflection of the data. <> SlichgeReinf: POOR 3
Rules 1 and 4 are more general than their corresponding
rules in Table 3, whereas rule 5 is the same for both programs. SlichgeReinf: GOOD 4
From the table, and as predicted, CN2 rules tend to be more DiffTvo:BeCha1
general than INLEN rules and still correct. A rule that covers <> None: POOR 8
more examples is more "reliable" and interesting from an None: GOOD 2
understanding viewpoint. Given the small data set, the table
shows that there are two or at most three interesting rules: 1, None:ReRa
5, and 10. The rest cover too few examples to warrant assess- Lov: EXCELLENT 3
ment. Average: GOOD 4
When using ML for creating models (e.g., rules) that im-
prove understanding, it is fruitful to use several tools because
each provides a different perspective of the data (Reich et al. The number after the decision is the number of examples this
leaf covers. Interesting rules can be created by traversing the
1996). The utility of each model depends on its evaluation by
tree from its root to some leaves. For example, the rule
experts and by other assessments. One such assessment deals
with how well the rules reflect the data. This can be answered IF (NoSla = DiffTwo) & (BeChal ( ) None)
quantitatively by executing a performance test. In order to ex- THEN ConEva = POOR
plain it, the discusser executed these tests with CN2. The basic
test involves checking how the rules reconstruct the decisions covers eight examples, thus is fairly general. This rule does
of the examples. This is a coverage or a resubstitution test that not appear in Table 5 or Table 3. It might be interesting to ask
produces an optimistic upper-bound estimation of accuracy. experts whether it is interesting. In summary, although we
The rules created by CN2 achieve 100% accuracy in this test. would avoid using models that are poor in performance tests,
Another common test used on ML programs, the hold-out, a "slightly worse" model from a performance viewpoint can
uses about 70-80% of the examples for training and the rest still give a different and complementary perspective, thus im-
for testing. It can be pessimistic and is inaccurate for small proving understanding.
data sets (i.e., those having less than about 1,000 examples in Compared to CN2 or NewID, it might be that INLEN per-
the testing set). The better tests are the k-fold cross-validation formance on a leave-one-out test is better, but this figure is
(CV) or leave-one-out. These tests can be improved by using not provided. Therefore, it is hard to assess the suitability of
the bootstrap method. See Reich (1997) for an elaboration on using INLEN to extract knowledge from the database. It is
this subject. also hard to assess the quality of the data Arciszewski et al.
The rule set in Table 5 was created using parameters whose (1995) discussed the complete process of using ML programs
combination, lest the type of rule error estimation, led to the for knowledge extraction including the process of collecting
JOURNAL OF COMPUTING IN CIVIL ENGINEERING / JULY 1998/165
examples, the evaluation of results, and the assessment of the ments about the evaluation of quality of engineering decision
"quality" of the data. It is recommended that when using rules will initiate a discussion about this critical subject for
ML programs, a similar process to the one discussed by the future of machine learning in engineering. The second
Arciszewski et al. (1995) or by Reich (1997) should be used. writer became involved in work on automated knowledge ac-
One last minor comment: The authors confused COBWEB quisition in the mid-1980s. He soon realized that a formal and
with CLUSTER, which is also an unsupervised learning pro- objective evaluation of quality of acquired decision rules is the
gram, in their discussion of BRIDGER (page 10). key to the acceptance of learning systems by the engineering
community. For this reason, he developed in 1989 an initial
APPENDIX. REFERENCES outline of a method for evaluation of a collection of examples!
decision rules in the context of case-based optimization (Ar-
Arciszewski, T., Michalski, R. S., and Dybala, T. (1995). "STAR meth- ciszewski and Ziarko 1991). In 1992, he conducted studies of
odology-based learning about construction accidents and their preven-
tion." Automation in Constr., 4(1), 75-85.
performance-based evaluation of learning systems and of col-
Boswell, R. (1990). "Manual for NewID version 4.1." Tech. Rep. TIl lections of decision rules. This work resulted in a formal eval-
P21541RAB1412.3, Turing Institute, Glasgow, U.K. uation method presented in Arciszewski et al. (1992). The
Reich, Y. (1995). "Measuring the value of knowledge." Int. J. Human- method has been used extensively in various machine learning
Compo Studies, 42(1), 3-30. experiments conducted at George Mason University dealing
Reich, Y. (1997). "Machine learning techniques for civil engineering with symbolic selective and constructive induction-based
problems." Microcomputers in Civ. Engrg., 12(4),295-310.
Reich, Y., Medina, M., Shieh, T.-Y., and Jacobs, T. (1996). "Modeling
learning systems during the period 1993-1996. The results of
and debugging engineering decision procedures with machine learn- this work, which clearly demonstrated advantages and limita-
ing." J. Computing in Civ. Engrg., ASCE, 10(2), 157-166. tions of performance-based evaluation of quality of decision
Wang, W., and Gero, J. S. (1997). "Sequence-based prediction in con- rules, are presented in Szczepanik et al. (1996). Therefore, he
ceptual design of bridges." J. Computing in Civ. Engrg., ASCB, 11(1), recently proposed a semantic evaluation of decision rules in
37-43. the context of the background knowledge understood as a hi-
erarchical system of concepts and their relationships in a given
domain. This method is presented in Arciszewski (1998).
In the context of the writers' experience in the area of qual-
Closure by Tomasz Arciszewski5 ity evaluation of decision rules, they entirely agree with the
discusser's comments, which complement the paper and the
writers' own discussion of the acquired decision rules. How-
The writers appreciate the comments of the discusser. He
ever, one should be very careful in reaching definite conclu-
correctly pointed out that it is critical to assess the quality of
sions while working with only 31 examples. In this case, any
the automatically acquired knowledge. However, the objective
formal performance-based analysis of decision rules can ob-
of the paper was to "describe a novel approach to construct-
viously be conducted only in the context of those examples.
ability knowledge acquisition based on the use of machine
Moreover, in the case of such a small number of examples the
learning and to demonstrate its feasibility," not to produce any
use of statistical methods becomes quite controversial. There-
specific domain-related knowledge and to verify its quality.
For all these reasons, the writers used only a very small num- fore, if the objective is to learn about a given domain, one
ber of examples. Thus, the use of statistical performance eval- should use a much larger number of examples of a balanced
uation methods for such a small set of examples would simply nature within this domain. That would give one a chance to
be questionable, at best. Therefore, the issue of knowledge acquire a larger, or better, collection of decision rules that
quality was only marginally addressed in the paper. The writ- could be correctly evaluated using statistical methods.
ers are aware, however, of the importance of this issue and it The comment about using several tools to learn about a
is their pleasure to provide a response to the discussion. This given domain addresses a very complex issue of knowledge
response is difficult, however, because the discusser mostly acquisition. The writers agree with the discusser that using
refers his comments to a paper that at the time of writing is various tools may produce complementary results in some
still in print. cases, particularly when significantly different tools are used,
The writers hope that the discusser's and their own com- for example tools producing decision rules and decision trees,
utilizing selective and constructed attributes, etc. However, the
'Assoc. Prof. of Urban Sys. Engrg., George Mason Univ., Fairfax, VA recent experiments of Arciszewski et al. (1997) clearly dem-
22030-4444. onstrated that this may not be always the case. When two
166/ JOURNAL OF COMPUTING IN CIVIL ENGINEERING / JULY 1998