You are on page 1of 12

Appl Intell (2008) 28:17–28

DOI 10.1007/s10489-007-0039-1

Automated case creation and management for diagnostic CBR

Chunsheng Yang · Benoit Farley · Bob Orchard

Received: 27 October 2006 / Accepted: 24 January 2007 / Published online: 17 February 2007

C Springer Science + Business Media, LLC 2007

Abstract With the rapid development of case-based reason- 1 Introduction

ing (CBR) techniques such as case retrieval and case adap-
tation, CBR has been widely applied to various real-world Case-Based Reasoning is one of the major reasoning
applications. A successful case-based reasoning system re- paradigms in artificial intelligence. A CBR system solves a
quires a high-quality case base, which provides rich and ef- new problem by retrieving a similar one from a case base,
ficient solutions for solving real-world problems. How to which stores the experienced solutions to past problems.
automatically create and manage such a case base is a vi- When it cannot find a solution that is similar enough to solve
tal but unsolved problem. This paper tackles this important a new problem, a CBR system will adapt the solution of a
problem. We proposed a methodology for creating cases from relatively similar problem to the new one. In principle, CBR
readily available large-sized databases, which were collected is different from other AI reasoning approaches such as rule-
in the routine operations. Building on techniques from case- based reasoning and model-based reasoning. In a rule-based
based reasoning and natural language processing, we present reasoning system, for instance, expert systems, the rules re-
a methodology for automatically creating cases at THE ini- flect the certain or direct relationships between the problems
tial stage of a CBR system development. After the detailed and their solutions. Usually, a problem can be broken down
description of the methodology, we introduce a case study into several sub-problems and solved by firing a set of rules.
for validating the usefulness of the methodology. The exper- The rules can be created from the text-based documents or
imental results show that the proposed methodology signifi- from domain expert’s experience and know-how. In CBR
cantly reduces the human effort required for authoring cases, systems, it is the cases instead of the rules that reflect or
and we are able to automatically create the high-quality cases document the relationship between the problems and their
for diagnostic CBR systems from historic maintenance and solutions, which were obtained from experiences. A case
operational data at the initial stage of system development. stored a solution for a problem. Such a relationship might not
be absolute, and the solution for a similar problem may not be
Keywords Case-based reasoning . Case bases . Automated
When we develop a case-based reasoning (CBR) system
case creation . Methodology . Natural language processing
for various real-world applications, a significant challenge
that faces us is how to create a high-quality case base. With-
out a high-quality case base, it is impossible for any CBR
C. Yang ( ) . B. Farley . B. Orchard to function well for solving problems. There are three ways
National Research Council Canada, 1200 Montreal Road, Ottawa,
to create cases for a CBR system. The first way is to ask
Ontario K1A 0R6, Canada
e-mail: domain experts to author cases in terms of their experience
in solving the problems. This is very expensive and usually
B. Farley impracticable for developers. The second way is to author
cases from existing documentation. Zaluski et al. proposed
B. Orchard to author cases from the maintenance documentation, iden-
e-mail: tifying the procedures used to solve the problems [1]. This

18 Appl Intell (2008) 28:17–28

method is useful for some applications where operational ogy to machinery maintenance applications. We developed
data was not collected and the historical data is not available an automated case creation system for an aircraft diagnostic
for authoring cases. However, the cases created in such a way system in which CBR was used as a reasoning technique
do not show how or why a solution works. The third method for decision-making support. We have created a high-quality
is to author cases from runtime CBR systems. The work in case base from the historical operation data provided by an
[2] introduced a distributed maintenance strategy, called col- airline industry. This paper presents the developed method-
laborative maintenance (CM), which provides an intelligent ology in detail. We also reported a case study for demonstrat-
methodology to support long-term case collection and au- ing how to apply our proposed method for automated case
thoring. With this method, we are unable to create any cases creation for a CBR application. Some experimental results
during the early stage of development and it takes a long time from the case study show that the proposed methodology
to collect enough cases for some CBR applications. significantly reduces the human effort required for author-
The modern operation of complex systems such as aircraft ing cases from the huge amount of historic data, and we are
and trains generates a huge amount of operational and main- able to create a high-quality case base for diagnostic CBR
tenance data. These data is a rich resource for us to create systems.
cases for diagnostic CBR systems. To develop a CBR sys- The remainder of the paper is organized as follows.
tem for real-world applications, we expect that developers Section 2 details the methodology for automated case cre-
are able to create a high-quality case base at the early stage ation; Section 3 presents a case study for validating the use-
of system development. In this study, we tackle this problem: fulness of the methodology, along with the experimental re-
to automatically create cases from historical data using ad- sults; Section 4 discusses the results and limitations of the
vanced techniques emerging in data mining and the reasoning methodology; the final section concludes the paper and points
community. Recently, some researchers have started to tackle out the direction of the future work.
the automated case creation issue. Yang and Cheng proposed
to identify cases from large database by using data mining
techniques [3]. We believe that automated case creation is a 2 Methodology
fundamental issue for CBR systems. In particular, we expect
to create cases from available historic data at the initial stage In general, the operational data collected in real-world ap-
of system development and to automatically collect or author plications contains the symptom messages generated by the
the cases at the on-line runtime stage. various sensors installed on the system while the maintenance
The data collected from routine operations of complex data records the maintenance actions taken to fix problems.
systems or equipments, are of semi-structured and tree- In practice, operators or technicians create the maintenance
text. To identify useful information from them for case data. When operators take any maintenance action, they cre-
creation is a challenging and time-consuming task. Man- ate a record describing the problem and its repair action.
ually creating cases requires significant human effort and Unfortunately, these records are often semi-structured and
domain knowledge [4]. In order to reduce the human ef- in free-text format. Although this data records useful expe-
fort and overcome the difficulty of creation of high-quality rience for solving problems, it is not easily interpreted by
cases, we have developed a framework for authoring cases humans to create the formal cases that might be used to solve
from the free-text maintenance databases [5]. This frame- the similar problems in the future.
work focused on how to extract useful information to cre- To describe our methodology, we first define some com-
ate the cases for CBR systems in maintenance domains. mon concepts on the operational and maintenance data.
In this study, we extend our work to diagnostic CBR sys-
tems. Building on the established techniques from natural Definition 1 – Operation data (Symptom): For a given phys-
language processing (NLP) [6, 7] and, CBR we developed a ical system, a symptom is an instance (sdi ) collected
data-mining-based methodology for automated case creation in the operational data. sdi contains state parameters
for diagnostic CBR systems. This developed methodology (spi1 , spi2 , . . .), a timestamp (tis ), and a symptom de-
helps extract maintenance information and related symptoms scription (sdti ). sdi is a text message describing the
from the readily available historic maintenance and opera- symptom. Therefore, we can express a symptom as
tion databases, and automatically create the cases that docu- sdi ⊇ {spi1 , spi2 , . . . sdti , tis } and the collected opera-
ment these historical relationships between the problems and tion data for the system as a database (SD), S D ⊇
solutions. {sd1 , sd2 , . . . , sdi , . . . , sdk }.
The developed methodology has many applications. It can Definition 2—Maintenance data: For a given system, its
be used to create cases not only for diagnostic applications but maintenance data can be formulated as a general instance
also for other applications such as Q/A systems for e-learning (mdi ), which contains some common attributes: event-
applications. We have successfully applied this methodol- DateTime(tie ), repairDateTime(tir ), problemDe-

Appl Intell (2008) 28:17–28 19

scription(pdi ), repairAction(ai ), instance- there are L − N + 1 N-grams. Such a matching algorithm

Id(idi ), partId(pnoi ) and repairPlace(rpi ). helps to reduce the impact of misspelling, abbreviations, and
Usually, problemDescription and repairAction are acronyms. After considering the trade-off between the algo-
free-form text. For a given real-world application, its rithm performance and matching accuracy, we selected N to
maintenance data is collected as a database, MD, where be three (tri-gram matching). For example, in the tri-gram
MD ⊇ {md1 , md2 , . . . , mdi , . . . , mdq }. matching algorithm, the text word “diagnose” could be dis-
Definition 3—Case base: A case (c) is defined as c = assembled into 6 tri-grams: {dia, iag, agn, gno, nos, ose}. If
{ pc , sc , m c }. pc denotes a set of problem attributes, which a text phrase, “diagnose” is matched to the misspelled one,
contain a problem description and its symptoms; sc is a “diagnoes”, the tri-gram will identify them as two similar
set of solution attributes, either a single action or sev- text phrases. In matching a semi-structured or free-text string
eral repair actions for fixing the problem; m c contains to a well-defined problem description that are designed for
all attributes related to case base maintenance, including linking the sensor data to symptoms, N-gram matching al-
redundancy, inconsistency, success times, failure times, gorithm largely improves the efficiency of the string or text
positive actions, and negative actions. Let CB denote a matching.
case base, where CB ⊇ {c1 , c2 , . . . , ci , . . . , cn }.
2.2 Identifying solutions for a problem
From the above definitions, our task is to create a C B from
the given SD and MD. Table 1 shows the proposed methodol- Given an mdi in MD, we have to determine the solution
ogy, which automates the procedures for case base creation sc . The task of the solution identification is to extract repair
as four main processes: identification of symptoms, identi- action and component information from a given mdi using
fication of solutions to the problems, creation of a potential NLP techniques. In general, the free text of the repair ac-
case, and case base maintenance. Following are the detailed tion description in mdi contains one or more “sentences”
description of these processes. with extensive use of acronyms and abbreviations, omis-
sion of certain types of words (such as the definite article),
2.1 Identifying symptoms for a problem and numerous misspellings and typographic errors. Extract-
ing the required specific information, namely, the pieces of
The task of this process is to find pc from the SD for a given equipment involved in the repair and the actions performed
mdi in MD. In particular, identifying symptoms for a problem on the equipment (replace, reset, repair, etc.), from the free
is to find an sdi or several sdi ’s from SD by using a free-text text is a typical natural language understanding procedure.
matching approach. As defined above, the symptom descrip- This process shown in Fig. 1 consists of the following main
tion (sdti ) in an sdi is the formal (predetermined) text, while steps:
the problem description in mdi is free text. To match such free
text to the formal text of symptoms, we use an N-gram algo- r dictionary and acronyms database creation,
rithm. N-gram matching refers to a fragment of N consecutive r preprocessing and morphological analysis,
letters of a text phrase. For a given text phrase of length L, r grammar and parsing, and
r semantic interpretation.
Table 1 The methodology for automated case creation
Dictionary and acronyms database creation. To carry out
A given symptom database SD ⊇ {sd1 , sd2 . . . , sdi , . . . , sdk }.; the NLP process for understanding the free-text maintenance
A given maintenance database MD ⊇ {md1 , md2 . . . , mdi , . . . , messages, all we have to do is to build up a lexicon, which
mdq }. contains the words, acronyms and abbreviations used in the
Window size for case evaluation (w) specific domain, and to create a knowledge base for inter-
Output : CB ⊇ {c1 , c2 . . . , ci , . . . , cn }. preting these messages. Without the lexicon, it is not pos-
sible for readers to make any sense of what they read in
any given applications and to make a wise decision. For dif-
CB = φ
For each mdi in MD { ferent applications, we have to build up the corresponding
pc = identifySymptoms(S D, mdi ); // See Section 2.1 lexicon and knowledge base. The quality of the lexicon and
sc = identifySolution(mdi ); // See Section 2.2 knowledge base will directly affect the ability to create good
If ∃ pc ∩ ∃sc do cases from the historic data. For instance, in our application,
ctmp = cr eateT emCase( pc , sc , w); // See Section 2.3 we have built a lexicon from a huge amount of the histor-
CBRMgr(C B, ctmp ); // See Section 2.4 ical data. This lexicon contains 28240 noun entries, 23273
verb entries, 9073 adjectives entries, 1069 adverb entries, and
1008 entries for prepositions modal and auxiliary verbs and

20 Appl Intell (2008) 28:17–28

Fig. 1 Main function diagram

of NLP mdi Morpho-lexical Syntactical Semantic
Analysis Analysis Interpretation Output

Lexicon Parser/grammar Knowledge Bases

for Interpretation
Words Evaluation



various function words. A word may have several entries, one In many cases, a word may have multiple personalities. For
for each of its typical usage. A certain number of common instance, the word “tests” may be the noun “test” in the plural
recurrent abbreviations have been added into the lexicon as and the verb “test” in the third single person of the present
well. Some acronyms are added to the lexicon in order to tense. In our morphological analysis, a word will be deter-
understand the meaning and its corresponding expression. In mined in following order: the root word in the lexicon → the
order to make the appropriate semantic associations and de- acronym in the database → the abbreviation → mistyping or
rive the intended meaning of the maintenance message, we misspelling. For example, the word “chks” and “chgd”, they
have to create a knowledge base which contains s sufficient are not determined either as the root form of a word or an
domain knowledge. In this work, we created a knowledge acronym in the database. They are determined as the abbre-
which represents the FTRAMs (the Free-Text Repair Action viations: “chks” is for “checks” and “chgd” is for “changed”.
Messages) domain. In the created knowledge base for our
domain application, each word of the dictionary to be en-
countered in a message will be semantically classified. For Grammar and parsing. The grammar is based on the Alvey
instance, a number of nouns in repeated tests, and a “test” grammar in our approach. It was developed within the Alvey
class is created in the knowledge base with every test noun Natural Language Toolkit (ANLT) system. The ANLT sen-
put under it. Other nouns represent pieces of equipment, and tence grammar is written in a formalism similar to that of
so a “piece of equipment” class is also created in the knowl- Generalized Phrase Structure Grammar (GPSG) [16]. We
edge base. In the same way, we also create some knowledge found that most structures are covered by this grammar. Rules
for the verbs, tasks. The knowledge in the knowledge base is for the structures with a missing auxiliary verb or predicative
represented with Prog language. Here is an example of two verb have been derived from the existing ANLT rules. Some
entries in the knowledge base: new rules have to be formulated to handle certain idiomatic
expressions like references in a troubleshooting or mainte-
nance manual, piece description with serial and part numbers.
(make-sense START-VALVE piece-of-equipment)
We also modified some rules in order to make them less strict
(make-sense CHANGE replacement-action)
with regard to subject-verb agreement and allow for noun
phrase with no article. In the semi-structured and free-text
This format corresponds to the one used in our semantic data, the biggest problem is the lack of punctuation. In a mes-
interpretation module. With the created domain knowledge, sage that has no punctuation signs, the grammar may return
the system can conduct the interpretations of a specific class more than one parse, that is, several different word associ-
of objects, class of actions, class of grammatical operators, ations and propositions. For instance, “CHANGED START
and so on. VALVE OPERATION CHECKED OK”. If there were a pe-
riod after the word “VALVE”, there would be 2 sentences.
Preprocessing and morphological analysis. The main objec- The second sentence has 2 parses: (1) an active sentence
tive of this step is to determine what each word is exactly [NP + active VP] operation checked ok; and (2) a noun
or might be. Words may have different morphologies, de- phrase [N + passive VP] operation checked ok. The sec-
pending on properties like grammatical number and gender ond parse is generated because a rule says that a proposition
for nouns or person and tense for verbs. The morphological may be a single noun phrase, and another rule says that a
analysis allows you to retrieve the root of a word and tell its noun phrase may be a noun followed by a verb phrase in the
grammatical features. This analysis also helps keep the dic- passive voice. However, since there is no punctuation to tell
tionary smaller, because only the root form needs to be cited. the parser where to stop, it will return all the possible parses

Appl Intell (2008) 28:17–28 21

including as many words as possible. For this given example, Table 2 The result of NLP for a real mainte-
the parser will return the following possible cases with the nance message
noun phrases underlined: Attribute name of solution (S) Value

(1) a. changed start valve operation checked ok Part name EIU

(2) a. changed start valve operation; b. checked ok Part number 3957900612
Repair action REPLACE
(3) a. changed start valve; b. operation checked ok
Part series number 3-25-8-2-40D
(4) a. changed start valve; b. operation checked ok
(5) a. changed start; b. valve operation checked ok
(6) a. changed start; b. valve operation checked ok the repair action in the snag message, “#1 EIU replaced”, is
(7) a. changed; b. start valve operation checked ok analyzed as shown as Table 2.
(8) a. changed; b. start valve operation checked ok

The order in which the parses are generated depends on 2.3 Creating a potential case
how the grammar rules have been written, and there is no
guarantee that the first one will be the right one. For this Having pc and sc from the previous processes, this process
example, the right one is the third one. Therefore, we need creates a potential case, ctmp = { pctmp , sctmp , m ctmp } (where
to have a semantic interpretation for representing unambigu- pctmp ≡ pc , sctmp ≡ sc ; m ctmp is to be determined). A poten-
ously the meaning of the input proposition tial case is a structured case representation, which might be
added to a case base as a new case or be merged with the other
Semantic interpretation: After the parser has analyzed a cases based on the case base maintenance policies. These
semi-structured message and returned a list of grammatically policies are presented in Section 2.4. Before we do case base
valid parses, each parse is semantically interpreted, through management for the created case candidate, we have to check
its associated lambda-calculus logical form, with the ulti- this potential case to determine whether the symptoms related
mate goal of retrieving information about pieces of equip- to its problem have disappeared or not in a period (window
ment and action upon them, and verifications carried out and size, noted as w) after the repair actions were taken. The win-
their results. But this is also where one determines whether dow size is set in terms of the requirements from real-world
some word association proposed by the parser makes any applications. We assume that if the symptoms of the problem
sense, and where one rejects the parses which do not make disappeared in a given period (window size) then the repair
sense. Let us take the above example again, “CHANGED was successful, otherwise, the repair failed to fix the prob-
START VALVE OPERATION CHECKED OK”. One of its lem. We use the following rule to determine if a potential
parses, “changed start valve operation”, will see the word case is a positive or a negative case.
“changed”, which is described in the domain knowledge base
representing as an action to be done on a piece of equip-   
posCaseif ¬∃(sdi ∼
= pctmp ) ∀ tis − tir ∈ w
ment, intercepted as such over the word “operation”, which Ctmp =  
∼ pctmp ) ∀ t s − t r ∈ w
negCase if ∃(sdi =
is not classified as a piece of equipment. As a result, this i i

parse will be rejected on that basis, and the next candidate

parse will be tried for interpretation. The interpretation of where, tir is the repair date time for a given mdi in MD; tis is
“changed start valve” will succeed because “valve” is clas- the date time for a given sdi in SD; w is a window size for
sified as a piece of equipment and “changed” can now be checking the case quality.
interpreted successfully. Each noun phrase interpreted as a In real-world applications, both positive cases and nega-
piece of equipment is checked against the knowledge base to tive cases are useful for CBR systems. Somehow, the nega-
retrieve, if possible, the part number and series number. tive cases document the lesson learned from the past. When
In summary, in the natural language understanding pro- a CBR system recommends a negative case to the problem,
cedure, the semi-structured free text that describes the repair one can avoid repeating the same repair action and can try
action is first preprocessed to determine the nature and prop- other possible solutions.
erties of each word and token against the lexicon which con-
tains the words, the acronyms and the abbreviations. Then the 2.4 Maintaining the case base
sequence of morphologically analyzed items is syntactically
analyzed with a parser and checked against a grammar that Case base maintenance, or called case base management,
describes the patterns of valid propositions. Finally, the result plays an important role in automated case creation. The main
of the syntactic parsing is semantically interpreted to gener- task of case base management is to effectively manage a case
ate the class of repair action and the equipment on which the base for CBR systems by adding a new case, detecting the
action is performed. For example, the free-text that describes redundant cases, and merging similar cases. In the real-world

22 Appl Intell (2008) 28:17–28

applications, the size of a case base will largely increase if to the second step that conducts case base management for
a CBR system could not provide the efficient management the existing case base if we find a case (ci ) that is similar
support for removing the unused cases. Such increasing of to ctmp . This includes updating an existing case in the case
the size of case bases will lead the inefficient case retrieval. base, deleting a case, and merging multiple cases into a new
On the other hand, when adding a new case to a case base or case. This operation is realized by updating the attributes for
merging a case with existing case in the case base, it requires m c . If we detected a similar case (ci ) in the existing case
the support of the case base management. Up to date, a great base against the potential case ctmp , i.e., pci ≈= pctmp 1 and
deal of research effort has been devoted to case base main- sci ≈= sctmp , then m ci will be updated to reflect the effect of
tenance [8–10]. The research has focused on a number of the repair action applied to the problem. If ctmp is a positive
crucial issues such as the case life cycle, redundancy detec- case, then we increase the count of successful repair actions
tion, and adding/deleting cases to/from a case base. Some of of m ci otherwise we increase the count of unsuccessful repair
the earliest case base maintenance works [11] looked at the actions of m ci . In the same way, if we detected a similar case
development of maintenance strategies for deleting/adding (ci ) against case ctmp , which has similar problem descriptions
cases from/to existing case bases. For example, in [12], a but different solutions, i.e., pci ≈= pctmp and sci ≈!sctmp , we
class of competence-guided deletion policies for estimating will update the existing case by adding the new solution to
the competence of an individual case and deleting the case it, so that, the case will become more powerful for solving
from a case base is presented. Redundancy and inconsistency the similar problem in the future. In the algorithm of case
detection for case base management in CBR systems has also base management, the similarity computation is supported
attracted a lot of attention from researchers [13, 14]. These by the developed CBR engine introduced in Section 3. CBR
approaches are still useful for the case management in our engine provides a nearest neighbor algorithm for comput-
methodology when we automatically create a case from the ing the similarity for problems and solutions between two
historic database. cases.
In the operation of maintenance for a complex system, a
single problem/event might occur many times, and the repair
action for this problem/event may be either the same or dif- 3 Case study
ferent. In such a case, we expect to create a single case to
restore these experiences rather than multiple cases. There- The methodology for creating cases from large-sized semi-
fore, we need a sophisticated approach to manage the case structured is mainly developed for diagnostic CBR applica-
base when add a potential case to an existing case base. We tions and in particular for maintenance of complex systems
use the following algorithm shown in Table 3 for managing such as trains and aircraft. However, for various applications,
the created case bases. the format of raw data (including operational and mainte-
The main goal of this algorithm is to determine the at- nance data) may be different. To apply the proposed method-
tributes of m c for a given temporary case. The first step is ology, we have to bridge the gap between the raw data format
to determine whether the potential case could be a new case. and the defined data format. This task can be done in prepro-
We check the redundancy or inconsistency of the potential cessing the original data. In this section, we present a case
case against the existing case base. If a case is not against study to show how we apply this methodology to real-world
any case in the existing case base, this case could be a new CBR applications for automated case creation and how use-
case. We add it to the case base. Otherwise, we move on ful it is for reducing the human efforts required for authoring
This work was applied as part of a project called the Inte-
Table 3 An algorithm for case base management
grated Diagnostic System (IDS) [15]. IDS is an applied artifi-
Input: A given CB ⊇ {c1 , c2 . . . , ci , . . . , cn }. cial intelligent system that supports the decision-making pro-
A potential case ctmp = { pctmp , sctmp , m ctmp } cess in aircraft fleet maintenance. IDS integrates two kinds of
reasoning techniques: rule-based reasoning and case-based
For all ci in CB {
reasoning. To create cases for IDS, we used the collected
If ¬∃ ci similar to ctmp // where ci = { pci , sci , m ci }
CB = CB ∪ ctmp ; // add a new case to case base maintenance database. One important piece of data in the
Else if ( pci ≈= pctmp )∩ (sci ≈= sctmp ) database is the snag2 message. A snag is a transcript of the
m ci = upgradeMattributes(ci ); hand-written notes describing a problem (reported by pilots,
Else If ( pci ≈= pctmp ) ∩ (∀ci ¬∃(sci ≈= sctmp )) do
m ci = upgradeMattributes(ci ); 1
≈= means that two items are similar. It is computed with the nearest
sci = sci ∪ stmp ;
neighbor algorithm in our system.
} A snag is a common term for an equipment problem in the aviation
area. It is a record of the problem and the repair action.

Appl Intell (2008) 28:17–28 23

Table 4 An example of the raw

NN6615 437820001NM1003286 2312 2312ACA01058P28Q0CL6YUL ACA0646RT RMA
CAPT ROLL CTL SSTU 4CE1”. R 7. I2000-09-23NNDEFN 0000000000000
0000000000000 0000000000000 00000000000000 40227AC 74577LNNS ORDER
AC74577 2001-01-22 14:07:006650
27-92-41-501 42000-09-2506.36.00FIXYWG 26525AC 26525NNNNNN 000000000000
AC26525 2001-01-30 16:00:00.898990
0100010 001Y0000000010000NNNAC002FD 9W19XFEA 150000000042983622-9852-003
4V792 111AC26525 2001-01-30 16:00:00.89916023-80-0100 Y
100010001 Y0000000010000NN AC002EA 150000000042983 1467 AC26525 2001-01-30
16:00:00.89921023-80-0100 Y

other crew or maintenance technicians) and the repair actions Table 5 A formal instance (mi) obtained from Table 4
carried out to fix the problem. It is composed of well defined, Attribute name Contents
fixed fields describing the date, the location, a unique snag
identifier, etc. as well as free-text describing the problem eventDateTime(tie ) 2001-01-22 14:07:00
symptoms, the pieces of equipment involved in the repair and instanceId(idi ) M1003286
Problemdescrition(pdi ) RMA 27-93-2127 AVAIL
the actions performed on them. It is possible for someone to
create a potential case by combining the information in the FAULT ELAC 1 INPUT CAPT
snag message with information in the symptom database. To ROLL CTL SSTU 4CE1
help the user to create cases from the historic snag database, partId(pnoi ) 222
we developed an off-line tool called SROV (Snag Ratifica- repairPlace(rpi ) YWG
tion Object Validation). This tool allows the user to browse repairDateTime(tir ) 2001-01-30 16:00:00
the snag database, clean up the contents of the snag message, repairAction(ai ) REPLACED CAPTAINS SIDE
and convert the snag message into a potential case. However,
AMM 27-92-41-501
such case creation requires significant human effort and do-
main knowledge. Also, when users interpret a great deal of
data every day there is a strong likelihood of errors being
made. Therefore, we applied the proposed approach to help ‘#’, ‘.’, ‘∗ ’ and so on) and using a list of “poor single” words,
creating cases from the huge mount of the maintenance data we remove some words as well. The list of poor single words
at the initial stage. is constructed by analyzing a large set of snag messages to
We first developed a bridge component for preprocessing see which ones were not helpful in matching the text of the
the raw maintenance data to the defined data format. Then we symptom data. For example, the free-text of problem descrip-
implemented the system. Finally, we conducted experiments tion obtained from the raw snag message, RMA 27-93-2127
for automated case creation and evaluated the results by com- AVAIL. REPEAT E/W “F/CTL ELAC 1 FAULT” “ELAC 1
paring with results from manual case creation. OR INPUT OF CAPT ROLL CTL SSTU 4CE1”. R 7. af-
ter processing, results in RMA 27-93-2127 AVAIL REPEAT
3.1 Preprocessing of raw data in the maintenance database F/CTL ELAC 1 FAULT ELAC 1 INPUT CAPT ROLL CTL
SSTU 4CE1, as shown in Table 5.
The raw data like that shown in Table 4 is preprocessed to From Table 5, we can create a potential case candidate with
give clean data as shown in Table 5. The parse is simple since our methodology. This case candidate is shown in Table 6.
the various fields of the raw data are in a predetermined order This temporary case will be checked with the case base man-
of the fixed size. We extract the date, the place where the fix agement policies. If this case does not match against any
was done, a unique identifier, etc., as well as free-text de- cases in the case base, it will be added to the case base. If it is
scribing the problem symptoms and the repair actions. The matched against any case in the case base, it will be upgraded
free-text contains many unnecessary symbols or words. To the existing case by integrating the case base management
deal with this, we filter the unnecessary characters (such as information into the similar one.

24 Appl Intell (2008) 28:17–28

Table 6 A potential case (Ctmp )

created from Table 5 Ctmp Attributes Attribute name Contents
m ctmp m c1 Case ID Case-1
m c2 Case creation date 2002-04-05
m c3 EventDateTime 2001-01-22 14:07:00
m c4 Case quality posCase
m c5 Success times 1
m c6 Failure times 0
pctmp pc1 Problemdescrition RMA 27-93-2127 AVAIL REPEAT F/CTL
pc2 Symptoms WRN321 FLR1188 WRN320 WRN340
sctmp sc1 Repair station YWG
sc2 Repair date 2001-01-30 16:00:00
sc3 Repair actions Remove/Install (replace)
sc4 Equipment (No) 27-92-41-501

3.2 Implementation posed methodology. The bridge was designed and imple-
mented to preprocess the raw data from different application
After preprocessing the raw data in the maintenance database, domains and map the data to the well-defined format for
we implemented the automated case creation system (ACCS) our methodology. This component has to be implemented
following the proposed approach. for different applications. The potential case creation com-
The ACCS, as shown in Fig. 2 identifies a bridge for pre- ponent contains two modules: case candidate creation and
processing the raw data and the four components for case case quality identification. In the implementation, the de-
creation: symptom identification, solution identification, velopment environment is Java, Oracle, and PrologSystem.
potential case creation, and case base maintenance. All com- Java is used as a program language for system implementa-
ponents but the bridge were developed following the pro- tion; Oracle is used to store and manage the case bases; and

Fig. 2 The ACCS system

implementation Well-defined
Bridge Maintenance
Data (MD)
Raw Data in
Maintenance md i
Symptom Solution
Identification Identification

Pc Sc
Operational Data
(SD) Case Candidate Creation

C tmp

Case Quality Identification

Cpos or Cneg
Case Base Maintenance
Created Case

Redundancy Detection Java-based CBR Engine

Inconsistency Detection

Appl Intell (2008) 28:17–28 25

Fig. 3 The main window of the

case evaluation tool

PrologSystem is used to implement the knowledge base sys- evaluation component provides several necessary functional-
tem for natural language processing in extracting the solution ities for supporting case base management or maintenances
information from maintenance database. through an interactive environment. This interactive envi-
In implementing the system, we tried to find a Java-based ronment allows users to browser the created case base and to
case-based reasoning engine from the commercial software evaluate the cases one by one by checking the original main-
packages or free software packages. Unfortunately, we failed tenance records, problem symptoms, problem descriptions,
to find such a CBR software package. Therefore, we de- and repair actions. It also provides the basic supports for users
veloped a CBR engine for our implementation. This CBR to conduct case base maintenance operations such as modi-
engine is a Java-based software package, which is able to fying the case, deleting case and merging the multiple cases.
support the development of CBR applications. The devel- Figure 3 shows the main window of the tool. With the tool,
oped CBR engine provides not only case retrieval function- the domain experts can evaluate and manage the case base
ality but also case base management functionality. For re- easily.
trieving cases from a case base, the CBR engine provides a
nearest neighbor algorithm for computing similarity between 3.3 Experiments and results
two problem descriptions and two solutions for two differ-
ent cases. From the viewpoint of case base management, the We performed some experiments to test the effectiveness
developed CBR engine can provide the redundancy and in- of the methodology. To our knowledge, there is no similar
consistency detection support for implementing the case base system developed for automated case creation and manage-
management policies that we proposed in the methodology. ment in diagnostic CBR applications. We cannot evaluate the
The redundancy detection is to check if there exists a simi- methodology by comparing our system with other similar
lar case against a potential case by computing the similarity methodologies or systems. Therefore, our objectives of eval-
between two cases. Inconsistency detection is to check if a uation is to validate the usefulness and effectiveness for the
potential case is a complete case which is qualified to be proposed methodology from two aspects: the quality of cases
added or updated to a case base, and a potential case has the created using our developed methodology and reduction of
conflict the solutions for a similar problem with the exist- the human efforts required for case creation. We used the data
ing cases. For detecting inconsistency of a case, we mainly collected from two years. This dataset contains 62523 main-
check if the problem attributes or solution attributes have tenance records (messages). We conducted the experiments
the meaningful values which are assigned from case creation as follows:
processes. First, we asked a domain expert to manually author cases
The component of case base maintenance in ACCS is im- using the SROV, which is an offline tool developed for man-
plemented under the support of the developed Java-based ual case creation from the available historical operation and
CBR engine. With the support of the developed CBR engine, maintenance data. As we mentioned above, this tool allows
we also added a component to the ACCS for domain experts the user to browse the maintenance database, clean up the
to evaluate the created cases and manage case bases. This case contents of maintenance records, and convert them into a

26 Appl Intell (2008) 28:17–28

potential case if an expert believed that data can be useful Auto Manual
for case creation. This tool does not allow operators to do
250 230
case base management. In other words, the operators can-
not determine any useful information or case management 200
attributes for a case in which the solution may be applied to 150
a similar problem many times. With the help of SROV, an
expert cleaned these data and created 352 useful messages. 37
From these clean data, an expert tried to create the cases in 50
several sessions in order to reduce the influence of fatigue. 0
Cases Time (Min)
The times from the different sessions were summed. As a
result, the experimenter created 37 cases after taking 230 Fig. 4 Comparison of the experimental results
Second, we conducted experiments for automated case failed repair action by the attributes of case base manage-
creation using the same data and our developed ACCS. As we ment. From the statistical results, 45 snag messages were
described in Section 2, our developed methodology requires linked to those 14 cases. In total, 66 clean snag messages
a dictionary and acronyms database in order for ACCS to be were useful for creating cases.
able to identify the solution from the maintenance records.
Using the data provided to manual case creation experiments,
we created a substantial English lexicon. This database con- 4 Discussion and limitations
tains 28240 noun entries, 23273 verb entries, 9073 adjective
entries, 1069 adverb entries, and 1008 entries for preposi- Figure 4 and Table 7 show the results of experiments for cre-
tions, modal and auxiliary, and various function words. A ating the cases manually and automatically. Figure 4 shows
word may have several entries, one for each of its typical us- the consuming time for case creation for manual case cre-
ages. A certain number of common recurrent abbreviations ation and automated case creation. Table 7 shows the statistic
have been recorded into this lexicon as well. Moreover, we information for case quality for manual case creation and au-
also defined some acronyms for the database for our system tomated case creation. From the viewpoint of case-creating
can access them. These acronyms contain the common ones time, ACCS only took 2 minutes for completing the same
from the Aerospatiale industry in general, Airbus acronyms. work that a human took 230 minutes to do. In other words,
Using the created lexicon, the developed ACCS, and the our methodology is much faster than manual approach. From
same data used in the manual case creation, we conducted the viewpoint of case quality, the experimental results from
the experiments for automated case creation. Using ACCS, ACCS showed that the cases created with our methodology
we automatically created 35 cases that have better quality. contain richer information or experience for solving prob-
It is interesting that not each clean snag message contains lems. As shown in Table 7, all 37 cases created in SROV
completely useful information for creating a potential case are single solution cases. That is, each case contains only a
because either the symptoms are not found from the symp- single solution for a problem. On the other hand, 14 cases
tom database, or the fix does not exist in the snag message. among 35 cases created with ACCS were multiple solution
Among the 35 cases, 21 cases are created from a single snag cases. In other words, these cases contain multiple solutions
message and consist of either a positive case or a negative for a problem and provide much richer information on repair
case; 14 cases are linked to multiple snag messages, which actions taken in the past. Such useful information is that how
recorded similar solutions for similar problems or the same many times were successful for solving a problem and how
problem, and they contain information on the successful or many times were failed for that problem. Moreover, among

Table 7 The statistic information for case quality (N/A stands for not applicable)

Positive Negative Positive Negative Combined

Multiple cases with cases with cases with cases with cases with
Single solu- solution single single multiple multiple multiple
tion cases cases solution solutions solutions solutions solutions

Manual case 37 N/A 30 7 N/A N/A N/A

Automated 21 14 14 7 5 3 6
case creation

Appl Intell (2008) 28:17–28 27

14 multiple solution cases, 5 cases are positive cases, 3 cases ness of the methodology by comparing the quality of cases
are negative cases, and 6 cases are the combination of positive created automatically by the system and manually by domain
cases and negative cases. The main reason why our developed experts. The experimental results show that the developed
methodology can create better quality case than human does system, which implemented the methodology, can create the
is that we applied NLP techniques to identify the solutions high quality cases as the domain expert does, and signif-
from the maintenance records. The lexicon created from the icantly reduce the human effort required for case creation
huge maintenance database helps identify the accurate in- from the large-sized semi-structured maintenance data and
formation from the free-text data. Another reason is that we operational data, in particular, to be able to author cases from
provide effective policies for case base management in the the historic database at the initial stage of the development of
developed methodology. These policies help to combine mul- a CBR system. The proposed methodology is applicable not
tiple similar cases into one single case for the same problem. only to maintenance applications but also to other applica-
We are trying to apply the developed methodology to E- tions by implementing a specific bridge, which converts the
learning application to automatically create cases from the raw data format into the defined data format.
collected Q/A database for case-based Q/A systems. In this Although we have conducted the experiments for testing
application, the data is also a semi-structured and free-text the usefulness and effectiveness for the methodology, the
format. The questions from students are free text, and the cases created from the system still need to be evaluated be-
answers from tutors are email messages. These data are very fore they are applied to real-world applications. We did some
similar to our defined data. Therefore, we can directly apply evaluation for the created cases manually by domain experts
our methodology to this application. using the evaluation support tool in the developed system.
Although the developed methodology can automatically We asked the domain experts to check the cases one by one
create cases from large-sized semi-structured data, the cases to see if they are useful cases. This is not a good way to eval-
have to be evaluated in field trail before we apply them to a uate the quality of cases. As a more sophisticated system, it is
CBR application. In many applications, they may need to be desirable that the system can evaluate the quality of cases by
evaluated by domain experts. As we described, to identify using predefined criteria or knowledge bases. How to evalu-
the solution information from semi-structured text using our ate the quality of cases is an interesting and unsolved topic
methodology requires a domain-oriented lexicon and knowl- for researchers in CBR area. This will be our future work.
edge base that have to be created based on different domain We need to develop an approach for evaluating the quality
requirements and domain knowledge. These are the limita- of cases effectively, including the field trail for practitioners.
tion for the methodology. Another future work is to improve the adaptability of the
methodology for other real-world CBR applications.

Acknowledgments Many people have been involved this project. Spe-

5 Conclusions and future work
cial thanks go to the following for their support, discussion, and valuable
suggestions: M. Zaluski, M. Halasz, R. Wylie, and F. Dube. We are also
In this paper, in order to reduce the difficulty of case cre- grateful to Air Canada for providing us the aircraft fleet operational data
ation and management for diagnostic CBR applications and and maintenance data.
The authors would like to thank Dr. Jian Pei at Simon Fraser Uni-
to be able to effectively create cases at the initial stage of de-
versity and the reviewers for their valuable comments and suggestions
velopment of a CBR system, we developed a methodology on revision of the paper.
for automatically creating cases, building on the techniques
from case-based reasoning, natural language processing,
and database. The methodology can help to create cases References
from large-sized semi-structured maintenance and opera-
1. Zaluski M, Japkowicz N, Matwin S (2003) Case authoring from text
tional database for diagnostic CBR systems before apply-
historical experiences. In: Proceedings of the sixteenth Canadian
ing them to the real-world applications. After the detailed conference on artificial intelligence (AI’2003). Dalhousie Univer-
description of the methodology, we presented a case study: sity, Halifax, Nova Scotia, Canada
implementing the methodology for an automated case cre- 2. Ferrario MA, Smyth B (2000) Collaborative maintenance—a dis-
tributed, interactive case-based maintenance strategy. In: Proceed-
ation system in a real-world application, in order to demon-
ings of advances in case-based reasoning: fifth European workshop
strate how useful the methodology is. In implementing the EWCBR. Trento, Italy, pp 393–405
methodology for automated case creation, we also developed 3. Yang Q, Cheng H (2003) Case mining from large databases. In: Pro-
a Java-based CBR engine, which provides the fundamental ceedings of 5th international conference on case-based reasoning,
ICCBR. Trondheim, Norway
support for case retrieval and case base management and is
4. Lehane M, Dubé F, Halasz M, Orchard R, Wylie R, Zaluski M
a useful and common tool for other CBR applications. We (1998) Integrated diagnostic system (IDS) for aircraft fleet main-
conducted experiments for automated case creation using the tenance. In: Proceedings of the AAAI’98 workshop: case-bases
developed system. We evaluated the usefulness and effective- reasoning integrations. Madison, WI

28 Appl Intell (2008) 28:17–28

5. Yang C, Orchard R, Farley B, Zaluski M (2003) Authoring cases M.Sc. in computer science from Shanghai Jiao Tong University, China,
from free-text maintenance data. In: Proceeding of IAPR interna- and a Ph.D. from National Hiroshima University, Japan. He worked
tional conference on machine learning and data mining (MLDM with Fujitsu Inc., Japan, as a Senior Engineer and engaged on the de-
2003). Leipzig, Germany, pp 131–140 velopment of ATM Network Management Systems. He was an Assistant
6. Farley B (1999) From free-text repair action messages to automatic Professor at Shanghai Jiao Tong University from 1986 to 1990 working
case generation. In: Proceedings of AAAI spring symposium: AI on Hypercube Distributed Computer Systems. Dr. Yang has been the
in equipment maintenance service and support. Technical Report author for over 30 papers and book chapters published in the referred
SS-99-04, Menlo Park, CA, AAAI Press, pp 109–118 journals and conference proceedings. He was a Program Co-Chair for
7. Farley B (2001) Extracting information from free-text aircraft re- the 17th International Conference on Industry and Engineering Ap-
pair notes, artificial intelligence for engineering design, analysis plications of Artificial Intelligence and Expert Systems. Dr. Yang is a
and manufacture. Cambridge University Press 0890-0604/01, I5, guest editor for the International Journal of Applied Intelligence. He
pp 295–305 has served Program Committees for many conferences and institutions,
8. Aha DW, Breslow LA (1997) Refining conversational case libraies. and has been a reviewer for many conferences, journals, and organi-
In: Proceedings of int’l conference of case-based reasoning. RI, zations, including Applied Intelligence, NSERC, IEEE Trans., ACM
USA, pp 267–278 KDD, PAKDD, AAMAS, IEA/AIE and so on. Dr. Yang is a senior
9. Minor M, Hanft A (2000) The life cycle of test cases in a CBR IEEE member and ACM member.
system. In: Proceedings of advances in case-based reasoning: 5th
European workshop. EWCBR 2000, Trento, Italy, pp 455–466
10. Smyth B (1998) Case-based maintenance. In: Proceedings of the
11th international conference on industry and engineering applica-
tions of AI and expert systems. Castellon, Spain
11. Zhu J, Yang Q (1999) Remembering to add: competence persever-
ing case-addition policy for case-base maintenance. In: Proceed-
ings of the 16th int’l joint conference on AI. Stockholm, Sweden,
pp 234–239
12. Smyth B (1995) Remembering to forget: a competence persevering
deletion policy for case-based reasoning systems. In: Proceedings
of the 14th int’l joint conference on AI. Morgan-Kaufmann, pp
13. Racine K, Yang Q (1996) On the consistency management for large Benoit Farley is a research officer at the Institute for Information Tech-
case bases: the case for validation. In: Proceedings of AAAI-96 nology of the National Research Council of Canada. He received his B.
workshop on knowledge base validation Appl. Sc. and his Master degree in Telecommunications at the Univer-
14. Portinale L, Torasso P (2000) Automated case base management sité de Sherbrooke in the province of Québec, Canada. After a number
in a multi-model reasoning system. In: Proceedings of advances in of years in computer-assisted learning and training, he has been working
case-based reasoning: 5th European workshop, EWCBR. Trento, for the last twenty years in the field of natural language processing and
Italy, pp 234–246 understanding. His current research focuses on technolinguistic tools
15. Wylie R, Orchard R, Halasz M, Dubé F (1997) IDS: improving for aboriginal languages, more specifically Inuktitut, the language of
aircraft fleet maintenance. In: Proceedings of the 14th national con- the Inuit
ference on artificial intelligence. Calif, USA, pp 1078–1085
16. Grover G, Klein E, Oullum G, Sag I (1985) Generalized phrase
structure grammar. Backwell, Oxford

Bob is a Senior Research Officer at the National Research Council of

Canada. There he is a member of the Integrated Reasoning Group of the
Institute for Information Technology. He received an M.Sc. in Computer
Dr. Chunsheng Yang is a Research Officer at the Institute for In- Science from the University of Western Ontario in 1974 and an Hons.
formation Technology of the National Research Council of Canada. B.Sc. in Mathematics from Queens University in 1972. His research
He is interested in data mining, reasoning technologies such as case- interests include fuzzy logic, expert systems, and evolutionary comput-
based reasoning, rule-based reasoning and hybrid reasoning, multi- ing. The interest in fuzzy logic led to the development of FuzzyCLIPS
agent systems, and distributed computing. He received an Hons. B.Sc in (an extension to the CLIPS expert system tool from NASA) and the
Electronic Engineering from Harbin Engineering University, China, an FuzzyJ Toolkit which are used to create fuzzy reasoning applications.