You are on page 1of 9

ADLER: An Environment for Mining Insurance Data

M. Staudt J.-U. Kietz U. Reimer


Swiss Life, Information Systems Research (CH/IFUE), CH-8022 Zurich, Switzerland
fstaudt,kietz,reimerg@swisslife.ch

and can directly be applied for improving the deci-


sion making processes, for detecting new trends and
Abstract elaborating suitable strategies etc. In order to bridge
the gap between both sides, i.e. to nd a reasonable
The rapid technical progress of hardware and way for turning data into information, we need (e-
data recording technology makes huge masses cient) algorithms that can perform parts of the nec-
of digital data about products, clients and essary transformations automatically. There will al-
competitors available even for companies in ways remain interactive steps for this data analysis
the services sector. Data homogenization and and information gathering task. However, the auto-
information extraction are the crucial tasks matic processing should cover all those parts that can
when trying to exploit its inherent (and often not be handled properly by human beings due to the
hidden) knowledge for improvements of busi- size of transformation input and output. Basically, we
ness processes. This paper reports the current are faced with two fundamental problems:
activities at Swiss Life tackling both problems. First, we have to deal with a data homogeniza-
In particular, we sketch the design of a data tion and integration problem caused by a (due
analysis environment that applies Knowledge to historical reasons) great variety of tools and data
Representation and Machine Learning tech- stores on di erent system platforms and architectures
nology to obtain information from relational that do not work together or are at least coupled in
data. The main focus lies on explaining the a very loose manner only. Not only with respect to
usage and representation of meta data within technical aspects but also due to semantic dependen-
this environment. cies, analysis processes require a preceding integration
step both at the schema and the data level. These het-
1 Introduction erogeneous \legacy" systems usually serve for concrete
operational tasks in a company such as bookkeeping or
The exploding amount of available digital data in most stock list management and are therefore often called
companies due to the rapid technical progress of hard- Online Transactional Processing (OLTP) systems. Be-
ware and data recording technology has even more in- sides heterogenity availability is an important concern,
creased the tradeo between just managing the data too: Since detailed data analyses are very time con-
on the one hand and analyzing resp. exploiting the suming they would slow down the OLTP systems to
knowledge hidden in the data for business purposes a degree that they could not serve their primary pur-
on the other hand. The supply side of data manage- poses any more.
ment is characterized by huge data collections with a A solution to this problem is the construction of
chaotic structure, often erroneous, of doubtful quality central Data Warehouses that run independently from
and only partially integrated. On the demand side we the OLTP systems and take over all data from them
need abstract and high-level information that is tai- relevant for analysis tasks. From a database point of
lored to the user's (mostly management people) needs view a Data Warehouse is a collection of materialized
The copyright of this paper belongs to the papers authors. Per- views on the data in the source systems. Additional
mission to copy without fee all or part of this material is granted views on the warehouse data realize the di erent lev-
provided that the copies are not made or distributed for direct els of abstractions required for di erent analysis goals
commercial advantage. and end user needs. The Swiss Life Data Warehouse
Proceedings of the 4th KRDB Workshop intended to reach a homogenized base for data analysis
Athens, Greece, 30-August-1997 is called MASY (see Section 2).
(F. Baader, M.A. Jeusfeld, W. Nutt, eds.) Second, the information extraction problem
http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-8/ has to be solved. Starting from a consolidated but

M. Staudt, J.-U. Kietz, U. Reimer 16-1


nevertheless still immense and confusing collection of 2 Homogenization and Integration
data, transformation, condensation and restructuring The masses of digital data available at Swiss Life {
is necessary in order to gain information with respect not only data from insurance contracts but also from
to certain analysis goals out of it. In the data ware- external sources, such as information about the socio-
house context the term Online Analytical Processing demographic structure and the purchasing power of
(OLAP) was formed to describe the interactive anal- the population in the various parts of the country {
ysis and exploration of warehouse data and is under- led to the development of the central integrated Data
stood to comprise the support for manual navigation, Warehouse MASY [Fri96]. It comprises data from
selection and representation of data. In particular six OLTP systems: contract data (about 650,000 con-
multidimensional views and operations are considered tracts, some 500,000 clients) plus externally available
as typical OLAP features. The usage of OLAP tech- data collections. Some of the data sources are shown in
nology can be characterized as veri cation-driven ap- Figure 1. The basic insurance contract data e.g. stem
proach to data analysis, i.e. the data analyst usually from the system EVBS while GPRA contains (per-
starts with some hypothesis and is supported by re- sonal) data about all Swiss Life clients and partners.
spective tool facilities for veri cation. The system BWV is a publicly available catalogue of
In contrast to OLAP, methods for \knowledge dis- all (3 Million) Swiss households.
covery in databases" (KDD) aim at the automatic de- MASY is implemented on top of ORACLE and
tection of implicit, previously unknown and potentially follows a ROLAP warehouse architecture, i.e. em-
useful patterns in the data and conversion into pieces ploys on top of the relational structures a multidimen-
of new information that can in uence decision and sional OLAP-frontend based on ORACLE EXPRESS.
strategies. The data analysis based on KDD is discov- The database itself has both a normalized warehouse
ery driven, i.e. does not start with given hypotheses scheme gained from integrating the schemas of all
but searches for new ones (within a given hypothesis source systems, and a derived (redundant) denormal-
space). Data Mining algorithms constitute the kernel ized so-called Galaxy schema intended to eciently
within the often cyclic and multi-step KDD processes. support multi-dimensional access.
Other than classic statistical approaches the exploita- The rst version of the warehouse in operation is
tion of AI technology for data mining tasks was ne- expected to contain around 20 Giga Bytes of data,
glected for a long time. However, techniques from the distributed over approximately 30 tables and 600 at-
areas of Machine Learning (esp. Inductive Logic Pro- tributes.
gramming), Fuzzy Technology and Neural Networks
promise more elaborate kind of knowledge to be dis-
covered in huge data collections. 3 Mining for Information Extraction
In order to explore how OLAP tools employed on Based on the homogenized data in MASY the
top of MASY can be complemented with Data Min- Data Mining environment ADLER o ers tools and a
ing tools Swiss Life set up the project DAWAMI. This methodology for supporting the information extrac-
project is concerned with the design and implemen- tion task. While OLAP tools concentrate on mainly
tation of the Data Mining environment ADLER (= interactive exploration, abstraction and restructuring
Analysis, Data Mining and Learning Environment of of Data (which also leads to new information) Data
Rentenanstalt/Swiss Life). DAWAMI aims in partic- Mining provides algorithms for automatic acquisition
ular at enabling end users to execute mining tasks as of knowledge about previously unknown contents of
independently as possible from data mining experts' the data. Of course, there always remain interactive
support (Goal 1) and at making a broad range of data tasks, e.g. judging the quality of identi ed clusters and
mining applications possible (Goal 2) by integrating derived decision trees wrt. to the mining goals. Fig-
a variety of di erent mining methods into a common ure 1 shows the overall architecture of ADLER and its
environment for data transformation and mining. relationship to MASY and the OLTP sources respec-
The rest of this paper is organized as follows: Sec- tively.
tion 2 summarizes the data integration done in MASY.
In Section 3 we discuss our approach to information 3.1 Integrating Data Mining Algorithms
extraction, namely setting up a suite of Data Min-
ing tools. Section 4 describes the meta data man- In order to realize Goal 2 stated in Section 1 ADLER
agement task within the mining environment which supports a great variety of data mining approaches in-
is supported by the repository system ConceptBase. cluding classical statistical ones as well as techniques
Section 5 enumerates several concrete applications for from the areas of Machine Learning and Neural Net-
DAWAMI. Section 6 concludes with an outlook to fur- works. This requires an open framework that allows
ther steps of the project and to open problems. the integration of quite di erent types of algorithms

M. Staudt, J.-U. Kietz, U. Reimer 16-2


OLAP GUI

Data Mining GUI

Data Warehouse

OLAP C4.5
Frontend

ORACLE Foil
Express
GPRA
Cilgg

LIME Workbench
EVBS RDT

VWS MASY Cobweb

KIS explora
ORACLE
BWV backprop
Kepler
Ribl

MASY
Meta Data
Swiss Life
Enterprise
Data Model
LIME
Meta Data ADLER
ConceptBase
Rochade
Meta Data Management
Data Mining Environment

Figure 1: Architecture and Embedding of ADLER


and a high extensibility. The heart of the DAWAMI single relation (attribute-based approaches) or input
architecture is the Data Mining workbench Kepler tuples from di erent relations (relational approaches).
[WWSE96]. This system is the successor of the ma- The latter category in particular allows to include ad-
chine learning toolbox Mobal [MWKE93] and relies on ditional background knowledge and arbitrary combi-
the idea of \Plug-Ins": Its tool description language nations of di erent classes of data objects. While
and the primitives of a basic API enable the build- statistical, most Machine Learning and commercially
ing of wrappers around a given implemented mining available Data Mining algorithms are attribute-based,
algorithm and its inclusion into the applicable set of Inductive Logic Programming approaches fall into the
tools. second category [Kie96]. Compared to solely applying
Figure 1 shows some examples for algorithms that statistical methods the integration of Machine Learn-
are considered as candidates for becoming components ing approaches (in addition to giving up the restriction
of ADLER. With respect to their output we can dis- to single relations as input) has e.g. the following ad-
tinguish between the following categories: vantages:
1. detection of associations, rules and constraints: a  more adequate treatment of nominal (categorial)
common application of these techniques are e.g. attributes, esp. for hierarchically ordered values,
market basket analyses.  faster heuristic methods, e.g. for clustering,
2. identi cation of decision criteria: based on deci-  more comprehensible results.
sion trees as one possible result we can support 3.2 End User Support
tasks like credit assignments.
3. discovery of pro les, prototypes and segmenta- The interface already o ered by Kepler is currently
tions: classes of clients with similar properties e.g. being extended by us with user-friendly database ac-
can be grouped together and handled in a uniform cess operators (i.e., to support the selection, con g-
way. uration and sampling of data from the warehouse
that form the basis of the intended mining opera-
Another categorization concerns the kind of input tions) and additional presentation and evaluation facil-
data allowed: sets of attribute-value pairs describing ities. These extensions contribute to Goal 1 (cf. Sec.1)
properties of certain data objects represented in one and are complemented with providing an application

M. Staudt, J.-U. Kietz, U. Reimer 16-3


methodology relating di erent mining goals with re- Before actually starting a mining algorithm a se-
spective methods. ries of data transformations is usually needed. A Data
Warehouse consists of a large number of relations while
4 Meta Data Management in ADLER most mining algorithms only accept as input tuples of
We now turn to the management of meta data within one relation. Thus, either one of the existing relations
ADLER. After explaining its basic role we describe must be selected from which tuples are to be sampled
the model implemented in the information repository or a new relation must be de ned from the existing
ConceptBase for capturing the meta data relevant for ones. Most of the time the latter case will apply in or-
mining with ADLER and give an example for the der to have all attributes relevant for the mining task
support during the preprocessing and transformation combined in one relation. The meta data management
phase which precedes the actual mining activities. is needed to keep track of the views introduced so that
the user can always look up what views he currently
4.1 The Role of Meta Data has and how they are de ned. Moreover, to be able to
de ne suitable views at all, the user needes to know
Meta data plays an important role for the integration about the existing relational schemas, the type of their
task in distributed information systems both at design attributes, which (groups of) attributes are keys, and
and run time. Figure 1 shows three di erent types of which attributes can be used to join two relations. He
meta data important in our context: will nd all of this information in the meta data repos-
itory.
 The Swiss Life IT department administrates an Besides de ning the appropriate relational views
overall enterprise data model that describes all many more transformations may be needed for a suc-
relevant objects, agents and systems in the com- cessful mining:
pany. The actual implementation platform is
Rochade.  Di erent attributes representing the same prop-
 The MASY Data Warehouse meta data is mainly erty may have a di erent encoding, resulting in
stored in certain ORACLE tables. Parts of it will di erent values with the same meaning. Such
also be ported to the Rochade repository. Typi- cases can still occur, despite the homogenization
cally, such meta data contain information about already done when building the data warehouse,
the integration and transformation process (tech- because a mining algorithm may also access data
nical meta data) and the warehouse contents it- from other sources than the warehouse alone (e.g.,
self, e.g. di erent (materialized) views and levels online databases). The additional data may be
of abstraction that are available for the end users part of the input in the case of a relational mining
(business meta data). algorithm, or is part of the background knowledge
 Among the meta data relevant for Data Mining of an attribute-based mining algorithm.
we can in particular distinguish between the fol-  Certain attributes may be inappropriate for min-
lowing categories: ing. For example, the value range of a numeric
{ background knowledge including access to attribute may be very large, although only the
external sources magnitude of a value and not the exact gure is
{ knowledge about data selection and transfor- of interest. In such a case similarity measures as
mations needed to properly feed the mining used for clustering would weight such attributes
algorithms improperly. Taking the logarithm of all attribute
{ application methodology: which tools should values instead would lead to more comprehensi-
be used to solve which problems? ble mining results. Another example is the case
of an attribute which is cardinal in its nature but
Instead of handling meta data of single applications is encoded by single characters that stand for cer-
like classical data dictionary contents within the op- tain intervals. As a consequence, it is no longer
erational database system { a very common way for possible to measure the distance between two val-
Data Warehouse solutions { it is much more appropri- ues, nor to compare for greater or less between
ate to manage it separately from the data and combine the attribute values themselves or with values of
it with other kinds of meta data. As a consequence, another attribute. It would be better to represent
we introduce an independent meta data repository in each interval by its mean value.
ADLER that plays an outstanding role throughout the  Attributes derived from several other attributes
whole KDD process. Meta data additional to the one may explicitly express facts which have only been
already introduced by the warehouse is needed due to implicit in the data so far. For example, instead
a variety of reasons. Let us have a closer look at this. of having an attribute for the total income of

M. Staudt, J.-U. Kietz, U. Reimer 16-4


Figure 2: Meta Model of ADLER: Data Representation and Preprocessing
a household and an attribute for the number of left out. The system may also examine the character-
persons living in it, better mining results could istics of the input data and of the output and suggest
be achieved when using an attribute for the per which parameters of the mining algorithm to set to a
capita income, which can easily be derived from di erent value.
the two given ones. Due to the requirements on a meta data repository
 Certain attributes may not be useful as part of discussed above the necessity of a more powerful repre-
a mining input and should be projected away. sentation formalism than just entries in dedicated re-
Examples are nominal attributes with too many lational tables arises. ADLER employs the deductive
missing values or with too many di erent values and object-oriented meta data management system
so that they have nearly key property (or are even ConceptBase [JGJ+ 95] and its knowledge representa-
keys). tion language Telos to describe the mining meta data,
relevant excerpts of the MASY meta data and the links
All of the above mentioned kinds of transformations to the enterprise meta data stored in Rochade.
are only possible when there is a meta data reposi-
tory with the required information about attributes. 4.2 The ADLER Meta Model
Moreover, with such a repository the KDD environ-
ment can decide upon the required, recommended, and Figure 2 shows the graphical representation of an O-
the possible transformations itself and perform them Telos object collection managed by ConceptBase. The
automatically, maybe after having asked the user for displayed objects consitute parts of the meta schema
permission. used in ADLER to control the preprocessing and min-
Finally, another very important functionality that ing activities and to describe the source data and their
becomes possible with a meta data repository is a com- relationships. We leave out the above mentioned ex-
ponent that guides the user through the KDD process plicit connection to the warehouse sources and start
and gives him advice (cf. Goal 1). For example, when with relations available for the mining process. Apart
the decision tree constructed by a mining algorithm is from database relations (DBRelation) we consider also
too complex ADLER may tell the user that certain at- relations extracted from at text les available as ad-
tributes in the input data are noisy and after all quite ditional background knowledge not integrated in the
insigni cant for the mining task so that they should be warehouse (mainly for experimental purposes) and so-

M. Staudt, J.-U. Kietz, U. Reimer 16-5


called dynamic views that do not exist in the database strictions both on these attributes as well as on addi-
(as database views de ned on top of base relations by tional ones (cond attr). From all involved attributes
SQL expressions) but are created on the y during the the associated relations needed for executing the neces-
mining process. Kepler provides certain dedicated sary joins and performing the selections can be derived
operators to de ne such relations and even to store implicitly.
them locally. Instances of DBView and DynamicView
are derived relations which are constructed in the pre- 4.3 Preprocessing Support: An Example
processing phase of a KDD session. All types of re-
lations have columns, namely attributes (instances of In order to give an idea how the meta schema explained
RelAttribute). above can be used in ADLER at runtime we now show
Attribute values are very often discrete and possi- an example stemming from the preparation of our rst
ble values of an attribute can be organized in code analyses of the data available in MASY. For the ex-
tables, particularly for reasons of eciency. An entry ample we assume three relations:
in such a table relates a short code word with the ac-
tually intended value. While the codes are sucient  KNDSTM: client data relation with key attribute
for internal handling, the \long" values come into play CKNDNBR for the client number and about 100 fur-
when interacting with the user. Our database cur- ther attributes, e.g. birthdate (DGBTDTE), martial
rently employs about 250 code tables with the number status (NZIVSTD), number of children (NANZKND)
of entries ranging from 2 to several hundreds. Discrete and buying power class (CKFLKLS);
attributes can be either nominal or ordinal, a further  KNVPOL: insurance policy relation; relates policy
category comprises cardinal data which is not only or- numbers and certain other information (in partic-
dered but also allows to measure distances between ular the contract date DPOLDTE) about contracts
values. These categories are covered by Metrics. Fur- with foreign key KP CKNDBR to KNDSTM;
thermore, from a semantic point of view attributes  CELL: general data (e.g. type of ats or houses
can be assigned to sorts, like money or years. The (NGBDART) or typical professional quali cation
data type of an attribute (like NUMBER, STRING) is also (NFNKSTF) of inhabitants) about \cells", i.e. units
recorded in the meta data base but not shown in Fig- of 30 - 50 households in Switzerland, with key
ure 2 as well as the allocation of relations in databases, CELLID.
owner resp. user information etc.
One or more attributes usually constitute the key The records in KNDSTM are partly linked to cells
of a relation, i.e. their value(s) uniquely identi es a by the (foreign key) attribute K CELLID. Thus we can
tuple. Single attribute keys can be called simple key, obtain further information about the client's environ-
like a client identi er (see table KNDSTM below). ment via the cell he is living in.
Particularly interesting for the restructuring of base In Telos notation the meta data repository would
data by generating new derived relations are rela- contain this information as shown as an excerpt in
tionships between attributes. On the one hand we Figure 3. Note, that we instantiate the classes in
have basic relationships given e.g. by foreign key re- Figure 2. The has attribute values of the in-
lationships between di erent relations. In this case stances of DBRelation simply name the relevant at-
we assume equals relations between the involved at- tributes. The foreign key relationships mentioned
tributes. Typically, joins between relations do not above are modelled as instances of the equals link
involve arbitrary attributes but only those where it from RelAttribute to itself.
makes sense. The model allows to state such attribute In order to construct a relation NKND for a certain
pairs by specifying joins links, possibly together with mining task which restricts the client set to those peo-
additional information concerning the join result (e.g. ple who made a new contract since the beginning of
1:n or n:m matching values). Links between attributes last year, we perform the transformation steps given
labeled comparable with give additional hints for pos- below. We are in particular interested in the age and
sible comparisons during the mining phase. More com- the (weighted) buying power class distribution and
plicated relationships between attributes result from want to include the NGBDART and NFNKSTF attributes
applying attribute transformation operators (instances from CELL.
of AttrTrans) to an attribute (as input) to obtain
another (new) attribute (output). The application of 1. The birthday attribute DGBTDTE of KNDSTM has a
such operators will be explained in the example below. speci c date default for missing values. This de-
The generation of derived relations requires operators, fault value in uences the age distribution, so it
too (RelGen). These specify target attributes of the should be explicitly elminated by NULL. The re-
(new) relation, a condition part posing certain re- sult is an attribute column DGBTDTE'.

M. Staudt, J.-U. Kietz, U. Reimer 16-6


KNDSTM in DBRelation with KNVPOL in DBRelation with CELL in Relation with
has_attribute has_attribute has_attribute
a1: CKNDNBR; a1: KP_CKNDBR; a1: CELLID;
a27: DGBTDTE; a17: DPOLDTE a49: NGBDART;
a34: NANZKND; ... a53: NFNKSTF;
a35: NZIVSTD; end ...
a45: CKFLKLS; end
a67: K_CELLID;
...
end

KP_CKNDNBR in RelAttribute with K_CELLID in RelAttribute with


equals equals
e: CKNDNBR e: CELLID
end end

Figure 3: Relation and attribute speci cation in O-Telos


2. From DGBTDTE' and the current date the age NALT Step 1: build DGBTDTE'
of each client can be obtained. NullDGBTDTE in NullIntr with
input
i: DGBTDTE
3. The buying power class of a client should be output
weighted with his marital status and the num- o: DGBTDTE'
value
ber of children. We rst code the values of the v1: "10000101";
v2: "99999999"
marital status as numerical values (new attribute end
The value attribute (de ned for NullIntr) spec-
NZIVSTD) under the assumption that single per-
sons can a ord more things (value 1) while the that the
i es
were
date values (digits of year, month and day)
`misused' for marking missing values and
value of married people is neutral (0) and the should be replaced by NULL.
value of divorced and separately living persons is
negative (-1). Step 2: build NALT
CompNALT in AttrComp with
4. The new weighted buying power class CPKFLKLS input
results from adding this value to the old class i: DGBTDTE'
output
value and further reducing it by one for every two o: NALT
comp_expression
children. c: ":o = substr(today,4,1) - substr(:i,4,1)"
end
5. Relation NKND is de ned as consisting of the at- The comp expression value allows the derivation of
tributes constructed during the previous steps and the output value :o for attribute o of CompNALT from
those of CELL mentioned above. In addition to the input value :i of i by subtracting the year com-
the join between KNDSTM and CELL and the at- ponent of :i from the system date today.
tribute transformations a join with KNVPOL is re-
Step 3: build NZIVSTD'
quired in order to eliminate clients without \new" DecNZIVSTD in DecodeAttr with
contracts. The join conditions in both cases are input
derived implicitly from the equals links in the i:
output
NZIVSTD

model. The selection condition concerning the o:


from
NZIVSTD'

contract date DPOLDTE has to be given explicitly. from1: "single";


from2: "married";
from3: "separated;
Each of the above steps is realized by instances from4: "divorced"
of AttrTrans and RelGen (Step 5). Instances of to
to1: 1;
AttrTrans belong to di erent types of transformation to2:
to3:
0;
-1;
classes (specializations of AttrTrans) with suiting pa- end to4: -1
rameters. Instances of RelGen can be understood as Each from value is replaced by its corresponding to
generalized multi-join operators which are applied to value.
attributes only but with given selection condition and Step 4: build CPKFLKLS
implicit derivation of the participating relations. The CombCPKFLKLS in AttrComb with
attribute transformations concern either one attribute input
(NullIntr, AttrDecode, AttrComp for replacing cer- i1: NZIVSTD';
i2: NKFLKLS;
tain values by NULL, performing general value replace- i3: NANZKND
ments or arbitrary computations) or several attributes output
o: CPKFLKLS
(AttrComb for arbitrary combinations of values). comp_expression
c: ":o = :i2 + :i1 - (:i3 div 2)"
end

M. Staudt, J.-U. Kietz, U. Reimer 16-7


Step 5: generate NKND ing process. In addition, the preprocessing and mining
GenNKND in RelGen with activities do not constitute a sequential but cyclic pro-
output
o: NKND cess where its is useful even during the same mining
target
t1: CKNDNBR;
session to have the possibility to go back to previous
t2: NALT; steps.
t3: CPKFLKLS;
t4: NGBDART; In this respect, ADLER clearly becomes a support
t5: NFNKSTF environment that helps the user to keep track of things
cond_attr
a1: DPOLDTE done and frees him from the burden of dealing with
condition low-level SQL stu and le in- and ouput for con-
end
c: " :a1 > '19960101'"
structing his relational views and storing intermedi-
ate results. We thus see our work very much in the
Besides the advantage of having a documentation direction [BA96] calls for.
of executed transformations, another motivation for
modelling the preprocessing steps is to enable the user
to specify the required transformations in a stepwise
5 Applications
fashion, while the actual generation of a speci ed rela- For Swiss Life Data Mining has a high potential to
tion is done automatically by ADLER. The automatic support marketing initiatives that preserve and extend
generation builds upon the concretely modelled op- the market share of the company. In order to experi-
erations like those shown above. Even for the small ment with di erent mining algorithms and to develop
number of attributes in our example the resulting a xed palette of tools o ered in ADLER a number of
(SQL) expressions become very complex and their di- concrete applications were identi ed and selected for
rect manual speci cation is error-prone and laborious. the rst phase of DAWAMI:
Of course, there should also be GUI support for pro- Potential Clients: One might think that the wealthy
tecting the user from having to type the still cryptic inhabitants of a certain geographical area are the most
Telos speci cations. promising candidates for acquiring new contracts, but
Based on the relations constructed by various trans- this is usually not the case. An interesting data mining
formations a further preprocessing task concerns the task is therefore to nd out what the typical pro les of
computation of value distributions. The result would Swiss Life clients are with respect to the various insur-
also be stored in the meta data repository as instanti- ance products. An insurance agent can then use these
ation of Distribution and a corresponding link from pro les to determine from a publicly available register
each attribute of interest. The distribution of our syn- of households those non-clients of his geographic area
thesized attributes CPKFLKLS and NALT is shown in the which are quite possibly interested in a certain kind of
diagrams in Figure 3. life insurance.
Client Losses: One way to reach lower cancellation
4.4 Meta Data for Mining Activities rates for insurance contracts is via preventive measures
While our plans for recording transformation and pre- directed to clients that are endangered due to per-
processing steps are relatively clear, the considerations sonal circumstances or better o ers from competitors.
of how to represent the actual mining activities are By mining the data about previous cancellations us-
more vague. As a rst step it is necessary to model the ing as background knowledge unemployment statistics
mining algorithms wrt. di erent input requirements, for speci c regions and questionnaire data, we expect
parameters and results. Each mining session consists to obtain classi cation rules that identify clients that
of executing algorithms (i.e. instantiating the model) may be about to terminate their insurance contracts.
in order to pursue a certain mining goal which ob- Other mining tasks concern the development of
viously relates to the type of the algorithm and the standardized Type Tests that allow to include psycho-
available data sources. We think it is useful to record graphic aspects into client characterizations, the iden-
the complete path from the rst transformation in the ti cation of di erences between the typical Swiss Life
preprocessing phase to the successfully generated re- clients and those of the competitors, and the segmen-
sult, like a decision tree, which satis es the intended tation of all persons in the MASY warehouse into the
goals. In future sessions it can be reconstructed how so called RAD 2000 Target Groups based on a set of
the whole analysis process took place. Furthermore, fuzzy and overlapping group de nitions developed by
we are able to generalize at least to a certain degree by the Swiss Life marketing department some years ago.
taking together similar mining steps and taking over 6 Outlook and Goals
the signi cant (e.g. parameter settings) properties into
the mining methodology mentioned above, thus giv- While Goal 2 stated in Section 1 is a more architec-
ing future users typical examples of a successful min- tural issue that can be handled as sketched in Section

M. Staudt, J.-U. Kietz, U. Reimer 16-8


Figure 4: Value distribution of synthesized attributes
3.1 Goal 1 is very ambitious. Data Analysis at Swiss References
Life currently is conducted by specialists of the Infor-
mation Center for Market Analyses. As a rst step [BA96] R.J. Brachman and T. Anand. The pro-
cess of knowledge discovery in database.
these people should be enabled to apply ADLER in In Proc. of the 2nd Int. Conf. On Knowl-
addition to the statistics tools (mainly SPSS) they are edge Discovery and Data Mining. AAAI
already using. In the long run we want also reach users Press, 1996.
with less background in data analysis. The success in
this direction heavily depends on the quality of the [Fri96] M. Fritz. The employment of a data ware-
ADLER GUI and the provided mining methodology. house and OLAP at Swiss Life (in Ger-
Data Mining is a special form of knowledge acqui- man). Master's thesis, University of Kon-
sition. The results of mining, e.g. a classi cation com- stanz, Dec. 1996.
ponent that allows to classify people into speci c client [JGJ+ 95] M. Jarke, R. Gallersdoerfer, M. Jeusfeld,
segments, should be made available to other analyzers M. Staudt, and S. Eherer. ConceptBase
than just the one who initiated the current mining pro- - a deductive object base for meta data
cess. Therefore, Data Mining has to be embedded in a management. Journal of Intelligent In-
general concept for Knowledge Management at Swiss formation Systems , 4(2):167{192, 1995.
Life which also includes an integration task but now on
a more conceptual than on a source data level. Meta [Kie96] J.-U. Kietz. Inductive Analysis of Rela-
data resources as mentioned above will de nitely be a tional Data (in German). PhD thesis,
central part around which Knowledge Management is Technical University Berlin, Oct. 1996.
to be built up.
Besides these more global links the activities and [MWKE93] K. Morik, S. Wrobel, J.-U. Kietz, and
plans described above will serve as a starting point W. Emde. Knowledge Acquisition and
for establishing a stable Data Mining infrastructure Machine Learning: Theory, Methods and
at Swiss Life. We want to show that the very com- Applications. Academic Press, 1993.
mon application of database technology for managing [WWSE96] S. Wrobel, D. Wettschereck, E. Sommer,
huge amounts of data in the company is naturally com- and W. Emde. Extensibility in data min-
plemented by technology from the areas of Machine ing systems. In Proc. of the 2nd Int. Conf.
Learning (for data mining) and Knowledge Represen- On Knowledge Discovery and Data Min-
tation (for modelling domains and high level system ing. AAAI Press, 1996.
integration).

M. Staudt, J.-U. Kietz, U. Reimer 16-9