You are on page 1of 48

Using Artificial Intelligence for

the Evaluation of the Movability


of Insurances
Martin Åslin

June 29, 2013


Master’s Thesis in Computing Science, 30 ECTS credits
Supervisor at CS-UmU: Patrik Eklund
Examiner: Fredrik Georgsson

Umeå University
Department of Computing Science
SE-901 87 UMEÅ
SWEDEN
Abstract

Today the decision to move an insurance from one company/bank to another is done man-
ually. So there is always the risk that a incorrect decision is made due to human error. The
goal of this thesis is to evaluate the possibility to use an artificial intelligence, AI, to make
that decision instead. The thesis evaluates three AI techniques Fuzzy clustering, Bayesian
networks and Neural networks. These three techniques was compared and it was decided
that Fuzzy clustering would be the technique to use. Even though Fuzzy clustering only
achieved a hit rate of 69%, there is a lot of potential in Fuzzy clustering. In section 4.2 on
page 32 a few improvements are discussed which should help raise the hit rate.
ii
Contents

1 Background and motivation 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Insurances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.1 What is an insurance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2.2 Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.3 Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.4 Pitfalls . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Problem description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.1 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.2 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.3 Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3.4 The structure of the system . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Theory 7
2.1 Overview of ’intelligent computing’ . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 The logical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 The probabilistic approach . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 The numerical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Pros and cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The chosen method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 C-Means Fuzzy Clustering Algorithm . . . . . . . . . . . . . . . . . . 17
2.3.2 Rulebase algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Inference algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 Implementations, results and validations 25


3.1 Development environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.1 .Net . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.2 Usefull embedded namespaces . . . . . . . . . . . . . . . . . . . . . . . 25
3.1.3 Database management via XML export . . . . . . . . . . . . . . . . . 26
3.2 Result . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

iii
iv CONTENTS

3.3 Impact of the parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27


3.3.1 The Fuzzyness Variable . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3.2 The Number of Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 The Cluster Interval . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4 Conclusion 31
4.1 Restrictions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Better selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Small and focused . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Reinforced learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

5 Acknowledgments 35

References 37
List of Figures

2.1 The process of fuzzy controls . . . . . . . . . . . . . . . . . . . . . . . . . . . 8


2.2 Inference using Mamdani’s method . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 K-Means clustering example. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Membership values for clusters . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.5 Projection of a cluster . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.6 Simple bayesian network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.7 Bayesian networks - order of nodes . . . . . . . . . . . . . . . . . . . . . . . . 12
2.8 Calculating probability in a Bayesian network. . . . . . . . . . . . . . . . . . 13
2.9 A neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.10 Neural network with one hidden level . . . . . . . . . . . . . . . . . . . . . . . 15
2.11 C-Means clustering algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.12 Criterion number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.13 Create rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.14 Calculate the input. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.15 Process of Mamdani’s method. . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.16 Mamdani - Output membership function. . . . . . . . . . . . . . . . . . . . . 21
2.17 Centre of Gravity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.18 Larsen - Output membership function. . . . . . . . . . . . . . . . . . . . . . . 22
2.19 Takagi-Sugeno - X matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.20 Takagi-Sugeno - Y vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.21 Takagi-Sugeno - P vector. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.22 Takagi-Sugeno - β variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.23 Takagi-Sugeno - matrix computation . . . . . . . . . . . . . . . . . . . . . . . 23

3.1 The impact of the fuzzyness variable . . . . . . . . . . . . . . . . . . . . . . . 27


3.2 The impact of the number of runs variable . . . . . . . . . . . . . . . . . . . . 29
3.3 The impact of the interval limits . . . . . . . . . . . . . . . . . . . . . . . . . 30

v
vi LIST OF FIGURES
List of Tables

2.1 Randomized Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . 17


2.2 Adjusted Membership Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.1 Output intervals of the inference methods . . . . . . . . . . . . . . . . . . . . 26


3.2 Output from the system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

vii
viii LIST OF TABLES
Chapter 1

Background and motivation

1.1 Introduction1
Today’s banks and insurance companies have the possibility to, with the customers consent,
take over personal insurances like for example capital, service and private pension insurances.
Which insurance that is movable, i.e. possible to take over, or which part of the insurance
that is movable is partly decided by legislation and partly by the content of the insurance.
The size of one’s capital can be of big significance for a insurance company or bank when they
decide if they want to take over the insurance. The movability is also decided by internally
set rules by all separate companies. This makes the movability of a insurance dependent
on a lot of parameters that can change a lot. Today the majority of the evaluation of the
movability of insurances is done manually and from manually created sets of rules. There
are three main types of assessment criteria, green - the insurance is movable, yellow
- the insurance might be movable or partly movable, and red - the insurance is
not movable. Sometimes there are a lot of sub-criteria within each of the main criteria,
where a more detailed assessment is described. Here, for example, it can be described
how to proceed in a move matter by, for example, requesting a health certificate. Manual
evaluation of insurances is a time consuming process which could benefit from becoming
fully automated or partly automated. The goal of this thesis project is to create a system
that can classify insurance information.

1.2 Insurances
This section will explain what an insurance is, how the structure of an insurance looks and
what you have to keep in mind when reviewing to see if the information in the insurance
is correct. When reading this keep in mind that this is based on Swedish insurances and
might differ from insurances in other countries.

1.2.1 What is an insurance


An insurance is an agreed upon financial protection in exchange for payment. So you pay
some bank or company a fee and they will give you money in case something happens like an
1 Most of the information about insurances in chapter 1 has been found in [16] which is an internal

document at Svenska Försäkringsfabriken.

1
2 Chapter 1. Background and motivation

accident or if you retire. Some insurances will only give you back roughly the same amount
of money as you have paid them, they are called savings insurances and an example of such
insurance is the pension insurance. There is another type that is called risk insurances which
will in many cases give you a lot more money than you have paid to the insurance company,
but they are only valid as long as you pay the fee and you will not get the money back if
you stop paying. The risk insurance system works because of the number of people they
insure and that not everyone will be in e.g. an accident and in need of insurance money.
So the customers pays for each other, if all their customers would suddenly need insurance
money then the company would likely become bankrupt. The insurance companies have
rules where they evaluate the risk of something happening and adjusts the fee required or
simply not approve the insurance.
For example: lets say an 18 year old man wants to insure a brand new super car in a
major city, then he will most likely have to pay a huge yearly fee or the insurance company
might not want to sign him because the high risk of something happening to the car. So
the risks that they might look at in this case might be:

– His young age and the fact that he is male, which statistically means that he is more
likely to be in some kind of accident.
– Powerful sports car which is easier to lose control of and they are a more desirable
target for thieves.
– Lives in a major city with a big population which means more interaction with more
people which means that there is a higher chance that some kind of accident occurs.

1.2.2 Structure
The structure can be split into two main parts that makes up an insurance. The first part
contains more of an overall information about the insurance such as the insurance number,
the person that is insured, the owner of the insurance, the insurance provider when it was
signed, the cost of the insurance and etc. The second part contains information about
the content of the insurance. The content of the insurance is divided into moments. It is
possible for insurances to include more than one moment so e.g. an accident insurance will
contain a accident moment and a pension insurance can contain a pension moment and a
health insurance moment. The possibility to add more moments to the insurance is up to
the insurance provider.

1.2.3 Types
There are a lot of different insurances but they can all be divided into two main types, that
was mentioned in section 1.2.1 on the previous page, namely savings insurances and risk
insurances. Savings insurances are paid to the insured after a certain date, e.g. pension
savings, and risk insurances are paid after accidents, sickness etc. and is only valid as long
as the insurance fee is paid so e.g. if a person never is in an accident then they will not be
able to use the money they have paid for the insurance. Though the advantage with the
risk insurance is that the money you get if you e.g. are in an accident can be much higher
than the amount you have paid in.
Here are some of the types of insurances:

Pension this is a insurance that will let you receive money during your pension in a certain
period of time and interval. Can be signed by a private person and/or a company.
1.3. Problem description 3

Survivor’s protection this moment exists so that the husband/wife, children or other
heirs to the insured gets the money if the insured person dies.
Health insurance this insurance will provide the insured person with an income in case
of sickness or early retirement due to ill health. Though a qualifying period of sickness
might be needed.
Sickness-/ prematurely-capital this insurance will grant you a one-time amount of
money if you become sick or hurt enough that you are granted sickness benefit.
Medical treatment insurance this insurance covers costs during medical treatment/ at-
tendance.
Accident insurance usually contains three moments:
Medical treatment costs is costs that occur with sickness/injury. and includes
costs for treatment of a doctor/dentist but also necessary travel costs during the
treatment.
Disability capital a one time amount of money one receive if one is afflicted with a
permanent disability or decreased capacity to work.
Death capital a sum that is paid to the husband/wife, children or heirs of the insured
person in case of death.
Premium exemption this is an add-on for insurances that makes the insurance provider
take responsibility for the payment of the premium if the insured becomes so sick that
the period of sickness is greater than the qualifying period of sickness.

1.2.4 Pitfalls
There are a several pitfalls to keep a lookout for when reviewing an insurance and it slightly
differs between the different types of insurances. Some general ones that apply to most of
the insurances are e.g. is the insurance number, personal number, dates, sums, status etc.
correct?. Another pitfall is if an insurance has a status of continuous and a Z-time that has
passed, then something is wrong. Z-time marks the last day that the insurance is valid, but
that is only the case of the risk type insurance. A Z-time in a savings type insurance means
the date when they should start paying e.g. a person’s pension.

1.3 Problem description


This section will describe the goal of this thesis and cover some of the problems that has to
be considered and/or solved during this thesis.

1.3.1 Goal
The process of evaluating the movability of insurances is a time consuming process and that
is mainly because it is done manually and because of the complex rules. The goal of this
thesis is to make a system that can, based on manually pre-classified insurance evaluations,
learn to classify insurance information. This could potentially save a lot of time while
reducing the number of human errors. There are a few requirements of the system:

– The result of the evaluation should consist of a flag and a text.


4 Chapter 1. Background and motivation

– Use data that has been serialized as XML.

– The system should be as general as possible.

– The system should be implemented in C# and .NET 4.0

1.3.2 Data
The data this system will use for training will be stored in XML-files. The problem with the
data is that the number of elements an insurance have will differ between the insurances.
The possibility to add additional ”moments” to most of the insurances, will create a lot of
possible combinations and a lot of different elements that the insurance contain. Some of
these elements can of course be filtered out with a preprocessor since they might not affect
any decision. But can, and probably will, still leave us with a variable number of remaining
elements. It would be possible to create a large neural network, NN, that takes all, possible,
elements and just set the ones not used to NULL or 0. But that would make it harder to
train the system since it would have to learn a more complex model. It would probably be
more efficient to somehow divide the problem into a lot of smaller NNs that is more focused
on a subset of the problem. For example: that each of the different types of insurance has
its own network. They would still be large and/or complex since they can have a lot of
different moments, but perhaps it is possible to divide it even further and maybe combine
with other approaches such as Bayesian networks or a fuzzy clustering network.
There is another problem to consider with the data and that is the date and the sum
values and that is how should they be represented in the system. The problem is that they
are values that can be continuous meaning that they might represent 13 January 1988 to
11 February 2013 or 20-60.000 SEK.

1.3.3 Rules
There are of course a lot of rules regarding insurances and the rules can help us filter out
insurances which is incorrectly filled in, thus pointless to evaluate, and of course to help
us make a decision on whether or not the insurance is movable. A problem with the rules
is that they are complex and might be hard to implement and that companies can have
different rules. Another problem is that the rules can change a lot in the insurance industry
and that makes it important to make a system that can adapt to new rules or a system
where it is easy to change or add rules. It would be desirable if the system could detect any
ambiguity in the rules before and after any change or addition of new rules.

1.3.4 The structure of the system


If we decide to make one large structure/network that should handle every case then we
will end up with a behemoth of a system that will need a lot of data samples to be trained
well enough. Another problem would be the training time which would be very long, since
it will be a very complex function it would have to model.
On the other hand if we decide to use small expert systems e.g. one structure/network
for each type of moment. That could allow us to shorten the training time since it will
be a less complex function with less amount of training data. During retraining it might
be possible to not having to retrain all of them which can save time. But we would have
to come up with a way to train the system since insurances can have multiple amounts of
moments but only one verdict. So if an insurance has a yellow flag should we assume that
1.3. Problem description 5

both moments was yellow, or was one of them yellow and the other green? The same has
to be considered after the moments has been evaluated, if one moment is green while the
other is yellow will that make the complete verdict yellow?
6 Chapter 1. Background and motivation
Chapter 2

Theory

2.1 Overview of ’intelligent computing’


There will be three types of approaches that will be briefly explained in this section and
these approaches are logical, probabilistic and numerical. Then they will be compared
against each other and the pros and cons of each approach will be presented. At the end of
this section there will be a more thorough explanation and some algorithms of the chosen
approach.

2.1.1 The logical approach1


The method in the logical approach that will be studied is the fuzzy clustering method.
Before explaining fuzzy clustering it is necessary to explain fuzzy sets, fuzzy logic and fuzzy
control which is all used in fuzzy clustering.

Fuzzy sets and fuzzy logic


Lets say we have the proposition it is cold outside, now is this true if we know that
the temperature is 17 ◦ C? Some people will probably find it cold while some might find
it to be warmish. The problem with the proposition is the linguistic term cold which is
not an defined line where every value below that line is considered to be cold. Instead the
linguistic term cold is an arbitrary value that changes from person to person. Now this
is where fuzzy sets come into the picture, they allow us to represent a value such as cold.
What the fuzzy set does is that it divides the cold value into degrees of cold, usually a value
between 0 and 1. So if we go back to is it cold outside when it is 17 c? a fuzzy set
might represent this as 0.65 true instead of definitely stating that it is true or false.
When fuzzy sets are used in logical expressions it is called fuzzy logic. Fuzzy logic
describes fuzzy truth values which are a function of the truth values of the components.
The standard rules for evaluating these fuzzy truth values, T, of a complex sentence are:

– >(A ∧ B) = min(>(A), >(B))

– >(A ∨ B) = max(>(A), >(B))


1 The majority of the information for this basic explaination of Fuzzy logic, sets, control and Clustering

has been found in [2] and [9]

7
8 Chapter 2. Theory

– >(¬A) = 1 − >(A)

So for example if we have >(Cold(outside)) = 0.65 and >(F reezing(M artin)) = 0.55
and then we have >(Cold(outside)) ∧ >(F reezing(M artin)) = 0.55 which is probable.
These rules might not change when a variable is modified, even when we would have
wanted it to change. For example if we have A = 0.75 and B = 0.33 then min(>(A), >(B)) =
0.33 but if we change A = 0.85 then we still get min(>(A), >(B)) = 0.33 even though we
might want a new value since one of the values changed. There are a few ways we can
improve this but they will not be mentioned.

Fuzzy control
Fuzzy control uses rules for making decisions. A rule, <, is expressed as < : IF <
f uzzycriteria > T HEN < f uzzyconclusion >. Fuzzy control has a number of rules
and these rules are stored in what is called a rulebase. There are a few ways to create a
rulebase:

– Have an expert write the rules.

– Observe and record the in and out data when a expert performing the actions for a
period of time.

– Generate rules based on data.

In this project the last one is the one that is of most interest since we want to minimize
the number of human decisions.

Figure 2.1: The process of fuzzy controls, [2] page 58.

figure 2.1 describes the process of fuzzy controls. First the input, x, gets fuzzified which
means that x is transformed into its corresponding truth value. Then the now fuzzified x
gets combined by a logical conjunction which then is combined with the output membership
function of the rule. The newly created membership function is then calculated before being
defuzzified.
Inference is used to create the conjunctions, in figure 2.2 on the next page we can see
the Mamdani inference method which in this case uses the minimum operation and then
combines the output results by using the maximum operation. In figure 2.2 on the facing
page the result given by the maximum operation is the grey field in U1 since that result has
a higher value than U2 .
2.1. Overview of ’intelligent computing’ 9

Figure 2.2: Inference using Mamdani’s method, [2] page 59.

Fuzzy clustering

Figure 2.3: K-Means clustering example.2

In cluster analysis or clustering one strives to divide the data into different groups(clusters).
The data that is clustered together are more similar to each other than the ones in the other
clusters, figure 2.3 is an example of a set of data points that has been divided into three
clusters. Now in fuzzy clustering, a number of data points gets divided into clusters but
now all data points belong to each one of the clusters, but in different degrees, just like in
2 http://en.wikipedia.org/w/index.php?title=File:KMeans-Gaussian-data.svg&page=1, 8 Okt 2012
10 Chapter 2. Theory

fuzzy logic. The closer a point is to the center of the cluster the more it belong to that
cluster. The degree of belonging or membership value that a data point has, have to sum
up to 1.0. So a data point could have the following membership values U0 = 0.17, U1 = 0.35
and U2 = 0.48 which sums up to 1.0.

Figure 2.4: Membership values for clusters, [2] page 68.

Looking at figure 2.4 we can see two clusters and a number of data points. The number
above the points is the membership value of the point for that cluster. In the left square we
can see the membership values to cluster 1, and as you can see even the ones that is really
close to cluster 2’s center, represented by the right most x, have a membership value to
cluster 1. In fuzzy C-means clustering, which is the technique that will be used if fuzzy
clustering is chosen, it is usually the distance from the center of the cluster that decides the
membership value.

Figure 2.5: Projection of a cluster, [2] page 71.

Projections of the data points will be created after the set of data points has been divided
into clusters. Each cluster will get their own set of projections, this can be seen in figure 2.5,
these projections will be used to generate the rules of that clusters that together will form
the rulebase. The creation of the rulebase marks the end of the training and it can now be
used to create output. In order to get this output we need something called inference which
2.1. Overview of ’intelligent computing’ 11

will translate any new input to output with the help of the rulebase.

2.1.2 The probabilistic approach3


This section will go through Bayesian networks which is based on Bayes’ theorem which will
be explained first.

Bayes’ theorem
The product rule can be written in two forms, namely P (A ∧ B) = P (B|A)P (A) and
P (A ∧ B) = P (A|B)P (B). By combining these two formulas we will get P (B|A)P (A) =
P (A|B)P (B) and by dividing by P (A) it turns into Bayes’ theorem P (B|A) = P (A|B)P
P (A)
(B)
.
P (B|A) is read as what is the probability of B given A. With Bayes’ theorem it is
possible to calculate the probability of an unknown variable by using the probability from
three known variables and that is a common case where a few probabilities is known while
the one that we need to know is unknown.
In [9] they give an example where a doctor knows P (symptoms|disease), probability of
symptoms given disease, but wants to know P (disease|symptoms). In the example the
doctor knows that:
– P (s|m) = 0.7
– P (m) = 1/50000
– P (s) = 0.01
Where s is patient has a stiff neck and m that the patient has meningitis. So by using
Bayes’ theorem the doctor can calculate that the probability of a patient having meningitis
when the patient has a stiff neck is:

P (s|m)P (m) 0.7 · 1/50000


P (m|s) = = = 0.0014
P (s) 0.01

Baysian networks
Bayesian networks is a common approach when a system might have to deal with uncertainty.
Bayesian networks are based on Bayes’ theorem section 2.1.2. A Bayesian network can
be described as a probabilistic graphical model that can represent dependencies between
variables, see figure 2.6 on the next page.
The figure 2.6 on the following page4 shows how a simple Bayesian network can look.
There you have a graphical model describing the relationships between the different nodes.
Each of the nodes in the network has a probability table associated with it and since the
rain node is not dependent on any other node, then only the unconditional probability of
each state is necessary. When building a Bayesian network it is important to make a good
model of the relationships so that a node does not depend on, for that node, unnecessary
variables. The way the nodes are introduced in the system, the order of them, can have
a big impact on performance. If the nodes are introduced in a ’not so good’ way it could
potentially mean that some nodes will get unnecessary dependencies and sometimes that
could give dependencies that is difficult to calculate. In [9] they give the following example:
3 The majority of the information for this basic explaination of Bayes’ Theorem and Bayesian Networks

has been found in [9]


4 http://en.wikipedia.org/wiki/Bayesian network, 20 sep 2012
12 Chapter 2. Theory

Figure 2.6: Simple bayesian network

Figure 2.7: Bayesian networks - order of nodes

The Bayesian networks in figure 2.7 both describe the same problem, the only difference
is the order of the nodes. In network A we have Alarm which is dependent on Burglary and
Earthquake, M arycalls and Johncalls is dependent on Alarm. Which means that if either
a burglary is in progress or if there is an earthquake the alarm will go off which will cause
either or both Mary and John to call the owner. In network B we have Johncalls which
is dependent on M arycalls, Alarm which is dependent on both M arycalls and Johncalls.
burglary is dependent on Alarm and Earthquake depends on both burglary and Alarm.
A few details from the example in the book are needed in order for this to make sense. In
the example they state that Mary often listens to loud music so she might not hear the
alarm, so if she is calling it is a high probability that John will call as well. Which makes
Johncalls dependent on M arycalls. The book, [9], will be quoted for the dependency
between burglary and earthquake.

If the alarm is on, it is more likely that there has been an earthquake. (the
alarm is an earthquake sensor of sorts.) But if we know that there has been
a burglary, then that explains the alarm, and the probability of an earthquake
would only be slightly above normal. Hence, we need both alarm and burglary
as parents.
2.1. Overview of ’intelligent computing’ 13

P (G, R)T T
P (R|G)T T =
P (G)T
P
S∈{T,F } P (G = T, S, R = T )
= P
S,R∈{T,F } P (G = T, S, R)
P (G, S, R)T T T + P (G, S, R)T F T
=
P (G, S, R)T T T + P (G, S, R)T T F + P (G, S, R)T F T + P (G, S, R)T F F
(P (G|S, R)P (S|R)P (R))T T T + (P (G|S, R)P (S|R)P (R))T F T
=
(P (G|S, R)P (S|R)P (R))T T T + (P (G|S, R)P (S|R)P (R))T T F + (P (G|S, R)P (S|R)P (R))T F T + (P (G|S, R)P (S|R)P (R))T F F

(0.99 · 0.01 · 0.2) + (0.8 · 0.99 · 0.2)


=
(0.99 · 0.01 · 0.2) + (0.9 · 0.4 · 0.8) + (0.8 · 0.99 · 0.2) + (0.0 · 0.6 · 0.8)
0.00198 + 0.1584
= ≈ 35.77%
0.00198 + 0.288 + 0.1584 + 0.0

Figure 2.8: Calculating probability in a Bayesian network.

As we can see we get two additional dependencies when the Bayesian network is arranged
like B instead of A. This could, as stated above, mean that the computations becomes harder
to calculate.
Lets do an example5 that requires some calculations. Say that we have a Bayesian
network that looks like figure 2.6 on the facing page and we want to know What is the
probability that it is raining, given the grass is wet? So what we want to know
is P (R|G) where R means raining, G means Grass wet, S means sprinkler turned on
and T means true. In figure 2.8 we can see how it is solved by using Bayes’ theorem and other
statistical formulas/rules. It is possible to describe the probability of most of the scenarios
that can occur based on the probability functions, even if some parts are unknown.
With Bayes’ theorem and other statistical formulas/rules it is possible to describe the
probability of most of the scenarios that can occur based on the probability functions, even
if some parts are unknown. There is a lot more that can be said about Bayesian networks
but this is supposed to be a brief introduction/explanation. If this gets chosen you can read
a more thorough explanation in section 2.3 on page 16.

6
2.1.3 The numerical approach
Some problems might be too hard for designers solve on their own since it can sometimes
be hard (if not impossible) for a designer to predict all of the situations/states in which
the system might find itself in. The change over time is another problem that is hard to
predict, e.g. the stock market, and sometimes they don’t have an idea on how to program
the solution and this is where learning based AI can be a good choice since they will learn to
become a solution. There are many different types of learning approaches, but this project
will focus on neural networks and explain how those work.

5 http://en.wikipedia.org/wiki/Bayesian
networks, 20 Sep 2012
6 The
majority of the information for this basic explaination of Neural Networks has been found on
Wikipedia and some in [9]. The information about Artificial Neural Networks has been found in [9]
14 Chapter 2. Theory

Neural networks
In the world of artificial neural networks(ANN) or simply neural networks(NN) one tries
to achieve ‘intelligence’ by modelling the system after a biological neural network(BNN),
like the human brain. Before I continue to explain ANN I will try and explain how a BNN
works.
Disclaimer: This will be a simple explanation since I am far from an expert in the field
of neuroscience.

Figure 2.9: A neuron.7

A BNN is a vast network of connected nerve cells called neurons. Each neuron consists
of a cell body (soma) which contains a cell nucleus which contains the cells genetic material.
Stretching out from the body are a number of dendrites which receives signals from other
neurons and a single long fiber called axon. The axon sends signals to other neurons. The
axon and the dendrites are connected to other neurons at junctions called synapses. So that
is the structure of the BNN, now to explain how it works. Lets say that you see a flower,
then a lot of your neurons will start firing and sending signals to other neurons until they
stop at a state where you either recognises that it is in fact a flower, maybe even what type
of flower, or that it is something unknown.
So this is what a ANN is trying to mimic. An ANN will be built with a few layers. First
we have one Input layer and then a number of hidden layers and lastly one output layer.
Each layer consists of a number of nodes (neurons) and each of these nodes are connected
to all the nodes in the next layer, in one direction, see figure 2.10 on the next page. The
connection between two nodes, lets say i and j, serves to propagate activation ai from i to j
and this connection has a numeric weight, wij , associated with it. This weight describes the
strength and sign of the connection. During the learning process the strength of the weights
will be updated to produce a desired signal flow. When a node derives its output it starts
to calculate the weighted sum of all its inputs and then it applies an activation function to
this sum. This is called a feed-forward network and it is the type that will be used in case
ANN will be used.
One of the biggest risks when working with AI and systems that needs to be trained,
is the risk of over training them. This could mean that it starts over-fitting and if the
7 http://en.wikipedia.org/wiki/Nervous system, 18 sep 2012
2.1. Overview of ’intelligent computing’ 15

network is too big the network might become a big lookup table. There are a few available
techniques that can help reduce over-fitting, but those will be mentioned in section 2.3 on
the following page if this approach is chosen.

Figure 2.10: Neural network with one hidden level.8

2.1.4 Pros and cons


In this section the pros and cons of each of the methods will be evaluated and compared
against the others and based on those a method will be chosen as the most suited method
for this project. The pros and cons will be based on these criteria:

– Can the method handle a variable number of inputs.


– How would it have to handle dates/sums.
– Can it handle complex rules
– how well can it handle changes to the rules.

Can the method handle a variable number of inputs


Fuzzy Clustering and Neural networks does not handle a variable number of inputs so well
so it would be necessary to either assign the missing inputs as N U LL, 0 or make many
small expert systems for e.g. every type of moment. If the first choice is used we will end
up with a big system that will be harder to train and would require a lot more data for the
training. While the second choice with the expert systems will be simpler to train, require
less training data and after a rule change it might only be necessary to retrain a few of all
the expert systems. But a problem that arises is that the results would have to be combined
if an insurance contains multiple moments and it would then be necessary to know what
to do if the moments produce different results e.g. one moment is green while another is
yellow or red. We could make small expert systems for every insurance type as well but
then we would still have the problem that a insurance can contain multiple moments i.e.
8 http://en.wikipedia.org/wiki/Artificial neural network, 18 sep 2012
16 Chapter 2. Theory

the number of input will vary. Bayesian Networks can handle variable inputs since it is
possible to calculate the probability of missing variables by using Bayes’ theorem and other
statistical functions/rules. Given the number of inputs variables available in this project
and their dependencies it could mean that we would have to construct a big and complex
model. That could lead to complex calculations.

How can it handle dates and/or sums

This problem is the same for all methods. The date or sum will have to be converted into
a numerical representation before the system can use them, which means that we need to
figure out how they should be represented.

Can it handle complex rules

Neural Networks are good at learning complex rules but one can never really be sure which
complex rule it has learned. Fuzzy Clustering can also learn complex rules, but unlike Neural
networks it is possible to show how/why it makes the choices it makes. Bayesian Networks
can describe complex stochastic relationships between variables.

How well can it handle rule changes

If the rules change then both the Neural Network and Fuzzy Clustering require retraining.
It takes a long time to train a Neural Network and even longer for a Fuzzy Cluster. This
makes it even more interesting with making one expert system for each moment since that
could help us reduce training times. The smaller expert systems would require less training
data and would have to learn a less complex function which will save us time. Bayesian
Networks do not require retraining, though it might be necessary to update the probability
tables.

2.2 The chosen method


The method that was chosen for this project was fuzzy clustering. The two deciding reasons
for why the fuzzy clustering was chosen was:

Show how the system ’thinks’ With fuzzy clustering it is possible to show, with e.g.
graphs, how the system ’thinks’ which I evaluated would be a strong reason for picking
fuzzy clustering.

Similar problem with good results Patrik, my supervisor at the CS-department, had
done similar work with fuzzy clustering and with good results.

In section 2.3 the algorithms for fuzzy clustering will be explained.

2.3 Algorithms
In this section I will describe the fuzzy clustering algorithms used by this system.
2.3. Algorithms 17

2.3.1 C-Means Fuzzy Clustering Algorithm


This is the main algorithm for the type of fuzzy clustering that will be used in this project,
namely C-means fuzzy clustering. The algorithm for C-means clustering can be seen in
figure 2.11 and can be seen in [2] page 69.

Step 1: Fix c and m. Initialize U to some U (1) .


Select ε > 0 for a stopping condition.
Step 2: Update midpoints values vi for each cluster ci
Step 3: Compute the set µk ≡ i : 1 ≤ i ≤ c : ||xk − vi || = 0,
and update U (`) according to the following:

if µk = ∅, then
Pc
uik = 1/[ j=1 (||xk − vi ||/||xk − vj ||)2/(m−1) ]
otherwise P
uik = 0∀i ∈
/ µk and i∈µk uik = 1

Step 4: Stop, if ||U (`+1) − U ` || < ε, otherwise go to Step 2.

Figure 2.11: C-Means clustering algorithm

Step 1: In this step the membership matrix U is initialized. The membership matrix is
an matrix that contains all the membership values in the application. It is used to check
how strongly an input is tied to a particular cluster. To create and initialize an membership
matrix is very simple, just create a matrix of size CxI where C is the number of clusters and
I the number of inputs. Then just fill fill the matrix with randomized values 0 ≤ value ≤ 1,
this can be seen in table 2.1. Though there is a criterion that the values has to satisfy and
that is that the sum of all values for one cluster must sum up to 1 which in the table 2.1 is
not satisfied. But that is easily fixed by dividing all the values in that cluster with the sum
of the values. In table 2.2 on the next page the values have been divided with the sums in
table 2.1 and now sums up to 1.

Table 2.1: Membership Matrix with randomized


values between 0 and 1
Inputs
Clusters I1 I2 I3 I4 Suma
Cluster1 0, 05 0, 53 0, 48 0, 16 1, 22
Cluster2 0, 79 0, 69 0, 60 0, 91 2, 99
Cluster3 0, 67 0, 80 0, 40 0, 35 2, 22
a The sum is not part of the membership matrix, its
just there to show that it is > 1.
18 Chapter 2. Theory

Table 2.2: The Membership Matrix have been adjusted to sum up to 1 by dividing
the input values by the previous sum
Inputs
Clusters I1 I2 I3 I4 Suma
Cluster1 0, 040983607 0, 43442623 0, 393442623 0, 131147541 1
Cluster2 0, 264214047 0, 230769231 0, 200668896 0, 304347826 1
Cluster3 0, 301801802 0, 36036036 0, 18018018 0, 157657658 1
a The sum is not part of the membership matrix, its just there to show that it now sums up
to 1.

Step 2: In this step we calculate the center, or midpoint, of a cluster. The center can be
calculated with:
Xn n
X
vi = (uik )m xk / (uik )m
k=1 k=1

Where uik is the membership value of point xk , for the i:th cluster and m is a fuzziness
value that will work as a weighting exponent.

Step 3: In this the membership matrix U is updated and there are basically two cases
that can occur when updating the membership matrix. One is that the center of a cluster
is right on top of one or more points. In that case, the points that is under the center will
get a membership value of 1.0 for that cluster and a value of 0.0 for the rest of the clusters,
unless two clusters have the same center, which is unlikely but still plausible. If there is
two clusters with the same center and a point lies on that center then that point will get a
value of 1.0/number of clusters in the same location, so if there is two clusters then
the point would get the membership value 1.0/2 = 0.5. In the other case of the update the
value is updated based on this formula:
c
X
Uik = 1/[ (||xk − vi ||/||xk − vj ||)2/(m−1) ]
j=1

Where ||xk −vi || is the euclidean distance from the current cluster, vi , and the current point,
xk , ||xk − vj || is the euclidean distance between cluster, vj , and the current point and the
variable m is a fuzzyness variable that is used as a weighting exponent.

Step 4: In this step we check if the difference between the old membership matrix and
the new one is less than the stopping condition that was chosen or if there has not been any
change to the matrix. If it is not then we go back to step 2.

This algorithm runs with a c that is set to a fixed number of clusters so it has to run
several times with different c values in order to find the best c. It is preferred to have as
small a c as possible to limit the number of rules to a reasonable amount. If the c value is
to big then the system will become over fit and it could mean that each data point in the
system might be in one cluster each and thus the rule will most likely become that data
point for that cluster. When the algorithm has run several times for a many different c
values it is time to see which c that is the best. To find the best c we use something called
the criterion number. The criterion number is calculated according to:
2.3. Algorithms 19

c X
X n
S(U, c) = (uik )m [||xk − vi ||2 − ||vi − x̄||2 ]
i=1 k=1

Figure 2.12: Criterion number

where x̄ is the center of all data points. The goal is to get the c which generates the
smallest criterion number, though having c = number of data points usually generates
the best criterion number. But, as stated above, it is not what we want since that would
make the system over fit. After the best number of clusters has been discovered it is time
to identify the rules of each of the clusters which you can read about in section 2.3.2.

2.3.2 Rulebase algorithms


After we have divided all of the data points into clusters we need to make projections of
these clusters in order to create rules for the rulebase. All clusters will generate one rule
each that is based on the projections of the cluster. In figure 2.5 on page 10 a projection of
a cluster can be seen. The closer the projection is to the center the higher the projection
will be so the highest point of the projection is the center. When the projection has been
created we need to find a function which best fits this curve and that function will become
the rule.
The number of projections that is needed depends on how many inputs a data point
consists of, so if we have data points that has three inputs Age, Sex and Income then we
would need to make three projections on the cluster. One for every input.
With these projections we can now create rules. To create a rule we use the following
formula:

uik − eβp (πp (xk −αip )

Figure 2.13: Create rules

Where uik is the membership value of data point xk in the i:th cluster, πp is the projection
on the p:th axis, or p:th input. The variable βp is expressed as βp = 1/(2σ 2 ) where σ is the
standard deviation of πp . αip is the center of cluster i in πp . Running this on all clusters
and all axes will create a set of rules that constitutes the rulebase. After the rulebase has
been created then the fussy clustering is completed and it is ready for the real input.

2.3.3 Inference algorithms


In order to get output from the rulebase we need something that can take the input and
the rulebase and translate it into output and for that we have inference methods. In this
project there is two inference techniques that has been implemented, Mamdani’s method
and Takagi-Sugeno’s method.
20 Chapter 2. Theory

Activation
Both of the inference methods use a activation function. The first thing this function does
is that it calculates the input using this formula

−(xkp −vip )2
e 2σ 2

Figure 2.14: Calculate the input.

Where xkp is the p:th input of data point xk , vip is the center of the p:th input on the
i:th cluster. σ is the i:th clusters σ that we used when calculating the rules in section 2.3.2
on the previous page. The results are saved in arrays, where each array represents one of
the clusters and it contains all the calculations of all the dimensions of the input. So if we
have a data point x with the inputs age, sex and income. Then we would get an array that
would look like this arrayr1 = [calc(age), calc(sex), calc(income)]. After all of the arrays
has been created we go into the nest step which is to, depending on the configuration chosen,
either select all the smallest values, minimum, or largest values, maximum, in these arrays
and save them in a new array which represents the activation function.

Mamdani’s method
The process of Mamdani’s method can be seen in figure 2.15 on the facing page.
After the input has been calculated and the correct values has been chosen, see the
Activation paragraph above, Mamdani’s method will try and find the best rule for each
output. Mamdani’s membership function can be seen in equation (2.16) on the next page,
where αi is the activation values, Ui is the output values from the rulebase and U is a fuzzy
set.
To clarify things we will go through figure 2.15 on the facing page step by step. In this
figure we have a system that contains three rules 1, 2 and 3. The curves represents how the
rule depends on the input. In the figure input1 has been calculated to be 3 and input2 to
be 8 and following the arrows we can see the results of the input where they hit the curve.
The next step is to see which of the inputs that has the biggest affect on the final result and
in Mamdani’s case we do this with the maximum function. The result of this a fuzzy set,
U , which is represented by the green graphs in figure 2.15 on the next page.
Now to make sense of this fuzzy set we need to defuzzify it and there are a few techniques
for defuzzification but the one that has been used in this project is called Centre-of-Gravity.
So first we combine all values in the fuzzy set so we gets something that looks like the graph
called ”Result of aggregation” in figure 2.15 on the facing page. It is on this graph we want
to find the centre of gravity and that can be done by using equation (2.17) on the next page.
Where µU is the combined fuzzy set and uk is the k:th member of the fuzzy set.
There is another method that is based on Mamdani’s method but with a small alter-
ation. The method is called Larsen’s method and uses the product as implication instead of
minimum that Mamdani’s method uses. This is mentioned because in the implementation
there is the possibility to do four different configurations. The Activation has two different
settings namely maximum or minimum and in the Mamdani method it is possible to use
9 http://www.dma.fi.upm.es/java/fuzzy/fuzzyinf/mamdani3 en.htm, 15 Feb 2013
2.3. Algorithms 21

Figure 2.15: Process of Mamdani’s method.9

U = ∨ni=1 (αi ∧ Ui )

Figure 2.16: Mamdani - Output membership function.


Pl
k=1 uk · µU (uk )
u= P l
k=1 µU (uk )

Figure 2.17: Centre of Gravity.

Larsen’s method instead. Larsen’s membership function can be seen in equation (2.18) on
the following page.
22 Chapter 2. Theory

U = ∨ni=1 (αi · Ui )

Figure 2.18: Larsen - Output membership function.

Takagi-Sugeno’s method

This method uses linear functions to create an inference. The out put is represented like
this: ui = pi1 + pi2 x1 + pi3 x2 one for each rule. The first thing that needs to be done is
to compute the constants, p, for each rule. In [15], Takagi and Sugeno describes a way to
calculate these constants.
Let X be a m · n(k + 1) matrix(figure 2.19), Y an m vector(figure 2.20) and P a n(k + 1)
vector(figure 2.21).


β11 , . . . ,
βn1 , x11 · β11 , . . . , x11 · βn1 , . . .


..., xk1 · β11 , . . . , xk1 · βn1

X= .. ..
.

. .
β1m , . . . ,
βnm , x1m · β1m , . . . , x1m · β1m , . . .

..., xk1 · β1m , . . . , xk1 · βnm

Figure 2.19: Takagi-Sugeno - X matrix.

Y = [y1 , . . . , ym ]T

Figure 2.20: Takagi-Sugeno - Y vector.

P = [p10 , . . . , pn0 , p11 , . . . , pn1 , . . . , p1k , . . . , pnk ]T

Figure 2.21: Takagi-Sugeno - P vector.

Where β is defined as seen in figure 2.22 on the facing page, i represents the i:th rule, j
and m represent the number of data points in the system, k is the k:th input in a data point
and n is the number of rules in the system. Aik means the membership value of the k:th
input in rule i, xkj means the k:th input from the j:th data point and ym is the output of
data point m. The P vector represents the constant values needed to calculate the expected
output and is generated by the matrix computation seen in figure 2.23 on the next page. In
figure 2.21 the Pkn means that its the n:th constant for rule k. After the P vector has been
calculated we use this formula ui = pi1 + pi2 x1 + pi3 x2 to get the output ui which we then
use to calculate the final output with:
2.3. Algorithms 23

Pn
αi ui
u = Pi=1
n
i=1 αi
Where αi is the activation function that was described in the activation paragraph.

Ai1 (x1j ) ∧ · · · ∧ Aik (xkj )


βij = P
j Ai1 (xij ) ∧ · · · ∧ Aik (xkj )

Figure 2.22: Takagi-Sugeno - β variable.

P = (X T X)−1 X T Y

Figure 2.23: Takagi-Sugeno - matrix computation


24 Chapter 2. Theory
Chapter 3

Implementations, results and


validations

3.1 Development environment


3.1.1 .Net
One criteria for this thesis was that it should be implemented in the .Net 4.0 environment
since that is what they use at Acino. C# and .Net has excellent support for computers
running windows since it is Microsoft that develops them both. I had never used C# prior
to this project and knew that it could be a hindrance. I found it very similar to Java which
is a language that I have used many times which made it easier to learn. Though I might
have been influenced to program in a more Java like style and thus might not have fully
taking advantage of the C# language. But it was an opportunity to learn C# which is a widely
used language in the business world and so is good to know to make you more attractive on
the job market.

3.1.2 Usefull embedded namespaces


A namespace in the .Net environment is basically a collection of useful methods and func-
tions. The System.IO namespace contains useful methods for input and output like reading
and writing data streams. In this section I will mention some of the namespaces that I
have used in this project. The descriptions are from the msdn-site11 and has been used to
describe these namespaces.

System.IO: namespace contains types that allow reading and writing to files and data
streams, and types that provide basic file and directory support.

System.Runtime.Serialization: namespace contains classes that can be used for serial-


izing and deserializing objects. Serialization is the process of converting an object or
a graph of objects into a linear sequence of bytes for either storage or transmission
to another location. Deserialization is the process of taking in stored information and
recreating objects from it.
11 http://msdn.microsoft.com/en-us/

25
26 Chapter 3. Implementations, results and validations

System.Linq: namespaces contain types that support queries that use Language-Integrated
Query (LINQ). This includes types that represent queries as objects in expression trees.

System.Windows.Forms: namespace contains classes for creating Windows-based appli-


cations that take full advantage of the rich user interface features available in the
Microsoft Windows operating system.

System.XML: namespaces contain types for processing XML. Child namespaces support
serialization of XML documents or streams, XSD schemas, XQuery 1.0 and XPath 2.0,
and LINQ to XML, which is an in-memory XML programming interface that enables
easy modification of XML documents.

System.Text: namespaces contain types for character encoding and string manipulation.
A child namespace enables you to process text using regular expressions.

3.1.3 Database management via XML export


This implementation uses XML to get the information about the insurances. Currently you
have to download this information to a file in order for the program to read them, but it
should not be to difficult to tie the implementation to a server that is hosting the insurance
information. One thing to keep in mind is that there are two styles in use. At Acino, where
I am doing this thesis, they have one style and at Svenska Försäkringsfabriken they
have another style. The biggest difference is that the later uses codes for the different fields
while Acino uses words/names so it is easier to read the one from Acino and that is the
format that the implementation uses. The XML document is read by a preprocessor which
then parses out the interesting parts of the data and sends them to the main program.

3.2 Result
The results was a bit of a surprise for a number of reasons. The first one was that the
Mandani methods that used the maximum function in the activation phase had a constant
hit rate which never changed. Looking at a sample of the output of the functions, table 3.2
on page 28, we can see that the interval of these methods only covers two outcomes namely
case 3.0 = Green and 4.0 = Gray which explains the poor results. The method with the
best results is the Takagi-Sugeno method which covers the whole range of expected results.
In table 3.1 the observed upper and lower limits of each method can be seen.

Table 3.1: The output intervals of the different in-


ference methods
Inference Lower limit Upper limit
Takagi-Sugeno 1.0 3.5
T/Fa 0.5 2.5
T/T 0.5 2.5
F/F 3.0 4.0
F/T 3.0 4.0
a T/F, (activation function: T = Minimum, F =
Maximum)/(Mamdani function: T = Product, F =
Minimum)
3.3. Impact of the parameters 27

The second one is that the hit rate for all the methods was over all very stable with
only a few drops in the hit rate. It was expected that the results would improve or at least
change with different configurations but the results were very stable. With an exception of
the T/T Mamdani method which struggled when the fuzziness variable was > 12. Another
surprise with that Mamdani method was that it struggled when the number of runs was
> 140 which is strange since all that does is run the programs again to find the best system,
the one with the highest criterion number.

3.3 Impact of the parameters


There are a few settings that can affect the results. In this section we will go through some
of them and see what impact they have on the hit rate. But first there is a thing that needs
to be explained. In the charts the Mamdani inference is named T/T, T/F, F/T, or F/F
and the T stand for true and the F for false. The first symbol states if the Min-function has
been used or the Max-function is used, T means that Min was used. The second symbol
states if the Product-function has been used or if the Min-function has been used, T means
that the Product-function was used.

3.3.1 The Fuzzyness Variable


The fuzziness variable, m, is a weighting exponent. It is related to the weight that is given
to the closest center.

Figure 3.1: The impact of the fuzzyness variable

In figure 3.1 we can see that the Mamdani inference where we use the max function
have a steady hit rate of 12% which never changes and 12% is is far from acceptable. The
Mamdani inference where we instead use the min function performs around 40% which
still is not acceptable. There is some slight difference between the Mamdani that uses the
Min-function and the Product-function. The one that uses the min-function in both stages
perform slightly better and is a bit more stable than the one that uses the min-function in
the activation stage and the product-function in the next stage.
28 Chapter 3. Implementations, results and validations

Table 3.2: Sample of the output from the system


Expected Takagi-Sugeno T/F T/T F/F F/T
2 2, 4 2, 1 1, 6 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
2 2, 4 2, 1 1, 6 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
2 2, 4 2, 1 1, 6 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
2 2, 4 2, 1 1, 6 3, 1 3, 1
3 1, 2 0, 6 0, 5 3, 7 3, 7
1 1, 2 2, 3 2, 2 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
4 3, 2 0, 6 0, 4 3, 4 3, 4
3 2, 6 1, 6 1, 2 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
2 2, 4 2, 1 1, 6 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
2 1, 4 2, 3 1, 9 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
3 1, 2 2, 4 2, 4 3, 9 3, 9
3 1, 2 2, 4 2, 4 3, 9 3, 9
1 1, 2 2, 3 2, 3 4, 0 4, 0
3 2, 6 1, 6 1, 2 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
1 1, 2 2, 3 1, 9 3, 7 3, 7
1 1, 2 2, 3 1, 9 3, 7 3, 7
1 1, 2 1, 6 1, 2 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
1 1, 2 1, 6 1, 2 3, 7 3, 7
1 1, 2 1, 6 1, 2 3, 7 3, 7
2 1, 4 2, 3 1, 9 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
2 2, 4 2, 1 1, 6 3, 1 3, 1
4 3, 2 0, 6 0, 4 3, 4 3, 4
3 2, 6 1, 6 1, 2 3, 9 3, 9
3 2, 6 1, 6 1, 2 3, 9 3, 9
4 3, 5 0, 5 0, 4 3, 4 3, 4
a The numbers displayed is the actual numbers they produce.
During evaluation they are rounded off to the nearest integer.

The inference method with the best performance is the Takagi Sugeno inference which
performs with a hit rate around 69% with a few drops. 69% is a lot better than the Mamdani
methods but still not acceptable, the hit rate that we want to see is at least 90%.
3.3. Impact of the parameters 29

3.3.2 The Number of Runs


Since the initiation of the membership matrix is random we can see differences between
training runs. This means that it is a good idea to do a number of runs in order to get the
’best’ clustering we can get.

Figure 3.2: The impact of the number of runs variable

In figure 3.2 we can see the impact of the number of runs performed. Most of the inference
methods are stable which could mean that at some runs they were lucky or unlucky. Though
the number of runs seems to have a big impact on one of the inference methods, namely the
Mamdani version where we the minimum function is used in the activation and the product
function is used in the Mamdani method. Though when using the product function inside
the Mamdani method it is known as Larsen’s method.

3.3.3 The Cluster Interval


It is preferable to have as few clusters as possible in order to avoid over fitting the system.
In this test we look for an interval that that does not affect the hit rate while minimising
the number of clusters we have to try. The label 20/90 means that the system will cluster
between 20% and 90% of the data points. So if we have one hundred data points in the
system, that would mean that 90 − 20 = 70 different clustering systems would be created
and then compared to see which one that yields the best result.
In figure 3.3 on the next page we can see that the methods are relatively stable with a few
drops and peaks. This could just be a coincidence since it should not make any difference
in performance unless a area that the usually creates the best clusters is removed from the
interval. Lets say that usually the winning cluster is from the system that makes around
60% of the data points. So if we only check between 20% and 50% we might miss that and
get a lower hit rate. Looking at figure 3.3 on the following page we can see that the interval
between 20/60 and 20/40 seems to be the best. Keep in mind that this is mostly to reduce
training time and not to improve the hit rate.
30 Chapter 3. Implementations, results and validations

Figure 3.3: The impact of the interval limits


Chapter 4

Conclusion

There was an scheduling conflict in the beginning of the project. I had forgotten that I had
planned to retake two courses at the same time which in hindsight was not the best of ideas.
When the courses started the project was cut to a pace of 50% so half the day was spent
on one of the courses and the other half on the project. But it always took a while to get
back in to the project so a lot of time was lost in the end. Probably would have been best
to not have taken courses on the side or at least only one.

The goal for this project was to create a system that could evaluate insurances using AI.
In section 1.3.1 on page 3 some criteria is presented, the only criteria that is not fully ful-
filled is The result of the evaluation should consist of a flag and a text. The
system is currently only giving a flag and the reason for that is that I was unsure on how
the texts should be interpreted by the system. The system need the input and output to be
represented by floating-point numbers, it probably is possible to look up all combinations
of texts and give them a numerical representation that can be combined in a good way.
The system is currently not a good replacement for the evaluation of insurances. But even
though the system only managed to get a hit rate of 69% at best, there is potential in fuzzy
clustering. In section 4.2 on the following page a few suggestions is mentioned that could
help boost the performance and make it an dependable replacement.

4.1 Restrictions and Limitations


There are a few restrictions and limitations in this project and those are:

Slow learning process: The process of training the system can can be quite time con-
suming depending on the number of data points that is used. But since this is not
something that needs to be done so often it can be probably be disregarded. A way to
reduce the training time would be to make it parallel so that multiple number of clus-
ters can be evaluated at the same time. The fuzzy clustering process is very parallel
friendly because it can easily be divided into a number of tasks.

Requires retraining: If there is a change in the rules then the system will need to be
retrained. This could mean that new training data is required and that might have
to be manually evaluated which might take some time and then we have the actual
training time of the system which is mentioned in the previous block.

31
32 Chapter 4. Conclusion

Variable number of inputs: The system can not handle a variable number of input and
since two insurance can have different number of values/attributes we can not just
read the whole insurance and send it to the system. We need to choose a number of
important values/attributes that all insurances have and use them. The fewer that is
used the less complex the problem will be for the system and the training time will be
reduced. If a new value/attribute is introduced the system will have to be retrained.

No support for multiple instances: A way to improve the performance of the system
would be to make different systems that each evaluates one type of insurance. This is
discussed more in section 4.2. But currently the system does not support that. A few
modifications would have to be done in order to support multiple instances. First is to
fix the start up to support starting multiple instances and assigning tasks to them and
then fix so that each instance can save their own training data without overwriting
each other.

4.2 Future work


There is a few improvements that can be done in the future that might help improve the
results. These improvements will be discussed in the following sections and the proposed
improvements are:

– Better selections of training data

– Small and focused

– Reinforced learning

4.2.1 Better selections


It could be that the training set currently in use is to hard or complex for the system as
it is. So by making a more thorough selection of training data it could be possible to get
better results since it would be easier for the system to cluster.

4.2.2 Small and focused


At the moment the application is used as a single instance to evaluate all types of insurances.
This means that the system needs to make a big and complex clustering that should handle
all these cases. By making a lot of small systems that only handle a special case of the
insurances it should make it easier for the systems to make clusterings. There are a few
problems that can occur when doing this, depending on how finely the insurances is divided.
A insurance can contain a number of moments and if we make a system that has a lot
of smaller systems that is focused on one moment each then we would have to come up
with a way to combine the results of these moments. If for example one moment gives
Green(movable) and another gives Red(not movable) should the combined output be Red,
Green or Yellow(might be movable)?

4.2.3 Reinforced learning


During reinforced learning a system will, after training, run the training data again and
see how well it performs. If the system makes a good call it is ’rewarded’ and a bad call
4.2. Future work 33

results in a ’penalty’. The system will then go back to training and will try and adapt to the
’rewards’ and ’penalties’ it revived. This process will continue until a certain limit has been
reached e.g. a hit rate of atleast 90%. As you can see by introducing reinforced learning to
the application it would be possible to improve the results.
34 Chapter 4. Conclusion
Chapter 5

Acknowledgments

I would like to thank Acino for letting me do my thesis there and I would also like to thank
all of Acino’s employees for making me feel welcome. I especially want to give a big thanks
to Hannes Kock my supervisor at Acino, Patrik Eklund my supervisor at the CS-department
and Anna Theorin my contact at Svenska Försäkringsfabriken, for helping me with this
thesis.

35
36 Chapter 5. Acknowledgments
References

[1] John Binder, Daphne Koller, Stuart Russell, Keiji Kanazawa, and Padhraic Smyth.
Adaptive probabilistic networks with hidden variables. In Machine Learning, pages
213–244, 1997.

[2] Jens Bohlin, Patrik Eklund, Lena Kallin-Westin, and Tony Riissanen. Soft computing.
2007.

[3] A. Doulamis, N. Doulamis, and S.D. Kollias. On-line retrainable neural networks:
improving the performance of neural networks in image analysis problems. Neural
Networks, IEEE Transactions on, 11(1):137–155, 2000.

[4] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers, 1997.

[5] Berenji H.R. and Khedkar P. Learning and tuning fuzzy logic controllers through
reinforcements. Neural Networks, IEEE Transactions on, 3(5):724–740, 1992.

[6] Bipin Joshi. Beginning xml with c# 2008 - from novice to professional. 2008.

[7] S. Marinai, M. Gori, and G. Soda. Artificial neural networks for document analysis
and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
27(1):23–35, 2005.

[8] Dieter Merkl and Andreas Rauber. Document classification with unsupervised artificial
neural networks. In IN F. CRESTANI, & G. PASI (EDS.), SOFT COMPUTING
IN INFORMATION RETRIEVAL (PP. 102–121). WURZBURG (WIEN): PHYSICA-
VERLAG, 2000.

[9] Peter Norvig and Stuart J Russell. Artificial Intelligence A Modern Approach. Prentice
Hall, 3rd edition, 2009.

[10] Agnieszka Onisko, Marek J. Druzdzel, and Hanna Wasyluk. Learning bayesian network
parameters from small data sets: Application of noisy-or gates, 2000.

[11] Yager Ronald R. Knowledge-based defuzzification. Fuzzy Sets Syst., 80(2):177–185,


June 1996.

[12] J.A. Roubos, S. Mollov, R. Babuska, and H.B. Verbruggen. Fuzzy model-based predic-
tive control using takagi-sugeno models, 1999.

[13] Han saem Park, Si ho Yoo, and Sung bae Cho. Evolutionary fuzzy clustering algorithm
with knowledge-based evaluation and applications for gene expression profiling, 2005.

37
38 REFERENCES

[14] M. Sugeno and T. Yasukawa. A fuzzy-logic-based approach to qualitative modeling.


Fuzzy Systems, IEEE Transactions on, 1(1):7–, 1993.
[15] Tomohiro Takagi and Michio Sugeno. Fuzzy identification of systems and its applica-
tions to modeling and control. Systems, Man and Cybernetics, IEEE Transactions on,
SMC-15(1):116–132, 1985.
[16] Amanda Tapini. Mislife granskning, 2012.
[17] H. E. Virtanen. A study in fuzzy logic programming. Cybernetics and Systems’94,
Proceedings of the 12th European Meeting on Cybernetics and Systems Research, Ed.
R.Trappl, Vienna, Austria, pages 249–256, 1994.

[18] L.-X. Wang and J.M. Mendel. Fuzzy basis functions, universal approximation, and
orthogonal least-squares learning. Neural Networks, IEEE Transactions on, 3(5):807–
814, 1992.
[19] Nevin Lianwen Zhang and David Poole. A simple approach to bayesian network com-
putations, 1994.

You might also like