Professional Documents
Culture Documents
Umeå University
Department of Computing Science
SE-901 87 UMEÅ
SWEDEN
Abstract
Today the decision to move an insurance from one company/bank to another is done man-
ually. So there is always the risk that a incorrect decision is made due to human error. The
goal of this thesis is to evaluate the possibility to use an artificial intelligence, AI, to make
that decision instead. The thesis evaluates three AI techniques Fuzzy clustering, Bayesian
networks and Neural networks. These three techniques was compared and it was decided
that Fuzzy clustering would be the technique to use. Even though Fuzzy clustering only
achieved a hit rate of 69%, there is a lot of potential in Fuzzy clustering. In section 4.2 on
page 32 a few improvements are discussed which should help raise the hit rate.
ii
Contents
2 Theory 7
2.1 Overview of ’intelligent computing’ . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 The logical approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 The probabilistic approach . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.3 The numerical approach . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.1.4 Pros and cons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 The chosen method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.3.1 C-Means Fuzzy Clustering Algorithm . . . . . . . . . . . . . . . . . . 17
2.3.2 Rulebase algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3.3 Inference algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
iii
iv CONTENTS
4 Conclusion 31
4.1 Restrictions and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.1 Better selections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.2 Small and focused . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.3 Reinforced learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
5 Acknowledgments 35
References 37
List of Figures
v
vi LIST OF FIGURES
List of Tables
vii
viii LIST OF TABLES
Chapter 1
1.1 Introduction1
Today’s banks and insurance companies have the possibility to, with the customers consent,
take over personal insurances like for example capital, service and private pension insurances.
Which insurance that is movable, i.e. possible to take over, or which part of the insurance
that is movable is partly decided by legislation and partly by the content of the insurance.
The size of one’s capital can be of big significance for a insurance company or bank when they
decide if they want to take over the insurance. The movability is also decided by internally
set rules by all separate companies. This makes the movability of a insurance dependent
on a lot of parameters that can change a lot. Today the majority of the evaluation of the
movability of insurances is done manually and from manually created sets of rules. There
are three main types of assessment criteria, green - the insurance is movable, yellow
- the insurance might be movable or partly movable, and red - the insurance is
not movable. Sometimes there are a lot of sub-criteria within each of the main criteria,
where a more detailed assessment is described. Here, for example, it can be described
how to proceed in a move matter by, for example, requesting a health certificate. Manual
evaluation of insurances is a time consuming process which could benefit from becoming
fully automated or partly automated. The goal of this thesis project is to create a system
that can classify insurance information.
1.2 Insurances
This section will explain what an insurance is, how the structure of an insurance looks and
what you have to keep in mind when reviewing to see if the information in the insurance
is correct. When reading this keep in mind that this is based on Swedish insurances and
might differ from insurances in other countries.
1
2 Chapter 1. Background and motivation
accident or if you retire. Some insurances will only give you back roughly the same amount
of money as you have paid them, they are called savings insurances and an example of such
insurance is the pension insurance. There is another type that is called risk insurances which
will in many cases give you a lot more money than you have paid to the insurance company,
but they are only valid as long as you pay the fee and you will not get the money back if
you stop paying. The risk insurance system works because of the number of people they
insure and that not everyone will be in e.g. an accident and in need of insurance money.
So the customers pays for each other, if all their customers would suddenly need insurance
money then the company would likely become bankrupt. The insurance companies have
rules where they evaluate the risk of something happening and adjusts the fee required or
simply not approve the insurance.
For example: lets say an 18 year old man wants to insure a brand new super car in a
major city, then he will most likely have to pay a huge yearly fee or the insurance company
might not want to sign him because the high risk of something happening to the car. So
the risks that they might look at in this case might be:
– His young age and the fact that he is male, which statistically means that he is more
likely to be in some kind of accident.
– Powerful sports car which is easier to lose control of and they are a more desirable
target for thieves.
– Lives in a major city with a big population which means more interaction with more
people which means that there is a higher chance that some kind of accident occurs.
1.2.2 Structure
The structure can be split into two main parts that makes up an insurance. The first part
contains more of an overall information about the insurance such as the insurance number,
the person that is insured, the owner of the insurance, the insurance provider when it was
signed, the cost of the insurance and etc. The second part contains information about
the content of the insurance. The content of the insurance is divided into moments. It is
possible for insurances to include more than one moment so e.g. an accident insurance will
contain a accident moment and a pension insurance can contain a pension moment and a
health insurance moment. The possibility to add more moments to the insurance is up to
the insurance provider.
1.2.3 Types
There are a lot of different insurances but they can all be divided into two main types, that
was mentioned in section 1.2.1 on the previous page, namely savings insurances and risk
insurances. Savings insurances are paid to the insured after a certain date, e.g. pension
savings, and risk insurances are paid after accidents, sickness etc. and is only valid as long
as the insurance fee is paid so e.g. if a person never is in an accident then they will not be
able to use the money they have paid for the insurance. Though the advantage with the
risk insurance is that the money you get if you e.g. are in an accident can be much higher
than the amount you have paid in.
Here are some of the types of insurances:
Pension this is a insurance that will let you receive money during your pension in a certain
period of time and interval. Can be signed by a private person and/or a company.
1.3. Problem description 3
Survivor’s protection this moment exists so that the husband/wife, children or other
heirs to the insured gets the money if the insured person dies.
Health insurance this insurance will provide the insured person with an income in case
of sickness or early retirement due to ill health. Though a qualifying period of sickness
might be needed.
Sickness-/ prematurely-capital this insurance will grant you a one-time amount of
money if you become sick or hurt enough that you are granted sickness benefit.
Medical treatment insurance this insurance covers costs during medical treatment/ at-
tendance.
Accident insurance usually contains three moments:
Medical treatment costs is costs that occur with sickness/injury. and includes
costs for treatment of a doctor/dentist but also necessary travel costs during the
treatment.
Disability capital a one time amount of money one receive if one is afflicted with a
permanent disability or decreased capacity to work.
Death capital a sum that is paid to the husband/wife, children or heirs of the insured
person in case of death.
Premium exemption this is an add-on for insurances that makes the insurance provider
take responsibility for the payment of the premium if the insured becomes so sick that
the period of sickness is greater than the qualifying period of sickness.
1.2.4 Pitfalls
There are a several pitfalls to keep a lookout for when reviewing an insurance and it slightly
differs between the different types of insurances. Some general ones that apply to most of
the insurances are e.g. is the insurance number, personal number, dates, sums, status etc.
correct?. Another pitfall is if an insurance has a status of continuous and a Z-time that has
passed, then something is wrong. Z-time marks the last day that the insurance is valid, but
that is only the case of the risk type insurance. A Z-time in a savings type insurance means
the date when they should start paying e.g. a person’s pension.
1.3.1 Goal
The process of evaluating the movability of insurances is a time consuming process and that
is mainly because it is done manually and because of the complex rules. The goal of this
thesis is to make a system that can, based on manually pre-classified insurance evaluations,
learn to classify insurance information. This could potentially save a lot of time while
reducing the number of human errors. There are a few requirements of the system:
1.3.2 Data
The data this system will use for training will be stored in XML-files. The problem with the
data is that the number of elements an insurance have will differ between the insurances.
The possibility to add additional ”moments” to most of the insurances, will create a lot of
possible combinations and a lot of different elements that the insurance contain. Some of
these elements can of course be filtered out with a preprocessor since they might not affect
any decision. But can, and probably will, still leave us with a variable number of remaining
elements. It would be possible to create a large neural network, NN, that takes all, possible,
elements and just set the ones not used to NULL or 0. But that would make it harder to
train the system since it would have to learn a more complex model. It would probably be
more efficient to somehow divide the problem into a lot of smaller NNs that is more focused
on a subset of the problem. For example: that each of the different types of insurance has
its own network. They would still be large and/or complex since they can have a lot of
different moments, but perhaps it is possible to divide it even further and maybe combine
with other approaches such as Bayesian networks or a fuzzy clustering network.
There is another problem to consider with the data and that is the date and the sum
values and that is how should they be represented in the system. The problem is that they
are values that can be continuous meaning that they might represent 13 January 1988 to
11 February 2013 or 20-60.000 SEK.
1.3.3 Rules
There are of course a lot of rules regarding insurances and the rules can help us filter out
insurances which is incorrectly filled in, thus pointless to evaluate, and of course to help
us make a decision on whether or not the insurance is movable. A problem with the rules
is that they are complex and might be hard to implement and that companies can have
different rules. Another problem is that the rules can change a lot in the insurance industry
and that makes it important to make a system that can adapt to new rules or a system
where it is easy to change or add rules. It would be desirable if the system could detect any
ambiguity in the rules before and after any change or addition of new rules.
both moments was yellow, or was one of them yellow and the other green? The same has
to be considered after the moments has been evaluated, if one moment is green while the
other is yellow will that make the complete verdict yellow?
6 Chapter 1. Background and motivation
Chapter 2
Theory
7
8 Chapter 2. Theory
– >(¬A) = 1 − >(A)
So for example if we have >(Cold(outside)) = 0.65 and >(F reezing(M artin)) = 0.55
and then we have >(Cold(outside)) ∧ >(F reezing(M artin)) = 0.55 which is probable.
These rules might not change when a variable is modified, even when we would have
wanted it to change. For example if we have A = 0.75 and B = 0.33 then min(>(A), >(B)) =
0.33 but if we change A = 0.85 then we still get min(>(A), >(B)) = 0.33 even though we
might want a new value since one of the values changed. There are a few ways we can
improve this but they will not be mentioned.
Fuzzy control
Fuzzy control uses rules for making decisions. A rule, <, is expressed as < : IF <
f uzzycriteria > T HEN < f uzzyconclusion >. Fuzzy control has a number of rules
and these rules are stored in what is called a rulebase. There are a few ways to create a
rulebase:
– Observe and record the in and out data when a expert performing the actions for a
period of time.
In this project the last one is the one that is of most interest since we want to minimize
the number of human decisions.
figure 2.1 describes the process of fuzzy controls. First the input, x, gets fuzzified which
means that x is transformed into its corresponding truth value. Then the now fuzzified x
gets combined by a logical conjunction which then is combined with the output membership
function of the rule. The newly created membership function is then calculated before being
defuzzified.
Inference is used to create the conjunctions, in figure 2.2 on the next page we can see
the Mamdani inference method which in this case uses the minimum operation and then
combines the output results by using the maximum operation. In figure 2.2 on the facing
page the result given by the maximum operation is the grey field in U1 since that result has
a higher value than U2 .
2.1. Overview of ’intelligent computing’ 9
Fuzzy clustering
In cluster analysis or clustering one strives to divide the data into different groups(clusters).
The data that is clustered together are more similar to each other than the ones in the other
clusters, figure 2.3 is an example of a set of data points that has been divided into three
clusters. Now in fuzzy clustering, a number of data points gets divided into clusters but
now all data points belong to each one of the clusters, but in different degrees, just like in
2 http://en.wikipedia.org/w/index.php?title=File:KMeans-Gaussian-data.svg&page=1, 8 Okt 2012
10 Chapter 2. Theory
fuzzy logic. The closer a point is to the center of the cluster the more it belong to that
cluster. The degree of belonging or membership value that a data point has, have to sum
up to 1.0. So a data point could have the following membership values U0 = 0.17, U1 = 0.35
and U2 = 0.48 which sums up to 1.0.
Looking at figure 2.4 we can see two clusters and a number of data points. The number
above the points is the membership value of the point for that cluster. In the left square we
can see the membership values to cluster 1, and as you can see even the ones that is really
close to cluster 2’s center, represented by the right most x, have a membership value to
cluster 1. In fuzzy C-means clustering, which is the technique that will be used if fuzzy
clustering is chosen, it is usually the distance from the center of the cluster that decides the
membership value.
Projections of the data points will be created after the set of data points has been divided
into clusters. Each cluster will get their own set of projections, this can be seen in figure 2.5,
these projections will be used to generate the rules of that clusters that together will form
the rulebase. The creation of the rulebase marks the end of the training and it can now be
used to create output. In order to get this output we need something called inference which
2.1. Overview of ’intelligent computing’ 11
will translate any new input to output with the help of the rulebase.
Bayes’ theorem
The product rule can be written in two forms, namely P (A ∧ B) = P (B|A)P (A) and
P (A ∧ B) = P (A|B)P (B). By combining these two formulas we will get P (B|A)P (A) =
P (A|B)P (B) and by dividing by P (A) it turns into Bayes’ theorem P (B|A) = P (A|B)P
P (A)
(B)
.
P (B|A) is read as what is the probability of B given A. With Bayes’ theorem it is
possible to calculate the probability of an unknown variable by using the probability from
three known variables and that is a common case where a few probabilities is known while
the one that we need to know is unknown.
In [9] they give an example where a doctor knows P (symptoms|disease), probability of
symptoms given disease, but wants to know P (disease|symptoms). In the example the
doctor knows that:
– P (s|m) = 0.7
– P (m) = 1/50000
– P (s) = 0.01
Where s is patient has a stiff neck and m that the patient has meningitis. So by using
Bayes’ theorem the doctor can calculate that the probability of a patient having meningitis
when the patient has a stiff neck is:
Baysian networks
Bayesian networks is a common approach when a system might have to deal with uncertainty.
Bayesian networks are based on Bayes’ theorem section 2.1.2. A Bayesian network can
be described as a probabilistic graphical model that can represent dependencies between
variables, see figure 2.6 on the next page.
The figure 2.6 on the following page4 shows how a simple Bayesian network can look.
There you have a graphical model describing the relationships between the different nodes.
Each of the nodes in the network has a probability table associated with it and since the
rain node is not dependent on any other node, then only the unconditional probability of
each state is necessary. When building a Bayesian network it is important to make a good
model of the relationships so that a node does not depend on, for that node, unnecessary
variables. The way the nodes are introduced in the system, the order of them, can have
a big impact on performance. If the nodes are introduced in a ’not so good’ way it could
potentially mean that some nodes will get unnecessary dependencies and sometimes that
could give dependencies that is difficult to calculate. In [9] they give the following example:
3 The majority of the information for this basic explaination of Bayes’ Theorem and Bayesian Networks
The Bayesian networks in figure 2.7 both describe the same problem, the only difference
is the order of the nodes. In network A we have Alarm which is dependent on Burglary and
Earthquake, M arycalls and Johncalls is dependent on Alarm. Which means that if either
a burglary is in progress or if there is an earthquake the alarm will go off which will cause
either or both Mary and John to call the owner. In network B we have Johncalls which
is dependent on M arycalls, Alarm which is dependent on both M arycalls and Johncalls.
burglary is dependent on Alarm and Earthquake depends on both burglary and Alarm.
A few details from the example in the book are needed in order for this to make sense. In
the example they state that Mary often listens to loud music so she might not hear the
alarm, so if she is calling it is a high probability that John will call as well. Which makes
Johncalls dependent on M arycalls. The book, [9], will be quoted for the dependency
between burglary and earthquake.
If the alarm is on, it is more likely that there has been an earthquake. (the
alarm is an earthquake sensor of sorts.) But if we know that there has been
a burglary, then that explains the alarm, and the probability of an earthquake
would only be slightly above normal. Hence, we need both alarm and burglary
as parents.
2.1. Overview of ’intelligent computing’ 13
P (G, R)T T
P (R|G)T T =
P (G)T
P
S∈{T,F } P (G = T, S, R = T )
= P
S,R∈{T,F } P (G = T, S, R)
P (G, S, R)T T T + P (G, S, R)T F T
=
P (G, S, R)T T T + P (G, S, R)T T F + P (G, S, R)T F T + P (G, S, R)T F F
(P (G|S, R)P (S|R)P (R))T T T + (P (G|S, R)P (S|R)P (R))T F T
=
(P (G|S, R)P (S|R)P (R))T T T + (P (G|S, R)P (S|R)P (R))T T F + (P (G|S, R)P (S|R)P (R))T F T + (P (G|S, R)P (S|R)P (R))T F F
As we can see we get two additional dependencies when the Bayesian network is arranged
like B instead of A. This could, as stated above, mean that the computations becomes harder
to calculate.
Lets do an example5 that requires some calculations. Say that we have a Bayesian
network that looks like figure 2.6 on the facing page and we want to know What is the
probability that it is raining, given the grass is wet? So what we want to know
is P (R|G) where R means raining, G means Grass wet, S means sprinkler turned on
and T means true. In figure 2.8 we can see how it is solved by using Bayes’ theorem and other
statistical formulas/rules. It is possible to describe the probability of most of the scenarios
that can occur based on the probability functions, even if some parts are unknown.
With Bayes’ theorem and other statistical formulas/rules it is possible to describe the
probability of most of the scenarios that can occur based on the probability functions, even
if some parts are unknown. There is a lot more that can be said about Bayesian networks
but this is supposed to be a brief introduction/explanation. If this gets chosen you can read
a more thorough explanation in section 2.3 on page 16.
6
2.1.3 The numerical approach
Some problems might be too hard for designers solve on their own since it can sometimes
be hard (if not impossible) for a designer to predict all of the situations/states in which
the system might find itself in. The change over time is another problem that is hard to
predict, e.g. the stock market, and sometimes they don’t have an idea on how to program
the solution and this is where learning based AI can be a good choice since they will learn to
become a solution. There are many different types of learning approaches, but this project
will focus on neural networks and explain how those work.
5 http://en.wikipedia.org/wiki/Bayesian
networks, 20 Sep 2012
6 The
majority of the information for this basic explaination of Neural Networks has been found on
Wikipedia and some in [9]. The information about Artificial Neural Networks has been found in [9]
14 Chapter 2. Theory
Neural networks
In the world of artificial neural networks(ANN) or simply neural networks(NN) one tries
to achieve ‘intelligence’ by modelling the system after a biological neural network(BNN),
like the human brain. Before I continue to explain ANN I will try and explain how a BNN
works.
Disclaimer: This will be a simple explanation since I am far from an expert in the field
of neuroscience.
A BNN is a vast network of connected nerve cells called neurons. Each neuron consists
of a cell body (soma) which contains a cell nucleus which contains the cells genetic material.
Stretching out from the body are a number of dendrites which receives signals from other
neurons and a single long fiber called axon. The axon sends signals to other neurons. The
axon and the dendrites are connected to other neurons at junctions called synapses. So that
is the structure of the BNN, now to explain how it works. Lets say that you see a flower,
then a lot of your neurons will start firing and sending signals to other neurons until they
stop at a state where you either recognises that it is in fact a flower, maybe even what type
of flower, or that it is something unknown.
So this is what a ANN is trying to mimic. An ANN will be built with a few layers. First
we have one Input layer and then a number of hidden layers and lastly one output layer.
Each layer consists of a number of nodes (neurons) and each of these nodes are connected
to all the nodes in the next layer, in one direction, see figure 2.10 on the next page. The
connection between two nodes, lets say i and j, serves to propagate activation ai from i to j
and this connection has a numeric weight, wij , associated with it. This weight describes the
strength and sign of the connection. During the learning process the strength of the weights
will be updated to produce a desired signal flow. When a node derives its output it starts
to calculate the weighted sum of all its inputs and then it applies an activation function to
this sum. This is called a feed-forward network and it is the type that will be used in case
ANN will be used.
One of the biggest risks when working with AI and systems that needs to be trained,
is the risk of over training them. This could mean that it starts over-fitting and if the
7 http://en.wikipedia.org/wiki/Nervous system, 18 sep 2012
2.1. Overview of ’intelligent computing’ 15
network is too big the network might become a big lookup table. There are a few available
techniques that can help reduce over-fitting, but those will be mentioned in section 2.3 on
the following page if this approach is chosen.
the number of input will vary. Bayesian Networks can handle variable inputs since it is
possible to calculate the probability of missing variables by using Bayes’ theorem and other
statistical functions/rules. Given the number of inputs variables available in this project
and their dependencies it could mean that we would have to construct a big and complex
model. That could lead to complex calculations.
This problem is the same for all methods. The date or sum will have to be converted into
a numerical representation before the system can use them, which means that we need to
figure out how they should be represented.
Neural Networks are good at learning complex rules but one can never really be sure which
complex rule it has learned. Fuzzy Clustering can also learn complex rules, but unlike Neural
networks it is possible to show how/why it makes the choices it makes. Bayesian Networks
can describe complex stochastic relationships between variables.
If the rules change then both the Neural Network and Fuzzy Clustering require retraining.
It takes a long time to train a Neural Network and even longer for a Fuzzy Cluster. This
makes it even more interesting with making one expert system for each moment since that
could help us reduce training times. The smaller expert systems would require less training
data and would have to learn a less complex function which will save us time. Bayesian
Networks do not require retraining, though it might be necessary to update the probability
tables.
Show how the system ’thinks’ With fuzzy clustering it is possible to show, with e.g.
graphs, how the system ’thinks’ which I evaluated would be a strong reason for picking
fuzzy clustering.
Similar problem with good results Patrik, my supervisor at the CS-department, had
done similar work with fuzzy clustering and with good results.
2.3 Algorithms
In this section I will describe the fuzzy clustering algorithms used by this system.
2.3. Algorithms 17
if µk = ∅, then
Pc
uik = 1/[ j=1 (||xk − vi ||/||xk − vj ||)2/(m−1) ]
otherwise P
uik = 0∀i ∈
/ µk and i∈µk uik = 1
Step 1: In this step the membership matrix U is initialized. The membership matrix is
an matrix that contains all the membership values in the application. It is used to check
how strongly an input is tied to a particular cluster. To create and initialize an membership
matrix is very simple, just create a matrix of size CxI where C is the number of clusters and
I the number of inputs. Then just fill fill the matrix with randomized values 0 ≤ value ≤ 1,
this can be seen in table 2.1. Though there is a criterion that the values has to satisfy and
that is that the sum of all values for one cluster must sum up to 1 which in the table 2.1 is
not satisfied. But that is easily fixed by dividing all the values in that cluster with the sum
of the values. In table 2.2 on the next page the values have been divided with the sums in
table 2.1 and now sums up to 1.
Table 2.2: The Membership Matrix have been adjusted to sum up to 1 by dividing
the input values by the previous sum
Inputs
Clusters I1 I2 I3 I4 Suma
Cluster1 0, 040983607 0, 43442623 0, 393442623 0, 131147541 1
Cluster2 0, 264214047 0, 230769231 0, 200668896 0, 304347826 1
Cluster3 0, 301801802 0, 36036036 0, 18018018 0, 157657658 1
a The sum is not part of the membership matrix, its just there to show that it now sums up
to 1.
Step 2: In this step we calculate the center, or midpoint, of a cluster. The center can be
calculated with:
Xn n
X
vi = (uik )m xk / (uik )m
k=1 k=1
Where uik is the membership value of point xk , for the i:th cluster and m is a fuzziness
value that will work as a weighting exponent.
Step 3: In this the membership matrix U is updated and there are basically two cases
that can occur when updating the membership matrix. One is that the center of a cluster
is right on top of one or more points. In that case, the points that is under the center will
get a membership value of 1.0 for that cluster and a value of 0.0 for the rest of the clusters,
unless two clusters have the same center, which is unlikely but still plausible. If there is
two clusters with the same center and a point lies on that center then that point will get a
value of 1.0/number of clusters in the same location, so if there is two clusters then
the point would get the membership value 1.0/2 = 0.5. In the other case of the update the
value is updated based on this formula:
c
X
Uik = 1/[ (||xk − vi ||/||xk − vj ||)2/(m−1) ]
j=1
Where ||xk −vi || is the euclidean distance from the current cluster, vi , and the current point,
xk , ||xk − vj || is the euclidean distance between cluster, vj , and the current point and the
variable m is a fuzzyness variable that is used as a weighting exponent.
Step 4: In this step we check if the difference between the old membership matrix and
the new one is less than the stopping condition that was chosen or if there has not been any
change to the matrix. If it is not then we go back to step 2.
This algorithm runs with a c that is set to a fixed number of clusters so it has to run
several times with different c values in order to find the best c. It is preferred to have as
small a c as possible to limit the number of rules to a reasonable amount. If the c value is
to big then the system will become over fit and it could mean that each data point in the
system might be in one cluster each and thus the rule will most likely become that data
point for that cluster. When the algorithm has run several times for a many different c
values it is time to see which c that is the best. To find the best c we use something called
the criterion number. The criterion number is calculated according to:
2.3. Algorithms 19
c X
X n
S(U, c) = (uik )m [||xk − vi ||2 − ||vi − x̄||2 ]
i=1 k=1
where x̄ is the center of all data points. The goal is to get the c which generates the
smallest criterion number, though having c = number of data points usually generates
the best criterion number. But, as stated above, it is not what we want since that would
make the system over fit. After the best number of clusters has been discovered it is time
to identify the rules of each of the clusters which you can read about in section 2.3.2.
Where uik is the membership value of data point xk in the i:th cluster, πp is the projection
on the p:th axis, or p:th input. The variable βp is expressed as βp = 1/(2σ 2 ) where σ is the
standard deviation of πp . αip is the center of cluster i in πp . Running this on all clusters
and all axes will create a set of rules that constitutes the rulebase. After the rulebase has
been created then the fussy clustering is completed and it is ready for the real input.
Activation
Both of the inference methods use a activation function. The first thing this function does
is that it calculates the input using this formula
−(xkp −vip )2
e 2σ 2
Where xkp is the p:th input of data point xk , vip is the center of the p:th input on the
i:th cluster. σ is the i:th clusters σ that we used when calculating the rules in section 2.3.2
on the previous page. The results are saved in arrays, where each array represents one of
the clusters and it contains all the calculations of all the dimensions of the input. So if we
have a data point x with the inputs age, sex and income. Then we would get an array that
would look like this arrayr1 = [calc(age), calc(sex), calc(income)]. After all of the arrays
has been created we go into the nest step which is to, depending on the configuration chosen,
either select all the smallest values, minimum, or largest values, maximum, in these arrays
and save them in a new array which represents the activation function.
Mamdani’s method
The process of Mamdani’s method can be seen in figure 2.15 on the facing page.
After the input has been calculated and the correct values has been chosen, see the
Activation paragraph above, Mamdani’s method will try and find the best rule for each
output. Mamdani’s membership function can be seen in equation (2.16) on the next page,
where αi is the activation values, Ui is the output values from the rulebase and U is a fuzzy
set.
To clarify things we will go through figure 2.15 on the facing page step by step. In this
figure we have a system that contains three rules 1, 2 and 3. The curves represents how the
rule depends on the input. In the figure input1 has been calculated to be 3 and input2 to
be 8 and following the arrows we can see the results of the input where they hit the curve.
The next step is to see which of the inputs that has the biggest affect on the final result and
in Mamdani’s case we do this with the maximum function. The result of this a fuzzy set,
U , which is represented by the green graphs in figure 2.15 on the next page.
Now to make sense of this fuzzy set we need to defuzzify it and there are a few techniques
for defuzzification but the one that has been used in this project is called Centre-of-Gravity.
So first we combine all values in the fuzzy set so we gets something that looks like the graph
called ”Result of aggregation” in figure 2.15 on the facing page. It is on this graph we want
to find the centre of gravity and that can be done by using equation (2.17) on the next page.
Where µU is the combined fuzzy set and uk is the k:th member of the fuzzy set.
There is another method that is based on Mamdani’s method but with a small alter-
ation. The method is called Larsen’s method and uses the product as implication instead of
minimum that Mamdani’s method uses. This is mentioned because in the implementation
there is the possibility to do four different configurations. The Activation has two different
settings namely maximum or minimum and in the Mamdani method it is possible to use
9 http://www.dma.fi.upm.es/java/fuzzy/fuzzyinf/mamdani3 en.htm, 15 Feb 2013
2.3. Algorithms 21
U = ∨ni=1 (αi ∧ Ui )
Larsen’s method instead. Larsen’s membership function can be seen in equation (2.18) on
the following page.
22 Chapter 2. Theory
U = ∨ni=1 (αi · Ui )
Takagi-Sugeno’s method
This method uses linear functions to create an inference. The out put is represented like
this: ui = pi1 + pi2 x1 + pi3 x2 one for each rule. The first thing that needs to be done is
to compute the constants, p, for each rule. In [15], Takagi and Sugeno describes a way to
calculate these constants.
Let X be a m · n(k + 1) matrix(figure 2.19), Y an m vector(figure 2.20) and P a n(k + 1)
vector(figure 2.21).
β11 , . . . ,
βn1 , x11 · β11 , . . . , x11 · βn1 , . . .
..., xk1 · β11 , . . . , xk1 · βn1
X= .. ..
.
. .
β1m , . . . ,
βnm , x1m · β1m , . . . , x1m · β1m , . . .
..., xk1 · β1m , . . . , xk1 · βnm
Y = [y1 , . . . , ym ]T
Where β is defined as seen in figure 2.22 on the facing page, i represents the i:th rule, j
and m represent the number of data points in the system, k is the k:th input in a data point
and n is the number of rules in the system. Aik means the membership value of the k:th
input in rule i, xkj means the k:th input from the j:th data point and ym is the output of
data point m. The P vector represents the constant values needed to calculate the expected
output and is generated by the matrix computation seen in figure 2.23 on the next page. In
figure 2.21 the Pkn means that its the n:th constant for rule k. After the P vector has been
calculated we use this formula ui = pi1 + pi2 x1 + pi3 x2 to get the output ui which we then
use to calculate the final output with:
2.3. Algorithms 23
Pn
αi ui
u = Pi=1
n
i=1 αi
Where αi is the activation function that was described in the activation paragraph.
P = (X T X)−1 X T Y
System.IO: namespace contains types that allow reading and writing to files and data
streams, and types that provide basic file and directory support.
25
26 Chapter 3. Implementations, results and validations
System.Linq: namespaces contain types that support queries that use Language-Integrated
Query (LINQ). This includes types that represent queries as objects in expression trees.
System.XML: namespaces contain types for processing XML. Child namespaces support
serialization of XML documents or streams, XSD schemas, XQuery 1.0 and XPath 2.0,
and LINQ to XML, which is an in-memory XML programming interface that enables
easy modification of XML documents.
System.Text: namespaces contain types for character encoding and string manipulation.
A child namespace enables you to process text using regular expressions.
3.2 Result
The results was a bit of a surprise for a number of reasons. The first one was that the
Mandani methods that used the maximum function in the activation phase had a constant
hit rate which never changed. Looking at a sample of the output of the functions, table 3.2
on page 28, we can see that the interval of these methods only covers two outcomes namely
case 3.0 = Green and 4.0 = Gray which explains the poor results. The method with the
best results is the Takagi-Sugeno method which covers the whole range of expected results.
In table 3.1 the observed upper and lower limits of each method can be seen.
The second one is that the hit rate for all the methods was over all very stable with
only a few drops in the hit rate. It was expected that the results would improve or at least
change with different configurations but the results were very stable. With an exception of
the T/T Mamdani method which struggled when the fuzziness variable was > 12. Another
surprise with that Mamdani method was that it struggled when the number of runs was
> 140 which is strange since all that does is run the programs again to find the best system,
the one with the highest criterion number.
In figure 3.1 we can see that the Mamdani inference where we use the max function
have a steady hit rate of 12% which never changes and 12% is is far from acceptable. The
Mamdani inference where we instead use the min function performs around 40% which
still is not acceptable. There is some slight difference between the Mamdani that uses the
Min-function and the Product-function. The one that uses the min-function in both stages
perform slightly better and is a bit more stable than the one that uses the min-function in
the activation stage and the product-function in the next stage.
28 Chapter 3. Implementations, results and validations
The inference method with the best performance is the Takagi Sugeno inference which
performs with a hit rate around 69% with a few drops. 69% is a lot better than the Mamdani
methods but still not acceptable, the hit rate that we want to see is at least 90%.
3.3. Impact of the parameters 29
In figure 3.2 we can see the impact of the number of runs performed. Most of the inference
methods are stable which could mean that at some runs they were lucky or unlucky. Though
the number of runs seems to have a big impact on one of the inference methods, namely the
Mamdani version where we the minimum function is used in the activation and the product
function is used in the Mamdani method. Though when using the product function inside
the Mamdani method it is known as Larsen’s method.
Conclusion
There was an scheduling conflict in the beginning of the project. I had forgotten that I had
planned to retake two courses at the same time which in hindsight was not the best of ideas.
When the courses started the project was cut to a pace of 50% so half the day was spent
on one of the courses and the other half on the project. But it always took a while to get
back in to the project so a lot of time was lost in the end. Probably would have been best
to not have taken courses on the side or at least only one.
The goal for this project was to create a system that could evaluate insurances using AI.
In section 1.3.1 on page 3 some criteria is presented, the only criteria that is not fully ful-
filled is The result of the evaluation should consist of a flag and a text. The
system is currently only giving a flag and the reason for that is that I was unsure on how
the texts should be interpreted by the system. The system need the input and output to be
represented by floating-point numbers, it probably is possible to look up all combinations
of texts and give them a numerical representation that can be combined in a good way.
The system is currently not a good replacement for the evaluation of insurances. But even
though the system only managed to get a hit rate of 69% at best, there is potential in fuzzy
clustering. In section 4.2 on the following page a few suggestions is mentioned that could
help boost the performance and make it an dependable replacement.
Slow learning process: The process of training the system can can be quite time con-
suming depending on the number of data points that is used. But since this is not
something that needs to be done so often it can be probably be disregarded. A way to
reduce the training time would be to make it parallel so that multiple number of clus-
ters can be evaluated at the same time. The fuzzy clustering process is very parallel
friendly because it can easily be divided into a number of tasks.
Requires retraining: If there is a change in the rules then the system will need to be
retrained. This could mean that new training data is required and that might have
to be manually evaluated which might take some time and then we have the actual
training time of the system which is mentioned in the previous block.
31
32 Chapter 4. Conclusion
Variable number of inputs: The system can not handle a variable number of input and
since two insurance can have different number of values/attributes we can not just
read the whole insurance and send it to the system. We need to choose a number of
important values/attributes that all insurances have and use them. The fewer that is
used the less complex the problem will be for the system and the training time will be
reduced. If a new value/attribute is introduced the system will have to be retrained.
No support for multiple instances: A way to improve the performance of the system
would be to make different systems that each evaluates one type of insurance. This is
discussed more in section 4.2. But currently the system does not support that. A few
modifications would have to be done in order to support multiple instances. First is to
fix the start up to support starting multiple instances and assigning tasks to them and
then fix so that each instance can save their own training data without overwriting
each other.
– Reinforced learning
results in a ’penalty’. The system will then go back to training and will try and adapt to the
’rewards’ and ’penalties’ it revived. This process will continue until a certain limit has been
reached e.g. a hit rate of atleast 90%. As you can see by introducing reinforced learning to
the application it would be possible to improve the results.
34 Chapter 4. Conclusion
Chapter 5
Acknowledgments
I would like to thank Acino for letting me do my thesis there and I would also like to thank
all of Acino’s employees for making me feel welcome. I especially want to give a big thanks
to Hannes Kock my supervisor at Acino, Patrik Eklund my supervisor at the CS-department
and Anna Theorin my contact at Svenska Försäkringsfabriken, for helping me with this
thesis.
35
36 Chapter 5. Acknowledgments
References
[1] John Binder, Daphne Koller, Stuart Russell, Keiji Kanazawa, and Padhraic Smyth.
Adaptive probabilistic networks with hidden variables. In Machine Learning, pages
213–244, 1997.
[2] Jens Bohlin, Patrik Eklund, Lena Kallin-Westin, and Tony Riissanen. Soft computing.
2007.
[3] A. Doulamis, N. Doulamis, and S.D. Kollias. On-line retrainable neural networks:
improving the performance of neural networks in image analysis problems. Neural
Networks, IEEE Transactions on, 11(1):137–155, 2000.
[4] Nir Friedman, Dan Geiger, and Moises Goldszmidt. Bayesian network classifiers, 1997.
[5] Berenji H.R. and Khedkar P. Learning and tuning fuzzy logic controllers through
reinforcements. Neural Networks, IEEE Transactions on, 3(5):724–740, 1992.
[6] Bipin Joshi. Beginning xml with c# 2008 - from novice to professional. 2008.
[7] S. Marinai, M. Gori, and G. Soda. Artificial neural networks for document analysis
and recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
27(1):23–35, 2005.
[8] Dieter Merkl and Andreas Rauber. Document classification with unsupervised artificial
neural networks. In IN F. CRESTANI, & G. PASI (EDS.), SOFT COMPUTING
IN INFORMATION RETRIEVAL (PP. 102–121). WURZBURG (WIEN): PHYSICA-
VERLAG, 2000.
[9] Peter Norvig and Stuart J Russell. Artificial Intelligence A Modern Approach. Prentice
Hall, 3rd edition, 2009.
[10] Agnieszka Onisko, Marek J. Druzdzel, and Hanna Wasyluk. Learning bayesian network
parameters from small data sets: Application of noisy-or gates, 2000.
[12] J.A. Roubos, S. Mollov, R. Babuska, and H.B. Verbruggen. Fuzzy model-based predic-
tive control using takagi-sugeno models, 1999.
[13] Han saem Park, Si ho Yoo, and Sung bae Cho. Evolutionary fuzzy clustering algorithm
with knowledge-based evaluation and applications for gene expression profiling, 2005.
37
38 REFERENCES
[18] L.-X. Wang and J.M. Mendel. Fuzzy basis functions, universal approximation, and
orthogonal least-squares learning. Neural Networks, IEEE Transactions on, 3(5):807–
814, 1992.
[19] Nevin Lianwen Zhang and David Poole. A simple approach to bayesian network com-
putations, 1994.