Statistical Learning

Part d’aquest material ha estat cedit per Bea López, a la qual li agraeixo
Curs 2012-2013
1- Instance-Based Learning (IBL)


2- Case-Based Reasoning


3- Neural Networks


4- Support Vector Machines
CONTINGUTS DEL TEMA
1.1 Introduction


1.2 Nearest Neighbour


1.3 K-Nearest Neighbour


1.4 Reduction techniques in IBL



1- Instance-Based Learning (IBL)
Model based learning. Exemple models: Transfer function,
regression, ARMAX, residues,...
1- Instance-Based Learning (IBL)
Instance-Based Learning (IBL) is a paradigm of learning in which
algorithms typically store some or all of the n available training
examples (instances) from a training set, T, during learning. Each
instance has an input vector x, and an output class c. During
generalization, these systems use a distance function to
determine how close a new input vector y is to each stored
instance, and use the nearest instance or instances to predict the
output class of y (i.e., to classify y).
IBL is nonparametric as it constructs hypotheses directly from the
training data. Training is typically very simple: Just store the
training instances.
1.1- Introduction
Example: finding a function to represent
the data.
IBL: keep the data as is. Each case is an instance
Example: Response to a Saturated Ramp
time
t
rin

Amplitude
V
SAT
Unit
Under
Test


t
rin
= 100 µs
V
SAT
= 1 V
time
Amplitude
SP
0,1 V
est
0,5 V
est
0,9 V
est
t
d
t r
V
est
Instance Base
SP t
r
t
d
V
est

FAULT 1 SP1 t
r1
t
d1
V
est1

FAULT 2 SP2 t
r2
t
d2
V
est2

... ... ... ... ...
FAULTm SPm t
rm
t
dm
V
estm

time
Amplitude
SP= 5.73%
0,1 V
est
0,5 V
est
0,9 V
est
t
r
=72 µs
-0.98
5.73%
13 µs
72 µs
5.73% 13 µs 72 µs
-0.98
t
d
=13 µs
Vest=-0.98
-1.0000 76 15 4.4029 Nominal
-1.0000 80 15 1.0031 C2-50
-1.0005 72 17 7.6834 C2+50
-1.0000 77 15 3.0781 C2-20
-0.9997 73 16 5.7085 C2+20
-0.4999 75 5 5.2359 R6-50
-1.5004 80 24 3.3711 R6+50
-0.8000 75 11 4.6614 R6-20
-1.2001 77 19 4.0473 R6+20
-0.9996 86 31 2.1315 R5-50
-1.0002 76 9 4.8682 R5+50
-1.0002 77 20 3.9447 R5-20
-0.9999 75 12 4.6189 R5+20
-1.0000 92 29 0 R4-50
-0.9990 71 10 6.9145 R4+50
-1.0000 80 19 2.4917 R4-20
-0.9994 73 12 5.7311 R4+20
-0.9999 75 5 5.2359 R2,R3,C1-50
-1.0003 80 24 3.3711 R2,R3,C1+50
-1.0000 75 11 4.6614 R2,R3,C1-20
-1.0001 77 19 4.0473 R2,R3,C1+20
-1.9996 76 15 4.4029 R1-50
-0.6665 76 15 4.4029 R1+50
-1.2498 76 15 4.4029 R1-20
-0.8332 76 15 4.4029 R1+20
V
est
(V) t
r
(µs) t
d
(µs) SP (%) Fault
Distance to each
Instance
d
m


d
2

d
1
Instance with
minimum
distance to the
new situation is
proposed as a
diagnostic
Select a metric: Ex.
Euclidean normalized.

2
m
1 i
i i
i range
y x
) y , x ( E
¿
|
|
.
|

\
| ÷
=
=
 
Ain …. Ai2 Ai1
Attribute 1 Attribute 2 .... Attribute n
FAULT 1 A11 A12 A1n
FAULT 2 A21 A22 A2n
...
FAULT m Am1 Am2 Amn
Instance base
In the 1-Nearest Neighbor (NN) algorithm, The class is predicted
by the closest training sample .

1.2- Nearest Neighbour
Attribute 1
Attribute 2
Case2
Case1
Case3
New Case
Case4
Class
Case1 A
Case2 A
Case3 B
Case4 B
New case B

Requeriments:

-A set of labeled examples (training data)

- A metric to measure “closeness”
0.272
R1+50

0.165
R2,R3,C1
+20%
0.043
R4+20

0.803
R5-50
0.096
C2+20

0.204
Nominal
R1+20
0.221
82.04
69 -1.0000 76 15 4.4029 Nominal
100 -1.0000 80 15 1.0031 C2-50
99 -1.0005 72 17 7.6834 C2+50
89 -1.0000 77 15 3.0781 C2-20
82 -0.9997 73 16 5.7085 C2+20
100 -0.4999 75 5 5.2359 R6-50
98 -1.5004 80 24 3.3711 R6+50
86 -0.8000 75 11 4.6614 R6-20
79 -1.2001 77 19 4.0473 R6+20
91 -0.9996 86 31 2.1315 R5-50
82 -1.0002 76 9 4.8682 R5+50
38 -1.0002 77 20 3.9447 R5-20
47 -0.9999 75 12 4.6189 R5+20
100 -1.0000 92 29 0 R4-50
98 -0.9990 71 10 6.9145 R4+50
88 -1.0000 80 19 2.4917 R4-20
85 -0.9994 73 12 5.7311 R4+20
94 -0.9999 75 5 5.2359 R2,R3,C1-50
79 -1.0003 80 24 3.3711 R2,R3,C1+50
36 -1.0000 75 11 4.6614 R2,R3,C1-20
41 -1.0001 77 19 4.0473 R2,R3,C1+20
100 -1.9996 76 15 4.4029 R1-50
99 -0.6665 76 15 4.4029 R1+50
87 -1.2498 76 15 4.4029 R1-20
84 -0.8332 76 15 4.4029 R1+20
Success % V
est
(V) t
r
(µs) t
d
(µs) SP (%) Fault
time
Amplitude
SP= 5.73%
0,1 V
est
0,5 V
est
0,9 V
est
t
r
=72 µs
V
est
= -0.98
t
d
=13 µs
5.73%
13 µs
72 µs
5.73% 13 µs 72 µs -0.98
Exemple amb NN
There are many distance functions that have been proposed to
decide which instance is closest to a given input vector
Many of these metrics work well for numerical attributes but do
not appropriately handle nominal (i.e., discrete, and perhaps
unordered) attributes. Many real-world applications have both
nominal and linear attributes.
In general:
• For continuous feature vectors, just use Euclidean distance

• For discrete features, just assume distance between two
values is 0 if they are the same, 1 if different (e.g. Hamming
distance).

• To compensate for differences in units, scale all continuous
values to normalize their values to be between 0 and 1.
Examples of distance functions for continuous attributes
The Euclidean and Manhattan distance functions are equivalent to
the Minkowskian distance function with r = 2 and 1, respectively.
Examples of distance functions for nominal attributes
Hamming
D(x,y)=
0 if x=y
1 if x=y
Value Difference Metric
(VDM)
where
• N
a,x
is the number of instances in the training set T that have value x for attribute a;
• N
a,x,c
is the number of instances in T that have value x for attribute a and output class c;
• C is the number of output classes in the problem domain;
• q is a constant, usually 1 or 2; and
• P
a,x,c
is the conditional probability that the output class is c given that attribute a has
the value x, i.e., P(c | xa).

Pa,x,c is defined as

where Na,x is the sum of Na,x,c over all classes, i.e.,
and the sum of Pa,x,c over all C classes is 1 for a fixed value of a and x.
Examples of distance functions for nominal and continuous attributes
One way to handle applications with both continuous and nominal
attributes is to use a heterogeneous distance function that uses
different attribute distance functions on different kinds of attributes.
Heterogeneous Euclidean-Overlap Metric (HEOM)
This function defines the distance between two values x and y of
a given attribute a as:
The overall distance between two (possibly heterogeneous)
input vectors x and y is given by the Heterogeneous Euclidean-
Overlap Metric function HEOM(x,y):
For 1-nearest neighbour, the Voronoi diagram gives the complex
polyhedra that segment the space into the region of points closest
to each training example. (In two dimensions)
1. They are computationally expensive classifiers since they save all training
instances.
2. They are intolerant to the attribute noise.
3. They are intolerant to the irrelevant attributes.
4. They are sensitive to the choice of the algorithm's similarity function.
5. There is no natural way to work with nominal-valued attributes or missing
attributes.
6. They provide little usable information regarding the structure of the data.
k-NN Diag.
1-NN
Diag.
Nearest Neighbour main drawbacks:
Noise problem
Attribute 1
Attribute 2
New Case
Attribute 1
Attribute 2
New Case
It is based on the principle that the instances within a dataset
will generally exist in close proximity to other instances that have
similar properties. A supervised learning algorithm where new
instance query is classified based on majority of K-nearest
neighbors.
1.3 k-Nearest Neighbor (kNN)
The k-NN only requires:

- A set of labeled examples (training data)

- A metric to measure “closeness”

- An integer k
kNN classifier
• The use of large values of k has two main advantages
– Yields smoother decision regions, reduce the effect of noise
– Provides probabilistic information

• However, too large values of k are detrimental
– It destroys the locality of the estimation since farther
examples are taken into account
– In addition, it increases the computational burden.

• A good k can be selected by parameter optimization using, for
example, cross-validation.

• For most low-dimensional data, k is usually between 5-10

1NN v.s. kNN

Numerical example on the biquad circuit. K=3 Neighbors

Case Number SP td tr Vest Class
Case 1 3.97 18µs 77µs -0.8716 R1+20
(Norm) 0.433 0.5143 0.9506 -0.4636
Neighbors: Number: 37 11 19
Class: R1+20 R1+20 R1+20

Class R1+20
Instance 5 correctly classified by its neighbors.
Instance Number SP td tr Vest Class
Instance 5 4.33 17µs 76µs -0.8811 R1+20
R1+20 R1+20 R5-20 Class
45 36 697
Number
Neighbor 3 Neighbor 2 Neighbor 1

Distance-Weighted kNN
(Taking distance into account when voting )
It is more probable that the new instance belongs to the class
of the closest retrieved neighbor. Possible weight kernels:
w
k

Distance
Voting Weight
0
1
Linear
( )( )
k
j k k
k j
D
D D 1 ÷ e ÷
+ e = e
Exponential
( )
K
j
D
D
k j
e = e
Gaussian
( )
2
K
2
j
D
D
k j
e = e
e
k
weight given to the k neighbor
D
k
distance to the k
th
neighbor
D
j
distance to the j
th
neighbor
R2+20 R2+20 R5-20 Class
43 26 650
Number
Neighbor 3 Neighbor 2 Neighbor 1

Example using the weight-distance exponential kernel
0.23 0.015 0.012 Distance
Simple voting: Classification R2+20
Classification of instance 633 with class R5-20
weighted voting:
e
k
=0.2
Exponential Kernel
0.012

0.015

0.230
Distance
( )
K
j
D
D
k j
e = e
0.92

Weight
Incorrect
Vote
R5-20
0.92

Vote
R2+20
0.35
0.20
+
0.65

0.35

0. 20

Weight
Weighted voting: Classification R5-20 Correct
1.4 Reduction Techniques for IBL
Storing too many instances can result in large memory requeriments
and slow execution speed, and can cause an oversensitivity to noise.
The basic Nearest Neighbour algorithm retains all of the training
instances. It learns very quickly because it need only read in the
training set without much further processing, and it generalizes
accurately for many applications. However, since the basic NN
algorithm stores all of the training instances, it has relatively large
memory requeriments. It must search through all available
instances to classify a new input vector, so it is slow during
classification
Reducing the number of cases in memory can help. The reduction
have to be done eliminating instances while keeping the
performance when classifying.
Reduction
algorithms
Central points
Border points
Incremental
Decremental
Points to retain
Search direction
Used metric
Euclidean
Clark
Manhattan
HVDM
• DROP4

• IB3


Reduction Techniques for IBL
Reduction Technique DROP4 (R. Wilson, T. Martinez)
(Decremental Reduction Optimization Procedure)

Decremental: It begins with the entire set T and removes
unnecessary instances . S is the obtained reduced set.

- How instance s
i
is classified by the others.

- How the others are classified without the instance s
i
.

Classification: Given by the class of the K-nearest instances.
Reduction Technique DROP4 (cont.)

Example with K=2 neighbors
Attribute 1
Attribute 2
Case2
Case1
Case3
Case4
Case5
Case Num Neighbor 1 Neighbor 2
1 2 3
2 1 4
3 5 4
4 3 5
5 3 4
Associate: Case a is an associate of case b if case b has case a as a neighbor.
Case Num Associate 1 Associate 2 Associate 3
1 2
2 1
3 1 4 5
4 2 3 5
5 3 4
Reduction Technique DROP4 (cont.)

Example with K=2 neighbors
Attribute 1
Attribute 2
Case2
Case1
Case3
Case4
Case5
Case Num Neighbor 1 Neighbor 2
1 2 3
2 1 4
3 5 4
4 3 5
5 3 4
Basic Rules:
1- Remove instance s
i
if it is correctly classified by its
neighbors.

2- Remove instance s
i
from S if at least as many of
its associates in T would be classified correctly without i.
Case Num Associate 1 Associate 2 Associate 3
1 2
2 1
3 1 4 5
4 2 3 5
5 3 4
Reduction Technique DROP4 (cont.)

Example 1: Biquadratic filter with K=3 neighbors

Case correctly classified by its neighbors.
Rule 1 satisfied
Instance Number SP td tr Vest Class
Instance 1 3.97 18µs 77µs -0.8716 R1+20
R1+20 R1+20 R1+20 Class
19 11 37
Number
Neighbor 3 Neighbor 2 Neighbor 1

37
(R1+20)
119
(R2+20)
29
(R1+20)
1
(R1+20)
45
(R1+20)
45
(R1+20)
1
(R1+20)
29
(R1+20)
697
(R5-20)
37
(R1+20)
124
(R2+20)
28
(R1+20)
1
(R1+20)
45
(R1+20)
29
(R1+20)
29
(R1+20)
1
(R1+20)
45
(R1+20)
2
(R1+20)
11
(R1+20)
Neighbor 4 Neighbor 3 Neighbor 2 Neighbor 1 Associates
29
(R1+20)
124
(R2+20)
45
(R1+20)
37
(R1+20)
1
(R1+20)
1
(R1+20)
1
(R1+20)
1
(R1+20)
Next
neighbor
Next
neighbor
Next
neighbor

R1+20
Correct


R1+20
Correct


R1+20
Correct

R1+20
Correct
Associates
Diagnosis
Analysis dropping Instance 1
Associates to Instance 1 classification
not affected by its dropping. Rule 2
SATISFIED
INSTANCE 1
DROPPED
Next
neighbor

R1+20
Correct


R1+20
Correct


R1+20
Correct

R1+20
Correct
Associates
Diagnosis
Instance 5 correctly classified by its neighbors.
Rule 1 satisfied
Instance Number SP td tr Vest Class
Instance 5 4.33 17µs 76µs -0.8811 R1+20
R1+20 R1+20 R5-20 Class
45 36 697
Number
Neighbor 3 Neighbor 2 Neighbor 1

Example 2: Biquadratic filter with K=3 neighbors
45
(R1+20)
690
(R5-20)
5
(R1+20)
687
(R5-20)
697
(R5-20)
687
(R5-20)
32
(R1+20)
680
(R5-20)
5
(R1+20)
42
(R1+20)
Neighbor 4 Neighbor 3 Neighbor 2 Neighbor 1 Associates
5
(R1+20)
Next
neighbor
Next
neighbor

R5-20
Correct

R5-20
Wrong
Associates
Diagnosis
Analysis dropping instance 5
Associates to Instance 5 classification
AFFECTED by its dropping.
Rule 2 NOT Satisfied
INSTANCE 5
NOT
DROPPED
5
(R1+20)
687
(R5-20)
45
(R1+20)

R5-20
Correct

R1+20
Correct
Associates
Diagnosis
Reduction Technique IB3 (D. Aha, D.Kibler and M. Albert)
(Instance Based Learning Algorithm 3 )

Incremental: It begins with an empty set S.

It maintains a classification record for each instance s
i
stored. This record
indicates how the instance is performing the classification of instances of
its same class.
Basic Rules:
1- If the instance record has a figure higher than a certain pre-established
limit it is accepted and used for classifying the subsequent instances.

2- If it is less than a certain limit, the instance is believed to be noisy and
it will be dropped from the base S

3- If it lies between the two, it is not used for prediction but its record is
updated.
For each instance t in T (training set)
Let a be the nearest acceptable instance in S to t
(if there are no acceptable instances in S, let a be a random instance in S)
If class(a) ≠class(t) then add t to S
For each instance s in S
If s is at least as close to t as a is
Then update the classification record of s and remove
s from S if its classification record is significantly poor.
Remove all non-acceptable instances from S
An instance is acceptable if the lower bound on its accuracy is statistically
significantly higuer (at a 90% confidence level) than the upper bound on the
frequency of its class
An instance is dropped from S if the upper bound on its accuracy is statistically
significantly lower (at 70% confidence level) than the lower bound on the frequency
of its class.
IB3 algorithm
Other instances are kept in S during training and then dropped at the end if they do
not prove to be acceptable.
Reduction Technique IB3 (Cont.)

The limits are calculated using the upper and lower bounds of a
Bernoulli process probability being its mean the true probability.
Lower bound Upper bound
n: Number of classification attempts.
p: accuracy of such attempts
(nº of correct matches/n)
z: Confidence index.
n
z
n
z
n
) p ( p
z
n
z
p
2
2
2 2
1
4
1
2
+
+
÷
± +
Upper and lower bounds of the random probability:
p: Frequency of instances of that class.
n: Number of total instances previously processed.
z: Confidence index.
For the accuracy limits For the frequency of a class limits
U
Si
L
Si

U
Ci
L
Ci

Probability distribution of
success of instance S
i


Probability distribution of
instances of a class C
i


L
Si
> U
Ci

Accepted
U
Ci
L
Ci

U
Si
L
Si

Probability distribution of
success of instance S
i


Probability distribution of
instances of a class C
i


U
Si
< L
Ci
Rejected
Probability of success < Probability of being of class C
i

Probability of success Probability of being of class C
i

Confidence index
Z=0.9
Confidence index
Z=0.7
>
Instance s
i

Example



S
i
Accepted



Instance s
1

Instance s
2

Instance s
3

Instance s
i

Instance s
N







N=200


Instance s
i

N
C
=120 Same class C
i
n=100 interventions
C=80 correct
U
Si
=83.35%
L
Si
=76.17%
U
Ci
=63.07%
L
Ci
=56.85%
<
Probability distribution of
success of instance S
i


Probability distribution of
instances of a class C
i


U
Ci
=63.07%
L
Si
=76.17%
Confidence index
z=0.9
Confidence index
z=0.9
Instance s
i

Example



S
i
Rejected



Instance s
1

Instance s
2

Instance s
3

Instance s
i

Instance s
N







N=1000


Instance s
i

N
C
=700 Same class C
i
n=900 interventions
C=200 correct
U
Ci
=71.29%
L
Ci
=68.68%
U
si
=23.49%
L
si
=21%
<
Probability distribution of
success of instance S
i


Probability distribution of
instances of a class C
i


U
si
=23.49%
L
Ci
=68.68%
Confidence index
z=0.7
Confidence index
z=0.7
Sets comparison
for the biquad
filter
Fault Classic Spread DROP4 IB3
R1+20 84 83 85 77
R1-20 87 94 90 90
R1+50 99 98 99 97
R1-50 100 100 100 100
R2+20,R3+20,C1+20 41 30 41 35
R2-20,R3-20,C1-20 36 35 35 31
R2+50,R3+50,C1+50 79 78 83 72
R2-50,R3-50,C1-50 94 99 96 98
R4+20 85 89 87 78
R4-20 88 84 88 80
R4+50 98 98 98 98
R4-50 100 100 100 100
R5+20 47 33 46 36
R5-20 38 40 38 43
R5+50 82 78 83 75
R5-50 91 93 94 93
R6+20 79 78 79 78
R6-20 86 83 82 74
R6+50 98 99 100 100
R6-50 100 100 100 100
C2+20 82 75 74 76
C2-20 89 90 90 83
C2+50 99 99 99 95
C2-50 100 100 100 100
NOM 69 61 72 57
Average
82.04 80.68 82.36 78.64
size 25 12500 1112 2457
(8.8%) (19.6%)
2.1 Introduction
2.2 Case Representation
2.3 Retrieval
2.4 Reuse
2.5 Revise
2.6 Retain
2.7 Training
2.8 Performance measurements
2.8.1 Training and test
2.8.2 Cross Validation
2.8.3 Confusion Matrix
2.8.4 ROC analysis
2- Case Based Reasoning
2- Case Based Reasoning
It is based on “Similar problems have similar solutions”. A
new problem is solved by matching it with a similar past situation.
Advantages
• It is easy to obtain rules

• It is quite intuitive at diagnosis

• It tolerates lazy-learning schemes
Drawbacks
•Utility problem
• The order in which cases are selected when
training is very important
- It is necessary to define good training and
maintenance policies
2.1 Introduction
The CBR Cycle consists of:
RETRIEVE the most similar case or cases
REUSE the information and knowledge in that case to solve the problem
REVISE the proposed solution
RETAIN the parts of this experience likely to be useful for future problem solving

CASE BASE

GENERAL
KNOWLEDGE
New problem
Retrieved
cases
Retrieved
Solution
Revised
Solution
Learned
case
1. Retrieve
2. Reuse
3. Revise
INPUT
OUTPUT
4. Retain
The CBR cycle
CBR main task and subtasks
2.2 Case base representation
• Decide what to store in a case
• Finding an appropriate structure for describing case contents
• Deciding how the case memory should be organized and indexed
for effective retrieval and reuse
Código
individ.
Año de
inclusión
familia en
estudio
(2 dígitos)
Proband
o/No
probando
País de
resid.
¿Es
usted
proband
o de su
familia?
Tipo de
familia
Fecha
de
nacim. Ciudad C.P.
Prov. de
resid.
(código)
¿Afecto
de Neo
de mama/
ovario?
1 95 0 108 0 22/03/32 LLEIDA 25003 25 1
2 95 0 108 0 12/09/28 BARCELONA 08 1
3 95 1 108 1 1 02/05/58 LLEIDA 25006 25 1
4 95 0 108 0 0 03/11/64 LLEIDA 25003 08 0
5 95 0 108 0 26/07/36 BARCELONA 08 1
6 95 1 108 1 2 26/03/51 BARCELONA 08 1
7 95 0 108 0 23/07/19 BARCELONA 08 1
Attributes (characteristics)
Solution
Flat case structure
Características (contenido)
• Descriptivas del problema
– Objetivos a resolver
– Restricciones sobre los objetivos
– Otra información descriptiva (datos iniciales)
• Soluciones
– La solución propiamente dicha
– Los pasos del proceso de razonamiento (traza)
– Justificaciones de las decisiones
– Soluciones alternativas
– Expectativas
• “Feedback” del resultado (si existe)
– Éxito o fracaso de la solución
– Expectativas cumplidas o violadas
– Explicación del fracaso
Explicación de las anomalías, explicación de las estrategias de
reparación, siguiente caso, ...
Índices: vocabulario
• Propósito: recupera los casos más útiles
Depende de la tarea, características del dominio
• Los índices pueden proceder de
– Características observadas
– Características derivadas (inferidas)
• Índices “buenos”
– Predictivos
– Discriminatorios
– Abstractos al nivel adecuado
• Los índices deben ser
– predictivos (ej. Categorías)
– discriminatorios (ej. Valores)
– explicativos
• Qué se debe indizar
– Soluciones
– Resultados correctos
– Resultados erróneos
Índices: selección
Organizaciones de memoria
• Organización plana, búsqueda en serie
• Organización jerárquica, redes de características compartidas
• Redes de discriminación con prioridades
• Redes de discriminación con redundancias
• Organización plana, búsqueda en paralelo
• Organización jerárquica, búsqueda en paralelo
Organización plana, búsqueda en serie
• Los casos se almacenan en una lista, tabla, fichero.







• Búsqueda: recorrido por todos los casos


Pequeño
Rojo
Esfera
Pequeño
Rojo
Piramide
Grande
Rojo
Esfera
Grande
Azul
Prisma
Tamaño Color Forma
no parcial no
si parcial no
si parcial no
si no si
Pequeño, Rojo, Esfera
Grande, Rojo, Esfera
Grande, Azul, Prisma
Grande, Rojo, Pirámide
Input:
Grande, Naranja, Prisma
Output:
Grande, Azul, Prisma
Organización plana ...
• Ventajas
– Devuelve el caso o conjunto de casos más similares
– El proceso de añadir un caso es simple
• Inconvenientes
– La recuperación es costosa
• Variaciones
– Indexación superficial
– Partición de la librería de casos
– Búsqueda y comparación en paralelo.
Organización jerárquica, redes de características compartidas
• Cada caso en un nodo de un árbol
• El grafo subdivide el espacio en función de las características compartidas
por los casos
• Se pueden incluir umbrales (threshold) sobre los valores de las
características.
• Recuperación: se recorre el árbol en amplitud, siguiendo los nodos (clusters
de casos) que tienen mayor similitud.
Ejemplo
Grande
Azul
Prisma
Rojo
Grande
Pequeño
Esfera
Pirámide Esfera
Pequeño, Rojo, Esfera
Grande, Rojo, Esfera
Grande, Azul, Prisma
Grande, Rojo, Pirámide
Grande
Azul
Prisma
Rojo
Pequeño
esfera
Pirámide Esfera
Input:
Grande, Naranja, Prisma
Output: Grande, Rojo, pirámide
Grande
Consideraciones
• La jerarquía de los nodos debe de corresponder a la importancia de las
características.

• En el ejemplo, si FORMA es más importante que COLOR:
Grande
Azul
Prisma
Pequeño
Rojo
Pirámide
Rojo
Esfera
Grande Pequeño
Organización jerárquica...
• Ventajas
– Más eficiente

• Inconvenientes
– La incorporación de un caso es más compleja
– El mantenimiento óptimo es costoso
– Se necesita espacio para la organización
– No hay garantía de que algún caso mejor no se recupere
2.3 Case Retrieval
Three main issues:
• Similarity function
• Matching
• Ranking
Entre características (local)
Similitud total o exacta
Similitud parcial
Entre casos (global)
Similitud total o exacta
Similitud parcial
• Similarity function

• Matching
• Estructural
• Semántica
• Organizativa
• Pragmática
Starts with partial problem description, and ends when a best matching previous
case is found
Uso del ordenador
Doméstico Industrial Espacial
Doméstico 1 0,4 0
Industrial 0,8 1 0,2
Espacial 0,6 0,8 1
Exemple de similitud definida amb taula
• Ranking
• Eliminar los casos con puntuación baja
• Ordenar los casos
– Valorar las características que aparecen siempre conjuntamente
(contexto)
– Hacer un balance (trade-off) entre el número de características
presentes y ausentes
– Valorar el coste inferencial de la adaptación
– Considerar los casos más específicos antes que los más generales
– Preferir los casos más frecuentes
– Preferir los casos más recientes
El proceso de selección del caso mejor debe de ser un método rápido.

De otra manera la eficiencia del razonamiento basado en casos se pierde.
2.4 Case Reuse
• Copy
• Adapt

The reuse of a retrieved case solution in the context of the new case
focuses on two aspects:

- Difference between past and current case

- What part of retrieved case can be transferred to new
solution
2 main approaches:
• Copy: a simple classification where differences are abstracted away
and the solution class of the retrieved case is transferred to the new
case as its solution class

• Adapt:
Reuse the past solution (transformational reuse): uses
transformational operators {T} to transform the differences of
the old solution to a new one for the new case

Reuse the past method that constructed the solution
(derivational reuse) retrieved case holds information about the
method used for solving the old problem and “replays” the old
plan into the new context. Looks at justification of operators,
subgoals, alternatives, failed search paths
ADAPTACIÓN
• Objetivo:
– Ajustar la solución no-exacta para adecuarla al problema actual
– Reparar una solución errónea
• Qué adaptar: valores , estructuras
• Métodos
– Substitución
– Transformación
– Otros
Ejemplo
Caso actual
Fritura
Pollo y guisantes

Cortar el brécol en
trozos, desmenuzar la
ternera, marinar la
ternera en ....


Caso de memoria
Fritura
Ternera y brécol

Cortar el brécol en
trozos, desmenuzar la
ternera, marinar la
ternera en ....
?
Ejemplo...
Proteínas Vegetales

Carne Huevos Mariscos Amarillos Verdes

Roja Volatería Pescado Crustáceos Brécol Guisantes

Ternera Pollo
Ejemplo
Caso actual
Fritura
Pollo y guisantes

Cortar el guisantes en
trozos, desmenuzar la
pollo, marinar la
pollo en ....


Caso de memoria
Fritura
Ternera y brécol

Cortar el brécol en
trozos, desmenuzar la
ternera, marinar la
ternera en ....
The Class with more votes wins
• Voting (majority rule)
Similitude Class
0.623069 1
0.579494 1
0.567825 2
0.615917 2
0.411370 1
0.507618 1
Example with K=6 neighbors (1:Cancer; 2:Healthy)
Voting result :
4 cancer
2 Healthy
Cancer
The variable DV
bw
is calculated as
•Bilska-Wolak and Floyd Method
Example. Classifying a case C
i

{ }
K
C C C in cases of num
DV
K
bw
... , " "
2 1
+
=
If DV
bw
< t
Healthy
A threshold level t is defined
If DV
bw
≥ t
Cancer
C
1
+ C
2
+ C
3
- C
4
-
Retrieved cases
5 . 0
4
2
= =
bw
DV
Threshold t =0.5
Cancer
DV
bw
≥ t
0.8 0.6 0.5 0.4
This decision is independent of the similarity degree of the cases.
•Bilska-Wolak and Floyd Method
Example with a threshold t = 0,5
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
+
5 , 0
4
2
= =
bw
DV
Cancer
DV
bw
≥ t
0.5
•Bilska-Wolak and Floyd Method
Example with a threshold t = 0,5
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5 0.5 +
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
+
+
0.5
0.5
The variable DV is obtained as
•Other method
¿
¿
=
+
+
=
K
i
i
C
i
) T , C ( Sim
) T , C ( Sim
DV
i
1
If DV < t
Healthy
If DV ≥ t
Cancer
It takes into account the value of the distance when calculating the
decision variable
A thresehold level t is defined
Example. Classifying a case C
i

C
1
+ C
2
+ C
3
- C
4
-
Retrieved cases
6 0
3 2
4 1
4 0 5 0 6 0 8 0
6 0 8 0
.
.
.
. . . .
. .
DV = =
+ + +
+
=
Threshold t =0.5
Cancer
DV ≥ t
0.8 0.6 0.5 0.4
•Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnòstic T
0.5 0.5 0.5 0.5
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
5 0
5 0 5 0 5 0 5 0
5 0 5 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
¿
¿
=
+
+
DV ≥ t
Example with a threshold t = 0.5
Cancer
+
0.5
0.5
Cancer
•Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnòstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
61 0
5 0 5 0 5 0 5 0
8 0 8 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
¿
¿
=
+
+
Example with a threshold t = 0.5
+
0.61
0.61
DV ≥ t
Cancer Cancer
•Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnòstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5
0.4 0.5 0.4 0.5
44 0
5 0 5 0 4 0 4 0
4 0 4 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
¿
¿
=
+
+
Example with a threshold t = 0.5
-
0.44
0.44
DV ≥ t
Healthy Healthy
•Other method
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnòstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5 0.44 -
0.4 0.5 0.4 0.5
Example with a threshold t = 0.5
0.50 +
5 0
5 0 4 0 5 0 4 0
5 0 4 0
1
.
. . . .
. .
) T , C ( Sim
) T , C ( Sim
DV
K
i
i
C
i
i
=
+ + +
+
= =
¿
¿
=
+
+
•Other method- Bilska Wolak Comparison
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnòstic T
0.5 0.5 0.5 0.5 0.50 +
0.8 0.8 0.5 0.5 0.61 +
0.4 0.4 0.5 0.5 0.44 -
0.4 0.5 0.4 0.5 0.50 +
Example with a threshold t = 0.5
It helps discriminating
Sim(C
1
+,T) Sim(C
2
+,T) Sim(C
3
-,T) Sim(C
4
-,T) DV
bw
Diagnostic T
0.5 0.5 0.5 0.5 0.5 +
0.8 0.8 0.5 0.5 0.5 +
0.4 0.4 0.5 0.5 0.5 +
0.4 0.5 0.4 0.5 0.5 +
Bilska-Wolak
Our method
2.5 Case Revision
• Occurs when a case solution generated by the reuse phase is not correct
• An opportunity for learning occurs
• Consists of two tasks
(1) evaluate the case solution generated by reuse. If successful, learn
from the success
(2) otherwise repair the case solution using domain-specific knowledge
• Evaluate solution
- steps outside the CBR system
- requires asking the expert or performing the task in the real world.
- example: the success or failure of a medical treatment
• Repair fault
- Involves the detecting of errors of the current solution and retrieving
or generating explanations for them.
- May be predicted, handled and avoided
- Revised plan can then be retained directly or reevaluated and
repaired again

2.6 Case Retention
•Process of incorporating what is useful to retain from the new problem solving
episode into existing knowledge

•Extract
-case base is updated regardless of how problem was solved
-failures, or information from the revise task, may be saved as well
-relevant problem descriptors, problem solutions, explanations, or
justifications can all be saved and reused later

•Index
- “indexing problem” is a major problem in case-based reasoning
- How do we structure the search space of indexes?
- trivially we can simply use all input features as indices (the approach of
syntax-based methods within instance-based and memory-based
reasoning)

•Integrate
- Integration of the new case knowledge into the existing case-base.
For z =0.9, the cases are marked for
removal sooner (lower N
cross
) than
using z=0.3

Eliminate cases that:

-Diagnose incorrectly
L
si
< U
ci

0 10 20 30 40 50 60 70 80
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of interventions
L
si


U
ci


P
r
o
b
a
b
i
l
i
t
y


0 10 20 30 40 50 60 70 80
0.4
0.45
0.5
0.55
0.6
0.65
0.7
0.75
0.8
Number of interventions
L
si


U
ci


P
r
o
b
a
b
i
l
i
t
y


-They have no intervention in the
decision of cases of its same
class.
•Example. IB-like algorithm
2.7 Training
Two different usages of CBR:
- Training
New
case

Vest
i
tr
i
td
i
SP
i
Case i
Solution Unknown
Supervised learning: When training the solution is
known
- Diagnosing (new cases)
0
50 100 150 200
15
20
25
30
35
40
45
50
CORRECT
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150 200
0
200
400
600
800
1000
1200
1400
NUMBER of CASES
number of trainings
N
u
m
b
e
r

o
f

c
a
s
e
s


CBR System performance while training
2.8 Performance measurements
2.8.1 Training and test
S’ha de disposar d’un conjunt de casos de memòria i d’un conjunt de casos
de test. Si no es tenen casos nous a part dels utilitzats per entrenar, s’ha
d’utilitzar un subconjunt d’ells per entrenar i l’altre per testejar el sistema CBR.
Normalment, cal però, comprovar que el sistema funciona correctament per
diferents casos de test. Pot ser que el percentatge d’encerts que dóna el
sistema dóna molt malament per casualitat a l’hora d’escollir els casos de test,
o pel mateix motiu, que per casualitat estigui donant resultats molt bons.
El resultat pot sortir esbiaixat.
2.8.2 Cross-validation
Dividir el conjunt de casos disponibles en M subconjunts. S’utilitzarà un dels
subconjunts com a conjunt de test, i els altres M-1 per fer l’entrenament del
sistema. Es repeteix el procés utilitzant cada un dels conjunts com a conjunt
de test i la resta per entrenar.
Inconvenients:

-Elevat temps de computació

-Si la base disposa de poques dades, si se’n separen uns quants per fer el
test, perdem casos d’entrenament.
S
10
S
1
S
n
S
3
S
4
S
n
S
6
S
1
Case 1
Case 2

Case L
Case L+1
Case L+2

Case 2L
Case 2L+1
Case 2L+2

Case 3L



Case (n-1)L+1
Case (n-1)L+2

Case nL


The training process
It is sensitive to how the
training cases are sorted
S
1
S
2
S
3
S
n


S
1
S
2
S
3
S
n


,


T
1
= ,

,

,

,


T
2
= ,

,

,

,


T
m
= ,

,

,

0 50 100 150
10
20
30
40
number of trainings
P
e
r
c
e
n
t
a
g
e

100 150
0
10
20
30
40
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150
60
70
80
100
number of trainings
P
e
r
c
e
n
t
a
g
e

50
CORRECT with PRECISION
WRONG
TOTAL CORRECT
90
0
0 50 100 150
10
20
30
40
50
number of trainings
P
e
r
c
e
n
t
a
g
e

0 50 100 150
0
20
40
60
number of trainings
P
e
r
c
e
n
t
a
g
e

50 150
0
100
200
300
number of trainings
N
u
m
b
e
r

o
f

c
a
s
e
s


100
CORRECT at COMPONENT level
CORRECT at MODULE level
NUMBER OF CASES
0
Training 132
Training 132
Example of training with multiple
combinations of the training sets
2.8.3 Confusion Matrix
Real
Estimat
tp fp
fn tn
N P
TP
t
p
+
=
P: Número de casos positius (ex. malalts)
N: Número de casos negatius (ex. sans)
TP: Número de casos positius encertats
correctament
TN: Número de casos negatius encertats
correctament
FP: Número de casos positius NO
encertats
FN: Número de casos negatius NO
encertats
N P
TN
t
n
+
=
N P
FN
f
n
+
=
N P
FP
f
p
+
=
Amb minúscules, es refereix a “rate”.
Ex. Tp True Positive Rate
Exemple matriu de confusió
Cas Estimat Real
Cas1 1 1
Cas2 1 1
Cas3 2 1
Cas4 2 2
Cas5 1 1
Cas6 1 2
Cas7 2 1
Cas8 2 2
Cas9 1 1
Cas10 2 2
Positiu: 1
P=6
N=4
TP:4
TN:3
FP:1
FN: 1
N P
TP
t
p
+
=
N P
TN
t
n
+
=
N P
FN
f
n
+
=
N P
FP
f
p
+
=
=0,4 =0,3
=0,1 =0,1
Real
Estimat
0,4 0,1
0,1 0,3
2.8.4 ROC analysis
ROC (Receiver Operating Characteristics). Representa de forma gràfica la
relació entre tp i fp d’un sistema de diagnosi en funció d’un paràmetre de
decisió. El procediment per dibuixar-les ha estat el següent:
1. S’aplica el mètode CBR en tots els casos de test

2. Es comptabilitza el nombre de casos malalts encertats (TP) i el nombre de
casos sans que s’han diagnosticat com malalts (FP). Amb això es calculen els
ràtios de true positive (tp) i false positive (fp)


3. Es repeteix aquest càlcul per diferents llindar t, fent variar t de 0 a 1, obtenint
finalment la corba ROC.
L’algorisme seguit ha estat el següent:

1.Donat un nou pacient (cas de test)

2.S’ha calculat la similitud del cas de test amb cada un dels casos de la
memòria.

3.S’han extret els n casos de la memòria més similars al cas de test

4. S’ha calculat el coeficient

( )
( )
¿
¿
= ¸
i
j
p sim
p sim
On p
j
són els casos extrets de la memòria que tenen diabetis i p
i
són tots
els casos extrets.

5.S’estableix un llindar de decisió t. Si ¸ > t es considera que el cas de test
és diabètic. Altrament, es considera que està sa.
Exemple. Diagnòstic de diabetis
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
False Positive Rate
T
r
u
e

P
o
s
i
t
i
v
e

R
a
t
e

n=3
n=5
n=7
n=9
n=11
n=13
La corba serveix per ajudar a escollir el valor del llindar més adequat. L’expert, és el
que decidirà el nombre de FP i TP tolerats depenent de l’aplicació
Un bon mètode de diagnòstic dóna corbes ROC que estan per sobre de la diagonal
i que passin el més a prop possible del punt [0,1] (resultat ideal!!!)

Master your semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master your semester with Scribd & The New York Times

Cancel anytime.