You are on page 1of 8

Motivating Example

Memory-Based Learning

Instance-Based Learning

K-Nearest Neighbor

Inductive Assumption 1-Nearest Neighbor

• Similar inputs map to similar outputs N 2


– If not true => learning is impossible ( )
Dist(c1 ,c2 ) = ∑ attri (c1 ) − attri (c2 )
i=1
– If true => learning reduces to defining “similar”
NearestNeighbor = MIN j ( Dist(c j ,ctest ))

• Not all similarities created equal predictiontest = class j (or value j )


– predicting a person’s height may depend on
different attributes than predicting their IQ • works well if no attribute noise, class noise, class overlap
• can learn complex functions (sharp class boundaries)
• as number of training cases grows large, error rate of 1-NN
is at most 2 times the Bayes optimal rate (i.e. if you knew
the true probability of each class for each test case)

1
k-Nearest Neighbor How to choose “k”
N 2
Dist(c1 , c2 ) = (
∑ attri (c1 ) − attri (c2 )
i=1
) • Large k:
– less sensitive to noise (particularly class noise)
{
k − NearestNeighbors = k − MIN(Dist(ci , ctest )) } – better probability estimates for discrete classes
1 k 1 k – larger training sets allow larger values of k
predictiontest = ∑ classi (or ∑ valuei )
k i=1 k i=1 • Small k:
– captures fine structure of problem space better
• Average of k points more reliable when: – may be necessary with small training sets
– noise in attributes +o o
o • Balance must be struck between large and small k
ooo

attribute_2
o + oo+o
– noise in class labels o+ o o o
• As training set approaches infinity, and k grows large,
– classes partially overlap
+ +++
+ + kNN becomes Bayes optimal
+
+
attribute_1

From Hastie, Tibshirani, Friedman 2001 p418 From Hastie, Tibshirani, Friedman 2001 p418

2
From Hastie, Tibshirani, Friedman 2001 p419

Cross-Validation Distance-Weighted kNN

• Models usually perform better on training data • tradeoff between small and large k can be difficult
than on future test cases – use large k, but more emphasis on nearer neighbors?
• 1-NN is 100% accurate on training data!
k k
• Leave-one-out-cross validation: ∑ w ∗ class
i i ∑ w ∗ value
i i

– “remove” each case one-at-a-time predictiontest = i=1


k (or i =1
k )
– use as test case with remaining cases as train set ∑w i ∑w i
i =1 i =1
– average performance over all test cases 1
wk =
• LOOCV is impractical with most learning Dist(ck , ctest )
methods, but extremely efficient with MBL!

3
Locally Weighted Averaging Locally Weighted Regression
• Let k = number of training points • All algs so far are strict averagers: interpolate, but
• Let weight fall-off rapidly with distance can’t extrapolate
k k • Do weighted regression, centered at test point,
∑ w ∗ class
i=1
i i ∑ w ∗ value
i =1
i i weight controlled by distance and KernelWidth
predictiontest = (or )
• Local regressor can be linear, quadratic, n-th
k k

∑w
i =1
i ∑w
i =1
i
degree polynomial, neural net, …
1 • Yields piecewise approximation to surface that
wk =
e KernelWidth⋅Dist (c k ,c test ) typically is more complex than local regressor
• KernelWidth controls size of neighborhood that
has large effect on value (analogous to k)

Euclidean Distance Euclidean Distance?


o + o o
oo oo

attribute_2
N
oo + + o

attribute_2
D(c1,c2) = (
∑ attri (c1) − attri (c2)
i=1
)2 ooo
o +
+ + o oo
o
++ + + ++ o oo
++++
+ + o
• gives all attributes equal weight? +
attribute_1 attribute_1
– only if scale of attributes and differences are similar
– scale attributes to equal range or equal variance • if classes are not spherical?
• assumes spherical classes o • if some attributes are more/less important than
oo oo
oo other attributes?
attribute_2

ooo
o
++ +
++++ • if some attributes have more/less noise in them
+
+
attribute_1
than other attributes?

4
Weighted Euclidean Distance Learning Attribute Weights

N • Scale attribute ranges or attribute variances to


D(c1,c2) = (
∑ wi ⋅ attri (c1) − attri (c2)
i=1
)2 make them uniform (fast and easy)
• Prior knowledge
• large weights => attribute is more important
• small weights => attribute is less important • Numerical optimization:
• zero weights => attribute doesn’t matter – gradient descent, simplex methods, genetic algorithm
– criterion is cross-validation performance
• Weights allow kNN to be effective with axis-parallel • Information Gain or Gain Ratio of single attributes
elliptical classes
• Where do weights come from?

Information Gain Splitting Rules


• Information Gain = reduction in entropy due to
splitting on an attribute
• Entropy = expected number of bits needed to Sv
Entropy(S) − ∑ Entropy(Sv )
encode the class of a randomly drawn + or – v ∈Values(A ) S
GainRatio(S, A) =
example using the optimal info-theory coding S S
∑ Sv log2 Sv
v ∈Values(A )
Entropy = − p+ log2 p+ − p− log2 p−

Sv
Gain(S, A) = Entropy(S) − ∑ S
Entropy(S v )

v∈Values(A)

5
Gain_Ratio Correction Factor GainRatio Weighted Euclidean Distance

Gain Ratio for Equal Sized n-Way Splits

6.00
N
2
5.00 D(c1,c2) = ∑ gain _ ratio ⋅ (attr (c1) − attr (c2))
i i i
i=1
Correction Factor

4.00

3.00

2.00

1.00
• weight with gain_ratio after scaling?
0.00

0 10 20 30 40 50
Number of Splits

Booleans, Nominals, Ordinals, and Reals Curse of Dimensionality

• Consider attribute value differences: • as number of dimensions increases, distance between


points becomes larger and more uniform
(attri (c1) – attri(c2))
• if number of relevant attributes is fixed, increasing the
number of less relevant attributes may swamp distance
• Reals: easy! full continuum of differences
• Integers: not bad: discrete set of differences D(c1, c2) =
relevant
∑ (attri (c1) − attri (c2))2 +
irrelevant
∑ ( attr (c1) − attr (c2))
2
j j
i=1 j =1
• Ordinals: not bad: discrete set of differences
• when more irrelevant than relevant dimensions, distance
• Booleans: awkward: hamming distances 0 or 1 becomes less reliable
• Nominals? not good! recode as Booleans? • solutions: larger k or KernelWidth, feature selection,
feature weights, more complex distance functions

6
Advantages of Memory-Based Methods Weaknesses of Memory-Based Methods

• Lazy learning: don’t do any work until you know what you • Curse of Dimensionality:
want to predict (and from what variables!) – often works best with 25 or fewer dimensions
– never need to learn a global model
• Run-time cost scales with training set size
– many simple local models taken together can represent a more
complex global model • Large training sets will not fit in memory
– better focused learning • Many MBL methods are strict averagers
– handles missing values, time varying distributions, ...
• Sometimes doesn’t seem to perform as well as other
• Very efficient cross-validation
methods such as neural nets
• Intelligible learning method to many users
• Predicted values for regression not continuous
• Nearest neighbors support explanation and training
• Can use any distance metric: string-edit distance, …

Combine KNN with ANN Current Research in MBL

• Train neural net on problem • Condensed representations to reduce memory requirements


and speed-up neighbor finding to scale to 106–1012 cases
• Use outputs of neural net or hidden unit
• Learn better distance metrics
activations as new feature vectors for each point
• Feature selection
• Use KNN on new feature vectors for prediction
• Overfitting, VC-dimension, ...
• Does feature selection and feature creation • MBL in higher dimensions
• Sometimes works better than KNN or ANN • MBL in non-numeric domains:
– Case-Based Reasoning
– Reasoning by Analogy

7
References Closing Thought

• Locally Weighted Learning by Atkeson, Moore, Schaal • In many supervised learning problems, all the information
• Tuning Locally Weighted Learning by Schaal, Atkeson, you ever have about the problem is in the training set.
Moore • Why do most learning methods discard the training data
after doing learning?
• Do neural nets, decision trees, and Bayes nets capture all
the information in the training set when they are trained?
• In the future, we’ll see more methods that combine MBL
with these other learning methods.
– to improve accuracy
– for better explanation
– for increased flexibility

You might also like