Professional Documents
Culture Documents
Memory-Based Learning
Instance-Based Learning
K-Nearest Neighbor
1
k-Nearest Neighbor How to choose “k”
N 2
Dist(c1 , c2 ) = (
∑ attri (c1 ) − attri (c2 )
i=1
) • Large k:
– less sensitive to noise (particularly class noise)
{
k − NearestNeighbors = k − MIN(Dist(ci , ctest )) } – better probability estimates for discrete classes
1 k 1 k – larger training sets allow larger values of k
predictiontest = ∑ classi (or ∑ valuei )
k i=1 k i=1 • Small k:
– captures fine structure of problem space better
• Average of k points more reliable when: – may be necessary with small training sets
– noise in attributes +o o
o • Balance must be struck between large and small k
ooo
attribute_2
o + oo+o
– noise in class labels o+ o o o
• As training set approaches infinity, and k grows large,
– classes partially overlap
+ +++
+ + kNN becomes Bayes optimal
+
+
attribute_1
From Hastie, Tibshirani, Friedman 2001 p418 From Hastie, Tibshirani, Friedman 2001 p418
2
From Hastie, Tibshirani, Friedman 2001 p419
• Models usually perform better on training data • tradeoff between small and large k can be difficult
than on future test cases – use large k, but more emphasis on nearer neighbors?
• 1-NN is 100% accurate on training data!
k k
• Leave-one-out-cross validation: ∑ w ∗ class
i i ∑ w ∗ value
i i
3
Locally Weighted Averaging Locally Weighted Regression
• Let k = number of training points • All algs so far are strict averagers: interpolate, but
• Let weight fall-off rapidly with distance can’t extrapolate
k k • Do weighted regression, centered at test point,
∑ w ∗ class
i=1
i i ∑ w ∗ value
i =1
i i weight controlled by distance and KernelWidth
predictiontest = (or )
• Local regressor can be linear, quadratic, n-th
k k
∑w
i =1
i ∑w
i =1
i
degree polynomial, neural net, …
1 • Yields piecewise approximation to surface that
wk =
e KernelWidth⋅Dist (c k ,c test ) typically is more complex than local regressor
• KernelWidth controls size of neighborhood that
has large effect on value (analogous to k)
attribute_2
N
oo + + o
attribute_2
D(c1,c2) = (
∑ attri (c1) − attri (c2)
i=1
)2 ooo
o +
+ + o oo
o
++ + + ++ o oo
++++
+ + o
• gives all attributes equal weight? +
attribute_1 attribute_1
– only if scale of attributes and differences are similar
– scale attributes to equal range or equal variance • if classes are not spherical?
• assumes spherical classes o • if some attributes are more/less important than
oo oo
oo other attributes?
attribute_2
ooo
o
++ +
++++ • if some attributes have more/less noise in them
+
+
attribute_1
than other attributes?
4
Weighted Euclidean Distance Learning Attribute Weights
Sv
Gain(S, A) = Entropy(S) − ∑ S
Entropy(S v )
€
v∈Values(A)
5
Gain_Ratio Correction Factor GainRatio Weighted Euclidean Distance
6.00
N
2
5.00 D(c1,c2) = ∑ gain _ ratio ⋅ (attr (c1) − attr (c2))
i i i
i=1
Correction Factor
4.00
3.00
2.00
1.00
• weight with gain_ratio after scaling?
0.00
€
0 10 20 30 40 50
Number of Splits
6
Advantages of Memory-Based Methods Weaknesses of Memory-Based Methods
• Lazy learning: don’t do any work until you know what you • Curse of Dimensionality:
want to predict (and from what variables!) – often works best with 25 or fewer dimensions
– never need to learn a global model
• Run-time cost scales with training set size
– many simple local models taken together can represent a more
complex global model • Large training sets will not fit in memory
– better focused learning • Many MBL methods are strict averagers
– handles missing values, time varying distributions, ...
• Sometimes doesn’t seem to perform as well as other
• Very efficient cross-validation
methods such as neural nets
• Intelligible learning method to many users
• Predicted values for regression not continuous
• Nearest neighbors support explanation and training
• Can use any distance metric: string-edit distance, …
7
References Closing Thought
• Locally Weighted Learning by Atkeson, Moore, Schaal • In many supervised learning problems, all the information
• Tuning Locally Weighted Learning by Schaal, Atkeson, you ever have about the problem is in the training set.
Moore • Why do most learning methods discard the training data
after doing learning?
• Do neural nets, decision trees, and Bayes nets capture all
the information in the training set when they are trained?
• In the future, we’ll see more methods that combine MBL
with these other learning methods.
– to improve accuracy
– for better explanation
– for increased flexibility