Nonparametric Regression
• Fit more flexible regression functions f (X)
• Local regression at each query point x0
• → Nearest neighbor methods
→ kernel methods
1
Local average
• Only one predictor variable
• K -nearest neighbor average at x0 : Average of
K closest points to x0
→ Simple and flexible estimator
→ Discontinuous (bumpy) fit
2
Example: 20-nearest neighbor
average
2
1
y
0
−1
0.0 0.2 0.4 0.6 0.8 1.0
3
Kernel regression
• Resolve discontinuity
• Use local weighted fits
• Weight function Kλ (x0 , x)
• Weight decreases smoothly with distance from
target point: smooth fit
∑n
Kλ (x0 ,xi )yi
• fˆλ (x0 ) = ∑i=1
n
i=1 Kλ (x0 ,xi )
→ Nadaraya-Watson kernel-weighted average
4
Weight function
( )
|x − x0 |
Kλ (x0 , x) = D
λ
• Epanechnikov: D(t) = 34 (1 − t2 ) I(|t| ≤ 1)
• Tri-cube: D(t) = (1 − |t|3 )3 I(|t| ≤ 1)
• Gaussian: D(t) = ϕ(t)
5
Kernel functions
Epanechnikov
0.8
Tri−cube
Gaussian
0.6
D(t)
0.4
0.2
0.0
−3 −2 −1 0 1 2 3
6
Weight function
• Epanechnikov and tri-cube: compact support
• Gaussian: noncompact support
• Tri-cube is flatter on top than Epanechnikov
→ More efficient results but more bias
7
Kernel-weighted average
• Continuous fit
• Uses fixed width neighborhoods
• λ in the kernel function controls the window size
→ Bias-variance trade-off
→ λ ↗⇒ bias↗, variance↘
8
Example: Gaussian kernel, λ = 0.2
2
1
y
0
−1
0.0 0.2 0.4 0.6 0.8 1.0
9
NN and kernels
• Continuous fit and adaptive neighborhoods
→ Kernels with variable window width
( )
|x − x0 |
E.g. Kλ (x0 , x) = D
|x(k) − x0 |
→ λ(x0 ) = |x(k) − x0 |: distance to k th
nearest neighbor
10
Boundary problems
Local averages can have problems at the boundary
• Asymmetric neighborhoods
• NN: wider neighborhood ⇒ bias↗
• Kernel: less points ⇒ variance↗
→ Use higher order local regression
11
Local linear regression
• Use local linear fits (lines)
• Reduces bias substantially
• Solve at each target x0
∑
n
min = Kλ (x0 , xi )(yi − β0 − β1 xi )2
β0 ,β1
i=1
→ fˆ(x0 ) = β̂0 (x0 ) + β̂1 (x0 )x0
→ Different linear model at each target x0
12
Local linear regression
• W (x0 ) = diag(Kλ (x0 , xi ))
→ fˆ(x0 ) =
x̃t0 (Xt W (x0 )X)−1 Xt W (x0 )y =
l(x0 )t y
• S kernel
λ = (l(x 1 ), . . . , l(x n )) t
→ f̂ = S kernel
λ y
→ A linear operator!
13
Effective degrees of freedom
• f̂ = S kernel
λ y
→ Effective degrees of freedom is given by
trace(S kernel
λ )
→ Useful to select tuning parameter λ
14
Example: Gaussian kernel, linear fit
2
1
y
0
−1
0.0 0.2 0.4 0.6 0.8 1.0
15
Local polynomial regression
• fit a local polynomial of degree M
( )2
∑
n ∑
M
min = Kλ (x0 , xi ) yi − β0 − βm xm
i
β0 ,β1 ,...,βm
i=1 m=1
∑M
→ fˆ(x0 ) = β̂0 (x0 ) + β̂ (x
m=1 m 0 0)xm
• Further (smaller) reduction of bias ( High
curvature regions)
• Increased variance
16
Ex: Gaussian kernel, quadratic fit
2
1
y
0
−1
0.0 0.2 0.4 0.6 0.8 1.0
17
More than 1 predictor
• d-dimensional kernel functions
• Typically radial functions
( )
∥x − x0 ∥
Kλ (x0 , x) = D
λ
→ Standardize predictors
• More boundary problems
→ Use linear fits!
18
More than 1 predictor
More general kernel
( −1 )
(x − x0 t
) A (x − x0 )
Kλ (x0 , x) = D
λ
• A: positive semidefinite matrix
• Weigh components
• Correlations between features
19