Professional Documents
Culture Documents
Materi Class13
Materi Class13
Sasha Rakhlin
1 / 19
Outline
Perceptron, Continued
Bias-Variance Tradeoff
2 / 19
Outline
Perceptron, Continued
Bias-Variance Tradeoff
3 / 19
Recall from last lecture: for any T and (x1 , y1 ), . . . , (xT , yT ),
T
D2
∑ I{yt ⟨wt , xt ⟩ ≤ 0} ≤
t=1 γ2
4 / 19
Consequence for i.i.d. data (I)
1 D2
× E[ 2 ].
n γ
5 / 19
Proof: Take expectation on both sides:
1 n D2
E[ ∑ I{Yt ⟨wt , Xt ⟩ ≤ 0}] ≤ E [ 2 ]
n t=1 nγ
Eτ ES I{Yτ ⟨wτ , Xτ ⟩ ≤ 0}
Claim follows.
6 / 19
Consequence for i.i.d. data (II)
(X1 , Y1 ), . . . , (Xn , Yn )
Then final hyperplane of Perceptron separates the data perfectly, i.e. finds
minimum ̂L01 (wT ) = 0 for
̂ 1 n
L01 (w) = ∑ I{Yi ⟨w, Xi ⟩ ≤ 0}.
n i=1
7 / 19
Consequence for i.i.d. data (II)
1 D2
× E[ 2 ]
n+1 γ
8 / 19
Proof: Shortcuts: z = (x, y) and `(w, z) = I{y ⟨w, x⟩ ≤ 0}. Then
1 n+1 (−t)
ES EZ `(wT , Z) = ES,Zn+1 [ ∑ `(w , Zt )]
n + 1 t=1
9 / 19
A few remarks
10 / 19
SGD vs Multi-Pass
Perceptron update can be seen as gradient descent step with respect to loss
1 n
∑ max {−Yt ⟨w, Xt ⟩ , 0} .
n t=1
The first optimizes expected loss, the second optimizes empirical loss.
11 / 19
Overparametrization and overfitting
12 / 19
Rates vs sample complexity
n−α
13 / 19
Outline
Perceptron, Continued
Bias-Variance Tradeoff
14 / 19
Relaxing the assumptions
We proved
EL01 (wT ) − L01 (f∗ ) ≤ O(1/n)
under the assumption that P is linearly separable with margin γ.
We may relax the margin assumption, yet still hope to do well with linear
separators. That is, we would aim to minimize
hoping that
L01 (w∗ ) − L01 (f∗ )
∗
is small, where w is the best hyperplane classifier with respect to L01 .
15 / 19
More generally, we will work with some class of functions F that we hope
captures well the relationship between X and Y. The choice of F gives rise
to a bias-variance decomposition.
16 / 19
Bias-Variance Tradeoff
fˆn fF f⇤
17 / 19
Bias-Variance Tradeoff
We will only focus on the estimation error, yet the ideas we develop will
make it possible to read about model selection on your own.
Once again, if we guessed correctly and f∗ ∈ F, then
18 / 19
Bias-Variance Tradeoff
In simple problems (e.g. linearly separable data) we might get away with
having no bias-variance decomposition (e.g. as was done in the Perceptron
case).
19 / 19