Professional Documents
Culture Documents
By induction it follows that hwt , w∗ i+bt b∗ ≥ tγ. On the other hand we made
an update because yt (hxt , wt−1 i + bt−1 ) < 0. By using yt yt = 1,
kwt k2 + b2t = kwt−1 k2 + b2t−1 + yt2 kxt k2 + 1 + 2yt (hwt−1 , xt i + bt−1 )
≤ kwt−1 k2 + b2t−1 + kxt k2 + 1
2 2 2 2
2 kxtk = R we can again apply induction to conclude that kwt k +bt ≤
Since
t R + 1 . Combining the upper and the lower bounds, using the Cauchy-
Schwartz inequality, and kw∗ k = 1 yields
∗
∗ ∗ wt w
tγ ≤ hwt , w i + bt b = ,
bt b∗
∗
q
wt
w
= kwt k2 + b2 1 + (b∗ )2
p
≤
bt
b∗
t
p p
≤ t(R2 + 1) 1 + (b∗ )2 .
Squaring both sides of the inequality and rearranging the terms yields an
upper bound on the number of updates and hence the number of mistakes.
1.3.5 K-Means
All the algorithms we discussed so far are supervised, that is, they assume
that labeled training data is available. In many applications this is too much
to hope for; labeling may be expensive, error prone, or sometimes impossi-
ble. For instance, it is very easy to crawl and collect every page within the
www.purdue.edu domain, but rather time consuming to assign a topic to
each page based on its contents. In such cases, one has to resort to unsuper-
vised learning. A prototypical unsupervised learning algorithm is K-means,
which is clustering algorithm. Given X = {x1 , . . . , xm } the goal of K-means
is to partition it into k clusters such that each point in a cluster is similar
to points from its own cluster than with points from some other cluster.