# Assignment 3

Jared Friedman
April 6, 2005
1.a. i. No such perceptron exists. Consider numbering the pixels one
through nine, first by row and then by column. Number each weight according to its corresponding pixel. Since 7/9 > 3/4, and 6/9 > 3/4, the
weights of any perceptron which performs this task must satisfy the following two relations: −w0 + w1 + w2 + w3 + w4 + w5 + w6 − w7 − w8 − w9 < 0,
−w0 + w1 + w2 + w3 + w4 + w5 + w6 + w7 − w8 − w9 > 0, which imply that
w7 > 0. However, they must also satisfy −w0 − w1 − w2 − w3 − w4 − w5 −
w6 + w7 + w8 + w9 < 0 −w0 − w1 − w2 − w3 − w4 − w5 − w6 − w7 + w8 + w9 > 0,
which imply that w7 < 0. And thus we have a contradiction and no such
perceptron exists.
ii. Here is such a perceptron. Let w1 = 2, w2 = 2, w3 = 2, w4 = −1, w5 =
−1, w6 = −1, w7 = −1, w8 = −1, w9 = −1, and let the threshold be 0.5.
iii. No such perceptron exists. Assume that there is such a perceptron.
I will derive a contradiction by considering the 2 x 2 upper left corner. If
such a perceptron exists, its weights must satisfy +w1 + w2 + w4 − w5 −
w0 − w3 − w6 − w7 − w8 − w9 > 0 and −w1 + w2 + w4 − w5 − w0 − w3 −
w6 − w7 − w8 − w9 < 0, which together imply that w1 > 0. similarly, we
have that −w1 − w2 − w4 + w5 − w0 − w3 − w6 − w7 − w8 − w9 > 0 and
+w1 − w2 − w4 + w5 − w0 − w3 − w6 − w7 − w8 − w9 < 0, which together
implied that w1 < 0.
b. i. 196+1 = 197
ii. I assume that the number of output nodes is 10, as with all networks for
digit recognition. Then, following the theorem in the notes, an upper bound
on the VC dimension is given by 2(197)(108) log2 (e108) = 339134.
iii. The decision boundary of a decision stump with n inputs is a hyperplane in n dimensions. However, this hyperplane is subject to the additional

restriction that it must be perpendicular to one of the basis vectors. Thus,
an upper bound of its VC dimension comes from the VC dimension of perceptrons with the same number of inputs, which is n + 1, or 197.
v. consider decision trees with one continuous input. Consider the set S
consisting of the first N positive integers. For any dichotomy on S, here is an
algorithm for constructing a decision tree which represents that dichotomy.
each split will be of the form X > a − 0.5, where a is an integer. For any
given split X > a − 0.5, the left branch will be a positive classification if a
is a positive instance of the dichotomy, and a negative classification if a is
a negative instance of the dichotomy; the right branch will lead to the split
X > a + 1 − 0.5. Clearly, we can use this method to construct a decision
tree to shatter a set of size N, for any N. Therefore, the VC dimension of
decision trees must be infinite. This is not a valid reason to reject decision
trees because it shows only that the upper bound we derived for the sample complexity in terms of the VC dimension is infinite; the lower bound
we derived is still finite. Also, even if the sample complexity is not polynomial, decision trees could be a weak learner that performs well in practice.
c. Multi-layer feed-forward neural networks are clearly the most appropriate algorithm for the task. Decision trees with continuous attributes are
a poor choice because the inductive bias of decision trees is to look for hypotheses where the classification depends primarily on the values of only
a few attributes. Except possibly for the pixels on the border, all the inputs to the decision tree are roughly equally important in determining the
classification. Similarly, the domain of digit recognition shows translation
independence – if all the pixel values are shifted over by one, the classification should not change – and this will require a very large decision tree to
represent. On the other hand, the fact that the VC dimension of decision
trees with continuous attributes is infinite suggests that this will be a highly
expressive hypothesis space. Decision trees may well be able to represent
a correct hypothesis, however they will have to be very large to do so, and
will likely be untrainable.
Boosted decision stumps are a better choice. For one, as we can see in
part b, the VC dimension of boosted decision stumps is very similar to the
VC dimension of neural networks with one hidden layer. However, the inductive bias is still wrong. There is no reason to think that any particular pixel
will have particularly good discriminatory value in this domain, as would be
required for decision stumps to be a suitable choice. Also, classification into
2

digits will likely depend upon a fairly complex interaction between many
pixels. Boosted decision stumps, being a kind of a weighted sum, will have
trouble modeling this interaction.
Perceptrons are a reasonable choice with respect to their inductive bias,
and will probably perform reasonably well, especially considering the short
amount of time to train them. However, they are simply not expressive
enough to achieve top performance. As we can see from part b, the VC
dimension of perceptrons is much less than the VC dimension of a neural
net with a sizable hidden layer. Multi-layer feed-forward neural networks
combine the inductive bias of perceptrons with the high dimensionality of
the other two algorithms, and are therefore the best choice.
3.a. The error function F is addressing the problem that the weights in
a neural network generally increase over a large amount of training, and can
become unreasonably large. This can lead to overfitting. To correct for this,
the error function F penalizes large weights by calculating a weighted sum
of the training set error and the square of the Euclidean distance between
the origin and the weight vector.
∂F
b. To derive a weight update rule, all we need to do is to calculate ∂w
.
ji
Since F is a sum of two terms, we can take the partial derivatives of those
P
∂F
terms separately, and we find that ∂w
= − d∈D adj δid + 2λwji . Therefore,
ji

the update rule is wji ← wji + α(

P

d∈D

3