Professional Documents
Culture Documents
Informally, the capacity of a classification model is related to how complicated it can be. For example,
consider the thresholding of a high-degree polynomial: if the polynomial evaluates above zero, that point is
classified as positive, otherwise as negative. A high-degree polynomial can be wiggly, so it can fit a given
set of training points well. But one can expect that the classifier will make errors on other points, because it
is too wiggly. Such a polynomial has a high capacity. A much simpler alternative is to threshold a linear
function. This function may not fit the training set well, because it has a low capacity. This notion of
capacity is made rigorous below.
Definitions
VC dimension of a set-family
Let be a set family (a set of sets) and a set. Their intersection is defined as the following set family:
The VC dimension of is the largest cardinality of a set that is shattered by . If arbitrarily large sets
can be shattered, the VC dimension is .
A binary classification model with some parameter vector is said to shatter a set of generally positioned
data points if, for every assignment of labels to those points, there exists a such that the
model makes no errors when evaluating that set of data points.
The VC dimension of a model is the maximum number of points that can be arranged so that shatters
them. More formally, it is the maximum cardinal such that there exists a generally positioned data point
set of cardinality can be shattered by .
Examples
1. is a constant classifier (with no parameters); Its VC dimension is 0 since it cannot shatter even a single
point. In general, the VC dimension of a finite classification model, which can return at most different
classifiers, is at most (this is an upper bound on the VC dimension; the Sauer–Shelah lemma gives a
lower bound on the dimension).
2. is a single-parametric threshold classifier on real numbers; i.e, for a certain threshold , the classifier
returns 1 if the input number is larger than and 0 otherwise. The VC dimension of is 1 because: (a) It
can shatter a single point. For every point , a classifier labels it as 0 if and labels it as 1 if .
(b) It cannot shatter all the sets with two points. For every set of two numbers, if the smaller is labeled 1,
then the larger must also be labeled 1, so not all labelings are possible.
3. is a single-parametric interval classifier on real numbers; i.e, for a certain parameter , the classifier
returns 1 if the input number is in the interval and 0 otherwise. The VC dimension of is 2
because: (a) It can shatter some sets of two points. E.g, for every set , a classifier labels it as
(0,0) if or if , as (1,0) if , as (1,1) if , and as (0,1)
if . (b) It cannot shatter any set of three points. For every set of three numbers, if the smallest
and the largest are labeled 1, then the middle one must also be labeled 1, so not all labelings are possible.
4. is a straight line as a classification model on points in a two-dimensional plane (this is the model used
by a perceptron). The line should separate positive data points from negative data points. There exist sets of
3 points that can indeed be shattered using this model (any 3 points that are not collinear can be shattered).
However, no set of 4 points can be shattered: by Radon's theorem, any four points can be partitioned into
two subsets with intersecting convex hulls, so it is not possible to separate one of these two subsets from the
other. Thus, the VC dimension of this particular classifier is 3. It is important to remember that while one
can choose any arrangement of points, the arrangement of those points cannot change when attempting to
shatter for some label assignment. Note, only 3 of the 23 = 8 possible label assignments are shown for the
three points.
5. is a single-parametric sine classifier, i.e, for a certain parameter , the classifier returns 1 if the input
number has and 0 otherwise. The VC dimension of is infinite, since it can shatter any
finite subset of the set .[2]: 5 7
Uses
The VC dimension can predict a probabilistic upper bound on the test error of a classification model.
Vapnik[3] proved that the probability of the test error (i.e., risk with 0-1 loss function) distancing from an
upper bound (on data that is drawn i.i.d. from the same distribution as the training set) is given by:
where is the VC dimension of the classification model, , and is the size of the training set
(restriction: this formula is valid when . When is larger, the test-error may be much higher than
the training-error. This is due to overfitting).
The VC dimension also appears in sample-complexity bounds. A space of binary functions with VC
dimension can be learned with:[4]: 7 3
samples, where is the learning error and is the failure probability. Thus, the sample-complexity is a
linear function of the VC dimension of the hypothesis space.
In computational geometry
The VC dimension is one of the critical parameters in the size of ε-nets, which determines the complexity
of approximation algorithms based on them; range sets without finite VC dimension may not have finite ε-
nets at all.
Bounds
0. The VC dimension of the dual set-family of is strictly less than , and this is best possible.
We can construct a more powerful classifier by combining several different classifiers from ; this
technique is called boosting. Formally, given classifiers and a weight vector ,
we can define the following classifier:
The VC dimension of the set of all such classifiers (for all selections of classifiers from and a weight-
vector from ), assuming , is at most:[4]: 1 08–109
If the activation function is the sign function and the weights are general, then the VC
dimension is at most .
If the activation function is the sigmoid function and the weights are general, then the VC
dimension is at least and at most .
If the weights come from a finite family (e.g. the weights are real numbers that can be
represented by at most 32 bits in a computer), then, for both activation functions, the VC
dimension is at most .
Generalizations
The VC dimension is defined for spaces of binary functions (functions to {0,1}). Several generalizations
have been suggested for spaces of non-binary functions.
For multi-class functions (e.g., functions to {0,...,n-1}), the Natarajan dimension[6] can be
used. Ben David et al[7] present a generalization of this concept.
For real-valued functions (e.g., functions to a real interval, [0,1]), Pollard's pseudo-
dimension[8][9][10] can be used.
The Rademacher complexity provides similar bounds to the VC, and can sometimes provide
more insight than VC dimension calculations into such statistical methods such as those
using kernels.
The Memory Capacity (sometimes Memory Equivalent Capacity) gives a lower bound
capacity, rather than an upper bound (see for example: Artificial_neural_network#Capacity)
and therefore indicates the point of potential overfitting.
See also
Growth function
Sauer–Shelah lemma, a bound on the number of sets in a set system in terms of the VC
dimension.
Karpinski–Macintyre theorem,[11] a bound on the VC dimension of general Pfaffian formulas.
Footnotes
1. Vapnik, V. N.; Chervonenkis, A. Ya. (1971). "On the Uniform Convergence of Relative
Frequencies of Events to Their Probabilities". Theory of Probability & Its Applications. 16 (2):
264. doi:10.1137/1116025 (https://doi.org/10.1137%2F1116025). This is an English
translation, by B. Seckler, of the Russian paper: "On the Uniform Convergence of Relative
Frequencies of Events to Their Probabilities". Dokl. Akad. Nauk. 181 (4): 781. 1968. The
translation was reproduced as: Vapnik, V. N.; Chervonenkis, A. Ya. (2015). "On the Uniform
Convergence of Relative Frequencies of Events to Their Probabilities". Measures of
Complexity. p. 11. doi:10.1007/978-3-319-21852-6_3 (https://doi.org/10.1007%2F978-3-319-
21852-6_3). ISBN 978-3-319-21851-9.
2. Mohri, Mehryar; Rostamizadeh, Afshin; Talwalkar, Ameet (2012). Foundations of Machine
Learning. US, Massachusetts: MIT Press. ISBN 9780262018258.
3. Vapnik 2000.
4. Shalev-Shwartz, Shai; Ben-David, Shai (2014). Understanding Machine Learning – from
Theory to Algorithms. Cambridge University Press. ISBN 9781107057135.
5. Alon, N.; Haussler, D.; Welzl, E. (1987). "Partitioning and geometric embedding of range
spaces of finite Vapnik-Chervonenkis dimension". Proceedings of the third annual
symposium on Computational geometry – SCG '87. p. 331. doi:10.1145/41958.41994 (http
s://doi.org/10.1145%2F41958.41994). ISBN 978-0897912310. S2CID 7394360 (https://api.s
emanticscholar.org/CorpusID:7394360).
6. Natarajan 1989.
7. Ben-David, Cesa-Bianchi & Long 1992.
8. Pollard 1984.
9. Anthony & Bartlett 2009.
10. Morgenstern & Roughgarden 2015.
11. Karpinski & Macintyre 1997.
References
Moore, Andrew. "VC dimension tutorial" (https://autonlab.org/assets/tutorials/vcdim08.pdf)
(PDF).
Vapnik, Vladimir (2000). The nature of statistical learning theory. Springer.
Blumer, A.; Ehrenfeucht, A.; Haussler, D.; Warmuth, M. K. (1989). "Learnability and the
Vapnik–Chervonenkis dimension" (http://l2r.cs.uiuc.edu/~danr/Teaching/CS446-16/Papers/p
929-blumer.pdf) (PDF). Journal of the ACM. 36 (4): 929–865. doi:10.1145/76359.76371 (http
s://doi.org/10.1145%2F76359.76371). S2CID 1138467 (https://api.semanticscholar.org/Corp
usID:1138467).
Burges, Christopher. "Tutorial on SVMs for Pattern Recognition" (https://www.microsoft.com/
en-us/research/wp-content/uploads/2016/02/svmtutorial.pdf) (PDF). Microsoft. (containing
information also for VC dimension)
Chazelle, Bernard. "The Discrepancy Method" (http://www.cs.princeton.edu/~chazelle/book.
html).
Natarajan, B.K. (1989). "On Learning sets and functions" (https://doi.org/10.1007%2FBF001
14804). Machine Learning. 4: 67–97. doi:10.1007/BF00114804 (https://doi.org/10.1007%2F
BF00114804).
Ben-David, Shai; Cesa-Bianchi, Nicolò; Long, Philip M. (1992). "Characterizations of
learnability for classes of {O, …, n}-valued functions". Proceedings of the fifth annual
workshop on Computational learning theory – COLT '92. p. 333.
doi:10.1145/130385.130423 (https://doi.org/10.1145%2F130385.130423).
ISBN 089791497X.
Pollard, D. (1984). Convergence of Stochastic Processes. Springer. ISBN 9781461252542.
Anthony, Martin; Bartlett, Peter L. (2009). Neural Network Learning: Theoretical
Foundations. ISBN 9780521118620.
Morgenstern, Jamie H.; Roughgarden, Tim (2015). On the Pseudo-Dimension of Nearly
Optimal Auctions (http://papers.nips.cc/paper/5766-on-the-pseudo-dimension-of-nearly-opti
mal-auctions). NIPS. arXiv:1506.03684 (https://arxiv.org/abs/1506.03684).
Bibcode:2015arXiv150603684M (https://ui.adsabs.harvard.edu/abs/2015arXiv150603684
M).
Karpinski, Marek; Macintyre, Angus (February 1997). "Polynomial Bounds for VC Dimension
of Sigmoidal and General Pfaffian Neural Networks" (https://ora.ox.ac.uk/objects/uuid:a1446
5ce-11d9-4f89-aeec-fcf0bea603ed). Journal of Computer and System Sciences. 54 (1):
169–176. doi:10.1006/jcss.1997.1477 (https://doi.org/10.1006%2Fjcss.1997.1477).