Created: March 03, 2022
Modified: April 20, 2022

Occam generalization bound

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

Here's a formal argument for Occam's razor, adapted from Theorem 2.2 of Kearns and Vazirani's "An Introduction to Computational Learning Theory" and Theorem 1 from David McAllester's Pac-Bayes tutorial (https://arxiv.org/abs/1307.2118). It operates in a somewhat idealized setting that allows for a very simple derivation.

Result

Consider a classification task with 0/1 loss, where we have $n$ training examples sampled i.i.d. from some population, and a finite-size hypothesis space $\mathcal{H}$ . Suppose we find a hypothesis $\hat{h} \in \mathcal{H}$ that fits the training set perfectly.This analysis can be generalized to other losses, where perfect fitting is not possible, using concentration inequalities; see the McAllester tutorial for details. We will show that the expected test error of $\hat{h}$ at new points can, with high probability, be bounded in terms of $\log |\mathcal{H}|$ (the bits required to specify a hypothesis).

Let $L(h)$ denote the population-level mistake probability (expected test error) of a hypothesis $h\in\mathcal{H}$ , and call a hypothesis 'bad' if $L(h) > \epsilon$ . We want to bound the probability that our candidate hypothesis $\hat{h}$ is 'bad'.

Note that a 'bad' hypothesis might still happen to label a given input correctly, but it can do so with probability at most $1 - \epsilon$ . It follows that the probability that any given 'bad' hypothesis will fit $n$ independently selected training examples is at most $(1 - \epsilon)^n$ .

Let $\mathcal{H}' \subseteq \mathcal{H}$ be the set of 'bad' hypotheses: by a union bound, the probability that some 'bad' hypothesis is consistent with the data is at most $|\mathcal{H}'|(1 - \epsilon)^n$ , which is itself at most $|\mathcal{H}|(1 - \epsilon)^n$ . If we require this to be at most $\delta$ :

|\mathcal{H}|(1 - \epsilon)^n \le \delta

then taking the log of both sides and rearranging, it follows thatThe result that $\log \frac{1}{1 - \epsilon} \approx \epsilon$ , for small $\epsilon$ , follows from a first-order Taylor expansion of the natural logarithm.

n \ge \frac{\log |\mathcal{H}| + \log 1/\delta}{\log \frac{1}{1 - \epsilon}} \approx \frac{1}{\epsilon}\left(\log |\mathcal{H}| + \log 1/\delta\right)

is sufficient to guarantee with probability at least $1-\delta$ that a hypothesis consistent with the training data will have test loss of at most $\epsilon$ .

Discussion

This result shows that the number of training examples $n$ we need to bound the test error at some desired $\epsilon$ increases

linearly with the number of bits in the hypothesis description ( $\log |\mathcal{H}|$ )
linearly with the number of 9's of reliability we need (the bound fails with probability $\delta$ , but each order of magnitude decrease in $\delta$ costs just a constant increase in $\log 1/\delta$ ), so we can guarantee arbitrarily high reliability at small cost.
inversely with the desired test error rate: sharpening the bound from $\epsilon = 1\%$ to $\epsilon = 0.5\%$ requires doubling the number of training examples.

Note that this is just a guarantee on generalization error conditioned on already having found a hypothesis that fits the training data perfectly. It does not guarantee that any given hypothesis class $\mathcal{H}$ contains such a hypothesis, or that whatever training method we're using will find it if it does. So in practice we can't "just" double the number of training examples to guarantee a lower test error rate, since we may no longer be able to find an $h$ that perfectly fits this enlarged training set --- but if we do, then the bound applies.

Intuition

The empirical (training) loss of any hypothesis is a random variable. By the law of large numbers, it converges to the population (test) loss as $n\to\infty$ . Failures of generalization occur when a hypothesis 'gets lucky' on a given dataset by obtaining an empirical loss significantly better than its population loss. The fewer hypotheses we consider, the fewer chances there are for this to happen, so limiting ourselves to smaller hypothesis classes leads to more reliable generalization.Importantly, these hypotheses don't have to be 'simple' in any recognizable sense of the term. The size of the class is all that matters for this bound: a class containing ten randomly-selected crazy hypotheses is just as good as one containing the ten 'simplest' possible hypotheses by some subjective criterion. The question of what gets to count as 'simple' is analogous to the choice of a prior, a coding scheme or compression algorithm, or the the choice of programming environment that determines the constant term in Kolmogorov complexity.

What does this tell us in practice? Say we want to guarantee with 99% probability ( $\delta = 0.01$ ) that a hypothesis consistent with the training set will have test loss of $< 0.01 = \epsilon$ . Under this bound, we'd need

n \ge 68.97 \cdot (\log_2 |\mathcal{H}| + 6.64)

where $\log_2 |\mathcal{H}|$ is the number of bits to represent the hypothesis class.Here we've divided by $\ln 2 \approx 0.69$ to convert the logarithms to base 2. This implies a need for about 69 extra training points every time we double the size of the hypothesis space. For example, if our hypothesis space is all functions that can be expressed as Python programs of less than 1000 characters (8000 bits), then we'd expect to need on the order of 458 + 69 * 8000 = about 552458 training examples.

Note that this bound has no dependence on the size of the input space. Of course, if there are fewer than 552458 possible inputs, the bound is not very interesting. But importantly, it works even if the input space is enormously large; say, the set of all 20-megapixel images represented with 32-bits-per-pixel.

Qualitatively, what this analysis shows is that if your hypothesis class is too big, then it becomes more likely that some hypothesis will 'get lucky' and fit the training data even though it performs poorly on test data. Choosing such a hypothesis is called overfitting. The way to prevent this is to constrain your hypothesis class, either explicitly or through regularization.

Failed analyses

I wanted to understand some ways of thinking about generalization that don't work in this setting. Feel free to skip this section: these are approaches that some part of me wanted to consider, so I found it valuable to analyze why they don't work. You might benefit from your own first-principles exploration of how one might prove generalization results --- your misguided intuitions will probably be different from mine.

We don't need to pick the correct hypothesis. We might be tempted to consider the probability that we've picked out exactly the correct function $h^*$ , but this is too strong a condition. In general our hypothesis space $\mathcal{H}$ probably doesn't contain the "correct" hypothesis at all! (hypotheses are models of whatever real-world phenomena generate our data, and all models are wrong). Nonetheless the bound still holds.

As an aside, even if we did assume $h^* \in \mathcal{H}$ , we cannot usefully guarantee that a hypothesis that fits all training samples is in fact the 'correct' hypothesis $h^*$ . This would be a much stronger condition: guaranteeing that test error will be exactly zero, versus just bounded by some $\epsilon$ .

Assume for simplicity that the data distribution is the uniform distribution over the input space, so $p_\mathcal{D}(x) = 1 / |D|$ for all inputs $x$ . Since any incorrect hypothesis $h \ne h^*$ must disagree with $h^*$ on at least one input point, it follows that $h$ makes a mistake with probability $\ge 1 / |D|$ . So the probability that a wrong hypothesis $h$ classifies all of our $n$ training samples correctly is $\le (1 - 1 / |D|)^n$ . Using Bayes' rule, we have

p(h \ne h^* | h \text{ makes no mistakes on training set}) \le \frac{\left(1 - 1 / |D|\right)^n \cdot p(h \ne h^*)}{\left(1 - 1 / |D|\right)^n \cdot p(h \ne h^*) + 1 \cdot p(h = h^*)}

Under a uniform prior on hypotheses we have $p(h = h^*) = 1 / |\mathcal{H}|$ and $p(h \ne h^*) = \frac{|\mathcal{H}| - 1}{|\mathcal{H}|}$ . Substituting this in and multiplying through, we have

p(h \ne h^* | h \text{ makes no mistakes on training set}) \le \frac{\left(1 - 1 / |D|\right)^n \cdot \left(|\mathcal{H}| - 1\right)}{\left(1 - 1 / |D|\right)^n \cdot \left(|\mathcal{H}| - 1\right) + 1} \le \frac{\left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}|}{\left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}| + 1}

Now suppose we want to bound the probability of choosing the wrong hypothesis (the left hand side) by some small value $\epsilon$ . Then we have $\epsilon \le \frac{X}{X + 1}$ , or equivalently $X \le \frac{\epsilon}{1 - \epsilon}$ for $X = \left(1 - 1 / |D|\right)^n \cdot |\mathcal{H}|$ . We can already tell that this won't yield a useful bound, because to make this small we'd need $\left(1 - 1/D\right)^n$ small, which would require $n > D$ , i.e., we'd have to see enough points so that we're likely to see every possible point during training.

The problem with the above analysis is that it tries to prove too much. Since other functions $h$ could differ from the true function $h^*$ on only a single input, we have to see every input to be sure of getting the true function. This is true even if $\mathcal{H}$ is very small; for example, even if it contains just two possible hypotheses, those hypotheses might agree on 99.9999% of points, making them very hard to distinguish. The insight of the original formulation is that we don't need to distinguish such hypotheses, as long as we just care about test error.

Not a Bayesian analysis of hypotheses. We might identify two sources of uncertainty in the standard iid learning model:

We don't know which hypothesis generated the population.
We don't know which population members were selected to form the training data.

The probabilities in the analyses above account only for the latter of these: uncertainty over the training data. The Occam bound shows that with high probability we will sample data such that all hypotheses end up within some small fudge margin of their expected loss. This requires no assumptions about the distribution of losses incurred by individual hypotheses, but at least in principle we should be able to get tighter bounds by making such assumptions.

Let $\hat{L}(h)$ denote the empirical/training loss of a hypothesis $h\in \mathcal{H}$ , and $L(h)$ the population/test loss. Let $p(h)$ be a prior distribution over hypotheses (perhaps the uniform distribution). Suppose we see $n$ training examples and select a hypothesis $\hat{h}$ that works on all of them (so $\hat{L}(\hat{h}) = 0$ ). Conditioned on this criterion there is a unique posterior distribution over the test loss $L(\hat{h})$ , and in particular, a posterior probability that $L(\hat{h}) > \epsilon$ for any error rate $\epsilon$ . Indeed, by Bayes' rule, this probability is

\begin{align*} p(L(h) > \epsilon | \hat{L}(h) = 0) &= \frac{p(\hat{L}(h) = 0 | L(h) > \epsilon)p(L(h) > \epsilon)}{p(\hat{L}(h) = 0)} \end{align*}

Insofar as Bayesian reasoning gives precise answers given all conditioning information, this is in some sense the 'perfect' generalization bound. Unfortunately, it's not particularly useful, because it requires access to the prior distribution on population losses $p(L(h))$ implied by $p(h)$ : in particular, the prior probability that a hypothesis has population loss greater (or less) than $\epsilon$ , and the likelihood that such a hypothesis has training loss of zero.This likelihood depends on the full distribution of population losses; just knowing that the loss is greater or less than $\epsilon$ is not sufficient to compute it. To see this, observe that a hypothesis space in which the population loss is concentrated at a point just barely above $\epsilon$ is different from one in which there's a large tail of losses well above $\epsilon$ . But in general we don't know the population loss of any of our hypotheses, let alone all of them!If we did know the population loss of every hypothesis, the learning problem would be trivial, since we'd just choose the hypothesis with the lowest loss.

Occam generalization bound

Result

Discussion

Intuition

Failed analyses

Links to this note

free lunch theorem

A Universal Law of Robustness via Isoperimetry

Occam's razor

Meta