Created: September 26, 2020
Modified: March 26, 2024

calibration

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

A nice paper that gets at some subtleties of calibration:

Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison. Experts Don't Cheat: Learning What You Don't Know By Predicting Pairs (2024). https://arxiv.org/abs/2402.08733

A core intuition is that any calibrated model $\hat{p}$ can be seen as the projection of the true distribution $p$ ignoring some conditioning information.

To formalize this, we consider a 'grouping function' $\Phi(x)$ , which induces equivalence classes of input conditions. A model $\hat{p}_{Y|X}$ is 'first-order calibrated' (or just 'calibrated') if for any condition $x$ it can give us the average true probability (real-world $p_{Y|X}$ ) over all conditions in the equivalence class $[x]_\Phi$ :

\hat{p}_{Y|X}(y | x) = \mathbb{E}_X\left[ p_{Y|X}(y | X) | X \in [x]_\Phi \right] = p_{Y|X}(Y = y | \Phi(X) = \phi(x))

The usual grouping function is $\Phi(x) = \hat{p}_{Y|X}(\cdot|x)$ , so the equivalence classes are the $x$ 's that all induce the same conditional distribution. It can be shown (and may become clear through our discussion) that this is the coarsest possible grouping function: if there is any $\Phi$ such that a model is calibrated, then it will necessarily be calibrated for this one.

This definition says that any calibrated model can be seen as the projection of the true distribution $p$ ignoring some of the conditioning information. This is kind of tautologically true under the standard grouping function (the model is the result of forgetting everything about the conditioning input that is not relevant to the model's probabilities, ie "the model cares about whatever it cares about"), but potentially more interesting for other $\Phi$ 's.

I think it helps to work through an example. Suppose X is the bias of a coin and that our model $\hat{p}$ actually just ignores the bias entirely and always predicts $\hat{p} = 0.5$ . Then the standard $\Phi(x) = \hat{p}$ just lumps all coins into one big equivalence class. As long as coin biases are symmetrically distributed, then $\hat{p}=0.5$ is calibrated under that grouping. On the other hand, calibration would no longer hold for a finer-grained grouping function $\Phi(x) = x$ .

From one perspective we could think of varying $\Phi$ as a kind of 'test' for what information the model actually makes use of. In the coin case, our model $\hat{\phi}$ is calibrated wrt any grouping function that lumps all coins into a single equivalence class, but not grouping functions that look at the bias, so (if we didn't already know this --- eg if $\hat{p}$ was a black-box learned model) we can conclude from this that $\hat{p}$ is either not looking at the bias or at least not fully and correctly incorporating the given information. In general, it seems like the finest $\Phi$ for which a model remains calibrated might be seen as specifying what conditioning information the model is able to effectively use.

It's interesting that this expectation reveals a dependence on the distribution of $X$ , even though we are only modeling the conditionals $Y | X$ . The true conditional probability $p(y | x)$ does not depend on any aspects of the conditionals for other inputs $x'$ . But the model $\hat{p}$ is effectively 'blurring its eyes' according to $\Phi$ , and so the distribution of what might happen beyond that veil actually matters. This is epistemic uncertainty - now that we don't know precisely what situation we are in, a calibrated estimate has to reckon with what situations we might be in and how likely they are.

We can formally measure epistemic uncertainty by looking at the variation in 'true' probabilities (aleatoric uncertainty) among the candidate situations we might be in:

\hat{\mathbf{\Sigma}}(x) = \text{Cov}\left[p_{Y|X}(\cdot | x), p_{Y|X}(\cdot | x) | X \in [x]_\Phi\right]

Note that this is just the natural second-order extension of the definition of calibration above. A pair of estimators $(\hat{p}, \hat{\Sigma})$ for which these two definitions hold (with respect to some grouping function $\Phi$ ) is second-order calibrated.

The next cool result from the paper is that there is a bijection between second-order calibrated predictors and (first-order) calibrated predictors $\hat{p}(y_1, y_2 | x)$ of independent pairs of outcomes. More precisely: the pairs are independent under the true conditional given $x$ , but may not be independent given what the model knows ( $\Phi(x)$ ), and so the pairwise model effectively serves to capture this gap. Specifically, given a paired model $\hat{p}$ , the 'pair covariance'

\hat{\Sigma}_{Y_1, Y_2 | X}(x)_{y_i, y_j} = \hat{p}(y_i, y_j | x) - \hat{p}(y_i | x)\hat{p}(y_j | x),

which measures the difference between the predicted joint and the predicted marginals if the components were independent, is (in combination with the model's marginal $\hat{p}_{Y_1}$ ) second-order calibrated. AND, for any second-order calibrated predictor there is a unique paired predictor $\hat{p}(y_1, y_2 | x)$ such that this construction recovers it.

Q: Suppose a model is second-order calibrated. Is that better than a first-order calibrated model? A: I don't think 'second-order calibration' actually changes anything about the model $\hat{p}$ itself. It just means we have another artifact $\hat{\Sigma}$ that measures uncertainty in $\hat{p}$ .

something else that's weird: an LLM models $p(y | x, W)$ where $x$ is the prompt and $W$ is the actual world (which it learns about through its training data). if we've learned a calibrated model of language, then $\hat{p}(y | x) = p(y | \Phi(x))$ , i.e. the outputs are equivalent to hiding some information from the prompt. but that's not usually what happens! if the model hallucinates it's not that it's missing information from the prompt, it's missing information about the world. when I ask "who's the prime minister of Mexico?" and the model makes up an answer it's because the training data never taught it a confident answer (either because the training dataset $D = \Phi(W)$ doesn't contain the answer or because the training process failed to encode it in the weights $\theta = \Phi(D)$ ).

so for any actually-existing LLM I don't think it makes sense to cast the problem as epistemic uncertainty about the prompt. it's actually epistemic uncertainty about the world. does it still make sense to use this formalism then?

we assume the training data are sampled iid from $p(y | x, W)$ with knowledge of the real world. we 'lose information' partly through having a finite sample (missing information about the real world) and partly through failing to learn the true $p(y | x, D)$ , which I guess will look from the perspecxtive of $p(y|x)$ as a calibration failure (once we've snuck our knowledge about D into the conditional probabilities $p$ ). let's focus on the first case where the answer is not in the training set.

the mechanism of 'cheating' to diagnose uncertainty works regardless of why we don't know the answer. put differently, seeing a cheat $y$ tells you information about the world, not just about the prompt.

the whole point of the calibrated model is that it's trained without whatever features it needs. so it doesn't matter whether those features are 'part of x' or not. the whole point is that they're not part of x! (even if they would be part of some hypothetical x*).

tl;dr any calibrated model p(y|x) can be viewed as the image of the true process p under some operation Φ that coarsens information about x: p(y|x) = p(y|Φ(x)). calibrated models vary by how much context they are missing.

a calibrated model is the expectation of $p^*$ , and we can

this 'missing context' Φ(x) != x is epistemic uncertainty!

Φ(x)Φ(x)

e.g. the true context for an LLM might be x=(prompt, world) but Φ(x)=(prompt) omits all the world knowledge. this missing information is exactly epistemic uncertainty! a calibrated model is effectively averaging over the equivalence class of possible contexts with the same projected features Φ(x).

something else that's werird in my model here

there is a well-defined quantity, an expectation or variance over the true process wrt some equivalence class of $x$ 's. this seems hard to compute explicitly.

but we don't say that a calibrated model estimates this quantity. I don't know what that would even mean given the model is not itself random. we require that the model actually computes this quantity. there's some leeway in what specific quantity it computes because we haven't fully specified $\Phi$ .

calibration

Links to this note

Bayesian

probabilities hide detail

unearned confidence

all models are wrong

Meta