compression: Nonlinear Function
Created: September 06, 2021
Modified: September 06, 2021

compression

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

(below was originally an email to SuccessfulFriend, copying here for posterity)

learning as compression seems like kind of a folk idea with no single source? the intuition goes all the way back to Occam's razor (all else equal, simpler explanations are more explanatory) and is a theme that runs through a lot of different lines of work. But I also wanted to structure my understanding better so did a bit of looking up specific refs:

- the foundational result is Shannon's source coding theorem (https://en.wikipedia.org/wiki/Shannon%27s_source_coding_theorem), that probabilistic models can be viewed as compressors. given a model p(X), the most efficient code to transmit a specific value x has length -log p(x), and there are simple constructions for codes that achieve this efficiency in the limit. the implication is that all work on probabilistic modeling, Bayesian inference, finding models that give high probability to the data, etc. can be viewed as compression.

  • in statistics/learning theory there's the "minimum description length principle":

https://en.wikipedia.org/wiki/Minimum_description_length

Cosma Shalizi has some good references:

http://bactra.org/notebooks/mdl.html

This seems like the tutorial that everyone links (looks good but haven't read in detail):

https://homepages.cwi.nl/~paulv/course-kc/mdlintro.pdf

roughly speaking, MDL adds a notion of learning to Shannon's basic framework: it posits you transmit a dataset by a) first learning a model of the dataset, which you transmit using some prior code b) then transmitting the actual dataset under the code from your learned model

Various ways of formalizing this, but if you use Shannon codes and probabilistic models represented by parameters (weights), then the MDL description length is given by

log p(weights) + log p(data | weights)

where the first term is the length of a code needed to transmit the model weights (given some prior), and the second term is the code to reconstruct the training data exactly given the model.

this happens to be exactly the general form of a "regularized maximum likelihood" objective, which is how essentially all practical ML models are trained. The second term is just a standard predictive loss (for LLM pretraining, the likelihood of the words in the document) and the first term is a complexity penalty / regularization term. Not all models are trained with an explicit regularization term, but there's recent work observing that even just stopping SGD before it fully converges is implicit regularization (you can't learn an arbitrarily-complex model in a fixed number of steps). And obviously also just using a finite number of parameters at finite-precision is itself a limit on model complexity. If you admit these implicit limits, essentially every neural net ever trained, including LLM base models, is compressing its training data under an MDL objective.

  • for RL training, you have to squint your eyes a bit more to take the information-theory perspective, where the model "compresses" the reward feedback, but this seems to be a common perspective among top researchers. John Schulman has some discussion in this recent blog post (reasoning about the LoRa capacity needed for RL fine-tuning): https://thinkingmachines.ai/blog/lora/#how-much-capacity-is-needed-by-supervised-and-reinforcement-learning

  • A lot of learning theory is essentially proving generalization bounds: if you have data of size N, and a model that compresses the data to size <<N (ie, a simple explanation that gives high probability to the observed data), then with high probability the model will perform well on future examples. AFAIK the literature on this is pretty technical (tons of slightly different assumptions / representations / formulations) but I liked this David McAllester blog post which gives a basic result for a very general setting: if you can find a C++ program to predict your dataset, that is substantially smaller than the dataset, then with high probability it will also predict future data: https://machinethoughts.wordpress.com/2014/08/02/the-free-lunch-theorem/

Hutter's really into the theoretical formulation of MDL where you use algorithmic information (Kolmogorov complexity) as a more fundamental notion of description length, instead of statistical models as in the Shannon formulation. Of course the uncomputability ruins this in practice. I like to think of sufficiently-large deep networks as basically the tractable approximation to Hutton's idealized priors over all programs.

He starts with Shannon's theorem, which allows you to talk about information in terms of probabilities, and then most of the rest of the book is framed in terms of probabilistic models and inference as a clean unifying framework (eg, he presents the MDL principle as more or less just a degenerate special case of Bayes' rule).

where you basically try to find more sophisticated coding schemes to compress the weights. AFAIK no one really uses these, the weight priors in modern LLMs are very naive (or even implicit), which you could argue is evidence that modern practice isn't fully leaning into the MDL/compression frame.