Created: March 24, 2024
Modified: March 24, 2024

Helmholtz machine

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

Dayan, Hinton, Neal, Zemel (1994) https://www.cs.toronto.edu/~hinton/absps/helmholtz.pdf

This paper is one of the first to introduce variational inference and specifically to apply it to latent variables in a neural network. It extends the earlier work of Zemel and Hinton 1994 "Learning population codes by minimizing description length", which introduces the basic framework, to a hierarchical model in which each neuron is conditioned on activations at previous layers.

Model

The model assumes layers of binary neurons, where the bottom layer corresponds to the pixels of a binary image. Given activations $s^{\ell-1}$ at layer $\ell - 1$ , the recognition model (variational posterior) $q_\phi$ factorizes over activations at layer $\ell$ :

q^\ell_j(s^\ell_j = 1 | s^{\ell-1}, \phi) = \sigma\left(\sum_i s^{\ell-1}_i \phi^{\ell-1, \ell}_{i, j}\right) = \sigma(\mathbb{\Phi}^{\ell}_j s^{\ell-1}).

Overall the recognition model is the product of these Bernoulli distributions within each layer, chained bottom-up, so that each layer is inferred given the previous:

Q_\phi(s) = \prod_\ell \prod_j q_j^{\ell}(s^{\ell}_j | s^{\ell-1}, \phi)

In modern terminology, this is a hierarchical VAE with binary neurons, in which every layer's activation is viewed as a random variable.

The generative model $p(s, \theta)$ is a bit of a mess: conceptually they started with a simple factorized log-linear model, paralleling the recognition model, but hacked it into something more complicated in order to get their experiments to work. However it still essentially models each neuron as independently Bernoulli given the neurons in the next (higher-up) layer.

Training

This model is full of discrete latent variables. It could in principle be trained by stochastic gradient ascent of the ELBO using REINFORCE or another policy gradient algorithm. But that would be a high-variance mess, so in practice they use a straight-through estimator, replacing the binary activations $s^\ell_j$ with their expectations, the corresponding Bernoulli probabilities $q^\ell_j$ . This gives a deterministic objective, the "deterministic Helmholtz machine".

Alternatively the paper proposes a 'stochastic Helmholtz machine' that uses a wake-sleep algorithm for training:

In the 'wake' phase, data $d$ are presented and binary activations $s$ sampled bottom-up using the recognition model. The generative parameters $\theta$ are adjusted to maximize the likelihood of these sampled activations (or the KL divergence to the recognition distributions conditioned on the previous-layer activations? this seems to be implied by the paper's language though I'm not sure it'd make much difference…).
In the 'sleep' phase, activations $s$ and ultimately an image $d$ are sampled from the generative model. And these are used to train the recognition model.

This corresponds to a nice local learning rule in each phase, where each neuron adjusts its parameters to simply predict its neighbors. But it doesn't cleanly optimize any particular objective.

Discussion

The contributions of the paper include:

Demonstrating the use of hierarchical variational inference for unsupervised learning (and representation learning) with neural nets
The use of a variational model (vs MCMC) allows for much cleaner learning rules that one could imagine actually being implemented in the cortex.
Compared to other concurrent neuroscience-inspired work, this was notable in taking a probabilistic / statistical perspective (which is now the dominant paradigm in the field)

Despite having both a recognition and a generative model, the recognition process is purely bottom-up:

During recognition, the generative model is superfluous, since the recognition model contains all the information that is required. Nevertheless, the generative model plays an essential role in defining the objective function F that allows the parameters of the recognition model to be learned.

As described, the recognition process in the Helmholtz machine is purely bottom-up — the top-down generative model plays no direct role and there is no interaction between units in a single layer.

Incorporating top-down processing is suggested as a direction for future work:

However, such effects are important in real perception and can be implemented using iterative recognition, in which the generative and recognition activations interact to produce the final activity of a unit. This can introduce substantial theoretical complications in ensuring that the activation process is stable and converges adequately quickly, and in determining how the weights should change so as to capture input examples more accurately.

Helmholtz machine

Model

Training

Discussion

Meta