Created: March 29, 2024
Modified: March 29, 2024

Hopfield network

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

J.J. Hopfield. Neural networks and physical systems with emergent collective computational abilities (1982). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC346238/
https://www.doc.ic.ac.uk/~ae/papers/Hopfield-networks-15.pdf
Hopfield Networks is All You Need (2021) https://arxiv.org/abs/2008.02217
https://www.researchgate.net/publication/370873729_A_new_frontier_for_Hopfield_networks

A Hopfield network is 'just' an Ising model / Markov network model of neurons. It has no hidden states and is framed as an associative memory of neural activations.

From Hopfield's 1982 paper:

Humans build large computational circuits following some overarching plan. But evolution has no such plan, so computation in large systems of neurons might be an emergent property from composing many simple neurons.
The goal is 'content-addressable memory' or associative memory. To remember a citation it might be sufficient to remember any one of the authors, perhaps the date, perhaps a word of the title, or some combination of these.
Any physical system / dynamics model that flows towards stable attractors can be viewed as such a memory, in the sense that to end up at a given attractor it's sufficient to start anywhere in its potential well. The set of stable points $x_a, x_b, \ldots$ is the 'content' of the memory. Such a system is useful as an associative memory if we can easily manipulate its set of stable points.
The network is a collection of neurons with binary (0/1) activations $V_i$ . The neurons have weighted connections and each neuron has an individual threshold (bias). A neuron updates itself randomly in time, setting $V_i = 1$ if the weighted combination of its neighbors is above threshold and $V_i = 0$ otherwise.
To store a set of state vectors $V^1, \ldots, V^n$ , we construct the weight for each connection $W_{ij}$ by the following procedure:
1. Count the number of state vectors in which neurons $i$ , $j$ have the same value (either both activated or both deactivated).
2. Also count the number of state vectors in which those neurons have different values.
3. Subtract those two quantities. Thus, neurons that always fire together have a strongly positive weight, and those that always oppose each other have a strongly negative weight.
- Equivalently, the weights are the sum of outer products of the states, $W = \sum_i V^i (V^i)^T$ , representing the states in {-1, 1}
- The individual neuron updates can be seen as (local) descent steps on an energy $E = -\frac{1}{2}\sum_{i \ne j}W_{ij}V_iV_j$

Q: how would a hopfield network be learned? Given a set of target states we can construct weights by hand. But suppose we have a target state of interest. Do we just set the neurons to that state and then do a few iterations of gradient descent on the energy function wrt the weights? I suppose if we start with weights all zero, then the procedure:

Set the neurons to state $V^1$
Add 1 to the pairwise weights of any neurons that have the same activation, and subtract 1 from the weights of neurons that have different values
Set the neurons to the next state and repeat. reproduces the construction above, and also corresponds to gradient descent on the energy function! So yes that should work.

It's interesting that this unifies learning and inference as just gradient descent on the energy function, with respect to weights and activations respectively.

Q: can one have a deep hopfield network? A: insofar as a hopfield network is an iterative process, you can run it in an abstract representation space rather than 'pixel space'.

Q: can you use a hopfield network as a layer in another model, eg a transformer? A: here I suppose you have an input to the layer, you don't necessarily know the desired output, but you do have a gradient signal. so you'd take the input, run it through a through iterations of updates according to current weights, then produce an output which goes to the rest of the network. and then you just backprop through the unrolled iteration to update the weights and produce a signal to the input. presumably you can accelerate this taking advantage of fixed point theorems and implicit differentiation. so mechanistically, yes. but does it make any sense? if the goal is just to achieve the function of finding the nearest point (of a set of prototypes) to a given input, then probably other methods are more straightforward?

Q: How does a Hopfield network relate to a diffusion model? A: There are parallels! Both do a sort of iterative denoising. In a Hopfield network we have explicit weights that define an energy function. In a diffusion network we learn a network to parameterize the gradient of an energy (/ joint density). Diffusion models work on continuous data, where gradients make sense, and where we're trying to learn a data distribution as opposed to just memorizing a set of prototypes.

'Modern' Hopfield networks use higher-order energy functions (compared to pairwise quadratic) to get a lot more capacity.

'Hopfield Networks is All You Need'

Defines a modern Hopfield network (with continuous-valued states and log-sum-exp energy function) with an update rule equivalent to a special case of transformer attention, wvhere the Hopfield stored values $\mathbf{X}$ play the role of both the transformer keys and values:

\mathbf{\xi}_\text{new} = \mathbf{X} \mathbf{p} = \mathbf{X}\;\text{softmax}\left(\beta\mathbf{X}^T\mathbf{\xi}\right)

This apparently is a descent step on the energy function

\mathbf{E}(\mathbf{\xi}) = \beta^{-1} \log \sum_{i=1}^N \exp\left(\beta \left(\mathbf{X}^T\mathbf{\xi}\right)_i\right) + \frac{1}{2}\mathbf{\xi}^T\mathbf{\xi}

Hopfield network

Links to this note

associative memory

Meta