Created: March 29, 2024
Modified: March 29, 2024

sparse distributed memory

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

https://redwood.berkeley.edu/wp-content/uploads/2020/08/KanervaP_SDMrelated_models1993.pdf

A sparse distributed memory consists of an address matrix $A: M \times N$ and a contents matrix $C: M \times U$ . Assuming binary vectors, to retrieve from the memory we:

Find all rows of $A$ within some Hamming radius of the input (query) vector.
Sum the corresponding rows of $C$ .
Threshold the sum.

This looks a lot like a transformer feedforward key-value mechanism. A transformer layer effectively uses cosine similarity (rather than hamming distance - but hamming distance is (negative) cosine similarity in a {-1, 1} encoding) on the keys, so we get a weighted sum of values. It doesn't threshold the the similarity coefficients (vs an associative memory which only sums values within some hamming distance), and instead of thresholding the final result it uses some other nonlinearity like a ReLU. But the basic shape and semantics of the matrices are the same: the feedforward weight matrices $W_1$ and $W_2$ play exactly the roles of $A$ and $C$ in the sparse distributed memory.

In a related way it also looks a lot like an attention mechanism. The attention mechanism does do an analogue of thresholding the similarity coefficients, in that it exponentiates-and-normalizes them (softmax).

sparse distributed memory

Links to this note

associative memory

Meta