Modified: March 29, 2024
sparse distributed memory
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.References:
A sparse distributed memory consists of an address matrix and a contents matrix . Assuming binary vectors, to retrieve from the memory we:
- Find all rows of within some Hamming radius of the input (query) vector.
- Sum the corresponding rows of .
- Threshold the sum.
This looks a lot like a transformer feedforward key-value mechanism. A transformer layer effectively uses cosine similarity (rather than hamming distance - but hamming distance is (negative) cosine similarity in a {-1, 1} encoding) on the keys, so we get a weighted sum of values. It doesn't threshold the the similarity coefficients (vs an associative memory which only sums values within some hamming distance), and instead of thresholding the final result it uses some other nonlinearity like a ReLU. But the basic shape and semantics of the matrices are the same: the feedforward weight matrices and play exactly the roles of and in the sparse distributed memory.
In a related way it also looks a lot like an attention mechanism. The attention mechanism does do an analogue of thresholding the similarity coefficients, in that it exponentiates-and-normalizes them (softmax).