sparse distributed memory: Nonlinear Function
Created: March 29, 2024
Modified: March 29, 2024

sparse distributed memory

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

References:

A sparse distributed memory consists of an address matrix A:M×NA: M \times N and a contents matrix C:M×UC: M \times U. Assuming binary vectors, to retrieve from the memory we:

  1. Find all rows of AA within some Hamming radius of the input (query) vector.
  2. Sum the corresponding rows of CC.
  3. Threshold the sum.

This looks a lot like a transformer feedforward key-value mechanism. A transformer layer effectively uses cosine similarity (rather than hamming distance - but hamming distance is (negative) cosine similarity in a {-1, 1} encoding) on the keys, so we get a weighted sum of values. It doesn't threshold the the similarity coefficients (vs an associative memory which only sums values within some hamming distance), and instead of thresholding the final result it uses some other nonlinearity like a ReLU. But the basic shape and semantics of the matrices are the same: the feedforward weight matrices W1W_1 and W2W_2 play exactly the roles of AA and CC in the sparse distributed memory.

In a related way it also looks a lot like an attention mechanism. The attention mechanism does do an analogue of thresholding the similarity coefficients, in that it exponentiates-and-normalizes them (softmax).