active inference: Nonlinear Function
Created: March 19, 2024
Modified: April 14, 2025

active inference

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

How does an active inference agent work? I'll spell out a concrete story that we can plausibly implement.

As an example, suppose the agent wants to solve a maze. The state ss is the full drawing of the maze (which is static) and the agent's current position in it (which changes). At each step the agent observes a tuple ot=(dnorth,dsouth,deast,dwest)o_t = (d_\text{north}, d_\text{south}, d_\text{east}, d_\text{west}) giving the distance to the nearest wall in each direction, or None if there is no wall in that direction. It knows it has exited the maze when it observes no wall in three of the four directions.

The agent has a dynamics model predicting the environment's states ss and observations oo, and the agent's own actions as a function of a policy πθ\pi_\theta:

pdynamics(s1:T,o1:T,a1:Tπθ)=p(s0)t=1Tp(otst)p(st+1st,at)p(atπθ)p_\text{dynamics}(s_{1:T}, o_{1:T}, a_{1:T} | \pi_\theta) = p(s_0) \prod_{t=1}^T p(o_t | s_t)p(s_{t+1} | s_t, a_t)p(a_t | \pi_\theta)

q1: are the dynamics known or are they learned? presumably these distributions can also be parameterized. for reasonable exploration we need a prior on dynamics which is probably doable but seems important to get right. I guess the principled way to set this up is to just say that any unknown dynamics are themselves part of the unknown state of the world, to be inferred from observation.

The agent also has a preference factor specifying what states and/or observations it prefers.

ppref(s1:T,o1:T)=t=1Tppref(ot)p_\text{pref}(s_{1:T}, o_{1:T}) = \prod_{t=1}^T p_\text{pref}(o_t)

For example, a maze agent will prefer to observe that it's exited the maze, so the preference model will put high probability on that observation and low probability on all other observations.It should put at least nonzero probability on all possible observations to avoid pathologies. We will see that logppref\log p_\text{pref} is effectively the agent's "reward function", so a zero-probability state effectively has negative infinity reward. Why frame preferences as a distribution rather than a generic reward function? The main difference is that a distribution is inherently normalized, so it has one fewer degree of freedom than a reward function. This encodes that an agent's behavior depends only on relative reward: whether a state is 'good' or 'bad' is meaningful only in comparison to alternatives.

These are combined to define the agent's "model":

pmodel(s1:T,o1:T,a1:Tπθ)=1Zpdynamics(s1:T,o1:T,a1:Tπθ)ppref(s1:T,o1:T)p_\text{model}(s_{1:T}, o_{1:T}, a_{1:T} | \pi_\theta) = \frac{1}{Z} p_\text{dynamics}(s_{1:T}, o_{1:T}, a_{1:T} | \pi_\theta) p_\text{pref}(s_{1:T}, o_{1:T})

where Z=s,opdynamics(s1:T,o1:T,a1:Tπθ)ppref(s1:T,o1:T)Z = \sum_{s, o} p_\text{dynamics}(s_{1:T}, o_{1:T}, a_{1:T} | \pi_\theta) p_\text{pref}(s_{1:T}, o_{1:T}) is the normalizing constant such that this is itself a valid joint probability.

The agent then maintains a belief distribution q(s1:T,a1:T)q(s_{1:T}, a_{1:T}) that tries to approximate the model posterior pmodel(s1:T,a1:Tπθ,o1:T)p_\text{model}(s_{1:T}, a_{1:T} | \pi_\theta, o_{1:T}). It does this by descending the free energy.


(scratch) qs:

  • (how) does something like planning fall out of free energy minimization?
  • is the agent inferring its actions, or inferring a policy/plan?