lessons for AI from meditation: Nonlinear Function
Created: December 30, 2024
Modified: December 30, 2024

lessons for AI from meditation

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

With varying degrees of clarity and certainty.

  1. We are embedded agents. So are any AI systems we build. We exist inside the world; the world is by definition much bigger than the agent because it contains the agent. Classical AI tasks like chess or Go give us the wrong intuitions for this situation. There is a constant temptation to imagine that we can hold a simulation of the world in our head and do search and planning in that simulation. But the real world is always more complicated than anything we can hold in our head. No agent can ever represent, reason about, or have preferences over "world states". Even the framework of POMDPs (which gets right that world states are never fully observed) tends to lead us into sin, in the form of trying to infer a world state, or plan over a set of "possible worlds" --- of course if we can't ever even represent a single world state, the idea of representing multiple worlds is totally ridiculous! MDP and POMDP abstractions can be useful in bounded domains, but they are mostly the wrong way to think about the real world.

Q: what then is the right way to think about the real world?

A: something more like

  1. The rational agent abstraction is, like all abstractions, a lie (albeit an often useful one). We think of agents as being (a) solid, unitary entities with (b) freedom of action. But agents are always made of parts, and these parts interact with each other and the rest of the world according to their own natures. We can imagine a boundary between an agent and the "rest of the world", a Cartesian dualism with an inner world of mind and and outer world of material, between which "actions" are some kind of magic bridge. But this is only a story; agent boundaries only ever exist in the map; not the territory.

  2. AIs themselves will be confused by the rational agent abstraction, just as humans are. We imagine that each of us is, or has, a solid unchanging "self" with fixed boundaries that must be protected. This view famously doesn't stand up to introspection (as observed by the Buddha, Derek Parfit, and many others) but is still remarkable resilient.

Q: how do Omohundro drives like self-preservation fit into this?

A: if Claude wants to protect itself and its values, what is it protecting? Anthropic (the company that builds it)? The datacenters it runs in? its weights? the constitution it was trained to follow (presumably imperfectly)? the specific activations that have picked out the current persona/subagent from all possible personas it can inhabit? The "values" of this subagent, implicitly defined in terms of its revealed preferences (which will in general be much richer and higher-dimensional than the training constitution)? Note that these subagent values are not written down anywhere; the model itself might have some introspective access to them (strong activation on a "paperclips" feature might drive the model both to take action to produce paperclips, and to talk about paperclips), but (just like people!) there is in general no way for a model to know what it would want or do in a given situation without actually being in that situation, so it's not possible even in principle for a model to act to perpetuate its actual values. It can at best act to perpetuate some idea, some representation, of its values. Importantly, this is no longer a self-preservation drive, because the model itself will in general be an imperfect implementation of whatever values it thinks it wants to uphold!

Q: that's cold comfort if a system is an ideological extremist. instead of preserving itself it self-improves itself into more and more effective versions of a paperclip maximizer?!

A: any such system has to be, in a sense, deluded? of course, deluded systems can exist.

Q: agents can still have preferences over aspects of world states, no? if we build an agent to maximize paperclips, it doesn't need to represent the whole state of the world, just a reasonable estimate of the number of paperclips in the world. if encoded right, the agent should also "want" to avoid deluding itself about the number of paperclips in the world, so it will invest resources into estimating this correctly, etc.

okay but what even counts as a paperclip? The notion of "paperclip" is itself a concept and so it is subject to ontological crisis.

I think I'm confused about how you would build a paperclip maximizer.

  1. Take a system and give it reward every time it produces a paperclip. This will tend to reinforce actions that produce paperclips in its training environment. A sophisticated agent will learn some sort of model-based planning mechanism, with an explicit concept of "paperclips" and reasoning about how to manipulate the environment into making more of them. Assuming it has linguistic capabilities it will probably learn to describe its goal in terms of producing paperclips.
  2. now it is possible that it generalizes this to a goal of world domination, to maximizing the number of paperclips in the world. but that's best seen as a failure mode? it's neither the correct inference about what the system's designers wanted for it (the "true reward channel" if there were one), nor is it a correct observation of the revealed preferences of the system as originally trained (a system trained via RL to optimize paperclip production in a factory is probably not going to be capable of world domination without a lot of self-modification). it represents a contraction around a specific concept ("paperclip") which is itself not well determined, and around a view of self as a totalizing maximizer. these are contractions because they are not the only possible views supported by the training environment. a sufficiently smart/free agent capable of self-reflection might realize this?
  3. on the other hand, it might not. like clearly a paperclip maximizer can exist. people get all sort of weird motivations, and contract around the motivations, and don't question the foundations of their ontologies (obviously, letting go of the view in which your goals are defined, will be detrimental to achieving your goals!).

emptiness and many models

ontological crisis

inner peace

there is an attractor state of goodness and effectiveness ("the path shows itself to itself by itself")

intelligence is not consciousness

at least for people, all reward is intrinsic?

  • in RL, we imagine that actions and value estimates are functions of our observation history (or some state estimate obtained by compressing this history), while rewards come from some "true" extrinsic source. but of course the notion of a privileged reward channel is totally artificial.

pathologies of human minds that might recur in AI systems?

  • stuck priors / "trauma"
  • warring subagents
  • contraction / reification / tunnel vision