How do we maintain values when our models of the world shift? If someone's goal in life is to "do God's will", and then they come to believe…
Modified: April 12, 2023.
The law says that: when a measure becomes a target, it ceases to be a good measure . One can distinguish four types of Goodhart problems…
Modified: April 08, 2023.
AI safety, as a term, is sterile and hard to get excited about. Preventing catastrophe is important, but doesn't motivate me, since [ the…
Modified: January 24, 2022.
[ value alignment ] research often frames the problem as: first, learn the human 'value function' --- for every possible state of the world…
Modified: June 17, 2024.
This may be a central point of confusion: how do we define AI systems that have preferences about the real world , so that their goals and…
Modified: April 12, 2023.
References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…
Modified: April 05, 2023.
When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…
Modified: April 12, 2023.
A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…
Modified: March 31, 2023.
Notes on Abram Demski and Scott Garrabrant's sequence on Embedded Agency Embedded Agents : Classic models of rational [ agency ], such as…
Modified: April 07, 2023.
Modified: February 21, 2022.
Notes on the Alignment Forum's Value Learning sequence curated by Rohin Shah. ambitious value learning : the idea of learning 'the human…
Modified: April 07, 2023.
Suppose I have an agent that generates text. I want it to generate text that is [ value alignment|aligned ] with human values. Approaches…
Modified: February 21, 2022.
The idea is that a [ mesa optimizer|mesa-optimizing ] policy with access to sufficient information about the world (e.g., web search) might…
Modified: March 31, 2023.
What does it mean to [ love ] someone? Of course this question has as many answers as there are people, and probably more. But here's one…
Modified: November 28, 2023.
stray thoughts about reward functions (probably related to the [ agent ] abstraction and the [ intentional stance ]) one can make a…
Modified: April 06, 2023.
Language is a really natural way to tell AI systems what we want them to do. Some current examples: [ GPT ]-3 and successors (InstructGPT…
Modified: April 07, 2022.
See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]
Modified: June 12, 2021.