Notes tagged with "alignment": Nonlinear Function

17 notes tagged with "alignment"

worldly objective

This may be a central point of confusion: how do we define AI systems that have preferences about the real world , so that their goals and…

Modified: April 12, 2023.

AI safety

AI safety, as a term, is sterile and hard to get excited about. Preventing catastrophe is important, but doesn't motivate me, since [ the…

Modified: January 24, 2022.

Goodhart's law

The law says that: when a measure becomes a target, it ceases to be a good measure . One can distinguish four types of Goodhart problems…

Modified: April 08, 2023.

cooperative inverse reinforcement learning

References: Cooperative Inverse Reinforcement Learning The Off-Switch Game Incorrigibility in the CIRL Framework The CIRL setting models…

Modified: April 05, 2023.

love is value alignment

What does it mean to [ love ] someone? Of course this question has as many answers as there are people, and probably more. But here's one…

Modified: November 28, 2023.

objectives are big

A very incomplete and maybe nonsensical intuition I want to explore. Classically, people talk about very simple [ reward ] functions like…

Modified: March 31, 2023.

value learning

Notes on the Alignment Forum's Value Learning sequence curated by Rohin Shah. ambitious value learning : the idea of learning 'the human…

Modified: April 07, 2023.

value alignment

Modified: February 21, 2022.

value aligned language game

Suppose I have an agent that generates text. I want it to generate text that is [ value alignment|aligned ] with human values. Approaches…

Modified: February 21, 2022.

deceptive alignment

The idea is that a [ mesa optimizer|mesa-optimizing ] policy with access to sufficient information about the world (e.g., web search) might…

Modified: March 31, 2023.

embedded agent

Notes on Abram Demski and Scott Garrabrant's sequence on Embedded Agency Embedded Agents : Classic models of rational [ agency ], such as…

Modified: April 07, 2023.

ontological crisis

How do we maintain values when our models of the world shift? If someone's goal in life is to "do God's will", and then they come to believe…

Modified: April 12, 2023.

human values

[ value alignment ] research often frames the problem as: first, learn the human 'value function' --- for every possible state of the world…

Modified: June 17, 2024.

safe objective

Language is a really natural way to tell AI systems what we want them to do. Some current examples: [ GPT ]-3 and successors (InstructGPT…

Modified: April 07, 2022.

reward uncertainty

See also: [ cooperative inverse reinforcement learning ], [ love is value alignment ]

Modified: June 12, 2021.

reward

stray thoughts about reward functions (probably related to the [ agent ] abstraction and the [ intentional stance ]) one can make a…

Modified: April 06, 2023.

reward funnel

When thinking about the [ reward ] function for a real-world AI system, there is always some causal process that determines reward. For…

Modified: April 12, 2023.

See All tags