References: Direct Preference Optimization: Your Language Model is Secretly a Reward Model This seems like a compelling reframing of…
The standard [ Markov decision process ] formalism includes a reward function ; the total (discounted) reward across a trajectory is its…
Different experimental conditions may give rise to different outcomes . For example, let the variable indicate whether a person is…
A nice paper that gets at some subtleties of calibration: Daniel D. Johnson, Daniel Tarlow, David Duvenaud, Chris J. Maddison. Experts Don't…