Created: August 30, 2023
Modified: October 24, 2024

reinforced self-training

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

thoughts on reinforced self-training paper: https://arxiv.org/abs/2308.08998

the basic idea is very simple. we sample additional trajectories from a behavior-cloned policy, score them with the reward model, and choose the best ones to add to the dataset to train a new behavior-cloned policy. (or you can use any other offline RL method that takes in scored trajectories. but it seems like behavioral cloning actually works best lol)

I guess this assumes that sampling lots of trajectories and scoring them is cheap. the main advantage over something like PPO is that this works offline, so after putting in the work of generating and scoring trajectories, you can (in a sense) take as many gradient steps as you like.

this must be an existing method. have people just not really experimented with offline RL before?

it's similar in a way to DAgger, which also iteratively augments the dataset. But DAgger asks an expert to annotate each sampled trajectory with better moves, while ReST just uses the existing reward model. This requires a policy that is at least decent, capable of getting good reward, or else the whole thing never gets off the ground.

they relate it to self-training, which is the idea that you train a classifier on labeled data, use it to label some unlabeled data, then retrain, etc.

reinforced self-training

Meta