Modified: March 26, 2025
Thompson sampling
This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.In a bandit setting, in each round we see a context , choose an action , and receive a reward sampled from a distribution with some unknown parameter . As Bayesians, we assume a prior over possible worlds. At each step, then, with past observations , we have a posterior over what world we are in (i.e., what the reward function is).
Thompson sampling is the strategy that chooses each action with probability proportional to the chance that it is the 'best' action:
That is, for each action, we take all the worlds in which it is the best action, and total up their probability. Assuming for simplicity that the argmax is well defined, i.e., that there is a unique best action in each world, then this counts each world exactly once, and so the action probabilities sum to one.
(without a unique maximum, we'd need to write this as
where we only get the probabilities up to a proportionality constant. e.g., in an extreme case where all worlds are totally indifferent between all actions, so all actions achieve the maximum reward possible in each world, then the indicator always fires, so we'd get a total world-probability of 1 for each action, which obviously requires normalization to get a valid distribution over actions).
Suppose there are two actions, and . In 75% of worlds, gives the highest expected reward, and in the other 25% of worlds, gives the highest reward. So we want to play with 75% probability and otherwise.
An easy way to implement this is to first sample a world, and then just choose the optimal action in that world. This is the usual Thompson sampling algorithm:
# Choose action
theta ~ P(theta | D)
a = arg max ( expected_reward(x, a, theta) )
# Act and update posterior for next step
r = act(a) # sample from P(r | x, a, theta_true)
P = update_posterior(P, D_new = (x, a, r))Geometric exploration?
Scott Garrabrant claims we can write Thompson sampling as acting according to a geometric expectation
or equivalently, as maximizing the expected log probability of the best action:
does this follow from a general property of maximizing expected logs? let's simplify to a general case where we are optimizing a distribution to maximize an expected log probability of