Thompson sampling: Nonlinear Function
Created: March 26, 2025
Modified: March 26, 2025

Thompson sampling

This page is from my personal notes, and has not been specifically reviewed for public consumption. It might be incomplete, wrong, outdated, or stupid. Caveat lector.

In a bandit setting, in each round we see a context xx, choose an action aa, and receive a reward rr sampled from a distribution P(rx,a,θ)P(r | x, a, \theta) with some unknown parameter θ\theta. As Bayesians, we assume a prior p(θ)p(\theta) over possible worlds. At each step, then, with past observations D={(x,a,r)}\mathcal{D} = \{(x, a, r)\}, we have a posterior P(θD)P(\theta | \mathcal{D}) over what world we are in (i.e., what the reward function is).

Thompson sampling is the strategy that chooses each action with probability proportional to the chance that it is the 'best' action:

pthompson(a)=P(a=argmaxaE[rx,a,θ]D)=EθD[1(a=argmaxaE[rx,a,θ])]=θP(θD)1(a=argmaxaE[rx,a,θ])\begin{align*} p_\text{thompson}(a) &= P\left(a = \text{argmax}_{a'} \mathbb{E} [ r | x, a', \theta] | \mathcal{D}\right)\\ &= \mathbb{E}_{\theta | \mathcal{D}} \left[\mathbb{1}\left(a = \text{argmax}_{a'} \mathbb{E} [ r | x, a', \theta]\right)\right]\\ &= \sum_\theta P(\theta | \mathcal{D}) \mathbb{1}\left(a = \text{argmax}_{a'} \mathbb{E} [ r | x, a', \theta]\right)\\ \end{align*}

That is, for each action, we take all the worlds in which it is the best action, and total up their probability. Assuming for simplicity that the argmax is well defined, i.e., that there is a unique best action in each world, then this counts each world exactly once, and so the action probabilities sum to one.

(without a unique maximum, we'd need to write this as

pthompson(a)θP(θD)1(E[rx,a,θ]=maxaE[rx,a,θ])\begin{align*} p_\text{thompson}(a) &\propto \sum_\theta P(\theta | \mathcal{D}) \mathbb{1}\left(\mathbb{E} [ r | x, a, \theta] = \text{max}_{a'} \mathbb{E} [ r | x, a', \theta]\right)\\ \end{align*}

where we only get the probabilities up to a proportionality constant. e.g., in an extreme case where all worlds are totally indifferent between all actions, so all actions achieve the maximum reward possible in each world, then the indicator always fires, so we'd get a total world-probability of 1 for each action, which obviously requires normalization to get a valid distribution over actions).

Suppose there are two actions, a1a_1 and a2a_2. In 75% of worlds, a1a_1 gives the highest expected reward, and in the other 25% of worlds, a2a_2 gives the highest reward. So we want to play a1a_1 with 75% probability and a2a_2 otherwise.

An easy way to implement this is to first sample a world, and then just choose the optimal action in that world. This is the usual Thompson sampling algorithm:

# Choose action
theta ~ P(theta | D)
a = arg max ( expected_reward(x, a, theta) )

# Act and update posterior for next step
r = act(a)  # sample from P(r | x, a, theta_true)
P = update_posterior(P, D_new = (x, a, r))

Geometric exploration?

Scott Garrabrant claims we can write Thompson sampling as acting according to a geometric expectation

pthompson(a)=argmaxpΔAGθP(θD)Patp(at=argmaxaE[rD,a,θ])p_\text{thompson}(a) = \text{argmax}_{p \in \Delta A} \mathbb{G}_{\theta \sim P(\theta | D)} \mathbb{P}_{a_t \sim p}\left(a_t = \text{argmax}_a \mathbb{E}[r | D, a, \theta]\right)

or equivalently, as maximizing the expected log probability of the best action:

pthompson(a)=argmaxpΔAEθP(θD)logPatp(at=argmaxaE[rD,a,θ])p_\text{thompson}(a) = \text{argmax}_{p \in \Delta A} \mathbb{E}_{\theta \sim P(\theta | D)} \log \mathbb{P}_{a_t \sim p}\left(a_t = \text{argmax}_a \mathbb{E}[r | D, a, \theta]\right)

does this follow from a general property of maximizing expected logs? let's simplify to a general case where we are optimizing a distribution to maximize an expected log probability of

argmaxpEθ[logp(e(θ))]\text{argmax}_{p} \mathbb{E}_{\theta} \left[\log p(e(\theta))\right]