How do you choose between trying new things and doing what you know works?
You're a restaurant critic in a new city. There are 50 restaurants you haven't tried. You have 10 dinner slots. Tonight you tried Restaurant A and it was excellent. Should you go back tomorrow? Or try one of the other 49?
If you always return to Restaurant A, you might miss a better option. If you always try new places, you'll gather information but eat at mediocre restaurants. There's a real tension: exploitation (use what you know works) vs. exploration (try new things to gain information).
This dilemma appears everywhere in decision-making:
| Domain | Exploit | Explore |
|---|---|---|
| Clinical trials | Give patients the best known drug | Test new drugs on some patients |
| Web A/B testing | Show the best-converting variant | Test new designs to learn better |
| Robotics | Follow the known-good policy | Try new actions to discover rewards |
| Recommender systems | Show items the user likes | Show new items to learn preferences |
| Drug discovery | Synthesize known effective compounds | Test novel molecular structures |
Strip away everything except the core dilemma. You face K slot machines (one-armed bandits). Each machine i has an unknown reward distribution with mean μi. At each time step t, you pull one machine and receive a reward. Your goal: maximize total reward over T pulls.
This is the multi-armed bandit problem (MAB). The name comes from imagining a gambler at a casino with K slot machines (“one-armed bandits”), each with unknown payout rates. The gambler must decide which arms to pull to maximize cumulative reward.
The hardness of the bandit problem comes from the estimation-exploration tradeoff: to estimate μi accurately, you need many pulls of arm i. But every pull of i is a pull not spent on the potentially better arm j. You're spending reward to buy knowledge.
Key quantities:
A Bayesian agent maintains a posterior distribution over the reward parameters of each arm. As rewards are observed, the posterior updates. At any time, the agent has a full probability distribution over "how good each arm is" — not just a point estimate.
For Bernoulli rewards (success/failure, like a click or no-click), the Beta distribution is the natural conjugate prior. If arm i has had αi successes and βi failures, its posterior is:
The prior Beta(1, 1) is uniform — complete ignorance. After 3 successes and 2 failures: Beta(4, 3). The posterior mean is α/(α+β) = 4/7 ≈ 0.57. The posterior variance αβ/[(α+β)2(α+β+1)] tells you uncertainty.
For Gaussian rewards (continuous, like restaurant quality scores), the normal-inverse-gamma distribution is the conjugate prior. But in practice, many algorithms use a simpler approximation: track the sample mean μ̂i and the uncertainty σi = σ0/√ni, where ni is the number of pulls and σ0 is the prior std.
Simulate pulling a Bernoulli arm with true probability p. The Beta posterior (teal) updates with each pull and tightens around the true value (orange dashed).
You've been to Restaurant A 50 times. You know it's great. Should you ever try the others? The simplest policy: most of the time, exploit (go to the best-known option), but occasionally, pick a random option to explore. This is ε-greedy.
At each time step: with probability ε, choose a uniformly random arm. With probability 1−ε, choose the arm with the highest current estimate μ̂i.
A key insight: ε-greedy has a problem even when it works. When it does explore, it explores uniformly — picking a random arm with equal probability. This is unintelligent: it spends equal time on arms that are nearly as good as the best (small Δ) and arms that are clearly terrible (large Δ). Smarter exploration should focus on arms with high uncertainty, not uniform random arms.
Decaying ε addresses the fixed exploration problem: set εt = min(1, c/(t · Δ)) for some constant c and estimated gap Δ. Early on, explore a lot. Later, when you know more, exploit more. In practice, a simple schedule like εt = 1/log(t) or εt = ε0/√t is common.
python import numpy as np class EpsilonGreedy: def __init__(self, K, eps=0.1): self.K = K; self.eps = eps self.counts = np.zeros(K) # pulls per arm self.values = np.zeros(K) # sample mean per arm def select(self): if np.random.random() < self.eps: return np.random.randint(self.K) # random exploration return np.argmax(self.values) # greedy exploitation def update(self, arm, reward): self.counts[arm] += 1 # Online mean update: no need to store all rewards n = self.counts[arm] self.values[arm] += (reward - self.values[arm]) / n
ε-greedy has a binary choice: exploit (pick the best) or explore (pick uniformly random). Boltzmann exploration (also called softmax exploration) blends these smoothly: pull arm i with probability proportional to exp(μ̂i / τ), where τ is the temperature parameter.
This is the Boltzmann distribution from statistical physics — hence the name. At high temperature (τ → ∞), all arms have equal probability (pure exploration). At low temperature (τ → 0), the best arm gets all the probability (pure exploitation).
Boltzmann exploration is more intelligent than ε-greedy: it preferentially explores arms with high estimated value over arms with low value. If arm 1 has estimate 5 and arm 2 has estimate 3 and arm 3 has estimate 1, Boltzmann explores arm 1 most during exploration phase, not uniformly. This concentrates exploration on competitive arms.
5 arms with fixed estimates. Temperature τ controls how peaked the distribution is toward the best arm (highest bar = most likely to be selected).
Both ε-greedy and Boltzmann exploration are undirected: they explore randomly, not based on where knowledge is actually lacking. The Upper Confidence Bound (UCB) algorithm is directed: it explicitly tracks which arms are uncertain and systematically explores them.
The key idea: for each arm i, maintain a confidence interval [μ̂i − CIi, μ̂i + CIi] that contains the true μi with high probability. Then select the arm with the highest upper bound of its interval:
The term √(log(t) / ni) is the exploration bonus: it's large when arm i has been pulled few times (small ni), encouraging exploration, and shrinks as the arm is pulled more. The constant c controls how aggressively to explore.
Why does UCB work? Consider arm i with true mean μi = 5. If it's been pulled enough, μ̂i ≈ 5 and the upper bound ≈ 5 + small correction. The best arm j has upper bound μ̂j + correction > 5 (since j is better). So UCB prefers j. But if i is new and μ̂i is based on only 1 pull, the upper bound is huge — UCB will explore i. Once it's explored enough to know it's suboptimal, UCB stops exploring it. Exploration is automatic and principled.
Each bar is an arm. The teal bar is the mean estimate; the full bar is the upper confidence bound. UCB selects the arm with the tallest total bar. Watch intervals shrink as arms are pulled.
UCB is deterministic and uses a pessimistic confidence interval. Thompson Sampling (1933, re-popularized 2010s) is probabilistic and Bayesian. The idea: at each step, randomly sample from your posterior distribution over each arm's mean, then pick the arm whose sample is highest.
Thompson Sampling has beautiful theoretical guarantees. For Bernoulli bandits, it achieves the Lai-Robbins lower bound asymptotically. In practice, it often outperforms UCB empirically, especially in the early rounds when uncertainty is high and the Bayesian framework naturally handles the prior.
python import numpy as np from scipy.stats import beta class ThompsonSampling: def __init__(self, K): self.K = K self.alpha = np.ones(K) # Beta prior: alpha = successes + 1 self.beta_ = np.ones(K) # Beta prior: beta = failures + 1 def select(self): # Sample one value from each arm's posterior samples = beta.rvs(self.alpha, self.beta_) return np.argmax(samples) # pick the highest sample def update(self, arm, reward): # Bernoulli reward: 1 = success, 0 = failure self.alpha[arm] += reward self.beta_[arm] += 1 - reward
UCB and Thompson Sampling are near-optimal. But there exists a provably optimal solution to the discounted bandit problem: the Gittins index (Gittins, 1979). For a bandit with discount factor γ < 1, the optimal strategy is to always pull the arm with the highest Gittins index.
The Gittins index of arm i in state (information about arm i) is defined as the "fair entrance fee" you'd charge to swap arm i for a guaranteed constant stream of rewards. More precisely, it's the value ν such that you are indifferent between:
(a) Playing arm i forever with the option to switch to a guaranteed reward ν at any time.
(b) Receiving ν every period starting now.
For Bernoulli arms with Beta posterior, the Gittins index G(α, β) depends on the current posterior state (α, β). Computing it requires solving a dynamic programming problem for each (possibly discretized) posterior state. Tables are precomputed:
| α | β | Posterior mean | Gittins index (γ=0.9) |
|---|---|---|---|
| 1 | 1 | 0.50 | 0.7029 |
| 2 | 1 | 0.67 | 0.7798 |
| 1 | 2 | 0.33 | 0.5765 |
| 5 | 5 | 0.50 | 0.5700 |
| 10 | 2 | 0.83 | 0.8516 |
Four algorithms compete on the same 5-arm bandit. Watch cumulative regret accumulate over T=1000 pulls. The best algorithm will show the slowest-rising regret curve.
The bandit problem is the purest form of the exploration/exploitation dilemma. Every algorithm here extends directly to full RL.
| Algorithm | Exploration type | Regret | Key property |
|---|---|---|---|
| ε-greedy | Undirected (uniform random) | O(Δ·T·ε) | Simple, tunable |
| Boltzmann / softmax | Undirected (value-weighted) | Similar to ε-greedy | Smooth tradeoff, sensitive to scale |
| UCB1 | Directed (confidence-based) | O(log T) optimal | Deterministic, no priors needed |
| Thompson Sampling | Directed (Bayesian) | O(log T) optimal | Randomized, prior helps early |
| Gittins Index | Directed (optimal) | Optimal for discounted | Provably optimal but limited scope |
Related chapters:
"It is better to try something and fail than to do nothing. But it is even better to try something that might succeed." — The bandit problem in one sentence.