Kochenderfer, Wheeler & Wray — Chapter 15

Exploration & Exploitation

How do you choose between trying new things and doing what you know works?

Prerequisites: basic probability + Bayesian inference. That's it.
10
Chapters
4
Simulations
10
Quizzes

Chapter 0: The Fundamental Dilemma

You're a restaurant critic in a new city. There are 50 restaurants you haven't tried. You have 10 dinner slots. Tonight you tried Restaurant A and it was excellent. Should you go back tomorrow? Or try one of the other 49?

If you always return to Restaurant A, you might miss a better option. If you always try new places, you'll gather information but eat at mediocre restaurants. There's a real tension: exploitation (use what you know works) vs. exploration (try new things to gain information).

Why this is hard: Information has value. Trying a new restaurant teaches you something, even if it's bad. But that knowledge comes at a cost — you ate a bad meal. The exploration/exploitation dilemma is fundamentally a question about the value of information: is the knowledge gained from exploring worth the opportunity cost of not exploiting?

This dilemma appears everywhere in decision-making:

DomainExploitExplore
Clinical trialsGive patients the best known drugTest new drugs on some patients
Web A/B testingShow the best-converting variantTest new designs to learn better
RoboticsFollow the known-good policyTry new actions to discover rewards
Recommender systemsShow items the user likesShow new items to learn preferences
Drug discoverySynthesize known effective compoundsTest novel molecular structures
Regret: The cost of a suboptimal choice is measured as regret: regret = (reward of best action) − (reward of chosen action). Cumulative regret over T rounds is the total cost of not always knowing the best option. The goal: algorithms that minimize cumulative regret. Perfect knowledge would give zero regret. The question is how quickly we can approach that ideal as we learn.
You have 3 slot machines. Machine A has been played 100 times with average payout 5. Machine B has been played 2 times with average payout 6. Machine C has been played 0 times. A pure exploitation strategy (always pick highest mean) would pick Machine B. What does this ignore?

Chapter 1: The Multi-Armed Bandit

Strip away everything except the core dilemma. You face K slot machines (one-armed bandits). Each machine i has an unknown reward distribution with mean μi. At each time step t, you pull one machine and receive a reward. Your goal: maximize total reward over T pulls.

This is the multi-armed bandit problem (MAB). The name comes from imagining a gambler at a casino with K slot machines (“one-armed bandits”), each with unknown payout rates. The gambler must decide which arms to pull to maximize cumulative reward.

Why it matters for RL: Every RL problem contains a bandit subproblem: in each state s, the agent must choose among K = |A| actions with unknown expected returns. The state adds structure (current action affects future states), but the core challenge of learning which actions are good without trying them infinitely many times is the bandit problem. Solving bandits well gives intuition for solving full RL.

The hardness of the bandit problem comes from the estimation-exploration tradeoff: to estimate μi accurately, you need many pulls of arm i. But every pull of i is a pull not spent on the potentially better arm j. You're spending reward to buy knowledge.

Key quantities:

Lower bound on regret: The Lai-Robbins theorem (1985) proves that for any consistent algorithm, cumulative regret must grow at least as fast as ∑i:Δi>0 Δi · log(T) / KL(μi || μ*), where KL is the Kullback-Leibler divergence. This O(log T) lower bound is fundamental: no algorithm can do better. The best algorithms (UCB, Thompson sampling) achieve this bound up to constants.
You have 3 arms with true means [5, 7, 3]. After 100 pulls you've pulled arm 2 (best) 50 times, arm 1 thirty times, and arm 3 twenty times. What is the regret from the arm 1 pulls?

Chapter 2: Bayesian Bandit Estimation

A Bayesian agent maintains a posterior distribution over the reward parameters of each arm. As rewards are observed, the posterior updates. At any time, the agent has a full probability distribution over "how good each arm is" — not just a point estimate.

For Bernoulli rewards (success/failure, like a click or no-click), the Beta distribution is the natural conjugate prior. If arm i has had αi successes and βi failures, its posterior is:

μi | data ~ Beta(αi, βi)

The prior Beta(1, 1) is uniform — complete ignorance. After 3 successes and 2 failures: Beta(4, 3). The posterior mean is α/(α+β) = 4/7 ≈ 0.57. The posterior variance αβ/[(α+β)2(α+β+1)] tells you uncertainty.

The Beta update is trivially simple: When arm i succeeds, αi += 1. When it fails, βi += 1. That's the entire Bayesian update for Bernoulli rewards. Two counters per arm. No matrix inversions, no gradient descent. The Beta distribution handles all the uncertainty tracking automatically.

For Gaussian rewards (continuous, like restaurant quality scores), the normal-inverse-gamma distribution is the conjugate prior. But in practice, many algorithms use a simpler approximation: track the sample mean μ̂i and the uncertainty σi = σ0/√ni, where ni is the number of pulls and σ0 is the prior std.

Posterior contraction: As you pull arm i more, the posterior narrows. With 2 pulls, uncertainty is high. With 100 pulls, you've almost pinned down the true mean. This narrowing is automatic in the Bayesian framework — the posterior width ∼ 1/√n. It's built into the math: more data = less uncertainty.
Beta Posterior Updating

Simulate pulling a Bernoulli arm with true probability p. The Beta posterior (teal) updates with each pull and tightens around the true value (orange dashed).

True probability p 0.60
An arm starts with prior Beta(1,1). You observe 4 successes and 6 failures. What is the posterior?

Chapter 3: ε-Greedy Exploration

You've been to Restaurant A 50 times. You know it's great. Should you ever try the others? The simplest policy: most of the time, exploit (go to the best-known option), but occasionally, pick a random option to explore. This is ε-greedy.

At each time step: with probability ε, choose a uniformly random arm. With probability 1−ε, choose the arm with the highest current estimate μ̂i.

The ε tradeoff: Large ε = lots of exploration, slow convergence to good policy, high short-term regret. Small ε = fast exploitation, little exploration, risk of getting stuck on a suboptimal arm forever. ε = 0 is pure greedy (no exploration, guaranteed to get stuck). ε = 1 is random (explores everything equally, ignores all knowledge).

A key insight: ε-greedy has a problem even when it works. When it does explore, it explores uniformly — picking a random arm with equal probability. This is unintelligent: it spends equal time on arms that are nearly as good as the best (small Δ) and arms that are clearly terrible (large Δ). Smarter exploration should focus on arms with high uncertainty, not uniform random arms.

Decaying ε addresses the fixed exploration problem: set εt = min(1, c/(t · Δ)) for some constant c and estimated gap Δ. Early on, explore a lot. Later, when you know more, exploit more. In practice, a simple schedule like εt = 1/log(t) or εt = ε0/√t is common.

ε-greedy is not optimal: Even with decaying ε, the algorithm still explores unintelligently. Its regret is O(Δ · εt · T) from random explorations + suboptimality from greedy choices. UCB and Thompson sampling (Ch 5–6) achieve O(log T) regret optimally without the uniform random exploration penalty.
python
import numpy as np

class EpsilonGreedy:
    def __init__(self, K, eps=0.1):
        self.K = K; self.eps = eps
        self.counts = np.zeros(K)     # pulls per arm
        self.values = np.zeros(K)     # sample mean per arm

    def select(self):
        if np.random.random() < self.eps:
            return np.random.randint(self.K)   # random exploration
        return np.argmax(self.values)           # greedy exploitation

    def update(self, arm, reward):
        self.counts[arm] += 1
        # Online mean update: no need to store all rewards
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n
ε-greedy with ε=0.1 and K=10 arms. In the long run, roughly what fraction of pulls go to the best arm?

Chapter 4: Boltzmann Exploration

ε-greedy has a binary choice: exploit (pick the best) or explore (pick uniformly random). Boltzmann exploration (also called softmax exploration) blends these smoothly: pull arm i with probability proportional to exp(μ̂i / τ), where τ is the temperature parameter.

P(arm = i) = exp(μ̂i / τ) / ∑j exp(μ̂j / τ)

This is the Boltzmann distribution from statistical physics — hence the name. At high temperature (τ → ∞), all arms have equal probability (pure exploration). At low temperature (τ → 0), the best arm gets all the probability (pure exploitation).

The temperature analogy: In physics, high temperature = high randomness. Low temperature = system settles into its lowest energy state (like a ferromagnet). In bandits: high τ = explore freely. Low τ = always exploit. Annealing (slowly cooling τ) lets you explore early and exploit later — the bandit equivalent of simulated annealing.

Boltzmann exploration is more intelligent than ε-greedy: it preferentially explores arms with high estimated value over arms with low value. If arm 1 has estimate 5 and arm 2 has estimate 3 and arm 3 has estimate 1, Boltzmann explores arm 1 most during exploration phase, not uniformly. This concentrates exploration on competitive arms.

Weakness: scale sensitivity. The softmax probabilities depend on the scale of rewards. If rewards are in [0, 1], a temperature of τ = 0.1 is "cold." If rewards are in [0, 100], τ = 0.1 is "hot." You must set τ relative to the reward scale. In contrast, UCB (Ch 5) is scale-sensitive only through a single hyperparameter and adapts more automatically.
Boltzmann Action Probabilities

5 arms with fixed estimates. Temperature τ controls how peaked the distribution is toward the best arm (highest bar = most likely to be selected).

Temperature τ 1.00
With temperature τ → 0, Boltzmann exploration converges to what strategy?

Chapter 5: Upper Confidence Bound

Both ε-greedy and Boltzmann exploration are undirected: they explore randomly, not based on where knowledge is actually lacking. The Upper Confidence Bound (UCB) algorithm is directed: it explicitly tracks which arms are uncertain and systematically explores them.

The key idea: for each arm i, maintain a confidence interval [μ̂i − CIi, μ̂i + CIi] that contains the true μi with high probability. Then select the arm with the highest upper bound of its interval:

a* = argmaxi [ μ̂i + c √(log(t) / ni) ]

The term √(log(t) / ni) is the exploration bonus: it's large when arm i has been pulled few times (small ni), encouraging exploration, and shrinks as the arm is pulled more. The constant c controls how aggressively to explore.

"Optimism in the face of uncertainty": UCB embodies a beautifully rational principle. Instead of picking the arm you believe is best (μ̂ exploitation) or a random arm (exploration), pick the arm that could plausibly be the best given your current knowledge. Every arm that might be optimal gets pulled until its upper bound is proven lower than the best arm. This is the "optimism principle."

Why does UCB work? Consider arm i with true mean μi = 5. If it's been pulled enough, μ̂i ≈ 5 and the upper bound ≈ 5 + small correction. The best arm j has upper bound μ̂j + correction > 5 (since j is better). So UCB prefers j. But if i is new and μ̂i is based on only 1 pull, the upper bound is huge — UCB will explore i. Once it's explored enough to know it's suboptimal, UCB stops exploring it. Exploration is automatic and principled.

UCB1 regret bound: The UCB1 algorithm (Auer et al. 2002) achieves regret RT ≤ 8 ∑i:Δi>0 (log T / Δi) + (1 + π2/3) ∑i Δi. This is O(log T) and matches the Lai-Robbins lower bound up to constants. UCB1 is asymptotically optimal. No algorithm can do fundamentally better.
UCB in Action: Confidence Intervals

Each bar is an arm. The teal bar is the mean estimate; the full bar is the upper confidence bound. UCB selects the arm with the tallest total bar. Watch intervals shrink as arms are pulled.

Exploration c 1.0
Arm A has been pulled 1 time, estimate 4. Arm B has been pulled 100 times, estimate 5. With t=101, c=1: which arm does UCB select?

Chapter 6: Thompson Sampling

UCB is deterministic and uses a pessimistic confidence interval. Thompson Sampling (1933, re-popularized 2010s) is probabilistic and Bayesian. The idea: at each step, randomly sample from your posterior distribution over each arm's mean, then pick the arm whose sample is highest.

1. Maintain
Keep posterior p(μi | data) for each arm i (e.g., Beta(αi, βi) for Bernoulli)
2. Sample
Draw θi ~ p(μi | data) for each arm i
3. Select
Pull arm i* = argmaxi θi
4. Update
Update posterior for selected arm with observed reward. Repeat.
Why Thompson Sampling works: With high confidence (posterior concentrated near true mean), the probability of pulling suboptimal arms drops. With uncertainty (wide posterior), suboptimal arms get pulled with meaningful probability — but this is exactly calibrated exploration: we explore proportionally to the probability that the arm is actually optimal. This is Bayesian decision-making in its purest form.

Thompson Sampling has beautiful theoretical guarantees. For Bernoulli bandits, it achieves the Lai-Robbins lower bound asymptotically. In practice, it often outperforms UCB empirically, especially in the early rounds when uncertainty is high and the Bayesian framework naturally handles the prior.

Thompson Sampling as probability matching: We pull arm i with probability equal to the posterior probability that arm i is optimal: P(pull i) = P(μi = maxj μj | data). This is exactly what the sampling step computes. Sampling from the posterior over μi and picking the max automatically implements probability-optimal selection. There's no manual hyperparameter for exploration rate — the posterior width handles it.
python
import numpy as np
from scipy.stats import beta

class ThompsonSampling:
    def __init__(self, K):
        self.K = K
        self.alpha = np.ones(K)   # Beta prior: alpha = successes + 1
        self.beta_ = np.ones(K)   # Beta prior: beta = failures + 1

    def select(self):
        # Sample one value from each arm's posterior
        samples = beta.rvs(self.alpha, self.beta_)
        return np.argmax(samples)   # pick the highest sample

    def update(self, arm, reward):
        # Bernoulli reward: 1 = success, 0 = failure
        self.alpha[arm] += reward
        self.beta_[arm] += 1 - reward
Arm A has posterior Beta(10, 1) and arm B has posterior Beta(5, 5). A Thompson sample draws θA = 0.88 and θB = 0.52. Which arm is selected?

Chapter 7: Gittins Index & Optimal Exploration

UCB and Thompson Sampling are near-optimal. But there exists a provably optimal solution to the discounted bandit problem: the Gittins index (Gittins, 1979). For a bandit with discount factor γ < 1, the optimal strategy is to always pull the arm with the highest Gittins index.

The Gittins index of arm i in state (information about arm i) is defined as the "fair entrance fee" you'd charge to swap arm i for a guaranteed constant stream of rewards. More precisely, it's the value ν such that you are indifferent between:

(a) Playing arm i forever with the option to switch to a guaranteed reward ν at any time.

(b) Receiving ν every period starting now.

Why this is optimal: The Gittins index theorem (Whittle, 1980) proves that in the independent arms, discounted bandit setting, the index policy (always pull the arm with highest index) is optimal. This is remarkable: an optimal algorithm for a hard sequential decision problem reduces to computing one number per arm and picking the max. No look-ahead needed.

For Bernoulli arms with Beta posterior, the Gittins index G(α, β) depends on the current posterior state (α, β). Computing it requires solving a dynamic programming problem for each (possibly discretized) posterior state. Tables are precomputed:

αβPosterior meanGittins index (γ=0.9)
110.500.7029
210.670.7798
120.330.5765
550.500.5700
1020.830.8516
Gittins index vs. posterior mean: The index (0.7029) for an unexplored arm Beta(1,1) is much higher than its posterior mean (0.50). This reflects the value of exploration: the arm might be much better than its mean, and with discounting there's still time to exploit that information. As the arm is pulled more and the posterior narrows, the index converges toward the posterior mean. The gap between index and mean is the "exploration bonus" — and it's provably the optimal one.
Practical limitations of Gittins: The Gittins index is optimal only when arms are independent, rewards are discounted (not undiscounted), and the problem is stationary (arm distributions don't change). For contextual bandits, non-stationary arms, or undiscounted settings, it's not optimal. UCB and Thompson Sampling are more practically applicable since they handle these extensions more gracefully.
Why does an unexplored arm Beta(1,1) have a Gittins index (0.70) higher than its posterior mean (0.50)?

Chapter 8: Showcase — Bandit Algorithm Race

Four algorithms compete on the same 5-arm bandit. Watch cumulative regret accumulate over T=1000 pulls. The best algorithm will show the slowest-rising regret curve.

What to try: Easy bandits (large gaps between arms) — all methods converge quickly. Hard bandits (small gaps) — the difference in algorithm quality becomes apparent. Adjust the gap slider and watch which algorithms suffer most.
Bandit Algorithm Race: Regret Comparison
Arm gap Δ 0.20
Speed 5

Chapter 9: Connections & What's Next

The bandit problem is the purest form of the exploration/exploitation dilemma. Every algorithm here extends directly to full RL.

AlgorithmExploration typeRegretKey property
ε-greedyUndirected (uniform random)O(Δ·T·ε)Simple, tunable
Boltzmann / softmaxUndirected (value-weighted)Similar to ε-greedySmooth tradeoff, sensitive to scale
UCB1Directed (confidence-based)O(log T) optimalDeterministic, no priors needed
Thompson SamplingDirected (Bayesian)O(log T) optimalRandomized, prior helps early
Gittins IndexDirected (optimal)Optimal for discountedProvably optimal but limited scope
From bandits to full RL: In full RL, the exploration problem recurs in every state. ε-greedy Q-learning randomly explores actions. Posterior Sampling RL (PSRL) extends Thompson Sampling to full MDPs by maintaining a posterior over transition models and sampling a model to plan in. UCB applied to MDPs gives R-max-style algorithms. The bandit intuitions carry over: directed exploration beats undirected exploration.
Contextual bandits: If each pull comes with a context x (features about the user, item, state), the arm means become functions μi(x). Thompson Sampling extends naturally: maintain a Bayesian linear model for each arm, sample from the weight posterior, pick the arm with highest sampled prediction. This is the foundation of modern recommendation systems and personalized medicine.

Related chapters:

"It is better to try something and fail than to do nothing. But it is even better to try something that might succeed." — The bandit problem in one sentence.
Thompson Sampling selects arms proportionally to the probability they are optimal. Why is this better than always selecting the arm with the highest posterior mean?