Ch 15: Exploration and Exploitation

Chapter 0: The Fundamental Dilemma

You're a restaurant critic in a new city. There are 50 restaurants you haven't tried. You have 10 dinner slots. Tonight you tried Restaurant A and it was excellent. Should you go back tomorrow? Or try one of the other 49?

If you always return to Restaurant A, you might miss a better option. If you always try new places, you'll gather information but eat at mediocre restaurants. There's a real tension: exploitation (use what you know works) vs. exploration (try new things to gain information).

Why this is hard: Information has value. Trying a new restaurant teaches you something, even if it's bad. But that knowledge comes at a cost — you ate a bad meal. The exploration/exploitation dilemma is fundamentally a question about the value of information: is the knowledge gained from exploring worth the opportunity cost of not exploiting?

This dilemma appears everywhere in decision-making:

Domain	Exploit	Explore
Clinical trials	Give patients the best known drug	Test new drugs on some patients
Web A/B testing	Show the best-converting variant	Test new designs to learn better
Robotics	Follow the known-good policy	Try new actions to discover rewards
Recommender systems	Show items the user likes	Show new items to learn preferences
Drug discovery	Synthesize known effective compounds	Test novel molecular structures

Regret: The cost of a suboptimal choice is measured as regret: regret = (reward of best action) − (reward of chosen action). Cumulative regret over T rounds is the total cost of not always knowing the best option. The goal: algorithms that minimize cumulative regret. Perfect knowledge would give zero regret. The question is how quickly we can approach that ideal as we learn.

You have 3 slot machines. Machine A has been played 100 times with average payout 5. Machine B has been played 2 times with average payout 6. Machine C has been played 0 times. A pure exploitation strategy (always pick highest mean) would pick Machine B. What does this ignore?

Machine B's estimate is very uncertain (only 2 observations) and could be wrong; Machine C has never been tried Machine A has been played the most and is therefore the safest choice The average payout of Machine B might decrease with more plays

Chapter 1: The Multi-Armed Bandit

Strip away everything except the core dilemma. You face K slot machines (one-armed bandits). Each machine i has an unknown reward distribution with mean μ_i. At each time step t, you pull one machine and receive a reward. Your goal: maximize total reward over T pulls.

This is the multi-armed bandit problem (MAB). The name comes from imagining a gambler at a casino with K slot machines (“one-armed bandits”), each with unknown payout rates. The gambler must decide which arms to pull to maximize cumulative reward.

Why it matters for RL: Every RL problem contains a bandit subproblem: in each state s, the agent must choose among K = |A| actions with unknown expected returns. The state adds structure (current action affects future states), but the core challenge of learning which actions are good without trying them infinitely many times is the bandit problem. Solving bandits well gives intuition for solving full RL.

The hardness of the bandit problem comes from the estimation-exploration tradeoff: to estimate μ_i accurately, you need many pulls of arm i. But every pull of i is a pull not spent on the potentially better arm j. You're spending reward to buy knowledge.

Key quantities:

μ_i — true mean reward of arm i (unknown to agent)
μ* = max_i μ_i — optimal arm's reward (unknown)
Δ_i = μ* − μ_i — suboptimality gap of arm i
r_t — reward received at time t
Regret R_T = T·μ* − ∑_t=1^T r_t — total cost of not knowing the best arm

Lower bound on regret: The Lai-Robbins theorem (1985) proves that for any consistent algorithm, cumulative regret must grow at least as fast as ∑_{i:Δ_i>0} Δ_i · log(T) / KL(μ_i || μ*), where KL is the Kullback-Leibler divergence. This O(log T) lower bound is fundamental: no algorithm can do better. The best algorithms (UCB, Thompson sampling) achieve this bound up to constants.

You have 3 arms with true means [5, 7, 3]. After 100 pulls you've pulled arm 2 (best) 50 times, arm 1 thirty times, and arm 3 twenty times. What is the regret from the arm 1 pulls?

30 × (7 − 5) = 60 — each arm 1 pull cost us the difference from the best arm 30 × 5 = 150 — the actual rewards from arm 1 pulls 30 × 7 = 210 — what we would have gotten from the best arm

Chapter 2: Bayesian Bandit Estimation

A Bayesian agent maintains a posterior distribution over the reward parameters of each arm. As rewards are observed, the posterior updates. At any time, the agent has a full probability distribution over "how good each arm is" — not just a point estimate.

For Bernoulli rewards (success/failure, like a click or no-click), the Beta distribution is the natural conjugate prior. If arm i has had α_i successes and β_i failures, its posterior is:

μ_i | data ~ Beta(α_i, β_i)

The prior Beta(1, 1) is uniform — complete ignorance. After 3 successes and 2 failures: Beta(4, 3). The posterior mean is α/(α+β) = 4/7 ≈ 0.57. The posterior variance αβ/[(α+β)²(α+β+1)] tells you uncertainty.

The Beta update is trivially simple: When arm i succeeds, α_i += 1. When it fails, β_i += 1. That's the entire Bayesian update for Bernoulli rewards. Two counters per arm. No matrix inversions, no gradient descent. The Beta distribution handles all the uncertainty tracking automatically.

For Gaussian rewards (continuous, like restaurant quality scores), the normal-inverse-gamma distribution is the conjugate prior. But in practice, many algorithms use a simpler approximation: track the sample mean μ̂_i and the uncertainty σ_i = σ₀/√n_i, where n_i is the number of pulls and σ₀ is the prior std.

Posterior contraction: As you pull arm i more, the posterior narrows. With 2 pulls, uncertainty is high. With 100 pulls, you've almost pinned down the true mean. This narrowing is automatic in the Bayesian framework — the posterior width ∼ 1/√n. It's built into the math: more data = less uncertainty.

Beta Posterior Updating

Simulate pulling a Bernoulli arm with true probability p. The Beta posterior (teal) updates with each pull and tightens around the true value (orange dashed).

True probability p 0.60

An arm starts with prior Beta(1,1). You observe 4 successes and 6 failures. What is the posterior?

Beta(5, 7) — add successes to α and failures to β, starting from (1,1) Beta(4, 6) — the posterior is just the raw counts Beta(1, 1) — the prior doesn't change from observations

Chapter 3: ε-Greedy Exploration

You've been to Restaurant A 50 times. You know it's great. Should you ever try the others? The simplest policy: most of the time, exploit (go to the best-known option), but occasionally, pick a random option to explore. This is ε-greedy.

At each time step: with probability ε, choose a uniformly random arm. With probability 1−ε, choose the arm with the highest current estimate μ̂_i.

The ε tradeoff: Large ε = lots of exploration, slow convergence to good policy, high short-term regret. Small ε = fast exploitation, little exploration, risk of getting stuck on a suboptimal arm forever. ε = 0 is pure greedy (no exploration, guaranteed to get stuck). ε = 1 is random (explores everything equally, ignores all knowledge).

A key insight: ε-greedy has a problem even when it works. When it does explore, it explores uniformly — picking a random arm with equal probability. This is unintelligent: it spends equal time on arms that are nearly as good as the best (small Δ) and arms that are clearly terrible (large Δ). Smarter exploration should focus on arms with high uncertainty, not uniform random arms.

Decaying ε addresses the fixed exploration problem: set ε_t = min(1, c/(t · Δ)) for some constant c and estimated gap Δ. Early on, explore a lot. Later, when you know more, exploit more. In practice, a simple schedule like ε_t = 1/log(t) or ε_t = ε₀/√t is common.

ε-greedy is not optimal: Even with decaying ε, the algorithm still explores unintelligently. Its regret is O(Δ · ε_t · T) from random explorations + suboptimality from greedy choices. UCB and Thompson sampling (Ch 5–6) achieve O(log T) regret optimally without the uniform random exploration penalty.

python
import numpy as np

class EpsilonGreedy:
    def __init__(self, K, eps=0.1):
        self.K = K; self.eps = eps
        self.counts = np.zeros(K)     # pulls per arm
        self.values = np.zeros(K)     # sample mean per arm

    def select(self):
        if np.random.random() < self.eps:
            return np.random.randint(self.K)   # random exploration
        return np.argmax(self.values)           # greedy exploitation

    def update(self, arm, reward):
        self.counts[arm] += 1
        # Online mean update: no need to store all rewards
        n = self.counts[arm]
        self.values[arm] += (reward - self.values[arm]) / n

ε-greedy with ε=0.1 and K=10 arms. In the long run, roughly what fraction of pulls go to the best arm?

About 91% — (1-ε) greedy pulls, all going to best + ε/K random from exploration About 10% — same as any other arm About 90% — exactly (1-ε) since greedy always picks the best

Chapter 4: Boltzmann Exploration

ε-greedy has a binary choice: exploit (pick the best) or explore (pick uniformly random). Boltzmann exploration (also called softmax exploration) blends these smoothly: pull arm i with probability proportional to exp(μ̂_i / τ), where τ is the temperature parameter.

P(arm = i) = exp(μ̂_i / τ) / ∑_j exp(μ̂_j / τ)

This is the Boltzmann distribution from statistical physics — hence the name. At high temperature (τ → ∞), all arms have equal probability (pure exploration). At low temperature (τ → 0), the best arm gets all the probability (pure exploitation).

The temperature analogy: In physics, high temperature = high randomness. Low temperature = system settles into its lowest energy state (like a ferromagnet). In bandits: high τ = explore freely. Low τ = always exploit. Annealing (slowly cooling τ) lets you explore early and exploit later — the bandit equivalent of simulated annealing.

Boltzmann exploration is more intelligent than ε-greedy: it preferentially explores arms with high estimated value over arms with low value. If arm 1 has estimate 5 and arm 2 has estimate 3 and arm 3 has estimate 1, Boltzmann explores arm 1 most during exploration phase, not uniformly. This concentrates exploration on competitive arms.

Weakness: scale sensitivity. The softmax probabilities depend on the scale of rewards. If rewards are in [0, 1], a temperature of τ = 0.1 is "cold." If rewards are in [0, 100], τ = 0.1 is "hot." You must set τ relative to the reward scale. In contrast, UCB (Ch 5) is scale-sensitive only through a single hyperparameter and adapts more automatically.

Boltzmann Action Probabilities

5 arms with fixed estimates. Temperature τ controls how peaked the distribution is toward the best arm (highest bar = most likely to be selected).

Temperature τ 1.00

With temperature τ → 0, Boltzmann exploration converges to what strategy?

Pure greedy — always select the arm with the highest current estimate Pure random — select each arm with equal probability ε-greedy with ε = 0.5

Chapter 5: Upper Confidence Bound

Both ε-greedy and Boltzmann exploration are undirected: they explore randomly, not based on where knowledge is actually lacking. The Upper Confidence Bound (UCB) algorithm is directed: it explicitly tracks which arms are uncertain and systematically explores them.

The key idea: for each arm i, maintain a confidence interval [μ̂_i − CI_i, μ̂_i + CI_i] that contains the true μ_i with high probability. Then select the arm with the highest upper bound of its interval:

a* = argmax_i [ μ̂_i + c √(log(t) / n_i) ]

The term √(log(t) / n_i) is the exploration bonus: it's large when arm i has been pulled few times (small n_i), encouraging exploration, and shrinks as the arm is pulled more. The constant c controls how aggressively to explore.

"Optimism in the face of uncertainty": UCB embodies a beautifully rational principle. Instead of picking the arm you believe is best (μ̂ exploitation) or a random arm (exploration), pick the arm that could plausibly be the best given your current knowledge. Every arm that might be optimal gets pulled until its upper bound is proven lower than the best arm. This is the "optimism principle."

Why does UCB work? Consider arm i with true mean μ_i = 5. If it's been pulled enough, μ̂_i ≈ 5 and the upper bound ≈ 5 + small correction. The best arm j has upper bound μ̂_j + correction > 5 (since j is better). So UCB prefers j. But if i is new and μ̂_i is based on only 1 pull, the upper bound is huge — UCB will explore i. Once it's explored enough to know it's suboptimal, UCB stops exploring it. Exploration is automatic and principled.

UCB1 regret bound: The UCB1 algorithm (Auer et al. 2002) achieves regret R_T ≤ 8 ∑_{i:Δ_i>0} (log T / Δ_i) + (1 + π²/3) ∑_i Δ_i. This is O(log T) and matches the Lai-Robbins lower bound up to constants. UCB1 is asymptotically optimal. No algorithm can do fundamentally better.

UCB in Action: Confidence Intervals

Each bar is an arm. The teal bar is the mean estimate; the full bar is the upper confidence bound. UCB selects the arm with the tallest total bar. Watch intervals shrink as arms are pulled.

Exploration c 1.0

Arm A has been pulled 1 time, estimate 4. Arm B has been pulled 100 times, estimate 5. With t=101, c=1: which arm does UCB select?

Arm A — its upper bound is 4 + √(log(101)/1) ≈ 4 + 2.1 = 6.1, higher than B's bound of 5 + √(log(101)/100) ≈ 5.21 Arm B — it has a higher estimate (5 > 4) and more data Random — UCB adds randomness to break ties

Chapter 6: Thompson Sampling

UCB is deterministic and uses a pessimistic confidence interval. Thompson Sampling (1933, re-popularized 2010s) is probabilistic and Bayesian. The idea: at each step, randomly sample from your posterior distribution over each arm's mean, then pick the arm whose sample is highest.

1. Maintain

Keep posterior p(μ_i | data) for each arm i (e.g., Beta(α_i, β_i) for Bernoulli)

↓

2. Sample

Draw θ_i ~ p(μ_i | data) for each arm i

↓

3. Select

Pull arm i* = argmax_i θ_i

↓

4. Update

Update posterior for selected arm with observed reward. Repeat.

Why Thompson Sampling works: With high confidence (posterior concentrated near true mean), the probability of pulling suboptimal arms drops. With uncertainty (wide posterior), suboptimal arms get pulled with meaningful probability — but this is exactly calibrated exploration: we explore proportionally to the probability that the arm is actually optimal. This is Bayesian decision-making in its purest form.

Thompson Sampling has beautiful theoretical guarantees. For Bernoulli bandits, it achieves the Lai-Robbins lower bound asymptotically. In practice, it often outperforms UCB empirically, especially in the early rounds when uncertainty is high and the Bayesian framework naturally handles the prior.

Thompson Sampling as probability matching: We pull arm i with probability equal to the posterior probability that arm i is optimal: P(pull i) = P(μ_i = max_j μ_j | data). This is exactly what the sampling step computes. Sampling from the posterior over μ_i and picking the max automatically implements probability-optimal selection. There's no manual hyperparameter for exploration rate — the posterior width handles it.

python
import numpy as np
from scipy.stats import beta

class ThompsonSampling:
    def __init__(self, K):
        self.K = K
        self.alpha = np.ones(K)   # Beta prior: alpha = successes + 1
        self.beta_ = np.ones(K)   # Beta prior: beta = failures + 1

    def select(self):
        # Sample one value from each arm's posterior
        samples = beta.rvs(self.alpha, self.beta_)
        return np.argmax(samples)   # pick the highest sample

    def update(self, arm, reward):
        # Bernoulli reward: 1 = success, 0 = failure
        self.alpha[arm] += reward
        self.beta_[arm] += 1 - reward

Arm A has posterior Beta(10, 1) and arm B has posterior Beta(5, 5). A Thompson sample draws θ_A = 0.88 and θ_B = 0.52. Which arm is selected?

Arm A — because 0.88 > 0.52, Thompson selects the arm with the higher sample Arm B — because it has more uncertainty and needs more exploration Arm A — because Beta(10,1) has a higher mean than Beta(5,5)

Chapter 7: Gittins Index & Optimal Exploration

UCB and Thompson Sampling are near-optimal. But there exists a provably optimal solution to the discounted bandit problem: the Gittins index (Gittins, 1979). For a bandit with discount factor γ < 1, the optimal strategy is to always pull the arm with the highest Gittins index.

The Gittins index of arm i in state (information about arm i) is defined as the "fair entrance fee" you'd charge to swap arm i for a guaranteed constant stream of rewards. More precisely, it's the value ν such that you are indifferent between:

(a) Playing arm i forever with the option to switch to a guaranteed reward ν at any time.

(b) Receiving ν every period starting now.

Why this is optimal: The Gittins index theorem (Whittle, 1980) proves that in the independent arms, discounted bandit setting, the index policy (always pull the arm with highest index) is optimal. This is remarkable: an optimal algorithm for a hard sequential decision problem reduces to computing one number per arm and picking the max. No look-ahead needed.

For Bernoulli arms with Beta posterior, the Gittins index G(α, β) depends on the current posterior state (α, β). Computing it requires solving a dynamic programming problem for each (possibly discretized) posterior state. Tables are precomputed:

α	β	Posterior mean	Gittins index (γ=0.9)
1	1	0.50	0.7029
2	1	0.67	0.7798
1	2	0.33	0.5765
5	5	0.50	0.5700
10	2	0.83	0.8516

Gittins index vs. posterior mean: The index (0.7029) for an unexplored arm Beta(1,1) is much higher than its posterior mean (0.50). This reflects the value of exploration: the arm might be much better than its mean, and with discounting there's still time to exploit that information. As the arm is pulled more and the posterior narrows, the index converges toward the posterior mean. The gap between index and mean is the "exploration bonus" — and it's provably the optimal one.

Practical limitations of Gittins: The Gittins index is optimal only when arms are independent, rewards are discounted (not undiscounted), and the problem is stationary (arm distributions don't change). For contextual bandits, non-stationary arms, or undiscounted settings, it's not optimal. UCB and Thompson Sampling are more practically applicable since they handle these extensions more gracefully.

Why does an unexplored arm Beta(1,1) have a Gittins index (0.70) higher than its posterior mean (0.50)?

The index accounts for the option value of exploring — the arm might be very good, and future returns from learning that are valuable The index is always higher than the mean due to a computational artifact The prior Beta(1,1) has a mean of 0.70, not 0.50

Chapter 8: Showcase — Bandit Algorithm Race

Four algorithms compete on the same 5-arm bandit. Watch cumulative regret accumulate over T=1000 pulls. The best algorithm will show the slowest-rising regret curve.

What to try: Easy bandits (large gaps between arms) — all methods converge quickly. Hard bandits (small gaps) — the difference in algorithm quality becomes apparent. Adjust the gap slider and watch which algorithms suffer most.

Bandit Algorithm Race: Regret Comparison

Arm gap Δ 0.20

Speed 5

Chapter 9: Connections & What's Next

The bandit problem is the purest form of the exploration/exploitation dilemma. Every algorithm here extends directly to full RL.

Algorithm	Exploration type	Regret	Key property
ε-greedy	Undirected (uniform random)	O(Δ·T·ε)	Simple, tunable
Boltzmann / softmax	Undirected (value-weighted)	Similar to ε-greedy	Smooth tradeoff, sensitive to scale
UCB1	Directed (confidence-based)	O(log T) optimal	Deterministic, no priors needed
Thompson Sampling	Directed (Bayesian)	O(log T) optimal	Randomized, prior helps early
Gittins Index	Directed (optimal)	Optimal for discounted	Provably optimal but limited scope

From bandits to full RL: In full RL, the exploration problem recurs in every state. ε-greedy Q-learning randomly explores actions. Posterior Sampling RL (PSRL) extends Thompson Sampling to full MDPs by maintaining a posterior over transition models and sampling a model to plan in. UCB applied to MDPs gives R-max-style algorithms. The bandit intuitions carry over: directed exploration beats undirected exploration.

Contextual bandits: If each pull comes with a context x (features about the user, item, state), the arm means become functions μ_i(x). Thompson Sampling extends naturally: maintain a Bayesian linear model for each arm, sample from the weight posterior, pick the arm with highest sampled prediction. This is the foundation of modern recommendation systems and personalized medicine.

Related chapters:

Ch 14: Policy Validation — exploration ideas for validation
Ch 16: Model-Based RL — Bayesian exploration in full MDPs
Ch 17: Model-Free RL — ε-greedy and UCB in Q-learning

"It is better to try something and fail than to do nothing. But it is even better to try something that might succeed." — The bandit problem in one sentence.

Thompson Sampling selects arms proportionally to the probability they are optimal. Why is this better than always selecting the arm with the highest posterior mean?

Sampling from the posterior naturally drives exploration of uncertain arms without any hardcoded exploration rate Randomized selection prevents the algorithm from being "tricked" by adversarial environments Selecting the highest mean arm would always pick the suboptimal arm due to estimation bias

Exploration & Exploitation