Multi-armed Bandits — Sutton & Barto, Chapter 2

Chapter 0: The Bandit Problem

You walk into a casino. In front of you are 10 slot machines, each with a different colored lever. You have 1000 tokens. Each pull costs one token and pays a random reward. Some machines are generous, some are stingy — but you don't know which is which. What do you do?

This is the multi-armed bandit problem, and it captures the most fundamental tension in all of decision-making: exploration vs exploitation. Do you keep pulling the lever that has paid well so far (exploit), or try a new one that might be even better (explore)?

What makes this problem different from supervised learning is the nature of feedback. A teacher tells you the right answer whether you got it right or not — that is instructive feedback. A slot machine only tells you how much this pull paid — it never tells you what the other arms would have given. That is evaluative feedback. You must actively explore to discover better options.

The core tension: If you always pull the arm that looks best, you might miss a better one you haven't tried enough. If you spend too long exploring, you waste tokens on arms you already know are bad. Every RL agent faces this tradeoff.

Explore or Exploit?

Click any arm to pull it. The reward is random — drawn from a hidden distribution you don't know. Can you find the best arm in 30 pulls?

Pulls: 0 / 30 Total: 0.0

What distinguishes evaluative feedback from instructive feedback?

Evaluative feedback is always positive Evaluative feedback tells you how good your choice was, but not what the best choice would have been Evaluative feedback is noisier than instructive feedback

Chapter 1: Action Values

To decide which arm to pull, we need a way to measure how good each arm is. The true value of action a is the expected reward you get when you select it. Sutton & Barto call this q_*(a):

q_*(a) = E[R_t | A_t = a]

We don't know q_*(a) — if we did, we'd just always pick the arm with the highest value. Instead, we maintain estimates Q_t(a) and update them as we collect rewards. The simplest estimate is the sample average: the total reward received from arm a divided by how many times we've pulled it.

Q_t(a) = (sum of rewards when a was taken) / (number of times a was taken)

By the law of large numbers, as we pull arm a more and more times, Q_t(a) converges to q_*(a). The question is: how do we balance pulling each arm enough to get good estimates while not wasting pulls on bad arms?

Key distinction: q_*(a) is the true value — fixed and unknown. Q_t(a) is our current estimate — it changes with every pull and starts off unreliable. The gap between them is what drives the need for exploration.

Estimates vs True Values

Watch how sample-average estimates (orange bars) converge toward true values (teal lines) as you collect more samples. Click "Step" to pull the best-looking arm, or "Step (random)" to explore.

Step: 0

What does the sample-average method estimate?

The expected reward for each action, by averaging past rewards from that action The maximum reward possible from each action The probability of each action being optimal

Chapter 2: ε-Greedy Methods

The simplest solution to exploration vs exploitation is the ε-greedy strategy. Most of the time (probability 1 − ε), you pick the action with the highest estimated value — you exploit. But with small probability ε, you pick a completely random action — you explore.

When ε = 0, you are purely greedy: you never explore. You'll lock onto whatever arm looked best early and never discover if another arm is actually better. When ε = 1, you are purely random: you never exploit what you've learned. The sweet spot is somewhere in between.

Why not just be greedy? A greedy agent can get permanently stuck. Suppose arm 3 pays +1 on the first pull by luck, while the truly best arm (arm 7) pays −0.5. A greedy agent will never try arm 7 again. It doesn't know what it's missing, because it never looks.

Epsilon Exploration Rate

Drag the slider to change ε. Watch how the balance between exploration (random) and exploitation (greedy) shifts. The bar shows the fraction of actions that would be exploratory.

ε 0.10

A_t = argmax_a Q_t(a) with prob. 1 − ε, random action with prob. ε

With ε = 0.1, what fraction of actions are exploratory?

1% 10% — one in every ten actions is random 90%

Chapter 3: The 10-Armed Testbed

This is the experiment Sutton & Barto use to compare methods. There are 10 arms. Each arm's true value q_*(a) is drawn from a standard normal distribution N(0,1). When you pull arm a, the reward is drawn from N(q_*(a), 1) — the true value plus unit Gaussian noise.

Below is a fully interactive 10-armed testbed. You can pull arms manually, or let the agent auto-play with different ε values. Watch how Q estimates converge, and compare the average reward and % optimal action curves across strategies.

This is THE experiment from the textbook. Figure 2.2 in Sutton & Barto compares greedy, ε=0.01, and ε=0.1. You are about to reproduce it interactively. The results are striking: a little exploration (even just 1%) dramatically outperforms pure greed in the long run.

Interactive 10-Armed Bandit Testbed

The teal markers show true values q_*(a). Orange bars are your Q estimates. Click an arm to pull it, or use auto-play below.

ε 0.10

Performance curves (averaged over runs):

Step: 0 Avg reward: 0.00 % Optimal: 0%

In the 10-armed testbed, which strategy typically achieves the highest average reward after 1000 steps?

Greedy (ε=0) — never waste a pull exploring ε=0.01 — explore very rarely ε=0.1 — explore 10% of the time, learns fastest and finds the best arm

Chapter 4: Incremental Updates

Computing a sample average from scratch every time is wasteful. You'd have to store every single reward and re-sum them. There is an elegant incremental formula that updates the estimate one step at a time, using only the current estimate and the new reward:

Q_n+1 = Q_n + (1/n)(R_n − Q_n)

Read this as: NewEstimate = OldEstimate + StepSize × (Target − OldEstimate). The term (R_n − Q_n) is the error in our estimate. We nudge the estimate toward the new observation by a fraction 1/n. This formula is exact — it gives the same result as computing the full average — and it's the template for nearly every update rule in reinforcement learning.

This pattern appears everywhere in RL. Value function updates, TD learning, policy gradient baselines — they all follow this form: step in the direction of the error, scaled by a step size. Master this formula and you have the skeleton key to the rest of the book.

General Update Rule

NewEstimate ← OldEstimate + StepSize [Target − OldEstimate]

↓

For Sample Averages

StepSize = 1/n (decreases with each pull)

↓

For Nonstationary

StepSize = α (constant, e.g. 0.1)

Incremental Update Step-by-Step

Each step shows a new reward R_n, the current estimate Q_n, the error, and the updated Q_n+1. The teal line is the true value.

n=0 Q=0.00

In the incremental update Q_n+1 = Q_n + (1/n)(R_n − Q_n), what does (R_n − Q_n) represent?

The error — the difference between the new reward and our current estimate The total reward accumulated so far The learning rate

Chapter 5: Nonstationary Tracking

The sample-average method weights all past rewards equally. That's fine if the true values never change. But what if the environment is nonstationary — the slot machines slowly get better or worse over time? Then ancient rewards are misleading, and we want to give more weight to recent experience.

The fix is simple: replace the shrinking step size 1/n with a constant step size α. This makes the update rule:

Q_n+1 = Q_n + α(R_n − Q_n) where 0 < α ≤ 1

Expanding this recursion reveals that each past reward R_i is weighted by (1 − α)ⁿ⁻ⁱ · α. Older rewards get exponentially discounted. This is called exponential recency-weighted averaging. The larger α is, the faster old data is forgotten.

Recency weighting is like memory. With α = 0.1, the weight on a reward drops to half after about 7 steps, and to 1% after 44 steps. The agent effectively has a sliding window of attention, always tracking the recent past.

Exponential Recency Weighting

Drag α to see how past rewards are weighted. Higher α forgets faster. The bars show the weight assigned to each of the last 30 rewards.

α 0.10

Why is a constant step-size α better than 1/n for nonstationary problems?

It converges faster to the true value It gives more weight to recent rewards, tracking changes in the environment It uses less memory

Chapter 6: Optimistic Initialization

Here is a surprisingly effective trick: instead of initializing Q₁(a) = 0 for all arms, initialize them to an optimistically high value, like +5. Now even the greedy agent will explore, because every arm it pulls gives a reward lower than +5, which disappoints the agent and pushes it to try other arms.

The agent systematically explores all arms early on, because each pull drives the estimate down, making other (still-optimistic) arms look more attractive. Eventually, all estimates settle near their true values, and the agent converges on the best arm.

Optimism in the face of uncertainty. By starting with high estimates, the agent treats unvisited arms as promising by default. This is a form of systematic exploration that requires no randomness at all — even a greedy policy explores when started optimistically.

Optimistic vs Realistic Initialization

Orange: greedy with Q₁=5. Blue: ε-greedy (0.1) with Q₁=0. Watch how optimistic initialization forces early exploration then converges.

Step: 0

Why does optimistic initialization cause a greedy agent to explore?

Every arm disappoints (rewards are lower than the initial estimate), so the agent keeps trying different arms It adds random noise to the action selection It increases the learning rate

Chapter 7: Upper Confidence Bounds

ε-greedy explores blindly — when it explores, it picks a random arm with no regard for which arms are most uncertain or most promising. Upper Confidence Bound (UCB) action selection is smarter: it factors in how uncertain we are about each arm's value.

A_t = argmax_a [ Q_t(a) + c √( ln(t) / N_t(a) ) ]

The second term is the exploration bonus. It grows when arm a has been pulled fewer times (N_t(a) is small) and shrinks as we pull it more. The parameter c controls how much weight we give to exploration. UCB selects the arm with the highest upper confidence bound on its value — the arm that could plausibly be the best.

Exploration with purpose. ε-greedy says "pick at random sometimes." UCB says "pick the arm I'm most uncertain about, because it could be the best." This directed exploration is provably more efficient: UCB achieves logarithmic regret — the best possible rate.

UCB Confidence Intervals

Each arm shows its Q estimate (solid bar) plus the UCB bonus (lighter region). The arm with the tallest total bar gets selected. Arms pulled less often have wider uncertainty bands.

c (exploration) 2.0

Step: 0

In UCB, what happens to an arm's exploration bonus as it gets pulled more often?

It increases — popular arms get a boost It decreases — well-explored arms are less uncertain It stays constant — the bonus doesn't depend on pull count

Chapter 8: Gradient Bandits

Instead of estimating values and picking the largest, what if we directly learned a preference for each arm? Gradient bandit algorithms maintain a numerical preference H_t(a) for each action, and use the softmax function to turn preferences into selection probabilities:

Pr{A_t = a} = e^H_t(a) / ∑_b e^H_t(b) = π_t(a)

After each reward, we update preferences using stochastic gradient ascent. The key idea: if a reward is better than the average (the baseline R̄_t), increase the preference for the chosen action and decrease the others. If the reward is below average, do the opposite.

H_t+1(A_t) = H_t(A_t) + α(R_t − R̄_t)(1 − π_t(A_t))

H_t+1(a) = H_t(a) − α(R_t − R̄_t) π_t(a) for a ≠ A_t

The baseline trick. Without the baseline R̄_t, the algorithm would increase preference for the chosen action after every positive reward, even if that reward is below average. The baseline centers the update: only rewards better than average reinforce the action. This dramatically reduces variance and speeds learning.

Gradient Bandit Preferences

Watch how preferences H(a) and selection probabilities π(a) evolve. The best arm's probability should rise toward 1.

α 0.10

Step: 0

What role does the baseline R̄_t play in the gradient bandit update?

It sets the initial preference values It centers the update so that only rewards above average reinforce the chosen action It controls the exploration rate

Chapter 9: Comparing Methods

Sutton & Barto's Figure 2.6 is one of the most important plots in the book: a parameter study that compares all the methods we've learned on a single graph. Each method is run across a range of its key parameter, and performance (average reward over the first 1000 steps) is plotted.

The key finding: no single method dominates everywhere. UCB and gradient methods tend to perform well, but the best choice depends on the parameter setting. The parameter study shows each method at its best, revealing which approaches have the highest peak performance and which are most robust.

Fair comparison. You can't compare methods by picking one parameter value for each. A method with a bad parameter looks terrible. The parameter study plots every method at every parameter value, so you see each method at its best. This is the only honest way to compare.

Parameter Study — All Methods

Click "Run Study" to run all four methods across parameter ranges (takes a moment). Each point is the average reward over 1000 steps and 200 runs. Toggle methods on/off.

Click "Run Study" to generate the parameter curves.

Method	Parameter	Range
ε-greedy	ε (exploration rate)	1/128 to 1/4
UCB	c (confidence level)	1/16 to 4
Gradient	α (step size)	1/32 to 2
Optimistic greedy	Q₁ (initial value)	1/4 to 4

Why is a parameter study more informative than comparing methods at a single parameter value?

It shows each method at its best, preventing unfair comparisons due to bad parameter choices It runs faster It eliminates the need for exploration

Chapter 10: Summary & Connections

The multi-armed bandit is the simplest reinforcement learning problem: one state, k actions, immediate rewards. Yet it introduces the central ideas that permeate the entire field:

Concept	What We Learned	Where It Goes Next
Action values	q_*(a) and Q_t(a) estimates	Q-functions in full RL (Ch 6)
Exploration vs exploitation	ε-greedy, UCB, optimistic init	Every RL algorithm must balance this
Incremental updates	Q ← Q + α(R − Q)	TD learning, SARSA, Q-learning
Nonstationary tracking	Constant step-size α	All practical RL uses this
Gradient methods	Softmax + policy gradient	Policy gradient methods (Ch 13)

From bandits to full RL. In Chapter 3, we add states — your actions affect not only the reward but also the next situation you face. This transforms the bandit into a Markov Decision Process (MDP), and the exploration-exploitation tradeoff becomes even richer. But the fundamental update rule Q ← Q + α[Target − Q] stays exactly the same.

What this chapter didn't cover. Bandits have a rich theory beyond this introduction: Bayesian approaches (Thompson sampling), contextual bandits (actions depend on features), adversarial bandits (the environment fights back), and the Gittins index (the optimal solution for discounted infinite-horizon bandits). Each extends the ideas you've learned here.

"The only way to discover the limits of the possible is to go beyond them into the impossible."

— Arthur C. Clarke

You now understand the exploration-exploitation tradeoff. Every decision you make under uncertainty is a bandit problem in disguise.

What is the single most important idea that carries from bandits into full reinforcement learning?

The softmax function for action selection The incremental update rule: NewEstimate = OldEstimate + StepSize(Target - OldEstimate) Optimistic initialization of value estimates

Chapter 3: Finite MDPs →