The simplest decision problem: which slot machine arm should you pull? The entire field of reinforcement learning starts here.
You walk into a casino. In front of you are 10 slot machines, each with a different colored lever. You have 1000 tokens. Each pull costs one token and pays a random reward. Some machines are generous, some are stingy — but you don't know which is which. What do you do?
This is the multi-armed bandit problem, and it captures the most fundamental tension in all of decision-making: exploration vs exploitation. Do you keep pulling the lever that has paid well so far (exploit), or try a new one that might be even better (explore)?
What makes this problem different from supervised learning is the nature of feedback. A teacher tells you the right answer whether you got it right or not — that is instructive feedback. A slot machine only tells you how much this pull paid — it never tells you what the other arms would have given. That is evaluative feedback. You must actively explore to discover better options.
Click any arm to pull it. The reward is random — drawn from a hidden distribution you don't know. Can you find the best arm in 30 pulls?
To decide which arm to pull, we need a way to measure how good each arm is. The true value of action a is the expected reward you get when you select it. Sutton & Barto call this q*(a):
We don't know q*(a) — if we did, we'd just always pick the arm with the highest value. Instead, we maintain estimates Qt(a) and update them as we collect rewards. The simplest estimate is the sample average: the total reward received from arm a divided by how many times we've pulled it.
By the law of large numbers, as we pull arm a more and more times, Qt(a) converges to q*(a). The question is: how do we balance pulling each arm enough to get good estimates while not wasting pulls on bad arms?
Watch how sample-average estimates (orange bars) converge toward true values (teal lines) as you collect more samples. Click "Step" to pull the best-looking arm, or "Step (random)" to explore.
The simplest solution to exploration vs exploitation is the ε-greedy strategy. Most of the time (probability 1 − ε), you pick the action with the highest estimated value — you exploit. But with small probability ε, you pick a completely random action — you explore.
When ε = 0, you are purely greedy: you never explore. You'll lock onto whatever arm looked best early and never discover if another arm is actually better. When ε = 1, you are purely random: you never exploit what you've learned. The sweet spot is somewhere in between.
Drag the slider to change ε. Watch how the balance between exploration (random) and exploitation (greedy) shifts. The bar shows the fraction of actions that would be exploratory.
This is the experiment Sutton & Barto use to compare methods. There are 10 arms. Each arm's true value q*(a) is drawn from a standard normal distribution N(0,1). When you pull arm a, the reward is drawn from N(q*(a), 1) — the true value plus unit Gaussian noise.
Below is a fully interactive 10-armed testbed. You can pull arms manually, or let the agent auto-play with different ε values. Watch how Q estimates converge, and compare the average reward and % optimal action curves across strategies.
The teal markers show true values q*(a). Orange bars are your Q estimates. Click an arm to pull it, or use auto-play below.
Performance curves (averaged over runs):
Computing a sample average from scratch every time is wasteful. You'd have to store every single reward and re-sum them. There is an elegant incremental formula that updates the estimate one step at a time, using only the current estimate and the new reward:
Read this as: NewEstimate = OldEstimate + StepSize × (Target − OldEstimate). The term (Rn − Qn) is the error in our estimate. We nudge the estimate toward the new observation by a fraction 1/n. This formula is exact — it gives the same result as computing the full average — and it's the template for nearly every update rule in reinforcement learning.
Each step shows a new reward Rn, the current estimate Qn, the error, and the updated Qn+1. The teal line is the true value.
The sample-average method weights all past rewards equally. That's fine if the true values never change. But what if the environment is nonstationary — the slot machines slowly get better or worse over time? Then ancient rewards are misleading, and we want to give more weight to recent experience.
The fix is simple: replace the shrinking step size 1/n with a constant step size α. This makes the update rule:
Expanding this recursion reveals that each past reward Ri is weighted by (1 − α)n−i · α. Older rewards get exponentially discounted. This is called exponential recency-weighted averaging. The larger α is, the faster old data is forgotten.
Drag α to see how past rewards are weighted. Higher α forgets faster. The bars show the weight assigned to each of the last 30 rewards.
Here is a surprisingly effective trick: instead of initializing Q1(a) = 0 for all arms, initialize them to an optimistically high value, like +5. Now even the greedy agent will explore, because every arm it pulls gives a reward lower than +5, which disappoints the agent and pushes it to try other arms.
The agent systematically explores all arms early on, because each pull drives the estimate down, making other (still-optimistic) arms look more attractive. Eventually, all estimates settle near their true values, and the agent converges on the best arm.
Orange: greedy with Q1=5. Blue: ε-greedy (0.1) with Q1=0. Watch how optimistic initialization forces early exploration then converges.
ε-greedy explores blindly — when it explores, it picks a random arm with no regard for which arms are most uncertain or most promising. Upper Confidence Bound (UCB) action selection is smarter: it factors in how uncertain we are about each arm's value.
The second term is the exploration bonus. It grows when arm a has been pulled fewer times (Nt(a) is small) and shrinks as we pull it more. The parameter c controls how much weight we give to exploration. UCB selects the arm with the highest upper confidence bound on its value — the arm that could plausibly be the best.
Each arm shows its Q estimate (solid bar) plus the UCB bonus (lighter region). The arm with the tallest total bar gets selected. Arms pulled less often have wider uncertainty bands.
Instead of estimating values and picking the largest, what if we directly learned a preference for each arm? Gradient bandit algorithms maintain a numerical preference Ht(a) for each action, and use the softmax function to turn preferences into selection probabilities:
After each reward, we update preferences using stochastic gradient ascent. The key idea: if a reward is better than the average (the baseline R̄t), increase the preference for the chosen action and decrease the others. If the reward is below average, do the opposite.
Watch how preferences H(a) and selection probabilities π(a) evolve. The best arm's probability should rise toward 1.
Sutton & Barto's Figure 2.6 is one of the most important plots in the book: a parameter study that compares all the methods we've learned on a single graph. Each method is run across a range of its key parameter, and performance (average reward over the first 1000 steps) is plotted.
The key finding: no single method dominates everywhere. UCB and gradient methods tend to perform well, but the best choice depends on the parameter setting. The parameter study shows each method at its best, revealing which approaches have the highest peak performance and which are most robust.
Click "Run Study" to run all four methods across parameter ranges (takes a moment). Each point is the average reward over 1000 steps and 200 runs. Toggle methods on/off.
| Method | Parameter | Range |
|---|---|---|
| ε-greedy | ε (exploration rate) | 1/128 to 1/4 |
| UCB | c (confidence level) | 1/16 to 4 |
| Gradient | α (step size) | 1/32 to 2 |
| Optimistic greedy | Q1 (initial value) | 1/4 to 4 |
The multi-armed bandit is the simplest reinforcement learning problem: one state, k actions, immediate rewards. Yet it introduces the central ideas that permeate the entire field:
| Concept | What We Learned | Where It Goes Next |
|---|---|---|
| Action values | q*(a) and Qt(a) estimates | Q-functions in full RL (Ch 6) |
| Exploration vs exploitation | ε-greedy, UCB, optimistic init | Every RL algorithm must balance this |
| Incremental updates | Q ← Q + α(R − Q) | TD learning, SARSA, Q-learning |
| Nonstationary tracking | Constant step-size α | All practical RL uses this |
| Gradient methods | Softmax + policy gradient | Policy gradient methods (Ch 13) |
You now understand the exploration-exploitation tradeoff. Every decision you make under uncertainty is a bandit problem in disguise.