Decision Making Under Uncertainty Workbook

Chapter 0: Bayesian Networks & Joint Probability

A Bayesian network (also called a belief network or directed graphical model) encodes a joint probability distribution over a set of random variables using a directed acyclic graph (DAG). Each node is a variable. Each directed edge says "the parent influences the child." The power of a Bayes net is factorization: instead of storing an exponentially large joint table, you store small conditional probability tables (CPTs) — one per node, conditioned only on its parents.

Chain rule via Bayes net:
P(x₁, x₂, ..., x_n) = ∏_i P(x_i | parents(x_i))

Example (4-node chain A → B → C → D):
P(a, b, c, d) = P(a) · P(b|a) · P(c|b) · P(d|c)

The factorization principle. A Bayes net with n binary variables needs at most ∑_i 2^|parents(i)| parameters instead of 2ⁿ − 1 for the full joint table. A chain of 10 binary variables: full joint = 1023 params, Bayes net = 1 + 9 × 2 = 19 params. That's a 54× compression.

Exercise 0.1: Sequence Probability from a Bayes Net Derive

A Markov chain generates a 4-digit sequence x₁, x₂, x₃, x₄ where each digit is in {1,2,3,4,5,6}. The transition model is:

P(x_i+1 = j | x_i = k) = (j − k)² / ∑_l=1⁶ (l − k)²

The prior is uniform: P(x₁ = k) = 1/6 for all k. Compute P(x₁=3, x₂=5, x₃=1, x₄=4).

Hint: First compute the denominator ∑(l−k)² for each k you need, then multiply the chain P(x₁) · P(x₂|x₁) · P(x₃|x₂) · P(x₄|x₃).

probability

Show derivation

We need P(3) · P(5|3) · P(1|5) · P(4|1).

P(x₁=3) = 1/6

For k=3: ∑_l=1..6(l−3)² = 4+1+0+1+4+9 = 19

P(x₂=5 | x₁=3) = (5−3)²/19 = 4/19

For k=5: ∑_l=1..6(l−5)² = 16+9+4+1+0+1 = 31

P(x₃=1 | x₂=5) = (1−5)²/31 = 16/31

For k=1: ∑_l=1..6(l−1)² = 0+1+4+9+16+25 = 55

P(x₄=4 | x₃=1) = (4−1)²/55 = 9/55

P(3,5,1,4) = (1/6) × (4/19) × (16/31) × (9/55)

Numerator: 1 × 4 × 16 × 9 = 576

Denominator: 6 × 19 × 31 × 55 = 194,370

P = 576 / 194,370 ≈ 0.00296

This transition model assigns higher probability to transitions with large jumps (since (j−k)² grows with distance). The sequence 3→5→1→4 has moderate jumps, giving a probability of about 0.3%.

Exercise 0.2: Conditional Probability from a Bayes Net Derive

Consider a Bayes net with structure: A → C, B → C, C → D. All variables are binary (0 or 1).

Compute P(C=1). (Marginalize over A and B.)

probability

Show derivation

P(C=1) = ∑_a,b P(A=a) P(B=b) P(C=1|A=a,B=b)

= 0.3×0.6×0.9 + 0.3×0.4×0.7 + 0.7×0.6×0.4 + 0.7×0.4×0.1

= 0.162 + 0.084 + 0.168 + 0.028 = 0.442

A and B are independent (no edge between them), so P(A,B) = P(A)P(B). We just enumerate all four combinations of (A,B) and weight the CPT entries.

Exercise 0.3: D-Separation Trace

In the Bayes net A → C ← B, C → D: are A and B independent given no evidence?

Yes — A and B are d-separated (the collider C blocks the path when C is not observed) No — they share a child so they're always dependent It depends on the CPT values

Show explanation

A → C ← B is a collider (v-structure) at C. In a collider, the path is blocked when the collider node (C) and all its descendants are not observed. Since we have no evidence, C is not observed, so A and B are d-separated — hence independent.

If we did observe C (or D, which is a descendant of C), the path would open and A and B would become dependent. This is the "explaining away" phenomenon.

Exercise 0.4: D-Separation with Evidence Trace

Same Bayes net: A → C ← B, C → D. Are A and B independent given D=1?

Yes — D is not between A and B so it doesn't matter No — observing D (a descendant of collider C) opens the path, making A and B dependent No — A and B are always dependent in any Bayes net

Show explanation

D is a descendant of the collider C. Observing a descendant of a collider has the same unblocking effect as observing the collider itself. Once D=1 is observed, we gain information about C, which "opens" the A — C — B path. Now A and B are dependent: knowing something about A changes our belief about C, which changes our belief about B.

Exercise 0.5: Implement bayesNetProb() Build

Implement a function that computes the joint probability of a 4-node chain A → B → C → D. You're given the prior P(A) and a single transition CPT function P(child|parent).

Multiply the prior by the product of conditional probabilities along the chain.

Show solution

javascript
function bayesNetProb(seq, prior, cpt) {
  let prob = prior(seq[0]);
  for (let i = 1; i < seq.length; i++) {
    prob *= cpt(seq[i], seq[i - 1]);
  }
  return prob;
}

Exercise 0.6: Find the Bug Debug

This function computes P(C=1) by marginalizing over parents A and B in a Bayes net A→C, B→C. It's returning a value greater than 1. Click the buggy line.

function marginalizeC(pA, pB, cpt) {
  let pC1 = 0;
  for (let a of [0, 1]) {
    for (let b of [0, 1]) {
      let pAB = pA[a] + pB[b];
      pC1 += pAB * cpt[a][b];
    }
  }
  return pC1;
}

Show explanation

Line 5 is the bug. Since A and B are independent, P(A=a, B=b) = P(A=a) × P(B=b), not P(A=a) + P(B=b). Using addition instead of multiplication gives a value that's way too large (and can exceed 1). The fix: let pAB = pA[a] * pB[b];

Chapter 1: Bayesian Inference & Dirichlet

You've observed data and you want to update your beliefs. The Dirichlet distribution is the conjugate prior for the categorical (multinomial) distribution — meaning if your prior is Dirichlet and your data is categorical, your posterior is also Dirichlet. This makes Bayesian updates trivially easy: just add observed counts to the prior pseudo-counts.

Dirichlet prior: Dir(α₁, α₂, ..., α_k)
After observing counts n₁, n₂, ..., n_k:
Posterior: Dir(α₁+n₁, α₂+n₂, ..., α_k+n_k)

Posterior mean for category i: (α_i + n_i) / ∑_j(α_j + n_j)
Posterior mode for category i: (α_i + n_i − 1) / (∑_j(α_j + n_j) − k) // only valid when all α+n > 1

Mean vs. Mode. The posterior mean incorporates the "phantom counts" from the prior fully. The posterior mode subtracts 1 from each category (the -1 comes from the Dirichlet density's exponent being α−1). When data is scarce, the mean is more conservative; the mode can be more extreme.

Exercise 1.1: Dirichlet Posterior Mean Derive

A 6-sided die has a Dirichlet prior Dir(2,2,2,2,2,2). You roll the die 3 times and observe: 4, 3, 4. What is the posterior mean probability of rolling a 4?

probability

Show derivation

Prior: Dir(2,2,2,2,2,2)

Counts: face 3 appears 1 time, face 4 appears 2 times

Posterior: Dir(2, 2, 2+1, 2+2, 2, 2) = Dir(2, 2, 3, 4, 2, 2)

∑ = 2+2+3+4+2+2 = 15

Mean for face 4 = 4/15 ≈ 0.267

The prior contributes 12 "phantom observations" (2 per face). Combined with 3 real observations, we have 15 total. The prior pulls the estimate toward 1/6 ≈ 0.167, while the data (face 4 appeared 2/3 times = 0.667) pulls it up. The mean 0.267 is a compromise.

Exercise 1.2: Dirichlet Posterior Mode Derive

Same setup: Dir(2,2,2,2,2,2) prior, observations 4, 3, 4. What is the posterior mode (MAP estimate) for the probability of rolling a 4?

Recall: mode_i = (α_i − 1) / (∑α_j − k) where k is the number of categories.

probability

Show derivation

Posterior: Dir(2, 2, 3, 4, 2, 2), k = 6

∑α_j = 15

Mode for face 4 = (4 − 1) / (15 − 6) = 3/9 = 1/3 ≈ 0.333

The mode is higher than the mean (0.333 vs 0.267) because the mode represents the "peak" of the posterior density, which is pulled more toward the observed data than the mean. For large sample sizes, mode and mean converge.

Exercise 1.3: Informative vs. Uninformative Priors Trace

Which prior is more informative (harder to override with data)?

Dir(1, 1, 1) — it's uniform so it encodes strong certainty that outcomes are equally likely Dir(10, 10, 10) — it has more pseudo-counts so it takes more data to shift the posterior They're equally informative since both are symmetric

Show explanation

Dir(1,1,1) is the uniform Dirichlet — it says "any probability vector is equally likely." It contributes only 3 pseudo-counts total. Dir(10,10,10) contributes 30 pseudo-counts — like having already seen 27 evenly-distributed observations. You'd need far more data to move the posterior away from (1/3, 1/3, 1/3) with the Dir(10,10,10) prior.

The "informativeness" of a Dirichlet is controlled by ∑α_i (the concentration). Higher concentration = stronger prior = slower updates.

Exercise 1.4: Likelihood-Weighted Sampling Derive

Consider the Bayes net A → B with P(A=1)=0.4 and P(B=1|A=1)=0.7, P(B=1|A=0)=0.2. We want to estimate P(A=1|B=1) using likelihood weighting. In one sample, we sample A=1 from the prior. The evidence is B=1. What is the weight of this sample?

In likelihood weighting, evidence nodes are not sampled — instead, the weight is the product of P(evidence_node = observed_value | parents) for each evidence node.

weight

Show derivation

We sampled A=1. Evidence is B=1.

Weight = P(B=1 | A=1) = 0.7

If we had sampled A=0 instead, the weight would be P(B=1|A=0) = 0.2. Samples consistent with evidence get high weight; inconsistent ones get low weight. Over many samples, the weighted average converges to the true posterior.

Exercise 1.5: Implement dirichletPosteriorMean() Build

Given a Dirichlet prior (array of alphas) and an array of observation counts, return the posterior mean as an array of probabilities.

Show solution

javascript
function dirichletPosteriorMean(alphas, counts) {
  const post = alphas.map((a, i) => a + counts[i]);
  const total = post.reduce((s, v) => s + v, 0);
  return post.map(v => v / total);
}

Chapter 2: Value of Information

You're deciding whether to launch a rocket or refuel and wait. Fuel levels might be sufficient (probability 0.6) or insufficient (probability 0.4). A sensor can tell you — but the sensor costs money. How much is that information worth? That's the Value of Information (VOI): the expected improvement in decision quality from observing a variable before acting.

VOI formula:
VOI(sensor) = E_o[EU*(a* | sensor=o)] − EU*(a* | no sensor)

Where EU*(a*|...) means the expected utility of the optimal action given the information state.

VOI is always ≥ 0. Information can never hurt a rational decision maker. In the worst case, you observe the sensor and it doesn't change your optimal action — then VOI = 0. But it can never be negative: you can always ignore the sensor and fall back to the no-information optimal action.

Exercise 2.1: EU Without Sensor Derive

P(sufficient fuel) = 0.6, P(insufficient) = 0.4. Two actions:

Launch: +12 if fuel sufficient, −60 if insufficient
Refuel: +5 regardless

What is the optimal expected utility without a sensor?

utility

Show derivation

EU(launch) = 0.6 × 12 + 0.4 × (−60) = 7.2 − 24 = −16.8

EU(refuel) = 5

EU*(no sensor) = max(−16.8, 5) = 5

Without the sensor, refueling is optimal. The 40% chance of catastrophic failure makes launching too risky.

Exercise 2.2: Bayes Rule for Sensor Reading Derive

The sensor has: P(positive | sufficient) = 0.95, P(positive | insufficient) = 0.05. Compute P(sufficient | sensor positive) using Bayes rule.

probability

Show derivation

P(sensor+) = P(+|suff)P(suff) + P(+|insuff)P(insuff)

= 0.95 × 0.6 + 0.05 × 0.4 = 0.57 + 0.02 = 0.59

P(suff | sensor+) = P(+|suff)P(suff) / P(sensor+) = 0.57 / 0.59 = 0.966

A positive reading dramatically increases our confidence: from 60% to 96.6%. The sensor is highly reliable (95% true positive rate), so a positive reading is very informative.

Exercise 2.3: Compute VOI Derive

Using the sensor from Exercise 2.2, compute the Value of Information. Remember: VOI = E_o[EU*(best action | sensor=o)] − EU*(no sensor).

You need EU* for both sensor outcomes (positive and negative), weighted by P(sensor outcome).

utility

Show derivation

If sensor positive (P=0.59):

P(suff|+) = 0.966

EU(launch|+) = 0.966×12 + 0.034×(−60) = 11.59 − 2.03 = 9.56

EU(refuel|+) = 5

EU*(+) = max(9.56, 5) = 9.56

If sensor negative (P=0.41):

P(suff|−) = 0.05×0.6/0.41 = 0.073

EU(launch|−) = 0.073×12 + 0.927×(−60) = 0.88 − 55.61 = −54.73

EU*(−) = max(−54.73, 5) = 5

VOI:

E[EU*|sensor] = 0.59×9.56 + 0.41×5 = 5.64 + 2.05 = 7.69

VOI = 7.69 − 5 = 2.69

The sensor is worth 2.69 utility. If it costs less than that, buy it. Most of the value comes from the positive case: the sensor lets us confidently launch (EU=9.56) instead of conservatively refueling (EU=5).

Exercise 2.4: When Is VOI Zero? Trace

Under which condition is the Value of Information for an observation exactly zero?

When the sensor is perfectly accurate (100% reliable) When no possible observation outcome would change the optimal action When the prior probabilities are uniform VOI is never exactly zero

Show explanation

VOI = 0 when the optimal action is the same regardless of the observation. For example, if refueling gives EU=100 and launching gives EU=12 even in the best case — no sensor reading could possibly make launching better. The sensor would update your beliefs, but you'd still choose refuel every time. The information is useless for decision-making.

A perfectly accurate sensor can actually have high VOI (it resolves all uncertainty). And uniform priors don't guarantee VOI=0.

Exercise 2.5: Implement computeVOI() Build

Given prior P(state), action utilities per state, and a sensor model, compute VOI.

Show solution

javascript
function computeVOI(pState, utils, sensorModel) {
  // EU without sensor
  let euNoSensor = -Infinity;
  for (let a = 0; a < utils.length; a++) {
    let eu = 0;
    for (let s = 0; s < pState.length; s++)
      eu += pState[s] * utils[a][s];
    euNoSensor = Math.max(euNoSensor, eu);
  }
  // EU with sensor
  let euWithSensor = 0;
  for (let o = 0; o < sensorModel.length; o++) {
    let pObs = 0;
    for (let s = 0; s < pState.length; s++)
      pObs += sensorModel[o][s] * pState[s];
    let bestEU = -Infinity;
    for (let a = 0; a < utils.length; a++) {
      let eu = 0;
      for (let s = 0; s < pState.length; s++) {
        let pPost = sensorModel[o][s]*pState[s]/pObs;
        eu += pPost * utils[a][s];
      }
      bestEU = Math.max(bestEU, eu);
    }
    euWithSensor += pObs * bestEU;
  }
  return euWithSensor - euNoSensor;
}

Chapter 3: MDPs & Bellman Backup

A Markov Decision Process (MDP) is the mathematical framework for sequential decision making when the world is fully observable. You have states, actions, transition probabilities, rewards, and a discount factor. The Bellman backup is the core operation: it computes the value of a state by looking one step ahead and then using the values of successor states.

Bellman backup (value iteration):
U_k+1(s) = max_a [ R(s,a) + γ ∑_s' T(s'|s,a) · U_k(s') ]

Drone example (2 states, 2 actions):
States: s_c (calm), s_t (turbulent)
Actions: fly, hover
R(s_c,fly)=5, R(s_c,hover)=2, R(s_t,fly)=−1, R(s_t,hover)=1
T(s_c|s_c,fly)=0.6, T(s_t|s_c,fly)=0.4
T(s_c|s_t,fly)=0.3, T(s_t|s_t,fly)=0.7
T(s_c|s_c,hover)=1.0, T(s_c|s_t,hover)=0.4, T(s_t|s_t,hover)=0.6
γ = 0.9, U₀(s) = 0 for all s

The Bellman backup is a one-step lookahead. At each iteration, you ask: "If I take action a from state s, I get immediate reward R(s,a), then land in state s' with probability T(s'|s,a) and get discounted future value γU(s'). Which action maximizes this?" That's it. Repeat until convergence.

Exercise 3.1: First Bellman Backup — Calm State Derive

Starting from U₀(s_c)=0, U₀(s_t)=0. Compute U₁(s_c).

utility

Show derivation

U₁(s_c, fly) = 5 + 0.9×(0.6×0 + 0.4×0) = 5 + 0 = 5

U₁(s_c, hover) = 2 + 0.9×(1.0×0 + 0.0×0) = 2 + 0 = 2

U₁(s_c) = max(5, 2) = 5 π₁(s_c) = fly

With U₀=0 everywhere, the future value is zero. The first iteration just picks the action with the highest immediate reward: fly gives 5, hover gives 2.

Exercise 3.2: First Bellman Backup — Turbulent State Derive

Compute U₁(s_t) from U₀=0.

utility

Show derivation

U₁(s_t, fly) = −1 + 0.9×(0.3×0 + 0.7×0) = −1

U₁(s_t, hover) = 1 + 0.9×(0.4×0 + 0.6×0) = 1

U₁(s_t) = max(−1, 1) = 1 π₁(s_t) = hover

In turbulence, flying is dangerous (reward −1). Hovering is safe (reward 1). The greedy first-iteration policy: fly in calm, hover in turbulence.

Exercise 3.3: Second Bellman Iteration Derive

Now using U₁(s_c)=5 and U₁(s_t)=1, compute U₂(s_c). Which action is optimal?

utility

Show derivation

U₂(s_c, fly) = 5 + 0.9×(0.6×5 + 0.4×1) = 5 + 0.9×3.4 = 5 + 3.06 = 8.06

U₂(s_c, hover) = 2 + 0.9×(1.0×5 + 0.0×1) = 2 + 4.5 = 6.5

U₂(s_c) = max(8.06, 6.5) = 8.06 π₂(s_c) = fly

Now future values matter! Flying gives 5 immediate reward plus 0.9×(expected future). The expected next-state value is 0.6×5 + 0.4×1 = 3.4 because flying from calm has a 60% chance of staying calm (value 5) and 40% chance of turbulence (value 1).

Exercise 3.4: What Does γ Control? Trace

What happens to the optimal policy as the discount factor γ approaches 0?

The agent becomes maximally myopic — it only cares about immediate reward and ignores future consequences The agent becomes maximally farsighted — it weighs all future rewards equally Nothing changes — γ only affects convergence speed The value function becomes infinite

Show explanation

When γ → 0, the Bellman equation reduces to U(s) = max_a R(s,a). Future states don't matter at all. The agent is completely myopic — it grabs the biggest immediate reward without considering consequences. At γ=0.99, the agent weighs a reward 100 steps in the future at 0.99¹⁰⁰ ≈ 0.37 of its current value — still significant. At γ=0.5, that same reward is worth 0.5¹⁰⁰ ≈ 10⁻³⁰ — essentially zero.

Exercise 3.5: Order the Value Iteration Steps Design

Place these steps in the correct order for the value iteration algorithm.

→

Check convergence Extract policy Bellman backup all states Initialize U(s)=0 Repeat from backup

Show correct order

Initialize → Bellman backup → Check convergence → Repeat → Extract policy

Initialize U=0, then repeatedly apply Bellman backups to all states. After each sweep, check if max|U_k+1(s)−U_k(s)| < ε. If not converged, repeat. Once converged, extract the greedy policy: π(s) = argmax_a [R(s,a) + γ∑T(s'|s,a)U(s')].

Exercise 3.6: Implement bellmanBackup() Build

Given a state, perform one Bellman backup. Return the new utility value.

Show solution

javascript
function bellmanBackup(state, actions, R, T, U, gamma) {
  let bestQ = -Infinity;
  for (const a of actions) {
    let q = R[state][a];
    for (const [prob, ns] of T[state][a]) {
      q += gamma * prob * U[ns];
    }
    bestQ = Math.max(bestQ, q);
  }
  return bestQ;
}

Chapter 4: Gaussian Kernel Smoothing

You have a few data points and need to predict the value at a new query point. Gaussian kernel smoothing (Nadaraya-Watson estimator) is a non-parametric method: it computes a weighted average of known values, where the weight decreases with distance. Points closer to the query get more influence.

Kernel smoother:
f̂(x_q) = ∑_i w_i · y_i where w_i = k_i / ∑_j k_j

Gaussian kernel:
k_i = exp(−d_i² / (2σ²))
d_i = ||x_q − x_i||₂ // Euclidean distance

The bandwidth σ is the only knob. Small σ → sharp peaks, only nearest neighbors matter (noisy interpolation). Large σ → broad weights, everyone contributes equally (over-smoothing). As σ → ∞, the prediction converges to the global mean of all y values.

Exercise 4.1: Compute Distances Derive

Query point x_q = [0.6, 1.2]. Known points: x₁=[0,0], x₂=[1,0], x₃=[0,2]. Compute the Euclidean distance d₁ = ||x_q − x₁||.

distance

Show derivation

d₁ = √((0.6−0)² + (1.2−0)²) = √(0.36 + 1.44) = √1.80 = 1.342

Similarly: d₂ = √(0.16+1.44) = √1.60 = 1.265, and d₃ = √(0.36+0.64) = √1.00 = 1.000. Point x₃ is closest to the query.

Exercise 4.2: Compute Kernel Values Derive

Using distances d₁=1.342, d₂=1.265, d₃=1.000 and bandwidth σ=1, compute the Gaussian kernel value k₃ = exp(−d₃²/(2σ²)).

kernel value

Show derivation

k₃ = exp(−1.0² / (2×1²)) = exp(−0.5) = 0.6065

k₁ = exp(−1.80/2) = exp(−0.90) = 0.4066

k₂ = exp(−1.60/2) = exp(−0.80) = 0.4493

x₃ is closest, so it gets the largest kernel value. The kernel decays exponentially with squared distance — moving from d=1.0 to d=1.342 drops the weight from 0.607 to 0.407 (a 33% decrease).

Exercise 4.3: Weighted Prediction Derive

Known values: y₁=1, y₂=3, y₃=4. Kernel values: k₁=0.407, k₂=0.449, k₃=0.607. Compute the kernel-smoothed prediction f̂(x_q).

predicted value

Show derivation

∑k = 0.407 + 0.449 + 0.607 = 1.463

w₁=0.407/1.463=0.278, w₂=0.449/1.463=0.307, w₃=0.607/1.463=0.415

f̂ = 0.278×1 + 0.307×3 + 0.415×4 = 0.278 + 0.921 + 1.660 = 2.859

The prediction ≈ 2.86. Point x₃ (y=4) gets the most weight (41.5%) because it's closest. The result is pulled toward y₃=4 but tempered by the other points.

Exercise 4.4: Effect of Bandwidth Trace

As the bandwidth σ → ∞, what does the kernel smoother prediction converge to?

The value of the nearest neighbor The unweighted mean of all y values: (y₁+y₂+y₃)/3 Zero — all kernel values go to zero The maximum y value

Show explanation

As σ → ∞, exp(−d²/(2σ²)) → exp(0) = 1 for all points, regardless of distance. All kernel values become equal, so all weights become 1/n. The prediction converges to the simple average: (1+3+4)/3 = 2.667. Infinite bandwidth = maximum smoothing = global average. Conversely, as σ→0, only the nearest point matters (nearest-neighbor interpolation).

Exercise 4.5: Implement kernelSmooth() Build

Implement the Nadaraya-Watson kernel smoother with a Gaussian kernel.

Show solution

javascript
function kernelSmooth(xq, xs, ys, sigma) {
  xq = xq[0]; // unwrap single-element array
  let sumKY = 0, sumK = 0;
  for (let i = 0; i < xs.length; i++) {
    let d2 = 0;
    for (let j = 0; j < xq.length; j++)
      d2 += (xq[j] - xs[i][j]) ** 2;
    const k = Math.exp(-d2 / (2 * sigma * sigma));
    sumK += k;
    sumKY += k * ys[i];
  }
  return sumKY / sumK;
}

Chapter 5: MCTS & UCB1

Monte Carlo Tree Search (MCTS) builds a search tree incrementally by simulating random rollouts. At each node, it uses the UCB1 (Upper Confidence Bound) formula to balance exploitation (picking the action with highest estimated value) and exploration (trying actions that haven't been sampled enough to be confident about).

UCB1 score:
UCB1(a) = Q̄(a) + c · √(ln N / N(a))

Where Q̄(a) = average return of action a
N = total visits to parent node
N(a) = visits to action a
c = exploration parameter (higher c → more exploration)

The exploration bonus shrinks with visits. An action visited 100 times out of 200 total gets a bonus of c·√(ln200/100) = c·0.230. An action visited only twice gets c·√(ln200/2) = c·1.630 — seven times larger. UCB1 naturally gravitates toward under-explored actions.

Exercise 5.1: UCB1 for Fly Derive

A MCTS node has been visited N=10 times total. Action "Fly" has Q̄=3.0 and N(fly)=6 visits. Exploration parameter c=0.7. Compute UCB1(fly).

UCB1 score

Show derivation

UCB1(fly) = 3.0 + 0.7 × √(ln(10) / 6)

ln(10) = 2.3026

√(2.3026 / 6) = √(0.3838) = 0.6195

UCB1(fly) = 3.0 + 0.7 × 0.6195 = 3.0 + 0.434 = 3.43

Exercise 5.2: UCB1 for Hover — Which Gets Selected? Derive

Same node (N=10, c=0.7). Action "Hover" has Q̄=3.2 and N(hover)=4. Compute UCB1(hover) and determine which action MCTS selects.

UCB1 score

Show derivation

UCB1(hover) = 3.2 + 0.7 × √(ln(10) / 4)

√(2.3026 / 4) = √(0.5757) = 0.7588

UCB1(hover) = 3.2 + 0.7 × 0.7588 = 3.2 + 0.531 = 3.73

UCB1(hover)=3.73 > UCB1(fly)=3.43. MCTS selects Hover. Hover wins not just because it has a higher Q̄ (3.2 vs 3.0), but also because it has fewer visits (4 vs 6), giving it a larger exploration bonus. Both factors align here.

Exercise 5.3: Exploration vs. Exploitation Trace

If you increase the exploration parameter c from 0.7 to 5.0, what happens?

The algorithm converges faster because it explores more efficiently The algorithm favors the highest-Q action even more strongly The exploration bonus dominates — MCTS almost always picks the least-visited action regardless of Q̄ Nothing changes because UCB1 is invariant to c

Show explanation

With c=5.0, the exploration term c·√(ln(N)/N(a)) becomes ~7x larger than with c=0.7. For our example, UCB1(fly) = 3.0 + 5.0×0.62 = 6.10 and UCB1(hover) = 3.2 + 5.0×0.76 = 6.99. The Q-values (3.0 vs 3.2) are dwarfed by the exploration bonuses (3.10 vs 3.80). The algorithm becomes almost purely exploratory — the visit count difference matters more than the estimated value difference.

Exercise 5.4: UCB1 with Many Actions Derive

5 actions with visit counts [4, 3, 2, 1, 1]. Total N=11. The action with N(a)=2 has Q̄=24. c=1. Compute its UCB1 score.

UCB1 score

Show derivation

UCB1 = 24 + 1 × √(ln(11) / 2)

ln(11) = 2.3979

√(2.3979 / 2) = √(1.1990) = 1.095

UCB1 = 24 + 1.095 = 25.10

Exercise 5.5: Implement ucb1Score() Build

Implement the UCB1 scoring function.

Show solution

javascript
function ucb1Score(qAvg, nAction, nTotal, c) {
  return qAvg + c * Math.sqrt(Math.log(nTotal) / nAction);
}

Chapter 6: Policy Gradients & Optimization

Instead of learning a value function and deriving a policy, policy gradient methods directly optimize the policy parameters θ. The policy π_θ(a|s) is a parameterized probability distribution over actions. We adjust θ in the direction that increases expected return. The key formula: the gradient of the expected return involves the score function ∇ log π_θ(a|s).

Logistic policy (binary action):
π_θ(Boost | s) = σ(θ₁s + θ₂) = 1/(1 + exp(−(θ₁s + θ₂)))

Score function gradient (for action = Boost):
∇_θ log π_θ(Boost|s) = (1 − π) · [s, 1]

Policy gradient update:
θ ← θ + α · U(τ) · ∇_θ log π_θ(a|s)

Why the score function? The score function ∇ log π tells us: "Which direction in parameter space makes the taken action more likely?" Multiply by the return U(τ) to get: "Make actions that led to high returns more likely, and actions that led to low returns less likely." That's the entire intuition behind REINFORCE.

Exercise 6.1: Logistic Policy Gradient Derive

θ=[1,0], s=2. The agent took action Boost. Trajectory return U(τ)=6. Compute the policy gradient update Δθ = U(τ) · ∇ log π(Boost|s). Give the first component Δθ₁.

First compute π, then the score, then scale by U(τ).

Δθ₁

Show derivation

π(Boost|s=2) = 1/(1+exp(−(1×2+0))) = 1/(1+e⁻²) = 1/1.135 = 0.881

∇ log π(Boost|s) = (1−π) · [s, 1] = 0.119 × [2, 1] = [0.238, 0.119]

Δθ = 6 × [0.238, 0.119] = [1.43, 0.715]

The gradient pushes θ₁ up by 1.43, making Boost more likely for similar states. The gradient is small because π=0.881 is already high — the policy already "wanted" to Boost, so it gets only a modest reinforcement.

Exercise 6.2: Gradient Scaling Derive

θ=[2,−1], raw gradient ∇=[6,8]. ||∇||=10. You normalize the gradient to unit length, then step with learning rate α=0.5. What is θ₁' (the new value of θ₁)?

θ₁'

Show derivation

||∇|| = √(6²+8²) = √(100) = 10

∇̂ = [6/10, 8/10] = [0.6, 0.8]

θ' = [2,−1] + 0.5 × [0.6, 0.8] = [2.3, −0.6]

Gradient normalization (or clipping) prevents catastrophically large updates when the gradient magnitude explodes — common in RL where returns can vary wildly across trajectories.

Exercise 6.3: Why Gradient Clipping? Trace

Why is gradient clipping or normalization especially important in policy gradient methods?

It speeds up convergence by removing small gradients Trajectory returns have high variance, causing huge gradient magnitudes that can destabilize training It's only needed for mathematical correctness — without it the algorithm is wrong It reduces memory usage during backpropagation

Show explanation

In policy gradient methods, the gradient Δθ is multiplied by the trajectory return U(τ). Returns can vary enormously: one lucky rollout might return 500 while the average is 10. That single sample creates a gradient 50x larger than normal, potentially jumping the parameters to a terrible region. Gradient clipping/normalization caps the magnitude, keeping updates stable even when returns are noisy.

Exercise 6.4: Cross-Entropy Method (CEM) Trace

The Cross-Entropy Method (CEM) optimizes a policy by sampling parameter vectors from a distribution, evaluating them, and fitting the distribution to the top performers. What is the key advantage of CEM over gradient-based policy optimization?

CEM is gradient-free — it doesn't require differentiable policies or reward signals, just the ability to evaluate trajectories CEM always converges faster than policy gradients CEM finds the global optimum while policy gradients only find local optima CEM uses less memory since it doesn't store trajectories

Show explanation

CEM is a derivative-free optimization method. You don't need to differentiate through the policy or the environment — you just need to evaluate "how good is this parameter vector?" by rolling out trajectories. This makes CEM applicable to non-differentiable systems, discrete action spaces, and black-box environments. The tradeoff: CEM scales poorly with parameter dimension (it's essentially sampling in parameter space).

Exercise 6.5: Implement gradientScale() Build

Implement gradient normalization: if the gradient norm exceeds a threshold, scale it down to that threshold. Otherwise, leave it unchanged.

Show solution

javascript
function gradientScale(grad, maxNorm) {
  const norm = Math.sqrt(grad.reduce((s, g) => s + g*g, 0));
  if (norm <= maxNorm || norm === 0) return [...grad];
  return grad.map(g => g * maxNorm / norm);
}

Chapter 7: POMDPs — Belief Updates

In a Partially Observable MDP (POMDP), the agent can't see the true state — it maintains a belief (a probability distribution over states) and updates it as it takes actions and receives observations. The belief update is a Bayesian filter: predict forward through the transition model, then condition on the observation.

Belief update:
b'(s') ∝ O(o | s', a) · ∑_s T(s' | s, a) · b(s)

Simplified (no transition, just observation):
b'(s) ∝ O(o | s) · b(s)
Then normalize: b'(s) = b'(s) / ∑ b'(s)

Belief = sufficient statistic. A POMDP with belief tracking reduces to an MDP over the belief space. The catch: belief space is a continuous (|S|−1)-dimensional simplex, so you can't just enumerate states. This is why POMDP solving is fundamentally harder than MDP solving.

Exercise 7.1: First Belief Update Derive

A mineral survey robot has 4 possible locations: LA (Los Angeles), LB, LC, LD. Each location is either "high mineral" or "low mineral." We simplify to 4 states: s₁=LA-high, s₂=LB-high, s₃=LC-low, s₄=LD-low.

Initial belief: uniform b₀=[0.25, 0.25, 0.25, 0.25]. The robot scans LA and gets a positive reading. Observation model: P(positive | high) = 0.7, P(positive | low) = 0.2. Only the scanned location's state affects the observation.

Compute b₁(s₁) = P(s₁ | positive scan at LA).

probability

Show derivation

Scanning LA gives info about whether LA is high or low. s₁=LA-high, s₂=LB-high (LA is part of the state through s₁).

We need P(positive | s_i) for the scan at LA:

P(pos|s₁) = 0.7 // s1 = LA-high

P(pos|s₂) = 0.7 // s2 = LB-high, but LA scan depends on whether LA is high. Let's simplify: states where LA is "high" get P=0.7

Actually, let's redefine cleanly. Assume each state encodes a particular mineral configuration across all sites. Let s₁,s₂ be states where LA has high minerals, and s₃,s₄ be states where LA has low minerals:

P(pos|s₁) = P(pos|s₂) = 0.7

P(pos|s₃) = P(pos|s₄) = 0.2

P(pos) = 0.25×0.7 + 0.25×0.7 + 0.25×0.2 + 0.25×0.2 = 0.45

b₁(s₁) = P(pos|s₁)×b₀(s₁) / P(pos) = 0.7×0.25 / 0.45 = 0.175/0.45 = 0.389

Full updated belief: b₁ = [0.389, 0.389, 0.111, 0.111]. The positive reading doubles our confidence in the "high mineral" states.

Exercise 7.2: Second Belief Update Derive

From b₁=[0.389, 0.389, 0.111, 0.111], the robot scans LB and gets a negative reading. States s₁,s₂ have LB as "high" (in s₂) or "not LB-high." Let's say P(negative|s) depends on whether LB is high in that state:

s₁: LB is high → P(neg|s₁) = 0.3
s₂: LB is high → P(neg|s₂) = 0.3
s₃: LB is low → P(neg|s₃) = 0.8
s₄: LB is low → P(neg|s₄) = 0.8

Compute b₂(s₁).

probability

Show derivation

P(neg) = 0.389×0.3 + 0.389×0.3 + 0.111×0.8 + 0.111×0.8

= 0.1167 + 0.1167 + 0.0889 + 0.0889 = 0.4112

b₂(s₁) = 0.389×0.3 / 0.4112 = 0.1167 / 0.4112 = 0.284

Full: b₂ ≈ [0.284, 0.284, 0.216, 0.216]. The negative LB scan decreases our belief in states where LB is high (s₁,s₂), while increasing belief in states where LB is low (s₃,s₄).

Exercise 7.3: Expected Reward Given Belief Derive

Belief b₂=[0.1, 0.4, 0.2, 0.3]. The robot considers sampling at LA. If LA is high mineral (states s₁, s₂), reward = +50. If LA is low (states s₃, s₄), reward = −10. What is the expected reward of sampling LA?

expected reward

Show derivation

P(LA high) = b(s₁) + b(s₂) = 0.1 + 0.4 = 0.5

P(LA low) = b(s₃) + b(s₄) = 0.2 + 0.3 = 0.5

E[R(sample LA)] = 0.5 × 50 + 0.5 × (−10) = 25 − 5 = 20

Exercise 7.4: Why Not Just Use an MDP Solver? Trace

Why can't we directly apply MDP solvers (like value iteration) to POMDPs?

MDP solvers are too slow for large state spaces MDP solvers assume deterministic transitions MDP solvers require the agent to know the current state, but in a POMDP the state is hidden — the agent only has a belief distribution MDP solvers can't handle rewards that depend on actions

Show explanation

MDP value iteration computes U(s) for each state s and the policy maps states to actions. But in a POMDP, the agent never observes s directly — it only has observations. The policy must map beliefs (probability distributions over states) to actions. The belief space is continuous and (|S|−1)-dimensional, so you can't just enumerate "states." This is why POMDP solving requires specialized algorithms like point-based value iteration, QMDP, or alpha-vector methods.

Exercise 7.5: Implement beliefUpdate() Build

Implement a simple belief update (observation only, no transition).

Show solution

javascript
function beliefUpdate(belief, obsProbs) {
  const unnorm = belief.map((b, i) => b * obsProbs[i]);
  const total = unnorm.reduce((s, v) => s + v, 0);
  return unnorm.map(v => v / total);
}

Chapter 8: Alpha Vectors & POMDP Solvers

The value function of a POMDP is piecewise linear and convex over the belief space. It can be represented by a set of alpha vectors — each alpha vector α defines a hyperplane in belief space. The value at any belief b is the maximum dot product over all alpha vectors: V(b) = max_α α · b.

Value at belief b:
V(b) = max_{α ∈ Γ} ∑_s α(s) · b(s)

Each alpha vector is associated with a conditional plan (action + sub-plan for each observation).
The optimal action at belief b is the action associated with the winning alpha vector.

Alpha vectors make POMDP solving tractable. Instead of representing V(b) for every point in the continuous belief simplex, we store a finite set of alpha vectors. To evaluate any belief, just take the max dot product. The catch: the number of alpha vectors can grow exponentially with the planning horizon.

Exercise 8.1: Alpha Vector Evaluation Derive

Two alpha vectors for a 5-state POMDP:
α₁ = [50, 50, −10, −10, 0] (associated with "sample LA")
α₂ = [50, −10, 50, −10, 0] (associated with "sample LC")
Belief b = [0.6, 0.2, 0.1, 0.1, 0]. Compute V₁(b) = α₁ · b.

value

Show derivation

V₁(b) = 0.6×50 + 0.2×50 + 0.1×(−10) + 0.1×(−10) + 0×0

= 30 + 10 − 1 − 1 + 0 = 38

Exercise 8.2: Which Alpha Vector Wins? Derive

Compute V₂(b) = α₂ · b for the same belief. Which action does the policy prescribe?

value

Show derivation

V₂(b) = 0.6×50 + 0.2×(−10) + 0.1×50 + 0.1×(−10) + 0×0

= 30 − 2 + 5 − 1 + 0 = 32

V(b) = max(38, 32) = 38 → action = "sample LA"

α₁ wins because our belief puts 80% probability on states s₁,s₂ (where α₁ gives +50) and only 10% on s₃ (where α₂ gives +50). The policy samples the location we're most confident about.

Exercise 8.3: QMDP Overestimation Trace

QMDP assumes that uncertainty disappears after the first step. Why does this cause it to overestimate the value function?

It uses the wrong discount factor By assuming full observability after one step, it credits future actions with perfect state knowledge they won't actually have — making every action look better than it truly is It ignores the observation model entirely It samples too few belief points

Show explanation

QMDP computes the alpha vector for action a as: α_a(s) = R(s,a) + γ∑_s'T(s'|s,a)V_MDP(s'). The V_MDP(s') is the fully observable MDP value — it assumes perfect state knowledge from step 2 onward. In reality, the agent will still be uncertain. Since V_MDP(s) ≥ V_POMDP(b) for any b consistent with s, QMDP overestimates. This means QMDP never takes information-gathering actions (like scanning) because it thinks it won't need the information.

FIB (Fast Informed Bound) helps by computing a tighter upper bound that accounts for the observation model, narrowing the gap between QMDP and the true value.

Exercise 8.4: When Is PBVI Better? Trace

When is point-based value iteration (PBVI) a better choice than QMDP?

When the state space is very large When the transition model is deterministic When information-gathering actions are crucial — PBVI explicitly models how observations reduce uncertainty, while QMDP cannot PBVI is always worse because it's more expensive

Show explanation

PBVI performs Bellman backups at a set of sampled belief points, explicitly incorporating the observation model. This means it can discover that "scanning before sampling" is valuable — it models the information gain. QMDP, by assuming observability, thinks information has zero value and will always skip scanning. For problems where you must gather information before acting (like the mineral survey), PBVI dramatically outperforms QMDP.

Exercise 8.5: Conditional Plan Explosion Derive

With |A|=5 actions and |O|=2 observations, the number of t-step conditional plans is defined recursively:
• 1-step plans = |A| = 5
• t-step plans = |A| × ((t−1)-step plans)^|O|
Compute the ratio of 3-step plans to 2-step plans.

ratio

Show derivation

1-step plans = 5

2-step plans = 5 × 5² = 5 × 25 = 125

3-step plans = 5 × 125² = 5 × 15,625 = 78,125

Ratio = 78,125 / 125 = 625

The number of conditional plans grows doubly exponentially with the horizon. Going from 2 to 3 steps multiplied the plan count by 625. This explosive growth is why exact POMDP solving is intractable for long horizons — and why approximate methods (PBVI, SARSOP, etc.) are essential.

Chapter 9: Capstone

Time to put it all together. These problems combine multiple concepts from earlier chapters into multi-step reasoning chains — exactly the kind of problem that appears on exams. You'll need belief updates, expected reward calculations, value of information, and solver selection.

The DMUU pipeline. Most exam problems follow this flow: (1) Model the problem as a graphical model, MDP, or POMDP. (2) Compute prior beliefs using Bayes nets or Dirichlet priors. (3) Update beliefs with observations. (4) Compute expected utility of actions. (5) Determine if information gathering improves the decision (VOI). (6) Choose the appropriate solver for the problem scale.

Exercise 9.1: Multi-Step Belief + Reward Derive

A 3-state POMDP with uniform prior b₀=[1/3, 1/3, 1/3]. The agent takes a "scan" action and observes "positive." Observation model: P(pos|s₁)=0.9, P(pos|s₂)=0.3, P(pos|s₃)=0.1. Compute the updated belief b₁(s₁).

probability

Show derivation

P(pos) = (1/3)×0.9 + (1/3)×0.3 + (1/3)×0.1 = (0.9+0.3+0.1)/3 = 1.3/3 = 0.4333

b₁(s₁) = 0.9 × (1/3) / 0.4333 = 0.3 / 0.4333 = 0.6923

Exercise 9.2: Expected Reward at Posterior Derive

Using b₁ from Exercise 9.1 (b₁ ≈ [0.692, 0.231, 0.077]). Action "extract" gives reward +100 in s₁, −20 in s₂, −50 in s₃. Action "abandon" gives +5 in all states. What is the expected reward of "extract" at this belief?

expected reward

Show derivation

E[R(extract)] = 0.692×100 + 0.231×(−20) + 0.077×(−50)

= 69.2 − 4.62 − 3.85 = 60.73

The strong positive reading boosted b(s₁) to 69.2%, making extraction highly favorable (EU=60.73 vs abandon=5). Without the scan, the prior would give EU(extract) = (1/3)(100−20−50) = 10 — much less decisive.

Exercise 9.3: Which Solver? Trace

You have a POMDP with 50 states, 4 actions, 3 observations, and a planning horizon of 10 steps. Information-gathering is critical. Which solver approach is most appropriate?

Exact value iteration — enumerate all conditional plans QMDP — fast and handles the state space well Point-based value iteration (PBVI/SARSOP) — handles info-gathering and scales to this size Solve it as an MDP by ignoring observations

Show explanation

Exact is impossible: with |A|=4 and |O|=3, the 10-step plan tree has astronomically many branches. QMDP can't handle information-gathering (it assumes observability after step 1). MDP ignores partial observability entirely. PBVI/SARSOP samples belief points, performs backups with full observation modeling, and scales to 50 states and horizon 10. It's the right tool: it models information value while staying computationally tractable.

Exercise 9.4: Belief-Update-Then-Decide Pipeline Build

Implement a function that: (1) updates a belief given an observation, (2) computes expected reward for each action, and (3) returns the best action index and its expected reward.

Show solution

javascript
function beliefDecide(belief, obsProbs, rewards) {
  // 1. Update belief
  const unnorm = belief.map((b, i) => b * obsProbs[i]);
  const total = unnorm.reduce((s, v) => s + v, 0);
  const bPost = unnorm.map(v => v / total);
  // 2. Compute EU for each action
  let bestA = 0, bestEU = -Infinity;
  for (let a = 0; a < rewards.length; a++) {
    let eu = 0;
    for (let s = 0; s < bPost.length; s++)
      eu += bPost[s] * rewards[a][s];
    if (eu > bestEU) { bestEU = eu; bestA = a; }
  }
  return { action: bestA, eu: Math.round(bestEU * 100) / 100 };
}

Note: the rounding in the return is for cleaner test comparison. In practice you'd return the raw float.

Exercise 9.5: VOI for a Second Scan Derive

After the first scan, b₁=[0.692, 0.231, 0.077]. Extract gives [100,−20,−50], abandon gives [5,5,5]. EU*(no second scan) = max(60.73, 5) = 60.73. Now consider a second scan with P(pos|s₁)=0.8, P(pos|s₂)=0.4, P(pos|s₃)=0.1.

Compute P(second scan positive) given belief b₁.

probability

Show derivation

P(pos2) = 0.692×0.8 + 0.231×0.4 + 0.077×0.1

= 0.554 + 0.092 + 0.008 = 0.654

Since we already believe s₁ is likely (69.2%), and the second scan has P(pos|s₁)=0.8, a positive result is highly probable. The interesting question is whether the marginal value of this second scan justifies its cost — we already have a strong belief.