Every exam-style problem type from probabilistic reasoning to sequential decision making. Bayesian networks, MDPs, POMDPs, policy gradients, and Monte Carlo tree search — all solvable in-browser with instant feedback.
A Bayesian network (also called a belief network or directed graphical model) encodes a joint probability distribution over a set of random variables using a directed acyclic graph (DAG). Each node is a variable. Each directed edge says "the parent influences the child." The power of a Bayes net is factorization: instead of storing an exponentially large joint table, you store small conditional probability tables (CPTs) — one per node, conditioned only on its parents.
A Markov chain generates a 4-digit sequence x1, x2, x3, x4 where each digit is in {1,2,3,4,5,6}. The transition model is:
The prior is uniform: P(x1 = k) = 1/6 for all k. Compute P(x1=3, x2=5, x3=1, x4=4).
Hint: First compute the denominator ∑(l−k)2 for each k you need, then multiply the chain P(x1) · P(x2|x1) · P(x3|x2) · P(x4|x3).
We need P(3) · P(5|3) · P(1|5) · P(4|1).
For k=3: ∑l=1..6(l−3)2 = 4+1+0+1+4+9 = 19
For k=5: ∑l=1..6(l−5)2 = 16+9+4+1+0+1 = 31
For k=1: ∑l=1..6(l−1)2 = 0+1+4+9+16+25 = 55
This transition model assigns higher probability to transitions with large jumps (since (j−k)2 grows with distance). The sequence 3→5→1→4 has moderate jumps, giving a probability of about 0.3%.
Consider a Bayes net with structure: A → C, B → C, C → D. All variables are binary (0 or 1).
Compute P(C=1). (Marginalize over A and B.)
A and B are independent (no edge between them), so P(A,B) = P(A)P(B). We just enumerate all four combinations of (A,B) and weight the CPT entries.
A → C ← B is a collider (v-structure) at C. In a collider, the path is blocked when the collider node (C) and all its descendants are not observed. Since we have no evidence, C is not observed, so A and B are d-separated — hence independent.
If we did observe C (or D, which is a descendant of C), the path would open and A and B would become dependent. This is the "explaining away" phenomenon.
D is a descendant of the collider C. Observing a descendant of a collider has the same unblocking effect as observing the collider itself. Once D=1 is observed, we gain information about C, which "opens" the A — C — B path. Now A and B are dependent: knowing something about A changes our belief about C, which changes our belief about B.
Implement a function that computes the joint probability of a 4-node chain A → B → C → D. You're given the prior P(A) and a single transition CPT function P(child|parent).
javascript function bayesNetProb(seq, prior, cpt) { let prob = prior(seq[0]); for (let i = 1; i < seq.length; i++) { prob *= cpt(seq[i], seq[i - 1]); } return prob; }
This function computes P(C=1) by marginalizing over parents A and B in a Bayes net A→C, B→C. It's returning a value greater than 1. Click the buggy line.
function marginalizeC(pA, pB, cpt) { let pC1 = 0; for (let a of [0, 1]) { for (let b of [0, 1]) { let pAB = pA[a] + pB[b]; pC1 += pAB * cpt[a][b]; } } return pC1; }
Line 5 is the bug. Since A and B are independent, P(A=a, B=b) = P(A=a) × P(B=b), not P(A=a) + P(B=b). Using addition instead of multiplication gives a value that's way too large (and can exceed 1). The fix: let pAB = pA[a] * pB[b];
You've observed data and you want to update your beliefs. The Dirichlet distribution is the conjugate prior for the categorical (multinomial) distribution — meaning if your prior is Dirichlet and your data is categorical, your posterior is also Dirichlet. This makes Bayesian updates trivially easy: just add observed counts to the prior pseudo-counts.
A 6-sided die has a Dirichlet prior Dir(2,2,2,2,2,2). You roll the die 3 times and observe: 4, 3, 4. What is the posterior mean probability of rolling a 4?
The prior contributes 12 "phantom observations" (2 per face). Combined with 3 real observations, we have 15 total. The prior pulls the estimate toward 1/6 ≈ 0.167, while the data (face 4 appeared 2/3 times = 0.667) pulls it up. The mean 0.267 is a compromise.
Same setup: Dir(2,2,2,2,2,2) prior, observations 4, 3, 4. What is the posterior mode (MAP estimate) for the probability of rolling a 4?
Recall: modei = (αi − 1) / (∑αj − k) where k is the number of categories.
The mode is higher than the mean (0.333 vs 0.267) because the mode represents the "peak" of the posterior density, which is pulled more toward the observed data than the mean. For large sample sizes, mode and mean converge.
Dir(1,1,1) is the uniform Dirichlet — it says "any probability vector is equally likely." It contributes only 3 pseudo-counts total. Dir(10,10,10) contributes 30 pseudo-counts — like having already seen 27 evenly-distributed observations. You'd need far more data to move the posterior away from (1/3, 1/3, 1/3) with the Dir(10,10,10) prior.
The "informativeness" of a Dirichlet is controlled by ∑αi (the concentration). Higher concentration = stronger prior = slower updates.
Consider the Bayes net A → B with P(A=1)=0.4 and P(B=1|A=1)=0.7, P(B=1|A=0)=0.2. We want to estimate P(A=1|B=1) using likelihood weighting. In one sample, we sample A=1 from the prior. The evidence is B=1. What is the weight of this sample?
In likelihood weighting, evidence nodes are not sampled — instead, the weight is the product of P(evidence_node = observed_value | parents) for each evidence node.
If we had sampled A=0 instead, the weight would be P(B=1|A=0) = 0.2. Samples consistent with evidence get high weight; inconsistent ones get low weight. Over many samples, the weighted average converges to the true posterior.
Given a Dirichlet prior (array of alphas) and an array of observation counts, return the posterior mean as an array of probabilities.
javascript function dirichletPosteriorMean(alphas, counts) { const post = alphas.map((a, i) => a + counts[i]); const total = post.reduce((s, v) => s + v, 0); return post.map(v => v / total); }
You're deciding whether to launch a rocket or refuel and wait. Fuel levels might be sufficient (probability 0.6) or insufficient (probability 0.4). A sensor can tell you — but the sensor costs money. How much is that information worth? That's the Value of Information (VOI): the expected improvement in decision quality from observing a variable before acting.
P(sufficient fuel) = 0.6, P(insufficient) = 0.4. Two actions:
What is the optimal expected utility without a sensor?
Without the sensor, refueling is optimal. The 40% chance of catastrophic failure makes launching too risky.
The sensor has: P(positive | sufficient) = 0.95, P(positive | insufficient) = 0.05. Compute P(sufficient | sensor positive) using Bayes rule.
A positive reading dramatically increases our confidence: from 60% to 96.6%. The sensor is highly reliable (95% true positive rate), so a positive reading is very informative.
Using the sensor from Exercise 2.2, compute the Value of Information. Remember: VOI = Eo[EU*(best action | sensor=o)] − EU*(no sensor).
You need EU* for both sensor outcomes (positive and negative), weighted by P(sensor outcome).
If sensor positive (P=0.59):
If sensor negative (P=0.41):
VOI:
The sensor is worth 2.69 utility. If it costs less than that, buy it. Most of the value comes from the positive case: the sensor lets us confidently launch (EU=9.56) instead of conservatively refueling (EU=5).
VOI = 0 when the optimal action is the same regardless of the observation. For example, if refueling gives EU=100 and launching gives EU=12 even in the best case — no sensor reading could possibly make launching better. The sensor would update your beliefs, but you'd still choose refuel every time. The information is useless for decision-making.
A perfectly accurate sensor can actually have high VOI (it resolves all uncertainty). And uniform priors don't guarantee VOI=0.
Given prior P(state), action utilities per state, and a sensor model, compute VOI.
javascript function computeVOI(pState, utils, sensorModel) { // EU without sensor let euNoSensor = -Infinity; for (let a = 0; a < utils.length; a++) { let eu = 0; for (let s = 0; s < pState.length; s++) eu += pState[s] * utils[a][s]; euNoSensor = Math.max(euNoSensor, eu); } // EU with sensor let euWithSensor = 0; for (let o = 0; o < sensorModel.length; o++) { let pObs = 0; for (let s = 0; s < pState.length; s++) pObs += sensorModel[o][s] * pState[s]; let bestEU = -Infinity; for (let a = 0; a < utils.length; a++) { let eu = 0; for (let s = 0; s < pState.length; s++) { let pPost = sensorModel[o][s]*pState[s]/pObs; eu += pPost * utils[a][s]; } bestEU = Math.max(bestEU, eu); } euWithSensor += pObs * bestEU; } return euWithSensor - euNoSensor; }
A Markov Decision Process (MDP) is the mathematical framework for sequential decision making when the world is fully observable. You have states, actions, transition probabilities, rewards, and a discount factor. The Bellman backup is the core operation: it computes the value of a state by looking one step ahead and then using the values of successor states.
Starting from U0(sc)=0, U0(st)=0. Compute U1(sc).
With U0=0 everywhere, the future value is zero. The first iteration just picks the action with the highest immediate reward: fly gives 5, hover gives 2.
Compute U1(st) from U0=0.
In turbulence, flying is dangerous (reward −1). Hovering is safe (reward 1). The greedy first-iteration policy: fly in calm, hover in turbulence.
Now using U1(sc)=5 and U1(st)=1, compute U2(sc). Which action is optimal?
Now future values matter! Flying gives 5 immediate reward plus 0.9×(expected future). The expected next-state value is 0.6×5 + 0.4×1 = 3.4 because flying from calm has a 60% chance of staying calm (value 5) and 40% chance of turbulence (value 1).
When γ → 0, the Bellman equation reduces to U(s) = maxa R(s,a). Future states don't matter at all. The agent is completely myopic — it grabs the biggest immediate reward without considering consequences. At γ=0.99, the agent weighs a reward 100 steps in the future at 0.99100 ≈ 0.37 of its current value — still significant. At γ=0.5, that same reward is worth 0.5100 ≈ 10−30 — essentially zero.
Place these steps in the correct order for the value iteration algorithm.
Initialize → Bellman backup → Check convergence → Repeat → Extract policy
Initialize U=0, then repeatedly apply Bellman backups to all states. After each sweep, check if max|Uk+1(s)−Uk(s)| < ε. If not converged, repeat. Once converged, extract the greedy policy: π(s) = argmaxa [R(s,a) + γ∑T(s'|s,a)U(s')].
Given a state, perform one Bellman backup. Return the new utility value.
javascript function bellmanBackup(state, actions, R, T, U, gamma) { let bestQ = -Infinity; for (const a of actions) { let q = R[state][a]; for (const [prob, ns] of T[state][a]) { q += gamma * prob * U[ns]; } bestQ = Math.max(bestQ, q); } return bestQ; }
You have a few data points and need to predict the value at a new query point. Gaussian kernel smoothing (Nadaraya-Watson estimator) is a non-parametric method: it computes a weighted average of known values, where the weight decreases with distance. Points closer to the query get more influence.
Query point xq = [0.6, 1.2]. Known points: x1=[0,0], x2=[1,0], x3=[0,2]. Compute the Euclidean distance d1 = ||xq − x1||.
Similarly: d2 = √(0.16+1.44) = √1.60 = 1.265, and d3 = √(0.36+0.64) = √1.00 = 1.000. Point x3 is closest to the query.
Using distances d1=1.342, d2=1.265, d3=1.000 and bandwidth σ=1, compute the Gaussian kernel value k3 = exp(−d32/(2σ2)).
x3 is closest, so it gets the largest kernel value. The kernel decays exponentially with squared distance — moving from d=1.0 to d=1.342 drops the weight from 0.607 to 0.407 (a 33% decrease).
Known values: y1=1, y2=3, y3=4. Kernel values: k1=0.407, k2=0.449, k3=0.607. Compute the kernel-smoothed prediction f̂(xq).
The prediction ≈ 2.86. Point x3 (y=4) gets the most weight (41.5%) because it's closest. The result is pulled toward y3=4 but tempered by the other points.
As σ → ∞, exp(−d2/(2σ2)) → exp(0) = 1 for all points, regardless of distance. All kernel values become equal, so all weights become 1/n. The prediction converges to the simple average: (1+3+4)/3 = 2.667. Infinite bandwidth = maximum smoothing = global average. Conversely, as σ→0, only the nearest point matters (nearest-neighbor interpolation).
Implement the Nadaraya-Watson kernel smoother with a Gaussian kernel.
javascript function kernelSmooth(xq, xs, ys, sigma) { xq = xq[0]; // unwrap single-element array let sumKY = 0, sumK = 0; for (let i = 0; i < xs.length; i++) { let d2 = 0; for (let j = 0; j < xq.length; j++) d2 += (xq[j] - xs[i][j]) ** 2; const k = Math.exp(-d2 / (2 * sigma * sigma)); sumK += k; sumKY += k * ys[i]; } return sumKY / sumK; }
Monte Carlo Tree Search (MCTS) builds a search tree incrementally by simulating random rollouts. At each node, it uses the UCB1 (Upper Confidence Bound) formula to balance exploitation (picking the action with highest estimated value) and exploration (trying actions that haven't been sampled enough to be confident about).
A MCTS node has been visited N=10 times total. Action "Fly" has Q̄=3.0 and N(fly)=6 visits. Exploration parameter c=0.7. Compute UCB1(fly).
Same node (N=10, c=0.7). Action "Hover" has Q̄=3.2 and N(hover)=4. Compute UCB1(hover) and determine which action MCTS selects.
UCB1(hover)=3.73 > UCB1(fly)=3.43. MCTS selects Hover. Hover wins not just because it has a higher Q̄ (3.2 vs 3.0), but also because it has fewer visits (4 vs 6), giving it a larger exploration bonus. Both factors align here.
With c=5.0, the exploration term c·√(ln(N)/N(a)) becomes ~7x larger than with c=0.7. For our example, UCB1(fly) = 3.0 + 5.0×0.62 = 6.10 and UCB1(hover) = 3.2 + 5.0×0.76 = 6.99. The Q-values (3.0 vs 3.2) are dwarfed by the exploration bonuses (3.10 vs 3.80). The algorithm becomes almost purely exploratory — the visit count difference matters more than the estimated value difference.
5 actions with visit counts [4, 3, 2, 1, 1]. Total N=11. The action with N(a)=2 has Q̄=24. c=1. Compute its UCB1 score.
Implement the UCB1 scoring function.
javascript function ucb1Score(qAvg, nAction, nTotal, c) { return qAvg + c * Math.sqrt(Math.log(nTotal) / nAction); }
Instead of learning a value function and deriving a policy, policy gradient methods directly optimize the policy parameters θ. The policy πθ(a|s) is a parameterized probability distribution over actions. We adjust θ in the direction that increases expected return. The key formula: the gradient of the expected return involves the score function ∇ log πθ(a|s).
θ=[1,0], s=2. The agent took action Boost. Trajectory return U(τ)=6. Compute the policy gradient update Δθ = U(τ) · ∇ log π(Boost|s). Give the first component Δθ1.
First compute π, then the score, then scale by U(τ).
The gradient pushes θ1 up by 1.43, making Boost more likely for similar states. The gradient is small because π=0.881 is already high — the policy already "wanted" to Boost, so it gets only a modest reinforcement.
θ=[2,−1], raw gradient ∇=[6,8]. ||∇||=10. You normalize the gradient to unit length, then step with learning rate α=0.5. What is θ1' (the new value of θ1)?
Gradient normalization (or clipping) prevents catastrophically large updates when the gradient magnitude explodes — common in RL where returns can vary wildly across trajectories.
In policy gradient methods, the gradient Δθ is multiplied by the trajectory return U(τ). Returns can vary enormously: one lucky rollout might return 500 while the average is 10. That single sample creates a gradient 50x larger than normal, potentially jumping the parameters to a terrible region. Gradient clipping/normalization caps the magnitude, keeping updates stable even when returns are noisy.
CEM is a derivative-free optimization method. You don't need to differentiate through the policy or the environment — you just need to evaluate "how good is this parameter vector?" by rolling out trajectories. This makes CEM applicable to non-differentiable systems, discrete action spaces, and black-box environments. The tradeoff: CEM scales poorly with parameter dimension (it's essentially sampling in parameter space).
Implement gradient normalization: if the gradient norm exceeds a threshold, scale it down to that threshold. Otherwise, leave it unchanged.
javascript function gradientScale(grad, maxNorm) { const norm = Math.sqrt(grad.reduce((s, g) => s + g*g, 0)); if (norm <= maxNorm || norm === 0) return [...grad]; return grad.map(g => g * maxNorm / norm); }
In a Partially Observable MDP (POMDP), the agent can't see the true state — it maintains a belief (a probability distribution over states) and updates it as it takes actions and receives observations. The belief update is a Bayesian filter: predict forward through the transition model, then condition on the observation.
A mineral survey robot has 4 possible locations: LA (Los Angeles), LB, LC, LD. Each location is either "high mineral" or "low mineral." We simplify to 4 states: s1=LA-high, s2=LB-high, s3=LC-low, s4=LD-low.
Initial belief: uniform b0=[0.25, 0.25, 0.25, 0.25]. The robot scans LA and gets a positive reading. Observation model: P(positive | high) = 0.7, P(positive | low) = 0.2. Only the scanned location's state affects the observation.
Compute b1(s1) = P(s1 | positive scan at LA).
Scanning LA gives info about whether LA is high or low. s1=LA-high, s2=LB-high (LA is part of the state through s1).
We need P(positive | si) for the scan at LA:
Actually, let's redefine cleanly. Assume each state encodes a particular mineral configuration across all sites. Let s1,s2 be states where LA has high minerals, and s3,s4 be states where LA has low minerals:
Full updated belief: b1 = [0.389, 0.389, 0.111, 0.111]. The positive reading doubles our confidence in the "high mineral" states.
From b1=[0.389, 0.389, 0.111, 0.111], the robot scans LB and gets a negative reading. States s1,s2 have LB as "high" (in s2) or "not LB-high." Let's say P(negative|s) depends on whether LB is high in that state:
Compute b2(s1).
Full: b2 ≈ [0.284, 0.284, 0.216, 0.216]. The negative LB scan decreases our belief in states where LB is high (s1,s2), while increasing belief in states where LB is low (s3,s4).
Belief b2=[0.1, 0.4, 0.2, 0.3]. The robot considers sampling at LA. If LA is high mineral (states s1, s2), reward = +50. If LA is low (states s3, s4), reward = −10. What is the expected reward of sampling LA?
MDP value iteration computes U(s) for each state s and the policy maps states to actions. But in a POMDP, the agent never observes s directly — it only has observations. The policy must map beliefs (probability distributions over states) to actions. The belief space is continuous and (|S|−1)-dimensional, so you can't just enumerate "states." This is why POMDP solving requires specialized algorithms like point-based value iteration, QMDP, or alpha-vector methods.
Implement a simple belief update (observation only, no transition).
javascript function beliefUpdate(belief, obsProbs) { const unnorm = belief.map((b, i) => b * obsProbs[i]); const total = unnorm.reduce((s, v) => s + v, 0); return unnorm.map(v => v / total); }
The value function of a POMDP is piecewise linear and convex over the belief space. It can be represented by a set of alpha vectors — each alpha vector α defines a hyperplane in belief space. The value at any belief b is the maximum dot product over all alpha vectors: V(b) = maxα α · b.
Two alpha vectors for a 5-state POMDP:
α1 = [50, 50, −10, −10, 0] (associated with "sample LA")
α2 = [50, −10, 50, −10, 0] (associated with "sample LC")
Belief b = [0.6, 0.2, 0.1, 0.1, 0]. Compute V1(b) = α1 · b.
Compute V2(b) = α2 · b for the same belief. Which action does the policy prescribe?
α1 wins because our belief puts 80% probability on states s1,s2 (where α1 gives +50) and only 10% on s3 (where α2 gives +50). The policy samples the location we're most confident about.
QMDP computes the alpha vector for action a as: αa(s) = R(s,a) + γ∑s'T(s'|s,a)VMDP(s'). The VMDP(s') is the fully observable MDP value — it assumes perfect state knowledge from step 2 onward. In reality, the agent will still be uncertain. Since VMDP(s) ≥ VPOMDP(b) for any b consistent with s, QMDP overestimates. This means QMDP never takes information-gathering actions (like scanning) because it thinks it won't need the information.
FIB (Fast Informed Bound) helps by computing a tighter upper bound that accounts for the observation model, narrowing the gap between QMDP and the true value.
PBVI performs Bellman backups at a set of sampled belief points, explicitly incorporating the observation model. This means it can discover that "scanning before sampling" is valuable — it models the information gain. QMDP, by assuming observability, thinks information has zero value and will always skip scanning. For problems where you must gather information before acting (like the mineral survey), PBVI dramatically outperforms QMDP.
With |A|=5 actions and |O|=2 observations, the number of t-step conditional plans is defined recursively:
• 1-step plans = |A| = 5
• t-step plans = |A| × ((t−1)-step plans)|O|
Compute the ratio of 3-step plans to 2-step plans.
The number of conditional plans grows doubly exponentially with the horizon. Going from 2 to 3 steps multiplied the plan count by 625. This explosive growth is why exact POMDP solving is intractable for long horizons — and why approximate methods (PBVI, SARSOP, etc.) are essential.
Time to put it all together. These problems combine multiple concepts from earlier chapters into multi-step reasoning chains — exactly the kind of problem that appears on exams. You'll need belief updates, expected reward calculations, value of information, and solver selection.
A 3-state POMDP with uniform prior b0=[1/3, 1/3, 1/3]. The agent takes a "scan" action and observes "positive." Observation model: P(pos|s1)=0.9, P(pos|s2)=0.3, P(pos|s3)=0.1. Compute the updated belief b1(s1).
Using b1 from Exercise 9.1 (b1 ≈ [0.692, 0.231, 0.077]). Action "extract" gives reward +100 in s1, −20 in s2, −50 in s3. Action "abandon" gives +5 in all states. What is the expected reward of "extract" at this belief?
The strong positive reading boosted b(s1) to 69.2%, making extraction highly favorable (EU=60.73 vs abandon=5). Without the scan, the prior would give EU(extract) = (1/3)(100−20−50) = 10 — much less decisive.
Exact is impossible: with |A|=4 and |O|=3, the 10-step plan tree has astronomically many branches. QMDP can't handle information-gathering (it assumes observability after step 1). MDP ignores partial observability entirely. PBVI/SARSOP samples belief points, performs backups with full observation modeling, and scales to 50 states and horizon 10. It's the right tool: it models information value while staying computationally tractable.
Implement a function that: (1) updates a belief given an observation, (2) computes expected reward for each action, and (3) returns the best action index and its expected reward.
javascript function beliefDecide(belief, obsProbs, rewards) { // 1. Update belief const unnorm = belief.map((b, i) => b * obsProbs[i]); const total = unnorm.reduce((s, v) => s + v, 0); const bPost = unnorm.map(v => v / total); // 2. Compute EU for each action let bestA = 0, bestEU = -Infinity; for (let a = 0; a < rewards.length; a++) { let eu = 0; for (let s = 0; s < bPost.length; s++) eu += bPost[s] * rewards[a][s]; if (eu > bestEU) { bestEU = eu; bestA = a; } } return { action: bestA, eu: Math.round(bestEU * 100) / 100 }; }
Note: the rounding in the return is for cleaner test comparison. In practice you'd return the raw float.
After the first scan, b1=[0.692, 0.231, 0.077]. Extract gives [100,−20,−50], abandon gives [5,5,5]. EU*(no second scan) = max(60.73, 5) = 60.73. Now consider a second scan with P(pos|s1)=0.8, P(pos|s2)=0.4, P(pos|s3)=0.1.
Compute P(second scan positive) given belief b1.
Since we already believe s1 is likely (69.2%), and the second scan has P(pos|s1)=0.8, a positive result is highly probable. The interesting question is whether the marginal value of this second scan justifies its cost — we already have a strong belief.