Bishop PRML, Chapter 14

Combining Models

Bayesian model averaging, committees, boosting, bagging, decision trees, random forests, and mixtures of experts — why ensembles beat single models.

Prerequisites: Chapters 1, 3–5, 9 (probability, regression, classification, neural networks, EM).
12
Chapters
2
Simulations
12
Quizzes

Chapter 0: Why Combine Models?

You have trained five different classifiers on the same data. Each one gets about 80% accuracy. Should you pick the best one and throw away the rest?

Surprisingly, no. If you average their predictions, you almost always do better than the best individual model. This is the central insight of Chapter 14: combining models reduces error in ways that no single model can.

The key intuition: Imagine each model makes independent errors. When one model is wrong on a data point, the others are likely right. By averaging, the individual errors cancel out while the correct signal reinforces. The more independent the errors, the more you gain from combining. A committee of five models with uncorrelated errors can reduce variance by a factor of five.

This chapter covers three distinct strategies for combining models:

Averaging methods: Train models independently, then average their predictions. Committees, bagging, and random forests.

Sequential methods: Train models one at a time, each focusing on what previous models got wrong. Boosting and AdaBoost.

And a third, more flexible approach: mixtures of experts, where a gating network learns which model to trust for each input region.

A modern perspective: Nearly every state-of-the-art machine learning system uses ensembling. Kaggle competitions are won by model stacking. GPT-4 and other frontier LLMs are rumored to use mixtures of experts. Random forests remain the go-to baseline for tabular data. The ideas in this chapter are not historical curiosities — they are the backbone of practical ML.
Check: Why does averaging predictions from multiple models typically outperform any single model?

Chapter 1: Bayesian Model Averaging

Before we combine models heuristically, let us see what the Bayesian framework says. Given a set of models {M1, ..., ML}, the predictive distribution is:

p(t|x, D) = ∑l=1L p(t|x, Ml, D) p(Ml|D)

where p(Ml|D) is the posterior probability of model l given the data (from Bayes' theorem), and p(t|x, Ml, D) is the prediction of model l with its parameters integrated out.

Averaging vs. selection: Bayesian model averaging (BMA) is not the same as a committee. BMA averages over the space of models, weighted by how well each model explains the data. As data increases, the posterior typically concentrates on one model (the "true" one), and BMA converges to model selection. A committee, by contrast, keeps all models contributing equally, which can outperform BMA when no single model is correct.

The model posterior is computed via Bayes' theorem:

p(Ml|D) ∝ p(D|Ml) p(Ml)

where p(D|Ml) is the model evidence (marginal likelihood) from Chapter 3. This automatically implements Occam's razor: simpler models that explain the data well receive higher evidence.

A crucial subtlety: within each model Ml, the prediction p(t|x, Ml, D) already integrates over all parameters w of that model:

p(t|x, Ml, D) = ∫ p(t|x, w) p(w|D, Ml) dw

So BMA is a double average: first over parameters within each model, then over models. In practice, both integrals are intractable for complex models, which motivates the practical methods in the rest of this chapter — committees, boosting, and bagging — which approximate the benefits of model averaging without computing posteriors.

Check: How does Bayesian model averaging differ from a simple committee?

Chapter 2: Committees

The simplest ensemble: train L models independently and average their predictions. For regression, the committee prediction is:

yCOM(x) = (1/L) ∑l=1L yl(x)

Why does this work? Decompose each model's prediction as yl(x) = h(x) + εl(x), where h(x) is the true function and εl is the error of model l. The expected squared error of a single model is:

EAVG = (1/L) ∑l E[εl2]

The expected squared error of the committee is:

ECOM = E[(1/L ∑l εl)2]
The variance reduction theorem: If the errors are uncorrelated (E[εl εk] = 0 for l ≠ k), then ECOM = EAVG / L. The committee error is L times smaller! In practice, errors are partially correlated (models trained on the same data share biases), so the reduction is less dramatic — but still substantial.

What about classification? For a binary problem, each model votes ±1. If each model has error rate p < 0.5 and errors are independent, the probability that the majority vote is wrong follows a binomial distribution. With L = 21 models each having p = 0.3, the committee error drops to about 0.02 — a dramatic improvement over any individual model.

How do we get diverse (uncorrelated) models? Several strategies:

Different initializations: Train neural networks from different random starting weights.

Different subsets: Train each model on a bootstrap sample of the data (bagging).

Different architectures: Use different model families (tree + SVM + neural net).

Different features: Randomly restrict each model's input features (random forests).

The correlation bottleneck: In the general case with correlated errors, the committee error is ECOM = (1/L)EAVG + (1 − 1/L)ECORR, where ECORR captures the average pairwise correlation of errors. As L → ∞, the committee error converges to ECORR, not zero. This is why diversity is so important — and why the rest of this chapter focuses on methods for creating diverse models.
Check: If L models have uncorrelated errors, by what factor does the committee reduce the average error?

Chapter 3: Committee Simulation

Watch how averaging multiple noisy models produces a smoother, more accurate prediction. Each thin line is a single model fit to a noisy sample. The thick line is the committee average. Notice how individual wiggles cancel out.

Committee of Regression Models

Each thin curve is a degree-6 polynomial fit to a random bootstrap sample. The thick curve is their average. The dashed curve is the true function.

L=8 models, degree 6
Check: What happens to the committee prediction as we add more models?

Chapter 4: Boosting

Committees combine strong learners trained independently. Boosting takes the opposite approach: combine many weak learners sequentially, each one correcting the mistakes of the previous ones.

A weak learner is a classifier barely better than random guessing — perhaps a simple decision stump (a one-feature threshold). Boosting shows that you can "boost" such weak learners into an arbitrarily strong classifier.

The boosting insight: Instead of training each model on the same data, boosting re-weights the training points. Data points that the current ensemble misclassifies get higher weight, so the next weak learner focuses on the hard cases. Over rounds, the ensemble builds up strength precisely where it needs it most — on the decision boundary.

The final classifier is a weighted vote:

Y(x) = sign(∑m=1M αm ym(x))

where ym(x) ∈ {−1, +1} is the m-th weak learner and αm is its weight. More accurate learners get higher weight.

Boosting has a remarkable theoretical property: the training error decreases exponentially with the number of rounds M. Even more surprising, the test error often continues to decrease long after the training error reaches zero — boosting increases the margin of the classification.

Boosting and the bias-variance tradeoff: While bagging reduces variance, boosting primarily reduces bias. A single decision stump has high bias (it can only model a single threshold). By combining many stumps, each compensating for the others' errors, the ensemble can model arbitrarily complex decision boundaries. The risk: if boosted too long on noisy data, boosting can start to overfit by fitting the noise. In practice, early stopping (choosing M by cross-validation) is the standard remedy.
Check: How does boosting differ from a committee?

Chapter 5: AdaBoost

The most influential boosting algorithm. AdaBoost (Adaptive Boosting) maintains a weight wn for each training point n. Initially all weights are equal: wn = 1/N.

At each round m = 1, ..., M:

1. Fit weak learner ym(x) to minimize the weighted error:

εm = ∑n: ym(xn) ≠ tn wn / ∑n wn

2. Compute the learner's weight:

αm = ln((1 − εm) / εm)

3. Update the data weights:

wn ← wn · exp(αm · I[ym(xn) ≠ tn])
Reading the formulas: When εm is small (accurate learner), αm is large — accurate learners get more vote. The weight update multiplies by exp(αm) for misclassified points, making them heavier for the next round. Correctly classified points keep their current weight. The effect: each new learner is forced to focus on the "hard" examples.

Bishop shows that AdaBoost is equivalent to minimizing an exponential loss:

E = ∑n=1N exp(−tn fm(xn))

where fm(x) = ∑j=1m αj yj(x) is the ensemble output after m rounds. This exponential loss viewpoint connects boosting to forward stagewise additive modelling — a greedy algorithm that adds one basis function at a time.

The exponential loss has the property that its population minimizer is f*(x) = (1/2) ln[p(t=1|x) / p(t=−1|x)], which is half the log-odds. So AdaBoost is implicitly estimating log-odds, just like logistic regression — but using an ensemble of weak learners as the function class.

Gradient boosting generalizes AdaBoost: Replace the exponential loss with any differentiable loss (squared error, log-loss, Huber loss), and you get gradient boosting. Each round fits a weak learner to the negative gradient of the loss. For squared error, the gradient is the residual, so each tree fits the residuals of the previous ensemble. This framework (Friedman, 2001) underlies XGBoost and LightGBM, two of the most successful ML algorithms for tabular data.
Check: What does the alpha weight in AdaBoost represent?

Chapter 6: Bagging

Bagging (Bootstrap AGGregatING) is beautifully simple: create L different training sets by sampling N points with replacement from the original data, train a model on each, and average their predictions.

yBAG(x) = (1/L) ∑l=1L yl(x)

Each bootstrap sample includes about 63% of the original data points (on average). Why 63%? The probability that a specific point is not selected in any of N draws is (1 − 1/N)N ≈ e−1 ≈ 0.37. So about 37% of points are left out of each bootstrap sample. These out-of-bag (OOB) samples provide a free estimate of generalization error without needing a separate validation set.

Bagging reduces variance, not bias: The bootstrap creates different training sets, so each model sees a slightly different version of the problem. Averaging over these models smooths out the high-variance component of the error. This is why bagging works best with unstable learners — models whose output changes significantly when the training data changes slightly. Decision trees are the prime example: a small change in data can completely alter the tree structure.
MethodTrainingCombinationBest for
CommitteeIndependent, same dataEqual averageDiverse model types
BaggingIndependent, bootstrap dataEqual averageUnstable learners (trees)
BoostingSequential, reweighted dataWeighted voteWeak learners
Check: Why does bagging work best with unstable learners like decision trees?

Chapter 7: Decision Trees

A decision tree partitions the input space into axis-aligned rectangles. At each internal node, a single feature is tested against a threshold. The data flows left or right, until it reaches a leaf node that gives the prediction.

For classification, the tree is built greedily, top-down. At each node, we choose the split (feature j, threshold τ) that maximizes the information gain:

ΔH = H(parent) − [|left|/N · H(left) + |right|/N · H(right)]

where H is the entropy (or Gini impurity) of the class distribution. We keep splitting until a stopping criterion is met (max depth, minimum samples, or purity).

Why trees are unstable: Decision trees are high-variance models. A small change in the training data can change the first split, which cascades into a completely different tree. This is a weakness for individual trees, but a strength for ensembles: the instability creates the diversity that bagging and random forests exploit.

For regression, leaf nodes predict the mean of training points in that region. The split criterion minimizes the sum of squared residuals in the resulting children:

xn ∈ left (tn − t̄left)2 + ∑xn ∈ right (tn − t̄right)2

Trees have appealing properties: interpretability, no need for feature scaling, natural handling of mixed data types. Their weakness — high variance and tendency to overfit — is addressed by ensembles.

Pruning controls tree complexity. Cost-complexity pruning adds a penalty λ per leaf node to the loss. Starting with a fully grown tree, we prune back nodes whose removal reduces the penalized loss. The optimal λ is chosen by cross-validation. This is the tree analog of regularization in linear models.

Trees as adaptive basis functions: A decision tree with L leaf nodes is equivalent to a linear model with L indicator basis functions, one per leaf region. The tree learning algorithm adaptively chooses these basis functions (the partition), while the "weights" (leaf predictions) are just region means. This perspective connects trees to the basis function framework from Chapter 3 and explains why they can model any function given enough depth.
Check: Why are individual decision trees considered "unstable" learners?

Chapter 8: Random Forests

A random forest is bagging applied to decision trees with one crucial addition: at each split, only a random subset of features is considered.

The algorithm:

1. Draw L bootstrap samples from the training data.

2. For each sample, grow a full decision tree. At each node, randomly select m features (out of D total) and find the best split among only those m features.

3. Predict by averaging (regression) or majority vote (classification).

Why random feature selection helps: In bagging alone, if one feature is very strong, every tree will split on it first, making the trees correlated. Random feature restriction forces trees to use different features, decorrelating them. Since the variance reduction of averaging depends on correlation between models (Var = ρσ2 + (1−&rho)σ2/L), reducing correlation ρ directly reduces ensemble variance. The typical choice is m = √D for classification, m = D/3 for regression.
EnsembleBase learnerData diversityFeature diversity
Bagged treesFull treeBootstrapAll features at each split
Random forestFull treeBootstrapRandom m < D features at each split
Boosted treesShallow treeReweightedAll features (typically)

Random forests inherit trees' advantages (no scaling, mixed types, interpretability via feature importance) while dramatically reducing variance. They are consistently among the best off-the-shelf classifiers and rarely require extensive tuning.

Feature importance is a valuable byproduct. For each feature, compute the total decrease in impurity across all nodes that split on it, averaged over all trees. Features that frequently produce large impurity decreases are important. Alternatively, permutation importance measures how much OOB accuracy drops when a feature's values are randomly shuffled — a more robust measure that accounts for correlations between features.

Check: What does random forests add beyond bagging?

Chapter 9: Conditional Mixture Models

So far, our ensembles used fixed combination weights (equal average or fixed αm). What if the optimal combination depends on the input?

A conditional mixture model makes the mixing coefficients input-dependent:

p(t|x) = ∑k=1K πk(x) pk(t|x)

where the gating functions πk(x) are non-negative and sum to one for each x.

Why input-dependent mixing: Consider predicting house prices. In the city center, price depends on floor area and amenities. In the suburbs, it depends on lot size and school district. A single model struggles to capture both regimes. A conditional mixture assigns different experts to different input regions — an urban expert and a suburban expert — with a gating network that learns which expert to trust based on the input.

This framework generalizes the standard mixture model from Chapter 9. There, πk was a constant. Here, πk(x) is a function of the input, typically parameterized by a softmax:

πk(x) = exp(ak(x)) / ∑j exp(aj(x))

where ak(x) is the gating network's output for expert k.

For regression with Gaussian components, each expert k predicts a mean μk(x) and variance σk2(x). The conditional mixture then represents a multimodal predictive distribution. This is powerful for problems where the same input can have multiple plausible outputs — for example, in inverse kinematics, where multiple joint configurations can produce the same end-effector position.

Check: What is the key difference between a conditional mixture and a standard mixture model?

Chapter 10: Mixtures of Experts

The mixture of experts (MoE) model, introduced by Jacobs et al. (1991), is the conditional mixture made concrete. It has two types of learnable components:

Expert networks: K neural networks (or linear models), each specializing in a region of input space. Expert k outputs pk(t|x).

Gating network: A softmax classifier that outputs πk(x) — the probability that expert k is responsible for input x.

Learning via EM: The MoE is trained with the EM algorithm. The latent variable z indicates which expert generated each data point.
E-step: Compute responsibilities rnk = πk(xn) pk(tn|xn) / ∑j πj(xn) pj(tn|xn).
M-step: Update each expert using its responsible data (weighted by rnk). Update the gating network to predict rnk from xn.

Bishop also introduces hierarchical mixtures of experts (HME): the gating network itself is a tree of softmax splits. At each level, a gating function routes the input left or right, and the leaves are expert networks. This gives the model a coarse-to-fine partition of input space — a soft, differentiable version of a decision tree.

The HME illustrates a deep connection: a hard decision tree makes crisp, axis-aligned splits. The hierarchical MoE makes soft, differentiable splits using sigmoid or softmax gates. This means the entire model — experts and gates — can be trained end-to-end by gradient descent, unlike a traditional decision tree.

From MoE to modern transformers: The mixture of experts idea has experienced a renaissance in deep learning. Models like GShard and Switch Transformer use MoE layers inside transformers: each token is routed to a subset of "expert" feed-forward networks by a learned gating function. The key challenge is load balancing — ensuring tokens are distributed evenly across experts rather than collapsing to a single dominant expert. This is the same "expert collapse" problem Bishop discusses in the context of EM training.
Mixture of Experts: Gating Visualization

Three experts (colored curves) each fit a region of the input. The bottom panel shows the gating weights πk(x) — which expert is active where. The thick black curve is the combined prediction.

K=3 experts
Check: What role does the gating network play in a mixture of experts?

Chapter 11: Summary

MethodStrategyKey ideaReduces
Bayesian MAWeighted by evidencePosterior over modelsModel uncertainty
CommitteeEqual averageError cancellationVariance
BaggingBootstrap + averageData diversityVariance
Random forestBagging + random featuresFeature decorrelationVariance (more)
BoostingSequential + reweightFocus on hard examplesBias + variance
Mixtures of expertsInput-dependent gatingSoft partitioningBias (local experts)
The combining principle: All these methods exploit the same fundamental insight: a single model's error has a systematic component (bias) and a random component (variance). By combining multiple models — whether by averaging (committees, bagging), sequential correction (boosting), or input-dependent routing (MoE) — we can attack one or both components. The bias-variance tradeoff (Ch 3) is not just a diagnostic tool; it is the design principle behind every ensemble method in this chapter.

Chapter 14 completes Bishop's PRML. From the simplest polynomial curve fit in Chapter 1 to the ensemble methods here, the thread is consistent: start with a probabilistic model, understand its limitations, and build principled machinery to overcome them.

Notice how these methods compose with everything earlier in the book. Bagging with neural networks (Ch 5). Boosted decision stumps for classification (Ch 4 + 7). Random forests using information-theoretic splits (Ch 1). Mixtures of experts trained with EM (Ch 9). The combining framework does not replace individual models — it amplifies them.

"Methods for combining multiple models together
can give improved results compared with
any single model acting alone."
— Christopher Bishop, PRML §14.1
Check: What distinguishes boosting from bagging in terms of what they reduce?