Ch 14: Combining Models

Chapter 0: Why Combine Models?

You have trained five different classifiers on the same data. Each one gets about 80% accuracy. Should you pick the best one and throw away the rest?

Surprisingly, no. If you average their predictions, you almost always do better than the best individual model. This is the central insight of Chapter 14: combining models reduces error in ways that no single model can.

The key intuition: Imagine each model makes independent errors. When one model is wrong on a data point, the others are likely right. By averaging, the individual errors cancel out while the correct signal reinforces. The more independent the errors, the more you gain from combining. A committee of five models with uncorrelated errors can reduce variance by a factor of five.

This chapter covers three distinct strategies for combining models:

Averaging methods: Train models independently, then average their predictions. Committees, bagging, and random forests.

Sequential methods: Train models one at a time, each focusing on what previous models got wrong. Boosting and AdaBoost.

And a third, more flexible approach: mixtures of experts, where a gating network learns which model to trust for each input region.

A modern perspective: Nearly every state-of-the-art machine learning system uses ensembling. Kaggle competitions are won by model stacking. GPT-4 and other frontier LLMs are rumored to use mixtures of experts. Random forests remain the go-to baseline for tabular data. The ideas in this chapter are not historical curiosities — they are the backbone of practical ML.

Check: Why does averaging predictions from multiple models typically outperform any single model?

Individual errors tend to cancel out while correct signals reinforce, reducing overall variance Because the average model is always more complex Because it trains on more data

Chapter 1: Bayesian Model Averaging

Before we combine models heuristically, let us see what the Bayesian framework says. Given a set of models {M₁, ..., M_L}, the predictive distribution is:

p(t|x, D) = ∑_l=1^L p(t|x, M_l, D) p(M_l|D)

where p(M_l|D) is the posterior probability of model l given the data (from Bayes' theorem), and p(t|x, M_l, D) is the prediction of model l with its parameters integrated out.

Averaging vs. selection: Bayesian model averaging (BMA) is not the same as a committee. BMA averages over the space of models, weighted by how well each model explains the data. As data increases, the posterior typically concentrates on one model (the "true" one), and BMA converges to model selection. A committee, by contrast, keeps all models contributing equally, which can outperform BMA when no single model is correct.

The model posterior is computed via Bayes' theorem:

p(M_l|D) ∝ p(D|M_l) p(M_l)

where p(D|M_l) is the model evidence (marginal likelihood) from Chapter 3. This automatically implements Occam's razor: simpler models that explain the data well receive higher evidence.

A crucial subtlety: within each model M_l, the prediction p(t|x, M_l, D) already integrates over all parameters w of that model:

p(t|x, M_l, D) = ∫ p(t|x, w) p(w|D, M_l) dw

So BMA is a double average: first over parameters within each model, then over models. In practice, both integrals are intractable for complex models, which motivates the practical methods in the rest of this chapter — committees, boosting, and bagging — which approximate the benefits of model averaging without computing posteriors.

Check: How does Bayesian model averaging differ from a simple committee?

BMA weights models by posterior probability and converges to model selection; a committee weights models equally and benefits from error cancellation They are the same thing BMA always outperforms committees

Chapter 2: Committees

The simplest ensemble: train L models independently and average their predictions. For regression, the committee prediction is:

y_COM(x) = (1/L) ∑_l=1^L y_l(x)

Why does this work? Decompose each model's prediction as y_l(x) = h(x) + ε_l(x), where h(x) is the true function and ε_l is the error of model l. The expected squared error of a single model is:

E_AVG = (1/L) ∑_l E[ε_l²]

The expected squared error of the committee is:

E_COM = E[(1/L ∑_l ε_l)²]

The variance reduction theorem: If the errors are uncorrelated (E[ε_l ε_k] = 0 for l ≠ k), then E_COM = E_AVG / L. The committee error is L times smaller! In practice, errors are partially correlated (models trained on the same data share biases), so the reduction is less dramatic — but still substantial.

What about classification? For a binary problem, each model votes ±1. If each model has error rate p < 0.5 and errors are independent, the probability that the majority vote is wrong follows a binomial distribution. With L = 21 models each having p = 0.3, the committee error drops to about 0.02 — a dramatic improvement over any individual model.

How do we get diverse (uncorrelated) models? Several strategies:

• Different initializations: Train neural networks from different random starting weights.

• Different subsets: Train each model on a bootstrap sample of the data (bagging).

• Different architectures: Use different model families (tree + SVM + neural net).

• Different features: Randomly restrict each model's input features (random forests).

The correlation bottleneck: In the general case with correlated errors, the committee error is E_COM = (1/L)E_AVG + (1 − 1/L)E_CORR, where E_CORR captures the average pairwise correlation of errors. As L → ∞, the committee error converges to E_CORR, not zero. This is why diversity is so important — and why the rest of this chapter focuses on methods for creating diverse models.

Check: If L models have uncorrelated errors, by what factor does the committee reduce the average error?

By a factor of L — the committee error is E_AVG / L By a factor of L^2 It does not reduce the error

Chapter 3: Committee Simulation

Watch how averaging multiple noisy models produces a smoother, more accurate prediction. Each thin line is a single model fit to a noisy sample. The thick line is the committee average. Notice how individual wiggles cancel out.

Committee of Regression Models

Each thin curve is a degree-6 polynomial fit to a random bootstrap sample. The thick curve is their average. The dashed curve is the true function.

L=8 models, degree 6

Check: What happens to the committee prediction as we add more models?

It becomes smoother and closer to the true function, as individual model variance cancels out It overfits more because there are more parameters It does not change

Chapter 4: Boosting

Committees combine strong learners trained independently. Boosting takes the opposite approach: combine many weak learners sequentially, each one correcting the mistakes of the previous ones.

A weak learner is a classifier barely better than random guessing — perhaps a simple decision stump (a one-feature threshold). Boosting shows that you can "boost" such weak learners into an arbitrarily strong classifier.

The boosting insight: Instead of training each model on the same data, boosting re-weights the training points. Data points that the current ensemble misclassifies get higher weight, so the next weak learner focuses on the hard cases. Over rounds, the ensemble builds up strength precisely where it needs it most — on the decision boundary.

The final classifier is a weighted vote:

Y(x) = sign(∑_m=1^M α_m y_m(x))

where y_m(x) ∈ {−1, +1} is the m-th weak learner and α_m is its weight. More accurate learners get higher weight.

Boosting has a remarkable theoretical property: the training error decreases exponentially with the number of rounds M. Even more surprising, the test error often continues to decrease long after the training error reaches zero — boosting increases the margin of the classification.

Boosting and the bias-variance tradeoff: While bagging reduces variance, boosting primarily reduces bias. A single decision stump has high bias (it can only model a single threshold). By combining many stumps, each compensating for the others' errors, the ensemble can model arbitrarily complex decision boundaries. The risk: if boosted too long on noisy data, boosting can start to overfit by fitting the noise. In practice, early stopping (choosing M by cross-validation) is the standard remedy.

Check: How does boosting differ from a committee?

Boosting trains weak learners sequentially, each focusing on previously misclassified points; a committee trains models independently Boosting always uses strong learners There is no difference

Chapter 5: AdaBoost

The most influential boosting algorithm. AdaBoost (Adaptive Boosting) maintains a weight w_n for each training point n. Initially all weights are equal: w_n = 1/N.

At each round m = 1, ..., M:

1. Fit weak learner y_m(x) to minimize the weighted error:

ε_m = ∑_{n: y_m(x_n) ≠ t_n} w_n / ∑_n w_n

2. Compute the learner's weight:

α_m = ln((1 − ε_m) / ε_m)

3. Update the data weights:

w_n ← w_n · exp(α_m · I[y_m(x_n) ≠ t_n])

Reading the formulas: When ε_m is small (accurate learner), α_m is large — accurate learners get more vote. The weight update multiplies by exp(α_m) for misclassified points, making them heavier for the next round. Correctly classified points keep their current weight. The effect: each new learner is forced to focus on the "hard" examples.

Bishop shows that AdaBoost is equivalent to minimizing an exponential loss:

E = ∑_n=1^N exp(−t_n f_m(x_n))

where f_m(x) = ∑_j=1^m α_j y_j(x) is the ensemble output after m rounds. This exponential loss viewpoint connects boosting to forward stagewise additive modelling — a greedy algorithm that adds one basis function at a time.

The exponential loss has the property that its population minimizer is f*(x) = (1/2) ln[p(t=1|x) / p(t=−1|x)], which is half the log-odds. So AdaBoost is implicitly estimating log-odds, just like logistic regression — but using an ensemble of weak learners as the function class.

Gradient boosting generalizes AdaBoost: Replace the exponential loss with any differentiable loss (squared error, log-loss, Huber loss), and you get gradient boosting. Each round fits a weak learner to the negative gradient of the loss. For squared error, the gradient is the residual, so each tree fits the residuals of the previous ensemble. This framework (Friedman, 2001) underlies XGBoost and LightGBM, two of the most successful ML algorithms for tabular data.

Check: What does the alpha weight in AdaBoost represent?

The voting weight of a weak learner — more accurate learners (lower epsilon) get higher alpha The learning rate The number of weak learners to use

Chapter 6: Bagging

Bagging (Bootstrap AGGregatING) is beautifully simple: create L different training sets by sampling N points with replacement from the original data, train a model on each, and average their predictions.

y_BAG(x) = (1/L) ∑_l=1^L y_l(x)

Each bootstrap sample includes about 63% of the original data points (on average). Why 63%? The probability that a specific point is not selected in any of N draws is (1 − 1/N)^N ≈ e⁻¹ ≈ 0.37. So about 37% of points are left out of each bootstrap sample. These out-of-bag (OOB) samples provide a free estimate of generalization error without needing a separate validation set.

Bagging reduces variance, not bias: The bootstrap creates different training sets, so each model sees a slightly different version of the problem. Averaging over these models smooths out the high-variance component of the error. This is why bagging works best with unstable learners — models whose output changes significantly when the training data changes slightly. Decision trees are the prime example: a small change in data can completely alter the tree structure.

Method	Training	Combination	Best for
Committee	Independent, same data	Equal average	Diverse model types
Bagging	Independent, bootstrap data	Equal average	Unstable learners (trees)
Boosting	Sequential, reweighted data	Weighted vote	Weak learners

Check: Why does bagging work best with unstable learners like decision trees?

Unstable learners produce diverse models from bootstrap samples, and averaging diverse models reduces variance Because stable learners cannot be bagged Because bagging increases the number of training points

Chapter 7: Decision Trees

A decision tree partitions the input space into axis-aligned rectangles. At each internal node, a single feature is tested against a threshold. The data flows left or right, until it reaches a leaf node that gives the prediction.

For classification, the tree is built greedily, top-down. At each node, we choose the split (feature j, threshold τ) that maximizes the information gain:

ΔH = H(parent) − [|left|/N · H(left) + |right|/N · H(right)]

where H is the entropy (or Gini impurity) of the class distribution. We keep splitting until a stopping criterion is met (max depth, minimum samples, or purity).

Why trees are unstable: Decision trees are high-variance models. A small change in the training data can change the first split, which cascades into a completely different tree. This is a weakness for individual trees, but a strength for ensembles: the instability creates the diversity that bagging and random forests exploit.

For regression, leaf nodes predict the mean of training points in that region. The split criterion minimizes the sum of squared residuals in the resulting children:

∑_{x_n ∈ left} (t_n − t̄_left)² + ∑_{x_n ∈ right} (t_n − t̄_right)²

Trees have appealing properties: interpretability, no need for feature scaling, natural handling of mixed data types. Their weakness — high variance and tendency to overfit — is addressed by ensembles.

Pruning controls tree complexity. Cost-complexity pruning adds a penalty λ per leaf node to the loss. Starting with a fully grown tree, we prune back nodes whose removal reduces the penalized loss. The optimal λ is chosen by cross-validation. This is the tree analog of regularization in linear models.

Trees as adaptive basis functions: A decision tree with L leaf nodes is equivalent to a linear model with L indicator basis functions, one per leaf region. The tree learning algorithm adaptively chooses these basis functions (the partition), while the "weights" (leaf predictions) are just region means. This perspective connects trees to the basis function framework from Chapter 3 and explains why they can model any function given enough depth.

Check: Why are individual decision trees considered "unstable" learners?

A small change in training data can alter the root split, cascading into a completely different tree structure They always underfit They cannot handle continuous features

Chapter 8: Random Forests

A random forest is bagging applied to decision trees with one crucial addition: at each split, only a random subset of features is considered.

The algorithm:

1. Draw L bootstrap samples from the training data.

2. For each sample, grow a full decision tree. At each node, randomly select m features (out of D total) and find the best split among only those m features.

3. Predict by averaging (regression) or majority vote (classification).

Why random feature selection helps: In bagging alone, if one feature is very strong, every tree will split on it first, making the trees correlated. Random feature restriction forces trees to use different features, decorrelating them. Since the variance reduction of averaging depends on correlation between models (Var = ρσ² + (1−&rho)σ²/L), reducing correlation ρ directly reduces ensemble variance. The typical choice is m = √D for classification, m = D/3 for regression.

Ensemble	Base learner	Data diversity	Feature diversity
Bagged trees	Full tree	Bootstrap	All features at each split
Random forest	Full tree	Bootstrap	Random m < D features at each split
Boosted trees	Shallow tree	Reweighted	All features (typically)

Random forests inherit trees' advantages (no scaling, mixed types, interpretability via feature importance) while dramatically reducing variance. They are consistently among the best off-the-shelf classifiers and rarely require extensive tuning.

Feature importance is a valuable byproduct. For each feature, compute the total decrease in impurity across all nodes that split on it, averaged over all trees. Features that frequently produce large impurity decreases are important. Alternatively, permutation importance measures how much OOB accuracy drops when a feature's values are randomly shuffled — a more robust measure that accounts for correlations between features.

Check: What does random forests add beyond bagging?

Random feature selection at each split, which decorrelates the trees and further reduces ensemble variance Deeper trees Weighted voting instead of equal averaging

Chapter 9: Conditional Mixture Models

So far, our ensembles used fixed combination weights (equal average or fixed α_m). What if the optimal combination depends on the input?

A conditional mixture model makes the mixing coefficients input-dependent:

p(t|x) = ∑_k=1^K π_k(x) p_k(t|x)

where the gating functions π_k(x) are non-negative and sum to one for each x.

Why input-dependent mixing: Consider predicting house prices. In the city center, price depends on floor area and amenities. In the suburbs, it depends on lot size and school district. A single model struggles to capture both regimes. A conditional mixture assigns different experts to different input regions — an urban expert and a suburban expert — with a gating network that learns which expert to trust based on the input.

This framework generalizes the standard mixture model from Chapter 9. There, π_k was a constant. Here, π_k(x) is a function of the input, typically parameterized by a softmax:

π_k(x) = exp(a_k(x)) / ∑_j exp(a_j(x))

where a_k(x) is the gating network's output for expert k.

For regression with Gaussian components, each expert k predicts a mean μ_k(x) and variance σ_k²(x). The conditional mixture then represents a multimodal predictive distribution. This is powerful for problems where the same input can have multiple plausible outputs — for example, in inverse kinematics, where multiple joint configurations can produce the same end-effector position.

Check: What is the key difference between a conditional mixture and a standard mixture model?

In a conditional mixture, the mixing coefficients depend on the input x, allowing different experts to dominate in different input regions Conditional mixtures use more components There is no difference

Chapter 10: Mixtures of Experts

The mixture of experts (MoE) model, introduced by Jacobs et al. (1991), is the conditional mixture made concrete. It has two types of learnable components:

Expert networks: K neural networks (or linear models), each specializing in a region of input space. Expert k outputs p_k(t|x).

Gating network: A softmax classifier that outputs π_k(x) — the probability that expert k is responsible for input x.

Learning via EM: The MoE is trained with the EM algorithm. The latent variable z indicates which expert generated each data point.
E-step: Compute responsibilities r_nk = π_k(x_n) p_k(t_n|x_n) / ∑_j π_j(x_n) p_j(t_n|x_n).
M-step: Update each expert using its responsible data (weighted by r_nk). Update the gating network to predict r_nk from x_n.

Bishop also introduces hierarchical mixtures of experts (HME): the gating network itself is a tree of softmax splits. At each level, a gating function routes the input left or right, and the leaves are expert networks. This gives the model a coarse-to-fine partition of input space — a soft, differentiable version of a decision tree.

The HME illustrates a deep connection: a hard decision tree makes crisp, axis-aligned splits. The hierarchical MoE makes soft, differentiable splits using sigmoid or softmax gates. This means the entire model — experts and gates — can be trained end-to-end by gradient descent, unlike a traditional decision tree.

From MoE to modern transformers: The mixture of experts idea has experienced a renaissance in deep learning. Models like GShard and Switch Transformer use MoE layers inside transformers: each token is routed to a subset of "expert" feed-forward networks by a learned gating function. The key challenge is load balancing — ensuring tokens are distributed evenly across experts rather than collapsing to a single dominant expert. This is the same "expert collapse" problem Bishop discusses in the context of EM training.

Mixture of Experts: Gating Visualization

Three experts (colored curves) each fit a region of the input. The bottom panel shows the gating weights π_k(x) — which expert is active where. The thick black curve is the combined prediction.

K=3 experts

Check: What role does the gating network play in a mixture of experts?

It assigns input-dependent weights to each expert, determining which expert handles which input region It trains the experts It prevents overfitting

Chapter 11: Summary

Method	Strategy	Key idea	Reduces
Bayesian MA	Weighted by evidence	Posterior over models	Model uncertainty
Committee	Equal average	Error cancellation	Variance
Bagging	Bootstrap + average	Data diversity	Variance
Random forest	Bagging + random features	Feature decorrelation	Variance (more)
Boosting	Sequential + reweight	Focus on hard examples	Bias + variance
Mixtures of experts	Input-dependent gating	Soft partitioning	Bias (local experts)

The combining principle: All these methods exploit the same fundamental insight: a single model's error has a systematic component (bias) and a random component (variance). By combining multiple models — whether by averaging (committees, bagging), sequential correction (boosting), or input-dependent routing (MoE) — we can attack one or both components. The bias-variance tradeoff (Ch 3) is not just a diagnostic tool; it is the design principle behind every ensemble method in this chapter.

Chapter 14 completes Bishop's PRML. From the simplest polynomial curve fit in Chapter 1 to the ensemble methods here, the thread is consistent: start with a probabilistic model, understand its limitations, and build principled machinery to overcome them.

Notice how these methods compose with everything earlier in the book. Bagging with neural networks (Ch 5). Boosted decision stumps for classification (Ch 4 + 7). Random forests using information-theoretic splits (Ch 1). Mixtures of experts trained with EM (Ch 9). The combining framework does not replace individual models — it amplifies them.

"Methods for combining multiple models together
can give improved results compared with
any single model acting alone."
— Christopher Bishop, PRML §14.1

Check: What distinguishes boosting from bagging in terms of what they reduce?

Bagging primarily reduces variance (by averaging), while boosting reduces both bias and variance (by sequentially correcting errors) They both reduce only variance Bagging reduces bias, boosting reduces variance