Bayesian model averaging, committees, boosting, bagging, decision trees, random forests, and mixtures of experts — why ensembles beat single models.
You have trained five different classifiers on the same data. Each one gets about 80% accuracy. Should you pick the best one and throw away the rest?
Surprisingly, no. If you average their predictions, you almost always do better than the best individual model. This is the central insight of Chapter 14: combining models reduces error in ways that no single model can.
This chapter covers three distinct strategies for combining models:
Averaging methods: Train models independently, then average their predictions. Committees, bagging, and random forests.
Sequential methods: Train models one at a time, each focusing on what previous models got wrong. Boosting and AdaBoost.
And a third, more flexible approach: mixtures of experts, where a gating network learns which model to trust for each input region.
Before we combine models heuristically, let us see what the Bayesian framework says. Given a set of models {M1, ..., ML}, the predictive distribution is:
where p(Ml|D) is the posterior probability of model l given the data (from Bayes' theorem), and p(t|x, Ml, D) is the prediction of model l with its parameters integrated out.
The model posterior is computed via Bayes' theorem:
where p(D|Ml) is the model evidence (marginal likelihood) from Chapter 3. This automatically implements Occam's razor: simpler models that explain the data well receive higher evidence.
A crucial subtlety: within each model Ml, the prediction p(t|x, Ml, D) already integrates over all parameters w of that model:
So BMA is a double average: first over parameters within each model, then over models. In practice, both integrals are intractable for complex models, which motivates the practical methods in the rest of this chapter — committees, boosting, and bagging — which approximate the benefits of model averaging without computing posteriors.
The simplest ensemble: train L models independently and average their predictions. For regression, the committee prediction is:
Why does this work? Decompose each model's prediction as yl(x) = h(x) + εl(x), where h(x) is the true function and εl is the error of model l. The expected squared error of a single model is:
The expected squared error of the committee is:
What about classification? For a binary problem, each model votes ±1. If each model has error rate p < 0.5 and errors are independent, the probability that the majority vote is wrong follows a binomial distribution. With L = 21 models each having p = 0.3, the committee error drops to about 0.02 — a dramatic improvement over any individual model.
How do we get diverse (uncorrelated) models? Several strategies:
• Different initializations: Train neural networks from different random starting weights.
• Different subsets: Train each model on a bootstrap sample of the data (bagging).
• Different architectures: Use different model families (tree + SVM + neural net).
• Different features: Randomly restrict each model's input features (random forests).
Watch how averaging multiple noisy models produces a smoother, more accurate prediction. Each thin line is a single model fit to a noisy sample. The thick line is the committee average. Notice how individual wiggles cancel out.
Each thin curve is a degree-6 polynomial fit to a random bootstrap sample. The thick curve is their average. The dashed curve is the true function.
Committees combine strong learners trained independently. Boosting takes the opposite approach: combine many weak learners sequentially, each one correcting the mistakes of the previous ones.
A weak learner is a classifier barely better than random guessing — perhaps a simple decision stump (a one-feature threshold). Boosting shows that you can "boost" such weak learners into an arbitrarily strong classifier.
The final classifier is a weighted vote:
where ym(x) ∈ {−1, +1} is the m-th weak learner and αm is its weight. More accurate learners get higher weight.
Boosting has a remarkable theoretical property: the training error decreases exponentially with the number of rounds M. Even more surprising, the test error often continues to decrease long after the training error reaches zero — boosting increases the margin of the classification.
The most influential boosting algorithm. AdaBoost (Adaptive Boosting) maintains a weight wn for each training point n. Initially all weights are equal: wn = 1/N.
At each round m = 1, ..., M:
1. Fit weak learner ym(x) to minimize the weighted error:
2. Compute the learner's weight:
3. Update the data weights:
Bishop shows that AdaBoost is equivalent to minimizing an exponential loss:
where fm(x) = ∑j=1m αj yj(x) is the ensemble output after m rounds. This exponential loss viewpoint connects boosting to forward stagewise additive modelling — a greedy algorithm that adds one basis function at a time.
The exponential loss has the property that its population minimizer is f*(x) = (1/2) ln[p(t=1|x) / p(t=−1|x)], which is half the log-odds. So AdaBoost is implicitly estimating log-odds, just like logistic regression — but using an ensemble of weak learners as the function class.
Bagging (Bootstrap AGGregatING) is beautifully simple: create L different training sets by sampling N points with replacement from the original data, train a model on each, and average their predictions.
Each bootstrap sample includes about 63% of the original data points (on average). Why 63%? The probability that a specific point is not selected in any of N draws is (1 − 1/N)N ≈ e−1 ≈ 0.37. So about 37% of points are left out of each bootstrap sample. These out-of-bag (OOB) samples provide a free estimate of generalization error without needing a separate validation set.
| Method | Training | Combination | Best for |
|---|---|---|---|
| Committee | Independent, same data | Equal average | Diverse model types |
| Bagging | Independent, bootstrap data | Equal average | Unstable learners (trees) |
| Boosting | Sequential, reweighted data | Weighted vote | Weak learners |
A decision tree partitions the input space into axis-aligned rectangles. At each internal node, a single feature is tested against a threshold. The data flows left or right, until it reaches a leaf node that gives the prediction.
For classification, the tree is built greedily, top-down. At each node, we choose the split (feature j, threshold τ) that maximizes the information gain:
where H is the entropy (or Gini impurity) of the class distribution. We keep splitting until a stopping criterion is met (max depth, minimum samples, or purity).
For regression, leaf nodes predict the mean of training points in that region. The split criterion minimizes the sum of squared residuals in the resulting children:
Trees have appealing properties: interpretability, no need for feature scaling, natural handling of mixed data types. Their weakness — high variance and tendency to overfit — is addressed by ensembles.
Pruning controls tree complexity. Cost-complexity pruning adds a penalty λ per leaf node to the loss. Starting with a fully grown tree, we prune back nodes whose removal reduces the penalized loss. The optimal λ is chosen by cross-validation. This is the tree analog of regularization in linear models.
A random forest is bagging applied to decision trees with one crucial addition: at each split, only a random subset of features is considered.
The algorithm:
1. Draw L bootstrap samples from the training data.
2. For each sample, grow a full decision tree. At each node, randomly select m features (out of D total) and find the best split among only those m features.
3. Predict by averaging (regression) or majority vote (classification).
| Ensemble | Base learner | Data diversity | Feature diversity |
|---|---|---|---|
| Bagged trees | Full tree | Bootstrap | All features at each split |
| Random forest | Full tree | Bootstrap | Random m < D features at each split |
| Boosted trees | Shallow tree | Reweighted | All features (typically) |
Random forests inherit trees' advantages (no scaling, mixed types, interpretability via feature importance) while dramatically reducing variance. They are consistently among the best off-the-shelf classifiers and rarely require extensive tuning.
Feature importance is a valuable byproduct. For each feature, compute the total decrease in impurity across all nodes that split on it, averaged over all trees. Features that frequently produce large impurity decreases are important. Alternatively, permutation importance measures how much OOB accuracy drops when a feature's values are randomly shuffled — a more robust measure that accounts for correlations between features.
So far, our ensembles used fixed combination weights (equal average or fixed αm). What if the optimal combination depends on the input?
A conditional mixture model makes the mixing coefficients input-dependent:
where the gating functions πk(x) are non-negative and sum to one for each x.
This framework generalizes the standard mixture model from Chapter 9. There, πk was a constant. Here, πk(x) is a function of the input, typically parameterized by a softmax:
where ak(x) is the gating network's output for expert k.
For regression with Gaussian components, each expert k predicts a mean μk(x) and variance σk2(x). The conditional mixture then represents a multimodal predictive distribution. This is powerful for problems where the same input can have multiple plausible outputs — for example, in inverse kinematics, where multiple joint configurations can produce the same end-effector position.
The mixture of experts (MoE) model, introduced by Jacobs et al. (1991), is the conditional mixture made concrete. It has two types of learnable components:
Expert networks: K neural networks (or linear models), each specializing in a region of input space. Expert k outputs pk(t|x).
Gating network: A softmax classifier that outputs πk(x) — the probability that expert k is responsible for input x.
Bishop also introduces hierarchical mixtures of experts (HME): the gating network itself is a tree of softmax splits. At each level, a gating function routes the input left or right, and the leaves are expert networks. This gives the model a coarse-to-fine partition of input space — a soft, differentiable version of a decision tree.
The HME illustrates a deep connection: a hard decision tree makes crisp, axis-aligned splits. The hierarchical MoE makes soft, differentiable splits using sigmoid or softmax gates. This means the entire model — experts and gates — can be trained end-to-end by gradient descent, unlike a traditional decision tree.
Three experts (colored curves) each fit a region of the input. The bottom panel shows the gating weights πk(x) — which expert is active where. The thick black curve is the combined prediction.
| Method | Strategy | Key idea | Reduces |
|---|---|---|---|
| Bayesian MA | Weighted by evidence | Posterior over models | Model uncertainty |
| Committee | Equal average | Error cancellation | Variance |
| Bagging | Bootstrap + average | Data diversity | Variance |
| Random forest | Bagging + random features | Feature decorrelation | Variance (more) |
| Boosting | Sequential + reweight | Focus on hard examples | Bias + variance |
| Mixtures of experts | Input-dependent gating | Soft partitioning | Bias (local experts) |
Chapter 14 completes Bishop's PRML. From the simplest polynomial curve fit in Chapter 1 to the ensemble methods here, the thread is consistent: start with a probabilistic model, understand its limitations, and build principled machinery to overcome them.
Notice how these methods compose with everything earlier in the book. Bagging with neural networks (Ch 5). Boosted decision stumps for classification (Ch 4 + 7). Random forests using information-theoretic splits (Ch 1). Mixtures of experts trained with EM (Ch 9). The combining framework does not replace individual models — it amplifies them.