Adapter Modules (Houlsby 2019)

Chapter 0: The Transfer Problem

It's 2019. BERT just showed that pre-training a Transformer on unlabeled text, then fine-tuning on task-specific data, achieves state-of-the-art results on virtually every NLP benchmark. The recipe is simple: take the pre-trained BERT, add a classification head, and update ALL parameters on your task data.

But this approach has a fundamental scaling problem. Every new task requires a complete copy of the model. BERT-Large has 340M parameters — each fine-tuned model is ~1.3 GB. If you have 100 tasks, that's 130 GB of stored models. For GPT-3 at 175B, it's 350 GB per task copy — completely impractical.

The transfer learning dilemma: Full fine-tuning gets the best accuracy but creates a complete copy of all parameters per task. Feature extraction (freeze everything, train only the head) is parameter-efficient but sacrifices accuracy because the pre-trained representations aren't adapted to the task. Can we find a middle ground — adapt the model to each task while training only a tiny fraction of parameters?

Two extremes existed:

Strategy	What's Trained	Params per Task	Accuracy
Full fine-tuning	All parameters	340M (100%)	Best
Feature extraction	Only classifier head	~30K (0.01%)	Worse
Adapters (this paper)	Small inserted modules	~12M (3.6%)	Near-best

Houlsby et al. introduced the first major Parameter-Efficient Fine-Tuning (PEFT) method: inject small neural network modules — "adapters" — between the existing Transformer layers. Freeze all pre-trained parameters. Only train the adapters. The adapters learn task-specific transformations while the pre-trained backbone provides the general-purpose representations.

The PEFT Spectrum

Drag the slider between "Feature Extraction" (train nothing) and "Full Fine-Tuning" (train everything). Adapters sit in the sweet spot — training just 3.6% of parameters while achieving 98.4% of full fine-tuning accuracy.

% Trained 3.6%

What is the core problem with using full fine-tuning for many downstream tasks?

Each task requires a complete copy of all model parameters (340M for BERT-Large), creating enormous storage requirements that scale linearly with the number of tasks Full fine-tuning is too slow to be practical Fine-tuning always causes the model to forget its pre-trained knowledge

Chapter 1: Adapter Architecture

An adapter module is a small neural network inserted into each Transformer layer. Its design is brilliantly simple: a bottleneck — project down to a small dimension, apply a nonlinearity, project back up — wrapped in a residual connection.

Input

h ∈ R^d — the hidden state from the Transformer layer (d = 768 for BERT-Base)

↓

Down-projection

W_down · h → R^m where m << d (e.g., m = 64). Compresses to bottleneck dimension.

↓

Nonlinearity

ReLU(W_down · h). Introduces the capacity to learn nonlinear transformations.

↓

Up-projection

W_up · ReLU(W_down · h) → R^d. Projects back to original dimension.

↓ + residual (add input h back)

Output

h + W_up · ReLU(W_down · h). Same shape as input — can be dropped in anywhere.

Adapter(h) = h + W_up · ReLU(W_down · h + b_down) + b_up

The residual connection is crucial. If the adapter is initialized near zero (so W_up · ReLU(W_down · h) ≈ 0), the output is just h — the identity function. This means at initialization, the adapter does nothing, and the model behaves exactly like the pre-trained model. The adapter only deviates from identity as needed during training.

python
import torch.nn as nn

class Adapter(nn.Module):
    def __init__(self, d_model=768, bottleneck=64):
        super().__init__()
        self.down = nn.Linear(d_model, bottleneck)  # 768 → 64
        self.relu = nn.ReLU()
        self.up   = nn.Linear(bottleneck, d_model)  # 64 → 768
        # Initialize up-projection near zero
        nn.init.zeros_(self.up.weight)
        nn.init.zeros_(self.up.bias)

    def forward(self, h):
        # h: [batch, seq_len, 768]
        return h + self.up(self.relu(self.down(h)))
        # Output: [batch, seq_len, 768] — same shape

# Parameter count per adapter:
# down: 768 × 64 + 64 = 49,216
# up:   64 × 768 + 768 = 49,920
# Total: 99,136 per adapter ≈ 99K

Why a bottleneck? The bottleneck dimension m controls the adapter's capacity. If m = d (no bottleneck), the adapter has d² parameters — same as a full weight matrix. By setting m << d (e.g., m = 64 when d = 768), we get 2 × d × m ≈ 99K parameters instead of d² ≈ 590K. The bottleneck forces the adapter to learn a compressed, essential transformation rather than a full-rank one.

Adapter Module Visualizer

See the adapter's bottleneck architecture. Drag the bottleneck slider to change m. Watch how the parameter count changes. The residual connection (skip arrow) ensures the adapter starts as identity.

Bottleneck m 64

Why does the adapter module include a residual connection (adding the input back to the output)?

So that at initialization (when W_up is near zero), the adapter acts as the identity function — the model starts exactly as the pre-trained model and only deviates as needed during training. This ensures stable starting behavior and prevents catastrophic forgetting To make the gradient flow easier during backpropagation To increase the adapter's parameter count

Chapter 2: Bottleneck Design

The bottleneck dimension m is the adapter's primary hyperparameter. It controls the trade-off between parameter efficiency and task-specific capacity.

Parameter count analysis

Each adapter has 2 × d × m + d + m parameters (weights + biases). With two adapters per layer (after attention and after FFN) and L layers:

Total adapter params = 2L × (2dm + d + m)

python
# Adapter parameter counts for BERT-Base (d=768, L=12)
d, L = 768, 12
for m in [8, 16, 32, 64, 128, 256]:
    params = 2 * L * (2 * d * m + d + m)
    pct = params / 110e6 * 100
    print(f"m={m:>3d}: {params:>10,d} params ({pct:.2f}%)")
# m=  8:      313,344 params (0.28%)
# m= 16:      608,256 params (0.55%)
# m= 32:    1,198,080 params (1.09%)
# m= 64:    2,377,728 params (2.16%)
# m=128:    4,737,024 params (4.31%)
# m=256:    9,455,616 params (8.60%)

Houlsby et al. found that m = 64 provides the best accuracy-efficiency trade-off for BERT-Base. At m = 64, adapters add only 2.16% extra parameters while achieving 98.4% of full fine-tuning performance on the GLUE benchmark.

Bottleneck m	Added Params	% of BERT	GLUE Avg
8	313K	0.28%	79.2
64	2.38M	2.16%	84.0
256	9.46M	8.60%	84.4
Full FT	110M	100%	84.7

Diminishing returns above m = 64. Going from m = 64 to m = 256 quadruples the parameters but only improves GLUE by 0.4 points. The low-dimensional bottleneck is sufficient because the task-specific adaptation is inherently low-rank — an insight that later led directly to LoRA.

Information bottleneck perspective

The adapter bottleneck implements a form of the information bottleneck principle: compress the input (down-projection), keep only task-relevant information (the nonlinearity selects which dimensions matter), then expand back (up-projection). The bottleneck acts as a filter, discarding pre-trained features that aren't useful for the target task while amplifying useful ones.

Bottleneck Size vs Accuracy

Drag the slider to change the bottleneck dimension. The chart shows accuracy (vertical) and parameter count (bar). Notice the diminishing returns — most of the benefit comes from the first few dimensions.

Bottleneck m 64

Why does increasing the adapter bottleneck from m=64 to m=256 barely improve accuracy (84.0 → 84.4) despite 4x more parameters?

Because the task-specific adaptation is inherently low-dimensional — most downstream NLP tasks require only a small perturbation to the pre-trained representations. 64 dimensions are enough to capture the essential task-specific transformation Because larger bottlenecks cause overfitting Because the ReLU activation limits the adapter's capacity

Chapter 3: Where to Insert

Where in the Transformer should adapters go? Houlsby et al. tested two configurations:

Configuration 1: Two adapters per layer (recommended)

Multi-Head Attention

Standard self-attention (frozen)

↓ + residual + LayerNorm

Adapter 1

d → m → d bottleneck + residual (TRAINED)

↓

Feed-Forward Network

Standard FFN (frozen)

↓ + residual + LayerNorm

Adapter 2

d → m → d bottleneck + residual (TRAINED)

Configuration 2: One adapter per layer (more efficient)

Place only one adapter after the FFN sub-layer. This halves the adapter parameters with a small accuracy drop (~0.3 GLUE points). Later papers (AdapterFusion, MAD-X) adopted this single-adapter configuration as the default.

Configuration	Adapters per Layer	Total Adapter Params (m=64)	GLUE
Two per layer	24 total	2.38M	84.0
One per layer (after FFN)	12 total	1.19M	83.7
One per layer (after attention)	12 total	1.19M	83.4

Why after both sub-layers? Attention captures token-to-token relationships; FFN captures individual token transformations. Each sub-layer produces representations that may need task-specific adjustment. Placing an adapter after each gives the model two chances per layer to adapt its representations, one for relational features and one for individual features.

LayerNorm parameters

In addition to the adapter modules, Houlsby et al. also train the LayerNorm parameters (scale γ and bias β) in each layer. LayerNorms have very few parameters (2 × d per layer = 1,536 for BERT-Base), but they control the scale and shift of representations at each layer, providing a coarse but powerful adaptation signal.

python
# Complete training setup for adapter-based PEFT
def setup_adapter_training(model, bottleneck=64):
    # 1. Insert adapters into each Transformer layer
    for layer in model.encoder.layers:
        layer.adapter_attn = Adapter(d_model=768, bottleneck=bottleneck)
        layer.adapter_ffn  = Adapter(d_model=768, bottleneck=bottleneck)

    # 2. Freeze all pre-trained parameters
    for p in model.parameters():
        p.requires_grad = False

    # 3. Unfreeze adapters and LayerNorms
    for name, p in model.named_parameters():
        if 'adapter' in name or 'LayerNorm' in name:
            p.requires_grad = True

    # 4. Add task-specific classification head (also trained)
    model.classifier = nn.Linear(768, num_classes)

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable: {trainable:,d} / {sum(p.numel() for p in model.parameters()):,d}")
    # Trainable: 2,414,848 / 112,414,848 (2.15%)

Adapter Placement in Transformer

Click to toggle between one-adapter and two-adapter configurations. The orange modules are adapters (trained). Gray modules are frozen pre-trained layers. Notice how adapters are inserted after each sub-layer, wrapped with residual connections.

Why does placing two adapters per layer (after attention AND after FFN) work better than one?

Because attention and FFN produce different types of features — attention captures token relationships while FFN transforms individual tokens. Each adapter separately adapts its sub-layer's output, giving the model two chances per layer to inject task-specific transformations Because two adapters have more parameters Because the second adapter fixes errors from the first

Chapter 4: Training Procedure

Training adapters is almost identical to standard fine-tuning, with one key difference: most parameters are frozen.

What's trained vs frozen

Component	Status	Parameters
Token embeddings	Frozen	23.4M
Position embeddings	Frozen	393K
Self-attention (Q, K, V, O)	Frozen	28.3M per layer
Feed-forward network	Frozen	4.72M per layer
Adapter modules	Trained	99K per adapter
LayerNorm (γ, β)	Trained	1.5K per layer
Classification head	Trained	~30K

Training details

Houlsby et al. used the same hyperparameters as standard BERT fine-tuning:

1. Optimizer: Adam with learning rate 3e-4 (same as fine-tuning BERT).

2. Epochs: 3-10 depending on dataset size (same as fine-tuning).

3. Batch size: 32 (same as fine-tuning).

4. No special adapter-specific hyperparameters needed beyond the bottleneck dimension m.

The training speed is slightly faster than full fine-tuning because fewer parameters need gradients, but the difference isn't dramatic — the forward pass through the full model still dominates compute time.

Adapters don't dramatically speed up training. The forward pass through the frozen pre-trained layers still requires the same computation. The savings come from: (1) fewer optimizer states to store (only for adapter params), (2) fewer gradient computations (only through adapter params), (3) smaller checkpoint files to save. The big win is at deployment: one shared base model + tiny adapter files per task.

Gradient flow through frozen layers

During backpropagation, gradients flow through the frozen layers (they need to, to compute gradients for the adapters). The frozen parameters aren't updated, but their activations and gradients are computed normally. This means the adapter at layer L receives gradient information from all subsequent layers — it can learn from the entire network's behavior, not just its local context.

Training Dynamics

Watch how adapter parameters evolve during training while pre-trained weights stay frozen. The orange bars (adapters) change; the gray bars (frozen) stay constant. Click "Train" to animate one epoch.

Click Train to start

During adapter training, why do gradients still flow through the frozen pre-trained layers?

Because computing the gradient for an adapter at layer L requires backpropagating through all subsequent layers — the frozen layers aren't updated, but their activations and gradients must be computed so the adapter can learn from the full network's behavior Because the frozen layers are actually being updated slowly To speed up training convergence

Chapter 5: Results & Analysis

Houlsby et al. evaluated adapters on the GLUE benchmark (8 NLU tasks) and SQuAD (question answering), comparing against full fine-tuning and feature extraction.

GLUE results

Method	Trained Params	MNLI	QQP	SST-2	GLUE Avg
Feature extraction	~30K (0.03%)	77.1	84.7	89.5	79.2
Adapters (m=64)	2.4M (2.2%)	84.9	88.5	93.5	84.0
Full fine-tuning	110M (100%)	86.2	89.1	94.0	84.7

Adapters achieve within 0.7 GLUE points of full fine-tuning while training only 2.2% of the parameters. That's a 46x reduction in trainable parameters for a 0.8% accuracy trade-off.

The Pareto frontier. The key metric isn't just accuracy — it's accuracy per parameter trained. Adapters sit on the Pareto frontier: no other method at the time achieved higher accuracy with fewer parameters. This efficiency-accuracy trade-off defined the PEFT field that followed.

Which layers matter most?

Houlsby et al. found that adapters in higher layers (closer to the output) contribute more to task-specific performance than adapters in lower layers. This makes sense: lower layers learn general linguistic features (syntax, grammar) that transfer across tasks, while upper layers learn more task-specific features that need adaptation.

However, removing ALL lower-layer adapters hurts performance. Even general features benefit from slight task-specific adjustments.

Comparison with other efficient methods

At the time of publication, the main alternative to full fine-tuning was training only specific layers:

Method	Strategy	Params	GLUE
Top-layer only	Train last 2 layers + head	23M (21%)	82.3
All LayerNorms + head	Train only LN params	38K (0.03%)	80.1
Adapters (m=64)	Train adapter modules	2.4M (2.2%)	84.0
Full fine-tuning	Train everything	110M (100%)	84.7

Adapters outperform training specific layers despite having fewer parameters. The distributed nature of adapters (one per layer) is key — they can make fine-grained adjustments throughout the network rather than large adjustments in a few layers.

Accuracy vs Parameters Trade-off

This scatter plot shows different methods on the accuracy-vs-parameters plane. Adapters (orange dot) sit on the Pareto frontier — no other method achieves higher accuracy with fewer parameters. Click methods to highlight them.

Adapters achieve 84.0 GLUE with 2.2% of parameters, while full fine-tuning achieves 84.7 with 100%. What does this tell us about the nature of task adaptation?

Task-specific adaptation is inherently low-dimensional — only a small fraction of the pre-trained model's capacity needs to change for downstream tasks. Most of the pre-trained knowledge transfers directly, and the adapter's bottleneck captures the small task-specific residual BERT is over-parameterized and should be made smaller The GLUE benchmark is too easy to show meaningful differences

Chapter 6: Adapter Explorer

Experiment with the adapter architecture yourself. This explorer lets you configure the bottleneck size, the number of adapters, and see the resulting parameter counts and accuracy estimates.

Adapter Configuration Explorer

Configure your adapter setup: model size, bottleneck dimension, and placement. The visualization shows the Transformer with adapters inserted, parameter counts, and estimated accuracy on GLUE. Watch how the bottleneck controls the information flow through the adapter.

Bottleneck m 64

Adapters per Layer 2

Multi-Task Adapter Sharing

See how adapters enable multi-task deployment. One frozen BERT backbone serves multiple tasks; each task has its own set of adapter parameters. Click a task to "load" its adapters into the shared backbone.

The key insight from this explorer: Adapters achieve parameter efficiency not by compressing the model, but by separating general knowledge (frozen backbone) from task-specific knowledge (adapter modules). This separation is the intellectual foundation for all subsequent PEFT methods, including LoRA, prefix tuning, and prompt tuning.

How do adapters enable efficient multi-task deployment?

One frozen backbone model is loaded once; each task has its own small adapter file (~10 MB). To switch tasks, only the adapter parameters change — the backbone stays in memory. This means N tasks require 1 backbone + N tiny adapter files instead of N full model copies All tasks share the same adapter parameters Each task runs on a separate GPU with its own model

Chapter 7: Connections

Adapter modules launched the field of Parameter-Efficient Fine-Tuning (PEFT). Understanding their connections shows how the field evolved.

What came before

Work	Contribution	Relationship
BERT (2018)	Pre-train then fine-tune paradigm	Adapters solve BERT's multi-task deployment problem
Residual Adapters (Rebuffi 2017)	Adapter modules for vision CNNs	Houlsby adapted this concept from computer vision to NLP Transformers

What came after

Work	How It Extended Adapters
AdapterFusion (Pfeiffer 2021)	Combine multiple task adapters for multi-task learning
MAD-X (Pfeiffer 2020)	Language adapters + task adapters for cross-lingual transfer
Prefix Tuning (Li 2021)	Instead of adapter layers, prepend learnable tokens — no architecture change
LoRA (Hu 2022)	Low-rank weight updates instead of bottleneck modules — zero inference overhead
AdapterHub	Open-source library for adapter-based fine-tuning

The PEFT family tree

Additive methods:

Adapters: add bottleneck modules

Prefix tuning: add learnable tokens

Prompt tuning: add soft prompts

Reparameterization methods:

LoRA: low-rank weight updates

DoRA: decomposed weight updates

BitFit: train only bias terms

All PEFT methods share the core insight from this paper: task-specific adaptation is low-dimensional. The methods differ in WHERE they inject the adaptation (new modules vs weight perturbations vs input perturbations) and HOW they parameterize it (bottleneck vs low-rank vs continuous prompts).

The adapter legacy: Houlsby et al.'s adapter paper didn't just introduce a technique — it established a research paradigm. The idea that you can freeze a large pre-trained model and adapt it with small, task-specific modules became the foundation for the entire PEFT field. Every LoRA, every QLoRA, every prefix-tuning paper cites this work as a foundational inspiration.

"The art of being wise is the art of knowing what to overlook." — William James (adapters know which parameters to overlook)

What foundational insight did the adapter paper establish for the entire PEFT field?

That task-specific adaptation is inherently low-dimensional — you can separate general knowledge (frozen backbone) from task knowledge (small trainable modules), training only a tiny fraction of parameters while achieving near-full-fine-tuning accuracy That pre-trained models should never be fully fine-tuned That bottleneck architectures are the only way to do PEFT

Adapter Modules for PEFT

Chapter 0: The Transfer Problem

Chapter 1: Adapter Architecture

Chapter 2: Bottleneck Design

Parameter count analysis

Information bottleneck perspective

Chapter 3: Where to Insert

Configuration 1: Two adapters per layer (recommended)

Configuration 2: One adapter per layer (more efficient)

LayerNorm parameters

Chapter 4: Training Procedure

What's trained vs frozen

Training details

Gradient flow through frozen layers

Chapter 5: Results & Analysis

GLUE results

Which layers matter most?

Comparison with other efficient methods

Chapter 6: Adapter Explorer

Chapter 7: Connections

What came before

What came after

The PEFT family tree