Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, et al. (Google Research) — ICML 2019

Adapter Modules for PEFT

Parameter-Efficient Transfer Learning for NLP — insert small bottleneck "adapter" modules between Transformer layers. Freeze everything else. Train only ~3.6% of parameters while matching full fine-tuning.

Prerequisites: Transformer encoder + Fine-tuning basics + Residual connections. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Transfer Problem

It's 2019. BERT just showed that pre-training a Transformer on unlabeled text, then fine-tuning on task-specific data, achieves state-of-the-art results on virtually every NLP benchmark. The recipe is simple: take the pre-trained BERT, add a classification head, and update ALL parameters on your task data.

But this approach has a fundamental scaling problem. Every new task requires a complete copy of the model. BERT-Large has 340M parameters — each fine-tuned model is ~1.3 GB. If you have 100 tasks, that's 130 GB of stored models. For GPT-3 at 175B, it's 350 GB per task copy — completely impractical.

The transfer learning dilemma: Full fine-tuning gets the best accuracy but creates a complete copy of all parameters per task. Feature extraction (freeze everything, train only the head) is parameter-efficient but sacrifices accuracy because the pre-trained representations aren't adapted to the task. Can we find a middle ground — adapt the model to each task while training only a tiny fraction of parameters?

Two extremes existed:

StrategyWhat's TrainedParams per TaskAccuracy
Full fine-tuningAll parameters340M (100%)Best
Feature extractionOnly classifier head~30K (0.01%)Worse
Adapters (this paper)Small inserted modules~12M (3.6%)Near-best

Houlsby et al. introduced the first major Parameter-Efficient Fine-Tuning (PEFT) method: inject small neural network modules — "adapters" — between the existing Transformer layers. Freeze all pre-trained parameters. Only train the adapters. The adapters learn task-specific transformations while the pre-trained backbone provides the general-purpose representations.

The PEFT Spectrum

Drag the slider between "Feature Extraction" (train nothing) and "Full Fine-Tuning" (train everything). Adapters sit in the sweet spot — training just 3.6% of parameters while achieving 98.4% of full fine-tuning accuracy.

% Trained 3.6%
What is the core problem with using full fine-tuning for many downstream tasks?

Chapter 1: Adapter Architecture

An adapter module is a small neural network inserted into each Transformer layer. Its design is brilliantly simple: a bottleneck — project down to a small dimension, apply a nonlinearity, project back up — wrapped in a residual connection.

Input
h ∈ Rd — the hidden state from the Transformer layer (d = 768 for BERT-Base)
Down-projection
Wdown · h → Rm where m << d (e.g., m = 64). Compresses to bottleneck dimension.
Nonlinearity
ReLU(Wdown · h). Introduces the capacity to learn nonlinear transformations.
Up-projection
Wup · ReLU(Wdown · h) → Rd. Projects back to original dimension.
↓ + residual (add input h back)
Output
h + Wup · ReLU(Wdown · h). Same shape as input — can be dropped in anywhere.
Adapter(h) = h + Wup · ReLU(Wdown · h + bdown) + bup

The residual connection is crucial. If the adapter is initialized near zero (so Wup · ReLU(Wdown · h) ≈ 0), the output is just h — the identity function. This means at initialization, the adapter does nothing, and the model behaves exactly like the pre-trained model. The adapter only deviates from identity as needed during training.

python
import torch.nn as nn

class Adapter(nn.Module):
    def __init__(self, d_model=768, bottleneck=64):
        super().__init__()
        self.down = nn.Linear(d_model, bottleneck)  # 768 → 64
        self.relu = nn.ReLU()
        self.up   = nn.Linear(bottleneck, d_model)  # 64 → 768
        # Initialize up-projection near zero
        nn.init.zeros_(self.up.weight)
        nn.init.zeros_(self.up.bias)

    def forward(self, h):
        # h: [batch, seq_len, 768]
        return h + self.up(self.relu(self.down(h)))
        # Output: [batch, seq_len, 768] — same shape

# Parameter count per adapter:
# down: 768 × 64 + 64 = 49,216
# up:   64 × 768 + 768 = 49,920
# Total: 99,136 per adapter ≈ 99K
Why a bottleneck? The bottleneck dimension m controls the adapter's capacity. If m = d (no bottleneck), the adapter has d² parameters — same as a full weight matrix. By setting m << d (e.g., m = 64 when d = 768), we get 2 × d × m ≈ 99K parameters instead of d² ≈ 590K. The bottleneck forces the adapter to learn a compressed, essential transformation rather than a full-rank one.
Adapter Module Visualizer

See the adapter's bottleneck architecture. Drag the bottleneck slider to change m. Watch how the parameter count changes. The residual connection (skip arrow) ensures the adapter starts as identity.

Bottleneck m 64
Why does the adapter module include a residual connection (adding the input back to the output)?

Chapter 2: Bottleneck Design

The bottleneck dimension m is the adapter's primary hyperparameter. It controls the trade-off between parameter efficiency and task-specific capacity.

Parameter count analysis

Each adapter has 2 × d × m + d + m parameters (weights + biases). With two adapters per layer (after attention and after FFN) and L layers:

Total adapter params = 2L × (2dm + d + m)
python
# Adapter parameter counts for BERT-Base (d=768, L=12)
d, L = 768, 12
for m in [8, 16, 32, 64, 128, 256]:
    params = 2 * L * (2 * d * m + d + m)
    pct = params / 110e6 * 100
    print(f"m={m:>3d}: {params:>10,d} params ({pct:.2f}%)")
# m=  8:      313,344 params (0.28%)
# m= 16:      608,256 params (0.55%)
# m= 32:    1,198,080 params (1.09%)
# m= 64:    2,377,728 params (2.16%)
# m=128:    4,737,024 params (4.31%)
# m=256:    9,455,616 params (8.60%)

Houlsby et al. found that m = 64 provides the best accuracy-efficiency trade-off for BERT-Base. At m = 64, adapters add only 2.16% extra parameters while achieving 98.4% of full fine-tuning performance on the GLUE benchmark.

Bottleneck mAdded Params% of BERTGLUE Avg
8313K0.28%79.2
642.38M2.16%84.0
2569.46M8.60%84.4
Full FT110M100%84.7
Diminishing returns above m = 64. Going from m = 64 to m = 256 quadruples the parameters but only improves GLUE by 0.4 points. The low-dimensional bottleneck is sufficient because the task-specific adaptation is inherently low-rank — an insight that later led directly to LoRA.

Information bottleneck perspective

The adapter bottleneck implements a form of the information bottleneck principle: compress the input (down-projection), keep only task-relevant information (the nonlinearity selects which dimensions matter), then expand back (up-projection). The bottleneck acts as a filter, discarding pre-trained features that aren't useful for the target task while amplifying useful ones.

Bottleneck Size vs Accuracy

Drag the slider to change the bottleneck dimension. The chart shows accuracy (vertical) and parameter count (bar). Notice the diminishing returns — most of the benefit comes from the first few dimensions.

Bottleneck m 64
Why does increasing the adapter bottleneck from m=64 to m=256 barely improve accuracy (84.0 → 84.4) despite 4x more parameters?

Chapter 3: Where to Insert

Where in the Transformer should adapters go? Houlsby et al. tested two configurations:

Configuration 1: Two adapters per layer (recommended)

Multi-Head Attention
Standard self-attention (frozen)
↓ + residual + LayerNorm
Adapter 1
d → m → d bottleneck + residual (TRAINED)
Feed-Forward Network
Standard FFN (frozen)
↓ + residual + LayerNorm
Adapter 2
d → m → d bottleneck + residual (TRAINED)

Configuration 2: One adapter per layer (more efficient)

Place only one adapter after the FFN sub-layer. This halves the adapter parameters with a small accuracy drop (~0.3 GLUE points). Later papers (AdapterFusion, MAD-X) adopted this single-adapter configuration as the default.

ConfigurationAdapters per LayerTotal Adapter Params (m=64)GLUE
Two per layer24 total2.38M84.0
One per layer (after FFN)12 total1.19M83.7
One per layer (after attention)12 total1.19M83.4
Why after both sub-layers? Attention captures token-to-token relationships; FFN captures individual token transformations. Each sub-layer produces representations that may need task-specific adjustment. Placing an adapter after each gives the model two chances per layer to adapt its representations, one for relational features and one for individual features.

LayerNorm parameters

In addition to the adapter modules, Houlsby et al. also train the LayerNorm parameters (scale γ and bias β) in each layer. LayerNorms have very few parameters (2 × d per layer = 1,536 for BERT-Base), but they control the scale and shift of representations at each layer, providing a coarse but powerful adaptation signal.

python
# Complete training setup for adapter-based PEFT
def setup_adapter_training(model, bottleneck=64):
    # 1. Insert adapters into each Transformer layer
    for layer in model.encoder.layers:
        layer.adapter_attn = Adapter(d_model=768, bottleneck=bottleneck)
        layer.adapter_ffn  = Adapter(d_model=768, bottleneck=bottleneck)

    # 2. Freeze all pre-trained parameters
    for p in model.parameters():
        p.requires_grad = False

    # 3. Unfreeze adapters and LayerNorms
    for name, p in model.named_parameters():
        if 'adapter' in name or 'LayerNorm' in name:
            p.requires_grad = True

    # 4. Add task-specific classification head (also trained)
    model.classifier = nn.Linear(768, num_classes)

    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    print(f"Trainable: {trainable:,d} / {sum(p.numel() for p in model.parameters()):,d}")
    # Trainable: 2,414,848 / 112,414,848 (2.15%)
Adapter Placement in Transformer

Click to toggle between one-adapter and two-adapter configurations. The orange modules are adapters (trained). Gray modules are frozen pre-trained layers. Notice how adapters are inserted after each sub-layer, wrapped with residual connections.

Why does placing two adapters per layer (after attention AND after FFN) work better than one?

Chapter 4: Training Procedure

Training adapters is almost identical to standard fine-tuning, with one key difference: most parameters are frozen.

What's trained vs frozen

ComponentStatusParameters
Token embeddingsFrozen23.4M
Position embeddingsFrozen393K
Self-attention (Q, K, V, O)Frozen28.3M per layer
Feed-forward networkFrozen4.72M per layer
Adapter modulesTrained99K per adapter
LayerNorm (γ, β)Trained1.5K per layer
Classification headTrained~30K

Training details

Houlsby et al. used the same hyperparameters as standard BERT fine-tuning:

1. Optimizer: Adam with learning rate 3e-4 (same as fine-tuning BERT).

2. Epochs: 3-10 depending on dataset size (same as fine-tuning).

3. Batch size: 32 (same as fine-tuning).

4. No special adapter-specific hyperparameters needed beyond the bottleneck dimension m.

The training speed is slightly faster than full fine-tuning because fewer parameters need gradients, but the difference isn't dramatic — the forward pass through the full model still dominates compute time.

Adapters don't dramatically speed up training. The forward pass through the frozen pre-trained layers still requires the same computation. The savings come from: (1) fewer optimizer states to store (only for adapter params), (2) fewer gradient computations (only through adapter params), (3) smaller checkpoint files to save. The big win is at deployment: one shared base model + tiny adapter files per task.

Gradient flow through frozen layers

During backpropagation, gradients flow through the frozen layers (they need to, to compute gradients for the adapters). The frozen parameters aren't updated, but their activations and gradients are computed normally. This means the adapter at layer L receives gradient information from all subsequent layers — it can learn from the entire network's behavior, not just its local context.

Training Dynamics

Watch how adapter parameters evolve during training while pre-trained weights stay frozen. The orange bars (adapters) change; the gray bars (frozen) stay constant. Click "Train" to animate one epoch.

Click Train to start
During adapter training, why do gradients still flow through the frozen pre-trained layers?

Chapter 5: Results & Analysis

Houlsby et al. evaluated adapters on the GLUE benchmark (8 NLU tasks) and SQuAD (question answering), comparing against full fine-tuning and feature extraction.

GLUE results

MethodTrained ParamsMNLIQQPSST-2GLUE Avg
Feature extraction~30K (0.03%)77.184.789.579.2
Adapters (m=64)2.4M (2.2%)84.988.593.584.0
Full fine-tuning110M (100%)86.289.194.084.7

Adapters achieve within 0.7 GLUE points of full fine-tuning while training only 2.2% of the parameters. That's a 46x reduction in trainable parameters for a 0.8% accuracy trade-off.

The Pareto frontier. The key metric isn't just accuracy — it's accuracy per parameter trained. Adapters sit on the Pareto frontier: no other method at the time achieved higher accuracy with fewer parameters. This efficiency-accuracy trade-off defined the PEFT field that followed.

Which layers matter most?

Houlsby et al. found that adapters in higher layers (closer to the output) contribute more to task-specific performance than adapters in lower layers. This makes sense: lower layers learn general linguistic features (syntax, grammar) that transfer across tasks, while upper layers learn more task-specific features that need adaptation.

However, removing ALL lower-layer adapters hurts performance. Even general features benefit from slight task-specific adjustments.

Comparison with other efficient methods

At the time of publication, the main alternative to full fine-tuning was training only specific layers:

MethodStrategyParamsGLUE
Top-layer onlyTrain last 2 layers + head23M (21%)82.3
All LayerNorms + headTrain only LN params38K (0.03%)80.1
Adapters (m=64)Train adapter modules2.4M (2.2%)84.0
Full fine-tuningTrain everything110M (100%)84.7

Adapters outperform training specific layers despite having fewer parameters. The distributed nature of adapters (one per layer) is key — they can make fine-grained adjustments throughout the network rather than large adjustments in a few layers.

Accuracy vs Parameters Trade-off

This scatter plot shows different methods on the accuracy-vs-parameters plane. Adapters (orange dot) sit on the Pareto frontier — no other method achieves higher accuracy with fewer parameters. Click methods to highlight them.

Adapters achieve 84.0 GLUE with 2.2% of parameters, while full fine-tuning achieves 84.7 with 100%. What does this tell us about the nature of task adaptation?

Chapter 6: Adapter Explorer

Experiment with the adapter architecture yourself. This explorer lets you configure the bottleneck size, the number of adapters, and see the resulting parameter counts and accuracy estimates.

Adapter Configuration Explorer

Configure your adapter setup: model size, bottleneck dimension, and placement. The visualization shows the Transformer with adapters inserted, parameter counts, and estimated accuracy on GLUE. Watch how the bottleneck controls the information flow through the adapter.

Bottleneck m 64
Adapters per Layer 2
Multi-Task Adapter Sharing

See how adapters enable multi-task deployment. One frozen BERT backbone serves multiple tasks; each task has its own set of adapter parameters. Click a task to "load" its adapters into the shared backbone.

The key insight from this explorer: Adapters achieve parameter efficiency not by compressing the model, but by separating general knowledge (frozen backbone) from task-specific knowledge (adapter modules). This separation is the intellectual foundation for all subsequent PEFT methods, including LoRA, prefix tuning, and prompt tuning.
How do adapters enable efficient multi-task deployment?

Chapter 7: Connections

Adapter modules launched the field of Parameter-Efficient Fine-Tuning (PEFT). Understanding their connections shows how the field evolved.

What came before

WorkContributionRelationship
BERT (2018)Pre-train then fine-tune paradigmAdapters solve BERT's multi-task deployment problem
Residual Adapters (Rebuffi 2017)Adapter modules for vision CNNsHoulsby adapted this concept from computer vision to NLP Transformers

What came after

WorkHow It Extended Adapters
AdapterFusion (Pfeiffer 2021)Combine multiple task adapters for multi-task learning
MAD-X (Pfeiffer 2020)Language adapters + task adapters for cross-lingual transfer
Prefix Tuning (Li 2021)Instead of adapter layers, prepend learnable tokens — no architecture change
LoRA (Hu 2022)Low-rank weight updates instead of bottleneck modules — zero inference overhead
AdapterHubOpen-source library for adapter-based fine-tuning

The PEFT family tree

Additive methods:

Adapters: add bottleneck modules

Prefix tuning: add learnable tokens

Prompt tuning: add soft prompts

Reparameterization methods:

LoRA: low-rank weight updates

DoRA: decomposed weight updates

BitFit: train only bias terms

All PEFT methods share the core insight from this paper: task-specific adaptation is low-dimensional. The methods differ in WHERE they inject the adaptation (new modules vs weight perturbations vs input perturbations) and HOW they parameterize it (bottleneck vs low-rank vs continuous prompts).

The adapter legacy: Houlsby et al.'s adapter paper didn't just introduce a technique — it established a research paradigm. The idea that you can freeze a large pre-trained model and adapt it with small, task-specific modules became the foundation for the entire PEFT field. Every LoRA, every QLoRA, every prefix-tuning paper cites this work as a foundational inspiration.

"The art of being wise is the art of knowing what to overlook." — William James (adapters know which parameters to overlook)

What foundational insight did the adapter paper establish for the entire PEFT field?