Collobert, Weston, Bottou, Karlen, Kavukcuoglu, Kuksa (NEC Labs) — JMLR 2011

Natural Language Processing (Almost) from Scratch

One of the first papers to show that a single neural architecture can handle POS tagging, chunking, NER, and SRL — with minimal hand-crafted features. Learned embeddings transfer across tasks.

Prerequisites: Basic neural networks + Word embeddings. That's it.
8
Chapters
8+
Simulations

Chapter 0: The Feature Engineering Burden

It's 2011. You're building a system to extract named entities from text — finding that "Barack Obama" is a PERSON and "Washington D.C." is a LOCATION in "Barack Obama spoke in Washington D.C. today." The state-of-the-art approach requires you to design hundreds of hand-crafted features.

For each word, you manually compute features like: Is the first letter capitalized? Does it contain a digit? What is its suffix (-tion, -ing, -ed)? Is the previous word "Mr." or "Dr."? Does it appear in a gazetteer (list of known place names)? What part-of-speech tag does it have? Is it in a name dictionary?

This is feature engineering — the manual process of designing input representations. It is painstaking, task-specific, and brittle. Every new NLP task (POS tagging, chunking, NER, semantic role labeling) requires its own bespoke feature set, designed by a domain expert who understands both the linguistics and the machine learning algorithm.

Traditional NLP: Feature Engineering for Each Task

Each NLP task required its own hand-designed feature pipeline. Click each task to see the features that experts crafted for it. Notice how different and specialized each feature set is.

The vision: What if a neural network could learn its own features? What if, instead of manually designing hundreds of indicators per task, you could give the network raw words and let it figure out what matters? That would mean: (1) a single architecture for all NLP tasks, (2) features that improve automatically with more data, and (3) features that transfer between tasks. Collobert et al. showed this is possible — and the features the network learns are competitive with decades of hand-engineering.

Four NLP tasks, one paper

The paper tackles four core NLP tasks simultaneously:

TaskWhat it doesExample
POS taggingLabel each word's part of speech"The/DT cat/NN sat/VBD"
ChunkingGroup words into phrases"[The cat]NP [sat]VP [on the mat]PP"
NERFind named entities"[Obama]PER visited [Paris]LOC"
SRLWho did what to whom"[Obama]A0 [visited]V [Paris]A1"

Before this paper, each task had its own research community, its own benchmark, and its own feature engineering pipeline. The idea that a single neural network could handle all four was radical.

What makes these tasks hard?

Each task requires different types of linguistic knowledge:

Traditional systems had separate feature sets because each task seemed to require fundamentally different information. The paper's key claim: a single learned representation can capture all of this, because these tasks share underlying linguistic structure.

Traditional pipeline vs end-to-end

In traditional NLP, tasks were solved in a pipeline: first POS tag, then parse, then use parse features for NER, then use NER + parse features for SRL. Each stage depends on the previous one. Errors cascade: if the POS tagger makes a mistake, the parser makes a mistake, and NER has no chance.

Traditional Pipeline
POS → Parse → NER → SRL (errors cascade)
vs.
Collobert et al.
Shared Embeddings → Independent Task Heads (no cascade)

The neural approach eliminates pipeline errors because each task operates directly on the raw input through the shared embeddings. A mistake in POS tagging doesn't affect NER because NER doesn't use POS tag features — it learns its own features from the same raw words.

This independence between tasks is both a strength and a limitation. It prevents error cascading but also prevents tasks from helping each other at inference time.

The paper's multi-task training addresses this at the representation level (shared embeddings learn from all tasks), but not at the prediction level (each task still makes independent predictions). Modern systems like joint models and end-to-end parsers have since addressed this gap.

This "end-to-end" approach — replacing multi-stage pipelines with single neural networks that learn their own intermediate representations — became the dominant paradigm across all of deep learning. We see the same pattern in computer vision (replacing SIFT + SVM with end-to-end CNNs), speech recognition (replacing HMM-GMM pipelines with end-to-end CTC models), machine translation (replacing phrase-based SMT with sequence-to-sequence models), and robotics (replacing perception + planning + control pipelines with end-to-end learned policies).

What is the fundamental problem with the traditional approach to NLP that this paper addresses?

Chapter 1: The Unified Architecture

Collobert et al. propose a single neural network architecture for all four NLP tasks. The architecture has four stages, each building on the previous:

1. Lookup Table
Words → dense vectors (embeddings). Each word index gets mapped to a d-dimensional vector from a learned table. Shape: vocab_size × d
2. Feature Extraction
Window-based (concat embeddings of nearby words) or Sentence-based (1D convolution + max pooling over entire sentence)
3. Hidden Layers
One or more linear layers with HardTanh activation: HardTanh(x) = max(−1, min(1, x))
4. Tag Scoring
Linear layer outputs one score per possible tag. For word-level tasks (POS, NER), use a window around each word. For sentence-level (SRL), use the whole sentence.

The genius is in the simplicity. No parse trees. No gazetteers. No POS tag features. No suffix lists. Just raw words in, tag predictions out. The network learns whatever intermediate representations it needs.

The radical claim: Before this paper, the NLP community assumed that hand-crafted linguistic features were essential — that without parse trees, gazetteers, and morphological analyzers, competitive performance was impossible. Collobert et al. proved this wrong. A simple four-stage neural network, starting from raw word indices, could match or beat decades of feature engineering. This was the beginning of the end for the "feature engineering era" of NLP.
The Unified NLP Architecture

Data flows from raw words through the four stages. Click each stage to see the computation details: lookup table, feature extraction (window or convolution), hidden layers, and tag scoring.

Data flow with shapes

Let's trace the exact shapes through the window approach (used for POS tagging):

StageInput shapeOutput shapeParameters
Lookup tablewindow_size word indices(window_size × d)V × d (embedding matrix)
Concat + Linear(window_size × d,)(n_hidden,)(window_size × d) × n_hidden
HardTanh(n_hidden,)(n_hidden,)0
Linear(n_hidden,)(n_tags,)n_hidden × n_tags

With d = 50 (embedding dim), window_size = 5, n_hidden = 300, n_tags = 45 (POS tags):

Total parameters = 130,000 × 50 + 250 × 300 + 300 × 45 = 6,588,500

Compared to feature-engineered systems that used millions of indicator features, this is remarkably compact. And the 130,000 × 50 embedding matrix — the bulk of the parameters — is shared across all tasks.

Why HardTanh instead of Sigmoid or ReLU?

The paper uses HardTanh as the activation function — not sigmoid, not ReLU (which hadn't yet become standard in 2008 when the work was done). HardTanh is a piecewise linear approximation of tanh:

HardTanh(x) = −1 if x < −1, x if −1 ≤ x ≤ 1, +1 if x > 1

It has two advantages over sigmoid: (1) its outputs are zero-centered (range [−1, 1] instead of [0, 1]), which helps gradient flow by preventing the all-positive-gradients problem, and (2) its gradient is exactly 1 in the active region, avoiding the 0.25 maximum of sigmoid's derivative. It's faster to compute than tanh since it uses no exponentials.

The loss function: log-likelihood with Viterbi decoding

For word-level tasks, the paper uses two loss functions:

The sentence-level approach outperforms word-level on all tasks because it enforces valid tag sequences (e.g., B-PER must be followed by I-PER or O, never I-LOC). This is a form of structured prediction — the model learns not just which tags are likely for each word, but which tag sequences are valid.

python
# Viterbi decoding for finding the best tag sequence
def viterbi_decode(scores, transitions):
    """Find best tag sequence using dynamic programming."""
    n_words, n_tags = scores.shape
    dp = scores[0].clone()  # best score ending in each tag
    backpointers = []

    for t in range(1, n_words):
        best_scores, best_tags = (dp.unsqueeze(1) + transitions).max(dim=0)
        dp = best_scores + scores[t]
        backpointers.append(best_tags)

    # Trace back from best final tag
    best_path = [dp.argmax().item()]
    for bp in reversed(backpointers):
        best_path.append(bp[best_path[-1]].item())
    return list(reversed(best_path))
python
import torch
import torch.nn as nn

class WindowTagger(nn.Module):
    """Window approach for POS tagging, chunking, NER."""
    def __init__(self, vocab_size, embed_dim, window_size, hidden_dim, n_tags):
        super().__init__()
        self.window = window_size
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.linear1 = nn.Linear(window_size * embed_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, n_tags)
        self.hardtanh = nn.Hardtanh()

    def forward(self, word_indices):
        # word_indices: (batch, window_size) — indices of words in window
        x = self.embed(word_indices)          # (batch, win, d)
        x = x.view(x.size(0), -1)              # (batch, win*d)
        x = self.hardtanh(self.linear1(x))     # (batch, hidden)
        x = self.linear2(x)                     # (batch, n_tags)
        return x

# Instantiate for POS tagging
model = WindowTagger(
    vocab_size=130000,
    embed_dim=50,
    window_size=5,
    hidden_dim=300,
    n_tags=45
)
What are the four stages of Collobert et al.'s unified NLP architecture?

Chapter 2: Window vs Sentence Approach

The paper proposes two variants of the architecture, designed for different types of NLP tasks. The choice depends on how much context the task requires.

Window approach

For tasks where local context is sufficient — like POS tagging, chunking, and NER — the network looks at a fixed-size window of words centered on the target word. If the window size is ksz = 5, the network sees the target word plus 2 words before and 2 words after.

The window is simply concatenated: if each word embedding has dimension d = 50 and the window is 5 words, the input to the first hidden layer is a 250-dimensional vector.

input = [embed(wt-2); embed(wt-1); embed(wt); embed(wt+1); embed(wt+2)]

This is fast and simple but has a critical limitation: the network cannot see beyond the window. If the answer depends on a word 10 positions away (which sometimes happens in NER — "In the state of New York, the governor..."), the window approach misses it.

Padding at sentence boundaries

At the beginning and end of a sentence, the window extends beyond the sentence. The paper handles this with padding — special "start" and "end" tokens with their own learned embeddings. These boundary embeddings learn to encode the fact that the target word is near the beginning or end of a sentence, which is itself useful information (e.g., the first word of a sentence is more likely to be a subject).

python
# Window extraction with padding
def extract_windows(sentence, window_size, pad_idx):
    """Extract a window of word indices around each position."""
    half = window_size // 2
    padded = [pad_idx] * half + sentence + [pad_idx] * half
    windows = []
    for i in range(len(sentence)):
        windows.append(padded[i:i + window_size])
    return windows

# Example: "The cat sat" with window=5
# Position 0 ("The"): [PAD, PAD, The, cat, sat]
# Position 1 ("cat"): [PAD, The, cat, sat, PAD]
# Position 2 ("sat"): [The, cat, sat, PAD, PAD]

Sentence approach

For Semantic Role Labeling (SRL), where the network needs to understand the full sentence structure (who did what to whom), a window is not enough. The sentence approach uses 1D convolution over the entire sentence, followed by max pooling to extract a fixed-size representation regardless of sentence length.

Embed each word
Sentence of n words → n vectors of dimension d
1D convolution
Slide a filter of width k over the sequence → n features per filter
Max pooling over time
Take the max of each filter over all positions → fixed-size vector
Hidden + scoring
Standard linear → HardTanh → linear → tag scores
Why max pooling? Max pooling acts as a "did this pattern appear anywhere?" detector. If filter #17 detects the pattern "has been [verb]-ing" (a progressive construction), max pooling asks: "Does this pattern appear anywhere in the sentence?" The answer is a single number, independent of sentence length. This is how the network converts variable-length sentences into fixed-size representations.

Understanding 1D convolution on text

A 1D convolutional filter of width k operates on k consecutive word embeddings. Think of it as a pattern detector that slides across the sentence:

Each filter produces one number per position — how strongly the pattern matches at that location. With 300 filters, we get 300 features per position, each detecting a different local pattern. The max pool then selects the strongest match for each filter across all positions.

This architecture is a precursor to the 1D CNNs used in Kim (2014) for text classification, which became extremely popular before Transformers replaced them. The key limitation: even with max pooling, the representation captures which patterns appear but not where they appear relative to each other. For tasks requiring word-order sensitivity (like SRL), this is a significant weakness. Transformers solve this with positional encoding and self-attention.

python
import torch
import torch.nn as nn

# 1D convolution on text — the sentence approach
embed_dim = 50
n_filters = 300
filter_width = 5

# Create a 1D conv layer
conv = nn.Conv1d(embed_dim, n_filters, filter_width, padding=filter_width//2)

# Example: batch of 4 sentences, each 20 words, 50d embeddings
x = torch.randn(4, 20, 50)  # (batch, seq, embed)
x = x.transpose(1, 2)        # (batch, embed, seq) — Conv1d expects this

features = conv(x)            # (4, 300, 20) — 300 features per position
pooled, _ = features.max(dim=2)  # (4, 300) — max over time

print(f"Per-position features: {features.shape}")  # [4, 300, 20]
print(f"After max pool: {pooled.shape}")          # [4, 300]
# Variable sentence length → fixed 300d representation
Window vs Sentence Approach

Left: the window approach sees only nearby words. Right: the sentence approach (convolution + max pool) sees the entire sentence. Toggle between them. Notice how the sentence approach can capture long-range dependencies.

When to use which

ApproachContextBest forSpeed
Windowk words around targetPOS, Chunking, NERVery fast
SentenceEntire sentenceSRLSlower (conv + pool)
python
class SentenceTagger(nn.Module):
    """Sentence approach with 1D convolution for SRL."""
    def __init__(self, vocab_size, embed_dim, n_filters, filter_width, hidden_dim, n_tags):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        # 1D conv: embed_dim input channels, n_filters output channels
        self.conv = nn.Conv1d(embed_dim, n_filters, filter_width,
                              padding=filter_width//2)
        self.linear1 = nn.Linear(n_filters, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, n_tags)
        self.hardtanh = nn.Hardtanh()

    def forward(self, word_indices):
        # word_indices: (batch, seq_len)
        x = self.embed(word_indices)             # (batch, seq, d)
        x = x.transpose(1, 2)                    # (batch, d, seq) for Conv1d
        x = self.hardtanh(self.conv(x))           # (batch, n_filters, seq)
        x, _ = x.max(dim=2)                     # (batch, n_filters) — max pool over time
        x = self.hardtanh(self.linear1(x))       # (batch, hidden)
        x = self.linear2(x)                      # (batch, n_tags)
        return x

Position features for SRL

For Semantic Role Labeling, the sentence approach needs to know which word is the target verb. The paper adds a relative position feature: for each word, it computes the distance to the target verb and looks up this distance in a position embedding table.

pos_feat(wi) = LTpos(i − verb_position)

So if the verb is at position 4 and we're looking at word 2, the position feature is LTpos(−2). Word 6 gets LTpos(+2). This gives the network a sense of structure relative to the verb — which is essential for SRL where the role of a word (agent, patient, instrument) depends heavily on its position relative to the predicate.

This is a precursor to the positional encodings used in Transformers (Vaswani et al., 2017), though the Transformer version is absolute (position in the sentence) rather than relative (distance to a reference word).

Why does the paper use max pooling after convolution in the sentence approach?

Chapter 3: The Embedding Layer

The embedding layer — the lookup table — is the paper's most influential contribution. While the idea of word embeddings existed before (Bengio et al.'s neural language model, 2003), Collobert et al. demonstrated two critical properties:

  1. Embeddings learned for one task transfer to other tasks. Embeddings trained on POS tagging improve NER, and vice versa.
  2. Embeddings pre-trained on unlabeled text (via a language model objective) improve all downstream tasks. This is the ancestor of modern pre-training.

The lookup table is simply a matrix LTW ∈ Rd×|V|. Given a word index i, the embedding is the i-th column: LTW(i) = Wi. This is mathematically equivalent to multiplying a one-hot vector by the embedding matrix.

Multiple feature types

The paper doesn't just embed words. It also embeds additional features — each with its own lookup table:

FeatureVocabulary sizeEmbedding dimPurpose
Word~130,00050Semantic/syntactic meaning
Capitalization4 (allLower, allUpper, firstUpper, mixed)5"Obama" vs "the"
Word suffix~2,000 (2-char suffixes)5"-ed", "-ing", "-tion" morphology
Relative position (SRL)~1005Distance to target verb

The final embedding for a word is the concatenation of all its feature embeddings:

embed(w) = [LTword(w); LTcaps(caps(w)); LTsuffix(suf(w))]
dim = 50 + 5 + 5 = 60
Design decision — why concatenate, not add? Adding embeddings (like modern Transformers do with position encodings) assumes the features live in the same space. Concatenation keeps each feature in its own subspace, giving the hidden layers maximum flexibility to combine them. With only a few small feature types, concatenation is practical — the dimensionality increase is modest.
Multiple Embedding Lookup Tables

Each word gets multiple embeddings concatenated: word (50d), capitalization (5d), suffix (5d). Click a word to see its composite embedding. The total input dimension is the sum of all embedding dimensions.

Training the embeddings

The embedding matrices are initialized randomly and updated by backpropagation along with all other network weights. The key insight: because the embedding matrix is shared across all positions in the window (the same matrix is used to look up each word), the gradient signal from every word in every window updates the same matrix. This means the embeddings benefit from all the training data, not just the examples where a particular word appears.

How embedding gradients work

The gradient for a word embedding is particularly intuitive. During a forward pass, the embedding lookup selects row i from the matrix. During the backward pass, the gradient flows back to only that row. If word "cat" (index 42) appears in the current training example, only row 42 of the embedding matrix gets a gradient update — all other rows receive zero gradient.

This means rare words get fewer gradient updates than common words. The paper doesn't address this directly, but later work (Word2Vec, GloVe) developed techniques like subsampling frequent words and negative sampling to balance gradient distribution across the vocabulary.

∂L/∂W[i, :] = ∂L/∂embed(i) (only row i gets updated)

With a vocabulary of 130,000 words and a training set of millions of sentences, each word gets thousands of gradient updates over training — enough to learn useful representations even for moderately rare words.

Very rare words (appearing fewer than 5 times) still get poor embeddings. The paper handles this by mapping all rare words to a special "RARE" token with its own learned embedding. This is a crude solution — all rare words get the same embedding, which throws away what little information we have about them.

Modern systems solve this much more elegantly with subword tokenization (BPE), using vocabulary sizes of 32,000-100,000 subword tokens. Even if the word "magnetohydrodynamics" never appeared in training, its subwords "magnet" + "o" + "hydro" + "dynamics" all have well-trained embeddings that compose to a reasonable representation.

This is one area where the 2011 architecture shows its age — character-level and subword models (Bojanowski et al., 2017; Sennrich et al., 2016) were needed to handle the long tail of rare words properly.

python
# Multiple embedding lookup tables
class MultiEmbedding(nn.Module):
    def __init__(self):
        super().__init__()
        self.word_embed = nn.Embedding(130000, 50)  # vocabulary
        self.caps_embed = nn.Embedding(4, 5)        # capitalization patterns
        self.suf_embed  = nn.Embedding(2000, 5)     # 2-char suffixes

    def forward(self, word_ids, cap_ids, suf_ids):
        w = self.word_embed(word_ids)    # (batch, seq, 50)
        c = self.caps_embed(cap_ids)    # (batch, seq, 5)
        s = self.suf_embed(suf_ids)     # (batch, seq, 5)
        return torch.cat([w, c, s], dim=-1)  # (batch, seq, 60)

Embedding dimension: the 50-dimensional sweet spot

Why 50 dimensions? The paper doesn't report an extensive hyperparameter search, but the choice reflects a trade-off:

The 2011 choice of 50d was appropriate for the architecture depth (1-2 hidden layers) and the available compute. With more layers to process the embeddings, higher dimensions become useful.

Why does the paper concatenate multiple embedding types (word, caps, suffix) rather than using word embeddings alone?

Chapter 4: Multi-Task Learning

With a single architecture for all four tasks, a natural question arises: can training on multiple tasks simultaneously improve performance on each? The answer is yes — and the mechanism is shared embeddings.

The multi-task setup works like this: the embedding layer is shared across all tasks. Each task has its own hidden layer and scoring layer. During training, we alternate between tasks — one mini-batch of POS tagging, one mini-batch of NER, one of chunking, and so on. The task-specific layers get gradients only from their own task, but the shared embedding layer gets gradients from all tasks.

Shared Embeddings
One lookup table trained by ALL tasks. Each task's gradients improve the shared representations.
↓ ↓ ↓ ↓
POS Head
Hidden + 45 tags
Chunk Head
Hidden + 23 tags
NER Head
Hidden + 9 tags
SRL Head
Hidden + 114 tags
Why multi-task helps: POS tagging teaches the embeddings that "running" is a verb (or gerund). NER teaches them that capitalized words at sentence starts might be entities. Chunking teaches them about phrase boundaries. Each task provides a different training signal that enriches the shared embeddings. The result: embeddings that capture a richer set of linguistic properties than any single task could provide alone.

Training protocol

The paper uses a simple alternating strategy:

  1. Select a task t uniformly at random
  2. Sample a mini-batch from task t's training data
  3. Forward pass through shared embeddings + task t's head
  4. Backpropagate through task t's head AND the shared embeddings
  5. Repeat

No fancy multi-task weighting or gradient balancing — just random alternation. The simplicity is part of the paper's appeal.

Modern multi-task learning is more sophisticated. Today's systems use techniques like: (1) gradient normalization to balance gradient magnitudes across tasks, (2) task-specific learning rate schedules, (3) uncertainty weighting where task weights are learned based on homoscedastic uncertainty (Kendall et al., 2018). But Collobert et al.'s simple alternation remains a reasonable baseline that often performs surprisingly well.
Multi-Task Training: Shared Embedding Improvement

Watch how training on multiple tasks simultaneously improves the shared embeddings. Each task's gradient signal enriches different aspects of the embeddings. Click "Train" to see task accuracy curves evolve together.

Click Train to see multi-task learning

Results

TaskSingle-task accuracyMulti-task accuracyImprovement
POS97.12%97.20%+0.08
Chunking93.37%93.63%+0.26
NER87.58%88.67%+1.09
SRL73.54%74.29%+0.75

The biggest improvement is on NER (+1.09%), which has the least training data. Multi-task learning acts as a form of regularization and data augmentation — the shared embeddings benefit from the combined data of all tasks.

Why NER benefits most

NER has the least labeled training data among the four tasks. With sparse data, the embeddings for rare entity names (company names, locations, person names) receive few gradient updates from NER alone. But these same words appear frequently in POS tagging and chunking data, where they receive more updates. Multi-task learning transfers these updates to the shared embeddings, effectively providing more training signal for the rare but important words in NER.

This insight generalizes: multi-task learning helps most on low-resource tasks that share representations with high-resource tasks. Today, we see the same principle in large pre-trained models: fine-tuning GPT-3 on a small dataset works because the pre-training provided billions of gradient updates to the shared representation.

The gradient flow of multi-task learning

From a gradient perspective, multi-task learning works because each task provides a different "view" of what the embeddings should encode. POS tagging gradients push "running" and "jumping" closer together (both are verbs). NER gradients push "Obama" and "Clinton" closer together (both are person names). Chunking gradients push "the" and "a" closer together (both are determiners). These diverse signals produce embeddings that encode a richer set of properties than any single task.

embed Ltotal = ∇embed LPOS + ∇embed LNER + ∇embed Lchunk + ∇embed LSRL

The embedding gradient is the sum of gradients from all tasks. Each task "pulls" the embedding vectors in directions useful for its own objective. The resulting embeddings find a compromise position that works reasonably well for all tasks — and often better than any single-task optimum, because the multi-task signal acts as a regularizer against overfitting to any one task's idiosyncrasies.

python
# Simulating multi-task gradient flow
import torch
import torch.nn as nn

# Shared embedding, two task heads
embed = nn.Embedding(1000, 50)
head_pos = nn.Linear(250, 45)   # POS tagging
head_ner = nn.Linear(250, 9)    # NER

# Train on POS batch
words = torch.randint(0, 1000, (32, 5))
e = embed(words).view(32, -1)
pos_loss = head_pos(e).sum()
pos_loss.backward()
# embed.weight.grad now has POS signal

# Train on NER batch (gradients accumulate!)
e2 = embed(words).view(32, -1)
ner_loss = head_ner(e2).sum()
ner_loss.backward()
# embed.weight.grad now has POS + NER signal

Negative transfer: when multi-task hurts

Multi-task learning doesn't always help. If tasks are too dissimilar (e.g., sentiment analysis and machine translation), shared representations may compromise — getting worse at both tasks. The paper avoids this because POS tagging, chunking, NER, and SRL are all syntactic-semantic tasks that require similar linguistic knowledge. The shared embeddings naturally encode features useful for all four.

python
class MultiTaskNLP(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, task_n_tags):
        super().__init__()
        # Shared embedding layer
        self.embed = nn.Embedding(vocab_size, embed_dim)

        # Task-specific heads
        self.heads = nn.ModuleDict()
        for task, n_tags in task_n_tags.items():
            self.heads[task] = nn.Sequential(
                nn.Linear(5 * embed_dim, hidden_dim),  # window=5
                nn.Hardtanh(),
                nn.Linear(hidden_dim, n_tags)
            )

    def forward(self, word_ids, task):
        x = self.embed(word_ids).view(word_ids.size(0), -1)
        return self.heads[task](x)

# Training loop with random task selection
tasks = {'pos': 45, 'chunk': 23, 'ner': 9, 'srl': 114}
model = MultiTaskNLP(130000, 50, 300, tasks)

for step in range(1000000):
    task = random.choice(list(tasks.keys()))
    batch = sample_batch(task)
    scores = model(batch.word_ids, task)
    loss = cross_entropy(scores, batch.labels)
    loss.backward()  # gradients flow to task head AND shared embeddings
    optimizer.step()
How does multi-task learning improve performance in this paper?

Chapter 5: Semi-Supervised Pre-training

The most forward-looking contribution of the paper is semi-supervised pre-training. The idea: before training on any labeled NLP task, first pre-train the word embeddings on a massive amount of unlabeled text using a language modeling objective. Then fine-tune these pre-trained embeddings on the labeled data.

This is conceptually identical to what GPT, BERT, and every modern NLP system does — just 7 years earlier, at smaller scale, and with a simpler model.

The pre-training objective

Collobert et al. use a pairwise ranking loss. Given a sentence from the corpus, they create a corrupted version by replacing the center word with a random word. The network must score the original sentence higher than the corrupted one:

L = max(0, 1 − f(correct sentence) + f(corrupted sentence))

For example:

Correct
"The cat sat on the mat" → score: 8.3
Corrupted
"The cat democracy on the mat" → score: 2.1
loss = max(0, 1 − 8.3 + 2.1) = 0 ✓ (correct scores higher)

This is not a full language model (it doesn't predict the next word). It's a discriminative objective: can the network tell real sentences from fake ones? But the effect is the same — to score real sentences highly, the embeddings must capture which words fit naturally in which contexts.

The ancestor of modern pre-training: This is the same intuition behind Word2Vec (2013), GloVe (2014), ELMo (2018), BERT (2019), and GPT (2018-2024). Train on unlabeled text to learn general word representations, then fine-tune on labeled data for specific tasks. Collobert et al. did it first in 2008 (the paper was published in 2011 but the work started in 2008), with the full pipeline: pre-train on English Wikipedia, then fine-tune on NLP benchmarks.

Pre-training data and scale

PropertyValue
Pre-training corpusEnglish Wikipedia (631M words)
Pre-training objectivePairwise ranking (real vs corrupted)
Window size11 words
Embedding dimension50
Training time~1 month on a single CPU
Pre-training: Real vs Corrupted Sentences

The network learns to distinguish real sentences from corrupted ones. Click "Corrupt" to replace the center word with a random word. The network's score should be higher for the real sentence. Watch embeddings converge as they learn what "fits."

Click Corrupt, then Score

Impact of pre-training

The paper reports accuracy with and without pre-trained embeddings:

TaskRandom initPre-trainedImprovement
POS96.37%97.20%+0.83
Chunking90.33%93.63%+3.30
NER81.47%88.67%+7.20
SRL70.99%74.29%+3.30

NER improves by 7.2 percentage points from pre-training alone! This makes sense: NER requires knowing that "Obama" is a person-type word, which is exactly the kind of knowledge an embedding learns from reading Wikipedia. Without pre-training, the network has to learn this from the small labeled NER dataset — much harder.

What do pre-trained embeddings know?

The authors examined their pre-trained embeddings by finding nearest neighbors. The results reveal rich linguistic structure learned purely from unlabeled text:

Query wordNearest neighborsCaptured knowledge
FranceAustria, Belgium, Germany, ItalyEuropean countries
MondayTuesday, Wednesday, ThursdayDays of the week
racingriding, swimming, flyingGerund activities
universitiescolleges, schools, campusesEducational institutions
heshe, it, theyPronoun class

These are the same kind of relationships that Word2Vec (published 2 years later) would become famous for. Collobert et al. demonstrated this first, though their pre-training objective was different (pairwise ranking vs. prediction).

Comparison: pre-training objectives through history

MethodYearObjectiveKey advantage
Collobert et al.2008/2011Real vs corrupted sentenceSimple, no softmax over vocab
Word2Vec2013Predict context/center word10x faster training
GloVe2014Matrix factorization of co-occurrencesGlobal statistics
ELMo2018Bidirectional language modelContext-dependent embeddings
BERT2019Masked language modelDeep bidirectional context
GPT2018-24Autoregressive next-word predictionScales to trillions of tokens

All of these objectives share the same core insight from Collobert et al.: learn word representations by exploiting the structure of unlabeled text. The objectives differ in details, but the principle — that self-supervised learning on text produces transferable representations — was established in this 2008 work.

Scale matters: 2008 vs 2024

To appreciate how far the field has come while using the same principles:

PropertyCollobert 2008GPT-4 2023Ratio
Pre-training data631M words (~2.5 GB)~13T tokens (~50 TB)20,000x
Embedding dimension50~12,288246x
Total parameters~6.5M~1.8T (est.)277,000x
Training compute1 CPU-month~25,000 GPU-months25,000x
Tasks handled4 (POS, NER, Chunk, SRL)Hundreds+~100x

The architecture changed (feedforward → Transformer), the scale changed (millions → trillions), but the recipe remained: (1) learn representations from unlabeled text, (2) transfer to downstream tasks. Collobert et al. proved the recipe works at small scale. The field spent the next 15 years proving it works at every scale.

Historical note: The work that became this paper began in 2008 — before deep learning's GPU-powered renaissance (AlexNet, 2012), before Word2Vec (2013), before attention (2014), before Transformers (2017). It was one of the first demonstrations that neural networks could compete with classical NLP on standard benchmarks, and it did so with a single CPU and a simple feedforward architecture. The ideas were ahead of the hardware — a recurring pattern in deep learning history.

The backpropagation algorithm (1986) waited 25 years for GPUs to make deep networks practical. Convolutional networks (1989) waited 23 years for ImageNet and GPU training. And Collobert et al.'s pre-training recipe (2008) waited 10 years for BERT to demonstrate it at full scale. Good ideas persist. They just need the right compute and data to flourish.

Today, we stand on the shoulders of these early works. Every time you type model = AutoModelForTokenClassification.from_pretrained(...), you are using the exact paradigm that Collobert et al. established: pre-trained representations, transferred to a specific task, fine-tuned with task-specific labels.

The API has changed. The scale has changed. The models are unrecognizably more powerful. But the principle — learn representations from unlabeled text, then transfer them — has not changed since this paper proved it works.

That is what makes NLP (Almost) from Scratch one of the most influential papers in the history of natural language processing. Cited over 8,000 times, it laid the groundwork for an entire paradigm shift: from hand-crafted features to learned representations, from task-specific pipelines to unified architectures, from labeled-data-only training to pre-train-then-fine-tune. The "almost" in the title was prophetic — it took a few more years, but the "almost" eventually became "completely."

Fine-tuning strategy

A crucial practical question: when fine-tuning on labeled data, should you freeze the pre-trained embeddings or continue training them? The paper tries both:

The paper finds that fine-tuning the embeddings works best when combined with a small learning rate for the embedding layer (slower than the task-specific layers). This is now standard practice in modern NLP (differential learning rates / discriminative fine-tuning).

python
# Differential learning rates (modern PyTorch)
optimizer = torch.optim.SGD([
    {'params': model.embed.parameters(), 'lr': 0.001},  # slow for embeddings
    {'params': model.hidden.parameters(), 'lr': 0.01},  # faster for task layers
    {'params': model.output.parameters(), 'lr': 0.01},
])
# This preserves pre-trained knowledge while allowing task adaptation
Verified on modern systems: This differential learning rate principle is now used by default in Hugging Face Transformers' fine-tuning: the pre-trained backbone gets a small learning rate (2e-5) while task-specific heads get a larger one. The same insight from 2008, scaled to 2024.
python
# Pairwise ranking loss for pre-training
def pretrain_step(model, sentence, vocab_size):
    """One step of pairwise ranking pre-training."""
    center_idx = len(sentence) // 2

    # Score the real sentence
    real_score = model(sentence)

    # Create corrupted version: replace center word
    corrupted = sentence.clone()
    corrupted[center_idx] = torch.randint(0, vocab_size, (1,))
    corrupt_score = model(corrupted)

    # Ranking loss: real should score higher by margin 1
    loss = torch.clamp(1 - real_score + corrupt_score, min=0)
    return loss
What is the pre-training objective used in this paper?

Chapter 6: Benchmark Showcase

How does the neural network — with minimal features — compare to state-of-the-art systems that use decades of hand-crafted feature engineering? The results were surprising in 2011.

Benchmark Results: Neural vs Feature-Engineered

Compare the performance of Collobert et al.'s neural approach (orange) against the best feature-engineered systems of the time (teal). Toggle "Features" to see what happens when you add hand-crafted features to the neural system. Drag the slider to animate training progress.

Training progress 100%

Key results (F1 scores)

TaskBenchmarkNeural (no features)Neural + featuresBest traditional
POSWSJ97.2097.2997.24
ChunkingCoNLL 200093.6394.3294.13
NERCoNLL 200388.6789.5989.31
SRLCoNLL 200574.2977.9277.92
The headline result: With zero hand-crafted features (no gazetteers, no suffix lists, no parse trees), the neural network achieves competitive performance across all four tasks. On POS tagging, it actually beats the best feature-engineered system. When a small number of features are added (just POS tags and chunk tags for SRL), it matches or beats state-of-the-art on everything. This was a landmark result — it showed that feature engineering could be largely automated.

What features still help

The paper is honest about when features still help. For SRL, which requires understanding sentence-level structure, adding POS tags and chunk tags as features improves F1 from 74.29 to 77.92 — a 3.6-point jump. This makes sense: SRL benefits from syntactic structure, which POS and chunk tags encode. The network could eventually learn this from raw words, but with limited labeled SRL data, explicit syntactic features provide a shortcut.

Speed comparison

Beyond accuracy, the neural approach has a speed advantage:

SystemPOS tagging speedTraining time
Neural (this paper)~200,000 words/secHours
Best traditional (SVM)~1,000 words/secDays

The neural network is 200x faster at test time because it's just matrix multiplies, while traditional systems compute hundreds of features per word and then solve constrained optimization problems.

The Viterbi decoding advantage

When using sentence-level training with a transition matrix A, the paper employs Viterbi decoding at test time to find the globally optimal tag sequence. This is important because it enforces structural constraints — for example, in NER, a B-PER (beginning of person) tag can be followed by I-PER (inside person) but not by I-LOC (inside location).

The effect on NER is significant: sentence-level training with Viterbi improves F1 from 86.96 (word-level) to 88.67 (sentence-level), a 1.7-point gain. The transition matrix learns tag-to-tag compatibility patterns that would be hard to capture with word-level predictions alone.

best_tags = argmax[t1,...,tn] (∑i fθ(i, ti) + ∑i Ati,ti+1)

The first term scores individual word-tag pairs (the neural network output). The second term scores tag-tag transitions (the learned transition matrix). Viterbi finds the sequence that maximizes the combined score in O(n · T2) time, where n is the sentence length and T is the number of tags.

Comparison with modern systems

To put the 2011 results in perspective, here are the same benchmarks with modern systems:

SystemYearPOS (WSJ)NER (CoNLL)Architecture
Collobert et al.201197.2989.59Feedforward + Conv
ELMo201897.8492.22Bidirectional LSTM
BERT-base201997.8592.80Transformer (12 layers)
RoBERTa201993.11Transformer (24 layers)

The gap between 2011 and modern systems is surprisingly small on POS tagging (97.29 vs 97.85) but larger on NER (89.59 vs 93.11). The main improvements came from:

  1. Bidirectional context (ELMo, 2018): Using left-to-right AND right-to-left language models instead of just a forward window
  2. Self-attention (BERT, 2019): Replacing convolution with attention allows each word to directly attend to any other word, regardless of distance
  3. Scale (RoBERTa, 2019): 160GB of pre-training text vs 2.5GB — 64x more data
  4. Subword tokenization (BPE): Handles rare words and morphology without explicit suffix features

But the foundational principles — learned embeddings, pre-training on unlabeled text, fine-tuning — are identical. Collobert et al. built the playbook; everyone else optimized the plays.

Error analysis: where neural fails

The paper honestly examines failure cases. The neural system struggles most on:

These limitations were addressed by subsequent work: ELMo's contextual embeddings handle rare words better, attention mechanisms capture long-range dependencies, and span-based models handle nested entities. But identifying these limitations was itself a contribution — it showed the community exactly where to push next.

The enduring lesson: The paper's most important contribution wasn't any specific accuracy number — it was the methodology. Show that learned representations can match hand-crafted features. Show that pre-training on unlabeled text helps. Show that multiple tasks benefit from shared representations. This three-part methodology is now the default across ML: computer vision (ImageNet pre-training), speech (wav2vec), protein folding (ESM), code (CodeBERT). Every time you fine-tune a pre-trained model, you're following the recipe from Collobert et al.
python
# Evaluation: computing F1 score for sequence labeling
from collections import defaultdict

def compute_f1(predictions, gold, ignore_tags={'O'}):
    """Compute precision, recall, F1 for sequence labeling."""
    tp = defaultdict(int)
    fp = defaultdict(int)
    fn = defaultdict(int)

    for pred_seq, gold_seq in zip(predictions, gold):
        for p, g in zip(pred_seq, gold_seq):
            if g not in ignore_tags:
                if p == g: tp[g] += 1
                else: fn[g] += 1
            if p not in ignore_tags and p != g:
                fp[p] += 1

    total_tp = sum(tp.values())
    total_fp = sum(fp.values())
    total_fn = sum(fn.values())

    precision = total_tp / (total_tp + total_fp + 1e-8)
    recall = total_tp / (total_tp + total_fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    return f1
What was the most surprising finding in the benchmark results?

Chapter 7: Connections

The 2011 Collobert et al. paper is a bridge between traditional NLP and modern deep learning NLP. It proved the viability of three ideas that would come to dominate the field:

  1. Learned representations over hand-crafted features — now the default everywhere
  2. Pre-training on unlabeled text — the foundation of GPT, BERT, and all LLMs
  3. Multi-task learning with shared representations — used in T5, GPT, and modern multi-modal models

The lineage to modern NLP

Paper/SystemYearWhat it inherited from Collobert et al.
Word2Vec2013Learning embeddings from unlabeled text (simpler objective, same idea)
GloVe2014Pre-trained word vectors for downstream tasks
ELMo2018Contextual embeddings pre-trained on LM, fine-tuned per task
BERT2019Pre-train on unlabeled text, fine-tune on all NLP tasks simultaneously
GPT-3/42020-23Pre-train at massive scale → emergent multi-task capability
T52020Unified architecture for all NLP tasks (text-to-text)

The "Almost" in the title

Why "almost" from scratch? Because the paper still uses a few hand-crafted features that help significantly:

Modern systems like BERT and GPT eliminate even these features by using subword tokenization (BPE), which naturally captures morphology, and by using deep bidirectional context, which captures capitalization patterns implicitly. They are truly "from scratch" — but it took 7 more years to get there.

python
# Modern "from scratch" NER with HuggingFace — zero feature engineering
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load pre-trained model (no hand-crafted features needed)
ner = pipeline("ner", model="dslim/bert-base-NER")

# Run inference — input is raw text, output is entities
result = ner("Barack Obama visited Paris yesterday")
# [{'entity': 'B-PER', 'word': 'Barack'},
#  {'entity': 'I-PER', 'word': 'Obama'},
#  {'entity': 'B-LOC', 'word': 'Paris'}]
# No gazetteers, no suffix rules, no POS tags — truly from scratch

What Collobert et al. got right

The paradigm shift quantified

To appreciate the magnitude of the shift this paper initiated, consider the engineering effort for a single NLP task before and after:

AspectPre-2011 (feature engineering)Post-2011 (neural)
Feature design timeMonths per taskZero (learned)
Domain expertise neededLinguistics PhDML engineer
Transfer to new taskStart from scratchFine-tune existing model
Transfer to new languageRedesign all featuresTrain on new data (same architecture)
Inference speed~1K words/sec~200K words/sec

What they missed

The paper's lasting legacy: Before 2011, NLP was a feature engineering discipline. After 2011, it became a representation learning discipline. This paper — along with Bengio 2003 and Mikolov 2013 — proved that neural networks could learn linguistic features from raw text. The entire modern NLP ecosystem (Hugging Face, transformers, fine-tuning) traces back to this paradigm shift.

Related Veanors

"The most important property of a program is whether it accomplishes the intention of its user." — C.A.R. Hoare

Which three ideas from this 2011 paper became foundational to modern NLP (GPT, BERT, T5)?