NLP (Almost) from Scratch

Chapter 0: The Feature Engineering Burden

It's 2011. You're building a system to extract named entities from text — finding that "Barack Obama" is a PERSON and "Washington D.C." is a LOCATION in "Barack Obama spoke in Washington D.C. today." The state-of-the-art approach requires you to design hundreds of hand-crafted features.

For each word, you manually compute features like: Is the first letter capitalized? Does it contain a digit? What is its suffix (-tion, -ing, -ed)? Is the previous word "Mr." or "Dr."? Does it appear in a gazetteer (list of known place names)? What part-of-speech tag does it have? Is it in a name dictionary?

This is feature engineering — the manual process of designing input representations. It is painstaking, task-specific, and brittle. Every new NLP task (POS tagging, chunking, NER, semantic role labeling) requires its own bespoke feature set, designed by a domain expert who understands both the linguistics and the machine learning algorithm.

Traditional NLP: Feature Engineering for Each Task

Each NLP task required its own hand-designed feature pipeline. Click each task to see the features that experts crafted for it. Notice how different and specialized each feature set is.

The vision: What if a neural network could learn its own features? What if, instead of manually designing hundreds of indicators per task, you could give the network raw words and let it figure out what matters? That would mean: (1) a single architecture for all NLP tasks, (2) features that improve automatically with more data, and (3) features that transfer between tasks. Collobert et al. showed this is possible — and the features the network learns are competitive with decades of hand-engineering.

Four NLP tasks, one paper

The paper tackles four core NLP tasks simultaneously:

Task	What it does	Example
POS tagging	Label each word's part of speech	"The/DT cat/NN sat/VBD"
Chunking	Group words into phrases	"[The cat]_NP [sat]_VP [on the mat]_PP"
NER	Find named entities	"[Obama]_PER visited [Paris]_LOC"
SRL	Who did what to whom	"[Obama]_A0 [visited]_V [Paris]_A1"

Before this paper, each task had its own research community, its own benchmark, and its own feature engineering pipeline. The idea that a single neural network could handle all four was radical.

What makes these tasks hard?

Each task requires different types of linguistic knowledge:

POS tagging requires understanding syntax — "running" is a verb in "She is running" but a noun in "running water."
Chunking requires grouping words into phrases — "The big red dog" is one noun phrase, not four separate words.
NER requires world knowledge — "Apple" is a company in "Apple released a product" but a fruit in "Apple pie is delicious."
SRL requires understanding deep sentence structure — in "The ball was kicked by John," John is the agent (A0) even though he's not the grammatical subject.

Traditional systems had separate feature sets because each task seemed to require fundamentally different information. The paper's key claim: a single learned representation can capture all of this, because these tasks share underlying linguistic structure.

Traditional pipeline vs end-to-end

In traditional NLP, tasks were solved in a pipeline: first POS tag, then parse, then use parse features for NER, then use NER + parse features for SRL. Each stage depends on the previous one. Errors cascade: if the POS tagger makes a mistake, the parser makes a mistake, and NER has no chance.

Traditional Pipeline

POS → Parse → NER → SRL (errors cascade)

vs.

Collobert et al.

Shared Embeddings → Independent Task Heads (no cascade)

The neural approach eliminates pipeline errors because each task operates directly on the raw input through the shared embeddings. A mistake in POS tagging doesn't affect NER because NER doesn't use POS tag features — it learns its own features from the same raw words.

This independence between tasks is both a strength and a limitation. It prevents error cascading but also prevents tasks from helping each other at inference time.

The paper's multi-task training addresses this at the representation level (shared embeddings learn from all tasks), but not at the prediction level (each task still makes independent predictions). Modern systems like joint models and end-to-end parsers have since addressed this gap.

This "end-to-end" approach — replacing multi-stage pipelines with single neural networks that learn their own intermediate representations — became the dominant paradigm across all of deep learning. We see the same pattern in computer vision (replacing SIFT + SVM with end-to-end CNNs), speech recognition (replacing HMM-GMM pipelines with end-to-end CTC models), machine translation (replacing phrase-based SMT with sequence-to-sequence models), and robotics (replacing perception + planning + control pipelines with end-to-end learned policies).

What is the fundamental problem with the traditional approach to NLP that this paper addresses?

Each NLP task requires its own hand-crafted features designed by domain experts — this is slow, task-specific, and the features don't transfer between tasks Traditional NLP systems are too slow to run in real time Traditional NLP only works for English text

Chapter 1: The Unified Architecture

Collobert et al. propose a single neural network architecture for all four NLP tasks. The architecture has four stages, each building on the previous:

1. Lookup Table

Words → dense vectors (embeddings). Each word index gets mapped to a d-dimensional vector from a learned table. Shape: vocab_size × d

↓

2. Feature Extraction

Window-based (concat embeddings of nearby words) or Sentence-based (1D convolution + max pooling over entire sentence)

↓

3. Hidden Layers

One or more linear layers with HardTanh activation: HardTanh(x) = max(−1, min(1, x))

↓

4. Tag Scoring

Linear layer outputs one score per possible tag. For word-level tasks (POS, NER), use a window around each word. For sentence-level (SRL), use the whole sentence.

The genius is in the simplicity. No parse trees. No gazetteers. No POS tag features. No suffix lists. Just raw words in, tag predictions out. The network learns whatever intermediate representations it needs.

The radical claim: Before this paper, the NLP community assumed that hand-crafted linguistic features were essential — that without parse trees, gazetteers, and morphological analyzers, competitive performance was impossible. Collobert et al. proved this wrong. A simple four-stage neural network, starting from raw word indices, could match or beat decades of feature engineering. This was the beginning of the end for the "feature engineering era" of NLP.

The Unified NLP Architecture

Data flows from raw words through the four stages. Click each stage to see the computation details: lookup table, feature extraction (window or convolution), hidden layers, and tag scoring.

Data flow with shapes

Let's trace the exact shapes through the window approach (used for POS tagging):

Stage	Input shape	Output shape	Parameters
Lookup table	window_size word indices	(window_size × d)	V × d (embedding matrix)
Concat + Linear	(window_size × d,)	(n_hidden,)	(window_size × d) × n_hidden
HardTanh	(n_hidden,)	(n_hidden,)	0
Linear	(n_hidden,)	(n_tags,)	n_hidden × n_tags

With d = 50 (embedding dim), window_size = 5, n_hidden = 300, n_tags = 45 (POS tags):

Total parameters = 130,000 × 50 + 250 × 300 + 300 × 45 = 6,588,500

Compared to feature-engineered systems that used millions of indicator features, this is remarkably compact. And the 130,000 × 50 embedding matrix — the bulk of the parameters — is shared across all tasks.

Why HardTanh instead of Sigmoid or ReLU?

The paper uses HardTanh as the activation function — not sigmoid, not ReLU (which hadn't yet become standard in 2008 when the work was done). HardTanh is a piecewise linear approximation of tanh:

HardTanh(x) = −1 if x < −1, x if −1 ≤ x ≤ 1, +1 if x > 1

It has two advantages over sigmoid: (1) its outputs are zero-centered (range [−1, 1] instead of [0, 1]), which helps gradient flow by preventing the all-positive-gradients problem, and (2) its gradient is exactly 1 in the active region, avoiding the 0.25 maximum of sigmoid's derivative. It's faster to compute than tanh since it uses no exponentials.

The loss function: log-likelihood with Viterbi decoding

For word-level tasks, the paper uses two loss functions:

Word-level log-likelihood: Treat each word independently. The score for tag t at position i is f_θ(i, t). The loss is the negative log of the softmax probability of the correct tag.
Sentence-level log-likelihood: Model tag transitions with an additional transition matrix A, where A_ij is the score for transitioning from tag i to tag j. Use Viterbi decoding at test time to find the best tag sequence globally. This is essentially a neural CRF — a conditional random field with neural features.

The sentence-level approach outperforms word-level on all tasks because it enforces valid tag sequences (e.g., B-PER must be followed by I-PER or O, never I-LOC). This is a form of structured prediction — the model learns not just which tags are likely for each word, but which tag sequences are valid.

python
# Viterbi decoding for finding the best tag sequence
def viterbi_decode(scores, transitions):
    """Find best tag sequence using dynamic programming."""
    n_words, n_tags = scores.shape
    dp = scores[0].clone()  # best score ending in each tag
    backpointers = []

    for t in range(1, n_words):
        best_scores, best_tags = (dp.unsqueeze(1) + transitions).max(dim=0)
        dp = best_scores + scores[t]
        backpointers.append(best_tags)

    # Trace back from best final tag
    best_path = [dp.argmax().item()]
    for bp in reversed(backpointers):
        best_path.append(bp[best_path[-1]].item())
    return list(reversed(best_path))

python
import torch
import torch.nn as nn

class WindowTagger(nn.Module):
    """Window approach for POS tagging, chunking, NER."""
    def __init__(self, vocab_size, embed_dim, window_size, hidden_dim, n_tags):
        super().__init__()
        self.window = window_size
        self.embed = nn.Embedding(vocab_size, embed_dim)
        self.linear1 = nn.Linear(window_size * embed_dim, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, n_tags)
        self.hardtanh = nn.Hardtanh()

    def forward(self, word_indices):
        # word_indices: (batch, window_size) — indices of words in window
        x = self.embed(word_indices)          # (batch, win, d)
        x = x.view(x.size(0), -1)              # (batch, win*d)
        x = self.hardtanh(self.linear1(x))     # (batch, hidden)
        x = self.linear2(x)                     # (batch, n_tags)
        return x

# Instantiate for POS tagging
model = WindowTagger(
    vocab_size=130000,
    embed_dim=50,
    window_size=5,
    hidden_dim=300,
    n_tags=45
)

What are the four stages of Collobert et al.'s unified NLP architecture?

Lookup table (words → embeddings), feature extraction (window or convolution), hidden layers (with HardTanh), and tag scoring (one score per tag) Tokenization, parsing, feature extraction, classification Encoder, attention, decoder, output

Chapter 2: Window vs Sentence Approach

The paper proposes two variants of the architecture, designed for different types of NLP tasks. The choice depends on how much context the task requires.

Window approach

For tasks where local context is sufficient — like POS tagging, chunking, and NER — the network looks at a fixed-size window of words centered on the target word. If the window size is k_sz = 5, the network sees the target word plus 2 words before and 2 words after.

The window is simply concatenated: if each word embedding has dimension d = 50 and the window is 5 words, the input to the first hidden layer is a 250-dimensional vector.

input = [embed(w_t-2); embed(w_t-1); embed(w_t); embed(w_t+1); embed(w_t+2)]

This is fast and simple but has a critical limitation: the network cannot see beyond the window. If the answer depends on a word 10 positions away (which sometimes happens in NER — "In the state of New York, the governor..."), the window approach misses it.

Padding at sentence boundaries

At the beginning and end of a sentence, the window extends beyond the sentence. The paper handles this with padding — special "start" and "end" tokens with their own learned embeddings. These boundary embeddings learn to encode the fact that the target word is near the beginning or end of a sentence, which is itself useful information (e.g., the first word of a sentence is more likely to be a subject).

python
# Window extraction with padding
def extract_windows(sentence, window_size, pad_idx):
    """Extract a window of word indices around each position."""
    half = window_size // 2
    padded = [pad_idx] * half + sentence + [pad_idx] * half
    windows = []
    for i in range(len(sentence)):
        windows.append(padded[i:i + window_size])
    return windows

# Example: "The cat sat" with window=5
# Position 0 ("The"): [PAD, PAD, The, cat, sat]
# Position 1 ("cat"): [PAD, The, cat, sat, PAD]
# Position 2 ("sat"): [The, cat, sat, PAD, PAD]

Sentence approach

For Semantic Role Labeling (SRL), where the network needs to understand the full sentence structure (who did what to whom), a window is not enough. The sentence approach uses 1D convolution over the entire sentence, followed by max pooling to extract a fixed-size representation regardless of sentence length.

Embed each word

Sentence of n words → n vectors of dimension d

↓

1D convolution

Slide a filter of width k over the sequence → n features per filter

↓

Max pooling over time

Take the max of each filter over all positions → fixed-size vector

↓

Hidden + scoring

Standard linear → HardTanh → linear → tag scores

Why max pooling? Max pooling acts as a "did this pattern appear anywhere?" detector. If filter #17 detects the pattern "has been [verb]-ing" (a progressive construction), max pooling asks: "Does this pattern appear anywhere in the sentence?" The answer is a single number, independent of sentence length. This is how the network converts variable-length sentences into fixed-size representations.

Understanding 1D convolution on text

A 1D convolutional filter of width k operates on k consecutive word embeddings. Think of it as a pattern detector that slides across the sentence:

At position 1: processes [word₁, word₂, ..., word_k]
At position 2: processes [word₂, word₃, ..., word_k+1]
At position n-k+1: processes [word_n-k+1, ..., word_n]

Each filter produces one number per position — how strongly the pattern matches at that location. With 300 filters, we get 300 features per position, each detecting a different local pattern. The max pool then selects the strongest match for each filter across all positions.

This architecture is a precursor to the 1D CNNs used in Kim (2014) for text classification, which became extremely popular before Transformers replaced them. The key limitation: even with max pooling, the representation captures which patterns appear but not where they appear relative to each other. For tasks requiring word-order sensitivity (like SRL), this is a significant weakness. Transformers solve this with positional encoding and self-attention.

python
import torch
import torch.nn as nn

# 1D convolution on text — the sentence approach
embed_dim = 50
n_filters = 300
filter_width = 5

# Create a 1D conv layer
conv = nn.Conv1d(embed_dim, n_filters, filter_width, padding=filter_width//2)

# Example: batch of 4 sentences, each 20 words, 50d embeddings
x = torch.randn(4, 20, 50)  # (batch, seq, embed)
x = x.transpose(1, 2)        # (batch, embed, seq) — Conv1d expects this

features = conv(x)            # (4, 300, 20) — 300 features per position
pooled, _ = features.max(dim=2)  # (4, 300) — max over time

print(f"Per-position features: {features.shape}")  # [4, 300, 20]
print(f"After max pool: {pooled.shape}")          # [4, 300]
# Variable sentence length → fixed 300d representation

Window vs Sentence Approach

Left: the window approach sees only nearby words. Right: the sentence approach (convolution + max pool) sees the entire sentence. Toggle between them. Notice how the sentence approach can capture long-range dependencies.

When to use which

Approach	Context	Best for	Speed
Window	k words around target	POS, Chunking, NER	Very fast
Sentence	Entire sentence	SRL	Slower (conv + pool)

python
class SentenceTagger(nn.Module):
    """Sentence approach with 1D convolution for SRL."""
    def __init__(self, vocab_size, embed_dim, n_filters, filter_width, hidden_dim, n_tags):
        super().__init__()
        self.embed = nn.Embedding(vocab_size, embed_dim)
        # 1D conv: embed_dim input channels, n_filters output channels
        self.conv = nn.Conv1d(embed_dim, n_filters, filter_width,
                              padding=filter_width//2)
        self.linear1 = nn.Linear(n_filters, hidden_dim)
        self.linear2 = nn.Linear(hidden_dim, n_tags)
        self.hardtanh = nn.Hardtanh()

    def forward(self, word_indices):
        # word_indices: (batch, seq_len)
        x = self.embed(word_indices)             # (batch, seq, d)
        x = x.transpose(1, 2)                    # (batch, d, seq) for Conv1d
        x = self.hardtanh(self.conv(x))           # (batch, n_filters, seq)
        x, _ = x.max(dim=2)                     # (batch, n_filters) — max pool over time
        x = self.hardtanh(self.linear1(x))       # (batch, hidden)
        x = self.linear2(x)                      # (batch, n_tags)
        return x

Position features for SRL

For Semantic Role Labeling, the sentence approach needs to know which word is the target verb. The paper adds a relative position feature: for each word, it computes the distance to the target verb and looks up this distance in a position embedding table.

pos_feat(w_i) = LT_pos(i − verb_position)

So if the verb is at position 4 and we're looking at word 2, the position feature is LT_pos(−2). Word 6 gets LT_pos(+2). This gives the network a sense of structure relative to the verb — which is essential for SRL where the role of a word (agent, patient, instrument) depends heavily on its position relative to the predicate.

This is a precursor to the positional encodings used in Transformers (Vaswani et al., 2017), though the Transformer version is absolute (position in the sentence) rather than relative (distance to a reference word).

Why does the paper use max pooling after convolution in the sentence approach?

Max pooling converts variable-length convolution outputs into a fixed-size vector by asking "does this pattern appear anywhere in the sentence?" — making the representation length-independent Max pooling reduces computational cost Max pooling prevents overfitting

Chapter 3: The Embedding Layer

The embedding layer — the lookup table — is the paper's most influential contribution. While the idea of word embeddings existed before (Bengio et al.'s neural language model, 2003), Collobert et al. demonstrated two critical properties:

Embeddings learned for one task transfer to other tasks. Embeddings trained on POS tagging improve NER, and vice versa.
Embeddings pre-trained on unlabeled text (via a language model objective) improve all downstream tasks. This is the ancestor of modern pre-training.

The lookup table is simply a matrix LT_W ∈ R^d×|V|. Given a word index i, the embedding is the i-th column: LT_W(i) = W_i. This is mathematically equivalent to multiplying a one-hot vector by the embedding matrix.

Multiple feature types

The paper doesn't just embed words. It also embeds additional features — each with its own lookup table:

Feature	Vocabulary size	Embedding dim	Purpose
Word	~130,000	50	Semantic/syntactic meaning
Capitalization	4 (allLower, allUpper, firstUpper, mixed)	5	"Obama" vs "the"
Word suffix	~2,000 (2-char suffixes)	5	"-ed", "-ing", "-tion" morphology
Relative position (SRL)	~100	5	Distance to target verb

The final embedding for a word is the concatenation of all its feature embeddings:

embed(w) = [LT_word(w); LT_caps(caps(w)); LT_suffix(suf(w))]

dim = 50 + 5 + 5 = 60

Design decision — why concatenate, not add? Adding embeddings (like modern Transformers do with position encodings) assumes the features live in the same space. Concatenation keeps each feature in its own subspace, giving the hidden layers maximum flexibility to combine them. With only a few small feature types, concatenation is practical — the dimensionality increase is modest.

Multiple Embedding Lookup Tables

Each word gets multiple embeddings concatenated: word (50d), capitalization (5d), suffix (5d). Click a word to see its composite embedding. The total input dimension is the sum of all embedding dimensions.

Training the embeddings

The embedding matrices are initialized randomly and updated by backpropagation along with all other network weights. The key insight: because the embedding matrix is shared across all positions in the window (the same matrix is used to look up each word), the gradient signal from every word in every window updates the same matrix. This means the embeddings benefit from all the training data, not just the examples where a particular word appears.

How embedding gradients work

The gradient for a word embedding is particularly intuitive. During a forward pass, the embedding lookup selects row i from the matrix. During the backward pass, the gradient flows back to only that row. If word "cat" (index 42) appears in the current training example, only row 42 of the embedding matrix gets a gradient update — all other rows receive zero gradient.

This means rare words get fewer gradient updates than common words. The paper doesn't address this directly, but later work (Word2Vec, GloVe) developed techniques like subsampling frequent words and negative sampling to balance gradient distribution across the vocabulary.

∂L/∂W[i, :] = ∂L/∂embed(i) (only row i gets updated)

With a vocabulary of 130,000 words and a training set of millions of sentences, each word gets thousands of gradient updates over training — enough to learn useful representations even for moderately rare words.

Very rare words (appearing fewer than 5 times) still get poor embeddings. The paper handles this by mapping all rare words to a special "RARE" token with its own learned embedding. This is a crude solution — all rare words get the same embedding, which throws away what little information we have about them.

Modern systems solve this much more elegantly with subword tokenization (BPE), using vocabulary sizes of 32,000-100,000 subword tokens. Even if the word "magnetohydrodynamics" never appeared in training, its subwords "magnet" + "o" + "hydro" + "dynamics" all have well-trained embeddings that compose to a reasonable representation.

This is one area where the 2011 architecture shows its age — character-level and subword models (Bojanowski et al., 2017; Sennrich et al., 2016) were needed to handle the long tail of rare words properly.

python
# Multiple embedding lookup tables
class MultiEmbedding(nn.Module):
    def __init__(self):
        super().__init__()
        self.word_embed = nn.Embedding(130000, 50)  # vocabulary
        self.caps_embed = nn.Embedding(4, 5)        # capitalization patterns
        self.suf_embed  = nn.Embedding(2000, 5)     # 2-char suffixes

    def forward(self, word_ids, cap_ids, suf_ids):
        w = self.word_embed(word_ids)    # (batch, seq, 50)
        c = self.caps_embed(cap_ids)    # (batch, seq, 5)
        s = self.suf_embed(suf_ids)     # (batch, seq, 5)
        return torch.cat([w, c, s], dim=-1)  # (batch, seq, 60)

Embedding dimension: the 50-dimensional sweet spot

Why 50 dimensions? The paper doesn't report an extensive hyperparameter search, but the choice reflects a trade-off:

Too small (5-10d): Not enough capacity to encode the semantic and syntactic properties of 130,000 words. Nearby words in the embedding space may be unrelated.
50d (paper's choice): Sufficient to capture meaningful word relationships while keeping the total parameter count manageable. The embedding matrix is 130,000 × 50 = 6.5M parameters.
300d (Word2Vec default): Captures finer-grained relationships. Word2Vec later showed that 300d embeddings outperform 50d on analogy tasks. But 300d × 130,000 = 39M parameters in the embedding matrix alone.
768-1024d (BERT/GPT): Modern models use even higher dimensions, but they also have much deeper architectures that can exploit the extra capacity.

The 2011 choice of 50d was appropriate for the architecture depth (1-2 hidden layers) and the available compute. With more layers to process the embeddings, higher dimensions become useful.

Why does the paper concatenate multiple embedding types (word, caps, suffix) rather than using word embeddings alone?

Each feature type captures different information: the word embedding captures semantics, capitalization captures entity cues (Obama vs obama), and suffixes capture morphology (-ed = past tense) — concatenation gives the network access to all of these signals Concatenation makes the network run faster Multiple embeddings prevent overfitting

Chapter 4: Multi-Task Learning

With a single architecture for all four tasks, a natural question arises: can training on multiple tasks simultaneously improve performance on each? The answer is yes — and the mechanism is shared embeddings.

The multi-task setup works like this: the embedding layer is shared across all tasks. Each task has its own hidden layer and scoring layer. During training, we alternate between tasks — one mini-batch of POS tagging, one mini-batch of NER, one of chunking, and so on. The task-specific layers get gradients only from their own task, but the shared embedding layer gets gradients from all tasks.

Shared Embeddings

One lookup table trained by ALL tasks. Each task's gradients improve the shared representations.

↓ ↓ ↓ ↓

POS Head

Hidden + 45 tags

Chunk Head

Hidden + 23 tags

NER Head

Hidden + 9 tags

SRL Head

Hidden + 114 tags

Why multi-task helps: POS tagging teaches the embeddings that "running" is a verb (or gerund). NER teaches them that capitalized words at sentence starts might be entities. Chunking teaches them about phrase boundaries. Each task provides a different training signal that enriches the shared embeddings. The result: embeddings that capture a richer set of linguistic properties than any single task could provide alone.

Training protocol

The paper uses a simple alternating strategy:

Select a task t uniformly at random
Sample a mini-batch from task t's training data
Forward pass through shared embeddings + task t's head
Backpropagate through task t's head AND the shared embeddings
Repeat

No fancy multi-task weighting or gradient balancing — just random alternation. The simplicity is part of the paper's appeal.

Modern multi-task learning is more sophisticated. Today's systems use techniques like: (1) gradient normalization to balance gradient magnitudes across tasks, (2) task-specific learning rate schedules, (3) uncertainty weighting where task weights are learned based on homoscedastic uncertainty (Kendall et al., 2018). But Collobert et al.'s simple alternation remains a reasonable baseline that often performs surprisingly well.

Multi-Task Training: Shared Embedding Improvement

Watch how training on multiple tasks simultaneously improves the shared embeddings. Each task's gradient signal enriches different aspects of the embeddings. Click "Train" to see task accuracy curves evolve together.

Click Train to see multi-task learning

Results

Task	Single-task accuracy	Multi-task accuracy	Improvement
POS	97.12%	97.20%	+0.08
Chunking	93.37%	93.63%	+0.26
NER	87.58%	88.67%	+1.09
SRL	73.54%	74.29%	+0.75

The biggest improvement is on NER (+1.09%), which has the least training data. Multi-task learning acts as a form of regularization and data augmentation — the shared embeddings benefit from the combined data of all tasks.

Why NER benefits most

NER has the least labeled training data among the four tasks. With sparse data, the embeddings for rare entity names (company names, locations, person names) receive few gradient updates from NER alone. But these same words appear frequently in POS tagging and chunking data, where they receive more updates. Multi-task learning transfers these updates to the shared embeddings, effectively providing more training signal for the rare but important words in NER.

This insight generalizes: multi-task learning helps most on low-resource tasks that share representations with high-resource tasks. Today, we see the same principle in large pre-trained models: fine-tuning GPT-3 on a small dataset works because the pre-training provided billions of gradient updates to the shared representation.

The gradient flow of multi-task learning

From a gradient perspective, multi-task learning works because each task provides a different "view" of what the embeddings should encode. POS tagging gradients push "running" and "jumping" closer together (both are verbs). NER gradients push "Obama" and "Clinton" closer together (both are person names). Chunking gradients push "the" and "a" closer together (both are determiners). These diverse signals produce embeddings that encode a richer set of properties than any single task.

∇_embed L_total = ∇_embed L_POS + ∇_embed L_NER + ∇_embed L_chunk + ∇_embed L_SRL

The embedding gradient is the sum of gradients from all tasks. Each task "pulls" the embedding vectors in directions useful for its own objective. The resulting embeddings find a compromise position that works reasonably well for all tasks — and often better than any single-task optimum, because the multi-task signal acts as a regularizer against overfitting to any one task's idiosyncrasies.

python
# Simulating multi-task gradient flow
import torch
import torch.nn as nn

# Shared embedding, two task heads
embed = nn.Embedding(1000, 50)
head_pos = nn.Linear(250, 45)   # POS tagging
head_ner = nn.Linear(250, 9)    # NER

# Train on POS batch
words = torch.randint(0, 1000, (32, 5))
e = embed(words).view(32, -1)
pos_loss = head_pos(e).sum()
pos_loss.backward()
# embed.weight.grad now has POS signal

# Train on NER batch (gradients accumulate!)
e2 = embed(words).view(32, -1)
ner_loss = head_ner(e2).sum()
ner_loss.backward()
# embed.weight.grad now has POS + NER signal

Negative transfer: when multi-task hurts

Multi-task learning doesn't always help. If tasks are too dissimilar (e.g., sentiment analysis and machine translation), shared representations may compromise — getting worse at both tasks. The paper avoids this because POS tagging, chunking, NER, and SRL are all syntactic-semantic tasks that require similar linguistic knowledge. The shared embeddings naturally encode features useful for all four.

python
class MultiTaskNLP(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, task_n_tags):
        super().__init__()
        # Shared embedding layer
        self.embed = nn.Embedding(vocab_size, embed_dim)

        # Task-specific heads
        self.heads = nn.ModuleDict()
        for task, n_tags in task_n_tags.items():
            self.heads[task] = nn.Sequential(
                nn.Linear(5 * embed_dim, hidden_dim),  # window=5
                nn.Hardtanh(),
                nn.Linear(hidden_dim, n_tags)
            )

    def forward(self, word_ids, task):
        x = self.embed(word_ids).view(word_ids.size(0), -1)
        return self.heads[task](x)

# Training loop with random task selection
tasks = {'pos': 45, 'chunk': 23, 'ner': 9, 'srl': 114}
model = MultiTaskNLP(130000, 50, 300, tasks)

for step in range(1000000):
    task = random.choice(list(tasks.keys()))
    batch = sample_batch(task)
    scores = model(batch.word_ids, task)
    loss = cross_entropy(scores, batch.labels)
    loss.backward()  # gradients flow to task head AND shared embeddings
    optimizer.step()

How does multi-task learning improve performance in this paper?

The shared embedding layer receives gradient signals from all four tasks, learning richer word representations than any single task could provide — each task teaches different linguistic properties Multi-task learning uses a larger learning rate The tasks share their training labels with each other

Chapter 5: Semi-Supervised Pre-training

The most forward-looking contribution of the paper is semi-supervised pre-training. The idea: before training on any labeled NLP task, first pre-train the word embeddings on a massive amount of unlabeled text using a language modeling objective. Then fine-tune these pre-trained embeddings on the labeled data.

This is conceptually identical to what GPT, BERT, and every modern NLP system does — just 7 years earlier, at smaller scale, and with a simpler model.

The pre-training objective

Collobert et al. use a pairwise ranking loss. Given a sentence from the corpus, they create a corrupted version by replacing the center word with a random word. The network must score the original sentence higher than the corrupted one:

L = max(0, 1 − f(correct sentence) + f(corrupted sentence))

For example:

Correct

"The cat sat on the mat" → score: 8.3

Corrupted

"The cat democracy on the mat" → score: 2.1

loss = max(0, 1 − 8.3 + 2.1) = 0 ✓ (correct scores higher)

This is not a full language model (it doesn't predict the next word). It's a discriminative objective: can the network tell real sentences from fake ones? But the effect is the same — to score real sentences highly, the embeddings must capture which words fit naturally in which contexts.

The ancestor of modern pre-training: This is the same intuition behind Word2Vec (2013), GloVe (2014), ELMo (2018), BERT (2019), and GPT (2018-2024). Train on unlabeled text to learn general word representations, then fine-tune on labeled data for specific tasks. Collobert et al. did it first in 2008 (the paper was published in 2011 but the work started in 2008), with the full pipeline: pre-train on English Wikipedia, then fine-tune on NLP benchmarks.

Pre-training data and scale

Property	Value
Pre-training corpus	English Wikipedia (631M words)
Pre-training objective	Pairwise ranking (real vs corrupted)
Window size	11 words
Embedding dimension	50
Training time	~1 month on a single CPU

Pre-training: Real vs Corrupted Sentences

The network learns to distinguish real sentences from corrupted ones. Click "Corrupt" to replace the center word with a random word. The network's score should be higher for the real sentence. Watch embeddings converge as they learn what "fits."

Click Corrupt, then Score

Impact of pre-training

The paper reports accuracy with and without pre-trained embeddings:

Task	Random init	Pre-trained	Improvement
POS	96.37%	97.20%	+0.83
Chunking	90.33%	93.63%	+3.30
NER	81.47%	88.67%	+7.20
SRL	70.99%	74.29%	+3.30

NER improves by 7.2 percentage points from pre-training alone! This makes sense: NER requires knowing that "Obama" is a person-type word, which is exactly the kind of knowledge an embedding learns from reading Wikipedia. Without pre-training, the network has to learn this from the small labeled NER dataset — much harder.

What do pre-trained embeddings know?

The authors examined their pre-trained embeddings by finding nearest neighbors. The results reveal rich linguistic structure learned purely from unlabeled text:

Query word	Nearest neighbors	Captured knowledge
France	Austria, Belgium, Germany, Italy	European countries
Monday	Tuesday, Wednesday, Thursday	Days of the week
racing	riding, swimming, flying	Gerund activities
universities	colleges, schools, campuses	Educational institutions
he	she, it, they	Pronoun class

These are the same kind of relationships that Word2Vec (published 2 years later) would become famous for. Collobert et al. demonstrated this first, though their pre-training objective was different (pairwise ranking vs. prediction).

Comparison: pre-training objectives through history

Method	Year	Objective	Key advantage
Collobert et al.	2008/2011	Real vs corrupted sentence	Simple, no softmax over vocab
Word2Vec	2013	Predict context/center word	10x faster training
GloVe	2014	Matrix factorization of co-occurrences	Global statistics
ELMo	2018	Bidirectional language model	Context-dependent embeddings
BERT	2019	Masked language model	Deep bidirectional context
GPT	2018-24	Autoregressive next-word prediction	Scales to trillions of tokens

All of these objectives share the same core insight from Collobert et al.: learn word representations by exploiting the structure of unlabeled text. The objectives differ in details, but the principle — that self-supervised learning on text produces transferable representations — was established in this 2008 work.

Scale matters: 2008 vs 2024

To appreciate how far the field has come while using the same principles:

Property	Collobert 2008	GPT-4 2023	Ratio
Pre-training data	631M words (~2.5 GB)	~13T tokens (~50 TB)	20,000x
Embedding dimension	50	~12,288	246x
Total parameters	~6.5M	~1.8T (est.)	277,000x
Training compute	1 CPU-month	~25,000 GPU-months	25,000x
Tasks handled	4 (POS, NER, Chunk, SRL)	Hundreds+	~100x

The architecture changed (feedforward → Transformer), the scale changed (millions → trillions), but the recipe remained: (1) learn representations from unlabeled text, (2) transfer to downstream tasks. Collobert et al. proved the recipe works at small scale. The field spent the next 15 years proving it works at every scale.

Historical note: The work that became this paper began in 2008 — before deep learning's GPU-powered renaissance (AlexNet, 2012), before Word2Vec (2013), before attention (2014), before Transformers (2017). It was one of the first demonstrations that neural networks could compete with classical NLP on standard benchmarks, and it did so with a single CPU and a simple feedforward architecture. The ideas were ahead of the hardware — a recurring pattern in deep learning history.

The backpropagation algorithm (1986) waited 25 years for GPUs to make deep networks practical. Convolutional networks (1989) waited 23 years for ImageNet and GPU training. And Collobert et al.'s pre-training recipe (2008) waited 10 years for BERT to demonstrate it at full scale. Good ideas persist. They just need the right compute and data to flourish.

Today, we stand on the shoulders of these early works. Every time you type model = AutoModelForTokenClassification.from_pretrained(...), you are using the exact paradigm that Collobert et al. established: pre-trained representations, transferred to a specific task, fine-tuned with task-specific labels.

The API has changed. The scale has changed. The models are unrecognizably more powerful. But the principle — learn representations from unlabeled text, then transfer them — has not changed since this paper proved it works.

That is what makes NLP (Almost) from Scratch one of the most influential papers in the history of natural language processing. Cited over 8,000 times, it laid the groundwork for an entire paradigm shift: from hand-crafted features to learned representations, from task-specific pipelines to unified architectures, from labeled-data-only training to pre-train-then-fine-tune. The "almost" in the title was prophetic — it took a few more years, but the "almost" eventually became "completely."

Fine-tuning strategy

A crucial practical question: when fine-tuning on labeled data, should you freeze the pre-trained embeddings or continue training them? The paper tries both:

Frozen embeddings: The embedding layer is fixed during fine-tuning. Only the task-specific layers learn. This preserves the pre-trained representations but limits adaptation.
Fine-tuned embeddings: The embedding layer continues to update. This allows task-specific adaptation but risks "catastrophic forgetting" — overwriting useful pre-trained knowledge with task-specific noise.

The paper finds that fine-tuning the embeddings works best when combined with a small learning rate for the embedding layer (slower than the task-specific layers). This is now standard practice in modern NLP (differential learning rates / discriminative fine-tuning).

python
# Differential learning rates (modern PyTorch)
optimizer = torch.optim.SGD([
    {'params': model.embed.parameters(), 'lr': 0.001},  # slow for embeddings
    {'params': model.hidden.parameters(), 'lr': 0.01},  # faster for task layers
    {'params': model.output.parameters(), 'lr': 0.01},
])
# This preserves pre-trained knowledge while allowing task adaptation

Verified on modern systems: This differential learning rate principle is now used by default in Hugging Face Transformers' fine-tuning: the pre-trained backbone gets a small learning rate (2e-5) while task-specific heads get a larger one. The same insight from 2008, scaled to 2024.

python
# Pairwise ranking loss for pre-training
def pretrain_step(model, sentence, vocab_size):
    """One step of pairwise ranking pre-training."""
    center_idx = len(sentence) // 2

    # Score the real sentence
    real_score = model(sentence)

    # Create corrupted version: replace center word
    corrupted = sentence.clone()
    corrupted[center_idx] = torch.randint(0, vocab_size, (1,))
    corrupt_score = model(corrupted)

    # Ranking loss: real should score higher by margin 1
    loss = torch.clamp(1 - real_score + corrupt_score, min=0)
    return loss

What is the pre-training objective used in this paper?

A pairwise ranking loss where the network must score real sentences higher than corrupted ones (where a word has been replaced with a random word) — forcing the embeddings to learn which words fit in which contexts Predicting the next word in a sequence (autoregressive language modeling) Classifying sentences as positive or negative sentiment

Chapter 6: Benchmark Showcase

How does the neural network — with minimal features — compare to state-of-the-art systems that use decades of hand-crafted feature engineering? The results were surprising in 2011.

Benchmark Results: Neural vs Feature-Engineered

Compare the performance of Collobert et al.'s neural approach (orange) against the best feature-engineered systems of the time (teal). Toggle "Features" to see what happens when you add hand-crafted features to the neural system. Drag the slider to animate training progress.

Training progress 100%

Key results (F1 scores)

Task	Benchmark	Neural (no features)	Neural + features	Best traditional
POS	WSJ	97.20	97.29	97.24
Chunking	CoNLL 2000	93.63	94.32	94.13
NER	CoNLL 2003	88.67	89.59	89.31
SRL	CoNLL 2005	74.29	77.92	77.92

The headline result: With zero hand-crafted features (no gazetteers, no suffix lists, no parse trees), the neural network achieves competitive performance across all four tasks. On POS tagging, it actually beats the best feature-engineered system. When a small number of features are added (just POS tags and chunk tags for SRL), it matches or beats state-of-the-art on everything. This was a landmark result — it showed that feature engineering could be largely automated.

What features still help

The paper is honest about when features still help. For SRL, which requires understanding sentence-level structure, adding POS tags and chunk tags as features improves F1 from 74.29 to 77.92 — a 3.6-point jump. This makes sense: SRL benefits from syntactic structure, which POS and chunk tags encode. The network could eventually learn this from raw words, but with limited labeled SRL data, explicit syntactic features provide a shortcut.

Speed comparison

Beyond accuracy, the neural approach has a speed advantage:

System	POS tagging speed	Training time
Neural (this paper)	~200,000 words/sec	Hours
Best traditional (SVM)	~1,000 words/sec	Days

The neural network is 200x faster at test time because it's just matrix multiplies, while traditional systems compute hundreds of features per word and then solve constrained optimization problems.

The Viterbi decoding advantage

When using sentence-level training with a transition matrix A, the paper employs Viterbi decoding at test time to find the globally optimal tag sequence. This is important because it enforces structural constraints — for example, in NER, a B-PER (beginning of person) tag can be followed by I-PER (inside person) but not by I-LOC (inside location).

The effect on NER is significant: sentence-level training with Viterbi improves F1 from 86.96 (word-level) to 88.67 (sentence-level), a 1.7-point gain. The transition matrix learns tag-to-tag compatibility patterns that would be hard to capture with word-level predictions alone.

best_tags = argmax_{[t₁,...,t_n]} (∑_i f_θ(i, t_i) + ∑_i A_{t_i,t_i+1})

The first term scores individual word-tag pairs (the neural network output). The second term scores tag-tag transitions (the learned transition matrix). Viterbi finds the sequence that maximizes the combined score in O(n · T²) time, where n is the sentence length and T is the number of tags.

Comparison with modern systems

To put the 2011 results in perspective, here are the same benchmarks with modern systems:

System	Year	POS (WSJ)	NER (CoNLL)	Architecture
Collobert et al.	2011	97.29	89.59	Feedforward + Conv
ELMo	2018	97.84	92.22	Bidirectional LSTM
BERT-base	2019	97.85	92.80	Transformer (12 layers)
RoBERTa	2019	—	93.11	Transformer (24 layers)

The gap between 2011 and modern systems is surprisingly small on POS tagging (97.29 vs 97.85) but larger on NER (89.59 vs 93.11). The main improvements came from:

Bidirectional context (ELMo, 2018): Using left-to-right AND right-to-left language models instead of just a forward window
Self-attention (BERT, 2019): Replacing convolution with attention allows each word to directly attend to any other word, regardless of distance
Scale (RoBERTa, 2019): 160GB of pre-training text vs 2.5GB — 64x more data
Subword tokenization (BPE): Handles rare words and morphology without explicit suffix features

But the foundational principles — learned embeddings, pre-training on unlabeled text, fine-tuning — are identical. Collobert et al. built the playbook; everyone else optimized the plays.

Error analysis: where neural fails

The paper honestly examines failure cases. The neural system struggles most on:

Rare entities: Company names and locations appearing rarely in training but commonly in test data. Pre-training helps but doesn't eliminate this.
Long-range SRL dependencies: When the agent is far from the verb ("The man who was standing at the corner kicked the ball"), max pooling loses positional information.
Nested entities: "The University of California, Berkeley" contains nested location references that the BIO tagging scheme cannot represent.

These limitations were addressed by subsequent work: ELMo's contextual embeddings handle rare words better, attention mechanisms capture long-range dependencies, and span-based models handle nested entities. But identifying these limitations was itself a contribution — it showed the community exactly where to push next.

The enduring lesson: The paper's most important contribution wasn't any specific accuracy number — it was the methodology. Show that learned representations can match hand-crafted features. Show that pre-training on unlabeled text helps. Show that multiple tasks benefit from shared representations. This three-part methodology is now the default across ML: computer vision (ImageNet pre-training), speech (wav2vec), protein folding (ESM), code (CodeBERT). Every time you fine-tune a pre-trained model, you're following the recipe from Collobert et al.

python
# Evaluation: computing F1 score for sequence labeling
from collections import defaultdict

def compute_f1(predictions, gold, ignore_tags={'O'}):
    """Compute precision, recall, F1 for sequence labeling."""
    tp = defaultdict(int)
    fp = defaultdict(int)
    fn = defaultdict(int)

    for pred_seq, gold_seq in zip(predictions, gold):
        for p, g in zip(pred_seq, gold_seq):
            if g not in ignore_tags:
                if p == g: tp[g] += 1
                else: fn[g] += 1
            if p not in ignore_tags and p != g:
                fp[p] += 1

    total_tp = sum(tp.values())
    total_fp = sum(fp.values())
    total_fn = sum(fn.values())

    precision = total_tp / (total_tp + total_fp + 1e-8)
    recall = total_tp / (total_tp + total_fn + 1e-8)
    f1 = 2 * precision * recall / (precision + recall + 1e-8)
    return f1

What was the most surprising finding in the benchmark results?

With zero hand-crafted features, the neural network achieved competitive or superior performance compared to systems built with decades of feature engineering — proving that neural networks can largely automate feature discovery The neural network was slower than traditional systems Hand-crafted features always outperformed the neural approach

Chapter 7: Connections

The 2011 Collobert et al. paper is a bridge between traditional NLP and modern deep learning NLP. It proved the viability of three ideas that would come to dominate the field:

Learned representations over hand-crafted features — now the default everywhere
Pre-training on unlabeled text — the foundation of GPT, BERT, and all LLMs
Multi-task learning with shared representations — used in T5, GPT, and modern multi-modal models

The lineage to modern NLP

Paper/System	Year	What it inherited from Collobert et al.
Word2Vec	2013	Learning embeddings from unlabeled text (simpler objective, same idea)
GloVe	2014	Pre-trained word vectors for downstream tasks
ELMo	2018	Contextual embeddings pre-trained on LM, fine-tuned per task
BERT	2019	Pre-train on unlabeled text, fine-tune on all NLP tasks simultaneously
GPT-3/4	2020-23	Pre-train at massive scale → emergent multi-task capability
T5	2020	Unified architecture for all NLP tasks (text-to-text)

The "Almost" in the title

Why "almost" from scratch? Because the paper still uses a few hand-crafted features that help significantly:

Capitalization features: Whether a word is capitalized, all-caps, etc. This helps NER distinguish "Obama" (entity) from "obama" (typo).
Suffix features: 2-character word endings that capture morphology ("-ed" = past tense, "-ing" = gerund).
POS tags for SRL: When POS tags from an external tagger are added as input features, SRL improves by 3.6 points.

Modern systems like BERT and GPT eliminate even these features by using subword tokenization (BPE), which naturally captures morphology, and by using deep bidirectional context, which captures capitalization patterns implicitly. They are truly "from scratch" — but it took 7 more years to get there.

python
# Modern "from scratch" NER with HuggingFace — zero feature engineering
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline

# Load pre-trained model (no hand-crafted features needed)
ner = pipeline("ner", model="dslim/bert-base-NER")

# Run inference — input is raw text, output is entities
result = ner("Barack Obama visited Paris yesterday")
# [{'entity': 'B-PER', 'word': 'Barack'},
#  {'entity': 'I-PER', 'word': 'Obama'},
#  {'entity': 'B-LOC', 'word': 'Paris'}]
# No gazetteers, no suffix rules, no POS tags — truly from scratch

What Collobert et al. got right

The fundamental bet: learned representations will eventually beat hand-crafted features. This is now universally accepted.
Pre-training on unlabeled text transfers to labeled tasks. This is the foundation of modern NLP.
A single architecture can handle multiple NLP tasks. GPT-4 handles hundreds of tasks.

The paradigm shift quantified

To appreciate the magnitude of the shift this paper initiated, consider the engineering effort for a single NLP task before and after:

Aspect	Pre-2011 (feature engineering)	Post-2011 (neural)
Feature design time	Months per task	Zero (learned)
Domain expertise needed	Linguistics PhD	ML engineer
Transfer to new task	Start from scratch	Fine-tune existing model
Transfer to new language	Redesign all features	Train on new data (same architecture)
Inference speed	~1K words/sec	~200K words/sec

What they missed

Attention. Their model uses fixed windows or convolution. Attention (2014-2017) allows dynamic, data-dependent context aggregation.
Contextual embeddings. Their embeddings are static — "bank" gets the same vector in "river bank" and "bank account." ELMo/BERT create context-dependent vectors.
Scale. 50-dimensional embeddings on 631M words. GPT-3 uses 12,288-dimensional embeddings on 300B tokens. The principles scale, but the numbers are transformatively different.

The paper's lasting legacy: Before 2011, NLP was a feature engineering discipline. After 2011, it became a representation learning discipline. This paper — along with Bengio 2003 and Mikolov 2013 — proved that neural networks could learn linguistic features from raw text. The entire modern NLP ecosystem (Hugging Face, transformers, fine-tuning) traces back to this paradigm shift.

Related Veanors

Word2Vec — Learning word embeddings with simpler models (2013)
GloVe — Count-based embeddings with global matrix factorization
Backpropagation — The training algorithm this paper relies on
Yes You Should Understand Backprop — Practical gradient debugging

"The most important property of a program is whether it accomplishes the intention of its user." — C.A.R. Hoare

Which three ideas from this 2011 paper became foundational to modern NLP (GPT, BERT, T5)?

Learned representations (instead of hand-crafted features), pre-training on unlabeled text (then fine-tuning), and a unified architecture for multiple tasks (with shared embeddings) Attention mechanisms, transformer blocks, and positional encoding Reinforcement learning, reward models, and RLHF

Natural Language Processing (Almost) from Scratch