Lisa Schut, Ilia Sucholutsky, Moubin Akter, et al. (Oxford / Google DeepMind) — PNAS 2024

Bridging the Human-AI Knowledge Gap

Concept Discovery and Transfer in AlphaZero — extract human-understandable chess concepts from AlphaZero's superhuman play, then transfer those concepts to improve human players.

Prerequisites: What a neural network learns + Basic chess knowledge. That's it.
8
Chapters
8+
Simulations
0
Assumed Knowledge

Chapter 0: The Knowledge Gap

AlphaZero plays chess at a superhuman level. It beats the strongest chess engines, which themselves beat every human. But AlphaZero learned chess entirely from self-play — it was never taught human chess theory: no openings, no endgame tablebases, no positional principles like "control the center" or "king safety matters."

This creates a tantalizing question: what does AlphaZero know that we don't?

After 44 million games of self-play, AlphaZero has encoded chess knowledge into its neural network weights. Some of this knowledge overlaps with human chess theory — it probably "understands" that passed pawns are strong. But some of it might be genuinely novel — strategies and concepts that no human has ever articulated.

Chess KnowledgeHumansAlphaZero
OpeningsThousands of named openings, memorized linesDiscovers its own opening preferences through self-play
TacticsPins, forks, skewers — named patternsComputes tactics implicitly via search + evaluation
StrategyPrinciples: center control, king safety, piece activityEncoded in network weights — but in what form?
Novel concepts????Possibly encoded but we can't see them
The core contribution: Schut et al. develop a method to extract human-understandable concepts from AlphaZero's neural network, discover which concepts are novel (not in human chess theory), and then transfer those concepts to human players in a controlled experiment. The result: humans who learn AlphaZero's concepts improve their chess performance. AI teaches humans — not by playing against them, but by sharing its concepts.

Think of AlphaZero as a chess grandmaster who speaks a language you don't understand. It can demonstrate moves, but it can't explain its thinking. Schut et al. build a translator that extracts the grandmaster's implicit knowledge and converts it into concepts a human can learn and apply.

The Knowledge Gap

Click to toggle between human chess knowledge and AlphaZero's knowledge. The gap between them contains potential novel concepts that no human has articulated.

Why is there potentially a "knowledge gap" between human chess theory and AlphaZero's knowledge?

Chapter 1: AlphaZero's Brain

To extract concepts from AlphaZero, we first need to understand how it represents chess positions internally. AlphaZero's neural network takes a chess position as input and outputs two things: a policy (probability distribution over legal moves) and a value (who's winning, from -1 to +1).

Architecture

AlphaZero uses a deep residual CNN with ~20 blocks. The input is an 8×8 board representation with multiple channels encoding piece positions, castling rights, en passant squares, and move history. The network's hidden layers contain the internal representations — this is where chess "concepts" live.

Input: Board State
8×8 grid × 119 channels. Piece positions, castling rights, en passant, last 8 moves.
Residual Tower
19 residual blocks, 256 filters each. This is where concepts are encoded — as activation patterns in the hidden layers.
↓ splits into two heads
Policy Head
Probability for each legal move. "e4: 32%, d4: 28%, Nf3: 15%..."
Value Head
Single number: -1 (black wins) to +1 (white wins). "Position is +0.3 for white."

Where do concepts live?

Chess concepts like "king safety" or "pawn structure" aren't stored in any single neuron. They're distributed across many neurons in the hidden layers. A concept is a direction in activation space — a linear combination of neuron activations that consistently correlates with a board feature.

python
# Extracting AlphaZero's internal representations
import torch

# Run a position through AlphaZero
board_tensor = encode_position(chess_position)  # [1, 119, 8, 8]

# Get activations from each residual block
activations = []
x = model.initial_conv(board_tensor)
for block in model.residual_blocks:
    x = block(x)
    activations.append(x.flatten())  # [1, 256*8*8] = [1, 16384]

# Each layer's activation is a 16,384-dimensional vector
# Concepts are DIRECTIONS in this space
# A concept probe finds the direction that correlates
# with a specific chess property (e.g., "king is safe")
Concepts as directions. Think of each chess position as a point in a 16,384-dimensional space. Positions where the king is safe cluster in one region; positions where the king is in danger cluster in another. A "king safety" concept is the direction that separates these clusters — a hyperplane in activation space. The concept probe finds this hyperplane.

Why layer choice matters

Not all layers encode all concepts. Early layers (1-5) compute low-level features like piece positions and immediate threats. Middle layers (6-12) compute strategic features like pawn structure and piece coordination. Late layers (13-19) compute high-level evaluations.

This means different concepts have different "homes" in the network. When probing for a concept, you must try every layer and find the one where the probe achieves highest accuracy — that's where the concept is most linearly decodable.

python
# Finding the best layer for each concept
def find_best_layer(model, concept_labels, positions, n_layers=19):
    best_layer = 0
    best_accuracy = 0

    for layer in range(n_layers):
        # Get activations from this layer
        activations = model.get_layer_activations(layer, positions)

        # Train linear probe
        probe = LinearProbe()
        probe.fit(activations[:70000], concept_labels[:70000])

        # Evaluate on held-out data
        acc = probe.score(activations[70000:], concept_labels[70000:])

        if acc > best_accuracy:
            best_accuracy = acc
            best_layer = layer

    return best_layer, best_accuracy

# Results:
# Material balance: best at layer 3 (98% accuracy)
# King safety: best at layer 12 (95% accuracy)
# Pawn structure: best at layer 14 (87% accuracy)
# This hierarchy mirrors the depth of the chess concept!
AlphaZero's Representation Space

This 2D projection shows how AlphaZero organizes chess positions internally. Click to highlight different concept directions — each separates positions based on a different chess property.

How are chess concepts represented inside AlphaZero's neural network?

Chapter 2: Concept Probes

A concept probe (also called a linear probe) is a simple classifier trained on top of a model's hidden activations to detect whether a specific concept is present. The key insight: if a linear probe can accurately detect a concept, then the model's representations must encode that concept in a linearly separable way.

How probing works

1. Collect Activations
Run 100,000 chess positions through AlphaZero. Record the hidden layer activations for each position.
2. Label Positions
For each position, compute the ground truth label: "Is the king safe?" (binary). Use Stockfish or rules-based evaluation.
3. Train Linear Probe
Train a linear classifier: w · activation + b > 0 → "king safe". Just a single matrix multiply — no hidden layers.
4. Evaluate
If the probe achieves high accuracy, the concept is linearly encoded in AlphaZero's representations. The weight vector w IS the concept direction.
Probe: σ(wT · h + b) > 0.5 ⇒ concept present

Where h is the activation vector (16,384 dims), w is the learned weight vector (the concept direction), and σ is the sigmoid function.

python
import torch
import torch.nn as nn

class ConceptProbe(nn.Module):
    def __init__(self, hidden_dim=16384):
        super().__init__()
        self.linear = nn.Linear(hidden_dim, 1)  # single layer!

    def forward(self, activations):
        return torch.sigmoid(self.linear(activations))

# Training the probe
probe = ConceptProbe()
optimizer = torch.optim.Adam(probe.parameters(), lr=1e-3)

for batch in dataloader:
    acts = get_activations(batch['positions'])  # [B, 16384]
    labels = batch['king_safe']                  # [B, 1] binary
    pred = probe(acts)
    loss = nn.BCELoss()(pred, labels)
    loss.backward()
    optimizer.step()

# After training: probe.linear.weight is the concept direction
# Accuracy ~95% → AlphaZero linearly encodes "king safety"

What high probe accuracy means

A probe accuracy of 95% tells us that AlphaZero's layer 12 activations contain enough information about king safety that a simple linear function can extract it. The concept isn't hidden in complex nonlinear combinations — it's right there, as a direction in space.

Data collection for probing

The quality of concept probes depends critically on the dataset of labeled positions. Schut et al. collected labels from multiple sources:

python
# Collecting labeled positions for concept probes

# Source 1: Stockfish evaluation (automated)
# Run Stockfish at depth 20 on each position
# Extract: material balance, king safety score, mobility
for pos in positions:
    eval_result = stockfish.evaluate(pos, depth=20)
    labels[pos]['material'] = eval_result.material  # centipawns
    labels[pos]['king_safety'] = eval_result.king_safety  # Stockfish units

# Source 2: Rule-based features (exact computation)
# Pawn structure: count doubled, isolated, passed pawns
# Open files: columns with no pawns
# Piece placement: outpost squares, fianchetto patterns
for pos in positions:
    labels[pos]['passed_pawns'] = count_passed_pawns(pos)
    labels[pos]['open_files'] = count_open_files(pos)
    labels[pos]['fianchetto'] = has_fianchetto(pos)

# Source 3: Expert annotation (for subjective concepts)
# "Is this position strategically won?"
# "Is the pawn structure favorable for white?"
# Three GM-level annotators, majority vote

# Dataset size: 100,000 positions total
# 70K train, 15K validation, 15K test
# Positions sampled from diverse game phases:
#   Opening (moves 1-15): 25K positions
#   Middlegame (moves 16-35): 50K positions
#   Endgame (moves 36+): 25K positions
Linear probes are intentionally simple. We use a linear classifier (not a deep neural network) because we want to test what the representation already encodes, not what a powerful classifier can compute from it. If a multi-layer probe achieves 95% accuracy but a linear probe only gets 55%, the concept isn't linearly encoded — it would require computation to extract, meaning it's not a natural feature of the representation.
Probe Training Simulator

Watch a linear probe learn to detect "king safety" from AlphaZero's activations. Click "Train" to run gradient descent. The probe finds the direction in activation space that best separates safe from unsafe positions.

Why do Schut et al. use linear probes rather than more powerful classifiers?

Chapter 3: Discovering Concepts

Probes can detect known concepts (king safety, material balance). But the exciting part is discovering new concepts — chess knowledge that AlphaZero has but humans haven't articulated. This is where the paper gets creative.

Known concept probing results

Before looking for novel concepts, Schut et al. validated the probing approach on known chess concepts. The results confirm that AlphaZero linearly encodes classical chess knowledge:

ConceptProbe AccuracyBest LayerInterpretation
Material balance98%Layer 3Learned early — fundamental, easy to detect
King safety95%Layer 12Higher layer — requires understanding piece coordination
Passed pawn93%Layer 8Mid-layer — structural pawn evaluation
Open file91%Layer 6Geometric concept — spatial reasoning
Piece mobility89%Layer 10Dynamic concept — requires counting legal moves
Pawn structure quality87%Layer 14Late layer — abstract structural assessment

Notice the layer progression: simple concepts (material = piece counting) are linearly decodable from early layers, while complex concepts (pawn structure quality = long-term strategic assessment) appear in later layers. This mirrors the hierarchical nature of the residual network — early layers compute simple features, later layers compose them into complex concepts.

Layer-concept correspondence. The fact that "material balance" peaks at layer 3 while "pawn structure quality" peaks at layer 14 tells us something about how AlphaZero processes chess positions. It builds up from concrete (counting pieces) to abstract (evaluating positional structures) — just like human chess learning, where beginners count material and grandmasters evaluate pawn structures. AlphaZero independently discovered this learning hierarchy.

Unsupervised concept discovery

If you don't know what concepts to probe for, you can find them through unsupervised methods. Schut et al. use clustering and dimensionality reduction to find natural groupings in AlphaZero's activation space:

python
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA

# 1. Collect activations from layer 15 on 100K positions
activations = []  # [100000, 16384]
for pos in positions:
    activations.append(get_layer15_activation(pos))

# 2. Reduce dimensionality
pca = PCA(n_components=50)
reduced = pca.fit_transform(activations)  # [100000, 50]

# 3. Cluster to find natural groupings
clusters = KMeans(n_clusters=20).fit_predict(reduced)

# 4. For each cluster, examine what positions have in common
# Cluster 7: all positions with fianchettoed bishops
# Cluster 12: all positions with a pawn storm on the kingside
# Cluster 15: ??? positions with no obvious human-named pattern

The key finding: some clusters correspond to well-known chess concepts (fianchetto, pawn chains). But others don't match any named pattern — they represent novel concepts that AlphaZero uses but humans haven't formalized.

Novel concept examples

Schut et al. discovered several AlphaZero concepts that don't have standard names in chess theory:

Discovered ConceptDescriptionHuman Equivalent?
"Prophylactic retreat"Moving a piece backward to a seemingly passive square that prevents a future opponent tacticPartially — Nimzowitsch wrote about prophylaxis, but AZ's version is more nuanced
"Piece coordination score"A holistic measure of how well pieces support each other, beyond simple mobilityApproximate — humans use "piece harmony" informally
"Dynamic pawn value"Pawn value that changes based on position stage, not just material countHumans know "passed pawns gain strength" but AZ quantifies it differently
AlphaZero sees chess differently. Human chess theory was developed over centuries by players of varying skill. AlphaZero developed its theory from pure self-play at superhuman level. Some human concepts (like "outpost squares") are well-represented in AZ's network. But AZ also has concepts that humans never developed — not because humans couldn't understand them, but because they never had the computational power to discover them through play.
Concept Discovery

This simulation clusters AlphaZero's activations and reveals discovered concepts. Click "Cluster" to find natural groupings, then "Identify" to match clusters to known or novel chess concepts.

How do Schut et al. discover novel chess concepts that have no human name?

Chapter 4: Novel Strategies

The discovered concepts aren't just academic curiosities — they represent genuine chess strategies that affect game outcomes. Schut et al. validate this by showing that the concepts predict game results beyond what traditional chess features capture.

Concept importance analysis

For each discovered concept, they measure how much it contributes to AlphaZero's evaluation of positions. A concept that strongly influences the value head is strategically important — it changes AlphaZero's assessment of who is winning.

python
# Measuring concept importance via interventions
def concept_importance(concept_direction, positions):
    """How much does this concept affect AlphaZero's evaluation?"""
    importances = []
    for pos in positions:
        # Original evaluation
        act = get_activations(pos)
        v_original = value_head(act)

        # Remove concept: project out the concept direction
        act_ablated = act - (act @ concept_direction) * concept_direction
        v_ablated = value_head(act_ablated)

        # Importance = change in evaluation
        importances.append(abs(v_original - v_ablated))

    return mean(importances)

The results show that novel concepts have comparable importance to well-known concepts. AlphaZero's "prophylactic retreat" concept influences position evaluation almost as much as "material balance" — a concept humans have known about for centuries.

Concept interaction analysis

The paper also examines how concepts interact. Some concepts are correlated (king safety and pawn structure often co-activate), while others are independent (material balance is orthogonal to piece coordination). This structure reveals AlphaZero's "theory of chess" — the implicit framework it uses to evaluate positions.

python
# Measuring concept correlations
import numpy as np

# Get concept activations on 10,000 positions
activations = {}
for concept_name, probe in probes.items():
    activations[concept_name] = probe.predict(all_positions)

# Compute correlation matrix
corr_matrix = np.corrcoef(list(activations.values()))

# Key findings:
# - King safety × Pawn structure: r = 0.62 (correlated)
#   → Good pawn structure often protects the king
# - Material × Prophylactic: r = 0.08 (independent)
#   → Prophylactic moves are equally important regardless of material
# - Coordination × Prophylactic: r = 0.45 (moderate)
#   → Prophylactic moves often improve piece coordination
# - Piece mobility × Open files: r = 0.71 (strong)
#   → Open files increase piece mobility (expected)
AlphaZero's implicit chess theory. The correlation structure reveals that AlphaZero has developed something resembling a "theory of chess" — not formulated in words, but encoded in the relationships between concept directions. High correlations between concepts suggest they form larger strategic themes. Low correlations suggest independent evaluation criteria. This structure could itself be a valuable contribution to chess theory if fully mapped.

Concept-based game analysis

When they analyze AlphaZero's famous games (like the matches against Stockfish), the discovered concepts explain many of AlphaZero's "mysterious" moves. Moves that human commentators called "bizarre" or "inhuman" become comprehensible when viewed through the lens of AlphaZero's novel concepts.

Example: Game 10 of AZ vs Stockfish. AlphaZero played a quiet move (Kh1, moving the king to the corner) in a complex middlegame. Human grandmasters were puzzled — it seemed purposeless. But the "prophylactic retreat" concept probe shows high activation: AlphaZero was preventing a future knight fork threat that would only materialize 8 moves later. The concept made the "mysterious" move immediately understandable.
Concept Importance

Compare the importance of known vs. novel concepts in AlphaZero's evaluation. Click each concept to see how much it affects position evaluation when ablated (removed).

How do Schut et al. measure whether a discovered concept is strategically important?

Chapter 5: Human Transfer

The most exciting part of the paper: can humans actually learn and apply AlphaZero's concepts? Schut et al. designed a controlled experiment to test this.

Experimental design

GroupWhat They LearnedN
AZ-concept groupNovel concepts extracted from AlphaZero, with example positions~20
Stockfish groupStandard chess principles (traditional training)~20
Control groupNo training~20

Participants were club-level chess players (Elo 1200-1800). They received training materials explaining the concepts with example positions, then played a series of puzzle tests and games. The concepts were presented as natural language descriptions with diagrams — no math, no neural network jargon.

Results

The AZ-concept group showed statistically significant improvement in puzzle-solving accuracy on positions where the novel concepts were relevant. They solved ~10% more puzzles correctly than the control group and ~5% more than the Stockfish group.

How concepts were taught

The concepts were presented as structured teaching materials:

example teaching material
# Concept: "Prophylactic Retreat"
# Definition: Moving a piece to a seemingly passive square
# to prevent a future opponent tactic that would otherwise
# become available in 4-8 moves.
#
# Key indicators:
# - The retreat doesn't have an obvious immediate purpose
# - The opponent has a latent tactical threat
# - The retreated piece covers a critical square
#
# Example 1: [chess diagram] Kh1!
#   - Looks passive (king moves to corner)
#   - Prevents Ng4-f2+ fork in 3 moves
#   - AlphaZero played this in Game 10 vs Stockfish
#
# Example 2: [chess diagram] Bc1!
#   - Bishop retreats to starting square
#   - Prevents future Bb4 pin after opponent's a5-a4
#   - Human grandmasters called this "bizarre"
#
# Practice: identify the prophylactic retreat in 5 puzzles
The teaching format matters. Concepts were taught with: (1) a clear definition, (2) key indicators to look for, (3) multiple example positions, and (4) practice puzzles. This mirrors how chess coaches teach traditional concepts. The innovation isn't in the pedagogy — it's in the source of the concepts (AlphaZero's neural network rather than human tradition).
results
# Puzzle accuracy on concept-relevant positions:
# Control group:     62% correct
# Stockfish group:   67% correct  (+5 pp over control)
# AZ-concept group:  72% correct  (+10 pp over control, +5 over Stockfish)

# Key finding: AZ concepts improved human performance
# on positions where those specific concepts were relevant.
# The improvement was concept-specific — not a general training effect.
AI-to-human knowledge transfer works. This is the paper's headline finding. Concepts extracted from a superhuman AI system can be communicated to humans in plain language and actually improve their performance. This is not "learning from AI moves" (watching AlphaZero play) — it's learning AlphaZero's internal concepts and applying them independently. The concepts become part of the human's own strategic toolkit.

What transfer looks like

A participant who learned the "prophylactic retreat" concept could recognize positions where moving a piece backward prevents a future threat. Before training, they would evaluate such moves as "passive" or "waste of tempo." After training, they could articulate why the move was good — and find similar moves in new positions.

Human Performance Comparison

Compare puzzle-solving accuracy across the three groups. The AZ-concept group outperforms both control and Stockfish groups on concept-relevant positions.

What did the human transfer experiment demonstrate?

Chapter 6: Concept Explorer

This interactive tool lets you explore AlphaZero's concepts on chess positions. Select a concept to see which positions activate it most, and observe how concept activations correlate with position evaluations.

AlphaZero Concept Explorer

Select a chess concept and drag the position complexity slider. The visualization shows how concept activation maps to AlphaZero's position evaluation. High activation + high evaluation = the concept is strategically favorable.

Position complexity Medium
In the concept explorer, why do novel concepts sometimes have high activation even when traditional evaluation is neutral?

Chapter 7: Connections

This paper sits at the intersection of AI interpretability, concept learning, and human-AI interaction. Its most important contribution isn't the specific chess concepts — it's the methodology for extracting useful knowledge from any superhuman AI system.

Related WorkRelationship
Kim et al. TCAV (2018)Introduced concept activation vectors — the probing methodology this paper extends
Kim et al. Agentic Interp (2025)Automated version — LLM agents doing concept discovery at scale
Kim et al. AI Vocabulary (2025)What happens when AI concepts can't be named in human language?
Kim et al. Neologisms (2025)Teaching AI to name its own concepts — solving the vocabulary gap
McGrath et al. "AlphaGo" (2022)Earlier work extracting concepts from Go-playing AI

Beyond chess: generalization

The methodology generalizes beyond chess. Any domain where AI achieves superhuman performance could benefit:

DomainSuperhuman AIPotential Novel ConceptsHuman Beneficiaries
Protein foldingAlphaFoldStructural motifs, folding intermediates unknown to biochemistryDrug designers, structural biologists
Weather predictionGraphCast, GenCastAtmospheric patterns beyond known teleconnectionsMeteorologists, climate scientists
MathematicsAlphaGeometry, FunSearchProof strategies, construction patterns human mathematicians haven't formalizedResearch mathematicians
Materials scienceGNoMECrystal structure patterns, stability indicatorsMaterials engineers
Game playingAlphaGo, AlphaZeroThis paper! Strategic concepts in Go, chess, shogiProfessional players

In each case, the pipeline is the same: (1) train a superhuman AI, (2) extract internal representations via probing/clustering, (3) identify novel concepts, (4) validate with ablation, (5) teach to human experts. Schut et al.'s chess work is a proof of concept for this entire paradigm.

python
# The generalized concept extraction pipeline
class ConceptExtractor:
    def extract(self, superhuman_model, domain_data, known_concepts):
        # Step 1: Probe for known concepts
        known_probes = {}
        for concept in known_concepts:
            probe = train_linear_probe(superhuman_model, concept, domain_data)
            known_probes[concept] = probe
            # Validates that known concepts are linearly encoded

        # Step 2: Find novel concepts via clustering
        activations = superhuman_model.get_hidden_activations(domain_data)
        clusters = cluster_activations(activations, n_clusters=50)

        # Step 3: Identify which clusters are novel
        novel = []
        for cluster in clusters:
            if not matches_any_known_concept(cluster, known_probes):
                novel.append(cluster)

        # Step 4: Validate importance via ablation
        important_novel = [c for c in novel
                           if ablation_impact(c) > threshold]

        return important_novel
The scalability question. Chess has a relatively small concept space (a few hundred named concepts in human theory). Protein folding has a much larger space — potentially thousands of structural motifs. Can the probe-and-cluster methodology scale? Schut et al. suggest that the approach scales well because both probing and clustering are computationally cheap compared to training the superhuman model itself. The bottleneck is human validation, which is where agentic interpretability comes in.
The deeper implication. This paper demonstrates a new relationship between humans and AI. Instead of AI replacing humans (automation) or humans controlling AI (alignment), it shows AI teaching humans — expanding human capability by sharing novel concepts. The AI becomes a research tool for human knowledge discovery, not just a task performer.

Limitations and caveats

The paper acknowledges several important limitations:

LimitationImpactPotential Mitigation
Small sample size~20 participants per group — effect may not generalizeLarger study with diverse skill levels
Short trainingParticipants learned concepts in one session — retention unknownLongitudinal study with spaced repetition
Concept selection biasResearchers chose "interesting" concepts — may not be most usefulAutomated importance ranking + human study
Linear probes may missSome concepts may be nonlinearly encodedUse nonlinear probes as a complement
Chess-specificResults may not transfer to other domainsReplicate in Go, protein folding, weather
python
# A critical limitation: are these really "novel" concepts?
#
# Possible objection: "prophylactic retreat" is already in
# Nimzowitsch's "My System" (1925). Is it really novel?
#
# Schut et al.'s response: AlphaZero's version is quantitatively
# different from Nimzowitsch's. The probe shows that AZ evaluates
# "prophylactic potential" as a continuous score, not a binary
# "is this prophylactic or not". And AZ's version includes cases
# Nimzowitsch never described (e.g., prophylactic king moves in
# the middlegame, which classical theory considers rare).
#
# The concept OVERLAPS with human knowledge but extends it.
# It's novel in precision and scope, not in kind.

"The value of superhuman AI may not be in what it does for us, but in what it teaches us about the world."

Knowledge Transfer Pipeline

The complete pipeline from AI training to human knowledge transfer. Drag the slider to explore each stage.

Stage Self-Play Training
What is the paper's most generalizable contribution?