Concept Discovery and Transfer in AlphaZero — extract human-understandable chess concepts from AlphaZero's superhuman play, then transfer those concepts to improve human players.
AlphaZero plays chess at a superhuman level. It beats the strongest chess engines, which themselves beat every human. But AlphaZero learned chess entirely from self-play — it was never taught human chess theory: no openings, no endgame tablebases, no positional principles like "control the center" or "king safety matters."
This creates a tantalizing question: what does AlphaZero know that we don't?
After 44 million games of self-play, AlphaZero has encoded chess knowledge into its neural network weights. Some of this knowledge overlaps with human chess theory — it probably "understands" that passed pawns are strong. But some of it might be genuinely novel — strategies and concepts that no human has ever articulated.
| Chess Knowledge | Humans | AlphaZero |
|---|---|---|
| Openings | Thousands of named openings, memorized lines | Discovers its own opening preferences through self-play |
| Tactics | Pins, forks, skewers — named patterns | Computes tactics implicitly via search + evaluation |
| Strategy | Principles: center control, king safety, piece activity | Encoded in network weights — but in what form? |
| Novel concepts? | ??? | Possibly encoded but we can't see them |
Think of AlphaZero as a chess grandmaster who speaks a language you don't understand. It can demonstrate moves, but it can't explain its thinking. Schut et al. build a translator that extracts the grandmaster's implicit knowledge and converts it into concepts a human can learn and apply.
Click to toggle between human chess knowledge and AlphaZero's knowledge. The gap between them contains potential novel concepts that no human has articulated.
To extract concepts from AlphaZero, we first need to understand how it represents chess positions internally. AlphaZero's neural network takes a chess position as input and outputs two things: a policy (probability distribution over legal moves) and a value (who's winning, from -1 to +1).
AlphaZero uses a deep residual CNN with ~20 blocks. The input is an 8×8 board representation with multiple channels encoding piece positions, castling rights, en passant squares, and move history. The network's hidden layers contain the internal representations — this is where chess "concepts" live.
Chess concepts like "king safety" or "pawn structure" aren't stored in any single neuron. They're distributed across many neurons in the hidden layers. A concept is a direction in activation space — a linear combination of neuron activations that consistently correlates with a board feature.
python # Extracting AlphaZero's internal representations import torch # Run a position through AlphaZero board_tensor = encode_position(chess_position) # [1, 119, 8, 8] # Get activations from each residual block activations = [] x = model.initial_conv(board_tensor) for block in model.residual_blocks: x = block(x) activations.append(x.flatten()) # [1, 256*8*8] = [1, 16384] # Each layer's activation is a 16,384-dimensional vector # Concepts are DIRECTIONS in this space # A concept probe finds the direction that correlates # with a specific chess property (e.g., "king is safe")
Not all layers encode all concepts. Early layers (1-5) compute low-level features like piece positions and immediate threats. Middle layers (6-12) compute strategic features like pawn structure and piece coordination. Late layers (13-19) compute high-level evaluations.
This means different concepts have different "homes" in the network. When probing for a concept, you must try every layer and find the one where the probe achieves highest accuracy — that's where the concept is most linearly decodable.
python # Finding the best layer for each concept def find_best_layer(model, concept_labels, positions, n_layers=19): best_layer = 0 best_accuracy = 0 for layer in range(n_layers): # Get activations from this layer activations = model.get_layer_activations(layer, positions) # Train linear probe probe = LinearProbe() probe.fit(activations[:70000], concept_labels[:70000]) # Evaluate on held-out data acc = probe.score(activations[70000:], concept_labels[70000:]) if acc > best_accuracy: best_accuracy = acc best_layer = layer return best_layer, best_accuracy # Results: # Material balance: best at layer 3 (98% accuracy) # King safety: best at layer 12 (95% accuracy) # Pawn structure: best at layer 14 (87% accuracy) # This hierarchy mirrors the depth of the chess concept!
This 2D projection shows how AlphaZero organizes chess positions internally. Click to highlight different concept directions — each separates positions based on a different chess property.
A concept probe (also called a linear probe) is a simple classifier trained on top of a model's hidden activations to detect whether a specific concept is present. The key insight: if a linear probe can accurately detect a concept, then the model's representations must encode that concept in a linearly separable way.
Where h is the activation vector (16,384 dims), w is the learned weight vector (the concept direction), and σ is the sigmoid function.
python import torch import torch.nn as nn class ConceptProbe(nn.Module): def __init__(self, hidden_dim=16384): super().__init__() self.linear = nn.Linear(hidden_dim, 1) # single layer! def forward(self, activations): return torch.sigmoid(self.linear(activations)) # Training the probe probe = ConceptProbe() optimizer = torch.optim.Adam(probe.parameters(), lr=1e-3) for batch in dataloader: acts = get_activations(batch['positions']) # [B, 16384] labels = batch['king_safe'] # [B, 1] binary pred = probe(acts) loss = nn.BCELoss()(pred, labels) loss.backward() optimizer.step() # After training: probe.linear.weight is the concept direction # Accuracy ~95% → AlphaZero linearly encodes "king safety"
A probe accuracy of 95% tells us that AlphaZero's layer 12 activations contain enough information about king safety that a simple linear function can extract it. The concept isn't hidden in complex nonlinear combinations — it's right there, as a direction in space.
The quality of concept probes depends critically on the dataset of labeled positions. Schut et al. collected labels from multiple sources:
python # Collecting labeled positions for concept probes # Source 1: Stockfish evaluation (automated) # Run Stockfish at depth 20 on each position # Extract: material balance, king safety score, mobility for pos in positions: eval_result = stockfish.evaluate(pos, depth=20) labels[pos]['material'] = eval_result.material # centipawns labels[pos]['king_safety'] = eval_result.king_safety # Stockfish units # Source 2: Rule-based features (exact computation) # Pawn structure: count doubled, isolated, passed pawns # Open files: columns with no pawns # Piece placement: outpost squares, fianchetto patterns for pos in positions: labels[pos]['passed_pawns'] = count_passed_pawns(pos) labels[pos]['open_files'] = count_open_files(pos) labels[pos]['fianchetto'] = has_fianchetto(pos) # Source 3: Expert annotation (for subjective concepts) # "Is this position strategically won?" # "Is the pawn structure favorable for white?" # Three GM-level annotators, majority vote # Dataset size: 100,000 positions total # 70K train, 15K validation, 15K test # Positions sampled from diverse game phases: # Opening (moves 1-15): 25K positions # Middlegame (moves 16-35): 50K positions # Endgame (moves 36+): 25K positions
Watch a linear probe learn to detect "king safety" from AlphaZero's activations. Click "Train" to run gradient descent. The probe finds the direction in activation space that best separates safe from unsafe positions.
Probes can detect known concepts (king safety, material balance). But the exciting part is discovering new concepts — chess knowledge that AlphaZero has but humans haven't articulated. This is where the paper gets creative.
Before looking for novel concepts, Schut et al. validated the probing approach on known chess concepts. The results confirm that AlphaZero linearly encodes classical chess knowledge:
| Concept | Probe Accuracy | Best Layer | Interpretation |
|---|---|---|---|
| Material balance | 98% | Layer 3 | Learned early — fundamental, easy to detect |
| King safety | 95% | Layer 12 | Higher layer — requires understanding piece coordination |
| Passed pawn | 93% | Layer 8 | Mid-layer — structural pawn evaluation |
| Open file | 91% | Layer 6 | Geometric concept — spatial reasoning |
| Piece mobility | 89% | Layer 10 | Dynamic concept — requires counting legal moves |
| Pawn structure quality | 87% | Layer 14 | Late layer — abstract structural assessment |
Notice the layer progression: simple concepts (material = piece counting) are linearly decodable from early layers, while complex concepts (pawn structure quality = long-term strategic assessment) appear in later layers. This mirrors the hierarchical nature of the residual network — early layers compute simple features, later layers compose them into complex concepts.
If you don't know what concepts to probe for, you can find them through unsupervised methods. Schut et al. use clustering and dimensionality reduction to find natural groupings in AlphaZero's activation space:
python from sklearn.cluster import KMeans from sklearn.decomposition import PCA # 1. Collect activations from layer 15 on 100K positions activations = [] # [100000, 16384] for pos in positions: activations.append(get_layer15_activation(pos)) # 2. Reduce dimensionality pca = PCA(n_components=50) reduced = pca.fit_transform(activations) # [100000, 50] # 3. Cluster to find natural groupings clusters = KMeans(n_clusters=20).fit_predict(reduced) # 4. For each cluster, examine what positions have in common # Cluster 7: all positions with fianchettoed bishops # Cluster 12: all positions with a pawn storm on the kingside # Cluster 15: ??? positions with no obvious human-named pattern
The key finding: some clusters correspond to well-known chess concepts (fianchetto, pawn chains). But others don't match any named pattern — they represent novel concepts that AlphaZero uses but humans haven't formalized.
Schut et al. discovered several AlphaZero concepts that don't have standard names in chess theory:
| Discovered Concept | Description | Human Equivalent? |
|---|---|---|
| "Prophylactic retreat" | Moving a piece backward to a seemingly passive square that prevents a future opponent tactic | Partially — Nimzowitsch wrote about prophylaxis, but AZ's version is more nuanced |
| "Piece coordination score" | A holistic measure of how well pieces support each other, beyond simple mobility | Approximate — humans use "piece harmony" informally |
| "Dynamic pawn value" | Pawn value that changes based on position stage, not just material count | Humans know "passed pawns gain strength" but AZ quantifies it differently |
This simulation clusters AlphaZero's activations and reveals discovered concepts. Click "Cluster" to find natural groupings, then "Identify" to match clusters to known or novel chess concepts.
The discovered concepts aren't just academic curiosities — they represent genuine chess strategies that affect game outcomes. Schut et al. validate this by showing that the concepts predict game results beyond what traditional chess features capture.
For each discovered concept, they measure how much it contributes to AlphaZero's evaluation of positions. A concept that strongly influences the value head is strategically important — it changes AlphaZero's assessment of who is winning.
python # Measuring concept importance via interventions def concept_importance(concept_direction, positions): """How much does this concept affect AlphaZero's evaluation?""" importances = [] for pos in positions: # Original evaluation act = get_activations(pos) v_original = value_head(act) # Remove concept: project out the concept direction act_ablated = act - (act @ concept_direction) * concept_direction v_ablated = value_head(act_ablated) # Importance = change in evaluation importances.append(abs(v_original - v_ablated)) return mean(importances)
The results show that novel concepts have comparable importance to well-known concepts. AlphaZero's "prophylactic retreat" concept influences position evaluation almost as much as "material balance" — a concept humans have known about for centuries.
The paper also examines how concepts interact. Some concepts are correlated (king safety and pawn structure often co-activate), while others are independent (material balance is orthogonal to piece coordination). This structure reveals AlphaZero's "theory of chess" — the implicit framework it uses to evaluate positions.
python # Measuring concept correlations import numpy as np # Get concept activations on 10,000 positions activations = {} for concept_name, probe in probes.items(): activations[concept_name] = probe.predict(all_positions) # Compute correlation matrix corr_matrix = np.corrcoef(list(activations.values())) # Key findings: # - King safety × Pawn structure: r = 0.62 (correlated) # → Good pawn structure often protects the king # - Material × Prophylactic: r = 0.08 (independent) # → Prophylactic moves are equally important regardless of material # - Coordination × Prophylactic: r = 0.45 (moderate) # → Prophylactic moves often improve piece coordination # - Piece mobility × Open files: r = 0.71 (strong) # → Open files increase piece mobility (expected)
When they analyze AlphaZero's famous games (like the matches against Stockfish), the discovered concepts explain many of AlphaZero's "mysterious" moves. Moves that human commentators called "bizarre" or "inhuman" become comprehensible when viewed through the lens of AlphaZero's novel concepts.
Compare the importance of known vs. novel concepts in AlphaZero's evaluation. Click each concept to see how much it affects position evaluation when ablated (removed).
The most exciting part of the paper: can humans actually learn and apply AlphaZero's concepts? Schut et al. designed a controlled experiment to test this.
| Group | What They Learned | N |
|---|---|---|
| AZ-concept group | Novel concepts extracted from AlphaZero, with example positions | ~20 |
| Stockfish group | Standard chess principles (traditional training) | ~20 |
| Control group | No training | ~20 |
Participants were club-level chess players (Elo 1200-1800). They received training materials explaining the concepts with example positions, then played a series of puzzle tests and games. The concepts were presented as natural language descriptions with diagrams — no math, no neural network jargon.
The AZ-concept group showed statistically significant improvement in puzzle-solving accuracy on positions where the novel concepts were relevant. They solved ~10% more puzzles correctly than the control group and ~5% more than the Stockfish group.
The concepts were presented as structured teaching materials:
example teaching material # Concept: "Prophylactic Retreat" # Definition: Moving a piece to a seemingly passive square # to prevent a future opponent tactic that would otherwise # become available in 4-8 moves. # # Key indicators: # - The retreat doesn't have an obvious immediate purpose # - The opponent has a latent tactical threat # - The retreated piece covers a critical square # # Example 1: [chess diagram] Kh1! # - Looks passive (king moves to corner) # - Prevents Ng4-f2+ fork in 3 moves # - AlphaZero played this in Game 10 vs Stockfish # # Example 2: [chess diagram] Bc1! # - Bishop retreats to starting square # - Prevents future Bb4 pin after opponent's a5-a4 # - Human grandmasters called this "bizarre" # # Practice: identify the prophylactic retreat in 5 puzzles
results # Puzzle accuracy on concept-relevant positions: # Control group: 62% correct # Stockfish group: 67% correct (+5 pp over control) # AZ-concept group: 72% correct (+10 pp over control, +5 over Stockfish) # Key finding: AZ concepts improved human performance # on positions where those specific concepts were relevant. # The improvement was concept-specific — not a general training effect.
A participant who learned the "prophylactic retreat" concept could recognize positions where moving a piece backward prevents a future threat. Before training, they would evaluate such moves as "passive" or "waste of tempo." After training, they could articulate why the move was good — and find similar moves in new positions.
Compare puzzle-solving accuracy across the three groups. The AZ-concept group outperforms both control and Stockfish groups on concept-relevant positions.
This interactive tool lets you explore AlphaZero's concepts on chess positions. Select a concept to see which positions activate it most, and observe how concept activations correlate with position evaluations.
Select a chess concept and drag the position complexity slider. The visualization shows how concept activation maps to AlphaZero's position evaluation. High activation + high evaluation = the concept is strategically favorable.
This paper sits at the intersection of AI interpretability, concept learning, and human-AI interaction. Its most important contribution isn't the specific chess concepts — it's the methodology for extracting useful knowledge from any superhuman AI system.
| Related Work | Relationship |
|---|---|
| Kim et al. TCAV (2018) | Introduced concept activation vectors — the probing methodology this paper extends |
| Kim et al. Agentic Interp (2025) | Automated version — LLM agents doing concept discovery at scale |
| Kim et al. AI Vocabulary (2025) | What happens when AI concepts can't be named in human language? |
| Kim et al. Neologisms (2025) | Teaching AI to name its own concepts — solving the vocabulary gap |
| McGrath et al. "AlphaGo" (2022) | Earlier work extracting concepts from Go-playing AI |
The methodology generalizes beyond chess. Any domain where AI achieves superhuman performance could benefit:
| Domain | Superhuman AI | Potential Novel Concepts | Human Beneficiaries |
|---|---|---|---|
| Protein folding | AlphaFold | Structural motifs, folding intermediates unknown to biochemistry | Drug designers, structural biologists |
| Weather prediction | GraphCast, GenCast | Atmospheric patterns beyond known teleconnections | Meteorologists, climate scientists |
| Mathematics | AlphaGeometry, FunSearch | Proof strategies, construction patterns human mathematicians haven't formalized | Research mathematicians |
| Materials science | GNoME | Crystal structure patterns, stability indicators | Materials engineers |
| Game playing | AlphaGo, AlphaZero | This paper! Strategic concepts in Go, chess, shogi | Professional players |
In each case, the pipeline is the same: (1) train a superhuman AI, (2) extract internal representations via probing/clustering, (3) identify novel concepts, (4) validate with ablation, (5) teach to human experts. Schut et al.'s chess work is a proof of concept for this entire paradigm.
python # The generalized concept extraction pipeline class ConceptExtractor: def extract(self, superhuman_model, domain_data, known_concepts): # Step 1: Probe for known concepts known_probes = {} for concept in known_concepts: probe = train_linear_probe(superhuman_model, concept, domain_data) known_probes[concept] = probe # Validates that known concepts are linearly encoded # Step 2: Find novel concepts via clustering activations = superhuman_model.get_hidden_activations(domain_data) clusters = cluster_activations(activations, n_clusters=50) # Step 3: Identify which clusters are novel novel = [] for cluster in clusters: if not matches_any_known_concept(cluster, known_probes): novel.append(cluster) # Step 4: Validate importance via ablation important_novel = [c for c in novel if ablation_impact(c) > threshold] return important_novel
The paper acknowledges several important limitations:
| Limitation | Impact | Potential Mitigation |
|---|---|---|
| Small sample size | ~20 participants per group — effect may not generalize | Larger study with diverse skill levels |
| Short training | Participants learned concepts in one session — retention unknown | Longitudinal study with spaced repetition |
| Concept selection bias | Researchers chose "interesting" concepts — may not be most useful | Automated importance ranking + human study |
| Linear probes may miss | Some concepts may be nonlinearly encoded | Use nonlinear probes as a complement |
| Chess-specific | Results may not transfer to other domains | Replicate in Go, protein folding, weather |
python # A critical limitation: are these really "novel" concepts? # # Possible objection: "prophylactic retreat" is already in # Nimzowitsch's "My System" (1925). Is it really novel? # # Schut et al.'s response: AlphaZero's version is quantitatively # different from Nimzowitsch's. The probe shows that AZ evaluates # "prophylactic potential" as a continuous score, not a binary # "is this prophylactic or not". And AZ's version includes cases # Nimzowitsch never described (e.g., prophylactic king moves in # the middlegame, which classical theory considers rare). # # The concept OVERLAPS with human knowledge but extends it. # It's novel in precision and scope, not in kind.
"The value of superhuman AI may not be in what it does for us, but in what it teaches us about the world."
The complete pipeline from AI training to human knowledge transfer. Drag the slider to explore each stage.