GIANTS — Veanors

Chapter 0: The Problem

Imagine you are reading two papers. One introduces a clever way to align language models with human preferences using preference pairs. The other shows how to sample multiple outputs and compare them within a group. You read both carefully. Something clicks — you see a connection the original authors never made. Six months later, someone publishes a paper that combines exactly those two ideas into something new.

That moment of synthesis — seeing the downstream insight that two parent papers enable — is the essence of scientific creativity. It is what separates a literature reviewer from a researcher. It is also, as far as anyone knows, something only humans can do.

Or is it?

The GIANTS question: Given summaries of two "parent" papers, can a language model predict the core insight of a downstream paper that builds on both? Not summarize it, not retrieve it — anticipate it. Generate an insight that doesn't exist yet in the model's input.

This is an extraordinarily hard task. The insight is not contained in either parent paper. It lives in the gap between them — in the creative synthesis that a human scientist would perform. And the search space is vast: there are infinitely many plausible insights that could follow from two papers, but only one is the actual ground truth.

Here is the surprising finding: frontier models are terrible at this. Gemini-3-pro, one of the most capable models in existence, scores just 5.05 out of 10 on insight anticipation. Gemini-2.5-pro scores 5.01. Even a tiny 4B-parameter base model scores 4.75. The gap between the best frontier model and a base model is only 6%. Throwing more parameters at the problem does not help.

But GIANTS-4B — a Qwen3-4B model fine-tuned with reinforcement learning — scores 5.97. That is a 34% improvement over gemini-3-pro. A 4-billion parameter model outperforming a model orders of magnitude larger, on a task that requires scientific creativity.

How? That is the story of this lesson.

What makes this task different

Consider how this compares to tasks we know LLMs are good at. Summarization: the answer is literally in the input, just compressed. Question answering: the evidence is in the context or in training data. Code generation: the specification constrains the output heavily. Insight anticipation is none of these. The output is not in the input. It is not in the training data (the test papers are temporally held out). And the specification is minimal — "what insight follows from these two ideas?" admits thousands of valid answers.

This is why the problem is so fascinating. It sits at the boundary of what we think machines can do. And as we will see, the training algorithm matters far more than the model size.

A concrete example

To make this tangible, consider two real parent papers:

Parent A: "DPO eliminates the reward model by directly optimizing the policy using preference pairs. Instead of learning a scalar reward and then running PPO, DPO optimizes a closed-form objective over paired comparisons."

Parent B: "GRPO samples a group of G responses per prompt and computes advantages by comparing rewards within the group. This eliminates the value network, cutting memory usage by 25%."

Now: what insight might a downstream paper contribute by combining these two ideas? Take a moment to think. There is no single right answer, but a good answer would identify a specific mechanism — not just "combine DPO and GRPO" but how they compose and what new capability emerges.

This is what GIANTS-4B learns to do. And it does it better than models 100 times its size.

Why is insight anticipation harder than standard text generation or summarization?

Because it requires generating longer text Because the target insight is not contained in the input — it lives in the creative gap between two parent papers Because the model needs to read more papers

Chapter 1: The Key Insight

GIANTS frames insight anticipation as auto-encoding over the citation graph. This framing is the conceptual foundation of everything else in the paper, so let's unpack it carefully.

In a standard auto-encoder, you take an input, compress it through a bottleneck, and try to reconstruct the original. The bottleneck forces the model to learn the essential structure. GIANTS applies this same logic to scientific knowledge — but the "bottleneck" is the citation graph itself.

Downstream Paper

The real paper with its full insight y*. This is what we want to reconstruct.

↓ lossy channel

Parent Summaries (x_A, x_B)

Two summaries of papers that the downstream paper cites. These are the "compressed representation." They contain the ingredients but not the recipe.

↓ decoder (LLM)

Reconstructed Insight ŷ

The model's prediction of what insight emerges from combining the two parents. Compared against y* for evaluation.

The auto-encoder analogy: The downstream paper passes through a "lossy channel" — we throw away everything except two parent summaries. The LLM acts as the decoder, trying to reconstruct the core insight from this compressed input. If the model can do this well, it has learned something deep about how scientific ideas compose.

Why is this framing so powerful? Because it turns an impossibly vague task ("predict scientific breakthroughs") into a concrete, evaluable one. For every real paper on arXiv, we know its actual insight y*. We know which papers it cites. We can always check: did the model's generated insight ŷ match the real one?

The auto-encoder analogy also explains why the task is hard. In a standard auto-encoder, the bottleneck preserves the most important information. But in GIANTS' citation-graph bottleneck, the most important information — the creative synthesis — is precisely what gets lost. The parent summaries contain the ingredients, but the recipe (the downstream insight) is not extractable from the ingredients alone. The model must invent the recipe, not just recover it.

This is not asking the model to predict which paper will be written next. It is asking: given the specific ingredients (two parent papers), can you figure out what someone actually cooked? The answer already exists. We just hide it behind the lossy channel and ask the model to reconstruct it.

The data flow

Concretely, here is what the model sees and produces:

Component	What it is	Who provides it
x_A	Summary of parent paper A (~200 words)	GiantsBench pipeline (gemini-2.5-flash)
x_B	Summary of parent paper B (~200 words)	GiantsBench pipeline
y*	Ground-truth insight of the downstream paper	Extracted from the real paper, rewritten standalone
ŷ	Model-generated insight prediction	GIANTS-4B (or baseline model)

The input is (x_A, x_B). The output is ŷ. The evaluation compares ŷ to y*. The model never sees the downstream paper. It must invent what the downstream paper discovered.

Citation Graph Auto-Encoding

The downstream paper's insight passes through a lossy channel (parent summaries). The model tries to reconstruct it. Click "New Example" to see different configurations.

In the auto-encoding framing, what is the "lossy channel" that compresses the downstream paper?

Replacing the full downstream paper with summaries of just two of its parent papers — the ingredients without the recipe Tokenization of the paper text The attention mechanism's limited context window

Chapter 2: GiantsBench

To train and evaluate insight anticipation, GIANTS constructs GiantsBench — a large-scale benchmark from arXiv. This is not just a data dump. The construction involves careful filtering, annotation, and splitting to ensure the task is both hard and fair.

Source data

17,839 papers from arXiv, spanning May 2007 through January 2026. Papers must have at least 2 citations. This covers 8 scientific domains: cs.CL, cs.AI, cs.LG, cs.CV, physics, math, economics, and robotics.

Identifying parent pairs

For each downstream paper, gemini-2.5-flash reads the full text and identifies the two most synergistic parent papers — the two cited works whose combination most directly inspired the downstream paper's core contribution. This is not random: the model must reason about intellectual lineage.

The annotation pipeline

The construction of GiantsBench is a multi-stage process. Each stage is designed to maximize data quality while preventing information leakage (the downstream paper's content must never leak into the parent summaries or the task setup).

Detailed steps

1. Parent Identification

gemini-2.5-flash reads the downstream paper and picks the 2 most influential parent papers it builds on synergistically.

↓

2. Summarization

Each parent paper is summarized into a standalone paragraph. The summaries must NOT reference the downstream paper.

↓

3. Insight Extraction

The downstream paper's core insight y* is rewritten as a standalone statement — no references to the paper itself, just the idea.

↓

4. Deduplication

If multiple downstream papers share the same parent pair, keep only the most-cited downstream paper. This prevents data leakage.

The splits — why they matter

GiantsBench uses two orthogonal splits that make evaluation rigorous:

Split	Rule	Size	What it tests
Temporal	Train: before July 2023. Test: after.	Test: 7,504	Can the model anticipate insights from papers it could not have memorized?
Domain	Train: cs.CL only. Test: ALL 8 domains.	—	Does insight anticipation transfer across scientific fields?
Unseen parents	Test examples where neither parent appeared in training.	5,294	Can the model generalize beyond familiar research lineages?

Why train on cs.CL only: If you train on all domains, you cannot tell whether the model learned a general skill (insight synthesis) or just memorized domain-specific patterns. By training exclusively on computational linguistics and testing on physics, economics, and robotics, GIANTS measures whether insight anticipation is a transferable capability.

Scale and quality

The final benchmark contains 17,839 annotated triplets (parent pair + insight). Each one has been verified: the parent identification is sensible, the summaries are standalone (no leakage of the downstream paper's content), and the insight is faithfully extracted. The test set alone contains 7,504 examples — large enough for reliable statistical conclusions.

One subtle but important design choice: when multiple downstream papers share the same parent pair, GiantsBench keeps only the most-cited downstream paper. Why? Because popular papers are more likely to represent genuine intellectual synthesis rather than incremental follow-ups. This filtering also prevents the same parent pair from appearing multiple times with different targets, which would confuse the evaluation.

The temporal split in detail

The temporal split is not just a convenience — it is the primary defense against data contamination. Any paper published before July 2023 could theoretically appear in the base model's pre-training data. By restricting the test set to papers after July 2023, GIANTS ensures the model is predicting insights for papers it has never seen in any form. The model cannot retrieve the answer from memory. It must generate it.

This is more rigorous than many ML benchmarks, which do not control for data contamination at all. GIANTS' temporal split makes the evaluation meaningfully future-facing: you are testing the model's ability to anticipate insights that did not exist when the model was trained.

The domain split adds a second dimension of rigor. Even if a model somehow memorized every cs.CL paper, it would gain no advantage on the physics or economics test sets. Both splits must be beaten simultaneously, which makes it extremely unlikely that any observed improvement comes from memorization rather than genuine insight anticipation ability.

Think of GiantsBench as a map of scientific creativity. Each data point represents one creative act: a researcher who read two papers and saw a connection nobody had seen before. The benchmark asks: can a model learn to replicate that cognitive leap? 17,839 examples of human creativity, across 8 domains, spanning nearly 20 years of arXiv.

GiantsBench Pipeline

The four-stage construction process: identify parents, summarize, extract insight, deduplicate. Each stage filters and refines the data.

Why does GiantsBench train on cs.CL data only, despite having 8 domains available?

Because cs.CL papers are easier to parse To test whether insight anticipation is a transferable skill, not domain-specific memorization Because cs.CL has the most papers on arXiv

Chapter 3: The Evaluation

How do you grade a generated insight? This is not a multiple-choice test. The model's output ŷ is free-form text describing a scientific idea. The ground truth y* is also free-form text. You cannot just check string equality — there are infinitely many ways to phrase the same insight.

LM-judge scoring

GIANTS uses an LM judge: a separate language model that reads both ŷ (generated) and y* (ground truth) and assigns a similarity score from 1 to 10. The judge evaluates whether the generated insight captures the same core idea, not whether the wording matches.

This is a crucial design decision. The judge is not checking correctness (there is no "correct" answer in the traditional sense). It is checking semantic alignment: does the generated insight describe the same fundamental idea as the ground truth, even if the words and structure differ completely?

The judge separation principle: The LM judge used during training (to compute rewards for RL) is gemini-2.5-flash. The LM judge used during evaluation (to report final scores) is gemini-3-pro. These are different models. This strict separation prevents the training process from overfitting to the evaluation judge's quirks.

Human validation

Can you trust an LM judge? GIANTS validates this directly. Human annotators score a subset of generated insights, and the correlation with the LM judge is measured:

Spearman ρ = 0.761 between LM judge and human scores
p < 0.001 — the correlation is highly statistically significant

This is strong. For context, inter-annotator agreement between human raters on subjective tasks typically falls in the 0.6–0.8 range. The LM judge agrees with humans about as much as humans agree with each other.

Why not BLEU or ROUGE?

Surface-level metrics like BLEU (n-gram overlap) or ROUGE (recall of reference n-grams) fail catastrophically on this task. Two descriptions of the same scientific insight can use completely different vocabulary. "Combine RL with preference pairs" and "Use reinforcement learning to optimize based on human-annotated ranking data" describe the same idea but share almost no n-grams. The LM judge captures semantic similarity, not lexical overlap.

The scoring rubric

The LM judge does not simply ask "are these similar?" It evaluates along specific dimensions:

Core mechanism match: Does the generated insight identify the same core mechanism or approach as the ground truth?
Problem-solution alignment: Does it address the same problem that the downstream paper solves?
Novelty capture: Does it capture what makes the downstream idea new relative to the parents?
Specificity: Is it concrete enough to distinguish from vague platitudes like "combine these two ideas"?

A score of 1 means completely unrelated. A score of 5 means roughly the right direction but missing key details. A score of 8-10 means the generated insight nails the core mechanism even if the phrasing differs. This continuous scale is what makes the evaluation informative — and, as we will see, it is what makes RL training possible.

Why continuous matters for RL: If the reward were binary (match or no match), most RL rollouts would get 0 reward and produce zero gradient. The model would learn almost nothing per step. A continuous 1-10 score creates a gradient in reward space — the model can tell whether it is getting warmer or colder. This makes GRPO's group-relative advantages much more informative. Instead of "3 out of 8 candidates matched," you get "candidate 4 scored 7.2 and candidate 6 scored 3.1" — a rich signal about what works and what does not.

Robustness of the judge

A natural worry: what if the LM judge is noisy or biased? GIANTS addresses this in three ways. First, the Spearman correlation of 0.761 with human judgments shows the judge is reliable. Second, the strict separation between training and evaluation judges prevents overfitting. Third, the results are robust across multiple evaluation judges — the paper reports that swapping in different evaluation judges does not change the relative ranking of methods.

This last point is important because it means the improvement is real, not an artifact of one particular judge's biases. GIANTS-4B genuinely produces better insights, as measured by multiple independent evaluation criteria.

Why does GIANTS use different LM judges for training rewards and final evaluation?

To reduce computational cost To prevent the training process from overfitting to the evaluation judge's specific biases Because gemini-2.5-flash is more accurate than gemini-3-pro

Chapter 4: SFT vs RL — The Showcase

This is the most surprising result in the paper and the one that makes GIANTS work. Two training approaches, same data, same base model — completely different outcomes.

SFT: the obvious approach

Supervised Fine-Tuning (SFT) is the standard recipe: give the model (parent summaries, ground-truth insight) pairs and train it to predict the insight via next-token cross-entropy loss. You are literally teaching the model "when you see these parents, output this insight."

Result: 4.65. That is worse than the untrained base model (4.75). The model got dumber from training.

SFT-think: making it worse

Maybe the model needs to think step-by-step? SFT-think adds chain-of-thought distillation — training the model to produce a reasoning trace before the insight.

Result: 4.43. Even worse. Chain-of-thought made the problem harder, not easier.

Why SFT fails

The insight has low likelihood under the policy. Think about what SFT is asking. Given two parent summaries, predict the exact downstream insight token by token. But the downstream insight is a creative synthesis. It is not a deterministic function of the parents. Many plausible insights could follow from the same two papers.

Consider the probability landscape. For a standard QA pair like "What is the capital of France?" → "Paris," the target token sequence has very high conditional probability. The model naturally assigns high likelihood to "Paris" given the question. SFT works because it is reinforcing what the model already wants to say.

For insight anticipation, the situation is reversed. Given two parent summaries, the ground-truth insight is one of thousands of plausible continuations. Its probability under the pre-trained policy is extremely low. Cross-entropy loss penalizes the model severely for producing any of the other thousands of valid insights. The gradient pushes the model toward the single target, but the target is so unlikely that the model cannot find a stable equilibrium. It oscillates, hedges, and eventually collapses to bland, generic outputs that offend no loss function but contain no insight.

Forcing the model to clone one specific insight via cross-entropy is like teaching an art student by only showing them the "correct" painting for each prompt — it destroys creativity rather than building it.

The core failure: SFT tries to maximize the likelihood of one specific target. But insight anticipation is a one-to-many mapping — many valid insights can follow from the same parents. Cross-entropy loss penalizes the model for generating any insight that differs from the one ground truth, even if it's equally valid. The model learns to hedge and produce bland, generic outputs.

RL: exploring the insight space

GRPO (Group Relative Policy Optimization) takes a fundamentally different approach. Instead of cloning one target, it:

Samples G = 8 candidate insights per prompt from the current policy
Scores each candidate against the ground truth using the LM judge (similarity reward)
Computes advantages within the group (which candidates were better than average?)
Updates the policy to make the better candidates more likely

Result: 5.97. A 34% improvement over gemini-3-pro.

Why RL succeeds: RL does not require the model to produce the exact ground-truth insight. It only requires that some of its 8 samples are closer to the ground truth than others. The model explores the space of plausible insights, gets partial credit for plausible ones, and gradually learns to steer toward the most promising regions. It optimizes for similarity, not identity.

The analogy

Imagine teaching someone to compose music. SFT is like saying: "Here is the melody. Memorize it. Every note must match." The student learns to reproduce melodies but never learns to compose. RL is like saying: "Compose 8 melodies for this chord progression. I'll tell you which ones sound best." The student explores, discovers what works, and develops a sense of musicality — a meta-skill that transfers to new chord progressions.

Insight anticipation is composition, not transcription. SFT teaches transcription. RL teaches composition. And the gap between them is not subtle — it is the difference between a model that gets worse with training (SFT at 4.65) and one that dramatically improves (GRPO at 5.97).

The numbers tell the story

Method	Score	vs Base	Training signal
Base (Qwen3-4B)	4.75	—	None (pre-training only)
SFT	4.65	−2%	Cross-entropy on (x, y*) pairs
SFT-think	4.43	−7%	Cross-entropy on (x, CoT, y*)
GIANTS-4B (GRPO)	5.97	+26%	Similarity reward + group exploration

SFT vs RL: Insight Generation

Toggle between SFT and RL training. SFT produces one mediocre output. RL explores 8 candidates with varying scores. Click "Resample" to see different RL rollouts.

Why does SFT fail at insight anticipation while RL succeeds?

SFT forces the model to clone one specific insight via cross-entropy, but insight anticipation is one-to-many — RL can explore multiple candidates and get partial credit for similarity SFT uses a smaller learning rate RL has access to more training data

Chapter 5: GIANTS-4B Training

Now that we know why RL works, let's look at how GIANTS-4B is actually trained. The recipe is GRPO with a similarity-based reward, applied to Qwen3-4B.

Base model

Qwen3-4B: a 4-billion parameter language model from the Qwen family. Importantly, this is a small model. It is roughly 100x smaller than frontier models like gemini-3-pro. The choice is deliberate — GIANTS wants to show that insight anticipation is learnable even at small scale.

Why not start from a larger model? Two reasons. First, practical: GRPO requires sampling 8 completions per prompt, which means 8x the inference cost. A 4B model keeps this manageable. Second, scientific: if GIANTS-4B beats gemini-3-pro, the result is far more impressive than if GIANTS-70B beats it. The smaller the model, the stronger the evidence that the training methodology (not the model capacity) is responsible for the improvement.

GRPO mechanics

For each training prompt (a pair of parent summaries), the model generates G = 8 candidate insights. Each candidate ŷ_i is scored by the training judge (gemini-2.5-flash), producing a reward r_i between 1 and 10.

A_i = (r_i − μ_r) / σ_r

The advantage A_i of each candidate is its z-score within the group. Candidates scoring above the group mean get positive advantage; those below get negative. The policy is then updated via the standard GRPO objective — clipped policy gradient with these group-relative advantages.

Why z-scores and not raw rewards? Because z-scoring normalizes the advantages across different prompts. Some prompts are inherently harder (lower mean reward). Without normalization, easy prompts would dominate the gradient. Z-scoring ensures every prompt contributes equally to learning, regardless of its absolute difficulty level.

Judge separation is critical. The training judge (gemini-2.5-flash) and evaluation judge (gemini-3-pro) must be different models. If you train with the same judge you evaluate on, the model can learn to exploit the judge's biases rather than genuinely improving insight quality. This is the RL analog of "teaching to the test."

What the reward captures

The reward is not binary (correct/incorrect). It is a continuous similarity score from 1 to 10. This is crucial. A generated insight that captures half the ground truth gets a ~5, not a 0. This dense reward signal gives the model much more gradient information than a sparse binary reward would. The model can distinguish between "completely wrong," "on the right track," and "nailed it."

Training data

The training set comes from GiantsBench: papers published before July 2023, restricted to the cs.CL domain. The model never sees physics, economics, math, or robotics examples during training. Everything in those domains is zero-shot at test time.

The training loop in pseudocode

python
# GIANTS-4B training loop (simplified)
for batch in training_data:
    # batch contains (parent_A_summary, parent_B_summary, ground_truth_insight)
    for (xA, xB, y_star) in batch:
        # 1. Sample G=8 candidate insights from current policy
        candidates = [policy.generate(xA, xB) for _ in range(8)]

        # 2. Score each candidate with training judge (gemini-2.5-flash)
        rewards = [judge.score(cand, y_star) for cand in candidates]

        # 3. Compute group-relative advantages (z-scores)
        mu = mean(rewards)
        sigma = std(rewards)
        advantages = [(r - mu) / sigma for r in rewards]

        # 4. GRPO update: reinforce above-mean, suppress below-mean
        policy.grpo_update(candidates, advantages)

The computational cost: Each training step requires 8 full generations per prompt plus 8 LM-judge calls per prompt. This is expensive — far more than SFT, which only does one forward pass. But the quality of the reward signal justifies the cost. Each step gives the model 8 data points about what works and what does not, rather than SFT's single "memorize this" signal.

Why not PPO?

PPO (Proximal Policy Optimization) is the standard RL algorithm for LLM fine-tuning. Why does GIANTS choose GRPO instead? Two reasons. First, PPO requires training a critic (value) network, which doubles the memory requirement. With a 4B policy model, the critic would be another 4B model. GRPO eliminates this entirely by using group-relative advantages. Second, PPO's value function estimates are noisy for creative tasks — it is hard to learn a reliable value function when the reward landscape is as complex as "scientific insight quality." GRPO sidesteps this by computing advantages empirically from sampled rewards, which requires no function approximation.

For a deep dive into how GRPO works and why it eliminates the critic, see the DeepSeekMath lesson.

Why is a continuous similarity reward (1-10) better than a binary reward for insight anticipation?

Because continuous rewards are faster to compute Because a partially correct insight gets partial credit, giving the model much denser gradient signal to learn from Because binary rewards cause numerical overflow

Chapter 6: Main Results

Here is the headline table. Read it carefully — the pattern is remarkable.

Model	Score (1-10)	vs gemini-3-pro
Base (Qwen3-4B)	4.75	—
gemini-2.5-pro	5.01	−1%
gemini-3-pro	5.05	baseline
SFT	4.65	−8%
SFT-think	4.43	−12%
GIANTS-4B	5.97	+34%

Three things to notice

1. Frontier models do not scale at this task. gemini-3-pro (5.05) is barely better than gemini-2.5-pro (5.01), which is barely better than a 4B base model (4.75). The difference between a tiny open model and the best frontier model is just 6%. This is not a task where more parameters = more performance.

2. SFT actively hurts. Both SFT (4.65) and SFT-think (4.43) perform worse than the untrained base model (4.75). Supervised fine-tuning on the ground-truth insights makes the model worse at generating insights. This is the strongest evidence that insight anticipation cannot be solved by imitation learning.

3. RL is the only thing that works. GIANTS-4B (5.97) is the only approach that meaningfully improves over the base model. And it doesn't just edge ahead — it blows past every baseline by a wide margin. The 34% improvement over gemini-3-pro is enormous for a generative evaluation task.

The scaling paradox: On most NLP benchmarks, bigger models win. On insight anticipation, scaling does essentially nothing. But a small model with the right training signal (RL + similarity reward) leaps ahead. This suggests insight anticipation requires a fundamentally different capability than what scale provides — it requires learning to explore and synthesize, not to memorize and retrieve.

What does scale give you?

On tasks like coding, math, and factual QA, larger models win because they encode more knowledge and can execute more complex reasoning chains. But insight anticipation is not bottlenecked by knowledge or chain length. The parent summaries provide all the relevant knowledge. The challenge is the creative leap — seeing the connection. And that creative leap does not emerge from having more parameters. It emerges from training the model to explore the space of possible connections and reinforce the ones that are promising.

This is a profound finding for AI safety and capability research. It means there exist cognitive tasks where training methodology matters far more than model size. The frontier is not always about scale.

A closer look at the SFT collapse

The SFT and SFT-think results deserve extra attention because they are counterintuitive. Normally, supervised fine-tuning improves a base model. On standard NLP tasks, SFT is the reliable workhorse. On insight anticipation, it makes the model worse. And adding chain-of-thought (SFT-think) makes it even worse.

Why does chain-of-thought hurt? Because CoT distillation forces the model to also match a specific reasoning trace in addition to the final insight. The reasoning trace is even more idiosyncratic than the insight itself — there are infinitely many valid reasoning paths to any given insight. Forcing the model to match one specific path overwhelms it with conflicting gradients. The loss landscape becomes even more rugged, and the model converges to an even blander equilibrium.

This is a cautionary tale for the field: CoT is not a universal solution. On tasks where the target is underdetermined, CoT distillation can actively harm performance by adding more constraints to an already over-constrained optimization.

Model Comparison

Insight anticipation scores across all models. Notice how frontier models cluster together while GIANTS-4B breaks away.

What is the most surprising finding about frontier model performance on insight anticipation?

They are too slow They barely outperform a 4B base model — scaling does not help, and the gap between Qwen3-4B and gemini-3-pro is only 6% They refuse to generate insights

Chapter 7: Generalization

GIANTS-4B was trained on computational linguistics papers only. But GiantsBench spans 8 scientific domains. Does the model's skill transfer to fields it has never seen during training?

Cross-domain zero-shot

Yes. The gains hold across all domains. GIANTS-4B trained on cs.CL outperforms frontier models on physics, mathematics, economics, robotics, computer vision, AI, and machine learning — all without seeing a single training example from those fields.

This is not a small effect. The model consistently outperforms gemini-3-pro on every single domain, not just the domain it was trained on. The transfer is robust and consistent. If the model had only memorized NLP-specific patterns (e.g., "attention + decoding = better generation"), it would fail on physics (where the patterns involve differential equations and conservation laws) or economics (where they involve market mechanisms and equilibrium theory). The fact that it succeeds everywhere suggests it learned something deeper.

What this means: Insight anticipation is not a domain-specific skill. It is something closer to a meta-cognitive ability — the capacity to identify how two ideas compose into a third. Training this ability on one domain (NLP) transfers to completely unrelated fields (physics, economics). The model is not learning NLP facts; it is learning the structure of scientific synthesis itself.

Unseen parents

An even harder test: test-unseen-parents (5,294 examples) restricts evaluation to cases where neither parent paper appeared in the training set. The model cannot rely on any familiarity with the input papers. Even on this hardest split, GIANTS-4B maintains its advantage.

Temporal generalization

All test examples come from papers published after July 2023 — after the training data cutoff. The model is predicting insights for papers it could not possibly have memorized during training. This rules out the possibility that GIANTS-4B is simply retrieving insights it saw in pre-training.

What transfers

Think about what the model learned during training on cs.CL. It did not learn NLP-specific facts (those are in the base model already). What GRPO training teaches is something more abstract:

How to read two summaries and identify the gap between them. What does paper A solve that paper B doesn't? What does paper B enable that paper A can't use?
How to articulate a synthesis. Not "combine A and B" (too vague) but "use A's mechanism to solve B's limitation" (specific and testable).
How to be precise under uncertainty. The model learns to commit to a specific mechanism rather than hedging with generic statements. RL rewards specificity; SFT's cross-entropy punishes it.

These are domain-agnostic skills. A physicist combining two ideas uses the same cognitive patterns as a linguist. The content differs; the structure of creative synthesis is universal.

Evidence for a universal synthesis skill: If insight anticipation were domain-specific, you would expect the model to perform well on cs.CL (its training domain) and poorly on distant domains like physics or economics. Instead, the performance gap between GIANTS-4B and baselines is roughly constant across all domains. The RL training did not teach "NLP synthesis" — it taught "scientific synthesis" as a general cognitive pattern.

Cross-Domain Performance

GIANTS-4B (orange) vs baselines across scientific domains. Trained only on cs.CL, tested on all. The orange line stays above the baselines everywhere.

GIANTS-4B was trained only on cs.CL data. How does it perform on physics and economics?

It fails because it never saw those domains It still outperforms frontier models, because insight anticipation is a transferable meta-cognitive skill, not domain-specific knowledge It performs the same as the base model

Chapter 8: Human Evaluation

LM judges are useful, but do humans actually prefer GIANTS-4B's insights? The paper runs two independent human evaluations to answer this.

Direct human comparison

Annotators are shown insights from GIANTS-4B and the base model side by side (blinded — they do not know which model produced which insight) and asked which is better. This is the gold standard of evaluation: real humans making real judgments about scientific quality.

The result: 89.7% win rate for GIANTS-4B. Nearly 9 out of 10 times, humans prefer the RL-trained model's insights over the base model's. This is not a close call. It is a decisive, overwhelming human preference.

What makes GIANTS-4B's insights better? The human evaluators specifically noted:

Higher conceptual clarity. The insights are more precisely articulated. They identify the specific mechanism by which the two parent ideas combine, rather than vaguely gesturing at a connection.
Similar algorithmic complexity. The insights are not simpler — they are just as technically sophisticated as the base model's. The improvement is in precision of thought, not in dumbing things down.

SciJudge-30B

GIANTS also evaluates with SciJudge-30B, a 30-billion parameter model trained specifically to predict citation impact of scientific ideas. It is an independent quality signal — not trained by the GIANTS team, not part of their training pipeline.

SciJudge-30B shows a 68% preference for GIANTS-4B over the base model. This is a different kind of validation: not "is this insight similar to the ground truth?" but "does this insight seem like it would be impactful?" GIANTS-4B passes both tests.

Qualitative analysis

Looking at specific examples reveals the difference between GIANTS-4B and the base model. Given parent papers on (A) attention mechanism efficiency and (B) sparse mixture-of-experts routing, the base model might generate: "An approach that combines attention with MoE." That is technically relevant but useless — it describes every paper in the intersection of those fields.

GIANTS-4B, by contrast, generates something like: "Route tokens to specialized attention heads based on input semantics, allowing each head to develop domain expertise while keeping total computation constant." That is specific. It describes a mechanism. It has engineering content. Someone could start building a system from that description.

This is what "conceptual clarity" means in practice: not just correctness, but specificity and actionability. The human evaluators are not rewarding eloquence. They are rewarding the kind of insight that a researcher could actually use.

What RL teaches the model to do differently

The qualitative analysis reveals a specific pattern in how RL changes the model's behavior. The base model tends to produce generic compositional statements: "One could combine technique A with approach B to achieve better results." These are technically not wrong, but they contain zero information. Any human could say the same without reading either paper.

GIANTS-4B produces mechanistic compositions: "Use A's routing mechanism to dynamically allocate B's computation, allowing the model to specialize different pathways for different input types." This is a specific hypothesis. It identifies which component of A interacts with which component of B, and what emergent capability results. This is the difference between a literature review and a research proposal.

RL teaches this specificity because the similarity reward rewards it. Generic statements score ~4 (vaguely in the right direction). Mechanistic statements score ~7 (close match to the ground truth). The group-relative advantage strongly reinforces the specific over the generic.

Comparison to existing AI-for-science tools

How does GIANTS' output compare to tools researchers actually use? Current literature exploration tools (Semantic Scholar, Connected Papers, Elicit) excel at finding relevant papers and extracting claims. But they do not synthesize. They can tell you "paper A is related to paper B" but not "here is what you could build if you combined them."

GIANTS-4B is the first model evaluated on this synthesis task at scale. The 89.7% human win rate suggests that RL-trained models could eventually complement existing tools by adding a "creative suggestion" layer on top of retrieval. The researcher finds the papers; the model suggests what to build with them.

The key innovation is not just that GIANTS-4B produces better text. It is that the insights it produces are structurally different. They contain the specific mechanistic content that researchers need: which component interacts with which, what property emerges, what limitation is overcome. This kind of structured scientific thinking is precisely what separates useful AI assistance from impressive but unhelpful language generation.

Three independent validation signals all agree:
1. LM judge (gemini-3-pro): +34% improvement in similarity score
2. Human annotators: 89.7% win rate on conceptual clarity
3. SciJudge-30B (impact predictor): 68% preference
Three different judges, three different criteria, all pointing the same direction.

What does the 89.7% human win rate tell us about GIANTS-4B's improvements?

GIANTS-4B produces insights with higher conceptual clarity and precision, not just higher LM-judge scores — humans independently confirm the improvement GIANTS-4B produces simpler insights that are easier to read The human evaluation is not statistically significant

Chapter 9: Connections

Where GIANTS fits

Method	Relationship to GIANTS
DeepSeekMath / GRPO	GIANTS uses GRPO as its RL algorithm. DeepSeekMath introduced group-relative advantages that eliminate the critic network — the same mechanism that powers GIANTS-4B.
DAPO	Another GRPO variant for LLM RL. Where DAPO fixes failure modes of GRPO at scale (entropy collapse, dead gradients), GIANTS extends GRPO to a new task type with continuous similarity rewards.
AI Scientist v2	End-to-end automated research. AI Scientist generates full papers; GIANTS focuses on the narrower but more measurable task of predicting one specific insight. GIANTS is a component that could feed into a system like AI Scientist.
Poly-EPO	Exploration in RL for reasoning. GIANTS' key finding — that RL exploration beats SFT imitation — aligns with Poly-EPO's emphasis on maintaining diversity during policy optimization.
Simula	Synthetic data generation with LM-judge quality control. Both GIANTS and Simula use LM judges as reward signals, but for very different tasks (insight quality vs data quality).

The bigger picture

GIANTS sits at the intersection of two trends: RL for LLMs beyond correctness and AI for science. Most RL-for-LLM work (DeepSeek-R1, DAPO, Poly-EPO) uses binary rewards on verifiable tasks — is the math answer correct? GIANTS shows RL can work with soft, continuous rewards from an LM judge on a subjective, creative task. This opens the door to RL training on tasks where there is no single correct answer.

On the science side, GIANTS provides the first quantitative evidence that creative scientific synthesis can be improved through training. Previous work on AI for science focused on execution (running experiments, writing code, generating hypotheses from databases). GIANTS targets the spark — the moment of "wait, these two ideas combine into something new." If this capability continues to improve, it could fundamentally change how researchers discover connections in the literature.

Practical implications

Imagine a tool that reads your last two papers and suggests what the logical next step of your research should be. Not based on keyword matching or citation networks, but based on a deep understanding of how scientific ideas compose. That tool could surface connections you would never find by browsing arXiv. It could suggest collaborations between researchers in different fields who are working on complementary problems. It could accelerate the pace of discovery by reducing the time between "the ingredients exist" and "someone sees the recipe."

GIANTS does not build that tool. But it proves that the core capability — anticipating the insight at the intersection of two research lines — is learnable. That is the first step.

Key takeaways across the paper

Finding	Implication
SFT hurts on creative tasks	Cross-entropy loss is wrong for one-to-many mappings. Consider RL whenever the target is not unique.
Frontier models do not scale	Some cognitive tasks need the right training signal, not more parameters. Methodology can beat scale.
RL transfers across domains	Insight synthesis is a domain-agnostic meta-skill. Train on one field, deploy everywhere.
LM judges are viable rewards	Continuous LM-judge rewards enable RL on tasks where binary correctness is undefined.
4B can beat frontier at creativity	Small, well-trained models can outperform 100x larger models on tasks requiring exploration.

The meta-lesson: When a task looks impossible for SFT, do not give up — try RL. Insight anticipation looked unsolvable: frontier models could not do it, SFT made models worse, chain-of-thought made it worse still. But the right training signal (similarity reward + group exploration) unlocked a capability that no amount of scale or supervised data could provide.

Open questions

GIANTS leaves several exciting questions unanswered. Can insight anticipation improve further with more RL training? What if you chain GIANTS — generate an insight, then use it as a parent for the next generation? Could this produce a cascade of discoveries? And perhaps most intriguingly: can GIANTS generate insights that are genuinely novel — not matching any existing paper, but still scientifically valid?

The paper does not answer these questions, but the framework makes them answerable. GiantsBench provides the testbed. RL provides the training paradigm. The stage is set for AI systems that don't just analyze science — they participate in it.

Limitations to keep in mind

Evaluation is retrospective. GIANTS reconstructs insights for papers that already exist. We do not yet know whether it can generate genuinely novel insights that lead to new discoveries.
Parent identification is automated. The choice of which two papers are the "parents" is made by gemini-2.5-flash, not by the downstream paper's actual authors. The ground truth of intellectual lineage is more complex than any two-paper summary.
Judge-based rewards have limits. The LM judge can only assess similarity to existing insights. It cannot evaluate scientific validity or feasibility. A beautifully stated but physically impossible insight would score well.
The benchmark is arXiv-centric. Science that happens outside of arXiv (industry R&D, clinical trials, field research) is not represented.

These limitations do not diminish the contribution. They define the scope. GIANTS is a proof of concept for a new capability, not a finished product. Future work could address each limitation: validating generated insights with domain experts, using multi-parent configurations beyond pairs, adding feasibility checks via simulation or literature search, and expanding beyond arXiv to include clinical and industrial research.

"If I have seen further it is by standing on the shoulders of giants." — Isaac Newton, 1675. GIANTS asks: can a machine learn to see further, too?

GIANTS: Insight Anticipation