Standing on the shoulders of giants — predicting a downstream paper's core insight from its parent papers. GIANTS-4B outperforms gemini-3-pro by 34% via RL with similarity rewards.
Imagine you are reading two papers. One introduces a clever way to align language models with human preferences using preference pairs. The other shows how to sample multiple outputs and compare them within a group. You read both carefully. Something clicks — you see a connection the original authors never made. Six months later, someone publishes a paper that combines exactly those two ideas into something new.
That moment of synthesis — seeing the downstream insight that two parent papers enable — is the essence of scientific creativity. It is what separates a literature reviewer from a researcher. It is also, as far as anyone knows, something only humans can do.
Or is it?
This is an extraordinarily hard task. The insight is not contained in either parent paper. It lives in the gap between them — in the creative synthesis that a human scientist would perform. And the search space is vast: there are infinitely many plausible insights that could follow from two papers, but only one is the actual ground truth.
Here is the surprising finding: frontier models are terrible at this. Gemini-3-pro, one of the most capable models in existence, scores just 5.05 out of 10 on insight anticipation. Gemini-2.5-pro scores 5.01. Even a tiny 4B-parameter base model scores 4.75. The gap between the best frontier model and a base model is only 6%. Throwing more parameters at the problem does not help.
But GIANTS-4B — a Qwen3-4B model fine-tuned with reinforcement learning — scores 5.97. That is a 34% improvement over gemini-3-pro. A 4-billion parameter model outperforming a model orders of magnitude larger, on a task that requires scientific creativity.
How? That is the story of this lesson.
Consider how this compares to tasks we know LLMs are good at. Summarization: the answer is literally in the input, just compressed. Question answering: the evidence is in the context or in training data. Code generation: the specification constrains the output heavily. Insight anticipation is none of these. The output is not in the input. It is not in the training data (the test papers are temporally held out). And the specification is minimal — "what insight follows from these two ideas?" admits thousands of valid answers.
This is why the problem is so fascinating. It sits at the boundary of what we think machines can do. And as we will see, the training algorithm matters far more than the model size.
To make this tangible, consider two real parent papers:
Now: what insight might a downstream paper contribute by combining these two ideas? Take a moment to think. There is no single right answer, but a good answer would identify a specific mechanism — not just "combine DPO and GRPO" but how they compose and what new capability emerges.
This is what GIANTS-4B learns to do. And it does it better than models 100 times its size.
GIANTS frames insight anticipation as auto-encoding over the citation graph. This framing is the conceptual foundation of everything else in the paper, so let's unpack it carefully.
In a standard auto-encoder, you take an input, compress it through a bottleneck, and try to reconstruct the original. The bottleneck forces the model to learn the essential structure. GIANTS applies this same logic to scientific knowledge — but the "bottleneck" is the citation graph itself.
Why is this framing so powerful? Because it turns an impossibly vague task ("predict scientific breakthroughs") into a concrete, evaluable one. For every real paper on arXiv, we know its actual insight y*. We know which papers it cites. We can always check: did the model's generated insight ŷ match the real one?
The auto-encoder analogy also explains why the task is hard. In a standard auto-encoder, the bottleneck preserves the most important information. But in GIANTS' citation-graph bottleneck, the most important information — the creative synthesis — is precisely what gets lost. The parent summaries contain the ingredients, but the recipe (the downstream insight) is not extractable from the ingredients alone. The model must invent the recipe, not just recover it.
This is not asking the model to predict which paper will be written next. It is asking: given the specific ingredients (two parent papers), can you figure out what someone actually cooked? The answer already exists. We just hide it behind the lossy channel and ask the model to reconstruct it.
Concretely, here is what the model sees and produces:
| Component | What it is | Who provides it |
|---|---|---|
| xA | Summary of parent paper A (~200 words) | GiantsBench pipeline (gemini-2.5-flash) |
| xB | Summary of parent paper B (~200 words) | GiantsBench pipeline |
| y* | Ground-truth insight of the downstream paper | Extracted from the real paper, rewritten standalone |
| ŷ | Model-generated insight prediction | GIANTS-4B (or baseline model) |
The input is (xA, xB). The output is ŷ. The evaluation compares ŷ to y*. The model never sees the downstream paper. It must invent what the downstream paper discovered.
The downstream paper's insight passes through a lossy channel (parent summaries). The model tries to reconstruct it. Click "New Example" to see different configurations.
To train and evaluate insight anticipation, GIANTS constructs GiantsBench — a large-scale benchmark from arXiv. This is not just a data dump. The construction involves careful filtering, annotation, and splitting to ensure the task is both hard and fair.
17,839 papers from arXiv, spanning May 2007 through January 2026. Papers must have at least 2 citations. This covers 8 scientific domains: cs.CL, cs.AI, cs.LG, cs.CV, physics, math, economics, and robotics.
For each downstream paper, gemini-2.5-flash reads the full text and identifies the two most synergistic parent papers — the two cited works whose combination most directly inspired the downstream paper's core contribution. This is not random: the model must reason about intellectual lineage.
The construction of GiantsBench is a multi-stage process. Each stage is designed to maximize data quality while preventing information leakage (the downstream paper's content must never leak into the parent summaries or the task setup).
GiantsBench uses two orthogonal splits that make evaluation rigorous:
| Split | Rule | Size | What it tests |
|---|---|---|---|
| Temporal | Train: before July 2023. Test: after. | Test: 7,504 | Can the model anticipate insights from papers it could not have memorized? |
| Domain | Train: cs.CL only. Test: ALL 8 domains. | — | Does insight anticipation transfer across scientific fields? |
| Unseen parents | Test examples where neither parent appeared in training. | 5,294 | Can the model generalize beyond familiar research lineages? |
The final benchmark contains 17,839 annotated triplets (parent pair + insight). Each one has been verified: the parent identification is sensible, the summaries are standalone (no leakage of the downstream paper's content), and the insight is faithfully extracted. The test set alone contains 7,504 examples — large enough for reliable statistical conclusions.
One subtle but important design choice: when multiple downstream papers share the same parent pair, GiantsBench keeps only the most-cited downstream paper. Why? Because popular papers are more likely to represent genuine intellectual synthesis rather than incremental follow-ups. This filtering also prevents the same parent pair from appearing multiple times with different targets, which would confuse the evaluation.
The temporal split is not just a convenience — it is the primary defense against data contamination. Any paper published before July 2023 could theoretically appear in the base model's pre-training data. By restricting the test set to papers after July 2023, GIANTS ensures the model is predicting insights for papers it has never seen in any form. The model cannot retrieve the answer from memory. It must generate it.
This is more rigorous than many ML benchmarks, which do not control for data contamination at all. GIANTS' temporal split makes the evaluation meaningfully future-facing: you are testing the model's ability to anticipate insights that did not exist when the model was trained.
The domain split adds a second dimension of rigor. Even if a model somehow memorized every cs.CL paper, it would gain no advantage on the physics or economics test sets. Both splits must be beaten simultaneously, which makes it extremely unlikely that any observed improvement comes from memorization rather than genuine insight anticipation ability.
The four-stage construction process: identify parents, summarize, extract insight, deduplicate. Each stage filters and refines the data.
How do you grade a generated insight? This is not a multiple-choice test. The model's output ŷ is free-form text describing a scientific idea. The ground truth y* is also free-form text. You cannot just check string equality — there are infinitely many ways to phrase the same insight.
GIANTS uses an LM judge: a separate language model that reads both ŷ (generated) and y* (ground truth) and assigns a similarity score from 1 to 10. The judge evaluates whether the generated insight captures the same core idea, not whether the wording matches.
This is a crucial design decision. The judge is not checking correctness (there is no "correct" answer in the traditional sense). It is checking semantic alignment: does the generated insight describe the same fundamental idea as the ground truth, even if the words and structure differ completely?
Can you trust an LM judge? GIANTS validates this directly. Human annotators score a subset of generated insights, and the correlation with the LM judge is measured:
This is strong. For context, inter-annotator agreement between human raters on subjective tasks typically falls in the 0.6–0.8 range. The LM judge agrees with humans about as much as humans agree with each other.
Surface-level metrics like BLEU (n-gram overlap) or ROUGE (recall of reference n-grams) fail catastrophically on this task. Two descriptions of the same scientific insight can use completely different vocabulary. "Combine RL with preference pairs" and "Use reinforcement learning to optimize based on human-annotated ranking data" describe the same idea but share almost no n-grams. The LM judge captures semantic similarity, not lexical overlap.
The LM judge does not simply ask "are these similar?" It evaluates along specific dimensions:
A score of 1 means completely unrelated. A score of 5 means roughly the right direction but missing key details. A score of 8-10 means the generated insight nails the core mechanism even if the phrasing differs. This continuous scale is what makes the evaluation informative — and, as we will see, it is what makes RL training possible.
A natural worry: what if the LM judge is noisy or biased? GIANTS addresses this in three ways. First, the Spearman correlation of 0.761 with human judgments shows the judge is reliable. Second, the strict separation between training and evaluation judges prevents overfitting. Third, the results are robust across multiple evaluation judges — the paper reports that swapping in different evaluation judges does not change the relative ranking of methods.
This last point is important because it means the improvement is real, not an artifact of one particular judge's biases. GIANTS-4B genuinely produces better insights, as measured by multiple independent evaluation criteria.
This is the most surprising result in the paper and the one that makes GIANTS work. Two training approaches, same data, same base model — completely different outcomes.
Supervised Fine-Tuning (SFT) is the standard recipe: give the model (parent summaries, ground-truth insight) pairs and train it to predict the insight via next-token cross-entropy loss. You are literally teaching the model "when you see these parents, output this insight."
Result: 4.65. That is worse than the untrained base model (4.75). The model got dumber from training.
Maybe the model needs to think step-by-step? SFT-think adds chain-of-thought distillation — training the model to produce a reasoning trace before the insight.
Result: 4.43. Even worse. Chain-of-thought made the problem harder, not easier.
The insight has low likelihood under the policy. Think about what SFT is asking. Given two parent summaries, predict the exact downstream insight token by token. But the downstream insight is a creative synthesis. It is not a deterministic function of the parents. Many plausible insights could follow from the same two papers.
Consider the probability landscape. For a standard QA pair like "What is the capital of France?" → "Paris," the target token sequence has very high conditional probability. The model naturally assigns high likelihood to "Paris" given the question. SFT works because it is reinforcing what the model already wants to say.
For insight anticipation, the situation is reversed. Given two parent summaries, the ground-truth insight is one of thousands of plausible continuations. Its probability under the pre-trained policy is extremely low. Cross-entropy loss penalizes the model severely for producing any of the other thousands of valid insights. The gradient pushes the model toward the single target, but the target is so unlikely that the model cannot find a stable equilibrium. It oscillates, hedges, and eventually collapses to bland, generic outputs that offend no loss function but contain no insight.
Forcing the model to clone one specific insight via cross-entropy is like teaching an art student by only showing them the "correct" painting for each prompt — it destroys creativity rather than building it.
GRPO (Group Relative Policy Optimization) takes a fundamentally different approach. Instead of cloning one target, it:
Result: 5.97. A 34% improvement over gemini-3-pro.
Imagine teaching someone to compose music. SFT is like saying: "Here is the melody. Memorize it. Every note must match." The student learns to reproduce melodies but never learns to compose. RL is like saying: "Compose 8 melodies for this chord progression. I'll tell you which ones sound best." The student explores, discovers what works, and develops a sense of musicality — a meta-skill that transfers to new chord progressions.
Insight anticipation is composition, not transcription. SFT teaches transcription. RL teaches composition. And the gap between them is not subtle — it is the difference between a model that gets worse with training (SFT at 4.65) and one that dramatically improves (GRPO at 5.97).
| Method | Score | vs Base | Training signal |
|---|---|---|---|
| Base (Qwen3-4B) | 4.75 | — | None (pre-training only) |
| SFT | 4.65 | −2% | Cross-entropy on (x, y*) pairs |
| SFT-think | 4.43 | −7% | Cross-entropy on (x, CoT, y*) |
| GIANTS-4B (GRPO) | 5.97 | +26% | Similarity reward + group exploration |
Toggle between SFT and RL training. SFT produces one mediocre output. RL explores 8 candidates with varying scores. Click "Resample" to see different RL rollouts.
Now that we know why RL works, let's look at how GIANTS-4B is actually trained. The recipe is GRPO with a similarity-based reward, applied to Qwen3-4B.
Qwen3-4B: a 4-billion parameter language model from the Qwen family. Importantly, this is a small model. It is roughly 100x smaller than frontier models like gemini-3-pro. The choice is deliberate — GIANTS wants to show that insight anticipation is learnable even at small scale.
Why not start from a larger model? Two reasons. First, practical: GRPO requires sampling 8 completions per prompt, which means 8x the inference cost. A 4B model keeps this manageable. Second, scientific: if GIANTS-4B beats gemini-3-pro, the result is far more impressive than if GIANTS-70B beats it. The smaller the model, the stronger the evidence that the training methodology (not the model capacity) is responsible for the improvement.
For each training prompt (a pair of parent summaries), the model generates G = 8 candidate insights. Each candidate ŷi is scored by the training judge (gemini-2.5-flash), producing a reward ri between 1 and 10.
The advantage Ai of each candidate is its z-score within the group. Candidates scoring above the group mean get positive advantage; those below get negative. The policy is then updated via the standard GRPO objective — clipped policy gradient with these group-relative advantages.
Why z-scores and not raw rewards? Because z-scoring normalizes the advantages across different prompts. Some prompts are inherently harder (lower mean reward). Without normalization, easy prompts would dominate the gradient. Z-scoring ensures every prompt contributes equally to learning, regardless of its absolute difficulty level.
The reward is not binary (correct/incorrect). It is a continuous similarity score from 1 to 10. This is crucial. A generated insight that captures half the ground truth gets a ~5, not a 0. This dense reward signal gives the model much more gradient information than a sparse binary reward would. The model can distinguish between "completely wrong," "on the right track," and "nailed it."
The training set comes from GiantsBench: papers published before July 2023, restricted to the cs.CL domain. The model never sees physics, economics, math, or robotics examples during training. Everything in those domains is zero-shot at test time.
python # GIANTS-4B training loop (simplified) for batch in training_data: # batch contains (parent_A_summary, parent_B_summary, ground_truth_insight) for (xA, xB, y_star) in batch: # 1. Sample G=8 candidate insights from current policy candidates = [policy.generate(xA, xB) for _ in range(8)] # 2. Score each candidate with training judge (gemini-2.5-flash) rewards = [judge.score(cand, y_star) for cand in candidates] # 3. Compute group-relative advantages (z-scores) mu = mean(rewards) sigma = std(rewards) advantages = [(r - mu) / sigma for r in rewards] # 4. GRPO update: reinforce above-mean, suppress below-mean policy.grpo_update(candidates, advantages)
PPO (Proximal Policy Optimization) is the standard RL algorithm for LLM fine-tuning. Why does GIANTS choose GRPO instead? Two reasons. First, PPO requires training a critic (value) network, which doubles the memory requirement. With a 4B policy model, the critic would be another 4B model. GRPO eliminates this entirely by using group-relative advantages. Second, PPO's value function estimates are noisy for creative tasks — it is hard to learn a reliable value function when the reward landscape is as complex as "scientific insight quality." GRPO sidesteps this by computing advantages empirically from sampled rewards, which requires no function approximation.
For a deep dive into how GRPO works and why it eliminates the critic, see the DeepSeekMath lesson.
Here is the headline table. Read it carefully — the pattern is remarkable.
| Model | Score (1-10) | vs gemini-3-pro |
|---|---|---|
| Base (Qwen3-4B) | 4.75 | — |
| gemini-2.5-pro | 5.01 | −1% |
| gemini-3-pro | 5.05 | baseline |
| SFT | 4.65 | −8% |
| SFT-think | 4.43 | −12% |
| GIANTS-4B | 5.97 | +34% |
1. Frontier models do not scale at this task. gemini-3-pro (5.05) is barely better than gemini-2.5-pro (5.01), which is barely better than a 4B base model (4.75). The difference between a tiny open model and the best frontier model is just 6%. This is not a task where more parameters = more performance.
2. SFT actively hurts. Both SFT (4.65) and SFT-think (4.43) perform worse than the untrained base model (4.75). Supervised fine-tuning on the ground-truth insights makes the model worse at generating insights. This is the strongest evidence that insight anticipation cannot be solved by imitation learning.
3. RL is the only thing that works. GIANTS-4B (5.97) is the only approach that meaningfully improves over the base model. And it doesn't just edge ahead — it blows past every baseline by a wide margin. The 34% improvement over gemini-3-pro is enormous for a generative evaluation task.
On tasks like coding, math, and factual QA, larger models win because they encode more knowledge and can execute more complex reasoning chains. But insight anticipation is not bottlenecked by knowledge or chain length. The parent summaries provide all the relevant knowledge. The challenge is the creative leap — seeing the connection. And that creative leap does not emerge from having more parameters. It emerges from training the model to explore the space of possible connections and reinforce the ones that are promising.
This is a profound finding for AI safety and capability research. It means there exist cognitive tasks where training methodology matters far more than model size. The frontier is not always about scale.
The SFT and SFT-think results deserve extra attention because they are counterintuitive. Normally, supervised fine-tuning improves a base model. On standard NLP tasks, SFT is the reliable workhorse. On insight anticipation, it makes the model worse. And adding chain-of-thought (SFT-think) makes it even worse.
Why does chain-of-thought hurt? Because CoT distillation forces the model to also match a specific reasoning trace in addition to the final insight. The reasoning trace is even more idiosyncratic than the insight itself — there are infinitely many valid reasoning paths to any given insight. Forcing the model to match one specific path overwhelms it with conflicting gradients. The loss landscape becomes even more rugged, and the model converges to an even blander equilibrium.
This is a cautionary tale for the field: CoT is not a universal solution. On tasks where the target is underdetermined, CoT distillation can actively harm performance by adding more constraints to an already over-constrained optimization.
Insight anticipation scores across all models. Notice how frontier models cluster together while GIANTS-4B breaks away.
GIANTS-4B was trained on computational linguistics papers only. But GiantsBench spans 8 scientific domains. Does the model's skill transfer to fields it has never seen during training?
Yes. The gains hold across all domains. GIANTS-4B trained on cs.CL outperforms frontier models on physics, mathematics, economics, robotics, computer vision, AI, and machine learning — all without seeing a single training example from those fields.
This is not a small effect. The model consistently outperforms gemini-3-pro on every single domain, not just the domain it was trained on. The transfer is robust and consistent. If the model had only memorized NLP-specific patterns (e.g., "attention + decoding = better generation"), it would fail on physics (where the patterns involve differential equations and conservation laws) or economics (where they involve market mechanisms and equilibrium theory). The fact that it succeeds everywhere suggests it learned something deeper.
An even harder test: test-unseen-parents (5,294 examples) restricts evaluation to cases where neither parent paper appeared in the training set. The model cannot rely on any familiarity with the input papers. Even on this hardest split, GIANTS-4B maintains its advantage.
All test examples come from papers published after July 2023 — after the training data cutoff. The model is predicting insights for papers it could not possibly have memorized during training. This rules out the possibility that GIANTS-4B is simply retrieving insights it saw in pre-training.
Think about what the model learned during training on cs.CL. It did not learn NLP-specific facts (those are in the base model already). What GRPO training teaches is something more abstract:
These are domain-agnostic skills. A physicist combining two ideas uses the same cognitive patterns as a linguist. The content differs; the structure of creative synthesis is universal.
GIANTS-4B (orange) vs baselines across scientific domains. Trained only on cs.CL, tested on all. The orange line stays above the baselines everywhere.
LM judges are useful, but do humans actually prefer GIANTS-4B's insights? The paper runs two independent human evaluations to answer this.
Annotators are shown insights from GIANTS-4B and the base model side by side (blinded — they do not know which model produced which insight) and asked which is better. This is the gold standard of evaluation: real humans making real judgments about scientific quality.
The result: 89.7% win rate for GIANTS-4B. Nearly 9 out of 10 times, humans prefer the RL-trained model's insights over the base model's. This is not a close call. It is a decisive, overwhelming human preference.
What makes GIANTS-4B's insights better? The human evaluators specifically noted:
GIANTS also evaluates with SciJudge-30B, a 30-billion parameter model trained specifically to predict citation impact of scientific ideas. It is an independent quality signal — not trained by the GIANTS team, not part of their training pipeline.
SciJudge-30B shows a 68% preference for GIANTS-4B over the base model. This is a different kind of validation: not "is this insight similar to the ground truth?" but "does this insight seem like it would be impactful?" GIANTS-4B passes both tests.
Looking at specific examples reveals the difference between GIANTS-4B and the base model. Given parent papers on (A) attention mechanism efficiency and (B) sparse mixture-of-experts routing, the base model might generate: "An approach that combines attention with MoE." That is technically relevant but useless — it describes every paper in the intersection of those fields.
GIANTS-4B, by contrast, generates something like: "Route tokens to specialized attention heads based on input semantics, allowing each head to develop domain expertise while keeping total computation constant." That is specific. It describes a mechanism. It has engineering content. Someone could start building a system from that description.
This is what "conceptual clarity" means in practice: not just correctness, but specificity and actionability. The human evaluators are not rewarding eloquence. They are rewarding the kind of insight that a researcher could actually use.
The qualitative analysis reveals a specific pattern in how RL changes the model's behavior. The base model tends to produce generic compositional statements: "One could combine technique A with approach B to achieve better results." These are technically not wrong, but they contain zero information. Any human could say the same without reading either paper.
GIANTS-4B produces mechanistic compositions: "Use A's routing mechanism to dynamically allocate B's computation, allowing the model to specialize different pathways for different input types." This is a specific hypothesis. It identifies which component of A interacts with which component of B, and what emergent capability results. This is the difference between a literature review and a research proposal.
RL teaches this specificity because the similarity reward rewards it. Generic statements score ~4 (vaguely in the right direction). Mechanistic statements score ~7 (close match to the ground truth). The group-relative advantage strongly reinforces the specific over the generic.
How does GIANTS' output compare to tools researchers actually use? Current literature exploration tools (Semantic Scholar, Connected Papers, Elicit) excel at finding relevant papers and extracting claims. But they do not synthesize. They can tell you "paper A is related to paper B" but not "here is what you could build if you combined them."
GIANTS-4B is the first model evaluated on this synthesis task at scale. The 89.7% human win rate suggests that RL-trained models could eventually complement existing tools by adding a "creative suggestion" layer on top of retrieval. The researcher finds the papers; the model suggests what to build with them.
The key innovation is not just that GIANTS-4B produces better text. It is that the insights it produces are structurally different. They contain the specific mechanistic content that researchers need: which component interacts with which, what property emerges, what limitation is overcome. This kind of structured scientific thinking is precisely what separates useful AI assistance from impressive but unhelpful language generation.
| Method | Relationship to GIANTS |
|---|---|
| DeepSeekMath / GRPO | GIANTS uses GRPO as its RL algorithm. DeepSeekMath introduced group-relative advantages that eliminate the critic network — the same mechanism that powers GIANTS-4B. |
| DAPO | Another GRPO variant for LLM RL. Where DAPO fixes failure modes of GRPO at scale (entropy collapse, dead gradients), GIANTS extends GRPO to a new task type with continuous similarity rewards. |
| AI Scientist v2 | End-to-end automated research. AI Scientist generates full papers; GIANTS focuses on the narrower but more measurable task of predicting one specific insight. GIANTS is a component that could feed into a system like AI Scientist. |
| Poly-EPO | Exploration in RL for reasoning. GIANTS' key finding — that RL exploration beats SFT imitation — aligns with Poly-EPO's emphasis on maintaining diversity during policy optimization. |
| Simula | Synthetic data generation with LM-judge quality control. Both GIANTS and Simula use LM judges as reward signals, but for very different tasks (insight quality vs data quality). |
GIANTS sits at the intersection of two trends: RL for LLMs beyond correctness and AI for science. Most RL-for-LLM work (DeepSeek-R1, DAPO, Poly-EPO) uses binary rewards on verifiable tasks — is the math answer correct? GIANTS shows RL can work with soft, continuous rewards from an LM judge on a subjective, creative task. This opens the door to RL training on tasks where there is no single correct answer.
On the science side, GIANTS provides the first quantitative evidence that creative scientific synthesis can be improved through training. Previous work on AI for science focused on execution (running experiments, writing code, generating hypotheses from databases). GIANTS targets the spark — the moment of "wait, these two ideas combine into something new." If this capability continues to improve, it could fundamentally change how researchers discover connections in the literature.
Imagine a tool that reads your last two papers and suggests what the logical next step of your research should be. Not based on keyword matching or citation networks, but based on a deep understanding of how scientific ideas compose. That tool could surface connections you would never find by browsing arXiv. It could suggest collaborations between researchers in different fields who are working on complementary problems. It could accelerate the pace of discovery by reducing the time between "the ingredients exist" and "someone sees the recipe."
GIANTS does not build that tool. But it proves that the core capability — anticipating the insight at the intersection of two research lines — is learnable. That is the first step.
| Finding | Implication |
|---|---|
| SFT hurts on creative tasks | Cross-entropy loss is wrong for one-to-many mappings. Consider RL whenever the target is not unique. |
| Frontier models do not scale | Some cognitive tasks need the right training signal, not more parameters. Methodology can beat scale. |
| RL transfers across domains | Insight synthesis is a domain-agnostic meta-skill. Train on one field, deploy everywhere. |
| LM judges are viable rewards | Continuous LM-judge rewards enable RL on tasks where binary correctness is undefined. |
| 4B can beat frontier at creativity | Small, well-trained models can outperform 100x larger models on tasks requiring exploration. |
GIANTS leaves several exciting questions unanswered. Can insight anticipation improve further with more RL training? What if you chain GIANTS — generate an insight, then use it as a parent for the next generation? Could this produce a cascade of discoveries? And perhaps most intriguingly: can GIANTS generate insights that are genuinely novel — not matching any existing paper, but still scientifically valid?
The paper does not answer these questions, but the framework makes them answerable. GiantsBench provides the testbed. RL provides the training paradigm. The stage is set for AI systems that don't just analyze science — they participate in it.
These limitations do not diminish the contribution. They define the scope. GIANTS is a proof of concept for a new capability, not a finished product. Future work could address each limitation: validating generated insights with domain experts, using multi-parent configurations beyond pairs, adding feasibility checks via simulation or literature search, and expanding beyond arXiv to include clinical and industrial research.
"If I have seen further it is by standing on the shoulders of giants." — Isaac Newton, 1675. GIANTS asks: can a machine learn to see further, too?