An LLM alone hallucinates facts it never learned. A knowledge graph alone can't reason in language. Put them together and you get an agent that can traverse structured knowledge while explaining its reasoning step by step.
Ask GPT-4: "Who directed the movie featuring the actress who won the Oscar for the film based on the novel by the author who won the Booker Prize in 2019?" Without external memory, the model must chain four facts from training — and any one of them could be hallucinated or stale.
Now give the model access to a knowledge graph with 80 million facts about people, works, and awards. The agent can look up "Booker Prize 2019 winner," hop to their novels, check which one became a film, find that film's lead actress, find her Oscar win, and finally find that film's director. Each hop is a verified fact, not a guess.
This is the core motivation: structured knowledge graphs give agents a reliable, queryable memory that doesn't hallucinate. The agent's language reasoning connects the hops; the graph supplies the ground truth facts.
Click each step to see how an LLM-only chain can go wrong, versus a graph-grounded chain that verifies each hop.
In 2022, Yao et al. published ReAct: Synergizing Reasoning and Acting in Language Models. The key observation: when a model just generates reasoning traces ("Chain-of-Thought"), it can drift into hallucination because it has no feedback. When a model just acts (issues API calls), it has no reasoning to connect results. ReAct interleaves both.
The ReAct loop is exactly three steps, repeated until the agent decides it has the answer:
ReAct works via few-shot prompting. You give the model 2-6 examples of the Thought/Action/Observation format, and it continues the pattern for a new question. No fine-tuning required — the in-context examples teach the format.
ReAct trace (few-shot example in prompt) Question: What is the population of the capital city of the country that borders Spain to the south? Thought: I need to find which country borders Spain to the south. Action: Search("countries bordering Spain to the south") Observation: Morocco borders Spain to the south (via the Strait of Gibraltar). Thought: The capital of Morocco is Rabat. I need its population. Action: Search("Rabat population") Observation: Rabat has a population of approximately 577,000. Thought: I have the answer. Action: Finish("Approximately 577,000")
Click "Next Step" to walk through a ReAct trace answering a multi-hop question. Watch how Thought and Action alternate, with Observations feeding back in.
ReAct is powerful but brittle: if the agent goes down a wrong reasoning path and gets a wrong answer, it doesn't know it got it wrong. It just reports the result with confidence. Reflexion (Shinn et al., 2023) adds a second loop: after each episode, the agent reads its own trace and writes a self-critique that it carries into the next attempt.
Think of it as the difference between a student who submits an exam and forgets it, versus a student who reviews the graded exam, identifies their mistake, and keeps that insight as a note for the next test. Reflexion is the "keep the note" mechanism.
Reflexion: reflection written after a failed episode # Episode 1 failed: agent answered "Christopher Nolan" for a question # about the director of a 2023 film it didn't correctly identify Reflection: "I made an error in step 2: I assumed the film was Oppenheimer without verifying that it was actually released in 2023. I should always verify year-specific facts with a lookup before building further reasoning on them. In the next attempt, verify the film title and release year before looking up its director." # Episode 2 context (prepended before the new question): # [Reflection from ep1 inserted here] # Question: [same question] # Thought: Based on my previous error, I should first verify...
Watch how success rate climbs as the agent accumulates self-reflections. Each bar is one trial on a benchmark task set.
We've seen ReAct and Reflexion in abstract. Now let's run a real graph-grounded agent on a knowledge graph question. The agent needs to answer: "Which drug targets the same protein as Imatinib, and is used to treat a disease caused by the same gene mutation?"
This requires three hops: (1) find Imatinib's protein target, (2) find other drugs targeting that same protein, (3) for each candidate drug, check if its disease has the same gene mutation as Imatinib's primary disease. No LLM can reliably answer this from memory — it must traverse the biomedical KG.
Step through the agent's reasoning. Each step shows the KG subgraph returned (nodes/edges) and the agent's Thought about what to do next. The highlighted path grows with each hop.
Retrieval-Augmented Generation (RAG) retrieves a chunk of text (a paragraph) based on embedding similarity. Graph grounding retrieves a structured subgraph — a set of typed triples — based on entity traversal. The difference matters for multi-hop reasoning.
| Dimension | RAG (text chunks) | Graph Grounding (KG triples) |
|---|---|---|
| Retrieval unit | Paragraph or passage | Entity, relation, entity triple |
| Multi-hop | Hard — requires re-retrieval | Natural — just follow edges |
| Precision | Fuzzy (embedding similarity) | Exact (typed relation lookup) |
| Hallucination | Possible from irrelevant passages | Prevented for in-graph facts |
| Coverage | High (any text) | Limited to KG entities |
The agent doesn't directly access the knowledge graph — it calls tools that abstract the graph operations. This is a crucial design decision: exposing graph APIs rather than raw graph queries lets the LLM reason in natural language while the tool handles the graph traversal logic.
A well-designed graph agent toolkit has three classes of tools. Think of these like a GPS app: you don't compute shortest paths yourself — you call "navigate to X" and get back a route.
lookup_entity(name) → entity ID + attributes. get_neighbors(entity_id, relation_type) → list of (entity, relation) pairs. get_path(entity1, entity2, max_hops) → shortest connecting path.summarize_subgraph(triples) → natural language description. check_path_consistency(path) → boolean + explanation. These post-process graph results into forms the LLM can reason over.Graph tool API (Python) class KGToolkit: def lookup_entity(self, name: str) -> dict: # Returns: {"id": "Q123", "type": "drug", "description": "..."} return self.kg.get_entity_by_name(name) def get_neighbors(self, entity_id: str, rel: str = None) -> list: # Returns: [{"entity": "BCR-ABL1", "relation": "targets"}, ...] return self.kg.neighbors(entity_id, relation_filter=rel) def sparql_query(self, query: str) -> list: # LLM writes SPARQL; tool executes against KG # Returns list of matching triple patterns return self.kg.execute(query) # Agent action format in the ReAct loop: # Action: get_neighbors("imatinib", rel="targets") # Observation: [{"entity": "BCR-ABL1", "relation": "targets"}]
Modern agents (like KG-GPT and ToG) can generate structured queries over KGs. The LLM is prompted with the graph schema (entity types, relation types) and generates a query. The tool executes it, returning a subgraph that the LLM can reason over.
Click a tool type to see what goes in, what comes out, and what graph operation runs underneath.
Prompt engineering and few-shot examples take you far, but they don't adapt. If your agent consistently makes the same class of mistake, you need a mechanism to update it. Two approaches dominate: PPO (Proximal Policy Optimization) for online RL with a learned reward, and DPO (Direct Preference Optimization) for offline learning from human preference pairs.
Cast the agent in the RL framework: the state is the current context (question + Thought/Action/Observation history), the action is the next token (or the next tool call), and the reward is given at episode end — +1 if the final answer is correct, 0 otherwise. The goal: maximize expected reward over episodes.
Where τ is a full agent episode (the trajectory of Thoughts, Actions, Observations), πθ is the LLM policy parameterized by weights θ, and R(τ) is the episode reward (correct answer = 1, wrong = 0, or a shaped intermediate reward).
PPO keeps the updated policy close to a reference policy to prevent catastrophic changes. The policy generates trajectories, a reward model scores them, and the gradient update improves high-reward trajectories while the KL penalty keeps the model from drifting too far from its pretrained behavior.
Where rt(θ) = πθ(at|st) / πold(at|st) is the probability ratio, Ât is the advantage estimate (how much better was this action than average?), and ε clips the ratio to prevent large updates.
DPO skips the explicit reward model. Instead, you collect pairs of agent trajectories on the same question: one that got the answer right (preferred) and one that got it wrong (rejected). DPO directly optimizes the LLM to prefer the correct trajectory over the incorrect one, using only cross-entropy loss over the pair.
Where τw is the winning (correct) trajectory and τl is the losing (incorrect) trajectory. DPO is simpler to implement than PPO — no separate critic network, no online sampling loop during training.
Click to see the training pipeline for each method. Notice how PPO requires online trajectory collection while DPO trains on a fixed dataset of preference pairs.
How do we know if a graph-grounded agent is actually better? We need benchmarks that test multi-hop reasoning over structured knowledge — not just reading comprehension or simple factoid lookup. Two benchmarks from CS224W stand out: STaRK and UnknowBench.
STaRK (Wu et al., 2024) tests agents on three knowledge graphs: Amazon product graphs, biomedical KGs (PrimeKG), and academic citation graphs (MAG). Queries require combining structured graph traversal (follow edges) with unstructured text reasoning (understand entity descriptions).
UnknowBench tests a different capability: does the agent know when to say "I don't know"? It contains questions whose answers are absent from the knowledge graph — the correct response is to recognize the gap and say so, rather than hallucinate an answer.
This is critical for real-world deployment. An agent that confidently gives a wrong answer is more dangerous than one that says "I couldn't find this in the graph." UnknowBench rewards appropriate uncertainty.
| Benchmark | KG Used | Question Type | Key Capability Tested | Metric |
|---|---|---|---|---|
| STaRK-Amazon | Amazon product graph | Multi-constraint retrieval | Structured + text fusion | Hit@1, NDCG |
| STaRK-Bio | PrimeKG (biomedical) | Drug-disease-gene hops | Multi-hop traversal | Hit@1, NDCG |
| STaRK-Mag | MAG (academic) | Author-paper-venue hops | Citation graph reasoning | Hit@1, NDCG |
| UnknowBench | Multiple KGs | Answerable + unanswerable | Calibrated uncertainty | F1 on abstention |
| WebQuestions | Freebase | Natural language → SPARQL | Query generation | F1 |
| HotpotQA | Wikipedia | 2-hop reasoning | Multi-hop retrieval | F1, EM |
Comparing retrieval Hit@1 scores across methods on STaRK-Bio. Graph-grounded agents outperform pure embedding retrieval on multi-hop queries.
A single agent can handle one reasoning chain. But some tasks require parallel exploration — multiple competing hypotheses, different tools for different subtasks, or expert specialization. Multi-agent systems assign different agents to different subgraphs or reasoning tracks, coordinating via a shared message-passing protocol.
Here's the key insight: the coordination structure of the agents can itself be a graph. Agents are nodes; information flows between them along edges. This creates a recursive pattern — a graph of agents operating on a knowledge graph.
Agents share partial results as structured messages: "I found that BCR-ABL1 targets Imatinib — does any other drug in your subgraph also target BCR-ABL1?" This is analogous to GNN message passing: each agent aggregates messages from its neighbors (other agents), updates its belief state, and passes new messages forward.
Click "Run" to watch agents propagate information. The orchestrator decomposes the query; workers execute in parallel; the synthesizer combines results. Watch messages flow along agent edges.
Graph-grounded agents are powerful, but they come with a set of hard, open problems that the field is actively working on. Being honest about these limitations is as important as understanding the successes.
Grounding reduces hallucination for facts in the graph, but the agent still hallucinates when reasoning over graph results. The KG returns correct triples; the LLM combines them incorrectly. This is especially common when the agent must aggregate multiple graph results — it may invent a logical connection that the data doesn't support.
Real KGs are always incomplete. Freebase covered roughly 3% of world knowledge before it was shut down. Wikidata has many missing relations. When the agent queries for a fact that should be in the graph but isn't, it either incorrectly says "I don't know" (missing a valid answer) or falls back to parametric memory and hallucinates.
Each tool call adds latency. A 5-hop reasoning chain with 2 tool calls per hop requires 10 round trips to the KG. For a graph with millions of nodes, each query takes 10-100ms. Total agent latency can hit 5-10 seconds — too slow for interactive applications.
Knowledge graphs use discrete symbolic relations (drug targets protein). LLMs operate in continuous embedding space. Bridging them requires translation: embedding entities in the LLM's token space, or converting graph results to natural language. Both introduce information loss — the structured precision of the graph is partially lost in translation.
| Challenge | Current Best Mitigation | Open Problem |
|---|---|---|
| Inter-hop hallucination | Reflexion self-critique | Formal verification of multi-hop chains |
| KG incompleteness | KG completion models (TransE, RotatE) | Dynamic KG updating from text |
| Grounding latency | Subgraph pre-retrieval, caching | Fast approximate KG reasoning |
| Symbolic-neural mismatch | Entity linking + type constraints | Unified symbolic-neural architectures |
Graph-grounded agents sit at the intersection of several major research streams. Understanding how they connect to adjacent topics shows you both where the field came from and where it's going.
| Topic | Connection to Graph Agents | Learn more |
|---|---|---|
| Knowledge Graphs | The structured memory that agents query. KG embedding methods (TransE, RotatE) help with incomplete graphs. | Lec 10: KG |
| LLM + GNN | The integration architecture: LLM as reasoner, GNN as structure encoder for the KG subgraph returned by the agent. | Lec 16: LLM+GNN |
| Heterogeneous Graphs | Biomedical KGs have typed nodes (drug, protein, disease) and typed edges — exactly the heterogeneous graph setting. | Lec 9: Hetero |
| RL Algorithms | PPO and DPO for agent optimization. Graph traversal as MDP: state = current entity, action = which edge to follow. | RL Algorithms |
| Recommender Systems | Agents recommending items via KG traversal. User → interest graph → candidate items → LLM explanation. | Lec 11: RecSys |
"The structure of the graph is the structure of knowledge. An agent that can traverse it has access to what we know; one that can reason over it has access to what we can infer."