CS224W Lecture 17

Agents + Graphs

An LLM alone hallucinates facts it never learned. A knowledge graph alone can't reason in language. Put them together and you get an agent that can traverse structured knowledge while explaining its reasoning step by step.

Prerequisites: LLM basics + Knowledge graphs. That's it.
10
Chapters
5+
Simulations
0
Assumed Knowledge

Chapter 0: Why Agents Need Graphs

Ask GPT-4: "Who directed the movie featuring the actress who won the Oscar for the film based on the novel by the author who won the Booker Prize in 2019?" Without external memory, the model must chain four facts from training — and any one of them could be hallucinated or stale.

Now give the model access to a knowledge graph with 80 million facts about people, works, and awards. The agent can look up "Booker Prize 2019 winner," hop to their novels, check which one became a film, find that film's lead actress, find her Oscar win, and finally find that film's director. Each hop is a verified fact, not a guess.

This is the core motivation: structured knowledge graphs give agents a reliable, queryable memory that doesn't hallucinate. The agent's language reasoning connects the hops; the graph supplies the ground truth facts.

The Two Failure Modes This Solves

Hallucination: LLMs generate plausible-sounding text but can confabulate facts ("The Booker Prize 2019 went to Hilary Mantel" — actually Bernardine Evaristo and Margaret Atwood). Grounding each reasoning step in a KG makes wrong answers impossible for facts that are in the graph.
Staleness: LLMs freeze at training cutoff. A knowledge graph can be updated daily with new entities and edges. An agent querying a live KG gives answers current as of today, not 18 months ago.
Reasoning Chain: LLM Alone vs. Graph-Grounded

Click each step to see how an LLM-only chain can go wrong, versus a graph-grounded chain that verifies each hop.

Why does grounding an LLM agent's reasoning in a knowledge graph reduce hallucination?

Chapter 1: ReAct Framework

In 2022, Yao et al. published ReAct: Synergizing Reasoning and Acting in Language Models. The key observation: when a model just generates reasoning traces ("Chain-of-Thought"), it can drift into hallucination because it has no feedback. When a model just acts (issues API calls), it has no reasoning to connect results. ReAct interleaves both.

The ReAct loop is exactly three steps, repeated until the agent decides it has the answer:

Thought
The LLM reasons about the current state. What do I know? What do I need to find out? Which tool should I call next? This is free-form language generation.
Action
The LLM emits a structured tool call: Search("Booker Prize 2019"), Lookup("Bernardine Evaristo novels"), or Finish("The answer is..."). The action format is specified in the prompt.
Observation
The tool executes and returns a result — a graph query result, a retrieved document, a computed value. This observation is appended to the context and the loop repeats.
↻ repeat until Finish
Why does interleaving help? The Thought grounds the Action in current reasoning; the Observation grounds the next Thought in real evidence. Neither hallucination (pure thought) nor brute search (pure action) wins alone. The loop is self-correcting: a bad observation prompts the model to reason about what went wrong.

ReAct in Practice: Prompting

ReAct works via few-shot prompting. You give the model 2-6 examples of the Thought/Action/Observation format, and it continues the pattern for a new question. No fine-tuning required — the in-context examples teach the format.

ReAct trace (few-shot example in prompt)
Question: What is the population of the capital city of the country that
         borders Spain to the south?

Thought: I need to find which country borders Spain to the south.
Action: Search("countries bordering Spain to the south")
Observation: Morocco borders Spain to the south (via the Strait of Gibraltar).

Thought: The capital of Morocco is Rabat. I need its population.
Action: Search("Rabat population")
Observation: Rabat has a population of approximately 577,000.

Thought: I have the answer.
Action: Finish("Approximately 577,000")
Key result from the paper: ReAct outperforms Chain-of-Thought prompting on knowledge-intensive tasks (HotpotQA, FEVER) because observations from real tools prevent reasoning chains from drifting into confident hallucination. On HotpotQA, ReAct reduced "hallucination rate" from 56% (CoT-only) to 14%.
ReAct Loop Visualizer

Click "Next Step" to walk through a ReAct trace answering a multi-hop question. Watch how Thought and Action alternate, with Observations feeding back in.

In the ReAct framework, what purpose does the "Observation" step serve that pure Chain-of-Thought lacks?

Chapter 2: Reflexion

ReAct is powerful but brittle: if the agent goes down a wrong reasoning path and gets a wrong answer, it doesn't know it got it wrong. It just reports the result with confidence. Reflexion (Shinn et al., 2023) adds a second loop: after each episode, the agent reads its own trace and writes a self-critique that it carries into the next attempt.

Think of it as the difference between a student who submits an exam and forgets it, versus a student who reviews the graded exam, identifies their mistake, and keeps that insight as a note for the next test. Reflexion is the "keep the note" mechanism.

The Three-Memory Architecture

Short-term memory: The current episode's ReAct trace — the running Thought/Action/Observation context window. Cleared each episode.
Long-term memory: A persistent buffer of self-reflections from past failed episodes. Prepended to new episodes so the agent "remembers what didn't work."
The Reflector: After each episode ends (success or failure), a separate LLM call reads the full trace and the task specification, then writes a short, specific critique: "I failed because I searched for the director before verifying the movie title. Next time, verify the title first."
The key insight: Reflexion converts trial-and-error into directed improvement without any gradient updates. The model's weights don't change — only its context changes. This is "in-context RL" where the reward is the evaluator's verdict and the "policy update" is the textual self-reflection.
Reflexion: reflection written after a failed episode
# Episode 1 failed: agent answered "Christopher Nolan" for a question
# about the director of a 2023 film it didn't correctly identify

Reflection: "I made an error in step 2: I assumed the film was Oppenheimer
without verifying that it was actually released in 2023. I should
always verify year-specific facts with a lookup before building
further reasoning on them. In the next attempt, verify the film
title and release year before looking up its director."

# Episode 2 context (prepended before the new question):
# [Reflection from ep1 inserted here]
# Question: [same question]
# Thought: Based on my previous error, I should first verify...
Reflexion: Improvement Over Episodes

Watch how success rate climbs as the agent accumulates self-reflections. Each bar is one trial on a benchmark task set.

What does Reflexion store in its "long-term memory" buffer?

Chapter 3: Graph-Grounded Agents SHOWCASE

We've seen ReAct and Reflexion in abstract. Now let's run a real graph-grounded agent on a knowledge graph question. The agent needs to answer: "Which drug targets the same protein as Imatinib, and is used to treat a disease caused by the same gene mutation?"

This requires three hops: (1) find Imatinib's protein target, (2) find other drugs targeting that same protein, (3) for each candidate drug, check if its disease has the same gene mutation as Imatinib's primary disease. No LLM can reliably answer this from memory — it must traverse the biomedical KG.

The architecture: The agent (LLM) issues Thought/Action pairs. The "tool" is a KG query engine. Each Action returns a subgraph: a set of (entity, relation, entity) triples. The agent reads these triples as natural language observations and reasons about the next hop.
SHOWCASE: KG Agent Reasoning Chain

Step through the agent's reasoning. Each step shows the KG subgraph returned (nodes/edges) and the agent's Thought about what to do next. The highlighted path grows with each hop.

What Makes Graph Grounding Different from RAG?

Retrieval-Augmented Generation (RAG) retrieves a chunk of text (a paragraph) based on embedding similarity. Graph grounding retrieves a structured subgraph — a set of typed triples — based on entity traversal. The difference matters for multi-hop reasoning.

DimensionRAG (text chunks)Graph Grounding (KG triples)
Retrieval unitParagraph or passageEntity, relation, entity triple
Multi-hopHard — requires re-retrievalNatural — just follow edges
PrecisionFuzzy (embedding similarity)Exact (typed relation lookup)
HallucinationPossible from irrelevant passagesPrevented for in-graph facts
CoverageHigh (any text)Limited to KG entities
A user asks: "Who directed the sequel to the film starring the actor who won Best Actor in 2020?" Why is graph traversal better than a single RAG retrieval for this question?

Chapter 4: Tool Use on Graphs

The agent doesn't directly access the knowledge graph — it calls tools that abstract the graph operations. This is a crucial design decision: exposing graph APIs rather than raw graph queries lets the LLM reason in natural language while the tool handles the graph traversal logic.

The Graph Tool Toolkit

A well-designed graph agent toolkit has three classes of tools. Think of these like a GPS app: you don't compute shortest paths yourself — you call "navigate to X" and get back a route.

Entity tools: lookup_entity(name) → entity ID + attributes. get_neighbors(entity_id, relation_type) → list of (entity, relation) pairs. get_path(entity1, entity2, max_hops) → shortest connecting path.
Query tools: Structured query languages like SPARQL let the agent express graph patterns. The LLM generates a SPARQL-like query; the tool executes it on the KG and returns matching subgraphs. This is more expressive than single-hop lookups.
Synthesis tools: summarize_subgraph(triples) → natural language description. check_path_consistency(path) → boolean + explanation. These post-process graph results into forms the LLM can reason over.
Graph tool API (Python)
class KGToolkit:
    def lookup_entity(self, name: str) -> dict:
        # Returns: {"id": "Q123", "type": "drug", "description": "..."}
        return self.kg.get_entity_by_name(name)

    def get_neighbors(self, entity_id: str, rel: str = None) -> list:
        # Returns: [{"entity": "BCR-ABL1", "relation": "targets"}, ...]
        return self.kg.neighbors(entity_id, relation_filter=rel)

    def sparql_query(self, query: str) -> list:
        # LLM writes SPARQL; tool executes against KG
        # Returns list of matching triple patterns
        return self.kg.execute(query)

# Agent action format in the ReAct loop:
# Action: get_neighbors("imatinib", rel="targets")
# Observation: [{"entity": "BCR-ABL1", "relation": "targets"}]

SPARQL-Style Queries from LLMs

Modern agents (like KG-GPT and ToG) can generate structured queries over KGs. The LLM is prompted with the graph schema (entity types, relation types) and generates a query. The tool executes it, returning a subgraph that the LLM can reason over.

Graph Tool Call Anatomy

Click a tool type to see what goes in, what comes out, and what graph operation runs underneath.

Why do graph-grounded agents expose graph operations as tool APIs rather than giving the LLM direct access to the raw graph database?

Chapter 5: RL for Agent Optimization

Prompt engineering and few-shot examples take you far, but they don't adapt. If your agent consistently makes the same class of mistake, you need a mechanism to update it. Two approaches dominate: PPO (Proximal Policy Optimization) for online RL with a learned reward, and DPO (Direct Preference Optimization) for offline learning from human preference pairs.

Treating the Agent as a Policy

Cast the agent in the RL framework: the state is the current context (question + Thought/Action/Observation history), the action is the next token (or the next tool call), and the reward is given at episode end — +1 if the final answer is correct, 0 otherwise. The goal: maximize expected reward over episodes.

J(θ) = Eτ ~ πθ [ R(τ) ]

Where τ is a full agent episode (the trajectory of Thoughts, Actions, Observations), πθ is the LLM policy parameterized by weights θ, and R(τ) is the episode reward (correct answer = 1, wrong = 0, or a shaped intermediate reward).

PPO for Agents

PPO keeps the updated policy close to a reference policy to prevent catastrophic changes. The policy generates trajectories, a reward model scores them, and the gradient update improves high-reward trajectories while the KL penalty keeps the model from drifting too far from its pretrained behavior.

LPPO(θ) = E [ min(rt(θ) Ât, clip(rt(θ), 1-ε, 1+ε) Ât) ] − β · KL[πθ || πref]

Where rt(θ) = πθ(at|st) / πold(at|st) is the probability ratio, Ât is the advantage estimate (how much better was this action than average?), and ε clips the ratio to prevent large updates.

DPO for Agents

DPO skips the explicit reward model. Instead, you collect pairs of agent trajectories on the same question: one that got the answer right (preferred) and one that got it wrong (rejected). DPO directly optimizes the LLM to prefer the correct trajectory over the incorrect one, using only cross-entropy loss over the pair.

LDPO(θ) = −log σ(β log [πθw) / πrefw)] − β log [πθl) / πrefl)])

Where τw is the winning (correct) trajectory and τl is the losing (incorrect) trajectory. DPO is simpler to implement than PPO — no separate critic network, no online sampling loop during training.

PPO: Online. Generates new trajectories during training. Expensive (requires reward model + critic). Handles exploration naturally. Good when you have a reliable reward signal (e.g., verifiable math or code answers).
DPO: Offline. Trains on pre-collected preference pairs. Cheaper (no separate reward model). No exploration. Good when you have human annotators who can rank agent trajectories.
PPO vs. DPO Training Dynamics

Click to see the training pipeline for each method. Notice how PPO requires online trajectory collection while DPO trains on a fixed dataset of preference pairs.

DPO trains on preference pairs (winning vs. losing trajectory). What is the key advantage of DPO over PPO for agent training?

Chapter 6: Benchmarks

How do we know if a graph-grounded agent is actually better? We need benchmarks that test multi-hop reasoning over structured knowledge — not just reading comprehension or simple factoid lookup. Two benchmarks from CS224W stand out: STaRK and UnknowBench.

STaRK: Semi-Structure Retrieval Benchmark

STaRK (Wu et al., 2024) tests agents on three knowledge graphs: Amazon product graphs, biomedical KGs (PrimeKG), and academic citation graphs (MAG). Queries require combining structured graph traversal (follow edges) with unstructured text reasoning (understand entity descriptions).

Example STaRK query: "Find drugs that target proteins involved in Alzheimer's disease AND have fewer than 3 reported side effects AND were approved after 2010." This requires: (1) graph traversal for protein-disease links, (2) property filtering on edge attributes, (3) temporal reasoning. No single retrieval handles all three.

UnknowBench: Testing Knowledge Boundaries

UnknowBench tests a different capability: does the agent know when to say "I don't know"? It contains questions whose answers are absent from the knowledge graph — the correct response is to recognize the gap and say so, rather than hallucinate an answer.

This is critical for real-world deployment. An agent that confidently gives a wrong answer is more dangerous than one that says "I couldn't find this in the graph." UnknowBench rewards appropriate uncertainty.

BenchmarkKG UsedQuestion TypeKey Capability TestedMetric
STaRK-AmazonAmazon product graphMulti-constraint retrievalStructured + text fusionHit@1, NDCG
STaRK-BioPrimeKG (biomedical)Drug-disease-gene hopsMulti-hop traversalHit@1, NDCG
STaRK-MagMAG (academic)Author-paper-venue hopsCitation graph reasoningHit@1, NDCG
UnknowBenchMultiple KGsAnswerable + unanswerableCalibrated uncertaintyF1 on abstention
WebQuestionsFreebaseNatural language → SPARQLQuery generationF1
HotpotQAWikipedia2-hop reasoningMulti-hop retrievalF1, EM
Agent Performance on STaRK vs. Baselines

Comparing retrieval Hit@1 scores across methods on STaRK-Bio. Graph-grounded agents outperform pure embedding retrieval on multi-hop queries.

UnknowBench tests agents on questions whose answers are absent from the knowledge graph. Why is this an important capability to benchmark separately?

Chapter 7: Multi-Agent Graph Systems

A single agent can handle one reasoning chain. But some tasks require parallel exploration — multiple competing hypotheses, different tools for different subtasks, or expert specialization. Multi-agent systems assign different agents to different subgraphs or reasoning tracks, coordinating via a shared message-passing protocol.

Here's the key insight: the coordination structure of the agents can itself be a graph. Agents are nodes; information flows between them along edges. This creates a recursive pattern — a graph of agents operating on a knowledge graph.

Common Multi-Agent Architectures

Orchestrator Agent
Receives the original question, decomposes it into sub-questions, assigns each sub-question to a specialized worker agent, and combines the results.
↓ delegates to
Worker Agents (parallel)
Each worker handles one sub-question on its assigned subgraph. Drug agent handles drug-protein edges. Disease agent handles disease-gene edges. They run in parallel.
↓ results back to
Synthesis Agent
Receives all worker results, checks consistency across agents (do their subgraphs connect correctly?), and generates the final answer.
Why parallel agents? On a large KG, one agent exploring sequentially could take many hops and many API calls. Multiple agents exploring different subgraphs simultaneously are faster. They also provide cross-validation: if two agents converge on the same intermediate entity via different paths, confidence is higher.

Message Passing Between Agents

Agents share partial results as structured messages: "I found that BCR-ABL1 targets Imatinib — does any other drug in your subgraph also target BCR-ABL1?" This is analogous to GNN message passing: each agent aggregates messages from its neighbors (other agents), updates its belief state, and passes new messages forward.

The graph analogy is exact: In a GNN, node v aggregates messages from its neighbors N(v). In a multi-agent system, agent v aggregates information from neighboring agents N(v). The "neighborhood" is defined by which agents are relevant to the same intermediate entities. Graph ML concepts map directly onto multi-agent coordination.
Multi-Agent Coordination Graph

Click "Run" to watch agents propagate information. The orchestrator decomposes the query; workers execute in parallel; the synthesizer combines results. Watch messages flow along agent edges.

In a multi-agent graph system, the orchestrator decomposes the question and assigns sub-questions to worker agents. What is the key advantage over a single-agent sequential approach?

Chapter 8: Challenges

Graph-grounded agents are powerful, but they come with a set of hard, open problems that the field is actively working on. Being honest about these limitations is as important as understanding the successes.

Challenge 1: Hallucination at the Interface

Grounding reduces hallucination for facts in the graph, but the agent still hallucinates when reasoning over graph results. The KG returns correct triples; the LLM combines them incorrectly. This is especially common when the agent must aggregate multiple graph results — it may invent a logical connection that the data doesn't support.

Example failure: The graph correctly returns that Drug A targets Protein X, and that Protein X is associated with Disease Y. The agent incorrectly concludes "Drug A is approved for Disease Y" — an inference not in the graph. Graph grounding doesn't prevent inter-hop reasoning errors.

Challenge 2: Incomplete Knowledge Graphs

Real KGs are always incomplete. Freebase covered roughly 3% of world knowledge before it was shut down. Wikidata has many missing relations. When the agent queries for a fact that should be in the graph but isn't, it either incorrectly says "I don't know" (missing a valid answer) or falls back to parametric memory and hallucinates.

Challenge 3: Grounding Latency

Each tool call adds latency. A 5-hop reasoning chain with 2 tool calls per hop requires 10 round trips to the KG. For a graph with millions of nodes, each query takes 10-100ms. Total agent latency can hit 5-10 seconds — too slow for interactive applications.

Challenge 4: Symbolic-Neural Mismatch

Knowledge graphs use discrete symbolic relations (drug targets protein). LLMs operate in continuous embedding space. Bridging them requires translation: embedding entities in the LLM's token space, or converting graph results to natural language. Both introduce information loss — the structured precision of the graph is partially lost in translation.

ChallengeCurrent Best MitigationOpen Problem
Inter-hop hallucinationReflexion self-critiqueFormal verification of multi-hop chains
KG incompletenessKG completion models (TransE, RotatE)Dynamic KG updating from text
Grounding latencySubgraph pre-retrieval, cachingFast approximate KG reasoning
Symbolic-neural mismatchEntity linking + type constraintsUnified symbolic-neural architectures
A graph-grounded agent retrieves correct KG triples but still produces a wrong final answer by inferring a logical connection the data doesn't support. Which challenge category does this fall into?

Chapter 9: Connections

Graph-grounded agents sit at the intersection of several major research streams. Understanding how they connect to adjacent topics shows you both where the field came from and where it's going.

TopicConnection to Graph AgentsLearn more
Knowledge GraphsThe structured memory that agents query. KG embedding methods (TransE, RotatE) help with incomplete graphs.Lec 10: KG
LLM + GNNThe integration architecture: LLM as reasoner, GNN as structure encoder for the KG subgraph returned by the agent.Lec 16: LLM+GNN
Heterogeneous GraphsBiomedical KGs have typed nodes (drug, protein, disease) and typed edges — exactly the heterogeneous graph setting.Lec 9: Hetero
RL AlgorithmsPPO and DPO for agent optimization. Graph traversal as MDP: state = current entity, action = which edge to follow.RL Algorithms
Recommender SystemsAgents recommending items via KG traversal. User → interest graph → candidate items → LLM explanation.Lec 11: RecSys

Key Papers

What's Next

The frontier: Graph foundation models (GFMs) that can reason over arbitrary KGs without task-specific fine-tuning. An agent that can be dropped into a new domain's KG (medical, legal, financial) and immediately answer multi-hop queries. This requires learning graph structure as a universal language — the graph analogue of GPT's universal text representation.
"The structure of the graph is the structure of knowledge. An agent that can traverse it has access to what we know; one that can reason over it has access to what we can infer."
Next: Lec 18 — Graph Generation →