CS224W Lecture 17 — Agents + Graphs

Chapter 0: Why Agents Need Graphs

Ask GPT-4: "Who directed the movie featuring the actress who won the Oscar for the film based on the novel by the author who won the Booker Prize in 2019?" Without external memory, the model must chain four facts from training — and any one of them could be hallucinated or stale.

Now give the model access to a knowledge graph with 80 million facts about people, works, and awards. The agent can look up "Booker Prize 2019 winner," hop to their novels, check which one became a film, find that film's lead actress, find her Oscar win, and finally find that film's director. Each hop is a verified fact, not a guess.

This is the core motivation: structured knowledge graphs give agents a reliable, queryable memory that doesn't hallucinate. The agent's language reasoning connects the hops; the graph supplies the ground truth facts.

The Two Failure Modes This Solves

Hallucination: LLMs generate plausible-sounding text but can confabulate facts ("The Booker Prize 2019 went to Hilary Mantel" — actually Bernardine Evaristo and Margaret Atwood). Grounding each reasoning step in a KG makes wrong answers impossible for facts that are in the graph.

Staleness: LLMs freeze at training cutoff. A knowledge graph can be updated daily with new entities and edges. An agent querying a live KG gives answers current as of today, not 18 months ago.

Reasoning Chain: LLM Alone vs. Graph-Grounded

Click each step to see how an LLM-only chain can go wrong, versus a graph-grounded chain that verifies each hop.

Why does grounding an LLM agent's reasoning in a knowledge graph reduce hallucination?

Because the LLM generates longer reasoning chains when a KG is available Because each reasoning hop retrieves a verified edge from the graph rather than relying on the LLM's parametric memory, so facts are checked against ground truth Because knowledge graphs contain only recent information, avoiding training cutoff issues

Chapter 1: ReAct Framework

In 2022, Yao et al. published ReAct: Synergizing Reasoning and Acting in Language Models. The key observation: when a model just generates reasoning traces ("Chain-of-Thought"), it can drift into hallucination because it has no feedback. When a model just acts (issues API calls), it has no reasoning to connect results. ReAct interleaves both.

The ReAct loop is exactly three steps, repeated until the agent decides it has the answer:

Thought

The LLM reasons about the current state. What do I know? What do I need to find out? Which tool should I call next? This is free-form language generation.

↓

Action

The LLM emits a structured tool call: Search("Booker Prize 2019"), Lookup("Bernardine Evaristo novels"), or Finish("The answer is..."). The action format is specified in the prompt.

↓

Observation

The tool executes and returns a result — a graph query result, a retrieved document, a computed value. This observation is appended to the context and the loop repeats.

↻ repeat until Finish

Why does interleaving help? The Thought grounds the Action in current reasoning; the Observation grounds the next Thought in real evidence. Neither hallucination (pure thought) nor brute search (pure action) wins alone. The loop is self-correcting: a bad observation prompts the model to reason about what went wrong.

ReAct in Practice: Prompting

ReAct works via few-shot prompting. You give the model 2-6 examples of the Thought/Action/Observation format, and it continues the pattern for a new question. No fine-tuning required — the in-context examples teach the format.

ReAct trace (few-shot example in prompt)
Question: What is the population of the capital city of the country that
         borders Spain to the south?

Thought: I need to find which country borders Spain to the south.
Action: Search("countries bordering Spain to the south")
Observation: Morocco borders Spain to the south (via the Strait of Gibraltar).

Thought: The capital of Morocco is Rabat. I need its population.
Action: Search("Rabat population")
Observation: Rabat has a population of approximately 577,000.

Thought: I have the answer.
Action: Finish("Approximately 577,000")

Key result from the paper: ReAct outperforms Chain-of-Thought prompting on knowledge-intensive tasks (HotpotQA, FEVER) because observations from real tools prevent reasoning chains from drifting into confident hallucination. On HotpotQA, ReAct reduced "hallucination rate" from 56% (CoT-only) to 14%.

ReAct Loop Visualizer

Click "Next Step" to walk through a ReAct trace answering a multi-hop question. Watch how Thought and Action alternate, with Observations feeding back in.

In the ReAct framework, what purpose does the "Observation" step serve that pure Chain-of-Thought lacks?

It makes the model generate more tokens, improving reasoning quality It allows the model to decide when to stop reasoning It injects verified external evidence into the context, preventing the reasoning chain from drifting into hallucinated facts

Chapter 2: Reflexion

ReAct is powerful but brittle: if the agent goes down a wrong reasoning path and gets a wrong answer, it doesn't know it got it wrong. It just reports the result with confidence. Reflexion (Shinn et al., 2023) adds a second loop: after each episode, the agent reads its own trace and writes a self-critique that it carries into the next attempt.

Think of it as the difference between a student who submits an exam and forgets it, versus a student who reviews the graded exam, identifies their mistake, and keeps that insight as a note for the next test. Reflexion is the "keep the note" mechanism.

The Three-Memory Architecture

Short-term memory: The current episode's ReAct trace — the running Thought/Action/Observation context window. Cleared each episode.

Long-term memory: A persistent buffer of self-reflections from past failed episodes. Prepended to new episodes so the agent "remembers what didn't work."

The Reflector: After each episode ends (success or failure), a separate LLM call reads the full trace and the task specification, then writes a short, specific critique: "I failed because I searched for the director before verifying the movie title. Next time, verify the title first."

The key insight: Reflexion converts trial-and-error into directed improvement without any gradient updates. The model's weights don't change — only its context changes. This is "in-context RL" where the reward is the evaluator's verdict and the "policy update" is the textual self-reflection.

Reflexion: reflection written after a failed episode
# Episode 1 failed: agent answered "Christopher Nolan" for a question
# about the director of a 2023 film it didn't correctly identify

Reflection: "I made an error in step 2: I assumed the film was Oppenheimer
without verifying that it was actually released in 2023. I should
always verify year-specific facts with a lookup before building
further reasoning on them. In the next attempt, verify the film
title and release year before looking up its director."

# Episode 2 context (prepended before the new question):
# [Reflection from ep1 inserted here]
# Question: [same question]
# Thought: Based on my previous error, I should first verify...

Reflexion: Improvement Over Episodes

Watch how success rate climbs as the agent accumulates self-reflections. Each bar is one trial on a benchmark task set.

What does Reflexion store in its "long-term memory" buffer?

The model's gradient updates from failed episodes Textual self-critiques written after failed episodes, prepended to future attempts so the agent avoids repeating the same mistakes A cache of all previous Thought/Action/Observation traces

Chapter 3: Graph-Grounded Agents SHOWCASE

We've seen ReAct and Reflexion in abstract. Now let's run a real graph-grounded agent on a knowledge graph question. The agent needs to answer: "Which drug targets the same protein as Imatinib, and is used to treat a disease caused by the same gene mutation?"

This requires three hops: (1) find Imatinib's protein target, (2) find other drugs targeting that same protein, (3) for each candidate drug, check if its disease has the same gene mutation as Imatinib's primary disease. No LLM can reliably answer this from memory — it must traverse the biomedical KG.

The architecture: The agent (LLM) issues Thought/Action pairs. The "tool" is a KG query engine. Each Action returns a subgraph: a set of (entity, relation, entity) triples. The agent reads these triples as natural language observations and reasons about the next hop.

SHOWCASE: KG Agent Reasoning Chain

Step through the agent's reasoning. Each step shows the KG subgraph returned (nodes/edges) and the agent's Thought about what to do next. The highlighted path grows with each hop.

What Makes Graph Grounding Different from RAG?

Retrieval-Augmented Generation (RAG) retrieves a chunk of text (a paragraph) based on embedding similarity. Graph grounding retrieves a structured subgraph — a set of typed triples — based on entity traversal. The difference matters for multi-hop reasoning.

Dimension	RAG (text chunks)	Graph Grounding (KG triples)
Retrieval unit	Paragraph or passage	Entity, relation, entity triple
Multi-hop	Hard — requires re-retrieval	Natural — just follow edges
Precision	Fuzzy (embedding similarity)	Exact (typed relation lookup)
Hallucination	Possible from irrelevant passages	Prevented for in-graph facts
Coverage	High (any text)	Limited to KG entities

A user asks: "Who directed the sequel to the film starring the actor who won Best Actor in 2020?" Why is graph traversal better than a single RAG retrieval for this question?

Because graph databases are faster than vector databases Because the question needs only one fact, which a graph can return faster Because answering requires three sequential lookups (2020 winner → their film → its sequel → director), and each depends on the result of the previous — a single RAG retrieval can't chain these hops reliably

Chapter 4: Tool Use on Graphs

The agent doesn't directly access the knowledge graph — it calls tools that abstract the graph operations. This is a crucial design decision: exposing graph APIs rather than raw graph queries lets the LLM reason in natural language while the tool handles the graph traversal logic.

The Graph Tool Toolkit

A well-designed graph agent toolkit has three classes of tools. Think of these like a GPS app: you don't compute shortest paths yourself — you call "navigate to X" and get back a route.

Entity tools: lookup_entity(name) → entity ID + attributes. get_neighbors(entity_id, relation_type) → list of (entity, relation) pairs. get_path(entity1, entity2, max_hops) → shortest connecting path.

Query tools: Structured query languages like SPARQL let the agent express graph patterns. The LLM generates a SPARQL-like query; the tool executes it on the KG and returns matching subgraphs. This is more expressive than single-hop lookups.

Synthesis tools: summarize_subgraph(triples) → natural language description. check_path_consistency(path) → boolean + explanation. These post-process graph results into forms the LLM can reason over.

Graph tool API (Python)
class KGToolkit:
    def lookup_entity(self, name: str) -> dict:
        # Returns: {"id": "Q123", "type": "drug", "description": "..."}
        return self.kg.get_entity_by_name(name)

    def get_neighbors(self, entity_id: str, rel: str = None) -> list:
        # Returns: [{"entity": "BCR-ABL1", "relation": "targets"}, ...]
        return self.kg.neighbors(entity_id, relation_filter=rel)

    def sparql_query(self, query: str) -> list:
        # LLM writes SPARQL; tool executes against KG
        # Returns list of matching triple patterns
        return self.kg.execute(query)

# Agent action format in the ReAct loop:
# Action: get_neighbors("imatinib", rel="targets")
# Observation: [{"entity": "BCR-ABL1", "relation": "targets"}]

SPARQL-Style Queries from LLMs

Modern agents (like KG-GPT and ToG) can generate structured queries over KGs. The LLM is prompted with the graph schema (entity types, relation types) and generates a query. The tool executes it, returning a subgraph that the LLM can reason over.

Graph Tool Call Anatomy

Click a tool type to see what goes in, what comes out, and what graph operation runs underneath.

Why do graph-grounded agents expose graph operations as tool APIs rather than giving the LLM direct access to the raw graph database?

Because graph databases are too slow for direct LLM access Because it lets the LLM reason in natural language (Thought) while the tool handles graph traversal logic, decoupling semantic reasoning from structural computation Because LLMs cannot output structured query languages

Chapter 5: RL for Agent Optimization

Prompt engineering and few-shot examples take you far, but they don't adapt. If your agent consistently makes the same class of mistake, you need a mechanism to update it. Two approaches dominate: PPO (Proximal Policy Optimization) for online RL with a learned reward, and DPO (Direct Preference Optimization) for offline learning from human preference pairs.

Treating the Agent as a Policy

Cast the agent in the RL framework: the state is the current context (question + Thought/Action/Observation history), the action is the next token (or the next tool call), and the reward is given at episode end — +1 if the final answer is correct, 0 otherwise. The goal: maximize expected reward over episodes.

J(θ) = E_{τ ~ π_θ} [ R(τ) ]

Where τ is a full agent episode (the trajectory of Thoughts, Actions, Observations), π_θ is the LLM policy parameterized by weights θ, and R(τ) is the episode reward (correct answer = 1, wrong = 0, or a shaped intermediate reward).

PPO for Agents

PPO keeps the updated policy close to a reference policy to prevent catastrophic changes. The policy generates trajectories, a reward model scores them, and the gradient update improves high-reward trajectories while the KL penalty keeps the model from drifting too far from its pretrained behavior.

L_PPO(θ) = E [ min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t) ] − β · KL[π_θ || π_ref]

Where r_t(θ) = π_θ(a_t|s_t) / π_old(a_t|s_t) is the probability ratio, Â_t is the advantage estimate (how much better was this action than average?), and ε clips the ratio to prevent large updates.

DPO for Agents

DPO skips the explicit reward model. Instead, you collect pairs of agent trajectories on the same question: one that got the answer right (preferred) and one that got it wrong (rejected). DPO directly optimizes the LLM to prefer the correct trajectory over the incorrect one, using only cross-entropy loss over the pair.

L_DPO(θ) = −log σ(β log [π_θ(τ_w) / π_ref(τ_w)] − β log [π_θ(τ_l) / π_ref(τ_l)])

Where τ_w is the winning (correct) trajectory and τ_l is the losing (incorrect) trajectory. DPO is simpler to implement than PPO — no separate critic network, no online sampling loop during training.

PPO: Online. Generates new trajectories during training. Expensive (requires reward model + critic). Handles exploration naturally. Good when you have a reliable reward signal (e.g., verifiable math or code answers).

DPO: Offline. Trains on pre-collected preference pairs. Cheaper (no separate reward model). No exploration. Good when you have human annotators who can rank agent trajectories.

PPO vs. DPO Training Dynamics

Click to see the training pipeline for each method. Notice how PPO requires online trajectory collection while DPO trains on a fixed dataset of preference pairs.

DPO trains on preference pairs (winning vs. losing trajectory). What is the key advantage of DPO over PPO for agent training?

DPO achieves higher task accuracy than PPO on all benchmarks DPO requires less training data than PPO DPO does not need a separate reward model or critic network, making it simpler and cheaper to implement — it uses only cross-entropy loss over pre-collected trajectory pairs

Chapter 6: Benchmarks

How do we know if a graph-grounded agent is actually better? We need benchmarks that test multi-hop reasoning over structured knowledge — not just reading comprehension or simple factoid lookup. Two benchmarks from CS224W stand out: STaRK and UnknowBench.

STaRK: Semi-Structure Retrieval Benchmark

STaRK (Wu et al., 2024) tests agents on three knowledge graphs: Amazon product graphs, biomedical KGs (PrimeKG), and academic citation graphs (MAG). Queries require combining structured graph traversal (follow edges) with unstructured text reasoning (understand entity descriptions).

Example STaRK query: "Find drugs that target proteins involved in Alzheimer's disease AND have fewer than 3 reported side effects AND were approved after 2010." This requires: (1) graph traversal for protein-disease links, (2) property filtering on edge attributes, (3) temporal reasoning. No single retrieval handles all three.

UnknowBench: Testing Knowledge Boundaries

UnknowBench tests a different capability: does the agent know when to say "I don't know"? It contains questions whose answers are absent from the knowledge graph — the correct response is to recognize the gap and say so, rather than hallucinate an answer.

This is critical for real-world deployment. An agent that confidently gives a wrong answer is more dangerous than one that says "I couldn't find this in the graph." UnknowBench rewards appropriate uncertainty.

Benchmark	KG Used	Question Type	Key Capability Tested	Metric
STaRK-Amazon	Amazon product graph	Multi-constraint retrieval	Structured + text fusion	Hit@1, NDCG
STaRK-Bio	PrimeKG (biomedical)	Drug-disease-gene hops	Multi-hop traversal	Hit@1, NDCG
STaRK-Mag	MAG (academic)	Author-paper-venue hops	Citation graph reasoning	Hit@1, NDCG
UnknowBench	Multiple KGs	Answerable + unanswerable	Calibrated uncertainty	F1 on abstention
WebQuestions	Freebase	Natural language → SPARQL	Query generation	F1
HotpotQA	Wikipedia	2-hop reasoning	Multi-hop retrieval	F1, EM

Agent Performance on STaRK vs. Baselines

Comparing retrieval Hit@1 scores across methods on STaRK-Bio. Graph-grounded agents outperform pure embedding retrieval on multi-hop queries.

UnknowBench tests agents on questions whose answers are absent from the knowledge graph. Why is this an important capability to benchmark separately?

Because agents that refuse to answer save computation time Because a confidently wrong answer is more dangerous than an appropriate "I don't know" — hallucination on out-of-graph queries is a critical failure mode in real deployments Because knowledge graphs are always incomplete, making this the most common query type

Chapter 7: Multi-Agent Graph Systems

A single agent can handle one reasoning chain. But some tasks require parallel exploration — multiple competing hypotheses, different tools for different subtasks, or expert specialization. Multi-agent systems assign different agents to different subgraphs or reasoning tracks, coordinating via a shared message-passing protocol.

Here's the key insight: the coordination structure of the agents can itself be a graph. Agents are nodes; information flows between them along edges. This creates a recursive pattern — a graph of agents operating on a knowledge graph.

Common Multi-Agent Architectures

Orchestrator Agent

Receives the original question, decomposes it into sub-questions, assigns each sub-question to a specialized worker agent, and combines the results.

↓ delegates to

Worker Agents (parallel)

Each worker handles one sub-question on its assigned subgraph. Drug agent handles drug-protein edges. Disease agent handles disease-gene edges. They run in parallel.

↓ results back to

Synthesis Agent

Receives all worker results, checks consistency across agents (do their subgraphs connect correctly?), and generates the final answer.

Why parallel agents? On a large KG, one agent exploring sequentially could take many hops and many API calls. Multiple agents exploring different subgraphs simultaneously are faster. They also provide cross-validation: if two agents converge on the same intermediate entity via different paths, confidence is higher.

Message Passing Between Agents

Agents share partial results as structured messages: "I found that BCR-ABL1 targets Imatinib — does any other drug in your subgraph also target BCR-ABL1?" This is analogous to GNN message passing: each agent aggregates messages from its neighbors (other agents), updates its belief state, and passes new messages forward.

The graph analogy is exact: In a GNN, node v aggregates messages from its neighbors N(v). In a multi-agent system, agent v aggregates information from neighboring agents N(v). The "neighborhood" is defined by which agents are relevant to the same intermediate entities. Graph ML concepts map directly onto multi-agent coordination.

Multi-Agent Coordination Graph

Click "Run" to watch agents propagate information. The orchestrator decomposes the query; workers execute in parallel; the synthesizer combines results. Watch messages flow along agent edges.

In a multi-agent graph system, the orchestrator decomposes the question and assigns sub-questions to worker agents. What is the key advantage over a single-agent sequential approach?

Multi-agent systems always reach more accurate answers The orchestrator agent has more reasoning ability than a single agent Workers explore different subgraphs in parallel, reducing total latency and enabling cross-validation when multiple agents find the same intermediate entity via independent paths

Chapter 8: Challenges

Graph-grounded agents are powerful, but they come with a set of hard, open problems that the field is actively working on. Being honest about these limitations is as important as understanding the successes.

Challenge 1: Hallucination at the Interface

Grounding reduces hallucination for facts in the graph, but the agent still hallucinates when reasoning over graph results. The KG returns correct triples; the LLM combines them incorrectly. This is especially common when the agent must aggregate multiple graph results — it may invent a logical connection that the data doesn't support.

Example failure: The graph correctly returns that Drug A targets Protein X, and that Protein X is associated with Disease Y. The agent incorrectly concludes "Drug A is approved for Disease Y" — an inference not in the graph. Graph grounding doesn't prevent inter-hop reasoning errors.

Challenge 2: Incomplete Knowledge Graphs

Real KGs are always incomplete. Freebase covered roughly 3% of world knowledge before it was shut down. Wikidata has many missing relations. When the agent queries for a fact that should be in the graph but isn't, it either incorrectly says "I don't know" (missing a valid answer) or falls back to parametric memory and hallucinates.

Challenge 3: Grounding Latency

Each tool call adds latency. A 5-hop reasoning chain with 2 tool calls per hop requires 10 round trips to the KG. For a graph with millions of nodes, each query takes 10-100ms. Total agent latency can hit 5-10 seconds — too slow for interactive applications.

Challenge 4: Symbolic-Neural Mismatch

Knowledge graphs use discrete symbolic relations (drug targets protein). LLMs operate in continuous embedding space. Bridging them requires translation: embedding entities in the LLM's token space, or converting graph results to natural language. Both introduce information loss — the structured precision of the graph is partially lost in translation.

Challenge	Current Best Mitigation	Open Problem
Inter-hop hallucination	Reflexion self-critique	Formal verification of multi-hop chains
KG incompleteness	KG completion models (TransE, RotatE)	Dynamic KG updating from text
Grounding latency	Subgraph pre-retrieval, caching	Fast approximate KG reasoning
Symbolic-neural mismatch	Entity linking + type constraints	Unified symbolic-neural architectures

A graph-grounded agent retrieves correct KG triples but still produces a wrong final answer by inferring a logical connection the data doesn't support. Which challenge category does this fall into?

KG incompleteness — the needed fact wasn't in the graph Symbolic-neural mismatch — the KG format wasn't compatible with the LLM Inter-hop hallucination — the LLM incorrectly combined correct graph facts into an invalid logical conclusion

Chapter 9: Connections

Graph-grounded agents sit at the intersection of several major research streams. Understanding how they connect to adjacent topics shows you both where the field came from and where it's going.

Topic	Connection to Graph Agents	Learn more
Knowledge Graphs	The structured memory that agents query. KG embedding methods (TransE, RotatE) help with incomplete graphs.	Lec 10: KG
LLM + GNN	The integration architecture: LLM as reasoner, GNN as structure encoder for the KG subgraph returned by the agent.	Lec 16: LLM+GNN
Heterogeneous Graphs	Biomedical KGs have typed nodes (drug, protein, disease) and typed edges — exactly the heterogeneous graph setting.	Lec 9: Hetero
RL Algorithms	PPO and DPO for agent optimization. Graph traversal as MDP: state = current entity, action = which edge to follow.	RL Algorithms
Recommender Systems	Agents recommending items via KG traversal. User → interest graph → candidate items → LLM explanation.	Lec 11: RecSys

Key Papers

Yao et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. ICLR 2023.
Shinn et al. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. NeurIPS 2023.
Sun et al. (2023). Think-on-Graph: Deep and Responsible Reasoning of LLM on KG. ICLR 2024.
Wu et al. (2024). STaRK: Benchmarking LLM Retrieval on Textual and Relational Knowledge Bases. NeurIPS 2024.
He et al. (2024). G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding.

What's Next

The frontier: Graph foundation models (GFMs) that can reason over arbitrary KGs without task-specific fine-tuning. An agent that can be dropped into a new domain's KG (medical, legal, financial) and immediately answer multi-hop queries. This requires learning graph structure as a universal language — the graph analogue of GPT's universal text representation.

"The structure of the graph is the structure of knowledge. An agent that can traverse it has access to what we know; one that can reason over it has access to what we can infer."

Next: Lec 18 — Graph Generation →