Language models know that "Aspirin" is a drug. GNNs know that Aspirin is three hops from migraines in the biomedical graph. Neither alone is enough. Together, they're remarkably powerful.
You're building a drug repurposing system. You have a biomedical knowledge graph with 100,000 entities: drugs, genes, diseases, proteins. Each entity has a name. Some have long text descriptions. The graph has 5 million edges representing known interactions.
You try two approaches. First, train a GNN using one-hot entity IDs as node features. It learns graph topology brilliantly — it knows that Aspirin is three hops from migraines via COX2 — but it doesn't know that Aspirin's text description mentions "anti-inflammatory" and "analgesic," which would help identify which other diseases it might treat.
Second, feed all entity text descriptions to an LLM. It understands that "Aspirin" and "ibuprofen" are both NSAIDs with similar mechanisms. But it doesn't know the structure of the graph — it doesn't know that Aspirin is connected to 47 diseases in your KG while a different drug connects to only 3.
Both approaches use only half the information. The graph structure is the missing context for the LLM. The rich text semantics are the missing features for the GNN. Combining them gets you both.
Click a node to see what the LLM "sees" (text description) vs. what the GNN "sees" (graph neighborhood). The combination uses both views.
The simplest combination: use a frozen LLM to encode each node's text description into a fixed embedding vector, then use that vector as the node feature for a downstream GNN. The LLM is a preprocessing step. It runs once. The GNN trains on the resulting features.
This works because modern LLMs (BERT, Sentence-BERT, LLaMA) are excellent sentence encoders. Given the text "aspirin: a nonsteroidal anti-inflammatory drug used to treat pain, fever, and inflammation," a BERT encoder produces a 768-dimensional vector that captures the semantic content. Two drugs with similar descriptions will have similar vectors — a richer initialization than random or one-hot.
python # Step 1: encode all node texts with a frozen LLM from sentence_transformers import SentenceTransformer encoder = SentenceTransformer('all-MiniLM-L6-v2') # frozen node_texts = ["Aspirin: NSAID...", "COX2: enzyme...", ...] node_feats = encoder.encode(node_texts, convert_to_tensor=True) # node_feats: [N, 384] — one vector per node, from text alone # Step 2: train GNN using LLM features as input class GraphSAGE(nn.Module): def __init__(self): super().__init__() self.conv1 = SAGEConv(384, 256) # input dim = LLM embed dim self.conv2 = SAGEConv(256, num_classes) def forward(self, x, edge_index): x = F.relu(self.conv1(x, edge_index)) # aggregate LLM features return self.conv2(x, edge_index) # predict model = GraphSAGE() # Train model on (node_feats, edge_index, labels). LLM is never called again.
Now flip the roles. The GNN is the preprocessing step. It runs over the graph and computes structural embeddings for each node: where is this node in the graph topology? How central is it? What does its neighborhood look like? These structural features are then injected into an LLM as additional context.
This approach is motivated by a core LLM limitation: LLMs process text sequences. They have no native ability to represent "this entity is the hub of a star subgraph" or "these two entities are 4 hops apart." Graph structure is a different inductive bias from language. The GNN bridges this gap by converting graph topology into a fixed-size vector that the LLM can process.
Select a structural feature type. Watch how it assigns different "structural descriptions" to nodes. The GNN computes these as vectors; the LLM sees them as additional input tokens.
The showcase architecture: an end-to-end pipeline where an LLM encodes node text into semantic features, a GNN aggregates those features through the graph structure, and the combined representation is used for prediction. This is the architecture of systems like GIANT, TAPE, and GraphGPT.
Watch the data flow through each stage. Every tensor shape is shown. Every operation is motivated. By the end, you'll understand exactly why each component exists and what would break without it.
Watch data flow through the pipeline. Click "Step" to advance one stage at a time. Each stage shows what's computed and the tensor shape at that point. Click a node to highlight its specific path through the pipeline.
Chapter 1's approach (LLM encoder → frozen → GNN trains) is called a cascade: the two components are separate, one feeds the other, no gradients flow between them. Chapter 3's full pipeline can also be a cascade, or it can be a joint training setup where gradients from the GNN task flow back into the LLM to fine-tune it for graph-specific features.
This is one of the most important design decisions in LLM+GNN systems. The tradeoff is sharp.
LoRA (Low-Rank Adaptation) offers a practical compromise. Instead of updating all LLM weights, inject small low-rank adapter matrices into the LLM's attention layers. Only the adapters (typically <1% of total parameters) are trainable. The backbone LLM weights are frozen. Gradients flow through the adapters but not the backbone.
python # LoRA: fine-tune LLM with <1% of parameters trainable from peft import get_peft_model, LoraConfig lora_config = LoraConfig( r=16, # rank of the adapter matrices lora_alpha=32, # scaling factor target_modules=["q_proj", "v_proj"], # which layers to adapt lora_dropout=0.1, ) llm = get_peft_model(base_llm, lora_config) # llm now has ~0.5% trainable params (LoRA) + 99.5% frozen (backbone) # Gradient flows only through LoRA adapters → fast, cheap, no catastrophic forgetting # Use in LLM+GNN pipeline: LLM(+LoRA) → node_feats → GNN → task loss → backprop optimizer = Adam([*llm.parameters(), *gnn.parameters()], lr=1e-4)
Flip the architecture again. Instead of using the LLM to help the GNN, use the graph to help the LLM. The LLM remains the primary model — it generates text, answers questions, summarizes. But before the LLM processes a query, you augment its context with graph-derived information about the relevant entities.
This is graph-augmented generation, a graph-structured variant of retrieval-augmented generation (RAG). In standard RAG, you retrieve relevant text passages from a corpus and prepend them to the LLM's prompt. In graph-augmented generation, you retrieve relevant graph facts and paths, convert them to text, and prepend them.
Question: "What disease does the drug that inhibits COX2 treat?" To answer, you need: (Drug, inhibits, COX2) to find the drug, then (Drug, treats, Disease) to find the disease. This is a 2-hop KG query. The LLM alone might guess "Aspirin treats pain" from parametric memory — but parametric memory can be wrong or outdated. Augmenting with the KG provides verified, updatable facts.
Verbalization is the process of converting a graph subgraph into a natural language description the LLM can process. Simple template: "(entity1, relation, entity2)" → "entity1 [relation] entity2." More sophisticated: train a small seq2seq model to generate fluent descriptions of subgraph patterns.
python # Graph-augmented LLM query def graph_augmented_answer(question, kg, llm): # 1. Extract entities mentioned in question entities = entity_linker(question) # ["Aspirin", "migraine"] # 2. Retrieve relevant KG subgraph (2-3 hop paths) paths = kg.shortest_paths(entities, max_hops=3) # paths: [("Aspirin","inhibits","COX2"), ("COX2","causedBy","Migraine")] # 3. Verbalize paths to text context = verbalize(paths) # "Aspirin inhibits COX2. COX2 is causally related to Migraine." # 4. Augment prompt with graph context prompt = f"""Knowledge graph context: {context} Question: {question} Answer based on the provided context:""" # 5. LLM generates grounded answer return llm.generate(prompt)
Beyond just using LLM embeddings as static node features, LLMs can help GNNs in more dynamic ways: generating pseudo-labels for unlabeled nodes, suggesting which edges should exist (edge prediction augmentation), and explaining GNN predictions in natural language. This chapter covers three LLM→GNN augmentation strategies that go beyond simple feature encoding.
Most graph datasets have very few labeled nodes. Training a GNN on 10 labels out of 10,000 nodes is hard. LLMs can help: given a node's text description, prompt the LLM to predict its label (e.g., paper topic category). These are pseudo-labels — noisy but much more numerous than human labels.
Real-world graphs are often missing edges that should exist. A citation graph is missing citations for papers published after the scraping date. A collaboration graph is missing co-authorships across institutions. Use an LLM to predict which edges are likely: given two node descriptions, prompt "are these likely to interact?" Add high-confidence predicted edges to the graph before GNN training.
After the GNN makes a prediction, use the LLM to generate a natural language explanation. Pipeline: GNN identifies important subgraph (via GNNExplainer or attention weights) → verbalize the subgraph → prompt LLM to explain the prediction in terms of the verbalized subgraph → LLM outputs human-readable explanation.
python # LLM-augmented pseudo-label generation def generate_pseudolabels(unlabeled_nodes, llm, categories): pseudo = {} for node in unlabeled_nodes: prompt = f"""Paper abstract: {node.text} Classify this paper into one of: {categories} Respond with exactly one category name.""" pred = llm.generate(prompt, max_tokens=10) confidence = llm.logprob(pred) # token log-probability if confidence > threshold: pseudo[node.id] = pred # only keep high-confidence predictions return pseudo # Add pseudo-labels to training set; weight them lower than human labels # loss = human_loss + lambda * pseudo_loss (lambda < 1)
How much does combining LLMs with GNNs actually help? Let's look at real numbers from key benchmarks. Numbers tell the story better than any abstract argument.
ogbn-arxiv has 169,343 arXiv papers. Task: classify each paper into one of 40 subject areas. Nodes have paper titles as text features. Edges are citation links.
| Method | Test Accuracy | Key design |
|---|---|---|
| GCN (BoW features) | 71.7% | Bag-of-words node features, no LLM |
| BERT features + GCN (cascade) | 73.3% | Frozen BERT embeddings as input to GCN |
| GIANT (LM fine-tuned on neighbors) | 75.9% | LM trained to predict neighbor IDs from text |
| TAPE (LLM explanation features) | 76.5% | Frozen LLaMA explanations + GNN |
| TAPE (fine-tuned LLM + GNN) | 77.9% | Joint training with LoRA fine-tuning |
Predict whether two papers will cite each other. Text features from abstracts significantly boost performance over graph-structure-only methods.
| Method | MRR | Key design |
|---|---|---|
| GraphSAGE (one-hot) | 82.6% | No text features |
| GraphSAGE + BERT features | 87.3% | Frozen BERT cascade |
| LLM + GNN (joint) | 90.1% | Fine-tuned LLM + GNN, joint training |
Drag the slider to change graph edge density. Watch how LLM feature importance changes: sparse graphs need better text features; dense graphs can work with simpler features.
LLM+GNN systems work in practice — the benchmarks prove it. But they come with significant engineering challenges that don't appear in research papers. This chapter is honest about what's hard.
A node's "context" in a graph includes its k-hop neighborhood — potentially hundreds of connected entities. Serializing this neighborhood as text (for graph-augmented LLMs) quickly exceeds even the largest LLM context windows. A 2-hop neighborhood in ogbn-arxiv has ~50 papers. Each abstract is 200 words. 50 × 200 = 10,000 words. That's a very long context, and it grows quadratically with hop depth.
LLM hidden states live in a semantic embedding space optimized for language modeling. GNN hidden states live in a space optimized for graph topology prediction. These spaces don't naturally align — a point in LLM space doesn't correspond to the same point in GNN space, even for the same entity. Combining them requires a learned alignment (projection layer), but the projection may lose information from both.
LLMs encode world knowledge from pre-training. KGs encode factual knowledge from structured data. These can disagree. An LLM might believe "Drug X treats Disease Y" based on medical text it saw during pre-training. The KG might not have this edge because it was added to the medical literature after the KG was compiled. Which source wins? Systems that blindly concatenate LLM context and KG facts without resolving conflicts produce inconsistent outputs.
LLM+GNN is where two major threads of machine learning converge. Understanding the connections to adjacent fields prevents reinventing wheels and reveals which open problems are really the same problem in disguise.
| LLM+GNN concept | Parallel in another field | Shared insight |
|---|---|---|
| LLM as feature encoder (frozen) | ImageNet pre-training for computer vision | Transferable representations reduce task-specific data needs |
| Graph-augmented LLM generation | Retrieval-augmented generation (RAG) | Ground LLM in external verified knowledge at inference time |
| Joint training (LLM+GNN) | Multi-modal learning (CLIP) | Two modalities (text + graph) benefit from joint optimization |
| LoRA for LLM adaptation | Adapter tuning in NLP | Efficient fine-tuning by adding small, task-specific modules |
| LLM as pseudo-labeler | Self-training / semi-supervised learning | Model generates labels for unlabeled data to expand training set |
| GNN as structural context provider | Position encodings in transformers | Inject structural information (graph position) into attention |
"The great thing about a language is you can use it to think with — and the great thing about a graph is you can use it to remember things with. Together, they almost think."
— paraphrased from the spirit of CS224W Lecture 16