LLMs Get Lost In Multi-Turn Conversation

Chapter 0: The Problem

You open ChatGPT. You type: "Write me a Python function that takes a list of bank transactions and returns True if the balance ever goes below zero." The model nails it. 90%+ accuracy on benchmarks like HumanEval. Impressive.

Now imagine a more realistic scenario. You type: "I need a function to check something about transactions." The model asks what. You say: "It's a list of integers, deposits and withdrawals." A turn later: "Start at balance zero." Another turn: "Return True if it ever dips negative." Same information, same task — just delivered gradually, the way humans actually talk.

The model fails. Not sometimes — consistently. It makes assumptions in early turns ("I'll assume you want the final balance!"), generates a premature solution, then when you correct it, it tries to patch the wrong answer instead of starting fresh. By the end of the conversation, it's confused by its own prior attempts.

The gap between lab and reality: LLMs are evaluated on single-turn, fully-specified instructions — "here's everything you need, go." But real users provide information gradually. Studies of actual LLM conversation logs (Herlihy et al., 2024) confirm that underspecification is the norm, not the exception. Zipf called this the "principle of least effort" back in 1949: humans naturally communicate the minimum needed in each message.

This paper asks a simple question: how much performance do LLMs lose when the same task is spread across multiple turns instead of given all at once?

The answer is devastating: 39% average degradation across 15 state-of-the-art models and 6 diverse tasks. Every single model — from 8B open-weights to Gemini 2.5 Pro — gets substantially worse. And the degradation is not a gentle slope. Even splitting the instruction into just two turns causes most of the damage.

Single-Turn vs Multi-Turn: The Performance Gap

Each bar shows average performance on the same tasks. Left: all information in one message. Right: same information spread across turns. Click a model to see its breakdown.

The most unsettling finding: the strongest models lose just as much as the weakest. GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro all suffer 30–40% degradation. Being smarter doesn't make you immune to getting lost in conversation.

Even more alarming: reasoning models like o3 and DeepSeek-R1, which use extra "thinking" tokens, fare no better. Their longer responses actually make things worse — they contain more assumptions that compound across turns.

Why this matters right now: As LLMs become the backbone of agent systems (LangChain, AutoGen, Claude Code), they run multi-turn conversations constantly — with humans, with tools, with other LLMs. If models degrade this badly with just 2–8 turns of underspecified input, every agent pipeline is operating far below its benchmarked capability. The lab numbers are a fantasy.

Why is single-turn evaluation misleading for real-world LLM usage?

Real users provide information gradually across turns (underspecification), but benchmarks give everything at once — and performance drops 39% in the realistic setting Single-turn tasks are too easy for modern models Multi-turn conversations use more tokens, causing out-of-memory errors

Chapter 1: The Key Insight

Before this paper, multi-turn evaluation existed — but it was episodic. Each turn in the conversation was a self-contained subtask. "Translate this sentence." "Now summarize this paragraph." The turns happen to be in the same conversation, but each can be graded independently. This design overestimates LLM capability because the model never has to integrate information across turns.

The key insight of this paper is a method called sharded simulation: take a single, fully-specified instruction from an existing benchmark, break it into pieces ("shards"), and reveal one shard per turn. The model must fuse the shards across the conversation to solve the original task.

Sharding in one sentence: Instead of inventing new multi-turn tasks (which can't be compared to single-turn baselines), take existing single-turn tasks and fragment them into multi-turn conversations. Same task, same evaluation, different delivery — a controlled experiment isolating the effect of multi-turn underspecification.

Why This Is Clever

Previous multi-turn benchmarks had a fatal flaw: the multi-turn tasks were different from the single-turn ones. You couldn't tell if performance dropped because multi-turn is harder or because the tasks themselves were harder. Sharding eliminates this confound.

The original instruction and its sharded version deliver exactly the same information. The shards are just pieces of the original, rephrased to be natural conversational turns. If you concatenate all shards, you get back the original meaning (this is verified by a control condition called CONCAT that averages 95.1% of single-turn performance).

So when SHARDED performance drops to 61% of single-turn, we know it's not because information was lost. It's because the delivery method — gradual revelation across turns — causes the model to stumble.

The Sharding Concept

Watch a fully-specified instruction get sharded into conversational turns. Click "Shard It" to see the transformation, then "Simulate" to watch the model struggle.

The Experimental Design

The paper defines five conversation types, all derived from the same sharded instructions:

FULL

Original instruction in one turn. The single-turn baseline. Score = what benchmarks report.

↓ shard the instruction

CONCAT

All shards concatenated as bullet points in one turn. Controls for rephrasing effects. If CONCAT = FULL, the shards preserved all information.

↓ spread across turns

SHARDED

One shard per turn, multi-turn conversation. The core setting. This is where performance craters.

↓ add remediation

RECAP

After all shards revealed, add a final turn restating everything. Tests if "just remind it" fixes things. Spoiler: it helps, but not enough.

↓

SNOWBALL

Each turn reveals the new shard AND repeats all prior shards. Cumulative repetition. Offers 15–20% improvement but still falls short.

The discovery process: The authors probably tried the obvious thing first: just compare single-turn and multi-turn performance directly. But that comparison is unfair — you can't tell if the drop is from multi-turn difficulty or from different tasks. So they invented sharding to create a fair comparison. Then they added CONCAT to isolate rephrasing effects. Then RECAP and SNOWBALL to test agent-style mitigations. Each control peels back one potential confound.

What does the CONCAT control condition prove?

That multi-turn conversations are always worse That the performance drop in SHARDED is NOT due to information loss from rephrasing — since CONCAT (same shards, one turn) achieves 95% of FULL performance That concatenating prompts is the best strategy

Chapter 2: The Sharding Process

How do you take a fully-specified instruction and turn it into a realistic multi-turn conversation? You can't just randomly chop sentences apart — the shards need to feel like natural things a human would say, one turn at a time. This chapter traces the sharding pipeline end to end.

What Makes a Valid Shard Set

The paper defines five formal properties that every sharded instruction must satisfy. Let's work through them concretely. Consider this original instruction:

original
"Write a Python function below_zero(ops) that takes a list of
deposits and withdrawals starting from balance 0, and returns
True if the balance ever goes below zero, else False."

The paper decomposes this into atomic content units (ACUs): the intent I plus the clarifying details (c₁, ..., c_m). For our example:

I (intent): Write a Python function to check something about a transaction list
c₁: The input is a list of integers (deposits and withdrawals)
c₂: Starting balance is 0
c₃: Return True if balance ever goes below zero
c₄: Otherwise return False

P1: Information Preservation. The shard set must collectively contain the same information as the original. Nothing is lost, nothing is added. Formally: I(q) = I(q') where q is the original and q' is the shard set. If you reconstructed the instruction from all shards, you'd get something equivalent to the original.

P2: Clear Initial Intent. The first shard must establish the high-level goal. The user doesn't say "the balance starts at zero" as their opening message — they say "I need a function to check transactions." This mirrors how humans actually start conversations: with the big picture, then details.

P3: Order Insensitive. After the first shard (which sets the intent), the remaining shards can come in any order and still produce the same output. "Start at zero" can come before or after "return True if negative" — the final answer is the same. The paper achieves this by decontextualizing each shard (removing references to other shards).

P4: Maximal Sharding. Maximize the number of shards. Each shard should introduce exactly one atomic piece of information. This creates the most realistic multi-turn scenario.

P5: Minimal Transformation. Keep the language close to the original. Don't rewrite "returns True" as "yields a positive Boolean response." The shards should feel like natural fragments, not adversarial reformulations.

The Four-Step Pipeline

The authors developed a semi-automatic process combining LLM assistance with human quality control:

1. Segmentation

An LLM extracts atomic content units from the original instruction. Task-specific few-shot prompts guide the segmentation. Instructions yielding fewer than 3 segments are filtered out (too simple to shard meaningfully).

↓

2. Rephrasing

Each segment is decontextualized (made standalone) and rephrased as a natural conversational utterance. The first shard becomes the initial query (establishing intent). Dependencies between segments are resolved.

↓

3. Verification

Automated checks: run 10 FULL, 10 CONCAT, and 10 SHUFFLE-CONCAT simulations. If CONCAT or SHUFFLE-CONCAT performance drops below 80% of FULL, the sharding failed — information was lost or distorted. Reject and redo.

↓

4. Manual Inspection

Two human authors review every sharded instruction. They merge overly granular shards, split compound ones, reorder for naturalness, and edit phrasing. This takes ~3 hours per 100 samples. Automation alone isn't enough for research-grade quality.

A Worked Example

Let's trace the full pipeline for the math task. Original instruction from GSM8K:

original
"Josh decides to try flipping a house. He buys a house for $80k
and then puts in $50k in repairs. This increased the value of
the house by 150%. How much profit did he make?"

Step 1 (Segmentation): The LLM identifies 4 segments: (1) Josh flipped a house, (2) bought for $80k, (3) spent $50k on repairs, (4) value increased 150%, (5) find profit.

Step 2 (Rephrasing): Shard 1 = "My friend Josh sold his home. I want to know how much profit he made." Shard 2 = "He bought it for $80,000." Shard 3 = "He spent $50k on repairs." Shard 4 = "The house value increased by 150%." Shard 5 = "That's all I know. What's his profit?"

Step 3 (Verification): Run FULL and CONCAT simulations. CONCAT performance is 95% of FULL. Pass.

Step 4 (Inspection): Human reviewer confirms: each shard is a natural conversational turn, order of shards 2–4 doesn't matter, intent is clear from shard 1.

The 80% threshold matters: During verification, if CONCAT drops below 80% of FULL, it means the rephrasing distorted the meaning. The threshold is conservative — it catches cases where decontextualization accidentally changed the task. For example, if "the balance" becomes "a balance" and the model no longer knows which balance to check, that's a failed shard.

Interactive Sharding Pipeline

See how a fully-specified instruction flows through each pipeline stage. Use the buttons to step through. The verification scores are simulated.

Scale of the Sharding Effort

Task	Source Benchmark	Instructions	Avg Shards	Evaluation Metric
Code	HumanEval + LiveCodeBench	90–120	4–6	Functional Accuracy
Database	Spider (Text-to-SQL)	90–120	3–5	Functional Accuracy
Actions	BFCL (Function Calling)	90–120	3–5	Exact Match
Math	GSM8K	90–120	4–6	Exact Match
Data-to-Text	ToTTo	90–120	3–4	BLEU
Summary	Summary of a Haystack	90–120	6–8	Coverage + Citation

Six tasks spanning programming (Code, Database, Actions) and natural language (Math, Data-to-Text, Summary). Deliberately diverse: if the effect only appeared in code tasks, it might be a code-specific phenomenon. But it appears everywhere.

Why does the verification step run SHUFFLE-CONCAT simulations in addition to regular CONCAT?

To save computational costs To verify property P3 (order insensitivity) — if shuffling the shard order drops performance, the shards have hidden dependencies that make order matter To test if the model can handle longer prompts

Chapter 3: The Simulation Engine

Sharding gives us the conversation script. But who plays the user? Who decides if the model's response is an answer attempt or a clarification? Who scores the answers? The simulation engine is a three-actor loop, all backed by LLMs.

The Three Actors

The simulation runs as a loop with three LLM-backed roles. Only one — the assistant — is the model under test. The other two are infrastructure:

User Simulator

GPT-4o-mini with full sharded instruction + conversation history. Picks the next shard contextually (not randomly), rephrases it to fit the conversation flow. If the assistant asks a clarification question, the simulator responds with the most relevant shard.

↓ reveals ≤ 1 shard

Evaluated Assistant

The model being tested (e.g., GPT-4.1, Claude 3.7, Llama 3.3). Receives only a minimal system prompt (e.g., tool list) — never told the conversation will be multi-turn or underspecified. This measures default behavior.

↓ generates response

System (Strategy + Evaluator)

GPT-4o-mini classifies the response into one of 7 strategies (clarification, refusal, hedging, interrogation, discussion, missing, answer attempt). If it's an answer attempt, the answer extractor isolates the answer span and the task evaluator scores it.

↻ repeat until correct or no shards remain

The Seven Response Strategies

Following Herlihy et al. (2024), the system classifies every assistant response into one of seven categories. This matters because only answer attempts get scored:

Strategy	What It Looks Like	Scored?
Clarification	"Could you specify whether you want deposits only?"	No
Refusal	"I need more information before I can help."	No
Hedging	"I'm not sure exactly what you need, but here's a guess..."	No
Interrogation	"What programming language? What's the input format?"	No
Discussion	"That's an interesting problem. Here are some approaches..."	No
Missing	"I can try, but I'm missing details about X."	No
Answer Attempt	"Here's the function: def below_zero(ops):..."	Yes

A crucial design choice: the model gets credit for its best answer across all turns. In a conversation with N shards, the model can make up to N answer attempts, and its final score is the maximum. This actually favors the multi-turn setting — the model gets multiple shots. Despite this advantage, multi-turn performance still craters.

What the assistant doesn't know: The evaluated model never receives a hint that the conversation will be multi-turn or underspecified. It gets a minimal system prompt (e.g., "You are a helpful assistant") and that's it. This is deliberate: the paper measures how models behave by default when faced with gradual information disclosure. Any model that proactively asks clarifying questions instead of jumping to an answer would do better — but almost none do.

Scoring: How Answer Attempts Are Evaluated

When the strategy classifier detects an answer attempt, the answer extractor isolates just the answer (the code block, the SQL query, the number) from the surrounding text. This is important: the model might write a correct function wrapped in two paragraphs of incorrect reasoning. The extractor pulls out the code and scores only that.

Each task uses its original benchmark's metric:

Code, Database: Functional accuracy — does the generated code actually pass test cases?
Actions, Math: Semantic exact match — are the API calls / numerical answers correct?
Data-to-Text: BLEU score (0–100)
Summary: LLM-as-judge "Joint Score" assessing coverage and citation quality (0–100)

Binary metrics (pass/fail) are mapped to 0 or 100 so all tasks share the same scale. This enables cross-task averaging.

Simulation Validity

How reliable is the automated simulation? The authors manually inspected 200 SHARDED conversations across four tasks:

Component	Accuracy
Shard Fully Revealed (user sim)	96.0%
Shard Contextualized (user sim)	98.4%
Strategy Classification	95.2%
Answer Extraction	97.0%
Overall Success	97.8%

Less than 2% of errors disfavored the assistant. The simulation is not perfect, but its errors are roughly symmetric — they don't systematically make the multi-turn setting look worse than it is.

The $5,000 experiment: The total cost of running 200,000+ simulated conversations across 15 models, 6 tasks, and 3 simulation types was approximately $5,000. This is remarkable cost-efficiency: the same experiment with human users would cost orders of magnitude more and couldn't be repeated 10 times per instruction.

Simulation Loop Visualizer

Watch a multi-turn conversation unfold. The user reveals shards, the assistant responds, the system classifies each response. Click "Next Turn" to step through.

The model gets credit for its BEST answer across all turns. Why does this design choice matter for interpreting results?

It gives the multi-turn setting a structural advantage over single-turn (multiple chances vs one), making the observed 39% degradation even more striking because the deck was stacked in multi-turn's favor It makes scoring faster It ensures models never get a score of zero

Chapter 4: The Metrics

A single number — average performance — hides a critical distinction. When a model drops from 90% to 60%, is it because it got less capable (can't solve the hard problems anymore) or because it became less reliable (solves them sometimes but not others)? This paper introduces a metric framework that separates the two.

The Three Metrics

For each instruction, the paper runs N = 10 independent simulations (temperature T = 1.0). This produces a set of scores S = {S₁, S₂, ..., S₁₀}, each from a different random seed. From these 10 scores, three metrics emerge:

Average Performance: P̄ = (1/N) ∑_i S_i

This is the standard metric — the mean score across all runs. It's what most benchmarks report. But it conflates capability and consistency.

Aptitude: A⁹⁰ = percentile₉₀(S)

Aptitude is the 90th percentile score — the model's best-case performance. Think of it as: "When everything goes right, how well does this model do?" If a student scores 95 on their best exam, their aptitude is 95, regardless of whether they bombed other exams.

Unreliability: U₁₀⁹⁰ = percentile₉₀(S) − percentile₁₀(S)

Unreliability is the interpercentile range — the gap between the best and worst outcomes. If the same model scores 95 on its best run and 40 on its worst, its unreliability is 55 points. A reliable model has U close to 0 (consistent scores). An unreliable model has high U (wildly varying scores).

The decomposition that changes everything: Average performance P̄ = Aptitude − (Unreliability effects). A drop from P̄ = 90 to P̄ = 60 could be: (a) aptitude dropped from 95 to 65 (the model just got worse), or (b) aptitude stayed at 95 but unreliability jumped from 10 to 70 (the model CAN solve it but usually doesn't). The paper finds that multi-turn degradation is overwhelmingly (b) — increased unreliability, not lost aptitude.

Worked Example: Aptitude vs Reliability

Let's build intuition with concrete numbers. Imagine a model on 10 instructions, 10 runs each:

FULL setting (single-turn): The model scores 100 on 9 instructions and 0 on 1 instruction, consistently across all 10 runs. P̄ = 90, A = 100, U = 0 for 9 instructions; P̄ = 0, A = 0, U = 0 for 1 instruction. Corpus averages: P̄ = 90, A = 90, U = 0. High aptitude, perfectly reliable.

SHARDED setting (multi-turn): Same model, same instructions. Now for each instruction, it scores 100 on some runs and 0 on others (random due to stochastic generation + cascading errors from early turns). Across the corpus: P̄ = 60, A = 76 (it CAN get high scores), U = 50 (but scores swing wildly between runs). The 30-point drop in P̄ is mostly from unreliability (U went from 0 to 50), not from aptitude (A only dropped from 90 to 76).

Aptitude vs Unreliability Visualizer

Each box plot represents 10 simulations of one instruction. The top of the box is aptitude (90th percentile). The box height is unreliability (gap between 90th and 10th). Drag the "Multi-turn degradation" slider to see how unreliability grows.

Multi-turn effect 0%

Why 10 Runs at Temperature 1.0?

Most benchmarks run models once with temperature 0 (greedy decoding). This paper deliberately uses T = 1.0 and runs 10 times. Why?

At T = 0, you measure aptitude only. The model always generates its single most likely response. But real users interact at default temperature (T = 1.0), and each conversation plays out differently. By running 10 times, the paper captures the variance in outcomes — which turns out to be the dominant factor in multi-turn degradation.

The paper also runs a temperature ablation (T = 1.0, 0.5, 0.0). Finding: lower temperature improves reliability in single-turn but not in multi-turn. Even at T = 0.0, multi-turn unreliability remains ~30% because the stochastic user simulator introduces variation in early turns, and the model's deterministic response to Turn 1's variation cascades into different (often incorrect) trajectories.

The cascading error insight: In single-turn, T = 0 eliminates all randomness — same input, same output, always. In multi-turn, even T = 0 can't save you because the USER's messages vary (different shard orderings, different phrasings). The model's response to a slightly different Turn 2 message leads to a different Turn 3, which leads to a different Turn 4, and so on. Small early variations cascade into large outcome differences. This is why the paper calls it "getting lost" — one wrong turn early on sends the model down a bad path it can't recover from.

The paper finds that multi-turn degradation is primarily caused by:

Loss in aptitude (the model can no longer solve the problems even in the best case) Increase in unreliability (the model CAN solve the problems sometimes, but outcomes vary wildly between runs — U increases 112% while A drops only 16%) Both equally

Chapter 5: The Main Results

200,000+ simulated conversations. 15 LLMs. 6 tasks. 3 simulation types. The results are in, and they tell a consistent story: every model gets lost in multi-turn conversation.

The Headline Numbers

Comparing FULL (single-turn) to SHARDED (multi-turn) performance across all models and tasks:

Average FULL performance: ~80% (ranging from ~40% for weak models to ~97% for strong ones)
Average SHARDED performance: ~50% (ranging from ~25% to ~68%)
Average degradation: −39%
CONCAT performance: ~95% of FULL (confirming sharding doesn't lose information)

The uncomfortable truth: Models that score 90%+ on single-turn benchmarks drop to 55–68% when the same tasks are delivered multi-turn. The best model in the study (Gemini 2.5 Pro) goes from 97.3% average FULL to 64.5% SHARDED. The benchmark numbers that labs advertise are the FULL numbers. The reality of how users interact is the SHARDED numbers.

Per-Model Breakdown

The data from Table 1 reveals several surprising patterns:

Model	FULL	CONCAT	SHARDED	Degradation
Llama 3.1-8B	42	39	25	−40%
OLMo-2-13B	39	34	18	−54%
Claude 3 Haiku	55	51	34	−38%
GPT-4o-mini	68	66	41	−40%
Llama 3.3-70B	68	67	42	−38%
o3 (reasoning)	77	73	47	−39%
Claude 3.7 Sonnet	81	78	46	−43%
DeepSeek-R1	79	78	47	−41%
GPT-4o	82	81	50	−39%
GPT-4.1	85	82	54	−36%
Gemini 2.5 Pro	87	85	57	−34%

(Values are approximate averages across 6 tasks, simplified from the full table.)

Finding 1: No Model Is Immune

The degradation range is 30–54% across all 15 models. Stronger models (Gemini 2.5 Pro, GPT-4.1) have slightly smaller relative drops, but their absolute drops are just as large: going from 87 to 57 is a 30-point loss.

Finding 2: Reasoning Doesn't Help

The two reasoning models — o3 and DeepSeek-R1 — degrade at the same rate as non-reasoning models (~39–41%). Their extra "thinking" tokens should theoretically help them strategize over the conversation. They don't.

The paper identifies a likely cause: reasoning models generate 33% longer responses on average. Longer responses contain more assumptions about unspecified details. These assumptions then get baked into subsequent turns, making the model harder to redirect.

Finding 3: Task-Specific Patterns

Not all tasks degrade equally. Some models show task-specific resilience:

Code: Claude 3.7 Sonnet and GPT-4.1 maintain relatively strong multi-turn code performance
Actions (API calling): Command-A shows the least degradation
Data-to-Text: Gemini 2.5 Pro holds up better than others
Summary: Universally terrible — all models collapse in multi-turn summarization

This task-specificity suggests that multi-turn capability is not a single skill — it's domain-dependent. A model that handles multi-turn code well might still fail at multi-turn summarization.

Performance Heatmap: FULL vs SHARDED

Interactive heatmap showing degradation patterns across models and tasks. Darker red = larger drop. Hover over cells for exact numbers. Toggle between absolute scores and percentage degradation.

Finding 4: The Gradual Sharding Experiment

Is the effect only catastrophic when instructions are sharded into many tiny pieces? The paper tests this with a gradual sharding experiment: the same 31 instructions are sharded into 2, 3, 4, ... 8 pieces.

Result: even 2 shards cause most of the damage. Going from 1 shard (FULL) to 2 shards causes a large drop. Going from 2 to 8 causes further decline, but the steepest fall is at the first split. Any conversation involving underspecification and more than one turn leads to models getting lost.

The two-turn threshold: This is perhaps the paper's most actionable finding. You don't need a 10-turn conversation to see degradation. Just TWO turns — the user says something underspecified, then adds one clarification — and the model is already significantly less reliable. This means every "follow-up question" interaction in production is operating in the danger zone.

The gradual sharding experiment reveals that:

Even splitting the instruction into just 2 turns causes most of the performance degradation — the transition from 1 turn to 2 turns is the critical threshold Degradation increases linearly with the number of shards Only instructions with 5+ shards show significant degradation

Chapter 6: Aptitude vs Reliability

We know that SHARDED performance drops 39% on average. Chapter 4 gave us the tools to ask: where does this drop come from? Now we apply those tools to the full dataset of 200,000+ conversations.

Single-Turn: Aptitude Predicts Reliability

In single-turn (FULL and CONCAT), the paper finds a clean relationship: models with higher aptitude are also more reliable. GPT-4.1 and Gemini 2.5 Pro have both the highest aptitude (A ≈ 95–100) and the lowest unreliability (U ≈ 10–15). Conversely, Llama 3.1-8B and OLMo-2-13B have the lowest aptitude AND the highest unreliability.

This makes intuitive sense: a model that truly understands the task gives the same (correct) answer regardless of random variation. A model at the edge of its capability sometimes gets it right (lucky sampling) and sometimes doesn't.

The single-turn story: Better models → higher aptitude AND higher reliability. The correlation is strong. This is why the community has focused on improving aptitude — reliability comes along for free. Just make the model smarter and it'll also be more consistent.

Multi-Turn: The Correlation Breaks

In the SHARDED setting, this comfortable relationship shatters. Here are the key numbers:

Aptitude (A) drops modestly: average decrease of 16% from FULL to SHARDED. Models can still solve the problems when things go well.
Unreliability (U) skyrockets: average increase of 112% (more than doubles). The gap between best and worst runs explodes.
The aptitude-reliability correlation disappears: in multi-turn, even the most capable models (GPT-4.1, Gemini 2.5 Pro) have unreliability U ≈ 50–65, nearly indistinguishable from weaker models.

The broken promise: In single-turn, making models smarter makes them more reliable. In multi-turn, making models smarter does NOT make them more reliable. All models — weak and strong — converge to the same high level of unreliability. This means improving aptitude alone won't solve the multi-turn problem. Reliability must be explicitly trained for.

Visualizing the Decomposition

Let's trace what happens to a hypothetical model as it moves from FULL to SHARDED:

FULL: P̄ = 90

A = 95, U = 10. Best run scores 95, worst run scores 85. Tight, consistent performance. The benchmark number (90) is trustworthy.

↓ add multi-turn underspecification

SHARDED: P̄ = 60

A = 80, U = 50. Best run scores 80 (some aptitude loss), worst run scores 30 (catastrophic). Same model, same tasks. The 30-point drop in P̄ is ~40% from aptitude loss (95→80) and ~60% from unreliability explosion (10→50).

The key message: the model hasn't become fundamentally dumber. Its best-case is still decent (80). But its worst-case has collapsed (30). The problem is consistency, not capability.

Aptitude-Reliability Scatter Plot

Each dot is a model. X-axis: aptitude (90th percentile). Y-axis: unreliability (90th-10th gap). Toggle between FULL (tight cluster, high aptitude, low unreliability) and SHARDED (scattered, high unreliability for all). The transformation is dramatic.

Implications for LLM Builders

The paper's call to action is specific: LLM builders should optimize for three simultaneous objectives:

Maintain aptitude parity: A^SHARDED ≈ A^FULL — the model should be equally capable in multi-turn as single-turn.
Low multi-turn unreliability: U₁₀⁹⁰ < 15 in SHARDED settings — consistent performance regardless of conversational trajectory.
Achieve this at standard temperature: T = 1.0, not T = 0 — because real users interact at default temperature.

Currently, no model in the study meets all three criteria. This represents a clear research direction.

In single-turn settings, smarter models are also more reliable. In multi-turn settings:

The same relationship holds — smarter models are still more reliable Smarter models are LESS reliable All models converge to the same high unreliability regardless of aptitude — the aptitude-reliability correlation disappears

Chapter 7: Why They Get Lost

We've established THAT models get lost. Now the crucial question: WHY? The paper identifies four specific failure modes by analyzing the conversation transcripts. Each one compounds the others, creating a vicious cycle that makes recovery nearly impossible.

Failure Mode 1: Premature Answer Attempts

The most damaging pattern: when the model receives the first shard (which is intentionally underspecified), it immediately tries to generate a complete answer instead of asking for clarification. It fills in the gaps with assumptions.

Example with the code task: the user says "Write me a function to check something about transactions." A good response would be: "What should the function check? What format are the transactions?" Instead, the model writes:

premature answer (bad)
def check_transactions(transactions):
    # I'll assume you want the total sum
    return sum(transactions)  # WRONG ASSUMPTION

Now the model has committed to a direction. When the user later reveals "return True if the balance ever goes negative," the model tries to modify its existing (wrong) solution rather than starting fresh. It patches instead of rebuilding.

The premature commitment trap: Once the model has generated a concrete answer, it's psychologically anchored to that structure. Subsequent turns become "edits to the draft" rather than "fresh solutions incorporating all information." The model is doing incremental patching on a flawed foundation instead of full reconstruction. This is the core mechanism of getting lost.

Failure Mode 2: Bloated, Self-Referencing Responses

As the conversation progresses, responses get longer and longer. The model starts referencing its own prior answers: "Building on my previous solution..." or "As I mentioned earlier..." Each response carries forward assumptions from earlier turns, even if those assumptions were wrong.

The paper finds that response length correlates with error rate. Longer responses contain more unverified assumptions that compound. Reasoning models (o3, DeepSeek-R1) are particularly susceptible — their responses are 33% longer on average, giving more surface area for incorrect assumptions.

Failure Mode 3: Loss-of-Middle-Turns

Just as the "Lost in the Middle" phenomenon (Liu et al., 2024) shows LLMs neglecting information in the middle of long contexts, this paper discovers that LLMs overly weight the first and last turns of a conversation while neglecting middle turns.

Turn 1 (the initial, underspecified request) anchors the model's understanding. The last turn (often a final clarification) gets attention because it's most recent. But shards revealed in turns 2, 3, 4... are partially forgotten or insufficiently integrated. The model's final answer often satisfies the first and last requirements while missing middle ones.

The middle-turn problem: This is related to but distinct from the "lost in the middle" effect in long-context processing. There, the issue is positional: information at position 5000 in a 10000-token context gets less attention. Here, the issue is temporal: information revealed in conversation turn 3 of 6 gets less weight than turns 1 and 6. The mechanism might differ (context window position vs conversational primacy/recency bias), but the outcome is the same: information in the middle gets lost.

Failure Mode 4: Verbose Assumption Generation

When faced with underspecified instructions, models don't simply say "I don't know." They generate detailed assumptions about unspecified details, present them confidently, and build solutions on top of those assumptions.

"I'll assume the input is a list of floats." "I'll assume you want the function to also handle empty lists." "I'll assume the database uses PostgreSQL." Each assumption is a potential error source, and once stated, the model treats its own assumption as a requirement from the user.

The Failure Cascade Simulator

Watch the four failure modes compound across turns. Each bar shows the model's response quality. Red sections are assumptions. Orange sections are self-references. The "correct information" shrinks as errors accumulate. Use the slider to control the model's assumption rate.

Assumption Rate 60%

The Vicious Cycle

These four failure modes don't operate in isolation. They form a reinforcing loop:

Turn 1: Underspecified

User provides partial info. Model SHOULD ask for clarification.

↓ but instead...

Premature Answer

Model generates a full solution with assumptions baked in (FM1). Response is long and verbose (FM4).

↓ next turn reveals more info

Bloated Patching

Model tries to modify its prior answer, referencing its own assumptions (FM2). Middle-turn info gets lost (FM3).

↓ repeat for each shard

Lost in Conversation

By the final turn, the model is confused by its own prior attempts, has forgotten middle requirements, and produces an answer that partially satisfies first and last turns while violating middle ones.

What a GOOD model would do: On Turn 1, ask clarifying questions or explicitly state assumptions. On each subsequent turn, integrate the new information into a FRESH solution (not a patch). Maintain a running list of requirements gathered so far. Treat each new shard as an opportunity to rebuild, not repair. The paper's Appendix F confirms that models which ask more clarification questions in early turns tend to perform better — but the default behavior of all tested models is to answer immediately.

Why are reasoning models (o3, DeepSeek-R1) NOT better at multi-turn conversation despite using more compute?

Their longer responses (33% more tokens) contain more assumptions that compound across turns, making them MORE susceptible to the premature commitment trap — extra thinking doesn't help if the thinking introduces more wrong assumptions Their reasoning chains are too slow for multi-turn They were not trained on multi-turn data

Chapter 8: Can We Fix It?

The paper doesn't just diagnose the problem — it tests several potential solutions. The results are sobering: nothing works well enough. But each failed fix teaches us something about why multi-turn degradation is so stubborn.

Fix Attempt 1: RECAP (Agent-Style Summary)

Idea: After all shards are revealed, add a final "recap" turn. The user tells the model: "Your prior solutions were incorrect. Here are all the requirements as a single list: [all shards concatenated]. Please try again from scratch."

Result: Improves over SHARDED but falls short of FULL or CONCAT. The model is already too confused by its prior attempts. Even with a clean consolidated instruction, it tends to patch its existing answer rather than starting fresh. The conversation history is a poisoned well.

Why RECAP fails: When you tell a model "your previous answers were wrong, try again with this full spec," it doesn't truly start over. It still has all its prior attempts in context. Those prior attempts influence the new response, even when explicitly told to ignore them. The model can't fully "un-see" its earlier work.

Fix Attempt 2: SNOWBALL (Cumulative Repetition)

Idea: At each turn, repeat all prior shards along with the new one. Turn 3 = Shard 1 + Shard 2 + Shard 3. The model always has the full context so far.

Result: Better than RECAP. Offers 15–20% improvement over SHARDED. But still well below FULL. Why? Because the model's early responses (made with incomplete information) are still in context and still exert influence.

Fix Attempt 3: System Prompt Hint

Idea: Add an explicit system prompt: "This conversation will likely be multi-turn and the user's instructions may be underspecified. Wait for all information before generating a final answer."

Result: +1% average improvement across tasks. Basically nothing. The model ignores the hint and generates premature answers anyway. Default behavior is deeply ingrained.

Fix Attempt 4: Lower Temperature

Idea: Reduce temperature to decrease randomness. T = 0.5 or T = 0.0 should reduce unreliability.

Result: Works for single-turn (unreliability drops). Does NOT work for multi-turn. Even at T = 0.0, multi-turn unreliability remains ~30%. The randomness in multi-turn doesn't come from the model's sampling — it comes from cascading effects of early conversational choices. The user simulator's variation in shard phrasing and ordering creates different trajectories that T = 0 can't eliminate.

The cascading source of randomness: In single-turn, all randomness is in the model's sampling. T = 0 kills it. In multi-turn, randomness enters at every turn: the user simulator's phrasing varies, the model's response to that phrasing varies, the user's next shard selection varies. Even with deterministic decoding (T = 0), the input itself varies turn-to-turn. You can't eliminate multi-turn unreliability by lowering temperature because the randomness is environmental, not internal.

What DOES This Tell Us?

The failure of all four fixes points to a fundamental conclusion: multi-turn reliability cannot be bolted on. It must be trained in.

The paper's recommendations for LLM builders:

Train for multi-turn robustness directly. Include sharded/underspecified conversations in training data. Reward models that ask for clarification instead of assuming.
Optimize reliability as a first-class metric. Current training (RLHF, DPO) optimizes for single-turn preference. Multi-turn reliability should be an explicit training objective.
Penalize premature commitment. Models should be trained to distinguish "I have enough info to answer" from "I'm going to guess and hope." The cost of a premature wrong answer should outweigh the cost of asking one more clarifying question.

For users, the paper offers two practical workarounds:

"If time allows, try again." Start a new conversation with the same information. A fresh conversation often yields better results than persisting with a model that's gotten lost.
"Consolidate before retrying." Ask the LLM to summarize all user requirements from the conversation into a single instruction. Start a new conversation with that consolidated instruction.

Fix Strategy Comparison

Compare all remediation strategies side by side. Each bar shows average performance. The green line marks FULL (single-turn baseline). Nothing reaches it.

Why does lowering temperature to T=0 reduce unreliability in single-turn but NOT in multi-turn?

T=0 is too slow for multi-turn In single-turn, all randomness comes from the model's sampling (T=0 eliminates it). In multi-turn, randomness enters through varying user messages at each turn, creating different conversational trajectories that T=0 can't control T=0 causes the model to repeat itself in multi-turn

Chapter 9: Connections

Cheat Sheet: Every Key Concept

Concept	Definition	Key Number
Sharding	Breaking a fully-specified instruction into conversational fragments ("shards") that jointly deliver the same information	3–8 shards per task
FULL	Single-turn baseline: original instruction, one message	~80% avg performance
CONCAT	Control: all shards as bullet points in one turn	~95% of FULL
SHARDED	Multi-turn: one shard per turn	~61% of FULL (−39%)
Aptitude (A)	90th percentile score — best-case capability	Drops 16% in multi-turn
Unreliability (U)	90th − 10th percentile gap — consistency	Increases 112% in multi-turn
Lost in Conversation	When an early wrong turn cascades into unrecoverable error	Even 2 turns trigger it
RECAP	Agent-style: final turn restating everything	Helps some, not enough
SNOWBALL	Cumulative: repeat all prior shards each turn	+15–20% improvement

What the Paper Doesn't Say (But You Should Know)

The simulation is idealized. Real users are messier than GPT-4o-mini simulating a user. They make typos, change their minds, provide contradictory information. The paper's 39% degradation is likely an underestimate of real-world multi-turn degradation.
Only analytical tasks. The paper doesn't test creative writing, brainstorming, or open-ended exploration — tasks where multi-turn interaction is perhaps most natural. Whether models get lost in creative multi-turn is an open question.
English only. Multi-turn degradation in other languages is untested.
No multimodal tasks. What happens when shards include images, code outputs, or tool results?
The scoring favors multi-turn. The "best of N attempts" scoring gives multi-turn a structural advantage. Without this, the degradation would be even worse.

Related Phenomena

Lost in the Middle (Liu et al., 2024): LLMs neglect information in the middle of long contexts. This paper's "loss-of-middle-turns" is the temporal analogue — information from middle conversation turns gets underweighted.

The FlipFlop Effect (Laban et al., 2023): Challenging an LLM's answer causes it to flip to the opposite answer, even when it was originally correct. Related to the premature commitment problem: models are simultaneously too eager to commit AND too eager to abandon commitments when challenged.

Prompt Sensitivity (Li et al., 2023): LLMs are sensitive to minor prompt variations in single-turn. This paper shows the problem is vastly amplified in multi-turn because each turn is a new "prompt variation" that cascades.

Agent Reliability: LangChain, AutoGen, and other agent frameworks run multi-turn conversations by design. This paper suggests that the underlying model's multi-turn unreliability is a fundamental bottleneck that no amount of agent engineering can fully overcome.

Open Research Directions

Multi-turn RLHF: Current preference optimization uses single-turn comparisons. Training on multi-turn conversations with sharded instructions could directly address the reliability gap.
Clarification-seeking behavior: Rewarding models for asking clarifying questions instead of assuming. The paper shows this is rare in current models but associated with better outcomes.
Context management: Methods to help models "restart" from a clean state mid-conversation, discarding incorrect prior attempts.
Sharded benchmarks at scale: The paper calls for all benchmark providers to release sharded variants alongside fully-specified versions. This would make multi-turn evaluation routine rather than exceptional.

The bottom line: Every LLM benchmark number you see is a single-turn, fully-specified score. The real-world performance — where users provide information gradually, change requirements, and have multi-turn conversations — is approximately 39% worse. Until LLM builders explicitly optimize for multi-turn reliability, this gap will persist. The lab numbers are a ceiling, not a floor.

What is the most impactful research direction suggested by this paper's findings?

Training models to optimize for multi-turn reliability directly (not just aptitude), since current training paradigms that improve single-turn performance don't transfer to multi-turn reliability Building better agent frameworks Increasing model context length