When LLMs take a wrong turn in a conversation, they get lost and do not recover. A 39% average performance drop across 15 models and 200,000+ simulated conversations.
You open ChatGPT. You type: "Write me a Python function that takes a list of bank transactions and returns True if the balance ever goes below zero." The model nails it. 90%+ accuracy on benchmarks like HumanEval. Impressive.
Now imagine a more realistic scenario. You type: "I need a function to check something about transactions." The model asks what. You say: "It's a list of integers, deposits and withdrawals." A turn later: "Start at balance zero." Another turn: "Return True if it ever dips negative." Same information, same task — just delivered gradually, the way humans actually talk.
The model fails. Not sometimes — consistently. It makes assumptions in early turns ("I'll assume you want the final balance!"), generates a premature solution, then when you correct it, it tries to patch the wrong answer instead of starting fresh. By the end of the conversation, it's confused by its own prior attempts.
This paper asks a simple question: how much performance do LLMs lose when the same task is spread across multiple turns instead of given all at once?
The answer is devastating: 39% average degradation across 15 state-of-the-art models and 6 diverse tasks. Every single model — from 8B open-weights to Gemini 2.5 Pro — gets substantially worse. And the degradation is not a gentle slope. Even splitting the instruction into just two turns causes most of the damage.
Each bar shows average performance on the same tasks. Left: all information in one message. Right: same information spread across turns. Click a model to see its breakdown.
The most unsettling finding: the strongest models lose just as much as the weakest. GPT-4.1, Claude 3.7 Sonnet, and Gemini 2.5 Pro all suffer 30–40% degradation. Being smarter doesn't make you immune to getting lost in conversation.
Even more alarming: reasoning models like o3 and DeepSeek-R1, which use extra "thinking" tokens, fare no better. Their longer responses actually make things worse — they contain more assumptions that compound across turns.
Before this paper, multi-turn evaluation existed — but it was episodic. Each turn in the conversation was a self-contained subtask. "Translate this sentence." "Now summarize this paragraph." The turns happen to be in the same conversation, but each can be graded independently. This design overestimates LLM capability because the model never has to integrate information across turns.
The key insight of this paper is a method called sharded simulation: take a single, fully-specified instruction from an existing benchmark, break it into pieces ("shards"), and reveal one shard per turn. The model must fuse the shards across the conversation to solve the original task.
Previous multi-turn benchmarks had a fatal flaw: the multi-turn tasks were different from the single-turn ones. You couldn't tell if performance dropped because multi-turn is harder or because the tasks themselves were harder. Sharding eliminates this confound.
The original instruction and its sharded version deliver exactly the same information. The shards are just pieces of the original, rephrased to be natural conversational turns. If you concatenate all shards, you get back the original meaning (this is verified by a control condition called CONCAT that averages 95.1% of single-turn performance).
So when SHARDED performance drops to 61% of single-turn, we know it's not because information was lost. It's because the delivery method — gradual revelation across turns — causes the model to stumble.
Watch a fully-specified instruction get sharded into conversational turns. Click "Shard It" to see the transformation, then "Simulate" to watch the model struggle.
The paper defines five conversation types, all derived from the same sharded instructions:
How do you take a fully-specified instruction and turn it into a realistic multi-turn conversation? You can't just randomly chop sentences apart — the shards need to feel like natural things a human would say, one turn at a time. This chapter traces the sharding pipeline end to end.
The paper defines five formal properties that every sharded instruction must satisfy. Let's work through them concretely. Consider this original instruction:
original "Write a Python function below_zero(ops) that takes a list of deposits and withdrawals starting from balance 0, and returns True if the balance ever goes below zero, else False."
The paper decomposes this into atomic content units (ACUs): the intent I plus the clarifying details (c1, ..., cm). For our example:
The authors developed a semi-automatic process combining LLM assistance with human quality control:
Let's trace the full pipeline for the math task. Original instruction from GSM8K:
original "Josh decides to try flipping a house. He buys a house for $80k and then puts in $50k in repairs. This increased the value of the house by 150%. How much profit did he make?"
Step 1 (Segmentation): The LLM identifies 4 segments: (1) Josh flipped a house, (2) bought for $80k, (3) spent $50k on repairs, (4) value increased 150%, (5) find profit.
Step 2 (Rephrasing): Shard 1 = "My friend Josh sold his home. I want to know how much profit he made." Shard 2 = "He bought it for $80,000." Shard 3 = "He spent $50k on repairs." Shard 4 = "The house value increased by 150%." Shard 5 = "That's all I know. What's his profit?"
Step 3 (Verification): Run FULL and CONCAT simulations. CONCAT performance is 95% of FULL. Pass.
Step 4 (Inspection): Human reviewer confirms: each shard is a natural conversational turn, order of shards 2–4 doesn't matter, intent is clear from shard 1.
See how a fully-specified instruction flows through each pipeline stage. Use the buttons to step through. The verification scores are simulated.
| Task | Source Benchmark | Instructions | Avg Shards | Evaluation Metric |
|---|---|---|---|---|
| Code | HumanEval + LiveCodeBench | 90–120 | 4–6 | Functional Accuracy |
| Database | Spider (Text-to-SQL) | 90–120 | 3–5 | Functional Accuracy |
| Actions | BFCL (Function Calling) | 90–120 | 3–5 | Exact Match |
| Math | GSM8K | 90–120 | 4–6 | Exact Match |
| Data-to-Text | ToTTo | 90–120 | 3–4 | BLEU |
| Summary | Summary of a Haystack | 90–120 | 6–8 | Coverage + Citation |
Six tasks spanning programming (Code, Database, Actions) and natural language (Math, Data-to-Text, Summary). Deliberately diverse: if the effect only appeared in code tasks, it might be a code-specific phenomenon. But it appears everywhere.
Sharding gives us the conversation script. But who plays the user? Who decides if the model's response is an answer attempt or a clarification? Who scores the answers? The simulation engine is a three-actor loop, all backed by LLMs.
The simulation runs as a loop with three LLM-backed roles. Only one — the assistant — is the model under test. The other two are infrastructure:
Following Herlihy et al. (2024), the system classifies every assistant response into one of seven categories. This matters because only answer attempts get scored:
| Strategy | What It Looks Like | Scored? |
|---|---|---|
| Clarification | "Could you specify whether you want deposits only?" | No |
| Refusal | "I need more information before I can help." | No |
| Hedging | "I'm not sure exactly what you need, but here's a guess..." | No |
| Interrogation | "What programming language? What's the input format?" | No |
| Discussion | "That's an interesting problem. Here are some approaches..." | No |
| Missing | "I can try, but I'm missing details about X." | No |
| Answer Attempt | "Here's the function: def below_zero(ops):..." | Yes |
A crucial design choice: the model gets credit for its best answer across all turns. In a conversation with N shards, the model can make up to N answer attempts, and its final score is the maximum. This actually favors the multi-turn setting — the model gets multiple shots. Despite this advantage, multi-turn performance still craters.
When the strategy classifier detects an answer attempt, the answer extractor isolates just the answer (the code block, the SQL query, the number) from the surrounding text. This is important: the model might write a correct function wrapped in two paragraphs of incorrect reasoning. The extractor pulls out the code and scores only that.
Each task uses its original benchmark's metric:
Binary metrics (pass/fail) are mapped to 0 or 100 so all tasks share the same scale. This enables cross-task averaging.
How reliable is the automated simulation? The authors manually inspected 200 SHARDED conversations across four tasks:
| Component | Accuracy |
|---|---|
| Shard Fully Revealed (user sim) | 96.0% |
| Shard Contextualized (user sim) | 98.4% |
| Strategy Classification | 95.2% |
| Answer Extraction | 97.0% |
| Overall Success | 97.8% |
Less than 2% of errors disfavored the assistant. The simulation is not perfect, but its errors are roughly symmetric — they don't systematically make the multi-turn setting look worse than it is.
Watch a multi-turn conversation unfold. The user reveals shards, the assistant responds, the system classifies each response. Click "Next Turn" to step through.
A single number — average performance — hides a critical distinction. When a model drops from 90% to 60%, is it because it got less capable (can't solve the hard problems anymore) or because it became less reliable (solves them sometimes but not others)? This paper introduces a metric framework that separates the two.
For each instruction, the paper runs N = 10 independent simulations (temperature T = 1.0). This produces a set of scores S = {S1, S2, ..., S10}, each from a different random seed. From these 10 scores, three metrics emerge:
This is the standard metric — the mean score across all runs. It's what most benchmarks report. But it conflates capability and consistency.
Aptitude is the 90th percentile score — the model's best-case performance. Think of it as: "When everything goes right, how well does this model do?" If a student scores 95 on their best exam, their aptitude is 95, regardless of whether they bombed other exams.
Unreliability is the interpercentile range — the gap between the best and worst outcomes. If the same model scores 95 on its best run and 40 on its worst, its unreliability is 55 points. A reliable model has U close to 0 (consistent scores). An unreliable model has high U (wildly varying scores).
Let's build intuition with concrete numbers. Imagine a model on 10 instructions, 10 runs each:
FULL setting (single-turn): The model scores 100 on 9 instructions and 0 on 1 instruction, consistently across all 10 runs. P̄ = 90, A = 100, U = 0 for 9 instructions; P̄ = 0, A = 0, U = 0 for 1 instruction. Corpus averages: P̄ = 90, A = 90, U = 0. High aptitude, perfectly reliable.
SHARDED setting (multi-turn): Same model, same instructions. Now for each instruction, it scores 100 on some runs and 0 on others (random due to stochastic generation + cascading errors from early turns). Across the corpus: P̄ = 60, A = 76 (it CAN get high scores), U = 50 (but scores swing wildly between runs). The 30-point drop in P̄ is mostly from unreliability (U went from 0 to 50), not from aptitude (A only dropped from 90 to 76).
Each box plot represents 10 simulations of one instruction. The top of the box is aptitude (90th percentile). The box height is unreliability (gap between 90th and 10th). Drag the "Multi-turn degradation" slider to see how unreliability grows.
Most benchmarks run models once with temperature 0 (greedy decoding). This paper deliberately uses T = 1.0 and runs 10 times. Why?
At T = 0, you measure aptitude only. The model always generates its single most likely response. But real users interact at default temperature (T = 1.0), and each conversation plays out differently. By running 10 times, the paper captures the variance in outcomes — which turns out to be the dominant factor in multi-turn degradation.
The paper also runs a temperature ablation (T = 1.0, 0.5, 0.0). Finding: lower temperature improves reliability in single-turn but not in multi-turn. Even at T = 0.0, multi-turn unreliability remains ~30% because the stochastic user simulator introduces variation in early turns, and the model's deterministic response to Turn 1's variation cascades into different (often incorrect) trajectories.
200,000+ simulated conversations. 15 LLMs. 6 tasks. 3 simulation types. The results are in, and they tell a consistent story: every model gets lost in multi-turn conversation.
Comparing FULL (single-turn) to SHARDED (multi-turn) performance across all models and tasks:
The data from Table 1 reveals several surprising patterns:
| Model | FULL | CONCAT | SHARDED | Degradation |
|---|---|---|---|---|
| Llama 3.1-8B | 42 | 39 | 25 | −40% |
| OLMo-2-13B | 39 | 34 | 18 | −54% |
| Claude 3 Haiku | 55 | 51 | 34 | −38% |
| GPT-4o-mini | 68 | 66 | 41 | −40% |
| Llama 3.3-70B | 68 | 67 | 42 | −38% |
| o3 (reasoning) | 77 | 73 | 47 | −39% |
| Claude 3.7 Sonnet | 81 | 78 | 46 | −43% |
| DeepSeek-R1 | 79 | 78 | 47 | −41% |
| GPT-4o | 82 | 81 | 50 | −39% |
| GPT-4.1 | 85 | 82 | 54 | −36% |
| Gemini 2.5 Pro | 87 | 85 | 57 | −34% |
(Values are approximate averages across 6 tasks, simplified from the full table.)
The degradation range is 30–54% across all 15 models. Stronger models (Gemini 2.5 Pro, GPT-4.1) have slightly smaller relative drops, but their absolute drops are just as large: going from 87 to 57 is a 30-point loss.
The two reasoning models — o3 and DeepSeek-R1 — degrade at the same rate as non-reasoning models (~39–41%). Their extra "thinking" tokens should theoretically help them strategize over the conversation. They don't.
The paper identifies a likely cause: reasoning models generate 33% longer responses on average. Longer responses contain more assumptions about unspecified details. These assumptions then get baked into subsequent turns, making the model harder to redirect.
Not all tasks degrade equally. Some models show task-specific resilience:
This task-specificity suggests that multi-turn capability is not a single skill — it's domain-dependent. A model that handles multi-turn code well might still fail at multi-turn summarization.
Interactive heatmap showing degradation patterns across models and tasks. Darker red = larger drop. Hover over cells for exact numbers. Toggle between absolute scores and percentage degradation.
Is the effect only catastrophic when instructions are sharded into many tiny pieces? The paper tests this with a gradual sharding experiment: the same 31 instructions are sharded into 2, 3, 4, ... 8 pieces.
Result: even 2 shards cause most of the damage. Going from 1 shard (FULL) to 2 shards causes a large drop. Going from 2 to 8 causes further decline, but the steepest fall is at the first split. Any conversation involving underspecification and more than one turn leads to models getting lost.
We know that SHARDED performance drops 39% on average. Chapter 4 gave us the tools to ask: where does this drop come from? Now we apply those tools to the full dataset of 200,000+ conversations.
In single-turn (FULL and CONCAT), the paper finds a clean relationship: models with higher aptitude are also more reliable. GPT-4.1 and Gemini 2.5 Pro have both the highest aptitude (A ≈ 95–100) and the lowest unreliability (U ≈ 10–15). Conversely, Llama 3.1-8B and OLMo-2-13B have the lowest aptitude AND the highest unreliability.
This makes intuitive sense: a model that truly understands the task gives the same (correct) answer regardless of random variation. A model at the edge of its capability sometimes gets it right (lucky sampling) and sometimes doesn't.
In the SHARDED setting, this comfortable relationship shatters. Here are the key numbers:
Let's trace what happens to a hypothetical model as it moves from FULL to SHARDED:
The key message: the model hasn't become fundamentally dumber. Its best-case is still decent (80). But its worst-case has collapsed (30). The problem is consistency, not capability.
Each dot is a model. X-axis: aptitude (90th percentile). Y-axis: unreliability (90th-10th gap). Toggle between FULL (tight cluster, high aptitude, low unreliability) and SHARDED (scattered, high unreliability for all). The transformation is dramatic.
The paper's call to action is specific: LLM builders should optimize for three simultaneous objectives:
Currently, no model in the study meets all three criteria. This represents a clear research direction.
We've established THAT models get lost. Now the crucial question: WHY? The paper identifies four specific failure modes by analyzing the conversation transcripts. Each one compounds the others, creating a vicious cycle that makes recovery nearly impossible.
The most damaging pattern: when the model receives the first shard (which is intentionally underspecified), it immediately tries to generate a complete answer instead of asking for clarification. It fills in the gaps with assumptions.
Example with the code task: the user says "Write me a function to check something about transactions." A good response would be: "What should the function check? What format are the transactions?" Instead, the model writes:
premature answer (bad) def check_transactions(transactions): # I'll assume you want the total sum return sum(transactions) # WRONG ASSUMPTION
Now the model has committed to a direction. When the user later reveals "return True if the balance ever goes negative," the model tries to modify its existing (wrong) solution rather than starting fresh. It patches instead of rebuilding.
As the conversation progresses, responses get longer and longer. The model starts referencing its own prior answers: "Building on my previous solution..." or "As I mentioned earlier..." Each response carries forward assumptions from earlier turns, even if those assumptions were wrong.
The paper finds that response length correlates with error rate. Longer responses contain more unverified assumptions that compound. Reasoning models (o3, DeepSeek-R1) are particularly susceptible — their responses are 33% longer on average, giving more surface area for incorrect assumptions.
Just as the "Lost in the Middle" phenomenon (Liu et al., 2024) shows LLMs neglecting information in the middle of long contexts, this paper discovers that LLMs overly weight the first and last turns of a conversation while neglecting middle turns.
Turn 1 (the initial, underspecified request) anchors the model's understanding. The last turn (often a final clarification) gets attention because it's most recent. But shards revealed in turns 2, 3, 4... are partially forgotten or insufficiently integrated. The model's final answer often satisfies the first and last requirements while missing middle ones.
When faced with underspecified instructions, models don't simply say "I don't know." They generate detailed assumptions about unspecified details, present them confidently, and build solutions on top of those assumptions.
"I'll assume the input is a list of floats." "I'll assume you want the function to also handle empty lists." "I'll assume the database uses PostgreSQL." Each assumption is a potential error source, and once stated, the model treats its own assumption as a requirement from the user.
Watch the four failure modes compound across turns. Each bar shows the model's response quality. Red sections are assumptions. Orange sections are self-references. The "correct information" shrinks as errors accumulate. Use the slider to control the model's assumption rate.
These four failure modes don't operate in isolation. They form a reinforcing loop:
The paper doesn't just diagnose the problem — it tests several potential solutions. The results are sobering: nothing works well enough. But each failed fix teaches us something about why multi-turn degradation is so stubborn.
Idea: After all shards are revealed, add a final "recap" turn. The user tells the model: "Your prior solutions were incorrect. Here are all the requirements as a single list: [all shards concatenated]. Please try again from scratch."
Result: Improves over SHARDED but falls short of FULL or CONCAT. The model is already too confused by its prior attempts. Even with a clean consolidated instruction, it tends to patch its existing answer rather than starting fresh. The conversation history is a poisoned well.
Idea: At each turn, repeat all prior shards along with the new one. Turn 3 = Shard 1 + Shard 2 + Shard 3. The model always has the full context so far.
Result: Better than RECAP. Offers 15–20% improvement over SHARDED. But still well below FULL. Why? Because the model's early responses (made with incomplete information) are still in context and still exert influence.
Idea: Add an explicit system prompt: "This conversation will likely be multi-turn and the user's instructions may be underspecified. Wait for all information before generating a final answer."
Result: +1% average improvement across tasks. Basically nothing. The model ignores the hint and generates premature answers anyway. Default behavior is deeply ingrained.
Idea: Reduce temperature to decrease randomness. T = 0.5 or T = 0.0 should reduce unreliability.
Result: Works for single-turn (unreliability drops). Does NOT work for multi-turn. Even at T = 0.0, multi-turn unreliability remains ~30%. The randomness in multi-turn doesn't come from the model's sampling — it comes from cascading effects of early conversational choices. The user simulator's variation in shard phrasing and ordering creates different trajectories that T = 0 can't eliminate.
The failure of all four fixes points to a fundamental conclusion: multi-turn reliability cannot be bolted on. It must be trained in.
The paper's recommendations for LLM builders:
For users, the paper offers two practical workarounds:
Compare all remediation strategies side by side. Each bar shows average performance. The green line marks FULL (single-turn baseline). Nothing reaches it.
| Concept | Definition | Key Number |
|---|---|---|
| Sharding | Breaking a fully-specified instruction into conversational fragments ("shards") that jointly deliver the same information | 3–8 shards per task |
| FULL | Single-turn baseline: original instruction, one message | ~80% avg performance |
| CONCAT | Control: all shards as bullet points in one turn | ~95% of FULL |
| SHARDED | Multi-turn: one shard per turn | ~61% of FULL (−39%) |
| Aptitude (A) | 90th percentile score — best-case capability | Drops 16% in multi-turn |
| Unreliability (U) | 90th − 10th percentile gap — consistency | Increases 112% in multi-turn |
| Lost in Conversation | When an early wrong turn cascades into unrecoverable error | Even 2 turns trigger it |
| RECAP | Agent-style: final turn restating everything | Helps some, not enough |
| SNOWBALL | Cumulative: repeat all prior shards each turn | +15–20% improvement |
Lost in the Middle (Liu et al., 2024): LLMs neglect information in the middle of long contexts. This paper's "loss-of-middle-turns" is the temporal analogue — information from middle conversation turns gets underweighted.
The FlipFlop Effect (Laban et al., 2023): Challenging an LLM's answer causes it to flip to the opposite answer, even when it was originally correct. Related to the premature commitment problem: models are simultaneously too eager to commit AND too eager to abandon commitments when challenged.
Prompt Sensitivity (Li et al., 2023): LLMs are sensitive to minor prompt variations in single-turn. This paper shows the problem is vastly amplified in multi-turn because each turn is a new "prompt variation" that cascades.
Agent Reliability: LangChain, AutoGen, and other agent frameworks run multi-turn conversations by design. This paper suggests that the underlying model's multi-turn unreliability is a fundamental bottleneck that no amount of agent engineering can fully overcome.