Prompt Engineering — From Absolute Zero to Mastery

Roadmap

What You'll Master

00Why Prompting Matters 01System Prompts 02Few-Shot Examples 03Chain-of-Thought 04Structured Output 05Prompt Testing & Iteration 06Common Anti-Patterns 07Advanced Techniques 08Interactive Prompt Lab 09Connections

Chapter 00

Why Prompting Matters

You ask a language model: "Can you help me write an email?" It responds with a short generic template. Mildly useful. Now you ask: "You are a professional business writer. Write a concise follow-up email to a client named Sarah who hasn't responded in a week. Tone: warm but direct. End with a specific call to action." The output is something you'd actually send.

Same model. Same weights. Same hardware. The only difference is the words you typed. That's the power and the puzzle of prompt engineering.

The Model Is a Distribution, Not a Database

A language model doesn't look up answers — it samples from a probability distribution over possible next tokens, conditioned on everything you've given it so far. Your prompt is the conditioning signal. Change the conditioning, and you sample from a completely different region of that distribution.

Think of the model as a massive space of possible "modes" of response. A vague prompt leaves the model wandering across many modes — professor, comedian, Wikipedia editor, customer service bot — averaging them into something generic. A precise prompt collapses that distribution onto the specific mode you actually want.

The Core Insight

A language model is not a question-answering machine. It's a conditional text generator. Every token in your prompt shifts the distribution of what comes next. Prompt engineering is the practice of designing that shift intentionally.

The Wild Variance Problem

This conditioning sensitivity is dramatic. In controlled experiments, rephrasing the same factual question can change answer accuracy by 30 percentage points on the same model. Adding "Let's think step by step" to a math problem can take a model from 40% to 80% accuracy — no retraining, no new data, just six words.

That variance is both the problem and the opportunity. The problem: naive prompting produces inconsistent, hard-to-predict results. The opportunity: systematic prompting can unlock dramatically better performance, reliably and repeatably.

Prompting Is Not Magic

Prompting cannot make a model know things it wasn't trained on. It cannot make a small model perform like a large one. It cannot fix factual errors baked into the weights. What it CAN do is reliably unlock the model's existing capabilities — capabilities that bad prompts leave inaccessible.

Why "Just Try It" Fails

Most people approach prompting the way they approach cooking without a recipe: add things intuitively, taste, adjust, hope. This works sometimes. But it doesn't scale. When you have a production app making millions of calls, you can't iterate manually. You need a systematic process: understand the model's conditioning mechanisms, apply known techniques, measure outcomes, iterate.

Prompt Variance Simulator

Click a prompt style to see how it shifts the model's output distribution. Width = probability mass, position = quality of response.

A language model produces different outputs for rephrased versions of the same question. The most accurate reason is:

The model has random bugs that cause inconsistency Different phrasings activate different database entries Each phrasing shifts the probability distribution the model samples from The model selects outputs randomly regardless of the prompt

Chapter 01

System Prompts

When you use a chat API, you don't just send a user message — you send a system prompt first. The system prompt is a special instruction block that sets the context before any conversation begins. It's the most important 200 tokens in your application.

Think of it as giving your AI a job description on their first day. It tells them: who they are, what they're here to do, how to behave, what to avoid, and what format to produce. Without it, the model defaults to its training distribution — a helpful generalist that might ramble, hallucinate confidence, or misunderstand your use case.

The Four Elements of a Strong System Prompt

Every effective system prompt addresses these four things, usually in this order:

Structure

1. Role — who is the model playing?

Sets the persona, expertise, and perspective. "You are a senior data analyst at a healthcare company" constrains the model to think like that person.

Structure

2. Task — what is it here to do?

Describes the purpose of this deployment. "Your job is to extract structured medication information from clinical notes." Prevents scope creep.

Structure

3. Constraints — what it must NOT do

Boundaries prevent hallucination, off-topic responses, and format violations. "Never invent medication names. If you can't extract a field, return null."

Structure

4. Format — how to structure the output

Specifies the response shape before the model starts generating. "Respond only in JSON with the following keys: ..." Asking for format at the end often fails — the model is already mid-generation.

A Real System Prompt: Sentiment Classifier

Here's a production-grade system prompt for classifying customer review sentiment. Notice how each element is explicit, not implied:

System Prompt — Sentiment Classifier SYSTEM:
You are a sentiment analysis engine for e-commerce reviews. Your only job is to classify each review as POSITIVE, NEGATIVE, or NEUTRAL.

Rules:
- Read the entire review before deciding
- POSITIVE: overall satisfaction, would recommend, happy with purchase
- NEGATIVE: dissatisfied, would not recommend, returns/refunds mentioned
- NEUTRAL: mixed or purely factual, no clear emotional direction
- If the review mentions a product defect alongside praise, classify NEGATIVE
- Respond with EXACTLY one word: POSITIVE, NEGATIVE, or NEUTRAL
- Do not explain your reasoning unless asked

Every rule exists for a reason. "Read the entire review before deciding" prevents snap judgments on openers like "This product is terrible — wait no, it fixed itself!" The defect rule handles the common "beautiful but broken" pattern. The format instruction ("EXACTLY one word") makes the output machine-parseable.

Why Position Matters

Language models have an attention recency bias — they weight recent tokens slightly more than distant ones. This has a practical implication: don't put your most critical constraints at the end of a long system prompt. Critical rules should appear near the top or be repeated at both top and bottom.

Worked Example — Good vs. Bad System Prompt

Bad: "You are a helpful assistant. Help users with their questions. Be friendly. Always respond in markdown. Don't make things up. Keep responses under 200 words. You work for Acme Corp."

Why bad: "Be friendly" and "don't make things up" conflict in edge cases. "Always respond in markdown" is buried after several constraints. No role, no task specifics, no examples of acceptable responses.

Good: "You are Acme Corp's customer support assistant. Your job: answer questions about Acme products using only information in the context I provide. Format: plain text, under 150 words. If the context doesn't answer the question, say exactly: 'I don't have information about that — please contact support@acme.com'"

Why good: Role, task, format, and the critical fallback behavior are all explicit. The exact fallback string prevents hallucination and is measurable in tests.

System Prompt Anatomy

Click each layer to see its contribution to output quality.

You're building a production sentiment classifier. The model sometimes responds with "I think this review is positive because..." instead of just "POSITIVE". The most effective fix is:

Use a larger model Add to the system prompt: "Respond with EXACTLY one word. No explanation." Ask the model to be more concise in the user message Fine-tune the model on examples of correct responses

Chapter 02

Few-Shot Examples

Here's a paradox: three examples often outperform three paragraphs of instructions. Not because examples are magic — but because they communicate information that words struggle to capture: the exact output format, the level of detail, the tone, the edge-case handling.

This technique is called few-shot prompting (also called in-context learning). Instead of only telling the model what to do, you show it a handful of (input, output) pairs. The model infers the pattern and applies it to the new input.

Zero-Shot vs. Few-Shot

Terminology

Zero-shot prompting

You describe the task in words and give no examples. The model must generalize from the description alone. Works for simple, unambiguous tasks. Fails for tasks with subtle output requirements.

Terminology

Few-shot prompting

You provide K examples of (input → output) before the real input. Typical range: 3–8 examples. The model treats these as demonstrations and mimics the pattern. Especially powerful for formatting and tone.

A Real Few-Shot Prompt: Entity Extraction

Suppose you want to extract company names from news headlines. Zero-shot instructions alone struggle with tricky cases (abbreviations, partial names, possessives). Few-shot examples handle them elegantly:

Few-Shot Prompt — Entity Extraction SYSTEM: Extract company names from news headlines. Return a JSON array. If no companies, return []. Use the official full name.

USER: "Apple's revenue beats estimates despite iPhone slowdown"
ASSISTANT: ["Apple Inc."]

USER: "TSMC and Samsung race to 2nm chip production"
ASSISTANT: ["Taiwan Semiconductor Manufacturing Company", "Samsung Electronics"]

USER: "Tech stocks fall as Fed signals more rate hikes"
ASSISTANT: []

USER: "Elon Musk's xAI secures $6B in latest funding round"
ASSISTANT:

The three examples teach the model: (1) use the official full name, not the ticker or abbreviation; (2) when no company is mentioned, return empty array, not null; (3) ignore person names even when they're prominent. These lessons would take multiple paragraphs of instructions — and still be ambiguous.

Selecting Good Examples

Not all examples are equal. The best few-shot examples are:

PropertyWhy It MattersBad Example RepresentativeCover the most common case, not just easy onesOnly showing 5-star reviews in a mixed sentiment classifier Edge-coveringInclude the tricky cases the model is likely to get wrongOnly easy, unambiguous examples CorrectEach example must have a verified correct outputUsing model-generated examples without human review DiverseAvoid repetition — 3 similar examples count as 1Three synonymous phrasings of "cancel subscription" Short enoughLong examples burn context window quicklyExamples with 500-word outputs for a summary task

Ordering Effects

The order of examples matters — later examples (closer to the real input) tend to have stronger influence. Put your most important examples last, especially any examples that demonstrate the hardest edge cases.

Recency Bias Warning

If you put all your "positive output" examples last and "negative output" examples first, the model will lean positive even on negative inputs. Alternate the expected outputs to avoid order-induced bias.

Few-Shot Effect Visualizer

Drag the slider to add more examples. Watch how accuracy on edge cases improves.

Examples (k): 0 Accuracy: —

You're using few-shot prompting for a classification task. Your model keeps misclassifying "neutral" reviews as "positive". The best next step is:

Add more POSITIVE examples to reinforce what positive looks like Switch to zero-shot with better instructions Add examples of NEUTRAL reviews (especially ones that look positive at first glance) Place all positive examples at the end of the prompt

Chapter 03

Chain-of-Thought

A model is given this math problem: "Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?" Without any special prompting, GPT-3 answers: 11. Correct, but it got lucky on a simple problem — scaling to complex multi-step reasoning fails.

Now add five words to the prompt: "Let's think step by step." The model outputs: "Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 new balls. 5 + 6 = 11." Still 11, but now the model showed its work. And on harder problems — ones where the direct answer fails — this approach works when the direct approach doesn't.

Why Showing Work Improves Accuracy

A language model generates tokens one at a time, left to right. Each new token is conditioned on all previous tokens, including the tokens it just generated. When you force the model to write out intermediate steps, those steps become part of the conditioning for the final answer. The model is literally using its own scratch space to build up the answer incrementally.

The Scratchpad Intuition

A human can't multiply 237 × 89 in their head reliably. But they can write the partial products on paper and add them. CoT gives the model the same scratchpad. The "reasoning" isn't separate from the output — it IS the output, up to that point.

Zero-Shot CoT vs. Few-Shot CoT

Technique

Zero-Shot CoT

Append "Let's think step by step." to your prompt. No examples needed. Surprisingly effective — introduced by Kojima et al. 2022. Works best on math, logic, multi-hop reasoning. Costs extra tokens (the reasoning steps are output tokens).

Technique

Few-Shot CoT

Provide examples where the output includes the reasoning chain, not just the answer. The model learns both the reasoning format AND the content domain. More powerful than zero-shot CoT but costs context window for the examples.

A Real CoT Prompt: Contract Analysis

Few-Shot CoT — Contract Clause Risk Analysis SYSTEM: Analyze contract clauses for legal risk. Think step by step before giving your risk rating (LOW/MEDIUM/HIGH).

USER: Clause: "Either party may terminate with 30 days written notice."
ASSISTANT: Step 1: Identify the clause type — mutual termination right.
Step 2: Check asymmetry — both parties have equal rights here, so no one-sided risk.
Step 3: Check notice period — 30 days is standard. Short enough to exit, long enough to plan.
Step 4: Check missing elements — no cause required, no cure period, no fee.
Conclusion: This is a standard mutual termination clause with no unusual risk.
Risk: LOW

USER: Clause: "Vendor may modify pricing with 5 days notice. Client's continued use constitutes acceptance."
ASSISTANT: Step 1: Identify clause type — unilateral price modification.
Step 2: Check asymmetry — ONLY the vendor can modify, not the client. Strongly one-sided.
Step 3: Check notice period — 5 days is extremely short. Not enough time to find alternatives.
Step 4: Check implied acceptance — "continued use = acceptance" is a trap. Stopping use may not be practical.
Step 5: Check missing elements — no cap on modifications, no frequency limit.
Conclusion: This clause allows arbitrary price increases with minimal notice and automatic acceptance.
Risk: HIGH

USER: Clause: "All intellectual property created during the engagement transfers to Client upon final payment."
ASSISTANT:

The few-shot examples teach the model a consistent reasoning protocol: identify clause type → check asymmetry → check time periods → check missing elements → conclude. This protocol then generalizes to the new clause. The model doesn't invent a different reasoning path each time.

When CoT Helps (and When It Doesn't)

Task TypeCoT BenefitReason Multi-step mathVery high (+40%)Intermediate values are needed for final answer Logical reasoningHigh (+25-35%)Step ordering matters; steps build on each other Code generationHigh (planning step)"First write the function signature, then..." scaffolds structure Simple classificationLow or negativeAdds latency/cost with minimal accuracy gain; can overthink Factual recallNegligibleEither the model knows it or it doesn't — reasoning doesn't help Creative writingModeratePlanning a structure before writing improves coherence

CoT Accuracy Gain by Task Type

Click a task category to see how much CoT improves accuracy on benchmark datasets.

Why does "Let's think step by step" improve accuracy on multi-step math problems?

It triggers a separate "math mode" in the model's architecture The intermediate steps become part of the context, giving the model its own scratch space for computation It increases the model's temperature, leading to more creative answers It forces the model to use a calculator

Chapter 04

Structured Output

Your app needs to extract medication names, dosages, and frequencies from clinical notes — and store them in a database. The model returns: "The patient takes Lisinopril 10mg once daily and Metformin 500mg twice daily." Medically correct. Completely useless for a database write.

You need structured output: a guaranteed format (JSON, XML, CSV) with predictable fields, parseable by code without post-processing. Getting this reliably from a language model requires specific techniques.

Level 1: Format Instructions in the System Prompt

The simplest approach — describe the schema explicitly. Works on capable models for simple schemas:

System Prompt — Structured Extraction SYSTEM: Extract medications from clinical notes. Respond with ONLY a JSON array. No explanation, no preamble, no markdown fences. Each medication object must have exactly these fields:
- "name": string (official drug name, not brand name)
- "dose_mg": number or null
- "frequency": "QD"|"BID"|"TID"|"QID"|"PRN"|"OTHER"
- "route": "oral"|"IV"|"topical"|"inhaled"|"other"

If a field cannot be extracted, use null. Never invent values.

USER: "Pt on atorvastatin 40mg PO daily, aspirin 81mg daily, and albuterol inhaler as needed."
ASSISTANT: [{"name":"atorvastatin","dose_mg":40,"frequency":"QD","route":"oral"},{"name":"aspirin","dose_mg":81,"frequency":"QD","route":"oral"},{"name":"albuterol","dose_mg":null,"frequency":"PRN","route":"inhaled"}]

Level 2: JSON Mode / Response Format API

Most major APIs (OpenAI, Anthropic, Google) now support a response_format parameter that forces the model to produce valid JSON. This is more reliable than instructions alone — it modifies the sampling process to enforce syntactic validity.

Python — JSON Mode (OpenAI) import openai
client = openai.OpenAI()

response = client.chat.completions.create(
  model="gpt-4o",
  response_format={"type": "json_object"},
  messages=[
    {"role": "system", "content": "Extract data as JSON."},
    {"role": "user", "content": note}
  ]
)
data = json.loads(response.choices[0].message.content)

Level 3: Schema-Enforced Output with Instructor

The most powerful approach: define your schema as a Pydantic model and use the instructor library to enforce it. The library automatically handles retries when validation fails, and you get a fully typed Python object back:

Python — Pydantic + Instructor Pattern from pydantic import BaseModel
from typing import Optional, Literal, List
import instructor, openai

class Medication(BaseModel):
  name: str
  dose_mg: Optional[float] = None
  frequency: Literal["QD","BID","TID","QID","PRN","OTHER"]
  route: Literal["oral","IV","topical","inhaled","other"]

class MedList(BaseModel):
  medications: List[Medication]

client = instructor.from_openai(openai.OpenAI())

result = client.chat.completions.create(
  model="gpt-4o",
  response_model=MedList,
  messages=[{"role":"user","content":note}]
)
# result.medications is a List[Medication] — fully typed

The critical insight here: Pydantic validation IS part of the prompt. When instructor catches a validation error, it feeds the error message back to the model as a correction prompt and retries. The model learns from its own mistake within the same call.

XML Tags as Parsing Anchors

For models without JSON mode, XML tags are more reliable than raw JSON. Tags are forgiving — a missing comma doesn't break the parser. Anthropic's Claude models are particularly trained to follow XML tag conventions:

XML-Tagged Output Pattern SYSTEM: Analyze the contract clause. Use these XML tags exactly:
<risk_level>LOW|MEDIUM|HIGH</risk_level>
<one_line_summary>...</one_line_summary>
<recommended_action>...</recommended_action>

Parse with: re.search(r'<risk_level>(.+?)</risk_level>', output).group(1)

Structured Output Reliability by Method

Compare parse success rates across methods on 1000 real API calls.

You need to extract structured data from model outputs reliably in production. Ranked from most to least reliable, the correct order is:

Instructions only → JSON mode → Pydantic+Instructor → XML tags XML tags → JSON mode → Instructions only → Pydantic+Instructor Pydantic+Instructor → JSON mode → XML tags → Instructions only JSON mode → Pydantic+Instructor → XML tags → Instructions only

The Art of Asking Well

What You'll Master

Why Prompting Matters

The Model Is a Distribution, Not a Database

The Wild Variance Problem

Why "Just Try It" Fails

System Prompts

The Four Elements of a Strong System Prompt

A Real System Prompt: Sentiment Classifier

Why Position Matters

Few-Shot Examples

Zero-Shot vs. Few-Shot

A Real Few-Shot Prompt: Entity Extraction

Selecting Good Examples

Ordering Effects

Chain-of-Thought

Why Showing Work Improves Accuracy

Zero-Shot CoT vs. Few-Shot CoT

A Real CoT Prompt: Contract Analysis

When CoT Helps (and When It Doesn't)

Structured Output

Level 1: Format Instructions in the System Prompt

Level 2: JSON Mode / Response Format API

Level 3: Schema-Enforced Output with Instructor

XML Tags as Parsing Anchors

Prompt Testing & Iteration

The Prompt Development Lifecycle

Building an Eval Set

A Minimal Eval Harness

A/B Testing Prompts

Common Anti-Patterns

1. The Vague Request

2. Conflicting Instructions

3. The "Do Everything" Prompt

4. Over-Constraining

5. Prompt Injection Vulnerabilities

6. Missing Failure Mode Handling

Advanced Techniques

Self-Consistency: Sample and Vote

Tree of Thought (ToT)

Meta-Prompting

Prompt Chaining

Retrieval-Augmented Prompting (RAG)

Interactive Prompt Lab

Connections & What's Next

Prompts → RAG

Prompts → Agents

When Prompting Isn't Enough → Fine-Tuning

The Hierarchy

Related Gleams