Prompting teaches the model what you want. Fine-tuning teaches the model who you are — your format, your vocabulary, your behavior. Learn when, why, and how to fine-tune LLMs from scratch.
You're building a customer support bot for a legal firm. The bot needs to respond in a very specific format: numbered action items, formal British English, references to specific internal case numbers, and a particular disclaimer at the bottom of every response.
You write a detailed system prompt. You add five examples. You tweak the temperature. The model still occasionally drops the disclaimer. Sometimes it uses casual language. On complex queries, it forgets the case number format. You spend a week on prompt engineering. The failure rate: 8%.
Eight percent might sound fine until you realize this is legal correspondence. One misfired response and a client complains. The problem is that you're fighting the model's prior — it saw billions of tokens of casual English, and your prompt is a thin layer of instruction on top of that enormous prior.
Prompting works by steering the model's next-token predictions toward a region of output space. This works well when the region is large and forgiving — "be helpful," "be concise" — and fails when the region is narrow and precise — "always output valid JSON," "never use the word 'however'," "every response ends with a specific three-line footer."
The model's weights encode billions of text patterns. Your system prompt is competing with all of them simultaneously. Fine-tuning, by contrast, actually shifts the weight distributions so that your desired behavior becomes the model's default — its new prior.
Drag the slider to see how compliance rate changes with prompt complexity vs fine-tuning. Notice how prompting plateaus but fine-tuning keeps improving with more examples.
Signal 1 — Format brittleness: Your prompt says "respond in JSON" and the model outputs markdown with JSON inside it, or adds prose before the JSON block, or uses single quotes instead of double. The model learned JSON from millions of examples where JSON appeared in many contexts. Fine-tuning can make strict JSON output the model's reflex.
Signal 2 — Domain vocabulary: Your task involves proprietary terminology, internal acronyms, or domain-specific meanings that differ from general usage. A legal model needs to know that "discovery" means the pretrial evidence exchange process, not finding something. Fine-tuning on domain text changes how the model represents these tokens.
Signal 3 — Consistent persona or tone: Your prompt says "you are Alex, a friendly but professional assistant." Two hundred tokens into a long conversation, Alex starts sounding like a generic chatbot. Fine-tuning bakes the persona into the weights.
Fine-tuning is not always the answer. It costs time, money, and introduces operational complexity. The question is never "can I fine-tune?" — you almost always can. The question is "should I?"
Here's how to think about it: prompting is a rental. You pay per token, the landlord (the API provider) can change the model tomorrow, and you can move out anytime. Fine-tuning is buying a house. Higher upfront cost, but the behavior is locked in. You own it. And if you're using it constantly, it's cheaper per use.
| Task | Recommendation | Why |
|---|---|---|
| Answer general questions | Prompt | General capability, model already good at it |
| Classify support tickets into 5 categories | Few-shot or fine-tune | With 200+ examples, fine-tuning cheaper at scale |
| Extract structured data from medical records | Fine-tune | Format precision + domain vocab critical |
| Write blog posts in your brand voice | Fine-tune | Consistent persona that prompts can't hold |
| One-off translation task | Prompt | Not worth the overhead |
| Detect specific safety violations in user content | Fine-tune | High precision required, runs millions of times/day |
| Summarize documents you wrote | RAG + prompt | Retrieval beats memorization for facts |
| Answer questions about your codebase | RAG + prompt | Code changes too fast; don't fine-tune moving targets |
Adjust daily request volume and system prompt length to find the crossover point where fine-tuning becomes cheaper than verbose prompting.
The most common mistake in fine-tuning projects is spending two weeks on model selection and two hours on data. The truth is the opposite: the model is mostly decided for you (use the strongest model you can afford), but the data is where all your leverage lives.
Garbage in, garbage out — but it's worse than that. With fine-tuning, inconsistent data is actively poisonous. The model will happily learn contradictions, and the result will be a model that's confidently inconsistent.
Most fine-tuning APIs (OpenAI, Together, Fireworks) expect data in JSONL format — one JSON object per line, each representing a complete conversation. The conversation follows the familiar messages array structure: system, user, assistant turns.
jsonl — one line per training example
{"messages": [
{"role": "system", "content": "You are a legal document assistant. Always respond with numbered action items in formal British English."},
{"role": "user", "content": "Our client received a demand letter. What should we do?"},
{"role": "assistant", "content": "1. Acknowledge receipt of the demand letter within 5 business days.\n2. Instruct the client to refrain from any communications with the opposing party.\n3. Obtain and review all relevant documentation referenced in the letter.\n4. Schedule a consultation to assess merit and exposure.\n\n*This response does not constitute legal advice.*"}
]}
{"messages": [
{"role": "system", "content": "You are a legal document assistant. Always respond with numbered action items in formal British English."},
{"role": "user", "content": "What is the statute of limitations for breach of contract?"},
{"role": "assistant", "content": "1. Determine the jurisdiction in which the contract was formed and is to be performed.\n2. Identify whether the contract is written or oral, as limitation periods differ.\n3. Note that most jurisdictions impose a 3–6 year limitation for written contracts.\n4. Consult local statutes for any exceptions or tolling provisions.\n\n*This response does not constitute legal advice.*"}
]}
The prevailing myth is that fine-tuning requires thousands of examples. This was true for training from scratch. For fine-tuning a pre-trained model, the math is completely different.
The model already knows how to write. It knows grammar, reasoning, world knowledge. You're not teaching it cognition — you're teaching it your specific format and style. That's a much smaller delta. Empirically:
| Task Type | Examples Typically Needed | Why |
|---|---|---|
| Format / style consistency | 50–200 | The model knows the content, learns the wrapper |
| Classification (few categories) | 100–500 | Needs decision boundary examples across classes |
| Domain vocabulary / jargon | 200–1000 | Needs coverage of the new vocabulary in context |
| Complex reasoning in new domain | 500–5000 | Needs enough examples to learn domain logic |
| Full capability shift | 10,000+ | Teaching a genuinely new skill, not just style |
Deduplication: Remove near-duplicate examples. Exact duplicates waste training steps; near-duplicates (same question, slightly different phrasing) can cause the model to memorize rather than generalize. Use fuzzy matching (Jaccard similarity or embedding similarity) to find them.
Balance: If you're teaching the model five output categories, make sure each category is roughly equally represented. A 90/10 split will produce a model that almost always predicts the majority class.
Consistency: Every example of the same scenario type should produce the same output format. If three examples end with the disclaimer and two don't, the model will learn to include it 60% of the time — worse than no fine-tuning at all.
Length distribution: Include examples of short, medium, and long responses if you want the model to handle all lengths. A dataset of only short examples produces a model that truncates long responses.
python — data cleaning script import json from collections import Counter import hashlib def load_jsonl(path): with open(path) as f: return [json.loads(line) for line in f if line.strip()] def dedup(examples): seen = set() clean = [] for ex in examples: # Hash the user turn + assistant turn together key = "".join(m["content"] for m in ex["messages"]) h = hashlib.md5(key.encode()).hexdigest() if h not in seen: seen.add(h) clean.append(ex) return clean def check_format(examples): # Verify every example has the required disclaimer bad = [] for i, ex in enumerate(examples): last = ex["messages"][-1]["content"] if "does not constitute legal advice" not in last: bad.append(i) return bad data = load_jsonl("training.jsonl") data = dedup(data) bad_idx = check_format(data) print(f"{len(bad_idx)} examples missing disclaimer: fix before training")
Drag sliders to set class proportions. See how imbalance affects model confidence on the minority class. A balanced dataset (equal bars) gives the best per-class accuracy.
A 7-billion parameter model has 7 billion floating-point numbers. Full fine-tuning means computing gradients for all 7 billion, storing optimizer state (Adam needs 2x extra), and keeping all of that in GPU memory simultaneously. For a 7B model, that's roughly 84 GB of VRAM — more than four consumer GPUs combined.
Most teams can't afford that. But here's the key insight: you don't need to.
A weight matrix W in a transformer has shape [d_model, d_model], e.g., [4096, 4096] for a 7B model. That's 16 million parameters per matrix. Low-Rank Adaptation (LoRA) freezes W and learns a small delta instead:
Here r is the rank — typically 4, 8, or 16. Instead of learning 16M parameters (4096×4096), you learn 2×4096×8 = 65,536 parameters. That's a 244× reduction in trainable parameters.
At inference, the delta is simply added to the frozen weight: W' = W + α/r × A×B, where α is a scaling factor (often equal to r, so α/r = 1). The full model runs at normal speed — no overhead at inference time once you merge the adapter.
A full weight update ΔW (left) has millions of parameters. LoRA approximates it as A×B (right). Drag rank r to see how the approximation quality trades off against parameter count.
Transformers have four weight matrices in each attention block: Q (query), K (key), V (value), O (output projection), plus two in the MLP (up-projection, down-projection). LoRA originally applied to Q and V only. In practice, adapting all six gives better results for the same rank.
python — LoRA with Hugging Face PEFT from peft import LoraConfig, get_peft_model, TaskType from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", device_map="auto", torch_dtype=torch.bfloat16, ) lora_config = LoraConfig( r=16, # rank — higher = more capacity, more params lora_alpha=32, # scaling: effective lr = lr * alpha/r target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"], lora_dropout=0.05, bias="none", task_type=TaskType.CAUSAL_LM, ) model = get_peft_model(model, lora_config) model.print_trainable_parameters() # trainable params: 41,943,040 || all params: 8,072,204,288 || trainable%: 0.52%
QLoRA (Dettmers et al., 2023) goes further: it quantizes the frozen base model weights to 4-bit integers (reducing 7B × 16 bits = 14 GB to 7B × 4 bits = 3.5 GB), then trains LoRA adapters in bfloat16. The result: you can fine-tune a 7B model on a single 24 GB GPU (RTX 3090 or 4090).
The quantization introduces some noise, but the LoRA training compensates for it. QLoRA typically achieves 95-97% of full fine-tuning quality at 5-10% of the compute cost.
python — QLoRA setup with bitsandbytes from transformers import BitsAndBytesConfig import torch bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type="nf4", # NormalFloat4 — best for normal distributions bnb_4bit_compute_dtype=torch.bfloat16, # compute in bf16, store in 4-bit bnb_4bit_use_double_quant=True, # quantize the quantization constants too ) model = AutoModelForCausalLM.from_pretrained( "meta-llama/Llama-3.1-8B-Instruct", quantization_config=bnb_config, device_map="auto", ) # Now add LoRA adapters as before # GPU memory: ~6 GB instead of ~16 GB
Rank selection is the main hyperparameter in LoRA. The rule of thumb: start with r=16. If your validation loss plateaus early and you have enough data, try r=32 or r=64. If you're memory-constrained or have very few examples, r=4 or r=8 works surprisingly well for style and format tasks.
| Rank | Trainable params (7B model) | Best for |
|---|---|---|
| r=4 | ~10M (0.13%) | Style/format, very few examples |
| r=8 | ~21M (0.26%) | Classification, tone, light adaptation |
| r=16 | ~42M (0.52%) | Most tasks — good default |
| r=64 | ~168M (2.1%) | Complex reasoning, heavy domain shift |
| Full fine-tune | 8B (100%) | When nothing else works and you have a cluster |
Three main paths to fine-tuning, ordered by ease vs control:
The simplest path. You upload a JSONL file, specify hyperparameters (or let the API choose), and wait. The API runs the training job on OpenAI's infrastructure. You pay per training token, then per inference token on the resulting model.
python — OpenAI fine-tuning API end-to-end from openai import OpenAI client = OpenAI() # uses OPENAI_API_KEY from environment # Step 1: Upload training data with open("training.jsonl", "rb") as f: response = client.files.create(file=f, purpose="fine-tune") file_id = response.id print(f"Uploaded file: {file_id}") # Step 2: Start fine-tuning job job = client.fine_tuning.jobs.create( training_file=file_id, model="gpt-4o-mini-2024-07-18", # cheapest capable model hyperparameters={ "n_epochs": 3, # 3 passes through your data "batch_size": "auto", # let API decide "learning_rate_multiplier": "auto", }, ) print(f"Job started: {job.id}") # Step 3: Poll for completion import time while True: status = client.fine_tuning.jobs.retrieve(job.id) print(f"Status: {status.status}") if status.status in ["succeeded", "failed"]: break time.sleep(30) # Step 4: Use the fine-tuned model ft_model = status.fine_tuned_model # e.g., "ft:gpt-4o-mini:org:name:id" response = client.chat.completions.create( model=ft_model, messages=[{"role": "user", "content": "Our client got a demand letter."}] ) print(response.choices[0].message.content)
For open-weight models (Llama, Mistral, Qwen), you run training yourself. The Hugging Face ecosystem provides the building blocks: transformers for the model, peft for LoRA, trl for the SFT Trainer (Supervised Fine-Tuning).
python — HuggingFace SFT with LoRA/QLoRA from trl import SFTTrainer, SFTConfig from datasets import load_dataset from peft import LoraConfig from transformers import AutoTokenizer, AutoModelForCausalLM model_id = "meta-llama/Llama-3.1-8B-Instruct" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto") dataset = load_dataset("json", data_files="training.jsonl", split="train") lora_config = LoraConfig(r=16, lora_alpha=32, target_modules="all-linear") trainer = SFTTrainer( model=model, args=SFTConfig( output_dir="./checkpoints", num_train_epochs=3, per_device_train_batch_size=2, gradient_accumulation_steps=8, # effective batch = 16 learning_rate=2e-4, warmup_ratio=0.03, lr_scheduler_type="cosine", save_steps=100, logging_steps=10, bf16=True, max_seq_length=2048, ), train_dataset=dataset, peft_config=lora_config, ) trainer.train() trainer.save_model("./final-adapter")
Axolotl wraps the Hugging Face ecosystem in a YAML config file. Instead of writing 100 lines of Python, you write a 40-line config and run one command. It handles gradient checkpointing, flash attention, data packing, multi-GPU setups, and W&B logging automatically.
yaml — axolotl config (config.yaml)
base_model: meta-llama/Llama-3.1-8B-Instruct
model_type: AutoModelForCausalLM
load_in_4bit: true # QLoRA
adapter: lora
lora_r: 16
lora_alpha: 32
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
datasets:
- path: training.jsonl
type: chat_template
num_epochs: 3
micro_batch_size: 2
gradient_accumulation_steps: 8
learning_rate: 0.0002
lr_scheduler: cosine
warmup_steps: 10
output_dir: ./output
save_safetensors: true
bash — run axolotl
pip install axolotl
accelerate launch -m axolotl.cli.train config.yaml
A healthy training run shows training loss falling smoothly while validation loss tracks it closely. Divergence between them signals overfitting. Click to simulate a run with different data sizes.
gradient_accumulation_steps=8 with per_device_train_batch_size=2 gives an effective batch size of 16. Why use gradient accumulation instead of just setting batch_size=16?You ran the training job. Loss went down. Now what? Many teams ship the model and discover problems in production. Don't do that. Evaluation is where you find out if you actually got what you wanted — before your users do.
There are three things you need to check: (1) did it learn the target behavior, (2) did it not break anything else, and (3) is it actually better than what you had before?
Before you write a single training example, split your labeled data into training and test sets. The test set is sacred — it never touches the training job. A typical split: 80% train, 10% validation (used during training to check for overfitting), 10% test (used only at the end).
The test set should be representative of your real production distribution. If you fine-tuned for legal correspondence, your test set should contain the same variety of legal questions and request types you'll see in production — not just the easy ones.
Generic metrics like perplexity don't tell you if the model is doing your specific task correctly. Define task-specific metrics that match what you actually care about:
| Task | Metric | How to Compute |
|---|---|---|
| Format compliance | % valid outputs | Parse output with regex or JSON parser; count successes |
| Classification | F1 per class | Compare predicted label to ground truth |
| Extraction (structured data) | Field-level accuracy | Parse JSON output; check each field independently |
| Summarization | ROUGE-L + human eval | Automated n-gram overlap + sample review |
| Tone/style | LLM-as-judge | GPT-4 evaluates each output on a rubric |
For subjective qualities (tone, professionalism, accuracy), human evaluation is the gold standard but expensive. LLM-as-judge uses a powerful model (GPT-4o, Claude 3.5 Sonnet) to evaluate your fine-tuned model's outputs against a rubric. It's not perfect — LLMs have biases toward verbose outputs and their own style — but it scales.
python — LLM-as-judge evaluation def evaluate_response(client, question, response, rubric): judge_prompt = f"""Evaluate this assistant response on a 1-5 scale. Question: {question} Response: {response} Rubric: {rubric} Output ONLY a JSON object: {{"score": <1-5>, "reason": "<one sentence>"}}""" result = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": judge_prompt}], response_format={"type": "json_object"} ) return json.loads(result.choices[0].message.content) rubric = """1: Completely wrong format or missing disclaimer 2: Right format but informal language 3: Formal British English, numbered, but missing one required element 4: All elements present, professional tone 5: Perfect — exactly matches the target style with all requirements""" scores = [evaluate_response(client, q, r, rubric) for q, r in test_pairs] avg = sum(s["score"] for s in scores) / len(scores) print(f"Average judge score: {avg:.2f}/5.0")
Fine-tuning can cause catastrophic forgetting: the model improves on your task but loses competence elsewhere. A model fine-tuned on legal correspondence might start responding to casual "tell me a joke" requests with formal numbered action items.
Run your fine-tuned model on a set of general capability prompts (math, coding, common sense) and compare to the base model. If general scores drop more than 5-10%, you've over-trained or used inconsistent data.
Click a test prompt to see a side-by-side comparison of base model and fine-tuned model output. Notice format compliance and tone differences.
Your model is trained and evaluated. Now it needs to serve real traffic. Deploying a fine-tuned model has different considerations from deploying a stock model — especially if you used LoRA adapters.
After training a LoRA adapter, you have two choices for how to serve it.
Option 1 — Merge and export: Permanently bake the adapter into the base model weights (W' = W + A×B), then serve the merged model. This is the simplest deployment: the resulting model is identical in shape to the base model, runs at full speed, and any serving infrastructure that handles the base model handles your fine-tuned model.
Option 2 — Runtime adapters: Keep the adapter separate from the base model. Load the adapter on top of the base model at inference time. This enables serving multiple adapters on a single base model — a significant cost saving when you have ten fine-tuned variants for ten different customers.
python — merging LoRA adapter into base model from peft import PeftModel from transformers import AutoModelForCausalLM, AutoTokenizer base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.1-8B-Instruct") model = PeftModel.from_pretrained(base, "./final-adapter") # Merge adapter weights into base — creates a standard model merged = model.merge_and_unload() merged.save_pretrained("./merged-model") AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-8B-Instruct").save_pretrained("./merged-model") # Now serve ./merged-model with vLLM, Ollama, or any standard serving stack
Suppose you have 20 customers, each with a fine-tuned model. Naively that's 20 separate model deployments — 20× the GPU cost. With multi-LoRA serving, you load one base model and swap adapters per request. The adapter is small (tens of MB), so switching is fast. Tools like vLLM, LoRAX, and S-LoRA support this.
bash — serving multiple LoRA adapters with vLLM
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--enable-lora \
--lora-modules \
legal-firm-a=./adapters/firm-a \
legal-firm-b=./adapters/firm-b \
customer-support=./adapters/support \
--max-lora-rank 64
python — routing requests to correct adapter import openai client = openai.OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") # Each tenant gets their adapter by name response = client.chat.completions.create( model="legal-firm-a", # routes to that LoRA adapter messages=[{"role": "user", "content": "Explain discovery obligations."}] )
Fine-tuned models need versioning just like software. When you retrain on new data, the old version should remain accessible for at least a week — both for rollback in case of regressions and for A/B testing the new version against production traffic.
A simple convention: name adapters with a timestamp and data version. legal-v2-20260501. Keep at least two versions deployed simultaneously during the transition window.
Click a tenant button to route a request through the serving stack. See how one base model serves multiple customers with different adapters.
Fine-tuning has a fixed upfront cost (data collection + training compute) and then a reduced inference cost (smaller model, shorter prompts). Prompting has zero upfront cost but a higher ongoing inference cost (long system prompts + few-shot examples at every request).
At some request volume, fine-tuning becomes the cheaper option. Finding that crossover point is the break-even analysis.
Define:
Example: You use GPT-4o with a 1500-token system prompt + 3 few-shot examples (800 tokens) = 2300 tokens of overhead per request at $5/M input tokens = $0.0115 per request. A fine-tuned GPT-4o-mini with a 50-token prompt costs $0.00015 per request. Delta = $0.01135 per request. Training cost (5 hours of OpenAI fine-tuning at ~$10/hour) = $50. Break-even = 50 / 0.01135 ≈ 4,400 requests.
At 1,000 requests/day, you break even in 4.4 days. After that, every request saves $0.01135.
| Item | OpenAI API | Self-hosted (A100) |
|---|---|---|
| Training 7B model, 500 examples, 3 epochs | ~$5-15 | ~$2-8 (cloud GPU) |
| Training 70B model, 1000 examples, 3 epochs | ~$100-300 | ~$50-150 (8×A100) |
| Inference cost (per 1M tokens) | $0.15-$5 (gpt-4o-mini to gpt-4o) | $0.10-$0.40 (self-hosted) |
| Data collection (100 examples) | $0-2,000 (human annotation) | Same |
| Serving infrastructure | Included in API price | $500-3000/month (GPU rental) |
Set your parameters to find the crossover point where fine-tuning becomes cheaper than prompting.
Describe your use case and constraints. The tool will recommend whether to prompt, few-shot, fine-tune, or use RAG — with estimated costs and quality projections. Adjust sliders to see how tradeoffs shift.
Supervised fine-tuning is one rung on a ladder. Once you understand it, three major directions open up — each building on the same foundation but going further.
Reinforcement Learning from Human Feedback starts where SFT ends. After SFT, you have a model that mimics your examples. But mimicry has a ceiling — the model will reproduce the average of your training data, including its mediocre examples.
RLHF adds a second phase: human raters compare pairs of outputs and indicate which they prefer. A reward model is trained on these preferences. Then the policy (your SFT model) is optimized with PPO to maximize the reward model's score. This is how ChatGPT, Claude, and Gemini go from "capable base model" to "genuinely helpful assistant."
Modern shortcut: Direct Preference Optimization (DPO) skips the separate reward model entirely, directly fine-tuning on preference pairs. Same outcome, simpler pipeline. DPO is now the dominant approach for preference-based fine-tuning.
Distillation uses a large teacher model to generate training data for a smaller student model. You run GPT-4 on 10,000 prompts, collect its outputs, and fine-tune Llama 3.1 8B on those outputs. The student learns to mimic the teacher's behavior at a fraction of the inference cost.
This is called knowledge distillation via SFT. The key insight: GPT-4's outputs are higher quality than human-written labels for most tasks. You get teacher-quality training signal at scale without expensive human annotation. Microsoft's Phi series of small models was trained this way.
The techniques you've learned here apply beyond instruction following. Domain adaptation fine-tunes on domain-specific raw text (medical literature, legal documents, financial filings) to shift the model's knowledge distribution before task-specific SFT.
The two-stage recipe: (1) continued pre-training on domain text — this teaches vocabulary and domain knowledge; (2) instruction fine-tuning on domain-specific Q&A pairs — this teaches task-specific behavior. BioMedLM, BloombergGPT, and CodeLlama all follow this pattern.
| Technique | Builds on SFT by adding | When to use |
|---|---|---|
| RLHF / PPO | Human preference feedback + reward model + RL optimization | When quality ceiling from imitation is too low |
| DPO | Direct preference pairs, no reward model | When you want RLHF benefits without PPO complexity |
| Distillation | Teacher model generates labels | When you want small-model quality without human labels |
| Domain pre-training | Raw domain text before SFT | When domain vocabulary and knowledge are the bottleneck |
| Continual fine-tuning | Periodic re-training on new data | When task data grows over time |
Deepen the theory
"What I cannot create, I do not understand." — Richard Feynman
You've learned fine-tuning by understanding every piece: the data format, the LoRA math, the training pipeline, the evaluation framework, the deployment architecture, and the cost math. Now create something with it.